When AWR is not enough, a.k.a. Why your app is not cloud-ready. #SMARTDB


22.01.2020
by Kamil Stawiarski

In my previous post I described a case of "enq: TX – row lock contention" that was actually a network latency problem.

More and more those kinds of problems (it seems like history happens all over again) caused my friend (Radosław Kut) and me to write a simple tool to analyze SQL performance from a network perspective.

We called this tool STADO (SQL Tracefile Analyzer Designed for Oracle). It parses a tcpdump outout from application server to identify the longest running queries from application and network perspective.

For now it can show you:

  • SQL ID
  • Ratio – the amount of time spent in network layer
  • Ela App (ms) – time spent in application for executing the query
  • Ela Net (ms) – time spent in network layer for executing the query
  • Exec – number of executions
  • Ela Stddev App – Stddev for time spent from an app perspective
  • Ela App/Exec – AVG time spent on execution from an app perspective
  • Ela Stddev Net – Stddev for time spent from a network perspective
  • Ela Net/Exec – AVG time spent on execution from a network perspective
  • P – amount of packets used for executing this cursor
  • S – how many sessions where executing this cursor
  • RC – how many times a cursor was reused, reducing the amount of parse calls

Let me show you how this tool works, based on a simple example.

I have two database servers (Oracle XE 18c):

  • 10.0.0.8 – remote destination with ping around 64ms
  • 192.168.56.118 – local destination with ping below 1ms

I’ll execute the following script on both servers:

1set arraysize &&1
2set timing on
3 
4alter system flush shared_pool;
5 
6select /*NET_TEST*/ * from EMPLOYEES;
7 
8select sql_id, elapsed_time/1000 as ela_ms, executions
9from v$sql
10where sql_text like '%/*NET_TEST*/%'
11order by elapsed_time;
12 
13exit

While executing a script, I’ll be dumping network traffic with tcpdump like this:

1inter@piclerick stado % sudo tcpdump -i any -w sqls.pcap

Let’s analyze our query performance!

REMOTE

1inter@piclerick instantclient_18_1 % ./sqlplus hr/hr@10.0.0.8:1521/netst @emps 2

And the result is:

1107 rows selected.
2 
3Elapsed: 00:00:03.28
4 
5SQL_ID        ELA_MS EXECUTIONS
6------------- ---------- ----------
7956vu0d29ah05     13.931      1
83bxuudrscy1jn     41.594      1

Let’s compare SQL execution time from a database perspective and from an app perspective. We can see, that we spent almost 14ms executing a query, but the whole process from an app perspective took 3,28s!!!

Let’s check the same execution on local server:

LOCAL

1107 rows selected.
2 
3Elapsed: 00:00:00.03
4 
5SQL_ID        ELA_MS EXECUTIONS
6------------- ---------- ----------
7956vu0d29ah05      7.702      1
83bxuudrscy1jn     26.069      1

Database shows, that a query was around 2 times faster on a local server, comparing to the remote. The absolute difference is 6,229ms – but the actual difference, actual "wallclock" difference is more than 3s! That is huge.

And of course, that’s nothing new – Elaped time in V$SQL is "Elapsed time (in microseconds) used by this cursor for parsing, executing, and fetching", but it won’t show you the actual time spent on processing this SQL.

So, let’s say, that my local server is much slower than my new server in a remote destination.
I reduced the number of CPUs and RAM and made sure to keep a database a bit busy on my local machine.

Now my results are:

LOCAL

1107 rows selected.
2 
3Elapsed: 00:00:00.10
4 
5SQL_ID        ELA_MS EXECUTIONS
6------------- ---------- ----------
7956vu0d29ah05     18.101      1
83bxuudrscy1jn     47.954      1

REMOTE

1107 rows selected.
2 
3Elapsed: 00:00:03.30
4 
5SQL_ID        ELA_MS EXECUTIONS
6------------- ---------- ----------
7956vu0d29ah05      8.627      1
83bxuudrscy1jn     52.103      1

From a database perspective, the second query is much faster! But the application shows something very different.

Great, now let’s see the same performance, from a TCPDUMP perspective, using STADO:

1inter@piclerick stado % ./stado
2STADO (SQL Tracedump Analyzer Doing Oracle) by Radoslaw Kut and Kamil Stawiarski
3Pcap file analyzer for finding TOP SQLs from an APP perspective
4  -C string
5        <dir> directory path to write SQL Charts i.e. -C DevApp
6  -d int
7        Debug flag
8  -f string
9        path to PCAP file for analyzing
10  -i string
11        IP address of database server
12  -p string
13        Listener port for database server
14  -s string
15        Sort by: ela (Elapsed Time), elax (Elapsed time / Executions), pckt (Packets), rtt (Network Elapsed Time), rttx (Elapsed Time Net / Exec), rat (Ela Net / Ela App) (default "ela")

LOCAL – fetchsize 2

1inter@piclerick stado % ./stado -f sqls.pcap -i "192.168.56.118 or 10.0.0.8" -p 1521 -s rat
2SQL ID      Ratio       Ela App (ms)    Ela Net(ms) Exec    Ela Stddev App  Ela App/Exec    Ela Stddev Net  Ela Net/Exec    P   S   RC
3------------------------------------------------------------------------------------------------------------------------------------------------------------
45jng2h1ttn71y   0.000000    259.280000  0.000000    1   0.000000    259.280000  0.000000    0.000000    2   1   0
53bxuudrscy1jn   0.000000    87.572000   0.000000    1   0.000000    87.572000   0.000000    0.000000    3   1   0
68wmvpvzmgjmyx   0.000000    2.400000    0.000000    1   0.000000    2.400000    0.000000    0.000000    3   1   0
75w5f39jcwb5nv   0.000000    4.343000    0.000000    1   0.000000    4.343000    0.000000    0.000000    2   1   0
8638x13474m3x2   0.000000    1.894000    0.000000    1   0.000000    1.894000    0.000000    0.000000    2   1   0
92s9mmb6g8kbqb   0.000000    2.210000    0.000000    1   0.000000    2.210000    0.000000    0.000000    2   1   0
10956vu0d29ah05   0.542286    108.854000  59.030000   1   0.000000    108.854000  0.000000    59.030000   109 1   0
11 
12Sum App Time(s): 0.46655299999999994
13Sum Net Time(s): 0.05903
14 
15No. SQLs: 7
16 
17192.168.56.118 32 kb
18 
19 
20    Time frame:  2020-01-22 12:39:40.448777 +0100 CET  <=>  2020-01-22 12:39:41.19642 +0100 CET
21    Time frame duration (s):  0.747643

REMOTE – fetchsize 2

1inter@piclerick stado % ./stado -f sqlsr.pcap -i "192.168.56.118 or 10.0.0.8" -p 1521 -s rat
2SQL ID      Ratio       Ela App (ms)    Ela Net(ms) Exec    Ela Stddev App  Ela App/Exec    Ela Stddev Net  Ela Net/Exec    P   S   RC
3------------------------------------------------------------------------------------------------------------------------------------------------------------
48wmvpvzmgjmyx   0.000000    59.969000   0.000000    1   0.000000    59.969000   0.000000    0.000000    3   1   0
55w5f39jcwb5nv   0.000000    68.520000   0.000000    1   0.000000    68.520000   0.000000    0.000000    2   1   0
6638x13474m3x2   0.000000    63.750000   0.000000    1   0.000000    63.750000   0.000000    0.000000    2   1   0
72s9mmb6g8kbqb   0.000000    59.799000   0.000000    1   0.000000    59.799000   0.000000    0.000000    2   1   0
85jng2h1ttn71y   0.000000    241.471000  0.000000    1   0.000000    241.471000  0.000000    0.000000    2   1   0
93bxuudrscy1jn   0.000000    108.820000  0.000000    1   0.000000    108.820000  0.000000    0.000000    3   1   0
10956vu0d29ah05   0.972269    3200.125000 3111.382000 1   0.000000    3200.125000 0.000000    3111.382000 109 1   0
11 
12Sum App Time(s): 3.8024539999999996
13Sum Net Time(s): 3.111382
14 
15No. SQLs: 7
16 
1710.0.0.8 32 kb
18 
19 
20    Time frame:  2020-01-22 12:40:05.489516 +0100 CET  <=>  2020-01-22 12:40:10.377233 +0100 CET
21    Time frame duration (s):  4.887717

Here we can see clearly which SQL caused the biggest problem, and that time spent on the network was actually 97% of the whole execution time! (In a case of the local server the network time was only 54% – 49ms)

In most cases, this problem is due to a chatty application – like in this case: 107 rows were returned by 109 packets. That’s crazy.
In SQL*Plus we can modify fetch size with a parameter arraysize (default value is 14, which means that SQL*Plus fetches 14 rows at a time).

Let’s see how arraysize parameter influences the execution time:

LOCAL – fetchsize 14

1inter@piclerick stado % ./stado -f sqls.pcap -i "192.168.56.118 or 10.0.0.8" -p 1521 -s rat
2SQL ID      Ratio       Ela App (ms)    Ela Net(ms) Exec    Ela Stddev App  Ela App/Exec    Ela Stddev Net  Ela Net/Exec    P   S   RC
3------------------------------------------------------------------------------------------------------------------------------------------------------------
43bxuudrscy1jn   0.000000    72.221000   0.000000    1   0.000000    72.221000   0.000000    0.000000    3   1   0
58wmvpvzmgjmyx   0.000000    1.298000    0.000000    1   0.000000    1.298000    0.000000    0.000000    3   1   0
65w5f39jcwb5nv   0.000000    10.774000   0.000000    1   0.000000    10.774000   0.000000    0.000000    2   1   0
7638x13474m3x2   0.000000    17.649000   0.000000    1   0.000000    17.649000   0.000000    0.000000    2   1   0
82s9mmb6g8kbqb   0.000000    1.427000    0.000000    1   0.000000    1.427000    0.000000    0.000000    2   1   0
95jng2h1ttn71y   0.000000    199.157000  0.000000    1   0.000000    199.157000  0.000000    0.000000    2   1   0
10956vu0d29ah05   0.226816    41.937000   9.512000    1   0.000000    41.937000   0.000000    9.512000    17  1   0
11 
12Sum App Time(s): 0.344463
13Sum Net Time(s): 0.009512
14 
15No. SQLs: 7
16 
17192.168.56.118 27 kb
18 
19 
20    Time frame:  2020-01-22 14:14:23.996268 +0100 CET  <=>  2020-01-22 14:14:24.45883 +0100 CET
21    Time frame duration (s):  0.462562

REMOTE – fetchsize 14

1inter@piclerick stado % ./stado -f sqlsr.pcap -i "192.168.56.118 or 10.0.0.8" -p 1521 -s rat
2SQL ID      Ratio       Ela App (ms)    Ela Net(ms) Exec    Ela Stddev App  Ela App/Exec    Ela Stddev Net  Ela Net/Exec    P   S   RC
3------------------------------------------------------------------------------------------------------------------------------------------------------------
43bxuudrscy1jn   0.000000    116.133000  0.000000    1   0.000000    116.133000  0.000000    0.000000    3   1   0
58wmvpvzmgjmyx   0.000000    59.618000   0.000000    1   0.000000    59.618000   0.000000    0.000000    3   1   0
65w5f39jcwb5nv   0.000000    64.575000   0.000000    1   0.000000    64.575000   0.000000    0.000000    2   1   0
7638x13474m3x2   0.000000    59.540000   0.000000    1   0.000000    59.540000   0.000000    0.000000    2   1   0
82s9mmb6g8kbqb   0.000000    59.320000   0.000000    1   0.000000    59.320000   0.000000    0.000000    2   1   0
95jng2h1ttn71y   0.000000    256.340000  0.000000    1   0.000000    256.340000  0.000000    0.000000    2   1   0
10956vu0d29ah05   0.836899    484.620000  405.578000  1   0.000000    484.620000  0.000000    405.578000  17  1   0
11 
12Sum App Time(s): 1.1001460000000003
13Sum Net Time(s): 0.405578
14 
15No. SQLs: 7
16 
1710.0.0.8 27 kb
18 
19 
20    Time frame:  2020-01-22 14:13:49.553632 +0100 CET  <=>  2020-01-22 14:13:51.847139 +0100 CET
21    Time frame duration (s):  2.293507

LOCAL – fetchsize 200

1inter@piclerick stado % ./stado -f sqls.pcap -i "192.168.56.118 or 10.0.0.8" -p 1521 -s rat
2SQL ID      Ratio       Ela App (ms)    Ela Net(ms) Exec    Ela Stddev App  Ela App/Exec    Ela Stddev Net  Ela Net/Exec    P   S   RC
3------------------------------------------------------------------------------------------------------------------------------------------------------------
43bxuudrscy1jn   0.000000    87.994000   0.000000    1   0.000000    87.994000   0.000000    0.000000    3   1   0
58wmvpvzmgjmyx   0.000000    1.121000    0.000000    1   0.000000    1.121000    0.000000    0.000000    3   1   0
65w5f39jcwb5nv   0.000000    2.315000    0.000000    1   0.000000    2.315000    0.000000    0.000000    2   1   0
7638x13474m3x2   0.000000    14.832000   0.000000    1   0.000000    14.832000   0.000000    0.000000    2   1   0
82s9mmb6g8kbqb   0.000000    1.259000    0.000000    1   0.000000    1.259000    0.000000    0.000000    2   1   0
95jng2h1ttn71y   0.000000    187.470000  0.000000    1   0.000000    187.470000  0.000000    0.000000    2   1   0
10956vu0d29ah05   0.019606    17.546000   0.344000    1   0.000000    17.546000   0.000000    0.344000    9   1   0
11 
12Sum App Time(s): 0.31253699999999995
13Sum Net Time(s): 0.00034399999999999996
14 
15No. SQLs: 7
16 
17192.168.56.118 26 kb
18 
19 
20    Time frame:  2020-01-22 14:30:09.225337 +0100 CET  <=>  2020-01-22 14:30:09.664531 +0100 CET
21    Time frame duration (s):  0.439194

REMOTE – fetchsize 200

1inter@piclerick stado % ./stado -f sqlsr.pcap -i "192.168.56.118 or 10.0.0.8" -p 1521 -s rat
2SQL ID      Ratio       Ela App (ms)    Ela Net(ms) Exec    Ela Stddev App  Ela App/Exec    Ela Stddev Net  Ela Net/Exec    P   S   RC
3------------------------------------------------------------------------------------------------------------------------------------------------------------
42s9mmb6g8kbqb   0.000000    63.342000   0.000000    1   0.000000    63.342000   0.000000    0.000000    2   1   0
55jng2h1ttn71y   0.000000    237.884000  0.000000    1   0.000000    237.884000  0.000000    0.000000    2   1   0
63bxuudrscy1jn   0.000000    119.722000  0.000000    1   0.000000    119.722000  0.000000    0.000000    3   1   0
78wmvpvzmgjmyx   0.000000    64.242000   0.000000    1   0.000000    64.242000   0.000000    0.000000    3   1   0
85w5f39jcwb5nv   0.000000    67.841000   0.000000    1   0.000000    67.841000   0.000000    0.000000    2   1   0
9638x13474m3x2   0.000000    64.604000   0.000000    1   0.000000    64.604000   0.000000    0.000000    2   1   0
10956vu0d29ah05   0.488563    127.828000  62.452000   1   0.000000    127.828000  0.000000    62.452000   11  1   0
11 
12Sum App Time(s): 0.745463
13Sum Net Time(s): 0.062452
14 
15No. SQLs: 7
16 
1710.0.0.8 26 kb
18 
19 
20    Time frame:  2020-01-22 14:33:46.121799 +0100 CET  <=>  2020-01-22 14:33:48.165371 +0100 CET
21    Time frame duration (s):  2.043572

As we can see, we reduced the time of SQL statement from more than 3000ms to around 128ms by modifying the size of single fetch – fewer packets equals fewer round trips.

Still, the network plays a significant role in the performance of this query (~ 62%), and the bigger latency – the slower the query will be.
Imagine that you have an application with a lot of this kind of chatty cursors, no PL/SQL logic and everything based on JAVA or Python or something like this… Well, most of you don’t have to imagine – you see those systems every day!

If you take this kind of system and migrate it to some fancy new hardware with a turbo-cool automagic hocus-pocus self-problem-solving platform, will it run faster? If this marvelous platform is not a network switch away from your app server, the answer is simple and brutal – NO.

Cloud won’t solve your problems. If your app is designed in a shitty way – cloud will escalate your problems. There is no magic – only physics.

If you want to check if your systems are network-sensitive and chatty – just make a tcpdump and analyze output with STADO, which can be found here:
https://github.com/ora600pl/stado


Contact us

Database Whisperers sp. z o. o. sp. k.
al. Jerozolimskie 200, 3rd floor, room 342
02-486 Warszawa
NIP: 5272744987
REGON:362524978
+48 508 943 051
+48 661 966 009
info@ora-600.pl

Newsletter Sign up to be updated