In my previous post I described a case of "enq: TX – row lock contention" that was actually a network latency problem.
More and more those kinds of problems (it seems like history happens all over again) caused my friend (Radosław Kut) and me to write a simple tool to analyze SQL performance from a network perspective.
We called this tool STADO (SQL Tracefile Analyzer Designed for Oracle). It parses a tcpdump outout from application server to identify the longest running queries from application and network perspective.
For now it can show you:
- SQL ID
- Ratio – the amount of time spent in network layer
- Ela App (ms) – time spent in application for executing the query
- Ela Net (ms) – time spent in network layer for executing the query
- Exec – number of executions
- Ela Stddev App – Stddev for time spent from an app perspective
- Ela App/Exec – AVG time spent on execution from an app perspective
- Ela Stddev Net – Stddev for time spent from a network perspective
- Ela Net/Exec – AVG time spent on execution from a network perspective
- P – amount of packets used for executing this cursor
- S – how many sessions where executing this cursor
- RC – how many times a cursor was reused, reducing the amount of parse calls
Let me show you how this tool works, based on a simple example.
I have two database servers (Oracle XE 18c):
- 10.0.0.8 – remote destination with ping around 64ms
- 192.168.56.118 – local destination with ping below 1ms
I’ll execute the following script on both servers:
set arraysize &&1 set timing on alter system flush shared_pool; select /*NET_TEST*/ * from EMPLOYEES; select sql_id, elapsed_time/1000 as ela_ms, executions from v$sql where sql_text like '%/*NET_TEST*/%' order by elapsed_time; exit
While executing a script, I’ll be dumping network traffic with tcpdump like this:
inter@piclerick stado % sudo tcpdump -i any -w sqls.pcap
Let’s analyze our query performance!
REMOTE
inter@piclerick instantclient_18_1 % ./sqlplus hr/hr@10.0.0.8:1521/netst @emps 2
And the result is:
107 rows selected. Elapsed: 00:00:03.28 SQL_ID ELA_MS EXECUTIONS ------------- ---------- ---------- 956vu0d29ah05 13.931 1 3bxuudrscy1jn 41.594 1
Let’s compare SQL execution time from a database perspective and from an app perspective. We can see, that we spent almost 14ms executing a query, but the whole process from an app perspective took 3,28s!!!
Let’s check the same execution on local server:
LOCAL
107 rows selected. Elapsed: 00:00:00.03 SQL_ID ELA_MS EXECUTIONS ------------- ---------- ---------- 956vu0d29ah05 7.702 1 3bxuudrscy1jn 26.069 1
Database shows, that a query was around 2 times faster on a local server, comparing to the remote. The absolute difference is 6,229ms – but the actual difference, actual "wallclock" difference is more than 3s! That is huge.
And of course, that’s nothing new – Elaped time in V$SQL is "Elapsed time (in microseconds) used by this cursor for parsing, executing, and fetching", but it won’t show you the actual time spent on processing this SQL.
So, let’s say, that my local server is much slower than my new server in a remote destination.
I reduced the number of CPUs and RAM and made sure to keep a database a bit busy on my local machine.
Now my results are:
LOCAL
107 rows selected. Elapsed: 00:00:00.10 SQL_ID ELA_MS EXECUTIONS ------------- ---------- ---------- 956vu0d29ah05 18.101 1 3bxuudrscy1jn 47.954 1
REMOTE
107 rows selected. Elapsed: 00:00:03.30 SQL_ID ELA_MS EXECUTIONS ------------- ---------- ---------- 956vu0d29ah05 8.627 1 3bxuudrscy1jn 52.103 1
From a database perspective, the second query is much faster! But the application shows something very different.
Great, now let’s see the same performance, from a TCPDUMP perspective, using STADO:
inter@piclerick stado % ./stado STADO (SQL Tracedump Analyzer Doing Oracle) by Radoslaw Kut and Kamil Stawiarski Pcap file analyzer for finding TOP SQLs from an APP perspective -C string <dir> directory path to write SQL Charts i.e. -C DevApp -d int Debug flag -f string path to PCAP file for analyzing -i string IP address of database server -p string Listener port for database server -s string Sort by: ela (Elapsed Time), elax (Elapsed time / Executions), pckt (Packets), rtt (Network Elapsed Time), rttx (Elapsed Time Net / Exec), rat (Ela Net / Ela App) (default "ela")
LOCAL – fetchsize 2
inter@piclerick stado % ./stado -f sqls.pcap -i "192.168.56.118 or 10.0.0.8" -p 1521 -s rat SQL ID Ratio Ela App (ms) Ela Net(ms) Exec Ela Stddev App Ela App/Exec Ela Stddev Net Ela Net/Exec P S RC ------------------------------------------------------------------------------------------------------------------------------------------------------------ 5jng2h1ttn71y 0.000000 259.280000 0.000000 1 0.000000 259.280000 0.000000 0.000000 2 1 0 3bxuudrscy1jn 0.000000 87.572000 0.000000 1 0.000000 87.572000 0.000000 0.000000 3 1 0 8wmvpvzmgjmyx 0.000000 2.400000 0.000000 1 0.000000 2.400000 0.000000 0.000000 3 1 0 5w5f39jcwb5nv 0.000000 4.343000 0.000000 1 0.000000 4.343000 0.000000 0.000000 2 1 0 638x13474m3x2 0.000000 1.894000 0.000000 1 0.000000 1.894000 0.000000 0.000000 2 1 0 2s9mmb6g8kbqb 0.000000 2.210000 0.000000 1 0.000000 2.210000 0.000000 0.000000 2 1 0 956vu0d29ah05 0.542286 108.854000 59.030000 1 0.000000 108.854000 0.000000 59.030000 109 1 0 Sum App Time(s): 0.46655299999999994 Sum Net Time(s): 0.05903 No. SQLs: 7 192.168.56.118 32 kb Time frame: 2020-01-22 12:39:40.448777 +0100 CET <=> 2020-01-22 12:39:41.19642 +0100 CET Time frame duration (s): 0.747643
REMOTE – fetchsize 2
inter@piclerick stado % ./stado -f sqlsr.pcap -i "192.168.56.118 or 10.0.0.8" -p 1521 -s rat SQL ID Ratio Ela App (ms) Ela Net(ms) Exec Ela Stddev App Ela App/Exec Ela Stddev Net Ela Net/Exec P S RC ------------------------------------------------------------------------------------------------------------------------------------------------------------ 8wmvpvzmgjmyx 0.000000 59.969000 0.000000 1 0.000000 59.969000 0.000000 0.000000 3 1 0 5w5f39jcwb5nv 0.000000 68.520000 0.000000 1 0.000000 68.520000 0.000000 0.000000 2 1 0 638x13474m3x2 0.000000 63.750000 0.000000 1 0.000000 63.750000 0.000000 0.000000 2 1 0 2s9mmb6g8kbqb 0.000000 59.799000 0.000000 1 0.000000 59.799000 0.000000 0.000000 2 1 0 5jng2h1ttn71y 0.000000 241.471000 0.000000 1 0.000000 241.471000 0.000000 0.000000 2 1 0 3bxuudrscy1jn 0.000000 108.820000 0.000000 1 0.000000 108.820000 0.000000 0.000000 3 1 0 956vu0d29ah05 0.972269 3200.125000 3111.382000 1 0.000000 3200.125000 0.000000 3111.382000 109 1 0 Sum App Time(s): 3.8024539999999996 Sum Net Time(s): 3.111382 No. SQLs: 7 10.0.0.8 32 kb Time frame: 2020-01-22 12:40:05.489516 +0100 CET <=> 2020-01-22 12:40:10.377233 +0100 CET Time frame duration (s): 4.887717
Here we can see clearly which SQL caused the biggest problem, and that time spent on the network was actually 97% of the whole execution time! (In a case of the local server the network time was only 54% – 49ms)
In most cases, this problem is due to a chatty application – like in this case: 107 rows were returned by 109 packets. That’s crazy.
In SQL*Plus we can modify fetch size with a parameter arraysize (default value is 14, which means that SQL*Plus fetches 14 rows at a time).
Let’s see how arraysize parameter influences the execution time:
LOCAL – fetchsize 14
inter@piclerick stado % ./stado -f sqls.pcap -i "192.168.56.118 or 10.0.0.8" -p 1521 -s rat SQL ID Ratio Ela App (ms) Ela Net(ms) Exec Ela Stddev App Ela App/Exec Ela Stddev Net Ela Net/Exec P S RC ------------------------------------------------------------------------------------------------------------------------------------------------------------ 3bxuudrscy1jn 0.000000 72.221000 0.000000 1 0.000000 72.221000 0.000000 0.000000 3 1 0 8wmvpvzmgjmyx 0.000000 1.298000 0.000000 1 0.000000 1.298000 0.000000 0.000000 3 1 0 5w5f39jcwb5nv 0.000000 10.774000 0.000000 1 0.000000 10.774000 0.000000 0.000000 2 1 0 638x13474m3x2 0.000000 17.649000 0.000000 1 0.000000 17.649000 0.000000 0.000000 2 1 0 2s9mmb6g8kbqb 0.000000 1.427000 0.000000 1 0.000000 1.427000 0.000000 0.000000 2 1 0 5jng2h1ttn71y 0.000000 199.157000 0.000000 1 0.000000 199.157000 0.000000 0.000000 2 1 0 956vu0d29ah05 0.226816 41.937000 9.512000 1 0.000000 41.937000 0.000000 9.512000 17 1 0 Sum App Time(s): 0.344463 Sum Net Time(s): 0.009512 No. SQLs: 7 192.168.56.118 27 kb Time frame: 2020-01-22 14:14:23.996268 +0100 CET <=> 2020-01-22 14:14:24.45883 +0100 CET Time frame duration (s): 0.462562
REMOTE – fetchsize 14
inter@piclerick stado % ./stado -f sqlsr.pcap -i "192.168.56.118 or 10.0.0.8" -p 1521 -s rat SQL ID Ratio Ela App (ms) Ela Net(ms) Exec Ela Stddev App Ela App/Exec Ela Stddev Net Ela Net/Exec P S RC ------------------------------------------------------------------------------------------------------------------------------------------------------------ 3bxuudrscy1jn 0.000000 116.133000 0.000000 1 0.000000 116.133000 0.000000 0.000000 3 1 0 8wmvpvzmgjmyx 0.000000 59.618000 0.000000 1 0.000000 59.618000 0.000000 0.000000 3 1 0 5w5f39jcwb5nv 0.000000 64.575000 0.000000 1 0.000000 64.575000 0.000000 0.000000 2 1 0 638x13474m3x2 0.000000 59.540000 0.000000 1 0.000000 59.540000 0.000000 0.000000 2 1 0 2s9mmb6g8kbqb 0.000000 59.320000 0.000000 1 0.000000 59.320000 0.000000 0.000000 2 1 0 5jng2h1ttn71y 0.000000 256.340000 0.000000 1 0.000000 256.340000 0.000000 0.000000 2 1 0 956vu0d29ah05 0.836899 484.620000 405.578000 1 0.000000 484.620000 0.000000 405.578000 17 1 0 Sum App Time(s): 1.1001460000000003 Sum Net Time(s): 0.405578 No. SQLs: 7 10.0.0.8 27 kb Time frame: 2020-01-22 14:13:49.553632 +0100 CET <=> 2020-01-22 14:13:51.847139 +0100 CET Time frame duration (s): 2.293507
LOCAL – fetchsize 200
inter@piclerick stado % ./stado -f sqls.pcap -i "192.168.56.118 or 10.0.0.8" -p 1521 -s rat SQL ID Ratio Ela App (ms) Ela Net(ms) Exec Ela Stddev App Ela App/Exec Ela Stddev Net Ela Net/Exec P S RC ------------------------------------------------------------------------------------------------------------------------------------------------------------ 3bxuudrscy1jn 0.000000 87.994000 0.000000 1 0.000000 87.994000 0.000000 0.000000 3 1 0 8wmvpvzmgjmyx 0.000000 1.121000 0.000000 1 0.000000 1.121000 0.000000 0.000000 3 1 0 5w5f39jcwb5nv 0.000000 2.315000 0.000000 1 0.000000 2.315000 0.000000 0.000000 2 1 0 638x13474m3x2 0.000000 14.832000 0.000000 1 0.000000 14.832000 0.000000 0.000000 2 1 0 2s9mmb6g8kbqb 0.000000 1.259000 0.000000 1 0.000000 1.259000 0.000000 0.000000 2 1 0 5jng2h1ttn71y 0.000000 187.470000 0.000000 1 0.000000 187.470000 0.000000 0.000000 2 1 0 956vu0d29ah05 0.019606 17.546000 0.344000 1 0.000000 17.546000 0.000000 0.344000 9 1 0 Sum App Time(s): 0.31253699999999995 Sum Net Time(s): 0.00034399999999999996 No. SQLs: 7 192.168.56.118 26 kb Time frame: 2020-01-22 14:30:09.225337 +0100 CET <=> 2020-01-22 14:30:09.664531 +0100 CET Time frame duration (s): 0.439194
REMOTE – fetchsize 200
inter@piclerick stado % ./stado -f sqlsr.pcap -i "192.168.56.118 or 10.0.0.8" -p 1521 -s rat SQL ID Ratio Ela App (ms) Ela Net(ms) Exec Ela Stddev App Ela App/Exec Ela Stddev Net Ela Net/Exec P S RC ------------------------------------------------------------------------------------------------------------------------------------------------------------ 2s9mmb6g8kbqb 0.000000 63.342000 0.000000 1 0.000000 63.342000 0.000000 0.000000 2 1 0 5jng2h1ttn71y 0.000000 237.884000 0.000000 1 0.000000 237.884000 0.000000 0.000000 2 1 0 3bxuudrscy1jn 0.000000 119.722000 0.000000 1 0.000000 119.722000 0.000000 0.000000 3 1 0 8wmvpvzmgjmyx 0.000000 64.242000 0.000000 1 0.000000 64.242000 0.000000 0.000000 3 1 0 5w5f39jcwb5nv 0.000000 67.841000 0.000000 1 0.000000 67.841000 0.000000 0.000000 2 1 0 638x13474m3x2 0.000000 64.604000 0.000000 1 0.000000 64.604000 0.000000 0.000000 2 1 0 956vu0d29ah05 0.488563 127.828000 62.452000 1 0.000000 127.828000 0.000000 62.452000 11 1 0 Sum App Time(s): 0.745463 Sum Net Time(s): 0.062452 No. SQLs: 7 10.0.0.8 26 kb Time frame: 2020-01-22 14:33:46.121799 +0100 CET <=> 2020-01-22 14:33:48.165371 +0100 CET Time frame duration (s): 2.043572
As we can see, we reduced the time of SQL statement from more than 3000ms to around 128ms by modifying the size of single fetch – fewer packets equals fewer round trips.
Still, the network plays a significant role in the performance of this query (~ 62%), and the bigger latency – the slower the query will be.
Imagine that you have an application with a lot of this kind of chatty cursors, no PL/SQL logic and everything based on JAVA or Python or something like this… Well, most of you don’t have to imagine – you see those systems every day!
If you take this kind of system and migrate it to some fancy new hardware with a turbo-cool automagic hocus-pocus self-problem-solving platform, will it run faster? If this marvelous platform is not a network switch away from your app server, the answer is simple and brutal – NO.
Cloud won’t solve your problems. If your app is designed in a shitty way – cloud will escalate your problems. There is no magic – only physics.
If you want to check if your systems are network-sensitive and chatty – just make a tcpdump and analyze output with STADO, which can be found here:
https://github.com/ora600pl/stado