When AWR is not enough, a.k.a. Why your app is not cloud-ready. #SMARTDB


22.01.2020
by Kamil Stawiarski

In my previous post I described a case of "enq: TX – row lock contention" that was actually a network latency problem.

More and more those kinds of problems (it seems like history happens all over again) caused my friend (Radosław Kut) and me to write a simple tool to analyze SQL performance from a network perspective.

We called this tool STADO (SQL Tracefile Analyzer Designed for Oracle). It parses a tcpdump outout from application server to identify the longest running queries from application and network perspective.

For now it can show you:

  • SQL ID
  • Ratio – the amount of time spent in network layer
  • Ela App (ms) – time spent in application for executing the query
  • Ela Net (ms) – time spent in network layer for executing the query
  • Exec – number of executions
  • Ela Stddev App – Stddev for time spent from an app perspective
  • Ela App/Exec – AVG time spent on execution from an app perspective
  • Ela Stddev Net – Stddev for time spent from a network perspective
  • Ela Net/Exec – AVG time spent on execution from a network perspective
  • P – amount of packets used for executing this cursor
  • S – how many sessions where executing this cursor
  • RC – how many times a cursor was reused, reducing the amount of parse calls

Let me show you how this tool works, based on a simple example.

I have two database servers (Oracle XE 18c):

  • 10.0.0.8 – remote destination with ping around 64ms
  • 192.168.56.118 – local destination with ping below 1ms

I’ll execute the following script on both servers:

set arraysize &&1
set timing on

alter system flush shared_pool;

select /*NET_TEST*/ * from EMPLOYEES;

select sql_id, elapsed_time/1000 as ela_ms, executions
from v$sql
where sql_text like '%/*NET_TEST*/%'
order by elapsed_time;

exit

While executing a script, I’ll be dumping network traffic with tcpdump like this:

inter@piclerick stado % sudo tcpdump -i any -w sqls.pcap

Let’s analyze our query performance!

REMOTE

inter@piclerick instantclient_18_1 % ./sqlplus hr/hr@10.0.0.8:1521/netst @emps 2

And the result is:

107 rows selected.

Elapsed: 00:00:03.28

SQL_ID		  ELA_MS EXECUTIONS
------------- ---------- ----------
956vu0d29ah05	  13.931	  1
3bxuudrscy1jn	  41.594	  1

Let’s compare SQL execution time from a database perspective and from an app perspective. We can see, that we spent almost 14ms executing a query, but the whole process from an app perspective took 3,28s!!!

Let’s check the same execution on local server:

LOCAL

107 rows selected.

Elapsed: 00:00:00.03

SQL_ID		  ELA_MS EXECUTIONS
------------- ---------- ----------
956vu0d29ah05	   7.702	  1
3bxuudrscy1jn	  26.069	  1

Database shows, that a query was around 2 times faster on a local server, comparing to the remote. The absolute difference is 6,229ms – but the actual difference, actual "wallclock" difference is more than 3s! That is huge.

And of course, that’s nothing new – Elaped time in V$SQL is "Elapsed time (in microseconds) used by this cursor for parsing, executing, and fetching", but it won’t show you the actual time spent on processing this SQL.

So, let’s say, that my local server is much slower than my new server in a remote destination.
I reduced the number of CPUs and RAM and made sure to keep a database a bit busy on my local machine.

Now my results are:

LOCAL

107 rows selected.

Elapsed: 00:00:00.10

SQL_ID		  ELA_MS EXECUTIONS
------------- ---------- ----------
956vu0d29ah05	  18.101	  1
3bxuudrscy1jn	  47.954	  1

REMOTE

107 rows selected.

Elapsed: 00:00:03.30

SQL_ID		  ELA_MS EXECUTIONS
------------- ---------- ----------
956vu0d29ah05	   8.627	  1
3bxuudrscy1jn	  52.103	  1

From a database perspective, the second query is much faster! But the application shows something very different.

Great, now let’s see the same performance, from a TCPDUMP perspective, using STADO:

inter@piclerick stado % ./stado
STADO (SQL Tracedump Analyzer Doing Oracle) by Radoslaw Kut and Kamil Stawiarski
Pcap file analyzer for finding TOP SQLs from an APP perspective
  -C string
    	<dir> directory path to write SQL Charts i.e. -C DevApp
  -d int
    	Debug flag
  -f string
    	path to PCAP file for analyzing
  -i string
    	IP address of database server
  -p string
    	Listener port for database server
  -s string
    	Sort by: ela (Elapsed Time), elax (Elapsed time / Executions), pckt (Packets), rtt (Network Elapsed Time), rttx (Elapsed Time Net / Exec), rat (Ela Net / Ela App) (default "ela")

LOCAL – fetchsize 2

inter@piclerick stado % ./stado -f sqls.pcap -i "192.168.56.118 or 10.0.0.8" -p 1521 -s rat
SQL ID		Ratio		Ela App (ms)	Ela Net(ms)	Exec	Ela Stddev App	Ela App/Exec	Ela Stddev Net	Ela Net/Exec	P	S	RC
------------------------------------------------------------------------------------------------------------------------------------------------------------
5jng2h1ttn71y	0.000000	259.280000	0.000000	1	0.000000	259.280000	0.000000	0.000000	2	1	0
3bxuudrscy1jn	0.000000	87.572000	0.000000	1	0.000000	87.572000	0.000000	0.000000	3	1	0
8wmvpvzmgjmyx	0.000000	2.400000	0.000000	1	0.000000	2.400000	0.000000	0.000000	3	1	0
5w5f39jcwb5nv	0.000000	4.343000	0.000000	1	0.000000	4.343000	0.000000	0.000000	2	1	0
638x13474m3x2	0.000000	1.894000	0.000000	1	0.000000	1.894000	0.000000	0.000000	2	1	0
2s9mmb6g8kbqb	0.000000	2.210000	0.000000	1	0.000000	2.210000	0.000000	0.000000	2	1	0
956vu0d29ah05	0.542286	108.854000	59.030000	1	0.000000	108.854000	0.000000	59.030000	109	1	0

Sum App Time(s): 0.46655299999999994
Sum Net Time(s): 0.05903

No. SQLs: 7

192.168.56.118 32 kb


	Time frame:  2020-01-22 12:39:40.448777 +0100 CET  <=>  2020-01-22 12:39:41.19642 +0100 CET
	Time frame duration (s):  0.747643

REMOTE – fetchsize 2

inter@piclerick stado % ./stado -f sqlsr.pcap -i "192.168.56.118 or 10.0.0.8" -p 1521 -s rat
SQL ID		Ratio		Ela App (ms)	Ela Net(ms)	Exec	Ela Stddev App	Ela App/Exec	Ela Stddev Net	Ela Net/Exec	P	S	RC
------------------------------------------------------------------------------------------------------------------------------------------------------------
8wmvpvzmgjmyx	0.000000	59.969000	0.000000	1	0.000000	59.969000	0.000000	0.000000	3	1	0
5w5f39jcwb5nv	0.000000	68.520000	0.000000	1	0.000000	68.520000	0.000000	0.000000	2	1	0
638x13474m3x2	0.000000	63.750000	0.000000	1	0.000000	63.750000	0.000000	0.000000	2	1	0
2s9mmb6g8kbqb	0.000000	59.799000	0.000000	1	0.000000	59.799000	0.000000	0.000000	2	1	0
5jng2h1ttn71y	0.000000	241.471000	0.000000	1	0.000000	241.471000	0.000000	0.000000	2	1	0
3bxuudrscy1jn	0.000000	108.820000	0.000000	1	0.000000	108.820000	0.000000	0.000000	3	1	0
956vu0d29ah05	0.972269	3200.125000	3111.382000	1	0.000000	3200.125000	0.000000	3111.382000	109	1	0

Sum App Time(s): 3.8024539999999996
Sum Net Time(s): 3.111382

No. SQLs: 7

10.0.0.8 32 kb


	Time frame:  2020-01-22 12:40:05.489516 +0100 CET  <=>  2020-01-22 12:40:10.377233 +0100 CET
	Time frame duration (s):  4.887717

Here we can see clearly which SQL caused the biggest problem, and that time spent on the network was actually 97% of the whole execution time! (In a case of the local server the network time was only 54% – 49ms)

In most cases, this problem is due to a chatty application – like in this case: 107 rows were returned by 109 packets. That’s crazy.
In SQL*Plus we can modify fetch size with a parameter arraysize (default value is 14, which means that SQL*Plus fetches 14 rows at a time).

Let’s see how arraysize parameter influences the execution time:

LOCAL – fetchsize 14

inter@piclerick stado % ./stado -f sqls.pcap -i "192.168.56.118 or 10.0.0.8" -p 1521 -s rat
SQL ID		Ratio		Ela App (ms)	Ela Net(ms)	Exec	Ela Stddev App	Ela App/Exec	Ela Stddev Net	Ela Net/Exec	P	S	RC
------------------------------------------------------------------------------------------------------------------------------------------------------------
3bxuudrscy1jn	0.000000	72.221000	0.000000	1	0.000000	72.221000	0.000000	0.000000	3	1	0
8wmvpvzmgjmyx	0.000000	1.298000	0.000000	1	0.000000	1.298000	0.000000	0.000000	3	1	0
5w5f39jcwb5nv	0.000000	10.774000	0.000000	1	0.000000	10.774000	0.000000	0.000000	2	1	0
638x13474m3x2	0.000000	17.649000	0.000000	1	0.000000	17.649000	0.000000	0.000000	2	1	0
2s9mmb6g8kbqb	0.000000	1.427000	0.000000	1	0.000000	1.427000	0.000000	0.000000	2	1	0
5jng2h1ttn71y	0.000000	199.157000	0.000000	1	0.000000	199.157000	0.000000	0.000000	2	1	0
956vu0d29ah05	0.226816	41.937000	9.512000	1	0.000000	41.937000	0.000000	9.512000	17	1	0

Sum App Time(s): 0.344463
Sum Net Time(s): 0.009512

No. SQLs: 7

192.168.56.118 27 kb


	Time frame:  2020-01-22 14:14:23.996268 +0100 CET  <=>  2020-01-22 14:14:24.45883 +0100 CET
	Time frame duration (s):  0.462562

REMOTE – fetchsize 14

inter@piclerick stado % ./stado -f sqlsr.pcap -i "192.168.56.118 or 10.0.0.8" -p 1521 -s rat
SQL ID		Ratio		Ela App (ms)	Ela Net(ms)	Exec	Ela Stddev App	Ela App/Exec	Ela Stddev Net	Ela Net/Exec	P	S	RC
------------------------------------------------------------------------------------------------------------------------------------------------------------
3bxuudrscy1jn	0.000000	116.133000	0.000000	1	0.000000	116.133000	0.000000	0.000000	3	1	0
8wmvpvzmgjmyx	0.000000	59.618000	0.000000	1	0.000000	59.618000	0.000000	0.000000	3	1	0
5w5f39jcwb5nv	0.000000	64.575000	0.000000	1	0.000000	64.575000	0.000000	0.000000	2	1	0
638x13474m3x2	0.000000	59.540000	0.000000	1	0.000000	59.540000	0.000000	0.000000	2	1	0
2s9mmb6g8kbqb	0.000000	59.320000	0.000000	1	0.000000	59.320000	0.000000	0.000000	2	1	0
5jng2h1ttn71y	0.000000	256.340000	0.000000	1	0.000000	256.340000	0.000000	0.000000	2	1	0
956vu0d29ah05	0.836899	484.620000	405.578000	1	0.000000	484.620000	0.000000	405.578000	17	1	0

Sum App Time(s): 1.1001460000000003
Sum Net Time(s): 0.405578

No. SQLs: 7

10.0.0.8 27 kb


	Time frame:  2020-01-22 14:13:49.553632 +0100 CET  <=>  2020-01-22 14:13:51.847139 +0100 CET
	Time frame duration (s):  2.293507

LOCAL – fetchsize 200

inter@piclerick stado % ./stado -f sqls.pcap -i "192.168.56.118 or 10.0.0.8" -p 1521 -s rat
SQL ID		Ratio		Ela App (ms)	Ela Net(ms)	Exec	Ela Stddev App	Ela App/Exec	Ela Stddev Net	Ela Net/Exec	P	S	RC
------------------------------------------------------------------------------------------------------------------------------------------------------------
3bxuudrscy1jn	0.000000	87.994000	0.000000	1	0.000000	87.994000	0.000000	0.000000	3	1	0
8wmvpvzmgjmyx	0.000000	1.121000	0.000000	1	0.000000	1.121000	0.000000	0.000000	3	1	0
5w5f39jcwb5nv	0.000000	2.315000	0.000000	1	0.000000	2.315000	0.000000	0.000000	2	1	0
638x13474m3x2	0.000000	14.832000	0.000000	1	0.000000	14.832000	0.000000	0.000000	2	1	0
2s9mmb6g8kbqb	0.000000	1.259000	0.000000	1	0.000000	1.259000	0.000000	0.000000	2	1	0
5jng2h1ttn71y	0.000000	187.470000	0.000000	1	0.000000	187.470000	0.000000	0.000000	2	1	0
956vu0d29ah05	0.019606	17.546000	0.344000	1	0.000000	17.546000	0.000000	0.344000	9	1	0

Sum App Time(s): 0.31253699999999995
Sum Net Time(s): 0.00034399999999999996

No. SQLs: 7

192.168.56.118 26 kb


	Time frame:  2020-01-22 14:30:09.225337 +0100 CET  <=>  2020-01-22 14:30:09.664531 +0100 CET
	Time frame duration (s):  0.439194

REMOTE – fetchsize 200

inter@piclerick stado % ./stado -f sqlsr.pcap -i "192.168.56.118 or 10.0.0.8" -p 1521 -s rat
SQL ID		Ratio		Ela App (ms)	Ela Net(ms)	Exec	Ela Stddev App	Ela App/Exec	Ela Stddev Net	Ela Net/Exec	P	S	RC
------------------------------------------------------------------------------------------------------------------------------------------------------------
2s9mmb6g8kbqb	0.000000	63.342000	0.000000	1	0.000000	63.342000	0.000000	0.000000	2	1	0
5jng2h1ttn71y	0.000000	237.884000	0.000000	1	0.000000	237.884000	0.000000	0.000000	2	1	0
3bxuudrscy1jn	0.000000	119.722000	0.000000	1	0.000000	119.722000	0.000000	0.000000	3	1	0
8wmvpvzmgjmyx	0.000000	64.242000	0.000000	1	0.000000	64.242000	0.000000	0.000000	3	1	0
5w5f39jcwb5nv	0.000000	67.841000	0.000000	1	0.000000	67.841000	0.000000	0.000000	2	1	0
638x13474m3x2	0.000000	64.604000	0.000000	1	0.000000	64.604000	0.000000	0.000000	2	1	0
956vu0d29ah05	0.488563	127.828000	62.452000	1	0.000000	127.828000	0.000000	62.452000	11	1	0

Sum App Time(s): 0.745463
Sum Net Time(s): 0.062452

No. SQLs: 7

10.0.0.8 26 kb


	Time frame:  2020-01-22 14:33:46.121799 +0100 CET  <=>  2020-01-22 14:33:48.165371 +0100 CET
	Time frame duration (s):  2.043572

As we can see, we reduced the time of SQL statement from more than 3000ms to around 128ms by modifying the size of single fetch – fewer packets equals fewer round trips.

Still, the network plays a significant role in the performance of this query (~ 62%), and the bigger latency – the slower the query will be.
Imagine that you have an application with a lot of this kind of chatty cursors, no PL/SQL logic and everything based on JAVA or Python or something like this… Well, most of you don’t have to imagine – you see those systems every day!

If you take this kind of system and migrate it to some fancy new hardware with a turbo-cool automagic hocus-pocus self-problem-solving platform, will it run faster? If this marvelous platform is not a network switch away from your app server, the answer is simple and brutal – NO.

Cloud won’t solve your problems. If your app is designed in a shitty way – cloud will escalate your problems. There is no magic – only physics.

If you want to check if your systems are network-sensitive and chatty – just make a tcpdump and analyze output with STADO, which can be found here:
https://github.com/ora600pl/stado

Search:


Kontakt

Database Whisperers sp. z o. o. sp. k.
Al. Jerozolimskie 200
Wejście B, III piętro/ pokój 342
02-486 Warszawa
NIP: 5272744987
REGON:362524978
+48 508 943 051
+48 661 966 009
info@ora-600.pl

Newsletter zapisz się żeby być na bieżąco