Infiniband SR-IOV on Exadata OVM


26.04.2016
by Kamil Stawiarski

Virtual hosts on Exadata with OVM are HVM and not PV. This is one of the limitations of Infiniband SR-IOV – can’t be used with PV. So there is a qemu used to emulate the hardware

1[root@exa2dbadm01 ~]# egrep "builder|qemu" /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/vm.cfg
2builder = 'hvm'
3device_model = '/usr/lib/xen/bin/qemu-dm'

While accessing a physical device from within a DOMU we can see that actual work is being done on DOM0 machine.

Physical IOs on filesystem:
Physical disks on DOMUs are presented with loop devices:

1[root@exa2dbadm01 ~]# losetup -a | grep exa2adm01vm02.arrowecs.hub
2/dev/loop4: [0803]:57999656 (/EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/System.img*)
3/dev/loop5: [0803]:57999797 (/EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/grid12.1.0*)
4/dev/loop6: [0803]:57999799 (/EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/pv1_vgexad*)
5/dev/loop7: [0803]:57999798 (/EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/db12.1.0.2*)

The „disk” variable points actualy to symbolic links to the above loop devices:

1[root@exa2dbadm01 ~]# grep disk /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/vm.cfg
2disk = ['file:/OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/cbb71de28cec45cbb5dac61020b12b44.img,xvda,w','file:/OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/625339924f0e48cb8d19dc107a2e0ce2.img,xvdb,w','file:/OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/3d6e30bfa5da4e9c84d02a0503c476dd.img,xvdc,w','file:/OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/da9a404caad24d15bd92e8ebbe14c8b1.img,xvdd,w']
3[root@exa2dbadm01 ~]# ls -al /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/cbb71de28cec45cbb5dac61020b12b44.img
4lrwxrwxrwx 1 root root 62 mar 30 23:02 /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/cbb71de28cec45cbb5dac61020b12b44.img -> /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/System.img
5[root@exa2dbadm01 ~]# ls -al /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/625339924f0e48cb8d19dc107a2e0ce2.img
6lrwxrwxrwx 1 root root 75 mar 30 23:02 /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/625339924f0e48cb8d19dc107a2e0ce2.img -> /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/grid12.1.0.2.160119.img
7[root@exa2dbadm01 ~]# ls -al /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/3d6e30bfa5da4e9c84d02a0503c476dd.img
8lrwxrwxrwx 1 root root 75 mar 30 23:03 /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/3d6e30bfa5da4e9c84d02a0503c476dd.img -> /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/db12.1.0.2.160119-3.img
9[root@exa2dbadm01 ~]# ls -al /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/da9a404caad24d15bd92e8ebbe14c8b1.img
10lrwxrwxrwx 1 root root 67 mar 30 23:04 /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/da9a404caad24d15bd92e8ebbe14c8b1.img -> /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/pv1_vgexadb.img

Let’s check the filesystem and LVMs on the exa2adm01vm02 virtual host:

1[oracle@exa2adm01vm02 ~]$ df -h
2Filesystem            Size  Used Avail Use% Mounted on
3/dev/mapper/VGExaDb-LVDbSys1
4                       24G  6.5G   17G  29% /
5tmpfs                  12G  164M   12G   2% /dev/shm
6/dev/xvda1            496M   34M  437M   8% /boot
7/dev/mapper/VGExaDb-LVDbOra1
8                       20G   15G  3.8G  80% /u01
9/dev/xvdb              50G   14G   34G  30% /u01/app/12.1.0.2/grid
10/dev/xvdc              50G  8.5G   39G  19% /u01/app/oracle/product/12.1.0.2/dbhome_1
11[root@exa2adm01vm02 ~]# pvs
12  PV         VG      Fmt  Attr PSize  PFree
13  /dev/xvda2 VGExaDb lvm2 a--  24,50g  508,00m
14  /dev/xvdd1 VGExaDb lvm2 a--  58,00g 1020,00m

If I’ll create a tablespace in /u01/app/oracle/oradata it will be actualy located in /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/pv1_vgexadb.img file which is accesible through /dev/loop6 device.

Let’s try to prove it. At DOM0 there are processes associated with all loop devices:

1[root@exa2dbadm01 ~]# ps aux | grep loop | grep -v grep
2root     193012  0.0  0.0      0     0 ?        S<   Mar31   0:59 [loop4]
3root     193075  0.0  0.0      0     0 ?        S<   Mar31   0:17 [loop5]
4root     193101  0.0  0.0      0     0 ?        S<   Mar31   1:23 [loop6]
5root     193130  0.0  0.0      0     0 ?        S<   Mar31   0:15 [loop7]
6root     364287  0.0  0.0      0     0 ?        S<   Mar30   0:56 [loop0]
7root     364346  0.0  0.0      0     0 ?        S<   Mar30   0:18 [loop1]
8root     364372  0.0  0.0      0     0 ?        S<   Mar30   1:09 [loop2]
9root     364399  0.0  0.0      0     0 ?        S<   Mar30   0:16 [loop3]

We will trace the [loop6] process while doing some IOs at the virtual host level.

Virutal Host (exa2adm01vm02):

1SQL> select file_name
2  2  from dba_data_files
3  3  where tablespace_name='TBS_TEST';
4 
5FILE_NAME
6--------------------------------------------------------------------------------
7/u01/app/oracle/oradata/RICO/datafile/o1_mf_tbs_test_cjncko3x_.dbf

DOM0

1[root@exa2dbadm01 ~]# perf record -g -p 193101
2^C[ perf record: Woken up 1 times to write data ]
3[ perf record: Captured and wrote 0.040 MB perf.data (~1759 samples) ]
4[root@exa2dbadm01 ~]# perf report
5# Events: 172  cpu-clock
6#
7# Overhead  Command      Shared Object                          Symbol
8# ........  .......  .................  ..............................
9#
10    53.49%    loop6  [kernel.kallsyms]  [k] xen_hypercall_xen_version
11              |
12              --- xen_hypercall_xen_version
13                  check_events
14                 |
15                 |--32.61%-- __blk_run_queue
16                 |          |
17                 |          |--86.67%-- __make_request
18                 |          |          generic_make_request
19                 |          |          submit_bio
20                 |          |          dio_post_submission
21                 |          |          __blockdev_direct_IO_bvec
22                 |          |          ocfs2_direct_IO_bvec
23                 |          |          mapping_direct_IO
24                 |          |          generic_file_direct_write_iter
25                 |          |          ocfs2_file_write_iter
26                 |          |          aio_write_iter
27                 |          |          aio_kernel_submit
28                 |          |          lo_rw_aio
29                 |          |          loop_thread
30                 |          |          kthread
31                 |          |          kernel_thread_helper
32                 |          |
33                 |           --13.33%-- blk_run_queue
34                 |                     scsi_run_queue
35                 |                     scsi_next_command
36                 |                     scsi_end_request
37                 |                     scsi_io_completion
38                 |                     scsi_finish_command
39                 |                     scsi_softirq_done
40                 |                     blk_done_softirq
41                 |                     __do_softirq
42                 |                     call_softirq
43                 |                     do_softirq
44                 |                     irq_exit
45                 |                     xen_evtchn_do_upcall
46                 |                     xen_do_hypervisor_callback

Tracing network

With network traffic we can observe similar situation to physical device emulation.
When transfering data to and from virtual machine through Ethernet, we can see that DOM0 is doing actual work with xen_netback driver.

To measure this I’ll be transfering a file between cell storage server and virtual guest system using admin network – during this operation I’ll measure the network traffic with nttop.stp script (systemtap script provided by sourceware.org – https://sourceware.org/systemtap/examples/)

From Virtual Guest:

1[root@exa2adm01vm02 ~]# scp 10.8.8.53:*.rpm .
2root@10.8.8.53's password:
3cell-12.1.2.3.0_LINUX.X64_160207.3-1.x86_64.rpm

At the DOM0:

1  PID   UID DEV     XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND
235476     0 eth0      16351       0     960       0 netback/2
3    0     0 eth0          0   14750       0  281612 swapper
4    0     0 vif6.0    14747       0  281813       0 swapper
5376527     0 eth0          0     181       0    3567 perl
6376527     0 vif6.0      181       0    3569       0 perl
7    0     0 eth1          0      12       0       0 swapper
8338283     0 eth0          0      11       0     194 LGWRExaWatcher.
9338283     0 vif6.0       11       0     194       0 LGWRExaWatcher.
1039142     0 eth0          0      10       0     200 python
1139142     0 vif6.0       10       0     200       0 python

At the top we can see the netback/2 process which is responsible for transmitting the data.
We can see the perf call-graph of the netback/2 process while transmitting the data through the public network.

1[root@exa2dbadm01 ~]# perf record -g -p 35476
2^C[ perf record: Woken up 1 times to write data ]
3[ perf record: Captured and wrote 0.019 MB perf.data (~831 samples) ]
4 
5[root@exa2dbadm01 ~]# perf report
6# Events: 86  cpu-clock
7#
8# Overhead    Command      Shared Object                          Symbol
9# ........  .........  .................  ..............................
10#
11    69.77%  netback/2  [kernel.kallsyms]  [k] xen_hypercall_grant_table_op
12            |
13            --- xen_hypercall_grant_table_op
14               |
15               |--98.33%-- xen_netbk_rx_action
16               |          xen_netbk_kthread
17               |          kthread
18               |          kernel_thread_helper
19               |
20                --1.67%-- xen_netbk_tx_action
21                          xen_netbk_kthread
22                          kthread
23                          kernel_thread_helper

But when we try to record netback/2 process when using the infiniband address, we will see that no actions have been performed:

1[root@exa2dbadm01 ~]# perf record -g -p 35476
2^C[ perf record: Woken up 1 times to write data ]
3[ perf record: Captured and wrote 0.011 MB perf.data (~491 samples) ]
4 
5[root@exa2dbadm01 ~]# perf report

The perf.data file has no samples!

This is because SR-IOV implementation in the OVM. At the DOM0 we can see one PF and 16 VFs of the InfiniBand PCI card:

1[root@exa2dbadm01 ~]# lspci | grep -i infiniband
219:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
319:00.1 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
419:00.2 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
519:00.3 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
619:00.4 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
719:00.5 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
819:00.6 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
919:00.7 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
1019:01.0 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
1119:01.1 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
1219:01.2 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
1319:01.3 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
1419:01.4 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
1519:01.5 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
1619:01.6 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
1719:01.7 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
1819:02.0 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)

In the VM configuration file there is information about assiging the PCI address for the InfiniBand card to the virtual machine:

1[root@exa2dbadm01 ~]# grep "ib_" /EXAVMIMAGES/GuestImages/exa2adm01vm0[1-2].arrowecs.hub/vm.cfg
2/EXAVMIMAGES/GuestImages/exa2adm01vm01.arrowecs.hub/vm.cfg:ib_pfs = ['19:00.0']
3/EXAVMIMAGES/GuestImages/exa2adm01vm01.arrowecs.hub/vm.cfg:ib_pkeys = [{'pf':'19:00.0','port':'1','pkey':['0xffff',]},{'pf':'19:00.0','port':'2','pkey':['0xffff',]},]
4/EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/vm.cfg:ib_pfs = ['19:00.0']
5/EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/vm.cfg:ib_pkeys = [{'pf':'19:00.0','port':'1','pkey':['0xffff',]},{'pf':'19:00.0','port':'2','pkey':['0xffff',]},]

Although the parameters are the same for both DOMUs, the virtual guest is being assigned with only one exclusive VF at the system startup:

1[root@exa2dbadm01 ~]# xm list
2Name                                        ID   Mem VCPUs      State   Time(s)
3Domain-0                                     0  8192     4     r----- 376916.6
4exa2adm01vm01.arrowecs.hub                   2 49152     8     -b---- 365576.8
5exa2adm01vm02.arrowecs.hub                   6 12288     2     -b---- 276681.1
6[root@exa2dbadm01 ~]# xl pci-list exa2adm01vm01.arrowecs.hub
7Vdev Device
804.0 0000:19:00.1
9[root@exa2dbadm01 ~]# xl pci-list exa2adm01vm02.arrowecs.hub
10Vdev Device
1104.0 0000:19:00.2

After those assignments I have only 14 VFs left:

1[root@exa2dbadm01 ~]# xl pci-list-assignable-devices
20000:19:00.3
30000:19:00.4
40000:19:00.5
50000:19:00.6
60000:19:00.7
70000:19:01.0
80000:19:01.1
90000:19:01.2
100000:19:01.3
110000:19:01.4
120000:19:01.5
130000:19:01.6
140000:19:01.7
150000:19:02.0

So this is actualy a limitation on how many virtual machines I can run in the Exadata OVM environment.
On X-4 X-5 and X-6 – you have 63 VFs.


Contact us

Database Whisperers sp. z o. o. sp. k.
al. Jerozolimskie 200, 3rd floor, room 342
02-486 Warszawa
NIP: 5272744987
REGON:362524978
+48 508 943 051
+48 661 966 009
info@ora-600.pl

Newsletter Sign up to be updated