Infiniband SR-IOV on Exadata OVM


26.04.2016
by Kamil Stawiarski

Virtual hosts on Exadata with OVM are HVM and not PV. This is one of the limitations of Infiniband SR-IOV – can’t be used with PV. So there is a qemu used to emulate the hardware

[root@exa2dbadm01 ~]# egrep "builder|qemu" /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/vm.cfg
builder = 'hvm'
device_model = '/usr/lib/xen/bin/qemu-dm'

While accessing a physical device from within a DOMU we can see that actual work is being done on DOM0 machine.

Physical IOs on filesystem:
Physical disks on DOMUs are presented with loop devices:

[root@exa2dbadm01 ~]# losetup -a | grep exa2adm01vm02.arrowecs.hub
/dev/loop4: [0803]:57999656 (/EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/System.img*)
/dev/loop5: [0803]:57999797 (/EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/grid12.1.0*)
/dev/loop6: [0803]:57999799 (/EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/pv1_vgexad*)
/dev/loop7: [0803]:57999798 (/EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/db12.1.0.2*)

The „disk” variable points actualy to symbolic links to the above loop devices:

[root@exa2dbadm01 ~]# grep disk /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/vm.cfg
disk = ['file:/OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/cbb71de28cec45cbb5dac61020b12b44.img,xvda,w','file:/OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/625339924f0e48cb8d19dc107a2e0ce2.img,xvdb,w','file:/OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/3d6e30bfa5da4e9c84d02a0503c476dd.img,xvdc,w','file:/OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/da9a404caad24d15bd92e8ebbe14c8b1.img,xvdd,w']
[root@exa2dbadm01 ~]# ls -al /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/cbb71de28cec45cbb5dac61020b12b44.img
lrwxrwxrwx 1 root root 62 mar 30 23:02 /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/cbb71de28cec45cbb5dac61020b12b44.img -> /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/System.img
[root@exa2dbadm01 ~]# ls -al /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/625339924f0e48cb8d19dc107a2e0ce2.img
lrwxrwxrwx 1 root root 75 mar 30 23:02 /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/625339924f0e48cb8d19dc107a2e0ce2.img -> /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/grid12.1.0.2.160119.img
[root@exa2dbadm01 ~]# ls -al /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/3d6e30bfa5da4e9c84d02a0503c476dd.img
lrwxrwxrwx 1 root root 75 mar 30 23:03 /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/3d6e30bfa5da4e9c84d02a0503c476dd.img -> /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/db12.1.0.2.160119-3.img
[root@exa2dbadm01 ~]# ls -al /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/da9a404caad24d15bd92e8ebbe14c8b1.img
lrwxrwxrwx 1 root root 67 mar 30 23:04 /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/da9a404caad24d15bd92e8ebbe14c8b1.img -> /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/pv1_vgexadb.img

Let’s check the filesystem and LVMs on the exa2adm01vm02 virtual host:

[oracle@exa2adm01vm02 ~]$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VGExaDb-LVDbSys1
                       24G  6.5G   17G  29% /
tmpfs                  12G  164M   12G   2% /dev/shm
/dev/xvda1            496M   34M  437M   8% /boot
/dev/mapper/VGExaDb-LVDbOra1
                       20G   15G  3.8G  80% /u01
/dev/xvdb              50G   14G   34G  30% /u01/app/12.1.0.2/grid
/dev/xvdc              50G  8.5G   39G  19% /u01/app/oracle/product/12.1.0.2/dbhome_1
[root@exa2adm01vm02 ~]# pvs
  PV         VG      Fmt  Attr PSize  PFree
  /dev/xvda2 VGExaDb lvm2 a--  24,50g  508,00m
  /dev/xvdd1 VGExaDb lvm2 a--  58,00g 1020,00m

If I’ll create a tablespace in /u01/app/oracle/oradata it will be actualy located in /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/pv1_vgexadb.img file which is accesible through /dev/loop6 device.

Let’s try to prove it. At DOM0 there are processes associated with all loop devices:

[root@exa2dbadm01 ~]# ps aux | grep loop | grep -v grep
root     193012  0.0  0.0      0     0 ?        S<   Mar31   0:59 [loop4]
root     193075  0.0  0.0      0     0 ?        S<   Mar31   0:17 [loop5]
root     193101  0.0  0.0      0     0 ?        S<   Mar31   1:23 [loop6]
root     193130  0.0  0.0      0     0 ?        S<   Mar31   0:15 [loop7]
root     364287  0.0  0.0      0     0 ?        S<   Mar30   0:56 [loop0]
root     364346  0.0  0.0      0     0 ?        S<   Mar30   0:18 [loop1]
root     364372  0.0  0.0      0     0 ?        S<   Mar30   1:09 [loop2]
root     364399  0.0  0.0      0     0 ?        S<   Mar30   0:16 [loop3]

We will trace the [loop6] process while doing some IOs at the virtual host level.

Virutal Host (exa2adm01vm02):

SQL> select file_name
  2  from dba_data_files
  3  where tablespace_name='TBS_TEST';

FILE_NAME
--------------------------------------------------------------------------------
/u01/app/oracle/oradata/RICO/datafile/o1_mf_tbs_test_cjncko3x_.dbf

DOM0

[root@exa2dbadm01 ~]# perf record -g -p 193101
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.040 MB perf.data (~1759 samples) ]
[root@exa2dbadm01 ~]# perf report
# Events: 172  cpu-clock
#
# Overhead  Command      Shared Object                          Symbol
# ........  .......  .................  ..............................
#
    53.49%    loop6  [kernel.kallsyms]  [k] xen_hypercall_xen_version
              |
              --- xen_hypercall_xen_version
                  check_events
                 |
                 |--32.61%-- __blk_run_queue
                 |          |
                 |          |--86.67%-- __make_request
                 |          |          generic_make_request
                 |          |          submit_bio
                 |          |          dio_post_submission
                 |          |          __blockdev_direct_IO_bvec
                 |          |          ocfs2_direct_IO_bvec
                 |          |          mapping_direct_IO
                 |          |          generic_file_direct_write_iter
                 |          |          ocfs2_file_write_iter
                 |          |          aio_write_iter
                 |          |          aio_kernel_submit
                 |          |          lo_rw_aio
                 |          |          loop_thread
                 |          |          kthread
                 |          |          kernel_thread_helper
                 |          |
                 |           --13.33%-- blk_run_queue
                 |                     scsi_run_queue
                 |                     scsi_next_command
                 |                     scsi_end_request
                 |                     scsi_io_completion
                 |                     scsi_finish_command
                 |                     scsi_softirq_done
                 |                     blk_done_softirq
                 |                     __do_softirq
                 |                     call_softirq
                 |                     do_softirq
                 |                     irq_exit
                 |                     xen_evtchn_do_upcall
                 |                     xen_do_hypervisor_callback

Tracing network

With network traffic we can observe similar situation to physical device emulation.
When transfering data to and from virtual machine through Ethernet, we can see that DOM0 is doing actual work with xen_netback driver.

To measure this I’ll be transfering a file between cell storage server and virtual guest system using admin network – during this operation I’ll measure the network traffic with nttop.stp script (systemtap script provided by sourceware.org – https://sourceware.org/systemtap/examples/)

From Virtual Guest:

[root@exa2adm01vm02 ~]# scp 10.8.8.53:*.rpm .
root@10.8.8.53's password:
cell-12.1.2.3.0_LINUX.X64_160207.3-1.x86_64.rpm

At the DOM0:

  PID   UID DEV     XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND
35476     0 eth0      16351       0     960       0 netback/2
    0     0 eth0          0   14750       0  281612 swapper
    0     0 vif6.0    14747       0  281813       0 swapper
376527     0 eth0          0     181       0    3567 perl
376527     0 vif6.0      181       0    3569       0 perl
    0     0 eth1          0      12       0       0 swapper
338283     0 eth0          0      11       0     194 LGWRExaWatcher.
338283     0 vif6.0       11       0     194       0 LGWRExaWatcher.
39142     0 eth0          0      10       0     200 python
39142     0 vif6.0       10       0     200       0 python

At the top we can see the netback/2 process which is responsible for transmitting the data.
We can see the perf call-graph of the netback/2 process while transmitting the data through the public network.

[root@exa2dbadm01 ~]# perf record -g -p 35476
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.019 MB perf.data (~831 samples) ]

[root@exa2dbadm01 ~]# perf report
# Events: 86  cpu-clock
#
# Overhead    Command      Shared Object                          Symbol
# ........  .........  .................  ..............................
#
    69.77%  netback/2  [kernel.kallsyms]  [k] xen_hypercall_grant_table_op
            |
            --- xen_hypercall_grant_table_op
               |
               |--98.33%-- xen_netbk_rx_action
               |          xen_netbk_kthread
               |          kthread
               |          kernel_thread_helper
               |
                --1.67%-- xen_netbk_tx_action
                          xen_netbk_kthread
                          kthread
                          kernel_thread_helper

But when we try to record netback/2 process when using the infiniband address, we will see that no actions have been performed:

[root@exa2dbadm01 ~]# perf record -g -p 35476
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.011 MB perf.data (~491 samples) ]

[root@exa2dbadm01 ~]# perf report

The perf.data file has no samples!

This is because SR-IOV implementation in the OVM. At the DOM0 we can see one PF and 16 VFs of the InfiniBand PCI card:

[root@exa2dbadm01 ~]# lspci | grep -i infiniband
19:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
19:00.1 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:00.2 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:00.3 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:00.4 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:00.5 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:00.6 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:00.7 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:01.0 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:01.1 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:01.2 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:01.3 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:01.4 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:01.5 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:01.6 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:01.7 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:02.0 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)

In the VM configuration file there is information about assiging the PCI address for the InfiniBand card to the virtual machine:

[root@exa2dbadm01 ~]# grep "ib_" /EXAVMIMAGES/GuestImages/exa2adm01vm0[1-2].arrowecs.hub/vm.cfg
/EXAVMIMAGES/GuestImages/exa2adm01vm01.arrowecs.hub/vm.cfg:ib_pfs = ['19:00.0']
/EXAVMIMAGES/GuestImages/exa2adm01vm01.arrowecs.hub/vm.cfg:ib_pkeys = [{'pf':'19:00.0','port':'1','pkey':['0xffff',]},{'pf':'19:00.0','port':'2','pkey':['0xffff',]},]
/EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/vm.cfg:ib_pfs = ['19:00.0']
/EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/vm.cfg:ib_pkeys = [{'pf':'19:00.0','port':'1','pkey':['0xffff',]},{'pf':'19:00.0','port':'2','pkey':['0xffff',]},]

Although the parameters are the same for both DOMUs, the virtual guest is being assigned with only one exclusive VF at the system startup:

[root@exa2dbadm01 ~]# xm list
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  8192     4     r----- 376916.6
exa2adm01vm01.arrowecs.hub                   2 49152     8     -b---- 365576.8
exa2adm01vm02.arrowecs.hub                   6 12288     2     -b---- 276681.1
[root@exa2dbadm01 ~]# xl pci-list exa2adm01vm01.arrowecs.hub
Vdev Device
04.0 0000:19:00.1
[root@exa2dbadm01 ~]# xl pci-list exa2adm01vm02.arrowecs.hub
Vdev Device
04.0 0000:19:00.2

After those assignments I have only 14 VFs left:

[root@exa2dbadm01 ~]# xl pci-list-assignable-devices
0000:19:00.3
0000:19:00.4
0000:19:00.5
0000:19:00.6
0000:19:00.7
0000:19:01.0
0000:19:01.1
0000:19:01.2
0000:19:01.3
0000:19:01.4
0000:19:01.5
0000:19:01.6
0000:19:01.7
0000:19:02.0

So this is actualy a limitation on how many virtual machines I can run in the Exadata OVM environment.
On X-4 X-5 and X-6 – you have 63 VFs.


Contact us

Database Whisperers sp. z o. o. sp. k.
al. Jerozolimskie 200, 3rd floor, room 342
02-486 Warszawa
NIP: 5272744987
REGON:362524978
+48 508 943 051
+48 661 966 009
info@ora-600.pl

Newsletter Sign up to be updated