Virtual hosts on Exadata with OVM are HVM and not PV. This is one of the limitations of Infiniband SR-IOV – can’t be used with PV. So there is a qemu used to emulate the hardware
[root@exa2dbadm01 ~]# egrep "builder|qemu" /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/vm.cfg builder = 'hvm' device_model = '/usr/lib/xen/bin/qemu-dm'
While accessing a physical device from within a DOMU we can see that actual work is being done on DOM0 machine.
Physical IOs on filesystem:
Physical disks on DOMUs are presented with loop devices:
[root@exa2dbadm01 ~]# losetup -a | grep exa2adm01vm02.arrowecs.hub /dev/loop4: [0803]:57999656 (/EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/System.img*) /dev/loop5: [0803]:57999797 (/EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/grid12.1.0*) /dev/loop6: [0803]:57999799 (/EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/pv1_vgexad*) /dev/loop7: [0803]:57999798 (/EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/db12.1.0.2*)
The „disk” variable points actualy to symbolic links to the above loop devices:
[root@exa2dbadm01 ~]# grep disk /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/vm.cfg disk = ['file:/OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/cbb71de28cec45cbb5dac61020b12b44.img,xvda,w','file:/OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/625339924f0e48cb8d19dc107a2e0ce2.img,xvdb,w','file:/OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/3d6e30bfa5da4e9c84d02a0503c476dd.img,xvdc,w','file:/OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/da9a404caad24d15bd92e8ebbe14c8b1.img,xvdd,w'] [root@exa2dbadm01 ~]# ls -al /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/cbb71de28cec45cbb5dac61020b12b44.img lrwxrwxrwx 1 root root 62 mar 30 23:02 /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/cbb71de28cec45cbb5dac61020b12b44.img -> /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/System.img [root@exa2dbadm01 ~]# ls -al /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/625339924f0e48cb8d19dc107a2e0ce2.img lrwxrwxrwx 1 root root 75 mar 30 23:02 /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/625339924f0e48cb8d19dc107a2e0ce2.img -> /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/grid12.1.0.2.160119.img [root@exa2dbadm01 ~]# ls -al /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/3d6e30bfa5da4e9c84d02a0503c476dd.img lrwxrwxrwx 1 root root 75 mar 30 23:03 /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/3d6e30bfa5da4e9c84d02a0503c476dd.img -> /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/db12.1.0.2.160119-3.img [root@exa2dbadm01 ~]# ls -al /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/da9a404caad24d15bd92e8ebbe14c8b1.img lrwxrwxrwx 1 root root 67 mar 30 23:04 /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/da9a404caad24d15bd92e8ebbe14c8b1.img -> /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/pv1_vgexadb.img
Let’s check the filesystem and LVMs on the exa2adm01vm02 virtual host:
[oracle@exa2adm01vm02 ~]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VGExaDb-LVDbSys1 24G 6.5G 17G 29% / tmpfs 12G 164M 12G 2% /dev/shm /dev/xvda1 496M 34M 437M 8% /boot /dev/mapper/VGExaDb-LVDbOra1 20G 15G 3.8G 80% /u01 /dev/xvdb 50G 14G 34G 30% /u01/app/12.1.0.2/grid /dev/xvdc 50G 8.5G 39G 19% /u01/app/oracle/product/12.1.0.2/dbhome_1 [root@exa2adm01vm02 ~]# pvs PV VG Fmt Attr PSize PFree /dev/xvda2 VGExaDb lvm2 a-- 24,50g 508,00m /dev/xvdd1 VGExaDb lvm2 a-- 58,00g 1020,00m
If I’ll create a tablespace in /u01/app/oracle/oradata it will be actualy located in /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/pv1_vgexadb.img file which is accesible through /dev/loop6 device.
Let’s try to prove it. At DOM0 there are processes associated with all loop devices:
[root@exa2dbadm01 ~]# ps aux | grep loop | grep -v grep root 193012 0.0 0.0 0 0 ? S< Mar31 0:59 [loop4] root 193075 0.0 0.0 0 0 ? S< Mar31 0:17 [loop5] root 193101 0.0 0.0 0 0 ? S< Mar31 1:23 [loop6] root 193130 0.0 0.0 0 0 ? S< Mar31 0:15 [loop7] root 364287 0.0 0.0 0 0 ? S< Mar30 0:56 [loop0] root 364346 0.0 0.0 0 0 ? S< Mar30 0:18 [loop1] root 364372 0.0 0.0 0 0 ? S< Mar30 1:09 [loop2] root 364399 0.0 0.0 0 0 ? S< Mar30 0:16 [loop3]
We will trace the [loop6] process while doing some IOs at the virtual host level.
Virutal Host (exa2adm01vm02):
SQL> select file_name 2 from dba_data_files 3 where tablespace_name='TBS_TEST'; FILE_NAME -------------------------------------------------------------------------------- /u01/app/oracle/oradata/RICO/datafile/o1_mf_tbs_test_cjncko3x_.dbf
DOM0
[root@exa2dbadm01 ~]# perf record -g -p 193101 ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.040 MB perf.data (~1759 samples) ] [root@exa2dbadm01 ~]# perf report # Events: 172 cpu-clock # # Overhead Command Shared Object Symbol # ........ ....... ................. .............................. # 53.49% loop6 [kernel.kallsyms] [k] xen_hypercall_xen_version | --- xen_hypercall_xen_version check_events | |--32.61%-- __blk_run_queue | | | |--86.67%-- __make_request | | generic_make_request | | submit_bio | | dio_post_submission | | __blockdev_direct_IO_bvec | | ocfs2_direct_IO_bvec | | mapping_direct_IO | | generic_file_direct_write_iter | | ocfs2_file_write_iter | | aio_write_iter | | aio_kernel_submit | | lo_rw_aio | | loop_thread | | kthread | | kernel_thread_helper | | | --13.33%-- blk_run_queue | scsi_run_queue | scsi_next_command | scsi_end_request | scsi_io_completion | scsi_finish_command | scsi_softirq_done | blk_done_softirq | __do_softirq | call_softirq | do_softirq | irq_exit | xen_evtchn_do_upcall | xen_do_hypervisor_callback
Tracing network
With network traffic we can observe similar situation to physical device emulation.
When transfering data to and from virtual machine through Ethernet, we can see that DOM0 is doing actual work with xen_netback driver.
To measure this I’ll be transfering a file between cell storage server and virtual guest system using admin network – during this operation I’ll measure the network traffic with nttop.stp script (systemtap script provided by sourceware.org – https://sourceware.org/systemtap/examples/)
From Virtual Guest:
[root@exa2adm01vm02 ~]# scp 10.8.8.53:*.rpm . root@10.8.8.53's password: cell-12.1.2.3.0_LINUX.X64_160207.3-1.x86_64.rpm
At the DOM0:
PID UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND 35476 0 eth0 16351 0 960 0 netback/2 0 0 eth0 0 14750 0 281612 swapper 0 0 vif6.0 14747 0 281813 0 swapper 376527 0 eth0 0 181 0 3567 perl 376527 0 vif6.0 181 0 3569 0 perl 0 0 eth1 0 12 0 0 swapper 338283 0 eth0 0 11 0 194 LGWRExaWatcher. 338283 0 vif6.0 11 0 194 0 LGWRExaWatcher. 39142 0 eth0 0 10 0 200 python 39142 0 vif6.0 10 0 200 0 python
At the top we can see the netback/2 process which is responsible for transmitting the data.
We can see the perf call-graph of the netback/2 process while transmitting the data through the public network.
[root@exa2dbadm01 ~]# perf record -g -p 35476 ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.019 MB perf.data (~831 samples) ] [root@exa2dbadm01 ~]# perf report # Events: 86 cpu-clock # # Overhead Command Shared Object Symbol # ........ ......... ................. .............................. # 69.77% netback/2 [kernel.kallsyms] [k] xen_hypercall_grant_table_op | --- xen_hypercall_grant_table_op | |--98.33%-- xen_netbk_rx_action | xen_netbk_kthread | kthread | kernel_thread_helper | --1.67%-- xen_netbk_tx_action xen_netbk_kthread kthread kernel_thread_helper
But when we try to record netback/2 process when using the infiniband address, we will see that no actions have been performed:
[root@exa2dbadm01 ~]# perf record -g -p 35476 ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.011 MB perf.data (~491 samples) ] [root@exa2dbadm01 ~]# perf report
The perf.data file has no samples!
This is because SR-IOV implementation in the OVM. At the DOM0 we can see one PF and 16 VFs of the InfiniBand PCI card:
[root@exa2dbadm01 ~]# lspci | grep -i infiniband 19:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0) 19:00.1 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0) 19:00.2 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0) 19:00.3 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0) 19:00.4 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0) 19:00.5 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0) 19:00.6 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0) 19:00.7 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0) 19:01.0 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0) 19:01.1 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0) 19:01.2 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0) 19:01.3 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0) 19:01.4 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0) 19:01.5 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0) 19:01.6 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0) 19:01.7 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0) 19:02.0 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
In the VM configuration file there is information about assiging the PCI address for the InfiniBand card to the virtual machine:
[root@exa2dbadm01 ~]# grep "ib_" /EXAVMIMAGES/GuestImages/exa2adm01vm0[1-2].arrowecs.hub/vm.cfg /EXAVMIMAGES/GuestImages/exa2adm01vm01.arrowecs.hub/vm.cfg:ib_pfs = ['19:00.0'] /EXAVMIMAGES/GuestImages/exa2adm01vm01.arrowecs.hub/vm.cfg:ib_pkeys = [{'pf':'19:00.0','port':'1','pkey':['0xffff',]},{'pf':'19:00.0','port':'2','pkey':['0xffff',]},] /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/vm.cfg:ib_pfs = ['19:00.0'] /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/vm.cfg:ib_pkeys = [{'pf':'19:00.0','port':'1','pkey':['0xffff',]},{'pf':'19:00.0','port':'2','pkey':['0xffff',]},]
Although the parameters are the same for both DOMUs, the virtual guest is being assigned with only one exclusive VF at the system startup:
[root@exa2dbadm01 ~]# xm list Name ID Mem VCPUs State Time(s) Domain-0 0 8192 4 r----- 376916.6 exa2adm01vm01.arrowecs.hub 2 49152 8 -b---- 365576.8 exa2adm01vm02.arrowecs.hub 6 12288 2 -b---- 276681.1 [root@exa2dbadm01 ~]# xl pci-list exa2adm01vm01.arrowecs.hub Vdev Device 04.0 0000:19:00.1 [root@exa2dbadm01 ~]# xl pci-list exa2adm01vm02.arrowecs.hub Vdev Device 04.0 0000:19:00.2
After those assignments I have only 14 VFs left:
[root@exa2dbadm01 ~]# xl pci-list-assignable-devices 0000:19:00.3 0000:19:00.4 0000:19:00.5 0000:19:00.6 0000:19:00.7 0000:19:01.0 0000:19:01.1 0000:19:01.2 0000:19:01.3 0000:19:01.4 0000:19:01.5 0000:19:01.6 0000:19:01.7 0000:19:02.0
So this is actualy a limitation on how many virtual machines I can run in the Exadata OVM environment.
On X-4 X-5 and X-6 – you have 63 VFs.