How checkpoint queue actually works (the basic scenario)


11.06.2024
by Kamil Stawiarski

This article is true for Oracle 19.19 and 19.23 on ARM. On x86_64 there’s something funny going on with location of the buffer address in the private memory (finding appropriate register at kcbklbc call)- I need a bit more time to compare those two compilations, but the idea is the same 🙂

During my journey as an evil cloud admin, I was trying to force Oracle to write a database block that I modified directly in memory – skipping all Oracle mechanisms and audit tools.

When I was trying to achieve this, the biggest problem was to understand how the checkpoint queue actually works – where is it in the memory and how the block is being linked. Of course Oracle Core: Essential Internals for DBAs and Developers by Jonathan Lewis explains the mechanics but I needed actual function names and offsets in the structures.

So I started investigating Oracle checkpointing mechanism 🙂

Let’s start from the beginning.

We will read one block of a table into a buffer cache:

SQL> select salary
  2  from hr.employees
  3  where employee_id=100;

    SALARY
----------
     24000

Now we can dump the contents of the buffer cache to find our block details:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug tracefile_name
/u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_3934.trc
SQL> oradebug dump buffers 1
Statement processed.

Based on data object id we can find a block metadata in our trace file:

BH (0x176f870c0) file#: 10 rdba: 0x01008915 (4/35093) class: 1 ba: 0x176514000
  set: 9 pool: 3 bsz: 8192 bsi: 0 sflg: 2 pwc: 0,0
  dbwrid: 0 obj: 74271 objn: 74271 tsn: [3/1] afn: 10 hint: f
  hash: [0x1194860e0,0x1194860e0] lru: [0x176f87310,0x176f87040]
  ckptq: [NULL] fileq: [NULL]
  objq: [0x10828db90,0x10828db90] objaq: [0x10828db80,0x10828db80]
  st: XCURRENT md: NULL fpin: 'kdswh02: kdsgrp' fscn: 0x2d9cd7 tch: 2
  flags:
  LRBA: [0x0.0.0] LSCN: [0x0] HSCN: [0x0] HSUB: [65535]
  Printing buffer operation history (latest change first): (cnt 9)
  01. sid:09 L145:zib:mk:EXCL         02. sid:09 L212:zib:bic:FSQ
  03. sid:09 L122:zgb:set:st          04. sid:09 L830:olq1:clr:WRT+CKT
  05. sid:09 L951:zgb:lnk:objq        06. sid:09 L372:zgb:set:MEXCL
  07. sid:09 L123:zgb:no:FEN          08. sid:09 L083:zgb:ent:fn
  09. sid:13 L203:w_ini_dc:bic:FVB

This shows us, that the block we are looking for is located at this address: 0x176f870c0

Let’s produce a quick tool to have sneak peek into the memory contents:

[root@vrick19 ~]# cat showme.sh
#!/bin/bash
PID=$1
OFF=$2
LEN=8
MOD="+0"

if [[ $3 -gt 8 ]]; then
        LEN=$3
fi

if [[ $4 ]]; then
        MOD=$4
fi

ps aux | grep ${PID} | grep ora_ | grep -v grep

python2 << !
import binascii, struct
def showme(pid, offset, len=8):
    f = open("/proc/" + str(pid) + "/mem")
    f.seek(offset${MOD})
    x = binascii.hexlify(f.read(len))
    f.close()
    y=0
    if len==8:
        y=hex(struct.Struct("Q").unpack(binascii.unhexlify(x))[0])
    return x,y

print("Memory contents: ", showme(${PID}, int("${OFF}"), ${LEN}))

!

When you provide PID and offset to this little script, it will show you the memory contents.

If we check the first 16 bytes of our block address, we will see an interesting thing:

[root@vrick19 ~]# ./showme.sh 3637 $((16#176f870c0)) 16
('Memory contents: ', ('c070f87601000000c070f87601000000', 0))

It is the same address (0x176f870c0) repeated twice!

This is because those two first 8byte numbers are the numbers responsible for our checkpoint queue! If those two are different then the actual address – the block is on the queue! Simple and elegant 🙂

So if we want to understand which function is modifying those numbers, we can create a watchpoint in GDB:

(gdb) watch *0x176f870c0
Hardware watchpoint 1: *0x176f870c0
(gdb) c
Continuing.

Now, that we are watching this memory address with GDB, we can perform UPDATE and COMMIT on our block:

SQL> !cat u.sql
update hr.employees set salary=salary where employee_id=100
/
select xid from v$transaction
/
commit
/


SQL> @u

1 row updated.


XID
----------------
0A000A0093020000

GDB shows as follows:

Hardware watchpoint 1: *0x176f870c0

Old value = 1995993280
New value = 430851264
0x000000000bba3224 in kcbklbc ()

Based on that we know that the function that changes this memory region is called kcbklbc – we can user orafun.info to reveal it’s probable description: "kernel cache buffers checkpoint queue manipulation link buffer into checkpoint queue"

Sounds perfect!

There are two GDB commands that can reveal the mysteries of this function at this stage of execution:

  • info reg – display current values of the registers
  • disas – disassemble the current function

Based on that, we can find out that just a few instruction before, function kcbklbc is calling kcbkgcql (kernel cache buffers checkpoint queue manipulation get checkpoint queue latch)

After some digging and interpreting Assembler instructions, we can produce the following D script:

#!/usr/sbin/dtrace -s
#pragma D option quiet

pid$1::kcbklbc:entry
/pid == $1/
{
	printf("Manipulating checkpoint queue %x\n", uregs[1]);
	tracemem(uregs[0], 8*8);
	printf("--------------------------------\n");
}

pid$1::kcbkgcql:entry
/pid == $1/
{
	self->rba = *(uint*) copyin(uregs[0]+200, 4);
	self->objd = *(uint*) copyin(uregs[0]+208, 4);
	printf("\tGeting checkpoint queue latch for buffer at address 0x%x [rba: 0x%x, objd: %d]\n", uregs[0], self->rba, self->objd);
}

pid$1::kcbkgcql:return
/pid == $1/
{
	printf("\t\tGot latch address  0x%x\n", uregs[0]);

	self->dummy = *(long*) copyin(uregs[0]+160, 8);
	printf("\t\tAddress of dummy block at the end of the queue 0x%x\n", self->dummy);

	self->last_real_block = *(long*) copyin(self->dummy+8, 8);
	printf("\t\tAddress of the last real block in the queue is 0x%x\n\n", self->last_real_block);

	printf("\t\tFinal ckptq: [0x%x,0x%x]\n\n\n\n", self->dummy, self->last_real_block);
}

For x86 the DTrace script would be a bit different (because of the register names and usage):

#!/usr/sbin/dtrace -s
#pragma D option quiet

pid$1::kcbklbc:entry
/pid == $1/
{
	printf("Manipulating checkpoint queue %x\n", uregs[R_EBX]);
	tracemem(uregs[R_EBX], 8*8);
	printf("--------------------------------\n");
}

pid$1::kcbkgcql:entry
/pid == $1/
{
	self->rba = *(uint*) copyin(uregs[R_EBX]+200, 4);
	self->objd = *(uint*) copyin(uregs[R_EBX]+208, 4);
	printf("\tGeting checkpoint queue latch for buffer at address 0x%x [rba: 0x%x, objd: %d]\n", uregs[R_EBX], self->rba, self->objd);
}

pid$1::kcbkgcql:return
/pid == $1/
{
	printf("\t\tGot latch address  0x%x\n", uregs[R_EAX]);

	self->dummy = *(long*) copyin(uregs[R_EAX]+160, 8);
	printf("\t\tAddress of dummy block at the end of the queue 0x%x\n", self->dummy);

	self->last_real_block = *(long*) copyin(self->dummy+8, 8);
	printf("\t\tAddress of the last real block in the queue is 0x%x\n\n", self->last_real_block);

	printf("\t\tFinal ckptq: [0x%x,0x%x]\n\n\n\n", self->dummy, self->last_real_block);
}

Let’s try to explain what’s going on up there (I explain based on ARM registers but I’ll get the point):

  • pid$1::kcbklbc:entry – each time a process that I’m tracing is calling the function kcbklbc…
  • tracemem(uregs[0], 8*8); – print 64 bytes of memory pointed by the first register (X0 in our case since I did the test on ARM); if you look closely, you will see that there’s an address of the block to be manipulated;
  • pid$1::kcbkgcql:entry – each time the process calls kcbkgcql (the first register holds the buffer address)…
  • self->rba = *(uint*) copyin(uregs[0]+200, 4); – 200 bytes from the beginning is RBA of the block;
  • self->objd = *(uint*) copyin(uregs[0]+208, 4); – 208 bytes from the beginning is DATA_OBJECT_ID of the manipulated block;
  • pid$1::kcbkgcql:return – each time the function kcbkgcql ends…
  • printf("\t\tGot latch address 0x%x\n", uregs[0]); – print latch address, which is on the return registry (which can be matched with ADDR column from V$LATCH_CHILDREN);
  • self->dummy = *(long*) copyin(uregs[0]+160, 8); – 160 bytes from the latch address is the address of the last buffer in the queue… but not exactly – the last buffer is a DUMMY buffer from which we can traverse the whole checkpoint queue;
  • self->last_real_block = *(long*) copyin(self->dummy+8, 8); – so this is the address of the actual last block in the queue – my block will have to be between this one and the dummy one;

Once I have all those information, the function can set all appropriate values. In the Assembler it could look like this:

  • bl 0xbba4070 <kcbkgcql> – call the latch function;
  • mov x28, x0 – store value of X0 (returned value by the function) in register X28;
  • ldr x0, [x28, #160] – load the address that is located 160 bytes from the latch address returned by the latch function (this is the address of the DUMMY block);
  • ldr x11, [x0, #8] – find the previous block address from the DUMMY block – the actual last block in the queue and store in X11 register;
  • stp x11, x9, [sp, #208] – store X11 (the actual last block address) and X9 (we don’t care what’s there) 208 bytes from the beginning of the stack;
  • ldp x12, x13, [sp, #208] – load it to X12 and X13 – right now X12 contains the address of the actual last block after which I want to link;
  • ldr x0, [x12] – first 8 bytes from the X12 address is the address of the DUMMY block – remember it in X0;
  • stp x0, x12, [x19] – X19 contains address of my buffer, so I want to store the pair of next (DUMMY) and previous (actual last block) at the address of my buffer;
  • str x19, [x12] – now I can store my address at former last buffer in the queue;
  • ldr x16, [x19] – load address of the next (DUMMY) buffer in X16;
  • str x19, [x16, #8] – store my address at the position of the PREV buffer which is 8 bytes from the beginning of the DUMMY address;

A few simple instructions and we have our block linked!

Easy, right?

This is entry point for investigating at least 3 things:

  • buffer busy waits – because it occurs each time you are manipulating a queue to often;
  • the difference between ARM and x86 architectures – because Oracle seems to link blocks into the queue during different stages of transaction state (EDIT: thanks to Tanel Poder I don’t have to investigate it anymore – it has occured that on my Intel database I had supplemental logging turned on and it disabled IMU);
  • the final hack – because we are close to put a block on a checkpoint queue manually and force Oracle to write our changes 🙂

One thing is sure – the more my daughter will sleep at night the sooner I will dig into those topics 😉


Contact us

Database Whisperers sp. z o. o. sp. k.
al. Jerozolimskie 200, 3rd floor, room 342
02-486 Warszawa
NIP: 5272744987
REGON:362524978
+48 508 943 051
+48 661 966 009
info@ora-600.pl

Newsletter Sign up to be updated