Halloween started really scary this year – some company called us and said, that they have some problems with their Oracle Database Standard Edition.
Apparently they had a storage crash six months before, which caused some blocks to get corrupted and from that time no one has noticed that RMAN is not producing any backups.
But the system was working fine so no one was complaining at the business level… until the Halloween 2022.
It has occurred that twice a year they do some complex reporting and while doing it, they noticed corrupted blocks in some tablespaces.
Blockrecover would be a blessing but it is possible only in Enterprise Edition, so they tried to restore and recover affected database files from 6 months before, and it was not a surprise that the process of applying that amount of archivelogs lasted a very, very long time!
So they called us to verify if there is any possibility to resume their business process within a few hours and not… days.
Well, let’s give it a try!
SQL> select * from hr.employees;
select * from hr.employees
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 19, block # 205)
ORA-01110: data file 19:
'/u01/app/oracle/oradata/ORCL/F07C665E552C42C5E0530400000A57EF/datafile/o1_mf_us
ers_ktc4b0t1_.dbf'
The errors looked similar to the one above. But there were much more database blocks 🙂 In rico2 blocks were filled with some random data:
rico2 > d
File: /u01/app/oracle/oradata/ORCL/F07C665E552C42C5E0530400000A57EF/datafile/o1_mf_users_ktc4b0t1_.dbf(15)
Block: 205 Offsets: 0 to 512 Dba: 0x3c000cd
---------------------------------------------------------------
64757061 64757061 64757061 64757061 | dupadupadupadupa
64757061 64757061 64757061 64757061 | dupadupadupadupa
64757061 64757061 64757061 64757061 | dupadupadupadupa
64757061 64757061 64757061 64757061 | dupadupadupadupa
64757061 64757061 64757061 64757061 | dupadupadupadupa
64757061 64757061 64757061 64757061 | dupadupadupadupa
64757061 64757061 64757061 64757061 | dupadupadupadupa
64757061 64757061 64757061 64757061 | dupadupadupadupa
64757061 64757061 64757061 64757061 | dupadupadupadupa
64757061 64757061 64757061 64757061 | dupadupadupadupa
64757061 64757061 64757061 64757061 | dupadupadupadupa
64757061 64757061 64757061 64757061 | dupadupadupadupa
Of course waiting a few days for applying hundreds of gigabytes of archivelogs was not an option… but a thought came to me, that if no one noticed this error before, those blocks were not being touched since the storage crash!
So in theory it could be possible to just restore the database files from before the crash, find the blocks and copy them from the source datafiles to the current ones.
The only problem is, how to be sure, that a block was not changed during this whole time?
It is actually pretty easy! You have to scan all archivelogs and verify if the affected blocks were ever modified. To do that, you have to have a block id.
To calculate a valid block id, you have to construct it from relative file number and block number in that file.
So if the relative file number is 15 and corrupted block number is 205, the block id can be calculated like this: 4194304*15+205 = 62914765.
4194304 is a total number of blocks in one regular datafile.
62914765 in hex is 0x3c000cd so in little endian format will have the following structure: cd 00 c0 03
If you know that, you write a simple tool to scan binary files and look for that pattern in parallel. I modified my tool focs_vm and created a new tool: focs_file.
If given archivelog contains change vectors for specified database block address (DBA), the output will look similar to this one:
[root@beta23c focs_file]# ./target/release/focs_file -P 4 -p 'cd 00 c0 03' -f /u01/app/oracle/fra/ORCL/archivelog/2022_94_ktbxmbn7_.arc
Scanning memory from 0 to 22856448 in a separate thread
Scanning memory from 22856448 to 45712896 in a separate thread
Scanning memory from 68569344 to 91425792 in a separate thread
Scanning memory from 45712896 to 68569344 in a separate thread
Scanned: 87 %
Found 296 positions in a chunk
Length: 256 (0x100) bytes
0000: cd 00 c0 03 66 0e bc 00 00 00 00 00 01 01 1d 78 ....f..........x
0010: 04 00 00 00 00 00 00 00 06 00 14 00 10 00 1c 78 ...............x
0020: 1d 78 01 00 01 80 02 00 00 00 00 00 c8 00 c0 03 .x..............
0030: 00 00 00 00 66 0e bc 00 00 80 00 00 00 00 00 00 ....f...........
0040: 00 00 00 00 2e 00 14 00 18 00 10 00 33 00 04 00 ............3...
0050: 01 22 00 00 55 b6 02 00 5e 00 00 00 10 80 20 0f ."..U...^..... .
0060: 64 00 00 00 01 00 00 00 66 0e bc 00 01 00 00 00 d.......f.......
0070: 0b df 89 2e 00 00 00 00 0d 15 01 00 13 00 01 00 ................
0080: ce 00 c0 03 66 0e bc 00 00 00 00 00 01 01 1d 78 ....f..........x
0090: 04 00 00 00 00 00 00 00 06 00 14 00 10 00 1c 78 ...............x
00a0: 1d 78 01 00 01 80 02 00 00 00 00 00 c8 00 c0 03 .x..............
00b0: 00 00 00 00 66 0e bc 00 00 80 00 00 00 00 00 00 ....f...........
00c0: 00 00 00 00 64 00 00 00 01 12 00 00 66 0e bc 00 ....d.......f...
00d0: 01 00 00 00 0b df 89 2e 00 00 00 00 0d 15 01 00 ................
00e0: 13 00 01 00 cf 00 c0 03 66 0e bc 00 00 00 00 00 ........f.......
00f0: 01 01 1d 78 04 00 00 00 00 00 00 00 06 00 14 00 ...x............
Length: 256 (0x100) bytes
0000: cd 00 c0 03 ca 00 c0 03 fa 12 03 01 01 00 00 00 ................
0010: 00 00 00 36 0b 02 01 00 13 00 01 00 cd 00 c0 03 ...6............
0020: 66 0e bc 00 00 00 00 00 02 00 1d 78 04 00 00 00 f..........x....
0030: 00 00 00 00 1c 00 18 00 32 00 02 00 06 00 04 00 ........2.......
0040: 05 00 0c 00 07 00 07 00 03 00 00 00 00 00 02 00 ................
0050: 01 0d 00 00 00 00 00 00 08 00 0b 00 73 0c 00 00 ............s...
0060: a4 19 40 02 7c 02 64 00 cd 00 c0 03 ca 00 c0 03 ..@.|.d.........
0070: fa 12 02 01 01 00 00 00 2c 01 0b 00 00 00 00 00 ........,.......
0080: 60 00 00 00 00 00 00 00 00 00 00 00 02 00 00 00 `...............
0090: 3e 00 00 00 00 00 03 00 e8 52 40 02 c2 02 bc 00 >........R@.....
00a0: 53 74 65 76 65 6e ff ff 4b 69 6e 67 53 4b 49 4e Steven..KingSKIN
00b0: 47 00 14 00 35 31 35 2e 31 32 33 2e 34 35 36 37 G...515.123.4567
00c0: 78 67 06 11 01 01 01 00 41 44 5f 50 52 45 53 00 xg......AD_PRES.
00d0: c3 03 29 00 c1 5b 02 00 08 01 00 00 01 00 00 00 ..)..[..........
00e0: 66 0e bc 00 03 00 00 00 0b df 89 2e 00 00 00 00 f...............
00f0: 05 01 20 00 12 00 ff ff a4 19 40 02 66 0e bc 00 .. .......@.f...
If not – there will be nothing to show:
[root@beta23c focs_file]# ./target/release/focs_file -P 4 -p 'cd 00 c0 03' -f /u01/app/oracle/fra/ORCL/archivelog/2022_12_23/o1_mf_1_95_ktbxvbx9_.arc
Scanning memory from 0 to 305536 in a separate thread
Scanning memory from 305536 to 611072 in a separate thread
Scanning memory from 916608 to 1222144 in a separate thread
Scanning memory from 611072 to 916608 in a separate thread
So if there were no changes found to the affected blocks, we can just restore a datafile to the new location:
RMAN> run {
2> set newname for datafile 19 to new;
3> restore datafile 19;
4> }
executing command: SET NEWNAME
Starting restore at 23-DEC-22
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=17 device type=DISK
channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00019 to /u01/app/oracle/oradata/ORCL/F07C665E552C42C5E0530400000A57EF/datafile/o1_mf_users_%u_.dbf
channel ORA_DISK_1: reading from backup piece /u01/app/oracle/fra/ORCL/F07C665E552C42C5E0530400000A57EF/backupset/2022_12_23/o1_mf_nnndf_TAG20221223T120731_ktc317h6_.bkp
channel ORA_DISK_1: piece handle=/u01/app/oracle/fra/ORCL/F07C665E552C42C5E0530400000A57EF/backupset/2022_12_23/o1_mf_nnndf_TAG20221223T120731_ktc317h6_.bkp tag=TAG20221223T120731
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:01
Finished restore at 23-DEC-22
Verify the block is there and it is not corrupted:
rico2 > map
File: /u01/app/oracle/oradata/ORCL/F07C665E552C42C5E0530400000A57EF/datafile/o1_mf_users_ktc6t6no_.dbf(16)
Block: 205 Dba: 0x40000cd
------------------------------------------------------------
DATA Table/Cluster
struct kcbh, 20 bytes @0
struct ktbbh, 72 bytes @20
struct kdbh, 14 bytes @100
struct kdbt[1], 4 bytes @114
sb2 kdbr[98] @118
ub1 freespace[874] @314
ub1 rowdata[7000] @1188
ub4 tailchk @8188
And copy it manually to the appropriate place:
>>> source = open("/u01/app/oracle/oradata/ORCL/F07C665E552C42C5E0530400000A57EF/datafile/o1_mf_users_ktc6t6no_.dbf", "rb")
>>> target = open("/u01/app/oracle/oradata/ORCL/F07C665E552C42C5E0530400000A57EF/datafile/o1_mf_users_ktc4b0t1_.dbf", "rb+")
>>> source.seek(205*8192)
1679360
>>> source_block=source.read(8192)
>>> target.seek(205*8192)
1679360
>>> target.write(source_block)
8192
>>> target.close()
Let’s see if Oracle can handle this brutal behavior:
RMAN> backup validate datafile 19;
Starting backup at 23-DEC-22
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=747 device type=DISK
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00019 name=/u01/app/oracle/oradata/ORCL/F07C665E552C42C5E0530400000A57EF/datafile/o1_mf_users_ktc4b0t1_.dbf
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
19 OK 0 133 641 12324606
File Name: /u01/app/oracle/oradata/ORCL/F07C665E552C42C5E0530400000A57EF/datafile/o1_mf_users_ktc4b0t1_.dbf
Block Type Blocks Failing Blocks Processed
---------- -------------- ----------------
Data 0 30
Index 0 19
Other 0 458
Finished backup at 23-DEC-22
SQL> ;
1* select /*+ FULL(e) */ count(*) from hr.employees e
SQL> /
Execution Plan
----------------------------------------------------------
Plan hash value: 1756381138
------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 3 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | TABLE ACCESS FULL| EMPLOYEES | 107 | 3 (0)| 00:00:01 |
------------------------------------------------------------------------
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
7 consistent gets
0 physical reads
0 redo size
577 bytes sent via SQL*Net to client
108 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
Now the customer has to do some tests and verify that everything works good, but the processing can go on 🙂
This is extremely risky and I wouldn’t recommend it unless you are completely fucked and have no other possibility of rescuing your data! The business has to be aware that this is not supported and that the only supported method of fixing this problem is just to wait for archivelogs to be applied! But sometimes each hour costs a lot of money and it’s worth the risk.
It was another case of Schrödinger’s Backup 🙂