【Oracle数据库恢复】ORA-00600[25026】错误解析

该ORA-00600[25026]错误一般是当ORACLE查看检测tablespace表空间时发现一个无效的tablespace ID或者RDBA所引起,其2个变量的含义为:

Arg [a] — tablespace ID
Arg [b] — rdba

 

ORA-00600[25026]错误属于Kernel File management Tablespace component模块,其可能的影响包括 实例失败,可能的物理块损坏等。

其相关的BUG列表如下:

 

如果自己搞不定可以找诗檀软件专业ORACLE数据库修复团队成员帮您恢复!

诗檀软件专业数据库修复团队

服务热线 : 13764045638   QQ号:47079569    邮箱:service@parnassusdata.com

 

NB Bug Fixed Description
13503554 11.2.0.4, 12.2.0.0 Various ORA-600 errors crashing the apply process in a downstreams environment
17604137 12.1.0.2, 12.2.0.0 ORA-600 [25026] when running query on table being dropped
12912297 11.2.0.3.BP07, 11.2.0.4, 12.1.0.1 CREATE GLOBAL TEMPORARY TABLE incorrectly allows a tablespace group causing ORA-600 [25026] on insert
+ 9399991 11.1.0.7.5, 11.2.0.1.3, 11.2.0.1.BP04, 11.2.0.2, 12.1.0.1 Assorted Internal Errors and Dumps (mostly under kkpa*/kcb*) from SQL against partitioned tables
8630146 10.2.0.5, 11.2.0.1 OERI [25026] / OERI [25012] by SMON while recovering a transaction against a dropped tablespace
6249879 10.2.0.5, 11.2.0.1 OERI [kcbgcur_1] / [25026] / ORA-1502 from DML on table with CONSTRAINTS
4545196 10.2.0.4, 11.1.0.6 Corrupt index leaf by RMAN CONVERT from cross platform transport of compress-key indexes

【Oracle数据库恢复】ORA-00600[25027]错误解析

ORA-00600[25027]错误的触发原因是ORACLE检测到一个无效的表空间号TSN Tablespace Number或者相对文件号Relative File Number。

该ORA-00600[25027]的2个变量各代表:

arg[a] Tablespace Number表空间号

arg[b] 十进制的相对数据块号Relative Data Block Address (RDBA)

 

该ORA-00600[25027]错误相关的模块为Kernel File management Tablespace component,其影响为可能的物理块损坏。

当该错误触发后 如果 arg[b] 即RDBA为0,则该错误可能由于索引问题引起。

可以使用如下查询来获得有问题的索引:

 

 select do.owner,do.object_name, do.object_type,sysind.flags
     from dba_objects do, sys.ind$ sysind
     where do.object_id = sysind.obj#
     and bitand(sysind.flags,4096)=4096;

如果上面的查询返回了数据行,则建议用户进一步检查查询所获得的对象,并考虑drop这些对象来绕过错误。

 

进一步可以对trace文件中指向的表做一个analyze table validate structure cascade,来进一步确认该问题。

与ORA-00600[25027]相关的一些BUG列表如下:

 

如果自己搞不定可以找诗檀软件专业ORACLE数据库修复团队成员帮您恢复!

 

诗檀软件专业数据库修复团队

 

服务热线 : 13764045638   QQ号:47079569    邮箱:service@parnassusdata.com

 

NB Bug Fixed Description
14010183 11.2.0.3.BP22, 11.2.0.4.BP03, 12.1.0.2, 12.2.0.0 ORA-600 [ktspfundo:objdchk_kcbgcur_3] in SMON after failed temp segment merge load
13503554 11.2.0.4, 12.2.0.0 Various ORA-600 errors crashing the apply process in a downstreams environment
13785716 11.2.0.4, 12.1.0.1 Intermittent ORA-600 [25027] during upgrade from 10.2 to 11.2
11661824 11.2.0.1.BP09 Assorted Dumps by SQL*LOADER using DIRECT and PARALLEL after exadata bp8 is applied
10067246 12.2.0.0 ORA-600 [25027] ORA-7445 [kauxs_do_dml_cooperation] by CREATE INDEX ONLINE
14138130 11.2.0.3.5, 11.2.0.3.BP13, 11.2.0.4, 12.1.0.1 SGA memory corruption / ORA-7445 when modifying uncompressed blocks of an HCC-compressed segment
13330018 11.2.0.4, 12.1.0.1 ora-600 [ktspfmb_add1], [4294959240] occurred, then cannot recover with ora-600[25027]
13103913 11.2.0.2.BP15, 11.2.0.3.3, 11.2.0.3.BP03, 11.2.0.4, 12.1.0.1 ORA-600 [25027] [ts#] [1] or false ORA-1 during dml while index is being rebuilt online
10394825 11.2.0.3, 12.1.0.1 ORA-600[25027] [..] [0] inserting to ASSM segment
10329146 11.2.0.1.BP10, 11.2.0.2.2, 11.2.0.2.BP03, 11.2.0.2.GIBUNDLE02, 11.2.0.2.GIPSU02, 11.2.0.3, 12.1.0.1 Lost write in ASM with multiple DBWs and a disk is offlined and then onlined
+ 10209232 11.1.0.7.7, 11.2.0.1.BP08, 11.2.0.2.1, 11.2.0.2.BP02, 11.2.0.2.GIBUNDLE01, 11.2.0.3, 12.1.0.1 ORA-1578 / ORA-600 [3020] Corruption. Misplaced Blocks and Lost Write in ASM
+ 9399991 11.1.0.7.5, 11.2.0.1.3, 11.2.0.1.BP04, 11.2.0.2, 12.1.0.1 Assorted Internal Errors and Dumps (mostly under kkpa*/kcb*) from SQL against partitioned tables
* 9145541 11.1.0.7.4, 11.2.0.1.2, 11.2.0.2, 12.1.0.1 OERI[25027]/OERI[4097]/OERI[4000]/ORA-1555 in plugged datafile after CREATE CONTROLFILE in 11g
8837919 11.2.0.2, 12.1.0.1 DBV / RMAN enhanced to detect ASSM blocks with ktbfbseg but not ktbfexthd flag set as in Bug 8803762
8803762 11.1.0.7.6, 11.2.0.1.2, 11.2.0.1.BP06, 11.2.0.2, 12.1.0.1 ORA-600[kdsgrp1], ORA-600[25027] or wrong results on 11g database upgrade from 9i
8716064 11.2.0.2, 12.1.0.1 Analyze Table Validate Structure fails on ADG standby with several errors
+ 8597106 11.2.0.1.BP06, 11.2.0.2, 12.1.0.1 Lost Write in ASM when normal redundancy is used
7251049 11.2.0.1.BP08, 11.2.0.2, 12.1.0.1 Corruption in bitmap index introduced when using transportable tablespaces
8437213 10.2.0.4.3, 10.2.0.5, 11.1.0.7.7, 11.2.0.1 ASSM first level bitmap block corruption
8356966 11.2.0.1 ORA-7445 [kdr9ir2rst] by DBMS_ADVISOR or false ORA-1498 by ANALYZE on COMPRESS table
* 8198906 10.2.0.5, 11.2.0.1 OERI [kddummy_blkchk] / OERI [5467] for an aborted transaction of allocating extents
* 7263842 10.2.0.4.2, 10.2.0.5, 11.1.0.7.1, 11.2.0.1 ORA-955 during CTAS / OERI [ktsircinfo_num1] / dictionary inconsistency for PARTITIONED Tables
6666915 10.2.0.5, 11.1.0.7, 11.2.0.1 OERI[25027] / dictionary corruption from concurrent partition DDL
6025993 10.2.0.5, 11.1.0.6 ORA-600 [25027] in flashback archiving queries
4925342 9.2.0.8, 10.2.0.3, 11.1.0.6 OERI [25027] / OERI [25012] on IOT analyze estimate statistics
* 7190270 10.2.0.4.1, 10.2.0.5 Various ORA-600 errors / dictionary inconsistency from CTAS / DROP
4310371 9.2.0.8, 10.2.0.2 OERI [25027] from concurrent startup / shutdown in RAC
4177651 10.2.0.1 Row migration within a MERGE may OERI[25027]
4020195 10.1.0.5, 10.2.0.1 OERI 25027 can occur in RAC accessing transported tablespace
4000840 9.2.0.7, 10.1.0.4, 10.2.0.1 Update of a row with more than 255 columns can cause block corruption
3963135 10.1.0.5, 10.2.0.1 OERI[kcbgcur_3] / OERI:25027 during bitmap index updates
3829900 10.1.0.4, 10.2.0.1 OERI[25027] possible accessing index in 10g
2942185 9.2.0.6, 10.1.0.4, 10.2.0.1 Corruption occurs on direct path load into IOT with ADDED columns
3085057 10.1.0.2 ORA-600: [25027] from ALTER TABLE .. SHRINK SPACE CASCADE
2926182 9.2.0.5, 10.1.0.2 OERI[25027] / ORA-22922 accessing LOB columns in IOT in AFTER UPDATE trigger

 

【ORACLE数据库恢复】ORA-00600[KCLCHKBLK]

ORA-600: internal error code, arguments: [kclchkblk_3/4]错误的相关症状可能如下:

QQ截图20140404232335

 

  • 当从备份中restore过数据库并做了一个incomplete recovery不完全恢复时
  • 以resetlogs 选项打开数据库
  • 在打开数据库之后可能遇到以下的错误:
      • ORA-00600 [kclchkblk_4]
        ORA-00600 [2662]
  • 其错误堆栈stack trace 为kclchkblk kcbzib kcbgcur ktfbhget ktftfcload

 

触发该错误的原因是:

 

ORA-600[KCLCHKBLK_4]是由临时文件中的块上的SCN过高引起,  ORA-600[2662]错误也是类似的原因。

该问题可能是由于在OPEN RESETLOGS之后临时文件temporary file没有正确初始化所引起

这个ORA-00600[KCLCHKBLK]中的KCLCHKBLK 代表 check block scn after a disk read

ORA-600 [kclchkblk_3] [a] [b] [c]

The error is reported when the block SCN is less than the last block SCN.
Called by the cache layer after reading a block from disk.

The block SCN is checked to make sure it is not greater than the recent
SCN and not less than the last block SCN seen.

ARGUMENTS:
Arg [a]  Current SCN WRAP
Arg [b]  Current SCN BASE
Arg [c]  Block type

FUNCTIONALITY:
Check Block scn

IMPACT:
MEMORY CORRUPTION
POSSIBLE PHYSICAL CORRUPTION

 

 

 

如果自己搞不定可以找ASKMACLEAN专业ORACLE数据库修复团队成员帮您恢复!

 

已知的一些ORA-00600[KCLCHKBLK3/4]的相关BUG如下:

 

NB Bug Fixed Description
17273253 12.1.0.1.1 Various ORA-600 corruption errors with ASM
14034244 11.2.0.3.BP09, 11.2.0.4, 12.1.0.1 Lost write type corruption using ASM in 11.2.0.3
12718090 11.2.0.3.1, 11.2.0.3.BP02, 11.2.0.4, 12.1.0.1 ORA-600 [kclchkblk_3] can occur in RAC
+ 10425010 11.2.0.3, 12.1.0.1 Stale data blocks may be returned by Exadata FlashCache
* 10205230 11.2.0.1.6, 11.2.0.1.BP09, 11.2.0.2.2, 11.2.0.2.BP04, 11.2.0.3, 12.1.0.1 ORA-600 / corruption possible during shutdown in RAC
10071193 11.2.0.2.BP02, 11.2.0.3, 12.1.0.1 Lost write / ORA-600 [kclchkblk_3] / ORA-600 [3020] in RAC – superceded
+ 8597106 11.2.0.1.BP06, 11.2.0.2, 12.1.0.1 Lost Write in ASM when normal redundancy is used
12905058 11.2.0.3.1, 11.2.0.3.BP02, 11.2.0.4.1, 11.2.0.4.BP02 Rare ASM corruption after disk resync
4767885 10.2.0.1 OERI [kclchkblk_3] possible in RAC
3601253 9.2.0.6, 10.1.0.2 OERI[kclchkblk_3] possible accessing transported tablespace in RAC
2000029 9.0.1.3, 9.2.0.1 OERI:KCLCHKBLK_3 possible in RAC after reconfiguration

 

 

NB Bug Fixed Description
14351566 11.2.0.3.8, 11.2.0.3.BP21, 11.2.0.4, 12.1.0.1 ORA-600 [kclchkblk_4] when doing flash back

 

NB Bug Fixed Description
10112810 11.2.0.3, 12.1.0.1 ORA-600[kjbcrc:g] / ORA-600[kclchkblk_2] / [kjbconvert:modes] With Flash Cache Enabled on RAC
2873045 9.2.0.4, 10.1.0.2 OERI:[KCLCHKBLK_2] possible in RAC

【数据恢复】NOLOGGING UNRECOVERABLE ORA-26040解析

 

 

SQL> select count(*) from abc;
select count(*) from abc
*
第 1 行出现错误:
ORA-01578: ORACLE 数据块损坏 (文件号 17, 块号 131)
ORA-01110: 数据文件 17:
‘C:\APP\XIANGBLI\ORADATA\MACLEAN\DATAFILE\O1_MF_NLOGGING_9475OCS5_.DBF’
ORA-26040: 数据块是使用 NOLOGGING 选项加载的

 
SQL> select UNRECOVERABLE_CHANGE# , UNRECOVERABLE_time from v$datafile where file#=17;

UNRECOVERABLE_CHANGE# UNRECOVERABLE_
——————— ————–
6486756 26-9月 -13

 

把 (文件号 17, 块号 131) dump出来看一下

 

 

*** 2013-09-26 10:07:46.584
Start dump data blocks tsn: 17 file#:17 minblk 131 maxblk 131
Block dump from cache:
Dump of buffer cache at level 4 for pdb=0 tsn=17 rdba=71303299
Block dump from disk:
buffer tsn: 17 rdba: 0x04400083 (17/131)
scn: 0x0.62faac seq: 0xff flg: 0x04 tail: 0xfaac00ff
frmt: 0x02 chkval: 0xa2a1 type: 0x00=unknown
Hex dump of block: st=0, typ_found=0
Dump of memory from 0x000000000BFF2200 to 0x000000000BFF4200
00BFF2200 0000A200 04400083 0062FAAC 04FF0000 [……@…b…..]
00BFF2210 0000A2A1 FFFFFFFF FFFFFFFF FFFFFFFF […………….]
00BFF2220 FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF […………….]
Repeat 508 times
00BFF41F0 FFFFFFFF FFFFFFFF FFFFFFFF FAAC00FF […………….]
End dump data blocks tsn: 17 file#: 17 minblk 131 maxblk 131

 

scn: 0x0.62faac seq: 0xff

==》 对应的SCN为6486700,可以看到内容除了头部一点外 全是0XFF

 

dump 对应redo logfile 可以看到

REDO RECORD – Thread:1 RBA: 0x000074.00015418.0078 LEN: 0x003c VLD: 0x01 CON_UID: 0
SCN: 0x0000.0062faac SUBSCN: 1 09/26/2013 10:04:39
CHANGE #1 INVLD CON_ID:0 AFN:17 DBA:0x04400083 BLKS:0x000d OBJ:123054 SCN:0x0000.0062faac SEQ:1 OP:19.2 ENC:0
Direct Loader invalidate block range redo entry

 

OP:19.2=》Layer 19 : Direct Loader Log Blocks – KCOCODLB Opcode 2 : Invalidate range – KCBLCOIR

==》这里在redo里标记了 直接路径加载造成块失效的范围,在redo logfile dump中可以看到大量类似数据

 

即当recover时读取redo,读到“Direct Loader invalidate block range redo entry”信息时,则将对应的数据块的内容除了kcbh头部外全部记录为0XFF

当Oracle读取到这些块时就会知道这些块是SOFT Corrupt ,原因是nologging造成的。

Block is marked as SOFT Corrupt and has 0xff along the block. This is the
format used by Oracle to mark a block as corrupt due to redo invalidation
(NOLOGGING).
NOLOGGING OPERATION in redo:

 

 

利用RMAN检测数据库坏块的脚本

虽然我们也可以通过dbv(db file verify)工具做到对单个数据文件的坏块检测,但是直接使用RMAN的”backup validate check logical database;”结合V$DATABASE_BLOCK_CORRUPTION视图要方便地多。

如果自己搞不定可以找ASKMACLEAN专业ORACLE数据库修复团队成员帮您恢复!

Script:

1) $ rman target / nocatalog

2) RMAN> run {
allocate channel d1 type disk;
allocate channel d2 type disk;
allocate channel d3 type disk;
allocate channel d4 type disk;
backup validate check logical database;
}

3) select * from V$DATABASE_BLOCK_CORRUPTION ;

REM www.askmaclean.com & www.askmaclean.com

4) If V$DATABASE_BLOCK_CORRUPTION contains rows please run this query to 
find the objects that contains the corrupted blocks:

SELECT e.owner,
       e.segment_type,
       e.segment_name,
       e.partition_name,
       c.file#,
       greatest(e.block_id, c.block#) corr_start_block#,
       least(e.block_id + e.blocks - 1, c.block# + c.blocks - 1) corr_end_block#,
       least(e.block_id + e.blocks - 1, c.block# + c.blocks - 1) -
       greatest(e.block_id, c.block#) + 1 blocks_corrupted,
       null description
  FROM dba_extents e, v$database_block_corruption c
 WHERE e.file_id = c.file#
   AND e.block_id <= c.block# + c.blocks - 1
   AND e.block_id + e.blocks - 1 >= c.block#
UNION
SELECT s.owner,
       s.segment_type,
       s.segment_name,
       s.partition_name,
       c.file#,
       header_block corr_start_block#,
       header_block corr_end_block#,
       1 blocks_corrupted,
       'Segment Header' description
  FROM dba_segments s, v$database_block_corruption c
 WHERE s.header_file = c.file#
   AND s.header_block between c.block# and c.block# + c.blocks - 1
UNION
SELECT null owner,
       null segment_type,
       null segment_name,
       null partition_name,
       c.file#,
       greatest(f.block_id, c.block#) corr_start_block#,
       least(f.block_id + f.blocks - 1, c.block# + c.blocks - 1) corr_end_block#,
       least(f.block_id + f.blocks - 1, c.block# + c.blocks - 1) -
       greatest(f.block_id, c.block#) + 1 blocks_corrupted,
       'Free Block' description
  FROM dba_free_space f, v$database_block_corruption c
 WHERE f.file_id = c.file#
   AND f.block_id <= c.block# + c.blocks - 1
   AND f.block_id + f.blocks - 1 >= c.block#
 order by file#, corr_start_block#;

SELECT tablespace_name, segment_type, owner, segment_name
FROM dba_extents
WHERE file_id = &fileid
and &blockid between block_id AND block_id + blocks - 1;

Oracle内部错误:ORA-00600[13013][5001]故障诊断一例

周五被叫到客户现场解决一套10.2.0.4 RAC数据库上的ORA-00600[13013]内部错误问题,这个问题同事已经在上午通过远程拨号了解过情况,
初步判断是索引存在讹误corruption引起的600。

前期诊断

同事在我抵达现场之前已经做了初步的诊断,该10.2.0.4上的RAC系统主用节点的告警日志中多次出现ORA-00600:[13013], [5001]、ORA-00600:[qertbFetchByRowID]及ORA-00600: [25027] 等内部错误,具体的日志如下:

Fri Sep 16 01:16:54 2011
Errors in file /oravl01/oracle/admin/CRMDB2/udump/crmdb22_ora_2138788.trc:
ORA-00600: internal error code, arguments: [13013], [5001], [66209], [50730514], [23], [50730514], [3], []
Fri Sep 16 01:16:55 2011
Trace dumping is performing id=[cdmp_20110916011655]
Fri Sep 16 01:17:06 2011
Errors in file /oravl01/oracle/admin/CRMDB2/udump/crmdb22_ora_2138788.trc:
ORA-00600: internal error code, arguments: [13013], [5001], [66209], [50730514], [23], [50730514], [3], []
以下为13013 trace
ORA-00600: internal error code, arguments: [13013], [5001], [66209], [50730514], [23], [50730514], [3], []
Current SQL statement for this session:
UPDATE CUST_SUBSCRB_PERSONAL A SET a.cust_mobile='A04204441' WHERE a.subscrbid=71524739
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              088424844 ? 041124844 ?
ksedmp+0290          bl       ksedst               104A54870 ?
ksfdmp+0018          bl       03F30204
kgeriv+0108          bl       _ptrgl
kgesiv+0080          bl       kgeriv               1000D8FE4 ? 0785F7165 ?
10501ED28 ? FFFFFFFFFFF6748 ?
06C626E72 ?
ksesic6+0060         bl       kgesiv               110072D08 ? 7000000AC52B8F8 ?
7000000AC52B3A8 ? 07FFFFFFF ?
000000000 ?
updThreePhaseExe+0c  bl       01F9D7A8
4c
updexe+041c          bl       updThreePhaseExe     7000000AC52B8F8 ? 000000000 ?
110540128 ? FFFFFFFFFFF7C80 ?
opiexe+27d8          bl       updexe               7000000AC52B8F8 ?
FFFFFFFFFFF82B8 ?
kpoal8+0edc          bl       opiexe               FFFFFFFFFFFB434 ?
FFFFFFFFFFFB198 ?
FFFFFFFFFFF9608 ?
opiodr+0ae0          bl       _ptrgl
ttcpip+1020          bl       _ptrgl
opitsk+1124          bl       01F9F2A0
opiino+0990          bl       opitsk               000000000 ? 000000000 ?
============
Plan Table
============
------------------------------------------------------+-----------------------------------+
| Id  | Operation           | Name                    | Rows  | Bytes | Cost  | Time      |
------------------------------------------------------+-----------------------------------+
| 0   | UPDATE STATEMENT    |                         |       |       |     1 |           |
| 1   |  UPDATE             | CUST_SUBSCRB_PERSONAL   |       |       |       |           |
| 2   |   INDEX UNIQUE SCAN | CUST_SUBSCRB_PERSONAL_PK|     1 |    33 |     1 |  00:00:01 |
------------------------------------------------------+-----------------------------------+
===============================================================================================================
同时还伴随有ORA-00600:[qertbFetchByRowID]出现
Fri Sep 16 01:08:57 2011
Errors in file /oravl01/oracle/admin/CRMDB2/udump/crmdb22_ora_2138788.trc:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []
Current SQL statement for this session:
SELECT * from CUST_SUBSCRB_PERSONAL A WHERE a.subscrbid=307557025
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              578318D500000003 ?
FFFFFFFFFFF80D0 ?
ksedmp+0290          bl       ksedst               104A54870 ?
ksfdmp+0018          bl       03F30204
kgerinv+00dc         bl       _ptrgl
kgeasnmierr+004c     bl       kgerinv              11048AE68 ? 000000000 ?
000000001 ? 104BDEC18 ?
110489398 ?
qertbFetchByRowID+0  bl       03F2EF38
d34
opifch2+141c         bl       03F2EB1C
opifch+003c          bl       opifch2              1100DD338 ? 000000000 ?
FFFFFFFFFFF9980 ?
============
Plan Table
============
---------------------------------------------------------------+-----------------------------------+
| Id  | Operation                    | Name                    | Rows  | Bytes | Cost  | Time      |
---------------------------------------------------------------+-----------------------------------+
| 0   | SELECT STATEMENT             |                         |       |       |     1 |           |
| 1   |  TABLE ACCESS BY INDEX ROWID | CUST_SUBSCRB_PERSONAL   |     1 |   287 |     1 |  00:00:01 |
| 2   |   INDEX UNIQUE SCAN          | CUST_SUBSCRB_PERSONAL_PK|     1 |       |     1 |  00:00:01 |
---------------------------------------------------------------+-----------------------------------+
===============================================================================================================
此外还会出现ORA-00600: [25027], [6], [1443670797], [], [], [], [], []
Fri Sep 16 15:06:00 2011
Errors in file /oravl01/oracle/admin/CRMDB2/udump/crmdb22_ora_1872436.trc:
ORA-00600: internal error code, arguments: [25027], [6], [1443670797], [], [], [], [], []
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [25027], [6], [4278267905], [], [], [], [], []
Current SQL statement for this session:
select q.subscrbid,q.serv_lvl,q.cust_lvl
from cust_subscrb_personal q
where q.serv_lvl is null
or q.cust_lvl is null
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              088424844 ? 041124844 ?
ksedmp+0290          bl       ksedst               104A54870 ?
ksfdmp+0018          bl       03F30204
kgeriv+0108          bl       _ptrgl
kgesiv+0080          bl       kgeriv               000000000 ? 000000000 ?
000000000 ? 105673724 ?
00000000C ?
ksesic2+0060         bl       kgesiv               000000245 ? 0FFFFFFFF ?
700000010013BD0 ?
700000010018078 ? 110000998 ?
krtd2abh+040c        bl       ksesic2              61C3000061C3 ? 000000000 ?
000000006 ? 000000000 ?
0FF013001 ? 000000FA0 ?
105673FB8 ? 105673FA8 ?
kcbgtcr+24a0         bl       krtd2abh             700000471D95738 ?
7000001A9F0F050 ?
FFFFFFFFFFF78E0 ?
700000471D93500 ? 000000003 ?
ktrget+04ac          bl       kcbgtcr              1105A75A8 ? 000002000 ?
000000000 ? 000001940 ?
kdsgrp+00bc          bl       ktrget               000000064 ? 110000998 ?
700000010008000 ?
kdsgnp+0444          bl       kdsgrp               000000000 ? 000000000 ?
FFFFFFFFFFF80D0 ?
kafger+08fc          bl       kdsgnp               000000000 ? 000000000 ?
111119E38 ?
kdstf1100101km+0f58  bl       kafger               1105A74D8 ? 111119E38 ?
FFFFFFFFFFF8190 ? 000000000 ?
70000046DCA8488 ? 000000000 ?
0FFFF9608 ? 110471948 ?
kdsttgr+1a04         bl       kdstf1100101km       111119E38 ? 0000102A1 ?
70000042DF6DC50 ?
FFFFFFFFFFF8418 ? 0FFFFFFFF ?
000001FE8 ? 000000000 ?
000000000 ?
qertbFetch+09b8      bl       kdsttgr              111119E38 ? 000000418 ?
000000000 ? 000000000 ?
000000000 ? 000000000 ?
000000418 ? BE4610195B2C ?
opifch2+141c         bl       03F2EB1C
opifch+003c          bl       opifch2              1100DD338 ? 000000000 ?
FFFFFFFFFFF9980 ?
============
Plan Table
============
--------------------------------------------------+-----------------------------------+
| Id  | Operation          | Name                 | Rows  | Bytes | Cost  | Time      |
--------------------------------------------------+-----------------------------------+
| 0   | SELECT STATEMENT   |                      |       |       |  4898 |           |
| 1   |  TABLE ACCESS FULL | CUST_SUBSCRB_PERSONAL|    15 |   165 |  4898 |  00:00:59 |
--------------------------------------------------+-----------------------------------+

我们先从ORA-00600:[13013], [5001]内部错误入手,Mos Note <How to resolve ORA-00600 [13013], [5001] [ID 816784.1]> 比较翔实地介绍了该600错误,其argument的具体含义如下:

Arg [a] Passcount
Arg [b] Data Object number
Arg [c] Tablespace Relative DBA of block containing the row to be updated
Arg [d] Row Slot number
Arg [e] Relative DBA of block being updated (should be same as [c])
Arg [f] Code

我们这里的[13013], [5001], [66209], [50730514], [23], [50730514], [3], [] 就是

data_object_id = 66209 = 0x102A1
RDBA = 50730514 即  3061612 fild id=12  block id =398866
Row Slot number = 23
Code = 3

翻译过来就是对data_object_id为66209 的对象在12号数据文件上的398866块的第23行做代码为3的检测时发现了异常,通过data_object_id可以定位到该对象就是正在执行的SQL语句中的CUST_SUBSCRB_PERSONAL表。

ORA-00600:[13013], [5001]可能由表上的数据行或索引中的记录逻辑讹误(logical Corruption)所造成,这里要强调一下是逻辑讹误,而非物理损坏。逻辑讹误一般是由于Oracle Bug或者memory Corruption 所引起的,该ORA-00600:[13013], [5001]已知可能由5085288和4549673等多个Bug 引发,在版本10.2.0.4 上这些bug 触发概率较高。

伴随ORA-00600:[13013], [5001]发生的ORA-600 [qertbFetchByRowID]常由索引损坏引起,介绍了该问题:

ORA-600 [qertbFetchByRowID] Select Queries on 10.2.0.4 Database [ID 755592.1]
Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.4 and later   [Release: 10.2 and later ]
Symptoms
The following error was received during SELECT operation:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []
The call stack contains:
qertbFetchByRowID <- qergiFetch <- rwsfcd <- qerhjFetch
< - qerjotFetch <- rwsfcd <- qertqoFetch <- qerpxSlaveFetch <- qerpxFetch
Cause
Index corruption.
Solution
1.
Analyze the affected object to look for logical corruption.
SQL> analyze table TABLENAME validate structure cascade;
This will place an exclusive lock on the table.
If an error is reported try dropping and recreating the index(es) first.
2.
If this problem is occurring on a version less than 10.2.0.4, ensure the patch for
Bug 4883635 has been applied if Materialized Views are used.
BUG:5866783 - ORA-00600: [QERTBFETCHBYROWID] ON SELECT FROM ONE TABLE

而ORA-00600: [25027]也是类似的问题:

ORA-600 [25027] [ID 284433.1]
PURPOSE:
This article represents a partially published OERI note.
It has been published because the ORA-600 error has been
reported in at least one confirmed bug.
Therefore, the SUGGESTIONS section of this article may help
in terms of identifying the cause of the error.
This specific ORA-600 error may be considered for full publication
at a later date. If/when fully published, additional information
will be available here on the nature of this error.
ERROR:
Format: ORA-600 [25027] [a] [b]
VERSIONS:
versions 9.2 and above
ARGUMENTS:
Arg [a]  Tablespace Number (TSN)
Arg [b]  Decimal Relative Data Block Address (RDBA)
SUGGESTIONS:
1. If the Arg [b] (the RDBA) is 0 (zero), then this could be due to fake indexes.
The following query will list fake indexes:
select do.owner,do.object_name, do.object_type,sysind.flags
from dba_objects do, sys.ind$ sysind
where do.object_id = sysind.obj#
and bitand(sysind.flags,4096)=4096;
If the above query returns any rows, check the objects involved and consider
dropping them as they can cause this error. 
2. Run analyze table validate structure on the table referenced in the Current SQL statement in
the related trace file.
If the Known Issues section below does not help in terms of identifying
a solution, please submit the trace files and alert.log to Oracle
Support Services for further analysis.

但是请注意在该故障示例中ORA-00600: [25027]给出的Arg [b] Decimal Relative Data Block Address (RDBA)是一个完全不相干的RDBA,具体原因尚不明确。

小插曲

这当中有一个小的插曲,原来客户这里的应用人员需要尽快在该CUST_SUBSCRB_PERSONAL表上执行一段查询语句,但是该语句一旦执行就会遇到ORA-600 [qertbFetchByRowID]错误,因为是周五小周末,所以如果今天无法运行的话,就要拖到下个礼拜了。

所以被要求优先解决该语句执行的问题,首先看了一下该语句的执行计划:

SQL> explain plan for select count(*) from SHUCRM2C.cust_subscrb_personal a,  SHUCRM2C.wzy_bs20110916 b
2  where a.svcnum=b.svcnum and a.countyid='A00' and a.serv_lvl=0;
Explained.
SQL> select * from table(dbms_xplan.display());
PLAN_TABLE_OUTPUT
------------------------------------------------
Plan hash value: 3616548176
-----------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name                      | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                           |     1 |    31 |   231   (1)| 00:00:03 |
|   1 |  SORT AGGREGATE               |                           |     1 |    31 |            |          |
|*  2 |   HASH JOIN                   |                           |   425 | 13175 |   231   (1)| 00:00:03 |
|*  3 |    TABLE ACCESS BY INDEX ROWID| CUST_SUBSCRB_PERSONAL     |   425 |  8075 |   208   (0)| 00:00:03 |
|*  4 |     INDEX RANGE SCAN          | CUST_SUBSCRB_PERSONAL_2IX | 42462 |       |    20   (0)| 00:00:01 |
|   5 |    TABLE ACCESS FULL          | WZY_BS20110916            | 39792 |   466K|    22   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("A"."SVCNUM"="B"."SVCNUM")
3 - filter(TO_NUMBER("A"."SERV_LVL")=0)
4 - access("A"."COUNTYID"='A00')
Note
-----
- dynamic sampling used for this statement
23 rows selected.
SQL> select count(*) from SHUCRM2C.cust_subscrb_personal a,  SHUCRM2C.wzy_bs20110916 b
2  where a.svcnum=b.svcnum and a.countyid='A00' and a.serv_lvl=0;
select count(*) from SHUCRM2C.cust_subscrb_personal a,  SHUCRM2C.wzy_bs20110916 b
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []

以上执行计划中对表上的索引CUST_SUBSCRB_PERSONAL_2IX做了range scan后通过获得的rowid到表上去fetch记录(qertbFetchByRowID),在实际fetch by rowid 的时候引发了ORA-600错误。

因为需求较为紧急,所以我还是考虑能否使执行计划绕过该索引,投石问路给语句加上了RULE提示,再次执行:

SQL> select /*+ rule */
2  count(*) from SHUCRM2C.cust_subscrb_personal a,  SHUCRM2C.wzy_bs20110916 b
3  where a.svcnum=b.svcnum and a.countyid='A00' and a.serv_lvl=0;
COUNT(*)
----------
11559
Execution Plan
----------------------------------------------------------
Plan hash value: 2851452146
-------------------------------------------------------------------
| Id  | Operation                     | Name                      |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                           |
|   1 |  SORT AGGREGATE               |                           |
|   2 |   NESTED LOOPS                |                           |
|   3 |    TABLE ACCESS FULL          | WZY_BS20110916            |
|*  4 |    TABLE ACCESS BY INDEX ROWID| CUST_SUBSCRB_PERSONAL     |
|   5 |     AND-EQUAL                 |                           |
|*  6 |      INDEX RANGE SCAN         | CUST_SUBSCRB_SVCNUM_2006  |
|*  7 |      INDEX RANGE SCAN         | CUST_SUBSCRB_PERSONAL_2IX |
-------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - filter(TO_NUMBER("A"."SERV_LVL")=0)
6 - access("A"."SVCNUM"="B"."SVCNUM")
7 - access("A"."COUNTYID"='A00')
Note
-----
- rule based optimizer used (consider using cbo)

想不到居然执行成功了,但是执行计划当中仍有CUST_SUBSCRB_PERSONAL_2IX这个索引,这让我潜意识中认识到很可能是表而非索引存在逻辑讹误。

不管怎么说至少解了燃眉之急,先把这个方法告诉应用人员,然后回过头来继续诊断。

初步验证索引

为了确定到底是表还是索引存在逻辑讹误,一般需要使用analyze table validate structure cascade命令以验证表和索引的结构,但是该命令会要求以共享方式锁住表(TM mode=4),对于更新频繁的生产系统中的关键应用表,这是不可接受的。同事在接手这个故障时已经考虑过该问题了,所以他推荐用查询的方式来检验到底是那些索引存在corruption,具体方法如下:

通过ROWID_CREATE 构造出ROWID
SQL> select dbms_rowid.ROWID_CREATE(1,66209,12,398866,23) from dual;
DBMS_ROWID.ROWID_C
------------------
AAAQKhAAMAABhYSAAX
SQL> select * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
no rows selected
SQL> SELECT INDEX_NAME,COLUMN_NAME FROM DBA_IND_COLUMNS WHERE TABLE_NAME='CUST_SUBSCRB_PERSONAL' ORDER BY 1;
INDEX_NAME                     COLUMN_NAME
------------------------------ ----------------------------------------
CUST_SUBSCRB_CARD_NUM          VIP_CARD_NUM
CUST_SUBSCRB_MANAGERID_2006    MANAGERID
CUST_SUBSCRB_PERSONAL_2IX      COUNTYID
CUST_SUBSCRB_PERSONAL_3IX      CUST_LVL
CUST_SUBSCRB_PERSONAL_PK       SUBSCRBID
CUST_SUBSCRB_SERV_COUNTY       SERV_COUNTYID
CUST_SUBSCRB_SVCNUM_2006       SVCNUM
IDX_CUST_SUBSCRB_PERSONAL_01   SERV_LVL
之后强制使用index提示使用不同的索引
SQL> select /*+ INDEX(a CUST_SUBSCRB_CARD_NUM ) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
no rows selected
SQL> select /*+ INDEX(a CUST_SUBSCRB_MANAGERID_2006) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
no rows selected
SQL> select /*+ INDEX(a CUST_SUBSCRB_PERSONAL_2IX) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
select /*+ INDEX(a CUST_SUBSCRB_PERSONAL_2IX) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX')
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []
SQL> select /*+ INDEX(a CUST_SUBSCRB_PERSONAL_3IX) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
no rows selected
SQL> select /*+ INDEX(a CUST_SUBSCRB_PERSONAL_PK ) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
select /*+ INDEX(a CUST_SUBSCRB_PERSONAL_PK ) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX')
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []
SQL> select /*+ INDEX(a CUST_SUBSCRB_SERV_COUNTY) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
no rows selected
SQL> select /*+ INDEX(a CUST_SUBSCRB_SVCNUM_2006) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
select /*+ INDEX(a CUST_SUBSCRB_SVCNUM_2006) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX')
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []
SQL> select /*+ INDEX(a IDX_CUST_SUBSCRB_PERSONAL_01) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
select /*+ INDEX(a IDX_CUST_SUBSCRB_PERSONAL_01) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX')
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []

可以看到使用以上方法在其中4个索引上遇到了ORA-00600:[qertbFetchByRowID]。 但是使用该方法还是无法彻底搞清楚逻辑到底存在于table还是index上?

 

必要的工具

 

这里我们要介绍一下validate structure还存在online在线使用的选项,在online模式下整个验证过程不会在表上加任何TM锁,这一点具体可以使用10704 事件来证明:

 

10704, 00000, "Print out information about what enqueues are being obtained"
// *Cause:  When enabled, prints out arguments to calls to ksqcmi and
//          ksqlrl and the return values.
// *Action: Level indicates details:
//   Level: 1-4: print out basic info for ksqlrl, ksqcmi
//          5-9: also print out stuff in callbacks:  ksqlac, ksqlop
//          10+: also print out time for each line
SQL> create table validate_me (t1 int);
Table created.
SQL> create index ind_validate_me on validate_me(t1);
Index created.
SQL> insert into validate_me select rownum  from dba_tables where rownum<201;
200 rows created.
SQL> commit;
Commit complete.
SQL> oradebug setmypid;
Statement processed.
SQL> oradebug event 10704 trace name context forever,level 10;
Statement processed.
SQL> analyze table validate_me validate structure cascade online;
Table analyzed.
SQL> oradebug tracefile_name;
c:\app\diag\rdbms\g11r2\g11r2\trace\g11r2_ora_3020.trc
g11r2_ora_3020.trc=========================================================================
*** 2011-09-18 20:55:25.373
Oradebug command 'event 10704 trace name context forever,level 10' console output: <none>
*** 2011-09-18 20:55:49.765
ksqgtl *** TX-00060005-000006a9 mode=6 flags=0x401 timeout=0 ***
ksqgtl: xcb=0x31434194, ktcdix=2147483647, topxcb=0x31434194
ktcipt(topxcb)=0x0
*** 2011-09-18 20:55:49.766
ksucti: init txn DID from session DID
ksqgtl:
ksqlkdid: 0001-001B-00000006
*** 2011-09-18 20:55:49.766
*** ksudidTrace: ksqgtl
ktcmydid(): 0001-001B-00000006
ksusesdi:   0000-0000-00000000
ksusetxn:   0001-001B-00000006
ksqgtl: RETURNS 0
*** 2011-09-18 20:55:49.766
ksqrcl: TX,60005,6a9
ksqrcl: returns 0

但是validate strucutre online也有它的缺点,那就是在线模式下结构验证命令将不填充索引的状态信息到index_stats视图,如:

SQL> analyze index  ind_validate_me validate structure  ;
Index analyzed.
SQL> select count(*) from index_stats;
COUNT(*)
----------
1
SQL> conn / as sysdba
Connected.
SQL> analyze index  ind_validate_me validate structure  online;
Index analyzed.
SQL> select count(*) from index_stats;
COUNT(*)
----------
0

但是因为我们这里只要用到validate structure的结构验证功能,而对索引的详细状态没有兴趣,所以我们可以充分利用该online模式。

 

具体验证

 

使用validate structure online具体验证该问题表和表上的索引:

先仅对表进行验证,以区分到底是表还是索引存在逻辑讹误 
SQL> analyze table SHUCRM2O.CUST_SUBSCRB_PERSONAL  validate structure online;
analyze table SHUCRM2O.CUST_SUBSCRB_PERSONAL  validate structure online
*
ERROR at line 1:
ORA-01498: block check failure - see trace file
若验证发现问题会出现ORA-01498错误,并产生trace 文件
kdrchk:  row is marked as both a Key and Clustered
prow=0x7000001f241c45c flag=0xff
Block Checking: DBA = 50730514, Block Type = KTB-managed data block
data header at 0x7000001f241c07c
kdbchk: bad row tab 0, slot 23
Block header dump:  0x03061612
Object id on Block? Y
seg/obj: 0x102a1  csc: 0xb43.ecde68ca  itc: 3  flg: E  typ: 1 - DATA
brn: 0  bdba: 0x3061609 ver: 0x01 opc: 0
inc: 0  exflg: 0
............
可以看到问题发生在  23 号槽位上
tab 0, row 23, @0x3e0
tl: 4 fb: KCHDFLPN lb: 0x1  cc: 0 cki: 1
tab 0, row 24, @0x3de
tl: 2 fb: --HD---N lb: 0x30
tab 0, row 25, @0x3dc
tl: 2 fb: --HD---- lb: 0x30
tab 0, row 26, @0x3d8
tl: 4 fb: KCHDFLPN lb: 0xff  cc: 0 cki: 255

这里居然第23个row piece的 flag 是KCHDFLPN 即实际fb = 0xff,也就是该row piece同时被标记为key和clustered(row is marked as both a Key and Clustered),因此不管当服务进程尝试update该问题行记录或者通过ROWID访问该row时都出现了ORA-00600错误,虽然其错误代码不同,但都是由于该数据块中第23行记录的flag存在讹误引起的。

关于该ORA-00600:[13013], [5001]问题的成因和解决方法,更多内容可以参考<手工模拟Oracle数据块逻辑讹误引发ORA-00600:[13013], [5001]一例>一文。

 

待修订!

手工模拟Oracle数据块逻辑讹误引发,ORA-00600:[13013] [5001]一例

上周在客户那里遇到了一例由Oracle Bug引发的表数据块逻辑讹误触发ORA-00600:[13013], [5001]的问题,这里为了更好地说明该问题,于是萌发了手工模拟该数据块逻辑讹误的想法。

基础知识

Oracle中表的数据块由块头、事务槽、行字典、行数据等多种结构组成。 行数据(rowdata)实际是由许多row piece 行片组成的,每一条row piece的头部都有flag、locks、cols(cc)三个标志位。

其中flag标记了该row piece的类型,该flag位占用一个字节,其不同的bit位代表不同的含义,见下表:

ROW_CLUSTER_KEY = 0x80;              KDRHFK
ROW_CTABLE_NUMBER = 0x40;            KDRHFC
ROW_HEAD_PIECE = 0x20;               KDRHFH
ROW_DELETED_ROW = 0x10;              KDRHFD
ROW_FIRST_PIECE = 0x08;              KDRHFF
ROW_LAST_PIECE = 0x04;               KDRHFL
ROW_FROM_PREVIOUS = 0x02;            KDRHFP
ROW_CONTINUE_NEXT = 0x01;            KDRHFN

一般来说最普通的一条row piece是普通堆表(heap table)的未被删除的且无行迁移/链接的,其flag位应为

普通row的flag一般为
Single Row =
ROW_HEAD_PIECE + ROW_FIRST_PIECE + ROW_LAST_PIECE= 0x20 + 0x08 + 0x04= 0x2c
===================================================================================
cluster key的flag一般为
Cluster Key =
ROW_CLUSTER_KEY + ROW_HEAD_PIECE + ROW_FIRST_PIECE + ROW_LAST_PIECE=
KDRHFL, KDRHFF, KDRHFH, KDRHFK =0x80 + 0x2c =  0xac
BBED> x /rn
rowdata[68]                                 @8166
-----------
flag@8166: 0xac (KDRHFL, KDRHFF, KDRHFH, KDRHFK)
lock@8167: 0x00
cols@8168:    1
kref@8169:    1
mref@8171:    1
hrid@8173:0x01800014.0
nrid@8179:0x01800014.0
col    0[2] @8185: 10 
===================================================================================
Cluster Row =
ROW_CTABLE_NUMBER + ROW_HEAD_PIECE + ROW_FIRST_PIECE + ROW_LAST_PIECE =
(KDRHFL, KDRHFF, KDRHFH, KDRHFC) = 0x6c 
BBED> x /rncc
rowdata[0]                                  @8098
----------
flag@8098: 0x6c (KDRHFL, KDRHFF, KDRHFH, KDRHFC)
lock@8099: 0x00
cols@8100:   10
col    0[2] @8102: 200
col    1[8] @8105: Jennifer
col    2[6] @8114: Whalen
col    3[7] @8121: JWHALEN
col   4[12] @8129: 515.123.4444
col    5[7] @8142: w....
col    6[7] @8150: AD_ASST
col    7[2] @8158: 
col    8[0] @8161: *NULL*
col    9[3] @8162: .
出现ORA-00600:[13013], [5001]且Arg [f] Code =3 代表这一row piece的flag >0xc0,
也就是该行片同时被标记为key和clustered(row is marked as both a Key and Clustered), 其检验代码为check code 6251。
当flag >= 0xc0 时 会出现kdrchk: row is marked as both a Key and Clustered Block 12 failed with check code 6251
当 0xac >flag >= 0xa0 时 会 kdrchk: row is Key and Not only piece of key Block 12 failed with check code 6255
当 flag = 0x43 是 会出现 kdrchk: C and neither of H or F Block 12 failed with check code 6263
当 flag = 0x83 时 会出现 kdrchk: row is marked both as a Key and being continued Block 12 failed with check code 6254

 

当Oracle进程访问数据块时首先会校验block的sum值并与block中的CHECKSUM值进行对比,若一致则说明该block没有物理讹误。但是光这一项检查是不够的,不足以保证block无误。所以Oracle引入了一些列的逻辑检验,每一种逻辑检验对应一个检测代码(check code),这些检测包括row piece的flag、cols(cc)状态是否正确等。

实际负责这类逻辑检验的函数包括:kdbchk、kddummy_blkchk、kco_blkchk、kdBlkCheckError、kdrchk等等。

这里当服务进程访问到问题数据块,检测代码发现其flag为0xff(KCHDFLPN),该flag从逻辑上讲是冲突的,所以检测代码认为该row piece存在异常,进而会引发update的ORA-00600:[13013], [5001]或查询的ORA-600 [qertbFetchByRowID]内部错误。

这里需要说明一下的是,很多人认为dbv工具时无法检测出逻辑讹误的,实际上dbv、rman、validate structure和bbed-verify均可以检测出一定程度的逻辑讹误,但是最可靠的还是db_block_checksum=true情况下的validate structure [online]验证命令。从另一个角度来说,普通的dbv只能做单一的检测,而无法做到交叉地检验,从而了解表和索引上的不一致问题,但是validate structure online却可以做到。


正式模拟

以上我们了解了ORA-00600:[13013], [5001]内部错误是如何被引发的,那么下面手工模拟该错误也就不困难了,当然这里需要用到bbed工具。

以下我们会创建实验用的tablespace,table,index:

SQL> select * from v$version;
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE 10.2.0.4.0 Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production
SQL> select * from global_name;
GLOBAL_NAME
--------------------------------------------------------------------------------
www.askmaclean.com
/* 创建实验用的表空间  */
SQL> create tablespace maclean datafile '/home/oracle/maclean.dbf' size 20M;
Tablespace created.
SQL> create table tv tablespace maclean as select rownum t1,'find me' t2 from
dba_tables where rownumcreate index ind_tv on tv(t1) tablespace users;
Index created.
SQL> update tv set t2='corrption here' where t1=200;
update tv set t2='corrption here' where t1=200
*
ERROR at line 1:
ORA-12899: value too large for column "SYS"."TV"."T2" (actual: 14, maximum: 7)
SQL> alter table tv modify t2 varchar2(200);
Table altered.
SQL> update tv set t2='corruption here' where t1=200;
1 row updated.
SQL> commit;
Commit complete.
/* 以上创建了示例用表,其中t1=200的记录是之后将会
手动修改为存在讹误的行             */
SQL> select dump(200,16) from dual;
DUMP(200,16)
-----------------
Typ=2 Len=2: c2,3
/* 通过16进制码可以方便找出该t1=200的记录行 */ 
SQL> alter system checkpoint;
System altered.
SQL> alter tablespace maclean read only;
Tablespace altered.
SQL> select dbms_rowid.rowid_block_number(rowid) bno ,dbms_rowid.rowid_relative_fno(rowid) fno from tv;
BNO FNO
---------- ----------
12 6
[oracle@rh2 ~]$ cp maclean.dbf maclean.dbf.bak

 

接着使用BBED工具找到目标行并实施手工修改:

 

[oracle@rh2 ~]$ bbed filename=maclean.dbf mode=edit
Password:
BBED: Release 2.0.0.0.0 - Limited Production on Sun Sep 18 22:14:59 2011
Copyright (c) 1982, 2007, Oracle. All rights reserved.
BBED> set blocksize 8192
BLOCKSIZE 8192
BBED> set block 13
BLOCK# 13
BBED> map /v
File: maclean.dbf (0)
Block: 13 Dba:0x00000000
------------------------------------------------------------
KTB Data Block (Table/Cluster)
struct kcbh, 20 bytes @0
ub1 type_kcbh @0
ub1 frmt_kcbh @1
ub1 spare1_kcbh @2
ub1 spare2_kcbh @3
ub4 rdba_kcbh @4
ub4 bas_kcbh @8
ub2 wrp_kcbh @12
ub1 seq_kcbh @14
ub1 flg_kcbh @15
ub2 chkval_kcbh @16
ub2 spare3_kcbh @18
struct ktbbh, 96 bytes @20
ub1 ktbbhtyp @20
union ktbbhsid, 4 bytes @24
struct ktbbhcsc, 8 bytes @28
b2 ktbbhict @36
ub1 ktbbhflg @38
ub1 ktbbhfsl @39
ub4 ktbbhfnx @40
struct ktbbhitl[3], 72 bytes @44
struct kdbh, 14 bytes @124
ub1 kdbhflag @124
b1 kdbhntab @125
b2 kdbhnrow @126
sb2 kdbhfrre @128
sb2 kdbhfsbo @130
sb2 kdbhfseo @132
b2 kdbhavsp @134
b2 kdbhtosp @136
struct kdbt[1], 4 bytes @138
b2 kdbtoffs @138
b2 kdbtnrow @140
sb2 kdbr[200] @142
ub1 freespace[4725] @542
ub1 rowdata[2921] @5267
ub4 tailchk @8188
BBED> find /x c203
File: maclean.dbf (0)
Block: 13 Offsets: 5271 to 5782 Dba:0x00000000
------------------------------------------------------------------------
c2030f63 6f727275 7074696f 6e206865 72652c00 0202c203 0766696e 64206d65
2c000203 c2026407 66696e64 206d652c 000203c2 02630766 696e6420 6d652c00
0203c202 62076669 6e64206d 652c0002 03c20261 0766696e 64206d65 2c000203
c2026007 66696e64 206d652c 000203c2 025f0766 696e6420 6d652c00 0203c202
5e076669 6e64206d 652c0002 03c2025d 0766696e 64206d65 2c000203 c2025c07
66696e64 206d652c 000203c2 025b0766 696e6420 6d652c00 0203c202 5a076669
6e64206d 652c0002 03c20259 0766696e 64206d65 2c000203 c2025807 66696e64
206d652c 000203c2 02570766 696e6420 6d652c00 0203c202 56076669 6e64206d
652c0002 03c20255 0766696e 64206d65 2c000203 c2025407 66696e64 206d652c
000203c2 02530766 696e6420 6d652c00 0203c202 52076669 6e64206d 652c0002
03c20251 0766696e 64206d65 2c000203 c2025007 66696e64 206d652c 000203c2
024f0766 696e6420 6d652c00 0203c202 4e076669 6e64206d 652c0002 03c2024d
0766696e 64206d65 2c000203 c2024c07 66696e64 206d652c 000203c2 024b0766
696e6420 6d652c00 0203c202 4a076669 6e64206d 652c0002 03c20249 0766696e
64206d65 2c000203 c2024807 66696e64 206d652c 000203c2 02470766 696e6420
6d652c00 0203c202 46076669 6e64206d 652c0002 03c20245 0766696e 64206d65
找到了t1=200的偏移值为5271
则其fb的偏移值为5271 -4 = 5267
BBED> set offset 5267
OFFSET 5267
BBED> d
File: maclean.dbf (0)
Block: 13 Offsets: 5267 to 5778 Dba:0x00000000
------------------------------------------------------------------------
2c020202 c2030f63 6f727275 7074696f 6e206865 72652c00 0202c203 0766696e
64206d65 2c000203 c2026407 66696e64 206d652c 000203c2 02630766 696e6420
6d652c00 0203c202 62076669 6e64206d 652c0002 03c20261 0766696e 64206d65
2c000203 c2026007 66696e64 206d652c 000203c2 025f0766 696e6420 6d652c00
0203c202 5e076669 6e64206d 652c0002 03c2025d 0766696e 64206d65 2c000203
c2025c07 66696e64 206d652c 000203c2 025b0766 696e6420 6d652c00 0203c202
5a076669 6e64206d 652c0002 03c20259 0766696e 64206d65 2c000203 c2025807
66696e64 206d652c 000203c2 02570766 696e6420 6d652c00 0203c202 56076669
6e64206d 652c0002 03c20255 0766696e 64206d65 2c000203 c2025407 66696e64
206d652c 000203c2 02530766 696e6420 6d652c00 0203c202 52076669 6e64206d
652c0002 03c20251 0766696e 64206d65 2c000203 c2025007 66696e64 206d652c
000203c2 024f0766 696e6420 6d652c00 0203c202 4e076669 6e64206d 652c0002
03c2024d 0766696e 64206d65 2c000203 c2024c07 66696e64 206d652c 000203c2
024b0766 696e6420 6d652c00 0203c202 4a076669 6e64206d 652c0002 03c20249
0766696e 64206d65 2c000203 c2024807 66696e64 206d652c 000203c2 02470766
696e6420 6d652c00 0203c202 46076669 6e64206d 652c0002 03c20245 0766696e
/* 找到指定行的地址为5267,其当前flag为正常的0x2c  */
BBED> x /rnc
rowdata[0] @5267
----------
flag@5267: 0x2c (KDRHFL, KDRHFF, KDRHFH)
lock@5268: 0x02
cols@5269: 2
col 0[2] @5270: 200
col 1[15] @5273: corruption here
修改该flag 为 0xff BBED> modify /x 0xff
Warning: contents of previous BIFILE will be lost. Proceed? (Y/N) y
File: maclean.dbf (0)
Block: 13 Offsets: 5267 to 5778 Dba:0x00000000
------------------------------------------------------------------------
ff020202 c2030f63 6f727275 7074696f 6e206865 72652c00 0202c203 0766696e
64206d65 2c000203 c2026407 66696e64 206d652c 000203c2 02630766 696e6420
6d652c00 0203c202 62076669 6e64206d 652c0002 03c20261 0766696e 64206d65
2c000203 c2026007 66696e64 206d652c 000203c2 025f0766 696e6420 6d652c00
0203c202 5e076669 6e64206d 652c0002 03c2025d 0766696e 64206d65 2c000203
c2025c07 66696e64 206d652c 000203c2 025b0766 696e6420 6d652c00 0203c202
5a076669 6e64206d 652c0002 03c20259 0766696e 64206d65 2c000203 c2025807
66696e64 206d652c 000203c2 02570766 696e6420 6d652c00 0203c202 56076669
6e64206d 652c0002 03c20255 0766696e 64206d65 2c000203 c2025407 66696e64
206d652c 000203c2 02530766 696e6420 6d652c00 0203c202 52076669 6e64206d
652c0002 03c20251 0766696e 64206d65 2c000203 c2025007 66696e64 206d652c
000203c2 024f0766 696e6420 6d652c00 0203c202 4e076669 6e64206d 652c0002
03c2024d 0766696e 64206d65 2c000203 c2024c07 66696e64 206d652c 000203c2
024b0766 696e6420 6d652c00 0203c202 4a076669 6e64206d 652c0002 03c20249
0766696e 64206d65 2c000203 c2024807 66696e64 206d652c 000203c2 02470766
696e6420 6d652c00 0203c202 46076669 6e64206d 652c0002 03c20245 0766696e
BBED> x /rnc
rowdata[0] @5267
----------
flag@5267: 0xff (KDRHFN, KDRHFP, KDRHFL, KDRHFF, KDRHFD, KDRHFH, KDRHFC, KDRHFK)
lock@5268: 0x02
cols@5269: 0
ckix@5270: 2
BBED> sum apply
Check value for File 0, Block 13:
current = 0x0000, required = 0x0000
我们使用bbed的verify命令验证数据块会发现问题flag
BBED> verify
DBVERIFY - Verification starting
FILE = maclean.dbf
BLOCK = 12
kdrchk: row is marked as both a Key and Clustered
prow=0x7f5335f05693 flag=0xff
Block Checking: DBA = 25165836, Block Type = KTB-managed data block
data header at 0x7f5335f0427c
kdbchk: bad row tab 0, slot 199
Block 12 failed with check code 6251
DBVERIFY - Verification complete
Total Blocks Examined : 1
Total Blocks Processed (Data) : 1
Total Blocks Failing (Data) : 1
Total Blocks Processed (Index): 0
Total Blocks Failing (Index): 0
Total Blocks Empty : 0
Total Blocks Marked Corrupt : 0
Total Blocks Influx : 0
使用dbv工具是也可以验证这种逻辑讹误的
[oracle@rh2 ~]$ dbv file=maclean.dbf
DBVERIFY: Release 10.2.0.4.0 - Production on Sun Sep 18 22:27:49 2011
Copyright (c) 1982, 2007, Oracle. All rights reserved.
DBVERIFY - Verification starting : FILE = maclean.dbf
kdrchk: row is marked as both a Key and Clustered
prow=0x7f9ef25f7693 flag=0xff
Block Checking: DBA = 25165836, Block Type = KTB-managed data block
data header at 0x7f9ef25f627c
kdbchk: bad row tab 0, slot 199
Page 12 failed with check code 6251
DBVERIFY - Verification complete
Total Pages Examined : 2560
Total Pages Processed (Data) : 1
Total Pages Failing (Data) : 1
Total Pages Processed (Index): 0
Total Pages Failing (Index): 0
Total Pages Processed (Other): 11
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 2548
Total Pages Marked Corrupt : 0
Total Pages Influx : 0
Highest block SCN : 691691 (0.691691)

 

回到sqlplus中访问之前修改的数据行,触发ORA-600[13013] [5001]错误:

 

SQL> alter system flush buffer_cache;
System altered.
SQL> update tv set t2='correct here' where t1=200;
update tv set t2='correct here' where t1=200
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [13013], [5001], [52937],
[25165836], [199], [25165836], [3], []
PLAN_TABLE_OUTPUT
---------------------------------------------------------
Plan hash value: 568795662
----------------------------------------------------------------------------
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | UPDATE STATEMENT  |        |     1 |   115 |     2   (0)| 00:00:01 |
|   1 |  UPDATE           | TV     |       |       |            |          |
|*  2 |   INDEX RANGE SCAN| IND_TV |     1 |   115 |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("T1"=200)
SQL> select * from tv where t1=200;
select * from tv where t1=200
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [],
[], [], []
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 1015724781
--------------------------------------------------------------------------------------
| Id  | Operation                   | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |        |     1 |   115 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| TV     |     1 |   115 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | IND_TV |     1 |       |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("T1"=200)

可以看到当正好update到问题行记录时如预料出现了ORA-00600:[13013], [5001]错误,而ACCESS BY INDEX ROWID时出现了ORA-00600:[qertbFetchByRowID]。

解决方案

1.在有备份的情况下可以通过blockrecovery在线修复该问题数据块:

RMAN> blockrecover datafile 6 block 12;

Starting blockrecover at 18-SEP-11
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=144 devtype=DISK

starting media recovery
media recovery complete, elapsed time: 00:00:01

Finished blockrecover at 18-SEP-11

但是请注意如果该逻辑讹误确实是由Oracle Bug引起的话,那么很有可能blockrecover也无能为力,那么可以借鉴第二种方法。

 

2. 第二种方法针对没有备份可用的数据库或者recover数据块不管用的场景,可以设置10231事件并ctas复制该表,但是这种方法可能会丢失有问题的行记录:

SQL> alter session set events ‘10231 trace name context forever, level 10’

SQL> Create table.TABLE_COPY as select * from TABLE;

 

了解更多关于kdrchk函数的信息:

Add check for continued row piece pointing to itself with
corruption description:
"kdrchk: Row piece pointing to itself"
DB_BLOCK_CHECKING = MEDIUM will check for row pieces where the
next rowid (nrid) points to itself (chained row points to itself).
It produces error ORA-600 [kddummy_blkchk] or ORA-600 [kdBlkCheckError]
with check code [6266] (3rd ORA-600 argument).
DBVERIFY reports the same corruption description if the block is corrupt on disk.
RMAN when run with the CHECK LOGICAL option reports it as
corruption_type=CORRUPT/LOGICAL in v$database_block_corruption.
"ANALYZE TABLE  VALIDATE STRUCTURE" produces error ORA-1498 and trace file
shows the same corruption description.
With this fix in place DBMS_REPAIR can be used to identify and mark the affected
block as Soft Corrupt producing error ORA-1578 and it can be skipped it for DML's
using DBMS_REPAIR.SKIP_CORRUPT_BLOCKS.
[CM][SG][event 1][domain Q423][mem 0] Joining shared group
kdrchk: column length 0 but not null
prow=0x2a97f4d9d6 flag=0x2c column=57
Block Checking: DBA = 29635651, Block Type = KTB-managed data block
data header at 0x2a97f4be7c
kdbchk: bad row tab 0, slot 2
data_block_dump,data header at 0x2a97d113d8
data_block_dump,data header at 0x2a97d113d8
kdrchk: found invalid symbol reference 48
reference to delete symbol
valid symbol range [0,78)
Block Checking: DBA = 411055291, Block Type = KTB-managed data block
data header at 0x68a3f4
kdbchk: bad row tab 0, slot 4
Page 13499 failed with check code 6265
kdrchk: C and neither of H or F
prow=0x4282803ae flag=0x41
Block Checking: DBA = 322963095, Block Type = KTB-managed data block
data header at 0x42828007c
kdrchk: column length 0 but not null
prow=0x10021035e flag=0x2c column=40
Block Checking: DBA = 25189259, Block Type = KTB-managed data block
data header at 0x10020fe7c
kdbchk: bad row tab 0, slot 0
Page 23435 failed with check code 6264
kdrchk: column length 0 but not null
prow=0x1002122e5 flag=0x2c column=40
Block Checking: DBA = 25189260, Block Type = KTB-managed data block
kdrchk:  row is marked as both a Key and Clustered
prow=0xd2bfa981 flag=0xff
File#67, Block#74754
kdbchk: bad row tab 0, slot 0
kdrchk:  no columns, but has one of P or N
prow=0x934fbffa flag=0x31
DIAGNOSTIC ANALYSIS:
====================
A look at the block dump in the analyze trace file revealed two very
suspicious looking rows:
tab 0, row 0, @0x1ede
tl: 2 fb: --HD---N lb: 0x0
tab 0, row 1, @0x1edc
tl: 2 fb: --HD---N lb: 0x0
The flag bytes in these rows look incorrect.

待修订!

沪ICP备14014813号

沪公网安备 31010802001379号