数据恢复:解决ORA-600[kghstack_free2][kghstack_err+0068]一例

 

如果自己搞不定可以找诗檀软件专业ORACLE数据库修复团队成员帮您恢复!

诗檀软件专业数据库修复团队

服务热线 : 13764045638 QQ号:47079569 邮箱:service@parnassusdata.com

 

 

数据库后台日志文件alert.log报以下异常:

Database Characterset is AL32UTF8
Opening with internal Resource Manager plan
where NUMA PG = 1, CPUs = 16
replication_dependency_tracking turned off (no async multimaster replication found)
Thu Sep 20 09:46:09 2012
Errors in file /user/oracle/admin/maclean/bdump/maclean_p003_856120.trc:
ORA-00600: internal error code, arguments: [kghstack_free2], [], [], [], [], [], [], []
Starting background process QMNC
QMNC started with pid=53, OS id=1229222
Thu Sep 20 09:46:11 2012
Errors in file /user/oracle/admin/maclean/bdump/maclean_p003_856120.trc:
ORA-07445: exception encountered: core dump [kghstack_err+0068] [SIGSEGV] [Address not mapped to object] [0x778670101010109] [] []
ORA-00600: internal error code, arguments: [kghstack_free2], [], [], [], [], [], [], []
Thu Sep 20 09:46:12 2012
Errors in file /user/oracle/admin/maclean/bdump/maclean_p003_856120.trc:
ORA-07445: exception encountered: core dump [kghstack_err+0068] [SIGSEGV] [Address not mapped to object] [0x778670101010109] [] []
ORA-00607: Internal error occurred while making a change to a data block
ORA-00602: internal programming exception
ORA-07445: exception encountered: core dump [kghstack_err+0068] [SIGSEGV] [Address not mapped to object] [0x778670101010109] [] []
ORA-00600: internal error code, arguments: [kghstack_free2], [], [], [], [], [], [], []
Thu Sep 20 09:46:13 2012
Errors in file /user/oracle/admin/maclean/bdump/maclean_p003_856120.trc:
ORA-07445: exception encountered: core dump [kghstack_err+0068] [SIGSEGV] [Address not mapped to object] [0x778670101010109] [] []
ORA-00607: Internal error occurred while making a change to a data block
ORA-00602: internal programming exception
ORA-07445: exception encountered: core dump [kghstack_err+0068] [SIGSEGV] [Address not mapped to object] [0x778670101010109] [] []
ORA-00607: Internal error occurred while making a change to a data block
ORA-00602: internal programming exception
ORA-07445: exception encountered: core dump [kghstack_err+0068] [SIGSEGV] [Address not mapped to object] [0x778670101010109] [] []
ORA-00600: internal error code, arguments: [kghstack_free2], [], [], [], [], [], [], []
Thu Sep 20 09:46:15 2012
Doing block recovery for file 370 block 140818
Doing block recovery for file 370 block 140818
Doing block recovery for file 370 block 140818
Doing block recovery for file 370 block 140818
Doing block recovery for file 370 block 140818
Thu Sep 20 09:46:16 2012
Completed: ALTER DATABASE OPEN
Thu Sep 20 09:46:18 2012
Doing block recovery for file 370 block 140818
Block recovery from logseq 336560, block 264 to scn 12971246479133
Thu Sep 20 09:46:18 2012
Recovery of Online Redo Log: Thread 1 Group 4 Seq 336560 Reading mem 0
Mem# 0: /oradata4/macleanDB/redolog/redolog4a.log
Mem# 1: /oradata4/macleanDB/redolog/redolog4b.log
Block recovery completed at rba 336560.27935.16, scn 3020.445245214
Thu Sep 20 09:46:18 2012
SMON: slave died unexpectedly, downgrading to serial recovery
Thu Sep 20 09:46:18 2012
Errors in file /user/oracle/admin/maclean/bdump/maclean_smon_729208.trc:
ORA-00600: internal error code, arguments: [kghstack_free2], [], [], [], [], [], [], []
Thu Sep 20 09:46:20 2012
Errors in file /user/oracle/admin/maclean/bdump/maclean_smon_729208.trc:
ORA-07445: exception encountered: core dump [kghstack_err+0068] [SIGSEGV] [Address not mapped to object] [0x778670101010109] [] []
ORA-00600: internal error code, arguments: [kghstack_free2], [], [], [], [], [], [], []
Thu Sep 20 09:46:21 2012
Errors in file /user/oracle/admin/maclean/bdump/maclean_smon_729208.trc:
ORA-07445: exception encountered: core dump [kghstack_err+0068] [SIGSEGV] [Address not mapped to object] [0x778670101010109] [] []
ORA-00607: Internal error occurred while making a change to a data block
ORA-00602: internal programming exception
ORA-07445: exception encountered: core dump [kghstack_err+0068] [SIGSEGV] [Address not mapped to object] [0x778670101010109] [] []
ORA-00600: internal error code, arguments: [kghstack_free2], [], [], [], [], [], [], []
Thu Sep 20 09:46:22 2012
Errors in file /user/oracle/admin/maclean/bdump/maclean_smon_729208.trc:
ORA-07445: exception encountered: core dump [kghstack_err+0068] [SIGSEGV] [Address not mapped to object] [0x778670101010109] [] []
ORA-00607: Internal error occurred while making a change to a data block
ORA-00602: internal programming exception
ORA-07445: exception encountered: core dump [kghstack_err+0068] [SIGSEGV] [Address not mapped to object] [0x778670101010109] [] []
ORA-00607: Internal error occurred while making a change to a data block
ORA-00602: internal programming exception
ORA-07445: exception encountered: core dump [kghstack_err+0068] [SIGSEGV] [Address not mapped to object] [0x778670101010109] [] []
ORA-00600: internal error code, arguments: [kghstack_free2], [], [], [], [], [], [], []
Thu Sep 20 09:46:24 2012
Errors in file /user/oracle/admin/maclean/bdump/maclean_pmon_786638.trc:
ORA-00474: SMON process terminated with error
Thu Sep 20 09:46:24 2012
PMON: terminating instance due to error 474
Instance terminated by PMON, pid = 786638
Thu Sep 20 10:38:31 2012
Starting ORACLE instance (normal)
sskgpgetexecname failed to get name
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 3
Autotune of undo retention is turned on.
IMODE=BR
ILAT =61
LICENSE_MAX_USERS = 0
SYS auditing is enabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.4.0.
System parameters with non-default values:
processes = 500
sessions = 555
__shared_pool_size = 637534208
__large_pool_size = 16777216
__java_pool_size = 16777216
__streams_pool_size = 0
sga_target = 21474836480
control_files = /oradata1/macleanDB/control1/control01.ctl, /oradata1/macleanDB/control2/control02.ctl, /oradata1/macleanDB/control3/control03.ctl
db_block_size = 8192
__db_cache_size = 20786970624
compatible = 10.2.0.3.0
log_archive_dest_1 = LOCATION=/oraarch/maclean
log_archive_format = %t_%s_%r.dbf
db_files = 4000
db_file_multiblock_read_count= 16
undo_management = AUTO
undo_tablespace = UNDOTBS2
undo_retention = 7200
remote_login_passwordfile= EXCLUSIVE
audit_sys_operations = TRUE
db_domain =
dispatchers = (PROTOCOL=TCP) (SERVICE=macleanXDB)
job_queue_processes = 10
background_dump_dest = /user/oracle/admin/maclean/bdump
user_dump_dest = /user/oracle/admin/maclean/udump
core_dump_dest = /user/oracle/admin/maclean/cdump
audit_file_dest = /user/oracle/admin/maclean/adump
audit_trail = DB, EXTENDED
db_name = maclean
open_cursors = 300
pga_aggregate_target = 10737418240
MMAN started with pid=4, OS id=368872
PSP0 started with pid=3, OS id=184496
PMON started with pid=2, OS id=94512
DBW0 started with pid=5, OS id=331874
DBW1 started with pid=6, OS id=418128
LGWR started with pid=7, OS id=397498
CKPT started with pid=8, OS id=508388
SMON started with pid=9, OS id=459050
RECO started with pid=10, OS id=471536
CJQ0 started with pid=11, OS id=442822
MMON started with pid=12, OS id=180548
Thu Sep 20 10:38:38 2012
starting up 1 dispatcher(s) for network address ‘(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))’…
MMNL started with pid=13, OS id=352574
Thu Sep 20 10:38:38 2012
starting up 1 shared server(s) …
Thu Sep 20 10:38:42 2012
ALTER DATABASE MOUNT
Thu Sep 20 10:38:47 2012
Setting recovery target incarnation to 2
Thu Sep 20 10:38:47 2012
Successful mount of redo thread 1, with mount id 2376692082
Thu Sep 20 10:38:47 2012
Database mounted in Exclusive Mode
Completed: ALTER DATABASE MOUNT
Thu Sep 20 10:38:47 2012
ALTER DATABASE OPEN
Thu Sep 20 10:38:49 2012
Beginning crash recovery of 1 threads
parallel recovery started with 15 processes
Thu Sep 20 10:38:50 2012
Started redo scan
Thu Sep 20 10:38:51 2012
Completed redo scan
27994 redo blocks read, 8334 data blocks need recovery
Thu Sep 20 10:38:52 2012
Started redo application at
Thread 1: logseq 336560, block 3
Thu Sep 20 10:38:52 2012
Recovery of Online Redo Log: Thread 1 Group 4 Seq 336560 Reading mem 0
Mem# 0: /oradata4/macleanDB/redolog/redolog4a.log
Mem# 1: /oradata4/macleanDB/redolog/redolog4b.log
Thu Sep 20 10:38:53 2012
Completed redo application
Thu Sep 20 10:38:56 2012
Completed crash recovery at
Thread 1: logseq 336560, block 27997, scn 12971246499154
8334 data blocks read, 8334 data blocks written, 27994 redo blocks read
Thu Sep 20 10:38:57 2012
LGWR: STARTING ARCH PROCESSES
ARC0 started with pid=32, OS id=770284
ARC1 started with pid=33, OS id=762350
Thu Sep 20 10:38:57 2012
ARC0: Archival started
ARC1: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
Thu Sep 20 10:38:58 2012
Thread 1 advanced to log sequence 336561 (thread open)
Thread 1 opened at log sequence 336561
Current log# 2 seq# 336561 mem# 0: /oradata1/macleanDB/redolog/redolog2a.log
Current log# 2 seq# 336561 mem# 1: /oradata1/macleanDB/redolog/redolog2b.log
Successful open of redo thread 1
Thu Sep 20 10:38:58 2012
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Thu Sep 20 10:38:58 2012
ARC1: Becoming the ‘no FAL’ ARCH
ARC1: Becoming the ‘no SRL’ ARCH
Thu Sep 20 10:38:58 2012
ARC0: Becoming the heartbeat ARCH
Thu Sep 20 10:38:58 2012
SMON: enabling cache recovery
Thu Sep 20 10:39:03 2012
Successfully onlined Undo Tablespace 5.
Thu Sep 20 10:39:03 2012
SMON: enabling tx recovery
Thu Sep 20 10:39:03 2012
Database Characterset is AL32UTF8
Opening with internal Resource Manager plan
where NUMA PG = 1, CPUs = 16
Thu Sep 20 10:39:04 2012
Errors in file /user/oracle/admin/maclean/bdump/maclean_p003_491552.trc:
ORA-00600: internal error code, arguments: [kghstack_free2], [], [], [], [], [], [], []
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
QMNC started with pid=52, OS id=770310
Thu Sep 20 10:39:06 2012
Errors in file /user/oracle/admin/maclean/bdump/maclean_p003_491552.trc:
ORA-07445: exception encountered: core dump [kghstack_err+0068] [SIGSEGV] [Address not mapped to object] [0x778670101010109] [] []
ORA-00600: internal error code, arguments: [kghstack_free2], [], [], [], [], [], [], []
Thu Sep 20 10:39:08 2012
Errors in file /user/oracle/admin/maclean/bdump/maclean_p003_491552.trc:
ORA-07445: exception encountered: core dump [kghstack_err+0068] [SIGSEGV] [Address not mapped to object] [0x778670101010109] [] []
ORA-00607: Internal error occurred while making a change to a data block
ORA-00602: internal programming exception
ORA-07445: exception encountered: core dump [kghstack_err+0068] [SIGSEGV] [Address not mapped to object] [0x778670101010109] [] []
ORA-00600: internal error code, arguments: [kghstack_free2], [], [], [], [], [], [], []
Thu Sep 20 10:39:08 2012
Errors in file /user/oracle/admin/maclean/bdump/maclean_p003_491552.trc:
ORA-07445: exception encountered: core dump [kghstack_err+0068] [SIGSEGV] [Address not mapped to object] [0x778670101010109] [] []
ORA-00607: Internal error occurred while making a change to a data block
ORA-00602: internal programming exception
ORA-07445: exception encountered: core dump [kghstack_err+0068] [SIGSEGV] [Address not mapped to object] [0x778670101010109] [] []
ORA-00607: Internal error occurred while making a change to a data block
ORA-00602: internal programming exception
ORA-07445: exception encountered: core dump [kghstack_err+0068] [SIGSEGV] [Address not mapped to object] [0x778670101010109] [] []
ORA-00600: internal error code, arguments: [kghstack_free2], [], [], [], [], [], [], []
Thu Sep 20 10:39:11 2012
Doing block recovery for file 370 block 140818
Doing block recovery for file 370 block 140818
Doing block recovery for file 370 block 140818
Doing block recovery for file 370 block 140818
Doing block recovery for file 370 block 140818

 

设置10513 level 2内部事件后,SMON不再recover dead transaction ,打开数据库后实例不再意外终止。

SQL> show parameter event

NAME TYPE VALUE
———————————— ———– ——————————
event string 10513 trace name context forev
er, level 2

通过RMAN validate命令和dbv检测数据文件,发现表MAC_maclean_TAB上存在一个逻辑讹误块和一个物理讹误块。
SQL> select * from V$DATABASE_BLOCK_CORRUPTION ;

FILE# BLOCK# BLOCKS CORRUPTION_CHANGE# CORRUPTIO
———- ———- ———- —————— ———
370 140818 1 1.2971E+13 LOGICAL
84 23875 1 0 CORRUPT
鉴于用户可以接受少量数据的损失,且blockrecover以上物理讹误块在RMAN中实际运行时寻找该FILE 84 BLOCK 23875耗费大量时间,所以考虑利用ROWID的方式绕过讹误块,拷贝出绝大多数的表数据。

SQL> select dbms_rowid.rowid_create(1,353223,370,140818,0) from dual;

DBMS_ROWID.ROWID_C
——————
AAAh1CAFyAAAiYSAAA
SQL> select dbms_rowid.rowid_create(1,353223,370,140819,0) from dual;

DBMS_ROWID.ROWID_C
——————
AABWPHAFyAAAiYTAAA

 

alter session set events ‘10231 trace name context forever, level 10’;
SQL> select count(*) from MAC_maclean_TAB;

COUNT(*)
———-
8450840

 

insert /*+ APPEND */ into MAC_maclean_TAB_BAK select /*+ ROWID(A) */ * from MAC_maclean_TAB A where rowid <‘AABWPHAFyAAAiYSAAA’;

insert /*+ APPEND */ into MAC_maclean_TAB_BAK select /*+ ROWID(A) */ * from MAC_maclean_TAB A where rowid >=’AABWPHAFyAAAiYTAAA’;
SQL> select count(*) from MAC_maclean_TAB_BAK;
COUNT(*)
———-
8450796

alter table MAC_maclean_TAB rename to MAC_maclean_TAB_20120920_BAK;
alter table MAC_maclean_TAB_BAK rename to MAC_maclean_TAB;

 

SQL>
SQL> select count(*) from MAC_maclean_TAB;

COUNT(*)
———-
8450796

SQL> analyze table MAC_maclean_TAB validate structure;

Table analyzed.

drop table MAC_maclean_TAB_20120920_BAK;
原MAC_maclean_TAB共计8450840条数据,实际修复出8450796条记录,丢失44条。

后续使用CREATE TABLE命令将原存在逻辑坏块的数据块覆盖,避免上述ORA-600问题再次发生。
create table LARGE_TABLE (t1 int) tablespace MAC_REPORT_TBS ;

alter table LARGE_TABLE allocate extent (datafile ‘/maclean/maclean_32.dbf’ size 10M);

 
对全库做数据文件的验证操作,找出数据库中潜在的坏块,避免此类问题再次发生,若使用RMAN可以参考如下脚本:

RMAN> run {
allocate channel d1 type disk;
allocate channel d2 type disk;
allocate channel d3 type disk;
allocate channel d4 type disk;
backup validate check logical database;
}
以上会将找到的坏块信息汇总到V$DATABASE_BLOCK_CORRUPTION视图中:

select * from V$DATABASE_BLOCK_CORRUPTION ;

Oracle内部错误:ORA-00600[13013][5001]故障诊断一例

周五被叫到客户现场解决一套10.2.0.4 RAC数据库上的ORA-00600[13013]内部错误问题,这个问题同事已经在上午通过远程拨号了解过情况,
初步判断是索引存在讹误corruption引起的600。

前期诊断

同事在我抵达现场之前已经做了初步的诊断,该10.2.0.4上的RAC系统主用节点的告警日志中多次出现ORA-00600:[13013], [5001]、ORA-00600:[qertbFetchByRowID]及ORA-00600: [25027] 等内部错误,具体的日志如下:

Fri Sep 16 01:16:54 2011
Errors in file /oravl01/oracle/admin/CRMDB2/udump/crmdb22_ora_2138788.trc:
ORA-00600: internal error code, arguments: [13013], [5001], [66209], [50730514], [23], [50730514], [3], []
Fri Sep 16 01:16:55 2011
Trace dumping is performing id=[cdmp_20110916011655]
Fri Sep 16 01:17:06 2011
Errors in file /oravl01/oracle/admin/CRMDB2/udump/crmdb22_ora_2138788.trc:
ORA-00600: internal error code, arguments: [13013], [5001], [66209], [50730514], [23], [50730514], [3], []
以下为13013 trace
ORA-00600: internal error code, arguments: [13013], [5001], [66209], [50730514], [23], [50730514], [3], []
Current SQL statement for this session:
UPDATE CUST_SUBSCRB_PERSONAL A SET a.cust_mobile='A04204441' WHERE a.subscrbid=71524739
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              088424844 ? 041124844 ?
ksedmp+0290          bl       ksedst               104A54870 ?
ksfdmp+0018          bl       03F30204
kgeriv+0108          bl       _ptrgl
kgesiv+0080          bl       kgeriv               1000D8FE4 ? 0785F7165 ?
10501ED28 ? FFFFFFFFFFF6748 ?
06C626E72 ?
ksesic6+0060         bl       kgesiv               110072D08 ? 7000000AC52B8F8 ?
7000000AC52B3A8 ? 07FFFFFFF ?
000000000 ?
updThreePhaseExe+0c  bl       01F9D7A8
4c
updexe+041c          bl       updThreePhaseExe     7000000AC52B8F8 ? 000000000 ?
110540128 ? FFFFFFFFFFF7C80 ?
opiexe+27d8          bl       updexe               7000000AC52B8F8 ?
FFFFFFFFFFF82B8 ?
kpoal8+0edc          bl       opiexe               FFFFFFFFFFFB434 ?
FFFFFFFFFFFB198 ?
FFFFFFFFFFF9608 ?
opiodr+0ae0          bl       _ptrgl
ttcpip+1020          bl       _ptrgl
opitsk+1124          bl       01F9F2A0
opiino+0990          bl       opitsk               000000000 ? 000000000 ?
============
Plan Table
============
------------------------------------------------------+-----------------------------------+
| Id  | Operation           | Name                    | Rows  | Bytes | Cost  | Time      |
------------------------------------------------------+-----------------------------------+
| 0   | UPDATE STATEMENT    |                         |       |       |     1 |           |
| 1   |  UPDATE             | CUST_SUBSCRB_PERSONAL   |       |       |       |           |
| 2   |   INDEX UNIQUE SCAN | CUST_SUBSCRB_PERSONAL_PK|     1 |    33 |     1 |  00:00:01 |
------------------------------------------------------+-----------------------------------+
===============================================================================================================
同时还伴随有ORA-00600:[qertbFetchByRowID]出现
Fri Sep 16 01:08:57 2011
Errors in file /oravl01/oracle/admin/CRMDB2/udump/crmdb22_ora_2138788.trc:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []
Current SQL statement for this session:
SELECT * from CUST_SUBSCRB_PERSONAL A WHERE a.subscrbid=307557025
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              578318D500000003 ?
FFFFFFFFFFF80D0 ?
ksedmp+0290          bl       ksedst               104A54870 ?
ksfdmp+0018          bl       03F30204
kgerinv+00dc         bl       _ptrgl
kgeasnmierr+004c     bl       kgerinv              11048AE68 ? 000000000 ?
000000001 ? 104BDEC18 ?
110489398 ?
qertbFetchByRowID+0  bl       03F2EF38
d34
opifch2+141c         bl       03F2EB1C
opifch+003c          bl       opifch2              1100DD338 ? 000000000 ?
FFFFFFFFFFF9980 ?
============
Plan Table
============
---------------------------------------------------------------+-----------------------------------+
| Id  | Operation                    | Name                    | Rows  | Bytes | Cost  | Time      |
---------------------------------------------------------------+-----------------------------------+
| 0   | SELECT STATEMENT             |                         |       |       |     1 |           |
| 1   |  TABLE ACCESS BY INDEX ROWID | CUST_SUBSCRB_PERSONAL   |     1 |   287 |     1 |  00:00:01 |
| 2   |   INDEX UNIQUE SCAN          | CUST_SUBSCRB_PERSONAL_PK|     1 |       |     1 |  00:00:01 |
---------------------------------------------------------------+-----------------------------------+
===============================================================================================================
此外还会出现ORA-00600: [25027], [6], [1443670797], [], [], [], [], []
Fri Sep 16 15:06:00 2011
Errors in file /oravl01/oracle/admin/CRMDB2/udump/crmdb22_ora_1872436.trc:
ORA-00600: internal error code, arguments: [25027], [6], [1443670797], [], [], [], [], []
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [25027], [6], [4278267905], [], [], [], [], []
Current SQL statement for this session:
select q.subscrbid,q.serv_lvl,q.cust_lvl
from cust_subscrb_personal q
where q.serv_lvl is null
or q.cust_lvl is null
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              088424844 ? 041124844 ?
ksedmp+0290          bl       ksedst               104A54870 ?
ksfdmp+0018          bl       03F30204
kgeriv+0108          bl       _ptrgl
kgesiv+0080          bl       kgeriv               000000000 ? 000000000 ?
000000000 ? 105673724 ?
00000000C ?
ksesic2+0060         bl       kgesiv               000000245 ? 0FFFFFFFF ?
700000010013BD0 ?
700000010018078 ? 110000998 ?
krtd2abh+040c        bl       ksesic2              61C3000061C3 ? 000000000 ?
000000006 ? 000000000 ?
0FF013001 ? 000000FA0 ?
105673FB8 ? 105673FA8 ?
kcbgtcr+24a0         bl       krtd2abh             700000471D95738 ?
7000001A9F0F050 ?
FFFFFFFFFFF78E0 ?
700000471D93500 ? 000000003 ?
ktrget+04ac          bl       kcbgtcr              1105A75A8 ? 000002000 ?
000000000 ? 000001940 ?
kdsgrp+00bc          bl       ktrget               000000064 ? 110000998 ?
700000010008000 ?
kdsgnp+0444          bl       kdsgrp               000000000 ? 000000000 ?
FFFFFFFFFFF80D0 ?
kafger+08fc          bl       kdsgnp               000000000 ? 000000000 ?
111119E38 ?
kdstf1100101km+0f58  bl       kafger               1105A74D8 ? 111119E38 ?
FFFFFFFFFFF8190 ? 000000000 ?
70000046DCA8488 ? 000000000 ?
0FFFF9608 ? 110471948 ?
kdsttgr+1a04         bl       kdstf1100101km       111119E38 ? 0000102A1 ?
70000042DF6DC50 ?
FFFFFFFFFFF8418 ? 0FFFFFFFF ?
000001FE8 ? 000000000 ?
000000000 ?
qertbFetch+09b8      bl       kdsttgr              111119E38 ? 000000418 ?
000000000 ? 000000000 ?
000000000 ? 000000000 ?
000000418 ? BE4610195B2C ?
opifch2+141c         bl       03F2EB1C
opifch+003c          bl       opifch2              1100DD338 ? 000000000 ?
FFFFFFFFFFF9980 ?
============
Plan Table
============
--------------------------------------------------+-----------------------------------+
| Id  | Operation          | Name                 | Rows  | Bytes | Cost  | Time      |
--------------------------------------------------+-----------------------------------+
| 0   | SELECT STATEMENT   |                      |       |       |  4898 |           |
| 1   |  TABLE ACCESS FULL | CUST_SUBSCRB_PERSONAL|    15 |   165 |  4898 |  00:00:59 |
--------------------------------------------------+-----------------------------------+

我们先从ORA-00600:[13013], [5001]内部错误入手,Mos Note <How to resolve ORA-00600 [13013], [5001] [ID 816784.1]> 比较翔实地介绍了该600错误,其argument的具体含义如下:

Arg [a] Passcount
Arg [b] Data Object number
Arg [c] Tablespace Relative DBA of block containing the row to be updated
Arg [d] Row Slot number
Arg [e] Relative DBA of block being updated (should be same as [c])
Arg [f] Code

我们这里的[13013], [5001], [66209], [50730514], [23], [50730514], [3], [] 就是

data_object_id = 66209 = 0x102A1
RDBA = 50730514 即  3061612 fild id=12  block id =398866
Row Slot number = 23
Code = 3

翻译过来就是对data_object_id为66209 的对象在12号数据文件上的398866块的第23行做代码为3的检测时发现了异常,通过data_object_id可以定位到该对象就是正在执行的SQL语句中的CUST_SUBSCRB_PERSONAL表。

ORA-00600:[13013], [5001]可能由表上的数据行或索引中的记录逻辑讹误(logical Corruption)所造成,这里要强调一下是逻辑讹误,而非物理损坏。逻辑讹误一般是由于Oracle Bug或者memory Corruption 所引起的,该ORA-00600:[13013], [5001]已知可能由5085288和4549673等多个Bug 引发,在版本10.2.0.4 上这些bug 触发概率较高。

伴随ORA-00600:[13013], [5001]发生的ORA-600 [qertbFetchByRowID]常由索引损坏引起,介绍了该问题:

ORA-600 [qertbFetchByRowID] Select Queries on 10.2.0.4 Database [ID 755592.1]
Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.4 and later   [Release: 10.2 and later ]
Symptoms
The following error was received during SELECT operation:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []
The call stack contains:
qertbFetchByRowID <- qergiFetch <- rwsfcd <- qerhjFetch
< - qerjotFetch <- rwsfcd <- qertqoFetch <- qerpxSlaveFetch <- qerpxFetch
Cause
Index corruption.
Solution
1.
Analyze the affected object to look for logical corruption.
SQL> analyze table TABLENAME validate structure cascade;
This will place an exclusive lock on the table.
If an error is reported try dropping and recreating the index(es) first.
2.
If this problem is occurring on a version less than 10.2.0.4, ensure the patch for
Bug 4883635 has been applied if Materialized Views are used.
BUG:5866783 - ORA-00600: [QERTBFETCHBYROWID] ON SELECT FROM ONE TABLE

而ORA-00600: [25027]也是类似的问题:

ORA-600 [25027] [ID 284433.1]
PURPOSE:
This article represents a partially published OERI note.
It has been published because the ORA-600 error has been
reported in at least one confirmed bug.
Therefore, the SUGGESTIONS section of this article may help
in terms of identifying the cause of the error.
This specific ORA-600 error may be considered for full publication
at a later date. If/when fully published, additional information
will be available here on the nature of this error.
ERROR:
Format: ORA-600 [25027] [a] [b]
VERSIONS:
versions 9.2 and above
ARGUMENTS:
Arg [a]  Tablespace Number (TSN)
Arg [b]  Decimal Relative Data Block Address (RDBA)
SUGGESTIONS:
1. If the Arg [b] (the RDBA) is 0 (zero), then this could be due to fake indexes.
The following query will list fake indexes:
select do.owner,do.object_name, do.object_type,sysind.flags
from dba_objects do, sys.ind$ sysind
where do.object_id = sysind.obj#
and bitand(sysind.flags,4096)=4096;
If the above query returns any rows, check the objects involved and consider
dropping them as they can cause this error. 
2. Run analyze table validate structure on the table referenced in the Current SQL statement in
the related trace file.
If the Known Issues section below does not help in terms of identifying
a solution, please submit the trace files and alert.log to Oracle
Support Services for further analysis.

但是请注意在该故障示例中ORA-00600: [25027]给出的Arg [b] Decimal Relative Data Block Address (RDBA)是一个完全不相干的RDBA,具体原因尚不明确。

小插曲

这当中有一个小的插曲,原来客户这里的应用人员需要尽快在该CUST_SUBSCRB_PERSONAL表上执行一段查询语句,但是该语句一旦执行就会遇到ORA-600 [qertbFetchByRowID]错误,因为是周五小周末,所以如果今天无法运行的话,就要拖到下个礼拜了。

所以被要求优先解决该语句执行的问题,首先看了一下该语句的执行计划:

SQL> explain plan for select count(*) from SHUCRM2C.cust_subscrb_personal a,  SHUCRM2C.wzy_bs20110916 b
2  where a.svcnum=b.svcnum and a.countyid='A00' and a.serv_lvl=0;
Explained.
SQL> select * from table(dbms_xplan.display());
PLAN_TABLE_OUTPUT
------------------------------------------------
Plan hash value: 3616548176
-----------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name                      | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                           |     1 |    31 |   231   (1)| 00:00:03 |
|   1 |  SORT AGGREGATE               |                           |     1 |    31 |            |          |
|*  2 |   HASH JOIN                   |                           |   425 | 13175 |   231   (1)| 00:00:03 |
|*  3 |    TABLE ACCESS BY INDEX ROWID| CUST_SUBSCRB_PERSONAL     |   425 |  8075 |   208   (0)| 00:00:03 |
|*  4 |     INDEX RANGE SCAN          | CUST_SUBSCRB_PERSONAL_2IX | 42462 |       |    20   (0)| 00:00:01 |
|   5 |    TABLE ACCESS FULL          | WZY_BS20110916            | 39792 |   466K|    22   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("A"."SVCNUM"="B"."SVCNUM")
3 - filter(TO_NUMBER("A"."SERV_LVL")=0)
4 - access("A"."COUNTYID"='A00')
Note
-----
- dynamic sampling used for this statement
23 rows selected.
SQL> select count(*) from SHUCRM2C.cust_subscrb_personal a,  SHUCRM2C.wzy_bs20110916 b
2  where a.svcnum=b.svcnum and a.countyid='A00' and a.serv_lvl=0;
select count(*) from SHUCRM2C.cust_subscrb_personal a,  SHUCRM2C.wzy_bs20110916 b
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []

以上执行计划中对表上的索引CUST_SUBSCRB_PERSONAL_2IX做了range scan后通过获得的rowid到表上去fetch记录(qertbFetchByRowID),在实际fetch by rowid 的时候引发了ORA-600错误。

因为需求较为紧急,所以我还是考虑能否使执行计划绕过该索引,投石问路给语句加上了RULE提示,再次执行:

SQL> select /*+ rule */
2  count(*) from SHUCRM2C.cust_subscrb_personal a,  SHUCRM2C.wzy_bs20110916 b
3  where a.svcnum=b.svcnum and a.countyid='A00' and a.serv_lvl=0;
COUNT(*)
----------
11559
Execution Plan
----------------------------------------------------------
Plan hash value: 2851452146
-------------------------------------------------------------------
| Id  | Operation                     | Name                      |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                           |
|   1 |  SORT AGGREGATE               |                           |
|   2 |   NESTED LOOPS                |                           |
|   3 |    TABLE ACCESS FULL          | WZY_BS20110916            |
|*  4 |    TABLE ACCESS BY INDEX ROWID| CUST_SUBSCRB_PERSONAL     |
|   5 |     AND-EQUAL                 |                           |
|*  6 |      INDEX RANGE SCAN         | CUST_SUBSCRB_SVCNUM_2006  |
|*  7 |      INDEX RANGE SCAN         | CUST_SUBSCRB_PERSONAL_2IX |
-------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - filter(TO_NUMBER("A"."SERV_LVL")=0)
6 - access("A"."SVCNUM"="B"."SVCNUM")
7 - access("A"."COUNTYID"='A00')
Note
-----
- rule based optimizer used (consider using cbo)

想不到居然执行成功了,但是执行计划当中仍有CUST_SUBSCRB_PERSONAL_2IX这个索引,这让我潜意识中认识到很可能是表而非索引存在逻辑讹误。

不管怎么说至少解了燃眉之急,先把这个方法告诉应用人员,然后回过头来继续诊断。

初步验证索引

为了确定到底是表还是索引存在逻辑讹误,一般需要使用analyze table validate structure cascade命令以验证表和索引的结构,但是该命令会要求以共享方式锁住表(TM mode=4),对于更新频繁的生产系统中的关键应用表,这是不可接受的。同事在接手这个故障时已经考虑过该问题了,所以他推荐用查询的方式来检验到底是那些索引存在corruption,具体方法如下:

通过ROWID_CREATE 构造出ROWID
SQL> select dbms_rowid.ROWID_CREATE(1,66209,12,398866,23) from dual;
DBMS_ROWID.ROWID_C
------------------
AAAQKhAAMAABhYSAAX
SQL> select * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
no rows selected
SQL> SELECT INDEX_NAME,COLUMN_NAME FROM DBA_IND_COLUMNS WHERE TABLE_NAME='CUST_SUBSCRB_PERSONAL' ORDER BY 1;
INDEX_NAME                     COLUMN_NAME
------------------------------ ----------------------------------------
CUST_SUBSCRB_CARD_NUM          VIP_CARD_NUM
CUST_SUBSCRB_MANAGERID_2006    MANAGERID
CUST_SUBSCRB_PERSONAL_2IX      COUNTYID
CUST_SUBSCRB_PERSONAL_3IX      CUST_LVL
CUST_SUBSCRB_PERSONAL_PK       SUBSCRBID
CUST_SUBSCRB_SERV_COUNTY       SERV_COUNTYID
CUST_SUBSCRB_SVCNUM_2006       SVCNUM
IDX_CUST_SUBSCRB_PERSONAL_01   SERV_LVL
之后强制使用index提示使用不同的索引
SQL> select /*+ INDEX(a CUST_SUBSCRB_CARD_NUM ) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
no rows selected
SQL> select /*+ INDEX(a CUST_SUBSCRB_MANAGERID_2006) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
no rows selected
SQL> select /*+ INDEX(a CUST_SUBSCRB_PERSONAL_2IX) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
select /*+ INDEX(a CUST_SUBSCRB_PERSONAL_2IX) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX')
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []
SQL> select /*+ INDEX(a CUST_SUBSCRB_PERSONAL_3IX) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
no rows selected
SQL> select /*+ INDEX(a CUST_SUBSCRB_PERSONAL_PK ) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
select /*+ INDEX(a CUST_SUBSCRB_PERSONAL_PK ) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX')
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []
SQL> select /*+ INDEX(a CUST_SUBSCRB_SERV_COUNTY) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
no rows selected
SQL> select /*+ INDEX(a CUST_SUBSCRB_SVCNUM_2006) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
select /*+ INDEX(a CUST_SUBSCRB_SVCNUM_2006) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX')
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []
SQL> select /*+ INDEX(a IDX_CUST_SUBSCRB_PERSONAL_01) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
select /*+ INDEX(a IDX_CUST_SUBSCRB_PERSONAL_01) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX')
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []

可以看到使用以上方法在其中4个索引上遇到了ORA-00600:[qertbFetchByRowID]。 但是使用该方法还是无法彻底搞清楚逻辑到底存在于table还是index上?

 

必要的工具

 

这里我们要介绍一下validate structure还存在online在线使用的选项,在online模式下整个验证过程不会在表上加任何TM锁,这一点具体可以使用10704 事件来证明:

 

10704, 00000, "Print out information about what enqueues are being obtained"
// *Cause:  When enabled, prints out arguments to calls to ksqcmi and
//          ksqlrl and the return values.
// *Action: Level indicates details:
//   Level: 1-4: print out basic info for ksqlrl, ksqcmi
//          5-9: also print out stuff in callbacks:  ksqlac, ksqlop
//          10+: also print out time for each line
SQL> create table validate_me (t1 int);
Table created.
SQL> create index ind_validate_me on validate_me(t1);
Index created.
SQL> insert into validate_me select rownum  from dba_tables where rownum<201;
200 rows created.
SQL> commit;
Commit complete.
SQL> oradebug setmypid;
Statement processed.
SQL> oradebug event 10704 trace name context forever,level 10;
Statement processed.
SQL> analyze table validate_me validate structure cascade online;
Table analyzed.
SQL> oradebug tracefile_name;
c:\app\diag\rdbms\g11r2\g11r2\trace\g11r2_ora_3020.trc
g11r2_ora_3020.trc=========================================================================
*** 2011-09-18 20:55:25.373
Oradebug command 'event 10704 trace name context forever,level 10' console output: <none>
*** 2011-09-18 20:55:49.765
ksqgtl *** TX-00060005-000006a9 mode=6 flags=0x401 timeout=0 ***
ksqgtl: xcb=0x31434194, ktcdix=2147483647, topxcb=0x31434194
ktcipt(topxcb)=0x0
*** 2011-09-18 20:55:49.766
ksucti: init txn DID from session DID
ksqgtl:
ksqlkdid: 0001-001B-00000006
*** 2011-09-18 20:55:49.766
*** ksudidTrace: ksqgtl
ktcmydid(): 0001-001B-00000006
ksusesdi:   0000-0000-00000000
ksusetxn:   0001-001B-00000006
ksqgtl: RETURNS 0
*** 2011-09-18 20:55:49.766
ksqrcl: TX,60005,6a9
ksqrcl: returns 0

但是validate strucutre online也有它的缺点,那就是在线模式下结构验证命令将不填充索引的状态信息到index_stats视图,如:

SQL> analyze index  ind_validate_me validate structure  ;
Index analyzed.
SQL> select count(*) from index_stats;
COUNT(*)
----------
1
SQL> conn / as sysdba
Connected.
SQL> analyze index  ind_validate_me validate structure  online;
Index analyzed.
SQL> select count(*) from index_stats;
COUNT(*)
----------
0

但是因为我们这里只要用到validate structure的结构验证功能,而对索引的详细状态没有兴趣,所以我们可以充分利用该online模式。

 

具体验证

 

使用validate structure online具体验证该问题表和表上的索引:

先仅对表进行验证,以区分到底是表还是索引存在逻辑讹误 
SQL> analyze table SHUCRM2O.CUST_SUBSCRB_PERSONAL  validate structure online;
analyze table SHUCRM2O.CUST_SUBSCRB_PERSONAL  validate structure online
*
ERROR at line 1:
ORA-01498: block check failure - see trace file
若验证发现问题会出现ORA-01498错误,并产生trace 文件
kdrchk:  row is marked as both a Key and Clustered
prow=0x7000001f241c45c flag=0xff
Block Checking: DBA = 50730514, Block Type = KTB-managed data block
data header at 0x7000001f241c07c
kdbchk: bad row tab 0, slot 23
Block header dump:  0x03061612
Object id on Block? Y
seg/obj: 0x102a1  csc: 0xb43.ecde68ca  itc: 3  flg: E  typ: 1 - DATA
brn: 0  bdba: 0x3061609 ver: 0x01 opc: 0
inc: 0  exflg: 0
............
可以看到问题发生在  23 号槽位上
tab 0, row 23, @0x3e0
tl: 4 fb: KCHDFLPN lb: 0x1  cc: 0 cki: 1
tab 0, row 24, @0x3de
tl: 2 fb: --HD---N lb: 0x30
tab 0, row 25, @0x3dc
tl: 2 fb: --HD---- lb: 0x30
tab 0, row 26, @0x3d8
tl: 4 fb: KCHDFLPN lb: 0xff  cc: 0 cki: 255

这里居然第23个row piece的 flag 是KCHDFLPN 即实际fb = 0xff,也就是该row piece同时被标记为key和clustered(row is marked as both a Key and Clustered),因此不管当服务进程尝试update该问题行记录或者通过ROWID访问该row时都出现了ORA-00600错误,虽然其错误代码不同,但都是由于该数据块中第23行记录的flag存在讹误引起的。

关于该ORA-00600:[13013], [5001]问题的成因和解决方法,更多内容可以参考<手工模拟Oracle数据块逻辑讹误引发ORA-00600:[13013], [5001]一例>一文。

 

待修订!

手工模拟Oracle数据块逻辑讹误引发,ORA-00600:[13013] [5001]一例

上周在客户那里遇到了一例由Oracle Bug引发的表数据块逻辑讹误触发ORA-00600:[13013], [5001]的问题,这里为了更好地说明该问题,于是萌发了手工模拟该数据块逻辑讹误的想法。

基础知识

Oracle中表的数据块由块头、事务槽、行字典、行数据等多种结构组成。 行数据(rowdata)实际是由许多row piece 行片组成的,每一条row piece的头部都有flag、locks、cols(cc)三个标志位。

其中flag标记了该row piece的类型,该flag位占用一个字节,其不同的bit位代表不同的含义,见下表:

ROW_CLUSTER_KEY = 0x80;              KDRHFK
ROW_CTABLE_NUMBER = 0x40;            KDRHFC
ROW_HEAD_PIECE = 0x20;               KDRHFH
ROW_DELETED_ROW = 0x10;              KDRHFD
ROW_FIRST_PIECE = 0x08;              KDRHFF
ROW_LAST_PIECE = 0x04;               KDRHFL
ROW_FROM_PREVIOUS = 0x02;            KDRHFP
ROW_CONTINUE_NEXT = 0x01;            KDRHFN

一般来说最普通的一条row piece是普通堆表(heap table)的未被删除的且无行迁移/链接的,其flag位应为

普通row的flag一般为
Single Row =
ROW_HEAD_PIECE + ROW_FIRST_PIECE + ROW_LAST_PIECE= 0x20 + 0x08 + 0x04= 0x2c
===================================================================================
cluster key的flag一般为
Cluster Key =
ROW_CLUSTER_KEY + ROW_HEAD_PIECE + ROW_FIRST_PIECE + ROW_LAST_PIECE=
KDRHFL, KDRHFF, KDRHFH, KDRHFK =0x80 + 0x2c =  0xac
BBED> x /rn
rowdata[68]                                 @8166
-----------
flag@8166: 0xac (KDRHFL, KDRHFF, KDRHFH, KDRHFK)
lock@8167: 0x00
cols@8168:    1
kref@8169:    1
mref@8171:    1
hrid@8173:0x01800014.0
nrid@8179:0x01800014.0
col    0[2] @8185: 10 
===================================================================================
Cluster Row =
ROW_CTABLE_NUMBER + ROW_HEAD_PIECE + ROW_FIRST_PIECE + ROW_LAST_PIECE =
(KDRHFL, KDRHFF, KDRHFH, KDRHFC) = 0x6c 
BBED> x /rncc
rowdata[0]                                  @8098
----------
flag@8098: 0x6c (KDRHFL, KDRHFF, KDRHFH, KDRHFC)
lock@8099: 0x00
cols@8100:   10
col    0[2] @8102: 200
col    1[8] @8105: Jennifer
col    2[6] @8114: Whalen
col    3[7] @8121: JWHALEN
col   4[12] @8129: 515.123.4444
col    5[7] @8142: w....
col    6[7] @8150: AD_ASST
col    7[2] @8158: 
col    8[0] @8161: *NULL*
col    9[3] @8162: .
出现ORA-00600:[13013], [5001]且Arg [f] Code =3 代表这一row piece的flag >0xc0,
也就是该行片同时被标记为key和clustered(row is marked as both a Key and Clustered), 其检验代码为check code 6251。
当flag >= 0xc0 时 会出现kdrchk: row is marked as both a Key and Clustered Block 12 failed with check code 6251
当 0xac >flag >= 0xa0 时 会 kdrchk: row is Key and Not only piece of key Block 12 failed with check code 6255
当 flag = 0x43 是 会出现 kdrchk: C and neither of H or F Block 12 failed with check code 6263
当 flag = 0x83 时 会出现 kdrchk: row is marked both as a Key and being continued Block 12 failed with check code 6254

 

当Oracle进程访问数据块时首先会校验block的sum值并与block中的CHECKSUM值进行对比,若一致则说明该block没有物理讹误。但是光这一项检查是不够的,不足以保证block无误。所以Oracle引入了一些列的逻辑检验,每一种逻辑检验对应一个检测代码(check code),这些检测包括row piece的flag、cols(cc)状态是否正确等。

实际负责这类逻辑检验的函数包括:kdbchk、kddummy_blkchk、kco_blkchk、kdBlkCheckError、kdrchk等等。

这里当服务进程访问到问题数据块,检测代码发现其flag为0xff(KCHDFLPN),该flag从逻辑上讲是冲突的,所以检测代码认为该row piece存在异常,进而会引发update的ORA-00600:[13013], [5001]或查询的ORA-600 [qertbFetchByRowID]内部错误。

这里需要说明一下的是,很多人认为dbv工具时无法检测出逻辑讹误的,实际上dbv、rman、validate structure和bbed-verify均可以检测出一定程度的逻辑讹误,但是最可靠的还是db_block_checksum=true情况下的validate structure [online]验证命令。从另一个角度来说,普通的dbv只能做单一的检测,而无法做到交叉地检验,从而了解表和索引上的不一致问题,但是validate structure online却可以做到。


正式模拟

以上我们了解了ORA-00600:[13013], [5001]内部错误是如何被引发的,那么下面手工模拟该错误也就不困难了,当然这里需要用到bbed工具。

以下我们会创建实验用的tablespace,table,index:

SQL> select * from v$version;
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE 10.2.0.4.0 Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production
SQL> select * from global_name;
GLOBAL_NAME
--------------------------------------------------------------------------------
www.askmaclean.com
/* 创建实验用的表空间  */
SQL> create tablespace maclean datafile '/home/oracle/maclean.dbf' size 20M;
Tablespace created.
SQL> create table tv tablespace maclean as select rownum t1,'find me' t2 from
dba_tables where rownumcreate index ind_tv on tv(t1) tablespace users;
Index created.
SQL> update tv set t2='corrption here' where t1=200;
update tv set t2='corrption here' where t1=200
*
ERROR at line 1:
ORA-12899: value too large for column "SYS"."TV"."T2" (actual: 14, maximum: 7)
SQL> alter table tv modify t2 varchar2(200);
Table altered.
SQL> update tv set t2='corruption here' where t1=200;
1 row updated.
SQL> commit;
Commit complete.
/* 以上创建了示例用表,其中t1=200的记录是之后将会
手动修改为存在讹误的行             */
SQL> select dump(200,16) from dual;
DUMP(200,16)
-----------------
Typ=2 Len=2: c2,3
/* 通过16进制码可以方便找出该t1=200的记录行 */ 
SQL> alter system checkpoint;
System altered.
SQL> alter tablespace maclean read only;
Tablespace altered.
SQL> select dbms_rowid.rowid_block_number(rowid) bno ,dbms_rowid.rowid_relative_fno(rowid) fno from tv;
BNO FNO
---------- ----------
12 6
[oracle@rh2 ~]$ cp maclean.dbf maclean.dbf.bak

 

接着使用BBED工具找到目标行并实施手工修改:

 

[oracle@rh2 ~]$ bbed filename=maclean.dbf mode=edit
Password:
BBED: Release 2.0.0.0.0 - Limited Production on Sun Sep 18 22:14:59 2011
Copyright (c) 1982, 2007, Oracle. All rights reserved.
BBED> set blocksize 8192
BLOCKSIZE 8192
BBED> set block 13
BLOCK# 13
BBED> map /v
File: maclean.dbf (0)
Block: 13 Dba:0x00000000
------------------------------------------------------------
KTB Data Block (Table/Cluster)
struct kcbh, 20 bytes @0
ub1 type_kcbh @0
ub1 frmt_kcbh @1
ub1 spare1_kcbh @2
ub1 spare2_kcbh @3
ub4 rdba_kcbh @4
ub4 bas_kcbh @8
ub2 wrp_kcbh @12
ub1 seq_kcbh @14
ub1 flg_kcbh @15
ub2 chkval_kcbh @16
ub2 spare3_kcbh @18
struct ktbbh, 96 bytes @20
ub1 ktbbhtyp @20
union ktbbhsid, 4 bytes @24
struct ktbbhcsc, 8 bytes @28
b2 ktbbhict @36
ub1 ktbbhflg @38
ub1 ktbbhfsl @39
ub4 ktbbhfnx @40
struct ktbbhitl[3], 72 bytes @44
struct kdbh, 14 bytes @124
ub1 kdbhflag @124
b1 kdbhntab @125
b2 kdbhnrow @126
sb2 kdbhfrre @128
sb2 kdbhfsbo @130
sb2 kdbhfseo @132
b2 kdbhavsp @134
b2 kdbhtosp @136
struct kdbt[1], 4 bytes @138
b2 kdbtoffs @138
b2 kdbtnrow @140
sb2 kdbr[200] @142
ub1 freespace[4725] @542
ub1 rowdata[2921] @5267
ub4 tailchk @8188
BBED> find /x c203
File: maclean.dbf (0)
Block: 13 Offsets: 5271 to 5782 Dba:0x00000000
------------------------------------------------------------------------
c2030f63 6f727275 7074696f 6e206865 72652c00 0202c203 0766696e 64206d65
2c000203 c2026407 66696e64 206d652c 000203c2 02630766 696e6420 6d652c00
0203c202 62076669 6e64206d 652c0002 03c20261 0766696e 64206d65 2c000203
c2026007 66696e64 206d652c 000203c2 025f0766 696e6420 6d652c00 0203c202
5e076669 6e64206d 652c0002 03c2025d 0766696e 64206d65 2c000203 c2025c07
66696e64 206d652c 000203c2 025b0766 696e6420 6d652c00 0203c202 5a076669
6e64206d 652c0002 03c20259 0766696e 64206d65 2c000203 c2025807 66696e64
206d652c 000203c2 02570766 696e6420 6d652c00 0203c202 56076669 6e64206d
652c0002 03c20255 0766696e 64206d65 2c000203 c2025407 66696e64 206d652c
000203c2 02530766 696e6420 6d652c00 0203c202 52076669 6e64206d 652c0002
03c20251 0766696e 64206d65 2c000203 c2025007 66696e64 206d652c 000203c2
024f0766 696e6420 6d652c00 0203c202 4e076669 6e64206d 652c0002 03c2024d
0766696e 64206d65 2c000203 c2024c07 66696e64 206d652c 000203c2 024b0766
696e6420 6d652c00 0203c202 4a076669 6e64206d 652c0002 03c20249 0766696e
64206d65 2c000203 c2024807 66696e64 206d652c 000203c2 02470766 696e6420
6d652c00 0203c202 46076669 6e64206d 652c0002 03c20245 0766696e 64206d65
找到了t1=200的偏移值为5271
则其fb的偏移值为5271 -4 = 5267
BBED> set offset 5267
OFFSET 5267
BBED> d
File: maclean.dbf (0)
Block: 13 Offsets: 5267 to 5778 Dba:0x00000000
------------------------------------------------------------------------
2c020202 c2030f63 6f727275 7074696f 6e206865 72652c00 0202c203 0766696e
64206d65 2c000203 c2026407 66696e64 206d652c 000203c2 02630766 696e6420
6d652c00 0203c202 62076669 6e64206d 652c0002 03c20261 0766696e 64206d65
2c000203 c2026007 66696e64 206d652c 000203c2 025f0766 696e6420 6d652c00
0203c202 5e076669 6e64206d 652c0002 03c2025d 0766696e 64206d65 2c000203
c2025c07 66696e64 206d652c 000203c2 025b0766 696e6420 6d652c00 0203c202
5a076669 6e64206d 652c0002 03c20259 0766696e 64206d65 2c000203 c2025807
66696e64 206d652c 000203c2 02570766 696e6420 6d652c00 0203c202 56076669
6e64206d 652c0002 03c20255 0766696e 64206d65 2c000203 c2025407 66696e64
206d652c 000203c2 02530766 696e6420 6d652c00 0203c202 52076669 6e64206d
652c0002 03c20251 0766696e 64206d65 2c000203 c2025007 66696e64 206d652c
000203c2 024f0766 696e6420 6d652c00 0203c202 4e076669 6e64206d 652c0002
03c2024d 0766696e 64206d65 2c000203 c2024c07 66696e64 206d652c 000203c2
024b0766 696e6420 6d652c00 0203c202 4a076669 6e64206d 652c0002 03c20249
0766696e 64206d65 2c000203 c2024807 66696e64 206d652c 000203c2 02470766
696e6420 6d652c00 0203c202 46076669 6e64206d 652c0002 03c20245 0766696e
/* 找到指定行的地址为5267,其当前flag为正常的0x2c  */
BBED> x /rnc
rowdata[0] @5267
----------
flag@5267: 0x2c (KDRHFL, KDRHFF, KDRHFH)
lock@5268: 0x02
cols@5269: 2
col 0[2] @5270: 200
col 1[15] @5273: corruption here
修改该flag 为 0xff BBED> modify /x 0xff
Warning: contents of previous BIFILE will be lost. Proceed? (Y/N) y
File: maclean.dbf (0)
Block: 13 Offsets: 5267 to 5778 Dba:0x00000000
------------------------------------------------------------------------
ff020202 c2030f63 6f727275 7074696f 6e206865 72652c00 0202c203 0766696e
64206d65 2c000203 c2026407 66696e64 206d652c 000203c2 02630766 696e6420
6d652c00 0203c202 62076669 6e64206d 652c0002 03c20261 0766696e 64206d65
2c000203 c2026007 66696e64 206d652c 000203c2 025f0766 696e6420 6d652c00
0203c202 5e076669 6e64206d 652c0002 03c2025d 0766696e 64206d65 2c000203
c2025c07 66696e64 206d652c 000203c2 025b0766 696e6420 6d652c00 0203c202
5a076669 6e64206d 652c0002 03c20259 0766696e 64206d65 2c000203 c2025807
66696e64 206d652c 000203c2 02570766 696e6420 6d652c00 0203c202 56076669
6e64206d 652c0002 03c20255 0766696e 64206d65 2c000203 c2025407 66696e64
206d652c 000203c2 02530766 696e6420 6d652c00 0203c202 52076669 6e64206d
652c0002 03c20251 0766696e 64206d65 2c000203 c2025007 66696e64 206d652c
000203c2 024f0766 696e6420 6d652c00 0203c202 4e076669 6e64206d 652c0002
03c2024d 0766696e 64206d65 2c000203 c2024c07 66696e64 206d652c 000203c2
024b0766 696e6420 6d652c00 0203c202 4a076669 6e64206d 652c0002 03c20249
0766696e 64206d65 2c000203 c2024807 66696e64 206d652c 000203c2 02470766
696e6420 6d652c00 0203c202 46076669 6e64206d 652c0002 03c20245 0766696e
BBED> x /rnc
rowdata[0] @5267
----------
flag@5267: 0xff (KDRHFN, KDRHFP, KDRHFL, KDRHFF, KDRHFD, KDRHFH, KDRHFC, KDRHFK)
lock@5268: 0x02
cols@5269: 0
ckix@5270: 2
BBED> sum apply
Check value for File 0, Block 13:
current = 0x0000, required = 0x0000
我们使用bbed的verify命令验证数据块会发现问题flag
BBED> verify
DBVERIFY - Verification starting
FILE = maclean.dbf
BLOCK = 12
kdrchk: row is marked as both a Key and Clustered
prow=0x7f5335f05693 flag=0xff
Block Checking: DBA = 25165836, Block Type = KTB-managed data block
data header at 0x7f5335f0427c
kdbchk: bad row tab 0, slot 199
Block 12 failed with check code 6251
DBVERIFY - Verification complete
Total Blocks Examined : 1
Total Blocks Processed (Data) : 1
Total Blocks Failing (Data) : 1
Total Blocks Processed (Index): 0
Total Blocks Failing (Index): 0
Total Blocks Empty : 0
Total Blocks Marked Corrupt : 0
Total Blocks Influx : 0
使用dbv工具是也可以验证这种逻辑讹误的
[oracle@rh2 ~]$ dbv file=maclean.dbf
DBVERIFY: Release 10.2.0.4.0 - Production on Sun Sep 18 22:27:49 2011
Copyright (c) 1982, 2007, Oracle. All rights reserved.
DBVERIFY - Verification starting : FILE = maclean.dbf
kdrchk: row is marked as both a Key and Clustered
prow=0x7f9ef25f7693 flag=0xff
Block Checking: DBA = 25165836, Block Type = KTB-managed data block
data header at 0x7f9ef25f627c
kdbchk: bad row tab 0, slot 199
Page 12 failed with check code 6251
DBVERIFY - Verification complete
Total Pages Examined : 2560
Total Pages Processed (Data) : 1
Total Pages Failing (Data) : 1
Total Pages Processed (Index): 0
Total Pages Failing (Index): 0
Total Pages Processed (Other): 11
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 2548
Total Pages Marked Corrupt : 0
Total Pages Influx : 0
Highest block SCN : 691691 (0.691691)

 

回到sqlplus中访问之前修改的数据行,触发ORA-600[13013] [5001]错误:

 

SQL> alter system flush buffer_cache;
System altered.
SQL> update tv set t2='correct here' where t1=200;
update tv set t2='correct here' where t1=200
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [13013], [5001], [52937],
[25165836], [199], [25165836], [3], []
PLAN_TABLE_OUTPUT
---------------------------------------------------------
Plan hash value: 568795662
----------------------------------------------------------------------------
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | UPDATE STATEMENT  |        |     1 |   115 |     2   (0)| 00:00:01 |
|   1 |  UPDATE           | TV     |       |       |            |          |
|*  2 |   INDEX RANGE SCAN| IND_TV |     1 |   115 |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("T1"=200)
SQL> select * from tv where t1=200;
select * from tv where t1=200
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [],
[], [], []
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 1015724781
--------------------------------------------------------------------------------------
| Id  | Operation                   | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |        |     1 |   115 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| TV     |     1 |   115 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | IND_TV |     1 |       |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("T1"=200)

可以看到当正好update到问题行记录时如预料出现了ORA-00600:[13013], [5001]错误,而ACCESS BY INDEX ROWID时出现了ORA-00600:[qertbFetchByRowID]。

解决方案

1.在有备份的情况下可以通过blockrecovery在线修复该问题数据块:

RMAN> blockrecover datafile 6 block 12;

Starting blockrecover at 18-SEP-11
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=144 devtype=DISK

starting media recovery
media recovery complete, elapsed time: 00:00:01

Finished blockrecover at 18-SEP-11

但是请注意如果该逻辑讹误确实是由Oracle Bug引起的话,那么很有可能blockrecover也无能为力,那么可以借鉴第二种方法。

 

2. 第二种方法针对没有备份可用的数据库或者recover数据块不管用的场景,可以设置10231事件并ctas复制该表,但是这种方法可能会丢失有问题的行记录:

SQL> alter session set events ‘10231 trace name context forever, level 10’

SQL> Create table.TABLE_COPY as select * from TABLE;

 

了解更多关于kdrchk函数的信息:

Add check for continued row piece pointing to itself with
corruption description:
"kdrchk: Row piece pointing to itself"
DB_BLOCK_CHECKING = MEDIUM will check for row pieces where the
next rowid (nrid) points to itself (chained row points to itself).
It produces error ORA-600 [kddummy_blkchk] or ORA-600 [kdBlkCheckError]
with check code [6266] (3rd ORA-600 argument).
DBVERIFY reports the same corruption description if the block is corrupt on disk.
RMAN when run with the CHECK LOGICAL option reports it as
corruption_type=CORRUPT/LOGICAL in v$database_block_corruption.
"ANALYZE TABLE  VALIDATE STRUCTURE" produces error ORA-1498 and trace file
shows the same corruption description.
With this fix in place DBMS_REPAIR can be used to identify and mark the affected
block as Soft Corrupt producing error ORA-1578 and it can be skipped it for DML's
using DBMS_REPAIR.SKIP_CORRUPT_BLOCKS.
[CM][SG][event 1][domain Q423][mem 0] Joining shared group
kdrchk: column length 0 but not null
prow=0x2a97f4d9d6 flag=0x2c column=57
Block Checking: DBA = 29635651, Block Type = KTB-managed data block
data header at 0x2a97f4be7c
kdbchk: bad row tab 0, slot 2
data_block_dump,data header at 0x2a97d113d8
data_block_dump,data header at 0x2a97d113d8
kdrchk: found invalid symbol reference 48
reference to delete symbol
valid symbol range [0,78)
Block Checking: DBA = 411055291, Block Type = KTB-managed data block
data header at 0x68a3f4
kdbchk: bad row tab 0, slot 4
Page 13499 failed with check code 6265
kdrchk: C and neither of H or F
prow=0x4282803ae flag=0x41
Block Checking: DBA = 322963095, Block Type = KTB-managed data block
data header at 0x42828007c
kdrchk: column length 0 but not null
prow=0x10021035e flag=0x2c column=40
Block Checking: DBA = 25189259, Block Type = KTB-managed data block
data header at 0x10020fe7c
kdbchk: bad row tab 0, slot 0
Page 23435 failed with check code 6264
kdrchk: column length 0 but not null
prow=0x1002122e5 flag=0x2c column=40
Block Checking: DBA = 25189260, Block Type = KTB-managed data block
kdrchk:  row is marked as both a Key and Clustered
prow=0xd2bfa981 flag=0xff
File#67, Block#74754
kdbchk: bad row tab 0, slot 0
kdrchk:  no columns, but has one of P or N
prow=0x934fbffa flag=0x31
DIAGNOSTIC ANALYSIS:
====================
A look at the block dump in the analyze trace file revealed two very
suspicious looking rows:
tab 0, row 0, @0x1ede
tl: 2 fb: --HD---N lb: 0x0
tab 0, row 1, @0x1edc
tl: 2 fb: --HD---N lb: 0x0
The flag bytes in these rows look incorrect.

待修订!

Examine 11g automatic block Corruption recovery

11g的高可用框架中DataGuard为我们带来了大量有用的特性,最引入注目的显然是Active Data Guard,这一特性几乎彻底改观了Oracle HA的原有格局。除了Active Data Guard外Automatic Block Media Repair即自动的块介质恢复也是11g中数据卫士一个不容忽视的特色。该特性通过后台进程ABMR自动将物理备库(physical standby)上的健康数据块传输到主库(primary database)上以替换在主库发现的已损坏的数据块。同样的若物理备库上发现数据块损坏那么也可以利用到以上特性来修复。注意使用该特性无需额外设置db_lost_write_protect参数为非默认的NONE值,ABMR的自动修复不依赖于该参数。

深入研究的话可以发现实际控制Automatic Block Media Repair特性的是一系列隐藏参数,它们包括:

_auto_bmr enabled enable/disable Auto BMR
_auto_bmr_req_timeout 60 Auto BMR Requester Timeout
_auto_bmr_sess_threshold 30 Auto BMR Request Session Threshold
_auto_bmr_pub_timeout 10 Auto BMR Publish Timeout
_auto_bmr_fc_time 60 Auto BMR Flood Control Time
_auto_bmr_bg_time 3600 Auto BMR Process Run Time
_auto_bmr_sys_threshold 100 Auto BMR Request System Threshold

显然这里面_auto_bmr隐藏参数是ABMR特性的开关,其默认值为enabled,而其他参数则定义了abmr的超时和限定阀值,这里不做展开。

这里我们来实地体验一下这种高可用的block repair特性:

SQL> select * from v$version;
BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
PL/SQL Release 11.2.0.2.0 - Production
CORE	11.2.0.2.0	Production
TNS for Linux: Version 11.2.0.2.0 - Production
NLSRTL Version 11.2.0.2.0 - Production
/* 演示所用数据库版本为较新的11.2.0.2 */
SQL> conn maclean/maclean
Connected.
SQL> create tablespace abmr datafile size 10M;
Tablespace created.
SQL> select file_name from dba_data_files where tablespace_name='ABMR';
FILE_NAME
--------------------------------------------------------------------------------
/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf
SQL>  create table need_repair tablespace abmr  as select 1 t1 from dual;
Table created.
SQL> exec dbms_stats.gather_table_stats('MACLEAN','NEED_REPAIR');
PL/SQL procedure successfully completed.
SQL> select * from need_repair;
T1
----------
1
SQL>  select dbms_rowid.rowid_block_number(rowid) from need_repair;
DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID)
------------------------------------
131
SQL> alter system flush buffer_cache;
System altered.
SQL> alter system flush buffer_cache;
System altered.
[maclean@rh6 ~]$ bbed FILENAME=/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf mode=edit
Password: 
BBED: Release 2.0.0.0.0 - Limited Production on Sat Mar 26 20:42:25 2011
Copyright (c) 1982, 2007, Oracle.  All rights reserved.
************* !!! For Oracle Internal Use only !!! ***************
BBED> set block 131
BLOCK#         	131
BBED> corrupt block 131
Warning: contents of previous BIFILE will be lost. Proceed? (Y/N) y
Block marked media corrupt.
/* 以上我们使用bbed工具将need_repair表上唯一一行数据所在的数据块损坏 
这样如果没有11g中automatic block Corruption recovery特性的话,
应当报ORA-01578:ORACLE data block corrupted错误
*/
SQL> conn maclean/maclean
Connected.
SQL> select * from need_repair;
T1
----------
1
/* 以上查询并未出错,显然已经通过后台调用ABMR进程修复了该数据块 */
/* 告警日志中记录了ABMR的修复过程 */
Hex dump of (file 5, block 131) in trace file /s01/diag/rdbms/prod/PROD/trace/PROD_ora_24289.trc
Corrupt block relative dba: 0x01400083 (file 5, block 131)
Bad header found during buffer read
Data in bad block:
type: 6 format: 2 rdba: 0x00000083
last change scn: 0x0000.00000000 seq: 0xff flg: 0x04
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0x000006ff
check value in block header: 0x154d
computed block checksum: 0x0
Reading datafile '/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf' for corruption at rdba: 0x01400083 (file 5, block 131)
Reread (file 5, block 131) found same corrupt data (no logical check)
/* 以上dedicated server process 2次从disk读取该块都发现损坏后,启动了后台进程ABMR,
在实例启动时abmr进程并不随instance启动,仅当需要时被服务进程启动 
*/
Starting background process ABMR
ABMR started with pid=33, OS id=24293 
Automatic block media recovery service is active.
Automatic block media recovery requested for (file# 5, block# 131)
Automatic block media recovery successful for (file# 5, block# 131)
Automatic block media recovery successful for (file# 5, block# 131)
WARNING: AutoBMR fixed mismatched on-disk single block 83 with in-mem rdba 1400083.
/* 同时abmr不是fatal的后台进程,杀死该进程不会导致不良影响 */
[maclean@rh6 ~]$ ps -ef|grep abmr
maclean  24293     1  0 20:43 ?        00:00:00 ora_abmr_PROD
maclean  24390 22254  0 20:49 pts/0    00:00:00 grep abmr
[maclean@rh6 ~]$ kill -9 24293
/* 如果不想使用Automatic Block Media Repair特性,抑或者因为该特性出现了一些问题的话,
设置_auto_bmr为diabled即禁用该特性一般可以帮助我们绕过问题 */
SQL> alter system set "_auto_bmr"=disabled;
System altered.
SQL> select * from need_repair;
select * from need_repair
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 5, block # 131)
ORA-01110: data file 5:
'/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf'
/* 如预期的出现了ORA-01578错误 */
相关的告警日志内容 :
Hex dump of (file 5, block 131) in trace file /s01/diag/rdbms/prod/PROD/trace/PROD_ora_24742.trc
Corrupt block relative dba: 0x01400083 (file 5, block 131)
Bad header found during buffer read
Data in bad block:
type: 6 format: 2 rdba: 0x00000083
last change scn: 0x0000.00000000 seq: 0xff flg: 0x04
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0x000006ff
check value in block header: 0x154d
computed block checksum: 0x0
Reading datafile '/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf' for corruption at rdba: 0x01400083 (file 5, block 131)
Reread (file 5, block 131) found same corrupt data (no logical check)
Corrupt Block Found
TSN = 5, TSNAME = ABMR
RFN = 5, BLK = 131, RDBA = 20971651
OBJN = 13773, OBJD = 13773, OBJECT = NEED_REPAIR, SUBOBJECT = 
SEGMENT OWNER = MACLEAN, SEGMENT TYPE = Table Segment
Errors in file /s01/diag/rdbms/prod/PROD/trace/PROD_ora_24742.trc  (incident=5081):
ORA-01578: ORACLE data block corrupted (file # 5, block # 131)
ORA-01110: data file 5: '/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf'
Incident details in: /s01/diag/rdbms/prod/PROD/incident/incdir_5081/PROD_ora_24742_i5081.trc
2011-03-26 21:08:18.718000 +08:00
Sweep [inc][5081]: completed
Hex dump of (file 5, block 131) in trace file /s01/diag/rdbms/prod/PROD/incident/incdir_5081/PROD_m000_24753_i5081_a.trc
Corrupt block relative dba: 0x01400083 (file 5, block 131)
Bad header found during validation
Data in bad block:
type: 6 format: 2 rdba: 0x00000083
last change scn: 0x0000.00000000 seq: 0xff flg: 0x04
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0x000006ff
check value in block header: 0x154d
computed block checksum: 0x0
Reread of blocknum=131, file=/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf. found same corrupt data
Reread of blocknum=131, file=/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf. found same corrupt data
Reread of blocknum=131, file=/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf. found same corrupt data
Reread of blocknum=131, file=/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf. found same corrupt data
Reread of blocknum=131, file=/standby/oradata/PROD/datafile/o1_mf_abmr_6rvqp087_.dbf. found same corrupt data
Checker run found 1 new persistent data failures
Dumping diagnostic data in directory=[cdmp_20110326210819], requested by (instance=1, osid=24742), summary=[incident=5081].
2011-03-26 21:08:21.458000 +08:00
Sweep [inc2][5081]: completed

to be continued …………

ARCHIVER ERROR ORA-00354: CORRUPT REDO LOG BLOCK HEADER

Problem Description:
ORA-16038: log 2 sequence# 13831 cannot be archived
ORA-00354: corrupt redo log block header
ORA-00312: online log 2 thread 1: ‘/oradata/3/TOOLS/stdby_redo/srl1.log’

LOG FILE
---------------
Filename = alert_TOOLS5_from_1021.log
See ...
...
Wed Oct 28 11:41:59 2009
Primary database is in MAXIMUM AVAILABILITY mode
Standby controlfile consistent with primary
RFS[1]: Successfully opened standby log 1: '/oradata/3/TOOLS/stdby_redo/srl0.log'
Wed Oct 28 11:42:00 2009
ARC0: Log corruption near block 604525 change 10551037679542 time ?
Wed Oct 28 11:42:00 2009
Errors in file /tools/oracle/admin/TOOLS/bdump/tools_arc0_2143.trc:
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 604525 change 10551037679542 time 10/28/2009 11:29:50
ORA-00312: online log 2 thread 1: '/oradata/3/TOOLS/stdby_redo/srl1.log'
ARC0: All Archive destinations made inactive due to error 354
Wed Oct 28 11:42:00 2009
ARC0: Closing local archive destination LOG_ARCHIVE_DEST_2: '/oradata/3/TOOLS/archive/dgarc/1_13831_635534096.arc' (error 354)
(TOOLS)
Committing creation of archivelog '/oradata/3/TOOLS/archive/dgarc/1_13831_635534096.arc' (error 354)
ARCH: Archival stopped, error occurred. Will continue retrying
Wed Oct 28 11:42:05 2009
ORACLE Instance TOOLS - Archival Error
Wed Oct 28 11:42:05 2009
ORA-16038: log 2 sequence# 13831 cannot be archived
ORA-00354: corrupt redo log block header
ORA-00312: online log 2 thread 1: '/oradata/3/TOOLS/stdby_redo/srl1.log'
Wed Oct 28 11:42:05 2009
Errors in file /tools/oracle/admin/TOOLS/bdump/tools_arc0_2143.trc:
ORA-16038: log 2 sequence# 13831 cannot be archived
ORA-00354: corrupt redo log block header
ORA-00312: online log 2 thread 1: '/oradata/3/TOOLS/stdby_redo/srl1.log'
Wed Oct 28 11:43:04 2009
ARCH: Archival stopped, error occurred. Will continue retrying
Wed Oct 28 11:43:04 2009
ORACLE Instance TOOLS - Archival Error
Wed Oct 28 11:43:04 2009
Primary database is in MAXIMUM AVAILABILITY mode
Changing standby controlfile to RESYNCHRONIZATION level
Wed Oct 28 11:43:04 2009
ORA-16014: log 1 sequence# 13832 not archived, no available destinations
ORA-00312: online log 1 thread 1: '/oradata/3/TOOLS/stdby_redo/srl0.log'
Wed Oct 28 11:43:04 2009
Errors in file /tools/oracle/admin/TOOLS/bdump/tools_arc1_2145.trc:
ORA-16014: log 1 sequence# 13832 not archived, no available destinations
ORA-00312: online log 1 thread 1: '/oradata/3/TOOLS/stdby_redo/srl0.log'
RFS[1]: Successfully opened standby log 2: '/oradata/3/TOOLS/stdby_redo/srl1.log'
Wed Oct 28 11:43:13 2009
RFS[3]: Archived Log: '/oradata/3/TOOLS/archive/dgarc/1_13831_635534096.arc'
Wed Oct 28 11:43:14 2009
RFS LogMiner: Registered logfile [/oradata/3/TOOLS/archive/dgarc/1_13831_635534096.arc] to LogMiner session id [4]
Wed Oct 28 11:43:15 2009
LOGMINER: Begin mining logfile for session 4 thread 1 sequence 13831, /oradata/3/TOOLS/archive/dgarc/1_13831_635534096.arc
Wed Oct 28 11:44:03 2009
RFS[3]: Archived Log: '/oradata/3/TOOLS/archive/dgarc/1_13832_635534096.arc'
...
LOG FILE
---------------
Filename = alert_TOOLS6_from_1021.log
See ...
...
Wed Oct 28 11:16:01 2009
Thread 1 advanced to log sequence 13830 (LGWR switch)
Current log# 8 seq# 13830 mem# 0: /oradata/1/redo/TOOLS/redo1a.log
Current log# 8 seq# 13830 mem# 1: /oradata/2/redo/TOOLS/redo1b.log
Current log# 8 seq# 13830 mem# 2: /oradata/3/redo/TOOLS/redo1c.log
Wed Oct 28 11:29:50 2009
LGWR: Standby redo logfile selected to archive thread 1 sequence 13831
LGWR: Standby redo logfile selected for thread 1 sequence 13831 for destination LOG_ARCHIVE_DEST_2
Wed Oct 28 11:29:50 2009
Thread 1 advanced to log sequence 13831 (LGWR switch)
Current log# 9 seq# 13831 mem# 0: /oradata/1/redo/TOOLS/redo2a.log
Current log# 9 seq# 13831 mem# 1: /oradata/2/redo/TOOLS/redo2b.log
Current log# 9 seq# 13831 mem# 2: /oradata/3/redo/TOOLS/redo2c.log
Wed Oct 28 11:41:59 2009
LGWR: Standby redo logfile selected to archive thread 1 sequence 13832
LGWR: Standby redo logfile selected for thread 1 sequence 13832 for destination LOG_ARCHIVE_DEST_2
Wed Oct 28 11:41:59 2009
Thread 1 advanced to log sequence 13832 (LGWR switch)
Current log# 10 seq# 13832 mem# 0: /oradata/1/redo/TOOLS/redo3a.log
Current log# 10 seq# 13832 mem# 1: /oradata/2/redo/TOOLS/redo3b.log
Current log# 10 seq# 13832 mem# 2: /oradata/3/redo/TOOLS/redo3c.log
Wed Oct 28 11:43:04 2009
Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
LGWR: Standby redo logfile selected to archive thread 1 sequence 13833
LGWR: Standby redo logfile selected for thread 1 sequence 13833 for destination LOG_ARCHIVE_DEST_2
Wed Oct 28 11:43:04 2009
Thread 1 advanced to log sequence 13833 (LGWR switch)
Current log# 11 seq# 13833 mem# 0: /oradata/1/redo/TOOLS/redo4a.log
Current log# 11 seq# 13833 mem# 1: /oradata/2/redo/TOOLS/redo4b.log
Current log# 11 seq# 13833 mem# 2: /oradata/3/redo/TOOLS/redo4c.log
Wed Oct 28 11:45:04 2009
Destination LOG_ARCHIVE_DEST_2 is SYNCHRONIZED
LGWR: Standby redo logfile selected to archive thread 1 sequence 13834
LGWR: Standby redo logfile selected for thread 1 sequence 13834 for destination LOG_ARCHIVE_DEST_2
Wed Oct 28 11:45:05 2009
Thread 1 advanced to log sequence 13834 (LGWR switch)
Current log# 8 seq# 13834 mem# 0: /oradata/1/redo/TOOLS/redo1a.log
Current log# 8 seq# 13834 mem# 1: /oradata/2/redo/TOOLS/redo1b.log
Current log# 8 seq# 13834 mem# 2: /oradata/3/redo/TOOLS/redo1c.log
Wed Oct 28 11:46:03 2009
Thread 1 cannot allocate new log, sequence 13835
Checkpoint not complete
Current log# 8 seq# 13834 mem# 0: /oradata/1/redo/TOOLS/redo1a.log
Current log# 8 seq# 13834 mem# 1: /oradata/2/redo/TOOLS/redo1b.log
Current log# 8 seq# 13834 mem# 2: /oradata/3/redo/TOOLS/redo1c.log
Wed Oct 28 11:46:10 2009
Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
LGWR: Standby redo logfile selected to archive thread 1 sequence 13835
LGWR: Standby redo logfile selected for thread 1 sequence 13835 for destination LOG_ARCHIVE_DEST_2
Wed Oct 28 11:46:11 2009
Thread 1 advanced to log sequence 13835 (LGWR switch)
Current log# 9 seq# 13835 mem# 0: /oradata/1/redo/TOOLS/redo2a.log
Current log# 9 seq# 13835 mem# 1: /oradata/2/redo/TOOLS/redo2b.log
Current log# 9 seq# 13835 mem# 2: /oradata/3/redo/TOOLS/redo2c.log
Wed Oct 28 11:48:03 2009
Thread 1 cannot allocate new log, sequence 13836
Checkpoint not complete
Current log# 9 seq# 13835 mem# 0: /oradata/1/redo/TOOLS/redo2a.log
Current log# 9 seq# 13835 mem# 1: /oradata/2/redo/TOOLS/redo2b.log
Current log# 9 seq# 13835 mem# 2: /oradata/3/redo/TOOLS/redo2c.log
Wed Oct 28 11:48:06 2009
...
From the standby, as at 2009-10-28, 11:42, when the archiver tried to archive the standby
redo logfile. it encountered this error:
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 604525 change 10551037679542 time 10/28/2009 11:29:50
ORA-00312: online log 2 thread 1: '/oradata/3/TOOLS/stdby_redo/srl1.log'
Errors in file /tools/oracle/admin/TOOLS/bdump/tools_arc0_2143.trc

The real logfile is retrieved from primary by the standby RFS process, then the log apply continue as usual.
The fact that the standby redo logs are corrupted and identified as corrupt by the ARC process , makes it clear that there could be some sort of I/O errors which has caused.
Reviewing the alert.log file it is clear that the RFS process fetched the new copy of the file which is corrupted and the issue has been resolved.
This is more an issue to be concentrated from the system adminisration end to determine in case there are any issues at the I.O subsystem
.

list some Script to Collect Data Guard Primary Site Diagnostic Information:

Overview
——–
This script is intended to provide an easy method to provide information
necessary to troubleshoot Data Guard issues.

Script Notes
————-
This script is intended to be run via sqlplus as the SYS or Internal user.

Script
——-
– – – – – – – – – – – – – – – – Script begins here – – – – – – – – – – – – – – – –

— NAME: dg_prim_diag.sql (Run on PRIMARY with a LOGICAL or PHYSICAL STANDBY)
— ————————————————————————
— Copyright 2002, Oracle Corporation
— LAST UPDATED: 2/23/04

— Usage: @dg_prim_diag
— ————————————————————————
— PURPOSE:
— This script is to be used to assist in collection information to help
— troubeshoot Data Guard issues with an emphasis on Logical Standby.
— ————————————————————————
— DISCLAIMER:
— This script is provided for educational purposes only. It is NOT
— supported by Oracle World Wide Technical Support.
— The script has been tested and appears to work as intended.
— You should always run new scripts on a test instance initially.
— ————————————————————————
— Script output is as follows:

set echo off
set feedback off
column timecol new_value timestamp
column spool_extension new_value suffix
select to_char(sysdate,’Mondd_hhmi’) timecol,
‘.out’ spool_extension from sys.dual;
column output new_value dbname
select value || ‘_’ output
from v$parameter where name = ‘db_name’;
spool dg_prim_diag_&&dbname&&timestamp&&suffix
set linesize 79
set pagesize 35
set trim on
set trims on
alter session set nls_date_format = ‘MON-DD-YYYY HH24:MI:SS’;
set feedback on
select to_char(sysdate) time from dual;

set echo on

— In the following the database_role should be primary as that is what
— this script is intended to be run on. If protection_level is different
— than protection_mode then for some reason the mode listed in
— protection_mode experienced a need to downgrade. Once the error
— condition has been corrected the protection_level should match the
— protection_mode after the next log switch.

column role format a7 tru
column name format a10 wrap

select name,database_role role,log_mode,
protection_mode,protection_level
from v$database;

— ARCHIVER can be (STOPPED | STARTED | FAILED). FAILED means that the
— archiver failed to archive a log last time, but will try again within 5
— minutes. LOG_SWITCH_WAIT The ARCHIVE LOG/CLEAR LOG/CHECKPOINT event log
— switching is waiting for. Note that if ALTER SYSTEM SWITCH LOGFILE is
— hung, but there is room in the current online redo log, then value is
— NULL

column host_name format a20 tru
column version format a9 tru

select instance_name,host_name,version,archiver,log_switch_wait
from v$instance;

— The following query give us information about catpatch.
— This way we can tell if the procedure doesn’t match the image.

select version, modified, status from dba_registry
where comp_id = ‘CATPROC’;

— Force logging is not mandatory but is recommended. Supplemental
— logging must be enabled if the standby associated with this primary is
— a logical standby. During normal operations it is acceptable for
— SWITCHOVER_STATUS to be SESSIONS ACTIVE or TO STANDBY.

column force_logging format a13 tru
column remote_archive format a14 tru
column dataguard_broker format a16 tru

select force_logging,remote_archive,
supplemental_log_data_pk,supplemental_log_data_ui,
switchover_status,dataguard_broker
from v$database;

— This query produces a list of all archive destinations. It shows if
— they are enabled, what process is servicing that destination, if the
— destination is local or remote, and if remote what the current mount ID
— is.

column destination format a35 wrap
column process format a7
column archiver format a8
column ID format 99
column mid format 99

select dest_id “ID”,destination,status,target,
schedule,process,mountid mid
from v$archive_dest order by dest_id;

— This select will give further detail on the destinations as to what
— options have been set. Register indicates whether or not the archived
— redo log is registered in the remote destination control file.

set numwidth 8
column ID format 99

select dest_id “ID”,archiver,transmit_mode,affirm,async_blocks async,
net_timeout net_time,delay_mins delay,reopen_secs reopen,
register,binding
from v$archive_dest order by dest_id;

— The following select will show any errors that occured the last time
— an attempt to archive to the destination was attempted. If ERROR is
— blank and status is VALID then the archive completed correctly.

column error format a55 wrap

select dest_id,status,error from v$archive_dest;

— The query below will determine if any error conditions have been
— reached by querying the v$dataguard_status view (view only available in
— 9.2.0 and above):

column message format a80

select message, timestamp
from v$dataguard_status
where severity in (‘Error’,’Fatal’)
order by timestamp;

— The following query will determine the current sequence number
— and the last sequence archived. If you are remotely archiving
— using the LGWR process then the archived sequence should be one
— higher than the current sequence. If remotely archiving using the
— ARCH process then the archived sequence should be equal to the
— current sequence. The applied sequence information is updated at
— log switch time.

select ads.dest_id,max(sequence#) “Current Sequence”,
max(log_sequence) “Last Archived”
from v$archived_log al, v$archive_dest ad, v$archive_dest_status ads
where ad.dest_id=al.dest_id
and al.dest_id=ads.dest_id
group by ads.dest_id;

— The following select will attempt to gather as much information as
— possible from the standby. SRLs are not supported with Logical Standby
— until Version 10.1.

set numwidth 8
column ID format 99
column “SRLs” format 99
column Active format 99

select dest_id id,database_mode db_mode,recovery_mode,
protection_mode,standby_logfile_count “SRLs”,
standby_logfile_active ACTIVE,
archived_seq#
from v$archive_dest_status;

— Query v$managed_standby to see the status of processes involved in
— the shipping redo on this system. Does not include processes needed to
— apply redo.

select process,status,client_process,sequence#
from v$managed_standby;

— The following query is run on the primary to see if SRL’s have been
— created in preparation for switchover.

select group#,sequence#,bytes from v$standby_log;

— The above SRL’s should match in number and in size with the ORL’s
— returned below:

select group#,thread#,sequence#,bytes,archived,status from v$log;

— Non-default init parameters.

set numwidth 5
column name format a30 tru
column value format a48 wra
select name, value
from v$parameter
where isdefault = ‘FALSE’;

spool off

– – – – – – – – – – – – – – – – Script ends here – – – – – – – – – – – – – – – –

another one:

Overview
——–

This script is intended to provide an easy method to provide information
necessary to troubleshoot Data Guard issues.

Script Notes
————-

This script is intended to be run via sqlplus as the SYS or Internal user.

Script
——-

– – – – – – – – – – – – – – – – Script begins here – – – – – – – – – – – – – – – –

— NAME: DG_phy_stby_diag.sql
— ————————————————————————
— AUTHOR:
— Michael Smith – Oracle Support Services – DataServer Group
— Copyright 2002, Oracle Corporation
— ————————————————————————
— PURPOSE:
— This script is to be used to assist in collection information to help
— troubeshoot Data Guard issues.
— ————————————————————————
— DISCLAIMER:
— This script is provided for educational purposes only. It is NOT
— supported by Oracle World Wide Technical Support.
— The script has been tested and appears to work as intended.
— You should always run new scripts on a test instance initially.
— ————————————————————————
— Script output is as follows:

set echo off
set feedback off
column timecol new_value timestamp
column spool_extension new_value suffix
select to_char(sysdate,’Mondd_hhmi’) timecol,
‘.out’ spool_extension from sys.dual;
column output new_value dbname
select value || ‘_’ output
from v$parameter where name = ‘db_name’;
spool dgdiag_phystby_&&dbname&&timestamp&&suffix
set lines 200
set pagesize 35
set trim on
set trims on
alter session set nls_date_format = ‘MON-DD-YYYY HH24:MI:SS’;
set feedback on
select to_char(sysdate) time from dual;

set echo on


— ARCHIVER can be (STOPPED | STARTED | FAILED) FAILED means that the archiver failed
— to archive a — log last time, but will try again within 5 minutes. LOG_SWITCH_WAIT
— The ARCHIVE LOG/CLEAR LOG/CHECKPOINT event log switching is waiting for. Note that
— if ALTER SYSTEM SWITCH LOGFILE is hung, but there is room in the current online
— redo log, then value is NULL

column host_name format a20 tru
column version format a9 tru
select instance_name,host_name,version,archiver,log_switch_wait from v$instance;

— The following select will give us the generic information about how this standby is
— setup. The database_role should be standby as that is what this script is intended
— to be ran on. If protection_level is different than protection_mode then for some
— reason the mode listed in protection_mode experienced a need to downgrade. Once the
— error condition has been corrected the protection_level should match the protection_mode
— after the next log switch.

column ROLE format a7 tru
select name,database_role,log_mode,controlfile_type,protection_mode,protection_level
from v$database;

— Force logging is not mandatory but is recommended. Supplemental logging should be enabled
— on the standby if a logical standby is in the configuration. During normal
— operations it is acceptable for SWITCHOVER_STATUS to be SESSIONS ACTIVE or NOT ALLOWED.

column force_logging format a13 tru
column remote_archive format a14 tru
column dataguard_broker format a16 tru
select force_logging,remote_archive,supplemental_log_data_pk,supplemental_log_data_ui,
switchover_status,dataguard_broker from v$database;

— This query produces a list of all archive destinations and shows if they are enabled,
— what process is servicing that destination, if the destination is local or remote,
— and if remote what the current mount ID is. For a physical standby we should have at
— least one remote destination that points the primary set but it should be deferred.

COLUMN destination FORMAT A35 WRAP
column process format a7
column archiver format a8
column ID format 99

select dest_id “ID”,destination,status,target,
archiver,schedule,process,mountid
from v$archive_dest;

— If the protection mode of the standby is set to anything higher than max performance
— then we need to make sure the remote destination that points to the primary is set
— with the correct options else we will have issues during switchover.

select dest_id,process,transmit_mode,async_blocks,
net_timeout,delay_mins,reopen_secs,register,binding
from v$archive_dest;

— The following select will show any errors that occured the last time an attempt to
— archive to the destination was attempted. If ERROR is blank and status is VALID then
— the archive completed correctly.

column error format a55 tru
select dest_id,status,error from v$archive_dest;

— Determine if any error conditions have been reached by querying thev$dataguard_status
— view (view only available in 9.2.0 and above):

column message format a80
select message, timestamp
from v$dataguard_status
where severity in (‘Error’,’Fatal’)
order by timestamp;

— The following query is ran to get the status of the SRL’s on the standby. If the
— primary is archiving with the LGWR process and SRL’s are present (in the correct
— number and size) then we should see a group# active.

select group#,sequence#,bytes,used,archived,status from v$standby_log;

— The above SRL’s should match in number and in size with the ORL’s returned below:

select group#,thread#,sequence#,bytes,archived,status from v$log;

— Query v$managed_standby to see the status of processes involved in the
— configuration.

select process,status,client_process,sequence#,block#,active_agents,known_agents
from v$managed_standby;

— Verify that the last sequence# received and the last sequence# applied to standby
— database.

select al.thrd “Thread”, almax “Last Seq Received”, lhmax “Last Seq Applied”
from (select thread# thrd, max(sequence#) almax
from v$archived_log
where resetlogs_change#=(select resetlogs_change# from v$database)
group by thread#) al,
(select thread# thrd, max(sequence#) lhmax
from v$log_history
where first_time=(select max(first_time) from v$log_history)
group by thread#) lh
where al.thrd = lh.thrd;

— The V$ARCHIVE_GAP fixed view on a physical standby database only returns the next
— gap that is currently blocking redo apply from continuing. After resolving the
— identified gap and starting redo apply, query the V$ARCHIVE_GAP fixed view again
— on the physical standby database to determine the next gap sequence, if there is
— one.

select * from v$archive_gap;

— Non-default init parameters.

set numwidth 5
column name format a30 tru
column value format a50 wra
select name, value
from v$parameter
where isdefault = ‘FALSE’;

spool off

– – – – – – – – – – – – – – – – Script ends here – – – – – – – – – – – – – – – –

【Oracle数据恢复】ORA-00600[6711]错误一例

一套Linux上的10.2.0.4系统,日志中频繁出现ORA-00600[6711]内部错误:

 

如果自己搞不定可以找ASKMACLEAN专业ORACLE数据库修复团队成员帮您恢复!

 

Wed Sep  1 21:24:30 2010
Errors in file /s01/10gdb/admin/YOUYUS/bdump/youyus_smon_5622.trc:
ORA-00600: internal error code, arguments: [6711], [4256248], [1], [4256242], [0], [], [], []
Wed Sep  1 21:24:31 2010
Non-fatal internal error happenned while SMON was doing logging scn->time mapping.

 

 

MOS上有一个关于6711内部错误十分简单的Note,该文档声称出现6711错误极有可能是部分类型为簇(cluster)的数据字典表存在潜在的讹误,这个Note甚至没有告诉我们该错误argument参数的意义。
不过其实我们可以猜出来,因为是和corruption相关的错误,那么实际上可能关联的几个因素无非是obj#,file#,block#;4256248和4256242 两个数字像极了Data Block Address,把他们当做dba来看待,也就指向了1号数据文件的61938块和61944数据块,我们来看看这些块属于哪个对象:

SQL> set linesize 200;
SQL> select segment_name, segment_type
2    from dba_extents
3   where relative_fno = 1
4     and (61938 between block_id and block_id + blocks or
5         61944 between block_id and block_id + blocks);
SEGMENT_NAME                                                                      SEGMENT_TYPE
--------------------------------------------------------------------------------- ------------------
SMON_SCN_TO_TIME                                                                  CLUSTER

不出意料是一个cluster,SMON_SCN_TO_TIME是SMON_SCN_TIME表的基簇,SMON_SCN_TIME表用以记录数据库中scn对应的时间戳。我们直接查看用以创建数据字典的sql.bsq文件,可以进一步了解他们的结构:

cat $ORACLE_HOME/rdbms/admin/sql.bsq|grep -A 24 "create cluster smon_scn_to_time"
create cluster smon_scn_to_time (
thread number                         /* thread, compatibility */
)
/
create index smon_scn_to_time_idx on cluster smon_scn_to_time
/
create table smon_scn_time (
thread number,                         /* thread, compatibility */
time_mp number,                        /* time this recent scn represents */
time_dp date,                          /* time as date, compatibility */
scn_wrp number,                        /* scn.wrp, compatibility */
scn_bas number,                        /* scn.bas, compatibility */
num_mappings number,
tim_scn_map raw(1200),
scn number default 0,                  /* scn */
orig_thread number default 0           /* for downgrade */
) cluster smon_scn_to_time (thread)
/
create unique index smon_scn_time_tim_idx on smon_scn_time(time_mp)
/
create unique index smon_scn_time_scn_idx on smon_scn_time(scn)
/

从以上脚本可以看到这个簇上存在多个索引,我们需要进一步validate验证所有这些对象:

SQL> analyze table SMON_SCN_TIME validate structure;
Table analyzed.
SQL>analyze table SMON_SCN_TIME validate structure cascade;
Table analyzed.
SQL> analyze cluster SMON_SCN_TO_TIME validate structure;
Cluster analyzed.
SQL> analyze cluster SMON_SCN_TO_TIME validate structure cascade;
analyze cluster SMON_SCN_TO_TIME validate structure cascade
*
ERROR at line 1:
ORA-01499: table/index cross reference failure - see trace file

到这里问题已经很清晰了,问题出在SMON_SCN_TO_TIME的索引smon_scn_to_time_idx身上,极有可能是该索引上出现了逻辑讹误。所幸有问题的仅仅是索引,找出问题所在后要解决就显得容易得多了:

SQL> alter index smon_scn_to_time_idx rebuild ;
Index altered.
/* 在索引出现讹误的情况下仅仅rebuild往往是无效的,在我们rebuild的同时告警日志中再次出现了ORA-00600[6711]错误 !!! */
/* 我们需要的彻底把有问题的索引drop掉,并再次创建!!! */
SQL> drop index smon_scn_to_time_idx ;
Index dropped.
SQL> create index smon_scn_to_time_idx on cluster smon_scn_to_time;
Index created.
/* 至此问题解决,告警日志中不再出现错误! * /
/* That's great! * /

How to make BBED(Oracle Block Brower and EDitor Tool) on Unix/Linux/Windows

“BBED(Oracle Block Brower and EDitor Tool),用来直接查看和修改数据文件数据的一个工具,是Oracle一款内部工具,可以直接修改Oracle数据文件块的内容,简单来说就是一个针对 Oracle的二进制编辑工具。该工具不受Oracle支持,所以默认是没有生成可执行文件的,在使用前需要重新编译。”

 

在10g中编译该工具显得较简单:

 

[maclean@rh2 ~]$ cd $ORACLE_HOME/rdbms/lib
[maclean@rh2 lib]$ make -f ins_rdbms.mk $ORACLE_HOME/rdbms/lib/bbed
make: `/s01/10gdb/rdbms/lib/bbed' is up to date.
[maclean@rh2 lib]$ rm bbed
[maclean@rh2 lib]$ make -f ins_rdbms.mk $ORACLE_HOME/rdbms/lib/bbed
Linking BBED utility (bbed)
rm -f /s01/10gdb/rdbms/lib/bbed
gcc -o /s01/10gdb/rdbms/lib/bbed -L/s01/10gdb/rdbms/lib/ -L/s01/10gdb/lib/ -L/s01/10gdb/lib/stubs/  /s01/10gdb/lib/s0main.o /s01/10gdb/rdbms/lib/ssbbded.o /s01/10gdb/rdbms/lib/sbbdpt.o `cat /s01/10gdb/lib/ldflags`    -lnsslb10 -lncrypt10 -lnsgr10 -lnzjs10 -ln10 -lnnz10 -lnl10 /s01/10gdb/rdbms/lib/defopt.o -ldbtools10 -lclntsh  `cat /s01/10gdb/lib/ldflags`    -lnsslb10 -lncrypt10 -lnsgr10 -lnzjs10 -ln10 -lnnz10 -lnl10 -lnro10 `cat /s01/10gdb/lib/ldflags`    -lnsslb10 -lncrypt10 -lnsgr10 -lnzjs10 -ln10 -lnnz10 -lnl10 -lclient10 -lnnetd10  -lvsn10 -lcommon10 -lgeneric10 -lmm -lsnls10 -lnls10  -lcore10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lxml10 -lcore10 -lunls10 -lsnls10 -lnls10 -lcore10 -lnls10 `cat /s01/10gdb/lib/ldflags`    -lnsslb10 -lncrypt10 -lnsgr10 -lnzjs10 -ln10 -lnnz10 -lnl10 -lnro10 `cat /s01/10gdb/lib/ldflags`    -lnsslb10 -lncrypt10 -lnsgr10 -lnzjs10 -ln10 -lnnz10 -lnl10 -lclient10 -lnnetd10  -lvsn10 -lcommon10 -lgeneric10   -lsnls10 -lnls10  -lcore10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lxml10 -lcore10 -lunls10 -lsnls10 -lnls10 -lcore10 -lnls10 -lclient10 -lnnetd10  -lvsn10 -lcommon10 -lgeneric10 -lsnls10 -lnls10  -lcore10 -lsnls10 -lnls10 -lcore10 -lsnls10 -lnls10 -lxml10 -lcore10 -lunls10 -lsnls10 -lnls10 -lcore10 -lnls10   `cat /s01/10gdb/lib/sysliblist` -Wl,-rpath,/s01/10gdb/lib -lm    `cat /s01/10gdb/lib/sysliblist` -ldl -lm   -L/s01/10gdb/lib
[maclean@rh2 lib]$ cp bbed $ORACLE_HOME/bin
[maclean@rh2 lib]$ bbed
Password:
BBED: Release 2.0.0.0.0 - Limited Production on Thu Sep 2 14:18:27 2010
Copyright (c) 1982, 2007, Oracle.  All rights reserved.
BBED>
/* 你可能要问密码是什么? 呵呵 .. 🙂 */

11.2.0.1中编译bbed可执行文件所需要的ssbbded.o和sbbdpt.o对象文件被移除了,所幸我们可以使用10g下的这2个对象文件在11.2.0.1中编译。

[maclean@rh2 lib]$ make -f ins_rdbms.mk $ORACLE_HOME/rdbms/lib/bbed
Linking BBED utility (bbed)
rm -f /s01/11gdb/rdbms/lib/bbed
gcc -o /s01/11gdb/rdbms/lib/bbed -m64 -L/s01/11gdb/rdbms/lib/ -L/s01/11gdb/lib/ -L/s01/11gdb/lib/stubs/  /s01/11gdb/lib/s0main.o /s01/11gdb/rdbms/lib/ssbbded.o /s01/11gdb/rdbms/lib/sbbdpt.o `cat /s01/11gdb/lib/ldflags`    -lncrypt11 -lnsgr11 -lnzjs11 -ln11 -lnl11 -ldbtools11 -lclntsh  `cat /s01/11gdb/lib/ldflags`    -lncrypt11 -lnsgr11 -lnzjs11 -ln11 -lnl11 -lnro11 `cat /s01/11gdb/lib/ldflags`    -lncrypt11 -lnsgr11 -lnzjs11 -ln11 -lnl11 -lnnz11 -lzt11 -lztkg11 -lztkg11 -lclient11 -lnnetd11  -lvsn11 -lcommon11 -lgeneric11 -lmm -lsnls11 -lnls11  -lcore11 -lsnls11 -lnls11 -lcore11 -lsnls11 -lnls11 -lxml11 -lcore11 -lunls11 -lsnls11 -lnls11 -lcore11 -lnls11 `cat /s01/11gdb/lib/ldflags`    -lncrypt11 -lnsgr11 -lnzjs11 -ln11 -lnl11 -lnro11 `cat /s01/11gdb/lib/ldflags`    -lncrypt11 -lnsgr11 -lnzjs11 -ln11 -lnl11 -lclient11 -lnnetd11  -lvsn11 -lcommon11 -lgeneric11   -lsnls11 -lnls11  -lcore11 -lsnls11 -lnls11 -lcore11 -lsnls11 -lnls11 -lxml11 -lcore11 -lunls11 -lsnls11 -lnls11 -lcore11 -lnls11 -lclient11 -lnnetd11  -lvsn11 -lcommon11 -lgeneric11 -lsnls11 -lnls11  -lcore11 -lsnls11 -lnls11 -lcore11 -lsnls11 -lnls11 -lxml11 -lcore11 -lunls11 -lsnls11 -lnls11 -lcore11 -lnls11   `cat /s01/11gdb/lib/sysliblist` -Wl,-rpath,/s01/11gdb/lib -lm    `cat /s01/11gdb/lib/sysliblist` -ldl -lm   -L/s01/11gdb/lib
gcc: /s01/11gdb/rdbms/lib/ssbbded.o: No such file or directory
gcc: /s01/11gdb/rdbms/lib/sbbdpt.o: No such file or directory
[maclean@rh2 ~]$ cp /s01/10gdb/rdbms/lib/ssbbded.o /s01/11gdb/rdbms/lib
[maclean@rh2 ~]$ cp /s01/10gdb/rdbms/lib/sbbdpt.o  /s01/11gdb/rdbms/lib
[maclean@rh2 ~]$ cp /s01/10gdb/rdbms/mesg/bbedus.ms* /s01/11gdb/rdbms/mesg/
/* bbed 需要用到bbedus.msg和bbedus.msb 2个信息文件 */
[maclean@rh2 lib]$ make -f ins_rdbms.mk $ORACLE_HOME/rdbms/lib/bbed
Linking BBED utility (bbed)
rm -f /s01/11gdb/rdbms/lib/bbed
gcc -o /s01/11gdb/rdbms/lib/bbed -m64 -L/s01/11gdb/rdbms/lib/ -L/s01/11gdb/lib/ -L/s01/11gdb/lib/stubs/  /s01/11gdb/lib/s0main.o /s01/11gdb/rdbms/lib/ssbbded.o /s01/11gdb/rdbms/lib/sbbdpt.o `cat /s01/11gdb/lib/ldflags`    -lncrypt11 -lnsgr11 -lnzjs11 -ln11 -lnl11 -ldbtools11 -lclntsh  `cat /s01/11gdb/lib/ldflags`    -lncrypt11 -lnsgr11 -lnzjs11 -ln11 -lnl11 -lnro11 `cat /s01/11gdb/lib/ldflags`    -lncrypt11 -lnsgr11 -lnzjs11 -ln11 -lnl11 -lnnz11 -lzt11 -lztkg11 -lztkg11 -lclient11 -lnnetd11  -lvsn11 -lcommon11 -lgeneric11 -lmm -lsnls11 -lnls11  -lcore11 -lsnls11 -lnls11 -lcore11 -lsnls11 -lnls11 -lxml11 -lcore11 -lunls11 -lsnls11 -lnls11 -lcore11 -lnls11 `cat /s01/11gdb/lib/ldflags`    -lncrypt11 -lnsgr11 -lnzjs11 -ln11 -lnl11 -lnro11 `cat /s01/11gdb/lib/ldflags`    -lncrypt11 -lnsgr11 -lnzjs11 -ln11 -lnl11 -lclient11 -lnnetd11  -lvsn11 -lcommon11 -lgeneric11   -lsnls11 -lnls11  -lcore11 -lsnls11 -lnls11 -lcore11 -lsnls11 -lnls11 -lxml11 -lcore11 -lunls11 -lsnls11 -lnls11 -lcore11 -lnls11 -lclient11 -lnnetd11  -lvsn11 -lcommon11 -lgeneric11 -lsnls11 -lnls11  -lcore11 -lsnls11 -lnls11 -lcore11 -lsnls11 -lnls11 -lxml11 -lcore11 -lunls11 -lsnls11 -lnls11 -lcore11 -lnls11   `cat /s01/11gdb/lib/sysliblist` -Wl,-rpath,/s01/11gdb/lib -lm    `cat /s01/11gdb/lib/sysliblist` -ldl -lm   -L/s01/11gdb/lib
[maclean@rh2 lib]$ file bbed
bbed: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.6.9, dynamically linked (uses shared libs), for GNU/Linux 2.6.9, not stripped
[maclean@rh2 lib]$ size bbed
text    data     bss     dec     hex filename
154473   43448      32  197953   30541 bbed
[maclean@rh2 lib]$ ldd bbed
libclntsh.so.11.1 => /s01/11gdb/lib/libclntsh.so.11.1 (0x00002b042b883000)
libnnz11.so => /s01/11gdb/lib/libnnz11.so (0x00002b042dead000)
libdl.so.2 => /lib64/libdl.so.2 (0x00000039f2400000)
libm.so.6 => /lib64/libm.so.6 (0x00000039f2000000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00000039f2800000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x00000039f5c00000)
libc.so.6 => /lib64/libc.so.6 (0x00000039f1c00000)
libaio.so.1 => /usr/lib64/libaio.so.1 (0x00002b042e293000)
/lib64/ld-linux-x86-64.so.2 (0x00000039f1800000)
[maclean@rh2 lib]$ cp bbed $ORACLE_HOME/bin
[maclean@rh2 lib]$ which bbed
/s01/11gdb/bin/bbed
[maclean@rh2 lib]$ bbed
Password:
BBED: Release 2.0.0.0.0 - Limited Production on Thu Sep 2 15:18:37 2010
Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.
BBED>

 

如图:

 

 

 

 

沪ICP备14014813号

沪公网安备 31010802001379号