Oracle内部错误:ORA-00600[17175]一例

一套HP-UX上的10.2.0.4系统出现ORA-00600[17175] Oracle600内部错误，相关的日志信息如下:

Wed Dec 1 01:57:55 2010
Errors in file /u01/app/oracle/admin/xgp2/bdump/xgp21_pmon_3250.trc:
ORA-00600: internal error code, arguments: [17175], [255], [], [], [], [], [], []
ORA-00601: cleanup lock conflict
Wed Dec 1 01:57:57 2010
Trace dumping is performing id=[cdmp_20101201015757]
Wed Dec 1 01:58:05 2010
LGWR: terminating instance due to error 472
Wed Dec 1 01:58:05 2010
Errors in file /u01/app/oracle/admin/xgp2/bdump/xgp21_lms1_3291.trc:
ORA-00472: PMON process terminated with error
Wed Dec 1 01:58:05 2010
Errors in file /u01/app/oracle/admin/xgp2/bdump/xgp21_lms2_3293.trc:
ORA-00472: PMON process terminated with error
Wed Dec 1 01:58:05 2010
Errors in file /u01/app/oracle/admin/xgp2/bdump/xgp21_lms3_3295.trc:
ORA-00472: PMON process terminated with error
Wed Dec 1 01:58:05 2010
Errors in file /u01/app/oracle/admin/xgp2/bdump/xgp21_lms0_3289.trc:
ORA-00472: PMON process terminated with error
Wed Dec 1 01:58:05 2010
Errors in file /u01/app/oracle/admin/xgp2/bdump/xgp21_lmon_3283.trc:
ORA-00472: PMON process terminated with error
Wed Dec 1 01:58:05 2010
Errors in file /u01/app/oracle/admin/xgp2/bdump/xgp21_lmd0_3287.trc:
ORA-00472: PMON process terminated with error
Wed Dec 1 01:58:05 2010
Shutting down instance (abort)
License high water mark = 421

/u01/app/oracle/admin/xgp2/bdump/xgp21_pmon_3250.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/10.2.0/db_1
System name: HP-UX
Node name: XGP2_db1
Release: B.11.31
Version: U
Machine: ia64
Instance name: xgp21
Redo thread mounted by this instance: 1
Oracle process number: 2
Unix process pid: 3250, image: oracle@XGP2_db1 (PMON)

*** SERVICE NAME:(SYS$BACKGROUND) 2010-12-01 01:57:55.933
*** SESSION ID:(333.1) 2010-12-01 01:57:55.933
*** 2010-12-01 01:57:55.933
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [17175], [255], [], [], [], [], [], []
ORA-00601: cleanup lock conflict


ksedst <- ksedmp <- ksfdmp <- kgeriv <- kgesiv
<- kgesic1 <- kghcln <- kslilcr <- $cold_ksl_cleanup <- ksepop
<- kgepop <- kgesev <- ksesec0 <- $cold_kslges <- ksl_get_child_latch
<- kslgpl <- es <- ksfglt <- kghext_numa <- ksmasgn
<- kghnospc <- $cold_kghalo <- ksmdacnk <- ksmdget <- ksosp_alloc
<- ksoreq_submit <- ksbsrv <- kmmssv <- kmmlsa <- kmmlod
<- ksucln <- ksbrdp <- opirip <- $cold_opidrv <- sou2o
<- $cold_opimai_real <- main <- main_opd_entry

PROCESS STATE
-------------
Process global information:
process: c00000018d000078, call: c00000018d252238, xact: 0000000000000000, curses: c00000018d2508a8, usrses: c00000018d2508a8
----------------------------------------
SO: c00000018d000078, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
(process) Oracle pid=2, calls cur/top: c00000018d252238/c00000018d252238, flag: (e) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 48
last post received-location: ksoreq_reply
last process to post me: c00000018d037978 1 64
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: c00000018d001058 1 6
(latch info) wait_event=0 bits=90
holding (efd=5) c00000020001d500 Parent+children shared pool level=7
Location from where latch is held: kghfrunp: alloc: clatch nowait:
Context saved from call: 0
state=busy, wlstate=free
holding (efd=5) c00000020000b5f8 OS process allocation level=4
Location from where latch is held: ksoreq_submit:
Context saved from call: 13835058076152957304
state=busy, wlstate=free
Process Group: DEFAULT, pseudo proc: c0000004dd263230
O/S info: user: oracle, term: UNKNOWN, ospid: 3250
OSD pid info: Unix process pid: 3250, image: oracle@XGP2_db1 (PMON)


SO: c0000004df4d5f28, type: 19, owner: c00000018d000078, flag: INIT/-/-/0x00
GES MSG BUFFERS: st=emp chunk=0x0000000000000000 hdr=0x0000000000000000 lnk=0x0000000000000000 flags=0x0 inc=4
outq=0 sndq=0 opid=2 prmb=0x0
mbg[i]=(2 19) mbg[b]=(0 0) mbg[r]=(0 0)
fmq[i]=(4 1) fmq[b]=(0 0) fmq[r]=(0 0)
mop[s]=20 mop[q]=1 pendq=0 zmbq=0
nonksxp_recvs=0
------------process 0xc0000004df4d5f28--------------------
proc version : 0
Local node : 0
pid : 3250
lkp_node : 0
svr_mode : 0
proc state : KJP_NORMAL
Last drm hb acked : 0
Total accesses : 181
Imm. accesses : 180
Locks on ASTQ : 0
Locks Pending AST : 0
Granted locks : 0
AST_Q:
PENDING_Q:
GRANTED_Q:
----------------------------------------
SO: c00000018d2f3610, type: 11, owner: c00000018d000078, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c00000018d000078,
event: 1, last message event: 1,
last message waited event: 1, messages read: 0
channel: (c0000004dd29fdb0) scumnt mount lock
scope: 1, event: 19, last mesage event: 0,
publishers/subscribers: 0/19,
messages published: 0
SO: c00000018d2508a8, type: 4, owner: c00000018d000078, flag: INIT/-/-/0x00
(session) sid: 333 trans: 0000000000000000, creator: c00000018d000078, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0001-0002-00000003, short-term DID: 0000-0000-00000000
txn branch: 0000000000000000
oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS
service name: SYS$BACKGROUND
last wait for 'latch: shared pool' blocking sess=0x0000000000000000 seq=342 wait_time=175677 seconds since wait started=0
address=c0000002000fff60, number=d6, tries=7
Dumping Session Wait History
for 'latch: shared pool' count=1 wait_time=175677
address=c0000002000fff60, number=d6, tries=7
for 'latch: shared pool' count=1 wait_time=97554
address=c0000002000fff60, number=d6, tries=6
for 'latch: shared pool' count=1 wait_time=78023
address=c0000002000fff60, number=d6, tries=5
for 'latch: shared pool' count=1 wait_time=38978
address=c0000002000fff60, number=d6, tries=4
for 'latch: shared pool' count=1 wait_time=38942
address=c0000002000fff60, number=d6, tries=3
for 'latch: shared pool' count=1 wait_time=19435
address=c0000002000fff60, number=d6, tries=2
for 'latch: shared pool' count=1 wait_time=12655
address=c0000002000fff60, number=d6, tries=1
for 'latch: shared pool' count=1 wait_time=8
address=c0000002000fff60, number=d6, tries=0
for 'os thread startup' count=1 wait_time=144253
=0, =0, =0
for 'os thread startup' count=1 wait_time=141360
=0, =0, =0


SO: c00000018d2f3500, type: 11, owner: c00000018d000078, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c00000018d000078,
event: 2, last message event: 40,
last message waited event: 40, messages read: 1
channel: (c0000004dd29bbd8) system events broadcast channel
scope: 2, event: 224634, last mesage event: 40,
publishers/subscribers: 0/161,
messages published: 1


SO: c00000018d252238, type: 3, owner: c00000018d000078, flag: INIT/-/-/0x00
(call) sess: cur c00000018d2508a8, rec 0, usr c00000018d2508a8; depth: 0
----------------------------------------
SO: c00000018d2594b0, type: 5, owner: c00000018d252238, flag: INIT/-/-/0x00
(enqueue) PR-00000000-00000000 DID: 0001-0002-00000003
lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0xc0000004df401718, mode: X, lock_flag: 0x0
own: 0xc00000018d2508a8, sess: 0xc00000018d2508a8, proc: 0xc00000018d000078, prv: 0xc0000004df401728
----------------------------------------
SO: c00000018d30b710, type: 16, owner: c00000018d000078, flag: INIT/-/-/0x00
(osp req holder)
CHILD REQUESTS:
(osp req) type=2(BACKGROUND) flags=0x20001(STATIC/-) state=1(INITED) err=0
pg=0 arg1=0 arg2=(null) reply=(null) pname=S018
pid=0 parent=c00000018d30b710 fulfill=0000000000000000
----------------------------------------
SO: c0000004dbff09c0, type: 192, owner: c0000004dbff09c0, flag: -/-/-/0x00

在metalink上搜索600[17175]内部错误相关的文档，可以找到该错误的大量信息:

Keywords: ora-00600 [17175]

1. Bug 6250251: ORA-00600 17175 DURING KGI CLEANUP - DUMP - ORADEBUG
--ora-600 followed by ora-601 and instance crash with ORA-17175.
--Also, setting of heap check event triggers this problem. In this case
--it is event="10235 trace name context forever, level 27"

2. Bug 4216668 - Dump from INSERT / MERGE on internal columns (Doc ID 4216668.8)
--INSERT or MERGE commands might core dump if operating on object types and internal columns are involved.

3. Bug 7590297: ORA-600 [17175] [255] ORA-601: CLEANUP LOCK CONFLICT CRASHED THE DATABASE

4. SR 3-2296150050
--The error has occurred when Oracle was cleaning shared pool latch/heap information about the process
which died in middle.
--There is no data corruption associated with this error.
--This is evident from the function kghcln in the trace stack at which it failed.
--This problem is usually the symptom of some earlier problem with the latch.
--Either after a process has died, or a process has signaled an error while holding a shared pool latch,
and the index to the shared pool latch is invalid.

--There was a Bug 7590297 raised for this issue which could not be progressed due to unavailability of information.
--From few earlier known issues - This can be due to PMON may sometimes signal ORA-601
while trying to start up additional shared servers or dispatchers.
--There the workaround suggested was to Start the instance with max # of shared servers.

--Can you reproduce the problem?If the instance has been restated the issue may not persist as it is related to memory.
--If the issue persists then we have to perform the following to monitoring the instance to investigate further:

--1. Set the following event in parameter file:
--event="10257 trace name context forever, level 10"
--event="601 trace name SYSTEMSTATE level 10"

--The first event will cause PMON to dump info about shared server startup.
--The second event will cause PMON to do a system state dump when the 601 occurs.

--2. You should also have the track of this in intervals and save the historical results from:

--SQL> select e.total_waits, e.total_timeouts, e.time_waited from v$session_event e, v$session s
, v$bgprocess b where b.name='PMON' and s.paddr=b.paddr and e.sid=s.sid and e.event='process startup';

5. SR 3-2123025401
--=== ODM Solution / Action Plan ===
--Disabled NUMA for resolution

6. SR 7314313.994

Analysis:

Bug 6250251 and bug 4216668 are not applicable to this case.
Bug 7590297 is applicable to this case, as the call stack, error message are the same with this case.
But this patch is suspended as requested info is not available.

SR 3-2296150050: same error message, same DB version, similar call stack; closed without solution.
SR 3-2123025401: same error message, same DB version, similar call stack.
The issue happened twice in that SR and solved by disabling NUMA
SR 7314313.994: same error message, same DB version, similar call stack; closed without solution.

ERROR:
ORA-600 [17175] [a]

VERSIONS:
versions 9.2 to 10.1

DESCRIPTION:

This error occurs when we are cleaning up a shared pool latch (either after a process has died,
or a process has signaled an error while holding a shared pool latch),
and the index to the shared pool latch is invalid.

ARGUMENTS:
Arg [a] index of the latch recovery structure - usually 255

FUNCTIONALITY:
Generic Heap Manager

IMPACT:
INSTANCE HANG
PROCESS FAILURE
INSTANCE FAILURE

以下为Oracle GCS给出的行动计划，GCS认为绝大多数ORA-00600 [17xxx]是由memory相关的问题引起的，这些问题往往在重启实例后就可以得到解决。并建议可以设置shared_servers=max_shared_servers后进一步观察:

From the uploaded files it looks like you were reported with ORA-00600 [17175] errors 
and crashed the instance.What is the current status after the restart of the database. 
Are you still reported with the same errors and crashing the instance ?
Mostly the ORA-00600 [17xxx] errors are memory releated and might have got resolved after the database restart.


Further looking at the uploaded trace file the failing functions and the error closely 
matches Bug 6958493and is closed as duplicate of BaseBug 6962340which is closed 
as could not able to reproduce the error.


Also a smillar issue is reported inBug 3104250which is fixed in 10g, but that doesn't mean 
you cannot get this error for a new reason and that the same workaround would fix it.
We need to implement the workaround and set: shared_servers=max_shared_servers 
if the error reproduces again. If this is still repeated issue then we can file a new bug with development for the same.


ACTION PLAN
===========
1. Monitor the alertlog for the ORA-00600 [17175] errors for the next few days and if the database still crashes then please
set shared_servers=max_shared_servers and see if the problem resolves or not.