Oracle内部错误ORA-00600:[pfri.c: pfri8: plio mismatch ]一例

一套Linux x86-64上的11.2.0.1数据库出现ORA-00600:[pfri.c: pfri8: plio mismatch ],日志如下 :

 

ORA-00600: internal error code, arguments: [pfri.c: pfri8: plio mismatch ], [], [], [], [], [], [], [], [], [], [], []
ORA-04061: existing state of package body "APPS.OE_ORDER_UTIL" has been invalidated
ORA-04065: not executed, altered or dropped package body "APPS.OE_ORDER_UTIL"

 

经过和MOS沟通,确认为Bug: 9691456 11.2.0.2, 12.1.0.0 ORA-600 [pfri.c: pfri8: plio mismatch] with Editions:

 

Here is a direct hit on the error message from your alert.log:
APPS Packages Become Invalid With ORA-600 [pfri.c: pfri8: plio mismatch ] (Doc ID 1323867.1)

1323867.1 - APPS Packages Become Invalid With ORA-600 [pfri.c: pfri8: plio mismatch ]

Bug 9691456 - ORA-600 [pfri.c: pfri8: plio mismatch] with Editions

Bug 9691456  ORA-600 [pfri.c: pfri8: plio mismatch] with Editions
 This note gives a brief overview of bug 9691456.
 The content was last updated on: 18-NOV-2010
 Click here for details of each of the sections below.
Affects:

    Product (Component)	Oracle Server (Rdbms)
    Range of versions believed to be affected 	Versions BELOW 12.1
    Versions confirmed as being affected 	

        11.2.0.1 

    Platforms affected	Generic (all / most platforms affected)

Fixed:

    This issue is fixed in	

        12.1 (Future Release)
        11.2.0.2 (Server Patch Set) 

Description

    With Editions enabled some invalidation operation do not properly invalidate 
    stubs which can lead to problems such as ORA-600 [pfri.c: pfri8: plio mismatch].

 

解决方法是打到11.2.0.2 最新的patch set update.

Large Memory Footprints on AIX

Connor Mcdonald-一位Oracle极客为我们分享了一个AIX平台上11g独享服务进程内存占用过量的问题,该问题最后被确认为Bug”11G SERVER PROCESSES CONSUMING MUCH MORE MEMORY THAT 10G OR 9I”,相关文档如下:

 

Memory Footprint For Dedicated Server Processes More Than Doubled After 11g Upgrade On AIX Platform [ID 1246995.1]

Bug 9796810: 11G SERVER PROCESSES CONSUMING MUCH MORE MEMORY THAT 10G OR 9I

Bug 10190759: PROCESSES CONSUMING ADDITIONAL MEMORY DUE TO ‘USLA HEAP’

可以看到上述问题仅发生在从9i/10g升级到11g后,作为一个已确认的升级Bug值得我们大家去关注;最近几年这样的升级会越来越多,同时希望该Bug能在11.2.0.3中修复。

实际上我在10.2.0.3上就遇到过类似的Process Large Footprints问题:用户在打上一个one-off patch[6110331]后单个server process的rss量明显上升,主机的内存使用量大幅提高,虽然这个问题同样提交了SR,但最后没有确认为Bug;用户试图询问Oracle GCS关于rss上升的原因,但语焉而不详。

Search Criteria:AIX 11.2

Memory Footprint For Dedicated Server Processes More Than Doubled After 11g Upgrade On AIX Platform (Doc ID 1246995.1)

1. Have you installed the patch 10190759 ?

Review the note:
Memory Footprint For Dedicated Server Processes More Than Doubled After 11g Upgrade On AIX Platform (Doc ID 1246995.1)

If you have not installed the patch ?
–>>there is one available for 11.2.0.2.0, 11.2.0.2.2, 11.2.0.2.3

If you need me to review the patches you have installed you can upload the opatch listing?

opatch lsinventory -patch -detail

2. If you have already installed the patch 10190759 then

The additional memory seen allocated to oracle processes in the 11.2 release is a consequence of the additional link options added to the oracle link
line, -bexpfull and -brtllib. The two link options were specifically added in 11.2.0.1 to support the online patching feature.
Patch Name or Number: 10190759

 

Changes in the make file have been implemented such that you can relink without these options (-bexpfull and -brtllib) to avoid
additional memory overhead incurred by adding these options.These changes are available via a one-off patch.

This is a known bug: BUG:10190759 – PROCESSES CONSUMING ADDITIONAL MEMORY DUE TO ‘USLA HEAP’

Install  Patch: 10190759

ORA-01652 even though there is sufficient space in RECYCLE BIN

There is a bug 6977045 which may cause ORA-1652 raised even though there is sufficient space in RECYCLE BIN. Version under 11.2 believed to be affected

[oracle@rh2 ~]$ oerr ora 1652
01652, 00000, "unable to extend temp segment by %s in tablespace %s"
// *Cause:  Failed to allocate an extent of the required number of blocks for
//          a temporary segment in the tablespace indicated.
// *Action: Use ALTER TABLESPACE ADD DATAFILE statement to add one or more
//          files to the tablespace indicated.
Bug 6977045  ORA-1652 even though there is sufficient space in RECYCLE BIN
This note gives a brief overview bug 6977045.
The content was last updated on: 06-DEC-2010
Click here for details of each of the sections below.
Affects:
Product (Component)	Oracle Server (Rdbms)
Range of versions believed to be affected 	Versions BELOW 11.2
Versions confirmed as being affected
11.1.0.7
Platforms affected	Generic (all / most platforms affected)
Fixed:
This issue is fixed in
11.2.0.1 (Base Release)
11.1.0.7 Patch 32 on Windows Platforms
Symptoms:
Related To:
Error May Occur
Storage Space Usage Affected
ORA-1652
Recycle Bin
Description
Under space pressure an ORA-1652 may be signalled even if there is sufficient
space in the recyclebin.
Rediscovery Notes:
Under space pressure, space allocation fails, even though there
is sufficient free space in recycle bin.
Workaround
Turn off the recycle bin.
OR
Purge the recyclebin.
Hdr: 12582291 11.1.0.7 RDBMS 11.1.0.7 SPACE PRODID-5 PORTID-59
Abstract: UPDATING A LOB FAILS WHILE CLEARING RECYCLE BIN EVEN WHEN ENOUGH FREE SPACE IS A
BUG TYPE CHOSEN
===============
Code
SubComponent: Recovery
======================
DETAILED PROBLEM DESCRIPTION
============================
An OCI application module tried to update a LOB object, and this operation
internally & recursively tried to clear off a few segments from the recycle
bin. As ct. had enabled triggers preventing uncontrolled droppings of
segments, this apparently prevented the application module from succeeding.
Further, since this error did not show up on the application module that
failed, this customer-facing critical application of this large enterprise
was down for considerable time.
DIAGNOSTIC ANALYSIS
===================
None. This bug is raised mainly as a Q/A to get clarifications for customer,
who is demanding an answer and possible action plan so that they can prevent
such disastrous situation in future.
WORKAROUND?
===========
Yes
WORKAROUND INFORMATION
======================
Disable the trigger or not using the recycle bin (Though neither operation
is acceptable to ct. because of their business reasons).
TECHNICAL IMPACT
================
Critical application module fails.
RELATED ISSUES (bugs, forums, RFAs)
===================================
None (MOS Note 978045.1 was referenced by ct.)
Hdr: 6977045 10.2 RDBMS 10.2 RAM DATA PRODID-5 PORTID-23 ORA-1652
Abstract: ORA-1652  LMT SPACE NOT REALLOCATED CORRECTLY AFTER DROP TABLE
*** 04/16/08 12:57 pm ***
TAR:
----
6880393.992
PROBLEM:
--------
ORA-12801: error signaled in parallel query server P038
ORA-1652: unable to extend temp segment by 320 in tablespace ERROR_TS
After dropping a table in a LMT the space is not properly returned to the
tablespace datafiles .
Only after purge tablespace error_ts; do we see the space returned correctly.
Subsequently the test plan is successful and the table is created.
DIAGNOSTIC ANALYSIS:
--------------------
See attached test case. test_output.log
WORKAROUND:
-----------
none
RELATED BUGS:
-------------
REPRODUCIBILITY:
----------------
TEST CASE:
----------
See attached test case. test_output.log
STACK TRACE:
------------
SUPPORTING INFORMATION:
-----------------------
24 HOUR CONTACT INFORMATION FOR P1 BUGS:
----------------------------------------
DIAL-IN INFORMATION:
--------------------
IMPACT DATE:
------------
*** 04/16/08 01:29 pm ***
*** 04/16/08 02:04 pm ***
the problem here is that even though the objects are occupying the same space
when they were created, dba_free_space shows one datafile to contain all the
free space reclaimed by the drop table command.
*** 04/16/08 02:35 pm ***
Please confirm this is a duplicate of bug 5083393.
*** 04/17/08 10:56 am ***
*** 04/17/08 05:09 pm ***
*** 04/17/08 05:14 pm *** (CHG: Sta->10)
*** 04/17/08 05:14 pm ***
*** 04/21/08 11:06 am *** (CHG: Sta->16)
*** 04/21/08 11:06 am ***
please review uploaded file ora_test1.log.
Patch 5083393 has been applied to this instance and the test was ran against
this patch.
Notice the query immedatly following the ORA_1652 error.  The temporary
segments seem to be causing the failure and specifically segment 1199.88012  .
*** 04/22/08 01:55 pm ***
Current SQL statement for this session:
create table seckle.my_test2_tb
nologging tablespace error_ts
parallel (degree 6)
as
select * from ecm.E08401AH_GEMINI_CMF_WIDE_TB
ERROR parallelizer slave or internal
qbas:54482
pgakid:2 pgadep:0
qerpx: error stack: OER(12805)
qbas_qerpxs: 54482
dfo_qerpxs: 0x4b7ba89e0 dfo1_qerpxs: 0x4b7ba9178
ntq_qerpxs: 1 ntqi_qerpxs: 0
nbfs_qerpxs: 0
nobj_qerpxs: 2  ngdef_qerpxs: 1
mflg_qerpxs: 0x2c
slave set 1 DFO dump:
kkfdo: (0x4b7ba9178)
kkfdo->kkfdochi: (0x0)
kkfdo->kkfdopar: (0x0)
kkfdo->kkfdonxt: (0x0)
kkfdo->kkfdotqi: 0
kkfdo->kkfdontbl: 2
kkfdo->kkfdongra: 1
kkfdo->kkfdofigra: 0
kkfdo->kkfdoflg: 0x2818
kkfdo->kkfdooct: 1
kkfdo->kkfdonumeopn: 0
Output table queue: (0x4b7fab1b8)
kxfqd     : 0x4b7fa5728
kxfqdtqi  : 0            TQ id
kxfqdcc   : 0x14         TQ: from slave set 1 to QC
kxfqdpty  : 4
kxfqdsmp  : 0            number of samples
kxfqdflg  : 0x4
kxfqdfmt  :              TQ format
kxfqfnco  : 5            number of TQ columns
kxfqfnky  : 0            number of key columns
TQ column        kxfqcbfl   kxfqcdty   kxfqcflg   kxfqcplen
kxfqfcol[   0]:  4          23         0x0          4
kxfqfcol[   1]:  32720      23         0x80         32720
kxfqfcol[   2]:  1          23         0x0          1
kxfqfcol[   3]:  76         23         0x0          76
kxfqfcol[   4]:  32720      23         0x0          32720
slave set 2 DFO dump:
np_qerpxm: 6 mflg_qerpxm: 0xa7
cdfo_qerpxm: 0x4b7ba9178 (tqid 0) sdfo_qerpxm: 0x0 (tqid -1)
ctqh_qerpxm: 0xffffffff79378ac8 dump:
kxfqh     : 0xffffffff79378ac8
kxfqhflg  : 0x15         TQ handle open
kxfqhmkr  : 0x4          QC
kxfqhpc   : 2            1:producer 2:consumer 3:ranger
kxfqepty  : 4
kxfqhnsam : 6
kxfqhnth  : 6
kxfqhdsc  :              TQ descriptor
kxfqd     : 0x4b7fa5728
kxfqdtqi  : 0            TQ id
kxfqdcc   : 0x14         TQ: from slave set 1 to QC
kxfqdpty  : 4
kxfqdsmp  : 0            number of samples
kxfqdflg  : 0x4
kxfqdfmt  :              TQ format
kxfqfnco  : 5            number of TQ columns
kxfqfnky  : 0            number of key columns
TQ column        kxfqcbfl   kxfqcdty   kxfqcflg   kxfqcplen
kxfqfcol[   0]:  4          23         0x0          4
kxfqfcol[   1]:  32720      23         0x80         32720
kxfqfcol[   2]:  1          23         0x0          1
kxfqfcol[   3]:  76         23         0x0          76
kxfqfcol[   4]:  32720      23         0x0          32720
dnst_qerpxm[cur,par]: 6,0 dcnt_qerpxm[cur,par]: 0,0
ppxv_qerpxm[0]: 0xffffffff79377f50 count[np..1]:1 1 1 1 1 1
pqv1_qerpxm: 0xffffffff79377f38 bits[np..1]: 111111
pqv2_qerpxm: 0xffffffff79377f40 bits[np..1]: 000000

If you have enabled recyclebin ,then you should check tablespace free space with dba_free_space and recyclebin space also like:

create view dba_free_space_pre10g as
select ts.name TABLESPACE_NAME,
fi.file# FILE_ID,
f.block# BLOCK_ID,
f.length * ts.blocksize BYTES,
f.length BLOCKS,
f.file# RELATIVE_FNO
from sys.ts$ ts, sys.fet$ f, sys.file$ fi
where ts.ts# = f.ts#
and f.ts# = fi.ts#
and f.file# = fi.relfile#
and ts.bitmapped = 0
union all
select /*+ ordered use_nl(f) use_nl(fi) */
ts.name TABLESPACE_NAME,
fi.file# FILE_ID,
f.ktfbfebno BLOCK_ID,
f.ktfbfeblks * ts.blocksize BYTES,
f.ktfbfeblks BLOCKS,
f.ktfbfefno RELATIVE_FNO
from sys.ts$ ts, sys.x$ktfbfe f, sys.file$ fi
where ts.ts# = f.ktfbfetsn
and f.ktfbfetsn = fi.ts#
and f.ktfbfefno = fi.relfile#
and ts.bitmapped <> 0
and ts.online$ in (1, 4)
and ts.contents$ = 0
/
create view dba_free_space_recyclebin as
select /*+ ordered use_nl(u) use_nl(fi) */
ts.name TABLESPACE_NAME,
fi.file# FILE_ID,
u.ktfbuebno BLOCK_ID,
u.ktfbueblks * ts.blocksize BYTES,
u.ktfbueblks BLOCKS,
u.ktfbuefno RELATIVE_FNO
from sys.recyclebin$ rb, sys.ts$ ts, sys.x$ktfbue u, sys.file$ fi
where ts.ts# = rb.ts#
and rb.ts# = fi.ts#
and u.ktfbuefno = fi.relfile#
and u.ktfbuesegtsn = rb.ts#
and u.ktfbuesegfno = rb.file#
and u.ktfbuesegbno = rb.block#
and ts.bitmapped <> 0
and ts.online$ in (1, 4)
and ts.contents$ = 0
union all
select ts.name TABLESPACE_NAME,
fi.file# FILE_ID,
u.block# BLOCK_ID,
u.length * ts.blocksize BYTES,
u.length BLOCKS,
u.file# RELATIVE_FNO
from sys.ts$ ts, sys.uet$ u, sys.file$ fi, sys.recyclebin$ rb
where ts.ts# = u.ts#
and u.ts# = fi.ts#
and u.segfile# = fi.relfile#
and u.ts# = rb.ts#
and u.segfile# = rb.file#
and u.segblock# = rb.block#
and ts.bitmapped = 0
/

dba_free_space_pre10g which shows the real free space like 9i behavior , dba_free_space_recyclebin shows free space resided in recyclebin.

Brain Split?

真正出现脑裂的几率并不高,但确实让我们碰上了。2个月前为一套AIX6.1上的10.2.0.4双节点RAC系统做故障测试,主要内容是拔除RAC interconnect网线,测试CRS能否正确处理私有网络挂掉的情况。

 

正式测试时发现2台主机都没有重启,而两端的CSS都认为对方节点已经down了。这就造成2个节点都以为自身是幸存者,也就是我们说的脑裂(brain split),此时实例一般会因为LMON进程的缘故而hang住。

 

我们来比对当时2个节点上的日志进一步分析:

 

STEP 1 :20:41:19物理拔出网线后,节点间无法正常通信,进入misscount倒计时600s
节点1:
[    CSSD]2010-06-22 20:41:21.465 [3342] >TRACE:   clssnmPollingThread: node gis2 (2) missed(2) checkin(s)
[    CSSD]2010-06-22 20:41:22.465 [3342] >TRACE:   clssnmPollingThread: node gis2 (2) missed(3) checkin(s)
.........
[    CSSD]2010-06-22 20:51:17.956 [3342] >TRACE:   clssnmPollingThread: node gis2 (2) missed(598) checkin(s)
[    CSSD]2010-06-22 20:51:18.963 [3342] >TRACE:   clssnmPollingThread: node gis2 (2) missed(599) checkin(s)
[    CSSD]2010-06-22 20:51:19.963 [3342] >TRACE:   clssnmPollingThread: Eviction started for node gis2 (2), flags 0x0001, state 3, wt4c 0
/* 节点1上完成倒计时后开始驱逐节点2*/
节点2:
[    CSSD]2010-06-22 20:41:19.598 [3342] >TRACE:   clssnmPollingThread: node gis1 (1) missed(2) checkin(s)
[    CSSD]2010-06-22 20:41:20.599 [3342] >TRACE:   clssnmPollingThread: node gis1 (1) missed(3) checkin(s)
......................
[    CSSD]2010-06-22 20:51:15.871 [3342] >TRACE:   clssnmPollingThread: node gis1 (1) missed(598) checkin(s)
[    CSSD]2010-06-22 20:51:16.871 [3342] >TRACE:   clssnmPollingThread: node gis1 (1) missed(599) checkin(s)
[    CSSD]2010-06-22 20:51:17.878 [3342] >TRACE:   clssnmPollingThread: Eviction started for node gis1 (1), flags 0x0001, state 3, wt4c 0
/*同样的节点2也是在10分钟后的51分开始驱逐节点1*/
STEP 2: 2个节点读取磁盘心跳信息(clssnmReadDskHeartbeat),且认为对方节点已经down了
节点1:
[    CSSD]2010-06-22 20:51:20.964 [3856] >TRACE:   clssnmSetupAckWait: node(1) is ACTIVE
[    CSSD]2010-06-22 20:51:20.964 [3856] >TRACE:   clssnmSendVote: syncSeqNo(3)
[    CSSD]2010-06-22 20:51:20.964 [3856] >TRACE:   clssnmWaitForAcks: Ack message type(13), ackCount(1)
[    CSSD]2010-06-22 20:51:20.965 [2057] >TRACE:   clssnmSendVoteInfo: node(1) syncSeqNo(3)
[    CSSD]2010-06-22 20:51:21.714 [1543] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(3) wrtcnt(4185) LATS(1628594178) Disk lastSeqNo(4185)
[    CSSD]2010-06-22 20:51:21.965 [3856] >TRACE:   clssnmWaitForAcks: done, msg type(13)
[    CSSD]2010-06-22 20:51:21.965 [3856] >TRACE:   clssnmCheckDskInfo: Checking disk info...
[    CSSD]2010-06-22 20:51:22.718 [1543] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(3) wrtcnt(4186) LATS(1628595183) Disk lastSeqNo(4186)
[    CSSD]2010-06-22 20:51:22.964 [3342] >TRACE:   clssnmPollingThread: node gis1 (1) missed(2) checkin(s)
[    CSSD]2010-06-22 20:51:23.722 [1543] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(3) wrtcnt(4187) LATS(1628596186) Disk lastSeqNo(4187)
[ CSSD]2010-06-22 20:51:24.724 [1543] >TRACE: clssnmReadDskHeartbeat: node(2) is down.
rcfg(3) wrtcnt(4188) LATS(1628597189) Disk lastSeqNo(4188)
.............................
[    CSSD]2010-06-22 20:59:49.953 [1543] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(3) wrtcnt(4692) LATS(1629102418) Disk lastSeqNo(4692)
[    CSSD]2010-06-22 20:59:50.057 [3085] >TRACE:   clssgmPeerDeactivate: node 2 (gis2), death 0, state 0x80000001 connstate 0xf
[    CSSD]2010-06-22 20:59:50.104 [1029] >TRACE:   clssnm_skgxncheck: CSS daemon failed on node 2
[    CSSD]2010-06-22 20:59:50.382 [2314] >TRACE:   clssgmClientConnectMsg: Connect from con(112a6c5b0) proc(112a5a190) pid() proto(10:2:1:1)
[    CSSD]2010-06-22 20:59:51.231 [3856] >TRACE:   clssnmEvict: Start
[    CSSD]2010-06-22 20:59:51.231 [3856] >TRACE:   clssnmEvict: Evicting node 2, birth 1, death 3, killme 1
[    CSSD]2010-06-22 20:59:51.232 [3856] >TRACE:   clssnmWaitOnEvictions: Start
[    CSSD]2010-06-22 20:59:51.232 [3856] >TRACE:   clssnmWaitOnEvictions: Node(0) down, LATS(0),timeout(1629103696)
[    CSSD]2010-06-22 20:59:51.232 [3856] >TRACE:   clssnmWaitOnEvictions: Node(2) down, LATS(1629102418),timeout(1278)
[    CSSD]2010-06-22 20:59:51.232 [3856] >TRACE:   clssnmSetupAckWait: Ack message type (15)
[    CSSD]2010-06-22 20:59:51.232 [3856] >TRACE:   clssnmSetupAckWait: node(1) is ACTIVE
[    CSSD]2010-06-22 20:59:51.232 [3856] >TRACE:   clssnmSendUpdate: syncSeqNo(3)
[    CSSD]2010-06-22 20:59:51.232 [3856] >TRACE:   clssnmWaitForAcks: Ack message type(15), ackCount(1)
[    CSSD]2010-06-22 20:59:51.232 [2057] >TRACE:   clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[    CSSD]2010-06-2F1.232 [2057] >TRACE:   clssnmDeactivateNode: node 0 () left cluster
[    CSSD]2010-06-22 20:59:51.232 [2057] >TRACE:   clssnmUpdateNodeState: node 1, state (3/3) unique (1277207505/1277207505) prevConuni(0) birth (2/2) (old/new)
[    CSSD]2010-06-22 20:59:51.232 [2057] >TRACE:   clssnmUpdateNodeState: node 2, state (0/0) unique (1277206874/1277206874) prevConuni(1277206874) birth (1/0) (old/new)
[    CSSD]2010-06-22 20:59:51.232 [2057] >TRACE:   clssnmDeactivateNode: node 2 (gis2) left cluster
[    CSSD]2010-06-22 20:59:51.233 [2057] >USER:    clssnmHandleUpdate: SYNC(3) from node(1) completed
[    CSSD]2010-06-22 20:59:51.233 [2057] >USER:    clssnmHandleUpdate: NODE 1 (gis1) IS ACTIVE MEMBER OF CLUSTER
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmReconfigThread:  started for reconfig (3)
[    CSSD]2010-06-22 20:59:51.310 [4114] >USER:    NMEVENT_RECONFIG [00][00][00][02]
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock crs_version type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock ORA_CLSRD_1_gisdb type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock ORA_CLSRD_1_gisdb type 3
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock ORA_CLSRD_2_gisdb type 3
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(ORA_CLSRD_2_gisdb) birth(1/0)
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock DBGISDB type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock DGGISDB type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock DAALL_DB type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock EVMDMAIN type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock CRSDMAIN type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock IGGISDBALL type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock ocr_crs type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock ORA_CLSRCSN_SRV_gisdb1 type 3
[    CSSD]2010-06-22 20:59:51.311 [4114] >TRACE:   clssgmEstablishConnections: 1 nodes in cluster incarn 3
[    CSSD]2010-06-22 20:59:51.311 [3085] >TRACE:   clssgmPeerListener: connects done (1/1)
[    CSSD]2010-06-22 20:59:51.311 [4114] >TRACE:   clssgmEstablishMasterNode: MASTER for 3 is node(1) birth(2)
[    CSSD]2010-06-22 20:59:51.311 [4114] >TRACE:   clssgmChangeMasterNode: requeued 1 RPCs
[    CSSD]2010-06-22 20:59:51.311 [4114] >TRACE:   clssgmMasterCMSync: Synchronizing group/lock status
[    CSSD]2010-06-22 20:59:51.312 [4114] >TRACE:   clssgmMasterSendDBDone: group/lock status synchronization complete
[    CSSD]CLSS-3000: reconfiguration successful, incarnation 3 with 1 nodes
[    CSSD]CLSS-3001: local node number 1, master node number 1
/* 节点1在hearbeat 8分钟左右后认为CSS daemon failed on node 2,正式认为Node 2离开了集群,并完成了reconfiguration*/
节点2:
[    CSSD]2010-06-22 20:51:18.892 [3856] >TRACE:   clssnmSendVote: syncSeqNo(3)
[    CSSD]2010-06-22 20:51:18.892 [3856] >TRACE:   clssnmWaitForAcks: Ack message type(13), ackCount(1)
[    CSSD]2010-06-22 20:51:18.892 [2057] >TRACE:   clssnmSendVoteInfo: node(2) syncSeqNo(3)
[    CSSD]2010-06-22 20:51:19.287 [1543] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(3) wrtcnt(3548) LATS(351788040) Disk lastSeqNo(3548)
[    CSSD]2010-06-22 20:51:19.892 [3856] >TRACE:   clssnmWaitForAcks: done, msg type(13)
[    CSSD]2010-06-22 20:51:19.892 [3856] >TRACE:   clssnmCheckDskInfo: Checking disk info...
[ CSSD]2010-06-22 20:51:20.288 [1543] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(3) wrtcnt(3549) LATS(351789041) Disk lastSeqNo(3549)
[    CSSD]2010-06-22 20:51:21.308 [1543] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(3) wrtcnt(3550) LATS(351790062) Disk lastSeqNo(3550)
...........................
[    CSSD]2010-06-22 20:59:46.122 [1543] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(3) wrtcnt(4051) LATS(352294875) Disk lastSeqNo(4051)
[    CSSD]2010-06-22 20:59:46.341 [2314] >TRACE:   clssgmClientConnectMsg: Connect from con(112947c70) proc(112946f90) pid() proto(10:2:1:1)
[    CSSD]2010-06-22 20:59:46.355 [2314] >WARNING: clssgmShutDown: Received explicit shutdown request from client.
[    CSSD]2010-06-22 20:59:46.355 [2314] >TRACE:   clssgmClientShutdown: Aborting client (112a50210) proc (112a4e3d0)
[    CSSD]2010-06-22 20:59:46.355 [2314] >TRACE:   clssgmClientShutdown: Aborting client (112a50cd0) proc (112a4e3d0)
[    CSSD]2010-06-22 20:59:46.355 [2314] >TRACE:   clssgmClientShutdown: Aborting client (112a536f0) proc (112a4e3d0)
[    CSSD]2010-06-22 20:59:46.355 [2314] >TRACE:   clssgmClientShutdown: Aborting client (112a4eb90) proc (112a4eef0)
[    CSSD]2010-06-22 20:59:46.355 [2314] >TRACE:   clssgmClientShutdown: Aborting client (112a69250) proc (112a67e10)
[    CSSD]2010-06-22 20:59:46.355 [2314] >TRACE:   clssgmClientShutdown: Aborting client (112946050) proc (112945e50)
[    CSSD]2010-06-22 20:59:46.355 [2314] >TRACE:   clssgmClientShutdown: waited 0 seconds on 6 IO capable clients
[    CSSD]2010-06-22 20:59:46.494 [2314] >WARNING: clssgmClientShutdown: graceful shutdown completed.
[    CSSD]2010-06-22 20:59:47.130 [1543] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(3) wrtcnt(4052) LATS(352295883) Disk lastSeqNo(4052)
[    CSSD]2010-06-22 21:34:40.167 >USER:    Oracle Database 10g CSS Release 10.2.0.1.0 Production Copyright 1996, 2004 Oracle.  All rights reserved.
/* node2 也正确进行了heartbeat,并认为node(1) is down,最后被手动关闭;之后还原了网络故障,在21:34时CSS重新启动*/

 

如果你仔细看以上日志的话,你大概会找出”Oracle Database 10g CSS Release 10.2.0.1.0″的记录;这套RAC不是10.2.0.4的吗,为什么CSS还是10.2.0.1版本的呢,事后调查才发觉是安装该套系统的施工方国内某X码工程师在给CRS打补丁的时候忘记运行最后的root102.sh脚本了,该脚本将更新OCR/Voting disk及实际的CRS binary文件等,如果补丁安装结束后没有运行该脚本则升级不会有任何效果,而只会更新oraInventory中的信息。

 

刚开始时哪位X码的工程师抵死不肯承认忘记了运行脚本,而实际上在AIX 6.1上打10.2.0.4 CRS的patch是需要为oracle用户赋特有的权限的,这一点不同于AIX 5.3上,即:

 

chuser capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE,CAP_NUMA_ATTACH oracle
/*进一步检查*/
lsuser -f oracle | grep capabilities
capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE,CAP_NUMA_ATTACH

 

如果未对oracle用户赋以上权限则运行root102.sh脚本时将报错。另一个判断的标志是pre10204/pre10205目录,如果运行过root102.sh脚本的话$ORA_CRS_HOME/install目录下会多出一个形如pre$VERSION的目录,没有的话一般就是没有运行过脚本,当然也有可能是时候删除了(不建议删除)。

 

了解到以上信息后对此次脑裂的追根溯源就要简单的多了,版本10201上的CRS可以说Bug众多的,从10201-10204期间CRS加入了不少新的参数和机制,在MOS上搜索关键词”brain split CSS”可以找到以下案例:

Hdr: 8293652 10.2.0.3 PCW 10.2.0.3 OSD PRODID-5 PORTID-46
Abstract: CSS CANNOT HANDLE SPLIT-BRAIN AND DB INSTANCE RECEIVES ORA-29740
PROBLEM:
——–
config:
2-node RAC: Node1 (pdb01) and Node2 (pdb02)
There’s no time difference between two nodes.

pdb02 got ORA-29740 and terminated at “Tue Feb 17 12:13:06 2009”
ORA-29740 occured with reason 1.
After ORA-29740 happened, the instance won’t be able to start
until rebooting OS.
After rebooting OS, everything was fine and instances were up.

DIAGNOSTIC ANALYSIS:
——————–
clssnmReadDskHeartbeat: node(2) is down. rcfg(8) wrtcnt(2494425)
LATS(1205488794) Disk lastSeqNo(2494425)
nodes
clssgmMasterSendDBDone: group/lock status synchronization complete
nodes

WORKAROUND:
———–
none

RELATED BUGS:
————-

REPRODUCIBILITY:
—————-
once at ct’s env.

TEST CASE:
———-

STACK TRACE:
————

SUPPORTING INFORMATION:
———————–

24 HOUR CONTACT INFORMATION FOR P1 BUGS:
—————————————-

DIAL-IN INFORMATION:
——————–

IMPACT DATE:
————

Does ct apply any CRS(bundle) patch ?

When problem happen, cssd can’t connect each other via interconnect,
but both cssd can do heartbeat to voting disk.
However, both cssd consider that “I’m survivor”.
Looking into node 1 cssd.
* 12:02:30.856 – Initiated sync
[ CSSD]2009-02-17 12:02:30.856 [1262557536] >TRACE: clssnmDoSyncUpdate:
Initiating sync 7
[ CSSD]2009-02-17 12:02:30.856 [1262557536] >TRACE: clssnmDoSyncUpdate:
diskTimeout set to (57000)ms

* Checking voting disk, and find node2 is still voting and living.
[ CSSD]2009-02-17 12:02:30.874 [1262557536] >TRACE: clssnmCheckDskInfo:
Checking disk info…
[ CSSD]2009-02-17 12:02:30.874 [1262557536] >TRACE: clssnmCheckDskInfo:
node(2) timeout(20) state_network(5) state_disk(3) misstime(0)
[ CSSD]2009-02-17 12:02:31.878 [1262557536] >TRACE: clssnmCheckDskInfo:
node(2) disk HB found, network state 5, disk state(3) misstime(1010)

* Compared cluster size and confirmed it can survive.
[ CSSD]2009-02-17 12:02:34.885 [1262557536] >TRACE: clssnmCheckDskInfo:
node 2, iz-pdb02, state 5 with leader 2 has smaller cluster size 1;
my cluster size 1 with leader 1

* Then finished
[ CSSD]2009-02-17 12:02:34.886 [1262557536] >TRACE: clssnmDoSyncUpdate:
Sync Complete!
*** 03/08/09 11:23 pm ***
Looking into node 2 cssd log.

* 12:02:20.647 – initiated sync protocol
[ CSSD]2009-02-17 12:02:20.647 [1262557536] >TRACE: clssnmDoSyncUpdate:
Initiating sync 7
[ CSSD]2009-02-17 12:02:20.647 [1262557536] >TRACE: clssnmDoSyncUpdate:
diskTimeout set to (57000)ms

* Checking disk and find node1 does not do disk heartbeart for 59690 ms.
it would have waited for misscount and considered node 1 is dead
[ CSSD]2009-02-17 12:02:22.285 [1262557536] >TRACE: clssnmCheckDskInfo:
Checking disk info…
[ CSSD]2009-02-17 12:02:22.285 [1262557536] >TRACE: clssnmCheckDskInfo:
node(1) timeout(59690) state_network(5) state_disk(3) misstime(61000)

* node2 is the only active member of cluster, finished.
[ CSSD]2009-02-17 12:02:22.723 [1262557536] >TRACE: clssnmDoSyncUpdate:
Sync Complete!
*** 03/08/09 11:45 pm ***
So, strange point is, node 2 cssd says node 1 cssd didn’t do
disk heartbeat for 60 seconds.

Looking into node1 cssd log just before initiating sync. We see 87sec gap.
———————————–
[ CSSD]2009-02-17 12:00:19.354 [1199618400] >TRACE: clssgmClientConnectMsg:
Connect from con(0x784d80) proc(0x7749b0) pid(14746) proto(10:2:1:1)
[ CSSD]2009-02-17 12:00:33.338 [1199618400] >TRACE: clssgmClientConnectMsg:
Connect from con(0x7d8620) proc(0x75cfb0) pid() proto(10:2:1:1)
[ CSSD]2009-02-17 12:01:03.688 [1199618400] >TRACE: clssgmClientConnectMsg:
Connect from con(0x76a390) proc(0x75cfb0) pid(13634) proto(10:2:1:1)
[ CSSD]2009-02-17 12:02:30.855 [1168148832] >WARNING: clssnmDiskPMT:
sltscvtimewait timeout (69200)
[ CSSD]2009-02-17 12:02:30.855 [1189128544] >WARNING: clssnmeventhndlr:
Receive failure with node 2 (iz-pdb02), state 3, con(0x72b980),
probe((nil)), rc=10
[ CSSD]2009-02-17 12:02:30.855 [1189128544] >TRACE: clssnmDiscHelper:
iz-pdb02, node(2) connection failed, con (0x72b980), probe((nil))
[ CSSD]2009-02-17 12:02:30.856 [1262557536] >TRACE: clssnmDoSyncUpdate:
Initiating sync 7

As an interesting point, clssgmClientConnectMsg does not show message,
but nm polling thread/disk ping thread does not warn timeout.
(Usually it should write message first at 50% of timeout = 30 sec)

And “sltscvtimewait timeout (69200)” message means, DiskPingMonitor thread
does not run for a 69200 ms whereas it just wants to sleep 1 second.

These suggest, cssd does not scheduled about 70 seconds on node 1.
I don’t see any log from DiskPingThread, but I assume it is suspended
at some point also, and back to work after 70 seconds.
Please check OS message file to see any interesting error is recorded.
To prevent this issue, keep watching OS performance to see if any
extreme high load does not happen.

Recommended solution it to go to 10.2.0.4 and use oprocd so that
we can expect oprocd kill node1 in such case.

 

 

上述案例同样是在”cssd can’t connect each other via interconnect”的状况下出现了”I’m survivor”的脑裂问题,MOS的建议是升级到10204后oprocd进程可以阻止这样的惨剧发生。

 

该问题最后通过升级到10.2.0.5解决了,这个case告诉我们在中国的it大环境内,有时候我们不得不亲力亲为地关心每一个细节,就拿这次来说我一开始也没发现升级没完成的情况,后来还是同事提醒了我;因为这是一个非常低级错误,如果施工方的X码工程师仔仔细细地按照他们下发的文档按部就班亦或者能留意一下升级时的图形窗口中的说明的话,这个问题都不会发生!而实际上不仅仅是此套系统,连带着其他2套系统也是这位X码工程师安装升级的,这几套系统在之后的故障测试时都发现了同样的问题。

事实告诉我们,细节决定成败!

DataGuard Managed recovery hang

Our team deleted some archivelog by mistake. Rolled the database forwards by RMAN incremental recovery to an SCN. Did a manual recovery to sync it with the primary. Managed recovery is now failing.
alter database recover managed standby database disconnect

Alert log has :

Fri Jan 22 13:50:22 2010
Attempt to start background Managed Standby Recovery process
MRP0 started with pid=12
MRP0: Background Managed Standby Recovery process started
Media Recovery Waiting for thread 1 seq# 193389
Fetching gap sequence for thread 1, gap sequence 193389-193391
Trying FAL server: ITS
Fri Jan 22 13:50:28 2010
Completed: alter database recover managed standby database d
Fri Jan 22 13:53:25 2010
Failed to request gap sequence. Thread #: 1, gap sequence: 193389-193391
All FAL server has been attempted.

Managed recovery was working earlier today after the Rman incremental and resolved two gaps automatically. But it now appears hung with the standby falling behind the primary.

SQL> show parameter fal
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
fal_client string ITS_STBY
fal_server string ITS
[v08k608:ITS:oracle]$ tnsping ITS_STBY
TNS Ping Utility for Solaris: Version 9.2.0.7.0 - Production on 22-JAN-2010 15:01:17
Copyright (c) 1997 Oracle Corporation. All rights reserved.
Used parameter files:
/oracle/product/9.2.0/network/admin/sqlnet.ora
Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION = (ADDRESS = (PROTOCOL= TCP)(Host= v08k608.am.mot.com)(Port= 1526)) (CONNECT_DATA = (SID = ITS)))
OK (10 msec)
[v08k608:ITS:oracle]$ tnsping ITS
TNS Ping Utility for Solaris: Version 9.2.0.7.0 - Production on 22-JAN-2010 15:01:27
Copyright (c) 1997 Oracle Corporation. All rights reserved.
Used parameter files:
/oracle/product/9.2.0/network/admin/sqlnet.ora
Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION = (ADDRESS = (PROTOCOL= TCP)(Host= 187.10.68.75)(Port= 1526)) (CONNECT_DATA = (SID = ITS)))
OK (320 msec)
Primary has :
SQL> show parameter log_archive_dest_2
log_archive_dest_2 string SERVICE=DRITS_V08K608 reopen=6
0 max_failure=10 net_timeout=1
80 LGWR ASYNC=20480 OPTIONAL
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
log_archive_dest_state_2 string ENABLE
[ITS]/its15/oradata/ITS/arch> tnsping DRITS_V08K608
TNS Ping Utility for Solaris: Version 9.2.0.7.0 - Production on 22-JAN-2010 15:03:24
Copyright (c) 1997 Oracle Corporation. All rights reserved.
Used parameter files:
/oracle/product/9.2.0/network/admin/sqlnet.ora
Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION = (ADDRESS = (PROTOCOL= TCP)(Host= 10.177.13.57)(Port= 1526)) (CONNECT_DATA = (SID = ITS)))
OK (330 msec)

The arch process on the primary database might hang due to a bug below so that it couldn’t ship the missing archive log
files to the standby database.

BUG 6113783 ARC PROCESSES CAN HANG INDEFINITELY ON NETWORK
[ Not published so not viewable in My Oracle Support ]
Fixed 11.2, 10.2.0.5 patchset

We could work workaround the issue by killing the arch processes on the primary site and they will be respawned
automatically immediately without harming the primary database.

[maclean@rh2 ~]$ ps -ef|grep arc
maclean   8231     1  0 22:24 ?        00:00:00 ora_arc0_PROD
maclean   8233     1  0 22:24 ?        00:00:00 ora_arc1_PROD
maclean   8350  8167  0 22:24 pts/0    00:00:00 grep arc
[maclean@rh2 ~]$ kill -9 8231 8233
[maclean@rh2 ~]$ ps -ef|grep arc
maclean   8389     1  0 22:25 ?        00:00:00 ora_arc0_PROD
maclean   8391     1  1 22:25 ?        00:00:00 ora_arc1_PROD
maclean   8393  8167  0 22:25 pts/0    00:00:00 grep arc
and alert log will have:
Fri Jul 30 22:25:27 EDT 2010
ARCH: Detected ARCH process failure
ARCH: Detected ARCH process failure
ARCH: STARTING ARCH PROCESSES
ARC0 started with pid=26, OS id=8389
Fri Jul 30 22:25:27 EDT 2010
ARC0: Archival started
ARC1: Archival started
ARCH: STARTING ARCH PROCESSES COMPLETE
ARC1 started with pid=27, OS id=8391
Fri Jul 30 22:25:27 EDT 2010
ARC0: Becoming the 'no FAL' ARCH
ARC0: Becoming the 'no SRL' ARCH
Fri Jul 30 22:25:27 EDT 2010
ARC1: Becoming the heartbeat ARCH

Actually if we don’t kill some fatal process in 10g , oracle will respawn all nonfatal processes.
For example:

[maclean@rh2 ~]$ ps -ef|grep ora_|grep -v grep
maclean  14264     1  0 23:16 ?        00:00:00 ora_pmon_PROD
maclean  14266     1  0 23:16 ?        00:00:00 ora_psp0_PROD
maclean  14268     1  0 23:16 ?        00:00:00 ora_mman_PROD
maclean  14270     1  0 23:16 ?        00:00:00 ora_dbw0_PROD
maclean  14272     1  0 23:16 ?        00:00:00 ora_lgwr_PROD
maclean  14274     1  0 23:16 ?        00:00:00 ora_ckpt_PROD
maclean  14276     1  0 23:16 ?        00:00:00 ora_smon_PROD
maclean  14278     1  0 23:16 ?        00:00:00 ora_reco_PROD
maclean  14338     1  0 23:16 ?        00:00:00 ora_arc0_PROD
maclean  14340     1  0 23:16 ?        00:00:00 ora_arc1_PROD
maclean  14452     1  0 23:17 ?        00:00:00 ora_s000_PROD
maclean  14454     1  0 23:17 ?        00:00:00 ora_d000_PROD
maclean  14456     1  0 23:17 ?        00:00:00 ora_cjq0_PROD
maclean  14458     1  0 23:17 ?        00:00:00 ora_qmnc_PROD
maclean  14460     1  0 23:17 ?        00:00:00 ora_mmon_PROD
maclean  14462     1  0 23:17 ?        00:00:00 ora_mmnl_PROD
maclean  14467     1  0 23:17 ?        00:00:00 ora_q000_PROD
maclean  14568     1  0 23:18 ?        00:00:00 ora_q001_PROD
[maclean@rh2 ~]$ ps -ef|grep ora_|grep -v pmon|grep -v ckpt |grep -v lgwr|grep -v smon|grep -v grep|grep -v dbw|grep -v psp|grep -v mman |grep -v rec|awk '{print $2}'|xargs kill -9
and alert log will have:
Fri Jul 30 23:20:58 EDT 2010
ARCH: Detected ARCH process failure
ARCH: Detected ARCH process failure
ARCH: STARTING ARCH PROCESSES
ARC0 started with pid=20, OS id=14959
Fri Jul 30 23:20:58 EDT 2010
ARC0: Archival started
ARC1: Archival started
ARCH: STARTING ARCH PROCESSES COMPLETE
Fri Jul 30 23:20:58 EDT 2010
ARC0: Becoming the 'no FAL' ARCH
ARC0: Becoming the 'no SRL' ARCH
ARC1 started with pid=21, OS id=14961
ARC1: Becoming the heartbeat ARCH
Fri Jul 30 23:21:29 EDT 2010
found dead shared server 'S000', pid = (10, 3)
found dead dispatcher 'D000', pid = (11, 3)
Fri Jul 30 23:22:29 EDT 2010
Restarting dead background process CJQ0
Restarting dead background process QMNC
CJQ0 started with pid=12, OS id=15124
Fri Jul 30 23:22:29 EDT 2010
Restarting dead background process MMON
QMNC started with pid=13, OS id=15126
Fri Jul 30 23:22:29 EDT 2010
Restarting dead background process MMNL
MMON started with pid=14, OS id=15128
MMNL started with pid=16, OS id=15132
That's all right!

ORA-00600: INTERNAL ERROR CODE, ARGUMENTS: [729], [10992], [SPACE LEAK] Example

The customers got  this error every alternative days on Version  9.2.0.7. They did increase the shared pool from 450MB to 704MB. Let’s see the alert.log and the last generated trace file.

SQL> l
1  select  nam.ksppinm NAME,
2  val.KSPPSTVL VALUE
3  from x$ksppi nam,
4  x$ksppsv val
5  where nam.indx = val.indx
6  and  nam.ksppinm like '%shared%'
7* order by 1
SQL> /
NAME                                                              VALUE
----------------------------------------------------------------  ----------
_all_shared_dblinks
_shared_pool_reserved_min_alloc                                   4400
_shared_pool_reserved_pct                                         5
hi_shared_memory_address                                          0
max_shared_servers                                                20
shared_memory_address                                             0
shared_pool_reserved_size                                         31876710
shared_pool_size                                                  738197504
shared_server_sessions                                            0
shared_servers                                                    0
10 rows selected.
SQL>  select FREE_SPACE,LAST_FAILURE_SIZE,REQUEST_FAILURES,LAST_MISS_SIZE  from v$shared_pool_reserved;
FREE_SPACE LAST_FAILURE_SIZE  REQUEST_FAILURES LAST_MISS_SIZE
---------- -----------------  ---------------- --------------
19018368               456               725              0
1 row selected.
Alert log
~~~~~~~~~~
Thu May 28 19:05:11 2009
Errors in file  /u01/app/oracle/admin/preg062/udump/preg062_ora_17314.trc:
ORA-00600:  internal error code, arguments: [729], [10992], [space leak], [], [],  [], [], []
Trace File
~~~~~~~~~~~
Dump file  /u01/app/oracle/admin/preg062/udump/preg062_ora_17314.trc
Oracle9i  Enterprise Edition Release 9.2.0.7.0 - 64bit Production
With the  Partitioning, OLAP and Oracle Data Mining options
JServer Release  9.2.0.7.0 - Production
ORACLE_HOME =  /u01/app/oracle/product/920preg062
System name:	SunOS
Node name:	 iccscorp
Release:	5.9
Version:	Generic_122300-22
Machine:	sun4u
Instance  name: preg062
Error
-----
ORA-00600: internal error code,  arguments: [729], [10992], [space leak], [], [], [], [], []
Current  SQL
-----------
None
Call Stack
----------
ksedmp  kgeriv kgesiv ksesic2 ksmuhe ksmugf ksuxds ksudel opilof opiodr ttcpip  opitsk opiino opiodr opidrv sou2o main start
Session info
------------
SO:  411536570, type: 4, owner: 40e583e08, flag: INIT/-/-/0x00
(session) trans: 0, creator: 40e583e08, flag: (41) USR/- BSY/-/-/DEL/-/-
DID: 0001-00F9-00000F5B, short-term DID:  0000-0000-00000000
txn branch: 0
oct:  0, prv: 0, sql: 417fbbf18, psql: 416fa9840, user: 31/MATRIXTWO
O/S info: user: matrixadmin, term: , ospid: 17281, machine: iccscorp
program: mql@iccscorp (TNS V1-V3)
last wait for  'SQL*Net message from client' blocking sess=0x0 seq=3208 wait_time=836
driver id=54435000, #bytes=1, =0
ORA-04031  details
~~~~~~~~~~~~~
Begin 4031 Diagnostic Information
Allocation  Request
-------------------
Allocation request for: kkslpkp -  literal info.
Heap: 3d6fb45f0, size: 4200
Call stack
-----------
ksm_4031_dump   ksmasg  kghnospc  kghalp  kghsupmm  kghssgai  kkslpkp  kkslpgo  kkepsl   kkecdn  kkotap  kkoiqb  kkooqb  kkoqbc  apakkoqb
apaqbd  apadrv   opitca  kkssbt  kksfbc  kkspfda  kpodny  kpoal8  opiodr  ttcpip  opitsk   opiino  opiodr  opidrv  sou2o  main
Session Info
-------------
SO:  411536570, type: 4, owner: 40e583e08, flag: INIT/-/-/0x00
(session) trans: 0, creator: 40e583e08, flag: (41) USR/- BSY/-/-/-/-/-
DID: 0001-00F9-00000F5B, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 4311e4e30,  psql: 4311e4e30, user: 31/MATRIXTWO
O/S info: user: matrixadmin,  term: , ospid: 17281, machine: iccscorp
program:  mql@iccscorp (TNS V1-V3)
application name: mql@iccscorp (TNS  V1-V3), hash value=0
last wait for 'SQL*Net message from client'  blocking sess=0x0 seq=3196 wait_time=1975
driver  id=54435000, #bytes=1, =0
Number of Subpools and allocations
----------------------------------
===============================
Memory  Utilization of Subpool 1
===============================
Allocation Name          Size
_________________________   __________
"free memory              "    25065216
"miscellaneous             "    14914048
===============================
Memory  Utilization of Subpool 2
===============================
Allocation Name          Size
_________________________   __________
"free memory              "     9306608
"miscellaneous             "    19358000
===============================
Memory  Utilization of Subpool 3
===============================
Allocation Name          Size
_________________________   __________
"free memory              "    25209192
"miscellaneous             "    10192440
===============================
Memory  Utilization of Subpool 4
===============================
Allocation Name          Size
_________________________   __________
"free memory              "    15005800
"miscellaneous             "    11097176
LIBRARY CACHE STATISTICS:
namespace            gets hit ratio      pins hit ratio    reloads   invalids
--------------  --------- --------- --------- --------- ---------- ----------
CRSR            400143894     0.951 1821611655     0.969   10619950      63892
TABL/PRCD/TYPE  230543353     0.996 255666572     0.934    7504796          0
Connection  Mode & Relevant parameters
--------------------------------------
sga_max_size       = 3159332528
shared_pool_size =  738197504
db_cache_size       =  956301312
cursor_sharing      = SIMILAR
pga_aggregate_target  = 2097152000

It seems the ORA-04031 is the main issue, which triggered the ORA-00600 [729] error, after the session got abnormally terminated or killed.

Memory request failed on “shared pool” while trying to allocate 4200 bytes even though you have 9 to 25 mb of free space in 4 subpools.

I have reviewed the alert, trace and RDA report and following are my findings.

# Shared_pool_size is 738197504 and 4 subpools are used.
# Memory request failed for 4200 bytes.
# None of the components in subpools are showing any abnormal growth.

Suggestion
—————-
Issue is not exactly matching with any known bugs. Modifying the memory related parameters will help to avoid these errors.

1) Reduce the number of subpools to 2 from 4, by setting “_kghdsidx_count”=2 and restart the database. This will also help to reduce the shared pool fragmentation. Refer Note 396940.1

SQL> alter system set “_kghdsidx_count”=2 scope=spfile;

2) I have checked the memory request failure which is showing the size of 4200 bytes plus.
Set the _shared_pool_reserved_min_alloc=4000 which will help to allocate memory in reserved area, if the request is greater than 4000 bytes.

alter system set “_shared_pool_reserved_min_alloc”=4000 scope=spfile;

3) Set the shared_pool_reserved_size to 10 to 15 % of the shared pool size, by setting _shared_pool_reserved_pct parameter.

SQL> alter system set “_shared_pool_reserved_pct”=10 scope=spfile;

Implement the above changes and restart the database. This will help to avoid the shared pool fragmentation and helps to avoid the ORA-04031/ORA-00600 [729] errors.

After applying above change ,the error has not occured  again.

famous summary stack trace from Oracle Version 8.1.7.4.0 Bug Note

as this bug note claimed that:

PROBLEM:
——–
Customer frequently receives the following errors while rollback of a
transcation using Portal application:

ORA-603: ORACLE server session terminated by fatal error
ORA-600: internal error code, arguments: [6856], [0], [0], [], [], [], [],
[]

ORA-600: internal error code, arguments: [25012], [3], [15], [], [], [], [],
[]

DIAGNOSTIC ANALYSIS:
——————–
Alert.log:
~~~~~~~~~~
Wed May 19 12:47:28 2004
Errors in file /opt/oracle/admin/ORTPTP/udump/ortptp_ora_6363.trc:
ORA-603: ORACLE server session terminated by fatal error
ORA-600: internal error code, arguments: [6856], [0], [0], [], [], [], [],
[]
Wed May 19 14:38:39 2004
Errors in file /opt/oracle/admin/ORTPTP/udump/ortptp_ora_782.trc:
ORA-600: internal error code, arguments: [25012], [3], [15], [], [], [], [],
[]

Tablespace 3 = TEMP tablespace.

Block dump in tracefile ortptp_ora_21207.trc points to TEMP tablespace and
TEMP segment:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Block header dump:  0x00c0b917
Object id on Block? Y
seg/obj: 0xc0b916  csc: 0x00.18f4bc  itc: 1  flg: O  typ: 1 – DATA
fsl: 0  fnx: 0x0 ver: 0x01
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

WORKAROUND:
———–

RELATED BUGS:
————-
3562030

REPRODUCIBILITY:
—————-
Frequently

TEST CASE:
———-

STACK TRACE:
————
Summary Stack   (to Full stack)   (to Function List)
ksedmp             # KSE: dump the process state
kgeriv             # KGE Record Internal error code (with Va_list) (IGNORE)
kgeasi             # Raise an error on an ASSERTION failure (IGNORE)
kdbmrd             ? Module Notes: kdb.c – Kernel Data Block structure and
internal manipulation
kdoqmd             ? Module Notes: kdo.c – Kernel Data Operations
kcoapl             NAME: kcoapl – Kernel Cache Op APpLy
kcbchg1
kcbchg
ktuapundo          ktuapundo – Kernel Transaction Undo APply UNdo
ktbapundo          ktbapundo – Kernel Transaction Block APply UNdo
kdoiur             declare local objects */
kcoubk             kcoubk – Kernel Cache Op Undo callBacK — invoke undo
callback routine    */
ktundo             ktundo – Kernel Transaction UNDO
ktubko             Get undo record to rollback transaction, non-CR only */
ktuabt             ktuabt – Kernel Transaction Undo ABorT
*/
ktcrab             KTC: Kernel Transaction Control Real ABort – Abort a
transaction.
ktdabt
k2labo             abort session: first abort aborts tx
k2send             TESTING SUPPORT:
xctrol             XaCTion ROLlback: Rollback the current transaction of the
current session.
opiodr             OPIODR: ORACLE code request driver – route the current
request
ttcpip             TTCPIP: Two Task Common PIPe read/write
opitsk             opitsk – Two Task Oracle Side Function Dispatcher
opiino             opiino – ORACLE Program Interface INitialize Opi
opiodr             OPIODR: ORACLE code request driver – route the current
request
opidrv             # opidrv – ORACLE Program Interface DRiVer (IGNORE)
sou2o              # Main Oracle executable entry point
main               # Standard executable entry point
start              # C program entry point (IGNORE)
**********************************************************************************************

another summary:

drepprep     perform the document indexing
evapls    EVAluate any PLSql function
kcmclscn    check Lamport SCN
kcsadj1    adjust SCN
kgesinv    KGE Signal Internal (Named) error (with VA_list)
kghalo    KGH: main allocation entry point
kghalp    KGH: Allocate permanent memory
kghfnd    KGH: Find a chunk of at least the minimum size
kghfrunp    KGH: Ask client to free unpinned space
kghfrx    Free extent. This is called when a heap is unpinned to request that it
kghgex    KGH: Get a new extent
kghnospc    KGH: There is no space available in the heap
kghpmalo    KGH: Find and return a permanent chunk of space
kghxal    Allocate a fixed size piece of shared memory.
kglhpd    KGL HeaP Deallocate
kglobcl    KGL OBject CLear all tables
kglpnal    KGL PiN ALlOcate
kglpnc    KGL: PiN heaps and load data pieces of a Cursor object
kglpndl    KGL PiN DeLete
kglrfcl    KGL ReFerence CLear
kgmexec    KGM EXECute
kkmpost    POST PROCESSING
kksalx    ALlocate ‘size’ bytes from the eXecution-time heap
kkscls    KKS: Close the cursor, user is done with it
kkspfda    Multiple context area management
kkssbt    KKS: set bind types
kksscl    KKS: scan child list?
koklcopy    KOK Lob COPY.
koklcpb2c    KOK Lob CoPy Binary data (BFILE/BLOB) into Clob
kolfgdir    KOL File Get DIRectory object, path and FileNames.
kpuexec    KPU: Execute
kpuexecv8    KPU: Execute V8
kpurcsc    KPU Remote Call with ServiceContext, Callbacks
kqdgtc    return an open and parsed cursor for the given statement
kqldprr    KQLD Parent Referential constraint Read
kqllod    KQL: database object load
kqlsadd    kqlsadd – KQLS ADD a new element to a subordinate set
kqlslod    KQLS: Load all subordinate set elements for a given heap
kslcll    KSL: Clean up after a given latch
kslcllt    Clean up after a given latch
kslilcr    invoke latch cleanup routine:
ksmapg    KSM: Callback function for allocating a PGA extent, calls OSD to alloc
ksmasg    Callback function for allocating an SGA extent.
kssxdl    KSS: delete SO ignoring all except severe errors. cleans latches
ksucln    KSUCLN: Cleanup detached process
ksudlc    delete call
ksudlp    KSU: delete process.called when user detaches or during cleanup by PMON
ksuxda    KSUCLN: Attempt to delete all processes that are marked dead.
ksuxdl    KSUCLN: Delete state object for PMON
ksuxfl    KSU: Find dead processes and cleanup their latches. Called by PMON
kxfpbgpc    Get Permanent Chunks
kxfpbgtc    Buffer Allocation Get Chunk
kxfpnfy    KXFP: NotiFY (component notifier)
kxfxse    KXFX: execute
kxstcls    Trace cursor closing
opicca    ORACLE Program Interface: Clear Context Area
opiclo    ORACLE Program Interface: CLOse cursor
opiprs    ORACLE Program Interface: PaRSe
opitca    OPITCA: sets up the context area
pextproc    Pefm call EXTernal PROCedure
qerocStart    This function creates a collection iterator row-source to iterate
qkadrv    QKADRV: allocate query structures
qkajoi    QKAJOI: Query Kernel Allocation: JOIn processing
qximeop    QXIM Evaluate OPerand
rpicls    RPI: Recursive Program Interface CLoSe
selexe    SELEXE: prepare context area for fetch
xtyinpr    XTY Insert Numeric PRecision operator

 

ORA-600 Lookup Error Categories

Applies to:

Oracle Server – Enterprise Edition – Version:
Oracle Server – Personal Edition – Version:
Oracle Server – Standard Edition – Version:
Information in this document applies to any platform.
Checked for relevance 04-Jun-2009

Purpose

This note aims to provide a high level overview of the internal errors which may be encountered on the Oracle Server (sometimes referred to as the Oracle kernel). It is written to provide a guide to where a particular error may live and give some indication as to what the impact of the problem may be. Where a problem is reproducible and connected with a specific feature, you might obviously try not using the feature. If there is a consistent nature to the problem, it is good practice to ensure that the latest patchsets are in place and that you have taken reasonable measures to avoid known issues.

For repeatable issues which the ora-600 tool has not listed a likely cause , it is worth constructing a test case. Where this is possible, it greatly assists in the resolution time of any issue. It is important to remember that, in a many instances , the Server is very flexible and a workaround can very often be achieved.

Scope and Application

This bulletin provides Oracle DBAs with an overview of internal database errors.

Disclaimer: Every effort has been made to provide a reasonable degree of accuracy in what has been stated. Please consider that the details provided only serve to provide an indication of functionality and, in some cases, may not be wholly correct.

ORA-600 Lookup Error Categories

In the Oracle Server source, there are two types of ora-600 error :

  • the first parameter is a number which reflects the source component or layer the error is connected with; or
  • the first parameter is a mnemonic which indicates the source module where the error originated. This type of internal error is now used in preference to an internal error number.

Both types of error may be possible in the Oracle server.

Internal Errors Categorised by number range

The following table provides an indication of internal error codes used in the Oracle server. Thus, if ora-600[X] is encountered, it is possible to glean some high level background information : the error in generated in the Y layer which indicates that there may be a problem with Z.

Ora-600 Base Functionality Description
1 Service Layer The service layer has within it a variety of service related components which are associated with in memory related activities in the SGA such as, for example : the management of Enqueues, System Parameters, System state objects (these objects track the use of structures in the SGA by Oracle server processes), etc.. In the main, this layer provides support to allow process communication and provides support for locking and the management of structures to support multiple user processes connecting and interacting within the SGA. Note : vos  – Virtual Operating System provides features to support the functionality above.  As the name suggests it provides base functionality in much the same way as is provided by an Operating System.

Ora-600 Base Functionality Description
1 vos Component notifier
100 vos Debug
300 vos Error
500 vos Lock
700 vos Memory
900 vos System Parameters
1100 vos System State object
1110 vos Generic Linked List management
1140 vos Enqueue
1180 vos Instance Locks
1200 vos User State object
1400 vos Async Msgs
1700 vos license Key
1800 vos Instance Registration
1850 vos I/O Services components
2000 Cache Layer Where errors are generated in this area, it is advisable to check whether the error is repeatable and whether the error is perhaps associated with recovery or undo type operations; where this is the case and the error is repeatable, this may suggest some kind of hardware or physical issue with a data file, control file or log file. The Cache layer is responsible for making the changes to the underlying files and well as managing the related memory structures in the SGA. Note : rcv indicates recovery. It is important to remember that the Oracle cache layer is effectively going through the same code paths as used by the recovery mechanism.

Ora-600 Base Functionality Description
2000 server/rcv Cache Op
2100 server/rcv Control File mgmt
2200 server/rcv Misc (SCN etc.)
2400 server/rcv Buffer Instance Hash Table
2600 server/rcv Redo file component
2800 server/rcv Db file
3000 server/rcv Redo Application
3200 server/cache Buffer manager
3400 server/rcv Archival & media recovery component
3600 server/rcv recovery component
3700 server/rcv Thread component
3800 server/rcv Compatibility segment

It is important  to consider when the error occurred and the context in which the error was generated. If the error does not reproduce, it may be an in memory issue.

4000 Transaction Layer Primarily the transaction layer is involved with maintaining structures associated with the management of transactions.  As with the cache layer , problems encountered in this layer may indicate some kind of issue at a physical level. Thus it is important to try and repeat the same steps to see if the problem recurs.

Ora-600 Base Functionality Description
4000 server/txn Transaction Undo
4100 server/txn Transaction Undo
4210 server/txn Transaction Parallel
4250 server/txn Transaction List
4300 space/spcmgmt Transaction Segment
4400 txn/lcltx Transaction Control
4450 txn/lcltx distributed transaction control
4500 txn/lcltx Transaction Block
4600 space/spcmgmt Transaction Table
4800 dict/rowcache Query Row Cache
4900 space/spcmgmt Transaction Monitor
5000 space/spcmgmt Transaction Extent

It is important to try and determine what the object involved in any reproducible problem is. Then use the analyze command. For more information, please refer to the analyze command as detailed in the context of  Note:28814.1; in addition, it may be worth using the dbverify as discussed in Note:35512.1.

6000 Data Layer The data layer is responsible for maintaining and managing the data in the database tables and indexes. Issues in this area may indicate some kind of physical issue at the object level and therefore, it is important to try and isolate the object and then perform an anlayze on the object to validate its structure.

Ora-600 Base Functionality Description
6000 ram/data
ram/analyze
ram/index
data, analyze command and index related activity
7000 ram/object lob related errors
8000 ram/data general data access
8110 ram/index index related
8150 ram/object general data access

Again, it is important to try and determine what the object involved in any reproducible problem is. Then use the analyze command. For more information, please refer to the analyze command as detailed in the context of  Note:28814.1; in addition, it may be worth using the dbverify as discussed in Note:35512.1.

12000 User/Oracle Interface & SQL Layer Components This layer governs the user interface with the Oracle server. Problems generated by this layer usually indicate : some kind of presentation or format error in the data received by the server, i.e. the client may have sent incomplete information; or there is some kind of issue which indicates that the data is received out of sequence

Ora-600 Base Functionality Description
12200 progint/kpo
progint/opi
lob related
errors at interface level on server side, xa , etc.
12300 progint/if OCI interface to coordinating global transactions
12400 sqlexec/rowsrc table row source access
12600 space/spcmgmt operations associated with tablespace : alter / create / drop operations ; operations associated with create table / cluster
12700 sqlexec/rowsrc bad rowid
13000 dict/if dictionary access routines associated with kernel compilation
13080 ram/index kernel Index creation
13080 sqllang/integ constraint mechanism
13100 progint/opi archival and Media Recovery component
13200 dict/sqlddl alter table mechanism
13250 security/audit audit statement processing
13300 objsupp/objdata support for handling of object generation and object access
14000 dict/sqlddl sequence generation
15000 progint/kpo logon to Oracle
16000 tools/sqlldr sql loader related

You should try and repeat the issue and with the use of sql trace , try and isolate where exactly the issue may be occurring within the application.

14000 System Dependent Component internal error values This layer manages interaction with the OS. Effectively it acts as the glue which allows the Oracle server to interact with the OS. The types of operation which this layer manages are indicated as follows.

Ora-600 Base Functionality Description
14000 osds File access
14100 osds Concurrency management;
14200 osds Process management;
14300 osds Exception-handler or signal handler management
14500 osds Memory allocation
15000 security/dac,
security/logon
security/ldap
local user access validation; challenge / response activity for remote access validation; auditing operation; any activities associated with granting and revoking of privileges; validation of password with external password file
15100 dict/sqlddl this component manages operations associated with creating, compiling (altering), renaming, invalidating, and dropping  procedures, functions, and packages.
15160 optim/cbo cost based optimizer layer is used to determine optimal path to the data based on statistical information available on the relevant tables and indexes.
15190 optim/cbo cost based optimizer layer. Used in the generation of a new index to determine how the index should be created. Should it be constructed from the table data or from another index.
15200 dict/shrdcurs used to in creating sharable context area associated with shared cursors
15230 dict/sqlddl manages the compilation of triggers
15260 dict/dictlkup
dict/libcache
dictionary lookup and library cache access
15400 server/drv manages alter system and alter session operations
15410 progint/if manages compilation of pl/sql packages and procedures
15500 dict/dictlkup performs dictionary lookup to ensure semantics are correct
15550 sqlexec/execsvc
sqlexec/rowsrc
hash join execution management;
parallel row source management
15600 sqlexec/pq component provides support for Parallel Query operation
15620 repl/snapshots manages the creation of snapshot or materialized views as well as related snapshot / MV operations
15640 repl/defrdrpc layer containing various functions for examining the deferred transaction queue and retrieving information
15660 jobqs/jobq manages the operation of the Job queue background processes
15670 sqlexec/pq component provides support for Parallel Query operation
15700 sqlexec/pq component provides support for Parallel Query operation; specifically mechanism for starting up and shutting down query slaves
15800 sqlexec/pq component provides support for Parallel Query operation
15810 sqlexec/pq component provides support for Parallel Query operation; specifically functions for creating mechanisms through which Query co-ordinator can communicate with PQ slaves;
15820 sqlexec/pq component provides support for Parallel Query operation
15850 sqlexec/execsvc component provides support for the execution of SQL statements
15860 sqlexec/pq component provides support for Parallel Query operation
16000 loader sql Loader direct load operation;
16150 loader this layer is used for ‘C’ level call outs to direct loader operation;
16200 dict/libcache this is part of library Cache operation. Amongst other things it manages the dependency of SQL objects and tracks who is permitted to access these objects;
16230 dict/libcache this component is responsible for managing access to remote objects as part of library Cache operation;
16300 mts/mts this component relates to MTS (Multi Threaded Server) operation
16400 dict/sqlddl this layer contains functionality which allows tables to be loaded / truncated and their definitions to be modified. This is part of dictionary operation;
16450 dict/libcache this layer layer provides support for multi-instance access to the library cache; this functionality is applicable therefore to OPS environments;
16500 dict/rowcache this layer provides support to load / cache Oracle’s dictionary in memory in the library cache;
16550 sqlexec/fixedtab this component maps data structures maintained in the Oracle code to fixed tables such that they can be queried using the SQL layer;
16600 dict/libcache this layer performs management of data structures within the library cache;
16651 dict/libcache this layer performs management of dictionary related information within library Cache;
16701 dict/libcache this layer provides library Cache support to support database creation and forms part of the bootstrap process;
17000 dict/libcache this is the main library Cache manager. This Layer maintains the in memory representation of cached sql statements together will all the necessary support that this demands;
17090 generic/vos this layer implementations error management operations: signalling errors, catching  errors, recovering from errors, setting error frames, etc.;
17100 generic/vos Heap manager. The Heap manager manages the storage of internal data in an orderly and consistent manner. There can be many heaps serving various purposes; and heaps within heaps. Common examples are the SGA heap, UGA heap and the PGA heap. Within a Heap there are consistency markers which aim to ensure that the Heap is always in a consistent state. Heaps are use extensively and are in memory structures – not on disk.
17200 dict/libcache this component deals with loading remote library objects into the local library cache with information from the remote database.
17250 dict/libcache more library cache errors ; functionality for handling pipe operation associated with dbms_pipe
17270 dict/instmgmt this component manages instantiations of procedures, functions, packages, and cursors in a session. This provides a means to keep track of what has been loaded in the event of process death;
17300 generic/vos manages certain types of memory allocation structure.  This functionality is an extension of the Heap manager.
17500 generic/vos relates to various I/O operations. These relate to async i/o operation,  direct i/o operation and the management of writing buffers from the buffer cache by potentially a number of database writer processes;
17625 dict/libcache additional library Cache supporting functions
17990 plsql plsql ‘standard’ package related issues
18000 txn/lcltx transaction and savepoint management operations
19000 optim/cbo cost based optimizer related operations
20000 ram/index bitmap index and index related errors.
20400 ram/partnmap operations on partition related objects
20500 server/rcv server recovery related operation
21000 repl/defrdrpc,
repl/snapshot,
repl/trigger
replication related features
23000 oltp/qs AQ related errors.
24000 dict/libcache operations associated with managing stored outlines
25000 server/rcv tablespace management operations

Internal Errors Categorised by mnemonic

The following table details mnemonics error stems which are possible. If you have encountered : ora-600[kkjsrj:1] for example, you should look down the Error Mnemonic column (errors in alphabetical order) until you find the matching stem. In this case, kkj indicates that something unexpected has occurred in job queue operation.

Error Mnemonic(s) Functionality Description
ain ainp ram/index ain – alter index; ainp –  alter index partition management operation
apacb optim/rbo used by optimizer in connect by processing
atb atbi atbo ctc ctci cvw dict/sqlddl alter table , create table (IOT) or cluster operations as well as create view related operations (with constraint handling functionality)
dbsdrv sqllang/parse alter / create database operation
ddfnet progint/distrib various distributed operations on remote dictionary
delexe sqlexec/dmldrv manages the delete statement operation
dix ram/index manages drop index or validate index operation
dtb dict/sqlddl manages drop table operation
evaa2g evah2p evaa2g dbproc/sqlfunc various functions involves in evaluating operand outcomes such as : addition , average, OR operator, bites AND , bites OR, concatenation, as well as Oracle related functions : count(), dump() , etc. The list is extensive.
expcmo expgon dbproc/expreval handles expression evaluation with respect to two operands being equivalent
gra security/dac manages the granting and revoking of privilege rights to a user
gslcsq plsldap support for operations with an LDAP server
insexe sqlexec/dmldrv handles the insert statement operation
jox progint/opi functionality associated with the Java compiler and with the Java runtime environment within the Server
k2c k2d progint/distrib support for database to database operation in distributed environements as well as providing, with respect to the 2-phase commit protocol, a globally unique Database id
k2g k2l txn/disttx support for the 2 phase commit protocol protocol and the coordination of the various states in managing the distributed transaction
k2r k2s k2sp progint/distrib k2r – user interface for managing distributed transactions and combining distributed results ; k2s – handles logging on, starting a transaction, ending a transaction and recovering a transaction; k2sp – management of savepoints in a distributed environment.
k2v txn/disttx handles distributed recovery operation
kad cartserv/picklercs handles OCIAnyData implementation
kau ram/data manages the modification of indexes for inserts, updates and delete operations for IOTs as well as modification of indexes for IOTs
kcb kcbb kcbk kcbl kcbs kcbt kcbw kcbz cache manages Oracle’s buffer cache operation as well as operations used by capabilities such as direct load, has clusters , etc.
kcc kcf rcv manages and coordinates operations on the control file(s)
kcit context/trigger internal trigger functionality
kck rcv compatibility related checks associated with the compatible parameter
kcl cache background lck process which manages locking in a RAC or parallel server multiple instance environment
kco kcq kcra kcrf kcrfr kcrfw kcrp kcrr kcs kct kcv rcv various buffer cache operation such as quiesce operation , managing fast start IO target, parallel recovery operation , etc.
kd ram/data support for row level dependency checking and some log miner operations
kda ram/analyze manages the analyze command and collection of statistics
kdbl kdc kdd ram/data support for direct load operation, cluster space management and deleting rows
kdg ram/analyze gathers information about the underlying data and is used by the analyze command
kdi kdibc3 kdibco kdibh kdibl kdibo kdibq kdibr kdic kdici kdii kdil kdir kdis kdiss kdit kdk ram/index support of the creation of indexes on tables an IOTs and index look up
kdl kdlt ram/object lob and temporary lob management
kdo ram/data operations on data such as inserting a row piece or deleting a row piece
kdrp ram/analyze underlying support for operations provided by the dbms_repair package
kds kdt kdu ram/data operations on data such as retrieving a row and updating existing row data
kdv kdx ram/index functionality for dumping index and managing index blocks
kfc kfd kfg asm support for ASM file and disk operations
kfh kfp kft rcv support for writing to file header and transportable tablespace operations
kgaj kgam kgan kgas kgat kgav kgaz argusdbg/argusdbg support for Java Debug Wire Protocol (JDWP) and debugging facilites
kgbt kgg kgh kghs kghx kgkp vos kgbt – support for BTree operations; kgg – generic lists processing; kgh – Heap Manager : managing the internal structures withing the SGA / UGA / PGA and ensures their integrity; kghs – Heap manager with Stream support; kghx – fixed sized shared memory manager; kgkp – generic services scheduling policies
kgl kgl2 kgl3 kgla kglp kglr kgls dict/libcache generic library cache operation
kgm kgmt ilms support for inter language method services – or calling one language from another
kgrq kgsk kgski kgsn kgss vos support for priority queue and scheduling; capabilities for Numa support;  Service State object manager
kgupa kgupb kgupd0 kgupf kgupg kgupi kgupl kgupm kgupp kgupt kgupx kguq2 kguu vos Service related activities activities associated with for Process monitor (PMON); spawning or creating of background processes; debugging; managing process address space;  managing the background processes; etc.
kgxp vos inter process communication related functions
kjak kjat kjb kjbl kjbm kjbr kjcc kjcs kjctc kjcts kjcv kjdd kjdm kjdr kjdx kjfc kjfm kjfs kjfz kjg kji kjl kjm kjp kjr kjs kjt kju kjx ccl/dlm dlm related functionality ; associated with RAC or parallel server operation
kjxgf kjxgg kjxgm kjxgn kjxgna kjxgr ccl/cgs provides communication & synchronisation associated with GMS or OPS related functionality as well as name service and OPS Instance Membership Recovery Facility
kjxt ccl/dlm DLM request message management
kjzc kjzd kjzf kjzg kjzm ccl/diag support for diagnosibility amongst OPS related services
kkb dict/sqlddl support for operatoins which load/change table definitions
kkbl kkbn kkbo objsupp/objddl support for tables with lobs , nested tables and varrays as well as columns with objects
kkdc kkdl kkdo dict/dictlkup support for constraints, dictionary lookup and dictionary support for objects
kke optim/cbo query engine cost engine; provides support functions that provide cost estimates for queries under a number of different circumstances
kkfd sqlexec/pq support for performing parallel query operation
kkfi optim/cbo optimizer support for matching of expressions against functional ndexes
kkfr kkfs sqlexec/pq support for rowid range handling as well as for building parallel query query operations
kkj jobqs/jobq job queue operation
kkkd kkki dict/dbsched resource manager related support. Additionally, provides underlying functions provided by dbms_resource_manager and dbms_resource_manager_privs packages
kklr dict/sqlddl provides functions used to manipulate LOGGING and/or RECOVERABLE attributes of an object (non-partitioned table or index or  partitions of a partitioned table or index)
kkm kkmi dict/dictlkup provides various semantic checking functions
kkn ram/analyze support for the analyze command
kko kkocri optim/cbo Cost based Optimizer operation : generates alternative execution plans in order to find the optimal / quickest access to the data.  Also , support to determine cost and applicability of  scanning a given index in trying to create or rebuild an index or a partition thereof
kkpam kkpap ram/partnmap support for mapping predicate keys expressions to equivalent partitions
kkpo kkpoc kkpod dict/partn support for creation and modification of partitioned objects
kkqg kkqs kkqs1 kkqs2 kkqs3 kkqu kkqv kkqw optim/vwsubq query rewrite operation
kks kksa kksh kksl kksm dict/shrdcurs support for managing shared cursors/ shared sql
kkt dict/sqlddl support for creating, altering and dropping trigger definitions as well as handling the trigger operation
kkxa repl/defrdrpc underlying support for dbms_defer_query package operations
kkxb dict/sqlddl library cache interface for external tables
kkxl dict/plsicds underlying support for the dbms_lob package
kkxm progint/opi support for inter language method services
kkxs dict/plsicds underlying support for the dbms_sys_sql package
kkxt repl/trigger support for replication internal trigger operation
kkxwtp progint/opi entry point into the plsql compiler
kky drv support for alter system/session commands
kkz kkzd kkzf kkzg kkzi kkzj kkzl kkzo kkzp kkzq kkzr kkzu kkzv repl/snapshot support for snapshots or Materialized View validation and operation
kla klc klcli klx tools/sqlldr support for direct path sql loader operation
kmc kmcp kmd kmm kmr mts/mts support for Multi Threaded server operation (MTS) : manange and operate the virtual circuit mechanism, handle the dispatching of massages, administer shared servers and for collecting and maintaining statistics associated with MTS
knac knafh knaha knahc knahf knahs repl/apply replication apply operation associated with Oracle streams
kncc repl/repcache support for replication related information stored and maintained in library cache
kncd knce repl/defrdrpc replication related enqueue and dequeue of transction data as well as other queue related operations
kncog repl/repcache support for loading replicaiton object group information into library cache
kni repl/trigger support for replication internal trigger operation
knip knip2 knipi knipl knipr knipu knipu2 knipx repl/intpkg support for replication internal package operation.
kno repl/repobj support for replication objects
knp knpc knpcb knpcd knpqc knps repl/defrdrpc operations assocaied with propagating transactions to a remote node and coordination of this activity.
knst repl/stats replication statistics collection
knt kntg kntx repl/trigger support for replication internal trigger operation
koc objmgmt/objcache support for managing ADTs objects in the OOCI heap
kod objmgmt/datamgr support for persistent storage for objects : for read/write objects, to manage object IDs, and to manage object concurrency and recovery.
koh objmgmt/objcache object heap manager provides memory allocation services for objects
koi objmgmt/objmgr support for object types
koka objsupp/objdata support for reading images, inserting images, updating images, and deleting images based on object references (REFs).
kokb kokb2 objsupp/objsql support for nested table objects
kokc objmgmt/objcache support for pinning , unpinning and freeing objects
kokd objsupp/datadrv driver on the server side for managing objects
koke koke2 koki objsupp/objsql support for managing objects
kokl objsupp/objdata lob access
kokl2 objsupp/objsql lob DML and programmatic interface support
kokl3 objsupp/objdata object temporary LOB support
kokle kokm objsupp/objsql object SQL evaluation functions
kokn objsupp/objname naming support for objects
koko objsupp/objsup support functions to allow oci/rpi to communicate with Object Management Subsystem (OMS).
kokq koks koks2 koks3 koksr objsupp/objsql query optimisation for objects , semantic checking and semantic rewrite operations
kokt kokt2 kokt3 objsupp/objddl object compilation type manager
koku kokv objsupp/objsql support for unparse object operators and object view support
kol kolb kole kolf kolo objmgmt/objmgr support for object Lob buffering , object lob evaluation and object Language/runtime functions for Opaque types
kope2 kopi2 kopo kopp2 kopu koputil kopz objmgmt/pickler 8.1 engine implementation,  implementation of image ops for 8.1+ image format together with various pickler related support functions
kos objsupp/objsup object Stream interfaces for images/objects
kot kot2 kotg objmgmt/typemgr support for dynamic type operations to create, delete, and  update types.
koxs koxx objmgmt/objmgt object generic image Stream routines and miscellaneous generic object functions
kpcp kpcxlt progint/kpc Kernel programmatic connection pooling and kernel programmatic common type XLT translation routines
kpki progint/kpki kernel programatic interface support
kpls cartserv/corecs support for string formatting operations
kpn progint/kpn support for server to server communication
kpoal8 kpoaq kpob kpodny kpodp kpods kpokgt kpolob kpolon kpon progint/kpo support for programmatic operations
kpor progint/opi support for streaming protocol used by replication
kposc progint/kpo support for scrollable cursors
kpotc progint/opi oracle side support functions for setting up trusted external procedure callbacks
kpotx kpov progint/kpo support for managing local and distributed transaction coordination.
kpp2 kpp3 sqllang/parse kpp2 – parse routines for dimensions;
kpp3 – parse support for create/alter/drop summary  statements
kprb kprc progint/rpi support for executing sql efficiently on the Oracle server side as well as for copying data types during rpi operations
kptsc progint/twotask callback functions provided to all streaming operation as part of replication functionality
kpu kpuc kpucp progint/kpu Oracle kernel side programmatic user interface,  cursor management functions and client side connection pooling support
kqan kqap kqas argusdbg/argusdbg server-side notifiers and callbacks for debug operations.
kql kqld kqlp dict/libcache SQL Library Cache manager – manages the sharing of sql statements in the shared pool
kqr dict/rowcache row cache management. The row cache consists of a set of facilities to provide fast access to table definitions and locking capabilities.
krbi krbx krby krcr krd krpi rcv Backup and recovery related operations :
krbi – dbms_backup_restore package underlying support.; krbx –  proxy copy controller; krby – image copy; krcr – Recovery Controlfile Redo; krd – Recover Datafiles (Media & Standby Recovery);  krpi – support for the package : dbms_pitr
krvg krvt rcv/vwr krvg – support for generation of redo associated with DDL; krvt – support for redo log miner viewer (also known as log miner)
ksa ksdp ksdx kse ksfd ksfh ksfq ksfv ksi ksim ksk ksl ksm ksmd ksmg ksn ksp kspt ksq ksr kss ksst ksu ksut vos support for various kernel associated capabilities
ksx sqlexec/execsvc support for query execution associated with temporary tables
ksxa ksxp ksxr vos support for various kernel associated capabilities in relation to OPS or RAC operation
kta space/spcmgmt support for DML locks and temporary tables associated with table access
ktb ktbt ktc txn/lcltx transaction control operations at the block level : locking block, allocating space within the block , freeing up space, etc.
ktec ktef ktehw ktein ktel kteop kteu space/spcmgmt support for extent management operations :
ktec – extent concurrency operations; ktef – extent format; ktehw – extent high water mark operations; ktein – extent  information operations; ktel – extent support for sql loader; kteop – extent operations : add extent to segment, delete extent, resize extent, etc. kteu – redo support for operations changing segment header / extent map
ktf txn/lcltx flashback support
ktfb ktfd ktft ktm space/spcmgmt ktfb – support for bitmapped space manipulation of files/tablespaces;  ktfd – dictionary-based extent management; ktft – support for temporary file manipulation; ktm – SMON operation
ktp ktpr ktr ktri txn/lcltx ktp – support for parallel transaction operation; ktpr – support for parallel transaction recovery; ktr – kernel transaction read consistency;
ktri – support for dbms_resumable package
ktsa ktsap ktsau ktsb ktscbr ktsf ktsfx ktsi ktsm ktsp ktss ktst ktsx ktt kttm space/spcmgmt support for checking and verifying space usage
ktu ktuc ktur ktusm txn/lcltx internal management of undo and rollback segments
kwqa kwqi kwqic kwqid kwqie kwqit kwqj kwqm kwqn kwqo kwqp kwqs kwqu kwqx oltp/qs support for advanced queuing :
kwqa – advanced queue administration; kwqi – support for AQ PL/SQL trusted callouts; kwqic – common AQ support functions; kwqid – AQ dequeue support; kwqie – AQ enqueu support ; kwqit – time management operation ; kwqj – job queue scheduler for propagation; kwqm – Multiconsumer queue IOT support; kwqn – queue notifier; kwqo – AQ support for checking instType checking options; kwqp – queueing propagation; kwqs – statistics handling; kwqu – handles lob data. ; kwqx – support for handling transformations
kwrc kwre oltp/re rules engine evaluation
kxcc kxcd kxcs sqllang/integ constraint processing
kxdr sqlexec/dmldrv DML driver entrypoint
kxfp kxfpb kxfq kxfr kxfx sqlexec/pq parallel query support
kxhf kxib sqlexec/execsvc khhf- support for hash join file and memory management; kxib – index buffering operations
kxs dict/instmgmt support for executing shared cursors
kxti kxto kxtr dbproc/trigger support for trigger operation
kxtt ram/partnmap support for temporary table operations
kxwph ram/data support for managing attributes of the segment of a table / cluster / table-partition
kza security/audit support for auditing operations
kzar security/dac support for application auditing
kzck security/crypto encryption support
kzd security/dac support for dictionary access by security related functions
kzec security/dbencryption support inserting and retrieving encrypted objects into and out of the database
kzfa kzft security/audit support for fine grained auditing
kzia security/logon identification and authentication operations
kzp kzra kzrt kzs kzu kzup security/dac security related operations associated with privileges
msqima msqimb sqlexec/sqlgen support for generating sql statments
ncodef npi npil npixfr progint/npi support for managing remote network connection from  within the server itself
oba sqllang/outbufal operator buffer allocate for various types of operators : concatenate, decode, NVL, etc.  the list is extensive.
ocik progint/oci OCI oracle server functions
opiaba opidrv opidsa opidsc opidsi opiexe opifch opiino opilng opipar opipls opirip opitsk opix progint/opi OPI Oracle server functions – these are at the top of the server stack and are called indirectly by ythe client in order to server the client request.
orlr objmgmt/objmgr support for  C langauge interfaces to user-defined types (UDTs)
orp objmgmt/pickler oracle’s external pickler / opaque type interfaces
pesblt pfri pfrsqc plsql/cox pesblt – pl/sql built in interpreter; pfri – pl/sql runtime; pfrsqc – pl/sql callbacks for array sql and dml with returning
piht plsql/gen/utl support for pl/sql implementation of utl_http package
pirg plsql/cli/utl_raw support for pl/sql implementation of utl_raw package
pism plsql/cli/utl_smtp support for pl/sql implementation of utl_smtp package
pitcb plsql/cli/utl_tcp support for pl/sql implementation of utl_tcp package
piur plsql/gen/utl_url support for pl/sql implementation of utl_url package
plio plsql/pkg pl/sql object instantiation
plslm plsql/cox support for NCOMP processing
plsm pmuc pmuo pmux objmgmt/pol support for pl/sql handling of collections
prifold priold plsql/cox support to allow rpc forwarding to an older release
prm sqllang/param parameter handling associated with sql layer
prsa prsc prssz sqllang/parse prsa – parser for alter cluster command; prsc – parser for create database command; prssz – support for parse context to be saved
psdbnd psdevn progint/dbpsd psdbnd – support for managing bind variables; psdevn – support for pl/sql debugger
psdicd progint/plsicds small number of ICD to allow pl/sql to call into ‘C’ source
psdmsc psdpgi progint/dbpsd psdmsc – pl/sql system dependent miscellaneous functions ; psdpgi – support for opening and closing cursors in pl/sql
psf plsql/pls pl/sql service related functions for instantiating called pl/sql unit in library cache
qbadrv qbaopn sqllang/qrybufal provides allocation of buffer and control structures in query execution
qcdl qcdo dict/dictlkup qcdl – query compile semantic analysis; qcdo – query compile dictionary support for objects
qci dict/shrdcurs support for SQL language parser and semantic analyser
qcop qcpi qcpi3 qcpi4 qcpi5 sqllang/parse support for query compilation parse phase
qcs qcs2 qcs3 qcsji qcso dict/dictlkup support for semantic analysis by SQL compiler
qct qcto sqllang/typeconv qct – query compile type check operations; qcto –  query compile type check operators
qcu sqllang/parse various utilities provided for sql compilation
qecdrv sqllang/qryedchk driver performing high level checks on sql language query capabilities
qerae qerba qerbc qerbi qerbm qerbo qerbt qerbu qerbx qercb qercbi qerco qerdl qerep qerff qerfi qerfl qerfu qerfx qergi qergr qergs qerhc qerhj qeril qerim qerix qerjm qerjo qerle qerli qerlt qerns qeroc qeroi qerpa qerpf qerpx qerrm qerse qerso qersq qerst qertb qertq qerua qerup qerus qervw qerwn qerxt sqlexec/rowsrc row source operators :
qerae – row source (And-Equal) implementation; qerba – Bitmap Index AND row source; qerbc – bitmap index compaction row source; qerbi – bitmap index creation row source; qerbm – QERB Minus row source; qerbo  – Bitmap Index OR row source; qerbt – bitmap convert row source; qerbu – Bitmap Index Unlimited-OR row source; qerbx – bitmap index access row source; qercb – row source: connect by; qercbi – support for connect by; qerco – count row source; qerdl – row source delete; qerep – explosion row source; qerff – row source fifo buffer; qerfi  – first row row source; qerfl  – filter row source definition; qerfu – row source: for update; qerfx – fixed table row source; qergi – granule iterator row source; qergr – group by rollup row source; qergs – group by sort row source; qerhc – row sources hash clusters; qerhj – row source Hash Join;  qeril  – In-list row source; qerim – Index Maintenance row source; qerix – Index row source; qerjo – row source: join; qerle – linear execution row source implementation; qerli – parallel create index; qerlt – row source populate Table;  qerns  – group by No Sort row source; qeroc – object collection iterator row source; qeroi – extensible indexing query component; qerpa – partition row sources; qerpf – query execution row source: prefetch; qerpx – row source: parallelizer; qerrm – remote row source; qerse – row source: set implementation; qerso – sort row source; qersq – row source for sequence number; qerst  – query execution row sources: statistics; qertb – table row source; qertq  – table queue row source; qerua – row source : union-All;
qerup – update row source; qerus – upsert row source ; qervw – view row source; qerwn – WINDOW row source; qerxt – external table fetch row source
qes3t qesa qesji qesl qesmm qesmmc sqlexec/execsvc run time support for sql execution
qkacon qkadrv qkajoi qkatab qke qkk qkn qkna qkne sqlexec/rwsalloc SQL query dynamic structure allocation routines
qks3t sqlexec/execsvc query execution service associated with temp table transformation
qksmm qksmms qksop sqllang/compsvc qksmm –  memory management services for the SQL compiler; qksmms – memory management simulation services for the SQL compiler; qksop – query compilation service for operand processing
qkswc sqlexec/execsvc support for temp table transformation associated for with clause.
qmf xmlsupp/util support for ftp server; implements processing of ftp commands
qmr qmrb qmrs xmlsupp/resolver support hierarchical resolver
qms xmlsupp/data support for storage and retrieval of XOBs
qmurs xmlsupp/uri support for handling URIs
qmx qmxsax xmlsupp/data qmx – xml support; qmxsax – support for handling sax processing
qmxtc xmlsupp/sqlsupp support for ddl  and other operators related to the sql XML support
qmxtgx xmlsupp support for transformation : ADT -> XML
qmxtsk xmlsupp/sqlsupp XMLType support functions
qsme summgmt/dict summary management expression processing
qsmka qsmkz dict/dictlkup qsmka – support to analyze request in order to determine whether a summary could be created that would be useful; qsmkz – support for create/alter summary semantic analysis
qsmp qsmq qsmqcsm qsmqutl summgmt/dict qsmp – summary management partition processing; qsmq – summary management dictionary access; qsmqcsm – support for create / drop / alter summary and related dimension operations; qsmqutl – support for summaries
qsms summgmt/advsvr summary management advisor
qxdid objsupp/objddl support for domain index ddl operations
qxidm objsupp/objsql support for extensible index dml operations
qxidp objsupp/objddl support for domain index ddl partition operations
qxim objsupp/objsql extensible indexing support for objects
qxitex qxopc qxope objsupp/objddl qxitex – support for create / drop indextype; qxope – execution time support for operator  callbacks; qxope – execution time support for operator DDL
qxopq qxuag qxxm objsupp/objsql qxopq – support for queries with user-defined operators; qxuag – support for user defined aggregate processing; qxxm – queries involving external tables
rfmon rfra rfrdb rfrla rfrm rfrxpt drs implements 9i data guard broker monitor
rnm dict/sqlddl manages rename statement operation
rpi progint/rpi recursive procedure interface which handles the the environment setup where multiple recursize statements are executed from one top level statement
rwoima sqlexec/rwoprnds row operand operations
rwsima sqlexec/rowsrc row source implementation/retrieval according to the defining query
sdbima sqlexec/sort manages and performs sort operation
selexe sqlexec/dmldrv handles the operation of select statement execution
skgm osds platform specific memory management rountines interfacing with O.S. allocation functions
smbima sor sqlexec/sort manages and performs sort operation
sqn dict/sqlddl support for parsing references to sequences
srdima srsima stsima sqlexec/sort manages and performs sort operation
tbsdrv space/spcmgmt operations for executing create / alter / drop tablespace and related supporting functions
ttcclr ttcdrv ttcdty ttcrxh ttcx2y progint/twotask two task common layer which provides high level interaction and negotiation functions for Oracle client when communicating with the server.  It also provides important function of converting client side data / data types into equivalent on the server and vice versa
uixexe ujiexe updexe upsexe sqlexec/dmldrv support for : index maintenance operations, the execution of the update statement and associated actions connected with update as well as the upsert command which combines the operations of update and insert
vop optim/vwsubq view optimisation related functionality
xct txn/lcltx support for the management of transactions and savepoint operations
xpl sqlexec/expplan support for the explain plan command
xty sqllang/typeconv type checking functions
zlke security/ols/intext label security error handling component

Script to Collect RAC Diagnostic Information (racdiag.sql)

Script:

-- NAME: RACDIAG.SQL
-- SYS OR INTERNAL USER, CATPARR.SQL ALREADY RUN, PARALLEL QUERY OPTION ON
-- ------------------------------------------------------------------------
-- AUTHOR:
-- Michael Polaski - Oracle Support Services
-- Copyright 2002, Oracle Corporation
-- ------------------------------------------------------------------------
-- PURPOSE:
-- This script is intended to provide a user friendly guide to troubleshoot
-- RAC hung sessions or slow performance scenerios. The script includes
-- information to gather a variety of important debug information to determine
-- the cause of a RAC session level hang. The script will create a file
-- called racdiag_.out in your local directory while dumping hang analyze
-- dumps in the user_dump_dest(s) and background_dump_dest(s) on all nodes.
--
-- ------------------------------------------------------------------------
-- DISCLAIMER:
-- This script is provided for educational purposes only. It is NOT
-- supported by Oracle World Wide Technical Support.
-- The script has been tested and appears to work as intended.
-- You should always run new scripts on a test instance initially.
-- ------------------------------------------------------------------------
-- Script output is as follows:
set echo off
set feedback off
column timecol new_value timestamp
column spool_extension new_value suffix
select to_char(sysdate,'Mondd_hhmi') timecol,
'.out' spool_extension from sys.dual;
column output new_value dbname
select value || '_' output
from v$parameter where name = 'db_name';
spool racdiag_&&dbname&×tamp&&suffix
set lines 200
set pagesize 35
set trim on
set trims on
alter session set nls_date_format = 'MON-DD-YYYY HH24:MI:SS';
alter session set timed_statistics = true;
set feedback on
select to_char(sysdate) time from dual;
set numwidth 5
column host_name format a20 tru
select inst_id, instance_name, host_name, version, status, startup_time
from gv$instance
order by inst_id;
set echo on
-- Taking Hang Analyze dumps
-- This may take a little while...
oradebug setmypid
oradebug unlimit
oradebug -g all hanganalyze 3
-- This part may take the longest, you can monitor bdump or udump to see if
-- the file is being generated.
oradebug -g all dump systemstate 267
-- WAITING SESSIONS:
-- The entries that are shown at the top are the sessions that have
-- waited the longest amount of time that are waiting for non-idle wait
-- events (event column). You can research and find out what the wait
-- event indicates (along with its parameters) by checking the Oracle
-- Server Reference Manual or look for any known issues or documentation
-- by searching Metalink for the event name in the search bar. Example
-- (include single quotes): [ 'buffer busy due to global cache' ].
-- Metalink and/or the Server Reference Manual should return some useful
-- information on each type of wait event. The inst_id column shows the
-- instance where the session resides and the SID is the unique identifier
-- for the session (gv$session). The p1, p2, and p3 columns will show
-- event specific information that may be important to debug the problem.
-- To find out what the p1, p2, and p3 indicates see the next section.
-- Items with wait_time of anything other than 0 indicate we do not know
-- how long these sessions have been waiting.
--
set numwidth 10
column state format a7 tru
column event format a25 tru
column last_sql format a40 tru
select sw.inst_id, sw.sid, sw.state, sw.event, sw.seconds_in_wait seconds,
sw.p1, sw.p2, sw.p3, sa.sql_text last_sql
from gv$session_wait sw, gv$session s, gv$sqlarea sa
where sw.event not in
('rdbms ipc message','smon timer','pmon timer',
'SQL*Net message from client','lock manager wait for remote message',
'ges remote message', 'gcs remote message', 'gcs for action', 'client message',
'pipe get', 'null event', 'PX Idle Wait', 'single-task message',
'PX Deq: Execution Msg', 'KXFQ: kxfqdeq - normal deqeue',
'listen endpoint status','slave wait','wakeup time manager')
and sw.seconds_in_wait > 0
and (sw.inst_id = s.inst_id and sw.sid = s.sid)
and (s.inst_id = sa.inst_id and s.sql_address = sa.address)
order by seconds desc;
-- EVENT PARAMETER LOOKUP:
-- This section will give a description of the parameter names of the
-- events seen in the last section. p1test is the parameter value for
-- p1 in the WAITING SESSIONS section while p2text is the parameter
-- value for p3 and p3 text is the parameter value for p3. The
-- parameter values in the first section can be helpful for debugging
-- the wait event.
--
column event format a30 tru
column p1text format a25 tru
column p2text format a25 tru
column p3text format a25 tru
select distinct event, p1text, p2text, p3text
from gv$session_wait sw
where sw.event not in ('rdbms ipc message','smon timer','pmon timer',
'SQL*Net message from client','lock manager wait for remote message',
'ges remote message', 'gcs remote message', 'gcs for action', 'client message',
'pipe get', 'null event', 'PX Idle Wait', 'single-task message',
'PX Deq: Execution Msg', 'KXFQ: kxfqdeq - normal deqeue',
'listen endpoint status','slave wait','wakeup time manager')
and seconds_in_wait > 0
order by event;
-- GES LOCK BLOCKERS:
-- This section will show us any sessions that are holding locks that
-- are blocking other users. The inst_id will show us the instance that
-- the session resides on while the sid will be a unique identifier for
-- the session. The grant_level will show us how the GES lock is granted to
-- the user. The request_level will show us what status we are trying to
-- obtain.  The lockstate column will show us what status the lock is in.
-- The last column shows how long this session has been waiting.
--
set numwidth 5
column state format a16 tru;
column event format a30 tru;
select dl.inst_id, s.sid, p.spid, dl.resource_name1,
decode(substr(dl.grant_level,1,8),'KJUSERNL','Null','KJUSERCR','Row-S (SS)',
'KJUSERCW','Row-X (SX)','KJUSERPR','Share','KJUSERPW','S/Row-X (SSX)',
'KJUSEREX','Exclusive',request_level) as grant_level,
decode(substr(dl.request_level,1,8),'KJUSERNL','Null','KJUSERCR','Row-S (SS)',
'KJUSERCW','Row-X (SX)','KJUSERPR','Share','KJUSERPW','S/Row-X (SSX)',
'KJUSEREX','Exclusive',request_level) as request_level,
decode(substr(dl.state,1,8),'KJUSERGR','Granted','KJUSEROP','Opening',
'KJUSERCA','Canceling','KJUSERCV','Converting') as state,
s.sid, sw.event, sw.seconds_in_wait sec
from gv$ges_enqueue dl, gv$process p, gv$session s, gv$session_wait sw
where blocker = 1
and (dl.inst_id = p.inst_id and dl.pid = p.spid)
and (p.inst_id = s.inst_id and p.addr = s.paddr)
and (s.inst_id = sw.inst_id and s.sid = sw.sid)
order by sw.seconds_in_wait desc;
-- GES LOCK WAITERS:
-- This section will show us any sessions that are waiting for locks that
-- are blocked by other users. The inst_id will show us the instance that
-- the session resides on while the sid will be a unique identifier for
-- the session. The grant_level will show us how the GES lock is granted to
-- the user. The request_level will show us what status we are trying to
-- obtain.  The lockstate column will show us what status the lock is in.
-- The last column shows how long this session has been waiting.
--
set numwidth 5
column state format a16 tru;
column event format a30 tru;
select dl.inst_id, s.sid, p.spid, dl.resource_name1,
decode(substr(dl.grant_level,1,8),'KJUSERNL','Null','KJUSERCR','Row-S (SS)',
'KJUSERCW','Row-X (SX)','KJUSERPR','Share','KJUSERPW','S/Row-X (SSX)',
'KJUSEREX','Exclusive',request_level) as grant_level,
decode(substr(dl.request_level,1,8),'KJUSERNL','Null','KJUSERCR','Row-S (SS)',
'KJUSERCW','Row-X (SX)','KJUSERPR','Share','KJUSERPW','S/Row-X (SSX)',
'KJUSEREX','Exclusive',request_level) as request_level,
decode(substr(dl.state,1,8),'KJUSERGR','Granted','KJUSEROP','Opening',
'KJUSERCA','Cancelling','KJUSERCV','Converting') as state,
s.sid, sw.event, sw.seconds_in_wait sec
from gv$ges_enqueue dl, gv$process p, gv$session s, gv$session_wait sw
where blocked = 1
and (dl.inst_id = p.inst_id and dl.pid = p.spid)
and (p.inst_id = s.inst_id and p.addr = s.paddr)
and (s.inst_id = sw.inst_id and s.sid = sw.sid)
order by sw.seconds_in_wait desc;
-- LOCAL ENQUEUES:
-- This section will show us if there are any local enqueues. The inst_id will
-- show us the instance that the session resides on while the sid will be a
-- unique identifier for. The addr column will show the lock address. The type
-- will show the lock type. The id1 and id2 columns will show specific
-- parameters for the lock type.
--
set numwidth 12
column event format a12 tru
select l.inst_id, l.sid, l.addr, l.type, l.id1, l.id2,
decode(l.block,0,'blocked',1,'blocking',2,'global') block,
sw.event, sw.seconds_in_wait sec
from gv$lock l, gv$session_wait sw
where (l.sid = sw.sid and l.inst_id = sw.inst_id)
and l.block in (0,1)
order by l.type, l.inst_id, l.sid;
-- LATCH HOLDERS:
-- If there is latch contention or 'latch free' wait events in the WAITING
-- SESSIONS section we will need to find out which proceseses are holding
-- latches. The inst_id will show us the instance that the session resides
-- on while the sid will be a unique identifier for. The username column
-- will show the session's username. The os_user column will show the os
-- user that the user logged in as. The name column will show us the type
-- of latch being waited on. You can search Metalink for the latch name in
-- the search bar. Example (include single quotes):
-- [ 'library cache' latch ]. Metalink should return some useful information
-- on the type of latch.
--
set numwidth 5
select distinct lh.inst_id, s.sid, s.username, p.username os_user, lh.name
from gv$latchholder lh, gv$session s, gv$process p
where (lh.sid = s.sid and lh.inst_id = s.inst_id)
and (s.inst_id = p.inst_id and s.paddr = p.addr)
order by lh.inst_id, s.sid;
-- LATCH STATS:
-- This view will show us latches with less than optimal hit ratios
-- The inst_id will show us the instance for the particular latch. The
-- latch_name column will show us the type of latch. You can search Metalink
-- for the latch name in the search bar. Example (include single quotes):
-- [ 'library cache' latch ]. Metalink should return some useful information
-- on the type of latch. The hit_ratio shows the percentage of time we
-- successfully acquired the latch.
--
column latch_name format a30 tru
select inst_id, name latch_name,
round((gets-misses)/decode(gets,0,1,gets),3) hit_ratio,
round(sleeps/decode(misses,0,1,misses),3) "SLEEPS/MISS"
from gv$latch
where round((gets-misses)/decode(gets,0,1,gets),3) < .99
and gets != 0
order by round((gets-misses)/decode(gets,0,1,gets),3);
-- No Wait Latches:
--
select inst_id, name latch_name,
round((immediate_gets/(immediate_gets+immediate_misses)), 3) hit_ratio,
round(sleeps/decode(immediate_misses,0,1,immediate_misses),3) "SLEEPS/MISS"
from gv$latch
where round((immediate_gets/(immediate_gets+immediate_misses)), 3) < .99 and immediate_gets + immediate_misses > 0
order by round((immediate_gets/(immediate_gets+immediate_misses)), 3);
-- GLOBAL CACHE CR PERFORMANCE
-- This shows the average latency of a consistent block request.
-- AVG CR BLOCK RECEIVE TIME should typically be about 15 milliseconds
-- depending on your system configuration and volume, is the average
-- latency of a consistent-read request round-trip from the requesting
-- instance to the holding instance and back to the requesting instance. If
-- your CPU has limited idle time and your system typically processes
-- long-running queries, then the latency may be higher. However, it is
-- possible to have an average latency of less than one millisecond with
-- User-mode IPC. Latency can be influenced by a high value for the
-- DB_MULTI_BLOCK_READ_COUNT parameter. This is because a requesting process
-- can issue more than one request for a block depending on the setting of
-- this parameter. Correspondingly, the requesting process may wait longer.
-- Also check interconnect badwidth, OS tcp settings, and OS udp settings if
-- AVG CR BLOCK RECEIVE TIME is high.
--
set numwidth 20
column "AVG CR BLOCK RECEIVE TIME (ms)" format 9999999.9
select b1.inst_id, b2.value "GCS CR BLOCKS RECEIVED",
b1.value "GCS CR BLOCK RECEIVE TIME",
((b1.value / b2.value) * 10) "AVG CR BLOCK RECEIVE TIME (ms)"
from gv$sysstat b1, gv$sysstat b2
where b1.name = 'global cache cr block receive time' and
b2.name = 'global cache cr blocks received' and b1.inst_id = b2.inst_id
or b1.name = 'gc cr block receive time' and
b2.name = 'gc cr blocks received' and b1.inst_id = b2.inst_id ;
-- GLOBAL CACHE LOCK PERFORMANCE
-- This shows the average global enqueue get time.
-- Typically AVG GLOBAL LOCK GET TIME should be 20-30 milliseconds. the
-- elapsed time for a get includes the allocation and initialization of a
-- new global enqueue. If the average global enqueue get (global cache
-- get time) or average global enqueue conversion times are excessive,
-- then your system may be experiencing timeouts. See the 'WAITING SESSIONS',
-- 'GES LOCK BLOCKERS', GES LOCK WAITERS', and 'TOP 10 WAIT EVENTS ON SYSTEM'
-- sections if the AVG GLOBAL LOCK GET TIME is high.
--
set numwidth 20
column "AVG GLOBAL LOCK GET TIME (ms)" format 9999999.9
select b1.inst_id, (b1.value + b2.value) "GLOBAL LOCK GETS",
b3.value "GLOBAL LOCK GET TIME",
(b3.value / (b1.value + b2.value) * 10) "AVG GLOBAL LOCK GET TIME (ms)"
from gv$sysstat b1, gv$sysstat b2, gv$sysstat b3
where b1.name = 'global lock sync gets' and
b2.name = 'global lock async gets' and b3.name = 'global lock get time'
and b1.inst_id = b2.inst_id and b2.inst_id = b3.inst_id
or b1.name = 'global enqueue gets sync' and
b2.name = 'global enqueue gets async' and b3.name = 'global enqueue get time'
and b1.inst_id = b2.inst_id and b2.inst_id = b3.inst_id;
-- RESOURCE USAGE
-- This section will show how much of our resources we have used.
--
set numwidth 8
select inst_id, resource_name, current_utilization, max_utilization,
initial_allocation
from gv$resource_limit
where max_utilization > 0
order by inst_id, resource_name;
-- DLM TRAFFIC INFORMATION
-- This section shows how many tickets are available in the DLM. If the
-- TCKT_WAIT columns says "YES" then we have run out of DLM tickets which
-- could cause a DLM hang. Make sure that you also have enough TCKT_AVAIL.
--
set numwidth 5
select * from gv$dlm_traffic_controller
order by TCKT_AVAIL;
-- DLM MISC
--
set numwidth 10
select * from gv$dlm_misc;
-- LOCK CONVERSION DETAIL:
-- This view shows the types of lock conversion being done on each instance.
--
select * from gv$lock_activity;
-- TOP 10 WRITE PINGING/FUSION OBJECTS
-- This view shows the top 10 objects for write pings accross instances.
-- The inst_id column shows the node that the block was pinged on. The name
-- column shows the object name of the offending object. The file# shows the
-- offending file number (gc_files_to_locks). The STATUS column will show the
-- current status of the pinged block. The READ_PINGS will show us read
-- converts and the WRITE_PINGS will show us objects with write converts.
-- Any rows that show up are objects that are concurrently accessed across
-- more than 1 instance.
--
set numwidth 8
column name format a20 tru
column kind format a10 tru
select inst_id, name, kind, file#, status, BLOCKS,
READ_PINGS, WRITE_PINGS
from (select p.inst_id, p.name, p.kind, p.file#, p.status,
count(p.block#) BLOCKS, sum(p.forced_reads) READ_PINGS,
sum(p.forced_writes) WRITE_PINGS
from gv$ping p, gv$datafile df
where p.file# = df.file# (+)
group by p.inst_id, p.name, p.kind, p.file#, p.status
order by sum(p.forced_writes) desc)
where rownum < 11
order by WRITE_PINGS desc;
-- TOP 10 READ PINGING/FUSION OBJECTS
-- This view shows the top 10 objects for read pings. The inst_id column shows
-- the node that the block was pinged on. The name column shows the object
-- name of the offending object. The file# shows the offending file number
-- (gc_files_to_locks). The STATUS column will show the current status of the
-- pinged block. The READ_PINGS will show us read converts and the WRITE_PINGS
-- will show us objects with write converts. Any rows that show up are objects
-- that are concurrently accessed across more than 1 instance.
--
set numwidth 8
column name format a20 tru
column kind format a10 tru
select inst_id, name, kind, file#, status, BLOCKS,
READ_PINGS, WRITE_PINGS
from (select p.inst_id, p.name, p.kind, p.file#, p.status,
count(p.block#) BLOCKS, sum(p.forced_reads) READ_PINGS,
sum(p.forced_writes) WRITE_PINGS
from gv$ping p, gv$datafile df
where p.file# = df.file# (+)
group by p.inst_id, p.name, p.kind, p.file#, p.status
order by sum(p.forced_reads) desc)
where rownum < 11
order by READ_PINGS desc;
-- TOP 10 FALSE PINGING OBJECTS
-- This view shows the top 10 objects for false pings. This can be avoided by
-- better gc_files_to_locks configuration. The inst_id column shows the node
-- that the block was pinged on. The name column shows the object name of the
-- offending object. The file# shows the offending file number
-- (gc_files_to_locks). The STATUS column will show the current status of the
-- pinged block. The READ_PINGS will show us read converts and the WRITE_PINGS
-- will show us objects with write converts. Any rows that show up are objects
-- that are concurrently accessed across more than 1 instance.
--
set numwidth 8
column name format a20 tru
column kind format a10 tru
select inst_id, name, kind, file#, status, BLOCKS,
READ_PINGS, WRITE_PINGS
from (select p.inst_id, p.name, p.kind, p.file#, p.status,
count(p.block#) BLOCKS, sum(p.forced_reads) READ_PINGS,
sum(p.forced_writes) WRITE_PINGS
from gv$false_ping p, gv$datafile df
where p.file# = df.file# (+)
group by p.inst_id, p.name, p.kind, p.file#, p.status
order by sum(p.forced_writes) desc)
where rownum < 11
order by WRITE_PINGS desc;
-- INITIALIZATION PARAMETERS:
-- Non-default init parameters for each node.
--
set numwidth 5
column name format a30 tru
column value format a50 wra
column description format a60 tru
select inst_id, name, value, description
from gv$parameter
where isdefault = 'FALSE'
order by inst_id, name;
-- TOP 10 WAIT EVENTS ON SYSTEM
-- This view will provide a summary of the top wait events in the db.
--
set numwidth 10
column event format a25 tru
select inst_id, event, time_waited, total_waits, total_timeouts
from (select inst_id, event, time_waited, total_waits, total_timeouts
from gv$system_event where event not in ('rdbms ipc message','smon timer',
'pmon timer', 'SQL*Net message from client','lock manager wait for remote message',
'ges remote message', 'gcs remote message', 'gcs for action', 'client message',
'pipe get', 'null event', 'PX Idle Wait', 'single-task message',
'PX Deq: Execution Msg', 'KXFQ: kxfqdeq - normal deqeue',
'listen endpoint status','slave wait','wakeup time manager')
order by time_waited desc)
where rownum < 11 order by time_waited desc; -- SESSION/PROCESS REFERENCE: -- This section is very important for most of the above sections to find out -- which user/os_user/process is identified to which session/process. --  set numwidth 7 column event format a30 tru column program format a25 tru column username format a15 tru select p.inst_id, s.sid, s.serial#, p.pid, p.spid, p.program, s.username, p.username os_user, sw.event, sw.seconds_in_wait sec from gv$process p, gv$session s, gv$session_wait sw where (p.inst_id = s.inst_id and p.addr = s.paddr) and (s.inst_id = sw.inst_id and s.sid = sw.sid) order by p.inst_id, s.sid; -- SYSTEM STATISTICS: -- All System Stats with values of > 0. These can be referenced in the
-- Server Reference Manual
--
set numwidth 5
column name format a60 tru
column value format 9999999999999999999999999
select inst_id, name, value
from gv$sysstat
where value > 0
order by inst_id, name;
-- CURRENT SQL FOR WAITING SESSIONS:
-- Current SQL for any session in the WAITING SESSIONS list
--
set numwidth 5
column sql format a80 wra
select sw.inst_id, sw.sid, sw.seconds_in_wait sec, sa.sql_text sql
from gv$session_wait sw, gv$session s, gv$sqlarea sa
where sw.sid = s.sid (+)
and sw.inst_id = s.inst_id (+)
and s.sql_address = sa.address
and sw.event not in ('rdbms ipc message','smon timer','pmon timer',
'SQL*Net message from client','lock manager wait for remote message',
'ges remote message', 'gcs remote message', 'gcs for action', 'client message',
'pipe get', 'null event', 'PX Idle Wait', 'single-task message',
'PX Deq: Execution Msg', 'KXFQ: kxfqdeq - normal deqeue',
'listen endpoint status','slave wait','wakeup time manager')
and sw.seconds_in_wait > 0
order by sw.seconds_in_wait desc;
-- Taking Hang Analyze dumps
-- This may take a little while...
oradebug setmypid
oradebug unlimit
oradebug -g all hanganalyze 3
-- This part may take the longest, you can monitor bdump or udump to see
-- if the file is being generated.
oradebug -g all dump systemstate 267
set echo off
select to_char(sysdate) time from dual;
spool off
-- ---------------------------------------------------------------------------
Prompt;
Prompt racdiag output files have been written to:;
Prompt;
host pwd
Prompt alert log and trace files are located in:;
column host_name format a12 tru
column name format a20 tru
column value format a60 tru
select distinct i.host_name, p.name, p.value
from gv$instance i, gv$parameter p
where p.inst_id = i.inst_id (+)
and p.name like '%_dump_dest'
and p.name != 'core_dump_dest';

Sample Output:

TIME
--------------------
AUG-11-2001 12:06:36
1 row selected.
INST_ID INSTANCE_NAME    HOST_NAME            VERSION        STATUS  STARTUP_TIME
------- ---------------- -------------------- -------------- ------- ------------
1 V9201            opcbsol1             9.2.0.1.0      OPEN    AUG-01-2002
2 V9202            opcbsol2             9.2.0.1.0      OPEN    JUL-09-2002
2 rows selected.
SQL>
SQL> -- Taking Hanganalyze Dumps
SQL> -- This may take a little while...
SQL> oradebug setmypid
Statement processed.
SQL> oradebug unlimit
Statement processed.
SQL> oradebug setinst all
Statement processed.
SQL> oradebug -g def hanganalyze 3
Hang Analysis in /u02/32bit/app/oracle/admin/V9232/bdump/v92321_diag_29495.trc
SQL>
SQL> -- WAITING SESSIONS:
SQL> -- The entries that are shown at the top are the sessions that have
SQL> -- waited the longest amount of time that are waiting for non-idle wait
SQL> -- events (event column).  You can research and find out what the wait
SQL> -- event indicates (along with its parameters) by checking the Oracle
SQL> -- Server Reference Manual or look for any known issues or documentation
SQL> -- by searching Metalink for the event name in the search bar.  Example
SQL> -- (include single quotes): [ 'buffer busy due to global cache' ].
SQL> -- Metalink and/or the Server Reference Manual should return some useful
SQL> -- information on each type of wait event.  The inst_id column shows the
SQL> -- instance where the session resides and the SID is the unique identifier
SQL> -- for the session (gv$session).  The p1, p2, and p3 columns will show
SQL> -- event specific information that may be important to debug the problem.
SQL> -- To find out what the p1, p2, and p3 indicates see the next section.
SQL> -- Items with wait_time of anything other than 0 indicate we do not know
SQL> -- how long these sessions have been waiting.
SQL> --

沪ICP备14014813号

沪公网安备 31010802001379号