Oracle RAC内部错误:ORA-00600[kjbmprlst:shadow]一例

一套Linux x86-64上的11.2.0.1 4节点RAC系统中LMS GCS服务进程遭遇到内部错误ORA-00600[kjbmprlst:shadow],导致节点实例意外终止,具体日志如下:

Fri Jul 08 02:04:43 2011
Errors in file /u01/app/oracle/diag/rdbms/PROD/PROD1/trace/PROD1_lms1_536.trc  (incident=1011732):
ORA-00600: internal error code, arguments: [kjbmprlst:shadow], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/PROD/PROD1/incident/incdir_1011732/PROD1_lms1_536_i1011732.trc
Fri Jul 08 02:04:44 2011
Trace dumping is performing id=[cdmp_20110708020444]
Errors in file /u01/app/oracle/diag/rdbms/PROD/PROD1/trace/PROD1_lms1_536.trc:
ORA-00600: internal error code, arguments: [kjbmprlst:shadow], [], [], [], [], [], [], [], [], [], [], []
Errors in file /u01/app/oracle/diag/rdbms/PROD/PROD1/trace/PROD1_lms1_536.trc:
ORA-00600: internal error code, arguments: [kjbmprlst:shadow], [], [], [], [], [], [], [], [], [], [], []
LMS1 (ospid: 536): terminating the instance due to error 484
Fri Jul 08 02:04:45 2011
opiodr aborting process unknown ospid (27387) as a result of ORA-1092
System state dump is made for local instance
System State dumped to trace file /u01/app/oracle/diag/rdbms/PROD/PROD1/trace/PROD1_diag_513.trc
Fri Jul 08 02:04:54 2011
Termination issued to instance processes. Waiting for the processes to exit
Fri Jul 08 02:04:58 2011
ORA-1092 : opitsk aborting process

该ORA-00600[kjbmprlst:shadow]错误定位为11.2.0.1上的Bug 10121589或Bug 9458781:

Bug 10121589  ORA-600 [kjbmprlst:shadow] can occur in RAC
Affects:

    Product (Component)	Oracle Server (Rdbms)
    Range of versions believed to be affected 	Versions BELOW 12.1
    Versions confirmed as being affected 	

        11.2.0.1 

    Platforms affected	Generic (all / most platforms affected)

Fixed:

    This issue is fixed in	

        12.1 (Future Release)
        11.2.0.2 Bundle Patch 2 for Exadata Database
        11.2.0.1 Bundle Patch 7 for Exadata Database 

Symptoms:

Related To:

    Internal Error May Occur (ORA-600)
    ORA-600 [kjbmprlst:shadow] 

    RAC (Real Application Clusters) / OPS 

Description

    An ORA-600 [kjbmprlst:shadow] can occur if the fix for bug 9979039
    is present.

    Note:
     One off patches for 10200390 should also include this fix.

Bug 9458781  Missing close message to master leaves closed lock dangling crashing the instance with assorted Internal error

Affects:

    Product (Component)	Oracle Server (Rdbms)
    Range of versions believed to be affected 	Versions >= 11.2.0.1 but BELOW 11.2.0.2
    Versions confirmed as being affected 	

        11.2.0.1 

    Platforms affected	Generic (all / most platforms affected)

Fixed:

    This issue is fixed in	

        11.2.0.2 (Server Patch Set)
        11.2.0.1 Bundle Patch 4 for Exadata Database 

Symptoms:

Related To:

    Instance May Crash
    Internal Error May Occur (ORA-600)
    ORA-600 [KJBMPRLST:SHADOW]
    ORA-600 [KJBMOCVT:RID]
    ORA-600 [KJBRREF:PKEY]
    ORA-600 [KJBRASR:PKEY] 

    RAC (Real Application Clusters) / OPS 

Description

    A lock is closed without sending a message to the master.
    This causes closed lock dangling at the master crashing the instance with different internal errors.

    Reported internal errors so far are :
    - KJBMPRLST:SHADOW
    - KJBMOCVT:RID
    - KJBRREF:PKEY
    - KJBRASR:PKEY

该kjbmprlst:shadow内部函数用以管理kjbm shadow锁(/libserver10.a/kjbm.o )信息,存在某个已关闭的lock没有及时message给master node的代码漏洞,目前除了安装补丁外没有已验证的workaround办法(disable drm似乎是无效的):

oradebug lkdebug (track resources, take dumps)
KCL history
KJBL history
KJL history

PCM (GCS) and non-PCM (GES) resources are kept separate and use separate code paths.
GES:
Resource table: kjr and kjrt
Lock table: kjlt
Processes: kjpt
GCS:
Resource table: kjbr
Lock table: kjbl

DLM Structures (continued)
/* PCM resource structure */
typedef struct kjbr {                                /* 68 bytes on sun4u */
  kjsolk       hash_q_kjbr;                             /* hash list : hp */
  ub4          resname_kjbr[2];	                     /* the resource name */
  kjsolk       scan_q_kjbr; /* chain to lmd scan q of grantable resources */
  kjsolk       grant_q_kjbr;                 /* list of granted resources */
  kjsolk       convert_q_kjbr;       /* list of resources being converted */
  ub4          diskscn_bas_kjbr;         /* scn(base) known to be on disk */
  ub2          diskscn_wrap_kjbr;        /* scn(wrap) known to be on disk */
  ub2          writereqscn_wrap_kjbr;    /* scn(wrap) requested for write */
  ub4          writereqscn_bas_kjbr;     /* scn(base) requested for write */
  struct kjbl *sender_kjbr;                 /* lock elected to send block */
  ub2          senderver_kjbr;                  /* version# of above lock */
  ub2          writerver_kjbr;                  /* version# of lock below */
  struct kjbl *writer_kjbr;                /* lock elected to write block */
  ub1          mode_role_kjbr; /* one of 'n', 's', 'x' && one of 'l' or 'g' */
  ub1          flags_kjbr;                        /* ignorewip, free etc. */
  ub1          rfpcount_kjbr;                      /* refuse ping counter */
  ub1          history_kjbr;                /* resource operation history */
  kxid         xid_kjbr;                          /* split transaction ID */
} kjbr ;

/* kjbl - PCM lock structure
** Clients and most of the DLM will use the KJUSER* or KJ_* modes and kscns  */

typedef struct kjbl {                                /* 52 bytes on sun4u */
  union {                     /* discriminate lock@master and lock@client */
    struct {                                           /* for lock@master */
      kgglk        state_q_kjbl;             /* link to chain to resource */
      kjbopqi     *rqinfo_kjbl;                             /* target bid */
      struct kjbr *resp_kjbl;                   /* pointer to my resource */
    } kjbllam;                                 /* KJB Lock Lock At Master */
    struct {                                           /* for lock@client */
      ub4         disk_base_kjbl;        /* disk version(base) for replay */
      ub2         disk_wrap_kjbl;        /* disk version(wrap) for replay */
      ub1         master_node_kjbl;                   /* master instance# */
      ub1         client_flag_kjbl;     /* flags specific to client locks */
      ub2         update_seq_kjbl;               /* last update to master */
    } kjbllac;                                 /* KJB Lock Lock At Client */
  } kjblmcd;                        /* KJB Lock Master Client Discrimnant */
  void  *remote_lockp_kjbl;           /* pointer to client lock or shadow */
  ub2    remote_ver_kjbl;                         /* remote lock version# */
  ub2        ver_kjbl;                                     /* my version# */
  ub2        msg_seq_kjbl;                         /* client->master seq# */
  ub2        reqid_kjbl;                         /* requestid for convert */
  ub2        creqid_kjbl; /* requestid for convert that has been cancelled */
  ub2        pi_wrap_kjbl;                     /* scn(wrap) of highest pi */
  ub4        pi_base_kjbl;                     /* scn(base) of highest pi */
  ub1        mode_role_kjbl; /* one of 'n', 's', 'x' && one of 'l' or 'g' */
  ub1        state_kjbl;       /* _L|_R|_W|_S, notify, which q, lock type */
  ub1        node_kjbl;                       /* instance lock belongs to */
  ub1        flags_kjbl;                                /* lock flag bits */
  ub2        rreqid_kjbl;                               /* save the reqid */
  ub2         write_wrap_kjbl;        /* last write request version(wrap) */
  ub4         write_base_kjbl;        /* last write request version(base) */
  ub4         history_kjbl;                     /* lock operation history */
} kjbl;

PCM DLM locks that are owned by the local instance are allocated and embedded in an LE structure.
PCM DLM locks that are owned by remote instances and mastered by the local instance are allocated in SHARED_POOL.

PCM Locks and Resources
Fields of interest in the kclle structure: kcllerls or releasing; kcllelnm or name(id1,id2);
kcllemode or held-mode; kclleacq or acquiring; kcllelck or DLM lock.

Fields of interest in the kjbr structure: resname_kjbr[2] or resource name; grant_q_kjbr or grant queue;
convert_q_kjbr or convert queue; mode_role_kjbr, which is a bitwise merge of grant mode and
role-interpreted NULL(0x00), S(0x01), X(0x02), L0 Local (0x00), G0 Global without PI (0x08), G1 Global with PI (0x018).

The field mode_role_kjbl in kjbl is a bitwise merge of grant, request, and lock mode: 0x00 if grant NULL;
0x01 if grant S; 0x02 if grant X; 0x04 lock has been opened at master; 0x08 if global role (otherwise local);
0x10 has one or more PI; 0x20 if request CR; 0x40 if request S; 0x80 if request X.

Someone has to keep a list of all buffers and where they are mastered
This is called Global Resource Directory (GRD)
GRD is present on all the instances of the cluster
To find out the master:
select  b.dbablk, r.kjblmaster master_node
from x$le l, x$kjbl r, x$bh b
where b.obj =
and b.le_addr = l.le_addr
and l.le_kjbl = r.kjbllockp

Oracle Support宣称可以通过11.2.0.2 (Server Patch Set)11.2.0.1 Bundle Patch 4 for Exadata Database修复该bug,但是有迹象表明在11.2.0.2上仍可能发生该ORA-00600[kjbmprlst:shadow]内部错误,同时该bug更多地发生在超过2个节点的RAC系统中。

 


Posted

in

by

Tags:

Comments

2 responses to “Oracle RAC内部错误:ORA-00600[kjbmprlst:shadow]一例”

  1. admin Avatar
    admin
    Hdr: 11672191 11.2.0.1 RDBMS 11.2.0.1 RAC PRODID-5 PORTID-226 ORA-600 10399109
    Abstract: ORA-600 [KJBMPRLST:SHADOW] ON LMS1 WITH BP7 INSTALLED
    
    PROBLEM:
    --------
    Database instance crashed after process lms1 generates 
    ORA-600:[kjbmprlst:shadow] error
    
    DIAGNOSTIC ANALYSIS:
    --------------------
    Dump continued from file: 
    /u01/app/oracle/diag/rdbms/edw/edw4/trace/edw4_lms1_13021.trc
    ORA-600: internal error code, arguments: [kjbmprlst:shadow], [], [], [], 
    [], [], [], [], [], [], [], []
    
    ========= Dump for incident 426183 (ORA-600 [kjbmprlst:shadow]) ========
    ----- Beginning of Customized Incident Dump(s) -----
    FUSION MSG 0x2ab9365238b0,39 from 1 spnum 14 ver[26,29286] ln 136 sq[1,8]
    REPLAY 1 [0x77e5cf.2a, 1812994.0] c[0x23bfc8a90,2315] [0x1,x941]
    grant 1 convert 0 role x0
    pi [0x0.0x0] flags 0x0 state 0x100
    disk scn 0x0.0 writereq scn 0x0.0 rreqid x0
    msgRM# 29286 bkt# 7353 drmbkt# 7353
    pkey 1812994.0 undo 0 stat 5 masters[32768, 1->32768] reminc 26 RM# 29174
    flg x4 type x0 afftime x3dc7b931
    nreplays by lms 0 = 3780
    nreplays by lms 1 = 3786
    hv 15 [stat 0x0, 4->4, wm 32768, RMno 0, reminc 26, dom 0]
    kjga st 0x4, step 0.36.0, cinc 26, rmno 29286, flags 0x20
    lb 4096, hb 8191, myb 7353, drmb 7353, apifrz 1
    FUSION MSG DUMP END
    GCS RESOURCE 0x42856ee70 hashq [0x34cab79a0,0x459eeb7b0] name[0x77e5cf.2a] 
    pkey 1742728.0
    grant 0x4249fb430 cvt (nil) send (nil),0 write (nil),0@65536
    flag 0x2 mdrole 0x1 mode 1 scan 0 role LOCAL
    disk: 0x0000.00000000 write: 0x0000.00000000 cnt 0x0 hist 0x0
    xid 0x0000.000.00000000 sid 1 pkwait 0s rmacks 0
    refpcnt 0 weak: 0x0000.00000000
    pkey 1742728.0
    hv 15 [stat 0x0, 4->4, wm 32768, RMno 0, reminc 26, dom 0]
    kjga st 0x4, step 0.36.0, cinc 26, rmno 29286, flags 0x20
    lb 4096, hb 8191, myb 7353, drmb 7353, apifrz 1
    GCS SHADOW 0x4249fb430,1946 resp[0x42856ee70,0x77e5cf.2a] pkey 1742728.0
    grant 1 cvt 0 mdrole 0x1 st 0x100 lst 0x40 GRANTQ rl LOCAL
    master 4 owner 1 sid 1 remote[0x115fdddd8,1882] hist 0xc02186010c30086
    history 0x6.0x1.0xc.0x6.0x1.0xc.0x6.0x1.0xc.0x0.
    cflag 0x0 sender 0 flags 0x0 replay# 0 abast (nil).x0.1
    disk: 0x0000.00000000 write request: 0x0000.00000000
    pi scn: 0x0000.00000000 sq[0x42856ee98,0x42856ee98]
    msgseq 0x1 updseq 0x0 reqids[2142,0,0] infop (nil)
    GCS SHADOW END
    GCS RESOURCE END
    ----- End of Customized Incident Dump(s) -----
    
    
    This matches Bug:10230023: LMS HIT ORA-600 [KJBMPRLST:SHADOW] AND INSTANCE 
    CRASHED.
    Which was closed as a duplicate of Bug:10121589 - ORA-600 [kjbmprlst:shadow] 
    can occur in RAC (Doc ID 10121589.8)
    
    But fix for bug:10121589 is included on bundle patch 7 (BP7).
    
    WORKAROUND:
    -----------
    n/a
    
    RELATED BUGS:
    -------------
    Bug:10121589
    Bug:10200390
    
    REPRODUCIBILITY:
    ----------------
    It has reproduced just once at customer's site
    
    TEST CASE:
    ----------
    n/a 
    
    STACK TRACE:
    ------------
    Call stack is:
    
    ----- Incident Context Dump -----
    Address: 0x7fff4cb08928
    Incident ID: 426183
    Problem Key: ORA-600 [kjbmprlst:shadow]
    Error: ORA-600 [kjbmprlst:shadow] [] [] [] [] [] [] [] [] [] [] []
    [00]: dbgexExplicitEndInc [diag_dde]
    [01]: dbgeEndDDEInvocationImpl [diag_dde]
    [02]: dbgeEndDDEInvocation [diag_dde]
    [03]: kjbmprlst [GCS]<-- Signaling
    [04]: kjmxmpm [RAC_MLMDS]
    [05]: kjmpbmsg [RAC_MLMDS]
    [06]: kjmsm [RAC_MLMDS]
    [07]: ksbrdp [background_proc]
    [08]: opirip []
    [09]: opidrv []
    [10]: sou2o []
    [11]: opimai_real []
    [12]: ssthrdmain []
    [13]: main []
    [14]: __libc_start_main []
    [15]: _start []
    MD [00]: 'SID'='2717.1' (0x3)
    MD [01]: 'ProcId'='14.1' (0x3)
    MD [02]: 'Client 
    ProcId'='[email protected]_46974956490336' (0x0)
    Impact 0:
    Impact 1:
    Impact 2:
    Impact 3:
    
    Hdr: 11802096 11.2.0.1 RDBMS 11.2.0.1 RAC PRODID-5 PORTID-226 ORA-600 10200390
    Abstract: LMS0 FAILED WITH ORA-600[KJBMPRLST:SHADOW]
    
    PROBLEM:
    --------
    2 node RAC (DDB011, DDB012)
    
    LMS0 process failed with ORA-600[kjbmprlst:shadow] on node1
    at Feb 23 02:58. Then the instance terminated.
    
    DIAGNOSTIC ANALYSIS:
    --------------------
    I'm not sure if the customer met one of the followings.
    bug 10121589 kcl_reconfig_remaster_chk is called w/ null key
    bug 9458781 lnx64-11202-rac: lms hit ORA-600[kjbmocvt:rid], inst crashed
    
    - alert_DDB011.log
    Wed Feb 23 02:58:01 2011
    Errors in file ..../DDB011_lms0_17048.trc  (incident=94642):
    ORA-600:  [kjbmprlst:shadow], [], [], [], 
    [], [], []], [], [] [[
    Incident details in: ..../DDB011_lms0_17048_i94642.trc
    Errors in file ..../DDB011_lms0_17048.trc:
    ORA-600:  [kjbmprlst:shadow], [], [], [], 
    [], [], []], [], [] [[
    Wed Feb 23 02:58:02 2011
    Trace dumping is performing id=[cdmp_20110223025802]
    Wed Feb 23 02:58:03 2011
    Sweep [inc][94642]: completed
    Sweep [inc2][94642]: completed
    Errors in file ..../DDB011_lms0_17048.trc:
    ORA-600:  [kjbmprlst:shadow], [], [], [], 
    [], [], []], [], [] [[
    LMS0 (ospid: 17048): terminating the instance due to error 484
    System state dump is made for local instance
    System State dumped to trace file ..../DDB011_diag_17016.trc
    Wed Fed Feb 23 02:58:06 2011
    ORA-1092 : opitsk aborting process
    Wed Feb 23 02:58:06 2011
    License high water mark = 419
    Errors in file ..../DDB011_diag_17016.trc:
    ORA-601: ??????????????????????
    Trace dumping is performing id=[cdmp_20110223025805]
    Instance terminated by LMS0, pid = 17048
    
    WORKAROUND:
    -----------
    none
    
    RELATED BUGS:
    -------------
    - similar bug reports
    bug 10121589 kcl_reconfig_remaster_chk is called w/ null key
    bug 9458781 lnx64-11202-rac: lms hit ORA-600[kjbmocvt:rid], inst crashed
    
    REPRODUCIBILITY:
    ----------------
    It occurred on the customer's box once.
    
    TEST CASE:
    ----------
    none
    
    STACK TRACE:
    ------------
    - stack trace of ORA-600[kjbmprlst:shadow]
    dbgeEndDDEInvocationImpl dbgeEndDDEInvocation kjbmprlst kjmxmpm kjmpbmsg
    kjmsm ksbrdp opirip opidrv sou2o opimai_real ssthrdmain main
    
    Hdr: 11780852 11.2.0.1 RDBMS 11.2.0.1 RAC PRODID-5 PORTID-226 ORA-600 10247745
    Abstract: INSTANCES RESTARTED WITH ORA-600 ERROR - KJBMPRLST:SHADOW
    
      BUG TYPE CHOSEN
      ===============
      Code
      
      SubComponent: Real Application Clusters
      =======================================
      DETAILED PROBLEM DESCRIPTION
      ============================
      3 Node RAC Cluster, ASM Storage
      
      At least 4 distinct occurrences of the LMS related instance failure, the
      latest occurrence resulting in 2 of the 3 instances failing.
      
      alert_prdadx1.logd:Thu Feb 03 01:47:06 2011 ORA-600: internal error code,
      arguments: [kjbmprlst:shadow], [], [], [], [], [], [], [], [], [], [], []
      
      alert_prdadx1.logd:Mon Feb 14 15:42:14 2011 ORA-600: internal error code,
      arguments: [kjbmprlst:shadow], [], [], [], [], [], [], [], [], [], [], []
      
      alert_prdadx3.logd:Fri Jan 28 00:18:55 2011 ORA-600: internal error code,
      arguments: [kjbmprlst:shadow], [], [], [], [], [], [], [], [], [], [], []
      
      alert_prdadx3.logd:Mon Feb 14 15:42:14 2011 ORA-600: internal error code,
      arguments: [kjbmprlst:shadow], [], [], [], [], [], [], [], [], [], [], []
      
      DIAGNOSTIC ANALYSIS
      ===================
      Trace file /oracle/diag/rdbms/prdadx/prdadx3/trace/prdadx3_lms0_16071.trc
      Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
      With the Partitioning, Real Application Clusters, OLAP, Data Mining
      and Real Application Testing options
      ORACLE_HOME = /oracle/product/11.2
      System name:    Linux
      Node name:      sp2-adxdb-003.adx.sp2.yahoo.com
      Release:        2.6.18-128.el5
      Version:        #1 SMP Wed Dec 17 11:41:38 EST 2008
      Machine:        x86_64
      Instance name: prdadx3
      Redo thread mounted by this instance: 0 
      Oracle process number: 13
      Unix process pid: 16071, image: [email protected]
      (LMS0)
      ..
      .
      
      *** 2011-02-14 15:42:13.702
      DRM(13906) win(6) lms 0 finished replaying gcs resources
      lms 0 finished fixing gcs write protocol
      DRM(13906) win(7) lms 0 finished replaying gcs resources
      lms 0 finished fixing gcs write protocol
      DRM(13906) win(8) lms 0 finished replaying gcs resources
      lms 0 finished fixing gcs write protocol
      DRM(13907) win(1) lms 0 finished replaying gcs resources
      lms 0 finished fixing gcs write protocol
      DRM(13907) win(2) lms 0 finished replaying gcs resources
      Incident 96105 created, dump file:
      /oracle/diag/rdbms/prdadx/prdadx3/incident/incdir_96105/prdadx3_lms0_16071_i9
      6105.trc
      ORA-600: internal error code, arguments: [kjbmprlst:shadow], [], [], [],
      [], [], [], [], [], [], [], []
      
      2011-02-14 15:42:16.871280 : kjbmprlst: internal error on replaying
      0x169ff.a from 2
      2011-02-14 15:42:16.871312 : kjbmbassert [0x169ff.a]
      2011-02-14 15:42:16.871323 : kjbmsassert(0x169ff.a)(1)
      2011-02-14 15:42:16.871364 : kjbmsassert(0x169ff.a)(2)
      kjmpbmsg fatal error on 39
      
      MSG [39:KJX_B_REPLAY] R6i inc=26 len=136 sender=(2,2) seq=0
          fg=ip stat=KJUSERSTAT_DONE spnum=13 flg=x25
         flow ctrl: ver=0 flag=169 len=136 tkts=0 seq=0 wrp=0
                    sndr=2 dest=3 rcvr=1
        FUSION MSG 0x2ae41dca2900,39 from 2 spnum 13 ver[26,13906] ln 136 sq[2,8]
               REPLAY 1 [0x169ff.a, 638278.0] c[0x777fcaa30,20196] [0x1,x68]
               grant 1 convert 0 role x0
               pi [0x0.0x0] flags 0x0 state 0x100
               disk scn 0x0.0 writereq scn 0x0.0 rreqid x0
               msgRM# 13906 bkt# 11583 drmbkt# 11583
           pkey 638278.0 undo 0 stat 5 masters[32768, 2->32768] reminc 26 RM#
      13906
      flg x6 type x0 afftime x98e8d30d
          hv 115 [stat 0x0, 3->3, wm 32768, RMno 0, reminc 22, dom 0]
          kjga st 0x4, step 0.36.0, cinc 26, rmno 13906, flags 0x20
          lb 8192, hb 16383, myb 11583, drmb 11583, apifrz 1
          kjga st 0x4, step 0.36.0, cinc 26, rmno 13906, flags 0x20
          lb 8192, hb 16383, myb 11583, drmb 11583, apifrz 1
        FUSION MSG DUMP END
      ..
      .
      MSG [39:KJX_B_REPLAY] R6i inc=26 len=136 sender=(2,2) seq=0
          fg=ip stat=KJUSERSTAT_DONE spnum=13 flg=x25
        FUSION MSG 0x2ae41dca2900,39 from 2 spnum 13 ver[26,13906] ln 136 sq[2,8]
               REPLAY 1 [0x169ff.a, 638278.0] c[0x777fcaa30,20196] [0x1,x68]
               grant 1 convert 0 role x0
               pi [0x0.0x0] flags 0x0 state 0x100
               disk scn 0x0.0 writereq scn 0x0.0 rreqid x0
               msgRM# 13906 bkt# 11583 drmbkt# 11583
           pkey 638278.0 undo 0 stat 5 masters[32768, 2->32768] reminc 26 RM#
      13906
      flg x6 type x0 afftime x98e8d30d
          hv 115 [stat 0x0, 3->3, wm 32768, RMno 0, reminc 22, dom 0]
          kjga st 0x4, step 0.36.0, cinc 26, rmno 13906, flags 0x20
          lb 8192, hb 16383, myb 11583, drmb 11583, apifrz 1
        FUSION MSG DUMP END
      ..
      .
      *** 2011-02-14 15:42:16.876
      dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=3,
      mask=0x0)
      ----- Error Stack Dump -----
      ORA-600: internal error code, arguments: [kjbmprlst:shadow], [], [], [],
      [], [], [], [], [], [], [], []
      ----- SQL Statement (None) -----
      Current SQL information unavailable - no cursor.
      
      ----- Call Stack Trace -----
      skdstdst <- ksedst1 <- ksedst <- dbkedDefDump <- ksedmp
             <- kjmpbmsg <- kjmsm <- ksbrdp <- opirip <- opidrv
              <- sou2o <- opimai_real <- ssthrdmain <- main <- libc_start_main
               <- start
      
      WORKAROUND?
      ===========
      No
      
      TECHNICAL IMPACT
      ================
      Multiple Instance Failures
      
      RELATED ISSUES (bugs, forums, RFAs)
      ===================================
      10415076, 10126085 , 10126084 , 10121589 , 10117402 , 10210683 , 9915080 ,
      10039421 , 10128180 , 10201917 , 10200283 , 10200390 , 10229587 , 10384635 ,
      11777089
      
      HOW OFTEN DOES THE ISSUE REPRODUCE AT CUSTOMER SITE?
      ====================================================
      Intermittent
      
      DOES THE ISSUE REPRODUCE INTERNALLY?
      ====================================
      Not attempted
      
      EXPLAIN WHY THE ISSUE WAS NOT TESTED INTERNALLY.
      ================================================
      Unable to reproduce the environment and the scenario resulting in the LMS
      failure
      
      IS A TESTCASE AVAILABLE?
      ========================
      No
      
      Link to IPS Package:
      ====================
      not available
      
    *** 02/16/11 01:19 pm *** (CHG: Sta->16)
    *** 02/16/11 01:19 pm ***
    Invoking OPatch 11.1.0.6.6
    
    Oracle Interim Patch Installer version 11.1.0.6.6
    Copyright (c) 2009, Oracle Corporation.  All rights reserved.
    
    
    Oracle Home       : /oracle/product/11.2
    Central Inventory : /oracle/oraInventory
       from           : /etc/oraInst.loc
    OPatch version    : 11.1.0.6.6
    OUI version       : 11.2.0.1.0
    OUI location      : /oracle/product/11.2/oui
    Log file location : 
    /oracle/product/11.2/cfgtoollogs/opatch/opatch2011-02-16_11-47-09AM.log
    
    Patch history file: 
    /oracle/product/11.2/cfgtoollogs/opatch/opatch_history.txt
    
    Lsinventory Output file location : 
    /oracle/product/11.2/cfgtoollogs/opatch/lsinv/lsinventory2011-02-16_11-47-09AM
    .txt
    
    ------------------------------------------------------------------------------
    --
    Installed Top-level Products (1): 
    
    Oracle Database 11g                                                  
    11.2.0.1.0
    There are 1 products installed in this Oracle Home.
    
    
    Interim patches (2) :
    
    Patch  9352237      : applied on Thu Aug 05 18:35:52 PDT 2010
    Unique Patch ID:  12677125
       Created on 25 Mar 2010, 00:05:17 hrs PST8PDT
       Bugs fixed:
         8661168, 8769239, 8898852, 8801119, 9054253, 8706590, 8725286, 8974548
         8778277, 8780372, 8769569, 9027691, 9454036, 9454037, 9454038, 8761974
         7705591, 8496830, 8702892, 8639114, 8723477, 8729793, 8919682, 8818983
         9001453, 8475069, 9328668, 8891929, 8798317, 8820324, 8733749, 8702535
         8565708, 9036013, 8735201, 8684517, 8870559, 8773383, 8933870, 8812705
         8405205, 8822365, 8813366, 8761260, 8790767, 8795418, 8913269, 8897784
         8760714, 8717461, 8671349, 8775569, 8898589, 8861700, 8607693, 8642202
         8780281, 9369797, 8780711, 8784929, 8834636, 9015983, 8891037, 8828328
         8570322, 8832205, 8665189, 8717031, 8685253, 8718952, 8799099, 8633358
         9032717, 9321701, 8588519, 8783738, 8796511, 8782971, 8756598, 9454385
         8856497, 8703064, 9066116, 9007102, 8721315, 8818175, 8674263, 9352237
         8753903, 8720447, 9057443, 8790561, 8733225, 9197917, 8928276, 8991997, 
    8837736
    
    Patch  9778840      : applied on Thu Aug 05 18:33:09 PDT 2010
    Unique Patch ID:  12677125
       Created on 8 Jun 2010, 22:14:20 hrs PST8PDT
       Bugs fixed:
         9778840, 9783609
    
    
    
    Rac system comprising of multiple nodes
      Local node = sp2-adxdb-001
      Remote node = sp2-adxdb-002
      Remote node = sp2-adxdb-003
    
    ------------------------------------------------------------------------------
    SUPPORTING INFORMATION:
    -----------------------
    no patches are applied on DB and GI home.
    
    Hdr: 12351439 11.2.0.1.0 RDBMS 11.2.0.1.0 RAC PRODID-5 PORTID-226 ORA-600
    Abstract: RAC INSTANCE CRASHED WITH ORA-600 [KJBMPRLST:SHADOW]
    
    *** 04/12/11 07:48 am ***
    
    PROBLEM:
    --------
    One of the instances in this RAC crashed with ORA-600: [kjbmprlst:shadow]
    
    DIAGNOSTIC ANALYSIS:
    --------------------
    It seems like 9979039 is a prime candidate as the source of the problem.
    
    But due to the fact that the stack trace in this case doesn't match, we 
    are filing this bug to get confirmation.
    
    WORKAROUND:
    -----------
    None
    
    RELATED BUGS:
    -------------
    9979039
    10121589
    
    REPRODUCIBILITY:
    ----------------
    Random at customer site
    
    TEST CASE:
    ----------
    Comcast has attempted to upload the ISP files several times, but what is
    available is not complete. 
    
    STACK TRACE:
    ------------
    Short stack dump: 
    ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1782<-sspuser()+112<-__restore_rt()<-se
    0<-ksipgetctx()+1413<-ksqcmi()+20639<-ksqgtlctx()+3401<-ksqgelctx()+552<-ktcwi
    t1()+323<-kdddgb()+7099<-kdusru()+494<-kauupd()+406<-updrow()+1780<-qerupFetch
    ()+800<-updaul()+1179<-updThreePhaseExe()+328<-updexe()+663<-opiexe()+14377<-k
    poal8()+2275<-opiodr()+1149<-kpoodrc()+36<-rpiswu2()+1541<-kpoodr()+617<-upirt
    rc()+2357<-kpurcsc()+98<-kpuexec()+9784<-OCIStmtExecute()+39<-kewrose_oci_stmt
    _exec()+55<-kewrgwxf1_gwrsql_exft_1()+317<-kewrgwxf_gwrsql_exft()+469<-kewrfst
    xi_flush_sqltext_i()+557<-kewrfstx_flush_sqltext()+762<-kehTaskInsObject()+669
    <-kehWriteRec()+508<-kehWriteFdg()+518<-kehWriteResults()+353<-kehmain()+986<-
    kehExecute()+11<-spefcmpa()+127<-spefmccallstd()+235<-pextproc()+41<-__PGOSF47
    4_peftrusted()+150<-__PGOSF499_psdexsp()+255<-rpiswu2()+1541<-psdextp()+712<-p
    efccal()+726<-pefcal()+224<-pevm_FCAL()+162<-pfrinstr_FCAL()+75<-pfrrun_no_too
    l()+63<-pfrrun()+1030<-plsql_run()+774<-peicnt()+301<-kkxexe()+525<-opiexe()+1
    4801<-opiall0()+1317<-opikpr()+541<-opiodr()+1149<-__PGOSF141_rpidrus()+211<-s
    kgmstack()+148<-rpiswu2()+617<-kprball()+1190<-keasqlExecute()+103<-kehCreateE
    xecTask()+203<-keh_slave_main_cb()+136<-kebm_slave_main()+565<-ksvrdp()+1737<-
    opirip()+735<-opidrv()+603<-sou2o()+103<-opimai_real()+266<-ssthrdmain()+214<-
    main()+201<-__libc_start_main()+244<-_start()+36
    
    Hdr: 12574449 11.2.0.2.0 RDBMS 11.2.0.2.0 RAC PRODID-5 PORTID-226
    Abstract: ORA-600 [KJBRASR:PKEY] & ORA-600 [KJBMPRLST:SHADOW] WITH PATCH 9458781 INSTALLED
    
    PROBLEM:
    --------
    This is a 28 nodes cluster running 11.2.0.2 on linux 5.5 64 bit
    
    which has being happening as early as April  2011.
    
    on April 18 customer applied patch 10121589 (in merge Patch  10387939) to fix 
    the error but still the customer has been hitting this. Recent occurance is 
    on
    April 17th.
    
    
    This to confirm patch 10121589 is installed.
    
     Patch  10387939     : applied on Fri Feb 18 21:59:02 PST 2011  include the 
    fix of 10121589
    
    Unique Patch ID:  13350217
      Created on 30 Dec 2010, 22:55:01 hrs PST8PDT
      Bugs fixed:
        10158965, 9940990, 10190642, 10031806, 10228635, 10018789, 9744252
        10010252, 9956713, 10204358, 9715581, 9770451, 10094635, 10121589 
    <-------------------------------------------------This one
    
    DIAGNOSTIC ANALYSIS:
    --------------------
    I am not quite sure to file two bugs for both of the different argument (they 
    happened on separate time) but the point to the same issue.
    
    The GCS RESOURCE  is
    ==========================
    0x6.0x1.0xc.0x6.0x1.0xc.0x6.0x1.0xc.0x0
    
    
    
    Dump continued from 
    ----- Beginning of Customized Incident Dump(s) -----
    GCS RESOURCE 0x992c92238 hashq [0x9c6ee0940,0x9a4ec5f48] name[0x3f011c6.12f] 
    pkey 4522197.0
      grant 0x99d8f18a0 cvt (nil) send (nil)@0,0 write (nil),0@65536
      flag 0x2 mdrole 0x1 mode 1 scan 0.0 role LOCAL
      disk: 0x0000.00000000 write: 0x0000.00000000 cnt 0x0 hist 0x0
      xid 0x0000.000.00000000 sid 1 pkwait 0s rmacks 0
      refpcnt 0 weak: 0x0000.00000000
      pkey 4522197.0
      hv 64 [stat 0x0, 5->5, wm 32768, RMno 0, reminc 32, dom 0]
      kjga st 0x4, step 0.36.0, cinc 54, rmno 334, flags 0x20
      lb 0, hb 32767, myb 8760, drmb 8760, apifrz 1
      GCS SHADOW 0x99d8f18a0,102 resp[0x992c92238,0x3f011c6.12f] pkey 4522197.0
        grant 1 cvt 0 mdrole 0x1 st 0x100 lst 0x40 GRANTQ rl LOCAL
        master 5 owner 1 sid 1 remote[0x5cff7a570,37] hist 0xc02186010c30086
        history 0x6.0x1.0xc.0x6.0x1.0xc.0x6.0x1.0xc.0x0.
        cflag 0x0 sender 0 flags 0x0 replay# 0 abast (nil).x0.1 dbmap (nil)
        disk: 0x0000.00000000 write request: 0x0000.00000000
        pi scn: 0x0000.00000000 sq[0x992c92268,0x992c92268]
        msgseq 0x1 updseq 0x0 reqids[39,0,0] infop (nil) lockseq x5a7e
      GCS SHADOW END
    GCS RESOURCE END
    ----- End of Customized Incident Dump(s) -----
    
    ============================
    
     ORA-600: internal error code, arguments: [kjbmprlst:shadow], 
    
     file: /u01/app/oracle/diag/rdbms/south/south5/trace/south5_lms1_10196.trc
     
    
    ========= Dump for incident 60159 (ORA-600 [kjbmprlst:shadow]) ========
    ----- Beginning of Customized Incident Dump(s) -----
    FUSION MSG 0x2ae2ba712988,39 from 1 spnum 14 ver[66,617] ln 144 sq[1,8]
           REPLAY 1 [0x4276680.12f, 771080.0] c[0x437fcaa10,134] [0x229,x93]
           grant 1 convert 0 role x0
           pi [0x0.0x0] flags 0x0 state 0x100
           disk scn 0x0.0 writereq scn 0x0.0 rreqid x0
           msgRM# 617 bkt# 19664 drmbkt# 19664
       pkey 771080.0 undo 0 stat 5 masters[1, 2->32768] reminc 66 RM# 600
    flg x6 type x0 afftime x15a5c44
    nreplays by lms 0 = 85 
    nreplays by lms 1 = 123 
      hv 21 [stat 0x0, 5->5, wm 32768, RMno 0, reminc 66, dom 0]
      kjga st 0x4, step 0.36.0, cinc 66, rmno 617, flags 0x20
      lb 0, hb 32767, myb 19664, drmb 19664, apifrz 1
    FUSION MSG DUMP END
    GCS RESOURCE 0x99b6d59e0 hashq [0x99b696918,0x9ffba2d40] name[0x4276680.12f] 
    pkey 4540841.0
      grant 0x98c2e90e0 cvt (nil) send (nil)@0,0 write (nil),0@65536
      flag 0x2 mdrole 0x1 mode 1 scan 0.0 role LOCAL
      disk: 0x0000.00000000 write: 0x0000.00000000 cnt 0x0 hist 0x0
      xid 0x0000.000.00000000 sid 1 pkwait 0s rmacks 0
      refpcnt 0 weak: 0x0000.00000000
      pkey 4540841.0
      hv 21 [stat 0x0, 5->5, wm 32768, RMno 0, reminc 66, dom 0]
      kjga st 0x4, step 0.36.0, cinc 66, rmno 617, flags 0x20
      lb 0, hb 32767, myb 19664, drmb 19664, apifrz 1
      GCS SHADOW 0x98c2e90e0,12 resp[0x99b6d59e0,0x4276680.12f] pkey 4540841.0
        grant 1 cvt 0 mdrole 0x1 st 0x100 lst 0x40 GRANTQ rl LOCAL
        master 5 owner 1 sid 1 remote[0x6bfcbfbf0,145] hist 0xc02186010c30086
        history 0x6.0x1.0xc.0x6.0x1.0xc.0x6.0x1.0xc.0x0.
        cflag 0x0 sender 0 flags 0x0 replay# 0 abast (nil).x0.1 dbmap (nil)
        disk: 0x0000.00000000 write request: 0x0000.00000000
        pi scn: 0x0000.00000000 sq[0x99b6d5a10,0x99b6d5a10]
        msgseq 0x1 updseq 0x0 reqids[175,0,0] infop (nil) lockseq x7ae7
      GCS SHADOW END
    GCS RESOURCE END
     
    
    WORKAROUND:
    -----------
    None.
    
    RELATED BUGS:
    -------------
    With the fixed for bug 9458781 installed the only left bug which kind of 
    match this is
    
    
    
    Hit the internal error on a lock used to be belonged to a read-mostly object.
    
    this is fixed in 11.2.0.3.
    
    REPRODUCIBILITY:
    ----------------
    Intermittent
    
    TEST CASE:
    ----------
    n/a
    
    STACK TRACE:
    ------------
    ORA-600: internal error code, arguments: [kjbmprlst:shadow] stack shows as
    
    *** 04:15:25.695
    dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, 
    mask=0x0)
    ----- SQL Statement (None) -----
    Current SQL information unavailable - no cursor.
    
    ----- Call Stack Trace -----
    calling              call     entry                argument values in hex     
     
    location             type     point                (? means dubious value)    
     
    -------------------- -------- -------------------- 
    ----------------------------
    skdstdst()+36        call     kgdsdst()            000000000 ? 000000000 ?
                                                      7FFF4606D138 ? 000000001 ?
                                                      000000001 ? 000000002 ?
    ksedst1()+98         call     skdstdst()           000000000 ? 000000000 ?
                                                      7FFF4606D138 ? 000000001 ?
                                                      000000000 ? 000000002 ?
    ksedst()+34          call     ksedst1()            000000000 ? 000000001 ?
                                                      7FFF4606D138 ? 000000001 ?
                                                      000000000 ? 000000002 ?
    dbkedDefDump()+2741  call     ksedst()             000000000 ? 000000001 ?
                                                      7FFF4606D138 ? 000000001 ?
                                                      000000000 ? 000000002 ?
    ksedmp()+36          call     dbkedDefDump()       000000003 ? 000000002 ?
                                                      7FFF4606D138 ? 000000001 ?
                                                      000000000 ? 000000002 ?
    ksfdmp()+64          call     ksedmp()             000000003 ? 000000002 ?
                                                      7FFF4606D138 ? 000000001 ?
                                                      000000000 ? 000000002 ?
    dbgexPhaseII()+1764  call     ksfdmp()             000000003 ? 000000002 ?
                                                      7FFF4606D138 ? 000000001 ?
                                                      000000000 ? 000000002 ?
    dbgexExplicitEndInc  call     dbgexPhaseII()       2AE2B97D5710 ? 
    2AE2B9AAB3B0 ?
    ()+750                                             7FFF46078C70 ? 000000001 ?
                                                      000000000 ? 000000002 ?
    dbgeEndDDEInvocatio  call     dbgexExplicitEndInc  2AE2B97D5710 ? 
    2AE2B9AAB3B0 ?
    nImpl()+767                   ()                   7FFF46078C70 ? 000000001 ?
                                                      000000000 ? 000000002 ?
    dbgeEndDDEInvocatio  call     dbgeEndDDEInvocatio  2AE2B97D5710 ? 
    2AE2B9AAB3B0 ?
    n()+47                        nImpl()              7FFF46078C70 ? 000000001 ?
                                                      000000000 ? 000000002 ?
    kjbmprlst()+5577     call     dbgeEndDDEInvocatio  2AE2B97D5710 ? 
    2AE2B9AAB3B0 ?
                                 n()                  7FFF46078C70 ? 000000001 ?
                                                      000000000 ? 000000002 ?
    kjmxmpm()+966        call     kjbmprlst()          2AE2BA712988 ? 
    2AE2B9AAB3B0 ?
                                                      000000001 ? 000000001 ?
                                                      000000000 ? 000000002 ?
    kjmpbmsg()+4537      call     kjmxmpm()            2AE2BA712988 ? 000000000 ?
                                                      000000002 ? 000000001 ?
                                                      000000000 ? 000000002 ?
    kjmsm()+7767         call     kjmpbmsg()           2AE2BA710D48 ? 9D4CDA2D8 ?
                                                      000000001 ? 000000001 ?
                                                      000000000 ? 000000000 ?
    ksbrdp()+971         call     kjmsm()              0600164D8 ? 9E9C77730 ?
    
     
    
    SUPPORTING INFORMATION:
    -----------------------
    
    *** 17:43:29.058
    dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, 
    mask=0x0)
    ----- SQL Statement (None) -----
    Current SQL information unavailable - no cursor.
    
    ----- Call Stack Trace -----
    calling              call     entry                argument values in hex     
     
    location             type     point                (? means dubious value)    
     
    -------------------- -------- -------------------- 
    ----------------------------
     
                                                      000000000 ? 000000002 ?
    dbgexPhaseII()+1764  call     ksfdmp()             000000003 ? 000000002 ?
                                                      7FFF5DC21838 ? 000000001 ?
                                                      000000000 ? 000000002 ?
    dbgexExplicitEndInc  call     dbgexPhaseII()       2B45FC020710 ? 
    2B45FC2F63B0 ?
    ()+750                                             7FFF5DC2D370 ? 000000001 ?
                                                      000000000 ? 000000002 ?
    dbgeEndDDEInvocatio  call     dbgexExplicitEndInc  2B45FC020710 ? 
    2B45FC2F63B0 ?
    nImpl()+767                   ()                   7FFF5DC2D370 ? 000000001 ?
                                                      000000000 ? 000000002 ?
    dbgeEndDDEInvocatio  call     dbgeEndDDEInvocatio  2B45FC020710 ? 
    2B45FC2F63B0 ?
    n()+47                        nImpl()              7FFF5DC2D370 ? 000000001 ?
                                                      000000000 ? 000000002 ?
    kjbrasr()+1749       call     dbgeEndDDEInvocatio  2B45FC020710 ? 
    2B45FC2F63B0 ?
                                 n()                  7FFF5DC2D370 ? 000000001 ?
                                                      000000000 ? 000000002 ?
    kjbmsalc()+128       call     kjbrasr()            992C92238 ? 000000000 ?
                                                      9F7F43BD8 ? 979F2A880 ?
                                                      000000001 ? 000002868 ?
    kjbmprlst()+11125    call     kjbmsalc()           003F011C6 ? 00000012F ?
                                                      000450531 ? 992C92238 ?
                                                      000000000 ? 000000001 ?
    kjmxmpm()+966        call     kjbmprlst()          2B45FFBFC208 ? 00000012F ?
                                                      000000001 ? 000000001 ?
                                                      000000000 ? 000000001 ?
    kjmpbmsg()+4537      call     kjmxmpm()            2B45FFBFC208 ? 000000000 ?
                                                      000000001 ? 000000001 ?
                                                      000000000 ? 000000001 ?
    kjmsm()+7767         call     kjmpbmsg()           2B45FFBFB588 ? A250266C8 ?
                                                      000000001 ? 000000001 ?
                                                      000000000 ? 000000000 ?
    ksbrdp()+971         call     kjmsm()              0600164D8 ? 9D9CE6888 ?
    
     
    IPS package with all the related incident files will be uploaded to the bug
    
  2. maclean Avatar

    SO: 0xc91b29550, type: 4, owner: 0xc814385d0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
    proc=0xc814385d0, name=session, file=ksu.h LINE:11467, pg=0
    (session) sid: 1847 ser: 1 trans: (nil), creator: 0xc814385d0
    flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
    flags2: (0x408) -/-
    DID: , short-term DID:
    txn branch: (nil)
    oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS
    ksuxds FALSE at location: 0
    service name: SYS$BACKGROUND
    Current Wait Stack:
    Not in wait; last wait ended 0.328702 sec ago
    Wait State:
    fixed_waits=0 flags=0x21 boundary=(nil)/-1
    Session Wait History:
    elapsed time of 0.328731 sec since last wait
    0: waited for ‘ges LMD suspend for testing event’
    =0x32, =0x0, =0x0
    wait_id=2105066846 seq_num=26443 snap_id=1
    wait times: snap=0.037705 sec, exc=0.037705 sec, total=0.037705 sec
    wait times: max=0.060000 sec
    wait counts: calls=1 os=1
    occurred after 0.000026 sec of elapsed time
    1: waited for ‘ges LMD suspend for testing event’
    =0x32, =0x0, =0x0
    wait_id=2105066845 seq_num=26442 snap_id=1
    wait times: snap=0.003826 sec, exc=0.003826 sec, total=0.003826 sec
    wait times: max=0.060000 sec
    wait counts: calls=1 os=1
    occurred after 0.000033 sec of elapsed time
    2: waited for ‘ges message buffer allocation’
    pool=0x0, request=0x1, allocated=0x0
    wait_id=2105066844 seq_num=26441 snap_id=1
    wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec
    wait times: max=infinite
    wait counts: calls=0 os=0
    occurred after 0.000008 sec of elapsed time
    3: waited for ‘ges LMD suspend for testing event’
    =0x32, =0x0, =0x0
    wait_id=2105066843 seq_num=26440 snap_id=1
    wait times: snap=0.004991 sec, exc=0.004991 sec, total=0.004991 sec
    wait times: max=0.060000 sec
    wait counts: calls=1 os=1
    occurred after 0.000019 sec of elapsed time
    4: waited for ‘ges LMD suspend for testing event’
    =0x32, =0x0, =0x0
    wait_id=2105066842 seq_num=26439 snap_id=1
    wait times: snap=0.007574 sec, exc=0.007574 sec, total=0.007574 sec
    wait times: max=0.060000 sec
    wait counts: calls=1 os=1
    occurred after 0.000008 sec of elapsed time
    5: waited for ‘ges LMD suspend for testing event’
    =0x32, =0x0, =0x0
    wait_id=2105066841 seq_num=26438 snap_id=1
    wait times: snap=0.000036 sec, exc=0.000036 sec, total=0.000036 sec
    wait times: max=0.060000 sec
    wait counts: calls=1 os=1
    occurred after 0.000019 sec of elapsed time
    6: waited for ‘ges LMD suspend for testing event’
    =0x32, =0x0, =0x0
    wait_id=2105066840 seq_num=26437 snap_id=1
    wait times: snap=0.009855 sec, exc=0.009855 sec, total=0.009855 sec
    wait times: max=0.060000 sec
    wait counts: calls=1 os=1
    occurred after 0.000015 sec of elapsed time
    7: waited for ‘ges LMD suspend for testing event’
    =0x32, =0x0, =0x0
    wait_id=2105066839 seq_num=26436 snap_id=1
    wait times: snap=0.000061 sec, exc=0.000061 sec, total=0.000061 sec
    wait times: max=0.060000 sec
    wait counts: calls=1 os=1
    occurred after 0.000014 sec of elapsed time
    8: waited for ‘ges LMD suspend for testing event’
    =0x32, =0x0, =0x0
    wait_id=2105066838 seq_num=26435 snap_id=1
    wait times: snap=0.000071 sec, exc=0.000071 sec, total=0.000071 sec
    wait times: max=0.060000 sec
    wait counts: calls=1 os=1
    occurred after 0.000020 sec of elapsed time
    9: waited for ‘ges LMD suspend for testing event’
    =0x32, =0x0, =0x0
    wait_id=2105066837 seq_num=26434 snap_id=1
    wait times: snap=0.003144 sec, exc=0.003144 sec, total=0.003144 sec
    wait times: max=0.060000 sec
    wait counts: calls=1 os=1
    occurred after 0.000023 sec of elapsed time

    Hdr: 9491134 11.2.0.1 RDBMS 11.2.0.1 RAC PRODID-5 PORTID-226 ORA-600
    Abstract: LMD CRASHES INSTANCE WITH ORA-600 [KJMSCNDSCQ:TIMEOUT]

    *** 03/19/10 07:55 am ***
    TAR:
    —-

    PROBLEM:
    ——–
    1. Clear description of the problem encountered:

    – frequent instance crashes due to LMD crashing with :

    ORA-600: internal error code, arguments: [kjmscndscq:timeout]

    – 4 instances, 11.2.0.1 on Linux X86_64
    – Oracle RDS/IP (generic)
    – rds-tools Package 1.4-1 11-SEP-09
    – Using Infiniband

    – For example : Mar 18 at 02.06 inst#2 crashed :

    Thu Mar 18 02:06:56 2010
    Errors in file /u01/app/oracle/diag/rdbms/pr/PR2/trace/PR2_lmd0_7935.trc
    (incident=124183):
    ORA-600: internal error code, arguments: [kjmscndscq:timeout], [], [], [],
    [], [], [], [], [], [], [], []
    Incident details in:
    /u01/app/oracle/diag/rdbms/pr/PR2/incident/incdir_124183/PR2_lmd0_7935_i124183
    .trc

    LMD0 (ospid: 7935): terminating the instance due to error 482
    System state dump is made for local instance
    System State dumped to trace file
    /u01/app/oracle/diag/rdbms/pr/PR2/trace/PR2_diag_7921.trc
    Thu Mar 18 02:07:04 2010
    ORA-1092 : opitsk aborting process
    Thu Mar 18 02:07:04 2010
    License high water mark = 70
    Thu Mar 18 02:07:09 2010
    Instance terminated by LMD0, pid = 7935

    2. Sequence of events leading to the problem

    – Has seen it happen when :

    1) When RMAN was backing up the FRA to tape (netbackup).
    2) running a select on a large table that has thousands of inserts per
    second.

    3. Technical impact on the customer. Include persistent after effects.

    – Instances restart.

    DIAGNOSTIC ANALYSIS:
    ——————–
    – “PR2_lmd0_7935_i124183.trc”

    ACTION NAME:() 2010-03-18 02:06:56.860
    Dump continued from file:
    /u01/app/oracle/diag/rdbms/pr/PR2/trace/PR2_lmd0_7935.trc
    ORA-600: internal error code, arguments: [kjmscndscq:timeout], [], [], [],
    [], [], [], [], [], [], [], []

    – “PR2_lmon_7933.trc”

    2010-03-18 02:04:04.149
    CGS recovery timeout = 85 sec
    Begin DRM(23) (swin 0)
    —> file ends here

    – “PR2_diag_7921.trc”

    PROCESS 12:

    last post sent-location: kji.h LINE:2816 ID:kjata: wake up enqueue owner
    ..
    OSD pid info: Unix process pid: 7935, image:
    [email protected] (LMD0)

    Current Wait Stack:
    Not in wait; last wait ended 0.302150 sec ago
    Wait State:
    fixed_waits=0 flags=0x21 boundary=(nil)/-1
    Session Wait History:
    elapsed time of 0.302169 sec since last wait
    0: waited for ‘ges LMD suspend for testing event’
    =0x32, =0x0, =0x0
    wait_id=944384 seq_num=26897 snap_id=1
    wait times: snap=0.066003 sec, exc=0.066003 sec, total=0.066003 sec
    wait times: max=0.060000 sec
    wait counts: calls=1 os=1
    occurred after 0.000021 sec of elapsed time

    ———-resource 0x197f9cda80———————-
    resname : [0x0][0x0],[RM]

    CONVERT_Q:
    lp 0x18254c5908 gl KJUSERNL rl KJUSEREX rp 0x197f9cda80 [0x0][0x0],[RM]
    master 2 pid 7935 bast 0 rseq 1 mseq 0 history 0x55a5555a

    – It looks like it’s in mid-DRM and LMD is trying to get an RM enqueue, but I
    can’t see the holder of this resource on the other instances though.

    WORKAROUND:
    ———–
    – Suggested ct. disable DRM for now to get some stability, waiting for
    feedback on that.

    RELATED BUGS:
    ————-
    – many bugs reported for this ORA-600 but didn’t see clear match with this
    case.

    REPRODUCIBILITY:
    —————-

    TEST CASE:
    ———-

    STACK TRACE:
    ————

    SUPPORTING INFORMATION:
    ———————–
    – uploaded ‘logs.tar.gz’ containing four tarballs, one for each node’s trace
    dump location.
    – Each one unzips into a folder ‘diag’ so rename them as they are extracted.

    24 HOUR CONTACT INFORMATION FOR P1 BUGS:
    —————————————-

    DIAL-IN INFORMATION:
    ——————–

    IMPACT DATE:
    ————
    Hdr: 5624062 10.2.0.1.0 RDBMS 10.2.0.1.0 PRODID-5 PORTID-59 ORA-600
    Abstract: LMS PROCESS DOWN OWING TO ORA-600 [KJMVALIDATE:!PCM], [2], [32], [0], []

    *** 10/26/06 11:56 pm ***
    TAR:
    —-
    5910169.993

    PROBLEM:
    ——–
    NODE 2 >>

    Thu Oct 26 22:01:35 2006
    Thread 2 advanced to log sequence 6258
    Current log# 6 seq# 6258 mem# 0: /dev/vg02/rredolog61
    Current log# 6 seq# 6258 mem# 1: /dev/vg02/rredolog62
    Thu Oct 26 22:04:08 2006
    Errors in file /oracle/app/oracle/admin/APOLLO/bdump/apollo2_lms1_13420.trc:
    ORA-600: internal error code, arguments: [kjmvalidate:!pcm], [2], [32],
    [0], [], [], [], []
    Thu Oct 26 22:04:10 2006
    Errors in file /oracle/app/oracle/admin/APOLLO/bdump/apollo2_lms1_13420.trc:
    ORA-600: internal error code, arguments: [kjmvalidate:!pcm], [2], [32],
    [0], [], [], [], []
    Thu Oct 26 22:04:10 2006
    Trace dumping is performing id=[cdmp_20061026220410]
    Thu Oct 26 22:04:17 2006
    Errors in file /oracle/app/oracle/admin/APOLLO/bdump/apollo2_lms1_13420.trc:
    ORA-600: internal error code, arguments: [kjmvalidate:!pcm], [2], [32],
    [0], [], [], [], []
    Thu Oct 26 22:04:17 2006
    LMS1: terminating instance due to error 484
    Thu Oct 26 22:04:17 2006
    Errors in file /oracle/app/oracle/admin/APOLLO/bdump/apollo2_lmon_13387.trc:
    ORA-484: LMS* process terminated with error
    Thu Oct 26 22:04:17 2006
    Errors in file /oracle/app/oracle/admin/APOLLO/bdump/apollo2_lms0_13409.trc:
    ORA-484: LMS* process terminated with error
    Thu Oct 26 22:04:17 2006
    Errors in file /oracle/app/oracle/admin/APOLLO/bdump/apollo2_lmd0_13389.trc:
    ORA-484: LMS* process terminated with error
    Thu Oct 26 22:04:17 2006
    Errors in file /oracle/app/oracle/admin/APOLLO/bdump/apollo2_lms2_13422.trc:
    ORA-484: LMS* process terminated with error
    Thu Oct 26 22:04:19 2006
    System state dump is made for local instance
    System State dumped to trace file
    /oracle/app/oracle/admin/APOLLO/bdump/apollo2_diag_13383.trc
    Thu Oct 26 22:04:22 2006
    Instance terminated by LMS1, pid = 13420
    Fri Oct 27 09:17:58 2006
    Starting ORACLE instance (normal)

    DIAGNOSTIC ANALYSIS:
    ——————–
    apollo2_lms1_13420.trc >>

    MSG [32:KJX_B_OCR] [0x25d54.c0000] inc=26 len=352 sender=(0,0) seq=2521440
    fg=p stat=KJUSERSTAT_DONE spid=10552 flg=xff
    *** 22:04:08.761
    ksedmp: internal or fatal error
    ORA-600: internal error code, arguments: [kjmvalidate:!pcm], [2], [32],
    [0], [], [], [], []
    —– Call Stack Trace —–

    kgerinv kgerinv – KGE Record Internal (Named) error (with Va_list)
    */
    ksbrdp KSB: run a detached (background) process

    kgerinv kgeasnmierr kjmvalidate kjmpbmsg kjmsm ksbrdp

    apollo2_diag_13383.trc >>

    5: last wait for ‘rdbms ipc message’ (LMON)
    6: last wait for ‘ges remote message’ (LMD0)
    7: last wait for ‘gcs remote message’ (LMS0)

    8: last wait for ‘ges LMD suspend for testing event’ (LMS1)

    9: last wait for ‘gcs remote message’ (LMS2)

    WORKAROUND:
    ———–
    none

    RELATED BUGS:
    ————-

    REPRODUCIBILITY:
    —————-

    TEST CASE:
    ———-

    STACK TRACE:
    ————

    apollo2_lms1_13420.trc >>

    MSG [32:KJX_B_OCR] [0x25d54.c0000] inc=26 len=352 sender=(0,0) seq=2521440
    fg=p stat=KJUSERSTAT_DONE spid=10552 flg=xff
    *** 22:04:08.761
    ksedmp: internal or fatal error
    ORA-600: internal error code, arguments: [kjmvalidate:!pcm], [2], [32],
    [0], [], [], [], []
    —– Call Stack Trace —–

    kgerinv kgerinv – KGE Record Internal (Named) error (with Va_list)
    */
    ksbrdp KSB: run a detached (background) process

    kgerinv kgeasnmierr kjmvalidate kjmpbmsg kjmsm ksbrdp

    Hdr: 9677815 11.2.0.1 RDBMS 11.2.0.1 RAC PRODID-5 PORTID-226 9458781
    Abstract: INSERT INTO TABLE HANGS IN ‘ LOCK DEADLOCK RETRY’ – ‘GES MESSAGE BUFFER ALLOCATI

    *** 05/03/10 12:57 pm ***
    —-

    PROBLEM:
    ——–
    3 node rac system. The same insert statement executed only on instance 2
    hung 2 times. No other session in cluster was accessing the same table.
    The statement completed after the instance was restarted.

    DIAGNOSTIC ANALYSIS:
    ——————–
    Trace file /u01/app/oracle/diag/rdbms/gdhp/GDHP2/trace/GDHP2_ora_30308.trc
    Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 – 64bit Production
    With the Partitioning, Real Application Clusters, Automatic Storage
    Management, OLAP,
    Data Mining and Real Application Testing options
    ORACLE_HOME = /u01/app/oracle/product/11.2.0/db_1
    System name: Linux
    Release: 2.6.18-128.1.16.0.1.el5
    Version: #1 SMP Tue Jun 30 16:48:30 EDT 2009
    Machine: x86_64
    Instance name: GDHP2
    Redo thread mounted by this instance: 2
    Oracle process number: 91

    *** 11:33:55.997
    *** ID:(282.5434) 2010-05-02 11:33:55.997
    *** ID:() 2010-05-02 11:33:55.997
    *** NAME:(GDHPETL) 2010-05-02 11:33:55.997
    11:33:55.997
    *** NAME:() 2010-05-02 11:33:55.997

    Received ORADEBUG command (#1) ‘event ‘10046 trace name context forever,
    level 12” from process ‘Unix process pid: 19303, image: ‘

    *** 11:33:56.004
    Error encountered while processing ORADEBUG command (#1):
    ORA-49100: Failed to process event statement [‘10046 trace name context
    forever, level 12’]

    *** 11:36:23.584
    Received ORADEBUG command (#2) ‘event 10046 trace name context forever, level
    12’ from process ‘Unix process pid: 19303, image: ‘

    *** 11:36:23.584
    Finished processing ORADEBUG command (#2) ‘event 10046 trace name context
    forever, level 12′
    WAIT #2: nam=’lock deadlock retry’ ela= 2838 p1=0 p2=0 p3=0 obj#=16540
    tim=1272818183585014
    WAIT #2: nam=’ges message buffer allocation’ ela= 2 pool=0 request=1
    allocated=0 obj#=16540 tim=1272818183585051
    WAIT #2: nam=’lock deadlock retry’ ela= 10719 p1=0 p2=0 p3=0 obj#=16540
    tim=1272818183596011
    WAIT #2: nam=’ges message buffer allocation’ ela= 1 pool=0 request=1
    allocated=0 obj#=16540 tim=1272818183596068
    WAIT #2: nam=’lock deadlock retry’ ela= 10682 p1=0 p2=0 p3=0 obj#=16540
    tim=1272818183607055
    WAIT #2: nam=’ges message buffer allocation’ ela= 3 pool=0 request=1
    allocated=0 obj#=16540 tim=1272818183607105
    WAIT #2: nam=’lock deadlock retry’ ela= 10600 p1=0 p2=0 p3=0 obj#=16540
    tim=1272818183618017
    WAIT #2: nam=’ges message buffer allocation’ ela= 2 pool=0 request=1
    allocated=0 obj#=16540 tim=1272818183618069
    WAIT #2: nam=’lock deadlock retry’ ela= 10742 p1=0 p2=0 p3=0 obj#=16540
    tim=1272818183629049
    WAIT #2: nam=’ges message buffer allocation’ ela= 3 pool=0 request=1
    allocated=0 obj#=16540 tim=1272818183629100
    WAIT #2: nam=’lock deadlock retry’ ela= 10563 p1=0 p2=0 p3=0 obj#=16540
    tim=1272818183639989
    WAIT #2: nam=’ges message buffer allocation’ ela= 3 pool=0 request=1
    allocated=0 obj#=16540 tim=1272818183640043

    ————
    Wait event of “lock deadlock retry” is set in kcbgcur ( KCB: Get a block in
    current mode)

    and from 10046 there are many tried per waiting on ‘ges message buffer
    allocation’

    There seeme to be 2 issues here:

    1. The wait on GES message buffer allocation is set up even when there is
    no need to wait in kjcsmba.

    The root cause seems related to ‘ges message buffer allocation’

    bug 8630374 has same issue fixed in 12.1 ( Distributed Lock Manager (DLM)
    issue )

    REDISCOVERY INFORMATION:
    Check if there are a large number of wait event of “ges message buffer
    allocation” with very small elapse time when Event 10046 level 8 is set.

    2. WAIT #2: nam=’lock deadlock retry’ ela= 10600 p1=0 p2=0 p3=0 obj#=16540
    tim=1272818183618017

    This wait is more serious and has much higher impact.

    WORKAROUND:
    ———–
    Restart the instance

    RELATED BUGS:
    ————-
    8630374

    REPRODUCIBILITY:
    —————-
    2 times

    TEST CASE:
    ———-

    STACK TRACE:
    ————
    STG_T_RN_TRADE(RN_TRADE_ID,SRC_SUR_NK,HOSTTRAN_TP,HOSTTRAN_SERIAL_NBR,HOSTTRAN
    _ACCT_TP,HOSTTRAN_LEDGER,HOSTTRAN_GRP_ACCT_CD,HOSTTRAN_HOST_LEDGER,HOSTTRAN_GR
    P_REF_1,HOSTTRAN_GRP_REF_2,HOSTTRAN_GRP_REF_3,HOSTTRAN_GRP_REF_4,HOSTTRAN_GRP_
    REF_5,HOSTTRAN_CLEARING_CD,HOSTTRAN_OTHER_SIDE_LEDGER,HOSTTRAN_FAS_CURR,H
    short stack:
    ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1782<-sspuser()+112<-__restore_rt()<-se
    1<-kcbgcur()+13746<-kcbnew()+18953<-ktspfmdb()+403<-ktspfmtrng()+1051<-ktspfsa
    ll()+1835<-ktspfsrch()+839<-ktspscan_bmb()+505<-ktspgsp_main()+875<-kdtgsp()+6
    04<-kdtgrs()+1692<-kdtInsRow()+1008<-qerltcNoKdtBufferedInsRowCBK()+797<-qerlt
    cLoadStateMachine()+811<-qerltcFetch()+3372<-insexe()+1581<-opiexe()+22572<-kp
    oal8()+2288<-opiodr()+1149<-ttcpip()+1251<-opitsk()
    wait history:
    * time between current wait and wait #1: 0.000009 sec
    1. event: 'gc current request'
    time waited: 0.000149 sec
    wait id: 1837759 p1: 'file#'=0xc
    p2: 'block#'=0x9b345
    p3: 'id#'=0x2000001
    * time between wait #1 and #2: 0.000014 sec
    2. event: 'ges message buffer allocation'
    time waited: 0.000002 sec
    wait id: 1837758 p1: 'pool'=0x0
    p2: 'request'=0x1
    p3: 'allocated'=0x0
    * time between wait #2 and #3: 0.000023 sec
    3. event: 'lock deadlock retry'
    time waited: 0.010681 sec
    wait id: 1837757

Leave a Reply

Your email address will not be published. Required fields are marked *