11g新特性:hangdiag.sql实例hang诊断脚本

11g中引入了hangdiag.sql 这个脚本来帮助Oracle Support 用来收集诊断实例hang住所需要的diagnostic信息,这个脚本直接存放在11g 以后的$ORACLE_HOME/rdbms/admin目录下,无需用户另外下载。

 

该脚本用于收集单实例的hang诊断信息还是很不错的,如果是RAC环境的话推荐使用racdiag.sql这个脚本。

 

值得一提的是 hangdiag.sql会做 11g中独有的(oradebug dump hangdiag_header 1)并将v$wait_chain的源信息x$ksdhng_chains文本形式输出, 之后会做ashdump和systemstate(level 267 可能耗时较长)的dump转储操作,这样获得的trace文件信息就很全面了。

 

hangdiag.sql:

 

[oracle@vrh2 ~]$ cat $ORACLE_HOME/rdbms/admin/hangdiag.sql
Rem
Rem $Header: hangdiag.sql 08-jun-2007.02:06:43 amysoren Exp $
Rem
Rem hangdiag.sql
Rem
Rem Copyright (c) 2007, Oracle. All rights reserved.  
Rem
Rem    NAME
Rem      hangdiag.sql - Hang Diagnosis script
Rem
Rem    DESCRIPTION
Rem      It is generally useful (but not necessary) to run this script from a
Rem      prelim mode connection.
Rem
Rem      This script dumps data required to diagnose hangs:
Rem      1. ASH dump
Rem      2. Systemstate dump with short callstacks
Rem      3. Hang analysis results (x$ksdhng_chains)
Rem
Rem    NOTES
Rem      It is required to set PID using oradebug setmypid/setospid/setorapid
Rem      before invoking this script. 
Rem
Rem      "oradebug tracefile_name" gives the file name including the path of
Rem      the trace file containing the dumps.
Rem
Rem
Rem    MODIFIED   (MM/DD/YY)
Rem    amysoren    06/08/07 - direct_access command syntax change
Rem    avaliani    05/17/07 - bug 5499564: add x$ksdhng_chains
Rem    amysoren    01/10/07 - add ashdump, systemstate dump
Rem    amysoren    01/10/07 - Created
Rem

-- begin hang diag dump
oradebug dump hangdiag_header 1

-- dump hang analysis chains
oradebug direct_access enable trace
oradebug direct_access disable reply
oradebug direct_access set content_type = 'text/plain'
oradebug direct_access select * from x$ksdhng_chains

-- dump ash data
oradebug dump ashdump 5

-- dump systemstate with short callstacks
oradebug dump systemstate 267

x$ksdhng_chains内部视图的字段如下

SQL> desc x$ksdhng_chains;
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 ADDR                                               RAW(8)
 INDX                                               NUMBER
 INST_ID                                            NUMBER
 CHAIN_ID                                           NUMBER
 CHAIN_IS_CYCLE                                     NUMBER
 CHAIN_SIGNATURE                                    VARCHAR2(801)
 CHAIN_SIGNATURE_HASH                               NUMBER
 INSTANCE                                           NUMBER
 OSID                                               VARCHAR2(25)
 PID                                                NUMBER
 SID                                                NUMBER
 SESS_SERIAL#                                       NUMBER
 BLOCKER_IS_VALID                                   NUMBER
 BLOCKER_INSTANCE                                   NUMBER
 BLOCKER_OSID                                       VARCHAR2(25)
 BLOCKER_PID                                        NUMBER
 BLOCKER_SID                                        NUMBER
 BLOCKER_SESS_SERIAL#                               NUMBER
 BLOCKER_CHAIN_ID                                   NUMBER
 IN_WAIT                                            NUMBER
 TIME_SINCE_LAST_WAIT_SECS                          NUMBER
 WAIT_ID                                            NUMBER
 WAIT_EVENT                                         NUMBER
 WAIT_EVENT_TEXT                                    VARCHAR2(64)
 P1                                                 NUMBER
 P1_TEXT                                            VARCHAR2(64)
 P2                                                 NUMBER
 P2_TEXT                                            VARCHAR2(64)
 P3                                                 NUMBER
 P3_TEXT                                            VARCHAR2(64)
 IN_WAIT_SECS                                       NUMBER
 TIME_REMAINING_SECS                                NUMBER
 NUM_WAITERS                                        NUMBER
 ROW_WAIT_OBJ#                                      NUMBER
 ROW_WAIT_FILE#                                     NUMBER
 ROW_WAIT_BLOCK#                                    NUMBER
 ROW_WAIT_ROW#                                      NUMBER

 

 

使用范例如下:

 

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production
CORE    11.2.0.3.0      Production
TNS for Linux: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production

SQL> select * from global_name;

GLOBAL_NAME
--------------------------------------------------------------------------------
www.askmaclean.com & www.askmaclean.com  

SQL> oradebug setmypid;
Statement processed.
SQL>
SQL>
SQL> @?/rdbms/admin/hangdiag    
Statement processed.
Statement processed.
Statement processed.
Statement processed.
Statement processed.
Statement processed.
Statement processed.
SQL> oradebug tracefile_name;
/s01/orabase/diag/rdbms/vprod/VPROD2/trace/VPROD2_ora_22612.trc

Sample output 示例输出如下:

Processing Oradebug command 'setmypid'

Oradebug command 'setmypid' console output: <none>

Processing Oradebug command 'dump hangdiag_header 1'
Hang diagnosis dumps

Oradebug command 'dump hangdiag_header 1' console output: <none>
Result of fixed table query: select * from x$ksdhng_chains
ADDR    = 7FA5333FDC08
INDX    = 0
INST_ID = 2
CHAIN_ID        = 1
CHAIN_IS_CYCLE  = 0
CHAIN_SIGNATURE = 'SQL*Net message from client'<='enq: TM - contention'
CHAIN_SIGNATURE_HASH    = 373050554
INSTANCE        = 2
OSID    = 26593
PID     = 47
SID     = 188
SESS_SERIAL#    = 1879
BLOCKER_IS_VALID        = 1
BLOCKER_INSTANCE        = 2
BLOCKER_OSID    = 25967
BLOCKER_PID     = 43
BLOCKER_SID     = 176
BLOCKER_SESS_SERIAL#    = 9631
BLOCKER_CHAIN_ID        = 0
IN_WAIT = 1
TIME_SINCE_LAST_WAIT_SECS       = 0
WAIT_ID = 8
WAIT_EVENT      = 234
WAIT_EVENT_TEXT = enq: TM - contention
P1      = 1414332422
P1_TEXT = name|mode
P2      = 79775
P2_TEXT = object #
P3      = 0
P3_TEXT = table/partition
IN_WAIT_SECS    = 692
TIME_REMAINING_SECS     = -1
NUM_WAITERS     = 0
ROW_WAIT_OBJ#   = 79775
ROW_WAIT_FILE#  = 0
ROW_WAIT_BLOCK# = 0
ROW_WAIT_ROW#   = 0

ADDR    = 7FA5333FDC08
INDX    = 1
INST_ID = 2
CHAIN_ID        = 1
CHAIN_IS_CYCLE  = 0
CHAIN_SIGNATURE = 'SQL*Net message from client'<='enq: TM - contention'
CHAIN_SIGNATURE_HASH    = 373050554
INSTANCE        = 2
OSID    = 25967
PID     = 43
SID     = 176
SESS_SERIAL#    = 9631
BLOCKER_IS_VALID        = 0
BLOCKER_INSTANCE        = 0
BLOCKER_OSID    =
BLOCKER_PID     = 0
BLOCKER_SID     = 0
BLOCKER_SESS_SERIAL#    = 0
BLOCKER_CHAIN_ID        = 0
IN_WAIT = 1
TIME_SINCE_LAST_WAIT_SECS       = 0
WAIT_ID = 148
WAIT_EVENT      = 352
WAIT_EVENT_TEXT = SQL*Net message from client
P1      = 1650815232
P1_TEXT = driver id
P2      = 1
P2_TEXT = #bytes
P3      = 0
P3_TEXT =
IN_WAIT_SECS    = 568
TIME_REMAINING_SECS     = -1
NUM_WAITERS     = 1
ROW_WAIT_OBJ#   = -1
ROW_WAIT_FILE#  = 0
ROW_WAIT_BLOCK# = 0
ROW_WAIT_ROW#   = 0

2 rows selected

Processing Oradebug command 'dump ashdump 5'
ASH dump
<<<ACTIVE SESSION HISTORY - PROCESS TRACE DUMP HEADER BEGIN>>>
<<<ACTIVE SESSION HISTORY - PROCESS TRACE DUMP HEADER BEGIN>>>
****************
SCRIPT TO IMPORT
****************
------------------------------------------
Step 1: Create destination table <ashdump>
------------------------------------------
CREATE TABLE ashdump AS
SELECT * FROM SYS.WRH$_ACTIVE_SESSION_HISTORY WHERE rownum < 0
----------------------------------------------------------------
Step 2: Create the SQL*Loader control file <ashldr.ctl> as below
----------------------------------------------------------------
load data
infile * "str '\n####\n'"
append
into table ashdump
fields terminated by ',' optionally enclosed by '"'
(
SNAP_ID  CONSTANT 0           ,
DBID                          ,
INSTANCE_NUMBER               ,
SAMPLE_ID                     ,
SAMPLE_TIME TIMESTAMP ENCLOSED BY '"' AND '"' "TO_TIMESTAMP(:SAMPLE_TIME   ,'MM-DD-YYYY HH24:MI:SSXFF')"   ,
SESSION_ID                    ,
SESSION_SERIAL#               ,
SESSION_TYPE                  ,
USER_ID                       ,
SQL_ID                        ,
SQL_CHILD_NUMBER              ,
SQL_OPCODE                    ,
FORCE_MATCHING_SIGNATURE      ,
TOP_LEVEL_SQL_ID              ,
TOP_LEVEL_SQL_OPCODE          ,
SQL_PLAN_HASH_VALUE           ,
SQL_PLAN_LINE_ID              ,
SQL_PLAN_OPERATION#           ,
SQL_PLAN_OPTIONS#             ,
SQL_EXEC_ID                   ,
SQL_EXEC_START DATE 'MM/DD/YYYY HH24:MI:SS' ENCLOSED BY '"' AND '"' ":SQL_EXEC_START"   ,
PLSQL_ENTRY_OBJECT_ID         ,
PLSQL_ENTRY_SUBPROGRAM_ID     ,
PLSQL_OBJECT_ID               ,
PLSQL_SUBPROGRAM_ID           ,
QC_INSTANCE_ID                ,
QC_SESSION_ID                 ,
QC_SESSION_SERIAL#            ,
EVENT_ID                      ,
SEQ#                          ,
P1                            ,
P2                            ,
P3                            ,
WAIT_TIME                     ,
TIME_WAITED                   ,
BLOCKING_SESSION              ,
BLOCKING_SESSION_SERIAL#      ,
BLOCKING_INST_ID              ,
CURRENT_OBJ#                  ,
CURRENT_FILE#                 ,
CURRENT_BLOCK#                ,
CURRENT_ROW#                  ,
TOP_LEVEL_CALL#               ,
CONSUMER_GROUP_ID             ,
XID                           ,
REMOTE_INSTANCE#              ,
TIME_MODEL                    ,
SERVICE_HASH                  ,
PROGRAM                       ,
MODULE                        ,
ACTION                        ,
CLIENT_ID                     ,
MACHINE                       ,
PORT                          ,
ECID
)
---------------------------------------------------
Step 3: Load the ash rows dumped in this trace file
---------------------------------------------------
sqlldr userid/password control=ashldr.ctl data=<this_trace_filename> errors=1000000
---------------------------------------------------
<<<ACTIVE SESSION HISTORY - PROCESS TRACE DUMP HEADER END>>>
<<<ACTIVE SESSION HISTORY - PROCESS TRACE DUMP BEGIN>>>
####
881465081,2,2082369,"01-10-2012 00:49:06.871520000",188,1879,1,0,"562cub3hk2tjy",
0,26,0,"",0,0,0,0,0,33554433,"01/10/2012 00:37:27",0,0,0,0,0,0,0,668627480,9,1414332422,
79775,0,0,0,176,9631,2,79775,0,0,0,94,12553,,0,1024,3427055676,"sqlplus@vrh2.oracle.com (TNS V1-V3)",
"sqlplus@vrh2.oracle.com (TNS V1-V3)","","","vrh2.oracle.com",0,""
.......................................

Processing Oradebug command 'dump systemstate 267'
===================================================
SYSTEM STATE (level=11, with short stacks)
------------
System global information:
.............................
call stack performance statistics:
total                  : 0.030000 sec              
setup                  : 0.000000 sec              
stack unwind           : 0.000000 sec              
symbol translation     : 0.030000 sec
printing the call stack: 0.000000 sec              
printing frame data    : 0.000000 sec              
printing argument data : 0.000000 sec              
----- End of Call Stack Trace -----

 

好了 , 就这么简单, 把trace文件打个zip包传到 My Oracle Support上去吧, 当然你也可以找我!我叫 Maclean.Liu!

解决sqlplus的segmentation fault或hang问题

sqlplus应当是DBA 1.0时代使用最为频繁的管理工具,经常有经验丰富的老DBA会提到自己敲过几万次的sqlplus:),但有的时候这个吃饭家伙也会不好用,偶尔还会出现Segmentation fault错误,亦或者彻底hang住。在这里我介绍几种应对sqlplus无法正常使用的应对方法:

1.出现Segmentation fault,这种情况下一般是sqlplus 2进制文件被损坏了,可以通过重新build一个sqlplus来解决问题

[oracle@rh2 bin]$ sqlplus
Segmentation fault

/* 使用$ORACLE_HOME/sqlplus/lib目录下的make文件,编译一个新的sqlplus */

[oracle@rh2 ~]$ make -f $ORACLE_HOME/sqlplus/lib/ins_sqlplus.mk  newsqlplus

Linking /s01/oracle/product/11.2.0/dbhome_1/sqlplus/bin/sqlplus
rm -f /s01/oracle/product/11.2.0/dbhome_1/sqlplus/bin/sqlplus
gcc -o /s01/oracle/product/11.2.0/dbhome_1/sqlplus/bin/sqlplus -m64
-L/s01/oracle/product/11.2.0/dbhome_1/sqlplus/lib/ -L/s01/oracle/product/11.2.0/dbhome_1/lib/
-L/s01/oracle/product/11.2.0/dbhome_1/lib/stubs/  /s01/oracle/product/11.2.0/dbhome_1/sqlplus/lib/s0afimai.o
-lsqlplus -lclntsh  `cat /s01/oracle/product/11.2.0/dbhome_1/lib/ldflags`    -lncrypt11 -lnsgr11 -lnzjs11
-ln11 -lnl11 -lnro11 `cat /s01/oracle/product/11.2.0/dbhome_1/lib/ldflags`    -lncrypt11 -lnsgr11 -lnzjs11
-ln11 -lnl11 -lnnz11 -lzt11 -lztkg11 -lztkg11 -lclient11 -lnnetd11  -lvsn11 -lcommon11 -lgeneric11 -lmm
-lsnls11 -lnls11  -lcore11 -lsnls11 -lnls11 -lcore11 -lsnls11 -lnls11 -lxml11 -lcore11 -lunls11 -lsnls11 -lnls11
-lcore11 -lnls11 `cat /s01/oracle/product/11.2.0/dbhome_1/lib/ldflags`    -lncrypt11 -lnsgr11 -lnzjs11 -ln11
-lnl11 -lnro11 `cat /s01/oracle/product/11.2.0/dbhome_1/lib/ldflags`    -lncrypt11 -lnsgr11 -lnzjs11 -ln11 -lnl11
-lclient11 -lnnetd11  -lvsn11 -lcommon11 -lgeneric11   -lsnls11 -lnls11  -lcore11 -lsnls11 -lnls11 -lcore11 -lsnls11
-lnls11 -lxml11 -lcore11 -lunls11 -lsnls11 -lnls11 -lcore11 -lnls11 -lclient11 -lnnetd11  -lvsn11
-lcommon11 -lgeneric11 -lsnls11 -lnls11  -lcore11 -lsnls11 -lnls11 -lcore11 -lsnls11 -lnls11 -lxml11
-lcore11 -lunls11 -lsnls11 -lnls11 -lcore11 -lnls11   `cat /s01/oracle/product/11.2.0/dbhome_1/lib/sysliblist`
-Wl,-rpath,/s01/oracle/product/11.2.0/dbhome_1/lib -lm -lpthread   `cat /s01/oracle/product/11.2.0/dbhome_1/lib/sysliblist`
-ldl -lm -lpthread  -L/s01/oracle/product/11.2.0/dbhome_1/lib
/bin/chmod 755 /s01/oracle/product/11.2.0/dbhome_1/sqlplus/bin/sqlplus
rm -f /s01/oracle/product/11.2.0/dbhome_1/bin/sqlplus
mv -f /s01/oracle/product/11.2.0/dbhome_1/sqlplus/bin/sqlplus /s01/oracle/product/11.2.0/dbhome_1/bin/sqlplus
/bin/chmod 751 /s01/oracle/product/11.2.0/dbhome_1/bin/sqlplus
rm -f /s01/oracle/product/11.2.0/dbhome_1/sqlplus/lib/libsqlplus.so
rm -rf /s01/oracle/product/11.2.0/dbhome_1/sqlplus/bin/sqlplus

[oracle@rh2 ~]$ sqlplus  / as sysdba

SQL*Plus: Release 11.2.0.2.0 Production on Wed May 11 21:38:21 2011

Copyright (c) 1982, 2010, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options

2.出现sqlplus之后hang住的现象,hang的原因存在多种可能:
1)instance hanging数据库实例hang住,这种情况下sqlplus无法正常登陆到正hang的实例,而登陆到其他实例是可以的;若在10g以后版本中可以使用-prelim选项登陆实例,使用该选项登陆后无法执行普通的SQL语句,但可以使用oradebug内部调试工具,通过oradebug收集必要的hanganalyze信息后,可以进一步判断hang住的原因并决定下一步的操作。

[oracle@rh2 ~]$ sqlplus  / as sysdba
.............................we suspend here!!!

[oracle@rh2 ~]$ sqlplus  -prelim / as sysdba

SQL*Plus: Release 11.2.0.2.0 Production on Wed May 11 21:46:27 2011

Copyright (c) 1982, 2010, Oracle.  All rights reserved.

SQL> oradebug setmypid;
Statement processed.

SQL> oradebug dump hanganalyze 4;
Statement processed.

SQL> oradebug dump systemstate 266;
Statement processed.

SQL> oradebug tracefile_name

/s01/orabase/diag/rdbms/prod/PROD1/trace/PROD1_ora_23436.trc   -- where dump resides

将以上trc文件提交给Oracle Support或者资深的Oracle技术人员,以便他们分析出实例hang住的原因,通过调整参数或者修复bug可以避免再次出现类似的状况。

2)一执行sqlplus就出现挂起现象,甚至没有登陆任何数据库。一般这种情况是在读取sqlplus 2进制文件或其相关的共享库文件(.so文件)时遇到了问题,或者是在实际system call系统调用execve(“sqlplus”)时遇到了错误,一般我们可以使用系统跟踪工具strace(Linux)或truss(Unix)工具来分析这种挂起现象:

/* Unix */

truss -o sqlplus_hang.log sqlplus

/* Linux */

strace -o sqlplus_hang.log sqlplus

head -10 sqlplus_hang.log

execve("/s01/db_1/bin/sqlplus", ["sqlplus"], [/* 28 vars */]) = -1 ENOEXEC (Exec format error)

可以看到以上strace记录中发现了调用execve函数(execve() executes the program pointed to by filename)运行sqlplus程序时出现了ENOEXEC错误,该ENOEXEC错误代码说明我们正在执行一个格式无效的可执行文件,具体的解释如下:

This error indicates that a request has been made to execute a file which, although it has the
appropriate permissions, does not start with a valid magic number. A magic number is the first two
bytes in a file, used to determine what type of file it is.

You tried to execute a file that is not in a valid executable format. The most common format for
binary programs under linux is called ELF. Note that your shell will run ascii files that have the
executable bit set as a shell script (ie run it as shell commands).

You can reproduce this by doing
    $ dd if=/dev/random of=myfile bs=1k count=1 $ chmod +x myfile $ ./myfile zsh: exec format error: ./myfile

Note that there is a very slight possibility that you could create a valid program that does
something bad to your system!!
Note, you can have user defined ways of running programs using Linux's binfmt_misc. See
    /usr/src/linux/Documentation/binfmt_misc.txt

to be continued …………

ORA-00600:[qctcte1]内部错误一例

一套AIX上的4节点10.2.0.4 RAC系统在1月份出现实例hang住的现象,并伴随有ORA-00600:[qctcte1]内部错误,trace文件内容如下:

siposrc1_ora_102944.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
ORACLE_HOME = /oracle/product/10.2.0/db_1
System name: AIX
Node name: jszydb1
Release: 3
Version: 5
Machine: 00CE31834C00
Instance name: siposrc1
Redo thread mounted by this instance: 1
Oracle process number: 34
Unix process pid: 102944, image: oracle@jszydb1

*** ACTION NAME:() 2010-01-18 15:53:11.530
*** MODULE NAME:(JDBC Thin Client) 2010-01-18 15:53:11.530
*** SERVICE NAME:(siposrc) 2010-01-18 15:53:11.530
*** SESSION ID:(2175.6953) 2010-01-18 15:53:11.530
*** 2010-01-18 15:53:11.530
ksedmp: internal or fatal error
ORA-00600: 内部错误代码, 参数: [qctcte1], [0], [], [], [], [], [], []
Current SQL statement for this session:
SELECT /* OPT_DYN_SAMP */ /*+ ALL_ROWS IGNORE_WHERE_CLAUSE NO_PARALLEL(SAMPLESUB) opt_param('parallel_execution_enabled', 'false') NO_PARALLEL_INDEX(SAMPLESUB) NO_SQL_TUNE */ NVL(SUM(C1),:"SYS_B_00000"), NVL(SUM(C2),:"SYS_B_00001") FROM (SELECT /*+ IGNORE_WHERE_CLAUSE NO_PARALLEL("SIPO_ECLA$_TEMP") FULL("SIPO_ECLA$_TEMP") NO_PARALLEL_INDEX("SIPO_ECLA$_TEMP") */ :"SYS_B_00002" AS C1, CASE WHEN "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00003" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00004" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00005" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00006" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00007" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00008" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00009" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00010" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00011" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00012" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00013" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00014" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00015" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00016" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00017" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_00018" OR "SIPO_ECLA
.

.
$_TEMP"."SEQ_ID"=:"SYS_B_40000" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_40001" OR "SIPO_ECLA$_TEMP"."SEQ_ID"=:"SYS_B_40002" THEN :"SYS_B_40003" ELSE :"SYS_B_40004" END AS C2 FROM "SIPO_ECLA$_TEMP" SAMPLE BLOCK (:"SYS_B_40005" , :"SYS_B_40006") SEED (:"SYS_B_40007") "SIPO_ECLA$_TEM

STACK
qctcte qctocssm qctcopn qctcopn xtyxcssr xtyopncb qctcopn qctcpqb <- qctcpqbl <- xtydrv <- opitca <- kksFullTypeCheck <- rpiswu2 <- kksSetBindType <- kksfbc <- opiexe <- opiall0 <- opikpr <- opiodr <- rpidrus <- skgmstack <- rpidru <- rpiswu2 <- kprball <- IPRA <- IPRA <- kkedsSel <- kkecdn <- kkotap <- kkoiqb <- kkooqb <- kkoqbc <- apakkoqb <- apaqbdDescendents <- 3d4 <- apaqbdListReverse <- 06c <- apaqbd <- apadrv <- opitca <- kksLoadChild <- kxsGetRuntimeLock <- 810 <- kksfbc <- kkspsc0 <- kksParseCursor <- opiosq0 <- opiall0 <- opial7 <- opiodr <- ttcpip <- opitsk <- opiino <- opiodr <- opidrv

提交SR,MOS认为可能是Bug 6666870,给出以下方案:
There are a large number of possible bugs but Bug 6666870 is the most likely culprit. There is no one off patch available. 10205 is not yet available.

ACTION PLAN:
=============
1.Please disable Cost Based Transformation as a workaround, this can be done in the init.ora/spfile or at the session level, for example:
SQL> alter session set "_optimizer_cost_based_transformation"=off ;

2.Apply the 10205 patch set when it becomes available.

3.If the optimiser change fails to resolve your issue, please advise us of any recent changes to your DB or server?

4.In particular, did you recently apply the CPUJAN2008 Patch?
If so, please see Note.558901.1 Ext/Mod ORA-00600 internal error code, arguments [qctcte1] After Applying CPUJAN2008 Patch

5.If there is a function based index involved, please see;
Note.788124.1 Ext/Pub ORA-00600 [qctcte1] With Function Based Index Access

6.Changing your code to eliminate the parallel clauses may also act as a workaround.

single-task message and cursor pin S on X

Recently i find many session are waiting for Event <cursor pin s on x> in my customer’s AIX 10.2.0.4 instance, it’s a new event in the 10gr2,after mutex take  place of latch .

When a session is going to parse or reparse one cursor, need to pin X this cursor first. Then the others who what to execute this child cursor  have to wait until pin X released.

I looked for the session who was pining that cursor on  X mode ,and saw it was waiting on event single-task messag:
the mutex idn is d322e445,and operated in EXCL mode; It correspond to the cursor whose hash is “ff2c2fd459ac5ee188586781d322e445” as after:

PROCESS 1753:
----------------------------------------
SO: 70000048e9d5e68, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=1753, calls cur/top: 70000043308cb08/70000043798a6e0, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 109 0 4
last post received-location: kslpsr
last process to post me: 70000048f8deaf0 1 6
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: 70000048f8deaf0 1 6
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 70000048ca877c0
O/S info: user: orauser, term: UNKNOWN, ospid: 3318124
OSD pid info: Unix process pid: 3318124, image: oracle@p595crm1
.........................
----------------------------------------
SO: 70000048eed0d30, type: 4, owner: 70000048e9d5e68, flag: INIT/-/-/0x00
(session) sid: 3311 trans: 70000046be86948, creator: 70000048e9d5e68, flag: (100041) USR/- BSY/-/-/-/-/-
DID: 0001-06D9-008347C7, short-term DID: 0001-06D9-008347C8
txn branch: 70000046a261488
oct: 3, prv: 0, sql: 700000414e4f570, psql: 700000415f47a90, user: 50/SHUCRM1C
service name: CRMDB1
O/S info: user: pausr12, term: unknown, ospid: 1234, machine: p570web2
program: JDBC Thin Client
application name: JDBC Thin Client, hash value=2546894660
waiting for 'single-task message' blocking sess=0x0 seq=22963 wait_time=0 seconds since wait started=6
=0, =0, =0
Dumping Session Wait History
for 'SQL*Net message from dblink' count=1 wait_time=325
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=2
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=383
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=1
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=265
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=2
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=677
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=0
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=237
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=1
driver id=54435000, #bytes=1, =0
temporary object counter: 0
SO:  70000043e695968, type: 53, owner: 70000048eed0d30, flag: INIT/-/-/0x00
LIBRARY OBJECT LOCK: lock=70000043e695968 handle=700000410dd46c8 mode=N                           --the child cursor
call pin=0 session pin=0 hpc=0000 hlc=0000
htl=70000043e6959e8[70000041e9eb830,70000042411c178] htb=70000042411c178  ssga=70000042411bc10
user=70000048eed0d30 session=70000048eed0d30 count=1 flags=CBK[0020]  savepoint=0x0
LIBRARY OBJECT HANDLE: handle=700000410dd46c8 mtx=700000410dd47f8(0)  cdp=0
namespace=CRSR flags=RON/KGHP/PN0/EXP/[10010100]
kkkk-dddd-llll=0000-0001-0001  lock=N pin=X latch#=23 hpc=fffc  hlc=fffc                                     --lock  in NULL mode,pin on X mode
lwt=700000410dd4770[700000410dd4770,700000410dd4770]  ltm=700000410dd4780[700000445169a08,700000453b69228]
pwt=700000410dd4738[700000410dd4738,700000410dd4738]  ptm=700000410dd4748[700000410dd4748,700000410dd4748]
ref=700000410dd47a0[70000043d1b5e58,70000043d1b5e58]  lnd=700000410dd47b8[700000410dd47b8,700000410dd47b8]
LIBRARY OBJECT: object=7000004354983b0
type=CRSR flags=EXS[0001] pflags=[0000] status=VALD load=0
DEPENDENCIES: count=4 size=16
TRANSLATIONS: count=2 size=16
DATA BLOCKS:
data#     heap  pointer    status pins change whr
----- -------- -------- --------- ---- ------ ---
0 7000004359e4768 7000004354984c8 I/P/A/-/-    0 NONE   00
6 700000414b0e4d0 700000450c0bc28 I/P/A/-/E    0 NONE   00

----------------------------------------

KGX Atomic  Operation Log 70000041853c068
Mutex 700000414b0e3d8(3311, 0) idn d322e445 oper EXCL
Cursor Pin uid 3311 efd 0 whr 1 slp  0      -- pin sid 3311
opr=3 pso=70000043e695968  flg=0            -- operate code is 3
pcs=700000414b0e3d8 nxt=0 flg=35 cld=0 hd=700000410dd46c8   par=70000041d78b1e0    -- child cursor handle address is 700000410dd46c8
ct=0 hsh=0 unp=0 unn=0 hvl=1d78b4b8 nhv=1  ses=70000048eed0d30                              -- heap 0 pointer address 70000041d78b1e0
hep=700000414b0e458 flg=80 ld=1 ob=7000004354983b0 ptr=700000450c0bc28  fex=700000450c0af38
----------------------------------------
SO: 70000041e9eb7b0, type: 53, owner: 70000048eed0d30, flag:  INIT/-/-/0x00
LIBRARY OBJECT LOCK: lock=70000041e9eb7b0 handle=700000414e4f570 mode=N
call pin=0 session pin=0 hpc=0000 hlc=0000
htl=70000041e9eb830[70000044590d030,70000043e6959e8] htb=70000042411c178  ssga=70000042411bc10
user=70000048eed0d30 session=70000048eed0d30 count=1 flags=[0000]  savepoint=0x4c08a856
LIBRARY OBJECT HANDLE: handle=700000414e4f570 mtx=700000414e4f6a0(0)  cdp=1
name=
SELECT * FROM(SELECT A.*, rownum r FROM( select  account_id,billing_nbr,calling_nbr,called_nbr,to_date,call_duration,charge_item_name,to_char(rate  /100, '99999999999990.99') rate, charge,channel_id from ct_05
where  1=1 and account_id ='300187744'  and key_source_type ='52001'  order by to_date ) A WHERE rownum <= 15 ) B WHERE r > 0
hash=ff2c2fd459ac5ee188586781d322e445 timestamp=06-04-2010 15:14:33
namespace=CRSR flags=RON/KGHP/TIM/PN0/MED/DBN/[50010040]
kkkk-dddd-llll=0000-0001-0001 lock=N pin=0 latch#=23 hpc=0018 hlc=0018
lwt=700000414e4f618[700000414e4f618,700000414e4f618]  ltm=700000414e4f628[700000414e4f628,700000414e4f628]
pwt=700000414e4f5e0[700000414e4f5e0,700000414e4f5e0]  ptm=700000414e4f5f0[700000414e4f5f0,700000414e4f5f0]
ref=700000414e4f648[700000414e4f648,700000414e4f648]  lnd=700000414e4f660[700000414e4f660,700000414e4f660]
LIBRARY OBJECT: object=70000041d78b0c8
type=CRSR flags=EXS[0001] pflags=[0000] status=VALD load=0
CHILDREN: size=16
child#    table reference   handle
------ -------- --------- --------
0 70000043d1b61e8 70000043d1b5e58 700000410dd46c8                                 -- the child cursor
DATA BLOCKS:
data#     heap  pointer    status pins change whr
----- -------- -------- --------- ---- ------ ---
0 7000004467f6078 70000041d78b1e0  I/P/A/-/-    0 NONE   00                                      --  heap 0

The session is active,and its wait history indicates there are some dblink operation inside the running SQL, When this session waited to reparse the only child cursor for that SQL , it pin this child cursor in X mode first,and then it need more information from remote DB , but no response until SQL*NET MESSAGE from DBLINK timeout,it was waiting single-task message for more than 6 seconds when tracing.

In this scenario other sessions who wait to execute this child cursor were all waiting , so sadly.

The oracle document describes <single-task message> as When running single task, this event indicates that the session waits for the client side of the executable.

A little odd,the blocked also did some single task when it lost one dblink connection.

I searched the metalink with keyword: single-task message, there were few documents  found , but fortunately below note:

Hdr: 7757687 10.2.0.3.0 RDBMS 10.2.0.3.0 PRG INT/DISTR PRODID-5 PORTID-212
Abstract: CURSOR: PIN S WAIT ON X BLOCKER “SINGLE-TASK MESSAGE”

PROBLEM:
——–
The databse has multiple processes waiting on “CURSOR: PIN S WAIT ON X”

The holder of the mutes is waiitng on “single-task message” form almost
745687
seconds.
The query run by the holder  does conatin a dblink.

DIAGNOSTIC ANALYSIS:
——————–
All the  process waiting on the same mutex:-

waiting for ‘cursor: pin S wait on X’ blocking sess=0x0
idn=5d779f21, value=7af00000000, where|sleeps=57a0dfdb6

Holder
PROCESS 26:
KGX Atomic Operation Log 700000061490300
Mutex 70000006226e0f0(1967, 0) idn 5d779f21 oper EXCL
Cursor Pin uid 1967 efd 0 whr 1 slp 0
opr=3 pso=7000000670f93a0 flg=0
pcs=70000006226e0f0 nxt=0 flg=35 cld=0 hd=7000000606838f0
par=7000000629006a8
ct=0 hsh=0 unp=0 unn=0 hvl=62900980 nhv=1 ses=700000079543fe0
hep=70000006226e170 flg=80 ld=1 ob=700000062fe69a8
ptr=700000065f124f8 fex=700000065f11808

Session object for this process:-

SO: 700000079543fe0, type: 4, owner: 70000007a6ad538, flag: INIT/-/-/0x00
(session) sid: 1967 trans: 0, creator: 70000007a6ad538, flag: (41)
USR/- BSY/-/-/-/-/-
DID: 0001-001A-00000A38, short-term DID: 0001-001A-00000A39
txn branch: 0
oct: 3, prv: 0, sql: 700000062900e18, psql:
70000005d797bc8, user: 21/V500
O/S info: user: d_c11, term: , ospid: 6271140, machine: aruput5
program: cpm_srvscript@aruput5 (TNS V1-V3)
application name: cpm_srvscript@aruput5 (TNS V1-V3), hash value=0
waiting for ‘single-task message’ blocking sess=0x0 seq=7071
wait_time=0 seconds since wait
started=745687

This bug which occurs on Version 10.2.0.3 is so similar to my case on AIX version 10.2.0.4, so as the stack trace:

ksdxfstk+002c<-ksdxcb+04e4<-sspuser+0074<-000044C0<-nttrd+0120<-nsprecv+07a0<-
nscon+0218<-nsdo+157c<-nscall3+012c<-nscall+0778<-niotns+0888<-nigcall+0028<-osnco
n+0540<-kpkiadef+006c<-upiini+03fc<-upiah0+00ac<-kpuatch+0808<-OCIServerAttach+011
4<-kpnconn+02a4<-npicon0+042c<-kpndbcon+0630<-OCIKDBLinkConn2+0038<-OCIKDBLinkConn
+002c<-ddfnet2Normal+011c<-kkmfcbrm+009c<-kkmpfcbk+02a0<-qcsprfro+0538<-qcsprfro_tree
+0318<-qcsprfro_tree+01c8<-qcspafq+0068<-qcspqb+0470<-kkmdrv+003c<-opiSem+13b4<-opiDe
ferredSem+0234<-opitca+01e8<-kksFullTypeCheck+001c<-rpiswu2+034c<-kksLoadChild+30b0<-
kxsGetRuntimeLock+0810<-kksfbc+2930<-kkspsc0+0ffc<-kksParseCursor+00d4<-opiosq0+0b30<
-kpooprx+0168<-kpoal8+0400<-opiodr+0adc<-ttcpip+1020<-opitsk+10b0<-opiino+0990<-opiod
r+0adc<-opidrv+0474<-sou2o+0090<-opimai_real+01bc<-main+0098<-__start+0098

My case stack trace:
ksdxfstk+002c<-ksdxcb+04e4<-sspuser+0074<-000044C0<-nttcni+01b8<-nttcon+04f4<-ntconn+0160<
-nsopen+0960<-nscall1+01b0<-nscall+049c<-niotns+0880<-nigcall+0028<-osncon+0540<-kpkiadef+006c<
-upiini+0408<-upiah0+00ac<-kpuatch+0808<-OCIServerAttach+0114<-kpnconn+02a4<-npicon0+042c<
-kpndbcon+0630<-OCIKDBLinkConn2+0038<-OCIKDBLinkConn+002c<-ddfnet2Normal+011c<-kkmfcbrm+009c<
-kkmpfcbk+02a0<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228<-qcspafq+0068<
-qcspqbDescendents+03a0<-qcspqb+00ac<-qcsevw+02e0<-kkmevw+0a58<-kkmfcbvw+0178<
-kkmfcblo+0e38<-kkmpfcbk+0250<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228<
-qcspafq+0068<-qcspqbDescendents+03a0<-qcspqb+00ac<-qcsevw+02e0<-qcsfpsq+003c<
-kkmfcbsq+01e8<-kkmpfcbk+0260<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228<
-qcspafq+0068<-qcspqbDescendents+03a0<-qcspqb+00ac<-qcsevw+02e0<-qcsfpsq+003c<
-kkmfcbsq+01e8<-kkmpfcbk+0260<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228<
-qcspafq+0068<-qcspqbDescendents+03a0<-qcspqb+00ac<-kkmdrv+003c<-opiSem+13c0<
-opiprs+01ac<-kksParseChildCursor+05e0<-rpiswu2+034c<-kksLoadChild+1d88<
-kxsGetRuntimeLock+0810<-kksfbc+28b0<-kkspsc0+0ffc<-kksParseCursor+00d4<-opiosq0+0ae0<
-kpooprx+0168<-kpoal8+0400<-opiodr+0ae0<-ttcpip+1020<-opitsk+1124<-opiino+0990<-opiodr+0ae0<
-opidrv+0484<-sou2o+0090<-opimai_real+01bc<-main+0098<-__start+0098

Both of them have called OCIKDBLinkConn->OCIKDBLinkConn2->kpndbcon->npicon0->kpnconn->OCIServerAttach->kpuatch and follow.
Till now this bug has no patch to apply or workaround way what’s awful.

As “Bug 5111335: STREAMS PROPAGATION STUCK ON “SINGLE-TASK MESSAGE” PRODUCE LIBRARY CACHE LOCK CO” described:

Hdr: 5111335 10.2.0.2 RDBMS 10.2.0.2 STREAMS PRODID-5 PORTID-226
Abstract: STREAMS PROPAGATION STUCK ON “SINGLE-TASK MESSAGE” PRODUCE LIBRARY CACHE LOCK CO
This is a three cluster environment. Two nodes per cluster and on that
cluster running RAC with 2 instances.
The databases involved are:

GCTGCMU is the “GCT” database in GCM = Greenwich USA, will be named USA from
now on
GCTHKGU is the “GCT” database in HKG = Hong Kong, will be named HKG from now
on
GCTLONU is the “GCT” database in LON = London, will be named LON from now on

On this environment there is a Streams bi-directional replication environment
defined between the 3 databases.

USA database is propagating changes captured to LON and HKG.
There are two propagations defined from this site to every site. 4
propagations as a total.

GCTHKGU_BATCH_PROPAGATE
GCTLONU_BATCH_PROPAGATE
GCTHKGU_DAILY_PROPAGATE
GCTLONU_DAILY_PROPAGATE

Daily propagation propagates changes stored on streams queue
STRMADMIN.STREAMSOUT_DAILY.
Batch propagation propagates changes stored on streams queue
STRMADMIN.STREAMSOUT_BATCH

This databases have been upgraded to 10.2.0.2 from 10.2.0.1 last 13-Mar-2006,
and no problems were reported.

Customer reported that he could see that propagations from USA to HKG were
stucked on library cache lock.
This situation was solved by customer by bouncing USA instances.

Analyzing the systemstate dumps we could see that there were propagations
locked on this situation and QMON slaves too.
The library cache lock was on the queue STRMADMIN.STREAMSOUT_DAILY and the
owner of the lock was process J000 that was stucked ‘single-task message’.

From the information on the systemstate dump we can see that this process has
been on that situation for more than 3 days and 20 hours.
Based on the call stack it seems that the process was about or in the middle
of a call to a listener.

DIAGNOSTIC ANALYSIS:
——————–
Files to be uploaded:

USA:
– RDA.zip , RDA node 1
– GCTGCMU1_healthchk.html, streams healthcheck node 1
– GCTGCMU1_healthchk.html, streams healthcheck node 2
– gctgcmu2_ora_21349.trc, hanganalyze and systemstate dumps when the hang is
ocurring. Node 2
– gctgcmu1_ora_858.trc, hanganalyze and systemstate dumps when the hang is
ocurring. Node 1.

HKG:
– alert_GCTHKGU1.log, alert log file node 1
– alert_GCTHKGU2.log, alert log file node 2
– listener_hkg0223xus.log_20060320, listener log file node 1.
– listener_hkg0224xus.log_20060319, listener log file node 2.

LON:
– alert_GCTLONU1.log, alert log file node 1
– alert_GCTLONU2.log, alert log file node 2
– listener_lon0223xus.log_20060318, listener log file node 1.
– listener_lon3166xus.log_20060320, listener log file node 2.

Hang was identified at USA on node 1.

Analysis of gctgcmu1_ora_858.trc:

System State 1
~~~~~~~~~~~~~~
21: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=21
42: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=404
51: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=17
67: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=90
71: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=56
76: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=344
Cmd: PL/SQL Execute
77: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=87

System State 2
~~~~~~~~~~~~~~
21: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=21
42: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=404
51: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=17
67: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=90
71: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=56
76: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=344
Cmd: PL/SQL Execute
77: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=87

LOCK: handle=25ea4afc8 QUEU:STRMADMIN.STREAMSOUT_DAILY

Resource Holder State
LOCK: handle=25ea4afc8 50: waiting for ‘single-task message’

PROCESS 50:
~~~~~~~~~~~

OSD pid info: Unix process pid: 12949, image: oracle@slodb001 (J000)

From the call stack looks like process was doing an SQL*Net connection to a
listenerand doing this
connection we get hanged because the stack trace is the same for the two
process dumps shown on
both system states.

(session)
oct: 0, prv: 0, sql: (nil), psql: (nil), user: 92/STRMADMIN
waiting for ‘single-task message’ blocking sess=0x(nil) seq=5 wait_time=0
seconds since wait
started=332308
=0, =0, =0

It is running a propagation job: next_date :=
sys.dbms_aqadm.aq$_propaq(job);

Knowing that the seconds in wait are 332308 and the dump of the process
occurred on 2006-03-21 08:15:12.640,
this process is waiting on this event:

1. On USA the process is waiting since 17-MAR-2006 11:56:44
2. That time on London is 17-MAR-2006 16:56:44
3. That time on HONG KONG is 18-MAR-2006 01:56:44

At that time nothing relevant was found on alert log files on listener log
files.

Processes waiting on library cache lock
~~~~~~~~~~~~~~~

21 (PZ99) : Slave of J002
42 (q004) : Slave of process 44 (QMNC)
51 (J001) :
67 (q003) : Slave of process 44 (QMNC)
71 (q005) : Slave of process 44 (QMNC)
51 (TNS shadow) :
77 (q007) : Slave of process 44 (QMNC)

WORKAROUND:
———–
No workaround found so far.
Killing the process or bouncing instance should solve the problem.

RELATED BUGS:
————-
No similar bugs found.

This environment is also involved on bug 5089630

REPRODUCIBILITY:
—————-
Problem has reproduced just one time on customer site.

TEST CASE:
———-

STACK TRACE:
————
For processes waiting on library cache lock we can see different call stacks,
here is the one for process 21:

ksdxfstk ksdxcb sspuser pthread_sighandler_rt GI___libc_sigaction GI___poll
kslwait kjusuc ksipgetctx kqlmli kgllkal kglget
kglgob kwqicgob kwqicrqo0 kwqmsnrmsg kwqsldqs kwqsif
qerfxFetch qervwFetch qertqoFetch qerpxSlaveFetch qerpxFetch
opiexe kpoal8 opiodr kpoodr upirtrc kpurcsc kpuexecv8 kpuexec OCIStmtExecute

kxfxsStmtExecute kxfxsExecute kxfxsp kxfxmai kxfprdp opirip opidrv sou2o
opimai_real main __libc_start_main _start

For process waiting on single-task message, we can see the following call
stack:

ksdxfstk ksdxcb sspuser pthread_sighandler_rt GI___libc_sigaction GI___poll
clntudp_call do_ypcall yp_match internal_gethostbyname2_r
nss_nis_gethostbyname_r
gethostbyname_r
gethostbyname snlpcgthstbynm snlinGetAddrInfo nttbnd2addr ntacbbnd2addr
ntacbnd2addr
nsc2addr nscall1 nscall niotns nigcall osncon
kpkiadef upiini upiah0 kpuatch OCIServerAttach kpnconn
npicon0 kpndbcon OCIKDBLinkConn2 OCIKDBLinkConn kwqpcon kwqpgps
spefmccallstd pextproc peftrusted psdexsp rpiswu2 psdextp pefccal pefca
pevm_FCAL pfrinstr_FCAL pfrrun_no_tool pfrrun plsql_run peicnt kkxexe opiexe
opiodr rpidrus skgmstack rpidru rpiswu2 rpidrv rpiexe kkjex1e kkjsexe kkjrdp

opirip opidrv sou2o opimai_real main libc_start_main start

I do not see any Oracle bug here but it is hard to be
certain without the SQLNET.ORA settings and multiple
snapshots of the problem process. It is quite valid
for an Oracle session to be holding a library cache lock
when it makes an outbound DBlink call. If another session
wants that lock in an incompatible mode then it has to wait.

In this case the blocking session making the outbound link
call appears from the one stack dump to be in OS code
under gethostbyname().
However it could have been spinning down the stack in
Oracle code but I cannot comment on the info we have.

The “single-task message” wait event is started / ended
in kpnconn() so if there was a “spin” it would have to be
above that point in the stack.

In 10.2 there is the option to configure NET to timeout
on outbound connect attempts by setting
SQLNET.OUTBOUND_CONNECT_TIMEOUT = N seconds.
The timeout occurs in niotns() in the stack so if this
is set you would expect a session in this state to then
timeout which may be useful if there are underlying issues with
TCP hostname resolution.
(Watch out for bug 4933023 if you use this SQLNET parameter).

If the issue recurs then get CPU usage info, multiple stacks
and you may want to attach with gdb and see if the code ever
returns above gethostbyname or not to get an idea if this

Sorry , my explanation is probably not clear .

Here is the stack you have but annotated:

Callback to get the short stack. Confirms we are not HUNG in a device
driver.
ksdxfstk
ksdxcb
sspuser
pthread_sighandler_rt
GI___libc_sigaction

Inside OS supplied code executing gethostbyname() C call.
GI___poll
clntudp_call
do_ypcall
yp_match
internal_gethostbyname2_r
nss_nis_gethostbyname_r
gethostbyname_r
gethostbyname

Oracle Net code:
snlpcgthstbynm
snlinGetAddrInfo
nttbnd2addr
ntacbbnd2addr
ntacbnd2addr
nsc2addr
nscall1
nscall
niotns <<< This is where SQLNET.OUTBOUND_CONNECT_TIMEOUT
is implemented IF it has been configured.
nigcall

Oracle RDBMS code:
osncon
kpkiadef
upiini
upiah0
kpuatch
OCIServerAttach
kpnconn <<< This is where “single-task message” starts/ends
npicon0

If we just get two stack snapshots of the process, as was the case, and
both have the above stack then we can say:
As the “seq=” of both snaps is the same (5) then we did not return
below kpnconn() in the call stack.

But as poll() is typically a blocking call, even if only for a short
time, then we cannot be sure if this process was blocked permanently
in the poll() waiting on some large timout or similar, or if the
code was executing and looping. If it was looping it could have done so
at any point on the above stack down to kpnconn() where we would have
changed the seq# on the wait event if we exited here.

We can say that as the short stack dump triggered a callout then the
Linux kernel was not stuck inside a device driver.

We can also say it looks like the gethostbyname is in YP code doing a UDP
call so at OS level you may want to check out the name resolution
configuration. You probably dont want to be falling all the way back to
UDP requests to get hostnames resolved.

So if the problem recurs it would help to get:
ps information of the process (to see if it uses CPU at all)

strace for a few minutes – this should help show if the
process is entering / leaving any system call (eg: poll)

A full call stack with arguments from either gdb or errorstack
and these should help show the arguments passed up the stack.

In gdb you can also do something like attach to the problem process
and :
break osncon
break nigcall
break niotns
etc.. up the stack to gethostbyname and then “cont”
and it should stop if it enters into those functions from below
helping indicate a spin/loop point below that point.

It may also be worth getting Linux “crash” information for the
process to see what the process looks like from the Linux side
but follow up with the OS team on that side.

SQLNET.OUTBOUND_CONNECT_TIMEOUT is a backup mechanism to help prevent

This note claimed that one session may hold library cache lock , and wait in single-task message
after it makes an outbound DBlink call.If another session wants that lock in an incompatible mode then it has to wait.
The “single-task message” wait event is started / ended in kpnconn() so if there was a “spin” it would have to be
above that point in the stack. Oracle support advise to set SQLNET.OUTBOUND_CONNECT_TIMEOUT so that
we can expect a session in this state to then timeout which may be useful if there are underlying issues with
TCP hostname resolution. The principle in these cases is identical.
Do deep digging ,you can see this Knowledge:

Hdr: 8427478 10.2.0.3 RDBMS 10.2.0.3 PI/DISTRIB PRODID-5 PORTID-23 ORA-12170
Abstract: WHEN USING DBLINK,IT CONNECT TWICE AND TAKE DOBULE-TIME FOR TIME-OUT( ORA-12170)
PROBLEM:
——–
When client connect to database using TCP/IP, if there are some problems
in the network and server , The client recevies ORA-12170 (ETIMEDOUT)
The timeout time is depend on OS TCP parameter.(e.g. tcp_syn_retries
parameter on Linux platform).

ERROR:
ORA-12170 : TNS:Connect timeout occurred

** sqlnet.log ***************************************************
VERSION INFORMATION:
TNS for Linux: Version 10.2.0.3.0 – Production
TCP/IP NT Protocol Adapter for Linux: Version 10.2.0.3.0 – Production
Time: 10-APR-2009 21:38:41
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
TNS-505: Operation timed out
nt secondary err code: 110 <= ETIMEDOUT nt OS err code: 0 Client address:
** sqlnet.log ***************************************************

client server
| |
connect() |
|——-x–>|——–
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |when setting tcp_syn_retries=5(default)
|——-x–>| | It takes about 189 seconds for time-out.
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
———————
ORA-12170 occur.

But, the connection via DBLINK, it sometimes try to connect twice. and
it takes double-time for time-out.

local site remote site
| |
connect() |
|——-x–>|——–
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
ORA-12170 occur ,but it not returning to client.
and server process of local site try to connect automatically.(2nd connect)
connect() | |
|——-x–>| |
| SYN | |
|——-x–>| | It takes about 378 seconds for time-out.
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
———————
ORA-12170 occur.

The problem is that it takes dobule-time by this behavior using DBLINK.

DIAGNOSTIC ANALYSIS:
——————–
We got stack trace when trying connect.

1st connect
connect()+36<- nttcon()+965<- ntconn()+265<- nsopen()+1335 <- nscall1()+449<- nscall()+609<- niotns()+1678<- nigcall()+112 <- osncon()+692<- kpkiadef()+58<- upiini()+683<- upiah0()+104 <- kpuatch()+1064<- OCIServerAttach()+129<- kpnconn()+463 <- npicon0()+1250<- kpndbcon()+909<- OCIKDBLinkConn2()+40 <- OCIKDBLinkConn()+36<- ddfnet3Share()+151<- kksarm()+480<- kksauc()+481 <- kksCheckCriteria()+1711<- kksCheckCursor()+226 <- kksSearchChildList()+1496 <- kksfbc()+10276<- kkspsc0()+1903<- kksParseCursor()+132<- opiosq0()+1349 <- kpooprx()+215<- kpoal8()+673<- opiodr()+985<- ttcpip()+1093<- opitsk()+1031 <- opiino()+821<- opiodr()+985<- opidrv()+466<- sou2o()+91<- opimai_real()+117 <- main()+111<- __libc_start_main()+161 2nd connect <- connect()+36<- nttcon()+965<- ntconn()+265<- nsopen()+1335<- nscall1()+449<- nscall()+609 <- niotns()+1678<- nigcall()+112<- osncon()+692<- kpkiadef()+58<- upiini()+683<- upiah0()+104 <- kpuatch()+1064<- OCIServerAttach()+129<- kpnconn()+463<- npicon0()+1250<- kpndbcon()+909 <- OCIKDBLinkConn2()+40<- OCIKDBLinkConn()+36<- ddfnet2Normal()+204<- kkmfcbrm()+132 <- __PGOSF172_kkmpfcbk()+473<- qcsprfro()+503<- qcsprfro_tree()+302<- qcsprfro_tree()+118 <- qcspafq()+102<- qcspqb()+347<- kkmdrv()+60<- opiSem()+1044<- opiDeferredSem()+283 <- opitca()+275<- __PGOSF302_kksFullTypeCheck()+20<- rpiswu2()+334<- kksLoadChild()+5852 <- kxsGetRuntimeLock()+1314<- kksfbc()+12270<- kkspsc0()+1903<- kksParseCursor()+132 <- opiosq0()+1349<- kpooprx()+215<- kpoal8()+673<- opiodr()+985<- ttcpip()+1093 <- opitsk()+1031<- opiino()+821<- opiodr()+985<- opidrv()+466<- sou2o()+91<- opimai_real()+117 <- main()+111<- __libc_start_main()+161<- WORKAROUND: ----------- execute following command before connect. SQL> alter system flush shared_pool;

You can find the kpnconn call in stack trace , so the single-task message may occur in two time dblink reconnect and take a long time.
The document provide one workaround way: flush shared_pool.which I suspected.
In summary , still no Mature program to resolve this problem , All we can do is pray.
In my opinion , Version 10.2.0.4 is stable enough within most layer , but still a lots of bug or unexpected behavior reside in K2(kernel Distributed Execution Layer). Distributed system is a really complicated subject in computer science.

沪ICP备14014813号-2

沪公网安备 31010802001379号