ksvcreate: Process creation failed故障一例

今天上午接到电话,用户反映一套10.2.0.4的数据库出现用户无法登录的症状,随即在家里连上VPN远程支持,登上主机以后尝试测试操作系统认证登录和远程登录,发现已经能够正常登录了;为了了解起因,去查看了告警日志alert.log,发现在11:00左右出现大量的ksvcreate:process creation failed错误,具体错误日志如下:

Wed Mar 23 10:00:17 2011
Process m000 died, see its trace file
Wed Mar 23 10:00:17 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 10:01:18 2011
Process m000 died, see its trace file
Wed Mar 23 10:01:18 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 10:02:19 2011
Process m000 died, see its trace file
Wed Mar 23 10:02:19 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 10:54:41 2011
Process P007 died, see its trace file
Wed Mar 23 10:56:40 2011
Process P007 died, see its trace file
Wed Mar 23 10:57:40 2011
Process P007 died, see its trace file
Wed Mar 23 10:58:40 2011
Process P007 died, see its trace file
Wed Mar 23 10:59:18 2011
Process J000 died, see its trace file
Wed Mar 23 10:59:18 2011
kkjcre1p: unable to spawn jobq slave process
Wed Mar 23 10:59:18 2011
Errors in file /oracle/app/oracle/admin/sdh/bdump/sdh_cjq0_717010.trc:

Wed Mar 23 11:01:00 2011
Process m000 died, see its trace file
Wed Mar 23 11:01:00 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 11:02:01 2011
Process m000 died, see its trace file
Wed Mar 23 11:02:01 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 11:02:40 2011
Process P007 died, see its trace file
Wed Mar 23 11:03:02 2011
Process m000 died, see its trace file
Wed Mar 23 11:03:02 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 11:04:03 2011
Process m000 died, see its trace file
Wed Mar 23 11:04:03 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 11:04:40 2011
Process P007 died, see its trace file

ksvcreate: Process creation failed错误信息一般在Oracle实例在创建一些辅助后台进程(如mmon的子进程m00x或者并行子进程p00x等)时出现进程启动失败时出现,而造成该错误的可能性有多种,包括Oracle实例资源不足、操作系统资源不足等等。其中较为常见的是实例instance的process使用达到上限,可以通过查询v$resource_limit视图来了解实例生命周期内是否发生过process总数暴满的情况:

SQL> select * from v$resource_limit;

RESOURCE_NAME                  CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION   LIMIT_VALUE
------------------------------ ------------------- --------------- -------------------- --------------------
processes                                       79             800        800                  800
sessions                                        80             813        885                  885
enqueue_locks                                   18             303      10910                10910
enqueue_resources                               18             166       4112            UNLIMITED
ges_procs                                        0               0          0                    0
ges_ress                                         0               0          0            UNLIMITED
ges_locks                                        0               0          0            UNLIMITED
ges_cache_ress                                   0               0          0            UNLIMITED
ges_reg_msgs                                     0               0          0            UNLIMITED
ges_big_msgs                                     0               0          0            UNLIMITED
ges_rsv_msgs                                     0               0          0                    0
gcs_resources                                    0               0          0                    0
gcs_shadows                                      0               0          0                    0
dml_locks                                        0             153       3892            UNLIMITED
temporary_table_locks                            0               3  UNLIMITED            UNLIMITED
transactions                                  3125            3156        973            UNLIMITED
branches                                         0               3        973            UNLIMITED
cmtcallbk                                        0               2        973            UNLIMITED
sort_segment_locks                               0               8  UNLIMITED            UNLIMITED
max_rollback_segments                           19              43        973                65535
max_shared_servers                               0               0  UNLIMITED            UNLIMITED
parallel_max_servers                             6             122        120                 3600

/* 可以看到processes的MAX_UTILIZATION最大使用数目曾到过LIMIT_VALUE限定的800,
   sessions也有类似的情况 */

从以上V$resource_limit视图的输出来看,极有可能是processes总数达到上限导致了新的后台辅助进程创建失败,其实我们可以很方便地验证这一点:

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Prod
PL/SQL Release 10.2.0.1.0 - Production
CORE	10.2.0.1.0	Production
TNS for Linux: Version 10.2.0.1.0 - Production
NLSRTL Version 10.2.0.1.0 - Production

SQL> alter system set processes=20 scope=spfile;
System altered.

/* 将processes参数修改到一个较小值 */

SQL> startup force;
ORACLE instance started.

SQL> select resource_name,MAX_UTILIZATION,LIMIT_VALUE from v$resource_limit where resource_name in ('processes','sessions');

RESOURCE_NAME               MAX_UTILIZATION LIMIT_VALUE
------------------------------ --------------- ---------------
processes                    20           20
sessions                     19           27

 /* 从以上输出可以看到process总数曾经到达20的上限 */

/* 观察告警日志可以看到相关ksvcreate: Process(m000) creation failed的记录 */

[maclean@rh8 bdump]$ tail -20 alert_PROD1.log
ksvcreate: Process(m000) creation failed
Process m000 died, see its trace file
Wed Mar 23 22:01:52 2011
ksvcreate: Process(m000) creation failed
Wed Mar 23 22:01:55 2011
Process q000 died, see its trace file
Wed Mar 23 22:01:55 2011
ksvcreate: Process(q000) creation failed
Wed Mar 23 22:02:07 2011
Process q000 died, see its trace file
Wed Mar 23 22:02:07 2011
ksvcreate: Process(q000) creation failed
Wed Mar 23 22:02:19 2011
Process q000 died, see its trace file
Wed Mar 23 22:02:19 2011
ksvcreate: Process(q000) creation failed
Wed Mar 23 22:02:31 2011
Process q000 died, see its trace file
Wed Mar 23 22:02:31 2011
ksvcreate: Process(q000) creation failed

不同于10g,在11g中类似的错误出现后会在告警日志中说明process creation即进程创建失败的具体原因,如进程总数达到上限,那么就会出现ORA-00020错误(maximum number of processes (%s) exceeded. All process state objects are in use.Increase the value of the PROCESSES initialization parameter),类似以下日志:

ORA-00020: No more process state objects available
ORA-20 errors will not be written to the alert log for
 the next minute. Please look at trace files to see all
 the ORA-20 errors.
Process m001 submission failed with error = 20
Process m002 submission failed with error = 20
Process m003 submission failed with error = 20
Process m003 submission failed with error = 20
Process m002 submission failed with error = 20
2011-03-23 22:10:07.037000 +08:00
Process q001 submission failed with error = 20

实际上ksvcreate: Process creation failed错误在能够了解其发生的root cause的情况下,并不难解决。在上例中我们可以清楚地了解到是因为数据库在实际运行中出现了processes进程总数达到参数设定上限从而导致问题出现,那么可以合理增加初始化参数processes或者通过修正异常频繁的程序客户端登录来解决该问题。

Fractured block found during backing up datafile

一套AIX上的10.2.0.3系统,在rman备份期间告警日志出现如下记录:

======================= alert log record ============================
Hex dump of (file 35, block 1087687) in trace file /oracle/product/10.2.0/admin/MS/udump/ms_ora_103548.trc
Corrupt block relative dba: 0x08d098c7 (file 35, block 1087687)
Fractured block found during backing up datafile
Data in bad block:
 type: 6 format: 2 rdba: 0x08d098c7
 last change scn: 0x0006.44443e06 seq: 0x1 flg: 0x06
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x6d910601
 check value in block header: 0xc0b0
 computed block checksum: 0x4286
Reread of blocknum=1087687, file=/dev/vx/rdsk/oradgMS/lv_ms_DB31. found valid data

=========== trace information for process 103548 ========================
Corrupt block relative dba: 0x08d098c7 (file 35, block 1087687)
Fractured block found during backing up datafile
Data in bad block:
 type: 6 format: 2 rdba: 0x08d098c7
 last change scn: 0x0006.44443e06 seq: 0x1 flg: 0x06
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x6d910601
 check value in block header: 0xc0b0
 computed block checksum: 0x4286
Reread of blocknum=1087687, file=/dev/vx/rdsk/oradgMS/lv_ms_DB31. found valid data

可以看到该块的type为6,下面列出了Oracle 9.2中已经存在的各种块类型:

Type Description
1 KTU UNDO HEADER
2 KTU UNDO BLOCK
3 KTT SAVE UNDO HEADER
4 KTT SAVE UNDO BLOCK
5 DATA SEGMENT HEADER
6 trans data
7 Unknown
8 Unknown
9 Unknown
10 DATA SEGMENT FREE LIST BLOCK
11 Unknown
12 DATA SEGMENT HEADER WITH FREE LIST BLOCKS
13 Compatibility segment
14 KTU UNDO HEADER W/UNLIMITED EXTENTS
15 KTT SAVE UNDO HEADER W/UNLIMITED EXTENTS
16 DATA SEGMENT HEADER – UNLIMITED
17 DATA SEGMENT HEADER WITH FREE LIST BLKS – UNLIMITED
18 EXTENT MAP BLOCK
19 Unknown
20 Unknown
21 Unknown
22 DATA SEGMENT FREE LIST BLOCK WITH FREE BLOCK COUNT
23 BITMAPPED DATA SEGMENT HEADER
24 BITMAPPED DATA SEGMENT FREELIST
25 BITMAP INDEX BLOCK
26 BITMAP BLOCK
27 LOB BLOCK
28 KTU BITMAP UNDO HEADER – LIMITED EXTENTS
29 KTFB Bitmapped File Space Header
30 KTFB Bitmapped File Space Bitmap
31 TEMP INDEX BLOCK
32 FIRST LEVEL BITMAP BLOCK
33 SECOND LEVEL BITMAP BLOCK
34 THIRD LEVEL BITMAP BLOCK
35 PAGETABLE SEGMENT HEADER
36 PAGETABLE EXTENT MAP BLOCK
37 EXTENT MAP BLOCK OF SYSTEM MANAGED UNDO SEGMENT
38 KTU SMU HEADER BLOCK
39 Unknown
40 PAGETABLE MANAGED LOB BLOCK
41 Unknown
42 Unknown
43 Unknown
44 Unknown
45 Unknown
46 Unknown
47 Unknown

表和索引的块均可能为type 6的trans data;也就是说rman在备份期间读取到该数据块,并在初次读取时发现该块断裂了(Fractured),但Fractured并不代表块就真的corrupted了;从告警日志看rman在初次读取发现该块Fractured后,又再次读取时发现数据块已经恢复正常(found valid data)。所以上述告警日志并代表所列出的数据块存在讹误,很有可能是该数据块所在数据文件在备份期间发生了剧烈的IO操作,当rman读取到该数据块时可能存储正在对其进行写的操作,所以rman在第一次读取时认为该快断裂了(Fractured);之后rman对该块进行reread发现”断裂”现象已不存在,而”Corrupt block”仅仅是一种假象;针对上述问题可以对表或索引进行进一步的analyze..validate操作以确保不存在坏块。
同时上述”Corrupt block误报”现象极有可能是因为在Rman备份期间个别数据文件的IO过于活跃所致(如频繁的dml操作),建议在磁盘活跃度低的时间段运行rman备份工作。

参考文档:

Fractured Block Messages in Alert.log During RMAN Backup of Datafile
    * fact: Oracle Server - Enterprise Edition 8
    * fact: Oracle Server - Enterprise Edition 9
    * fact: Recovery Manager (RMAN)
    * symptom: Fractured block found during backup up datafile
    * symptom: Reread of blocknum found some corrupt data
    * symptom: Analyze table validate structure cascade returns no errors
    * change: NOTE ROLE: The messages are of the form Reread of blocknum=36256, 
       file=/pdscdata/pdsclive/data1/dispatch_data_large2. dbf. 
found same corrupt data *** Corrupt block relative dba: 0xfc008dc0 (file 63, block 36288) 
Fractured block found during backing up datafile Data in bad block - 
type: 0 format: 0 rdba: 0x00000000 last change scn: 0x0000.00000000 seq: 0x0 flg: 0x00 consistency 
value in tail: 0x53494e53 check value in block header: 0x0, block checksum disabled 
spare1: 0x0, spare2: 0x0, spare3: 0x0
    * cause: RMAN backups of datafile are being performed while the datafile is involved in heavy I/O. 
RMAN reads Oracle blocks from disk. If it finds that the block is fractured, which means it is being actively used, 
it performs a reread of the block. If that fails again then the block is assumed to be corrupt. 
By identifying the object that these blocks belong to by following Handling Oracle Block Corruptions in 
Oracle7/8/8i and performing an analyze .. validate structure cascade on the object involved you can 
confirm that the object is not corrupt. 

fix:
Run the backups when the tablespace has less I/O activity.

single quote needed in expdp query?

如果在使用数据泵时不采用parfile参数文件的话,query参数指定的查询条件是需要使用单引号括起来的,而当使用parfile时则不需要加上单引号,加上后反而会出现LPX-314: an internal failure occurred错误:

[maclean@rh2 mesg]$ expdp maclean/maclean directory=dump  tables=estimate_me query=estimate_me:"where t1<2000"
Export: Release 10.2.0.4.0 - 64bit Production on Tuesday, 02 November, 2010 22:32:33
Copyright (c) 2003, 2007, Oracle.  All rights reserved.
Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
FLASHBACK automatically enabled to preserve database integrity.
Starting "MACLEAN"."SYS_EXPORT_TABLE_01":  maclean/******** directory=dump tables=estimate_me query=estimate_me:where t1<2000
Estimate in progress using BLOCKS method...
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 192 MB
Processing object type TABLE_EXPORT/TABLE/TABLE
Processing object type TABLE_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
ORA-31693: Table data object "MACLEAN"."ESTIMATE_ME" failed to load/unload and is being skipped due to error:
ORA-00936: missing expression
Master table "MACLEAN"."SYS_EXPORT_TABLE_01" successfully loaded/unloaded
******************************************************************************
Dump file set for MACLEAN.SYS_EXPORT_TABLE_01 is:
/s01/dump/expdat.dmp
Job "MACLEAN"."SYS_EXPORT_TABLE_01" completed with 1 error(s) at 22:32:36
/* 不采用参数文件形式,没有加单引号的情况下出现ORA-00936: missing expression错误*/
[maclean@rh2 mesg]$ expdp maclean/maclean directory=dump  tables=estimate_me query=estimate_me:'"where t1<2000"'
Export: Release 10.2.0.4.0 - 64bit Production on Tuesday, 02 November, 2010 22:33:39
Copyright (c) 2003, 2007, Oracle.  All rights reserved.
Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
FLASHBACK automatically enabled to preserve database integrity.
Starting "MACLEAN"."SYS_EXPORT_TABLE_01":  maclean/******** directory=dump tables=estimate_me query=estimate_me:"where t1<2000"
Estimate in progress using BLOCKS method...
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 192 MB
Processing object type TABLE_EXPORT/TABLE/TABLE
Processing object type TABLE_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
. . exported "MACLEAN"."ESTIMATE_ME"                     32.42 KB    1999 rows
Master table "MACLEAN"."SYS_EXPORT_TABLE_01" successfully loaded/unloaded
******************************************************************************
Dump file set for MACLEAN.SYS_EXPORT_TABLE_01 is:
/s01/dump/expdat.dmp
Job "MACLEAN"."SYS_EXPORT_TABLE_01" successfully completed at 22:33:43
/* 不采用参数文件,在query参数中加上单引号则导出成功 */
[maclean@rh2 mesg]$ cat quote.par
tables=estimate_me
query=estimate_me:'"where t1<2000"'
directory=dump
[maclean@rh2 mesg]$ rm /s01/dump/expdat.dmp
[maclean@rh2 mesg]$ expdp maclean/maclean parfile=quote.par 
Export: Release 10.2.0.4.0 - 64bit Production on Tuesday, 02 November, 2010 22:35:08
Copyright (c) 2003, 2007, Oracle.  All rights reserved.
Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
FLASHBACK automatically enabled to preserve database integrity.
Starting "MACLEAN"."SYS_EXPORT_TABLE_01":  maclean/******** parfile=quote.par
Estimate in progress using BLOCKS method...
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 192 MB
Processing object type TABLE_EXPORT/TABLE/TABLE
Processing object type TABLE_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
ORA-31693: Table data object "MACLEAN"."ESTIMATE_ME" failed to load/unload and is being skipped due to error:
ORA-06502: PL/SQL: numeric or value error
ORA-31605: the following was returned from LpxXSLSetTextVar in routine kuxslSetParam:
LPX-314: an internal failure occurred
Master table "MACLEAN"."SYS_EXPORT_TABLE_01" successfully loaded/unloaded
******************************************************************************
Dump file set for MACLEAN.SYS_EXPORT_TABLE_01 is:
/s01/dump/expdat.dmp
Job "MACLEAN"."SYS_EXPORT_TABLE_01" completed with 1 error(s) at 22:35:11
/* 采用parfile选项,参数文件中query参数加入单引号出现LPX-314: an internal failure occurred错误,导出失败*/
[maclean@rh2 mesg]$ cat quote.par
tables=estimate_me
query=estimate_me:"where t1<2000"
directory=dump
[maclean@rh2 mesg]$ rm /s01/dump/expdat.dmp
[maclean@rh2 mesg]$ expdp maclean/maclean parfile=quote.par           
Export: Release 10.2.0.4.0 - 64bit Production on Tuesday, 02 November, 2010 22:36:27
Copyright (c) 2003, 2007, Oracle.  All rights reserved.
Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
FLASHBACK automatically enabled to preserve database integrity.
Starting "MACLEAN"."SYS_EXPORT_TABLE_01":  maclean/******** parfile=quote.par
Estimate in progress using BLOCKS method...
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 192 MB
Processing object type TABLE_EXPORT/TABLE/TABLE
Processing object type TABLE_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
. . exported "MACLEAN"."ESTIMATE_ME"                     32.42 KB    1999 rows
Master table "MACLEAN"."SYS_EXPORT_TABLE_01" successfully loaded/unloaded
******************************************************************************
Dump file set for MACLEAN.SYS_EXPORT_TABLE_01 is:
/s01/dump/expdat.dmp
Job "MACLEAN"."SYS_EXPORT_TABLE_01" successfully completed at 22:36:30
/* 采用parfile,参数文件中query未加入单引号,导出成功 */

1. QUERY in Parameter file.

Using the QUERY parameter in a parameter file is the preferred method. Put double quotes around the text of the WHERE clause.

Example to export the following data with the Export Data Pump client:

* from table scott.emp all employees whose job is analyst or whose salary is 3000 or more; and
* from from table hr.departments all deparments of the employees whose job is analyst or whose salary is 3000 or more.

File: expdp_q.par
—————–
DIRECTORY = my_dir
DUMPFILE = exp_query.dmp
LOGFILE = exp_query.log
SCHEMAS = hr, scott
INCLUDE = TABLE:”IN (‘EMP’, ‘DEPARTMENTS’)”
QUERY = scott.emp:”WHERE job = ‘ANALYST’ OR sal >= 3000″
# place following 3 lines on one single line:
QUERY = hr.departments:”WHERE department_id IN (SELECT DISTINCT
department_id FROM hr.employees e, hr.jobs j WHERE e.job_id=j.job_id
AND UPPER(j.job_title) = ‘ANALYST’ OR e.salary >= 3000)”

— Run Export DataPump job:

%expdp system/manager parfile=expdp_q.par

Note that in this example the TABLES parameter cannot be used, because all table names that are specified at the TABLES parameter should reside in the same schema.
2. QUERY on Command line.

The QUERY parameter can also be used on the command line. Again, put double quotes around the text of the WHERE clause.

Example to export the following data with the Export Data Pump client:

* table scott.dept; and
* from table scott.emp all employees whose name starts with an ‘A’

— Example Windows platforms:
— Note that the double quote character needs to be ‘escaped’
— Place following statement on one single line:

D:\> expdp scott/tiger DIRECTORY=my_dir DUMPFILE=expdp_q.dmp
LOGFILE=expdp_q.log TABLES=emp,dept QUERY=emp:\”WHERE ename LIKE ‘A%’\”

— Example Unix platforms:
— Note that all special characters need to be ‘escaped’

% expdp scott/tiger DIRECTORY=my_dir \
DUMPFILE=expdp_q.dmp LOGFILE=expdp_q.log TABLES=emp,dept \
QUERY=emp:\”WHERE ename LIKE \’A\%\’\”

— Example VMS platform:
— Using three double-quote characters

$ expdp scott/tiger DIRECTORY=my_dir –
DUMPFILE=exp_cmd.dmp LOGFILE=exp_cmd.log TABLES=emp,dept –
QUERY=emp:”””WHERE ename LIKE ‘A%'”””

Note that with the original export client two jobs were required:
— Example Windows platforms:
— Place following statement on one single line:

D:\> exp scott/tiger FILE=exp_q1.dmp LOG=exp_q1.log TABLES=emp
QUERY=\”WHERE ename LIKE ‘A%’\”

D:\> exp scott/tiger FILE=exp_q2.dmp LOG=exp_q2.log TABLES=dept

— Example Unix platforms:

> exp scott/tiger FILE=exp_q1.dmp LOG=exp_q1.log TABLES=emp \
QUERY=\”WHERE ename LIKE \’A\%\’\”

> exp scott/tiger FILE=exp_q2.dmp LOG=exp_q2.log TABLES=dept

— Example VMS platform:

$ exp scott/tiger FILE=exp_q1.dmp LOG=exp_q1.log TABLES=emp –
QUERY=”””WHERE ename LIKE ‘A%'”””

$ exp scott/tiger FILE=exp_q2.dmp LOG=exp_q2.log TABLES=dept

Note that with original export client in Oracle8i on VMS, the syntax was different (also note the extra space that is needed between two single quotes):
… QUERY=”‘WHERE ename LIKE \’A%\’ ‘”
That is: [double_quote][single_quote]WHERE ename LIKE [backslash][single_quote]A%[backslash][single_quote][space][single_quote][double_quote]

No way to disable datapump estimate?

昨天晚上去客户现场配合一个导入导出作业;这是一套Solaris 10上的10.2.0.1系统,导出采用expdp数据泵工具,需要导出的数据源是一张大小在120G左右的单表,该表存放了最近一年的数据,实际导出时只需要最近三个月的数据,所以使用了QUERY参数,并指定了并行度为2。
该导出作业之前未经测试过,语句也是临时写的,实际执行导出工作时发现在评估阶段(estimate phase)耗费了大约十多分钟的时间,estimate默认使用blocks模式,即通过计算对象段占用的数据库块来预估dumpfile的大小;此外还有statistics模式通过对象的统计信息来推算导出文件的大小:

The block method of estimate is the least accurate. We investigate how many blocks were allocated for the storage segments for the tables specified. If the table had been recently analyzed, the estimate=statistics would be the most accurate. So remember, it is important to generate statistics before calculating the estimate using estimate=statistics method. Additionally, ESTIMATE=BLOCKS method could be inaccurate when:

a) The table was created with a much bigger initial extent size than was needed for the actual table data

b) Many rows have been deleted from the table, or a very small percentage of each block is used.

# When statistics have been explicitly set via the DBMS_STATS package in order to force certain optimizer plans, instead of being directly collected from the data, ESTIMATE=STATISTICS may not report correct results.

但实际上这2中方式在使用QUERY的情况下都无法正确预估导出文件的大小:

[maclean@rh2 udump]$ expdp maclean/maclean directory=dump  tables=estimate_me query=estimate_me:'"where t1<2000"'  estimate=statistics
Export: Release 10.2.0.4.0 - 64bit Production on Tuesday, 02 November, 2010 21:28:26
Copyright (c) 2003, 2007, Oracle.  All rights reserved.
Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
FLASHBACK automatically enabled to preserve database integrity.
Starting "MACLEAN"."SYS_EXPORT_TABLE_01":  maclean/******** directory=dump tables=estimate_me query=estimate_me:"where t1<2000" estimate=statistics
Estimate in progress using STATISTICS method...
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
.  estimated "MACLEAN"."ESTIMATE_ME"                     104.7 MB
Total estimation using STATISTICS method: 104.7 MB
Processing object type TABLE_EXPORT/TABLE/TABLE
Processing object type TABLE_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
. . exported "MACLEAN"."ESTIMATE_ME"                     32.42 KB    1999 rows
Master table "MACLEAN"."SYS_EXPORT_TABLE_01" successfully loaded/unloaded
******************************************************************************
Dump file set for MACLEAN.SYS_EXPORT_TABLE_01 is:
/s01/dump/expdat.dmp
Job "MACLEAN"."SYS_EXPORT_TABLE_01" successfully completed at 21:28:29
/* statistics模式使用统计来判断导出文件大小,比较容易控制 */
begin
dbms_stats.set_table_stats(ownname => 'MACLEAN',tabname => 'ESTIMATE_ME',numrows => 999999999999999,numblks => 99999999999999999999);
end;
/
PL/SQL procedure successfully completed.
Estimate in progress using STATISTICS method...
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
.  estimated "MACLEAN"."ESTIMATE_ME"                     10244 GB
Total estimation using STATISTICS method: 10244 GB
/* 再次导出时数据泵采用了"作假的"统计信息*/
[maclean@rh2 udump]$ expdp maclean/maclean directory=dump  tables=estimate_me query=estimate_me:'"where t1<2000"'  estimate=blocks
Export: Release 10.2.0.4.0 - 64bit Production on Tuesday, 02 November, 2010 21:31:58
Copyright (c) 2003, 2007, Oracle.  All rights reserved.
Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
FLASHBACK automatically enabled to preserve database integrity.
Starting "MACLEAN"."SYS_EXPORT_TABLE_01":  maclean/******** directory=dump tables=estimate_me query=estimate_me:"where t1<2000" estimate=blocks
Estimate in progress using BLOCKS method...
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
.  estimated "MACLEAN"."ESTIMATE_ME"                       192 MB
Total estimation using BLOCKS method: 192 MB
Processing object type TABLE_EXPORT/TABLE/TABLE
Processing object type TABLE_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
. . exported "MACLEAN"."ESTIMATE_ME"                     32.42 KB    1999 rows
Master table "MACLEAN"."SYS_EXPORT_TABLE_01" successfully loaded/unloaded
******************************************************************************
Dump file set for MACLEAN.SYS_EXPORT_TABLE_01 is:
/s01/dump/expdat.dmp
Job "MACLEAN"."SYS_EXPORT_TABLE_01" successfully completed at 21:32:01
/* 可以看到以上BLOCKS method预估的dumpfile大小为192MB,而STATISTICS方式预估为104.7MB;
然而实际的使用QUERY参数导出的1999行数据只占用32k的空间!
estimate在这里摆了一个大乌龙
*/

查了一下metalink发现10.2.0.1上存在expdp在estimate评估阶段耗时很久的bug:

EXPDP Slow and Estimate Takes Long Time [ID 822283.1]
Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.1 to 10.2.0.3
This problem can occur on any platform.
Symptoms
Data pump Export is very slow and long time is spent in Estimating the data size.
EXPDP is stuck for long at :
Starting "SYSTEM"."SYS_EXPORT_SCHEMA_01": system/******** parfile=dpexp.par
Estimate in progress using BLOCKS method...
AWR report taken when EXPDP is running slow show the below query took long time to complete .
SELECT /*+rule*/ SYS_XMLGEN(VALUE(KU$),
XMLFORMAT.createFormat2('TABLE_DATA_T', '7')), 0 ,KU$.BASE_OBJ.NAME
,KU$.BASE_OBJ.OWNER_NAME ,'TABLE' ,to_char(KU$.BYTES_ALLOC) ,
to_char(KU$.ET_PARALLEL) ,KU$.FGAC ,KU$.NONSCOPED_REF ,KU$.XMLSCHEMACOLS
,KU$.NAME , KU$.NAME ,'TABLE_DATA' ,KU$.PART_NAME ,KU$.SCHEMA_OBJ.OWNER_NAME
,KU$.TS_NAME , KU$.TRIGFLAG
,decode(KU$.SCHEMA_OBJ.TYPE_NUM,2,decode(bitand(KU$.PROPERTY,8192),8192,
'NESTED TABLE','T'),19,'PARTITION',20,'PARTITION','SUBPARTITION')
,to_char(KU$.UNLOAD_METHOD)
FROM SYS.KU$_TABLE_DATA_VIEW KU$
WHERE NOT BITAND(KU$.BASE_OBJ.FLAGS,128)!=0 AND NOT (BITAND
(KU$.BASE_OBJ.FLAGS,16)=16) AND KU$.BASE_OBJ.NAME IN
('PA_DRAFT_INVOICES_ALL') AND KU$.BASE_OBJ.OWNER_NAME IN ('PA') AND NOT
EXISTS (SELECT 1 FROM SYS.KU$NOEXP_TAB A WHERE A.OBJ_TYPE='TABLE' AND
A.NAME=KU$.BASE_OBJ.NAME AND A.SCHEMA=KU$.BASE_OBJ.OWNER_NAME) AND NOT EXISTS
(SELECT 1 FROM SYS.KU$NOEXP_TAB A WHERE A.OBJ_TYPE='SCHEMA' AND
A.NAME=KU$.BASE_OBJ.OWNER_NAME);
Cause
The issue here is with using RBO.
Data pump is not choosing Cost Based Optimizer.
This is because of the 'unpublished' bug 5929373 -DATA PUMP TAKES LONG TIME TO EXPORT DATA
This bug is closed as duplicate of another bug << 5464834 >>
Solution
1. Upgrade to 10.2.0.4 . The bug is fixed in 10.2.0.4
Or
2. Apply the << patch 5464834 >> if available for your platform .
* Login to metalink
* Go to 'Patches and Updates'
* Click on 'Simple Search'
* Enter the patch number 5464834
* Select your platform from the dropdown list .
* Patch,if available , will be displayed .
* Download and apply the patch .

该Bug可以通过升级版本解决,但远水解不了近火,最初想disable掉estimate功能;但查阅若干资料后发现目前似乎没有能够关掉estimate功能的办法,这个就是悲哀所在了!耐心等estimate阶段完成后,实际数据导出阶段倒也只花了十来分钟,就这一点看来datapump很有中国式的办事风格。

另外当指定的导出文件数小于指定的parallelism,且导出数据量较大时可能出现ORA-39095错误,会导致导出作业意外终止:

39095, 00000, "Dump file space has been exhausted: Unable to allocate %s bytes"
// *Cause:  The Export job ran out of dump file space before the job was
//          completed.
// *Action: Reattach to the job and add additional dump files to the job
//          restarting the job.
EXPDP generating ORA-39095 : " dump file space has been exhausted"
Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.1 to 11.2.0.2 - Release: 10.2 to 11.2
Information in this document applies to any platform.
Symptoms
Checked for relevance on 10-22-2010
Errors while performing Datapump Export:
- when exporting a large number of tables then it completely stops
- the jobs may remain in the datapump export tables
- No trace generated at the time of the failure.
ERROR:
ORA - 39095 : " dump file space has been exhausted. Unable to allocate 4096 bytes".jobsystem.sys_export_full_02 stops due to fatal error.
Cause
Customer chose only limited number of files in "dumpfile" parameter with limited "filesize" , i.e.:
full=Y
directory=dpump_dir
dumpfile=full_1.dmp,full_2.dmp,full_3.dmp,full_4.dmp
filesize=3000000000
This means 4 dump files with filesize 3000000000 byte (2.79 G).
the dumpfile parameter is the issue here.
.
Solution
1- You can use the dynamic format (i.e dumpfile=full_%U.dmp) :
The 'wildcard' specification for the dump file can expand up to 99 files. If 99 files have been generated before the export has completed, it will again return the ORA-39095 error.
2- If this is yet not enough and more files are needed, a workaround would be to speficy a bigger 'filesize' parameter.
3- If this is inconvenient, another option is to use this syntax:
dumpfile=fullexp%U.dmp, fullexp2_%U.dmp, fullexp3_%U.dmp
which can expand up to 3*99 files.
If encountering problems containing the dump in a single directory using this solution, you may prefer this syntax:
dumpfile=dmpdir1:fullexp1_%U.dmp, dmpdir2:fullexp2_%U.dmp, dmpdir3:fullexp3_U.dmp
(assuming the 3 directory objects listed above had been already created first).

解决方法很简单就是去掉并行度,或者指定动态的导出文件名。
记以录之,回去补觉了!

  • The block method of estimate is the least accurate. We investigate how many blocks were allocated for the storage segments for the tables specified. If the table had been recently analyzed, the estimate=statistics would be the most accurate. So remember, it is important to generate statistics before calculating the estimate using estimate=statistics method. Additionally, ESTIMATE=BLOCKS method could be inaccurate when:a) The table was created with a much bigger initial extent size than was needed for the actual table data

    b) Many rows have been deleted from the table, or a very small percentage of each block is used.

  • When statistics have been explicitly set via the DBMS_STATS package in order to force certain optimizer plans, instead of being directly collected from the data, ESTIMATE=STATISTICS may not report correct results.

Poor Data Export Performance(revised)

这是一个十分简单的单表导出作业,表上大约有200万条数据;但实际exp导出时dump文件每秒增长不足1MB,速度十分缓慢。

SQL> show user;
USER is "SH"
SQL> select count(*) from orders;
COUNT(*)
----------
2319232
SQL> select bytes/1024/1024 from dba_segments where owner='SH' and segment_name='ORDERS';
BYTES/1024/1024
---------------
261
/*该表大小为261MB*/
/*这里我们使用直接路径导出*/
[maclean@rh2 ~]$ exp sh/sh file=/s01/orders.dmp direct=y tables=orders
Export: Release 11.2.0.1.0 - Production on Wed Aug 18 21:52:56 2010
Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.
Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Export done in US7ASCII character set and AL16UTF16 NCHAR character set
server uses WE8MSWIN1252 character set (possible charset conversion)
About to export specified tables via Direct Path ...
. . exporting table                         ORDERS
/* 我们通过脚本观察导出文件大小增长速度,每秒大约0.4MB */
[maclean@rh2 s01]$ while (1>0) do ls -lh orders.dmp ; sleep 1; done
-rw-r--r-- 1 maclean oinstall 912K Aug 18 21:54 orders.dmp
-rw-r--r-- 1 maclean oinstall 1.4M Aug 18 21:54 orders.dmp
-rw-r--r-- 1 maclean oinstall 1.8M Aug 18 21:54 orders.dmp
-rw-r--r-- 1 maclean oinstall 2.2M Aug 18 21:54 orders.dmp
-rw-r--r-- 1 maclean oinstall 2.6M Aug 18 21:54 orders.dmp
-rw-r--r-- 1 maclean oinstall 3.0M Aug 18 21:54 orders.dmp
-rw-r--r-- 1 maclean oinstall 3.5M Aug 18 21:54 orders.dmp
-rw-r--r-- 1 maclean oinstall 3.9M Aug 18 21:54 orders.dmp
-rw-r--r-- 1 maclean oinstall 4.3M Aug 18 21:54 orders.dmp
-rw-r--r-- 1 maclean oinstall 4.7M Aug 18 21:54 orders.dmp
-rw-r--r-- 1 maclean oinstall 5.1M Aug 18 21:54 orders.dmp
/*是什么导致了exp如此缓慢呢,有必看一下该导出服务进程的trace文件*/
SQL> select spid from v$process where addr=(select paddr from v$session where module like 'exp@%');
SPID
------------------------
22624
SQL> oradebug setospid 22624;
Oracle pid: 37, Unix process pid: 22624, image: oracle@rh2 (TNS V1-V3)
SQL> oradebug tracefile_name;
/rdbms/sprod/SPROD/trace/SPROD_ora_22624.trc
SQL> host
[maclean@rh2 trace]$ ls -lh SPROD_ora_22624.trc
-rw-r----- 1 maclean oinstall 1.2G Aug 18 21:58 SPROD_ora_22624.trc
/* 这个trace文件居然有1.2G之巨,会存放了些什么呢?*/
[maclean@rh2 trace]$ tail -100 SPROD_ora_22624.trc
2010-08-18 21:59:14.748194 : nsbasic_bsd:41 20 43 4C 41 53 53 07  |A.CLASS.|
2010-08-18 21:59:14.748201 : nsbasic_bsd:00 78 6D 08 0F 01 1A 2E  |.xm.....|
2010-08-18 21:59:14.748212 : nsbasic_bsd:07 00 78 6D 08 0F 01 1A  |..xm....|
2010-08-18 21:59:14.748219 : nsbasic_bsd:2E 13 00 32 30 30 39 2D  |...2009-|
2010-08-18 21:59:14.748225 : nsbasic_bsd:30 38 2D 31 35 3A 30 30  |08-15:00|
2010-08-18 21:59:14.748232 : nsbasic_bsd:3A 32 35 3A 34 35 05 00  |:25:45..|
2010-08-18 21:59:14.748239 : nsbasic_bsd:43 48 45 43 4B 01 00 4E  |CHECK..N|
2010-08-18 21:59:14.748246 : nsbasic_bsd:01 00 4E 01 00 4E 02 00  |..N..N..|
2010-08-18 21:59:14.748253 : nsbasic_bsd:C1 02 FE FF 06 00 50 55  |......PU|
2010-08-18 21:59:14.748260 : nsbasic_bsd:42 4C 49 43 15 00 2F 39  |BLIC../9|
/*绝大多数是Oracle sqlnet trace的内容,是不是因为设置了Oracle SqlNet端的trace选项,从而导致了导出服务进程写出大量sqlnet trace,最终引发exp的缓慢?*/
[maclean@rh2 trace]$ cat $ORACLE_HOME/network/admin/sqlnet.ora
TRACE_LEVEL_SERVER = 16
TRACE_FILE_SERVER = SERVER
TRACE_TIMESTAMP_ SERVER = ON
TRACE_UNIQUE_SERVER = ON
/*的确启用了server端sqlnet trace*/
[maclean@rh2 trace]$ echo "" >  /s01/11gdb/network/admin/sqlnet.ora
/*清空sqlnet.ora配置文件,禁用server端sqlnet trace*/
/*再次尝试exp导出*/
[maclean@rh2 ~]$ exp sh/sh file=/s01/orders.dmp direct=y tables=orders
Export: Release 11.2.0.1.0 - Production on Wed Aug 18 22:00:35 2010
Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.
Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Export done in US7ASCII character set and AL16UTF16 NCHAR character set
server uses WE8MSWIN1252 character set (possible charset conversion)
About to export specified tables via Direct Path ...
. . exporting table                         ORDERS
[maclean@rh2 s01]$ while (1>0) do ls -lh orders.dmp ; sleep 1; done
-rw-r--r-- 1 maclean oinstall 61M Aug 18 22:00 orders.dmp
-rw-r--r-- 1 maclean oinstall 94M Aug 18 22:00 orders.dmp
-rw-r--r-- 1 maclean oinstall 108M Aug 18 22:00 orders.dmp
-rw-r--r-- 1 maclean oinstall 140M Aug 18 22:00 orders.dmp
-rw-r--r-- 1 maclean oinstall 162M Aug 18 22:00 orders.dmp
-rw-r--r-- 1 maclean oinstall 181M Aug 18 22:00 orders.dmp
/*dump文件每秒增长20MB左右,exp导出速度恢复正常*/
/*验证了之前设置server端16级的sqlnet trace会引起exp性能下降的猜测*/
/*如果是expdp数据泵导出工具是否会受到该sqlnet trace的影响呢*/
[maclean@rh2 s01]$ echo "TRACE_LEVEL_SERVER = 16
> TRACE_FILE_SERVER = SERVER
> TRACE_TIMESTAMP_ SERVER = ON
> TRACE_UNIQUE_SERVER = ON" > $ORACLE_HOME/network/admin/sqlnet.ora
[maclean@rh2 dump]$ expdp sh/sh directory=ordump dumpfile=orders.dmp tables=orders
Export: Release 11.2.0.1.0 - Production on Wed Aug 18 22:05:17 2010
Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.
Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Starting "SH"."SYS_EXPORT_TABLE_01":  sh/******** directory=ordump dumpfile=orders.dmp tables=orders
Estimate in progress using BLOCKS method...
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 261 MB
Processing object type TABLE_EXPORT/TABLE/TABLE
Processing object type TABLE_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
. . exported "SH"."ORDERS"                               223.8 MB 2319232 rows
Master table "SH"."SYS_EXPORT_TABLE_01" successfully loaded/unloaded
******************************************************************************
Dump file set for SH.SYS_EXPORT_TABLE_01 is:
/s01/dump/orders.dmp
Job "SH"."SYS_EXPORT_TABLE_01" successfully completed at 22:05:29
/* expdp数据泵在12s内完成了导出工作,可见其并不受到sqlnet trace的不良影响*/
/* 如果我们降低Sqlnet trace的级别,是否可以降低其对exp导出性能的影响呢?*/
/*尝试将trace level改成1*/
[maclean@rh2 s01]$ echo "TRACE_LEVEL_SERVER = 1
> TRACE_FILE_SERVER = SERVER
> TRACE_TIMESTAMP_ SERVER = ON
> TRACE_UNIQUE_SERVER = ON"   >  $ORACLE_HOME/network/admin/sqlnet.ora
maclean@rh2 ~]$ exp sh/sh file=/s01/orders.dmp direct=y tables=orders
Export: Release 11.2.0.1.0 - Production on Wed Aug 18 22:11:57 2010
Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.
Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Export done in US7ASCII character set and AL16UTF16 NCHAR character set
server uses WE8MSWIN1252 character set (possible charset conversion)
About to export specified tables via Direct Path ...
. . exporting table                         ORDERS    2319232 rows exported
EXP-00091: Exporting questionable statistics.
Export terminated successfully with warnings.
[maclean@rh2 s01]$ while (1>0) do ls -lh orders.dmp ; sleep 1; done
-rw-r--r-- 1 maclean oinstall 0 Aug 18 22:11 orders.dmp
-rw-r--r-- 1 maclean oinstall 33M Aug 18 22:11 orders.dmp
-rw-r--r-- 1 maclean oinstall 74M Aug 18 22:11 orders.dmp
-rw-r--r-- 1 maclean oinstall 108M Aug 18 22:12 orders.dmp
-rw-r--r-- 1 maclean oinstall 146M Aug 18 22:12 orders.dmp
-rw-r--r-- 1 maclean oinstall 183M Aug 18 22:12 orders.dmp
/* 导出速度没有明显下降,较低级别的server端sqlnet trace不会对exp导出性能造成影响*/

这个例子告诉我们在使用高级别的sqlnet trace诊断Oracle网络问题后,一定要记得要还原现场的配置,否则可能造成”莫名的性能问题”。

single-task message and cursor pin S on X

Recently i find many session are waiting for Event <cursor pin s on x> in my customer’s AIX 10.2.0.4 instance, it’s a new event in the 10gr2,after mutex take  place of latch .

When a session is going to parse or reparse one cursor, need to pin X this cursor first. Then the others who what to execute this child cursor  have to wait until pin X released.

I looked for the session who was pining that cursor on  X mode ,and saw it was waiting on event single-task messag:
the mutex idn is d322e445,and operated in EXCL mode; It correspond to the cursor whose hash is “ff2c2fd459ac5ee188586781d322e445” as after:

PROCESS 1753:
----------------------------------------
SO: 70000048e9d5e68, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=1753, calls cur/top: 70000043308cb08/70000043798a6e0, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 109 0 4
last post received-location: kslpsr
last process to post me: 70000048f8deaf0 1 6
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: 70000048f8deaf0 1 6
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 70000048ca877c0
O/S info: user: orauser, term: UNKNOWN, ospid: 3318124
OSD pid info: Unix process pid: 3318124, image: oracle@p595crm1
.........................
----------------------------------------
SO: 70000048eed0d30, type: 4, owner: 70000048e9d5e68, flag: INIT/-/-/0x00
(session) sid: 3311 trans: 70000046be86948, creator: 70000048e9d5e68, flag: (100041) USR/- BSY/-/-/-/-/-
DID: 0001-06D9-008347C7, short-term DID: 0001-06D9-008347C8
txn branch: 70000046a261488
oct: 3, prv: 0, sql: 700000414e4f570, psql: 700000415f47a90, user: 50/SHUCRM1C
service name: CRMDB1
O/S info: user: pausr12, term: unknown, ospid: 1234, machine: p570web2
program: JDBC Thin Client
application name: JDBC Thin Client, hash value=2546894660
waiting for 'single-task message' blocking sess=0x0 seq=22963 wait_time=0 seconds since wait started=6
=0, =0, =0
Dumping Session Wait History
for 'SQL*Net message from dblink' count=1 wait_time=325
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=2
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=383
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=1
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=265
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=2
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=677
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=0
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=237
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=1
driver id=54435000, #bytes=1, =0
temporary object counter: 0
SO:  70000043e695968, type: 53, owner: 70000048eed0d30, flag: INIT/-/-/0x00
LIBRARY OBJECT LOCK: lock=70000043e695968 handle=700000410dd46c8 mode=N                           --the child cursor
call pin=0 session pin=0 hpc=0000 hlc=0000
htl=70000043e6959e8[70000041e9eb830,70000042411c178] htb=70000042411c178  ssga=70000042411bc10
user=70000048eed0d30 session=70000048eed0d30 count=1 flags=CBK[0020]  savepoint=0x0
LIBRARY OBJECT HANDLE: handle=700000410dd46c8 mtx=700000410dd47f8(0)  cdp=0
namespace=CRSR flags=RON/KGHP/PN0/EXP/[10010100]
kkkk-dddd-llll=0000-0001-0001  lock=N pin=X latch#=23 hpc=fffc  hlc=fffc                                     --lock  in NULL mode,pin on X mode
lwt=700000410dd4770[700000410dd4770,700000410dd4770]  ltm=700000410dd4780[700000445169a08,700000453b69228]
pwt=700000410dd4738[700000410dd4738,700000410dd4738]  ptm=700000410dd4748[700000410dd4748,700000410dd4748]
ref=700000410dd47a0[70000043d1b5e58,70000043d1b5e58]  lnd=700000410dd47b8[700000410dd47b8,700000410dd47b8]
LIBRARY OBJECT: object=7000004354983b0
type=CRSR flags=EXS[0001] pflags=[0000] status=VALD load=0
DEPENDENCIES: count=4 size=16
TRANSLATIONS: count=2 size=16
DATA BLOCKS:
data#     heap  pointer    status pins change whr
----- -------- -------- --------- ---- ------ ---
0 7000004359e4768 7000004354984c8 I/P/A/-/-    0 NONE   00
6 700000414b0e4d0 700000450c0bc28 I/P/A/-/E    0 NONE   00
----------------------------------------
KGX Atomic  Operation Log 70000041853c068
Mutex 700000414b0e3d8(3311, 0) idn d322e445 oper EXCL
Cursor Pin uid 3311 efd 0 whr 1 slp  0      -- pin sid 3311
opr=3 pso=70000043e695968  flg=0            -- operate code is 3
pcs=700000414b0e3d8 nxt=0 flg=35 cld=0 hd=700000410dd46c8   par=70000041d78b1e0    -- child cursor handle address is 700000410dd46c8
ct=0 hsh=0 unp=0 unn=0 hvl=1d78b4b8 nhv=1  ses=70000048eed0d30                              -- heap 0 pointer address 70000041d78b1e0
hep=700000414b0e458 flg=80 ld=1 ob=7000004354983b0 ptr=700000450c0bc28  fex=700000450c0af38
----------------------------------------
SO: 70000041e9eb7b0, type: 53, owner: 70000048eed0d30, flag:  INIT/-/-/0x00
LIBRARY OBJECT LOCK: lock=70000041e9eb7b0 handle=700000414e4f570 mode=N
call pin=0 session pin=0 hpc=0000 hlc=0000
htl=70000041e9eb830[70000044590d030,70000043e6959e8] htb=70000042411c178  ssga=70000042411bc10
user=70000048eed0d30 session=70000048eed0d30 count=1 flags=[0000]  savepoint=0x4c08a856
LIBRARY OBJECT HANDLE: handle=700000414e4f570 mtx=700000414e4f6a0(0)  cdp=1
name=
SELECT * FROM(SELECT A.*, rownum r FROM( select  account_id,billing_nbr,calling_nbr,called_nbr,to_date,call_duration,charge_item_name,to_char(rate  /100, '99999999999990.99') rate, charge,channel_id from ct_05
where  1=1 and account_id ='300187744'  and key_source_type ='52001'  order by to_date ) A WHERE rownum <= 15 ) B WHERE r > 0
hash=ff2c2fd459ac5ee188586781d322e445 timestamp=06-04-2010 15:14:33
namespace=CRSR flags=RON/KGHP/TIM/PN0/MED/DBN/[50010040]
kkkk-dddd-llll=0000-0001-0001 lock=N pin=0 latch#=23 hpc=0018 hlc=0018
lwt=700000414e4f618[700000414e4f618,700000414e4f618]  ltm=700000414e4f628[700000414e4f628,700000414e4f628]
pwt=700000414e4f5e0[700000414e4f5e0,700000414e4f5e0]  ptm=700000414e4f5f0[700000414e4f5f0,700000414e4f5f0]
ref=700000414e4f648[700000414e4f648,700000414e4f648]  lnd=700000414e4f660[700000414e4f660,700000414e4f660]
LIBRARY OBJECT: object=70000041d78b0c8
type=CRSR flags=EXS[0001] pflags=[0000] status=VALD load=0
CHILDREN: size=16
child#    table reference   handle
------ -------- --------- --------
0 70000043d1b61e8 70000043d1b5e58 700000410dd46c8                                 -- the child cursor
DATA BLOCKS:
data#     heap  pointer    status pins change whr
----- -------- -------- --------- ---- ------ ---
0 7000004467f6078 70000041d78b1e0  I/P/A/-/-    0 NONE   00                                      --  heap 0

The session is active,and its wait history indicates there are some dblink operation inside the running SQL, When this session waited to reparse the only child cursor for that SQL , it pin this child cursor in X mode first,and then it need more information from remote DB , but no response until SQL*NET MESSAGE from DBLINK timeout,it was waiting single-task message for more than 6 seconds when tracing.

In this scenario other sessions who wait to execute this child cursor were all waiting , so sadly.

The oracle document describes <single-task message> as When running single task, this event indicates that the session waits for the client side of the executable.

A little odd,the blocked also did some single task when it lost one dblink connection.

I searched the metalink with keyword: single-task message, there were few documents  found , but fortunately below note:

Hdr: 7757687 10.2.0.3.0 RDBMS 10.2.0.3.0 PRG INT/DISTR PRODID-5 PORTID-212
Abstract: CURSOR: PIN S WAIT ON X BLOCKER “SINGLE-TASK MESSAGE”

PROBLEM:
——–
The databse has multiple processes waiting on “CURSOR: PIN S WAIT ON X”

The holder of the mutes is waiitng on “single-task message” form almost
745687
seconds.
The query run by the holder  does conatin a dblink.

DIAGNOSTIC ANALYSIS:
——————–
All the  process waiting on the same mutex:-

waiting for ‘cursor: pin S wait on X’ blocking sess=0x0
idn=5d779f21, value=7af00000000, where|sleeps=57a0dfdb6

Holder
PROCESS 26:
KGX Atomic Operation Log 700000061490300
Mutex 70000006226e0f0(1967, 0) idn 5d779f21 oper EXCL
Cursor Pin uid 1967 efd 0 whr 1 slp 0
opr=3 pso=7000000670f93a0 flg=0
pcs=70000006226e0f0 nxt=0 flg=35 cld=0 hd=7000000606838f0
par=7000000629006a8
ct=0 hsh=0 unp=0 unn=0 hvl=62900980 nhv=1 ses=700000079543fe0
hep=70000006226e170 flg=80 ld=1 ob=700000062fe69a8
ptr=700000065f124f8 fex=700000065f11808

Session object for this process:-

SO: 700000079543fe0, type: 4, owner: 70000007a6ad538, flag: INIT/-/-/0x00
(session) sid: 1967 trans: 0, creator: 70000007a6ad538, flag: (41)
USR/- BSY/-/-/-/-/-
DID: 0001-001A-00000A38, short-term DID: 0001-001A-00000A39
txn branch: 0
oct: 3, prv: 0, sql: 700000062900e18, psql:
70000005d797bc8, user: 21/V500
O/S info: user: d_c11, term: , ospid: 6271140, machine: aruput5
program: cpm_srvscript@aruput5 (TNS V1-V3)
application name: cpm_srvscript@aruput5 (TNS V1-V3), hash value=0
waiting for ‘single-task message’ blocking sess=0x0 seq=7071
wait_time=0 seconds since wait
started=745687

This bug which occurs on Version 10.2.0.3 is so similar to my case on AIX version 10.2.0.4, so as the stack trace:

ksdxfstk+002c<-ksdxcb+04e4<-sspuser+0074<-000044C0<-nttrd+0120<-nsprecv+07a0<-
nscon+0218<-nsdo+157c<-nscall3+012c<-nscall+0778<-niotns+0888<-nigcall+0028<-osnco
n+0540<-kpkiadef+006c<-upiini+03fc<-upiah0+00ac<-kpuatch+0808<-OCIServerAttach+011
4<-kpnconn+02a4<-npicon0+042c<-kpndbcon+0630<-OCIKDBLinkConn2+0038<-OCIKDBLinkConn
+002c<-ddfnet2Normal+011c<-kkmfcbrm+009c<-kkmpfcbk+02a0<-qcsprfro+0538<-qcsprfro_tree
+0318<-qcsprfro_tree+01c8<-qcspafq+0068<-qcspqb+0470<-kkmdrv+003c<-opiSem+13b4<-opiDe
ferredSem+0234<-opitca+01e8<-kksFullTypeCheck+001c<-rpiswu2+034c<-kksLoadChild+30b0<-
kxsGetRuntimeLock+0810<-kksfbc+2930<-kkspsc0+0ffc<-kksParseCursor+00d4<-opiosq0+0b30<
-kpooprx+0168<-kpoal8+0400<-opiodr+0adc<-ttcpip+1020<-opitsk+10b0<-opiino+0990<-opiod
r+0adc<-opidrv+0474<-sou2o+0090<-opimai_real+01bc<-main+0098<-__start+0098

My case stack trace:
ksdxfstk+002c<-ksdxcb+04e4<-sspuser+0074<-000044C0<-nttcni+01b8<-nttcon+04f4<-ntconn+0160<
-nsopen+0960<-nscall1+01b0<-nscall+049c<-niotns+0880<-nigcall+0028<-osncon+0540<-kpkiadef+006c<
-upiini+0408<-upiah0+00ac<-kpuatch+0808<-OCIServerAttach+0114<-kpnconn+02a4<-npicon0+042c<
-kpndbcon+0630<-OCIKDBLinkConn2+0038<-OCIKDBLinkConn+002c<-ddfnet2Normal+011c<-kkmfcbrm+009c<
-kkmpfcbk+02a0<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228<-qcspafq+0068<
-qcspqbDescendents+03a0<-qcspqb+00ac<-qcsevw+02e0<-kkmevw+0a58<-kkmfcbvw+0178<
-kkmfcblo+0e38<-kkmpfcbk+0250<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228<
-qcspafq+0068<-qcspqbDescendents+03a0<-qcspqb+00ac<-qcsevw+02e0<-qcsfpsq+003c<
-kkmfcbsq+01e8<-kkmpfcbk+0260<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228<
-qcspafq+0068<-qcspqbDescendents+03a0<-qcspqb+00ac<-qcsevw+02e0<-qcsfpsq+003c<
-kkmfcbsq+01e8<-kkmpfcbk+0260<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228<
-qcspafq+0068<-qcspqbDescendents+03a0<-qcspqb+00ac<-kkmdrv+003c<-opiSem+13c0<
-opiprs+01ac<-kksParseChildCursor+05e0<-rpiswu2+034c<-kksLoadChild+1d88<
-kxsGetRuntimeLock+0810<-kksfbc+28b0<-kkspsc0+0ffc<-kksParseCursor+00d4<-opiosq0+0ae0<
-kpooprx+0168<-kpoal8+0400<-opiodr+0ae0<-ttcpip+1020<-opitsk+1124<-opiino+0990<-opiodr+0ae0<
-opidrv+0484<-sou2o+0090<-opimai_real+01bc<-main+0098<-__start+0098

Both of them have called OCIKDBLinkConn->OCIKDBLinkConn2->kpndbcon->npicon0->kpnconn->OCIServerAttach->kpuatch and follow.
Till now this bug has no patch to apply or workaround way what’s awful.

As “Bug 5111335: STREAMS PROPAGATION STUCK ON “SINGLE-TASK MESSAGE” PRODUCE LIBRARY CACHE LOCK CO” described:

Hdr: 5111335 10.2.0.2 RDBMS 10.2.0.2 STREAMS PRODID-5 PORTID-226
Abstract: STREAMS PROPAGATION STUCK ON “SINGLE-TASK MESSAGE” PRODUCE LIBRARY CACHE LOCK CO
This is a three cluster environment. Two nodes per cluster and on that
cluster running RAC with 2 instances.
The databases involved are:

GCTGCMU is the “GCT” database in GCM = Greenwich USA, will be named USA from
now on
GCTHKGU is the “GCT” database in HKG = Hong Kong, will be named HKG from now
on
GCTLONU is the “GCT” database in LON = London, will be named LON from now on

On this environment there is a Streams bi-directional replication environment
defined between the 3 databases.

USA database is propagating changes captured to LON and HKG.
There are two propagations defined from this site to every site. 4
propagations as a total.

GCTHKGU_BATCH_PROPAGATE
GCTLONU_BATCH_PROPAGATE
GCTHKGU_DAILY_PROPAGATE
GCTLONU_DAILY_PROPAGATE

Daily propagation propagates changes stored on streams queue
STRMADMIN.STREAMSOUT_DAILY.
Batch propagation propagates changes stored on streams queue
STRMADMIN.STREAMSOUT_BATCH

This databases have been upgraded to 10.2.0.2 from 10.2.0.1 last 13-Mar-2006,
and no problems were reported.

Customer reported that he could see that propagations from USA to HKG were
stucked on library cache lock.
This situation was solved by customer by bouncing USA instances.

Analyzing the systemstate dumps we could see that there were propagations
locked on this situation and QMON slaves too.
The library cache lock was on the queue STRMADMIN.STREAMSOUT_DAILY and the
owner of the lock was process J000 that was stucked ‘single-task message’.

From the information on the systemstate dump we can see that this process has
been on that situation for more than 3 days and 20 hours.
Based on the call stack it seems that the process was about or in the middle
of a call to a listener.

DIAGNOSTIC ANALYSIS:
——————–
Files to be uploaded:

USA:
– RDA.zip , RDA node 1
– GCTGCMU1_healthchk.html, streams healthcheck node 1
– GCTGCMU1_healthchk.html, streams healthcheck node 2
– gctgcmu2_ora_21349.trc, hanganalyze and systemstate dumps when the hang is
ocurring. Node 2
– gctgcmu1_ora_858.trc, hanganalyze and systemstate dumps when the hang is
ocurring. Node 1.

HKG:
– alert_GCTHKGU1.log, alert log file node 1
– alert_GCTHKGU2.log, alert log file node 2
– listener_hkg0223xus.log_20060320, listener log file node 1.
– listener_hkg0224xus.log_20060319, listener log file node 2.

LON:
– alert_GCTLONU1.log, alert log file node 1
– alert_GCTLONU2.log, alert log file node 2
– listener_lon0223xus.log_20060318, listener log file node 1.
– listener_lon3166xus.log_20060320, listener log file node 2.

Hang was identified at USA on node 1.

Analysis of gctgcmu1_ora_858.trc:

System State 1
~~~~~~~~~~~~~~
21: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=21
42: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=404
51: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=17
67: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=90
71: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=56
76: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=344
Cmd: PL/SQL Execute
77: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=87

System State 2
~~~~~~~~~~~~~~
21: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=21
42: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=404
51: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=17
67: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=90
71: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=56
76: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=344
Cmd: PL/SQL Execute
77: waiting for ‘library cache lock’ [LOCK: handle=25ea4afc8] seq=87

LOCK: handle=25ea4afc8 QUEU:STRMADMIN.STREAMSOUT_DAILY

Resource Holder State
LOCK: handle=25ea4afc8 50: waiting for ‘single-task message’

PROCESS 50:
~~~~~~~~~~~

OSD pid info: Unix process pid: 12949, image: oracle@slodb001 (J000)

From the call stack looks like process was doing an SQL*Net connection to a
listenerand doing this
connection we get hanged because the stack trace is the same for the two
process dumps shown on
both system states.

(session)
oct: 0, prv: 0, sql: (nil), psql: (nil), user: 92/STRMADMIN
waiting for ‘single-task message’ blocking sess=0x(nil) seq=5 wait_time=0
seconds since wait
started=332308
=0, =0, =0

It is running a propagation job: next_date :=
sys.dbms_aqadm.aq$_propaq(job);

Knowing that the seconds in wait are 332308 and the dump of the process
occurred on 2006-03-21 08:15:12.640,
this process is waiting on this event:

1. On USA the process is waiting since 17-MAR-2006 11:56:44
2. That time on London is 17-MAR-2006 16:56:44
3. That time on HONG KONG is 18-MAR-2006 01:56:44

At that time nothing relevant was found on alert log files on listener log
files.

Processes waiting on library cache lock
~~~~~~~~~~~~~~~

21 (PZ99) : Slave of J002
42 (q004) : Slave of process 44 (QMNC)
51 (J001) :
67 (q003) : Slave of process 44 (QMNC)
71 (q005) : Slave of process 44 (QMNC)
51 (TNS shadow) :
77 (q007) : Slave of process 44 (QMNC)

WORKAROUND:
———–
No workaround found so far.
Killing the process or bouncing instance should solve the problem.

RELATED BUGS:
————-
No similar bugs found.

This environment is also involved on bug 5089630

REPRODUCIBILITY:
—————-
Problem has reproduced just one time on customer site.

TEST CASE:
———-

STACK TRACE:
————
For processes waiting on library cache lock we can see different call stacks,
here is the one for process 21:

ksdxfstk ksdxcb sspuser pthread_sighandler_rt GI___libc_sigaction GI___poll
kslwait kjusuc ksipgetctx kqlmli kgllkal kglget
kglgob kwqicgob kwqicrqo0 kwqmsnrmsg kwqsldqs kwqsif
qerfxFetch qervwFetch qertqoFetch qerpxSlaveFetch qerpxFetch
opiexe kpoal8 opiodr kpoodr upirtrc kpurcsc kpuexecv8 kpuexec OCIStmtExecute

kxfxsStmtExecute kxfxsExecute kxfxsp kxfxmai kxfprdp opirip opidrv sou2o
opimai_real main __libc_start_main _start

For process waiting on single-task message, we can see the following call
stack:

ksdxfstk ksdxcb sspuser pthread_sighandler_rt GI___libc_sigaction GI___poll
clntudp_call do_ypcall yp_match internal_gethostbyname2_r
nss_nis_gethostbyname_r
gethostbyname_r
gethostbyname snlpcgthstbynm snlinGetAddrInfo nttbnd2addr ntacbbnd2addr
ntacbnd2addr
nsc2addr nscall1 nscall niotns nigcall osncon
kpkiadef upiini upiah0 kpuatch OCIServerAttach kpnconn
npicon0 kpndbcon OCIKDBLinkConn2 OCIKDBLinkConn kwqpcon kwqpgps
spefmccallstd pextproc peftrusted psdexsp rpiswu2 psdextp pefccal pefca
pevm_FCAL pfrinstr_FCAL pfrrun_no_tool pfrrun plsql_run peicnt kkxexe opiexe
opiodr rpidrus skgmstack rpidru rpiswu2 rpidrv rpiexe kkjex1e kkjsexe kkjrdp

opirip opidrv sou2o opimai_real main libc_start_main start

I do not see any Oracle bug here but it is hard to be
certain without the SQLNET.ORA settings and multiple
snapshots of the problem process. It is quite valid
for an Oracle session to be holding a library cache lock
when it makes an outbound DBlink call. If another session
wants that lock in an incompatible mode then it has to wait.

In this case the blocking session making the outbound link
call appears from the one stack dump to be in OS code
under gethostbyname().
However it could have been spinning down the stack in
Oracle code but I cannot comment on the info we have.

The “single-task message” wait event is started / ended
in kpnconn() so if there was a “spin” it would have to be
above that point in the stack.

In 10.2 there is the option to configure NET to timeout
on outbound connect attempts by setting
SQLNET.OUTBOUND_CONNECT_TIMEOUT = N seconds.
The timeout occurs in niotns() in the stack so if this
is set you would expect a session in this state to then
timeout which may be useful if there are underlying issues with
TCP hostname resolution.
(Watch out for bug 4933023 if you use this SQLNET parameter).

If the issue recurs then get CPU usage info, multiple stacks
and you may want to attach with gdb and see if the code ever
returns above gethostbyname or not to get an idea if this

Sorry , my explanation is probably not clear .

Here is the stack you have but annotated:

Callback to get the short stack. Confirms we are not HUNG in a device
driver.
ksdxfstk
ksdxcb
sspuser
pthread_sighandler_rt
GI___libc_sigaction

Inside OS supplied code executing gethostbyname() C call.
GI___poll
clntudp_call
do_ypcall
yp_match
internal_gethostbyname2_r
nss_nis_gethostbyname_r
gethostbyname_r
gethostbyname

Oracle Net code:
snlpcgthstbynm
snlinGetAddrInfo
nttbnd2addr
ntacbbnd2addr
ntacbnd2addr
nsc2addr
nscall1
nscall
niotns <<< This is where SQLNET.OUTBOUND_CONNECT_TIMEOUT
is implemented IF it has been configured.
nigcall

Oracle RDBMS code:
osncon
kpkiadef
upiini
upiah0
kpuatch
OCIServerAttach
kpnconn <<< This is where “single-task message” starts/ends
npicon0

If we just get two stack snapshots of the process, as was the case, and
both have the above stack then we can say:
As the “seq=” of both snaps is the same (5) then we did not return
below kpnconn() in the call stack.

But as poll() is typically a blocking call, even if only for a short
time, then we cannot be sure if this process was blocked permanently
in the poll() waiting on some large timout or similar, or if the
code was executing and looping. If it was looping it could have done so
at any point on the above stack down to kpnconn() where we would have
changed the seq# on the wait event if we exited here.

We can say that as the short stack dump triggered a callout then the
Linux kernel was not stuck inside a device driver.

We can also say it looks like the gethostbyname is in YP code doing a UDP
call so at OS level you may want to check out the name resolution
configuration. You probably dont want to be falling all the way back to
UDP requests to get hostnames resolved.

So if the problem recurs it would help to get:
ps information of the process (to see if it uses CPU at all)

strace for a few minutes – this should help show if the
process is entering / leaving any system call (eg: poll)

A full call stack with arguments from either gdb or errorstack
and these should help show the arguments passed up the stack.

In gdb you can also do something like attach to the problem process
and :
break osncon
break nigcall
break niotns
etc.. up the stack to gethostbyname and then “cont”
and it should stop if it enters into those functions from below
helping indicate a spin/loop point below that point.

It may also be worth getting Linux “crash” information for the
process to see what the process looks like from the Linux side
but follow up with the OS team on that side.

SQLNET.OUTBOUND_CONNECT_TIMEOUT is a backup mechanism to help prevent

This note claimed that one session may hold library cache lock , and wait in single-task message
after it makes an outbound DBlink call.If another session wants that lock in an incompatible mode then it has to wait.
The “single-task message” wait event is started / ended in kpnconn() so if there was a “spin” it would have to be
above that point in the stack. Oracle support advise to set SQLNET.OUTBOUND_CONNECT_TIMEOUT so that
we can expect a session in this state to then timeout which may be useful if there are underlying issues with
TCP hostname resolution. The principle in these cases is identical.
Do deep digging ,you can see this Knowledge:

Hdr: 8427478 10.2.0.3 RDBMS 10.2.0.3 PI/DISTRIB PRODID-5 PORTID-23 ORA-12170
Abstract: WHEN USING DBLINK,IT CONNECT TWICE AND TAKE DOBULE-TIME FOR TIME-OUT( ORA-12170)
PROBLEM:
——–
When client connect to database using TCP/IP, if there are some problems
in the network and server , The client recevies ORA-12170 (ETIMEDOUT)
The timeout time is depend on OS TCP parameter.(e.g. tcp_syn_retries
parameter on Linux platform).

ERROR:
ORA-12170 : TNS:Connect timeout occurred

** sqlnet.log ***************************************************
VERSION INFORMATION:
TNS for Linux: Version 10.2.0.3.0 – Production
TCP/IP NT Protocol Adapter for Linux: Version 10.2.0.3.0 – Production
Time: 10-APR-2009 21:38:41
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
TNS-505: Operation timed out
nt secondary err code: 110 <= ETIMEDOUT nt OS err code: 0 Client address:
** sqlnet.log ***************************************************

client server
| |
connect() |
|——-x–>|——–
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |when setting tcp_syn_retries=5(default)
|——-x–>| | It takes about 189 seconds for time-out.
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
———————
ORA-12170 occur.

But, the connection via DBLINK, it sometimes try to connect twice. and
it takes double-time for time-out.

local site remote site
| |
connect() |
|——-x–>|——–
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
ORA-12170 occur ,but it not returning to client.
and server process of local site try to connect automatically.(2nd connect)
connect() | |
|——-x–>| |
| SYN | |
|——-x–>| | It takes about 378 seconds for time-out.
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
|——-x–>| |
| SYN | |
———————
ORA-12170 occur.

The problem is that it takes dobule-time by this behavior using DBLINK.

DIAGNOSTIC ANALYSIS:
——————–
We got stack trace when trying connect.

1st connect
connect()+36<- nttcon()+965<- ntconn()+265<- nsopen()+1335 <- nscall1()+449<- nscall()+609<- niotns()+1678<- nigcall()+112 <- osncon()+692<- kpkiadef()+58<- upiini()+683<- upiah0()+104 <- kpuatch()+1064<- OCIServerAttach()+129<- kpnconn()+463 <- npicon0()+1250<- kpndbcon()+909<- OCIKDBLinkConn2()+40 <- OCIKDBLinkConn()+36<- ddfnet3Share()+151<- kksarm()+480<- kksauc()+481 <- kksCheckCriteria()+1711<- kksCheckCursor()+226 <- kksSearchChildList()+1496 <- kksfbc()+10276<- kkspsc0()+1903<- kksParseCursor()+132<- opiosq0()+1349 <- kpooprx()+215<- kpoal8()+673<- opiodr()+985<- ttcpip()+1093<- opitsk()+1031 <- opiino()+821<- opiodr()+985<- opidrv()+466<- sou2o()+91<- opimai_real()+117 <- main()+111<- __libc_start_main()+161 2nd connect <- connect()+36<- nttcon()+965<- ntconn()+265<- nsopen()+1335<- nscall1()+449<- nscall()+609 <- niotns()+1678<- nigcall()+112<- osncon()+692<- kpkiadef()+58<- upiini()+683<- upiah0()+104 <- kpuatch()+1064<- OCIServerAttach()+129<- kpnconn()+463<- npicon0()+1250<- kpndbcon()+909 <- OCIKDBLinkConn2()+40<- OCIKDBLinkConn()+36<- ddfnet2Normal()+204<- kkmfcbrm()+132 <- __PGOSF172_kkmpfcbk()+473<- qcsprfro()+503<- qcsprfro_tree()+302<- qcsprfro_tree()+118 <- qcspafq()+102<- qcspqb()+347<- kkmdrv()+60<- opiSem()+1044<- opiDeferredSem()+283 <- opitca()+275<- __PGOSF302_kksFullTypeCheck()+20<- rpiswu2()+334<- kksLoadChild()+5852 <- kxsGetRuntimeLock()+1314<- kksfbc()+12270<- kkspsc0()+1903<- kksParseCursor()+132 <- opiosq0()+1349<- kpooprx()+215<- kpoal8()+673<- opiodr()+985<- ttcpip()+1093 <- opitsk()+1031<- opiino()+821<- opiodr()+985<- opidrv()+466<- sou2o()+91<- opimai_real()+117 <- main()+111<- __libc_start_main()+161<- WORKAROUND: ----------- execute following command before connect. SQL> alter system flush shared_pool;

You can find the kpnconn call in stack trace , so the single-task message may occur in two time dblink reconnect and take a long time.
The document provide one workaround way: flush shared_pool.which I suspected.
In summary , still no Mature program to resolve this problem , All we can do is pray.
In my opinion , Version 10.2.0.4 is stable enough within most layer , but still a lots of bug or unexpected behavior reside in K2(kernel Distributed Execution Layer). Distributed system is a really complicated subject in computer science.

沪ICP备14014813号

沪公网安备 31010802001379号