single-task message and cursor pin S on X

Recently i find many session are waiting for Event <cursor pin s on x> in my customer's AIX 10.2.0.4 instance, it's a new event in the 10gr2,after mutex take place of latch . When a session is going to parse or reparse one cursor, need to pin X this cursor first. Then the others who what to execute this child cursor have to wait until pin X released. I looked for the session who was pining that cursor on X mode ,and saw it was waiting on event single-task messag: the mutex idn is d322e445,and operated in EXCL mode; It correspond to the cursor whose hash is "ff2c2fd459ac5ee188586781d322e445" as after:

PROCESS 1753:
----------------------------------------
SO: 70000048e9d5e68, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=1753, calls cur/top: 70000043308cb08/70000043798a6e0, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 109 0 4
last post received-location: kslpsr
last process to post me: 70000048f8deaf0 1 6
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: 70000048f8deaf0 1 6
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 70000048ca877c0
O/S info: user: orauser, term: UNKNOWN, ospid: 3318124
OSD pid info: Unix process pid: 3318124, image: oracle@p595crm1
.........................
----------------------------------------
SO: 70000048eed0d30, type: 4, owner: 70000048e9d5e68, flag: INIT/-/-/0x00
(session) sid: 3311 trans: 70000046be86948, creator: 70000048e9d5e68, flag: (100041) USR/- BSY/-/-/-/-/-
DID: 0001-06D9-008347C7, short-term DID: 0001-06D9-008347C8
txn branch: 70000046a261488
oct: 3, prv: 0, sql: 700000414e4f570, psql: 700000415f47a90, user: 50/SHUCRM1C
service name: CRMDB1
O/S info: user: pausr12, term: unknown, ospid: 1234, machine: p570web2
program: JDBC Thin Client
application name: JDBC Thin Client, hash value=2546894660
waiting for 'single-task message' blocking sess=0x0 seq=22963 wait_time=0 seconds since wait started=6
=0, =0, =0
Dumping Session Wait History
for 'SQL*Net message from dblink' count=1 wait_time=325
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=2
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=383
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=1
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=265
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=2
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=677
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=0
driver id=54435000, #bytes=1, =0
for 'SQL*Net message from dblink' count=1 wait_time=237
driver id=54435000, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=1
driver id=54435000, #bytes=1, =0
temporary object counter: 0
SO:  70000043e695968, type: 53, owner: 70000048eed0d30, flag: INIT/-/-/0x00
LIBRARY OBJECT LOCK: lock=70000043e695968 handle=700000410dd46c8 mode=N                           --the child cursor
call pin=0 session pin=0 hpc=0000 hlc=0000
htl=70000043e6959e8[70000041e9eb830,70000042411c178] htb=70000042411c178  ssga=70000042411bc10
user=70000048eed0d30 session=70000048eed0d30 count=1 flags=CBK[0020]  savepoint=0x0
LIBRARY OBJECT HANDLE: handle=700000410dd46c8 mtx=700000410dd47f8(0)  cdp=0
namespace=CRSR flags=RON/KGHP/PN0/EXP/[10010100]
kkkk-dddd-llll=0000-0001-0001  lock=N pin=X latch#=23 hpc=fffc  hlc=fffc                                     --lock  in NULL mode,pin on X mode
lwt=700000410dd4770[700000410dd4770,700000410dd4770]  ltm=700000410dd4780[700000445169a08,700000453b69228]
pwt=700000410dd4738[700000410dd4738,700000410dd4738]  ptm=700000410dd4748[700000410dd4748,700000410dd4748]
ref=700000410dd47a0[70000043d1b5e58,70000043d1b5e58]  lnd=700000410dd47b8[700000410dd47b8,700000410dd47b8]
LIBRARY OBJECT: object=7000004354983b0
type=CRSR flags=EXS[0001] pflags=[0000] status=VALD load=0
DEPENDENCIES: count=4 size=16
TRANSLATIONS: count=2 size=16
DATA BLOCKS:
data#     heap  pointer    status pins change whr
----- -------- -------- --------- ---- ------ ---
0 7000004359e4768 7000004354984c8 I/P/A/-/-    0 NONE   00
6 700000414b0e4d0 700000450c0bc28 I/P/A/-/E    0 NONE   00

----------------------------------------

KGX Atomic  Operation Log 70000041853c068
Mutex 700000414b0e3d8(3311, 0) idn d322e445 oper EXCL
Cursor Pin uid 3311 efd 0 whr 1 slp  0      -- pin sid 3311
opr=3 pso=70000043e695968  flg=0            -- operate code is 3
pcs=700000414b0e3d8 nxt=0 flg=35 cld=0 hd=700000410dd46c8   par=70000041d78b1e0    -- child cursor handle address is 700000410dd46c8
ct=0 hsh=0 unp=0 unn=0 hvl=1d78b4b8 nhv=1  ses=70000048eed0d30                              -- heap 0 pointer address 70000041d78b1e0
hep=700000414b0e458 flg=80 ld=1 ob=7000004354983b0 ptr=700000450c0bc28  fex=700000450c0af38
----------------------------------------
SO: 70000041e9eb7b0, type: 53, owner: 70000048eed0d30, flag:  INIT/-/-/0x00
LIBRARY OBJECT LOCK: lock=70000041e9eb7b0 handle=700000414e4f570 mode=N
call pin=0 session pin=0 hpc=0000 hlc=0000
htl=70000041e9eb830[70000044590d030,70000043e6959e8] htb=70000042411c178  ssga=70000042411bc10
user=70000048eed0d30 session=70000048eed0d30 count=1 flags=[0000]  savepoint=0x4c08a856
LIBRARY OBJECT HANDLE: handle=700000414e4f570 mtx=700000414e4f6a0(0)  cdp=1
name=
SELECT * FROM(SELECT A.*, rownum r FROM( select  account_id,billing_nbr,calling_nbr,called_nbr,to_date,call_duration,charge_item_name,to_char(rate  /100, '99999999999990.99') rate, charge,channel_id from ct_05
where  1=1 and account_id ='300187744'  and key_source_type ='52001'  order by to_date ) A WHERE rownum <= 15 ) B WHERE r > 0
hash=ff2c2fd459ac5ee188586781d322e445 timestamp=06-04-2010 15:14:33
namespace=CRSR flags=RON/KGHP/TIM/PN0/MED/DBN/[50010040]
kkkk-dddd-llll=0000-0001-0001 lock=N pin=0 latch#=23 hpc=0018 hlc=0018
lwt=700000414e4f618[700000414e4f618,700000414e4f618]  ltm=700000414e4f628[700000414e4f628,700000414e4f628]
pwt=700000414e4f5e0[700000414e4f5e0,700000414e4f5e0]  ptm=700000414e4f5f0[700000414e4f5f0,700000414e4f5f0]
ref=700000414e4f648[700000414e4f648,700000414e4f648]  lnd=700000414e4f660[700000414e4f660,700000414e4f660]
LIBRARY OBJECT: object=70000041d78b0c8
type=CRSR flags=EXS[0001] pflags=[0000] status=VALD load=0
CHILDREN: size=16
child#    table reference   handle
------ -------- --------- --------
0 70000043d1b61e8 70000043d1b5e58 700000410dd46c8                                 -- the child cursor
DATA BLOCKS:
data#     heap  pointer    status pins change whr
----- -------- -------- --------- ---- ------ ---
0 7000004467f6078 70000041d78b1e0  I/P/A/-/-    0 NONE   00                                      --  heap 0

The session is active,and its wait history indicates there are some dblink operation inside the running SQL, When this session waited to reparse the only child cursor for that SQL , it pin this child cursor in X mode first,and then it need more information from remote DB , but no response until SQL*NET MESSAGE from DBLINK timeout,it was waiting single-task message for more than 6 seconds when tracing. In this scenario other sessions who wait to execute this child cursor were all waiting , so sadly. The oracle document describes <single-task message> as When running single task, this event indicates that the session waits for the client side of the executable. A little odd,the blocked also did some single task when it lost one dblink connection. I searched the metalink with keyword: single-task message, there were few documents found , but fortunately below note:

Hdr: 7757687 10.2.0.3.0 RDBMS 10.2.0.3.0 PRG INT/DISTR PRODID-5 PORTID-212 Abstract: CURSOR: PIN S WAIT ON X BLOCKER "SINGLE-TASK MESSAGE" PROBLEM: -------- The databse has multiple processes waiting on "CURSOR: PIN S WAIT ON X" The holder of the mutes is waiitng on "single-task message" form almost 745687 seconds. The query run by the holder does conatin a dblink. DIAGNOSTIC ANALYSIS: -------------------- All the process waiting on the same mutex:- waiting for 'cursor: pin S wait on X' blocking sess=0x0 idn=5d779f21, value=7af00000000, where|sleeps=57a0dfdb6 Holder PROCESS 26: KGX Atomic Operation Log 700000061490300 Mutex 70000006226e0f0(1967, 0) idn 5d779f21 oper EXCL Cursor Pin uid 1967 efd 0 whr 1 slp 0 opr=3 pso=7000000670f93a0 flg=0 pcs=70000006226e0f0 nxt=0 flg=35 cld=0 hd=7000000606838f0 par=7000000629006a8 ct=0 hsh=0 unp=0 unn=0 hvl=62900980 nhv=1 ses=700000079543fe0 hep=70000006226e170 flg=80 ld=1 ob=700000062fe69a8 ptr=700000065f124f8 fex=700000065f11808 Session object for this process:- SO: 700000079543fe0, type: 4, owner: 70000007a6ad538, flag: INIT/-/-/0x00 (session) sid: 1967 trans: 0, creator: 70000007a6ad538, flag: (41) USR/- BSY/-/-/-/-/- DID: 0001-001A-00000A38, short-term DID: 0001-001A-00000A39 txn branch: 0 oct: 3, prv: 0, sql: 700000062900e18, psql: 70000005d797bc8, user: 21/V500 O/S info: user: d_c11, term: , ospid: 6271140, machine: aruput5 program: cpm_srvscript@aruput5 (TNS V1-V3) application name: cpm_srvscript@aruput5 (TNS V1-V3), hash value=0 waiting for 'single-task message' blocking sess=0x0 seq=7071 wait_time=0 seconds since wait started=745687

This bug which occurs on Version 10.2.0.3 is so similar to my case on AIX version 10.2.0.4, so as the stack trace: ksdxfstk+002c<-ksdxcb+04e4<-sspuser+0074<-000044C0<-nttrd+0120<-nsprecv+07a0<- nscon+0218<-nsdo+157c<-nscall3+012c<-nscall+0778<-niotns+0888<-nigcall+0028<-osnco n+0540<-kpkiadef+006c<-upiini+03fc<-upiah0+00ac<-kpuatch+0808<-OCIServerAttach+011 4<-kpnconn+02a4<-npicon0+042c<-kpndbcon+0630<-OCIKDBLinkConn2+0038<-OCIKDBLinkConn +002c<-ddfnet2Normal+011c<-kkmfcbrm+009c<-kkmpfcbk+02a0<-qcsprfro+0538<-qcsprfro_tree +0318<-qcsprfro_tree+01c8<-qcspafq+0068<-qcspqb+0470<-kkmdrv+003c<-opiSem+13b4<-opiDe ferredSem+0234<-opitca+01e8<-kksFullTypeCheck+001c<-rpiswu2+034c<-kksLoadChild+30b0<- kxsGetRuntimeLock+0810<-kksfbc+2930<-kkspsc0+0ffc<-kksParseCursor+00d4<-opiosq0+0b30< -kpooprx+0168<-kpoal8+0400<-opiodr+0adc<-ttcpip+1020<-opitsk+10b0<-opiino+0990<-opiod r+0adc<-opidrv+0474<-sou2o+0090<-opimai_real+01bc<-main+0098<-__start+0098 My case stack trace: ksdxfstk+002c<-ksdxcb+04e4<-sspuser+0074<-000044C0<-nttcni+01b8<-nttcon+04f4<-ntconn+0160< -nsopen+0960<-nscall1+01b0<-nscall+049c<-niotns+0880<-nigcall+0028<-osncon+0540<-kpkiadef+006c< -upiini+0408<-upiah0+00ac<-kpuatch+0808<-OCIServerAttach+0114<-kpnconn+02a4<-npicon0+042c< -kpndbcon+0630<-OCIKDBLinkConn2+0038<-OCIKDBLinkConn+002c<-ddfnet2Normal+011c<-kkmfcbrm+009c< -kkmpfcbk+02a0<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228<-qcspafq+0068< -qcspqbDescendents+03a0<-qcspqb+00ac<-qcsevw+02e0<-kkmevw+0a58<-kkmfcbvw+0178< -kkmfcblo+0e38<-kkmpfcbk+0250<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228< -qcspafq+0068<-qcspqbDescendents+03a0<-qcspqb+00ac<-qcsevw+02e0<-qcsfpsq+003c< -kkmfcbsq+01e8<-kkmpfcbk+0260<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228< -qcspafq+0068<-qcspqbDescendents+03a0<-qcspqb+00ac<-qcsevw+02e0<-qcsfpsq+003c< -kkmfcbsq+01e8<-kkmpfcbk+0260<-qcsprfro+0538<-qcsprfro_tree+03f0<-qcsprfro_tree+0228< -qcspafq+0068<-qcspqbDescendents+03a0<-qcspqb+00ac<-kkmdrv+003c<-opiSem+13c0< -opiprs+01ac<-kksParseChildCursor+05e0<-rpiswu2+034c<-kksLoadChild+1d88< -kxsGetRuntimeLock+0810<-kksfbc+28b0<-kkspsc0+0ffc<-kksParseCursor+00d4<-opiosq0+0ae0< -kpooprx+0168<-kpoal8+0400<-opiodr+0ae0<-ttcpip+1020<-opitsk+1124<-opiino+0990<-opiodr+0ae0< -opidrv+0484<-sou2o+0090<-opimai_real+01bc<-main+0098<-__start+0098 Both of them have called OCIKDBLinkConn->OCIKDBLinkConn2->kpndbcon->npicon0->kpnconn->OCIServerAttach->kpuatch and follow. Till now this bug has no patch to apply or workaround way what's awful. As "Bug 5111335: STREAMS PROPAGATION STUCK ON "SINGLE-TASK MESSAGE" PRODUCE LIBRARY CACHE LOCK CO" described:

Hdr: 5111335 10.2.0.2 RDBMS 10.2.0.2 STREAMS PRODID-5 PORTID-226 Abstract: STREAMS PROPAGATION STUCK ON "SINGLE-TASK MESSAGE" PRODUCE LIBRARY CACHE LOCK CO This is a three cluster environment. Two nodes per cluster and on that cluster running RAC with 2 instances. The databases involved are: GCTGCMU is the "GCT" database in GCM = Greenwich USA, will be named USA from now on GCTHKGU is the "GCT" database in HKG = Hong Kong, will be named HKG from now on GCTLONU is the "GCT" database in LON = London, will be named LON from now on On this environment there is a Streams bi-directional replication environment defined between the 3 databases. USA database is propagating changes captured to LON and HKG. There are two propagations defined from this site to every site. 4 propagations as a total. GCTHKGU_BATCH_PROPAGATE GCTLONU_BATCH_PROPAGATE GCTHKGU_DAILY_PROPAGATE GCTLONU_DAILY_PROPAGATE Daily propagation propagates changes stored on streams queue STRMADMIN.STREAMSOUT_DAILY. Batch propagation propagates changes stored on streams queue STRMADMIN.STREAMSOUT_BATCH This databases have been upgraded to 10.2.0.2 from 10.2.0.1 last 13-Mar-2006, and no problems were reported. Customer reported that he could see that propagations from USA to HKG were stucked on library cache lock. This situation was solved by customer by bouncing USA instances. Analyzing the systemstate dumps we could see that there were propagations locked on this situation and QMON slaves too. The library cache lock was on the queue STRMADMIN.STREAMSOUT_DAILY and the owner of the lock was process J000 that was stucked 'single-task message'. From the information on the systemstate dump we can see that this process has been on that situation for more than 3 days and 20 hours. Based on the call stack it seems that the process was about or in the middle of a call to a listener. DIAGNOSTIC ANALYSIS: -------------------- Files to be uploaded: USA: - RDA.zip , RDA node 1 - GCTGCMU1_healthchk.html, streams healthcheck node 1 - GCTGCMU1_healthchk.html, streams healthcheck node 2 - gctgcmu2_ora_21349.trc, hanganalyze and systemstate dumps when the hang is ocurring. Node 2 - gctgcmu1_ora_858.trc, hanganalyze and systemstate dumps when the hang is ocurring. Node 1. HKG: - alert_GCTHKGU1.log, alert log file node 1 - alert_GCTHKGU2.log, alert log file node 2 - listener_hkg0223xus.log_20060320, listener log file node 1. - listener_hkg0224xus.log_20060319, listener log file node 2. LON: - alert_GCTLONU1.log, alert log file node 1 - alert_GCTLONU2.log, alert log file node 2 - listener_lon0223xus.log_20060318, listener log file node 1. - listener_lon3166xus.log_20060320, listener log file node 2. Hang was identified at USA on node 1. Analysis of gctgcmu1_ora_858.trc: System State 1 ~~~~~~~~~~~~~~ 21: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=21 42: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=404 51: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=17 67: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=90 71: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=56 76: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=344 Cmd: PL/SQL Execute 77: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=87 System State 2 ~~~~~~~~~~~~~~ 21: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=21 42: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=404 51: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=17 67: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=90 71: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=56 76: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=344 Cmd: PL/SQL Execute 77: waiting for 'library cache lock' [LOCK: handle=25ea4afc8] seq=87 LOCK: handle=25ea4afc8 QUEU:STRMADMIN.STREAMSOUT_DAILY Resource Holder State LOCK: handle=25ea4afc8 50: waiting for 'single-task message' PROCESS 50: ~~~~~~~~~~~ OSD pid info: Unix process pid: 12949, image: oracle@slodb001 (J000) From the call stack looks like process was doing an SQL*Net connection to a listenerand doing this connection we get hanged because the stack trace is the same for the two process dumps shown on both system states. (session) oct: 0, prv: 0, sql: (nil), psql: (nil), user: 92/STRMADMIN waiting for 'single-task message' blocking sess=0x(nil) seq=5 wait_time=0 seconds since wait started=332308 =0, =0, =0 It is running a propagation job: next_date := sys.dbms_aqadm.aq$_propaq(job); Knowing that the seconds in wait are 332308 and the dump of the process occurred on 2006-03-21 08:15:12.640, this process is waiting on this event: 1. On USA the process is waiting since 17-MAR-2006 11:56:44 2. That time on London is 17-MAR-2006 16:56:44 3. That time on HONG KONG is 18-MAR-2006 01:56:44 At that time nothing relevant was found on alert log files on listener log files. Processes waiting on library cache lock ~~~~~~~~~~~~~~~ 21 (PZ99) : Slave of J002 42 (q004) : Slave of process 44 (QMNC) 51 (J001) : 67 (q003) : Slave of process 44 (QMNC) 71 (q005) : Slave of process 44 (QMNC) 51 (TNS shadow) : 77 (q007) : Slave of process 44 (QMNC) WORKAROUND: ----------- No workaround found so far. Killing the process or bouncing instance should solve the problem. RELATED BUGS: ------------- No similar bugs found. This environment is also involved on bug 5089630 REPRODUCIBILITY: ---------------- Problem has reproduced just one time on customer site. TEST CASE: ---------- STACK TRACE: ------------ For processes waiting on library cache lock we can see different call stacks, here is the one for process 21: ksdxfstk ksdxcb sspuser pthread_sighandler_rt GI___libc_sigaction GI___poll kslwait kjusuc ksipgetctx kqlmli kgllkal kglget kglgob kwqicgob kwqicrqo0 kwqmsnrmsg kwqsldqs kwqsif qerfxFetch qervwFetch qertqoFetch qerpxSlaveFetch qerpxFetch opiexe kpoal8 opiodr kpoodr upirtrc kpurcsc kpuexecv8 kpuexec OCIStmtExecute kxfxsStmtExecute kxfxsExecute kxfxsp kxfxmai kxfprdp opirip opidrv sou2o opimai_real main __libc_start_main _start For process waiting on single-task message, we can see the following call stack: ksdxfstk ksdxcb sspuser pthread_sighandler_rt GI___libc_sigaction GI___poll clntudp_call do_ypcall yp_match internal_gethostbyname2_r nss_nis_gethostbyname_r gethostbyname_r gethostbyname snlpcgthstbynm snlinGetAddrInfo nttbnd2addr ntacbbnd2addr ntacbnd2addr nsc2addr nscall1 nscall niotns nigcall osncon kpkiadef upiini upiah0 kpuatch OCIServerAttach kpnconn npicon0 kpndbcon OCIKDBLinkConn2 OCIKDBLinkConn kwqpcon kwqpgps spefmccallstd pextproc peftrusted psdexsp rpiswu2 psdextp pefccal pefca pevm_FCAL pfrinstr_FCAL pfrrun_no_tool pfrrun plsql_run peicnt kkxexe opiexe opiodr rpidrus skgmstack rpidru rpiswu2 rpidrv rpiexe kkjex1e kkjsexe kkjrdp opirip opidrv sou2o opimai_real main libc_start_main start I do not see any Oracle bug here but it is hard to be certain without the SQLNET.ORA settings and multiple snapshots of the problem process. It is quite valid for an Oracle session to be holding a library cache lock when it makes an outbound DBlink call. If another session wants that lock in an incompatible mode then it has to wait. In this case the blocking session making the outbound link call appears from the one stack dump to be in OS code under gethostbyname(). However it could have been spinning down the stack in Oracle code but I cannot comment on the info we have. The "single-task message" wait event is started / ended in kpnconn() so if there was a "spin" it would have to be above that point in the stack. In 10.2 there is the option to configure NET to timeout on outbound connect attempts by setting SQLNET.OUTBOUND_CONNECT_TIMEOUT = N seconds. The timeout occurs in niotns() in the stack so if this is set you would expect a session in this state to then timeout which may be useful if there are underlying issues with TCP hostname resolution. (Watch out for bug 4933023 if you use this SQLNET parameter). If the issue recurs then get CPU usage info, multiple stacks and you may want to attach with gdb and see if the code ever returns above gethostbyname or not to get an idea if this Sorry , my explanation is probably not clear . Here is the stack you have but annotated: Callback to get the short stack. Confirms we are not HUNG in a device driver. ksdxfstk ksdxcb sspuser pthread_sighandler_rt GI___libc_sigaction Inside OS supplied code executing gethostbyname() C call. GI___poll clntudp_call do_ypcall yp_match internal_gethostbyname2_r nss_nis_gethostbyname_r gethostbyname_r gethostbyname Oracle Net code: snlpcgthstbynm snlinGetAddrInfo nttbnd2addr ntacbbnd2addr ntacbnd2addr nsc2addr nscall1 nscall niotns <<< This is where SQLNET.OUTBOUND_CONNECT_TIMEOUT is implemented IF it has been configured. nigcall Oracle RDBMS code: osncon kpkiadef upiini upiah0 kpuatch OCIServerAttach kpnconn <<< This is where "single-task message" starts/ends npicon0 ... If we just get two stack snapshots of the process, as was the case, and both have the above stack then we can say: As the "seq=" of both snaps is the same (5) then we did not return below kpnconn() in the call stack. But as poll() is typically a blocking call, even if only for a short time, then we cannot be sure if this process was blocked permanently in the poll() waiting on some large timout or similar, or if the code was executing and looping. If it was looping it could have done so at any point on the above stack down to kpnconn() where we would have changed the seq# on the wait event if we exited here. We can say that as the short stack dump triggered a callout then the Linux kernel was not stuck inside a device driver. We can also say it looks like the gethostbyname is in YP code doing a UDP call so at OS level you may want to check out the name resolution configuration. You probably dont want to be falling all the way back to UDP requests to get hostnames resolved. So if the problem recurs it would help to get: ps information of the process (to see if it uses CPU at all) strace for a few minutes - this should help show if the process is entering / leaving any system call (eg: poll) A full call stack with arguments from either gdb or errorstack and these should help show the arguments passed up the stack. In gdb you can also do something like attach to the problem process and : break osncon break nigcall break niotns etc.. up the stack to gethostbyname and then "cont" and it should stop if it enters into those functions from below helping indicate a spin/loop point below that point. It may also be worth getting Linux "crash" information for the process to see what the process looks like from the Linux side but follow up with the OS team on that side. SQLNET.OUTBOUND_CONNECT_TIMEOUT is a backup mechanism to help prevent

This note claimed that one session may hold library cache lock , and wait in single-task message after it makes an outbound DBlink call.If another session wants that lock in an incompatible mode then it has to wait. The "single-task message" wait event is started / ended in kpnconn() so if there was a "spin" it would have to be above that point in the stack. Oracle support advise to set SQLNET.OUTBOUND_CONNECT_TIMEOUT so that we can expect a session in this state to then timeout which may be useful if there are underlying issues with TCP hostname resolution. The principle in these cases is identical. Do deep digging ,you can see this Knowledge:

Hdr: 8427478 10.2.0.3 RDBMS 10.2.0.3 PI/DISTRIB PRODID-5 PORTID-23 ORA-12170 Abstract: WHEN USING DBLINK,IT CONNECT TWICE AND TAKE DOBULE-TIME FOR TIME-OUT( ORA-12170) PROBLEM: -------- When client connect to database using TCP/IP, if there are some problems in the network and server , The client recevies ORA-12170 (ETIMEDOUT) The timeout time is depend on OS TCP parameter.(e.g. tcp_syn_retries parameter on Linux platform). ERROR: ORA-12170 : TNS:Connect timeout occurred ** sqlnet.log *************************************************** VERSION INFORMATION: TNS for Linux: Version 10.2.0.3.0 - Production TCP/IP NT Protocol Adapter for Linux: Version 10.2.0.3.0 - Production Time: 10-APR-2009 21:38:41 Tracing not turned on. Tns error struct: ns main err code: 12535 TNS-12535: TNS:operation timed out ns secondary err code: 12560 nt main err code: 505 TNS-505: Operation timed out nt secondary err code: 110 <= ETIMEDOUT nt OS err code: 0 Client address: ** sqlnet.log *************************************************** client server | | connect() | |-------x-->|-------- | SYN | | |-------x-->| | | SYN | | |-------x-->| | | SYN | |when setting tcp_syn_retries=5(default) |-------x-->| | It takes about 189 seconds for time-out. | SYN | | |-------x-->| | | SYN | | |-------x-->| | | SYN | | --------------------- ORA-12170 occur. But, the connection via DBLINK, it sometimes try to connect twice. and it takes double-time for time-out. local site remote site | | connect() | |-------x-->|-------- | SYN | | |-------x-->| | | SYN | | |-------x-->| | | SYN | | |-------x-->| | | SYN | | |-------x-->| | | SYN | | |-------x-->| | | SYN | | ORA-12170 occur ,but it not returning to client. and server process of local site try to connect automatically.(2nd connect) connect() | | |-------x-->| | | SYN | | |-------x-->| | It takes about 378 seconds for time-out. | SYN | | |-------x-->| | | SYN | | |-------x-->| | | SYN | | |-------x-->| | | SYN | | |-------x-->| | | SYN | | --------------------- ORA-12170 occur. The problem is that it takes dobule-time by this behavior using DBLINK. DIAGNOSTIC ANALYSIS: -------------------- We got stack trace when trying connect. 1st connect connect()+36<- nttcon()+965<- ntconn()+265<- nsopen()+1335 <- nscall1()+449<- nscall()+609<- niotns()+1678<- nigcall()+112 <- osncon()+692<- kpkiadef()+58<- upiini()+683<- upiah0()+104 <- kpuatch()+1064<- OCIServerAttach()+129<- kpnconn()+463 <- npicon0()+1250<- kpndbcon()+909<- OCIKDBLinkConn2()+40 <- OCIKDBLinkConn()+36<- ddfnet3Share()+151<- kksarm()+480<- kksauc()+481 <- kksCheckCriteria()+1711<- kksCheckCursor()+226 <- kksSearchChildList()+1496 <- kksfbc()+10276<- kkspsc0()+1903<- kksParseCursor()+132<- opiosq0()+1349 <- kpooprx()+215<- kpoal8()+673<- opiodr()+985<- ttcpip()+1093<- opitsk()+1031 <- opiino()+821<- opiodr()+985<- opidrv()+466<- sou2o()+91<- opimai_real()+117 <- main()+111<- __libc_start_main()+161 2nd connect <- connect()+36<- nttcon()+965<- ntconn()+265<- nsopen()+1335<- nscall1()+449<- nscall()+609 <- niotns()+1678<- nigcall()+112<- osncon()+692<- kpkiadef()+58<- upiini()+683<- upiah0()+104 <- kpuatch()+1064<- OCIServerAttach()+129<- kpnconn()+463<- npicon0()+1250<- kpndbcon()+909 <- OCIKDBLinkConn2()+40<- OCIKDBLinkConn()+36<- ddfnet2Normal()+204<- kkmfcbrm()+132 <- __PGOSF172_kkmpfcbk()+473<- qcsprfro()+503<- qcsprfro_tree()+302<- qcsprfro_tree()+118 <- qcspafq()+102<- qcspqb()+347<- kkmdrv()+60<- opiSem()+1044<- opiDeferredSem()+283 <- opitca()+275<- __PGOSF302_kksFullTypeCheck()+20<- rpiswu2()+334<- kksLoadChild()+5852 <- kxsGetRuntimeLock()+1314<- kksfbc()+12270<- kkspsc0()+1903<- kksParseCursor()+132 <- opiosq0()+1349<- kpooprx()+215<- kpoal8()+673<- opiodr()+985<- ttcpip()+1093 <- opitsk()+1031<- opiino()+821<- opiodr()+985<- opidrv()+466<- sou2o()+91<- opimai_real()+117 <- main()+111<- __libc_start_main()+161<- WORKAROUND: ----------- execute following command before connect. SQL> alter system flush shared_pool;

You can find the kpnconn call in stack trace , so the single-task message may occur in two time dblink reconnect and take a long time. The document provide one workaround way: flush shared_pool.which I suspected. In summary , still no Mature program to resolve this problem , All we can do is pray. In my opinion , Version 10.2.0.4 is stable enough within most layer , but still a lots of bug or unexpected behavior reside in K2(kernel Distributed Execution Layer). Distributed system is a really complicated subject in computer science.

评论 (1)