ORA-600:[ksnpost:ksnigb]错误一例

一套HP-UX Itanium上的10.2.0.3系统出现了ORA-00600: internal error code, arguments: [ksnpost:ksnigb], [], [], [], [], [],错误,相关的日志如下:

 

ORA-00600: internal error code, arguments: [kolaslGetLength-1], [], [], [], [], [], [], []

ORA-00600: internal error code, arguments: [ksnpost:ksnigb], [], [], [], [], [], [], []
ORA-00609: could not attach to incoming connection
ORA-12157: TNS:internal network communication error

ORA-27300: OS system dependent operation:fork failed with status: 12
ORA-27301: OS failure message: Not enough space
ORA-27302: failure occurred at: skgpspawn3

ORA-00600: internal error code, arguments: [kolaslGetLength-1], [], [], [], [], [], [], []
Current SQL statement for this session:
SELECT * FROM gv$sql

ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [ksnpost:ksnigb], [], [], [], [], [], [], []
ORA-00609: could not attach to incoming connection
ORA-12157: TNS:internal network communication error
Current SQL information unavailable - no session.

 

Mos上相关的Bug 信息:

 

Bug # 4634662 – Ora-600 [kolaslGetLength-1]
Bug # 7479468 – Ora-600 [ksnpost:ksnigb]

 

Issue 1)
This issue is described in the note 357016.1.Please apply the patch mentioned in the note 357016.1 to avoid this issue.

 

Issue 2)
This is due to OS resource limitation. You need to check with HP for this.We are running out of OS resources and hence
this issue is coming.
Also you can set the ulimit -a values to unlimited for oracle and root user to avoid this issue.

 

Bug 4634662 OERI:kolaslGetLength-1 from V$SQL in RAC
ORA-600[kolaslGetLength-1] selecting from V$SQL (of GV$SQL)
Range of versions believed to be affected Versions < 11
Versions confirmed as being affected

10.1.0.5
10.2.0.3

his issue is fixed in

10.2.0.3 Patch 5 on Windows Platforms
10.2.0.4 (Server Patch Set)
11.1.0.6 (Base Release)

on a RAC database with a varying width character set.

Applies to:
Oracle Server – Enterprise Edition – Version: 10.1.0.2 to 10.2.0.3 – Release: 10.1 to 10.2
Information in this document applies to any platform.
Symptoms

select * from gv$sql on a RAC enviornment

errors with:
ORA-00600: internal error code, arguments: [kolaslGetLength-1], [], [], [], [], [], [], [].

with a multibye characterset (ie:al32utf8, utf8 )
Note: This error has also been reported when select from SQLT$_GV$SQL_PLAN while running Sqlt Execute diagnostic utility – SQLTXPLAIN.SQLT$*

Cause
This is Bug 4634662

Details: ora-600[kolaslGetLength-1] received selecting from v$sql over a RAC db which has a varying width character set

Solution

Bug 4634662 is fixed in 10.2.0.4 and Rel 11.

To check if a patch exists for your platform, go to Metalink:

a) Click on Patches & Updates Folder.
b) Click on Simple Search.
c) Enter patch#: Patch 4634662
d) Select your O/S
e) Click Go.

ORA-600 [Ksnpost:Ksnigb] ORA-12157

Applies to:
Oracle Server – Enterprise Edition – Version: 10.2.0.4 and later [Release: 10.2 and later ]
Information in this document applies to any platform.
Symptoms

In the alert log is reported the following:

ORA-00600: internal error code, arguments: [ksnpost:ksnigb], [], [], [], [], [], [], []
ORA-00609: could not attach to incoming connection
ORA-12157: TNS:internal network communication error

Cause

ORA-600 [ksnpost:ksnigb] along with ORA-600 [729] errors are creating problem while creating new sessions.
All these point to insufficient memory or OS limits.

Bug 7479468 was raised for similar issue that was closed as there was not enough swap swap space hence was the problem.
Solution

Check the swap space in the system and increase if required.
References
BUG:7479468 – ORA-600 [KSNPOST:KSNIGB] AND ORA-12157

 

分析发现在10.2.0.3等版本上存在RAC中查询gv$sql视图会引发ORA-600[kolaslGetLength-1]错误的bug,该bug 可以通过实施补丁4634662来解决。

 

此外当有效内存不足或OS参数上限过小时可能导致Oracle无法成功创建新的session ,引发ORA-600[ksnpost:ksnigb]错误。

建议检查ulimit -a 的输出, 在HP-UX上不妨将stack size设置为unlimited(set the stack size to unlimited.)

$ulimit -a
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) 1048576
stack(kbytes) 131072
memory(kbytes) unlimited
coredump(blocks) 4194303
nofiles(descriptors) 2048

EM dbconsole显示数据文件使用量为负数?

有网友在itpub上反应11g中查看EM dbconsole显示数据文件的使用量为负数,如图:

 

 

实际上导致(EM ( EM Home page -> Administration Tab -> Storage -> Datafiles ))显示负的使用量的Bug在11g和10g中均存在,这里将这些bug列出来以供参考:

 

10g Enterprise Manager Shows Negative Utilisation For An Oracle Datafile.

Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.1 to 10.2.0.3
This problem can occur on any platform.
Symptoms

EM ( EM Home page -> Administration Tab -> Storage -> Datafiles ) shows negative 
utilization for some files. Value in Used (MB) column is negative.

Cause
This is because of Bug:5083393.

If the object is dropped and not purged, the value of FILE_ID / REL_FNO may be wrong in 
DBA_FREE_SPACE and this cause EM to display wrong value in Used (MB) column.

Solution
If available, download Patch:5083393 to resolve this issue.
Note that this patch is available on 10.2.0.2/10.2.0.3 for some platforms at the 
time of this writing this article (JULY 2007).

Workaround :
-------------
Purge all objects from the recycle bin.

$sqlplus "/as sysdba"
sql> purge dba_recyclebin 

Hdr: 5083393 10.1.0.4 RDBMS 10.1.0.4 SPACE PRODID-5 PORTID-912
Abstract: VALUE OF FILE_ID AND RELATIVE_FNO OF DBA_FREE_SPACE IS DIFFERENT

PROBLEM:
--------
When the table is dropped, the value of the FILE_ID column and
the RELATIVE_FNO column of dba_free_space might be different.

SQL> select * from dba_free_space where tablespace_name = 'TBS_TEST1';

TABLESPACE_NAME FILE_ID BLOCK_ID    BYTES BLOCKS RELATIVE_FNO
--------------- ------- -------- -------- ------ ------------
E
E
E
TBS_TEST1             6      105    65536      8            6
TBS_TEST1             6      113    65536      8            6
TBS_TEST1             6      121    65536      8            6
TBS_TEST1             6        9    65536      8            7   purge table tbs_test01;

Table purged.

SQL> select * from dba_free_space where tablespace_name = 'TBS_TEST1';

TABLESPACE_NAME FILE_ID BLOCK_ID    BYTES BLOCKS RELATIVE_FNO
--------------- ------- -------- -------- ------ ------------
TBS_TEST1             6        9   983040    120            6
TBS_TEST1             7        9   983040    120            7
TBS_TEST1             8        9 20905984   2552            8

In a word, there seems to be a problem in the display of an free space
concerning the segment that exists in RECYCLE BIN.

WORKAROUND:
-----------
None

RELATED BUGS:
-------------
None

REPRODUCIBILITY:
----------------
  100% (both In-house and Ct' site) 
  Rep?     Platform               RDBMS Ver. 
  -------  ---------------------  ---------- 
  Y(100%)  912 Intel Windows NT   10.1.0.4
  Y(100%)  912 Intel Windows NT   10.1.0.5
  Y(100%)  912 Intel Windows NT   10.2.0.2
  Y(100%)   46 Linux x86          10.1.0.4

 

11gR2 EMCONSOLE SHOWS NEGATIVE VALUES WHEN VIEWING DATAFILE USAGE SPACE IN 11.2.0.1

 

Hdr: 9868375 11.2.0.1 STORAGE 11.2.0.1 PRODID-1366 PORTID-226 9548105
Abstract: EMCONSOLE SHOWS NEGATIVE VALUES WHEN VIEWING DATAFILE USAGE SPACE IN 11.2.0.1

PROBLEM:
--------
The display of the usage of the data files (Tablespaces -> Datafiles) shows 
values in the range negative. The fill rate is not shown correctly.

This is the same exact issue as explained in note 559414.1 but on top of 
11.2.0.1 and also the issue does not seem fixed when trying the workaround by 
clearing the recycle bin or by turning it off.

DIAGNOSTIC ANALYSIS:
--------------------
Checked running the statements from sqlplus they provided the correct value 
and checked this Bug 5549540 seems it should have been fixed in 11G but still 
having the same issue.

WORKAROUND:
-----------
To run the statements and check the data file usage from sqlplus.

 

11gR1 Grid Control Shows Duplicate Datafile Names Multiple times (Some With Negative Values) When Drilling Onto Tablespace Link

Applies to:
Enterprise Manager Grid Control - Version: 10.2.0.5 to 11.1.0.1 - Release: 10.2 to 11.1
Information in this document applies to any platform.
Symptoms

The list of datafiles displayed for a tablespace in EM Grid Control is wrong, so that some 
datafiles are listed repeatedly multiple times, and some of the sizes are wrong and showing a negative value.

This is seen from database home page => server tab => tablespaces.
Click on tablespace name, a list of datafiles associated with that tablespace should be returned.

Cause
If the SQL statement which Grid Control uses is run manually at Sqlplus, the same output of the 
same datafile listed multiple times can also be seen. 

Here you have the SQL statement launched by the Grid Control to collect the list of datafiles :

SELECT d.file_name, TO_CHAR((d.bytes / 1024 / 1024), '99999990.000'), 
NVL(TO_CHAR(((d.bytes - s.bytes) / 1024 / 1024), '99999990.000'), TO_CHAR((d.bytes / 1024 / 1024), '99999990.000')), 
d.file_id, d.autoextensible, d.increment_by, d.maxblocks FROM sys.dba_data_files d, 
(SELECT file_id, SUM(bytes) bytes FROM sys.dba_free_space WHERE tablespace_name = '&tablespaceName' 
GROUP BY file_id) s WHERE (s.file_id (+)= d.file_id) AND (d.tablespace_name = '&tablespaceName')

This is the issue described in following bug :

Bug 8499373 GRID CONTROL SHOWS SAME DATAFILE NAME REPEATEDLY WHEN DRILL INTO A TABLESPACE

This bug is actually closed as a duplicate of a database bug Bug 8570838 WRONG RESULTS FOR QUERY AGAINST 
DBA_DATA_FILES IN 10.2.0.4
The root cause of the database bug is :
Bug 7155655 Intermittent Wrong Results/Plan Instability from Join Predicate Push Down (JPPD)

Solution

Apply the Patch 7155655. 

Note :
Depending on platform it may be necessary to apply a merge patch for this - eg. in some cases it was 
necessary to apply Patch 7278117.

Hdr: 9748363 11.2.0.1 STORAGE 11.1.0.1 PRODID-1366 PORTID-226 9548105
Abstract: NEGATIVE VALUES REPORTED BY GRIDCONTROLFOR TABLESPACE WITH AT LEAST TWO DATAFILE

*** 05/25/10 09:46 am ***
TAR
---
3-1772640611

Problem Description
-------------------
The Used (MB) column on the datafiles summary webpage in the 11.2 grid 
control 
shows negative values for a tablespace containing at least two datafiles

Environment Information
-----------------------

Test Case Step-by-Step Instructions
-----------------------------------

Test Case Location
------------------

Diagnostic Analysis
-------------------
This is the same as bug 9548105 except for this is for GridControl vs 
dbconsole.
SELECT d.file_name, TO_CHAR((d.bytes / 1024 / 1024), '99999990.000'),
NVL(TO_CHAR(((d.bytes - s.bytes) / 1024 / 1024), '99999990.000'),
TO_CHAR((d.bytes / 1024/ 1024), '99999990.000')), d.file_id, 
d.autoextensible,
d.increment_by, d.maxblocks FROM sys.dba_data_files d,  
(SELECT ts.name tablespace_name, SUM(e.length * ts.blocksize) bytes  
FROM sys.fet$ e, sys.ts$ ts  WHERE ts.ts# = e.ts# GROUP BY ts.name  
UNION ALL  SELECT ts.name tablespace_name, SUM(e.blocks * ts.blocksize) bytes 

FROM sys.dba_lmt_free_space e, sys.ts$ ts  WHERE ts.ts# = e.tablespace_id 
GROUP BY ts.name) s
WHERE (s.tablespace_name = d.tablespace_name) AND (d.tablespace_name = :1)

Performance
-----------

NLS Information
---------------

-------

Log Files Location
------------------

Reproducibility
---------------
yes

URL
---

Did you test with the latest version?
-------------------------------------

Available Workarounds
---------------------
none

Related Bugs
------------
9548105

Oracle内部错误:ORA-00600[13013][5001]故障诊断一例

周五被叫到客户现场解决一套10.2.0.4 RAC数据库上的ORA-00600[13013]内部错误问题,这个问题同事已经在上午通过远程拨号了解过情况,
初步判断是索引存在讹误corruption引起的600。

前期诊断

同事在我抵达现场之前已经做了初步的诊断,该10.2.0.4上的RAC系统主用节点的告警日志中多次出现ORA-00600:[13013], [5001]、ORA-00600:[qertbFetchByRowID]及ORA-00600: [25027] 等内部错误,具体的日志如下:

Fri Sep 16 01:16:54 2011
Errors in file /oravl01/oracle/admin/CRMDB2/udump/crmdb22_ora_2138788.trc:
ORA-00600: internal error code, arguments: [13013], [5001], [66209], [50730514], [23], [50730514], [3], []
Fri Sep 16 01:16:55 2011
Trace dumping is performing id=[cdmp_20110916011655]
Fri Sep 16 01:17:06 2011
Errors in file /oravl01/oracle/admin/CRMDB2/udump/crmdb22_ora_2138788.trc:
ORA-00600: internal error code, arguments: [13013], [5001], [66209], [50730514], [23], [50730514], [3], []

以下为13013 trace

ORA-00600: internal error code, arguments: [13013], [5001], [66209], [50730514], [23], [50730514], [3], []
Current SQL statement for this session:
UPDATE CUST_SUBSCRB_PERSONAL A SET a.cust_mobile='A04204441' WHERE a.subscrbid=71524739
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              088424844 ? 041124844 ?
ksedmp+0290          bl       ksedst               104A54870 ?
ksfdmp+0018          bl       03F30204
kgeriv+0108          bl       _ptrgl
kgesiv+0080          bl       kgeriv               1000D8FE4 ? 0785F7165 ?
10501ED28 ? FFFFFFFFFFF6748 ?
06C626E72 ?
ksesic6+0060         bl       kgesiv               110072D08 ? 7000000AC52B8F8 ?
7000000AC52B3A8 ? 07FFFFFFF ?
000000000 ?
updThreePhaseExe+0c  bl       01F9D7A8
4c
updexe+041c          bl       updThreePhaseExe     7000000AC52B8F8 ? 000000000 ?
110540128 ? FFFFFFFFFFF7C80 ?
opiexe+27d8          bl       updexe               7000000AC52B8F8 ?
FFFFFFFFFFF82B8 ?
kpoal8+0edc          bl       opiexe               FFFFFFFFFFFB434 ?
FFFFFFFFFFFB198 ?
FFFFFFFFFFF9608 ?
opiodr+0ae0          bl       _ptrgl
ttcpip+1020          bl       _ptrgl
opitsk+1124          bl       01F9F2A0
opiino+0990          bl       opitsk               000000000 ? 000000000 ?

============
Plan Table
============
------------------------------------------------------+-----------------------------------+
| Id  | Operation           | Name                    | Rows  | Bytes | Cost  | Time      |
------------------------------------------------------+-----------------------------------+
| 0   | UPDATE STATEMENT    |                         |       |       |     1 |           |
| 1   |  UPDATE             | CUST_SUBSCRB_PERSONAL   |       |       |       |           |
| 2   |   INDEX UNIQUE SCAN | CUST_SUBSCRB_PERSONAL_PK|     1 |    33 |     1 |  00:00:01 |
------------------------------------------------------+-----------------------------------+

===============================================================================================================

同时还伴随有ORA-00600:[qertbFetchByRowID]出现

Fri Sep 16 01:08:57 2011
Errors in file /oravl01/oracle/admin/CRMDB2/udump/crmdb22_ora_2138788.trc:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []

ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []
Current SQL statement for this session:
SELECT * from CUST_SUBSCRB_PERSONAL A WHERE a.subscrbid=307557025
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              578318D500000003 ?
FFFFFFFFFFF80D0 ?
ksedmp+0290          bl       ksedst               104A54870 ?
ksfdmp+0018          bl       03F30204
kgerinv+00dc         bl       _ptrgl
kgeasnmierr+004c     bl       kgerinv              11048AE68 ? 000000000 ?
000000001 ? 104BDEC18 ?
110489398 ?
qertbFetchByRowID+0  bl       03F2EF38
d34
opifch2+141c         bl       03F2EB1C
opifch+003c          bl       opifch2              1100DD338 ? 000000000 ?
FFFFFFFFFFF9980 ?

============
Plan Table
============
---------------------------------------------------------------+-----------------------------------+
| Id  | Operation                    | Name                    | Rows  | Bytes | Cost  | Time      |
---------------------------------------------------------------+-----------------------------------+
| 0   | SELECT STATEMENT             |                         |       |       |     1 |           |
| 1   |  TABLE ACCESS BY INDEX ROWID | CUST_SUBSCRB_PERSONAL   |     1 |   287 |     1 |  00:00:01 |
| 2   |   INDEX UNIQUE SCAN          | CUST_SUBSCRB_PERSONAL_PK|     1 |       |     1 |  00:00:01 |
---------------------------------------------------------------+-----------------------------------+

===============================================================================================================

此外还会出现ORA-00600: [25027], [6], [1443670797], [], [], [], [], []

Fri Sep 16 15:06:00 2011
Errors in file /oravl01/oracle/admin/CRMDB2/udump/crmdb22_ora_1872436.trc:
ORA-00600: internal error code, arguments: [25027], [6], [1443670797], [], [], [], [], []

ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [25027], [6], [4278267905], [], [], [], [], []
Current SQL statement for this session:
select q.subscrbid,q.serv_lvl,q.cust_lvl
from cust_subscrb_personal q
where q.serv_lvl is null
or q.cust_lvl is null
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              088424844 ? 041124844 ?
ksedmp+0290          bl       ksedst               104A54870 ?
ksfdmp+0018          bl       03F30204
kgeriv+0108          bl       _ptrgl
kgesiv+0080          bl       kgeriv               000000000 ? 000000000 ?
000000000 ? 105673724 ?
00000000C ?
ksesic2+0060         bl       kgesiv               000000245 ? 0FFFFFFFF ?
700000010013BD0 ?
700000010018078 ? 110000998 ?
krtd2abh+040c        bl       ksesic2              61C3000061C3 ? 000000000 ?
000000006 ? 000000000 ?
0FF013001 ? 000000FA0 ?
105673FB8 ? 105673FA8 ?
kcbgtcr+24a0         bl       krtd2abh             700000471D95738 ?
7000001A9F0F050 ?
FFFFFFFFFFF78E0 ?
700000471D93500 ? 000000003 ?
ktrget+04ac          bl       kcbgtcr              1105A75A8 ? 000002000 ?
000000000 ? 000001940 ?
kdsgrp+00bc          bl       ktrget               000000064 ? 110000998 ?
700000010008000 ?
kdsgnp+0444          bl       kdsgrp               000000000 ? 000000000 ?
FFFFFFFFFFF80D0 ?
kafger+08fc          bl       kdsgnp               000000000 ? 000000000 ?
111119E38 ?
kdstf1100101km+0f58  bl       kafger               1105A74D8 ? 111119E38 ?
FFFFFFFFFFF8190 ? 000000000 ?
70000046DCA8488 ? 000000000 ?
0FFFF9608 ? 110471948 ?
kdsttgr+1a04         bl       kdstf1100101km       111119E38 ? 0000102A1 ?
70000042DF6DC50 ?
FFFFFFFFFFF8418 ? 0FFFFFFFF ?
000001FE8 ? 000000000 ?
000000000 ?
qertbFetch+09b8      bl       kdsttgr              111119E38 ? 000000418 ?
000000000 ? 000000000 ?
000000000 ? 000000000 ?
000000418 ? BE4610195B2C ?
opifch2+141c         bl       03F2EB1C
opifch+003c          bl       opifch2              1100DD338 ? 000000000 ?
FFFFFFFFFFF9980 ?

============
Plan Table
============
--------------------------------------------------+-----------------------------------+
| Id  | Operation          | Name                 | Rows  | Bytes | Cost  | Time      |
--------------------------------------------------+-----------------------------------+
| 0   | SELECT STATEMENT   |                      |       |       |  4898 |           |
| 1   |  TABLE ACCESS FULL | CUST_SUBSCRB_PERSONAL|    15 |   165 |  4898 |  00:00:59 |
--------------------------------------------------+-----------------------------------+

我们先从ORA-00600:[13013], [5001]内部错误入手,Mos Note <How to resolve ORA-00600 [13013], [5001] [ID 816784.1]> 比较翔实地介绍了该600错误,其argument的具体含义如下:

Arg [a] Passcount
Arg [b] Data Object number
Arg [c] Tablespace Relative DBA of block containing the row to be updated
Arg [d] Row Slot number
Arg [e] Relative DBA of block being updated (should be same as [c])
Arg [f] Code

我们这里的[13013], [5001], [66209], [50730514], [23], [50730514], [3], [] 就是

data_object_id = 66209 = 0x102A1
RDBA = 50730514 即  3061612 fild id=12  block id =398866
Row Slot number = 23
Code = 3

翻译过来就是对data_object_id为66209 的对象在12号数据文件上的398866块的第23行做代码为3的检测时发现了异常,通过data_object_id可以定位到该对象就是正在执行的SQL语句中的CUST_SUBSCRB_PERSONAL表。

ORA-00600:[13013], [5001]可能由表上的数据行或索引中的记录逻辑讹误(logical Corruption)所造成,这里要强调一下是逻辑讹误,而非物理损坏。逻辑讹误一般是由于Oracle Bug或者memory Corruption 所引起的,该ORA-00600:[13013], [5001]已知可能由5085288和4549673等多个Bug 引发,在版本10.2.0.4 上这些bug 触发概率较高。

伴随ORA-00600:[13013], [5001]发生的ORA-600 [qertbFetchByRowID]常由索引损坏引起,介绍了该问题:

ORA-600 [qertbFetchByRowID] Select Queries on 10.2.0.4 Database [ID 755592.1]

Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.4 and later   [Release: 10.2 and later ]

Symptoms

The following error was received during SELECT operation:

ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []

The call stack contains:

qertbFetchByRowID <- qergiFetch <- rwsfcd <- qerhjFetch
< - qerjotFetch <- rwsfcd <- qertqoFetch <- qerpxSlaveFetch <- qerpxFetch

Cause
Index corruption.

Solution

1.
Analyze the affected object to look for logical corruption.

SQL> analyze table TABLENAME validate structure cascade;

This will place an exclusive lock on the table.

If an error is reported try dropping and recreating the index(es) first.

2.
If this problem is occurring on a version less than 10.2.0.4, ensure the patch for
Bug 4883635 has been applied if Materialized Views are used.

BUG:5866783 - ORA-00600: [QERTBFETCHBYROWID] ON SELECT FROM ONE TABLE

而ORA-00600: [25027]也是类似的问题:

ORA-600 [25027] [ID 284433.1]

PURPOSE:
  This article represents a partially published OERI note.

  It has been published because the ORA-600 error has been
  reported in at least one confirmed bug.

  Therefore, the SUGGESTIONS section of this article may help
  in terms of identifying the cause of the error.

  This specific ORA-600 error may be considered for full publication
  at a later date. If/when fully published, additional information
  will be available here on the nature of this error.

ERROR:

  Format: ORA-600 [25027] [a] [b]

VERSIONS:
  versions 9.2 and above

ARGUMENTS:
  Arg [a]  Tablespace Number (TSN)
  Arg [b]  Decimal Relative Data Block Address (RDBA)

SUGGESTIONS:

 1. If the Arg [b] (the RDBA) is 0 (zero), then this could be due to fake indexes.

  The following query will list fake indexes:

     select do.owner,do.object_name, do.object_type,sysind.flags
     from dba_objects do, sys.ind$ sysind
     where do.object_id = sysind.obj#
     and bitand(sysind.flags,4096)=4096;

If the above query returns any rows, check the objects involved and consider
dropping them as they can cause this error. 

2. Run analyze table validate structure on the table referenced in the Current SQL statement in
    the related trace file.

  If the Known Issues section below does not help in terms of identifying
  a solution, please submit the trace files and alert.log to Oracle
  Support Services for further analysis.

但是请注意在该故障示例中ORA-00600: [25027]给出的Arg [b] Decimal Relative Data Block Address (RDBA)是一个完全不相干的RDBA,具体原因尚不明确。

小插曲

这当中有一个小的插曲,原来客户这里的应用人员需要尽快在该CUST_SUBSCRB_PERSONAL表上执行一段查询语句,但是该语句一旦执行就会遇到ORA-600 [qertbFetchByRowID]错误,因为是周五小周末,所以如果今天无法运行的话,就要拖到下个礼拜了。

所以被要求优先解决该语句执行的问题,首先看了一下该语句的执行计划:

SQL> explain plan for select count(*) from SHUCRM2C.cust_subscrb_personal a,  SHUCRM2C.wzy_bs20110916 b
2  where a.svcnum=b.svcnum and a.countyid='A00' and a.serv_lvl=0;

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
------------------------------------------------
Plan hash value: 3616548176

-----------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name                      | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                           |     1 |    31 |   231   (1)| 00:00:03 |
|   1 |  SORT AGGREGATE               |                           |     1 |    31 |            |          |
|*  2 |   HASH JOIN                   |                           |   425 | 13175 |   231   (1)| 00:00:03 |
|*  3 |    TABLE ACCESS BY INDEX ROWID| CUST_SUBSCRB_PERSONAL     |   425 |  8075 |   208   (0)| 00:00:03 |
|*  4 |     INDEX RANGE SCAN          | CUST_SUBSCRB_PERSONAL_2IX | 42462 |       |    20   (0)| 00:00:01 |
|   5 |    TABLE ACCESS FULL          | WZY_BS20110916            | 39792 |   466K|    22   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("A"."SVCNUM"="B"."SVCNUM")
3 - filter(TO_NUMBER("A"."SERV_LVL")=0)
4 - access("A"."COUNTYID"='A00')

Note
-----
- dynamic sampling used for this statement

23 rows selected.

SQL> select count(*) from SHUCRM2C.cust_subscrb_personal a,  SHUCRM2C.wzy_bs20110916 b
  2  where a.svcnum=b.svcnum and a.countyid='A00' and a.serv_lvl=0;
select count(*) from SHUCRM2C.cust_subscrb_personal a,  SHUCRM2C.wzy_bs20110916 b
                              *
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []

以上执行计划中对表上的索引CUST_SUBSCRB_PERSONAL_2IX做了range scan后通过获得的rowid到表上去fetch记录(qertbFetchByRowID),在实际fetch by rowid 的时候引发了ORA-600错误。

因为需求较为紧急,所以我还是考虑能否使执行计划绕过该索引,投石问路给语句加上了RULE提示,再次执行:

SQL> select /*+ rule */
2  count(*) from SHUCRM2C.cust_subscrb_personal a,  SHUCRM2C.wzy_bs20110916 b
3  where a.svcnum=b.svcnum and a.countyid='A00' and a.serv_lvl=0;

COUNT(*)
----------
11559

Execution Plan
----------------------------------------------------------
Plan hash value: 2851452146

-------------------------------------------------------------------
| Id  | Operation                     | Name                      |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                           |
|   1 |  SORT AGGREGATE               |                           |
|   2 |   NESTED LOOPS                |                           |
|   3 |    TABLE ACCESS FULL          | WZY_BS20110916            |
|*  4 |    TABLE ACCESS BY INDEX ROWID| CUST_SUBSCRB_PERSONAL     |
|   5 |     AND-EQUAL                 |                           |
|*  6 |      INDEX RANGE SCAN         | CUST_SUBSCRB_SVCNUM_2006  |
|*  7 |      INDEX RANGE SCAN         | CUST_SUBSCRB_PERSONAL_2IX |
-------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

4 - filter(TO_NUMBER("A"."SERV_LVL")=0)
6 - access("A"."SVCNUM"="B"."SVCNUM")
7 - access("A"."COUNTYID"='A00')

Note
-----
- rule based optimizer used (consider using cbo)

想不到居然执行成功了,但是执行计划当中仍有CUST_SUBSCRB_PERSONAL_2IX这个索引,这让我潜意识中认识到很可能是表而非索引存在逻辑讹误。

不管怎么说至少解了燃眉之急,先把这个方法告诉应用人员,然后回过头来继续诊断。

初步验证索引

为了确定到底是表还是索引存在逻辑讹误,一般需要使用analyze table validate structure cascade命令以验证表和索引的结构,但是该命令会要求以共享方式锁住表(TM mode=4),对于更新频繁的生产系统中的关键应用表,这是不可接受的。同事在接手这个故障时已经考虑过该问题了,所以他推荐用查询的方式来检验到底是那些索引存在corruption,具体方法如下:

通过ROWID_CREATE 构造出ROWID

SQL> select dbms_rowid.ROWID_CREATE(1,66209,12,398866,23) from dual;

DBMS_ROWID.ROWID_C
------------------
AAAQKhAAMAABhYSAAX

SQL> select * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a where rowid=chartorowid('AAAQKhAAMAABhYSAAX');

no rows selected

SQL> SELECT INDEX_NAME,COLUMN_NAME FROM DBA_IND_COLUMNS WHERE TABLE_NAME='CUST_SUBSCRB_PERSONAL' ORDER BY 1;

INDEX_NAME                     COLUMN_NAME
------------------------------ ----------------------------------------
CUST_SUBSCRB_CARD_NUM          VIP_CARD_NUM
CUST_SUBSCRB_MANAGERID_2006    MANAGERID
CUST_SUBSCRB_PERSONAL_2IX      COUNTYID
CUST_SUBSCRB_PERSONAL_3IX      CUST_LVL
CUST_SUBSCRB_PERSONAL_PK       SUBSCRBID
CUST_SUBSCRB_SERV_COUNTY       SERV_COUNTYID
CUST_SUBSCRB_SVCNUM_2006       SVCNUM
IDX_CUST_SUBSCRB_PERSONAL_01   SERV_LVL

之后强制使用index提示使用不同的索引

SQL> select /*+ INDEX(a CUST_SUBSCRB_CARD_NUM ) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');

no rows selected

SQL> select /*+ INDEX(a CUST_SUBSCRB_MANAGERID_2006) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');

no rows selected

SQL> select /*+ INDEX(a CUST_SUBSCRB_PERSONAL_2IX) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
select /*+ INDEX(a CUST_SUBSCRB_PERSONAL_2IX) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX')
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []

SQL> select /*+ INDEX(a CUST_SUBSCRB_PERSONAL_3IX) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');

no rows selected

SQL> select /*+ INDEX(a CUST_SUBSCRB_PERSONAL_PK ) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
select /*+ INDEX(a CUST_SUBSCRB_PERSONAL_PK ) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX')
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []

SQL> select /*+ INDEX(a CUST_SUBSCRB_SERV_COUNTY) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');

no rows selected

SQL> select /*+ INDEX(a CUST_SUBSCRB_SVCNUM_2006) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');
select /*+ INDEX(a CUST_SUBSCRB_SVCNUM_2006) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX')
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []

SQL> select /*+ INDEX(a IDX_CUST_SUBSCRB_PERSONAL_01) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX');

select /*+ INDEX(a IDX_CUST_SUBSCRB_PERSONAL_01) */ * from SHUCRM2O.CUST_SUBSCRB_PERSONAL a
where rowid=chartorowid('AAAQKhAAMAABhYSAAX')
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []

可以看到使用以上方法在其中4个索引上遇到了ORA-00600:[qertbFetchByRowID]。 但是使用该方法还是无法彻底搞清楚逻辑到底存在于table还是index上?

 

必要的工具

 

这里我们要介绍一下validate structure还存在online在线使用的选项,在online模式下整个验证过程不会在表上加任何TM锁,这一点具体可以使用10704 事件来证明:

 

10704, 00000, "Print out information about what enqueues are being obtained"
// *Cause:  When enabled, prints out arguments to calls to ksqcmi and
//          ksqlrl and the return values.
// *Action: Level indicates details:
//   Level: 1-4: print out basic info for ksqlrl, ksqcmi
//          5-9: also print out stuff in callbacks:  ksqlac, ksqlop
//          10+: also print out time for each line

SQL> create table validate_me (t1 int);

Table created.

SQL> create index ind_validate_me on validate_me(t1);

Index created.

SQL> insert into validate_me select rownum  from dba_tables where rownum<201;

200 rows created.

SQL> commit;

Commit complete.

SQL> oradebug setmypid;
Statement processed.

SQL> oradebug event 10704 trace name context forever,level 10;
Statement processed.

SQL> analyze table validate_me validate structure cascade online;

Table analyzed.

SQL> oradebug tracefile_name;
c:\app\diag\rdbms\g11r2\g11r2\trace\g11r2_ora_3020.trc

g11r2_ora_3020.trc=========================================================================

*** 2011-09-18 20:55:25.373
Oradebug command 'event 10704 trace name context forever,level 10' console output: <none>

*** 2011-09-18 20:55:49.765
ksqgtl *** TX-00060005-000006a9 mode=6 flags=0x401 timeout=0 ***
ksqgtl: xcb=0x31434194, ktcdix=2147483647, topxcb=0x31434194
ktcipt(topxcb)=0x0

*** 2011-09-18 20:55:49.766
ksucti: init txn DID from session DID
ksqgtl:
ksqlkdid: 0001-001B-00000006

*** 2011-09-18 20:55:49.766
*** ksudidTrace: ksqgtl
ktcmydid(): 0001-001B-00000006
ksusesdi:   0000-0000-00000000
ksusetxn:   0001-001B-00000006
ksqgtl: RETURNS 0

*** 2011-09-18 20:55:49.766
ksqrcl: TX,60005,6a9
ksqrcl: returns 0

但是validate strucutre online也有它的缺点,那就是在线模式下结构验证命令将不填充索引的状态信息到index_stats视图,如:

SQL> analyze index  ind_validate_me validate structure  ;

Index analyzed.

SQL> select count(*) from index_stats;

COUNT(*)
----------
1

SQL> conn / as sysdba
Connected.
SQL> analyze index  ind_validate_me validate structure  online;

Index analyzed.

SQL> select count(*) from index_stats;

COUNT(*)
----------
0

但是因为我们这里只要用到validate structure的结构验证功能,而对索引的详细状态没有兴趣,所以我们可以充分利用该online模式。

 

具体验证

 

使用validate structure online具体验证该问题表和表上的索引:

先仅对表进行验证,以区分到底是表还是索引存在逻辑讹误 

SQL> analyze table SHUCRM2O.CUST_SUBSCRB_PERSONAL  validate structure online;
analyze table SHUCRM2O.CUST_SUBSCRB_PERSONAL  validate structure online
*
ERROR at line 1:
ORA-01498: block check failure - see trace file

若验证发现问题会出现ORA-01498错误,并产生trace 文件

kdrchk:  row is marked as both a Key and Clustered
prow=0x7000001f241c45c flag=0xff
Block Checking: DBA = 50730514, Block Type = KTB-managed data block
data header at 0x7000001f241c07c
kdbchk: bad row tab 0, slot 23
Block header dump:  0x03061612
Object id on Block? Y
seg/obj: 0x102a1  csc: 0xb43.ecde68ca  itc: 3  flg: E  typ: 1 - DATA
brn: 0  bdba: 0x3061609 ver: 0x01 opc: 0
inc: 0  exflg: 0

............

可以看到问题发生在  23 号槽位上

tab 0, row 23, @0x3e0
tl: 4 fb: KCHDFLPN lb: 0x1  cc: 0 cki: 1
tab 0, row 24, @0x3de
tl: 2 fb: --HD---N lb: 0x30
tab 0, row 25, @0x3dc
tl: 2 fb: --HD---- lb: 0x30
tab 0, row 26, @0x3d8
tl: 4 fb: KCHDFLPN lb: 0xff  cc: 0 cki: 255

这里居然第23个row piece的 flag 是KCHDFLPN 即实际fb = 0xff,也就是该row piece同时被标记为key和clustered(row is marked as both a Key and Clustered),因此不管当服务进程尝试update该问题行记录或者通过ROWID访问该row时都出现了ORA-00600错误,虽然其错误代码不同,但都是由于该数据块中第23行记录的flag存在讹误引起的。

关于该ORA-00600:[13013], [5001]问题的成因和解决方法,更多内容可以参考<手工模拟Oracle数据块逻辑讹误引发ORA-00600:[13013], [5001]一例>一文。

 

待修订!

Upgrade GI/CRS 11.1.0.7 to 11.2.0.2. Rootupgrade.sh Hanging

Upgrade grid 11.1.0.7 to 11.2.0.2. Rootupgrade.sh Hanging

We installed 11gR2 GI software and applied PSU2 patches upon getting runupgrade.sh prompt.runupgrade.sh hang on the first node.

[root@vrh8 client]# uname -a
Linux vrh8 2.6.18-238.5.1.el5 #1 SMP Mon Feb 21 05:52:39 EST 2011 x86_64

x86_64 x86_64 GNU/Linux
cluvfy passed with 2 ignorable errors:

[root@vrh8 vrh8]# cd /tmp
[root@vrh8 tmp]# df -lh .
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg0-tmp 992M 263M 679M 28% /tmp

[root@vrh8 grid]# grep fail cluvfy_during_inst.log
/tmp l118464lwap1049 /tmp 713MB 1GB failed
Result: Free disk space check failed for “l118464lwap1049:/tmp”
/tmp vrh8 /tmp 692.131MB 1GB failed
Result: Free disk space check failed for “vrh8:/tmp”
Result: Check for multiple users with UID value 0 failed

[root@vrh8 vrh8]# cd /tmp
[root@vrh8 tmp]# df -lh .
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg0-tmp 992M 263M 679M 28% /tmp

We installed 11gR2 GI software and applied PSU2 patches upon getting runupgrade.sh prompt.

runupgrade.sh hang on the first node. We followed “How to Proceed from Failed Upgrade to 11gR2

Grid Infrastructure on Linux/Unix [ID 969254.1]” 1A section, it didn’t help.

[root@vrh8 bin]# ./crsctl query crs activeversion
Oracle Clusterware active version on the cluster is [11.1.0.7.0]

rootupgrade.sh output:

[root@vrh8 11.2.0.2]# ./rootupgrade.sh
Running Oracle 11g root script…

The following environment variables are set as:
ORACLE_OWNER= oracrs
ORACLE_HOME= /d22/oracrs/11.2.0.2

Enter the full pathname of the local bin directory: [/usr/local/bin]:
The contents of “dbhome” have not changed. No need to overwrite.
The contents of “oraenv” have not changed. No need to overwrite.
The contents of “coraenv” have not changed. No need to overwrite.

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Using configuration parameter file: /d22/oracrs/11.2.0.2/crs/install/crsconfig_params
LOCAL ADD MODE
Creating OCR keys for user ‘root’, privgrp ‘root’..
Operation successful.
OLR initialization – successful
Adding daemon to inittab
ACFS-9200: Supported
ACFS-9300: ADVM/ACFS distribution files found.
ACFS-9312: Existing ADVM/ACFS installation detected.
ACFS-9314: Removing previous ADVM/ACFS installation.
ACFS-9315: Previous ADVM/ACFS components successfully removed.
ACFS-9307: Installing requested ADVM/ACFS software.
ACFS-9308: Loading installed ADVM/ACFS drivers.
ACFS-9321: Creating udev for ADVM/ACFS.
ACFS-9323: Creating module dependencies – this may take some time.
ACFS-9327: Verifying ADVM/ACFS devices.
ACFS-9309: ADVM/ACFS installation correctness verified.

****hanging here for more than 2 hrs, so we cancelled it

INT at /d22/oracrs/11.2.0.2/crs/install/crsconfig_lib.pm line 1173.
/d22/oracrs/11.2.0.2/perl/bin/perl -I/d22/oracrs/11.2.0.2/perl/lib –

I/d22/oracrs/11.2.0.2/crs/install /d22/oracrs/11.2.0.2/crs/install/rootcrs.pl execution failed
Oracle root script execution aborted!

1. The below logs are required to analyze this issue.

NEW_GRID_HOME/cfgtoollogs/crsconfig/*.*
NEW_GRID_HOME/log/<nodename>/*.*

Please upload the logs under the above directories. Zip and upload the files including the subdirectories.

2. When the rootupgrade was handing, did you check the usage of /tmp. Was free space exhausting?

=== ODM Research ===

There has been multiple root script run for upgrade. I have taken the first incident from the file
rootcrs_vrh8.log:
—————————————–

2011-02-13 13:07:55: Successfully started requested Oracle stack daemons
2011-02-13 13:07:55: Upgrading the existing voting disks!
2011-02-13 13:07:55: Executing /d22/oracrs/11.2.0.2/bin/cssvfupgd
2011-02-13 13:07:55: Executing cmd: /d22/oracrs/11.2.0.2/bin/cssvfupgd <<<<<<<<<<<<<<< The root script seems to hang at this point.
2011-02-13 15:01:16: ###### Begin DIE Stack Trace ######
2011-02-13 15:01:16: Package File Line Calling
2011-02-13 15:01:16: ————— ——————– —- ———-
2011-02-13 15:01:16: 1: main rootcrs.pl 325 crsconfig_lib::dietrap
2011-02-13 15:01:16: 2: crsconfig_lib crsconfig_lib.pm 9301 main::__ANON__
2011-02-13 15:01:16: 3: crsconfig_lib crsconfig_lib.pm 9301 (eval)
2011-02-13 15:01:16: 4: crsconfig_lib crsconfig_lib.pm 9260 crsconfig_lib::system_cmd_capture1
2011-02-13 15:01:16: 5: crsconfig_lib crsconfig_lib.pm 9247 crsconfig_lib::system_cmd_capture
2011-02-13 15:01:16: 6: crsconfig_lib crsconfig_lib.pm 924 crsconfig_lib::system_cmd
2011-02-13 15:01:16: 7: oracss oracss.pm 275 crsconfig_lib::run_crs_cmd
2011-02-13 15:01:16: 8: crsconfig_lib crsconfig_lib.pm 1019 oracss::CSS_upgrade
2011-02-13 15:01:16: 9: crsconfig_lib crsconfig_lib.pm 1006 crsconfig_lib::start_cluster
2011-02-13 15:01:16: 10: main rootcrs.pl 697 crsconfig_lib::perform_start_cluster
2011-02-13 15:01:16: ####### End DIE Stack Trace #######

cssvfupgd.log:
——————–
Oracle Database 11g Clusterware Release 11.2.0.2.0 – Production Copyright 1996, 2010 Oracle. All rights reserved.
2011-02-13 13:07:55.356: [ OCRRAW][3605955376]prgval:buffer passed is too small
2011-02-13 13:07:55.361: [CSSVFUPG][3605955376]cssvfupgd_GetVFList: found voting file /s01/app/ocrvot/VOTEDISK/UAT2_vdisk1.dat
2011-02-13 13:07:55.365: [ OCRRAW][3605955376]prgval:buffer passed is too small
2011-02-13 13:07:55.369: [CSSVFUPG][3605955376]cssvfupgd_GetVFList: found voting file /s01/app/ocrvot/VOTEDISK/UAT2_vdisk2.dat
2011-02-13 13:07:55.373: [ OCRRAW][3605955376]prgval:buffer passed is too small
2011-02-13 13:07:55.377: [CSSVFUPG][3605955376]cssvfupgd_GetVFList: found voting file /s01/app/ocrvot/VOTEDISK/UAT2_vdisk3.dat
2011-02-13 13:07:55.402: [CSSVFUPG][3605955376]cssvfupgd_SetNum: Processing SYSTEM.css.misscount
2011-02-13 13:07:55.404: [CSSVFUPG][3605955376]cssvfupgd_SetNum: Processing SYSTEM.css.disktimeout
2011-02-13 13:07:55.406: [CSSVFUPG][3605955376]cssvfupgd_SetNum: Processing SYSTEM.css.reboottime
2011-02-13 13:07:55.408: [CSSVFUPG][3605955376]cssvfupgd_SetNum: Processing SYSTEM.css.diagwait
2011-02-13 13:07:55.414: [CSSVFUPG][3605955376]cssvfupgd_SetNum: Processing SYSTEM.css.pollinterval
2011-02-13 13:07:55.416: [CSSVFUPG][3605955376]cssvfupgd_GetGUID: Fetching GUID for /s01/app/ocrvot/VOTEDISK/UAT2_vdisk1.dat
2011-02-13 13:07:55.419: [ SKGFD][3605955376]NOTE: No asm libraries found in the system

2011-02-13 13:07:55.419: [ CLSF][3605955376]Allocated CLSF context
2011-02-13 13:07:55.419: [ SKGFD][3605955376]Discovery with str:/s01/app/ocrvot/VOTEDISK/UAT2_vdisk1.dat:

2011-02-13 13:07:55.419: [ SKGFD][3605955376]UFS discovery with :/s01/app/ocrvot/VOTEDISK/UAT2_vdisk1.dat:

2011-02-13 13:07:55.420: [ SKGFD][3605955376]Fetching UFS disk :/s01/app/ocrvot/VOTEDISK/UAT2_vdisk1.dat:

2011-02-13 13:07:55.420: [ SKGFD][3605955376]OSS discovery with :/s01/app/ocrvot/VOTEDISK/UAT2_vdisk1.dat:

2011-02-13 13:07:55.421: [ SKGFD][3605955376]Handle 0x124de360 from lib :UFS:: for disk :/s01/app/ocrvot/VOTEDISK/UAT2_vdisk1.dat:

2011-02-13 14:19:31.132: [ SKGFD][3605955376]WARNING:io_getevents timed out 2226 sec >>>>>>>>>>>>>>>>>>>> After about one hour it shows time out error.

2011-02-13 14:19:31.132: [ SKGFD][3605955376]WARNING:io_getevents timed out 2226 sec

The script has stalled at the voting disk upgrade phase. Please provide me the below details.

1. What cluster file system are you using for the voting files? provide its details and the mount options used.

for ocfs, get its mount options
mount | grep ocfs

3. Voting disks details
ls -l /s01/app/ocrvot/VOTEDISK/UAT2_vdisk*

4. Get the diagwait detail.
OLD_CRS_HOME/bin/crsctl get css diagwait

1. What cluster file system are you using for the voting files? provide its details and the mount options used
/dev/emcpowera1 on /s01/app/ocrvot type ocfs2 (rw,_netdev,datavolume,nointr,heartbeat=local)

2. Voting disks details

[root@vrh8 11.2.0.2]# ls -l /s01/app/ocrvot/VOTEDISK/UAT2_vdisk*
-rw-r—– 1 oracrs oinstall 21004288 Jun 11 07:31 /s01/app/ocrvot/VOTEDISK/UAT2_vdisk1.dat
-rw-r—– 1 oracrs oinstall 21004288 Jun 11 07:31 /s01/app/ocrvot/VOTEDISK/UAT2_vdisk2.dat
-rw-r—– 1 oracrs oinstall 21004288 Jun 11 07:31 /s01/app/ocrvot/VOTEDISK/UAT2_vdisk3.dat

 

3. Get the diagwait detail

crsctl get css diagwait
Failure 33 in main Oracle Cluster Registry context initialization: PROC-33: Oracle Cluster Registry is not configured Operating System error [No such file or directory] [2]

owc may not be required now as the issue we face is clear.

The diagwait should not error out, as explained in the following note,
11gR2 rootupgrade.sh Fails as cssvfupgd Can not Upgrade Voting Disk (Doc ID 1102283.1)

Make sure you are running ‘crsctl get css diagwait’ from the old crs home. You can also check it in multiple node. If it errors out, this has to be fixed as explained in the above note.

according to that note ,When I ./oprocd stop ,get error:
[root@l118464lwap1049 bin]# ./oprocd stop
Jun 16 23:24:42.966 | ERR | failed to connect to daemon, errno(111)

ACFS-9200: Supported
ACFS-9300: ADVM/ACFS distribution files found.
ACFS-9307: Installing requested ADVM/ACFS software.
ACFS-9308: Loading installed ADVM/ACFS drivers.
ACFS-9321: Creating udev for ADVM/ACFS.
ACFS-9323: Creating module dependencies – this may take some time.
ACFS-9327: Verifying ADVM/ACFS devices.
ACFS-9309: ADVM/ACFS installation correctness verified.

cssvfupgd.log
2011-02-13 23:36:49.311: [ OCRRAW][3394941744]prgval:buffer passed is too small
2011-02-13 23:36:49.315: [CSSVFUPG][3394941744]cssvfupgd_GetVFList: found voting
file /s01/app/ocrvot/VOTEDISK/UAT2_vdisk2.dat
2011-02-13 23:36:49.319: [ OCRRAW][3394941744]prgval:buffer passed is too small
2011-02-13 23:36:49.323: [CSSVFUPG][3394941744]cssvfupgd_GetVFList: found voting
file /s01/app/ocrvot/VOTEDISK/UAT2_vdisk3.dat
2011-02-13 23:36:49.351: [CSSVFUPG][3394941744]cssvfupgd_SetNum: Processing SYST
EM.css.misscount
2011-02-13 23:36:49.354: [CSSVFUPG][3394941744]cssvfupgd_SetNum: Processing SYST
EM.css.disktimeout
2011-02-13 23:36:49.356: [CSSVFUPG][3394941744]cssvfupgd_SetNum: Processing SYST
EM.css.reboottime
2011-02-13 23:36:49.358: [CSSVFUPG][3394941744]cssvfupgd_SetNum: Processing SYST
EM.css.diagwait
2011-02-13 23:36:49.367: [CSSVFUPG][3394941744]cssvfupgd_SetNum: Processing SYST
EM.css.pollinterval
2011-02-13 23:36:49.369: [CSSVFUPG][3394941744]cssvfupgd_GetGUID: Fetching GUID
for /s01/app/ocrvot/VOTEDISK/UAT2_vdisk1.dat
2011-02-13 23:36:49.371: [ SKGFD][3394941744]NOTE: No asm libraries found in t
he system

2011-02-13 23:36:49.372: [ CLSF][3394941744]Allocated CLSF context
2011-02-13 23:36:49.372: [ SKGFD][3394941744]Discovery with str:/s01/app/ocrvo
t/VOTEDISK/UAT2_vdisk1.dat:

2011-02-13 23:36:49.372: [ SKGFD][3394941744]UFS discovery with :/s01/app/ocrv
ot/VOTEDISK/UAT2_vdisk1.dat:

2011-02-13 23:36:49.372: [ SKGFD][3394941744]Fetching UFS disk :/s01/app/ocrvo
t/VOTEDISK/UAT2_vdisk1.dat:

2011-02-13 23:36:49.372: [ SKGFD][3394941744]OSS discovery with :/s01/app/ocrv
ot/VOTEDISK/UAT2_vdisk1.dat:

2011-02-13 23:36:49.372: [ SKGFD][3394941744]Handle 0x98c4360 from lib :UFS::
for disk :/s01/app/ocrvot/VOTEDISK/UAT2_vdisk1.dat:

Question:
in Your update about cssvfupgd.log You stated it was hanging there.
Is there an entry after about 70 minutes about a timeout in that log file like:

2011-02-13 23:36:49.372: [ SKGFD][3394941744]Handle 0x98c4360 from lib :UFS::
for disk :/s01/app/ocrvot/VOTEDISK/UAT2_vdisk1.dat:
2011-02-17 0:48:19.372: [ SKGFD][3394941744]WARNING:io_getevents timed out 4294 sec <<<< present ???

Please provide the following outputs:
rpm -qa|grep ocfs2
uname -a
cat /etc/redhat-release

[root@vrh8 ~]# rpm -qa|grep ocfs2
ocfs2console-1.4.4-1.el5
ocfs2-tools-1.4.4-1.el5
ocfs2-2.6.18-238.5.1.el5-1.4.7-1.el5
[root@vrh8 ~]# uname -a
Linux vrh8 2.6.18-238.5.1.el5 #1 SMP Mon Feb 21 05:52:39 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
[root@vrh8 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.6 (Tikanga)
[root@vrh8 ~]#

Combinations that install SUCCESSFUL:

OEL5.4+ocfs2-1.4.7-1+ocfs2-tools-1.4.4
OEL5.6+ocfs2-1.4.8-1+ocfs2-tools-1.6.3
OEL5.6+ocfs2-1.4.7-1+ocfs2-tools-1.4.4
RHLE5.6+OEL kernel(redhat compatible kernel)+ocfs2-1.4.8-1+ocfs2-tools-1.6.3
RHLE5.6+OEL kernel(redhat compatible kernel)+ocfs2-1.4.7-1+ocfs2-tools-1.4.4
RHEL5.4

Combinations that failed:
RHLE5.6(redhat kernel)+ocfs2-1.4.7-1+ocfs2-tools-1.4.4
RHLE5.6(redhat kernel)+ocfs2-1.4.8-1+ocfs2-tools-1.6.3

Problem reproduces with redhat kernel — RHEL 5.6 with 2.6.18-2xx kernels

Please review the following Note to change the location of your voting disk
Note 428681.1
Title: How to ADD/REMOVE/REPLACE/MOVE Oracle Cluster Registry (OCR) and Voting Disk

Pasting info from —
Oracle? Clusterware Administration and Deployment Guide
11g Release 2 (11.2)

3 Managing Oracle Cluster Registry and Voting Disks
Oracle Universal Installer for Oracle Clusterware 11g release 2 (11.2), does not support the use of raw or block devices. However, if you upgrade from a previous Oracle Clusterware release, then you can continue to use raw or block devices.

[oracrs@vrh8 grid]$ grep fail cluvfy_during_inst_061711.log
/tmp l118464lwap1049 /tmp 706MB 1GB failed
Result: Free disk space check failed for “l118464lwap1049:/tmp”
/tmp vrh8 /tmp 927.1312MB 1GB failed
Result: Free disk space check failed for “vrh8:/tmp”
Result: Check for multiple users with UID value 0 failed
PRVF-5431 : Oracle Cluster Voting Disk configuration check failed

[oracrs@vrh8 grid]$ ./runcluvfy.sh stage -pre crsinst -n vrh8,l118464lwap1049 -verbose|tee cluvfy_during_inst.log

Please upload the following Cluvfy trace log —
$ORA_CRS_HOME/cv/log/cvutrace.log.0

Please download the latest CVU from OTN:
http://www.oracle.com/technetwork/database/clustering/downloads/cvu-download-homepage-099973.html

Please upload
/s02/app/crs/11.2.0.2/log/vrh8/agent/ohasd/oraagent_oracrs/oraagent_oracrs.log

In addition pls upload
/s02/app/crs/11.2.0.2/log/vrh8/agent/ohasd/oracssdagent_root/oracssdagent_root.log

Please run this command on both the new setup and your existing production setup for a quick comparison —
rpm -qa|grep ocfs2

Server with issue:
[root@vrh8 ohasd]# rpm -qa|grep ocfs2
ocfs2console-1.4.4-1.el5
ocfs2-tools-1.4.4-1.el5
ocfs2-2.6.18-238.5.1.el5-1.4.7-1.el5

Prod:

[root@vrh9  bin]# rpm -qa|grep ocfs2
ocfs2-2.6.18-194.el5-1.4.7-1.el5
ocfs2console-1.4.4-1.el5
ocfs2-tools-1.4.4-1.el5
ocfs2-2.6.18-194.8.1.el5-1.4.7-1.el5

[root@vrh8 ~]# uname -a
Linux vrh8 2.6.18-238.5.1.el5 #1 SMP Mon Feb 21 05:52:39 EST 2011 x86_64 x86_64 x86_64 GNU/Linux

[root@vrh8 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.6 (Tikanga)

rpm -qa|grep ocfs2
ocfs2console-1.4.4-1.el5
ocfs2-tools-1.4.4-1.el5
ocfs2-2.6.18-238.5.1.el5-1.4.7-1.el5

@ . from Bug 11876815 (Doc ID 1321757.1)
@ combinations that install SUCCESSFUL:
@ .
@ OEL5.4+ocfs2-1.4.7-1+ocfs2-tools-1.4.4
@ OEL5.6+ocfs2-1.4.8-1+ocfs2-tools-1.6.3
@ OEL5.6+ocfs2-1.4.7-1+ocfs2-tools-1.4.4
@ RHLE5.6+OEL kernel(redhat compatible kernel)+ocfs2-1.4.8-1+ocfs2-tools-1.6.3
@ RHLE5.6+OEL kernel(redhat compatible kernel)+ocfs2-1.4.7-1+ocfs2-tools-1.4.4
@ RHEL5.4
@ .
@ combinations that failed:
@ RHLE5.6(redhat kernel)+ocfs2-1.4.7-1+ocfs2-tools-1.4.4
@ RHLE5.6(redhat kernel)+ocfs2-1.4.8-1+ocfs2-tools-1.6.3
@ .
@ .
@ So that is clear that , it is redhat kernel’s problem.Since RHEL5.6 redhat
@ provided 2.6.18-2xx kernels, we can’t fix redhat kernels, please use Oracle
@ Enterprise kernel (redhat compatible) for installation.

As per last action plan (conveyed if any) you need to contact REDHAT support to know the cause of this issue. Workaround is to not use OCFS and go for raw device for upgrade to succeed.
A Oracle bug 11876815 was logged internally for this hang issue and few combinations of OEL, RHEL, OCFS2 were tried and tested and the combination you are using has not worked for us too (per bug internal updates given above)
The solution provided by Oracle bug developer is to use OEL and not RHEL or contact RHEL support for identifying the cause and solution (incase they have already tested this setup).
Let me know if RHEL support is already engaged and provide the case id so that I can open internal SR for Oracle/Red Hat Joint Escalation Team (JET) Engagement for both vendors to work together internally.

+ the SR issue of grid upgrade from 11.1 to 11.2.0.2.2 is resolved
– voting disk was moved from ocfs to raw device – as a workaround for Bug 11876815
– set TMP and TEMP env to new dir with availabe space before running the installer and prechecks to succeed
– applied GIPSU#2 before the rootupgrade.sh step
– rootupgrade.sh step was successful on all nodes
– verified post upgrade checks and logs to confirm GI upgrade was success !

+ DB upgrade to 11.2.0.2 Plus PSU#2 will be resumed shorlty

MMON Slave日志[KDL_TRIM]: NEWLEN: MESSAGES

有网友反映windows 2003 上的10.2.0.4数据库,MMON的Slave后台进程报带有NEWLEN: MESSAGES的相关日志,如:


*** ACTION NAME:(Auto ADDM Slave Action) 2011-09-09 18:00:12.053
*** MODULE NAME:(MMON_SLAVE) 2011-09-09 18:00:12.053
*** SERVICE NAME:(SYS$BACKGROUND) 2011-09-09 18:00:12.053
*** SESSION ID:(797.606) 2011-09-09 18:00:12.053
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0

这一般是由于10.2.0.4上已经confirmed的bug:”Bug 6972843 – Unnecessary trace files with “[kdl_trim]: newlen” text in them [ID 6972843.8]”所引起的,在bug目前确认仅在10.2.0.4上发生,而到10.2.0.4.1中已经被修复了。

主要是MMON的辅助slave进程会不必要地输出如”[kdl_trim]: newlen: 0″的日志信息:


Unnecessary trace files may be generated with entries that read:'[kdl_trim]: newlen: '.
 
Hdr: 9600827 10.2.0.4 RDBMS 10.2.0.4 RAM LOBS PRODID-5 PORTID-212
Abstract: [KDL_TRIM]: NEWLEN: MESSAGES IN MMON SLAVES TRACES

PROBLEM:
--------
MMON slave processes traces contains the following entries despite patch 
6972843 applied

*** NAME:(Auto ADDM Slave Action) 2010-04-14 06:00:42.899
*** NAME:(MMON_SLAVE) 2010-04-14 06:00:42.899
*** NAME:(SYS$BACKGROUND) 2010-04-14 06:00:42.899
*** ID:(2472.45569) 2010-04-14 06:00:42.899
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0

DIAGNOSTIC ANALYSIS:
--------------------
- Opatch contains patch 6972843 applied

WORKAROUND:
-----------
none

RELATED BUGS:
-------------
Bug 6972843 - HOURLY TRACE FILE CREATED IN BDUMP FOR AUTO ADDM SLAVE ACTION

该bug可以被忽略,但是如果日志产生过于频繁的话可以考虑升级10.2.0.4的PSU或者直接升级到10.2.0.5。

为11.2.0.2 Grid Infrastructure添加节点

在之前的文章中我介绍了为10g RAC Cluster添加节点的具体步骤。在11gr2中Oracle CRS升级为Grid Infrastructure,通过GI我们可以更方便地控制CRS资源如:VIP、ASM等等,这也导致了在为11.2中的GI添加节点时,同10gr2相比有着较大的差异。

这里我们要简述在11.2中为GI ADD NODE的几个要点:

一、准备工作

准备工作是不可忽略的,在10g RAC Cluster添加节点中我列举了必须完成的先决条件,在11.2 GI中这些条件依然有效,但请注意以下2点:

1.不仅要为oracle用户配置用户等价性,也要为grid(GI安装用户)用户配置;除非你同时使用oracle安装GI和RDBMS,这是不推荐的

2.在11.2 GI中推出了octssd(Oracle Cluster Synchronization Service Daemon)时间同步服务,如果打算使用octssd的话那么建议禁用ntpd事件服务,具体方法如下:

# service ntpd stop
Shutting down ntpd:                                        [  OK  ]
# chkconfig ntpd off
# mv /etc/ntp.conf /etc/ntp.conf.orig
# rm /var/run/ntpd.pid

3.使用cluster verify工具验证新增节点是否满足cluster的要求:

cluvfy stage -pre nodeadd -n <NEW NODE>

具体用法如:

su - grid

[grid@vrh1 ~]$ cluvfy stage -pre nodeadd -n vrh3

Performing pre-checks for node addition 

Checking node reachability...
Node reachability check passed from node "vrh1"

Checking user equivalence...
User equivalence check passed for user "grid"

Checking node connectivity...

Checking hosts config file...

Verification of the hosts config file successful

Check: Node connectivity for interface "eth0"
Node connectivity passed for interface "eth0"

Node connectivity check passed

Checking CRS integrity...

CRS integrity check passed

Checking shared resources...

Checking CRS home location...
The location "/g01/11.2.0/grid" is not shared but is present/creatable on all nodes
Shared resources check for node addition passed

Checking node connectivity...

Checking hosts config file...

Verification of the hosts config file successful

Check: Node connectivity for interface "eth0"
Node connectivity passed for interface "eth0"

Check: Node connectivity for interface "eth1"
Node connectivity passed for interface "eth1"

Node connectivity check passed

Total memory check passed
Available memory check passed
Swap space check passed
Free disk space check passed for "vrh3:/tmp"
Free disk space check passed for "vrh1:/tmp"
Check for multiple users with UID value 54322 passed
User existence check passed for "grid"
Run level check passed
Hard limits check failed for "maximum open file descriptors"
Check failed on nodes:
        vrh3
Soft limits check passed for "maximum open file descriptors"
Hard limits check passed for "maximum user processes"
Soft limits check passed for "maximum user processes"
System architecture check passed
Kernel version check passed
Kernel parameter check passed for "semmsl"
Kernel parameter check passed for "semmns"
Kernel parameter check passed for "semopm"
Kernel parameter check passed for "semmni"
Kernel parameter check passed for "shmmax"
Kernel parameter check passed for "shmmni"
Kernel parameter check passed for "shmall"
Kernel parameter check passed for "file-max"
Kernel parameter check passed for "ip_local_port_range"
Kernel parameter check passed for "rmem_default"
Kernel parameter check passed for "rmem_max"
Kernel parameter check passed for "wmem_default"
Kernel parameter check passed for "wmem_max"
Kernel parameter check passed for "aio-max-nr"
Package existence check passed for "make-3.81( x86_64)"
Package existence check passed for "binutils-2.17.50.0.6( x86_64)"
Package existence check passed for "gcc-4.1.2 (x86_64)( x86_64)"
Package existence check passed for "libaio-0.3.106 (x86_64)( x86_64)"
Package existence check passed for "glibc-2.5-24 (x86_64)( x86_64)"
Package existence check passed for "compat-libstdc++-33-3.2.3 (x86_64)( x86_64)"
Package existence check passed for "elfutils-libelf-0.125 (x86_64)( x86_64)"
Package existence check passed for "elfutils-libelf-devel-0.125( x86_64)"
Package existence check passed for "glibc-common-2.5( x86_64)"
Package existence check passed for "glibc-devel-2.5 (x86_64)( x86_64)"
Package existence check passed for "glibc-headers-2.5( x86_64)"
Package existence check passed for "gcc-c++-4.1.2 (x86_64)( x86_64)"
Package existence check passed for "libaio-devel-0.3.106 (x86_64)( x86_64)"
Package existence check passed for "libgcc-4.1.2 (x86_64)( x86_64)"
Package existence check passed for "libstdc++-4.1.2 (x86_64)( x86_64)"
Package existence check passed for "libstdc++-devel-4.1.2 (x86_64)( x86_64)"
Package existence check passed for "sysstat-7.0.2( x86_64)"
Package existence check passed for "ksh-20060214( x86_64)"
Check for multiple users with UID value 0 passed
Current group ID check passed

Checking OCR integrity...

OCR integrity check passed

Checking Oracle Cluster Voting Disk configuration...

Oracle Cluster Voting Disk configuration check passed
Time zone consistency check passed

Starting Clock synchronization checks using Network Time Protocol(NTP)...

NTP Configuration file check started...
No NTP Daemons or Services were found to be running

Clock synchronization check using Network Time Protocol(NTP) passed

User "grid" is not part of "root" group. Check passed
Checking consistency of file "/etc/resolv.conf" across nodes

File "/etc/resolv.conf" does not have both domain and search entries defined
domain entry in file "/etc/resolv.conf" is consistent across nodes
search entry in file "/etc/resolv.conf" is consistent across nodes
All nodes have one search entry defined in file "/etc/resolv.conf"
PRVF-5636 : The DNS response time for an unreachable node exceeded "15000" ms on following nodes: vrh3

File "/etc/resolv.conf" is not consistent across nodes

Pre-check for node addition was unsuccessful on all the nodes.

一般来说如果我们不使用DNS解析域名方式的话,那么resolv.conf不一直的问题可以忽略,但在slient安装模式下可能造成我们的操作无法完成,这个后面会介绍。

二、向GI中加入新的节点

注意11.2.0.2 GI添加节点的关键脚本addNode.sh可能存在Bug,如官方文档所述当希望使用Interactive Mode交互模式启动OUI界面添加节点时,只要运行addNode.sh脚本即可,实际情况则不是这样:

documentation said:
Go to CRS_home/oui/bin and run the addNode.sh script on one of the existing nodes.
Oracle Universal Installer runs in add node mode and the Welcome page displays.
Click Next and the Specify Cluster Nodes for Node Addition page displays.

we done:

运行addNode.sh要求以GI拥有者身份运行该脚本,一般为grid用户,要求在已有的正运行GI的节点上启动脚本

[grid@vrh1 ~]$ cd $ORA_CRS_HOME/oui/bin

[grid@vrh1 bin]$ ./addNode.sh
ERROR:
Value for CLUSTER_NEW_NODES not specified.

USAGE:
/g01/11.2.0/grid/cv/cvutl/check_nodeadd.pl  {-pre|-post} 

/g01/11.2.0/grid/cv/cvutl/check_nodeadd.pl -pre [-silent] CLUSTER_NEW_NODES={}
/g01/11.2.0/grid/cv/cvutl/check_nodeadd.pl -pre [-silent] CLUSTER_NEW_NODES={} 
CLUSTER_NEW_VIRTUAL_HOSTNAMES={}

/g01/11.2.0/grid/cv/cvutl/check_nodeadd.pl -pre [-silent] -responseFile
/g01/11.2.0/grid/cv/cvutl/check_nodeadd.pl -post [-silent]

我们的本意是期望使用图形化的交互界面的OUI(runInstaller -addnode)来新增节点,然而addNode.sh居然让我们输入一些参量,而且其调用的check_nodeadd.pl脚本使用的是silent模式。

在MOS和GOOGLE上搜了一圈,基本所有的文档都推荐使用silent模式来添加节点,无法只好转到静默添加上来。实际上静默添加所需要提供的参数并不多,这可能是这种方式得到推崇的原因之一,但是这里又碰到问题了:

语法SYNTAX:
./addNode.sh –silent 
"CLUSTER_NEW_NODES={node2}" 
"CLUSTER_NEW_PRIVATE_NODE_NAMES={node2-priv}" "CLUSTER_NEW_VIRTUAL_HOSTNAMES={node2-vip}"

在我们的例子中具体命令如下

./addNode.sh -silent
"CLUSTER_NEW_NODES={vrh3}" "CLUSTER_NEW_VIRTUAL_HOSTNAMES={vrh3-vip}"
"CLUSTER_NEW_PRIVATE_NODE_NAMES={vrh3-priv}" 

以上命令因为采用silent模式所以没有任何窗口输出(实际上会输出到 /tmp/silentInstall.log日志文件中),去掉-silent参数

./addNode.sh  "CLUSTER_NEW_NODES={vrh3}"
"CLUSTER_NEW_VIRTUAL_HOSTNAMES={vrh3-vip}" "CLUSTER_NEW_PRIVATE_NODE_NAMES={vrh3-priv}"

Performing pre-checks for node addition 

Checking node reachability...
Node reachability check passed from node "vrh1"

Checking user equivalence...
User equivalence check passed for user "grid"

Checking node connectivity...

Checking hosts config file...

Verification of the hosts config file successful

Check: Node connectivity for interface "eth0"
Node connectivity passed for interface "eth0"

Node connectivity check passed

Checking CRS integrity...

CRS integrity check passed

Checking shared resources...

Checking CRS home location...
The location "/g01/11.2.0/grid" is not shared but is present/creatable on all nodes
Shared resources check for node addition passed

Checking node connectivity...

Checking hosts config file...

Verification of the hosts config file successful

Check: Node connectivity for interface "eth0"
Node connectivity passed for interface "eth0"

Check: Node connectivity for interface "eth1"
Node connectivity passed for interface "eth1"

Node connectivity check passed

Total memory check passed
Available memory check passed
Swap space check passed
Free disk space check passed for "vrh3:/tmp"
Free disk space check passed for "vrh1:/tmp"
Check for multiple users with UID value 54322 passed
User existence check passed for "grid"
Run level check passed
Hard limits check failed for "maximum open file descriptors"
Check failed on nodes:
        vrh3
Soft limits check passed for "maximum open file descriptors"
Hard limits check passed for "maximum user processes"
Soft limits check passed for "maximum user processes"
System architecture check passed
Kernel version check passed
Kernel parameter check passed for "semmsl"
Kernel parameter check passed for "semmns"
Kernel parameter check passed for "semopm"
Kernel parameter check passed for "semmni"
Kernel parameter check passed for "shmmax"
Kernel parameter check passed for "shmmni"
Kernel parameter check passed for "shmall"
Kernel parameter check passed for "file-max"
Kernel parameter check passed for "ip_local_port_range"
Kernel parameter check passed for "rmem_default"
Kernel parameter check passed for "rmem_max"
Kernel parameter check passed for "wmem_default"
Kernel parameter check passed for "wmem_max"
Kernel parameter check passed for "aio-max-nr"
Package existence check passed for "make-3.81( x86_64)"
Package existence check passed for "binutils-2.17.50.0.6( x86_64)"
Package existence check passed for "gcc-4.1.2 (x86_64)( x86_64)"
Package existence check passed for "libaio-0.3.106 (x86_64)( x86_64)"
Package existence check passed for "glibc-2.5-24 (x86_64)( x86_64)"
Package existence check passed for "compat-libstdc++-33-3.2.3 (x86_64)( x86_64)"
Package existence check passed for "elfutils-libelf-0.125 (x86_64)( x86_64)"
Package existence check passed for "elfutils-libelf-devel-0.125( x86_64)"
Package existence check passed for "glibc-common-2.5( x86_64)"
Package existence check passed for "glibc-devel-2.5 (x86_64)( x86_64)"
Package existence check passed for "glibc-headers-2.5( x86_64)"
Package existence check passed for "gcc-c++-4.1.2 (x86_64)( x86_64)"
Package existence check passed for "libaio-devel-0.3.106 (x86_64)( x86_64)"
Package existence check passed for "libgcc-4.1.2 (x86_64)( x86_64)"
Package existence check passed for "libstdc++-4.1.2 (x86_64)( x86_64)"
Package existence check passed for "libstdc++-devel-4.1.2 (x86_64)( x86_64)"
Package existence check passed for "sysstat-7.0.2( x86_64)"
Package existence check passed for "ksh-20060214( x86_64)"
Check for multiple users with UID value 0 passed
Current group ID check passed

Checking OCR integrity...

OCR integrity check passed

Checking Oracle Cluster Voting Disk configuration...

Oracle Cluster Voting Disk configuration check passed
Time zone consistency check passed

Starting Clock synchronization checks using Network Time Protocol(NTP)...

NTP Configuration file check started...
No NTP Daemons or Services were found to be running

Clock synchronization check using Network Time Protocol(NTP) passed

User "grid" is not part of "root" group. Check passed
Checking consistency of file "/etc/resolv.conf" across nodes

File "/etc/resolv.conf" does not have both domain and search entries defined
domain entry in file "/etc/resolv.conf" is consistent across nodes
search entry in file "/etc/resolv.conf" is consistent across nodes
All nodes have one search entry defined in file "/etc/resolv.conf"
PRVF-5636 : The DNS response time for an unreachable node exceeded "15000" ms on following nodes: vrh3

File "/etc/resolv.conf" is not consistent across nodes

Checking VIP configuration.
Checking VIP Subnet configuration.
Check for VIP Subnet configuration passed.
Checking VIP reachability
Check for VIP reachability passed.

Pre-check for node addition was unsuccessful on all the nodes.

在addNode.sh正式添加节点之前它也会调用cluvfy工具来验证新加入节点是否满足条件,如果不满足则拒绝下一步操作。因为我们在之前已经验证过了新节点的可用性,所以这里完全可以跳过addNode.sh的验证,具体来看一下addNode.sh脚本的内容:

[grid@vrh1 bin]$ cat addNode.sh 

#!/bin/sh
OHOME=/g01/11.2.0/grid
INVPTRLOC=$OHOME/oraInst.loc
ADDNODE="$OHOME/oui/bin/runInstaller -addNode -invPtrLoc $INVPTRLOC ORACLE_HOME=$OHOME $*"
if [ "$IGNORE_PREADDNODE_CHECKS" = "Y" -o ! -f "$OHOME/cv/cvutl/check_nodeadd.pl" ]
then
        $ADDNODE
else
        CHECK_NODEADD="$OHOME/perl/bin/perl $OHOME/cv/cvutl/check_nodeadd.pl -pre $*"
        $CHECK_NODEADD
        if [ $? -eq 0 ]
        then
        $ADDNODE
        fi
fi

可以看到存在一个IGNORE_PREADDNODE_CHECKS环境变量可以控制是否进行节点新增的预检查,我们手动设置该变量,之后再次运行addNode.sh脚本:

export IGNORE_PREADDNODE_CHECKS=Y

./addNode.sh  "CLUSTER_NEW_NODES={vrh3}"
"CLUSTER_NEW_VIRTUAL_HOSTNAMES={vrh3-vip}" "CLUSTER_NEW_PRIVATE_NODE_NAMES={vrh3-priv}"
> add_node.log  2>&1

另开一个窗口可以监控新增节点的过程日志

tail -f add_node.log 

Starting Oracle Universal Installer...

Checking swap space: must be greater than 500 MB.   Actual 5951 MB    Passed
Checking monitor: must be configured to display at least 256 colors.    Actual 16777216    Passed
Oracle Universal Installer, Version 11.2.0.2.0 Production
Copyright (C) 1999, 2010, Oracle. All rights reserved.

Performing tests to see whether nodes vrh2,vrh3 are available
............................................................... 100% Done.

.
-----------------------------------------------------------------------------
Cluster Node Addition Summary
Global Settings
   Source: /g01/11.2.0/grid
   New Nodes
Space Requirements
   New Nodes
      vrh3
         /: Required 6.66GB : Available 32.40GB
Installed Products
   Product Names
      Oracle Grid Infrastructure 11.2.0.2.0
      Sun JDK 1.5.0.24.08
      Installer SDK Component 11.2.0.2.0
      Oracle One-Off Patch Installer 11.2.0.0.2
      Oracle Universal Installer 11.2.0.2.0
      Oracle USM Deconfiguration 11.2.0.2.0
      Oracle Configuration Manager Deconfiguration 10.3.1.0.0
      Enterprise Manager Common Core Files 10.2.0.4.3
      Oracle DBCA Deconfiguration 11.2.0.2.0
      Oracle RAC Deconfiguration 11.2.0.2.0
      Oracle Quality of Service Management (Server) 11.2.0.2.0
      Installation Plugin Files 11.2.0.2.0
      Universal Storage Manager Files 11.2.0.2.0
      Oracle Text Required Support Files 11.2.0.2.0
      Automatic Storage Management Assistant 11.2.0.2.0
      Oracle Database 11g Multimedia Files 11.2.0.2.0
      Oracle Multimedia Java Advanced Imaging 11.2.0.2.0
      Oracle Globalization Support 11.2.0.2.0
      Oracle Multimedia Locator RDBMS Files 11.2.0.2.0
      Oracle Core Required Support Files 11.2.0.2.0
      Bali Share 1.1.18.0.0
      Oracle Database Deconfiguration 11.2.0.2.0
      Oracle Quality of Service Management (Client) 11.2.0.2.0
      Expat libraries 2.0.1.0.1
      Oracle Containers for Java 11.2.0.2.0
      Perl Modules 5.10.0.0.1
      Secure Socket Layer 11.2.0.2.0
      Oracle JDBC/OCI Instant Client 11.2.0.2.0
      Oracle Multimedia Client Option 11.2.0.2.0
      LDAP Required Support Files 11.2.0.2.0
      Character Set Migration Utility 11.2.0.2.0
      Perl Interpreter 5.10.0.0.1
      PL/SQL Embedded Gateway 11.2.0.2.0
      OLAP SQL Scripts 11.2.0.2.0
      Database SQL Scripts 11.2.0.2.0
      Oracle Extended Windowing Toolkit 3.4.47.0.0
      SSL Required Support Files for InstantClient 11.2.0.2.0
      SQL*Plus Files for Instant Client 11.2.0.2.0
      Oracle Net Required Support Files 11.2.0.2.0
      Oracle Database User Interface 2.2.13.0.0
      RDBMS Required Support Files for Instant Client 11.2.0.2.0
      RDBMS Required Support Files Runtime 11.2.0.2.0
      XML Parser for Java 11.2.0.2.0
      Oracle Security Developer Tools 11.2.0.2.0
      Oracle Wallet Manager 11.2.0.2.0
      Enterprise Manager plugin Common Files 11.2.0.2.0
      Platform Required Support Files 11.2.0.2.0
      Oracle JFC Extended Windowing Toolkit 4.2.36.0.0
      RDBMS Required Support Files 11.2.0.2.0
      Oracle Ice Browser 5.2.3.6.0
      Oracle Help For Java 4.2.9.0.0
      Enterprise Manager Common Files 10.2.0.4.3
      Deinstallation Tool 11.2.0.2.0
      Oracle Java Client 11.2.0.2.0
      Cluster Verification Utility Files 11.2.0.2.0
      Oracle Notification Service (eONS) 11.2.0.2.0
      Oracle LDAP administration 11.2.0.2.0
      Cluster Verification Utility Common Files 11.2.0.2.0
      Oracle Clusterware RDBMS Files 11.2.0.2.0
      Oracle Locale Builder 11.2.0.2.0
      Oracle Globalization Support 11.2.0.2.0
      Buildtools Common Files 11.2.0.2.0
      Oracle RAC Required Support Files-HAS 11.2.0.2.0
      SQL*Plus Required Support Files 11.2.0.2.0
      XDK Required Support Files 11.2.0.2.0
      Agent Required Support Files 10.2.0.4.3
      Parser Generator Required Support Files 11.2.0.2.0
      Precompiler Required Support Files 11.2.0.2.0
      Installation Common Files 11.2.0.2.0
      Required Support Files 11.2.0.2.0
      Oracle JDBC/THIN Interfaces 11.2.0.2.0
      Oracle Multimedia Locator 11.2.0.2.0
      Oracle Multimedia 11.2.0.2.0
      HAS Common Files 11.2.0.2.0
      Assistant Common Files 11.2.0.2.0
      PL/SQL 11.2.0.2.0
      HAS Files for DB 11.2.0.2.0
      Oracle Recovery Manager 11.2.0.2.0
      Oracle Database Utilities 11.2.0.2.0
      Oracle Notification Service 11.2.0.2.0
      SQL*Plus 11.2.0.2.0
      Oracle Netca Client 11.2.0.2.0
      Oracle Net 11.2.0.2.0
      Oracle JVM 11.2.0.2.0
      Oracle Internet Directory Client 11.2.0.2.0
      Oracle Net Listener 11.2.0.2.0
      Cluster Ready Services Files 11.2.0.2.0
      Oracle Database 11g 11.2.0.2.0
-----------------------------------------------------------------------------

Instantiating scripts for add node (Monday, August 15, 2011 10:15:35 PM CST)
.                                                                 1% Done.
Instantiation of add node scripts complete

Copying to remote nodes (Monday, August 15, 2011 10:15:38 PM CST)
...............................................................................................                                 96% Done.
Home copied to new nodes

Saving inventory on nodes (Monday, August 15, 2011 10:21:02 PM CST)
.                                                               100% Done.
Save inventory complete
WARNING:A new inventory has been created on one or more nodes in this session.
However, it has not yet been registered as the central inventory of this system.
To register the new inventory please run the script at '/g01/oraInventory/orainstRoot.sh'
with root privileges on nodes 'vrh3'.
If you do not register the inventory, you may not be able to update or
patch the products you installed.
The following configuration scripts need to be executed as the "root" user in each cluster node.
/g01/oraInventory/orainstRoot.sh #On nodes vrh3
/g01/11.2.0/grid/root.sh #On nodes vrh3
To execute the configuration scripts:
    1. Open a terminal window
    2. Log in as "root"
    3. Run the scripts in each cluster node

The Cluster Node Addition of /g01/11.2.0/grid was successful.
Please check '/tmp/silentInstall.log' for more details.

以上GI软件的安装成功了,接下来我们还需要在新加入的节点上运行2个关键的脚本,千万不要忘记这一点!:

运行orainstRoot.sh 和 root.sh脚本要求以root身份
su - root 

[root@vrh3]# cat /etc/oraInst.loc
inventory_loc=/g01/oraInventory                     --这里是oraInventory的位置
inst_group=asmadmin

[root@vrh3 ~]# cd /g01/oraInventory

[root@vrh3 oraInventory]# ./orainstRoot.sh
Creating the Oracle inventory pointer file (/etc/oraInst.loc)
Changing permissions of /g01/oraInventory.
Adding read,write permissions for group.
Removing read,write,execute permissions for world.

Changing groupname of /g01/oraInventory to asmadmin.
The execution of the script is complete.

运行CRS_HOME下的root.sh脚本,可能会有警告但不要紧

[root@vrh3 ~]# cd $ORA_CRS_HOME

[root@vrh3 g01]# /g01/11.2.0/grid/root.sh
Running Oracle 11g root script...

The following environment variables are set as:
    ORACLE_OWNER= grid
    ORACLE_HOME=  /g01/11.2.0/grid

Enter the full pathname of the local bin directory: [/usr/local/bin]:
   Copying dbhome to /usr/local/bin ...
   Copying oraenv to /usr/local/bin ...
   Copying coraenv to /usr/local/bin ...

Creating /etc/oratab file...
Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.

Using configuration parameter file: /g01/11.2.0/grid/crs/install/crsconfig_params
Creating trace directory
LOCAL ADD MODE
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
OLR initialization - successful
Adding daemon to inittab
ACFS-9200: Supported
ACFS-9300: ADVM/ACFS distribution files found.
ACFS-9307: Installing requested ADVM/ACFS software.
ACFS-9308: Loading installed ADVM/ACFS drivers.
ACFS-9321: Creating udev for ADVM/ACFS.
ACFS-9323: Creating module dependencies - this may take some time.
ACFS-9327: Verifying ADVM/ACFS devices.
ACFS-9309: ADVM/ACFS installation correctness verified.
CRS-4402: The CSS daemon was started in exclusive mode but found an active CSS daemon on node vrh1, number 1, and is terminating
An active cluster was found during exclusive startup, restarting to join the cluster
clscfg: EXISTING configuration version 5 detected.
clscfg: version 5 is 11g Release 2.
Successfully accumulated necessary OCR keys.
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
/g01/11.2.0/grid/bin/srvctl start listener -n vrh3 ... failed
Failed to perform new node configuration at /g01/11.2.0/grid/crs/install/crsconfig_lib.pm line 8255.
/g01/11.2.0/grid/perl/bin/perl -I/g01/11.2.0/grid/perl/lib -I/g01/11.2.0/grid/crs/install 
/g01/11.2.0/grid/crs/install/rootcrs.pl execution failed

以上会出现了2个小错误:

1.新增节点上LISTENER启动失败的问题可以忽略,这是因为RDBMS_HOME仍未安装,但CRS尝试去启动相关的监听

[root@vrh3 g01]# /g01/11.2.0/grid/bin/srvctl start listener -n vrh3
PRCR-1013 : Failed to start resource ora.CRS_LISTENER.lsnr
PRCR-1064 : Failed to start resource ora.CRS_LISTENER.lsnr on node vrh3
CRS-5010: Update of configuration file "/s01/orabase/product/11.2.0/dbhome_1/network/admin/listener.ora" failed: details at "(:CLSN00014:)" in "/g01/11.2.0/grid/log/vrh3/agent/crsd/oraagent_oracle/oraagent_oracle.log"
CRS-5013: Agent "/g01/11.2.0/grid/bin/oraagent.bin" failed to start process "/s01/orabase/product/11.2.0/dbhome_1/bin/lsnrctl" for action "check": details at "(:CLSN00008:)" in "/g01/11.2.0/grid/log/vrh3/agent/crsd/oraagent_oracle/oraagent_oracle.log"
CRS-2674: Start of 'ora.CRS_LISTENER.lsnr' on 'vrh3' failed
CRS-5013: Agent "/g01/11.2.0/grid/bin/oraagent.bin" failed to start process "/s01/orabase/product/11.2.0/dbhome_1/bin/lsnrctl" for action "clean": details at "(:CLSN00008:)" in "/g01/11.2.0/grid/log/vrh3/agent/crsd/oraagent_oracle/oraagent_oracle.log"
CRS-5013: Agent "/g01/11.2.0/grid/bin/oraagent.bin" failed to start process "/s01/orabase/product/11.2.0/dbhome_1/bin/lsnrctl" for action "check": details at "(:CLSN00008:)" in "/g01/11.2.0/grid/log/vrh3/agent/crsd/oraagent_oracle/oraagent_oracle.log"
CRS-2678: 'ora.CRS_LISTENER.lsnr' on 'vrh3' has experienced an unrecoverable failure
CRS-0267: Human intervention required to resume its availability.
PRCC-1015 : LISTENER was already running on vrh3
PRCR-1004 : Resource ora.LISTENER.lsnr is already running

2.rootcrs.pl脚本运行失败的话,一般重新运行一次即可:

[root@vrh3 bin]# /g01/11.2.0/grid/perl/bin/perl -I/g01/11.2.0/grid/perl/lib
-I/g01/11.2.0/grid/crs/install /g01/11.2.0/grid/crs/install/rootcrs.pl

Using configuration parameter file: /g01/11.2.0/grid/crs/install/crsconfig_params
PRKO-2190 : VIP exists for node vrh3, VIP name vrh3-vip
PRKO-2420 : VIP is already started on node(s): vrh3
Preparing packages for installation...
cvuqdisk-1.0.9-1
Configure Oracle Grid Infrastructure for a Cluster ... succeeded

3.建议在新增节点上重启crs,并使用cluvfy验证nodeadd顺利完成 :

[root@vrh3 ~]# crsctl stop crs

[root@vrh3 ~]# crsctl start crs

[root@vrh3 ~]# su - grid

[grid@vrh3 ~]$ cluvfy stage -post nodeadd -n vrh1,vrh2,vrh3

Performing post-checks for node addition 

Checking node reachability...
Node reachability check passed from node "vrh1"

Checking user equivalence...
User equivalence check passed for user "grid"

Checking node connectivity...

Checking hosts config file...

Verification of the hosts config file successful

Check: Node connectivity for interface "eth0"
Node connectivity passed for interface "eth0"

Node connectivity check passed

Checking cluster integrity...

Cluster integrity check passed

Checking CRS integrity...

CRS integrity check passed

Checking shared resources...

Checking CRS home location...
The location "/g01/11.2.0/grid" is not shared but is present/creatable on all nodes
Shared resources check for node addition passed

Checking node connectivity...

Checking hosts config file...

Verification of the hosts config file successful

Check: Node connectivity for interface "eth0"
Node connectivity passed for interface "eth0"

Check: Node connectivity for interface "eth1"
Node connectivity passed for interface "eth1"

Node connectivity check passed

Checking node application existence...

Checking existence of VIP node application (required)
VIP node application check passed

Checking existence of NETWORK node application (required)
NETWORK node application check passed

Checking existence of GSD node application (optional)
GSD node application is offline on nodes "vrh3,vrh2,vrh1"

Checking existence of ONS node application (optional)
ONS node application check passed

Checking Single Client Access Name (SCAN)...

Checking TCP connectivity to SCAN Listeners...
TCP connectivity to SCAN Listeners exists on all cluster nodes

Checking name resolution setup for "vrh.cluster.oracle.com"...

ERROR:
PRVF-4664 : Found inconsistent name resolution entries for SCAN name "vrh.cluster.oracle.com"

ERROR:
PRVF-4657 : Name resolution setup check for "vrh.cluster.oracle.com" (IP address: 192.168.1.190) failed

ERROR:
PRVF-4664 : Found inconsistent name resolution entries for SCAN name "vrh.cluster.oracle.com"

Verification of SCAN VIP and Listener setup failed

User "grid" is not part of "root" group. Check passed

Checking if Clusterware is installed on all nodes...
Check of Clusterware install passed

Checking if CTSS Resource is running on all nodes...
CTSS resource check passed

Querying CTSS for time offset on all nodes...
Query of CTSS for time offset passed

Check CTSS state started...
CTSS is in Active state. Proceeding with check of clock time offsets on all nodes...
Check of clock time offsets passed

Oracle Cluster Time Synchronization Services check passed

Post-check for node addition was successful.

Oracle内部错误:ORA-07445[kcflfi()+466] [INT_DIVIDE_BY_ZERO]一例

一套Windows上的11.2.0.1单实例数据库在database open阶段出现了ORA-07445:core dump [kcflfi()+466] [INT_DIVIDE_BY_ZERO] [] [PC:0x500282E] [] []内部错误,具体的出错日志如下:

LOG CONTENT

=======================ALERT.LOG============================

Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 2
Using LOG_ARCHIVE_DEST_1 parameter default value as USE_DB_RECOVERY_FILE_DEST
ARCH: Warning; less destinations available than specified
by LOG_ARCHIVE_MIN_SUCCEED_DEST init.ora parameter
Autotune of undo retention is turned on. 
IMODE=BR
ILAT =84
2011-08-01 13:13:47.068000 +08:00
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options.
Using parameter settings in server-side spfile C:\APP\PRODUCT\11.2.0\DBHOME_1\DATABASE\SPFILEG11R2.ORA
System parameters with non-default values:
  _spin_count              = 2000
  processes                = 500
  event                    = "10500 trace name context forever,level 8:10013 trace name context forever,level 10:
10015 trace name context forever,level 10"
  sga_max_size             = 600M
  shared_pool_size         = 152M
  large_pool_size          = 32M
  java_pool_size           = 4M
  streams_pool_size        = 0
  _db_file_direct_io_count = 12
  sga_target               = 0
  memory_target            = 0
  control_files            = "C:\APP\ORADATA\G11R2\CONTROLFILE\O1_MF_6VWCSH9J_.CTL"
  control_files            = "C:\APP\FLASH_RECOVERY_AREA\G11R2\CONTROLFILE\O1_MF_6VWCSHNF_.CTL"
  db_block_checksum        = "TRUE"
  db_block_size            = 8192
  db_cache_size            = 196M
  _shared_io_pool_size     = 0
  compatible               = "11.2.0.0.0"
  log_archive_dest_2       = "service=stdby optional lgwr sync affirm valid_for=(online_logfiles,all_roles)"
  log_buffer               = 10485760
  db_create_file_dest      = "C:\app\oradata"
  db_recovery_file_dest    = "C:\app\flash_recovery_area"
  db_recovery_file_dest_size= 500000M
  undo_tablespace          = "UNDOTBS1"
  _kgl_bucket_count        = 2
  remote_login_passwordfile= "EXCLUSIVE"
  db_domain                = ""
  session_cached_cursors   = 300
  audit_file_dest          = "C:\APP\ADMIN\G11R2\ADUMP"
  optimizer_features_enable= "10.2.0.4"
  audit_trail              = "DB"
  cell_offload_plan_display= "ALWAYS"
  db_name                  = "G11R2"
  open_cursors             = 3000
  _optimizer_extended_cursor_sharing_rel= "NONE"
  pga_aggregate_target     = 300M
  diagnostic_dest          = "C:\APP"
2011-08-01 13:13:48.164000 +08:00
PMON started with pid=2, OS id=984 
VKTM started with pid=3, OS id=3656 at elevated priority
VKTM running at (10)millisec precision with DBRM quantum (100)ms
GEN0 started with pid=4, OS id=5824 
DIAG started with pid=5, OS id=5832 
DBRM started with pid=6, OS id=2784 
PSP0 started with pid=7, OS id=2500 
DIA0 started with pid=8, OS id=5320 
MMAN started with pid=9, OS id=4128 
DBW0 started with pid=10, OS id=5852 
LGWR started with pid=11, OS id=3960 
CKPT started with pid=12, OS id=4472 
SMON started with pid=13, OS id=5788 
RECO started with pid=14, OS id=6036 
MMON started with pid=15, OS id=5740 
MMNL started with pid=16, OS id=2112 
ORACLE_BASE from environment = C:\app
alter database mount exclusive
2011-08-01 13:13:52.390000 +08:00
Sweep [inc][135908]: completed
NSS2 started with pid=19, OS id=2728 
Sweep [inc][135901]: completed
Successful mount of redo thread 1, with mount id 2704081164
Database mounted in Exclusive Mode
2011-08-01 13:13:53.413000 +08:00
Lost write protection disabled
2011-08-01 13:13:54.578000 +08:00
Sweep [inc][135897]: completed
Sweep [inc2][135908]: completed
Sweep [inc2][135901]: completed
Sweep [inc2][135897]: completed
2011-08-01 13:13:55.788000 +08:00
Completed: alter database mount exclusive
alter database open
Beginning crash recovery of 1 threads
 parallel recovery started with 3 processes
2011-08-01 13:13:56.959000 +08:00
Started redo scan
Completed redo scan
 read 0 KB redo, 0 data blocks need recovery
Started redo application at
 Thread 1: logseq 867, block 88140, scn 9122496
Recovery of Online Redo Log: Thread 1 Group 3 Seq 867 Reading mem 0
  Mem# 0: C:\APP\ORADATA\G11R2\ONLINELOG\O1_MF_3_6VWCSMPO_.LOG
  Mem# 1: C:\APP\FLASH_RECOVERY_AREA\G11R2\ONLINELOG\O1_MF_3_6VWCSNGX_.LOG
Completed redo application of 0.00MB
Completed crash recovery at
 Thread 1: logseq 867, block 88140, scn 9142497
 0 data blocks read, 0 data blocks written, 0 redo k-bytes read
2011-08-01 13:13:58.738000 +08:00
LGWR: STARTING ARCH PROCESSES
ARC0 started with pid=22, OS id=4784 
2011-08-01 13:13:59.765000 +08:00
ARC0: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC0: STARTING ARCH PROCESSES
ARC1 started with pid=24, OS id=2780 
ARC2 started with pid=25, OS id=1288 
ARC1: Archival started
LGWR: Primary database is in MAXIMUM AVAILABILITY mode
ARC2: Archival started
ARC1: Becoming the 'no FAL' ARCH
ARC1: Becoming the 'no SRL' ARCH
ARC2: Becoming the heartbeat ARCH
LGWR: Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR
ARC3 started with pid=26, OS id=3876 
2011-08-01 13:14:00.828000 +08:00
ARC3: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
NSS2 started with pid=19, OS id=5156 
2011-08-01 13:14:29.008000 +08:00
ORA-16198: LGWR received timedout error from KSR
2011-08-01 13:14:35.980000 +08:00
Errors in file c:\app\diag\rdbms\g11r2\g11r2\trace\g11r2_lgwr_3960.trc:
ORA-16198: Timeout incurred on internal channel during remote archival
LGWR: Error 16198 verifying archivelog destination LOG_ARCHIVE_DEST_2
Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
LGWR: Continuing...
ARCH: LGWR is scheduled to archive destination LOG_ARCHIVE_DEST_2 after log switch
2011-08-01 13:14:38.629000 +08:00
Trying to expand controlfile section 11 for Oracle Managed Files
Exception [type: INT_DIVIDE_BY_ZERO, ] [] [PC:0x500282E, __VInfreq__kcflfi()+466]
Errors in file c:\app\diag\rdbms\g11r2\g11r2\trace\g11r2_arc0_4784.trc  (incident=136091):
ORA-07445: exception encountered: core dump [kcflfi()+466] [INT_DIVIDE_BY_ZERO] [] [PC:0x500282E] [] []
Incident details in: c:\app\diag\rdbms\g11r2\g11r2\incident\incdir_136091\g11r2_arc0_4784_i136091.trc
2011-08-01 13:14:40.283000 +08:00
Trace dumping is performing id=[cdmp_20110801131440]
2011-08-01 13:14:52.417000 +08:00
Sweep [inc][136091]: completed
Sweep [inc2][136091]: completed
2011-08-01 13:14:59.805000 +08:00
ARC2: Detected ARCH process failure
ARC2: STARTING ARCH PROCESSES
ARC0 started with pid=19, OS id=5016 
2011-08-01 13:15:00.836000 +08:00
ARC0: Archival started
ARC2: STARTING ARCH PROCESSES COMPLETE
2011-08-01 13:15:36.689000 +08:00
Deleted Oracle managed file C:\APP\FLASH_RECOVERY_AREA\G11R2\ARCHIVELOG\2011_08_01\O1_MF_1_866_73DFKWRK_.ARC
2011-08-01 13:15:38.013000 +08:00
Error 12154 received logging on to the standby
Errors in file c:\app\diag\rdbms\g11r2\g11r2\trace\g11r2_ora_4852.trc:
ORA-12154: TNS:could not resolve the connect identifier specified
ARCH: Error 12154 Creating archive log file to 'stdby'
Trying to expand controlfile section 11 for Oracle Managed Files
Exception [type: INT_DIVIDE_BY_ZERO, ] [] [PC:0x500282E, __VInfreq__kcflfi()+466]
Errors in file c:\app\diag\rdbms\g11r2\g11r2\trace\g11r2_ora_4852.trc  (incident=136051):
ORA-07445: exception encountered: core dump [kcflfi()+466] [INT_DIVIDE_BY_ZERO] [] [PC:0x500282E] [] []
Incident details in: c:\app\diag\rdbms\g11r2\g11r2\incident\incdir_136051\g11r2_ora_4852_i136051.trc
2011-08-01 13:15:39.680000 +08:00
Trace dumping is performing id=[cdmp_20110801131539]
2011-08-01 13:15:42.782000 +08:00
PMON (ospid: 984): terminating the instance due to error 397
2011-08-01 13:15:50.520000 +08:00
Instance terminated by PMON, pid = 984

=============================g11r2_ora_4852_i136051.trc=============================

Dump file c:\app\diag\rdbms\g11r2\g11r2\incident\incdir_136051\g11r2_ora_4852_i136051.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Windows NT Version V6.1 Service Pack 1 
CPU                 : 4 - type 586, 2 Physical Cores
Process Affinity    : 0x0x00000000
Memory (Avail/Total): Ph:2122M/3566M, Ph+PgF:5413M/7130M, VA:1084M/2047M 
Instance name: g11r2
Redo thread mounted by this instance: 1
Oracle process number: 17
Windows thread id: 4852, image: ORACLE.EXE (SHAD)

*** 2011-08-01 13:15:38.527
*** SESSION ID:(197.1) 2011-08-01 13:15:38.527
*** CLIENT ID:() 2011-08-01 13:15:38.527
*** SERVICE NAME:() 2011-08-01 13:15:38.527
*** MODULE NAME:(oradim.exe) 2011-08-01 13:15:38.527
*** ACTION NAME:() 2011-08-01 13:15:38.527

Dump continued from file: c:\app\diag\rdbms\g11r2\g11r2\trace\g11r2_ora_4852.trc
ORA-07445: exception encountered: core dump [kcflfi()+466] [INT_DIVIDE_BY_ZERO] [] [PC:0x500282E] [] []

========= Dump for incident 136051 (ORA 7445 [kcflfi()+466]) ========
----- Beginning of Customized Incident Dump(s) -----
Exception [type: INT_DIVIDE_BY_ZERO, ] [] [PC:0x500282E, __VInfreq__kcflfi()+466]

Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
Process Id: 0x000010bc  Thread Id : 0x000012f4    Time : Mon Aug 01 13:15:38 
Excp. Code: 0xc0000094  Excp. Type: INT_DIVIDE    Flags: 0x00000000

------------------- Registers ----------------------------
eip = 0500282e esp = 0d9f525c ebp = 0d9f577c edi = 37eefe00 esi = 00000265
eax = 00000265 ebx = 00000000 ecx = 089ee234 edx = 00000000
ecs = 0000001b eds = 00000023 ees = 00000023 ess = 00000023
egs = 00000000 efs = 0000003b
eflags = 00010246
------------------- End of Registers ---------------------

*** 2011-08-01 13:15:38.536
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x3, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=a01hp0psv0rrh) -----
alter database open
----------- messages from pre-loading .sym files:
Symbol file C:\app\product\11.2.0\dbhome_1\RDBMS\ADMIN\oracommon11.SYM does not match binary.
 Symbol TimeStamp=4bb5eaac, Module TimeStamp=0 are different
Symbol file C:\app\product\11.2.0\dbhome_1\RDBMS\ADMIN\oraclsra11.SYM does not match binary.
 Symbol TimeStamp=4bb4cf99, Module TimeStamp=0 are different
----------- end of messages from pre-loading .sym files
----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
Symbol file C:\app\product\11.2.0\dbhome_1\BIN\oracommon11.SYM does not match binary.
 Symbol TimeStamp=4bb5eaac, Module TimeStamp=0 are different
Symbol file C:\app\product\11.2.0\dbhome_1\BIN\oraclsra11.SYM does not match binary.
 Symbol TimeStamp=4bb4cf99, Module TimeStamp=0 are different
EnumerateLoadedModules64 failed with error -1073741819
Symbol file oraclsra11.SYM does not match binary.
 Symbol TimeStamp=4bb4cf99, Module TimeStamp=0 are different
Symbol file oracommon11.SYM does not match binary.
 Symbol TimeStamp=4bb5eaac, Module TimeStamp=0 are different
__VInfreq__kcflfi()           00000000             
+466                                               
_kccrszf()+287       CALLrel  _kcflfi()            0 318345B8 34 31C0DD40 4000
                                                   265 4 7FFFFFFF 1 0 0
_kccrsd_expd()+1418  CALLrel  _kccrszf()           D9F7CEC 268 264
_kccwnc_reuse_expan  CALLrel  _kccrsd_expd()       D9F7CEC B 38
d()+640                                            
__VInfreq__kccwnc()  CALLrel  _kccwnc_reuse_expan  D9F7CEC B 26
+235                          d()                  
_krse_arc_complete(  CALLrel  _kccwnc()            D9F7CEC D9F6D38 B
)+1615                                             
_krse_arc_driver_co  CALLrel  _krse_arc_complete(  D9F78AC
re()+1307                     )                    
_krse_arc_driver()+  CALLrel  _krse_arc_driver_co  D9F7CEC 1 D9F7C6C 0 0 D9F7CC8
274                           re()                 0 0 0 0 0 0 0
_krsq_arch_to_force  CALLrel  _krse_arc_driver()   D9F7CEC 1 D9F7C6C 0 0 D9F7CC8
_switch()+196                                      0 0 0 0 0 0 0
__VInfreq__kcttsc()  CALLrel  _krsq_arch_to_force  D9F7CEC 1
+129                          _switch()            
_kcfopd()+1504       CALLrel  _kcttsc()            2
_adbdrv()+16700      CALLrel  _kcfopd()            0 0 0 0 D9FBBF8
_opiexe()+13594      CALLrel  _adbdrv()            4A C0000094 33644518 D9FBD38
                                                   6D60697 2F3FC5F0
_opiosq0()+6248      CALLrel  _opiexe()            4 0 D9FC704
_kpooprx()+277       CALLrel  _opiosq0()           3 E D9FC970 A4 0
_kpoal8()+632        CALLrel  _kpooprx()           D9FF074 D9FD3F8 13 1 0 A4
_opiodr()+1248       CALLreg  00000000             5E 1C D9FF070
___dyn_tls_init_cal  CALLreg  00000000             5E 1C D9FF070 1
lback()+2935122                                    
_opitsk()+1404       CALL???  00000000             C9A10E8 5E D9FF070 0 D9FED00
                                                   D9FF19C 53E52E 0 D9FF1C8
_opiino()+980        CALLrel  _opitsk()            0 0
_opiodr()+1248       CALLreg  00000000             3C 4 D9FFBC4
_opidrv()+1201       CALLrel  _opiodr()            3C 4 D9FFBC4 0
_sou2o()+55          CALLrel  _opidrv()            3C 4 D9FFBC4
_opimai_real()+124   CALLrel  _sou2o()             D9FFBD4 3C 4 D9FFBC4
_opimai()+125        CALLrel  _opimai_real()       2 D9FFBFC
_OracleThreadStart@  CALLrel  _opimai()            2 D9FFF3C 0 70 FFFFFFFF
4()+830                                            FFFFFFFF
___dyn_tls_init_cal  CALLptr  00000000             901FF6C D9FFFD4 776437F5
lback()+366382316                                  901FF6C 765D34CB 0
___dyn_tls_init_cal  CALLreg  00000000             901FF6C 765D34CB 0 0 901FF6C
lback()+367384440                                  0
___dyn_tls_init_cal  CALLrel  ___dyn_tls_init_cal  401326 901FF6C 0 0 0 0
lback()+367384392             lback()+367384403    
00000000             CALL???  00000000             

--------------------- Binary Stack Dump ---------------------
..................

从以上日志中可以看到在”Trying to expand controlfile section 11 for Oracle Managed Files“扩扎控制文件过程中出现了
_kccwnc_reuse_expan->_kccrsd_expd->_kccrszf->_kcflfi->_VInfreq__kcflfi()
函数的7445错误,kcf意为(manages and coordinates operations on the control file(s),kcf.c),是在处理日志文件中引发了INT_DIVIDE_BY_ZERO除数为零的代码bug。

通过7445和kcflfi关键词在MOS上搜索没有太大的发现,说明该Bug的处罚几率非常低,正好让我碰到说明是某些特殊参数的设置引起了该问题。

目标锁定启动日志中的非默认隐藏参数”_db_file_direct_io_count”,该参数决定了直接路径读写的IO大小,从9i开始该参数的单位调整为bytes而非原先的blocks,之前因为对该参数进行一些测试所以设置了一个较小值。

Parameter: DB_FILE_DIRECT_IO_COUNT
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Versions:	8.0 - 8.1
                This parameter is hidden in 9.0 onwards.

 Parameter type:        integer
 Parameter class:       dynamic, scope = ALTER SYSTEM DEFERRED
 Default value:         64
 Range of values:       operating system-dependent

Description:
~~~~~~~~~~~~
DB_FILE_DIRECT_IO_COUNT is used to specify the number of blocks to be used
for IO operations done by backup, restore or direct path read and write
functions. The IO buffer size is a product of DB_FILE_DIRECT_IO_COUNT and
DB_BLOCK_SIZE. The IO buffer size cannot exceed max_IO_size for your
platform.

Assigning a high value to this parameter results in greater use of PGA or
SGA memory.

o In Oracle8i, minimize the number of I/O requests by setting the
  DB_FILE_DIRECT_IO_COUNT instance parameter so that

  DB_BLOCK_SIZE x DB_FILE_DIRECT_IO_COUNT = max_io_size of system

  In Oracle8i the default for this is 64 blocks.

  (In Oracle9i, it is replaced by _DB_FILE_DIRECT_IO_COUNT which governs
   the size of direct I/Os in BYTES (not blocks). The default is 1Mb but
   will be sized down if the max_io_size of the system is smaller.)

ORA-19863 during RMAN duplicate

Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.3
This problem can occur on any platform.
Symptoms
-- Problem Statement:
Duplicate failed during the datafile restore stage:

Starting restore at 2008-Apr-09 09:28:24
using channel ORA_AUX_DISK_1

channel ORA_AUX_DISK_1: starting datafile backupset restore
channel ORA_AUX_DISK_1: specifying datafile(s) to restore from backup set
restoring datafile 00003 to /u06/oradata/hcmprdc/sysaux01.dbf
...
restoring datafile 00121 to /u06/oradata/hcmprdc/waapp.dbf
channel ORA_AUX_DISK_1: reading from backup piece
/u04/oradata/flash_recovery_area/HCMPRD/mdjd8s5v_1_1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 04/09/2008 09:28:28
RMAN-03015: error occurred in stored script Memory Script
ORA-19870: error reading backup piece /u04/oradata/flash_recovery_area/HCMPRD/mdjd8s5v_1_1
ORA-19863: device block size 1040384 is larger than max allowed: 262144

Cause
The database parameter _db_file_direct_io_count in the target and auxiliary instance does not match.
Solution

-- To implement the solution:

Ensure that parameter _db_file_direct_io_count on the target and auxiliary database the same

_DB_FILE_DIRECT_IO_COUNT need to be set to the same value between the source database 
where the backup was taken and the target database where the backup is being restored.

2.0 Size of Input/Output Buffers
================================

a. input buffers
----------------

NOTE : DB_FILE_DIRECT_IO_COUNT is not available in Oracle9i onwards.
       In Oracle9i, it is replaced by a hidden _DB_FILE_DIRECT_IO_COUNT which 
       governs the size of direct I/Os in BYTES (not blocks). The default is 
       1Mb butwill be sized down if the max_io_size of the system is smaller.

The input buffer size is:
  buffersize = db_block_size * db_file_direct_io_count

As there are 4 input buffers, the total input buffer memory use per channel is:
 memory(input) = #buffers * #files * buffersize
               = 4 * #files * buffersize

For example, if 2 channels are used, and each of these channels backs up 3 
files, then for each channel

 memory(input) = 4 * 3 * db_block_size * db_file_direct_io_count

b. output buffers
-----------------

For disk channels, the output buffer size is:
  buffersize = db_block_size * db_file_direct_io_count

For SBT_TAPE channels, the output buffer size in Oracle8/8i is o/s dependant. (On Solaris,
this defaults to 64k) On 9i/10g it defaults to 256k for all platforms. The BLKSIZE argument to 'allocate channel...' can be
used to override the default value.

As there are 4 output buffers,
  memory(output) = #buffers * buffersize
                 = 4 * buffersize

一般来说使用该隐藏参数的默认值即可,通过重置该参数后修复启动问题:

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
PL/SQL Release 11.2.0.1.0 - Production
CORE    11.2.0.1.0      Production
TNS for 32-bit Windows: Version 11.2.0.1.0 - Production
NLSRTL Version 11.2.0.1.0 - Production

SQL> alter system reset "_db_file_direct_io_count" scope=spfile;

System altered.

SELECT x.ksppinm NAME, y.ksppstvl VALUE, x.ksppdesc describ
 FROM SYS.x$ksppi x, SYS.x$ksppcv y
 WHERE x.inst_id = USERENV ('Instance')
 AND y.inst_id = USERENV ('Instance')
 AND x.indx = y.indx
AND x.ksppinm LIKE '%db_file_direct_io_count%'
/

NAME                           VALUE                DESCRIB
------------------------------ -------------------- ------------------------------
_db_file_direct_io_count       1048576              Sequential I/O buf size

windows上的11gr2默认该参数为1MB

Oracle RAC内部错误:ORA-00600[kjbmprlst:shadow]一例

一套Linux x86-64上的11.2.0.1 4节点RAC系统中LMS GCS服务进程遭遇到内部错误ORA-00600[kjbmprlst:shadow],导致节点实例意外终止,具体日志如下:

Fri Jul 08 02:04:43 2011
Errors in file /u01/app/oracle/diag/rdbms/PROD/PROD1/trace/PROD1_lms1_536.trc  (incident=1011732):
ORA-00600: internal error code, arguments: [kjbmprlst:shadow], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/PROD/PROD1/incident/incdir_1011732/PROD1_lms1_536_i1011732.trc
Fri Jul 08 02:04:44 2011
Trace dumping is performing id=[cdmp_20110708020444]
Errors in file /u01/app/oracle/diag/rdbms/PROD/PROD1/trace/PROD1_lms1_536.trc:
ORA-00600: internal error code, arguments: [kjbmprlst:shadow], [], [], [], [], [], [], [], [], [], [], []
Errors in file /u01/app/oracle/diag/rdbms/PROD/PROD1/trace/PROD1_lms1_536.trc:
ORA-00600: internal error code, arguments: [kjbmprlst:shadow], [], [], [], [], [], [], [], [], [], [], []
LMS1 (ospid: 536): terminating the instance due to error 484
Fri Jul 08 02:04:45 2011
opiodr aborting process unknown ospid (27387) as a result of ORA-1092
System state dump is made for local instance
System State dumped to trace file /u01/app/oracle/diag/rdbms/PROD/PROD1/trace/PROD1_diag_513.trc
Fri Jul 08 02:04:54 2011
Termination issued to instance processes. Waiting for the processes to exit
Fri Jul 08 02:04:58 2011
ORA-1092 : opitsk aborting process

该ORA-00600[kjbmprlst:shadow]错误定位为11.2.0.1上的Bug 10121589或Bug 9458781:

Bug 10121589  ORA-600 [kjbmprlst:shadow] can occur in RAC
Affects:

    Product (Component)	Oracle Server (Rdbms)
    Range of versions believed to be affected 	Versions BELOW 12.1
    Versions confirmed as being affected 	

        11.2.0.1 

    Platforms affected	Generic (all / most platforms affected)

Fixed:

    This issue is fixed in	

        12.1 (Future Release)
        11.2.0.2 Bundle Patch 2 for Exadata Database
        11.2.0.1 Bundle Patch 7 for Exadata Database 

Symptoms:

Related To:

    Internal Error May Occur (ORA-600)
    ORA-600 [kjbmprlst:shadow] 

    RAC (Real Application Clusters) / OPS 

Description

    An ORA-600 [kjbmprlst:shadow] can occur if the fix for bug 9979039
    is present.

    Note:
     One off patches for 10200390 should also include this fix.

Bug 9458781  Missing close message to master leaves closed lock dangling crashing the instance with assorted Internal error

Affects:

    Product (Component)	Oracle Server (Rdbms)
    Range of versions believed to be affected 	Versions >= 11.2.0.1 but BELOW 11.2.0.2
    Versions confirmed as being affected 	

        11.2.0.1 

    Platforms affected	Generic (all / most platforms affected)

Fixed:

    This issue is fixed in	

        11.2.0.2 (Server Patch Set)
        11.2.0.1 Bundle Patch 4 for Exadata Database 

Symptoms:

Related To:

    Instance May Crash
    Internal Error May Occur (ORA-600)
    ORA-600 [KJBMPRLST:SHADOW]
    ORA-600 [KJBMOCVT:RID]
    ORA-600 [KJBRREF:PKEY]
    ORA-600 [KJBRASR:PKEY] 

    RAC (Real Application Clusters) / OPS 

Description

    A lock is closed without sending a message to the master.
    This causes closed lock dangling at the master crashing the instance with different internal errors.

    Reported internal errors so far are :
    - KJBMPRLST:SHADOW
    - KJBMOCVT:RID
    - KJBRREF:PKEY
    - KJBRASR:PKEY

该kjbmprlst:shadow内部函数用以管理kjbm shadow锁(/libserver10.a/kjbm.o )信息,存在某个已关闭的lock没有及时message给master node的代码漏洞,目前除了安装补丁外没有已验证的workaround办法(disable drm似乎是无效的):

oradebug lkdebug (track resources, take dumps)
KCL history
KJBL history
KJL history

PCM (GCS) and non-PCM (GES) resources are kept separate and use separate code paths.
GES:
Resource table: kjr and kjrt
Lock table: kjlt
Processes: kjpt
GCS:
Resource table: kjbr
Lock table: kjbl

DLM Structures (continued)
/* PCM resource structure */
typedef struct kjbr {                                /* 68 bytes on sun4u */
  kjsolk       hash_q_kjbr;                             /* hash list : hp */
  ub4          resname_kjbr[2];	                     /* the resource name */
  kjsolk       scan_q_kjbr; /* chain to lmd scan q of grantable resources */
  kjsolk       grant_q_kjbr;                 /* list of granted resources */
  kjsolk       convert_q_kjbr;       /* list of resources being converted */
  ub4          diskscn_bas_kjbr;         /* scn(base) known to be on disk */
  ub2          diskscn_wrap_kjbr;        /* scn(wrap) known to be on disk */
  ub2          writereqscn_wrap_kjbr;    /* scn(wrap) requested for write */
  ub4          writereqscn_bas_kjbr;     /* scn(base) requested for write */
  struct kjbl *sender_kjbr;                 /* lock elected to send block */
  ub2          senderver_kjbr;                  /* version# of above lock */
  ub2          writerver_kjbr;                  /* version# of lock below */
  struct kjbl *writer_kjbr;                /* lock elected to write block */
  ub1          mode_role_kjbr; /* one of 'n', 's', 'x' && one of 'l' or 'g' */
  ub1          flags_kjbr;                        /* ignorewip, free etc. */
  ub1          rfpcount_kjbr;                      /* refuse ping counter */
  ub1          history_kjbr;                /* resource operation history */
  kxid         xid_kjbr;                          /* split transaction ID */
} kjbr ;

/* kjbl - PCM lock structure
** Clients and most of the DLM will use the KJUSER* or KJ_* modes and kscns  */

typedef struct kjbl {                                /* 52 bytes on sun4u */
  union {                     /* discriminate lock@master and lock@client */
    struct {                                           /* for lock@master */
      kgglk        state_q_kjbl;             /* link to chain to resource */
      kjbopqi     *rqinfo_kjbl;                             /* target bid */
      struct kjbr *resp_kjbl;                   /* pointer to my resource */
    } kjbllam;                                 /* KJB Lock Lock At Master */
    struct {                                           /* for lock@client */
      ub4         disk_base_kjbl;        /* disk version(base) for replay */
      ub2         disk_wrap_kjbl;        /* disk version(wrap) for replay */
      ub1         master_node_kjbl;                   /* master instance# */
      ub1         client_flag_kjbl;     /* flags specific to client locks */
      ub2         update_seq_kjbl;               /* last update to master */
    } kjbllac;                                 /* KJB Lock Lock At Client */
  } kjblmcd;                        /* KJB Lock Master Client Discrimnant */
  void  *remote_lockp_kjbl;           /* pointer to client lock or shadow */
  ub2    remote_ver_kjbl;                         /* remote lock version# */
  ub2        ver_kjbl;                                     /* my version# */
  ub2        msg_seq_kjbl;                         /* client->master seq# */
  ub2        reqid_kjbl;                         /* requestid for convert */
  ub2        creqid_kjbl; /* requestid for convert that has been cancelled */
  ub2        pi_wrap_kjbl;                     /* scn(wrap) of highest pi */
  ub4        pi_base_kjbl;                     /* scn(base) of highest pi */
  ub1        mode_role_kjbl; /* one of 'n', 's', 'x' && one of 'l' or 'g' */
  ub1        state_kjbl;       /* _L|_R|_W|_S, notify, which q, lock type */
  ub1        node_kjbl;                       /* instance lock belongs to */
  ub1        flags_kjbl;                                /* lock flag bits */
  ub2        rreqid_kjbl;                               /* save the reqid */
  ub2         write_wrap_kjbl;        /* last write request version(wrap) */
  ub4         write_base_kjbl;        /* last write request version(base) */
  ub4         history_kjbl;                     /* lock operation history */
} kjbl;

PCM DLM locks that are owned by the local instance are allocated and embedded in an LE structure.
PCM DLM locks that are owned by remote instances and mastered by the local instance are allocated in SHARED_POOL.

PCM Locks and Resources
Fields of interest in the kclle structure: kcllerls or releasing; kcllelnm or name(id1,id2);
kcllemode or held-mode; kclleacq or acquiring; kcllelck or DLM lock.

Fields of interest in the kjbr structure: resname_kjbr[2] or resource name; grant_q_kjbr or grant queue;
convert_q_kjbr or convert queue; mode_role_kjbr, which is a bitwise merge of grant mode and
role-interpreted NULL(0x00), S(0x01), X(0x02), L0 Local (0x00), G0 Global without PI (0x08), G1 Global with PI (0x018).

The field mode_role_kjbl in kjbl is a bitwise merge of grant, request, and lock mode: 0x00 if grant NULL;
0x01 if grant S; 0x02 if grant X; 0x04 lock has been opened at master; 0x08 if global role (otherwise local);
0x10 has one or more PI; 0x20 if request CR; 0x40 if request S; 0x80 if request X.

Someone has to keep a list of all buffers and where they are mastered
This is called Global Resource Directory (GRD)
GRD is present on all the instances of the cluster
To find out the master:
select  b.dbablk, r.kjblmaster master_node
from x$le l, x$kjbl r, x$bh b
where b.obj =
and b.le_addr = l.le_addr
and l.le_kjbl = r.kjbllockp

Oracle Support宣称可以通过11.2.0.2 (Server Patch Set)11.2.0.1 Bundle Patch 4 for Exadata Database修复该bug,但是有迹象表明在11.2.0.2上仍可能发生该ORA-00600[kjbmprlst:shadow]内部错误,同时该bug更多地发生在超过2个节点的RAC系统中。

 

沪公网安备 31010802001379号

TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569