再议RAC Brain Split脑裂

这2天在面试DBA Candidate的时候,我问到Oracle RAC中Brain Split脑裂决议的一些概念, 几乎所有的Candidate都告诉我当”只有2个节点的时候,投票算法就失效了,会让2个节点去抢占Quorum Disk,最先获得的节点将活下来” 。 我们姑且把这套理论叫做” 抢占论”。

“抢占论”的具体观点可能与下面这一段文字大同小异:

 

“在集群中,节点间通过某种机制(心跳)了解彼此的健康状态,以确保各节点协调工作。 假设只有”心跳”出现问题, 各个节点还在正常运行, 这时,每个节点都认为其他的节点宕机了, 自己是整个集群环境中的”唯一建在者”,自己应该获得整个集群的”控制权”。 在集群环境中,存储设备都是共享的, 这就意味着数据灾难, 这种情况就是”脑裂”
解决这个问题的通常办法是使用投票算法(Quorum Algorithm). 它的算法机理如下:

观点1:

集群中各个节点需要心跳机制来通报彼此的”健康状态”,假设每收到一个节点的”通报”代表一票。对于三个节点的集群,正常运行时,每个节点都会有3票。 当结点A心跳出现故障但节点A还在运行,这时整个集群就会分裂成2个小的partition。 节点A是一个,剩下的2个是一个。 这是必须剔除一个partition才能保障集群的健康运行。 对于有3个节点的集群, A 心跳出现问题后, B 和 C 是一个partion,有2票, A只有1票。 按照投票算法, B 和C 组成的集群获得控制权, A 被剔除。

 

 

观点2:

如果只有2个节点,投票算法就失效了。 因为每个节点上都只有1票。 这时就需要引入第三个设备:Quorum Device. Quorum Device 通常采用饿是共享磁盘,这个磁盘也叫作Quorum disk。 这个Quorum Disk 也代表一票。 当2个结点的心跳出现问题时, 2个节点同时去争取Quorum Disk 这一票, 最早到达的请求被最先满足。 故最先获得Quorum Disk的节点就获得2票。另一个节点就会被剔除。

 

 

以上这段文字描述中观点1 与我在<Oracle RAC Brain Split Resolution> 一文中提出的看法其实是类似的。  这里再列出我的描述:

在脑裂检查阶段Reconfig Manager会找出那些没有Network Heartbeat而有Disk Heartbeat的节点,并通过Network Heartbeat(如果可能的话)和Disk Heartbeat的信息来计算所有竞争子集群(subcluster)内的节点数目,并依据以下2种因素决定哪个子集群应当存活下去:

  1. 拥有最多节点数目的子集群(Sub-cluster with largest number of Nodes)
  2. 若子集群内数目相等则为拥有最低节点号的子集群(Sub-cluster with lowest node number),举例来说在一个2节点的RAC环境中总是1号节点会获胜。

补充:关于 我引入的子集群的概念的介绍:

“在解决脑裂的场景中,NM还会监控voting disk以了解其他的竞争子集群(subclusters)。关于子集群我们有必要介绍一下,试想我们的环境中存在大量的节点,以Oracle官方构建过的128个节点的环境为我们的想象空间,当网络故障发生时存在多种的可能性,一种可能性是全局的网络失败,即128个节点中每个节点都不能互相发生网络心跳,此时会产生多达128个的信息”孤岛”子集群。另一种可能性是局部的网络失败,128个节点中被分成多个部分,每个部分中包含多于一个的节点,这些部分就可以被称作子集群(subclusters)。当出现网络故障时子集群内部的多个节点仍能互相通信传输投票信息(vote mesg),但子集群或者孤岛节点之间已经无法通过常规的Interconnect网络交流了,这个时候NM Reconfiguration就需要用到voting disk投票磁盘。”

 

争议主要体现在 , “抢占论” 认为当 只有2个节点时 是通过抢占votedisk 的结果来决定具体哪个节点存活下来同时” 抢占论”没有介绍 当存在多个相同节点数目的子集群情况下的结论(譬如4节点的RAC , 1、2节点组成一个子集群,3、4节点组成一个子集群), 若按照2节点时的做法那么依然是通过子集群间抢占votedisk来决定。

 

我个人认为这种说法(“抢占论”)是错误的,不管是具体脑裂时的CRS关键进程css的日志,还是Oracle官方的内部文档都可以说明该问题。

 

我们来看10.2 RAC中的一个场景,假设集群中共有3个节点,其中1号实例没有被启动,集群中只有2个活动节点(active node),发生2号节点的网络失败的故障,因2号节点的member number较小故其通过voting disk向3号节点发起驱逐,具体日志如下:

观察红色部分的日志 ,明确显示了NM(Node Monitor)节点监控服务检查votedisk信息,并计算出了smaller cluster size

以下为2号节点的ocssd.log日志

[    CSSD]2011-04-23 17:42:32.022 [3032460176] >
TRACE: clssnmCheckDskInfo: node 3, vrh3, state 5 with leader 3
has smaller cluster size 1; my cluster size 1 with leader 2

检查voting disk后发现子集群3为最小"子集群"(3号节点的node number较2号大);2号节点为最大子集群

[    CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE:   clssnmEvict: Start
[    CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE:   clssnmEvict:
Evicting node 3, vrh3, birth 3, death 13, impendingrcfg 1, stateflags 0x40d
[    CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE:
clssnmSendShutdown: req to node 3, kill time 1643084

发起对3号节点的驱逐和shutdown request

以下为3号节点的ocssd.log日志:
[    CSSD]2011-04-23 17:43:15.913 [3032460176] >ERROR:   clssnmCheckDskInfo:
Aborting local node to avoid splitbrain.
[    CSSD]2011-04-23 17:43:15.913 [3032460176] >ERROR:                     :
my node(3), Leader(3), Size(1) VS Node(2), Leader(2), Size(1)

读取voting disk后发现kill block,为避免split brain,自我aborting!

 

 

此外Metalink 上一些官方Note 也明确说明了我以上的观点 , 摘录部分内容如下:

 

1.
When interconnect breaks – keeps the largest cluster possible up, other nodes will be evicted, in 2 node cluster lowest number node remains.
 

2.
Node eviction: pick a cluster node as victim to reboot.Always keep the largest cluster possible up, evicted other nodes two nodes: keep the lowest number node up and evict other

 

实际上有部分Vendor Unix Clusterware集群软件的脑裂可能如确实是以谁先获得 “Quorum disk”为决定因素, 但是自10g 推出的Oracle 自己的Real Application Cluster(RAC) 的clusterware 或者说 CRS( cluster ready services) 在Brain Split Resolution时并非如此,在这方面类推并不能帮助我们找出正确的结论。

 

 

How to Uninstall/Reinstall 10g CRS Clusterware?

如何重装10g clusterware集群软件?

一般情况下我会说把DB、CRS停掉,手动删除ORA_CRS_HOME及/etc(或者/opt)目录下的ora*配置文件,使用dd命令清理LUN的头部。再重装一般就没有问题了,但实践过程中往往会因为清理地不够彻底而出现问题。

今天同事问我有没有彻底卸载CRS的文档,到metalink上搜了一圈结果还真没有,最后从Alejandro Vargas’ Blog上找到一份比较权威的文档,这里分享一下。

在AIX 5.3+HACMP 5.4以上环境安装10gR2 10.2.0.1 RAC CRS Clusterware必须先运行Patch 6718715中的rootpre.sh

在AIX 5.3+HACMP 5.4以上环境安装10gR2 10.2.0.1 RAC CRS Clusterware必须先运行Patch 6718715中的rootpre.sh,若不运行该rootpre.sh则会导致后续的诸多问题,例如:

 

 

1. 在 “Cluster Node Information” or “Specify Cluster Configuration” 2个界面窗口中无法点击灰色的ADD NODE按钮:

 

AIX: "Cluster Node Information" or "Specify Cluster Configuration" Window Does not Show Any Node and "Add" Button is Greyed Out

Oracle Server - Enterprise Edition - Version: 10.2.0.1 and later   [Release: 10.2 and later ]
IBM AIX on POWER Systems (64-bit)
Symptoms

Installing Oracle Clusterware (CRS or 11gR2 Grid Infrastructure), "Cluster Node Information" or "Specify Cluster Configuration" window does not show any node and "Add" button is greyed out:

Installation log

[main] [10:21:20:345] [sQueryCluster.isCluster:74]  LKMGR file =/usr/sbin/cluster/utilities/cldomain
[main] [10:21:20:346] [QueryCluster.:49]  Detected Cluster
[main] [10:21:20:346] [QueryCluster.isCluster:65]  Cluster existence check = true

Cause

The cause is IBM HACMP (or PowerHA) executable was not removed cleanly when HACMP was deinstalled. After HACMP was deinstalled, "ls" still shows one HACMP command:

$ ls -l /usr/sbin/cluster/utilities/cldomain
lrwxrwxrwx    1 root     system           29 Sep 21 13:54 /usr/sbin/cluster/utilities/cldomain -> /opt/VRTSvcs/rac/bin/cldomain

Oracle OUI depends on /usr/sbin/cluster/utilities/cldomain to determine if vendor clusterware exists, if so, it will install on top of HACMP and get list of nodes from HACMP.

In this case, since HACMP executable is detected, OUI will not allow user to manually enter node information, however, since HACMP was deinstalled, it does not have any node membership information.

Solution

The solution is to remove /usr/sbin/cluster/utilities/cldomain from all nodes and restart OUI.

 

 

 

2. 运行root.sh时显示ocrconfig.bin无法加载必要的库文件,或者干脆root.sh运行失败 ,显示Failed to start Oracle Clusterware stack;

 

由于10g CRS对于AIX HACMP有较多的依赖关系所以 随10.2.0.1附带的rootpre.sh无法有效配置oracle用户的HA信息,所以必须先下载Patch 6718715: SUPPORT FOR HACMP 5.4 IN ROOTPRE.SH SCRIPT FOR 版本10.2.0.3,否则10.2.0.1 CRS将无法正常安装。

 

 

 

 

Product : 	Oracle Database Server 10GR2 (10.2.0.x)

Bug     :       6718715 - Support HACMP 5.4

Platforms: AIX 5L (5.x), AIX 6.x

Steps to apply the patch
------------------------
1--> Login as root user 
2--> Unpack the files shipped in this patch in a temporary directory
3--> Run the rootpre.sh script
     ./rootpre.sh

Note: 
 This patch supercedes any rootpre.sh shipped with the aforementioned 
 Oracle products.

Contents of this distribution
-----------------------------
1--> README.txt
2--> loadext(32bit executable)
3--> pw-syscall32 (32-bit executable for 32-bit kernels in AIX 4.1 & AIX 4.2)
4--> pw-syscall (32-bit executable for 32-bit kernels in AIX 4.3 and AIX 5L)
5--> pw-syscall64 (64-bit executable for 64-bit kernels in AIX 5L)
6--> rootpre.sh(commands text)
7--> ORCLcluster/lib/libskgxnr.a (Oracle 64-bit cluster library)
8--> ORCLcluster/lib/libskgxnr.so (Oracle 64-bit cluster library)
9--> ORCLcluster/lib32/libskgxnr.a (Oracle 32-bit cluster library)
10--> ORCLcluster/lib32/libskgxnr.so (Oracle 32-bit cluster library)

Ignore gsd resource failed to start above 10g

On : 10.2.0.1 version, Real Application Cluster

When attempting to start gsd resource.
the following error occurs.

ERROR
———————–
Auto-start failed for the CRS resource .

Trac the issue with note:
Tracing GSD, SRVCTL, GSDCTL, VIPCA and SRVCONFIG (Doc ID 178683.1)

Tracing GSD, SRVCTL, GSDCTL, VIPCA and SRVCONFIG
PURPOSE
-------
The Purpose of this document is to assist in debugging SRVCTL, GSD, GSDCTL, VIPCA,
and SRVCONFIG problems.
SCOPE & APPLICATION
-------------------
This document is for support analysts to troubleshoot SRVCTL, GSD, GSDCTL, VIPCA,
and SRVCONFIG issues.
TRACING GSD, SRVCTL, GSDCTL, VIPCA, and SRVCONFIG
------------------------------------------
To provide verbose output for SRVCTL, GSD, GSDCTL, VIPCA, or SRVCONFIG, tracing can
be enabled to provide additional screen output.
--------------------------------------------------------------------------
10g:
Just set the environment variable SRVM_TRACE to true to trace all of the
SRVM files like gsd, srvctl, vipca, and ocrconfig.
--------------------------------------------------------------------------
9i:
To Trace GSD:
-------------
1. vi the gsd.sh file in the $ORACLE_HOME/bin directory.
For Windows:  Right click on the OraHomebingsd.bat file and choose Edit.
2. At the end of the file, look for the following line:
exec $JRE -classpath $CLASSPATH oracle.ops.mgmt.daemon.OPSMDaemon $MY_OHOME
3. Add the following just before the -classpath in the 'exec $JRE' line:
-DTRACING.ENABLED=true -DTRACING.LEVEL=2
4. At the end of the gsd.sh file, the string should now look like this:
exec $JRE -DTRACING.ENABLED=true -DTRACING.LEVEL=2 -classpath.....
5. Test this by running gsd.sh:
[opcbsol1]/u01/home/usupport> gsd.sh
[main][9:31:8:860] Daemon: argument is /u01/32bit/app/oracle/product/9.0.1
[main][9:31:8:893] tracing is true; at level 2
[main][9:31:8:893] trace file is /u01/32bit/app/oracle/product/9.0.1/srvm/log/gsdaemon.log
cont...
To Trace SRVCTL:
---------------
1. vi the srvctl file in the $ORACLE_HOME/bin directory.
For Windows:  Right click on the OraHomebinsrvctl.bat file and choose Edit.
2. At the end of the file, look for the following line:
$JRE -classpath $CLASSPATH oracle.ops.opsctl.OPSCTLDriver "$@"
3. Add the following just before the -classpath in the '$JRE' line:
-DTRACING.ENABLED=true -DTRACING.LEVEL=2
4. At the end of the srvctl file, the string should now look like this:
$JRE -DTRACING.ENABLED=true -DTRACING.LEVEL=2 -classpath.....
5. Test this by running srvctl:
[opcbsol1]/u01/home/usupport> srvctl status -p V90321
[main][9:33:2:968] srvctl: tracing is true at level 2
[main][9:33:3:38] Going into GetActiveNodes constructor...
[main][9:33:3:59] Detected Cluster
[main][9:33:3:60] Cluster existence = true
[main][9:33:3:95] loaded library
[main][9:33:3:108] Inside GetActiveNodes.initializeCluster
[main][9:33:3:264] The status string is: 1
[main][9:33:3:265] The result string is: Everything ok So Far 1
cont...
To Trace GSDCTL:
---------------
1. vi the gsdctl file in the $ORACLE_HOME/bin directory.
For Windows:  Right click on the OraHomebingsdctl.bat file and choose Edit.
2. At the end of the file, look for the following line:
$JRE -classpath $CLASSPATH oracle.ops.mgmt.daemon.GSDCTLDriver...
3. Add the following just before the -classpath in the '$JRE' line:
-DTRACING.ENABLED=true -DTRACING.LEVEL=2
4. At the end of the gsdctl file, the string should now look like this:
$JRE -DTRACING.ENABLED=true -DTRACING.LEVEL=2 -classpath.....
5. Test this by running gsdctl:
[opcbsol1]/u02/32bit/app/oracle/product/9.2.0/bin> gsdctl stat
[main] [15:41:34:849] [GetActiveNodes.create:Compile]  Going into GetActiveNodes
[main] [15:41:34:918] [sQueryCluster.:Compile]  Detected Cluster
[main] [15:41:34:922] [sQueryCluster.isCluster:Compile]  Cluster existence = true
cont...
To Trace SRVCONFIG:
-------------------
1. vi the srvconfig file in the $ORACLE_HOME/bin directory.
For Windows:  Right click on the OraHomebinsrvconfig.bat file and choose Edit.
2. At the end of the file, look for the following line:
$JRE -classpath $CLASSPATH oracle.ops.mgmt.rawdevice.RawDeviceUtil $*
3. Add the following just before the -classpath in the '$JRE' line:
-DTRACING.ENABLED=true -DTRACING.LEVEL=2
4. At the end of the srvconfig file, the string should now look like this:
$JRE -DTRACING.ENABLED=true -DTRACING.LEVEL=2 -classpath.....
5. Test this by running srvconfig:
[opcbsol1]/u02/32bit/app/oracle/product/9.2.0/bin> srvconfig -version
[main] [16:0:58:395] [RawDeviceUtil.getDeviceName:Compile]
[main] [16:0:58:454] [sQueryCluster.:Compile]  Detected Cluster
[main] [16:0:58:457] [sQueryCluster.isCluster:Compile]  Cluster existence = true
cont...
Failed to start GSD on local node
PROBLEM
-------
AIX 5L cannot successfully start gsd on any node of the cluster.
Get error "Failed to start GSD on local node"
SOLUTION
--------
Ensure that the user (oracle) is added to the HAGSUSER UNIX group.
If the gsd still fails, turn on tracing of the GSD.
Simply turning on GSD tracing, allowed for the GSD to start successfully.
Look at note 178683.1 for how to enable GSD tracing.
LOG FILE
-----------------------
Filename =crsd.log
See the following error:
2009-01-02 08:08:27.838: [ CRSCOMM][12351]32Receive message header caa_clsrecv ret 11
2009-01-02 08:08:27.838: [ CRSCOMM][12351]32Error reading response IOException : Didn't receive header part of message
(File: caa_Message.cpp, line: 711
2009-01-02 08:08:27.838: [ CRSEVT][12351]32invokepeer ret 300
2009-01-02 08:08:27.838: [ CRSRES][12351]32Remote start failed to execute on ccdb_b: X_E2E_NoResponse :
(File: caa_CmdRTI.cpp, line: 507
2009-01-02 08:08:27.839: [ CRSRES][12351][ALERT]32Remote start for `ora.ccdb_b.gsd` failed on member `ccdb_b`
2009-01-02 08:08:27.914: [ OCRMAS][3611]th_master:13: I AM THE NEW OCR MASTER at incar 6. Node Number 1
2009-01-02 08:08:27.915: [ OCRRAW][3611]proprioo: for disk 0 (/dev/ro_ocr_raw), id match (1), my id set
(1731740172,1028247821) total id sets (1), 1st set (1731740172,1028247821), 2nd set (0,0) my votes (2), total votes (2)
2009-01-02 08:08:27.916: [ OCRRAW][3611]rrecovernumpage: numpage on device is not correct (0); recalculate (262075)
2009-01-02 08:08:27.922: [ OCRMAS][3611]th_master: Deleted ver keys from cache (master)
2009-01-02 08:08:30.996: [ CLSVER][527]32Returned from grpstat with event 1
2009-01-02 08:08:30.996: [ CLSVER][527]32Doing grpstat on crs_version group
2009-01-02 08:08:58.400: [ CRSCOMM][13127]32CLEANUP: Searching for connections to failed node ccdb_b
2009-01-02 08:08:58.400: [ CRSEVT][13127]32Processing member leave for ccdb_b, incarnation: 7
2009-01-02 08:08:58.402: [ CRSD][13127]32SM: recovery in process: 8
2009-01-02 08:08:58.402: [ CRSEVT][13127]32Do failover for: ccdb_b
2009-01-02 08:08:58.418: [ CRSRES][13127]32 startup = 0
2009-01-02 08:08:58.435: [ CRSRES][13127]32Not failing resource ora.ccdb_a.gsd because it was locked.
2009-01-02 08:08:58.435: [ CRSRES][13127]32X_RES_Unavailable : Resource ora.ccdb_a.gsd is locked
(File: rti.cpp, line: 976
2009-01-02 08:08:58.438: [ CRSRES][13127]32 startup = 0
2009-01-02 08:08:58.444: [ CRSRES][13127]32 startup = 0
2009-01-02 08:08:58.491: [ CRSRES][13898]32startRunnable: setting CLI values

On the customer ‘s environment other Aix platform got the same issues as this machine .
Due to this reason ,we considered the issue is cause of setups and gsd resource won’t impact the oracle or other applications above the version (10G) .

Work arounds
Manually disable the gsd resource :
1.Use crs_unregister to delete the resource from CRS then CRS won’t attempt to start the gsd resource .
Hard code the during checking the status
2.Hard code the gsd.sh return the status Online ,to show the status Online ;

GSD resource won’t impace the CRS or Database above the version 10g

Would It affect RAC clusterware and database If we adjust OS time/Clock?

Question:

在RAC环境中节点之间的OS操作系统时钟一致是clusterware能够稳定运行的重要因素之一,但是如果我们确实有调整OS时间的需求,那么是否真的会影响到RAC的正常运行呢? 具体的影响是如何的呢?  又需要注意哪些方面的因素呢?

Answer:

RAC: Frequently Asked Questions (Doc ID 220970.1)

Does Oracle RAC work with NTP (Network Time Protocol)?
YES! NTP and Oracle RAC are compatible, as a matter of fact, it is recommended to setup NTP in an Oracle RAC cluster, for Oracle 9i Database, Oracle Database 10g, and Oracle Database 11g Release 1.


Keep the following points in mind:

# Minor changes in time (in the seconds range) are harmless for Oracle RAC and the Oracle Clusterware. If you intend on making large time changes it is best to shutdown the instances and the entire Oracle Clusterware stack on that node to avoid a false eviction, especially if you are using the Oracle RAC 10g low-brownout patches, which allow really low misscount settings.

# Backup/recovery aspect of large time changes are documented in Note: 77370.1, basically you can’t use RECOVER DATABASE UNTIL TIME to reach the second recovery point, It is possible to overcome with RECOVER DATABASE UNTIL CANCEL or UNTIL CHANGE. If you are doing complete recovery (most of the times) then this is not an issue since the Oracle recovery code uses SCN (System Change Numbers) to advance in the redo/archive logs. The SCN numbers never go back in time (unless a reset-logs operation is performed), there is always an association of an SCN to a human readable timestamp (which may change forward or backwards), hence the issue with recovery until point in time vs. until SCN/Cancel.

# If DBMS_SCHEDULER is in usage it will be affected by time changes, as it’s using actual clock rather than SCN.

# On platforms with OPROCD get fix for <> “OPROCD REBOOTS NODE WHEN TIME IS SET BACK BY XNTPD”

# If NTP is not configured correctly (using -x flag), and diagwait not set to 13 Note: 559365.1 10.2/11.1 RAC systems can be rebooted due to OPROCD, during a leap second event, see Note: 759143.1.
# Daylight saving time adjustments do not affect the system clock, only the displayed time, hence have no impact on the Oracle software.

Apart from these issues, the Oracle RDBMS server is immuned to time changes, i.e. will not affect transaction/read consistency operations.

Also please refer to note:
Dates & Calendars – Frequently Asked Questions (Doc ID 227334.1)

So please perform time changes in small amount using date command only.  Doing it precisely will  be difficult manually. Therefore using ntpd with -x option could be better solution for this case as well.

In general step in not more than 3 seconds when tuning time backward should be fine.

Know about RAC Clusterware Process OPROCD

OPROCD introduced in 10.2.0.4 Linux and other Unix platform.

  • Fencing
    • Cluster handling of nodes that should not have access to shared resources
    • STONITH – Power cycle the node
    • PCW – nodes fence themselves through the reboot(8) command
    • Fabric Fencing from Polyserve
      • Healthy nodes send SNMP msgs to Fabric switch to disable SAN access from unhealthy nodes [ fence them out ]
      • Server is left in up state to view logs etc.
  • Oracle’s Cluster I/O Fencing solution
  • Only started on Unix platforms when vendor Clusterware is not running
  • Does not run on Windows and Linux!
  • Takes 2 parameters
    • Timeout value [ length of time between executions ]
    • Margin [ leeway for dispatches ]
    • Oproc.debug –t 1000 –m 500
  • In fatal mode node will get reboot’ed
  • In non-fatal mode error messages will be logged

OPROCD – This process is spawned in any non-vendor clusterware environment, except
on Windows where Oracle uses a kernel driver to perform the same actions and Linux
prior to version 10.2.0.4. If oprocd detects problems, it will kill a node via C
code. It is spawned in init.cssd and runs as root. This daemon is used to detect
hardware and driver freezes on the machine. If a machine were frozen for long enough
that the other nodes evicted it from the cluster, it needs to kill itself to prevent
any IO from getting reissued to the disk after the rest of the cluster has remastered
locks.”

*** Oprocd log locations:
In /etc/oracle/oprocd or /var/opt/oracle/oprocd depending on version/platform.

Note that oprocd only runs when no vendor clusterware is running or on Linux > 10.2.0.4

COMMON CAUSES OF OPROCD REBOOTS

– A problem detected by the OPROCD process. This can be caused by 4 things:1) An OS scheduler problem.
2) The OS is getting locked up in a driver or hardware.
3) Excessive amounts of load on the machine, thus preventing the scheduler from
behaving reasonably.
4) An Oracle bug.OPROCD Bugs Known to Cause Reboots:

Bug 5015469 – OPROCD may reboot the node whenever the system date is moved
backwards.
Fixed in 10.2.0.3+

Bug 4206159 – Oprocd is prone to time regression due to current API used (AIX only)
Fixed in 10.1.0.3 + One off patch for Bug 4206159.

Diagnostic Fixes (VERY NECESSARY IN MOST CASES):

Bug 5137401 – Oprocd logfile is cleared after a reboot
Fixed in 10.2.0.4+

Bug 5037858 – Increase the warning levels if a reboot is approaching
Fixed in 10.2.0.3+

FILES TO REVIEW AND GATHER FOR OPROCD REBOOTS

If logging a service request, please provide ALL of the following files to Oracle
Support if possible:

– Oprocd logs in /etc/oracle/oprocd or /var/opt/oracle/oprocd depending on version/platform.

– All the files in the following directories from all nodes.

For 10.2 and above, all files under:

<CRS_HOME>/log

Recommended method for gathering these for each node would be to run the
diagcollection.pl script.

For 10.1:

<CRS_HOME>/crs/log
<CRS_HOME>/crs/init
<CRS_HOME>/css/log
<CRS_HOME>/css/init
<CRS_HOME>/evm/log
<CRS_HOME>/evm/init
<CRS_HOME>/srvm/log

Recommended method for gathering these for each node:

cd <CRS_HOME>
tar cf crs.tar crs/init crs/log css/init css/log evm/init evm/log srvm/log

– Messages or Syslog from all nodes from the time of the problem:

Sun: /var/adm/messages
HP-UX: /var/adm/syslog/syslog.log
Tru64: /var/adm/messages
Linux: /var/log/messages
IBM: /bin/errpt -a > messages.out

– ‘opatch lsinventory -detail’ output for the CRS home

– It would also be useful to get the following from each node leading up to the time
of the reboot:

– netstat -is (or equivelant)
– iostat -x (or equivelant)
– vmstat (or equivelant)

There is a tool called “OS Watcher” that helps gather this information. This tool
will dump netstat, vmstat, iostat, and other output at an inverval and save x number
of hours of archived data. For more information about this tool see Note 301137.1.

 

The OPROCD executable sets a signal handler for the SIGALRM handler and sets the interval timer based on the to-millisec parameter provided. The alarm handler gets the current time and checks it against the time that the alarm handler was last entered. If the difference exceeds (to-millisec + margin-millisec), it will fail; the production version will cause a node reboot.

In fatal mode, OPROCD will reboot the node if it detects excessive wait. In Non Fatal mode, it will write an error message out to the file .oprocd.log in one of the following directories.

Oracle clusterware has the following three daemons which may be responsible for panicing the node. It is possible that some other external entity may have rebooted the node. In the context of this discussion, we will assume that the reboot/panic was done by an Oracle clusterware daemon.

* Oprocd – Cluster fencing module
* Cssd – Cluster sychronization module which manages node membership
* Oclsomon – Cssd monitor which will monitor for cssd hangs

OPROCD This is a daemon that only gets activated when there is no vendor clusterware present on the OS.This daemon is also not activated to run on Windows/Linux. This daemon runs a tight loop and if it is not scheduled for 1.5 seconds, will reboot the node.
CSSD This daemon pings the other members of the cluster over the private network and Voting disk. If this does not get a response for Misscount seconds and Disktimeout seconds respectively, it will reboot the node.
Oclsomon This daemon monitors the CSSD to ensure that CSSD is scheduled by the OS, if it detects any problems it will reboot the node.

A sample log looks like
May 11 18:13:15.528 | INF | monitoring started with timeout(1000), margin(500)
May 11 18:13:15.548 | INF | normal startup, setting process to fatal mode
May 12 11:43:00.899 | INF | shutting down from client request
May 12 11:43:00.899 | INF | exiting current process in NORMAL mode
May 12 12:10:43.984 | INF | monitoring started with timeout(1000), margin(500)
May 13 11:29:37.528 | INF | shutting down from client request
May 13 11:29:37.528 | INF | exiting current process in NORMAL mode
When fatal mode is disabled, OPROCD will write the following to the log file and exit:
May 10 18:01:40.668 | INF | monitoring started with timeout(1000), margin(500)
May 10 18:23:02.490 | ERR | AlarmHandler:? timeout(1739751316) exceeds interval(1000000000)+margin(500000000)
[root@rh2 ~]# ps -ef|grep oprocd|grep -v grep
root     19763     1  0 Jun27 ?        00:00:00 oprocd start
[root@rh2 oprocd]# cd /etc/oracle/oprocd
[root@rh2 oprocd]# ls -l
total 20
drwxrwx--- 2 root oinstall 4096 Jun 27 23:52 check
drwxrwx--- 2 root oinstall 4096 Mar 29 22:37 fatal
-rwxr--r-- 1 root root      512 Jun 27 23:52 rh2.oprocd.lgl
-rw-r--r-- 1 root root      171 Jun 27 23:52 rh2.oprocd.log
drwxrwx--- 2 root oinstall 4096 Jun 27 23:52 stop
[root@rh2 oprocd]# cat rh2.oprocd.log
Jun 27 23:52:47.861 | INF | monitoring started with timeout(1000), margin(500), skewTimeout(125)
Jun 27 23:52:47.864 | INF | normal startup, setting process to fatal mode
[root@rh2 oprocd]# oprocd
usage:  oprocd [start | startInstall | stop | check | enableFatal| help | -?]
run [ -t | -m | -g | -f  | -e]   foreground startup
-t           timeout in ms
-m            timout margin in ms
-e           clock skew epsilon in ms
-g         group name to enable fatal
-f                    fatal startup
start  [-t | -m  | -e]           starts the daemon
-t         timeout in ms
-m          timout margin in ms
-e         clock skew epsilon in ms
startInstall [ -t | -m | -g  | - e] start process in install mode
-t    timeout in ms
-m     timout margin in ms
-e    clock skew epsilon in ms
-g  group name to enable fatal
enableFatal  [ -t ]             force install mode process to fatal
-t    timeout for response in ms
stop         [ -t ]             stops running daemon
-t    timeout for response in ms
check        [ -t ]           checks status of daemon
-t    timeout for response in ms
help                          this help information
-?                            same as help above
[root@rh2 oprocd]# oprocd stop
Jun 28 00:17:36.604 | INF | daemon shutting down

Oracle Clusterware Process Monitor (OPROCD) From Julian Dyke

Process Monitor Daemon
Provides Cluster I/O Fencing
Implemented on Unix systems
Not required with third-party clusterware
Implemented in Linux in 10.2.0.4 and above
In 10.2.0.3 and below hangcheck timer module is used
Provides hangcheck timer functionality to maintain cluster integrity
Behaviour similar to hangcheck timer
Runs as root
Locked in memory
Failure causes reboot of system
See /etc/init.d/init.cssd for operating system reboot commands
OPROCD takes two parameters
-t  - Timeout value
Length of time between executions (milliseconds)
Normally defaults to 1000
-m - Margin
Acceptable margin before rebooting (milliseconds)
Normally defaults to 500
Parameters are specified in /etc/init.d/init.cssd
OPROCD_DEFAULT_TIMEOUT=1000
OPROCD_DEFAULT_MARGIN=500
Contact Oracle Support before changing these values
/etc/init.d/init.cssd can increase OPROCD_DEFAULT_MARGIN based on two CSS variables
reboottime (mandatory)
diagwait (optional)
Values can for these be obtained using
[root@server3]# crsctl get css reboottime
[root@server3]# crsctl get css diagwait
Both values are reported in seconds
The algorithm is
If diagwait > reboottime then
OPROCD_DEFAULT_MARGIN := (diagwait - reboottime) * 1000
Therefore increasing diagwait will reduce frequency of reboots e.g
[root@server3]# crsctl set css diagwait 13

沪ICP备14014813号

沪公网安备 31010802001379号