Brain Split?

真正出现脑裂的几率并不高,但确实让我们碰上了。2个月前为一套AIX6.1上的10.2.0.4双节点RAC系统做故障测试,主要内容是拔除RAC interconnect网线,测试CRS能否正确处理私有网络挂掉的情况。

 

正式测试时发现2台主机都没有重启,而两端的CSS都认为对方节点已经down了。这就造成2个节点都以为自身是幸存者,也就是我们说的脑裂(brain split),此时实例一般会因为LMON进程的缘故而hang住。

 

我们来比对当时2个节点上的日志进一步分析:

 

STEP 1 :20:41:19物理拔出网线后,节点间无法正常通信,进入misscount倒计时600s
节点1:
[    CSSD]2010-06-22 20:41:21.465 [3342] >TRACE:   clssnmPollingThread: node gis2 (2) missed(2) checkin(s)
[    CSSD]2010-06-22 20:41:22.465 [3342] >TRACE:   clssnmPollingThread: node gis2 (2) missed(3) checkin(s)
.........
[    CSSD]2010-06-22 20:51:17.956 [3342] >TRACE:   clssnmPollingThread: node gis2 (2) missed(598) checkin(s)
[    CSSD]2010-06-22 20:51:18.963 [3342] >TRACE:   clssnmPollingThread: node gis2 (2) missed(599) checkin(s)
[    CSSD]2010-06-22 20:51:19.963 [3342] >TRACE:   clssnmPollingThread: Eviction started for node gis2 (2), flags 0x0001, state 3, wt4c 0

/* 节点1上完成倒计时后开始驱逐节点2*/

节点2:
[    CSSD]2010-06-22 20:41:19.598 [3342] >TRACE:   clssnmPollingThread: node gis1 (1) missed(2) checkin(s)
[    CSSD]2010-06-22 20:41:20.599 [3342] >TRACE:   clssnmPollingThread: node gis1 (1) missed(3) checkin(s)
......................
[    CSSD]2010-06-22 20:51:15.871 [3342] >TRACE:   clssnmPollingThread: node gis1 (1) missed(598) checkin(s)
[    CSSD]2010-06-22 20:51:16.871 [3342] >TRACE:   clssnmPollingThread: node gis1 (1) missed(599) checkin(s)
[    CSSD]2010-06-22 20:51:17.878 [3342] >TRACE:   clssnmPollingThread: Eviction started for node gis1 (1), flags 0x0001, state 3, wt4c 0

/*同样的节点2也是在10分钟后的51分开始驱逐节点1*/

STEP 2: 2个节点读取磁盘心跳信息(clssnmReadDskHeartbeat),且认为对方节点已经down了

节点1:
[    CSSD]2010-06-22 20:51:20.964 [3856] >TRACE:   clssnmSetupAckWait: node(1) is ACTIVE
[    CSSD]2010-06-22 20:51:20.964 [3856] >TRACE:   clssnmSendVote: syncSeqNo(3)
[    CSSD]2010-06-22 20:51:20.964 [3856] >TRACE:   clssnmWaitForAcks: Ack message type(13), ackCount(1)
[    CSSD]2010-06-22 20:51:20.965 [2057] >TRACE:   clssnmSendVoteInfo: node(1) syncSeqNo(3)
[    CSSD]2010-06-22 20:51:21.714 [1543] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(3) wrtcnt(4185) LATS(1628594178) Disk lastSeqNo(4185)
[    CSSD]2010-06-22 20:51:21.965 [3856] >TRACE:   clssnmWaitForAcks: done, msg type(13)
[    CSSD]2010-06-22 20:51:21.965 [3856] >TRACE:   clssnmCheckDskInfo: Checking disk info...
[    CSSD]2010-06-22 20:51:22.718 [1543] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(3) wrtcnt(4186) LATS(1628595183) Disk lastSeqNo(4186)
[    CSSD]2010-06-22 20:51:22.964 [3342] >TRACE:   clssnmPollingThread: node gis1 (1) missed(2) checkin(s)
[    CSSD]2010-06-22 20:51:23.722 [1543] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(3) wrtcnt(4187) LATS(1628596186) Disk lastSeqNo(4187)
[ CSSD]2010-06-22 20:51:24.724 [1543] >TRACE: clssnmReadDskHeartbeat: node(2) is down.
rcfg(3) wrtcnt(4188) LATS(1628597189) Disk lastSeqNo(4188)
.............................
[    CSSD]2010-06-22 20:59:49.953 [1543] >TRACE:   clssnmReadDskHeartbeat: node(2) is down. rcfg(3) wrtcnt(4692) LATS(1629102418) Disk lastSeqNo(4692)
[    CSSD]2010-06-22 20:59:50.057 [3085] >TRACE:   clssgmPeerDeactivate: node 2 (gis2), death 0, state 0x80000001 connstate 0xf
[    CSSD]2010-06-22 20:59:50.104 [1029] >TRACE:   clssnm_skgxncheck: CSS daemon failed on node 2
[    CSSD]2010-06-22 20:59:50.382 [2314] >TRACE:   clssgmClientConnectMsg: Connect from con(112a6c5b0) proc(112a5a190) pid() proto(10:2:1:1)
[    CSSD]2010-06-22 20:59:51.231 [3856] >TRACE:   clssnmEvict: Start
[    CSSD]2010-06-22 20:59:51.231 [3856] >TRACE:   clssnmEvict: Evicting node 2, birth 1, death 3, killme 1
[    CSSD]2010-06-22 20:59:51.232 [3856] >TRACE:   clssnmWaitOnEvictions: Start
[    CSSD]2010-06-22 20:59:51.232 [3856] >TRACE:   clssnmWaitOnEvictions: Node(0) down, LATS(0),timeout(1629103696)
[    CSSD]2010-06-22 20:59:51.232 [3856] >TRACE:   clssnmWaitOnEvictions: Node(2) down, LATS(1629102418),timeout(1278)
[    CSSD]2010-06-22 20:59:51.232 [3856] >TRACE:   clssnmSetupAckWait: Ack message type (15)
[    CSSD]2010-06-22 20:59:51.232 [3856] >TRACE:   clssnmSetupAckWait: node(1) is ACTIVE
[    CSSD]2010-06-22 20:59:51.232 [3856] >TRACE:   clssnmSendUpdate: syncSeqNo(3)
[    CSSD]2010-06-22 20:59:51.232 [3856] >TRACE:   clssnmWaitForAcks: Ack message type(15), ackCount(1)
[    CSSD]2010-06-22 20:59:51.232 [2057] >TRACE:   clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[    CSSD]2010-06-2F1.232 [2057] >TRACE:   clssnmDeactivateNode: node 0 () left cluster

[    CSSD]2010-06-22 20:59:51.232 [2057] >TRACE:   clssnmUpdateNodeState: node 1, state (3/3) unique (1277207505/1277207505) prevConuni(0) birth (2/2) (old/new)
[    CSSD]2010-06-22 20:59:51.232 [2057] >TRACE:   clssnmUpdateNodeState: node 2, state (0/0) unique (1277206874/1277206874) prevConuni(1277206874) birth (1/0) (old/new)
[    CSSD]2010-06-22 20:59:51.232 [2057] >TRACE:   clssnmDeactivateNode: node 2 (gis2) left cluster

[    CSSD]2010-06-22 20:59:51.233 [2057] >USER:    clssnmHandleUpdate: SYNC(3) from node(1) completed
[    CSSD]2010-06-22 20:59:51.233 [2057] >USER:    clssnmHandleUpdate: NODE 1 (gis1) IS ACTIVE MEMBER OF CLUSTER
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmReconfigThread:  started for reconfig (3)
[    CSSD]2010-06-22 20:59:51.310 [4114] >USER:    NMEVENT_RECONFIG [00][00][00][02]
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock crs_version type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock ORA_CLSRD_1_gisdb type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock ORA_CLSRD_1_gisdb type 3
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock ORA_CLSRD_2_gisdb type 3
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(ORA_CLSRD_2_gisdb) birth(1/0)
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock DBGISDB type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock DGGISDB type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock DAALL_DB type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock EVMDMAIN type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock CRSDMAIN type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock IGGISDBALL type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock ocr_crs type 2
[    CSSD]2010-06-22 20:59:51.310 [4114] >TRACE:   clssgmCleanupGrocks: cleaning up grock ORA_CLSRCSN_SRV_gisdb1 type 3
[    CSSD]2010-06-22 20:59:51.311 [4114] >TRACE:   clssgmEstablishConnections: 1 nodes in cluster incarn 3
[    CSSD]2010-06-22 20:59:51.311 [3085] >TRACE:   clssgmPeerListener: connects done (1/1)
[    CSSD]2010-06-22 20:59:51.311 [4114] >TRACE:   clssgmEstablishMasterNode: MASTER for 3 is node(1) birth(2)
[    CSSD]2010-06-22 20:59:51.311 [4114] >TRACE:   clssgmChangeMasterNode: requeued 1 RPCs
[    CSSD]2010-06-22 20:59:51.311 [4114] >TRACE:   clssgmMasterCMSync: Synchronizing group/lock status
[    CSSD]2010-06-22 20:59:51.312 [4114] >TRACE:   clssgmMasterSendDBDone: group/lock status synchronization complete
[    CSSD]CLSS-3000: reconfiguration successful, incarnation 3 with 1 nodes

[    CSSD]CLSS-3001: local node number 1, master node number 1

/* 节点1在hearbeat 8分钟左右后认为CSS daemon failed on node 2,正式认为Node 2离开了集群,并完成了reconfiguration*/

节点2:
[    CSSD]2010-06-22 20:51:18.892 [3856] >TRACE:   clssnmSendVote: syncSeqNo(3)
[    CSSD]2010-06-22 20:51:18.892 [3856] >TRACE:   clssnmWaitForAcks: Ack message type(13), ackCount(1)
[    CSSD]2010-06-22 20:51:18.892 [2057] >TRACE:   clssnmSendVoteInfo: node(2) syncSeqNo(3)
[    CSSD]2010-06-22 20:51:19.287 [1543] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(3) wrtcnt(3548) LATS(351788040) Disk lastSeqNo(3548)
[    CSSD]2010-06-22 20:51:19.892 [3856] >TRACE:   clssnmWaitForAcks: done, msg type(13)
[    CSSD]2010-06-22 20:51:19.892 [3856] >TRACE:   clssnmCheckDskInfo: Checking disk info...
[ CSSD]2010-06-22 20:51:20.288 [1543] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(3) wrtcnt(3549) LATS(351789041) Disk lastSeqNo(3549)
[    CSSD]2010-06-22 20:51:21.308 [1543] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(3) wrtcnt(3550) LATS(351790062) Disk lastSeqNo(3550)
...........................
[    CSSD]2010-06-22 20:59:46.122 [1543] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(3) wrtcnt(4051) LATS(352294875) Disk lastSeqNo(4051)
[    CSSD]2010-06-22 20:59:46.341 [2314] >TRACE:   clssgmClientConnectMsg: Connect from con(112947c70) proc(112946f90) pid() proto(10:2:1:1)
[    CSSD]2010-06-22 20:59:46.355 [2314] >WARNING: clssgmShutDown: Received explicit shutdown request from client.
[    CSSD]2010-06-22 20:59:46.355 [2314] >TRACE:   clssgmClientShutdown: Aborting client (112a50210) proc (112a4e3d0)
[    CSSD]2010-06-22 20:59:46.355 [2314] >TRACE:   clssgmClientShutdown: Aborting client (112a50cd0) proc (112a4e3d0)
[    CSSD]2010-06-22 20:59:46.355 [2314] >TRACE:   clssgmClientShutdown: Aborting client (112a536f0) proc (112a4e3d0)
[    CSSD]2010-06-22 20:59:46.355 [2314] >TRACE:   clssgmClientShutdown: Aborting client (112a4eb90) proc (112a4eef0)
[    CSSD]2010-06-22 20:59:46.355 [2314] >TRACE:   clssgmClientShutdown: Aborting client (112a69250) proc (112a67e10)
[    CSSD]2010-06-22 20:59:46.355 [2314] >TRACE:   clssgmClientShutdown: Aborting client (112946050) proc (112945e50)
[    CSSD]2010-06-22 20:59:46.355 [2314] >TRACE:   clssgmClientShutdown: waited 0 seconds on 6 IO capable clients
[    CSSD]2010-06-22 20:59:46.494 [2314] >WARNING: clssgmClientShutdown: graceful shutdown completed.
[    CSSD]2010-06-22 20:59:47.130 [1543] >TRACE:   clssnmReadDskHeartbeat: node(1) is down. rcfg(3) wrtcnt(4052) LATS(352295883) Disk lastSeqNo(4052)
[    CSSD]2010-06-22 21:34:40.167 >USER:    Oracle Database 10g CSS Release 10.2.0.1.0 Production Copyright 1996, 2004 Oracle.  All rights reserved.

/* node2 也正确进行了heartbeat,并认为node(1) is down,最后被手动关闭;之后还原了网络故障,在21:34时CSS重新启动*/

 

如果你仔细看以上日志的话,你大概会找出”Oracle Database 10g CSS Release 10.2.0.1.0″的记录;这套RAC不是10.2.0.4的吗,为什么CSS还是10.2.0.1版本的呢,事后调查才发觉是安装该套系统的施工方国内某X码工程师在给CRS打补丁的时候忘记运行最后的root102.sh脚本了,该脚本将更新OCR/Voting disk及实际的CRS binary文件等,如果补丁安装结束后没有运行该脚本则升级不会有任何效果,而只会更新oraInventory中的信息。

 

刚开始时哪位X码的工程师抵死不肯承认忘记了运行脚本,而实际上在AIX 6.1上打10.2.0.4 CRS的patch是需要为oracle用户赋特有的权限的,这一点不同于AIX 5.3上,即:

 

chuser capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE,CAP_NUMA_ATTACH oracle
/*进一步检查*/
lsuser -f oracle | grep capabilities
        capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE,CAP_NUMA_ATTACH

 

如果未对oracle用户赋以上权限则运行root102.sh脚本时将报错。另一个判断的标志是pre10204/pre10205目录,如果运行过root102.sh脚本的话$ORA_CRS_HOME/install目录下会多出一个形如pre$VERSION的目录,没有的话一般就是没有运行过脚本,当然也有可能是时候删除了(不建议删除)。

 

了解到以上信息后对此次脑裂的追根溯源就要简单的多了,版本10201上的CRS可以说Bug众多的,从10201-10204期间CRS加入了不少新的参数和机制,在MOS上搜索关键词”brain split CSS”可以找到以下案例:

Hdr: 8293652 10.2.0.3 PCW 10.2.0.3 OSD PRODID-5 PORTID-46
Abstract: CSS CANNOT HANDLE SPLIT-BRAIN AND DB INSTANCE RECEIVES ORA-29740
PROBLEM:
——–
config:
2-node RAC: Node1 (pdb01) and Node2 (pdb02)
There’s no time difference between two nodes.

pdb02 got ORA-29740 and terminated at “Tue Feb 17 12:13:06 2009”
ORA-29740 occured with reason 1.
After ORA-29740 happened, the instance won’t be able to start
until rebooting OS.
After rebooting OS, everything was fine and instances were up.

DIAGNOSTIC ANALYSIS:
——————–
clssnmReadDskHeartbeat: node(2) is down. rcfg(8) wrtcnt(2494425)
LATS(1205488794) Disk lastSeqNo(2494425)
nodes
clssgmMasterSendDBDone: group/lock status synchronization complete
nodes

WORKAROUND:
———–
none

RELATED BUGS:
————-

REPRODUCIBILITY:
—————-
once at ct’s env.

TEST CASE:
———-

STACK TRACE:
————

SUPPORTING INFORMATION:
———————–

24 HOUR CONTACT INFORMATION FOR P1 BUGS:
—————————————-

DIAL-IN INFORMATION:
——————–

IMPACT DATE:
————

Does ct apply any CRS(bundle) patch ?

When problem happen, cssd can’t connect each other via interconnect,
but both cssd can do heartbeat to voting disk.
However, both cssd consider that “I’m survivor”.
Looking into node 1 cssd.
* 12:02:30.856 – Initiated sync
[ CSSD]2009-02-17 12:02:30.856 [1262557536] >TRACE: clssnmDoSyncUpdate:
Initiating sync 7
[ CSSD]2009-02-17 12:02:30.856 [1262557536] >TRACE: clssnmDoSyncUpdate:
diskTimeout set to (57000)ms

* Checking voting disk, and find node2 is still voting and living.
[ CSSD]2009-02-17 12:02:30.874 [1262557536] >TRACE: clssnmCheckDskInfo:
Checking disk info…
[ CSSD]2009-02-17 12:02:30.874 [1262557536] >TRACE: clssnmCheckDskInfo:
node(2) timeout(20) state_network(5) state_disk(3) misstime(0)
[ CSSD]2009-02-17 12:02:31.878 [1262557536] >TRACE: clssnmCheckDskInfo:
node(2) disk HB found, network state 5, disk state(3) misstime(1010)

* Compared cluster size and confirmed it can survive.
[ CSSD]2009-02-17 12:02:34.885 [1262557536] >TRACE: clssnmCheckDskInfo:
node 2, iz-pdb02, state 5 with leader 2 has smaller cluster size 1;
my cluster size 1 with leader 1

* Then finished
[ CSSD]2009-02-17 12:02:34.886 [1262557536] >TRACE: clssnmDoSyncUpdate:
Sync Complete!
*** 03/08/09 11:23 pm ***
Looking into node 2 cssd log.

* 12:02:20.647 – initiated sync protocol
[ CSSD]2009-02-17 12:02:20.647 [1262557536] >TRACE: clssnmDoSyncUpdate:
Initiating sync 7
[ CSSD]2009-02-17 12:02:20.647 [1262557536] >TRACE: clssnmDoSyncUpdate:
diskTimeout set to (57000)ms

* Checking disk and find node1 does not do disk heartbeart for 59690 ms.
it would have waited for misscount and considered node 1 is dead
[ CSSD]2009-02-17 12:02:22.285 [1262557536] >TRACE: clssnmCheckDskInfo:
Checking disk info…
[ CSSD]2009-02-17 12:02:22.285 [1262557536] >TRACE: clssnmCheckDskInfo:
node(1) timeout(59690) state_network(5) state_disk(3) misstime(61000)

* node2 is the only active member of cluster, finished.
[ CSSD]2009-02-17 12:02:22.723 [1262557536] >TRACE: clssnmDoSyncUpdate:
Sync Complete!
*** 03/08/09 11:45 pm ***
So, strange point is, node 2 cssd says node 1 cssd didn’t do
disk heartbeat for 60 seconds.

Looking into node1 cssd log just before initiating sync. We see 87sec gap.
———————————–
[ CSSD]2009-02-17 12:00:19.354 [1199618400] >TRACE: clssgmClientConnectMsg:
Connect from con(0x784d80) proc(0x7749b0) pid(14746) proto(10:2:1:1)
[ CSSD]2009-02-17 12:00:33.338 [1199618400] >TRACE: clssgmClientConnectMsg:
Connect from con(0x7d8620) proc(0x75cfb0) pid() proto(10:2:1:1)
[ CSSD]2009-02-17 12:01:03.688 [1199618400] >TRACE: clssgmClientConnectMsg:
Connect from con(0x76a390) proc(0x75cfb0) pid(13634) proto(10:2:1:1)
[ CSSD]2009-02-17 12:02:30.855 [1168148832] >WARNING: clssnmDiskPMT:
sltscvtimewait timeout (69200)
[ CSSD]2009-02-17 12:02:30.855 [1189128544] >WARNING: clssnmeventhndlr:
Receive failure with node 2 (iz-pdb02), state 3, con(0x72b980),
probe((nil)), rc=10
[ CSSD]2009-02-17 12:02:30.855 [1189128544] >TRACE: clssnmDiscHelper:
iz-pdb02, node(2) connection failed, con (0x72b980), probe((nil))
[ CSSD]2009-02-17 12:02:30.856 [1262557536] >TRACE: clssnmDoSyncUpdate:
Initiating sync 7

As an interesting point, clssgmClientConnectMsg does not show message,
but nm polling thread/disk ping thread does not warn timeout.
(Usually it should write message first at 50% of timeout = 30 sec)

And “sltscvtimewait timeout (69200)” message means, DiskPingMonitor thread
does not run for a 69200 ms whereas it just wants to sleep 1 second.

These suggest, cssd does not scheduled about 70 seconds on node 1.
I don’t see any log from DiskPingThread, but I assume it is suspended
at some point also, and back to work after 70 seconds.
Please check OS message file to see any interesting error is recorded.
To prevent this issue, keep watching OS performance to see if any
extreme high load does not happen.

Recommended solution it to go to 10.2.0.4 and use oprocd so that
we can expect oprocd kill node1 in such case.

 

 

上述案例同样是在”cssd can’t connect each other via interconnect”的状况下出现了”I’m survivor”的脑裂问题,MOS的建议是升级到10204后oprocd进程可以阻止这样的惨剧发生。

 

该问题最后通过升级到10.2.0.5解决了,这个case告诉我们在中国的it大环境内,有时候我们不得不亲力亲为地关心每一个细节,就拿这次来说我一开始也没发现升级没完成的情况,后来还是同事提醒了我;因为这是一个非常低级错误,如果施工方的X码工程师仔仔细细地按照他们下发的文档按部就班亦或者能留意一下升级时的图形窗口中的说明的话,这个问题都不会发生!而实际上不仅仅是此套系统,连带着其他2套系统也是这位X码工程师安装升级的,这几套系统在之后的故障测试时都发现了同样的问题。

事实告诉我们,细节决定成败!


Posted

in

by

Tags:

Comments

3 responses to “Brain Split?”

  1. admin Avatar
    admin

    Hdr: 5577914 10.2.0.2 PCW 10.2.0.2 CSS PRODID-5 PORTID-226 4930431
    Abstract: CSS CANNOT RESOLVE SPLIT BRAIN, WHEN SOME NODES OCCUR INTERCONNECT FAILURE
    Customer tested the following 2 scenarios.

    – scenario 1

    Customer cut interconnect network interface on all nodes.
    (Test was started from 07:35:24)


    [CSSD]2006-09-28 07:36:29.078 [2997689264] >TRACE:clssnmCheckDskInfo:
    node(1) mode(0) SYNC_MASTER
    [CSSD]2006-09-28 07:36:29.078 [2997689264] >TRACE:clssnmCheckDskInfo:
    node(2) mode(0) SYNC_MASTER
    [CSSD]2006-09-28 07:36:29.078 [2997689264] >ERROR:clssnmCheckDskInfo:
    Terminating local instance to avoid splitbrain.
    [CSSD]2006-09-28 07:36:29.078 [2997689264] >ERROR:
    : Node(1), Leader(1), Size(1) VS Node(2), Leader(2), Size(3)


    [CSSD]2006-09-28 07:36:30.562 [2997824432] >TRACE:clssnmCheckDskInfo:
    node(1) timeout(2050) state_network(0) state_disk(3) missCount(64)
    [CSSD]2006-09-28 07:36:30.562 [2997824432] >TRACE:clssnmCheckDskInfo:
    node(1) mode(0) SYNC_MASTER
    [CSSD]2006-09-28 07:36:30.562 [2997824432] >ERROR:clssnmCheckDskInfo:
    Terminating local instance to avoid splitbrain.
    [CSSD]2006-09-28 07:36:30.562 [2997824432] >ERROR:
    : Node(2), Leader(2), Size(1) VS Node(1), Leader(1), Size(1)


    [CSSD]2006-09-28 07:36:36.058 [2987334576] >TRACE:clssnmCheckDskInfo:
    node(1) timeout(6840) state_network(0) state_disk(3) missCount(63)
    [CSSD]2006-09-28 07:36:36.058 [2987334576] >TRACE:clssnmCheckDskInfo:
    node(2) timeout(4830) state_network(0) state_disk(3) missCount(63)
    [CSSD]2006-09-28 07:36:36.058 [2987334576] >TRACE:clssnmCheckDskInfo:
    node(1) mode(0) SYNC_MASTER
    [CSSD]2006-09-28 07:36:36.058 [2987334576] >ERROR:clssnmCheckDskInfo:
    Terminating local instance to avoid splitbrain.
    [CSSD]2006-09-28 07:36:36.058 [2987334576] >ERROR:
    : Node(3), Leader(3), Size(1) VS Node(1), Leader(1), Size(1)


    [CSSD]2006-09-28 07:36:36.918 [2997820336] >TRACE:clssnmCheckDskInfo:
    node(1) timeout(6300) state_network(0) state_disk(3) missCount(64)
    [CSSD]2006-09-28 07:36:36.918 [2997820336] >TRACE:clssnmCheckDskInfo:
    node(2) timeout(4300) state_network(0) state_disk(3) missCount(64)
    [CSSD]2006-09-28 07:36:36.918 [2997820336] >TRACE:clssnmCheckDskInfo:
    node(1) mode(0) SYNC_MASTER
    [CSSD]2006-09-28 07:36:36.918 [2997820336] >ERROR:clssnmCheckDskInfo:
    Terminating local instance to avoid splitbrain.
    [CSSD]2006-09-28 07:36:36.918 [2997820336] >ERROR:
    : Node(4), Leader(4), Size(1) VS Node(1), Leader(1), Size(1)

    –> The result was ALL nodes were down.
    I think CSS should survive NODE1.
    – scenario 2

    Customer cut interconnect network interface on NODE1 and NODE2.
    (Test was started from 08:02:00)


    [CSSD]2006-09-28 08:02:00.628 [3007810480] >TRACE:clssnmPollingThread:
    node b2 (2) missed(2) checkin(s)
    [CSSD]2006-09-28 08:02:00.628 [3007810480] >TRACE:clssnmPollingThread:
    node b3 (3) missed(2) checkin(s)
    [CSSD]2006-09-28 08:02:00.628 [3007810480] >TRACE:clssnmPollingThread:
    node b4 (4) missed(2) checkin(s)
    [CSSD]2006-09-28 08:02:01.630 [3007810480] >TRACE:clssnmPollingThread:
    node b2 (2) missed(3) checkin(s)
    [CSSD]2006-09-28 08:02:01.630 [3007810480] >TRACE:clssnmPollingThread:
    node b3 (3) missed(3) checkin(s)
    [CSSD]2006-09-28 08:02:01.630 [3007810480] >TRACE:clssnmPollingThread:
    node b4 (4) missed(3) checkin(s)
    [CSSD]2006-09-28 08:02:01.951 [2997320624] >TRACE:clssnmSendingThread:
    sending status msg to all nodes
    [CSSD]2006-09-28 08:02:01.951 [2997320624] >TRACE:clssnmSendingThread:
    sent status msg to all nodes
    [CSSD]2006-09-28 08:02:02.632 [3007810480] >TRACE:clssnmPollingThread:
    node b2 (2) missed(4) checkin(s)
    [CSSD]2006-09-28 08:02:02.632 [3007810480] >TRACE:clssnmPollingThread:
    node b3 (3) missed(4) checkin(s)
    [CSSD]2006-09-28 08:02:02.632 [3007810480] >TRACE:clssnmPollingThread:
    node b4 (4) missed(4) checkin(s)
    [CSSD]2006-09-28 08:02:03.634 [3007810480] >TRACE:clssnmPollingThread:
    node b2 (2) missed(5) checkin(s)
    [CSSD]2006-09-28 08:02:03.634 [3007810480] >TRACE:clssnmPollingThread:
    node b3 (3) missed(5) checkin(s)

    —> misscount was up to NODE2, NODE3, NODE4. (It looks like OK)

    [CSSD]2006-09-28 08:04:35.346 [2976340912] >TRACE:clssgmMasterSendDBDone:
    group/lock status synchronization complete
    [CSSD]2006-09-28 08:04:35.346 [2976340912]
    >TRACE:clssgmCompareSwapEventValue:
    changed CmInfo State val 9, from 8, changes 17
    [CSSD]2006-09-28 08:04:35.346 [2976340912]
    >TRACE:clssgmCompareSwapEventValue:
    changed CmInfo State val 10, from 9, changes 18
    [CSSD]CLSS-3000: reconfiguration successful, incarnation 3 with 1 nodes
    [CSSD]CLSS-3001: local node number 1, master node number 1

    —> CSS survived NODE1.


    [CSSD]2006-09-28 08:02:02.758 [3008162736] >TRACE:clssnmPollingThread:
    node b1 (1) missed(4) checkin(s)
    [CSSD]2006-09-28 08:02:02.758 [3008162736] >TRACE:clssnmPollingThread:
    node b3 (3) missed(2) checkin(s)
    [CSSD]2006-09-28 08:02:02.758 [3008162736] >TRACE:clssnmPollingThread:
    node b4 (4) missed(2) checkin(s)
    [CSSD]2006-09-28 08:02:03.760 [3008162736] >TRACE:clssnmPollingThread:
    node b1 (1) missed(5) checkin(s)
    [CSSD]2006-09-28 08:02:03.760 [3008162736] >TRACE:clssnmPollingThread:
    node b3 (3) missed(3) checkin(s)
    [CSSD]2006-09-28 08:02:03.760 [3008162736] >TRACE:clssnmPollingThread:
    node b4 (4) missed(3) checkin(s)

    —> misscount was up to NODE1, NODE3, and NODE4. (It looks like OK)

    [CSSD]2006-09-28 08:04:04.994 [2987183024] >TRACE:clssnmCheckDskInfo:
    node(4) timeout(59690) state_network(0) state_disk(3) missCount(71)
    [CSSD]2006-09-28 08:04:05.306 [2987183024] >TRACE:clssnmCheckDskInfo:
    node(1) mode(0) SYNC_MASTER
    [CSSD]2006-09-28 08:04:05.306 [2987183024] >ERROR:clssnmCheckDskInfo:
    Terminating local instance to avoid splitbrain.
    [CSSD]2006-09-28 08:04:05.306 [2987183024] >ERROR:
    : Node(2), Leader(2), Size(1) VS Node(1), Leader(1), Size(1)

    —> CSS killed NODE2.
    *** 10/02/06 03:20 pm ***

    [CSSD]2006-09-28 08:02:02.241 [3008162736] >TRACE:clssnmPollingThread:
    node b1 (1) missed(4) checkin(s)
    [CSSD]2006-09-28 08:02:02.241 [3008162736] >TRACE:clssnmPollingThread:
    node b2 (2) missed(2) checkin(s)
    [CSSD]2006-09-28 08:02:03.243 [3008162736] >TRACE:clssnmPollingThread:
    node b1 (1) missed(5) checkin(s)
    [CSSD]2006-09-28 08:02:03.243 [3008162736] >TRACE:clssnmPollingThread:
    node b2 (2) missed(3) checkin(s)
    [CSSD]2006-09-28 08:02:04.245 [3008162736] >TRACE:clssnmPollingThread:
    node b1 (1) missed(6) checkin(s)
    [CSSD]2006-09-28 08:02:04.245 [3008162736] >TRACE:clssnmPollingThread:
    node b2 (2) missed(4) checkin(s)

    —> misscount was up to NODE1 and NODE2. (It looks like OK)

    [CSSD]2006-09-28 08:04:04.477 [2987183024] >TRACE:clssnmCheckDskInfo:
    node(4) timeout(59250) state_network(0) state_disk(3) missCount(119)
    [CSSD]2006-09-28 08:04:05.230 [2987183024] >TRACE:clssnmCheckDskInfo:
    node(1) mode(0) SYNC_MASTER
    [CSSD]2006-09-28 08:04:05.230 [2987183024] >ERROR:clssnmCheckDskInfo:
    Terminating local instance to avoid splitbrain.
    [CSSD]2006-09-28 08:04:05.230 [2987183024] >ERROR:
    : Node(3), Leader(3), Size(1) VS Node(1), Leader(1), Size(1)

    —> CSS killed NODE3.


    [CSSD]2006-09-28 08:02:02.431 [3008158640] >TRACE:clssnmPollingThread:
    node b1 (1) missed(4) checkin(s)
    [CSSD]2006-09-28 08:02:02.431 [3008158640] >TRACE:clssnmPollingThread:
    node b2 (2) missed(2) checkin(s)
    [CSSD]2006-09-28 08:02:03.433 [3008158640] >TRACE:clssnmPollingThread:
    node b1 (1) missed(5) checkin(s)
    [CSSD]2006-09-28 08:02:03.433 [3008158640] >TRACE:clssnmPollingThread:
    node b2 (2) missed(3) checkin(s)
    [CSSD]2006-09-28 08:02:04.435 [3008158640] >TRACE:clssnmPollingThread:
    node b1 (1) missed(6) checkin(s)
    [CSSD]2006-09-28 08:02:04.435 [3008158640] >TRACE:clssnmPollingThread:
    node b2 (2) missed(4) checkin(s)

    —> misscount was up to NODE1 and NODE2. (It looks like OK)

    [CSSD]2006-09-28 08:03:04.552 [2987178928] >TRACE:clssnmCheckDskInfo:
    node(1) mode(0) SYNC_MASTER
    [CSSD]2006-09-28 08:03:04.552 [2987178928] >ERROR:clssnmCheckDskInfo:
    Terminating local instance to avoid splitbrain.
    [CSSD]2006-09-28 08:03:04.552 [2987178928] >ERROR:
    : Node(4), Leader(3), Size(2) VS Node(1), Leader(1), Size(2)

    —> CSS killed NODE4.

    —> The result was NODE1 was only survied.
    I think CSS should survive NODE3 and NODE4.
    *** 10/02/06 03:46 pm ***
    *** 10/02/06 03:46 pm *** (CHG: Sta->16)
    *** 10/02/06 04:24 pm *** (CHG: Asg->NEW OWNER OWNER)
    *** 10/02/06 04:26 pm *** (CHG: Comp->PCW SubComp->CSS)
    *** 10/02/06 09:00 pm *** (CHG: Sta->10)
    *** 10/02/06 09:00 pm ***
    You say “Customer cut interconnect network interface”,
    does it mean logically ? i.e. ifconfig or something else.

    And, has customer ran same test on different release / PSR ?
    I wonder if this is new issue in 10.2.0.2 or not.
    In this case, we just disable interconnect, so we expect we can
    use disk heartbeat and find all other is still alive.

    For 1st case, we split 1 | 2 | 3 | 4 so expect 1 to survive
    In 2nd case, we split 1 | 2 | 3 – 4 | so we expect 3 and 4 to survive.
    For case1,

    node1:
    Compared with node2 first. node1 thought node2 can see node3 and
    node4 each other, so their size is 3, then killed itself.
    We can’t find why node1 think node2 can see the other nodes.

    [ CSSD]2006-09-28 07:36:28.076 [2997689264] >TRACE: clssnmCheckDskInfo:
    node(2) disk HB found, network state 0, disk state(3) missCount(63)
    [ CSSD]2006-09-28 07:36:28.076 [2997689264] >TRACE: clssnmCheckDskInfo:
    node(3) disk HB found, network state 0, disk state(3) missCount(63)
    [ CSSD]2006-09-28 07:36:28.076 [2997689264] >TRACE: clssnmCheckDskInfo:
    node(4) disk HB found, network state 0, disk state(3) missCount(63)
    [ CSSD]2006-09-28 07:36:29.078 [2997689264] >TRACE: clssnmCheckDskInfo:
    node(1) mode(0) SYNC_MASTER
    [ CSSD]2006-09-28 07:36:29.078 [2997689264] >TRACE: clssnmCheckDskInfo:
    node(2) mode(0) SYNC_MASTER
    [ CSSD]2006-09-28 07:36:29.078 [2997689264] >ERROR: clssnmCheckDskInfo:
    Terminating local instance to avoid splitbrain.
    [ CSSD]2006-09-28 07:36:29.078 [2997689264] >ERROR:
    : Node(1), Leader(1), Size(1) VS Node(2), Leader(2), Size(3) <----*** node2: Since node1 goes down one sec before, it failed to read disk HB now. Anyway, they thought node1 would be still alive, and need to kill itself to avoid splitbrain. This seems ok. [ CSSD]2006-09-28 07:36:29.560 [2997824432] >TRACE: clssnmCheckDskInfo:
    node(1) timeout(1050) state_network(0) state_disk(3) missCount(63)
    [ CSSD]2006-09-28 07:36:29.560 [2997824432] >TRACE: clssnmCheckDskInfo:
    node(3) disk HB found, network state 0, disk state(3) missCount(63)
    [ CSSD]2006-09-28 07:36:29.560 [2997824432] >TRACE: clssnmCheckDskInfo:
    node(4) disk HB found, network state 0, disk state(3) missCount(63)
    [ CSSD]2006-09-28 07:36:30.562 [2997824432] >TRACE: clssnmCheckDskInfo:
    node(1) timeout(2050) state_network(0) state_disk(3) missCount(64)
    [ CSSD]2006-09-28 07:36:30.562 [2997824432] >TRACE: clssnmCheckDskInfo:
    node(1) mode(0) SYNC_MASTER
    [ CSSD]2006-09-28 07:36:30.562 [2997824432] >ERROR: clssnmCheckDskInfo:
    Terminating local instance to avoid splitbrain.
    [ CSSD]2006-09-28 07:36:30.562 [2997824432] >ERROR:
    : Node(2), Leader(2), Size(1) VS Node(1), Leader(1), Size(1)

    We can see similiar log for node3 and node4

    For case2,

    Node4:
    node4 could not see node1 and node2. but find disk HB from them.
    node4 thought node1 and node2 can see each other, so 2-2 split brain
    situation and decided to kill node3 and node4 group.

    [ CSSD]2006-09-28 08:03:03.550 [2987178928] >TRACE: clssnmCheckDskInfo:
    node(1) disk HB found, network state 0, disk state(3) missCount(63)
    [ CSSD]2006-09-28 08:03:03.550 [2987178928] >TRACE: clssnmCheckDskInfo:
    node(2) disk HB found, network state 0, disk state(3) missCount(63)
    [ CSSD]2006-09-28 08:03:04.552 [2987178928] >TRACE: clssnmCheckDskInfo:
    node(1) mode(0) SYNC_MASTER
    [ CSSD]2006-09-28 08:03:04.552 [2987178928] >ERROR: clssnmCheckDskInfo:
    Terminating local instance to avoid splitbrain.
    [ CSSD]2006-09-28 08:03:04.552 [2987178928] >ERROR:
    : Node(4), Leader(3), Size(2) VS Node(1), Leader(1), Size(2) <--*** It's strange that node4 though node1 and node2 can see each other. From node2 cssd log, we see log that node2 couldn't see any other node. [ CSSD]2006-09-28 08:02:57.864 [3008162736] >TRACE: clssnmPollingThread:
    node b1 (1) is impending reconfig
    [ CSSD]2006-09-28 08:02:57.864 [3008162736] >TRACE: clssnmPollingThread:
    node b1 (1) missed(59) checkin(s)
    [ CSSD]2006-09-28 08:02:57.864 [3008162736] >TRACE: clssnmPollingThread:
    node b3 (3) missed(5) checkin(s)
    [ CSSD]2006-09-28 08:02:57.864 [3008162736] >TRACE: clssnmPollingThread:
    node b4 (4) missed(4) checkin(s)
    [ CSSD]2006-09-28 08:02:58.207 [2997672880] >TRACE: clssnmSendingThread:
    sending status msg to all nodes
    [ CSSD]2006-09-28 08:02:58.207 [2997672880] >TRACE: clssnmSendingThread:
    sent status msg to all nodes
    [ CSSD]2006-09-28 08:02:58.866 [3008162736] >TRACE: clssnmPollingThread:
    node b1 (1) is impending reconfig
    [ CSSD]2006-09-28 08:02:58.866 [3008162736] >TRACE: clssnmPollingThread:
    Eviction started for node b1 (1), flags 0x000f, state 3, wt4c 0
    [ CSSD]2006-09-28 08:02:58.866 [3008162736] >TRACE: clssnmPollingThread:
    node b3 (3) missed(6) checkin(s)
    [ CSSD]2006-09-28 08:02:58.866 [3008162736] >TRACE: clssnmPollingThread:
    node b4 (4) missed(5) checkin(s)

    Node1,2,3 finishes splitbrain resolution about one minutes later node4
    goes away. Since node4 goes away, node1 could survive. this is an expected
    behavior expect node4 killed itself.
    Looks like cssd couldn’t solve splitbrain as expected.
    But this is not worst situation because we don’t face splitbrain.

  2. admin Avatar
    admin

    CSS Timeout Computation in Oracle Clusterware

    Applies to:

    Oracle Server – Enterprise Edition – Version: 10.1.0.2 to 11.1.0.6
    Oracle Server – Standard Edition – Version: 10.1.0.2 to 11.1.0.6
    Information in this document applies to any platform.
    Oracle Clusterware

    Purpose

    The purpose of this Note is to document default CSS misscount timeout calculations in 10g  Release 1,  10g Release 2 , 11g and higher versions.

    Scope and Application

    • Define misscount parameter
    • Define the default calculations for the misscount parameter
    • Describe Cluster Synchronization Service (CSS) heartbeats and their interrelationship
    • Describe the cases where the default calculation may be too sensitive

    CSS Timeout Computation in Oracle Clusterware

    MISSCOUNT DEFINITION AND DEFAULT VALUES
    The CSS misscount parameter represents the maximum time, in seconds, that a network heartbeat can be missed before entering into a cluster reconfiguration to evict the node. The following are the default values for the misscount parameter and their respective versions when using Oracle Clusterware* in seconds:

    OS 10g (R1 &R2) 11g

    Linux

    60

    30

    Unix

    30

    30

    VMS

    30

    30

    Windows

    30

    30

    *CSS misscount default value when using vendor (non-Oracle) clusterware is 600 seconds. This is to allow the vendor clusterware ample time to resolve any possible split brain scenarios.

    On AIX platforms with HACMP starting with 10.2.0.3 BP#1, the misscount is 30. This is documented in Note 551658.1

    CSS HEARTBEAT MECHANISMS AND THEIR INTERRELATIONSHIP
    The synchronization services component (CSS) of the Oracle Clusterware maintains two heartbeat mechanisms 1.) the disk heartbeat to the voting device and 2.) the network heartbeat  across the interconnect which establish and confirm valid node membership in the cluster. Both of these heartbeat mechanisms have an associated timeout value. The disk heartbeat has an internal i/o timeout interval (DTO Disk TimeOut), in seconds, where an i/o to the voting disk must complete. The misscount parameter (MC), as stated above, is the maximum time, in seconds, that a network heartbeat  can be missed. The disk heartbeat i/o timeout interval is directly related to the misscount parameter setting. There has been some variation in this relationship 
    between versions as described below:

    9.x.x.x

    NOTE, MISSCOUNT WAS A  DIFFERENT ENTITY IN THIS RELEASE

    10.1.0.2

    No one should be on this version

    10.1.0.3

    DTO = MC – 15 seconds

    10.1.0.4

    DTO = MC – 15 seconds

    10.1.0.4+Unpublished Bug 3306964

    DTO = MC – 3 seconds

    10.1.0.4 with CRS II Merge patch

    DTO =Disktimeout (Defaults to 200 seconds) Normally OR Misscount seconds only during initial Cluster formation or Slightly before reconfiguration

    10.1.0.5

    IOT = MC – 3 seconds

    10.2.0.1 +Fix for unpublished Bug 4896338

    IOT=Disktimeout (Defaults to 200 seconds) Normally OR Misscount seconds only during initial Cluster formation or Slightly before reconfiguration

    10.2.0.2

    Same as above (10.2.0.1 with Patch Bug:4896338
    10.1 – 11.1 During node join and leave (reconfiguration) in a cluster we need to reconfigure, in that particular case we use Short Disk TimeOut (SDTO) which is in all versions SDTO = MC – reboottime (usually 3 seconds) 

    Misscount drives cluster membership reconfigurations and directly effects the availability of the cluster. In most cases, the default settings for MC should be acceptable.  Modifying the default value of misscount not only influences the timeout interval for the i/o to the voting disk, but also influences the tolerance for missed network heartbeats across the interconnect.

    LONG LATENCIES TO THE VOTING DISKS
    If I/O latencies to the voting disk are greater than the default DTO calculations noted above, the cluster may experience CSS node evictions depending on (a)the Oracle Clusterware (CRS) version, (b)whether merge patch has been applied and (c)the state of the Cluster. More details on this are covered in the section "Change in Behavior with CRS Merge PATCH (4896338 on 10.2.0.1)".

    These latencies can be attributed to any number of problems in the i/o subsystem or problems with any component in the i/o path. The following is a non exhaustive list of reported problems which resulted in CSS node eviction due to latencies to the voting disk longer than the default Oracle Clusterware i/o timeout value(DTO):

    1. QLogic HBA cards with a Link Down Timeout greater than the default misscount.
    2. Bad cables to the SAN/storage array that effect i/o latencies
    3. SAN switch (like Brocade) failover latency greater than the default misscount
    4. EMC Clariion Array when trespassing the SP to the backup SP greater than default misscount
    5. EMC PowerPath path error detection and I/O repost and redirect greater than default misscount  
    6. NetApp Cluster (CFO) failover latency greater than default misscount
    7. Sustained high CPU load which effects the CSSD disk ping monitoring thread
    8. Poor SAN network configuration that creates latencies in the I/O path.

    The most common problems relate to multi-path IO software drivers, and the reconfiguration times resulting from a failure in the IO path. Hardware and (re)configuration issues that introduce these latencies should be corrected. Incompatible failover times with underlying OS, network or storage hardware or software may be addressed given a complete understanding of the considerations listed below.

    Misscount should NOT be modified to workaround the above-mentioned issues. Oracle support recommends that you apply the latest patchset which changes the CSS behaviour. More details covered in next section.

    Change in Behavior with Bug:4896338 applied on top of 10.2.0.1
    Starting with 10.2.0.1+Bug:4896338, CSS will not evict the node from the cluster due to (DTO) I/O to voting disk taking more than misscount seconds unless it is during the initial cluster formation or slightly before reconfiguration. 
    So if we have a N number of nodes in a cluster and one of the nodes takes more than misscount seconds to access the voting disk, the node will not be evicted as long as the access to the voting disk is completed within disktimeout seconds. Consequently with this patch, there is no need to increase the misscount at all.

    Additionally this merge patch introduces Disktimeout  which is the amount of time that a lack of disk ping to voting disk(s) will be tolerated.

    Note:  applying the patch will not change your value for Misscount.  

    The table below explains in the conditions under which the eviction will occur

    Network Ping Disk Ping Reboot
    Completes within misscount seconds Completes within Misscount seconds

    N

    Completes within Misscount seconds Takes more than misscount seconds but less than Disktimeout seconds

    N

    Completes within Misscount seconds Takes more than Disktimeout seconds

    Y

    Takes more than Misscount Seconds Completes within Misscount seconds

    Y

     

     

     

     

      

    * By default Misscount is less than Disktimeout seconds

    CONSIDERATIONS WHEN CHANGING MISSCOUNT FROM THE DEFAULT VALUE

    1. Customers drive SLA and cluster availability. The customer ultimately defines Service Levels and availability for the cluster. Before recommending any change to misscount, the full impact of that change should be described and the impact to cluster availability measured.
    2. Customers may have timeout and retry logic in their applications. The impact of delaying reconfiguration may cause ‘artificial’ timeouts of the application, reconnect failures and subsequent logon storms.
    3. Misscount timeout values are version dependent and are subject to change. As we have seen, misscount calculations are variable between releases and between versions within a release. Creating a false dependency on misscount calculation in one version may not be appropriate for later versions.
    4. Internal I/O timeout interval (DTO) algorithms may change in later releases as stated above, there exists a direct relationship between the internal I/O timeout interval and misscount. This relationship is subject to change in later releases.
    5. An increase in misscount to compensate for i/o latencies directly effects reconfiguration times for network failures. The network heartbeat is the primary indicator of connectivity within the cluster. Misscount is the tolerance level of missed ‘check ins’ that trigger cluster reconfiguration. Increasing misscount will prolong the time to take corrective action in the event of network failure or other anomalies effecting the availability of a node in the cluster. This directly effects cluster availability.
    6. Changing misscount to workaround voting disk latencies will need to be corrected when the underlying disk latency is corrected, misscount needs to be set back to the default The customer needs to document the change and set the parameter back to the default when the underlying storage I/O latency is resolved.
    7. Do not change default misscount values if you are  running Vendor Clusterware along with Oracle Clusterware. The default values for misscount should not be changed when using vendor clusterware. Modifying misscount in this environment may cause clusterwide outages and potential corruptions.
    8. Changing misscount parameter incurs a clusterwide outage. As note below, the customer will need to schedule 
      a clusterwide outage to make this change.
    9. Changing misscount should not be used to compensate for poor configurations or faulty hardware
    10. Cluster and RDBMS availability are directly effected by high misscount settings.
    11. In case of stretched clusters and stretched storage systems and a site failure where we loose one storage and N number of nodes we go into a reconfiguration state and then we revert to ShortDiskTimeOut value as internal I/O timeout for the votings. Several cases are known with stretched clusters where when a site failure happen the storage failover cannot complete within SDTO. If the I/O to the votings is blocked more than SDTO the result is node evictions on the surviving side. 

    To Change MISSCOUNT back to default Please refer to Note:284752.1
    THIS IS THE ONLY SUPPORTED METHOD. NOT FOLLOWING THIS METHOD RISKS EVICTIONS AND/OR CORRUPTING THE OCR

    10g Release 2 MIRRORED VOTING DISKS AND VENDOR MULTIPATHING SOLUTIONS 
    Oracle RAC 10g Release 2 allows for multiple voting disks so that  the customer does not have to rely on a multipathing solution from a storage vendor. You can have n voting disks (up to 31) where n = m*2+1 where m is the number of disk failures you  want to survive. Oracle recommends each voting disk to be on a separate physical disk.

     

  3. admin Avatar
    admin

    10g RAC: Steps To Increase CSS Misscount, Reboottime and Disktimeout

    PURPOSE
    ——-
    The purpose of this note is to document the steps needed to modify the CSS
    misscount, reboottime and disktimeout settings. Please review Note 294430.1
    to understand the implications before editing these settings.

    SCOPE & APPLICATION
    ——————-
    Customers should not modify CSS settings unless guided by either Oracle support or Oracle development to do so.

    Steps To Modify The CSS Miscount
    ——————————–

    1) Shut down CRS on all but one node. For exact steps use Note 309542.1
    2) Execute crsctl as root to modify the misscount:
    $ORA_CRS_HOME/bin/crsctl set css misscount
    where
    is the maximum i/o latency to the voting disk +1 second
    3) Reboot the node where adjustment was made
    4) Start all other nodes shutdown in step 1

    With the Patch:4896338 for 10.2.0.1 there are two additional settings that can
    be tuned. This change is incorporated into the 10.2.0.2 and 10.1.0.6 patchsets.

    These following are only relevant on 10.2.0.1 with Patch:4896338
    In addition to MissCount, CSS now has two more parameters:
    1) reboottime (default 3 seconds) – the amount of time allowed for a node
    to complete a reboot after the CSS daemon has been evicted. (I.E. how
    long does it take for the machine to completely shutdown when you do a
    reboot)
    2) disktimeout (default 200 seconds) – the maximum amount of time allowed
    for a voting file I/O to complete; if this time is exceeded the voting
    disk will be marked as offline. Note that this is also the amount of
    time that will be required for initial cluster formation, i.e. when no
    nodes have previously been up and in a cluster.

    $CRS_HOME/bin/crsctl set css reboottime [-force] ( is seconds)
    $CRS_HOME/bin/crsctl set css disktimeout [-force] ( is seco

    Confirm the new css misscount setting via ocrdump

    RELATED DOCUMENTS
    —————–
    Note:294430.1 CSS Timeout Computation in 10g RAC (10gR1 and 10gR2)
    Note:309542.1 How to start/stop 10g CRS ClusterWare

Leave a Reply

Your email address will not be published. Required fields are marked *