admin
2010-08-22
Hdr: 5577914 10.2.0.2 PCW 10.2.0.2 CSS PRODID-5 PORTID-226 4930431
Abstract: CSS CANNOT RESOLVE SPLIT BRAIN, WHEN SOME NODES OCCUR INTERCONNECT FAILURE
Customer tested the following 2 scenarios.
- scenario 1
Customer cut interconnect network interface on all nodes.
(Test was started from 07:35:24)
<NODE1>
[CSSD]2006-09-28 07:36:29.078 [2997689264] >TRACE:clssnmCheckDskInfo:
node(1) mode(0) SYNC_MASTER
[CSSD]2006-09-28 07:36:29.078 [2997689264] >TRACE:clssnmCheckDskInfo:
node(2) mode(0) SYNC_MASTER
[CSSD]2006-09-28 07:36:29.078 [2997689264] >ERROR:clssnmCheckDskInfo:
Terminating local instance to avoid splitbrain.
[CSSD]2006-09-28 07:36:29.078 [2997689264] >ERROR:
: Node(1), Leader(1), Size(1) VS Node(2), Leader(2), Size(3)
<NODE2>
[CSSD]2006-09-28 07:36:30.562 [2997824432] >TRACE:clssnmCheckDskInfo:
node(1) timeout(2050) state_network(0) state_disk(3) missCount(64)
[CSSD]2006-09-28 07:36:30.562 [2997824432] >TRACE:clssnmCheckDskInfo:
node(1) mode(0) SYNC_MASTER
[CSSD]2006-09-28 07:36:30.562 [2997824432] >ERROR:clssnmCheckDskInfo:
Terminating local instance to avoid splitbrain.
[CSSD]2006-09-28 07:36:30.562 [2997824432] >ERROR:
: Node(2), Leader(2), Size(1) VS Node(1), Leader(1), Size(1)
<NODE3>
[CSSD]2006-09-28 07:36:36.058 [2987334576] >TRACE:clssnmCheckDskInfo:
node(1) timeout(6840) state_network(0) state_disk(3) missCount(63)
[CSSD]2006-09-28 07:36:36.058 [2987334576] >TRACE:clssnmCheckDskInfo:
node(2) timeout(4830) state_network(0) state_disk(3) missCount(63)
[CSSD]2006-09-28 07:36:36.058 [2987334576] >TRACE:clssnmCheckDskInfo:
node(1) mode(0) SYNC_MASTER
[CSSD]2006-09-28 07:36:36.058 [2987334576] >ERROR:clssnmCheckDskInfo:
Terminating local instance to avoid splitbrain.
[CSSD]2006-09-28 07:36:36.058 [2987334576] >ERROR:
: Node(3), Leader(3), Size(1) VS Node(1), Leader(1), Size(1)
<NODE4>
[CSSD]2006-09-28 07:36:36.918 [2997820336] >TRACE:clssnmCheckDskInfo:
node(1) timeout(6300) state_network(0) state_disk(3) missCount(64)
[CSSD]2006-09-28 07:36:36.918 [2997820336] >TRACE:clssnmCheckDskInfo:
node(2) timeout(4300) state_network(0) state_disk(3) missCount(64)
[CSSD]2006-09-28 07:36:36.918 [2997820336] >TRACE:clssnmCheckDskInfo:
node(1) mode(0) SYNC_MASTER
[CSSD]2006-09-28 07:36:36.918 [2997820336] >ERROR:clssnmCheckDskInfo:
Terminating local instance to avoid splitbrain.
[CSSD]2006-09-28 07:36:36.918 [2997820336] >ERROR:
: Node(4), Leader(4), Size(1) VS Node(1), Leader(1), Size(1)
--> The result was ALL nodes were down.
I think CSS should survive NODE1.
- scenario 2
Customer cut interconnect network interface on NODE1 and NODE2.
(Test was started from 08:02:00)
<NODE1>
[CSSD]2006-09-28 08:02:00.628 [3007810480] >TRACE:clssnmPollingThread:
node b2 (2) missed(2) checkin(s)
[CSSD]2006-09-28 08:02:00.628 [3007810480] >TRACE:clssnmPollingThread:
node b3 (3) missed(2) checkin(s)
[CSSD]2006-09-28 08:02:00.628 [3007810480] >TRACE:clssnmPollingThread:
node b4 (4) missed(2) checkin(s)
[CSSD]2006-09-28 08:02:01.630 [3007810480] >TRACE:clssnmPollingThread:
node b2 (2) missed(3) checkin(s)
[CSSD]2006-09-28 08:02:01.630 [3007810480] >TRACE:clssnmPollingThread:
node b3 (3) missed(3) checkin(s)
[CSSD]2006-09-28 08:02:01.630 [3007810480] >TRACE:clssnmPollingThread:
node b4 (4) missed(3) checkin(s)
[CSSD]2006-09-28 08:02:01.951 [2997320624] >TRACE:clssnmSendingThread:
sending status msg to all nodes
[CSSD]2006-09-28 08:02:01.951 [2997320624] >TRACE:clssnmSendingThread:
sent status msg to all nodes
[CSSD]2006-09-28 08:02:02.632 [3007810480] >TRACE:clssnmPollingThread:
node b2 (2) missed(4) checkin(s)
[CSSD]2006-09-28 08:02:02.632 [3007810480] >TRACE:clssnmPollingThread:
node b3 (3) missed(4) checkin(s)
[CSSD]2006-09-28 08:02:02.632 [3007810480] >TRACE:clssnmPollingThread:
node b4 (4) missed(4) checkin(s)
[CSSD]2006-09-28 08:02:03.634 [3007810480] >TRACE:clssnmPollingThread:
node b2 (2) missed(5) checkin(s)
[CSSD]2006-09-28 08:02:03.634 [3007810480] >TRACE:clssnmPollingThread:
node b3 (3) missed(5) checkin(s)
---> misscount was up to NODE2, NODE3, NODE4. (It looks like OK)
[CSSD]2006-09-28 08:04:35.346 [2976340912] >TRACE:clssgmMasterSendDBDone:
group/lock status synchronization complete
[CSSD]2006-09-28 08:04:35.346 [2976340912]
>TRACE:clssgmCompareSwapEventValue:
changed CmInfo State val 9, from 8, changes 17
[CSSD]2006-09-28 08:04:35.346 [2976340912]
>TRACE:clssgmCompareSwapEventValue:
changed CmInfo State val 10, from 9, changes 18
[CSSD]CLSS-3000: reconfiguration successful, incarnation 3 with 1 nodes
[CSSD]CLSS-3001: local node number 1, master node number 1
---> CSS survived NODE1.
<NODE2>
[CSSD]2006-09-28 08:02:02.758 [3008162736] >TRACE:clssnmPollingThread:
node b1 (1) missed(4) checkin(s)
[CSSD]2006-09-28 08:02:02.758 [3008162736] >TRACE:clssnmPollingThread:
node b3 (3) missed(2) checkin(s)
[CSSD]2006-09-28 08:02:02.758 [3008162736] >TRACE:clssnmPollingThread:
node b4 (4) missed(2) checkin(s)
[CSSD]2006-09-28 08:02:03.760 [3008162736] >TRACE:clssnmPollingThread:
node b1 (1) missed(5) checkin(s)
[CSSD]2006-09-28 08:02:03.760 [3008162736] >TRACE:clssnmPollingThread:
node b3 (3) missed(3) checkin(s)
[CSSD]2006-09-28 08:02:03.760 [3008162736] >TRACE:clssnmPollingThread:
node b4 (4) missed(3) checkin(s)
---> misscount was up to NODE1, NODE3, and NODE4. (It looks like OK)
[CSSD]2006-09-28 08:04:04.994 [2987183024] >TRACE:clssnmCheckDskInfo:
node(4) timeout(59690) state_network(0) state_disk(3) missCount(71)
[CSSD]2006-09-28 08:04:05.306 [2987183024] >TRACE:clssnmCheckDskInfo:
node(1) mode(0) SYNC_MASTER
[CSSD]2006-09-28 08:04:05.306 [2987183024] >ERROR:clssnmCheckDskInfo:
Terminating local instance to avoid splitbrain.
[CSSD]2006-09-28 08:04:05.306 [2987183024] >ERROR:
: Node(2), Leader(2), Size(1) VS Node(1), Leader(1), Size(1)
---> CSS killed NODE2.
*** 10/02/06 03:20 pm ***
<NODE3>
[CSSD]2006-09-28 08:02:02.241 [3008162736] >TRACE:clssnmPollingThread:
node b1 (1) missed(4) checkin(s)
[CSSD]2006-09-28 08:02:02.241 [3008162736] >TRACE:clssnmPollingThread:
node b2 (2) missed(2) checkin(s)
[CSSD]2006-09-28 08:02:03.243 [3008162736] >TRACE:clssnmPollingThread:
node b1 (1) missed(5) checkin(s)
[CSSD]2006-09-28 08:02:03.243 [3008162736] >TRACE:clssnmPollingThread:
node b2 (2) missed(3) checkin(s)
[CSSD]2006-09-28 08:02:04.245 [3008162736] >TRACE:clssnmPollingThread:
node b1 (1) missed(6) checkin(s)
[CSSD]2006-09-28 08:02:04.245 [3008162736] >TRACE:clssnmPollingThread:
node b2 (2) missed(4) checkin(s)
---> misscount was up to NODE1 and NODE2. (It looks like OK)
[CSSD]2006-09-28 08:04:04.477 [2987183024] >TRACE:clssnmCheckDskInfo:
node(4) timeout(59250) state_network(0) state_disk(3) missCount(119)
[CSSD]2006-09-28 08:04:05.230 [2987183024] >TRACE:clssnmCheckDskInfo:
node(1) mode(0) SYNC_MASTER
[CSSD]2006-09-28 08:04:05.230 [2987183024] >ERROR:clssnmCheckDskInfo:
Terminating local instance to avoid splitbrain.
[CSSD]2006-09-28 08:04:05.230 [2987183024] >ERROR:
: Node(3), Leader(3), Size(1) VS Node(1), Leader(1), Size(1)
---> CSS killed NODE3.
<NODE4>
[CSSD]2006-09-28 08:02:02.431 [3008158640] >TRACE:clssnmPollingThread:
node b1 (1) missed(4) checkin(s)
[CSSD]2006-09-28 08:02:02.431 [3008158640] >TRACE:clssnmPollingThread:
node b2 (2) missed(2) checkin(s)
[CSSD]2006-09-28 08:02:03.433 [3008158640] >TRACE:clssnmPollingThread:
node b1 (1) missed(5) checkin(s)
[CSSD]2006-09-28 08:02:03.433 [3008158640] >TRACE:clssnmPollingThread:
node b2 (2) missed(3) checkin(s)
[CSSD]2006-09-28 08:02:04.435 [3008158640] >TRACE:clssnmPollingThread:
node b1 (1) missed(6) checkin(s)
[CSSD]2006-09-28 08:02:04.435 [3008158640] >TRACE:clssnmPollingThread:
node b2 (2) missed(4) checkin(s)
---> misscount was up to NODE1 and NODE2. (It looks like OK)
[CSSD]2006-09-28 08:03:04.552 [2987178928] >TRACE:clssnmCheckDskInfo:
node(1) mode(0) SYNC_MASTER
[CSSD]2006-09-28 08:03:04.552 [2987178928] >ERROR:clssnmCheckDskInfo:
Terminating local instance to avoid splitbrain.
[CSSD]2006-09-28 08:03:04.552 [2987178928] >ERROR:
: Node(4), Leader(3), Size(2) VS Node(1), Leader(1), Size(2)
---> CSS killed NODE4.
---> The result was NODE1 was only survied.
I think CSS should survive NODE3 and NODE4.
*** 10/02/06 03:46 pm ***
*** 10/02/06 03:46 pm *** (CHG: Sta->16)
*** 10/02/06 04:24 pm *** (CHG: Asg->NEW OWNER OWNER)
*** 10/02/06 04:26 pm *** (CHG: Comp->PCW SubComp->CSS)
*** 10/02/06 09:00 pm *** (CHG: Sta->10)
*** 10/02/06 09:00 pm ***
You say "Customer cut interconnect network interface",
does it mean logically ? i.e. ifconfig or something else.
And, has customer ran same test on different release / PSR ?
I wonder if this is new issue in 10.2.0.2 or not.
In this case, we just disable interconnect, so we expect we can
use disk heartbeat and find all other is still alive.
For 1st case, we split 1 | 2 | 3 | 4 so expect 1 to survive
In 2nd case, we split 1 | 2 | 3 - 4 | so we expect 3 and 4 to survive.
For case1,
node1:
Compared with node2 first. node1 thought node2 can see node3 and
node4 each other, so their size is 3, then killed itself.
We can't find why node1 think node2 can see the other nodes.
[ CSSD]2006-09-28 07:36:28.076 [2997689264] >TRACE: clssnmCheckDskInfo:
node(2) disk HB found, network state 0, disk state(3) missCount(63)
[ CSSD]2006-09-28 07:36:28.076 [2997689264] >TRACE: clssnmCheckDskInfo:
node(3) disk HB found, network state 0, disk state(3) missCount(63)
[ CSSD]2006-09-28 07:36:28.076 [2997689264] >TRACE: clssnmCheckDskInfo:
node(4) disk HB found, network state 0, disk state(3) missCount(63)
[ CSSD]2006-09-28 07:36:29.078 [2997689264] >TRACE: clssnmCheckDskInfo:
node(1) mode(0) SYNC_MASTER
[ CSSD]2006-09-28 07:36:29.078 [2997689264] >TRACE: clssnmCheckDskInfo:
node(2) mode(0) SYNC_MASTER
[ CSSD]2006-09-28 07:36:29.078 [2997689264] >ERROR: clssnmCheckDskInfo:
Terminating local instance to avoid splitbrain.
[ CSSD]2006-09-28 07:36:29.078 [2997689264] >ERROR:
: Node(1), Leader(1), Size(1) VS Node(2), Leader(2), Size(3) <----***
node2:
Since node1 goes down one sec before, it failed to read disk HB now.
Anyway, they thought node1 would be still alive, and need to kill itself
to avoid splitbrain. This seems ok.
[ CSSD]2006-09-28 07:36:29.560 [2997824432] >TRACE: clssnmCheckDskInfo:
node(1) timeout(1050) state_network(0) state_disk(3) missCount(63)
[ CSSD]2006-09-28 07:36:29.560 [2997824432] >TRACE: clssnmCheckDskInfo:
node(3) disk HB found, network state 0, disk state(3) missCount(63)
[ CSSD]2006-09-28 07:36:29.560 [2997824432] >TRACE: clssnmCheckDskInfo:
node(4) disk HB found, network state 0, disk state(3) missCount(63)
[ CSSD]2006-09-28 07:36:30.562 [2997824432] >TRACE: clssnmCheckDskInfo:
node(1) timeout(2050) state_network(0) state_disk(3) missCount(64)
[ CSSD]2006-09-28 07:36:30.562 [2997824432] >TRACE: clssnmCheckDskInfo:
node(1) mode(0) SYNC_MASTER
[ CSSD]2006-09-28 07:36:30.562 [2997824432] >ERROR: clssnmCheckDskInfo:
Terminating local instance to avoid splitbrain.
[ CSSD]2006-09-28 07:36:30.562 [2997824432] >ERROR:
: Node(2), Leader(2), Size(1) VS Node(1), Leader(1), Size(1)
We can see similiar log for node3 and node4
For case2,
Node4:
node4 could not see node1 and node2. but find disk HB from them.
node4 thought node1 and node2 can see each other, so 2-2 split brain
situation and decided to kill node3 and node4 group.
[ CSSD]2006-09-28 08:03:03.550 [2987178928] >TRACE: clssnmCheckDskInfo:
node(1) disk HB found, network state 0, disk state(3) missCount(63)
[ CSSD]2006-09-28 08:03:03.550 [2987178928] >TRACE: clssnmCheckDskInfo:
node(2) disk HB found, network state 0, disk state(3) missCount(63)
[ CSSD]2006-09-28 08:03:04.552 [2987178928] >TRACE: clssnmCheckDskInfo:
node(1) mode(0) SYNC_MASTER
[ CSSD]2006-09-28 08:03:04.552 [2987178928] >ERROR: clssnmCheckDskInfo:
Terminating local instance to avoid splitbrain.
[ CSSD]2006-09-28 08:03:04.552 [2987178928] >ERROR:
: Node(4), Leader(3), Size(2) VS Node(1), Leader(1), Size(2) <--***
It's strange that node4 though node1 and node2 can see each other.
From node2 cssd log, we see log that node2 couldn't see any other node.
[ CSSD]2006-09-28 08:02:57.864 [3008162736] >TRACE: clssnmPollingThread:
node b1 (1) is impending reconfig
[ CSSD]2006-09-28 08:02:57.864 [3008162736] >TRACE: clssnmPollingThread:
node b1 (1) missed(59) checkin(s)
[ CSSD]2006-09-28 08:02:57.864 [3008162736] >TRACE: clssnmPollingThread:
node b3 (3) missed(5) checkin(s)
[ CSSD]2006-09-28 08:02:57.864 [3008162736] >TRACE: clssnmPollingThread:
node b4 (4) missed(4) checkin(s)
[ CSSD]2006-09-28 08:02:58.207 [2997672880] >TRACE: clssnmSendingThread:
sending status msg to all nodes
[ CSSD]2006-09-28 08:02:58.207 [2997672880] >TRACE: clssnmSendingThread:
sent status msg to all nodes
[ CSSD]2006-09-28 08:02:58.866 [3008162736] >TRACE: clssnmPollingThread:
node b1 (1) is impending reconfig
[ CSSD]2006-09-28 08:02:58.866 [3008162736] >TRACE: clssnmPollingThread:
Eviction started for node b1 (1), flags 0x000f, state 3, wt4c 0
[ CSSD]2006-09-28 08:02:58.866 [3008162736] >TRACE: clssnmPollingThread:
node b3 (3) missed(6) checkin(s)
[ CSSD]2006-09-28 08:02:58.866 [3008162736] >TRACE: clssnmPollingThread:
node b4 (4) missed(5) checkin(s)
Node1,2,3 finishes splitbrain resolution about one minutes later node4
goes away. Since node4 goes away, node1 could survive. this is an expected
behavior expect node4 killed itself.
Looks like cssd couldn't solve splitbrain as expected.
But this is not worst situation because we don't face splitbrain.