oracle rac 12.1以后的脑裂brain split evict踢节点算法改进

oracle 12.1以后RAC发生脑裂时踢节点的算法变了；以前2节点RAC情况下，总是踢不是lower node number（master node)的那个节点；这样一般就是踢2号节点。在12.1以后clusterware维护一个weight权重值，主要是计算每个节点或者说子集群里使用的service和连接这些service的负载情况，这样当发生脑裂时总是踢weight小的节点，即负载轻的节点，保证服务更多用户的节点存活。具体可以见文档 12c: Which Node Will Survive when Split Brain Takes Place (Doc ID 1951726.1) 和 Split Brain: What’s new in Oracle Database 12.1.0.2c?

理解 12.1.0.2 开始，脑裂问题发生后，节点保留策略。

在 11.2 及早期版本，在脑裂发生时，节点号小的会保留下来。然而从 12.1.0.2 开始，引入节点权重的概念。从 12.1.0.2 开始，解决脑裂时，权重高的节点将会存活下来。

这里负责计算权重weigth的函数是 clssnmrCheckNodeWeight ， clssnm 即 Node Monitoring (clssnm.c) – Node monitoring (NM) is used to verify the health of all members of the cluster. It will maintain consistency with vendor clusterware (if it exists) via skgxn.

12c: Which Node Will Survive when Split Brain Takes Place (Doc ID 1951726.1)

PURPOSE
To understand the new behavior, from 12.1.0.2, of which node will survive when split brain takes place.

DETAILS
In 11.2 or even older version, the lowest number node will survive when split brain takes place, however this has changed in 12.1.0.2 with the introduction of node weight. Started from 12.1.0.2, during split brain resolution, node with higher weight will survive:

2014-11-24 14:25:41.140603 : CSSD:1117321536: clssnmrCheckNodeWeight: node(1) has weight stamp(0), pebble(0)
2014-11-24 14:25:41.140609 : CSSD:1117321536: clssnmrCheckNodeWeight: node(2) has weight stamp(311972654), pebble(3)
2014-11-24 14:25:41.140612 : CSSD:1117321536: clssnmrCheckNodeWeight: stamp(311972654), completed(1/2)
2014-11-24 14:25:41.140615 : CSSD:1117321536: clssnmrCheckSplit: Waiting for node weights, stamp(311972654)
2014-11-24 14:25:41.188880 : CSSD:1084811584: clssnmvDiskKillCheck: not evicted, file /dev/raw/raw2 flags 0x00000000, kill block unique 0, my unique 1416805718
2014-11-24 14:25:41.558921 : CSSD:1114167616: clssnmvDiskPing: Writing with status 0x3, timestamp 1416810341/1022717334
2014-11-24 14:25:41.731912 : CSSD:1086388544: clssnmvDHBValidateNCopy: node 1, node1, has a disk HB, but no network HB, DHB has rcfg 311972655, wrtcnt, 9527468, LATS 102 2717514, lastSeqNo 9527467, uniqueness 1416808381, timestamp 1416810341/1022722074
2014-11-24 14:25:41.731928 : CSSD:1086388544: clssnmvReadDskHeartbeat: manual shutdown of nodename node1, nodenum 1 epoch 1416810341 msec 1022722074
2014-11-24 14:25:41.732266 : CSSD:1117321536: clssnmrCheckNodeWeight: node(2) has weight stamp(311972654), pebble(3)
2014-11-24 14:25:41.732273 : CSSD:1117321536: clssnmrCheckNodeWeight: stamp(311972654), completed(1/1)
2014-11-24 14:25:41.732294 : CSSD:1117321536: clssnmCheckDskInfo: My cohort: 2
2014-11-24 14:25:41.732299 : CSSD:1117321536: clssnmRemove: Start
2014-11-24 14:25:41.732306 : CSSD:1117321536: (:CSSNM00007:)clssnmrRemoveNode: Evicting node 1, node1, from the cluster in incarnation 311972655, node birth incarnation 311972654, death incarnation 311972655, stateflags 0x225000 uniqueness value 1416808381 The number of the resource executing on each node and others are considered by the weight. Reference

Posted

April 8, 2020

mac

Tags:

Comments

One response to “oracle rac 12.1以后的脑裂brain split evict踢节点算法改进”

admin

April 8, 2020

Split Brain: What’s new in Oracle Database 12.1.0.2c?

In an Oracle cluster prior to version 12.1.0.2c, when a split brain problem occurs, the node with lowest node number survives. However, starting from Oracle Database 12.1.0.2c, the node with higher weight will survive during split brain resolution. In this article I will explore this new feature for one of the possible factors contributing to the node weight, i.e. the

In an Oracle cluster prior to version 12.1.0.2c, when a split brain problem occurs, the node with lowest node number survives. However, starting from Oracle Database 12.1.0.2c, the node with higher weight will survive during split brain resolution.

In this article I will explore this new feature for one of the possible factors contributing to the node weight, i.e. the number of database services executing on a node.

What is Split Brain?
In a cluster, a private interconnect is used by cluster nodes to monitor each node’s status and communicate with each other. When two or more nodes fail to ping or connect to each other via this private interconnect, the cluster gets partitioned into two or more smaller sub-clusters each of which cannot talk to others over the interconnect. Oblivious of the existence of other cluster fragments, each sub-cluster continues to operate independently of the others. This is called “Split Brain”. In such a scenario, integrity of the cluster and its data might be compromised due to uncoordinated writes to shared data by independently operating nodes. Hence, to protect the integrity of the cluster and its data, the split-brain must be resolved.

How does the Oracle Grid Infrastructure Clusterware resolve a “split brain” situation?
Voting disk is used by Oracle Cluster Synchronization Services Daemon (ocssd) on each node, to mark its own attendance and also to record the nodes it can communicate with. In a “split brain” situation, voting disk is used to determine which node(s) will survive and which node(s) will be evicted.

Prior to Oracle Database 12.1.0.2c, the algorithm to determine the node(s) to be retained / evicted is as follows:

If the sub-clusters are of the different sizes, the clusterware identifies the largest sub-cluster, and aborts all the nodes which do not belong to that sub-cluster.
If all the sub-clusters are of the same size, the sub-cluster having the lowest numbered node survives so that, in a 2-node cluster, the node with the lowest node number will survive.
However, starting from 12.1.0.2c, in case of split brain, some improvement has been made to node eviction algorithm. In order to make largest number of resources available to the users, the node weight is computed for each node based on number of the resource executing on it and the sub-cluster with higher weight will survive.

Starting in Oracle Database 12.1.0.2c, the new algorithm to determine the node(s) to be retained / evicted is as follows:

If the sub-clusters are of the different sizes, the functionality is same as earlier i.e. the clusterware identifies the largest sub-cluster, and aborts all the nodes which do NOT belong to that sub-cluster.
If all the sub-clusters are of the same size, the functionality has been modified as:
If the sub-clusters have equal node weights, the sub-cluster with the lowest numbered node in it survives so that, in a 2-node cluster, the node with the lowest node number will survive.
If the sub-clusters have unequal node weights, the sub-cluster having the higher weight survives so that, in a 2-node cluster, the node with the lowest node number might be evicted if it has a lower weight.
Now I will demonstrate this new feature in an Oracle 12.1.0.2c standard 3 node cluster, using an RAC database called admindb for one of the possible factors contributing to the node weight, i.e. the number of database services executing on a node. Since I will only explore the scenarios for which functionality has been modified, i.e. sub-clusters are of equal size, I have shut down one of the nodes so that there are only 2 active nodes in the cluster.

Current scenario:
Name of the cluster: Cluster01.example.com

Number of nodes: 3 (host01, host02, host03)

Name of RAC database: admindb

Instances of RAC database: admindb1 on host01

admindb2 on host02

Overview
Check that only two nodes (host01 and host02) are active and host01 has lower node number
Create two singleton services for the RAC database admindb
Serv1 : Preferred instance admindb1
Serv2: Preferred instance admindb2
Case-I: Equal number of database services executing on both the nodes
Start both the services for database admindb so that equal number of database services execute on both the nodes.
Simulate loss of connectivity between two nodes.
Verify that:
host01 is retained as it has a lower node number.
host02 is evicted.
Case-II: Unequal number of database services executing on both the nodes
Stop the service serv1 so that host01 is not hosting any service and service serv2 executes on host02. As a result, unequal number of database services execute on both the nodes.
Simulate loss of connectivity between two nodes.
Verify that:
host02 is retained as it has higher number of database services executing.
host01 is evicted although it has a lower node number.
Demonstration
Check that only two nodes (host01 and host02) are active and host01 has lower node number:
[root@host02 ~]# olsnodes -s -n
host01 1 Active
host02 2 Active
host03 3 Inactive
Create two singleton services for the RAC database admindb:
Serv1 : Preferred instance admindb1
Serv2: Preferred instance admindb2
[oracle@host02 root]$ srvctl add service -s serv1 -d admindb -preferred admindb1
[oracle@host02 root]$ srvctl add service -s serv2 -d admindb -preferred admindb2
Case-I: Equal number of database services executing on both the nodes
We will verify that when an equal number of database services are running on both nodes, the node with lower node number (host01) survives.

Verify that admindb is the only database in the cluster having its instances executing on host01 and host02.
[root@host02 ~]# crsctl stat res -n host01 |grep NAME | grep .db
NAME=ora.admindb.db

[root@host02 ~]# crsctl stat res -n host02 |grep NAME | grep .db
NAME=ora.admindb.db
Start both the services for database admindb so that serv1 executes on host01 and serv2 executes on host02. As a result, equal number of database services execute on both the nodes.
[oracle@host02 root]$ srvctl start service -d admindb

[oracle@host02 root]$ srvctl status service -d admindb
Service serv1 is running on instance(s) admindb1
Service serv2 is running on instance(s) admindb2
Find out name of Private network
[root@host01 ~]# oifcfg getif

eth0 192.9.201.0 global public
eth1 10.0.0.0 global cluster_interconnect
To simulate loss of connectivity between two nodes, stop the private network service on one of the nodes:
[root@host01 ~]# ifdown eth1
Verify that host01 is retained as it has a lower node number and host02 is evicted:
–Ocssd log of host01

[root@host01 ~]# vi /u01/app/grid/diag/crs/host01/crs/trace/ocssd.trc
2015-12-29 15:20:44.374229 : CSSD:1126267200: clssnmrCheckSplit: Waiting for node weights, stamp(346867999)

2015-12-29 15:20:44.499676 : CSSD:1079707968: clssnmvDiskPing: Writing with status 0x3, timestamp 1451382644/4294773370

2015-12-29 15:20:44.502647 : CSSD:1076521280: clssnmvDiskPing: Writing with status 0x3, timestamp 1451382644/4294773370

2015-12-29 15:20:44.502702 : CSSD:1116805440: clssnmvDiskPing: Writing with status 0x3, timestamp 1451382644/4294773370

2015-12-29 15:20:44.868385 : CSSD:1121536320: clssnmvDHBValidateNCopy: node 2, host02, has a disk HB, but no network HB, DHB has rcfg 346868000, wrtcnt, 499999, LATS 4294773740, lastSeqNo 499995, uniqueness 1451382179, timestamp 1451382644/4294615110

2015-12-29 15:20:44.868456 : CSSD:1126267200: clssnmCheckSplit: nodenum 3 curts_ms -193556 readts_ms -193556

2015-12-29 15:20:44.868462 : CSSD:1126267200: clssnmCheckSplit: Node 3, host03 removed

2015-12-29 15:20:44.868496 : CSSD:1126267200: clssnmrCheckNodeWeight: node(1) has weight stamp(346867999), pebble(2)

2015-12-29 15:20:44.868499 : CSSD:1126267200: clssnmrCheckNodeWeight: node(2) has weight stamp(346867999), pebble(0)

2015-12-29 15:20:44.868501 : CSSD:1126267200: clssnmrCheckNodeWeight: stamp(346867999), completed(2/2)

2015-12-29 15:20:44.868517 : CSSD:1126267200: clssnmCheckDskInfo: My cohort: 1

2015-12-29 15:20:44.868520 : CSSD:1126267200: clssnmRemove: Start

2015-12-29 15:20:44.868525 : CSSD:1126267200: (:CSSNM00007:)clssnmrRemoveNode: Evicting node 2, host02, from the cluster in incarnation 346868000, node birth incarnation 346867999, death incarnation 346868000, stateflags 0x224000 uniqueness value 1451382179
[root@host01 ~]# olsnodes -s -n
host01 1 Active
host02 2 Inactive
Hence, we observed that when an equal number of database services were running on both nodes, the node with lower node number (host01) survives.

Case-II: Unequal numbers of database services executing on both the nodes
We will verify that when an unequal number of database services are running on the two nodes, the node hosting the higher number of database services survives even if it has a higher node number.

Stop the service serv1 so that host01 is not hosting any service and service serv2 executes on host02. As a result, unequal number of database services execute on both the nodes.
[oracle@host02 root]$ srvctl stop service -s serv1 -d admindb

[oracle@host02 root]$ srvctl status service -d admindb
Service serv1 is not running.
Service serv2 is running on instance(s) admindb2
To simulate loss of connectivity between two nodes, stop private network service on one of the nodes:
[root@host01 ~]# ifdown eth1
Verify that host02 is retained as it has higher number of database services executing and host01 is evicted although it has a lower node number:
[root@host02 ~]# olsnodes -s -n
host01 1 Inactive
host02 2 Active
OCSSD Log of host02:
[root@host02 ~]# vi /u01/app/grid/diag/crs/host02/crs/trace/ocssd.trc
2015-11-30 15:39:39.666779 : CSSD:1124809024: clssnmrCheckSplit: Waiting for node weights, stamp(344360122)

2015-11-30 15:39:40.058235 : CSSD:1124809024: clssnmrCheckNodeWeight: node(1) has weight stamp(0), pebble(0)

2015-11-30 15:39:40.058243 : CSSD:1124809024: clssnmrCheckNodeWeight: node(2) has weight stamp(344360122), pebble(1)

2015-11-30 15:39:40.058245 : CSSD:1124809024: clssnmrCheckNodeWeight: stamp(344360122), completed(1/2)

2015-11-30 15:39:40.058247 : CSSD:1124809024: clssnmrCheckSplit: Waiting for node weights, stamp(344360122)

2015-11-30 15:39:40.077691 : CSSD:1090804032: clssnmvDiskKillCheck: not evicted, file ORCL:ASMDISK03 flags 0x00000000, kill block unique 0, my unique 1448874242

2015-11-30 15:39:40.077791 : CSSD:1116924224: clssnmvDiskKillCheck: not evicted, file ORCL:ASMDISK01 flags 0x00000000, kill block unique 0, my unique 1448874242

2015-11-30 15:39:40.077791 : CSSD:1116924224: clssnmvDiskKillCheck: not evicted, file ORCL:ASMDISK01 flags 0x00000000, kill block unique 0, my unique 1448874242

2015-11-30 15:39:40.077923 : CSSD:1095665984: clssnmvDiskKillCheck: not evicted, file ORCL:ASMDISK02 flags 0x00000000, kill block unique 0, my unique 1448874242

2015-11-30 15:39:40.092015 : CSSD:1083697472: clssnmvDiskPing: Writing with status 0x3, timestamp 1448878180/3431154

2015-11-30 15:39:40.113021 : CSSD:1098819904: clssnmvDiskPing: Writing with status 0x3, timestamp 1448878180/3431174

2015-11-30 15:39:40.114578 : CSSD:1088051520: clssnmvDiskPing: Writing with status 0x3, timestamp 1448878180/3431174

2015-11-30 15:39:40.117006 : CSSD:1118501184: clssnmvDHBValidateNCopy: node 1, host01, has a disk HB, but no network HB, DHB has rcfg 344360123, wrtcnt, 743780, LATS 3431184, lastSeqNo 743777, uniqueness 1448877474, timestamp 1448878179/3142874

2015-11-30 15:39:40.117016 : CSSD:1118501184: clssnmvReadDskHeartbeat: manual shutdown of nodename host01, nodenum 1 epoch 1448878179 msec 3142874

2015-11-30 15:39:40.117357 : CSSD:1124809024: clssnmrCheckNodeWeight: node(2) has weight stamp(344360122), pebble(1)

2015-11-30 15:39:40.117361 : CSSD:1124809024: clssnmrCheckNodeWeight: stamp(344360122), completed(1/1)

2015-11-30 15:39:40.117376 : CSSD:1124809024: clssnmCheckDskInfo: My cohort: 2

2015-11-30 15:39:40.117379 : CSSD:1124809024: clssnmRemove: Start

2015-11-30 15:39:40.117383 : CSSD:1124809024: (:CSSNM00007:)clssnmrRemoveNode: Evicting node 1, host01, from the cluster in incarnation 344360123, node birth incarnation 344360122, death incarnation 344360123, stateflags 0x225000 uniqueness value 1448877474
Thus, we observed that when unequal number of database services are running on the two nodes, the node with higher number of database services survives even though it has a higher node number.

Summary:
Starting from 12.1.0.2, during split brain resolution, the new algorithm followed to decide the nodes to be evicted/retained is as follows:

If the sub-clusters are of the different sizes, the functionality is same as earlier, i.e. the clusterware identifies the largest sub-cluster, and aborts all the nodes which do
not belong to that sub-cluster
If all the sub-clusters are of the same size, the functionality has been modified as:
If the sub-clusters have equal node weights, the sub-cluster with the lowest numbered node in it survives so that, in a 2-node cluster, the node with the lowest node number will survive.
If the sub-clusters have unequal node weights, the sub-cluster having the higher weight survives so that, in a 2-node cluster, the node with the lowest node number might be evicted if it has a lower weight.

Reply

oracle rac 12.1以后的脑裂brain split evict踢节点算法改进

Comments

One response to “oracle rac 12.1以后的脑裂brain split evict踢节点算法改进”

Leave a Reply Cancel reply