适用于:
Oracle Database - Enterprise Edition - 版本 11.2.0.1 和更高版本 Oracle Database Cloud Schema Service - 版本 N/A 和更高版本 Oracle Database Exadata Cloud Machine - 版本 N/A 和更高版本 Oracle Cloud Infrastructure - Database Service - 版本 N/A 和更高版本 Oracle Database Cloud Exadata Service - 版本 N/A 和更高版本 本文档所含信息适用于所有平台用途
本文档的目的是总结可能阻止 Grid Infrastructure (GI) 成功启动的 5 大问题。适用范围
本文档仅适用于 11gR2 Grid Infrastructure。 要确定 GI 的状态,请运行以下命令:1. $GRID_HOME/bin/crsctl check crs 2. $GRID_HOME/bin/crsctl stat res -t -init 3. $GRID_HOME/bin/crsctl stat res -t 4. ps -ef | egrep 'init|d.bin'
详细信息
问题 1:CRS-4639:无法连接 Oracle 高可用性服务,ohasd.bin 未运行或 ohasd.bin 虽在运行但无 init.ohasd 或其他进程
症状:
1. 命令“$GRID_HOME/bin/crsctl check crs”返回错误:
CRS-4639: Could not contact Oracle High Availability Services
2. 命令“ps -ef | grep init”不显示类似于如下所示的行:
root 4878 1 0 Sep12 ? 00:00:02 /bin/sh /etc/init.d/init.ohasd run
3. 命令“ps -ef | grep d.bin”不显示类似于如下所示的行:
root 21350 1 6 22:24 ? 00:00:01 /u01/app/11.2.0/grid/bin/ohasd.bin reboot 或者它只显示 "ohasd.bin reboot" 进程而没有其他进程4. 日志 ohasd.log 中出现以下信息:
2013-11-04 09:09:15.541: [ default][2609911536] Created alert : (:OHAS00117:) : TIMED OUT WAITING FOR OHASD MONITOR5. 日志 ohasOUT.log 中出现以下信息:
2013-11-04 08:59:14 Changing directory to /u01/app/11.2.0/grid/log/lc1n1/ohasd OHASD starting Timed out waiting for init.ohasd script to start; posting an alert6. ohasd.bin 一直处于启动状态,ohasd.log 信息:
2014-08-31 15:00:25.132: [ CRSSEC][733177600]{0:0:2} Exception: PrimaryGroupEntry constructor failed to validate group name with error: 0 groupId: 0x7f8df8022450 acl_string: pgrp:spec:r-x
2014-08-31 15:00:25.132: [ CRSSEC][733177600]{0:0:2} Exception: ACL entry creation failed for: pgrp:spec:r-x
2014-08-31 15:00:25.132: [ INIT][733177600]{0:0:2} Dump State Starting ...
7. 只有ohasd.bin运行,但是ohasd.log没有任何信息。 OS 日志/var/log/messages显示
2015-07-12 racnode1 logger: autorun file for ohasd is missing
可能的原因:
解决方案:
h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
# crsctl enable crs # crsctl start crs
# crsctl stop crs -f # touch $GRID_HOME/cdata/<node>.olr # chown root:oinstall $GRID_HOME/cdata/<node>.olr # ocrconfig -local -restore$GRID_HOME/cdata/<node>/backup_<date>_<num>.olr # crsctl start crs如果出于某种原因,OLR 备份不存在,要重建 OLR 就需要以 root 用户身份执行 deconfig 并重新运行 root.sh: # $GRID_HOME/crs/install/rootcrs.pl -deconfig -force # $GRID_HOME/root.sh6. 需要重新初始化/创建OLR, 使用命令与前面创建OLR命令相同。 7. 重启init.ohasd进程或者在init.ohasd中添加"sleep 30",这样允许在启动集群前输出hostname信息,参考Note 1427234.1. 8. 如果上面方法不能解决问题,请检查OS messages中有关ohasd.bin日志信息,按照OS message中提示信息, 设置LD_LIBRARY_PATH = <GRID_HOME>/lib,并且手动执行crswrapexece.pl命令。
问题 2:CRS-4530:联系集群同步服务守护进程时出现通信故障,ocssd.bin 未运行
症状:
1. 命令“$GRID_HOME/bin/crsctl check crs”返回错误:
CRS-4638: Oracle High Availability Services is online CRS-4535: Cannot communicate with Cluster Ready Services CRS-4530: Communications failure contacting Cluster Synchronization Services daemon CRS-4534: Cannot communicate with Event Manager
2. 命令“ps -ef | grep d.bin”不显示类似于如下所示的行:
oragrid 21543 1 1 22:24 ? 00:00:01 /u01/app/11.2.0/grid/bin/ocssd.bin
3. ocssd.bin 正在运行,但在 ocssd.log 中显示消息“CLSGPNP_CALL_AGAIN”后又中止运行
4. ocssd.log 显示如下内容:
2012-01-27 13:42:58.796: [ CSSD][19]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 223132864, wrtcnt, 1112, LATS 783238209, lastSeqNo 1111, uniqueness 1327692232, timestamp 1327693378/787089065
5. 对于 3 个或更多节点的情况,2 个节点形成的集群一切正常,但是,当第 3 个节点加入时就出现故障,ocssd.log 显示如下内容:
2012-02-09 11:33:53.048: [ CSSD][1120926016](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 2 nodes with leader 2, racnode2, is smaller than cohort of 2 nodes led by node 1, racnode1, based on map type 2 2012-02-09 11:33:53.048: [ CSSD][1120926016]################################### 2012-02-09 11:33:53.048: [ CSSD][1120926016]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread
6. 10 分钟后 ocssd.bin 启动超时
2012-04-08 12:04:33.153: [ CSSD][1]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with uniqueness value 1333911873
......
2012-04-08 12:14:31.994: [ CSSD][5]clssgmShutDown: Received abortive shutdown request from client.
2012-04-08 12:14:31.994: [ CSSD][5]###################################
2012-04-08 12:14:31.994: [ CSSD][5]clssscExit: CSSD aborting from thread GMClientListener
2012-04-08 12:14:31.994: [ CSSD][5]###################################
2012-04-08 12:14:31.994: [ CSSD][5](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
7. alert<node>.log 显示:
2014-02-05 06:16:56.815
[cssd(3361)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/bdprod2/cssd/ocssd.log
...
2014-02-05 06:27:01.707
[ohasd(2252)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'bdprod2'.
2014-02-05 06:27:02.075
[ohasd(2252)]CRS-2771:Maximum restart attempts reached for resource 'ora.cssd'; will not restart.
>
可能的原因:
解决方案:
# crsctl start crs -excl # crsctl replace votedisk <+OCRVOTE diskgroup>
问题 3:CRS-4535:无法与集群就绪服务通信,crsd.bin 未运行
症状:
1. 命令“$GRID_HOME/bin/crsctl check crs”返回错误:
CRS-4638: Oracle High Availability Services is online CRS-4535: Cannot communicate with Cluster Ready Services CRS-4529: Cluster Synchronization Services is online CRS-4534: Cannot communicate with Event Manager
2. 命令“ps -ef | grep d.bin”不显示类似于如下所示的行:
root 23017 1 1 22:34 ? 00:00:00 /u01/app/11.2.0/grid/bin/crsd.bin reboot
3. 即使存在 crsd.bin 进程,命令“crsctl stat res -t –init”仍然显示:
ora.crsd 1 ONLINE INTERMEDIATE
可能的原因:
解决方案:
# ocrconfig -repair -add +OCR2 (添加条目) # ocrconfig -repair -delete +OCR2 (删除条目)
# crsctl start res ora.crsd -init
问题 4:Agent 或者 mdnsd.bin, gpnpd.bin, gipcd.bin 未运行
症状:
1. orarootagent 未运行. ohasd.log 显示:
2012-12-21 02:14:05.071: [ AGFW][24] {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /grid/11.2.0/grid_2/bin/orarootagent Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/grid/11.2.0/grid_2/bin/orarootagent]
2. mdnsd.bin, gpnpd.bin 或者 gipcd.bin 未运行, 以下是 mdnsd log中显示的一个例子:
2012-12-31 21:37:27.601: [ clsdmt][1088776512]Creating PID [4526] file for home /u01/app/11.2.0/grid host lc1n1 bin mdns to /u01/app/11.2.0/grid/mdns/init/
2012-12-31 21:37:27.602: [ clsdmt][1088776512]Error3 -2 writing PID [4526] to the file []
2012-12-31 21:37:27.602: [ clsdmt][1088776512]Failed to record pid for MDNSD
或者
2012-12-31 21:39:52.656: [ clsdmt][1099217216]Creating PID [4645] file for home /u01/app/11.2.0/grid host lc1n1 bin mdns to /u01/app/11.2.0/grid/mdns/init/ 2012-12-31 21:39:52.656: [ clsdmt][1099217216]Writing PID [4645] to the file [/u01/app/11.2.0/grid/mdns/init/lc1n1.pid] 2012-12-31 21:39:52.656: [ clsdmt][1099217216]Failed to record pid for MDNSD
3. oraagent 或 appagent 未运行, 日志crsd.log显示:
2012-12-01 00:06:24.462: [ AGFW][1164069184] {0:2:27} Created alert : (:CRSAGF00130:) : Failed to start the agent /u01/app/grid/11.2.0/bin/appagent_oracle
可能的原因:
解决方案:
# cd <GRID_HOME>/crs/install # ./rootcrs.pl -unlock # ./rootcrs.pl -patch 这将停止集群软件,对需要的文件的所有者/权限设置为root用户,并且重启集群软件。
./ologgerd/init/<node>.pid
./osysmond/init/<node>.pid
./ctss/init/<node>.pid
./ohasd/init/<node>.pid
./crs/init/<node>.pid
所有者属于<grid>:oinstall,权限644
./mdns/init/<node>.pid
./evm/init/<node>.pid
./gipc/init/<node>.pid
./gpnp/init/<node>.pid3.
问题 5:ASM 实例未启动,ora.asm 不在线
症状:
1. 命令“ps -ef | grep asm”不显示 ASM 进程
2. 命令“crsctl stat res -t –init”显示:
ora.asm 1 ONLINE OFFLINE
可能的原因:
解决方案: