Oracle 11g RAC 报错CRS-4535, CRS-4000解决

环境：AIX6.1 + Oracle11.2.0.4 RAC（2 nodes）

1.故障现象
2.定位问题
3.处理问题

1.故障现象

使用crsctl查看集群各资源状态，在任一节点都会直接报错CRS-4535, CRS-4000；但此时数据库是可以被正常访问的。
具体故障现象如下：

＃节点1查询grid@bjdb1:/home/grid>crsctl stat res -tCRS-4535: Cannot communicate with Cluster Ready ServicesCRS-4000: Command Status failed, or completed with errors.＃节点2查询root@bjdb2:/>crsctl stat res -tCRS-4535: Cannot communicate with Cluster Ready ServicesCRS-4000: Command Status failed, or completed with errors.

同样的，crs_stat -t 查看一样报错，错误码是CRS－0184：root@bjdb1:/>crs_stat -tCRS-0184: Cannot communicate with the CRS daemon.节点2也一样！确定此时数据库是可以被正常访问的。如下：

＃节点2模拟客户端登录RAC集群，使用SCAN IP访问，发现可以正常访问到数据库oracle@bjdb2:/home/oracle>sqlplus jingyu/jingyu@192.168.103.31/bjdbSQL*Plus: Release 11.2.0.4.0 Production on Mon Oct 10 14:24:47 2016Copyright （c） 1982, 2013, Oracle.All rights reserved.Connected to:Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit ProductionWith the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,Data Mining and Real Application Testing optionsSQL>

RAC环境下的/etc/hosts文件相关内容：#scan192.168.103.31scan-ip

2.定位问题

首先查看节点1的集群相关日志：
Clusterware（GI）的日志存放在$GRID_HOME/log/nodename下；
Clusterware（GI）对应几个关键的后台进程css，crs，evm，它们的日志分别存在cssd，crsd，evmd目录下；节点1查看相关日志：

＃查看GI的alert日志文件，最近的记录只是提示GI所在存储空间使用率高，稍后清理下即可，而且目前还有一定空间剩余，显然并非是此次故障的原因。root@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1>tail -f alert*.log2016-10-10 14:18:26.125:[crflogd（39190674）]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is "/opt/u01/app/11.2.0/grid/crf/db/bjdb1".2016-10-10 14:23:31.125:[crflogd（39190674）]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is "/opt/u01/app/11.2.0/grid/crf/db/bjdb1".2016-10-10 14:28:36.125:[crflogd（39190674）]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is "/opt/u01/app/11.2.0/grid/crf/db/bjdb1".2016-10-10 14:33:41.125:[crflogd（39190674）]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is "/opt/u01/app/11.2.0/grid/crf/db/bjdb1".2016-10-10 14:38:46.125:[crflogd（39190674）]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is "/opt/u01/app/11.2.0/grid/crf/db/bjdb1".＃因为crsctl不可以使用，进而查看crs的日志信息，发现3号已经有报错，无法打开裸设备，从而导致无法初始化OCR；继续看错误信息，发现是这个时候访问共享存储时无法成功。怀疑此刻存储出现问题，需要进一步和现场人员确定此时间点是否有存储相关的施工。root@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1/crsd>tail -f crsd.log2016-10-03 18:04:40.248: [OCRRAW][1]proprinit: Could not open raw device2016-10-03 18:04:40.248: [OCRASM][1]proprasmcl: asmhandle is NULL2016-10-03 18:04:40.252: [OCRAPI][1]a_init:16！: Backend init unsuccessful : [26]2016-10-03 18:04:40.253: [CRSOCR][1] OCR context init failure.Error: PROC-26: Error while accessing the physical storage2016-10-03 18:04:40.253: [CRSD][1] Created alert : （:CRSD00111:） :Could not init OCR, error: PROC-26: Error while accessing the physical storage2016-10-03 18:04:40.253: [CRSD][1][PANIC] CRSD exiting: Could not init OCR, code: 262016-10-03 18:04:40.253: [CRSD][1] Done.

节点2查看相关日志：

＃查看GI的alert日志，发现节点2的ctss有CRS-2409的报错，虽然根据MOS文档 ID 1135337.1说明，This is not an error. ctssd is reporting that there is a time difference and it is not doing anything about it as it is running in observer mode.只需要查看两个节点的时间是否一致，但实际上查询节点时间一致:root@bjdb2:/opt/u01/app/11.2.0/grid/log/bjdb2>tail -f alert*.log2016-10-10 12:29:22.145:[ctssd（5243030）]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.2016-10-10 12:59:38.799:[ctssd（5243030）]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.2016-10-10 13:34:11.402:[ctssd（5243030）]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.2016-10-10 14:12:44.168:[ctssd（5243030）]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.2016-10-10 14:44:04.824:[ctssd（5243030）]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.＃查看节点2的crs日志，发现和节点1相近的时间点，同样访问共享存储出现了问题，进而无法初始化OCRroot@bjdb2:/opt/u01/app/11.2.0/grid/log/bjdb2/crsd>tail -f crsd.log2016-10-03 18:04:31.077: [OCRRAW][1]proprinit: Could not open raw device2016-10-03 18:04:31.077: [OCRASM][1]proprasmcl: asmhandle is NULL2016-10-03 18:04:31.081: [OCRAPI][1]a_init:16！: Backend init unsuccessful : [26]2016-10-03 18:04:31.081: [CRSOCR][1] OCR context init failure.Error: PROC-26: Error while accessing the physical storage2016-10-03 18:04:31.082: [CRSD][1] Created alert : （:CRSD00111:） :Could not init OCR, error: PROC-26: Error while accessing the physical storage2016-10-03 18:04:31.082: [CRSD][1][PANIC] CRSD exiting: Could not init OCR, code: 262016-10-03 18:04:31.082: [CRSD][1] Done.

现在登入到grid用户，确定下ASM磁盘组的状态：
sqlplus / as sysasm 直接查询v$asm_diskgroup;select name, state, total_mb, free_mb from v$asm_diskgroup;发现OCR_VOTE1磁盘组在两个ASM实例上都是没有mount；

SQL> select instance_name from v$instance;INSTANCE_NAME------------------------------------------------+ASM2SQL> select name, state, total_mb, free_mb from v$asm_diskgroup;NAME STATE TOTAL_MBFREE_MB------------------------------ --------------------------------- ---------- ----------DATA MOUNTED 73728088152FRA_ARCHIVEMOUNTED10240 9287OCR_VOTE1DISMOUNTED 00

另一个节点一样；

节点mount OCR相关磁盘组SQL> select name, state from v$asm_diskgroup;NAME STATE------------------------------ ---------------------------------DATA MOUNTEDFRA_ARCHIVEMOUNTEDOCR_VOTE1DISMOUNTED

再确认下目前GI的一些核心后台进程：

#发现crs这个进程是没有启动的，查询没有任何结果输出root@bjdb1:/>ps -ef|grep crsd.bin｜grep -v grep

同样，节点2查询也是一样没有启动crs进程。简单总结问题现状：故障发生在10月3日下午18:04左右，所有节点都因为无法访问共享存储进而导致OCR初始化失败。目前的crs进程是没有正常启动的。

3.处理问题

3.1 尝试手工挂载OCR磁盘组

SQL> alter diskgroup ocr_vote1 mount;Diskgroup altered.SQL> select name, state from v$asm_diskgroup;NAME STATE------------------------------ ---------------------------------DATA MOUNTEDFRA_ARCHIVEMOUNTEDOCR_VOTE1MOUNTED

3.2 节点1启动CRS
目前，crs这个进程依然是没有启动的，＃证明crsd.bin当前没有启动root@bjdb1:/>ps -ef|grep crsd.bin|grep -v grep节点1尝试正常开启crs失败

root@bjdb1:/>crsctl start crsCRS-4640: Oracle High Availability Services is already activeCRS-4000: Command Start failed, or completed with errors.

节点1尝试正常关闭crs失败

root@bjdb1:/>crsctl stop crsCRS-2796: The command may not proceed when Cluster Ready Services is not runningCRS-4687: Shutdown command has completed with errors.CRS-4000: Command Stop failed, or completed with errors.

那么下一步如何处理呢？
最终选择在节点1强制停止crs再启动成功

＃强制关闭节点1的crsroot@bjdb1:/>crsctl stop crs -fCRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on "bjdb1"CRS-2673: Attempting to stop "ora.mdnsd" on "bjdb1"CRS-2673: Attempting to stop "ora.crf" on "bjdb1"CRS-2673: Attempting to stop "ora.ctssd" on "bjdb1"CRS-2673: Attempting to stop "ora.evmd" on "bjdb1"CRS-2673: Attempting to stop "ora.asm" on "bjdb1"CRS-2673: Attempting to stop "ora.drivers.acfs" on "bjdb1"CRS-2677: Stop of "ora.evmd" on "bjdb1" succeededCRS-2677: Stop of "ora.mdnsd" on "bjdb1" succeededCRS-2677: Stop of "ora.ctssd" on "bjdb1" succeededCRS-5017: The resource action "ora.crf stop" encountered the following error:action for daemon aborted. For details refer to "（:CLSN00108:）" in "/opt/u01/app/11.2.0/grid/log/bjdb1/agent/ohasd/orarootagent_root/orarootagent_root.log".CRS-2675: Stop of "ora.crf" on "bjdb1" failedCRS-2679: Attempting to clean "ora.crf" on "bjdb1"CRS-2681: Clean of "ora.crf" on "bjdb1" succeededCRS-2675: Stop of "ora.asm" on "bjdb1" failedCRS-2679: Attempting to clean "ora.asm" on "bjdb1"CRS-2681: Clean of "ora.asm" on "bjdb1" succeededCRS-2673: Attempting to stop "ora.cluster_interconnect.haip" on "bjdb1"CRS-2677: Stop of "ora.cluster_interconnect.haip" on "bjdb1" succeededCRS-2673: Attempting to stop "ora.cssd" on "bjdb1"CRS-2677: Stop of "ora.cssd" on "bjdb1" succeededCRS-2673: Attempting to stop "ora.gipcd" on "bjdb1"CRS-2677: Stop of "ora.gipcd" on "bjdb1" succeededCRS-2673: Attempting to stop "ora.gpnpd" on "bjdb1"CRS-2677: Stop of "ora.gpnpd" on "bjdb1" succeededCRS-2677: Stop of "ora.drivers.acfs" on "bjdb1" succeededCRS-2793: Shutdown of Oracle High Availability Services-managed resources on "bjdb1" has completedCRS-4133: Oracle High Availability Services has been stopped.root@bjdb1:/>＃查看crsctl资源状态，此时肯定没有root@bjdb1:/>crsctl stat res -tCRS-4535: Cannot communicate with Cluster Ready ServicesCRS-4000: Command Status failed, or completed with errors.＃查看crsd.bin,cssd.bin,evmd.bin,都已经没有了相关进程root@bjdb1:/>ps -ef|grep crsd.binroot8126466 25428158 0 15:52:50pts/00:00 grep crsd.binroot@bjdb1:/>ps -ef|grep cssd.binroot8126470 25428158 0 15:53:01pts/00:00 grep cssd.binroot@bjdb1:/>ps -ef|grep evmd.binroot 35520600 25428158 0 15:53:13pts/00:00 grep evmd.bin＃查看pmon进程,也都没有了root@bjdb1:/>ps -ef|grep pmon｜grep -v greproot@bjdb1:/>＃尝试再次启动crs，成功！root@bjdb1:/>crsctl start crsCRS-4123: Oracle High Availability Services has been started.＃查看crsctl资源，依然报错，说明还没有完全起来root@bjdb1:/>crsctl stat res -tCRS-4535: Cannot communicate with Cluster Ready ServicesCRS-4000: Command Status failed, or completed with errors.＃等待一段时间，可以查GI相关的核心后台进程root@bjdb1:/>ps -ef|grep crsd.bin|grep -v greproot@bjdb1:/>ps -ef|grep cssd.bin|grep -v grepgrid 10747994 26542192 0 15:55:03-0:00 /opt/u01/app/11.2.0/grid/bin/ocssd.binroot@bjdb1:/>ps -ef|grep pmonroot 39387390 25428158 0 15:57:23pts/00:00 grep pmonroot@bjdb1:/>ps -ef|grep pmon|grep -v grepgrid 399114661 0 15:58:47-0:00 asm_pmon_+ASM2root@bjdb1:/>ps -ef|grep pmon|grep -v greproot 37814470 25428158 0 15:59:27pts/00:00 grep pmongrid 399114661 0 15:58:47-0:00 asm_pmon_+ASM2root@bjdb1:/>root@bjdb1:/>ps -ef|grep crsd.binroot@bjdb1:/>ps -ef|grep cssd.bingrid 10747994 26542192 0 15:55:03-0:00 /opt/u01/app/11.2.0/grid/bin/ocssd.binroot@bjdb1:/>ps -ef|grep evmd.bingrid 401737601 0 15:57:10-0:00 /opt/u01/app/11.2.0/grid/bin/evmd.binroot@bjdb1:/>ps -ef|grep crsd.binroot 376832381 0 15:59:54-0:01 /opt/u01/app/11.2.0/grid/bin/crsd.bin rebootroot@bjdb1:/>＃当核心进程都起来时，再次查看crsctl资源情况，发现已经可以正常查询，各资源正在启动root@bjdb1:/>crsctl stat res -t--------------------------------------------------------------------------------NAME TARGETSTATESERVER STATE_DETAILS--------------------------------------------------------------------------------Local Resources--------------------------------------------------------------------------------ora.DATA.dg ONLINEONLINE bjdb1ora.LISTENER.lsnr ONLINEONLINE bjdb1ora.OCR_VOTE1.dg ONLINEONLINE bjdb1ora.asm ONLINEONLINE bjdb1Startedora.gsd OFFLINE OFFLINEbjdb1ora.net1.network ONLINEONLINE bjdb1ora.ons ONLINEONLINE bjdb1ora.registry.acfs ONLINEONLINE bjdb1--------------------------------------------------------------------------------Cluster Resources--------------------------------------------------------------------------------ora.LISTENER_SCAN1.lsnr1ONLINEOFFLINEora.bjdb.db1ONLINEOFFLINE2ONLINEOFFLINEora.bjdb1.vip1ONLINEONLINE bjdb1ora.bjdb2.vip1ONLINEOFFLINE STARTINGora.cvu1ONLINEONLINE bjdb1ora.oc4j1ONLINEONLINE bjdb1ora.scan1.vip1ONLINEOFFLINE STARTING

最后等待一段时间后，再次查询，发现节点1各资源已经全部正常。

root@bjdb1:/>crsctl stat res -t--------------------------------------------------------------------------------NAME TARGETSTATESERVER STATE_DETAILS--------------------------------------------------------------------------------Local Resources--------------------------------------------------------------------------------ora.DATA.dg ONLINEONLINE bjdb1ora.LISTENER.lsnr ONLINEONLINE bjdb1ora.OCR_VOTE1.dg ONLINEONLINE bjdb1ora.asm ONLINEONLINE bjdb1Startedora.gsd OFFLINE OFFLINEbjdb1ora.net1.network ONLINEONLINE bjdb1ora.ons ONLINEONLINE bjdb1ora.registry.acfs ONLINEONLINE bjdb1--------------------------------------------------------------------------------Cluster Resources--------------------------------------------------------------------------------ora.LISTENER_SCAN1.lsnr1ONLINEOFFLINEora.bjdb.db1ONLINEONLINE bjdb1Open2ONLINEOFFLINEora.bjdb1.vip1ONLINEONLINE bjdb1ora.bjdb2.vip1ONLINEOFFLINEora.cvu1ONLINEONLINE bjdb1ora.oc4j1ONLINEONLINE bjdb1ora.scan1.vip1ONLINEOFFLINE

3.3 解决节点1上GI对应存储空间使用率过高
继续观察节点1的日志：

grid@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1>tail -f alert*.log2016-10-10 16:03:25.373:[crflogd（39780590）]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is "/opt/u01/app/11.2.0/grid/crf/db/bjdb1".2016-10-10 16:08:30.373:[crflogd（39780590）]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is "/opt/u01/app/11.2.0/grid/crf/db/bjdb1".2016-10-10 16:09:50.796:[ctssd（5046446）]CRS-2407:The new Cluster Time Synchronization Service reference node is host bjdb1.2016-10-10 16:10:20.373:[crflogd（39780590）]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is "/opt/u01/app/11.2.0/grid/crf/db/bjdb1".2016-10-10 16:15:25.373:[crflogd（39780590）]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is "/opt/u01/app/11.2.0/grid/crf/db/bjdb1".

其实这个之前也看到过，就是需要清理/opt/u01目录空间了！查找可以删除的一些历史日志，解决完这个提示就不会再出现！3.4 节点2手工挂载OCR，重启CRS
节点1问题已解决，在节点2同样挂载OCR后重启CRS
方法都一样，只是在节点2操作，不再赘述。

＃强制停止节点2的crsroot@bjdb2:/>crsctl stop crs -fCRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on "bjdb2"CRS-2673: Attempting to stop "ora.mdnsd" on "bjdb2"CRS-2673: Attempting to stop "ora.crf" on "bjdb2"CRS-2673: Attempting to stop "ora.ctssd" on "bjdb2"CRS-2673: Attempting to stop "ora.evmd" on "bjdb2"CRS-2673: Attempting to stop "ora.asm" on "bjdb2"CRS-2673: Attempting to stop "ora.drivers.acfs" on "bjdb2"CRS-2677: Stop of "ora.crf" on "bjdb2" succeededCRS-2677: Stop of "ora.mdnsd" on "bjdb2" succeededCRS-2677: Stop of "ora.evmd" on "bjdb2" succeededCRS-2677: Stop of "ora.ctssd" on "bjdb2" succeededCRS-2675: Stop of "ora.asm" on "bjdb2" failedCRS-2679: Attempting to clean "ora.asm" on "bjdb2"CRS-2681: Clean of "ora.asm" on "bjdb2" succeededCRS-2673: Attempting to stop "ora.cluster_interconnect.haip" on "bjdb2"CRS-2677: Stop of "ora.cluster_interconnect.haip" on "bjdb2" succeededCRS-2673: Attempting to stop "ora.cssd" on "bjdb2"CRS-2677: Stop of "ora.cssd" on "bjdb2" succeededCRS-2673: Attempting to stop "ora.gipcd" on "bjdb2"CRS-2677: Stop of "ora.gipcd" on "bjdb2" succeededCRS-2673: Attempting to stop "ora.gpnpd" on "bjdb2"CRS-2677: Stop of "ora.gpnpd" on "bjdb2" succeededCRS-2677: Stop of "ora.drivers.acfs" on "bjdb2" succeededCRS-2793: Shutdown of Oracle High Availability Services-managed resources on "bjdb2" has completedCRS-4133: Oracle High Availability Services has been stopped.

再启动：
crsctl start crs等待一段时间后查询：

＃观察到crs进程已经启动root@bjdb2:/>ps -ef|grep crsd.bin|grep -v greproot 226101481 0 16:24:15-0:01 /opt/u01/app/11.2.0/grid/bin/crsd.bin reboot＃最后使用crsctl查看资源已经恢复正常root@bjdb2:/>crsctl stat res -t--------------------------------------------------------------------------------NAME TARGETSTATESERVER STATE_DETAILS--------------------------------------------------------------------------------Local Resources--------------------------------------------------------------------------------ora.DATA.dg ONLINEONLINE bjdb1 ONLINEONLINE bjdb2ora.LISTENER.lsnr ONLINEONLINE bjdb1 ONLINEONLINE bjdb2ora.OCR_VOTE1.dg ONLINEONLINE bjdb1 ONLINEONLINE bjdb2ora.asm ONLINEONLINE bjdb1Started ONLINEONLINE bjdb2Startedora.gsd OFFLINE OFFLINEbjdb1 OFFLINE OFFLINEbjdb2ora.net1.network ONLINEONLINE bjdb1 ONLINEONLINE bjdb2ora.ons ONLINEONLINE bjdb1 ONLINEONLINE bjdb2ora.registry.acfs ONLINEONLINE bjdb1 ONLINEONLINE bjdb2--------------------------------------------------------------------------------Cluster Resources--------------------------------------------------------------------------------ora.LISTENER_SCAN1.lsnr1ONLINEONLINE bjdb2ora.bjdb.db1ONLINEONLINE bjdb1Open2ONLINEONLINE bjdb2Openora.bjdb1.vip1ONLINEONLINE bjdb1ora.bjdb2.vip1ONLINEONLINE bjdb2ora.cvu1ONLINEONLINE bjdb1ora.oc4j1ONLINEONLINE bjdb1ora.scan1.vip1ONLINEONLINE bjdb2＃查看运行在节点2上的监听程序，之前故障时，scan的???听就在节点2上root@bjdb2:/>ps -ef|grep tnsgrid53084301 0 Aug 17-5:05 /opt/u01/app/11.2.0/grid/bin/tnslsnr LISTENER_SCAN1 -inheritgrid55052401 1 Aug 17- 27:23 /opt/u01/app/11.2.0/grid/bin/tnslsnr LISTENER -inheritroot@bjdb2:/>

至此，完成本次RAC集群CRS－4535，CRS－4000故障的处理；值得注意的是，巡检发现故障后，我整个troubleshooting解决过程，RAC数据库对外都是可以提供服务的，这点也说明了RAC的稳定性！更多Oracle相关信息见Oracle 专题页面 http://www.linuxidc.com/topicnews.aspx？tid=12本文永久更新链接地址