环境:AIX6.1 + Oracle11.2.0.4 RAC(2 nodes)
1.故障现象
使用crsctl查看集群各资源状态,在任一节点都会直接报错CRS-4535, CRS-4000;但此时数据库是可以被正常访问的。
具体故障现象如下:#节点1查询grid@bjdb1:/home/grid>crsctl stat res -tCRS-4535: Cannot communicate with Cluster Ready ServicesCRS-4000: Command Status failed, or completed with errors.#节点2查询root@bjdb2:/>crsctl stat res -tCRS-4535: Cannot communicate with Cluster Ready ServicesCRS-4000: Command Status failed, or completed with errors.同样的,crs_stat -t 查看一样报错,错误码是CRS-0184:
root@bjdb1:/>crs_stat -tCRS-0184: Cannot communicate with the CRS daemon.节点2也一样!
确定此时数据库是可以被正常访问的。如下:#节点2模拟客户端登录RAC集群,使用SCAN IP访问,发现可以正常访问到数据库oracle@bjdb2:/home/oracle>sqlplus jingyu/jingyu@192.168.103.31/bjdbSQL*Plus: Release 11.2.0.4.0 Production on Mon Oct 10 14:24:47 2016Copyright (c) 1982, 2013, Oracle.All rights reserved.Connected to:Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit ProductionWith the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,Data Mining and Real Application Testing optionsSQL>RAC环境下的/etc/hosts文件相关内容:
#scan192.168.103.31scan-ip2.定位问题
首先查看节点1的集群相关日志:
Clusterware(GI)的日志存放在$GRID_HOME/log/nodename下;
Clusterware(GI)对应几个关键的后台进程css,crs,evm,它们的日志分别存在cssd,crsd,evmd目录下;节点1查看相关日志:
#查看GI的alert日志文件,最近的记录只是提示GI所在存储空间使用率高,稍后清理下即可,而且目前还有一定空间剩余,显然并非是此次故障的原因。root@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1>tail -f alert*.log2016-10-10 14:18:26.125:[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is "/opt/u01/app/11.2.0/grid/crf/db/bjdb1".2016-10-10 14:23:31.125:[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is "/opt/u01/app/11.2.0/grid/crf/db/bjdb1".2016-10-10 14:28:36.125:[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is "/opt/u01/app/11.2.0/grid/crf/db/bjdb1".2016-10-10 14:33:41.125:[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is "/opt/u01/app/11.2.0/grid/crf/db/bjdb1".2016-10-10 14:38:46.125:[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is "/opt/u01/app/11.2.0/grid/crf/db/bjdb1".#因为crsctl不可以使用,进而查看crs的日志信息,发现3号已经有报错,无法打开裸设备,从而导致无法初始化OCR;继续看错误信息,发现是这个时候访问共享存储时无法成功。怀疑此刻存储出现问题,需要进一步和现场人员确定此时间点是否有存储相关的施工。root@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1/crsd>tail -f crsd.log2016-10-03 18:04:40.248: [OCRRAW][1]proprinit: Could not open raw device2016-10-03 18:04:40.248: [OCRASM][1]proprasmcl: asmhandle is NULL2016-10-03 18:04:40.252: [OCRAPI][1]a_init:16!: Backend init unsuccessful : [26]2016-10-03 18:04:40.253: [CRSOCR][1] OCR context init failure.Error: PROC-26: Error while accessing the physical storage2016-10-03 18:04:40.253: [CRSD][1] Created alert : (:CRSD00111:) :Could not init OCR, error: PROC-26: Error while accessing the physical storage2016-10-03 18:04:40.253: [CRSD][1][PANIC] CRSD exiting: Could not init OCR, code: 262016-10-03 18:04:40.253: [CRSD][1] Done.节点2查看相关日志:
#查看GI的alert日志,发现节点2的ctss有CRS-2409的报错,虽然根据MOS文档 ID 1135337.1说明,This is not an error. ctssd is reporting that there is a time difference and it is not doing anything about it as it is running in observer mode.只需要查看两个节点的时间是否一致,但实际上查询节点时间一致:root@bjdb2:/opt/u01/app/11.2.0/grid/log/bjdb2>tail -f alert*.log2016-10-10 12:29:22.145:[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.2016-10-10 12:59:38.799:[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.2016-10-10 13:34:11.402:[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.2016-10-10 14:12:44.168:[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.2016-10-10 14:44:04.824:[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.#查看节点2的crs日志,发现和节点1相近的时间点,同样访问共享存储出现了问题,进而无法初始化OCRroot@bjdb2:/opt/u01/app/11.2.0/grid/log/bjdb2/crsd>tail -f crsd.log2016-10-03 18:04:31.077: [OCRRAW][1]proprinit: Could not open raw device2016-10-03 18:04:31.077: [OCRASM][1]proprasmcl: asmhandle is NULL2016-10-03 18:04:31.081: [OCRAPI][1]a_init:16!: Backend init unsuccessful : [26]2016-10-03 18:04:31.081: [CRSOCR][1] OCR context init failure.Error: PROC-26: Error while accessing the physical storage2016-10-03 18:04:31.082: [CRSD][1] Created alert : (:CRSD00111:) :Could not init OCR, error: PROC-26: Error while accessing the physical storage2016-10-03 18:04:31.082: [CRSD][1][PANIC] CRSD exiting: Could not init OCR, code: 262016-10-03 18:04:31.082: [CRSD][1] Done.现在登入到grid用户,确定下ASM磁盘组的状态:
sqlplus / as sysasm 直接查询v$asm_diskgroup;
select name, state, total_mb, free_mb from v$asm_diskgroup;发现OCR_VOTE1磁盘组在两个ASM实例上都是没有mount;
SQL> select instance_name from v$instance;INSTANCE_NAME------------------------------------------------+ASM2SQL> select name, state, total_mb, free_mb from v$asm_diskgroup;NAME STATE TOTAL_MBFREE_MB------------------------------ --------------------------------- ---------- ----------DATA MOUNTED 73728088152FRA_ARCHIVEMOUNTED10240 9287OCR_VOTE1DISMOUNTED 00另一个节点一样;
节点mount OCR相关磁盘组SQL> select name, state from v$asm_diskgroup;NAME STATE------------------------------ ---------------------------------DATA MOUNTEDFRA_ARCHIVEMOUNTEDOCR_VOTE1DISMOUNTED再确认下目前GI的一些核心后台进程:
#发现crs这个进程是没有启动的,查询没有任何结果输出root@bjdb1:/>ps -ef|grep crsd.bin|grep -v grep同样,节点2查询也是一样没有启动crs进程。
简单总结问题现状:故障发生在10月3日 下午18:04左右,所有节点都因为无法访问共享存储进而导致OCR初始化失败。目前的crs进程是没有正常启动的。
3.处理问题
3.1 尝试手工挂载OCR磁盘组SQL> alter diskgroup ocr_vote1 mount;Diskgroup altered.SQL> select name, state from v$asm_diskgroup;NAME STATE------------------------------ ---------------------------------DATA MOUNTEDFRA_ARCHIVEMOUNTEDOCR_VOTE1MOUNTED3.2 节点1启动CRS目前,crs这个进程依然是没有启动的,
#证明crsd.bin当前没有启动root@bjdb1:/>ps -ef|grep crsd.bin|grep -v grep节点1尝试正常开启crs失败
root@bjdb1:/>crsctl start crsCRS-4640: Oracle High Availability Services is already activeCRS-4000: Command Start failed, or completed with errors.节点1尝试正常关闭crs失败
root@bjdb1:/>crsctl stop crsCRS-2796: The command may not proceed when Cluster Ready Services is not runningCRS-4687: Shutdown command has completed with errors.CRS-4000: Command Stop failed, or completed with errors.那么下一步如何处理呢?
最终选择在节点1强制停止crs再启动成功
#强制关闭节点1的crsroot@bjdb1:/>crsctl stop crs -fCRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on "bjdb1"CRS-2673: Attempting to stop "ora.mdnsd" on "bjdb1"CRS-2673: Attempting to stop "ora.crf" on "bjdb1"CRS-2673: Attempting to stop "ora.ctssd" on "bjdb1"CRS-2673: Attempting to stop "ora.evmd" on "bjdb1"CRS-2673: Attempting to stop "ora.asm" on "bjdb1"CRS-2673: Attempting to stop "ora.drivers.acfs" on "bjdb1"CRS-2677: Stop of "ora.evmd" on "bjdb1" succeededCRS-2677: Stop of "ora.mdnsd" on "bjdb1" succeededCRS-2677: Stop of "ora.ctssd" on "bjdb1" succeededCRS-5017: The resource action "ora.crf stop" encountered the following error:action for daemon aborted. For details refer to "(:CLSN00108:)" in "/opt/u01/app/11.2.0/grid/log/bjdb1/agent/ohasd/orarootagent_root/orarootagent_root.log".CRS-2675: Stop of "ora.crf" on "bjdb1" failedCRS-2679: Attempting to clean "ora.crf" on "bjdb1"CRS-2681: Clean of "ora.crf" on "bjdb1" succeededCRS-2675: Stop of "ora.asm" on "bjdb1" failedCRS-2679: Attempting to clean "ora.asm" on "bjdb1"CRS-2681: Clean of "ora.asm" on "bjdb1" succeededCRS-2673: Attempting to stop "ora.cluster_interconnect.haip" on "bjdb1"CRS-2677: Stop of "ora.cluster_interconnect.haip" on "bjdb1" succeededCRS-2673: Attempting to stop "ora.cssd" on "bjdb1"CRS-2677: Stop of "ora.cssd" on "bjdb1" succeededCRS-2673: Attempting to stop "ora.gipcd" on "bjdb1"CRS-2677: Stop of "ora.gipcd" on "bjdb1" succeededCRS-2673: Attempting to stop "ora.gpnpd" on "bjdb1"CRS-2677: Stop of "ora.gpnpd" on "bjdb1" succeededCRS-2677: Stop of "ora.drivers.acfs" on "bjdb1" succeededCRS-2793: Shutdown of Oracle High Availability Services-managed resources on "bjdb1" has completedCRS-4133: Oracle High Availability Services has been stopped.root@bjdb1:/>#查看crsctl资源状态,此时肯定没有root@bjdb1:/>crsctl stat res -tCRS-4535: Cannot communicate with Cluster Ready ServicesCRS-4000: Command Status failed, or completed with errors.#查看crsd.bin,cssd.bin,evmd.bin,都已经没有了相关进程root@bjdb1:/>ps -ef|grep crsd.binroot8126466 25428158 0 15:52:50pts/00:00 grep crsd.binroot@bjdb1:/>ps -ef|grep cssd.binroot8126470 25428158 0 15:53:01pts/00:00 grep cssd.binroot@bjdb1:/>ps -ef|grep evmd.binroot 35520600 25428158 0 15:53:13pts/00:00 grep evmd.bin#查看pmon进程,也都没有了root@bjdb1:/>ps -ef|grep pmon|grep -v greproot@bjdb1:/>#尝试再次启动crs,成功!root@bjdb1:/>crsctl start crsCRS-4123: Oracle High Availability Services has been started.#查看crsctl资源,依然报错,说明还没有完全起来root@bjdb1:/>crsctl stat res -tCRS-4535: Cannot communicate with Cluster Ready ServicesCRS-4000: Command Status failed, or completed with errors.#等待一段时间,可以查GI相关的核心后台进程root@bjdb1:/>ps -ef|grep crsd.bin|grep -v greproot@bjdb1:/>ps -ef|grep cssd.bin|grep -v grepgrid 10747994 26542192 0 15:55:03-0:00 /opt/u01/app/11.2.0/grid/bin/ocssd.binroot@bjdb1:/>ps -ef|grep pmonroot 39387390 25428158 0 15:57:23pts/00:00 grep pmonroot@bjdb1:/>ps -ef|grep pmon|grep -v grepgrid 399114661 0 15:58:47-0:00 asm_pmon_+ASM2root@bjdb1:/>ps -ef|grep pmon|grep -v greproot 37814470 25428158 0 15:59:27pts/00:00 grep pmongrid 399114661 0 15:58:47-0:00 asm_pmon_+ASM2root@bjdb1:/>root@bjdb1:/>ps -ef|grep crsd.binroot@bjdb1:/>ps -ef|grep cssd.bingrid 10747994 26542192 0 15:55:03-0:00 /opt/u01/app/11.2.0/grid/bin/ocssd.binroot@bjdb1:/>ps -ef|grep evmd.bingrid 401737601 0 15:57:10-0:00 /opt/u01/app/11.2.0/grid/bin/evmd.binroot@bjdb1:/>ps -ef|grep crsd.binroot 376832381 0 15:59:54-0:01 /opt/u01/app/11.2.0/grid/bin/crsd.bin rebootroot@bjdb1:/>#当核心进程都起来时,再次查看crsctl资源情况,发现已经可以正常查询,各资源正在启动root@bjdb1:/>crsctl stat res -t--------------------------------------------------------------------------------NAME TARGETSTATESERVER STATE_DETAILS--------------------------------------------------------------------------------Local Resources--------------------------------------------------------------------------------ora.DATA.dg ONLINEONLINE bjdb1ora.LISTENER.lsnr ONLINEONLINE bjdb1ora.OCR_VOTE1.dg ONLINEONLINE bjdb1ora.asm ONLINEONLINE bjdb1Startedora.gsd OFFLINE OFFLINEbjdb1ora.net1.network ONLINEONLINE bjdb1ora.ons ONLINEONLINE bjdb1ora.registry.acfs ONLINEONLINE bjdb1--------------------------------------------------------------------------------Cluster Resources--------------------------------------------------------------------------------ora.LISTENER_SCAN1.lsnr1ONLINEOFFLINEora.bjdb.db1ONLINEOFFLINE2ONLINEOFFLINEora.bjdb1.vip1ONLINEONLINE bjdb1ora.bjdb2.vip1ONLINEOFFLINE STARTINGora.cvu1ONLINEONLINE bjdb1ora.oc4j1ONLINEONLINE bjdb1ora.scan1.vip1ONLINEOFFLINE STARTING最后等待一段时间后,再次查询,发现节点1各资源已经全部正常。
root@bjdb1:/>crsctl stat res -t--------------------------------------------------------------------------------NAME TARGETSTATESERVER STATE_DETAILS--------------------------------------------------------------------------------Local Resources--------------------------------------------------------------------------------ora.DATA.dg ONLINEONLINE bjdb1ora.LISTENER.lsnr ONLINEONLINE bjdb1ora.OCR_VOTE1.dg ONLINEONLINE bjdb1ora.asm ONLINEONLINE bjdb1Startedora.gsd OFFLINE OFFLINEbjdb1ora.net1.network ONLINEONLINE bjdb1ora.ons ONLINEONLINE bjdb1ora.registry.acfs ONLINEONLINE bjdb1--------------------------------------------------------------------------------Cluster Resources--------------------------------------------------------------------------------ora.LISTENER_SCAN1.lsnr1ONLINEOFFLINEora.bjdb.db1ONLINEONLINE bjdb1Open2ONLINEOFFLINEora.bjdb1.vip1ONLINEONLINE bjdb1ora.bjdb2.vip1ONLINEOFFLINEora.cvu1ONLINEONLINE bjdb1ora.oc4j1ONLINEONLINE bjdb1ora.scan1.vip1ONLINEOFFLINE3.3 解决节点1上GI对应存储空间使用率过高继续观察节点1的日志:
grid@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1>tail -f alert*.log2016-10-10 16:03:25.373:[crflogd(39780590)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is "/opt/u01/app/11.2.0/grid/crf/db/bjdb1".2016-10-10 16:08:30.373:[crflogd(39780590)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is "/opt/u01/app/11.2.0/grid/crf/db/bjdb1".2016-10-10 16:09:50.796:[ctssd(5046446)]CRS-2407:The new Cluster Time Synchronization Service reference node is host bjdb1.2016-10-10 16:10:20.373:[crflogd(39780590)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is "/opt/u01/app/11.2.0/grid/crf/db/bjdb1".2016-10-10 16:15:25.373:[crflogd(39780590)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is "/opt/u01/app/11.2.0/grid/crf/db/bjdb1".其实这个之前也看到过,就是需要清理/opt/u01目录空间了!查找可以删除的一些历史日志,解决完这个提示就不会再出现!
3.4 节点2手工挂载OCR,重启CRS节点1问题已解决,在节点2同样挂载OCR后重启CRS
方法都一样,只是在节点2操作,不再赘述。
#强制停止节点2的crsroot@bjdb2:/>crsctl stop crs -fCRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on "bjdb2"CRS-2673: Attempting to stop "ora.mdnsd" on "bjdb2"CRS-2673: Attempting to stop "ora.crf" on "bjdb2"CRS-2673: Attempting to stop "ora.ctssd" on "bjdb2"CRS-2673: Attempting to stop "ora.evmd" on "bjdb2"CRS-2673: Attempting to stop "ora.asm" on "bjdb2"CRS-2673: Attempting to stop "ora.drivers.acfs" on "bjdb2"CRS-2677: Stop of "ora.crf" on "bjdb2" succeededCRS-2677: Stop of "ora.mdnsd" on "bjdb2" succeededCRS-2677: Stop of "ora.evmd" on "bjdb2" succeededCRS-2677: Stop of "ora.ctssd" on "bjdb2" succeededCRS-2675: Stop of "ora.asm" on "bjdb2" failedCRS-2679: Attempting to clean "ora.asm" on "bjdb2"CRS-2681: Clean of "ora.asm" on "bjdb2" succeededCRS-2673: Attempting to stop "ora.cluster_interconnect.haip" on "bjdb2"CRS-2677: Stop of "ora.cluster_interconnect.haip" on "bjdb2" succeededCRS-2673: Attempting to stop "ora.cssd" on "bjdb2"CRS-2677: Stop of "ora.cssd" on "bjdb2" succeededCRS-2673: Attempting to stop "ora.gipcd" on "bjdb2"CRS-2677: Stop of "ora.gipcd" on "bjdb2" succeededCRS-2673: Attempting to stop "ora.gpnpd" on "bjdb2"CRS-2677: Stop of "ora.gpnpd" on "bjdb2" succeededCRS-2677: Stop of "ora.drivers.acfs" on "bjdb2" succeededCRS-2793: Shutdown of Oracle High Availability Services-managed resources on "bjdb2" has completedCRS-4133: Oracle High Availability Services has been stopped.再启动:
crsctl start crs等待一段时间后查询:
#观察到crs进程已经启动root@bjdb2:/>ps -ef|grep crsd.bin|grep -v greproot 226101481 0 16:24:15-0:01 /opt/u01/app/11.2.0/grid/bin/crsd.bin reboot#最后使用crsctl查看资源已经恢复正常root@bjdb2:/>crsctl stat res -t--------------------------------------------------------------------------------NAME TARGETSTATESERVER STATE_DETAILS--------------------------------------------------------------------------------Local Resources--------------------------------------------------------------------------------ora.DATA.dg ONLINEONLINE bjdb1 ONLINEONLINE bjdb2ora.LISTENER.lsnr ONLINEONLINE bjdb1 ONLINEONLINE bjdb2ora.OCR_VOTE1.dg ONLINEONLINE bjdb1 ONLINEONLINE bjdb2ora.asm ONLINEONLINE bjdb1Started ONLINEONLINE bjdb2Startedora.gsd OFFLINE OFFLINEbjdb1 OFFLINE OFFLINEbjdb2ora.net1.network ONLINEONLINE bjdb1 ONLINEONLINE bjdb2ora.ons ONLINEONLINE bjdb1 ONLINEONLINE bjdb2ora.registry.acfs ONLINEONLINE bjdb1 ONLINEONLINE bjdb2--------------------------------------------------------------------------------Cluster Resources--------------------------------------------------------------------------------ora.LISTENER_SCAN1.lsnr1ONLINEONLINE bjdb2ora.bjdb.db1ONLINEONLINE bjdb1Open2ONLINEONLINE bjdb2Openora.bjdb1.vip1ONLINEONLINE bjdb1ora.bjdb2.vip1ONLINEONLINE bjdb2ora.cvu1ONLINEONLINE bjdb1ora.oc4j1ONLINEONLINE bjdb1ora.scan1.vip1ONLINEONLINE bjdb2#查看运行在节点2上的监听程序,之前故障时,scan的???听就在节点2上root@bjdb2:/>ps -ef|grep tnsgrid53084301 0 Aug 17-5:05 /opt/u01/app/11.2.0/grid/bin/tnslsnr LISTENER_SCAN1 -inheritgrid55052401 1 Aug 17- 27:23 /opt/u01/app/11.2.0/grid/bin/tnslsnr LISTENER -inheritroot@bjdb2:/>至此,完成本次RAC集群CRS-4535,CRS-4000故障的处理;值得注意的是,巡检发现故障后,我整个troubleshooting解决过程,RAC数据库对外都是可以提供服务的,这点也说明了RAC的稳定性!更多Oracle相关信息见Oracle 专题页面 http://www.linuxidc.com/topicnews.aspx?tid=12
本文永久更新链接地址