由 BUG 引发 ORA-15064 进而导致数据库实例意外中止

今天早上刚到公司一会，就接到南京客户打来的电话，说他们的核心会员数据库宕机了，让我远程帮忙查看一下。还没来得急看报纸的我赶紧打开电脑，远程连接到客户的服务器进行诊断。客户的生产环境是AIX 6.1上的Oracle 11.2.0.3.0，在凌晨1:40分左右发生的故障。为了不涉及泄露客户隐私，把数据库实例名进行了替换。下面看具体的分析： 1. 数据库alert.log分析 Mon Jan 05 01:40:50 2015WARNING: ASM communication error: op 18 state 0x50 （3113）ERROR: slave communication error with ASMNOTE: Deferred communication with ASM instanceErrors in file /u01/app/oracle/diag/rdbms/test5/test5/trace/test5_ora_16581034.trc:ORA-03113: end-of-file on communication channelProcess ID: Session ID: 288 Serial number: 5649NOTE: deferred map free for map id 4422Mon Jan 05 01:40:55 2015NOTE: ASMB terminatingMon Jan 05 01:40:55 2015 *********************************************************************** Fatal NI connect error 12170. VERSION INFORMATION:TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - ProductionTCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - ProductionOracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production Time: 05-JAN-2015 01:40:55 Tracing not turned on. Tns error struct: ns main err code: 12535 TNS-12535: TNS:operation timed out ns secondary err code: 12606 nt main err code: 0 nt secondary err code: 0 nt OS err code: 0 Client address: （ADDRESS=（PROTOCOL=tcp）（HOST=test2）（PORT=64460））WARNING: inbound connection timed out （ORA-3136）Mon Jan 05 01:40:55 2015 *********************************************************************** Fatal NI connect error 12170. VERSION INFORMATION:TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - ProductionTCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - ProductionOracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production Time: 05-JAN-2015 01:40:55 Tracing not turned on. Tns error struct: ns main err code: 12535 TNS-12535: TNS:operation timed out ns secondary err code: 12606 nt main err code: 0 nt secondary err code: 0 nt OS err code: 0Mon Jan 05 01:40:55 2015 *********************************************************************** Client address: （ADDRESS=（PROTOCOL=tcp）（HOST=test2）（PORT=64530））Errors in file /u01/app/oracle/diag/rdbms/test5/test5/trace/test5_asmb_5898342.trc:ORA-15064: communication failure with ASM instanceORA-03135: connection lost contactProcess ID: Session ID: 216 Serial number: 7 Fatal NI connect error 12170.WARNING: inbound connection timed out （ORA-3136）Errors in file /u01/app/oracle/diag/rdbms/test5/test5/trace/test5_asmb_5898342.trc:ORA-15064: communication failure with ASM instanceORA-03135: connection lost contactProcess ID: Session ID: 216 Serial number: 7 VERSION INFORMATION:TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - ProductionTCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - ProductionOracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production Time: 05-JAN-2015 01:40:55 Tracing not turned on. Tns error struct:ASMB （ospid: 5898342）: terminating the instance due to error 15064 ns main err code: 12535 TNS-12535: TNS:operation timed outMon Jan 05 01:40:55 2015 *********************************************************************** ns secondary err code: 12606 Fatal NI connect error 12170. nt main err code: 0 VERSION INFORMATION:TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - ProductionTCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - ProductionOracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production nt secondary err code: 0 Time: 05-JAN-2015 01:40:55 Tracing not turned on. nt OS err code: 0Mon Jan 05 01:40:56 2015System state dump requested by （instance=1, osid=5898342 （ASMB））, summary=[abnormal instance termination].System State dumped to trace file /u01/app/oracle/diag/rdbms/test5/test5/trace/test5_diag_20578640.trcMon Jan 05 01:40:59 2015ORA-1092 : opitsk aborting processMon Jan 05 01:41:00 2015ORA-1092 : opitsk aborting processInstance terminated by ASMB, pid = 5898342Mon Jan 05 08:46:45 2015Starting ORACLE instance （normal）由于在故障点之前也会报一些ORA-3136的错误，故这里不做说明，与本次错误干系不大。从alert日志中发现，数据库实例在1:40:55时，被ASMB（进程号：5898342）进程终止，直到8点46分左右才开始启动（客户自己手动重启）。 2.相关trace文件分析 test5_ora_16581034.trc： *** 2015-01-05 01:40:50.900*** SESSION ID:（424.19913） 2015-01-05 01:40:50.900*** CLIENT ID:（） 2015-01-05 01:40:50.900*** SERVICE NAME:（SYS$USERS） 2015-01-05 01:40:50.900*** MODULE NAME:（backup incr datafile） 2015-01-05 01:40:50.900*** ACTION NAME:（0000097 STARTED16） 2015-01-05 01:40:50.900 WARNING: ASM communication error: op 18 state 0x50 （3113）ERROR: slave communication error with ASMNOTE: Deferred communication with ASM instanceORA-03113: end-of-file on communication channelProcess ID: Session ID: 288 Serial number: 5649 基本与alert日志中报的错误相同，就是获知在1：40分这个故障点，数据库与ASM实例的连接中断了 test5_asmb_5898342.trc： *** 2015-01-05 01:40:55.870NOTE: ASMB terminating *** 2015-01-05 01:40:55.922ORA-15064: communication failure with ASM instanceORA-03135: connection lost contactProcess ID: Session ID: 216 Serial number: 7error 15064 detected in background processORA-15064: communication failure with ASM instanceORA-03135: connection lost contactProcess ID: Session ID: 216 Serial number: 7kjzduptcctx: Notifying DIAG for crash event----- Abridged Call Stack Trace -----ksedsts（）+360<-kjzdssdmp（）+240<-kjzduptcctx（）+228<-kjzdicrshnfy（）+100<-ksuitm（）+5124<-ksbrdp（）+4508<-opirip（）+1624<-opidrv（）+608<-sou2o（）+136<-opimai_real（）+188<-ssthrdmain（）+268<-main（）+204<-__start（）+112 ----- End of Abridged Call Stack Trace ----- *** 2015-01-05 01:40:55.954ASMB （ospid: 5898342）: terminating the instance due to error 15064ksuitm: waiting up to [5] seconds before killing DIAG（20578640）从这个trace文件可以获知，由于遭遇ora-15064错误，ASMB进程终止了数据库实例。 test5_diag_20578640.trc： Session Wait History: elapsed time of 1.543354 sec since last wait 0: waited for "ASM background timer" =0x0, =0x0, =0x0 wait_id=881966 seq_num=30034 snap_id=1 wait times: snap=0.000422 sec, exc=0.000422 sec, total=0.000422 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000016 sec of elapsed time 1: waited for "ASM file metadata operation" msgop=0x2, locn=0x3, =0x0 wait_id=881964 seq_num=30033 snap_id=2 wait times: snap=0.000082 sec, exc=0.000127 sec, total=4 min 4 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000000 sec of elapsed time 2: waited for "latch: shared pool" address=0x700000000107048, number=0x133, tries=0x0 wait_id=881965 seq_num=30032 snap_id=1 wait times: snap=4 min 4 sec, exc=4 min 4 sec, total=4 min 4 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000000 sec of elapsed time 3: waited for "ASM file metadata operation" msgop=0x2, locn=0x3, =0x0 wait_id=881964 seq_num=30031 snap_id=1 wait times: snap=0.000045 sec, exc=0.000045 sec, total=0.000045 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000019 sec of elapsed time 4: waited for "ASM background timer" =0x0, =0x0, =0x0 wait_id=881963 seq_num=30030 snap_id=1 wait times: snap=3.285876 sec, exc=3.285876 sec, total=3.285876 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000011 sec of elapsed time 5: waited for "ASM file metadata operation" msgop=0xc, locn=0x3, =0x0 wait_id=881962 seq_num=30029 snap_id=1 wait times: snap=0.000034 sec, exc=0.000034 sec, total=0.000034 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000019 sec of elapsed time 6: waited for "ASM background timer" =0x0, =0x0, =0x0 wait_id=881961 seq_num=30028 snap_id=1 wait times: snap=5.000528 sec, exc=5.000528 sec, total=5.000528 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000009 sec of elapsed time 7: waited for "ASM file metadata operation" msgop=0x0, locn=0x3, =0x0 wait_id=881960 seq_num=30027 snap_id=1 wait times: snap=0.000003 sec, exc=0.000003 sec, total=0.000003 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000016 sec of elapsed time 8: waited for "ASM background timer" =0x0, =0x0, =0x0 wait_id=881959 seq_num=30026 snap_id=1 wait times: snap=5.000493 sec, exc=5.000493 sec, total=5.000493 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000011 sec of elapsed time 9: waited for "ASM file metadata operation" msgop=0x0, locn=0x3, =0x0 wait_id=881958 seq_num=30025 snap_id=1 wait times: snap=0.000002 sec, exc=0.000002 sec, total=0.000002 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000017 sec of elapsed time sample interval: 1 sec, max history 120 sec --------------------------------------------------- [120 samples, 01:38:56 - 01:40:55] waited for "latch: shared pool", seq_num: 30032 p1: "address"=0x700000000107048 p2: "number"=0x133 p3: "tries"=0x0 time_waited: >= 119 sec （still in wait） --------------------------------------------------- Sampled Session History Summary: longest_non_idle_wait: "latch: shared pool" [120 samples, 01:38:56 - 01:40:55] time_waited: >= 119 sec （still in wait）在ASMB进程的的dump文件中，发现有2个4分4秒的等待，waited for "ASM file metadata operation"和waited for "latch: shared pool",而其他的相关操作都是只有几微秒，甚至0秒的 3.ASM实例的alert日志： Mon Jan 05 01:40:56 2015 ***********************************************************************TNS-12537: TNS:connection closed Fatal NI connect error 12537, connecting to: （DESCRIPTION=（LOCAL=YES）（ADDRESS=（PROTOCOL=beq））） ns secondary err code: 12560 VERSION INFORMATION:TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - ProductionOracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production nt main err code: 0 Time: 05-JAN-2015 01:40:56 nt secondary err code: 0 Tracing not turned on. nt OS err code: 0 Tns error struct: ns main err code: 12537 TNS-12537: TNS:connection closed ns secondary err code: 12560 nt main err code: 0 nt secondary err code: 0 nt OS err code: 0Mon Jan 05 04:42:23 2015NOTE: ASMB process exiting due to lack of ASM file activity for 305 secondsMon Jan 05 08:46:53 2015NOTE: client test5:test5 registered, osid 23593052, mbr 0x0Mon Jan 05 08:47:24 2015Starting background process ASMBMon Jan 05 08:47:24 2015ASMB started with pid=26, OS id=20185292 Mon Jan 05 08:47:25 2015NOTE: client +ASM:+ASM registered, osid 16384406, mbr 0x1 在故障点之前，ASM的alert日志中未报过TNS错误，但故障点的时候，ASM中也出现了TNS连接关闭的错误，因此，也不排除网络因素引起的ASM实例故障。4点多的时候有一个NOTE: ASMB process exiting due to lack of ASM file activity for 305 seconds，根据文档“NOTE ASMB process exiting due to lack of ASM file activity （文档 ID 754110.1）”，这个NOTE可以忽略。到了8:46的时候，ASM实例随着数据库的重启也正常启动了最终，找到了一篇MOS文档，是和这次遇到的故障相关的，应该就是Bug 13914613导致的这个问题：

Bug 13914613 Excessive time holding shared pool latch in kghfrunp with auto memory management

This note gives a brief overview of bug 13914613.
The content was last updated on: 17-DEC-2014
Click here for details of each of the sections below.

Affects:

Product （Component） Oracle Server （Rdbms）
Range of versions believed to be affected （Not specified）
Versions confirmed as being affected
11.2.0.3
11.2.0.2
10.2.0.5
Platforms affected Generic （all / most platforms affected）

Fixed:

The fix for 13914613 is first included in
12.1.0.1 （Base Release）
11.2.0.4 （Server Patch Set）
11.2.0.3.6 Database Patch Set Update
11.2.0.3 Bundle Patch 12 for Exadata Database
11.2.0.3 Patch 18 on Windows Platforms
11.2.0.2 Patch 25 on Windows Platforms

来源： <https://support.oracle.com/epmos/faces/DocumentDisplay？_afrLoop=182220034715943&id=13914613.8&_adf.ctrl-state=14dpmawvgr_260>

Description

A session may spend excessive time holding the shared pool latch underkghfrunp when auto memory management is used. This can ultimately result in an instance crash due to other sessions holding critical resources too long. eg: database crash due to ORA-240 and ORA-15064 Rediscovery Notes Session wait chains show "latch: shared pool" waits Wait times for shared pool latch over 10 seconds for the same holder Call stacks of shared pool latch holder show kghfrunp as a currently executing function. Auto memory management is in use such that "duration" subheaps are used eg: ASMM （sga_target） or AMM （memory_target） being used. Workaround Set init.ora parameter _enable_shared_pool_durations=false Getting a Fix Use one of the "Fixed" versions listed above （for Patch Sets / bundles use the latest version available ascontents are cumulative - the "Fixed" version listed above isthe first version where the fix is included）

分析了产生故障的原因，我给客户的解决方案就是以下两条： 1.设置隐含参数"_enable_shared_pool_durations"=false2.升级到11.2.0.4的patchset 在CentOS 5.5 i386 上安装 Oracle 10G XE http://www.linuxidc.com/Linux/2011-02/31928p2.htmLinux下Oracle 11G XE 安装笔记 http://www.linuxidc.com/Linux/2012-08/66974.htm在CentOS 6.4下安装Oracle 11gR2（x64） http://www.linuxidc.com/Linux/2014-02/97374.htmOracle 11gR2 在VMWare虚拟机中安装步骤 http://www.linuxidc.com/Linux/2013-09/89579p2.htmDebian 下安装 Oracle 11g XE R2 http://www.linuxidc.com/Linux/2014-03/98881.htm更多Oracle相关信息见Oracle 专题页面 http://www.linuxidc.com/topicnews.aspx？tid=12本文永久更新链接地址

*Product （Component）*	Oracle Server （Rdbms）
*Range of versions believed* to be affected**	（Not specified）
*Versions confirmed* as being affected**	11.2.0.3 11.2.0.2 10.2.0.5
Platforms affected	Generic （all / most platforms affected）