首页 / 数据库 / MySQL / 使用AWK分析Oracle系统锁定、Hang状态
在早期Oracle版本中,由于技术不成熟等原因,数据库锁定和僵死状态还是时有发生的。对待这些问题,老先生们的处理策略无外乎是“重启”和“考究”两种策略。所谓“重启”,通过强制的重启服务器或者数据库,将软硬件偶然出现的锁定僵死状态重置,是一种最简单的处理办法。但是,如果是应用程序层面潜在的缺陷、数据库的Bug或者诸如大规模事务回滚的情况,重启大法的作用是很有限的。况且,一些关键业务系统,重启数据库是需要很高级别的审批流程的。这个时候,就需要进行“考究”,仔细研究故障点进行精准分析。“考究”的过程是非常复杂的。传统方法是将数据库内存状态全部dump到Trace文件中,从其中观察到各个进程之间复杂的交互关系,最后定位到数据库锁定的原因。由于Oracle的负责性,Trace文本文件通常成千上万行,进行读取也是非常消耗精力的事情。数据库故障在任何企业都是大事故,每分每秒都意味着巨大的经济损失,早一分钟定位错误,解决问题,是有着重大意义的。Linux体系对于正则表达式、文本处理是有很强的积淀的。AWK/SED命令就可以处理比较负责的文本分析问题。于是,行业中推出了结合awk命令使用的ass109.awk命令程序,实现快速的系统trace文件分析。下面通过一个比较简单的会话间锁定测试,来演示awk命令处理思路。具体ass109.awk脚本在网络上下载比较方便,就不累列本文。1、测试环境准备实验环境选择Oracle 11gR2,版本为11.2.0.4。SQL> select * from v$version;BANNER--------------------------------------------------------------------------------Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit ProductionPL/SQL Release 11.2.0.4.0 - ProductionCORE 11.2.0.4.0 ProductionTNS for Linux: Version 11.2.0.4.0 - ProductionNLSRTL Version 11.2.0.4.0 – Production实验场景找一个比较简单的就好,双Session共同更新相同记录,构成锁定状态。--Session 1SQL> create table t as select object_id from dba_objects;Table createdSQL> select sid from v$mystat where rownum<2; SID---------- 319SQL> delete t where object_id=1000;1 row deleted--Session 2SQL> select sid from v$mystat where rownum<2; SID---------- 479SQL> delete t where object_id=1000;(Hang住)319和479会话处在锁定等待状态。2、数据库状态Trace下面就需要将系统状态System State,笔者选择使用oradebug工具来进行处理。[oracle@testlife trace]$ sqlplus /nologSQL*Plus: Release 11.2.0.4.0 Production on Fri Oct 14 20:01:13 2016Copyright (c) 1982, 2013, Oracle. All rights reserved.SQL> conn / as sysdbaConnected.SQL> oradebug setmypidStatement processed.SQL> oradebug dump systemstate 10;Statement processed.SQL> oradebug tracefile_name;/u01/app/oracle/diag/rdbms/testdb/testdb/trace/testdb_ora_8344.trc定位到Trace File所在的目录,确认文件生成。同时,确定ass109处理文件也在该目录下。[oracle@testlife trace]$ ls -l | grep testdb_ora_8344.trc-rw-r----- 1 oracle oinstall 1021451 Oct 14 20:01 testdb_ora_8344.trc[oracle@testlife trace]$ ls -l | grep ass109.awk -rw-r--r-- 1 root root 34549 Oct 14 19:25 ass109.awk3、分析处理使用awk分析trace文件极其简单,分析结果直接就输出在屏幕中。[oracle@testlife trace]$ awk -f ass109.awk testdb_ora_8344.trcStarting Systemstate 1............................Ass.Awk Version 1.0.9 - Processing testdb_ora_8344.trcSystem State 1~~~~~~~~~~~~~~~~1: 2: 0: waiting for "pmon timer" 3: 0: waiting for "rdbms ipc message" 4: 0: waiting for "VKTM Logical Idle Wait" 5: 0: waiting for "rdbms ipc message" 6: 0: waiting for "DIAG idle wait" 7: 0: waiting for "rdbms ipc message" 8: 0: waiting for "DIAG idle wait" 9: 0: waiting for "rdbms ipc message" 10: 0: waiting for "rdbms ipc message" 11: 0: waiting for "rdbms ipc message" 12: 0: waiting for "rdbms ipc message" 13: 0: waiting for "smon timer" 14: 0: waiting for "rdbms ipc message" 15: 0: waiting for "rdbms ipc message" 16: 0: waiting for "rdbms ipc message" 17: 18: 19: 0: waiting for "rdbms ipc message" 20: 0: waiting for "Streams AQ: qmn coordinator idle wait" 21: 0: waiting for "rdbms ipc message" 22: 0: waiting for "Streams AQ: qmn slave idle wait" 23: 0: waiting for "Streams AQ: waiting for time management or cleanup tasks" 24: 0: waiting for "Space Manager: slave idle wait" 25: 0: waiting for "SQL*Net message from client" 26: 0: waiting for "SQL*Net message from client" 27: 0: waiting for "enq: TX - row lock contention"[Enqueue TX-00060002-00000611] 28: Blockers~~~~~~~~ Above is a list of all the processes. If they are waiting for a resource then it will be given in square brackets. Below is a summary of the waited upon resources, together with the holder of that resource. Notes: ~~~~~ o A process id of "???" implies that the holder was not found in the systemstate. Resource Holder StateEnqueue TX-00060002-00000611 26: 0: waiting for "SQL*Net message from client"Object Names~~~~~~~~~~~~Enqueue TX-00060002-00000611 18642 Lines Processed.分析结果很简单,也很容易看懂。分析重点是从process入手,将当前系统中所有对应的Process信息逐个整理出来,将对应的等待事件wait event理出来。最后,将存在Block的情况整理列出来。从上面的结果看,可以发现资源Enqueue TX-00060002-00000611被26号会话持有,该会话却处在等待用户信息输入状态。而上面的Process列表中,对于这个资源27号进程是等待持有的。那么,我们直到了26和27直接有问题,如何定位究竟哪个会话呢?注意:这个26和27对应的就是v$process中数据库内部的进程编号PID。SQL> select * from v$process where pid in (26,27);ADDR PID SPID ---------------- ---------- ------------------------ 00000001339FDBD0 26 7505 00000001339FEC88 27 8317 SQL> select sid, serial#, paddr, event from v$session where paddr in ("00000001339FDBD0","00000001339FEC88"); SID SERIAL# PADDR EVENT---------- ---------- ---------------- ---------------------------------------------------------------- 319 21 00000001339FDBD0 SQL*Net message from client 479 43 00000001339FEC88 enq: TX - row lock contention正确定位到319和479会话。3、结论数据库资源相互锁定在应用领域非常常见,例如资源释放问题、会话操作失误和事务回滚等,都会引起或大或小的资源锁定。通过awk命令配合ass109.awk命令文件,可以比较方便的让我们快速定位问题,解决问题。更多Oracle相关信息见Oracle 专题页面 http://www.linuxidc.com/topicnews.aspx?tid=12本文永久更新链接地址