了解你所不知道的SMON功能(一):清理临时段

SMON(system monitor process)系统监控后台进程,有时候也被叫做system cleanup process,这么叫的原因是它负责完成很多清理(cleanup)任务。但凡学习过Oracle基础知识的技术人员都会或多或少对该background process的功能有所了解。

曾几何时对SMON功能的了解程度可以作为评判一位DBA理论知识的重要因素,至今仍有很多公司在DBA面试中会问到SMON有哪些功能这样的问题。首先这是一道开放式的题目,并不会奢求面试者能够打全(答全几乎是不可能的,即便是在你阅读本篇文章之后),答出多少可以作为知识广度的评判依据(如果面试人特意为这题准备过,那么也很好,说明他已经能系统地考虑问题了),接着还可以就具体的某一个功能说开去,来了解面试者的知识深度,当然这扯远了。

我们所熟知的SMON是个兢兢业业的家伙,它负责完成一些列系统级别的任务。与PMON(Process Monitor)后台进程不同的是,SMON负责完成更多和整体系统相关的工作,这导致它会去做一些不知名的”累活”,当系统频繁产生这些”垃圾任务”,则SMON可能忙不过来。因此在10g中SMON变得有一点懒惰了,如果它在短期内接收到过多的工作通知(SMON: system monitor process posted),那么它可能选择消极怠工以便让自己不要过于繁忙(SMON: Posted too frequently, trans recovery disabled),之后会详细介绍。

SMON的主要作用包括:

1.清理临时段(SMON cleanup temporary segments)

触发场景

很多人错误地理解了这里所说的临时段temporary segments,认为temporary segments是指temporary tablespace临时表空间上的排序临时段(sort segment)。事实上这里的临时段主要指的是永久表空间(permanent tablespace)上的临时段,当然临时表空间上的temporary segments也是由SMON来清理(cleanup)的,但这种清理仅发生在数据库实例启动时(instance startup)。

永久表空间上同样存在临时段,譬如当我们在某个永久表空间上使用create table/index等DDL命令创建某个表/索引时,服务进程一开始会在指定的永久表空间上分配足够多的区间(Extents),这些区间在命令结束之前都是临时的(Temporary Extents),直到表/索引完全建成才将该temporary segment转换为permanent segment。另外当使用drop命令删除某个段时,也会先将该段率先转换为temporary segment,之后再来清理该temporary segment(DROP object converts the segment to temporary and then cleans up the temporary segment)。 常规情况下清理工作遵循谁创建temporary segment,谁负责清理的原则。换句话说,因服务进程rebuild index所产生的temporary segment在rebuild完成后应由服务进程自行负责清理。一旦服务进程在成功清理temporary segment之前就意外终止了,亦或者服务进程在工作过程中遇到了某些ORA-错误导致语句失败,那么SMON都会被要求(posted)负责完成temporary segment的清理工作。

对于永久表空间上的temporary segment,SMON会三分钟清理一次(前提是接到post),如果SMON过于繁忙那么可能temporary segment长期不被清理。temporary segment长期不被清理可能造成一个典型的问题是:在rebuild index online失败后,后续执行的rebuild index命令要求之前产生的temporary segment已被cleanup,如果cleanup没有完成那么就需要一直等下去。在10gR2中我们可以使用dbms_repair.online_index_clean来手动清理online index rebuild的遗留问题:

The dbms_repair.online_index_clean function has been created to cleanup online index rebuilds.
Use the dbms_repair.online_index_clean function to resolve the issue.
Please note if you are unable to run the dbms_repair.online_index_clean function it is due to the fact
that you have not installed the patch for Bug 3805539 or are not running on a release that includes this fix.
The fix for this bug is a new function in the dbms_repair package called dbms_repair.online_index_clean,
which has been created to cleanup online index [[sub]partition] [re]builds.

New functionality is not allowed in patchsets;
therefore, this is not available in a patchset but is available in 10gR2.

Check your patch list to verify the database is patched for Bug 3805539
using the following command and patch for the bug if it is not listed:

opatch lsinventory -detail

Cleanup after a failed online index [re]build can be slow to occurpreventing subsequent such operations
until the cleanup has occured.

接着我们通过实践来看一下smon是如何清理永久表空间上的temporary segment的:

设置10500事件以跟踪smon进程,这个诊断事件后面会介绍

SQL> alter system set events '10500 trace name context forever,level 10';
System altered.

在第一个会话中执行create table命令,这将产生一定量的Temorary Extents

SQL> create table smon as select * from ymon;

在另一个会话中执行对DBA_EXTENTS视图的查询,可以发现产生了多少临时区间

SQL> SELECT COUNT(*) FROM DBA_EXTENTS WHERE SEGMENT_TYPE='TEMPORARY';

COUNT(*)
----------
117

终止以上create table的session,等待一段时间后观察smon后台进程的trc可以发现以下信息:

*** 2011-06-07 21:18:39.817
SMON: system monitor process posted msgflag:0x0200 (-/-/-/-/TMPSDROP/-/-)

*** 2011-06-07 21:18:39.818
SMON: Posted, but not for trans recovery, so skip it.

*** 2011-06-07 21:18:39.818
SMON: clean up temp segments in slave

SQL> SELECT COUNT(*) FROM DBA_EXTENTS WHERE SEGMENT_TYPE='TEMPORARY';

COUNT(*)
----------
0

可以看到smon通过slave进程完成了对temporary segment的清理

与永久表空间上的临时段不同,出于性能的考虑临时表空间上的Extents并不在操作(operations)完成后立即被释放和归还。相反,这些Temporary Extents会被标记为可用,以便用于下一次的排序操作。SMON仍会清理这些Temporary segments,但这种清理仅发生在实例启动时(instance startup):

For performance issues, extents in TEMPORARY tablespaces are not released ordeallocated
once the operation is complete.Instead, the extent is simply marked as available for the next sort operation.
SMON cleans up the segments at startup.

A sort segment is created by the first statement that used a TEMPORARY tablespacefor sorting, after startup.
A sort segment created in a TEMPOARY tablespace is only released at shutdown.
The large number of EXTENTS is caused when the STORAGE clause has been incorrectly calculated.

现象

可以通过以下查询了解数据库中Temporary Extent的总数,在一定时间内比较其总数,若有所减少那么说明SMON正在清理Temporary segment

SELECT COUNT(*) FROM DBA_EXTENTS WHERE SEGMENT_TYPE='TEMPORARY';

也可以通过v$sysstat视图中的”SMON posted for dropping temp segment”事件统计信息来了解SMON收到清理要求的情况:

SQL> select name,value from v$sysstat where name like '%SMON%';
 
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
total number of times SMON posted                                         8
SMON posted for undo segment recovery                                     0
SMON posted for txn recovery for other instances                          0
SMON posted for instance recovery                                         0
SMON posted for undo segment shrink                                       0
SMON posted for dropping temp segment                                     1

另外在清理过程中SMON会长期持有Space Transacton(ST)队列锁,其他会话可能因为得不到ST锁而等待超时出现ORA-01575错误:

01575, 00000, "timeout waiting for space management resource"
// *Cause: failed to acquire necessary resource to do space management.
// *Action: Retry the operation.

如何禁止SMON清理临时段

可以通过设置诊断事件event=’10061 trace name context forever, level 10’禁用SMON清理临时段(disable SMON from cleaning temp segments)。

alter system set events '10061 trace name context forever, level 10';

相关诊断事件

除去10061事件外还可以用10500事件来跟踪smon的post信息,具体的事件设置方法见<EVENT: 10500 “turn on traces for SMON>

  1. Online index rebuilds which fail (for any reason) leave the
    dictionary marked that the rebuild was in progress and SMON should
    clean up the dictionary (kdicclean). This cleanup function is
    only executed every hour by SMON so you have to wait for SMON
    to clean IND$.

    It is understood that in some cases this hourly cleanup may
    be an issue for customers and this is the focus of the enhancement
    in bug 3805539 which introduces a PLSQL API in DBMS_REPAIR to
    allow a user to force the failed rebuild to be cleaned up.

  2. [oracle@rh2 ~]$ oerr ora 8105
    08105, 00000, “Oracle event to turn off smon cleanup for online index build”
    // *Cause: set this event only under the supervision of Oracle development
    // *Action: debugging only

  3. ORA-600 [Ktfbfget-1] on Database Startup

    Applies To
    Oracle Server – Enterprise Edition – Version: 9.2
    This problem can occur on any platform.
    Symptoms
    ORA-00600: internal error code, arguments: [ktfbfget-1] at database startup.

    SMON crashing the instance.
    Cause
    Cleaning up the TEMP Segment causes the problem.
    Fix
    1. Set EVENT=”10061 trace name context forever, level 10″ in the init.ora

    2. Open the database.

    3. If it is not opened, set all the following events, then open the database.
    EVENT=”10511 trace name context forever”
    EVENT=”10052 trace name context forever”
    EVENT=”10061 trace name context forever, level 10″
    EVENT=”10500 trace name context forever, level 8″

    4. Check for temporary segments:
    SQL> select * from dba_segments where segment_type=’TEMPORARY';

    By default the TEMP segment is dropped at instance shutdown, hence there should be no TEMP segments at instance startup.

    Identify the tablespace name from the above query.

    IF ANY ROWS RETURNED PROCEED FURTHER.

    5. Drop the Temp segments using the following command:

    SQL> alter session set events ‘immediate trace name DROP_SEGMENTS level TS#+1′;

    Where TS# is the tablespace number obtained from the query above.

    6. If the above won’t work, drop that TEMPORARY tablespace.

    7. Shutdown the database.

    8. Remove all the events.

    9. Startup the database.

  4. 刘大,看了你的文章很有收获,纠正了我一些错误的观点,非常感谢。

    “可以通过设置诊断事件event=’10061 trace name context forever, level 10′禁用SMON清理临时段(disable SMON from cleaning temp segments)。”
    关于这句话有点疑惑,这里的“ 禁用” 是临时禁用smon吧?

  5. 请问Maclean,设置10061事件禁用SMON清理永久表空间产生的临时段,对生成库有影响吗?
    谢谢。
    最近核心生产库被这个SMON清理临时段,数据库导致宕机,已经发生3次了。

    (Fri Aug 10 13:58:26 2012
    Errors in file /home/oracle/admin/shestate/bdump/rac2_smon_229636.trc:
    ORA-00600: internal error code, arguments: [1153], [15], [], [], [], [], [], []
    Fri Aug 10 13:58:28 2012
    Non-fatal internal error happenned while SMON was doing temporary segment drop.
    SMON encountered 1 out of maximum 100 non-fatal internal errors.
    Fri Aug 10 13:58:28 2012
    Errors in file /home/oracle/admin/shestate/bdump/rac2_smon_229636.trc:
    ORA-00600: internal error code, arguments: [1153], [15], [], [], [], [], [], []
    Fri Aug 10 13:58:28 2012
    Trace dumping is performing id=[cdmp_20120810135828]
    Fri Aug 10 13:58:29 2012
    Non-fatal internal error happenned while SMON was doing temporary segment drop.
    SMON encountered 2 out of maximum 100 non-fatal internal errors.
    Fri Aug 10 13:58:29 2012
    Errors in file /home/oracle/admin/shestate/bdump/rac2_smon_229636.trc:
    ORA-00600: internal error code, arguments: [1153], [15], [], [], [], [], [], []
    Non-fatal internal error happenned while SMON was doing temporary segment drop.
    SMON encountered 3 out of maximum 100 non-fatal internal errors.

    ….
    到100次就宕机了。
    数据库版本9.2.0.5,升级或打patch,领导不同意,是否有其它的办法绕过。
    谢谢。

  6. Pingback: 了解你所不知道的SMON功能系列文章汇总 | Ask Maclean Oracle Blog