Advanced Diagnostic using oradebug dumpvar

oradebug工具是Oracle数据库调优和诊断的利器,合理运用oradebug可以大幅减少我们收集诊断信息所花费的时间。当然前提是合理运用,对于初级DBA有这样一个忠告,不要在生产环境中去接触或修改自己所不熟悉的领域的东西,这一点很重要。并不是说中高级DBA在知识和经验上能达到巨细靡遗的状态,实际上中高级Oracle DBA可能每天都在和新事物打交道。恰恰相反,中高级DBA是对以上忠告最忠实的践行者,在生产环境中绝不随意涉险于不确定、不明确、不熟悉的区域,很多时候这会让人觉得是一种”好奇心丧失”的表现。

言归正传,我们所要介绍的是oradebug的dumpvar命令,该命令用于打印转储固定的PGA/UGA/SGA的变量:,其语法如下:

oradebug dumpvar
Print/dump a fixed PGA/SGA/UGA variable.
Syntax                                                              Parameter
oradebug dumpvar <p|s|uga> <variable name> [level]                  <p|s|uga> PGA,SGA or UGA
                                                                    fixed variable name
                                                                    [level]
使用示例
SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE    10.2.0.4.0      Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production

SQL> select * from global_name;

GLOBAL_NAME
--------------------------------------------------------------------------------
www.askmaclean.com


SQL> show parameter db_files

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_files                             integer     200

SQL> oradebug setmypid;
Statement processed.

SQL> oradebug dumpvar sga kcfdpk
kfil kcfdpk_ [698B950, 698B954) = 000000C8

这里的KCFDPK变量保存了db_files参数的值,000000C8 == 200

利用oradebug工具我们不仅可以做到对这些内部变量的窥视,还可以做到修改,
但是这对我们实际的诊断没有益处,仅仅可以用来满足某些模拟实验

仅仅了解dumpvar命令的使用方法并没有意义,只有和有意义的内部变量结合起来才能真正起到高级诊断的作用,在这里抛砖引玉罗列出部分诊断变量:

sgauso --  该变量可以用于判断use_stored_outlines

SQL> oradebug setmypid;
Statement processed.

Default use_stored_outlines=false;

SQL> oradebug dumpvar sga sgauso
qolprm sgauso_ [060021418, 06002143C) =
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

SQL> alter system set use_stored_outlines=true;
System altered.

SQL> oradebug dumpvar sga sgauso
qolprm sgauso_ [060021418, 06002143C) =
00000001 45440007 4C554146 00000054 00000000 00000000 00000000 00000000 00000000

45440007 4C554146 ==>      DEFAULT,意为优化器使用DEFAULT category中存放的outline.


ksmvpa -- the size of the variable part of the pga

SQL> oradebug dumpvar pga ksmvpa
b4 ksmvpa_ [0068AA6B4, 0068AA6B8) = 0000E920          -- 59680 bytes

kkjsre        -- The SGA variable kkjsre must be 1 for jobs to execute automatically.

SQL> oradebug  dumpvar sga kkjsre
word kkjsre_ [060025760, 060025764) = 00000001

SQL> exec  dbms_ijob.set_enabled(false);
PL/SQL procedure successfully completed.

SQL> oradebug  dumpvar sga kkjsre
word kkjsre_ [060025760, 060025764) = 00000000

kcmsmx          --the MAX reasonable scn 

SQL> oradebug dumpvar pga kcmsmx
kscn kcmsmx_ [00B7E811C, 00B7E8124) = CA7F0000 00000C6C         -- 11.2.0.2

SQL> oradebug dumpvar pga kcmsmx
kscn kcmsmx_ [0068AB02C, 0068AB034) = A701C000 00000B3D         -- 10.2.0.4 

Notice the MAX scn allowed on 11.2 is far higher that that at earlier releases
(due to the fix from bug 9254170).
Use the fix for 9254170 on ALL DBs with an SCN rate set to match between versions.
Avoid excessive SCN generation rates.

kcvlcm - logging checkpoints to alert - FALSE

SQL> show parameter log_checkpoints_to_alert

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
log_checkpoints_to_alert             boolean     TRUE

SQL> oradebug dumpvar sga kcvlcm
word kcvlcm_ [060019E88, 060019E8C) = 00000001

SQL> alter system set log_checkpoints_to_alert=false;

System altered.

SQL> oradebug dumpvar sga kcvlcm
word kcvlcm_ [060019E88, 060019E8C) = 00000000

kcvcpr - false (a checkpoint is requested)

SQL> oradebug dumpvar sga  kcvcpr
word kcvcpr_ [060019E90, 060019E94) = 00000000

kcvcpf - false (checkpointing fast)

kcvgcw -false a global checkpointing is waiting

SQL> oradebug dumpvar sga kcvgcw
sword kcvgcw_ [060025748, 06002574C) = 00000000

kcfcpd - true if checkpoint writes done

kcvblg  kcvblg[0] is s 1 incidate some file/s in backup mode

SQL> oradebug dumpvar sga kcvblg
ub4 * kcvblg_ [060019E18, 060019E20) = 9A248508 00000000

SQL> oradebug peek 0x9A248508 100
[09A248508, 09A24856C) =
00000001 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000002 00000000
00000002 00000002 00000000 00000000 ...

alter database begin backup;

SQL>  oradebug peek 0x9A248508 100
[09A248508, 09A24856C) =
00000001 00000003 00000003 00000003 00000003
00000003 00000003 00000003 00000003 00000003
00000003 00000003 00000003 00000003 ...

kcsgscn_           -- current scn 

SQL> oradebug dumpvar sga kcsgscn_
kcslf kcsgscn_ [0600122B0, 0600122E0) =
01C7DB4B 00000000 00000000 00000000              01C7DB4B  -- 29875019
0001E907 00000000 00000000 00000000
00000000 00000000 60011F90 00000000              60011F90 is fixed

SQL> select current_scn from v$database;

CURRENT_SCN
-----------
   29875044

kslwlst_  --  waiter list latch

SQL> oradebug dumpvar sga kslwlst
ksllt kslwlst_ [200040AC, 20004174) =
00000000 00000009 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 ...

kcfdfk -- 2 * db_files

SQL> oradebug dumpvar sga kcfdfk
kfil kcfdfk_ [060017EF0, 060017EF4) = 00000190            --400    when db_files=200 

SQL> alter system set db_files=2000 scope=spfile;
System altered.

SQL> oradebug dumpvar sga kcfdfk
kfil kcfdfk_ [060017EF0, 060017EF4) = 00000FA0             --4000

kcffsz  Address of Datafile Size 

So kcbzib() passes in information about how many blocks to read
( and where to start reading from ) The assertion is 

  if (bno < 2 || bno + nblks - 1 > (kcflsz[fno] ? kcflsz[fno] : kcffsz[fno]))

              ==============================================================

              .......NOTE: This section is removed after 9g..................

  {
     /* if the user gave us the block number then this is an external error,
      * but if the dba was internally generated then this is a corruption of
      * some kind. */
     ASSERTNM5(usrdba, OERINM("kcfrbd_2"),
               fno, bno, nblks, kcflsz[fno], kcffsz[fno]);

SQL> oradebug dumpvar sga kcffsz
ub4 * kcffsz_ [060017F20, 060017F28) = 9988E700 00000000

SQL> oradebug peek 0x9988E700 200
[09988E700, 09988E7C8) =
00000001
0007B200
00030980
0000AF00
00067CA0
00003200
00280000
00000500
00000000
00000A00
00000000
00000000
00000600
00040000 ...

SQL> select blocks,to_char(blocks,'XXXXXXXX') hex_blocks from v$datafile;

    BLOCKS HEX_BLOCK
---------- ---------
    504320     7B200
    199040     30980
     44800      AF00
    425120     67CA0
     12800      3200
   2621440    280000
      1280       500
         0         0
      2560       A00
     38400      9600
      6400      1900

    BLOCKS HEX_BLOCK
---------- ---------
      1536       600
    262144     40000

13 rows selected.

kcflsz -- like kcffsz

SQL> oradebug dumpvar  sga kcflsz
ub4 * kcflsz_ [060017F28, 060017F30) = 99892588 00000000

SQL> oradebug peek 0x99892588 200
[099892588, 099892650) =
00000000 0007B200 00030980 0000AF00 00067CA0
00003200 00280000 00000500 00000000 00000A00
00000000 00000000 00000600 00040000 ...

THe arguments are file number, block number, number of blocks, kcflsz[fno],
kcffsz[fno]
Error comes out of kcf.c.
Can they try a validate structure cascade on this table and see what happens?
Please update the customer field so that it does not read internal.

kcf.c kcf.c Kernel Cache Files component.
ksl.c Kernel Service layer Latching & Wait-post Implement.
ksm.c Kernel Service Memory component implementation.
kkj.c Kernel Kompiletime Job queue.

还原真实的cache recovery

我们在学习Oracle基础知识的时候会了解到实例恢复(Instance Recovery)或者说崩溃恢复(Crash recovery)的概念,有时候甚至于这2个名词在我们日常的语言中表达同样的意思。实际上Instance Recovery与Crash Recovery是存在区别的:针对单实例(single instance)或者RAC中所有节点全部崩溃后的恢复,我们称之为Crash Recovery。而对于RAC中的某一个节点失败,存活节点(surviving instance)试图对失败节点线程上redo做应用的情况,我们称之为Instance Recovery。

不管是Instance Recovery还是Crash Recovery,都由2个部分组成:cache recovery继以transaction recovery。

根据官方文档的介绍,Cache Recovery也叫Rolling Forward,就是我们常说的前滚;而Transaction Recovery也叫Rolling Back,就是我们常说的回滚。前滚和回滚贯穿Oracle恢复的基本概念,是我们入门必要学习的知识,在次不多介绍。

有文事者,必济之以武略。理论学得再好,不实践也无用。所幸Crash Recovery是很容易做成的实验,我们不妨来看一看:

SQL> shutdown abort;
ORACLE instance shut down.

SQL> startup mount;
ORACLE instance started.

Total System Global Area 1065353216 bytes
Fixed Size                  2089336 bytes
Variable Size             486542984 bytes
Database Buffers          570425344 bytes
Redo Buffers                6295552 bytes
Database mounted.

SQL> alter database open;

Crash Recovery将从alter database open开始,我们来观察其日志

==================alert.log====================
alter database open
Tue Jun 14 18:19:53 2011
Beginning crash recovery of 1 threads
 parallel recovery started with 2 processes
Tue Jun 14 18:19:53 2011
Started redo scan
Tue Jun 14 18:19:53 2011
Completed redo scan
 0 redo blocks read, 0 data blocks need recovery
Tue Jun 14 18:19:53 2011
Started redo application at
 Thread 1: logseq 1004, block 1124, scn 17136185
Tue Jun 14 18:19:53 2011
Recovery of Online Redo Log: Thread 1 Group 2 Seq 1004 Reading mem 0
  Mem# 0: /flashcard/oradata/G10R2/onlinelog/o1_mf_2_6v34jokt_.log
  Mem# 1: /s01/flash_recovery_area/G10R2/onlinelog/o1_mf_2_6v34jotq_.log
Tue Jun 14 18:19:53 2011
Completed redo application
Tue Jun 14 18:19:53 2011
Completed crash recovery at
 Thread 1: logseq 1004, block 1124, scn 17156186
 0 data blocks read, 0 data blocks written, 0 redo blocks read
Tue Jun 14 18:19:53 2011
LGWR: STARTING ARCH PROCESSES
ARC0: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC0 started with pid=16, OS id=7829
Tue Jun 14 18:19:53 2011
Thread 1 advanced to log sequence 1005 (thread open)
Thread 1 opened at log sequence 1005
  Current log# 3 seq# 1005 mem# 0: /flashcard/oradata/G10R2/onlinelog/o1_mf_3_6v34jpmp_.log
  Current log# 3 seq# 1005 mem# 1: /s01/flash_recovery_area/G10R2/onlinelog/o1_mf_3_6v34jpyn_.log
Successful open of redo thread 1
Tue Jun 14 18:19:53 2011
ARC0: Becoming the 'no FAL' ARCH
ARC0: Becoming the 'no SRL' ARCH
ARC0: Becoming the heartbeat ARCH
Tue Jun 14 18:19:53 2011
SMON: enabling cache recovery
Tue Jun 14 18:19:53 2011
db_recovery_file_dest_size of 204800 MB is 6.81% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Tue Jun 14 18:19:54 2011
Successfully onlined Undo Tablespace 1.
Tue Jun 14 18:19:54 2011
SMON: enabling tx recovery
Database Characterset is UTF8
Opening with internal Resource Manager plan
where NUMA PG = 1, CPUs = 2
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
QMNC started with pid=17, OS id=7831
Tue Jun 14 18:19:55 2011
Completed: alter database open

注意上述单实例Crash Recovery到数据库打开的整个过程:

  • alter database open
  • Beginning crash recovery of 1 threads
  • Started redo scan
  • Completed redo scan
  • Started redo application at
  • Completed redo application
  • Completed crash recovery at
  • SMON: enabling cache recovery
  • Successfully onlined Undo Tablespace 1
  • SMON: enabling tx recovery
  • Completed: alter database open

从上述步骤中我们可以看到三种恢复名词,即:

  • crash recovery
  • cache recovery
  • tx recovery

这和官方文档所描述的Crash Recovery概念是不一致的,我们现在来理清这几种recovery。

crash recovery包含对redo的scan和application,显然其完成的是Rolling Forward前滚的工作,告警日志中出现的crash recovery等同于官方文档中介绍的”cache recovery”,我们可以将” Completed crash recovery”看做前滚完成的标志。而tx recovery从字面就可以看出实际上是Transaction Recovery,tx recovery发生在Undo Tablespace online之后(回滚事务的前提是Undo可用),数据完成打开操作之前(“Completed: alter database open”)。注意tx recovery并不要求数据库打开前完成,仅仅是在数据库打开之前由smon启动(“SMON: enabling tx recovery”)。

剩下的唯一的问题是,这里的cache recovery是什么?显然它不是官方文档中所描述的”cache recovery”,几乎没有任何文档介绍存在这样一个recovery操作,这也是本文重点要介绍的。

我们来看另一个演示,这个演示用以说明cache recovery还存在于最普通的不包含Crash Recovery的数据库打开过程中:

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.

SQL> startup mount;
ORACLE instance started.
Total System Global Area 1065353216 bytes
Fixed Size                  2089336 bytes
Variable Size             486542984 bytes
Database Buffers          570425344 bytes
Redo Buffers                6295552 bytes
Database mounted.

SQL> alter database open;
Database altered.

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE    10.2.0.4.0      Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production

SQL> select * from global_name;

GLOBAL_NAME
--------------------------------------------------------------------------------
www.askmaclean.com

==================alert.log====================
alter database open
Tue Jun 14 18:43:52 2011
LGWR: STARTING ARCH PROCESSES
ARC0: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC0 started with pid=14, OS id=8133
Tue Jun 14 18:43:52 2011
Thread 1 opened at log sequence 1005
Current log# 3 seq# 1005 mem# 0: /flashcard/oradata/G10R2/onlinelog/o1_mf_3_6v34jpmp_.log
Current log# 3 seq# 1005 mem# 1: /s01/flash_recovery_area/G10R2/onlinelog/o1_mf_3_6v34jpyn_.log
Successful open of redo thread 1
Tue Jun 14 18:43:52 2011
ARC0: Becoming the 'no FAL' ARCH
ARC0: Becoming the 'no SRL' ARCH
ARC0: Becoming the heartbeat ARCH
Tue Jun 14 18:43:52 2011
SMON: enabling cache recovery
Tue Jun 14 18:43:53 2011
Successfully onlined Undo Tablespace 1.
Tue Jun 14 18:43:53 2011
SMON: enabling tx recovery
Database Characterset is UTF8
Opening with internal Resource Manager plan
where NUMA PG = 1, CPUs = 2
replication_dependency_tracking turned off (no async multimaster replication found)
Tue Jun 14 18:43:53 2011
Incremental checkpoint up to RBA [0x3ed.624.0], current log tail at RBA [0x3ed.944.0]
Tue Jun 14 18:43:53 2011
Starting background process QMNC
QMNC started with pid=15, OS id=8135
Tue Jun 14 18:43:53 2011
Completed: alter database open

因为是clean shutdown,所以这里不存在crash recovery。但这里同样出现了”SMON: enabling cache recovery”,可见cache recovery是每次实例启动instance startup必要执行的一种恢复操作。但问题是,这个恢复操作到底针对何种对象?

实际上cache recovery所要恢复的是rowcache,也就是我们常说的字典缓存(dictionary cache)。关于这个结论,肯定有很多人要问我这样说的依据是什么,对应于这个”cache recovery”的问题,我们很难从google中得到一些启示,因为它和官方文档所描述的”cache recovery-rolling forward”存在重名的关系。

为了证明cache recovery所恢复的是rowcache,我们需要一个实证,从正式的系统中得到验证。要做到这一点是比较困难的,我们需要Oracle愿意把整个database open的过程变成慢动作来供我们参考,验证要用到一些调试工具,例如gdb或者dbx。

我们首先将实例启动到mount状态,并对执行startup的LOCAL进程做gdb的breakpoint断点调试:

SQL> shutdown abort;
ORACLE instance shut down.

SQL> startup mount;
ORACLE instance started.
Total System Global Area 1065353216 bytes
Fixed Size                  2089336 bytes
Variable Size             486542984 bytes
Database Buffers          570425344 bytes
Redo Buffers                6295552 bytes
Database mounted.

找出LOCAL进程的系统进程号SPID

SQL> select spid from v$process
2  where addr in (
3  select paddr from v$session
4  where sid=(select distinct sid from v$mystat))
5  /

SPID
------------
8326

在实例startup nomount/mount后共享池的library cache就是可用的

SQL> select namespace from v$librarycache where gets!=0;

NAMESPACE
---------------
SQL AREA
TABLE/PROCEDURE

而rowcache则尚未被填充,因为字典缓存来源于自举对象(bootstrap$)和字典基表

SQL> select parameter,count,gets from v$rowcache where count!=0;
no rows selected

另开一个terminal窗口,并执行对LOCAL进程8326的gdb breakpoint调试

[oracle@rh2 ~]$ gdb $ORACLE_HOME/bin/oracle 8326
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /s01/db_1/bin/oracle...(no debugging symbols found)...done.
Attaching to program: /s01/db_1/bin/oracle, process 8326
Reading symbols from /s01/db_1/lib/libskgxp10.so...(no debugging symbols found)...done.
Loaded symbols for /s01/db_1/lib/libskgxp10.so
Reading symbols from /s01/db_1/lib/libhasgen10.so...(no debugging symbols found)...done.
Loaded symbols for /s01/db_1/lib/libhasgen10.so
Reading symbols from /s01/db_1/lib/libskgxn2.so...(no debugging symbols found)...done.
Loaded symbols for /s01/db_1/lib/libskgxn2.so
Reading symbols from /s01/db_1/lib/libocr10.so...(no debugging symbols found)...done.
Loaded symbols for /s01/db_1/lib/libocr10.so
Reading symbols from /s01/db_1/lib/libocrb10.so...(no debugging symbols found)...done.
Loaded symbols for /s01/db_1/lib/libocrb10.so
Reading symbols from /s01/db_1/lib/libocrutl10.so...(no debugging symbols found)...done.
Loaded symbols for /s01/db_1/lib/libocrutl10.so
Reading symbols from /s01/db_1/lib/libjox10.so...(no debugging symbols found)...done.
Loaded symbols for /s01/db_1/lib/libjox10.so
Reading symbols from /s01/db_1/lib/libclsra10.so...(no debugging symbols found)...done.
Loaded symbols for /s01/db_1/lib/libclsra10.so
Reading symbols from /s01/db_1/lib/libdbcfg10.so...(no debugging symbols found)...done.
Loaded symbols for /s01/db_1/lib/libdbcfg10.so
Reading symbols from /s01/db_1/lib/libnnz10.so...(no debugging symbols found)...done.
Loaded symbols for /s01/db_1/lib/libnnz10.so
Reading symbols from /usr/lib64/libaio.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libaio.so.1
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libnsl.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnsl.so.1
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
0x0000003181a0d8e0 in __read_nocancel () from /lib64/libpthread.so.0

输入断点kcrf_commit_force和kqlobjlod
(gdb) break kcrf_commit_force
Breakpoint 1 at 0x2724b6c

(gdb) break kqlobjlod
Breakpoint 2 at 0x1ac5e8c

在之前的terminal中执行数据库打开操作,因为breakpoint的关系这个open操作会hang住,
这时我们通过观察告警日志来了解恢复进度

SQL> alter database open;                              --这里会hang住

在gdb下输入continue,

(gdb) c
Continuing.

Breakpoint 1, 0x0000000002724b6c in kcrf_commit_force ()

观察告警日志可以发现redo application已经完成,但还未进入cache recovery阶段

alter database open
Tue Jun 14 19:14:33 2011
Beginning crash recovery of 1 threads
parallel recovery started with 2 processes
Tue Jun 14 19:14:33 2011
Started redo scan
Tue Jun 14 19:14:33 2011
Completed redo scan
39 redo blocks read, 74 data blocks need recovery
Tue Jun 14 19:14:33 2011
Started redo application at
Thread 1: logseq 1006, block 1155
Tue Jun 14 19:14:33 2011
Recovery of Online Redo Log: Thread 1 Group 1 Seq 1006 Reading mem 0
Mem# 0: /flashcard/oradata/G10R2/onlinelog/o1_mf_1_6v34jnkn_.log
Mem# 1: /s01/flash_recovery_area/G10R2/onlinelog/o1_mf_1_6v34jnst_.log
Tue Jun 14 19:14:33 2011
Completed redo application
Tue Jun 14 19:14:33 2011
Completed crash recovery at
Thread 1: logseq 1006, block 1194, scn 17200193
74 data blocks read, 74 data blocks written, 39 redo blocks read
Tue Jun 14 19:14:33 2011
LGWR: STARTING ARCH PROCESSES
ARC0: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC0 started with pid=17, OS id=8656
Tue Jun 14 19:14:33 2011
Thread 1 advanced to log sequence 1007 (thread open)
Thread 1 opened at log sequence 1007
Current log# 2 seq# 1007 mem# 0: /flashcard/oradata/G10R2/onlinelog/o1_mf_2_6v34jokt_.log
Current log# 2 seq# 1007 mem# 1: /s01/flash_recovery_area/G10R2/onlinelog/o1_mf_2_6v34jotq_.log
Successful open of redo thread 1
Tue Jun 14 19:14:33 2011
ARC0: Becoming the 'no FAL' ARCH
ARC0: Becoming the 'no SRL' ARCH
ARC0: Becoming the heartbeat ARCH
db_recovery_file_dest_size of 204800 MB is 6.81% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Tue Jun 14 19:14:37 2011
Incremental checkpoint up to RBA [0x3ef.3.0], current log tail at RBA [0x3ef.3.0]

且此时rowcache仍未被填充

SQL> select parameter,count,gets from v$rowcache where count!=0;
no rows selected

在gdb界面下再次执行continue 2次

(gdb) c
Continuing.

Breakpoint 1, 0x0000000002724b6c in kcrf_commit_force ()
(gdb) c
Continuing.

Breakpoint 2, 0x0000000001ac5e8c in kqlobjlod ()

观察告警日志可以发现已开始cache recovery,但也卡陷在cache recovery上,这保证我们的演示不受骚扰

Tue Jun 14 19:16:44 2011
SMON: enabling cache recovery

此时rowcache中出现唯一的一个dc_objects对象

select parameter,count,gets from v$rowcache where count!=0;

PARAMETER                             COUNT       GETS
-------------------------------- ---------- ----------
dc_objects                                1          1

这个对象是什么呢?也许你已经猜到了,我们做一个rowcache dump来看一下:

SQL> ALTER SESSION SET EVENTS 'immediate trace name row_cache level 10';

================row_cache trace===================

BUCKET 43170:
row cache parent object: address=0x92326060 cid=8(dc_objects)
hash=f3d1a8a1 typ=11 transaction=(nil) flags=00000001
own=0x92326130[0x9230f628,0x9230f628] wat=0x92326140[0x92326140,0x92326140] mode=S
status=EMPTY/-/-/-/-/-/-/-/-
set=0, complete=FALSE
data=
00000000 4f42000a 5453544f 24504152 00000000 00000000 00000000 00000000
00000000 00000001 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000
BUCKET 43170 total object count=1

可以看到上述dc_objects尚未完成加载(status=EMPTY & complete=FALSE ),那么这是一个什么object呢?

4f42000a 5453544f 24504152 => 转换为文本即:OB TSTO$PAR也就是BOOTSTRAP$

换而言之在cache recovery时第一个恢复的字典缓存对象是BOOTSTRAP$,这并不出乎我们的意料。

启动实例的LOCAL进程的等待事件为instance state change,这是常规情况下我们观察不到得

SQL> select event,p1text,p1 from v$session where wait_class!='Idle';

EVENT                                    P1TEXT                                           P1
---------------------------------------- ---------------------------------------- ----------
instance state change                    layer                                             2

在gdb界面下再次continue,将载入更多的rowcache对象

(gdb) c
Continuing.

Breakpoint 2, 0x0000000001ac5e8c in kqlobjlod ()

BUCKET 37:
row cache parent object: address=0x916cd980 cid=3(dc_rollback_segments)
hash=5fed2a24 typ=9 transaction=(nil) flags=000000a6
own=0x916cda50[0x916cda50,0x916cda50] wat=0x916cda60[0x916cda60,0x916cda60] mode=N
status=VALID/INSERT/-/FIXED/-/-/-/-/-
data=
00000000 00000000 00000001 00000009 59530006 4d455453 00000000 00000000
00000000 00000000 00000000 00000000 00000003 00000000 00000000 00000000
00000000 00000000 00000000 00000000
BUCKET 37 total object count=1

595300064d455453 -> SYSTEM 属于dc_rollback_segments 也就是著名的system回滚段

BUCKET 55556:
row cache parent object: address=0x916d8cd0 cid=8(dc_objects)
hash=ce89d903 typ=11 transaction=(nil) flags=00000001
own=0x916d8da0[0x9230f628,0x9230f628] wat=0x916d8db0[0x916d8db0,0x916d8db0] mode=S
status=EMPTY/-/-/-/-/-/-/-/-
set=0, complete=FALSE
data=
00000000 5f430006 234a424f 00000000 00000000 00000000 00000000 00000000
00000000 00000005 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000
BUCKET 55556 total object count=1

5f430006 234a424f -> C_OBJ# 是著名的bootstrap对象之一,可以从$ORACLE_HOME/rdbms/admin/sql.bsq中找到它

create rollback segment SYSTEM tablespace SYSTEM
storage (initial 50K next 50K)
/
create cluster c_obj# (obj# number)
pctfree 5 size 800                           /* don't waste too much space */
/* A table of 32 cols, 2 index, 2 col per index requires about 2K.
* A table of 10 cols, 2 index, 2 col per index requires about 750.
*/
storage (initial 130K next 200k maxextents unlimited pctincrease 0)
/* avoid space management during IOR I */
/

我们还可以通过v$rowcache_parent视图来了解dictionary cache的情况

SQL> col cache_name for a20
SQL> col keystr for a31
SQL> set linesize 200
SQL> select address,cache_name,existent,lock_mode,saddr,substr(key,1,30) keystr from v$rowcache_parent;

ADDRESS          CACHE_NAME           E  LOCK_MODE SADDR            KEYSTR
---------------- -------------------- - ---------- ---------------- -------------------------------
00000000916CCE20 dc_tablespaces       N          0 00               000000000000000000000000000000
00000000916CD980 dc_rollback_segments Y          0 00               000000000000000000000000000000
0000000092326060 dc_objects           Y          0 00               000000000A00424F4F545354524150
00000000916D8CD0 dc_objects           N          3 000000009BD91328 000000000600435F4F424A23000000
00000000916DA830 dc_object_ids        Y          0 00               380000000000000000000000000000

可以看到持有row cache lock的会话是'000000009BD91328',
且该dc_objects对象还处于non-existent状态,
换而言之真正装载rowcache的是启动实例的LOCAL服务进程

SQL> select sid,program,event,p1,p2,p3 from v$session where saddr='000000009BD91328';

SID PROGRAM                                          EVENT                                    P1   P2 P3
----- ------------------------------------------------ ---------------------------------------- -- ---- --
3294 sqlplus@rh2.oracle.com (TNS V1-V3)               db file scattered read                    1  378  3

该进程正在等待db file scattered read,fileid->1,block-378,这些块属于BOOTSTRAP$表

BOOTSTRAP$对象已从rowcache被载入到library cache中

SQL> select kglhdadr,kglnaobj from x$kglob where kglobtyp=2 and kglnaobj not like 'X$%';

KGLHDADR             KGLNAOBJ
-------------------- --------------------
0000000092326990     BOOTSTRAP$

SQL> select owner||'.'||Name from v$db_object_cache where type='TABLE' and name not like 'X$%';

OWNER||'.'||NAME
--------------------------------------------------------------------------------
SYS.BOOTSTRAP$

初步总结:

  1. 在数据库正式open前需要恢复字典缓存,这个步骤被称为cache recovery,其实是row cache recovery。与官方文档中描述的”cache recovery”不同,row cache recovery应当是Oracle Internal的叫法。
  2. 实际执行row cache recovery的不是SMON进程,而是启动实例的服务进程

【书籍推荐】Oracle 8i Internal Services

直到今天仍是经典,无可替代,绝不落伍。想了解Oracle internal 必读的一本书,每次翻开影印本都会有所收获,感谢作者的无私贡献!

GDE Error: Error retrieving file - if necessary turn off error checking (404:Not Found)

Know more about redo log buffer and latches

1.
The Total size of the log buffer is determined by LOG_BUFFER parameter.

2.
Only Server process may pin a data block in exclusive mode.

3.
LGWR writes to the redo log files when:

  • The redo log buffer is 1/3 full.
  • 1 MB of data have been written to the redo log buffer.
  • A 3-second time-out occurs.
  • commit!
  • before dbwr write out dirty buffers
  • A thread is closed

 

4.
the correct sequence for the LGWR algorithm is :

  1. Acquire the redo writing and redo allocation latches
  2. determine the buffer to write out
  3. Release the redo allocation latch
  4. determine how many writes are required

 

5.
In Oracle8i, the redo copy latch is always acquired regardless of the redo size.

Because the parameter LOG_SMALL_ENTRY_MAX_SIZE is obsolete, a redo copy latch is always acquired.

6.
Three most relevant redo events under normal operations:

  • Log file parallel write
  • Log buffer space
  • Log file sync

 

7.
relevant redo statistics and their purpose

  • REDO writes                                                 === Number of times the log buffer is written
  • redo blocks written                                === Number of times the log buffer is written
  • redo write time                                          === Total time required to write all the redos to disk
  • redo buffer allocation retires         === Total number of retries necessary to allocate space in the redo buffer

 

8.
Log buffer contention is typically indicated by Redo buffer allocation retries .

The buffer allocation retries indicates that a process has to continually try to allocate buffers that are not available

9.
Redo buffer allocation retries can be gathered by querying the name and value columns from V$ dynamic views:
V$sesstat
V$sysstat

10.
Stripe the redo log files across physical disks will help flush the redo buffer faster.

The redo log file write batch is 128K, so striping the redo logs across 8 disks with a stripe size of 16K could improve the write time by as much as 75%.

11.
The number of redo copy latches can be defined by setting _Log_simultaneous_copies .

The parameter begins with an underscore because in 8i it is a hidden parameter.

12.
The purpose of the redo writing latch is to: Prevent multiple processes from posting
LGWR when there is no space in the buffer .

The redo writing latch frees space in log buffer and does not want
processes writing to the buffer while space is being freed

13.
_log_io_size is set to reduce contention for the redo writing latch.

But notice that it is a hidden parameter. This is the number of used redo blocks that will automatically initiate a log write.

14.
A standby database must be refreshed following a nologging operation.

When nologging occurs, the redo logs of the standby database are not written to.

15.
log_small_entry_max_size is used to determine if a redo copy latch is required based upon the redo entry size,

however, this parameter is obsolete in Oracle8i, as a copy latch is always acquired.

_log_simultaneous_copies determines the number of redo copy latches to allocate.

Note that this is a hidden parameter in Oracle8i because it is not recommended to have more than one copy latch.

log_checkpoint_interval determines the need for a checkpoint.

EVENT 10235:"check memory manager internal structures"

Event:10235                     
~~~~~~~~~~~ 

Version/Use: 

  7.0 - 10.1.X   Check memory manager internal structures. 

  7.0 - 10.1.X "Check memory manager internal structures" 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

  NOTE: Events should NEVER be set by customers unless advised to do so by


      Oracle Support Services. Read [NOTE:75713.1] before setting any event. 

Summary Syntax: 

~~~~~~~~~~~~~~~ 

  EVENT="10235 trace name context forever,  level LL" 

  (Always comment exactly when and why this event is being set) 

   ** IMPORTANT: Do **NOT** use ALTER SESSION SET EVENTS or ORADEBUG 

                syntax to set this event in sessions. This can cause 

                lots of ORA-600 errors against SGA heaps as not all  

                sessions using the SGA heaps will be using the same event 

                level. This applies to ALL levels except level 65536. 

Levels: 

~~~~~~~ 

  The event being set at all causes certain heap checks to be performed. 

   ***  WARNING *********************************************************** 

    ***   This event should only EVER be set at the request of Oracle Support. 

    ***   It can impact performance on most types of system. 

    ***   Level 2 and above can impact latch contention. 

    ***   Level 3 and above can have a *SEVERE* impact on performance. 

  ************************************************************************ 

  The bottom 3 bits of the level cause the following checks to occur: 

     ~~~~~         ~~~~~~~~~~~~ 

     Level         Description 

     ~~~~~         ~~~~~~~~~~~~ 

        1             Fast check on heap free (kghfrh) 

        2             Do 1 AND fill memory with junk on alloc / free 

        3             Do 2 AND ensure the chunk belongs to given heap on free 

        4             Do 3 AND make permanent chunks freeable so they can  

                      also be checked 

                       This level can give rise to increased memory use 

                       and can trigger false ORA-4030 and false ORA-4031 

                       errors. 

 

  Oracle 9205 onwards only: 

    65536             This is introduced by the diagnostic enhancement in 

                      bug 3293155. It is a totally independent bit setting 

                      which has minimal impact on performance (unless ORed  

                      with other levels). When this is set Oracle tries to 

                      keep comments with "permanent" memory allocations 

                      which can be useful for memory leak problems if the 

                      leaked memory appears to be a leak of "perm" memory. 

                      This level can be set/unset dynamically but will only  

                      store comments in "perm" memory allocated when the  

                      event is set. 

 

  There are additional values which Oracle Support can use. 

 

Description/Steps: 

~~~~~~~~~~~~~~~~~~ 

  This event may be used to try to catch HEAP corruption problems closer  

  to when they occur.  Typically level 12 is required to get close to the 

  corruption but this can impact performance too much to be useful. 

 

Example Output / Interpreting Output: 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

  The event should cause an ORA-600 and heapdump to be produced if an 

  error is detected.  

Related: 

~~~~~~~~ 

 

Know more about Oracle Latches

Two purposes of latches were:Serialize Execution,Serialize Access

Match the characteristics with latches and enqueues.

Several modes of access                   Enqueue
FIFO queue acquisition                    Enqueue
Mostly exclusive access                   Latch
Non-deterministic acquisition             Latch
No atomicity                              Enqueue
Atomicity                                 Latch

Most latches have a level between 0 and 8.

Usually a process can request a latch X only:After obtaining latch Y and if level X > level Y.
Remember that the level of X, which is child latch, must be greater than level Y, which is most likely a parent latch.

The structure for parent and child latches can be seen by querying X$KSLLT .
There is one KSLLT structure for each parent and child latch in an instance.

The characteristic of latch acquisition in NO-WAIT mode is:
One attempt to get a latch, no spin or sleep .
This is typically for latches where there are many of the same type. The process will attempt to acquire another latch.

The sleep time on latches approximately doubles with each wait until upper limit is reached. This is an exponential backoff.

The statistic waiters_woken counts the number of times a posting for a latch has occurred.
It can be found in V$LATCH.This view contains aggregate statistics for latches

PMON invokes a clean up function for releasing latches if the holding process dies.PMON will initiate the clean up function.

_MAX_SLEEP_HOLDING_LATCH The maximum exponential backoff when waiting for a latch and already holding anothers
_MAX_EXPONENTIAL_SLEEP Controls the maximum time a process has to sleep before requesting the latch again
_SPIN_COUNT Designates the number of times a process will spin while trying to acquire a latchs

V$SESSION_WAIT dynamic performance view gives real time information regarding what sessions are currently waiting or have just waited for an event.

When diagnosing latch contention, we should typically look for Latches with the most impact.

We should use below formula  to calculate the total cost of latch spinning:

_spin_count * sleeps/misses

Statistics from V$LATCH with their associated description:

IMMEDIATE_MISSES Number of times the fast get failed when requested in “no-wait” mode

MISSES Number of times the fast get call failed when requested in “willing-to-wait” modes

SPIN_GETS Number of times where the fast get failed but the latch was acquired during the first spin before having to yield to the CPU

GETS Number of times the latch was obtained when requested in “willing-to-wait” mode

SLEEPi Number of times a “willing-to-wait” request had to sleep I times

IMMEDIATE_GETS Number of times the latch was obtained when requested in “no-wait” mode

We can set event 10005 to perform latch tracing, or oradebug dump latches 3 ;

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE    10.2.0.4.0      Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production

SQL> select * from global_name;

GLOBAL_NAME
--------------------------------------------------------------------------------
www.askmaclean.com

[oracle@rh2 ~]$ oerr ora 10005
10005, 00000, "trace latch operations for debugging"
// *Document: NO
// *Cause:
// *Action: Enable tracing for various latch operations
// *Comment:
//    level 1 - trace latch gets and frees
//    level 4 - trace multiple posts by processes when latch is freed
//

Wait and latch (KSL) layer (10005)

We can set events 10005 (trace latch gets and frees) and 10073 (have PMON
dump info before latch cleanup).  Level does not matter for these events.
We can also set event="600 trace name LATCHES level 1" to dump latch info
when PMON hits the error (I'm not so sure this one is going to work).

 event = "600 trace name latches level 10"
 event = "10005 trace name context forever, level 1"

10005 event has been removed above 9i , try dump KSTDUMPCURPROC 1

SQL> oradebug setmypid;
SQL> oradebug unlimit;
SQL> oradebug dump KSTDUMPCURPROC 1
SQL> oradebug tracefile_name;

KSTDUMP: In-memory trace dump
TIME(usecs):SEQ# ORAPID   SID EVENT  OP DATA
========================================================================
D9396B13:000044F8    14     0 10005   1 KSL WAIT BEG [PX Idle Wait] 200/0xc8 146/0x92 0/0x0
D9573C3D:00004568    14     0 10005   2 KSL WAIT END [PX Idle Wait] 200/0xc8 146/0x92 0/0x0 time=1954087
D9573C4F:00004569    14     0 10005   1 KSL WAIT BEG [PX Idle Wait] 200/0xc8 147/0x93 0/0x0
D9750D82:000045FC    14     0 10005   2 KSL WAIT END [PX Idle Wait] 200/0xc8 147/0x93 0/0x0 time=1954097
D9750D8F:000045FD    14     0 10005   1 KSL WAIT BEG [PX Idle Wait] 200/0xc8 148/0x94 0/0x0
D992DEA2:00004639    14     0 10005   2 KSL WAIT END [PX Idle Wait] 200/0xc8 148/0x94 0/0x0 time=1954065
D992DEB2:0000463A    14     0 10005   1 KSL WAIT BEG [PX Idle Wait] 200/0xc8 149/0x95 0/0x0
D9B0AFD8:000046A9    14     0 10005   2 KSL WAIT END [PX Idle Wait] 200/0xc8 149/0x95 0/0x0 time=1954084
D9B0AFEB:000046AA    14     0 10005   1 KSL WAIT BEG [PX Idle Wait] 200/0xc8 150/0x96 0/0x0
D9CE8110:00004742    14     0 10005   2 KSL WAIT END [PX Idle Wait] 200/0xc8 150/0x96 0/0x0 time=1954083
D9CE811E:00004743    14     0 10005   1 KSL WAIT BEG [PX Idle Wait] 200/0xc8 151/0x97 0/0x0
D9EC5246:00004790    14     0 10005   2 KSL WAIT END [PX Idle Wait] 200/0xc8 151/0x97 0/0x0 time=1954086
D9ECA80D:00004792    14     0 10280   8 kst: process state object about to be deleted
E308018E:00006780    14     0 10280   1 kst: process state object created on 06-30 22:11:24.211
E308018F:00006781    14     0 10280   3 kst: process info: ospid=10380 pso_num=14 pso_serial#=2
E30801D0:00006782    14     0 10420   1 kso: new process: pid=10380 (legacy spawn)
E4ACF279:00006D4A    14     0 10280   8 kst: process state object about to be deleted
E4B2CD42:00006D4B    14     0 10280   1 kst: process state object created on 06-30 22:11:52.853
E4B2CD43:00006D4C    14     0 10280   3 kst: process info: ospid=10382 pso_num=14 pso_serial#=3
E4B2CDA3:00006D4D    14     0 10420   1 kso: new process: pid=10382 (legacy spawn)
KSTDUMP: End of in-memory trace dump


**** LGWR trace
05A46022:0000DEB0     6   221 10254  73 KSBCTI: (LGWR) : (timeout action)   : acnum=[58] comment=[Redo writer log switch operations]
05A46025:0000DEB1     6   221 10254  73 KSBCTI: (LGWR) : (timeout action)   : acnum=[59] comment=[Redo writer generate offline immed marker]
05A46026:0000DEB2     6   221 10254  73 KSBCTI: (LGWR) : (timeout action)   : acnum=[170] comment=[kfr Poke LGWR]
05A46027:0000DEB3     6   221 10254  73 KSBCTI: (LGWR) : (timeout action)   : acnum=[171] comment=[kfr Incr Ckpt]
05A46029:0000DEB4     6   221 10254  73 KSBCTI: (LGWR) : (timeout action)   : acnum=[172] comment=[kfr ACD relocation]
05A4602A:0000DEB5     6   221 10254  71 KSBCTI: (LGWR) : (interrupt action) : acnum=[1] comment=[KSB action for X-instance calls]
05A4602A:0000DEB6     6   221 10254  71 KSBCTI: (LGWR) : (interrupt action) : acnum=[63] comment=[Scumnt mount lock]
05A4602B:0000DEB7     6   221 10254  71 KSBCTI: (LGWR) : (interrupt action) : acnum=[64] comment=[Poll system events broadcast channel]
05A4602E:0000DEB8     6   221 10005   1 KSL WAIT BEG [rdbms ipc message] 300/0x12c 0/0x0 0/0x0
05D117DB:0000DF45     6   221 10005   2 KSL WAIT END [rdbms ipc message] 300/0x12c 0/0x0 0/0x0 time=2930603
05D117E6:0000DF46     6   221 10254  71 KSBCTI: (LGWR) : (interrupt action) : acnum=[1] comment=[KSB action for X-instance calls]
05D117E7:0000DF47     6   221 10254  71 KSBCTI: (LGWR) : (interrupt action) : acnum=[63] comment=[Scumnt mount lock]
05D117E8:0000DF48     6   221 10254  71 KSBCTI: (LGWR) : (interrupt action) : acnum=[64] comment=[Poll system events broadcast channel]
05D117ED:0000DF49     6   221 10254  73 KSBCTI: (LGWR) : (timeout action)   : acnum=[0] comment=[Monitor Cleanup]
05D117EE:0000DF4A     6   221 10254  73 KSBCTI: (LGWR) : (timeout action)   : acnum=[56] comment=[Redo writer IO's]
05D117FE:0000DF4B     6   221 10254  73 KSBCTI: (LGWR) : (timeout action)   : acnum=[57] comment=[Redo writer post action]
05D117FF:0000DF4C     6   221 10254  73 KSBCTI: (LGWR) : (timeout action)   : acnum=[58] comment=[Redo writer log switch operations]
05D11803:0000DF4D     6   221 10254  73 KSBCTI: (LGWR) : (timeout action)   : acnum=[59] comment=[Redo writer generate offline immed marker]
05D11804:0000DF4E     6   221 10254  73 KSBCTI: (LGWR) : (timeout action)   : acnum=[170] comment=[kfr Poke LGWR]
05D11805:0000DF4F     6   221 10254  73 KSBCTI: (LGWR) : (timeout action)   : acnum=[171] comment=[kfr Incr Ckpt]
05D11807:0000DF50     6   221 10254  73 KSBCTI: (LGWR) : (timeout action)   : acnum=[172] comment=[kfr ACD relocation]
05D11808:0000DF51     6   221 10254  71 KSBCTI: (LGWR) : (interrupt action) : acnum=[1] comment=[KSB action for X-instance calls]
05D11808:0000DF52     6   221 10254  71 KSBCTI: (LGWR) : (interrupt action) : acnum=[63] comment=[Scumnt mount lock]
05D11809:0000DF53     6   221 10254  71 KSBCTI: (LGWR) : (interrupt action) : acnum=[64] comment=[Poll system events broadcast channel]
05D1180C:0000DF54     6   221 10005   1 KSL WAIT BEG [rdbms ipc message] 300/0x12c 0/0x0 0/0x0
05FDCFC0:0000DFE1     6   221 10005   2 KSL WAIT END [rdbms ipc message] 300/0x12c 0/0x0 0/0x0 time=2930610
05FDCFCB:0000DFE2     6   221 10254  71 KSBCTI: (LGWR) : (interrupt action) : acnum=[1] comment=[KSB action for X-instance calls]
05FDCFCC:0000DFE3     6   221 10254  71 KSBCTI: (LGWR) : (interrupt action) : acnum=[63] comment=[Scumnt mount lock]
05FDCFCD:0000DFE4     6   221 10254  71 KSBCTI: (LGWR) : (interrupt action) : acnum=[64] comment=[Poll system events broadcast channel]
05FDCFD1:0000DFE5     6   221 10254  73 KSBCTI: (LGWR) : (timeout action)   : acnum=[0] comment=[Monitor Cleanup]
05FDCFD2:0000DFE6     6   221 10254  73 KSBCTI: (LGWR) : (timeout action)   : acnum=[56] comment=[Redo writer IO's]
05FDCFEF:0000DFE7     6   221 10254  73 KSBCTI: (LGWR) : (timeout action)   : acnum=[57] comment=[Redo writer post action]
05FDCFF0:0000DFE8     6   221 10254  73 KSBCTI: (LGWR) : (timeout action)   : acnum=[58] comment=[Redo writer log switch operations]
05FDCFF3:0000DFE9     6   221 10254  73 KSBCTI: (LGWR) : (timeout action)   : acnum=[59] comment=[Redo writer generate offline immed marker]
05FDCFF4:0000DFEA     6   221 10254  73 KSBCTI: (LGWR) : (timeout action)   : acnum=[170] comment=[kfr Poke LGWR]
05FDCFF5:0000DFEB     6   221 10254  73 KSBCTI: (LGWR) : (timeout action)   : acnum=[171] comment=[kfr Incr Ckpt]
05FDCFF7:0000DFEC     6   221 10254  73 KSBCTI: (LGWR) : (timeout action)   : acnum=[172] comment=[kfr ACD relocation]

Know more about Buffer Cache and Latch

We can examine X$BH table to obtain buffer header information,The BH stands for buffer header.

Structures that maintain a list of data buffer headers are called:Hash buckets.

Hash buckets are grouped by relative DBA and class number.
Hash chains list data buffer headers in one hash bucket.

Cache Recycle Pool For randomly accessed large tables

Default Pool For normally accessed tables

Keep Pool for frequently updated small tables

Set DB_BLOCK_LRU_LATCHES parameter to create multiple LRU lists.This parameter sets the number of latches,
and it is best used on a multi-CPU system.

DB_BLOCK_BUFFERS sets the number of buffers to allocate in the buffer cache.
DB_BLOCK_LRU_EXTENDED_STATISTICS is obsolete in Oracle8i.
DB_WRITER_PROCESSES sets the number of database writer processes.

Cache buffers LRU chain protects LRU lists.

The total number of working sets per instance is defined by DB_BLOCK_LRU_LATCHES. There is one latch per set.

The number of DBWR processes cannot be greater than the number of working sets.

Oracle get buffers as below steps:

  1. get a buffer descriptor
  2. specify a mode in which a buffer should be obtained
  3. scan the appropriate hash chain
  4. find the appropriate buffer in the chain or read from disk
  5. attach a state object to the buffer header

Buffers are initially hashed to LRU_AUX.This contains buffers that are candidates for reuse.
LRU_MAIN List houses buffers that are in use (pinned or dirty).
LRU_W List is the write list for dirty buffers.
LRU_XR List is the reuse range list for buffers that are to be written for reuse.

DBWR processes operate on working sets that are assigned to them in a cyclical manner.

There are three reasons may cause DBWR write buffers to disk:

  • To provide checkpoints
  • To do free requests
  • To do ping writes(obsolete)

The write batch size controls the number of asynchronous writes slots allocated to each DBWR.The write batch size controls the number of asynchronous writes slots allocated to each DBWR.

Two Most important DBWR statistics: DBWR make free request,DBWR lru scans 。

Two circumstances where DBWR purges dirty buffers whose DBA falls between the lowest and highest DBA of the datafiles:

  • ALTER TABLESPACE…BEGIN BACKUP
  • Make a tablespace read only

Of the nine latches that protect the buffer cache,the three that are most important:

  • Cache buffers lru chain
  • Cache buffers chains
  • Checkpoint queue latch

each of the fixed tables below with its purpose:

  • X$KCBWAIT         Wait statistics by block class
  • X$KCBFWAIT      Wait statistics by file id
  • X$KCBBHS              DBWR histogram statistics
  • X$KCBWBPD         Buffer pool descriptors

A target buffer cache hit ratio is above 90%.
But 60% could be a valid hit ratio for a DSS or Data Warehouse application.

Two primary events that relate to the buffer cache:

  • Buffer busy waits
  • Free buffer waits

We can use the views and table:V$SYSTEM_EVENT, V$WAITSTAT, and X$KCBFWAIT to determine the reason and block class for buffer busy waits.

VIEW: X$KTUXE – Transaction Entry (table)

View:   X$KTUXE
         [K]ernel [T]ransaction [U]ndo
           Transa[x]tion [E]ntry (table)

  This view is very useful as it gives an indication of the state
  of the transaction tables in the rollback segment headers.
  The information here can be used to see the state of transactions
  requiring transaction recovery do not show in <View:V$TRANSACTION>
  <Event:10013> may be useful to trace transaction recovery.

 Column          Type               Description
 --------        ----               --------
 ADDR            RAW(4|8)           address of this row/entry in the array or SGA
 INDX            NUMBER             index number of this row in the fixed table array
 INST_ID         NUMBER       8.x   oracle instance number

Transaction ID
 KTUXEUSN        NUMBER             undo seg number
       KUSNOLTP UB2MAXVAL           is no-undo xac
 KTUXESLT        NUMBER             slot number
       KSLTINV  UB2MAXVAL           not a valid slot num
 KTUXESQN        NUMBER             wrap number, is savept # at start if no-undo xac
                                     position within transaction

 KTUXERDBF       NUMBER             relative File
 KTUXERDBB       NUMBER             relative Block
 KTUXESCNB       NUMBER             SCN base for prepare/commit
 KTUXESCNW       NUMBER             SCN wrap for prepare/commit
 KTUXESTA        VARCHAR2(16)       Transaction Status
 KTUXECFL        VARCHAR2(24)       Transaction flags
 KTUXEUEL        NUMBER             Used for extent of tx and link to commit list

Distributed tx: collecting dba and undo bk to start retrieving collecting info rec
 KTUXEDDBF       NUMBER             relative file
 KTUXEDDBB       NUMBER             relative dba

Parent transaction id
 KTUXEPUSN       NUMBER       8.x   undo seg number
       KUSNOLTP UB2MAXVAL           is no-undo xac
 KTUXEPSLT       NUMBER       8.x   slot number
       KSLTINV  UB2MAXVAL           not a valid slot num
 KTUXEPSQN       NUMBER       8.x   wrap number, is savept # at start if no-undo xac
                                      position within transaction

 KTUXESIZ        NUMBER       8.1  number of undo blocks used by the transaction

Notes:

To see any DEAD transactions for deferred transaction recovery
  after startup:

      select * from x$ktuxe where ktuxecfl='DEAD';

沪ICP备14014813号

沪公网安备 31010802001379号

TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569