【Oracle数据恢复】ORA-00600 [6749]一例

From 论坛用户:

 

 

RAC 环境

Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production
CORE    11.2.0.3.0      Production
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production


下午在告警日志中看到如下信息:

Thread 1 advanced to log sequence 16646 (LGWR switch)
  Current log# 102 seq# 16646 mem# 0: +ORADATA/portaldb/redo02.log
Thu Sep 04 16:48:52 2014
Errors in file /oracle/diag/rdbms/portaldb/portaldb1/trace/portaldb1_ora_40698472.trc  (incident=642694):
ORA-00600: internal error code, arguments: [6749], [3], [37923665], [163], [], [], [], [], [], [], [], []
Incident details in: /oracle/diag/rdbms/portaldb/portaldb1/incident/incdir_642694/portaldb1_ora_40698472_i642694.trc
Thu Sep 04 16:48:56 2014
Dumping diagnostic data in directory=[cdmp_20140904164856], requested by (instance=1, osid=40698472), summary=[incident=642694].
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Thu Sep 04 16:48:58 2014
Sweep [inc][642694]: completed
Sweep [inc2][642694]: completed
Thu Sep 04 16:53:14 2014
Thread 1 advanced to log sequence 16647 (LGWR switch)
  Current log# 103 seq# 16647 mem# 0: +ORADATA/portaldb/redo103.log


找到了portaldb1_ora_40698472.trc (大约1.9G) 和portaldb1_ora_40698472_i642694.trc(11M左右) 两个文件 ;

oracle@ptdb1:/home/oracle/awrrpt_pack$du -sm portaldb1_ora_40698472.trc
1830.01 portaldb1_ora_40698472.trc                

oracle@ptdb1:/home/oracle/awrrpt_pack$du -sm portaldb1_ora_40698472_i642694.trc
11.46   portaldb1_ora_40698472_i642694.trc


portaldb1_ora_40698472.trc这个文件是去不来了我把前面重要的几行复制下来;

UMP REDO
Opcodes 11.*
DBAs (file#, block#):
(9, 174929) .
SCNs: scn: 0x0000.00000000 thru scn: 0xffff.ffffffff
**NOTE: Only Dumping Redo less then 12 hours**
Times: 09/04/2014 04:41:29 thru eternity
Initial buffer sizes: read 1024K, overflow 832K, change 805K
Thread 1 low checkpoint scn: 0x0001.4c4c6be0
Thread 2 low checkpoint scn: 0x0001.4c4cb42f
SCN Start Scan Point: scn: 0x0001.4c4cb42f (5575062575)
Initial buffer sizes: read 1024K, overflow 832K, change 805K
Initial buffer sizes: read 1024K, overflow 832K, change 805K
INCARNATION:
  START: scn: 0x0000.00000001 (1) Timestamp:  06/01/2013 17:58:46
  END: scn: 0xffff.ffffffff
descrip:"Thread 0001, Seq# 0000016485, SCN 0x00014c4c6be0-0x00014c4f53f2"
descrip:"Thread 0002, Seq# 0000018304, SCN 0x00014c4cb42f-0x00014c4ea321"

*** 2014-09-04 16:41:30.442
*Error - Unable to open log for Thread 2 at SCN: scn: 0x0001.4c4ea321 (5575189281)
END OF DUMP REDO
Dumping current redo log in thread 1
Initial buffer sizes: read 1024K, overflow 832K, change 805K

DUMP OF REDO FROM FILE '+ORADATA/portaldb/redo02.log'
Opcodes 11.*
DBAs (file#, block#):
(9, 174929) .
RBAs: 0x000000.00000000.0000 thru 0xffffffff.ffffffff.ffff
SCNs: scn: 0x0001.4c4cb42f (5575062575) thru scn: 0xffff.ffffffff
Times: 09/04/2014 04:41:29 thru eternity
FILE HEADER:
        Compatibility Vsn = 186646528=0xb200000
        Db ID=663718102=0x278f88d6, Db Name='PORTALDB'
        Activation ID=663749334=0x279002d6
        Control Seq=1800459=0x1b790b, File size=1024000=0xfa000
        File Number=102, Blksiz=512, File Type=2 LOG
descrip:"Thread 0001, Seq# 0000016641, SCN 0x0001502f1901-0xffffffffffff"
thread: 1 nab: 0xffffffff seq: 0x00004101 hws: 0x1 eot: 1 dis: 0
resetlogs count: 0x30b21356 scn: 0x0000.00000001 (1)
prev resetlogs count: 0x0 scn: 0x0000.00000000
Low  scn: 0x0001.502f1901 (5640231169) 09/04/2014 16:39:02
Next scn: 0xffff.ffffffff 01/01/1988 00:00:00
Enabled scn: 0x0000.00000001 (1) 06/01/2013 17:58:50
Thread closed scn: 0x0001.502f1901 (5640231169) 09/04/2014 16:39:02


20:05:59 sys@PORTALDB> column SEGMENT_NAME format a60
20:07:10 sys@PORTALDB> column owner format a10
20:07:38 sys@PORTALDB> column TABLESPACE_NAME format a10
20:07:49 sys@PORTALDB> SELECT tablespace_name,segment_type,owner,segment_name FROM dba_extents  wHERE file_id =9   and 174929 between block_id AND block_id + blocks - 1;

TABLESPACE SEGMENT_TYPE       OWNER      SEGMENT_NAME
---------- ------------------ ---------- ------------------------------------------------------------
NEANDS     TABLE              NEANDS3    M_PRODUCT_ALIAS           //该表为条码表

Elapsed: 00:00:04.09


20:08:07 sys@PORTALDB>  select table_name, iot_name from all_tables where table_name='M_PRODUCT_ALIAS';

TABLE_NAME           IOT_NAME
-------------------- ------------------------------
M_PRODUCT_ALIAS

21:59:18 sys@PORTALDB> select do.owner,do.object_name, do.object_type,sysind.flags
22:10:33   2       from dba_objects do, sys.ind$ sysind
22:10:33   3       where do.object_id = sysind.obj#
22:10:33   4       and bitand(sysind.flags,4096)=4096;

no rows selected

其中portaldb1_ora_40698472_i642694.trc文件中的这个语句是不是造成这个报错的原因呢?已经把trc文件 和这个表给开发那边了,让开发那边也看看,

*** 2014-09-04 16:48:52.213
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=f40kjffsrvhvp) -----
UPDATE M_PRODUCT_ALIAS A SET A.IS_TOWMS = 'N' WHERE A.M_PRODUCT_ID = :B1 

我还摸不着头绪,把故障现象发到这里,希望大家帮我看看,谢谢

 

 
odm finding :
Format: ORA-600 [6749] [a]

VERSIONS:
versions 7.3 to 8.1

DESCRIPTION:

This internal occurs during the row change for update or delete.
Some reported problems are consistent read issue, some bugs are related to
row corruption on the block.

FUNCTIONALITY:
ROW DELETE

IMPACT:
PROCESS FAILURE
ROW CORRUPTION
[6749], [3], [37923665], [163], [], [], [], [], [], [], [], []
3 代表 row no longer exists 行不再存在
37923665 代表dba
C 是 行的slot number

由于11.2.0.3 上似乎没有对应的Bug,考虑对该表做ANALYZE TABLE <table_name> VALIDATE STRUCTURE CASCADE; 并flush buffer_cache; 如果错误仍存在,考虑重建该表

 
7705591 Corruption with self-referenced row in MSSM tablespace. Wrong Results / OERI[6749] / ORA-8102
UPDATE M_PRODUCT_ALIAS A SET A.IS_TOWMS = ‘N’ WHERE A.M_PRODUCT_ID = :B1

Dump continued from file: /oracle/diag/rdbms/portaldb/portaldb1/trace/portaldb1_ora_40698472.trc
ORA-00600: internal error code, arguments: [6749], [3], [37923665], [163], [], [], [], [], [], [], [], []
pin: ‘kduwh18: kdu_array_flush’ waiters: 1
addr: 0x7000001ff813088 obj: 161302 cls: DATA bscn: 0x1.50311147
buffer tsn: 15 rdba: 0x0242ab51 (9/174929)
scn: 0x0001.50311147 seq: 0x01 flg: 0x04 tail: 0x11470601
frmt: 0x02 chkval: 0x2018 type: 0x06=trans data
ock header dump: 0x0242ab51
Object id on Block? Y
seg/obj: 0x27616 csc: 0x01.503110d9 itc: 8 flg: – typ: 1 – DATA
fsl: 0 fnx: 0x0 ver: 0x01

0x158:pri[163] offs=0x85c
tab 0, row 163, @0x85c
tl: 2 fb: –HDFL– lb: 0x1

 

 

如果自己搞不定可以找诗檀软件专业ORACLE数据库修复团队成员帮您恢复!

诗檀软件专业数据库修复团队

服务热线 : 13764045638   QQ号:47079569    邮箱:service@parnassusdata.com

latch free:SQL memory manager latch

一套HP-UX上的10.2.0.4 RAC系统,其中一个节点出现大量的latch free:SQL memory manager latch等待事件, 相关ADDM性能信息:

 

SELECT KSMCHCLS CLASS, COUNT(KSMCHCLS) NUM, SUM(KSMCHSIZ) SIZ,
To_char( ((SUM(KSMCHSIZ)/COUNT(KSMCHCLS)/1024)),'999,999.00')||'k' "AVG SIZE"
FROM X$KSMSP
GROUP BY KSMCHCLS;
CLASS NUM SIZ AVG SIZE
-------- ---------- ---------- ------------
R-freea 927 9481080 9.99k
freeabl 1037478 2001975328 1.88k
R-free 593 170107192 280.14k
recr 515080 595805056 1.13k
R-perm 498 35345056 69.31k
R-recr 2 2224 1.09k
perm 97848 1320357168 13.18k
free 251577 161871664 .63k

DETAILED ADDM REPORT FOR TASK 'TASK_21286' WITH ID 21286
--------------------------------------------------------

Analysis Period: 26-OCT-2009 from 10:00:52 to 11:00:12
Database ID/Instance: 2429423994/1
Database/Instance Names: VPROD/VPROD
Host Name: crmdb1
Database Version: 10.2.0.4.0
Snapshot Range: from 10253 to 10254
Database Time: 420176 seconds
Average Database Load: 118 active sessions

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

FINDING 1: 75% impact (313574 seconds)
--------------------------------------
Database latches in the "Other" wait class were consuming significant database
time.

RECOMMENDATION 1: DB Configuration, 75% benefit (313574 seconds)
RATIONALE: The latch "SQL memory manager latch" with latch id 289 was
one of the top latches contributing to high "latch free" waits.
RELEVANT OBJECT: database latch# 289

SYMPTOMS THAT LED TO THE FINDING:
SYMPTOM: Wait class "Other" was consuming significant database time.
(75% impact [313620 seconds])

FINDINGS
========
Load Profile

Per Second Per Transaction
Redo size: 534,863.07 8,586.10
Logical reads: 527,828.57 8,473.18
Block changes: 13,065.49 209.74
Physical reads: 3,999.96 64.21
Physical writes: 617.50 9.91
User calls: 9,869.10 158.43
Parses: 3,287.89 52.78
Hard parses: 27.90 0.45
Sorts: 1,564.66 25.12
Logons: 4.35 0.07
Executes: 5,423.19 87.06

COMMENTS
==========
Load in terms of GETS is tremendous and no particular SQL is responsible for a
large percentage of the load. It may simply be too much for the hardware/configuration.

HOWEVER
=========
I/O while moderate in volume is very slow.

Tablespace IO Stats

* ordered by IOs (Reads + Writes) desc

Tablespace Reads Av Reads/s Av Rd(ms) Av Blks/Rd Writes Av Writes/s Buffer Waits Av Buf Wt(ms)
------------------------------------------------------------------------------------------------------
TBS_CRM_TS_S1 8,340,219 772 4.48 2.95 356,810 33 17,284,247 2.15
TBS_CRM_IDX_S1 1,643,104 152 7.89 1.22 298,265 28 22,684 4.34
TEMP 762,343 71 1.09 6.67 411,652 38 0 0.00
TBS_BAKDATA 358,137 33 6.00 4.11 93,672 9 335,632 6.32
TBS_RM_TS_S1 400,528 37 9.57 1.09 39,000 4 1,546 10.49
TBS_RM_IDX_S1 399,888 37 8.71 1.01 33,342 3 1,577 9.74
TBS_PROD 422,348 39 3.68 1.00 941 0 5 12.00
TBS_SP_TS_S1 341,104 32 6.26 12.19 504 0 44 5.68
TBS_BFBAKDATA 187,013 17 2.34 7.49 909 0 104,797 2.75
TBS_PF 141,810 13 8.76 1.79 8,571 1 563 11.39
TBS_BSS_SYS 118,965 11 2.26 14.72 2,171 0 7 0.00

So about 1/3 typical performance in terms of Av Rd(ms).

SEGMENTS
==========
Segments by Logical Reads

* Total Logical Reads: 5,699,755,248
* Captured Segments account for 72.1% of Total

Owner Tablespace Name Object Name Subobject Name Obj. Type Logical Reads %Total
CRM TBS_CRM_TS_S1 PROD_2_TD TABLE 454,967,408 7.98
RM TBS_RM_TS_S1 PHONE_NUMBER TABLE 435,121,200 7.63
EVENT TBS_CRM_TS_S1 CUST_INTERACTION TABLE 386,838,512 6.79
CRM TBS_CRM_TS_S1 CO_2_TD TABLE 383,932,304 6.74
CRM TBS_CRM_TS_S1 PARTY_IDENTIFY_EXTEND TABLE 298,659,184 5.24

So 5 individual segments are the target of 72% of all gets.

.
ACTION PLAN
============
Please do:

(1) Disk I/O
-----------------
Please ask your SA why Disk I/O is so slow.

(2) Segements
---------------------
Please consider partitioning these 5 objects to spread out GETS and READS:

Owner Tablespace Name Object Name
---------------------------------------------
CRM TBS_CRM_TS_S1 PROD_2_TD
RM TBS_RM_TS_S1 PHONE_NUMBER
EVENT TBS_CRM_TS_S1 CUST_INTERACTION
CRM TBS_CRM_TS_S1 CO_2_TD
CRM TBS_CRM_TS_S1 PARTY_IDENTIFY_EXTEND

 

诊断发现shared pool共享池有较多的空闲内存,反倒是IO响应速度的Av Rd(ms)要慢于典型的1 ms to 5 ms。

 

需要SA系统管理员进一步确认存储IO是否存在性能问题。

 

文档《Note.457063.1 Slow I/O On HP Unix》介绍了HP-UX平台上IO性能瓶颈的一些解决路径,引用如下:

 

 Slow I/O On HP Unix

Applies to:
Oracle Server – Enterprise Edition – Version: 8.1.7.0 to 10.2.0.3 – Release: 8.1.7 to 10.2
HP OpenVMS VAX
HP-UX PA-RISC (64-bit)
HP-UX PA-RISC (32-bit)
HP OpenVMS Alpha
HP-UX Itanium
HP Tru64 UNIX
Symptoms

Disk utility output shows 100% usage for disk continously:

ProcList CPU Rpt Mem Rpt Disk Rpt NextKeys SlctProc Help Exit
GlancePlus C.04.50.00 19:15:39 bplita3 ia64 Current Avg High
———————————————————————-
CPU  Util S SN NU U      | 90% 90% 90%
Disk Util F       F      |100% 100% 100% <=== Disk Too heavily loaded

In Statspack or AWR report,  high ‘free buffer waits’ seen even after with considerably increasing db_cache_size and db writers.

Statspack Report

Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            Buffer Nowait %:   99.99       Redo NoWait %:     99.99
            Buffer  Hit   %:   58.92    In-memory Sort %:     99.82
            Library Hit   %:   89.19        Soft Parse %:     83.51
         Execute to Parse %:   50.76         Latch Hit %:    100.00
Parse CPU to Parse Elapsd %:   81.48     % Non-Parse CPU:     93.21

….

Top 5 Timed Events
~~~~~~~~~~~~~~~~~~                                                     % Total
Event                                               Waits    Time (s) Ela Time
——————————————– ———— ———– ——–
free buffer waits                                  76,390      74,807    55.82
enqueue                                             4,071      11,625     8.67
…..

Cause

The problem is related to Disk I/O.
Solution

1. Use the O/S striping software

 Try to use the O/S striping software to distribute database files over as many disks as you can.

2. Use Direct IO
Mount the filesystem with direct IO option.
For example:
% Mount –F vxfs –o remount,nodatainlog,mincache=direct,convosync=direct /dev/vg00/lv_ora /soft/oracle

mincache and convosync
“mincache=direct” => bypass buffer cache on read
“convosync=direct” => force direct I/O for DB writers

Mincache=direct and convosync=direct allow data to be transferred directly from Oracle buffer cache to disk and disk to Oracle buffer cache. This avoids double buffering by bypassing the file system buffer cache and can improve physical read/write performance. However, cases where the disk read could have been avoided because a required block was in file system buffer cache may be negatively impacted.

If your filesystem is mounted with this option then FILESYSTEMIO_OPTIONS parameter needs to set to “SETALL”  in order to use DIO.
Parameters in Oracle influencing the use of Direct IO

    FILESYSTEMIO_OPTIONS defines the IO operations on filesystem files .This parameter should not normally be set by the user.
    The value may be any of the following:
        asynch – Set by default on HP. This allows asynchronous IO to be used where supported by the OS.
        directIO – This allows directIO to be used where supported by the OS. Direct IO bypasses any Unix buffer cache.
        setall – Enables both ASYNC and DIRECT IO.
        none – This disables ASYNC IO and DIRECT IO so that Oracle uses normal synchronous writes, without any direct io options.

See
Document 120697.1 Init.ora Parameter “FILESYSTEMIO_OPTIONS” Reference Note

Oracle recommends that you use the value SETALL instead of the value DIRECTIO, because the DIRECTIO value disables asynchronous I/O.

    DISK_ASYNCH_IO controls whether I/O to datafiles, control files, and logfiles is asynchronous (that is, whether parallel server processes can overlap I/O requests with CPU processing during table scans). If your platform supports asynchronous I/O to disk, Oracle recommends that you leave this parameter set to its default value. However, if the asynchronous I/O implementation is not stable, you can set this parameter to false to disable asynchronous I/O. If your platform does not support asynchronous I/O to disk, this parameter has no effect.

    If you set DISK_ASYNCH_IO to false, then you should also set DBWR_IO_SLAVES or DB_WRITER_PROCESSES to a value other than its default of zero in order to simulate asynchronous I/O.

    DB_WRITER_PROCESSES or DBWR_IO_SLAVES
    see comments in DISK_ASYNCH_IO

3. Concurrent I/O
An alternative solution to Direct I/O is to use Concurrent I/O. Concurrent I/O is available in OnlineJFS 5.0.1.

To enable Concurrent I/O, the filesystem must be mounted with “-o cio”.
Eg:

mount -F vxfs -o nodatainlog,cio /soevxfs/redos /oracle/mnt/redos

Please note that remount should not be used to enable Concurrent I/O on mounted filesystems.

“-o cio”
Concurrent I/O allows multiple processes to read from or write to the same file without blocking other read(2) or write(2) calls.With Concurrent I/O, the read and write operations are not serialized.  This advisory is generally used by applications that require high performance for accessing data and do not perform overlapping writes to the same file. It is the responsibility of the application or the running threads to coordinate the write activities to the same file. It also avoids double buffering by bypassing the filesystem buffer cache and thus improves physical read/write performance significantly. Concurrent I/O performs very close to that of raw logical volumes.

一次Exadata上的ORA-600[kjbmprlst:shadow]故障分析

 

这是一套Exadata 上的11.2.0.1   四节点RAC 系统,从今年初开始频繁地因为LMS后台进程出现内部错误ORA-600[kjbmprlst:shadow]而导致实例意外终止, 虽然是4节点RAC 保证了其高可用性, 但是仍因为 实例经常意外终止 导致应用程序交易失败。

实际我在7月份已经分析过该问题了, 详见<Oracle RAC内部错误:ORA-00600[kjbmprlst:shadow]一例>一文 , kjbmprlst:shadow内部函数用以管理kjbm shadow锁(/libserver10.a/kjbm.o )信息,存在某个已关闭的lock没有及时message给master node的代码漏洞。 实际上当时我已经给出了禁用DRM 可能可以避免该ORA-600[kjbmprlst:shadow] 的看法 ,但是 翻阅MOS 上所有现有的ORA-600[kjbmprlst:shadow] case均没有提及disable DRM可以是一种workaround的途径, 提交SR 也没有得到Oracle GCS关于该看法的 肯定回答, 此外还考虑到 核心的产品环境 使用11.2.0.1 这个bug 较多的版本确实不合适, 升级到 11.2.0.2 已修复该Bug 10121589 的bundle Patch 也算是一种不错的方案 ,所以也就没有深究下去。

之后就一直在做升级Exadata到11.2.0.2 的一些准备工作, 但是用户最后考虑到升级Exadata的步骤过于复杂, 很多步骤是不可回退的, 而且在国内升级的案例目前也不多, 推翻了 升级的方案, 于是有了下面的这一段故障分析

 

故障特征

 

如上面所说的这是一套Exadata上的4节点 11.2.0.1 RAC , 应用程序是 Oracle 自己的Billing and Revenue Management (BRM) 。从今年初开始多次出现ORA-600[kjbmprlst:shadow]内部错误, 引发该内部错误的是RAC 关键后台进程 LMS (  global cache server process) ,结果是导致实例意外终止。

 

该ORA-600[kjbmprlst:shadow]在10月份频发的发生(几乎一个礼拜2次),故障发生还存在以下2个特征:

  1. 主要集中在凌晨03:00 以后的20分钟内发生
  2. 1、2、4节点都陆续发生过故障 , 唯有 3号节点从未发生过 ( 这个后面会介绍)

 

初步分析

 

通过应用方拿到了某次故障发生当天每一个小时的AWR报告, 这次故障发生在凌晨 03:05 的 1号节点上, 于是开始分析当时的 1号实例的性能负载。

分析AWR 报告的一种方法就是 通过对比多个时间段的报告 来了解性能变化 ,当然前提是知道问题发生的时段, 这里我们来对比2个时段的信息,这三个时段分别是:

1. 问题发生前的几分钟, 因为1号实例上 3点以后的AWR 报告缺失,所以这里使用的是 02:00 ~ 03:00 的AWR
2. 问题发生前的1小时 , 01:00~02:00 的AWR

 

首先观测DB TIME:

Snap Id Snap Time Sessions Cursors/Session
Begin Snap: 14225 12-Oct-11 02:00:05 284 152.2
End Snap: 14226 12-Oct-11 02:59:59 288 150.1
Elapsed: 59.91 (mins)
DB Time: 809.98 (mins)

 

Snap Id Snap Time Sessions Cursors/Session
Begin Snap: 14224 12-Oct-11 01:00:10 284 152.3
End Snap: 14225 12-Oct-11 02:00:05 284 152.2
Elapsed: 59.91 (mins)
DB Time: 257.62 (mins)

 

AWR中的DB TIME可以一定程度上反映数据库实例的繁忙程度。这里 01:00 到 02:00 时间段的AAS= 257/60 = 4.28,而问题发生的时间段的 AAS = 809 / 60 = 13 , 说明实例的工作负载明显上升了。

 

再来看主要等待事件:

Event Waits Time(s) Avg wait (ms) % DB time Wait Class
enq: TX – row lock contention 10,072 16,079 1596 33.08 Application
log file sync 166,546 10,579 64 21.77 Commit
gc buffer busy acquire 55,828 8,017 144 16.50 Cluster
gc buffer busy release 21,203 5,904 278 12.15 Cluster
gc current block busy 29,196 3,362 115 6.92 Cluster

 

Event Waits Time(s) Avg wait (ms) % DB time Wait Class
log file sync 258,415 4,875 19 31.54 Commit
gc buffer busy acquire 44,471 3,349 75 21.66 Cluster
DB CPU 1,880 12.16
gc current block busy 42,722 1,748 41 11.31 Cluster
gc buffer busy release 10,730 1,127 105 7.29 Cluster

 

问题发生时段的主要等待事件是 enq: TX – row lock contention 这说明存在大量的行锁争用,此外gc buffer busy acquire/release 和 gc current block busy 全局缓存争用等待时长也明显较 之前的 01:00- 02:00 多了数倍,这说明问题发生时段 节点间对全局缓存资源的争用进一步被加剧了。01:00 – 02:00 时段的 enq: TX – row lock contention 不是前5的等待事件,但实际每秒处理的事务数更多为71.8个(问题发生时段为每秒46.3个事务),这说明造成大量行所争用的的原因不是事务变多, 而是应用程序特性造成的。

 

引起频繁 enq: TX – row lock contention等待事件 和 全局缓存资源争用 的最主要SQL语句均是1usg75g3cx4n6:

 

Cluster Wait Time (s) Executions %Total Elapsed Time(s) %Clu %CPU %IO SQL Id SQL Module SQL Text
4,047.30 20,470 22.75 17,790.11 21.01 0.04 0.00 1usg75g3cx4n6 update

 

SQL “1usg75g3cx4n6” 是一条Update更新语句,在此不列出。

 

问题发生时段该语句 1 号实例上运行了20470次,平均每次运行耗时 0.87s , cluster wait time 占该SQL总执行时间的22.75% ,而在01:00 – 02:00 期间该语句共运行了32845 次,平均每次运行耗时 0.02s , cluster wait time 占该SQL总执行时间的 67.44 %。

以上信息说明1号实例在问题发生时段 由SQL语句1usg75g3cx4n6 因为enq: TX – row lock contention(主要因素) 和 gc 全局缓存争用 (次要因素)相互作用引起了巨大的负载。引起enq: TX – row lock contention行锁争用的主要原因是 并发的DML更新相关的行数据。

 

深入分析

除了在问题发生时段数据库负载加剧,出现大量的行锁和全局缓存争用外,问题时段还出现了大规模的动态资源管理操作DRM( Dynamic Resource Management)。

引入DRM的目的是尽可能减少RAC 中CPU和 消息通信的消耗,DRM的资源管理决定依据相关对象的全局缓存操作历史, 该特性会定期自动找出集群中频繁访问某个缓存对象的节点实例,并将该缓存对象re-master到该实例上以优化RAC性能。  DRM 由 LMON、LMD 和 LMS 进程协同完成, LMS负责传输实际的 gcs resouce和 lock 锁信息 , 11g中受到_gc_policy_time隐藏参数的影响,默认最短每10分钟可能被触发。当设置_gc_policy_time=0 ,则object affinity 和 11g 中引入的read mostly policies新特性均将被禁用。

 

AWR报告中1号实例问题时段的 DRM 统计:

 

Name Total per Remaster Op Begin Snap End Snap
remaster ops 2 1.00
remastered objects 2 1.00
replayed locks received 215,517 107,758.50
replayed locks sent 339,349 169,674.50
resources cleaned 0 0.00
remaster time (s) 7.4 3.70
quiesce time (s) 4.3 2.13
freeze time (s) 0.1 0.04
cleanup time (s) 0.4 0.21
replay time (s) 1.0 0.48
fixwrite time (s) 0.9 0.47
sync time (s) 0.8 0.38
affinity objects 525 525

 

以下为01:00- 02:00 时段的DRM 统计:

Name Total per Remaster Op Begin Snap End Snap
remaster ops 5 1.00
remastered objects 8 1.60
replayed locks received 110,902 22,180.40
replayed locks sent 68,890 13,778.00
resources cleaned 0 0.00
remaster time (s) 12.9 2.57
quiesce time (s) 1.4 0.28
freeze time (s) 0.2 0.03
cleanup time (s) 0.9 0.17
replay time (s) 0.5 0.10
fixwrite time (s) 1.9 0.38
sync time (s) 8.0 1.60
affinity objects 526 525

 

可以看到在问题发生时段 DRM 发送(sent)了更多的replayed locks锁资源 ,这里的replay是指将本地锁信息传送到新的master 实例(REPLAY – Transfer of the local lock information to the new master.) , 而负责发送这些replayed locks锁资源的 正是LMS 进程。

如上文所述1、2、4节点均出现过该ORA-600[kjbmprlst:shadow]故障,但是唯有3号节点从没有出现过该问题, 打开3号节点在问题发生时段(03:00-04:00) 的AWR报告后就有重大的发现:

 

Elapsed Time (s) Executions Elapsed Time per Exec (s) %Total %CPU %IO SQL Id SQL Module SQL Text
3,590.52 0 17.30 34.71 63.63 59v4zh1ac3v2a DBMS_SCHEDULER DECLARE job BINARY_INTEGER := …
3,589.83 0 17.29 51.69 37.98 b6usrg82hwsa3 DBMS_SCHEDULER call dbms_stats.gather_databas…
2,173.79 1 2,173.79 10.47 62.16 28.75 b20w3p880j2gs DBMS_SCHEDULER /* SQL Analyze(1) */ select /*…

 

可以发现3号节点当时正在运行”call dbms_stats.gather_database_stats_job_proc ( )”  11g中自动收集统计信息的作业,该作业运行了较长时间(一个小时内尚未完成), 我们知道10g以后引入了默认的自动收集统计信息的作业,在11g中得到了加强, 这一作业会自动收集自上一次收集以来更新超过总数据量10%的 对象的统计信息, 因为该核心产品数据库是大型的OLTP应用,在03:00之前已经发生了大量的数据更改,所以这导致gather_database_stats_job_proc 要收集许多大对象上的统计信息,而这会引发该3号实例频繁Request之前在 1、2号实例中被更新的全局缓存资源,频繁地访问影响了DRM的decision, 导致大量的object 被 re-master 到 3号节点上。

 

Name Total per Remaster Op Begin Snap End Snap
remaster ops 14 1.00
remastered objects 25 1.79
replayed locks received 1,088,009 77,714.93
replayed locks sent 202,112 14,436.57
resources cleaned 0 0.00
remaster time (s) 30.8 2.20
quiesce time (s) 8.5 0.60
freeze time (s) 0.4 0.03
cleanup time (s) 3.3 0.24
replay time (s) 6.4 0.46
fixwrite time (s) 7.7 0.55
sync time (s) 4.5 0.32
affinity objects 4,294,967,289 4,294,967,289

 

可以看到3号实例接收了大量的replayed lock (replayed locks received) ,而发送的lock则很少。而实际发送(sent) 给 3号实例这些锁信息 资源的 正是 1号实例上的LMS进程。

翻阅alert.log 和视图 我发现每一次的gather_database_stats_job_proc均在凌晨3点的 3号实例发生, 因此每一次故障发生时3号实例只要接收这些replayed locks   资源即可,其LMS进程 仅发送少量的 replayed locks , 所以造成了上面所说的3号实例从不发生该ORA-600[kjbmprlst:shadow]故障 , 这一古怪的现象。

 

我们再来阅读上发生ORA-600[kjbmprlst:shadow]故障的LMS进程的trace文件中的部分内容:

*** 2011-10-12 02:25:22.770
DRM(22812) win(8) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22813) win(1) lms 1 finished replaying gcs resources

*** 2011-10-12 03:05:29.717
 lms 1 finished fixing gcs write protocol
DRM(22813) win(2) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol

*** 2011-10-12 03:05:30.312
DRM(22813) win(3) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22813) win(4) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22813) win(5) lms 1 finished replaying gcs resources

*** 2011-10-12 03:05:31.280
 lms 1 finished fixing gcs write protocol
DRM(22813) win(6) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22813) win(7) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol

*** 2011-10-12 03:05:32.269
DRM(22813) win(8) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22814) win(1) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22814) win(2) lms 1 finished replaying gcs resources

*** 2011-10-12 03:05:33.479
 lms 1 finished fixing gcs write protocol
DRM(22814) win(3) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22814) win(4) lms 1 finished replaying gcs resources

*** 2011-10-12 03:05:34.333
 lms 1 finished fixing gcs write protocol
DRM(22814) win(5) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol
DRM(22814) win(6) lms 1 finished replaying gcs resources
 lms 1 finished fixing gcs write protocol

*** 2011-10-12 03:05:35.315
DRM(22814) win(7) lms 1 finished replaying gcs resources
Incident 1035732 created, dump file:
/maclean1_lms1_20075_i1035732.trc
ORA-00600: internal error code, arguments: [kjbmprlst:shadow], [], [], [], [], [], [], [], [], [], [], []

2011-10-12 03:05:36.811503 : kjbmprlst: internal error on replaying 0x618cd.371 from 3
2011-10-12 03:05:36.811541 : kjbmbassert [0x618cd.371]
2011-10-12 03:05:36.811559 : kjbmsassert(0x618cd.371)(2)
2011-10-12 03:05:36.811636 : kjbmsassert(0x618cd.371)(3)
2011-10-12 03:05:36.811667 : kjbmsassert(0x618cd.371)(4)
kjmpbmsg fatal error on 39
MSG [39:KJX_B_REPLAY] .Yq inc=64 len=136 sender=(3,3) seq=0
     fg=ip stat=KJUSERSTAT_DONE spnum=14 flg=x24
    flow ctrl: ver=60 flag=169 len=136 tkts=0 seq=0 wrp=0
               sndr=3 dest=1 rcvr=2
   FUSION MSG 0x2b178baa5120,39 from 3 spnum 14 ver[64,22813] ln 136 sq[3,8]
          REPLAY 1 [0x618cd.371, 574166.0] c[0x59f904e08,-17597] [0x1,xc69]
          grant 1 convert 0 role x0
          pi [0x0.0x0] flags 0x0 state 0x100
          disk scn 0x0.0 writereq scn 0x0.0 rreqid x0
          msgRM# 22813 bkt# 99097 drmbkt# 99097
      pkey 574166.0 undo 0 stat 5 masters[32768, 3->32768] reminc 64 RM# 22813
 flg x6 type x0 afftime x9776dffe
 nreplays by lms 0 = 0 
 nreplays by lms 1 = 0 
     hv 9 [stat 0x0, 1->1, wm 32768, RMno 0, reminc 34, dom 0]
     kjga st 0x4, step 0.36.0, cinc 64, rmno 22813, flags 0x20
     lb 98304, hb 114687, myb 99097, drmb 99097, apifrz 1
   FUSION MSG DUMP END
MSG [65521:PBATCH] inc=64 len=8376 sender=3 seq=279807301 fg=q
    flow ctrl: ver=1 flag=37 len=8376 tkts=0 seq=279807301 wrp=1
               sndr=3 dest=1 rcvr=2
MSG [39:KJX_B_REPLAY] .Y'z inc=64 len=136 sender=(3,3) seq=0
     fg=ip stat=KJUSERSTAT_DONE spnum=14 flg=x24
   FUSION MSG 0x2b178baa31c8,39 from 3 spnum 14 ver[64,22813] ln 136 sq[3,8]
          REPLAY 1 [0x527f5.27a, 574166.0] c[0xa2f7eddc0,2158] [0x1,xf33]
          grant 1 convert 0 role x0
          pi [0x0.0x0] flags 0x0 state 0x100
          disk scn 0x0.0 writereq scn 0x0.0 rreqid x0
          msgRM# 22813 bkt# 99582 drmbkt# 99582
      pkey 574166.0 undo 0 stat 5 masters[32768, 3->32768] reminc 64 RM# 22813
 flg x6 type x0 afftime x9776dffe
 nreplays by lms 0 = 0 
 nreplays by lms 1 = 0 
     hv 33 [stat 0x0, 1->1, wm 32768, RMno 0, reminc 34, dom 0]
     kjga st 0x4, step 0.36.0, cinc 64, rmno 22813, flags 0x20
     lb 98304, hb 114687, myb 99582, drmb 99582, apifrz 1
   FUSION MSG DUMP END

.........................

    ------------process 0xcf22a6d40--------------------
    proc version      : 0
    Local node        : 0
    pid               : 20075
    lkp_node          : 0
    svr_mode          : 0
    proc state        : KJP_NORMAL
    Last drm hb acked : 114687
    flags             : x80
    current lock op    : (nil)
    Total accesses    : 11763805
    Imm.  accesses    : 11763774
    Locks on ASTQ     : 0
    Locks Pending AST : 0
    Granted locks     : 0
    AST_Q: 
    PENDING_Q: 
    GRANTED_Q: 
    KJM HIST LMS1: 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 
      20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 20:39:0 1:0 
      13:65521:0 20:39:0 20:39:0 20:39:0 
    START DEFER MSG QUEUE 0 ON LMS1 flg x87:

MSG [61:KJX_SYNC_RCVCHANNEL] inc=64 len=424 sender=(3,3) seq=98262147
     fg=iq stat=KJUSERSTAT_DONE spnum=11 flg=x1
     SYNC_RCVCHANNEL 0xca303f2f0 from 3 spnum 11 ver[64,196610]
     receiver 0 last sent seq 0.98262028
     receiver 1 last sent seq 1.545148943
     receiver 2 last sent seq 1.279807365
MSG [25:KJX_FTDONE] inc=64 len=64 sender=(3,3) seq=109225856
     fg=s stat=KJUSERSTAT_ATTACHFAILED spnum=11 flg=x0
MSG [61:KJX_SYNC_RCVCHANNEL] inc=64 len=424 sender=(3,3) seq=98262146
     fg=iq stat=KJUSERSTAT_DONE spnum=11 flg=x1
     SYNC_RCVCHANNEL 0xca405e758 from 3 spnum 11 ver[64,196610]
     receiver 0 last sent seq 0.98262028
     receiver 1 last sent seq 1.545148943
     receiver 2 last sent seq 1.279807365
MSG [25:KJX_FTDONE] inc=64 len=64 sender=(3,3) seq=6400108
     fg=s stat= spnum=11 flg=x20
    END DEFER MSG QUEUE 0 DUMP ON LMS1
     SEQUENCES: 
    [sidx 0]  0:0.0  1:711171046.0  2:279807300.1  3:43212975.0
     DEFER MSG QUEUE 1 ON LMS1 IS EMPTY
     SEQUENCES: 
    [sidx 1]  0:0.0  1:711171047.0  2:279807292.1  3:43212974.0
     DEFER MSG QUEUE 2 ON LMS1 IS EMPTY
     SEQUENCES: 
    [sidx 2]  0:0.0  1:711171047.0  2:279807300.1  3:43212975.0


可以看到在实际出发ORA-600[kjbmprlst:shadow]之前,LMS后台进程确实在进行频繁地DRM操作。

我们不难看出出发ORA-600[kjbmprlst:shadow]的原因是:数据库较高的性能负载, 1号实例上SQL 语句引发的大量行锁争用 和 3号节点上的 gather_database_stats_job_proc 自动统计信息作业 同时存在 , 导致1号实例上的LMS进程频发地执行DRM操作 , 最终在较高的工作负载情况下触发了管理kjbm shadow锁的代码bug 。

经过分析发现2、4号实例发生该ORA-600[kjbmprlst:shadow]故障与1号节点的情形相似,整个RAC中唯有3号节点例外,因为总是由它来执行gather_database_stats_job_proc, 导致它总是成为DRM的受益者 , 而其他3个节点成为受害者。

但是注意 可能你要问 那不就是简单的由自动统计作业引发的故障吗? 用得着分析这么多吗, 直接关掉该作业 或者换个时段就可以了。

回答:实际上如果仅仅是gather_database_stats_job_proc的话是不足以引发该故障的, 一方面由于节点之间的所跑得应用程序都是一样的,也没有采取分区或者分schema , 这导致节点之间的全局缓存资源争用加剧 , 另一方面 因为每个Buffer Cache都较大, 均存放了大量的全局缓存资源, 且数据的更新十分频繁 , 这导致gather_database_stats_job_proc作业开始执行后 大量的全局资源涌向3号实例 ,同时也触发了DRM的决定, 以至于1、2、4号节点上的LMS 在短期内有极大的工作压力 , 最终导致了 bug 被触发。仅仅禁用该gather_database_stats_job_proc 自动信息搜集作业 ,或者修改该作业的运行时间 都不足以彻底规避该ORA-600[kjbmprlst:shadow]故障, 因为只要是LMS后台进程处于较高的工作负载情况下都仍有可能触发bug。

 

虽然以上分析了一大堆但是我对解决该ORA-600[kjbmprlst:shadow]故障的看法没有丝毫改变, 而且更坚定了。Disable DRM 禁用DRM特性可以从根本上改变LMS进程的行为方式,彻底杜绝该ORA-600[kjbmprlst:shadow]故障发生的可能性。

但我仍需要一些佐证!

 

资料信息

 

翻阅MOS 一个Note 映入我的眼帘,<Bug 12834027 – ORA-600 [kjbmprlst:shadow] / ORA-600 [kjbrasr:pkey] with RAC read mostly locking [ID 12834027.8]> 这个Note 我在上一次分析这个问题的时候已经阅读过了, 当时该Note的描述该问题会在11.2.0.1 上发生,不提供任何workaround 方法 ,唯一的解决途径是打上11.2.0.2 + Bp 的补丁。

但是我发现这个Note 在最近的 21-OCT-2011 被修改了,内容如下:

Bug 12834027  ORA-600 [kjbmprlst:shadow] / ORA-600 [kjbrasr:pkey] with RAC read mostly locking
 This note gives a brief overview of bug 12834027.
 The content was last updated on: 21-OCT-2011
 Click here for details of each of the sections below.
Affects:

    Product (Component)	Oracle Server (Rdbms)
    Range of versions believed to be affected 	Versions BELOW 12.1
    Versions confirmed as being affected 	

        11.2.0.2 

    Platforms affected	Generic (all / most platforms affected)

Fixed:

    This issue is fixed in	

        12.1 (Future Release) 

Symptoms:

Related To:

    Internal Error May Occur (ORA-600)
    ORA-600 [kjbmprlst:shadow]
    ORA-600 [kjbrasr:pkey] 

    RAC (Real Application Clusters) / OPS
    _gc_read_mostly_locking 

Description

    ORA-600 [kjbmprlst:shadow] / ORA-600 [kjbrasr:pkey] can occur with RAC
    read mostly locking.

    Workaround
      Disable DRM
     or
      Disable read-mostly object locking
      eg: Run with "_gc_read_mostly_locking"=FALSE

 

 

这一次翻到这个Note 是在10月28日 ,而这个Note是在10月21日更新的, 不得不说这个更新来的很及时, 提供了2种 workaround 的方式 , 包括彻底禁用DRM ,和之前所说的禁用_gc_read_mostly_locking 特性,相关的bug note 说明了 另一个Exadata case的经过:

 

Hdr: 12834027 N/A RDBMS 11.2.0.2.0 RAC PRODID-5 PORTID-226 12730844
Abstract: ORA-600 [KJBMPRLST:SHADOW] & [KJBRASR:PKEY] IN A READ MOSTLY & SKIP LOCK ENV

*** 08/04/11 04:22 am ***

PROBLEM:
--------
3 nodes got crashed and restarted in exadata cluster.

DIAGNOSTIC ANALYSIS:
--------------------
8 node Exadata RAC instance. Below instances got terminated and it rejoined
cluster around 11:46 Today (01-Aug).

Instance2 (gmfcdwp2) got ORA-600 [kjbmprlst:shadow] at 11:46 AM and
instance got terminated by LMS1 process.
LMS1 (ospid: 14922): terminating the instance due to error 484

instance got terminated by LMS1 process
LMS1 (ospid: 28176): terminating the instance due to error 484

instance got terminated by LMS1 process

which was seen in instance 4 and 8.

 

这个case和我们的惊人的相似, 也是几个Exadata上特定的instance number 会出现ORA-600 [kjbmprlst:shadow]。

另一方面需要评估禁用DRM 对于整体性能的影响,其实这一点我已经在<Important parameters For Oracle BRM Application in 11gR2>中介绍了, Oracle 官方推荐为 BRM 应用程序 禁用 DRM 特性 (真绕口) , 而因为这套系统使用的是默认的Exadata 出场的实例参数设置,所以。。。

 

解决方案

1. 禁用DRM 特性,可以彻底避免LMS进程因为DRM操作而引发bug , 具体做法是设置隐藏参数_gc_policy_time为0。

影响评估:

DRM 特性可以一定程度上优化数据库性能, 但是对于没有分区或分schema 的应用程序而言 DRM并不会带来好处。  Oracle 官方推荐在BRM应用程序 上 要禁用DRM特性, 见Metalink 文档<Questions About BRM Support For Oracle DB 11GR2 And Exadata [ID 1319678.1]>:

“_gc_policy_time=0 to disable DRM.” 修改该_gc_policy_time参数 需要重启实例

 

2. 部分禁用DRM 特性 , 设置 _gc_read_mostly_locking= False。 设置 _gc_read_mostly_locking= False可以避免replayed lock 大量传输 ,极大程度上减少LMS触发bug的可能性。  修改该参数同样需要重启实例

 

案例总结

写了这么久有点累了, 也有点不知所云。 可能会再写一篇 介绍11g 中新加入的Read-mostly locking DRM 新特性,本来想在这篇里一并介绍但是发现篇幅有点长了。

有时候我们解决一个问题只需要某个恰当的方案就足够了, 这个方案可以来源于Metalink  、Google 或者朋友, 我们并不知道这种做法、解决方式实际这么生效 , 但是总归或多或少 它生效了。 但是这不代表 真正意义上理解这个故障或者问题了, 譬如 我花了大约 3-4个小时的时间来分析这个问题, 虽然得到的 结论几乎和我在7月份的那一次是一样的 ,但是透过 分析的过程 我还是获益良多 , 这是唯有自己分析方能产生的结果。

同学们, 享受这分析过程带来的 乐趣吧!

Warning: lio_listio returned EAGAIN Performance degradation may be seen

Warning: lio_listio returned EAGAIN
Performance degradation may be seen.
WARNING:Oracle process running out of OS kernel I/O resources (1)

出现以上信息一般代表AIX 的AIO参数配置不是最佳,可能导致Oracle IO性能下降, 建议参考文档AIX: DBWR trace warning: LIO_LISTIO RETURNED EAGAIN [ID 265491.1]来调优AIX AIO参数,一般在AIX 版本4、5上才需要调优该AIO参数,在AIX 6上默认配置即可。

MMON Slave日志[KDL_TRIM]: NEWLEN: MESSAGES

有网友反映windows 2003 上的10.2.0.4数据库,MMON的Slave后台进程报带有NEWLEN: MESSAGES的相关日志,如:


*** ACTION NAME:(Auto ADDM Slave Action) 2011-09-09 18:00:12.053
*** MODULE NAME:(MMON_SLAVE) 2011-09-09 18:00:12.053
*** SERVICE NAME:(SYS$BACKGROUND) 2011-09-09 18:00:12.053
*** SESSION ID:(797.606) 2011-09-09 18:00:12.053
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0

这一般是由于10.2.0.4上已经confirmed的bug:”Bug 6972843 – Unnecessary trace files with “[kdl_trim]: newlen” text in them [ID 6972843.8]”所引起的,在bug目前确认仅在10.2.0.4上发生,而到10.2.0.4.1中已经被修复了。

主要是MMON的辅助slave进程会不必要地输出如”[kdl_trim]: newlen: 0″的日志信息:


Unnecessary trace files may be generated with entries that read:'[kdl_trim]: newlen: '.
 
Hdr: 9600827 10.2.0.4 RDBMS 10.2.0.4 RAM LOBS PRODID-5 PORTID-212
Abstract: [KDL_TRIM]: NEWLEN: MESSAGES IN MMON SLAVES TRACES

PROBLEM:
--------
MMON slave processes traces contains the following entries despite patch 
6972843 applied

*** NAME:(Auto ADDM Slave Action) 2010-04-14 06:00:42.899
*** NAME:(MMON_SLAVE) 2010-04-14 06:00:42.899
*** NAME:(SYS$BACKGROUND) 2010-04-14 06:00:42.899
*** ID:(2472.45569) 2010-04-14 06:00:42.899
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0
[kdl_trim]: newlen: 0

DIAGNOSTIC ANALYSIS:
--------------------
- Opatch contains patch 6972843 applied

WORKAROUND:
-----------
none

RELATED BUGS:
-------------
Bug 6972843 - HOURLY TRACE FILE CREATED IN BDUMP FOR AUTO ADDM SLAVE ACTION

该bug可以被忽略,但是如果日志产生过于频繁的话可以考虑升级10.2.0.4的PSU或者直接升级到10.2.0.5。

LIBCLNTSH.SO: WRONG ELF CLASS: ELFCLASS32错误一例

某套数据库以dba组用户登录正常,但以非dba组用户登录时出现LIBCLNTSH.SO: WRONG ELF CLASS: ELFCLASS32错误,如下例子:


su  - maclean

sqlplus 

ld.so.1: sqlplus: fatal: /s01/orabase/product/10.2.0/dbhome_1/lib32/libclntsh.so.10.1 : wrong ELF class:  ELFCLASS32

这里可以看到引起wrong ELF class的库文件(libclntsh.so.10.1)位于lib32目录下,说明该库文件是32位的。

一般来说只有安装了64位oracle软件时会同时出现lib和lib32 这2个存放库文件的目录, 当安装了32位oracle软件时则只会显示lib一个目录。 这也是判断所安装的Oracle数据库软件是64 bit 还是 32 bit的一种方法。

以上问题一般是由于设置了不正确的LD_LIBRARY_PATH 环境变量 , 或者 lib 目录、libclntsh.so.10.1权限不正确引起的。

建议通过以下步骤排查问题:


1. 检查 sqlplus 命令相关的so共享库文件

[oracle@vrh1 ~]$ ldd `which sqlplus` 
        linux-vdso.so.1 =>  (0x00007fff0bbfd000)
        libsqlplus.so => /s01/orabase/product/10.2.0/dbhome_1/lib/libsqlplus.so (0x00002b450f6b4000)
        libclntsh.so.10.1 => /s01/orabase/product/10.2.0/dbhome_1/lib/libclntsh.so.10.1 (0x00002b450f89b000)
        libnnz10.so => /s01/orabase/product/10.2.0/dbhome_1/lib/libnnz10.so (0x00002b4510ec6000)
        libdl.so.2 => /lib64/libdl.so.2 (0x0000003635c00000)
        libm.so.6 => /lib64/libm.so.6 (0x0000003635800000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003636000000)
        libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003639800000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003635400000)
        libaio.so.1 => /usr/lib64/libaio.so.1 (0x00002b45122ab000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003635000000)

2. 找出以上输出中与所报错误中文件名相符的记录,如错误中所报的so文件的路径名为

/s01/orabase/product/10.2.0/dbhome_1/lib32/libclntsh.so.10.1

而 64位 sqlplus 所需要的64位 so文件 的路径为

/s01/orabase/product/10.2.0/dbhome_1/lib/libclntsh.so.10.1

3. 以上说明可能是LD_LIBRARY_PATH 变量存在问题 , 检查该变量
 
echo $LD_LIBRARY_PATH 

/s01/orabase/product/10.2.0/dbhome_1/lib32:/s01/orabase/product/10.2.0/dbhome_1/lib

一般来说在64位的情况下 需要将 lib 目录 放置在 该LD_LIBRARY_PATH 变量的前面,如

export LD_LIBRARY_PATH= $ORACLE_HOME/lib:$ORACLE_HOME/lib32


4. 若检查发现LD_LIBRARY_PATH 变量没有问题, 那么可以进一步检查 lib 和lib32 这2个目录的权限 ,以及相关so 文件是否对 other组有必要的r-x权限。

也可以通过9.2.0.8 和 10g中自带的 changePerm.sh脚本来修复问题,该脚本一般位于$ORACLE_HOME/install目录下

su  - oracle

$ cd $ORACLE_HOME/install
$ ./changePerm.sh


Where is the script changePerm.sh?

Applies to:
Oracle Server - Enterprise Edition - Version: 9.2.0.8 to 10.2.0.5.0 - Release: 9.2 to 10.2
Information in this document applies to any platform.
Goal
Where is the script changePerm.sh?
Solution


During patch set installation, all new files and directories are created with restricted access, 
by default. Users or third party applications with a different group identifier from that of the database, 
which try to access client-side utilities or libraries in the database home, 
will see permission errors when trying to access these files or directories.

changePerm.sh is a script that allows users that do not belong to the database group to 
access and execute files in the ORACLE_HOME directory. Only run this script when absolutely 
required as it will reduce security. The script (borne shell) loops through a hard-coded set of 
files and directories, extracts the "group" permissions, and set the "other" permissions the 
same as the group's permissions. By doing so, "other" users are able to execute 
these files and change to these directories

The script is located in the ORACLE_HOME/install directory on Unix and Linux systems 
(there is no equivalent for Windows). 
This script ships with Oracle Database versions 9.2.0.8 and Oracle 10g releases. 

The changePerm.sh file is not available nor necessary for Oracle Database Server 11.1 
and above because in 11g, 
permissions under the Oracle Home (including "others") are set correctly/relaxed.

Output when the changePerm.sh script is executed:

$ cd $ORACLE_HOME/install
$ ./changePerm.sh
-------------------------------------------------------------------------------
Disclaimer: The purpose of this script is to relax permissions on some of the
files in the database Oracle Home so that all clients can access them.
Please note that Oracle Corporation recommends using the most restrictive file
permissions as possible for your given implementation. Running this script
should be done only after considering all security ramifications.
-------------------------------------------------------------------------------

Do you wish to continue (y/n) : y
Finished running the script successfully

Please see /tmp/changePerm_err.log for errors and /tmp/changePerm.log for the log of events

This script should be run by Oracle Software owner to relax the permission 
and can be run while database processes are active.

Note :
1. If you are patching Oracle RAC home, then you will need to run this script on all the nodes
2. Oracle recommends using the most restrictive file permissions 
possible for your given implementation. Run this script only after considering 
all security ramifications and only if you need to share this installation


dbms_logmnr Unsupported SQLREDO

Question:

Try Testing Oracle Logminer

 

SQL> create table maclean (t1 varchar2(100)) tablespace users;

Table created.

SQL> insert into maclean values (‘MACLEAN’);

1 row created.

SQL> commit;

Commit complete.

after start logmnr:
sql> …add log file ….
sql> exec dbms_logmnr.start_logmnr(options=>dbms_logmnr.dict_from_online_catalog + dbms_logmnr.committed_data_only);
sql> select sql_redo from v$logmnr_contents;
create table maclean (t1 varchar2(100)) tablespace users;
Unsupported
commit;
In the system, some statement will got sqltext from sqlredo column and other is ‘Unsupported’. The got sqltext have INSERT, DELETE, UPDATE; and ‘Unsupported’ sql have INSERT, DELETE, UPDATE too.

 

Answer:

Have you enabled supplemental logging prior to mining the redo / archive logs ? If not then please enable the supplemental logging.

To enable minimal supplemental logging execute the following SQL statement:
SQL>ALTER DATABASE ADD SUPPLEMENTAL LOG DATA;

If you had not enabled supplemental logging earlier then the redo logs will not contain sufficient information to mine the logs. Additionally logminer will not always populate all the fields of the v$logmnr_contents this is because the redo may/may not have all the information that we need for every column. Adding Supplemental Logging will help in more info being logged in the redo being generated, helping populate more values.

 

Exactly, supplemental logging is mandatory, Oracle recommends that you at least enable minimal supplemental logging for LogMiner. By default, Oracle Database does not provide any supplemental logging, which means that by default LogMiner is not usable. Therefore, you must enable at least minimal supplemental logging prior to generating log files which will be analyzed by LogMiner.
Once supplemental logging is enabled the log file generated later on will be elgible for log miner.

 

You must enable at least database minimal supplemental logging prior to generate log files which will be analyzed by LogMiner.

 

 

Solution
connect / as sysdba

-- enable minimal supplemental logging at database level
alter database add supplemental log data; 

select supplemental_log_data_min from v$database;

SUPPLEME
--------
YES

-- switch logfile to get a new fresh archived log
alter system switch logfile;

connect test/test

-- insert some rows into table
insert into emp values (3, 'Text 3');
insert into emp values (4, 'Text 4');
commit;

connect / as sysdba

-- switch again logfile to get a minimal redo activity
alter system switch logfile;

-- mine the last written archived log
exec dbms_logmnr.add_logfile ('D:\Databases\O102\arc\O102__1__24__676053397.ARC', options => dbms_logmnr.new);
exec dbms_logmnr.start_logmnr(options => dbms_logmnr.dict_from_online_catalog);
select operation, sql_redo from v$logmnr_contents where seg_name = 'EMP'; 

select operation, sql_redo from v$logmnr_contents where seg_name = 'EMP';

OPERATION
--------------------------------
SQL_REDO
------------------------------------------------------------------------------
INSERT
insert into "TEST"."EMP"("EMPNO","EMPNAME") values ('3','Text 3');

INSERT
insert into "TEST"."EMP"("EMPNO","EMPNAME") values ('4','Text 4');

2 rows selected.

了解SYSDATE函数

SYSDATE是我们在Oracle开发中经常要用到的一种单行函数(single row function),该函数用以返回当前的日期与时间,常和DUAL伪表一起合作。SYSDATE实际上指的是SYSDATE()函数,但是因为这个函数没有参量,所以这里可以省略()圆括号。另外SYSDATE函数是用于获取数据库所在的操作系统的当前时间值的,所以它和数据库或会话的时区没有关系。

我们可以使用NLS_DATE_FORMAT参数或者TO_CHAR函数来获得我们想要的SYSDATE日期格式,具体的格式代码如下:

Format Code Explanation
YEAR Year, spelled out
YYYY 4-digit year
MM Month (01-12; JAN = 01).
MON Abbreviated name of month.
MONTH Name of month, padded with blanks to length of 9 characters.
D Day of week (1-7).
DAY Name of day.
DD Day of month (1-31).
DDD Day of year (1-366).
DY Abbreviated name of day.
HH Hour of day (1-12).
HH12 Hour of day (1-12).
HH24 Hour of day (0-23).
MI Minute (0-59).
SS Second (0-59).
SSSSS Seconds past midnight (0-86399).

具体的用法如:

syntax:
select to_char(sysdate,'FORMAT CODES') from dual;

select to_char(sysdate,'YEAR-MON-DAY HH24:MI:SS') from dual;

TO_CHAR(SYSDATE,'YEAR-MON-DAY HH24:MI:SS')
--------------------------------------------------------------------------------
TWENTY ELEVEN-AUG-WEDNESDAY 21:34:43

与SYSDATE不同,CURRENT_DATE()函数会返回数据库会话所设置的本地时区的当前日期。

我们来具体看一下这2个函数所返回结果的差异:

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
PL/SQL Release 11.2.0.2.0 - Production
CORE    11.2.0.2.0      Production
TNS for Linux: Version 11.2.0.2.0 - Production
NLSRTL Version 11.2.0.2.0 - Production

SQL> select * from global_name;

GLOBAL_NAME
--------------------------------------------------------------------------------
www.askmaclean.com

SQL>  alter session set nls_date_format='YYYY-MM-DD HH24:mi:ss';

Session altered.

SQL> set linesize 200 pagesize 1400;
SQL>  col sessiontimezone for a20

SQL> select current_date, sysdate,dbtimezone,sessiontimezone from dual;

CURRENT_DATE        SYSDATE             DBTIME SESSIONTIMEZONE
------------------- ------------------- ------ --------------------
2011-08-24 19:53:09 2011-08-24 19:53:09 +08:00 +08:00

/* 我们调整session的所在时区到+09:00 */

SQL> alter session set time_zone='+09:00';

Session altered.

SQL>  select current_date, sysdate,dbtimezone,sessiontimezone from dual;

CURRENT_DATE        SYSDATE             DBTIME SESSIONTIMEZONE
------------------- ------------------- ------ --------------------
2011-08-24 20:56:33 2011-08-24 19:56:33 +08:00 +09:00

/* 可以看到current_date所返回的时间增长了一个小时,而SYSDATE不变 */

另外请注意SYSDATE函数的类型并非是DATE类型,而是其特有的类型:

SQL> create table mytime(t1 date);

Table created.

SQL> insert into mytime values (sysdate);

1 row created.

SQL> commit;

Commit complete.

SQL> select dump(t1), dump(sysdate), dump( current_date) from mytime;

DUMP(T1)
--------------------------------------------------------------------------------
DUMP(SYSDATE)
--------------------------------------------------------------------------------
DUMP(CURRENT_DATE)
--------------------------------------------------------------------------------
Typ=12 Len=7: 120,111,8,24,21,31,59
Typ=13 Len=8: 219,7,8,24,20,32,12,0
Typ=13 Len=8: 219,7,8,24,20,32,12,0

以上可以看到sysdate和current_date都属于TYPE=13,而普通DATE类型为TYPE=12

数据类型type=12代表DATE Valid date range from January 1, 4712 BC to December 31, 9999 AD.

因为sysdate的时钟取自操作系统,所以DBA同样需要关心操作系统本身时区的设置。不当的时区设置可能导致数据库内部的定时作业系统紊乱,主要会影响使用DBMS_JOB调用的一些JOB,DBMS_JOB包不会考虑到时区的因素,而仅仅考虑sydate的值(DBMS_JOB uses the “date” datatype to store the start date/time and does not store/use timezone related information. JOBS who are scheduled are always executed when sysdate => next execution time. )。如果使用DBMS_SCHEDULER则会考虑具体的时区。

此外一些使用夏令时的地区还会面临更复杂的DST问题,详细可以参考ORAganism的文章<Fake lack of TimeZones in DBMS_JOBs>和MOS文档<DBMS_SCHEDULER or DBMS_JOB And DST / Timezones Explained. [ID 467722.1]> &<Troubleshooting DBMS_SCHEDULER and DBMS_JOB [ID 783357.1] >:

DBMS_SCHEDULER or DBMS_JOB And DST / Timezones Explained. [ID 467722.1]

Applies to:
Oracle Server - Enterprise Edition - Version: 9.0.1.0 to 11.2.0.2 - Release: 9.0.1 to 11.2
Information in this document applies to any platform.
Purpose
This note gives a overview of how DBMS_SCHEDULER / DBMS_JOB react on DST transitions and what
to know when using timezone information with jobs.
Scope and Application
To be used when DBMS_SCHEDULER jobs run at a wrong time after a DST change or you have output
from jobs selecting from TIMESTAMP WITH LOCAL TIMEZONE columns that is not correct.
If things about timezones are not clear please have first a look at Note 340512.1 Timestamps
& time zones - Frequently Asked Questions.
DBMS_SCHEDULER or DBMS_JOB And DST / Timezones Explained.
When running trough DBMS_SCHEDULER or DBMS_JOB a job output using TIMESTAMP WITH LOCAL
TIMEZONE gives the wrong time.This happens when you use DBMS_SCHEDULER or DBMS_JOB to run jobs
who select information from TIMESTAMP WITH LOCAL TIMEZONE fields.
The time in a TIMESTAMP WITH LOCAL TIMEZONE that will be returned depends on the SESSIONTIMEZONE.

select sessiontimezone from dual;

If a TIMESTAMP WITH LOCAL TIMEZONE field contains for example "20-DEC-2007 15:15 -08:00" and
your session timezone is also -08:00 you will see 15:15.
If you change the session timezone to (for example) -07:00 then you will see 16:15,
witch is normal and intended.

Jobs using DBMS_SCHEDULER and DBMS_JOB however use a SESSIONTIMEZONE set to UTC (= +00:00 ).

Hence, above example will give 15 + 8 = 23:15 when selected trough a job.

If you use jobs that provide TIMESTAMP WITH LOCAL TIMEZONE then you may need to
included in the job an alter session to change the SESSIONTIMEZONE.

execute immediate 'alter session set time_zone = ''-08:00''';

If you want the job to follow DST rules then you need to use a named timezone that
follows the DST rules you want to use:
execute immediate 'alter session set time_zone = ''US/Pacific''';

There is Enhancement Request 5479680 - INCORPORATE SESSIONTIMEZONE INFORMATION INTO
JOB INFORMATION to a) use the session timezone of the submitting session as timezone and b)
have a way to see with what timezone a job is submitted.
Job runtime time after DST has changed.
When using DBMS_JOB:
DBMS_JOB uses the "date" datatype to store the start date/time and does not store/use timezone
 related information. JOBS who are scheduled are always executed when sysdate => next execution time.
This means for DST transitions that:

* Assuming here that during DST the time is going FORWARD from 1:59 am directly to 3:00 am (skip on hour)
- any job scheduled between 2:00 am and 2:59 am will only be started at 3:00 am seen the period between
2:00 am and 2:59 am does not exist. This means ALL jobs scheduled between 2:00 am and 2:59 am are started
at 3:00 am at once.

* Assuming here that during DST the time is going BACK from 2:59 am to 2:00 am (= add one hour) the
jobs will be executed in the "first" 2:00 am - 2:59 am timeslot, if the interval is more then
one hour then the next execution will simply the correct time - aldo the actual "absolute"
interval will be one hour more then normal. If the "interval" is less then one hour then they will
NOT be executed the "second time" time the 2:00 am- 2:59 am timeslot is there because the "next execution time"
will be > 2:59 am. So if you have for example a job that runs every 5 minutes this will NOT be executed
during the "second time" the 2:00 am - 2:59 am timeslot is there until "locale time" becomes 3:00 am (or later).

Please also note that the next execution time is calculated AFTER the job has finished.
A known issue: Note 1328496.1 Ora-01878: Specified Field Not Found In Datetime Or Interval in
Alert.log after DST change When using DBMS_SCHEDULER.
DBMS_SCHEDULER uses a TIMESTAMP WITH TIMEZONE field to store the starting time (and other time related
information) of a job, unlike DBMS_JOB who uses a "date" column who has no concept of timezones.

When checking the various DBMS_SCHEDULER time fields make sure you are using a NLS_TIMESTAMP_TZ_FORMAT
that display's the timezone information

ALTER SESSION SET NLS_TIMESTAMP_TZ_FORMAT ='DD/MM/YYYY HH24:MI:SS TZR TZD';

or for US customers:
ALTER SESSION SET NLS_TIMESTAMP_TZ_FORMAT ='YYYY/MM/DD HH24:MI:SS TZR TZD';

By using the "TZR TZD" mask you can quickly see in what format the data is stored/defined:

select start_date, repeat_interval, last_start_date, next_run_date from dba_scheduler_jobs where job_name='';

* If it returns a offset (ex: -08:00) then the job time is defined with a offset.
Timezone offsets are by nature NOT "DST aware" seen they mean a fixed offset from UTC.
So any job using a offset as timezone will run after a DST change at a wrong time.

* If it returns a named timezone (ex: Japan ) then the time is defined with
that timezone name but that timezone has no DST rules.
Some named timezones are also not "DST aware" ( Japan for example) because that region simply does not use DST.

* If it returns a named timezone and a DST prefix (ex: US/Pacific PST )
then the time is defined with that timezone name and that timezone is DST aware.

To have a job that automatically adjusts for DST you need to make sure it's defined
with the proper named timezone name that actually uses DST rules.

The DST information in Oracle is sometimes updated, when using a named timezone please check
Note 412160.1 Updated Time Zones in Oracle Time Zone File patches.
that note has a list of named timezones that changed and so you can see if you need to
update the Oracle DST information.
Please note that it's a good idea to update the Oracle DST definitions when using named
timezones but that applying "a DST patch" will not always solve all scheduler problems.
Most "scheduler dst" problems are due the usage of a offset, not a named timezone, when creating the jobs.

Good to know:
* In oracle the session timezone (select sessiontimezone from dual;) defaults to a offset ( like +05:00),
even if the Operating system "TZ" variable is set to a named TZ,
unless the ORA_SDTZ is set in the client (!) environment (or registry) with a *oracle* TZ name.

* DBMS_SCHEDULER has no relation to or does not use the database timezone.

When you submit a job using DBMS_SCHEDULER Oracle will use a different source to populate the timezone depending on:

a) When start_date is NULL, DBMS_SCHEDULER will determine the time zone for the repeat interval as follows:

1. It will check whether the session time zone is a region name. The session time
zone can be set to a named timezone by either:
* Issuing an ALTER SESSION statement, for example:
SQL> ALTER SESSION SET time_zone = 'Asia/Shanghai';
* Setting the ORA_SDTZ environment variable on the client (!) side.

2. If the session time zone is an absolute offset instead of a region name, the Scheduler
will use the value of the DEFAULT_TIMEZONE Scheduler attribute.

SQL> exec DBMS_SCHEDULER.SET_SCHEDULER_ATTRIBUTE('default_timezone','US/Eastern');

PL/SQL procedure successfully completed.

To see the current defined DEFAULT_TIMEZONE Scheduler attribute you can issue:

SQL> ALTER SESSION SET NLS_TIMESTAMP_TZ_FORMAT ='DD/MM/YYYY HH24:MI:SS TZR TZD';

Session altered.

SQL> select dbms_scheduler.stime from dual;

STIME
---------------------------------------------------------------------------
06/03/2008 11:58:06 US/EASTERN EST

3. If the DEFAULT_TIMEZONE attribute is NULL, the Scheduler will use the time zone of
systimestamp when the job or window is enabled .The timezone returned by systimestamp is always a OFFSET.

Conclusion: make sure your DEFAULT_TIMEZONE Scheduler attribute is set, and check the
sessiontimezone if needed seen this may take precedence.

Note: SELECT DBMS_SCHEDULER.GET_SYS_TIME_ZONE_NAME FROM DUAL;
is not related to the DEFAULT_TIMEZONE.
DBMS_SCHEDULER.GET_SYS_TIME_ZONE_NAME provides the TimeZone set on the OS level trough the
TZ variable, it will return the named timezone only if the OS TZ setting is also known in Oracle.
Only when no OS TZ variable is set it will report the DEFAULT_TIMEZONE.

b) When start_date is not NULL then DBMS_SCHEDULER will use:

1. the timezone specified in the start time
ALTER SESSION SET NLS_TIMESTAMP_TZ_FORMAT ='DD/MM/YYYY HH24:MI:SS TZR TZD';
begin
DBMS_SCHEDULER.DROP_JOB ('GIVETZ_TEST');
end;
/
begin
dbms_scheduler.create_job('GIVETZ_TEST','PLSQL_BLOCK','begin null; end;',
start_date=>'19/11/2007 12:00:00 CET', enabled=>true,repeat_interval => 'freq=daily;byhour=12;byminute=01');
end;
/
select start_date, repeat_interval, last_start_date, next_run_date from
dba_scheduler_jobs where job_name='GIVETZ_TEST';

begin
DBMS_SCHEDULER.DROP_JOB ('GIVETZ_TEST');
end;
/
begin
dbms_scheduler.create_job('GIVETZ_TEST','PLSQL_BLOCK','begin null; end;',
start_date=>'19/11/2007 12:00:00 +01:00', enabled=>true,repeat_interval => 'freq=daily;byhour=12;byminute=01');
end;
/
select start_date, repeat_interval, last_start_date, next_run_date from
dba_scheduler_jobs where job_name='GIVETZ_TEST';

2. The SESSION timezone when not specified in the start time.
ALTER SESSION SET NLS_TIMESTAMP_TZ_FORMAT ='DD/MM/YYYY HH24:MI:SS TZR TZD';
ALTER SESSION SET TIME_ZONE = '+01:00';
begin
DBMS_SCHEDULER.DROP_JOB ('NODTZ_TEST');
end;
/
begin
dbms_scheduler.create_job('NODTZ_TEST','PLSQL_BLOCK','begin null; end;',
start_date=>'19/11/2007 12:00:00', enabled=>true,repeat_interval => 'freq=daily;byhour=12;byminute=01');
end;
/
select start_date, repeat_interval, last_start_date,next_run_date from
dba_scheduler_jobs where job_name='NODTZ_TEST';

ALTER SESSION SET TIME_ZONE = 'CET';
begin
DBMS_SCHEDULER.DROP_JOB ('NODTZ_TEST');
end;
/
begin
dbms_scheduler.create_job('NODTZ_TEST','PLSQL_BLOCK','begin null; end;',
start_date=>'19/11/2007 12:00:00', enabled=>true,repeat_interval => 'freq=daily;byhour=12;byminute=01');
end;
/
select start_date, repeat_interval, last_start_date,next_run_date from
dba_scheduler_jobs where job_name='NODTZ_TEST';

3. The SESSION timezone when using CURRENT_TIMESTAMP

ALTER SESSION SET NLS_TIMESTAMP_TZ_FORMAT ='DD/MM/YYYY HH24:MI:SS TZR TZD';
ALTER SESSION SET TIME_ZONE = '+01:00';
begin
DBMS_SCHEDULER.DROP_JOB ('CURRTS_TEST');
end;
/
begin
dbms_scheduler.create_job('CURRTS_TEST','PLSQL_BLOCK','begin null; end;',
start_date=>current_timestamp, enabled=>true,repeat_interval => 'freq=daily;byhour=12;byminute=01');
end;
/
select start_date, repeat_interval, last_start_date, next_run_date from
dba_scheduler_jobs where job_name='CURRTS_TEST';

ALTER SESSION SET TIME_ZONE = 'CET';
begin
DBMS_SCHEDULER.DROP_JOB ('CURRTS_TEST');
end;
/
begin
dbms_scheduler.create_job('CURRTS_TEST','PLSQL_BLOCK','begin null; end;',
start_date=>current_timestamp, enabled=>true,repeat_interval => 'freq=daily;byhour=12;byminute=01');
end;
/
select start_date, repeat_interval, last_start_date, next_run_date from
dba_scheduler_jobs where job_name='CURRTS_TEST';

4. The SYSTIMESTAMP offset when using SYSTIMESTAMP

ALTER SESSION SET NLS_TIMESTAMP_TZ_FORMAT ='DD/MM/YYYY HH24:MI:SS TZR TZD';
ALTER SESSION SET TIME_ZONE = 'CET';

begin
DBMS_SCHEDULER.DROP_JOB ('SYSTIME_TEST');
end;
/
begin
dbms_scheduler.create_job('SYSTIME_TEST','PLSQL_BLOCK','begin null; end;',
start_date=>systimestamp, enabled=>true,repeat_interval => 'freq=daily;byhour=12;byminute=00');
end;
/
select start_date, repeat_interval, last_start_date, next_run_date from
dba_scheduler_jobs where job_name='SYSTIME_TEST';
The timezone returned by systimestamp is always a OFFSET.

5. the SESSION timezone when using SYSDATE.

ALTER SESSION SET NLS_TIMESTAMP_TZ_FORMAT ='DD/MM/YYYY HH24:MI:SS TZR TZD';
ALTER SESSION SET TIME_ZONE = '+01:00';

begin
DBMS_SCHEDULER.DROP_JOB ('SYSDATE_TEST');
end;
/
begin
dbms_scheduler.create_job('SYSDATE_TEST','PLSQL_BLOCK','begin null; end;',
start_date=>sysdate, enabled=>true,repeat_interval => 'freq=daily;byhour=12;byminute=00');
end;
/
select start_date, repeat_interval, last_start_date, next_run_date from
dba_scheduler_jobs where job_name='SYSDATE_TEST';

ALTER SESSION SET TIME_ZONE = 'CET';
begin
DBMS_SCHEDULER.DROP_JOB ('SYSDATE_TEST');
end;
/
begin
dbms_scheduler.create_job('SYSDATE_TEST','PLSQL_BLOCK','begin null; end;',
start_date=>sysdate, enabled=>true,repeat_interval => 'freq=daily;byhour=12;byminute=00');
end;
/

select start_date, repeat_interval, last_start_date, next_run_date from
dba_scheduler_jobs where job_name='SYSDATE_TEST';

So the best option if to always specify a full start date with a timezone name that uses
the DST rules you expect or to use for example CURRENT_TIMESTAMP and set the session timezone
first to the timezone name that uses the DST rules you expect.

You can also redefine existing jobs if you notice they are defined with a offset.
A example using a explicit named timezone:
begin
dbms_scheduler.set_attribute('SYSDATE_TEST','start_date',to_timestamp_tz
('Europe/London 12:00 20-NOV-2007','TZR HH24:MI DD-MON-RRRR'));
end;
/
select start_date, repeat_interval, last_start_date, next_run_date from dba_scheduler_jobs
where job_name='SYSDATE_TEST';

沪公网安备 31010802001379号

TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569