Search Results for: cost

Oracle 当更新表时生成ORA-1410: invalid ROWID

如果自己搞不定可以找诗檀软件专业ORACLE数据库修复团队成员帮您恢复!

诗檀软件专业数据库修复团队

服务热线 : 13764045638 QQ号:47079569 邮箱:service@parnassusdata.com

 

 

适用于:

Oracle Database – Enterprise Edition – 版本7.3.4.010.2.0.5 [Release 7.3.4 to 10.2]

本文信息适用于任何平台。
*** 24-Sept-2014检查相关性***

症状

当尝试更新表时,可能发生以下错误。

示例:
update acd_person set CONTACT_MAIL_FLAG=’Y’ where PRS_CD_ID=’292177803′

ERROR at line 1:
ORA-01410: invalid ROWID

A select will work fine on the table.


实例未崩溃
错误是可重复出现

.

更改

对环境无更改

原因

Rowid是允许直接访问行的一个结构。Rowid包含对象号,数据文件所在位置,块号,块中slot号的信息。

当信息所基于的查询在运行过程中更改时,生成ORA-01410 。这表示当由于在索引块leaf中发现了ROWID的???而索引块delete未完成时会发生该问题。Meaning that the issue can occur when a index block delete is not completed as ROWID’s are found in the index block leaf.

解决方案

1.  在表上运行explain plan,以update 语句来找到有问题的索引。

(Note 199081.1)

PLAN_TABLE_OUTPUT
———————————————————————————-
| Id  | Operation            |  Name          | Rows  | Bytes |  Cost  |
———————————————————————–
|   0 | UPDATE STATEMENT     |                |     1 |     9 |     3 |
|   1 |  UPDATE              | ACD_PERSON     |       |       |       |
|*  2 |   INDEX  UNIQUE SCAN  | PK_ACD_PERSON  |     1 |     9 |     2 |
———————————————————————–

2.  运行以下来找出索引语句。

set long 100000
select dbms_metadata.get_ddl(‘INDEX’,'<index name>’,'<owner>’) from dual;

3.  Drop 索引

DROP INDEX <index_name>;

4.  重建索引。

CREATE INDEX <index_name> ON table(column);

更新应当运行不出错

索引必须被drop并重建。联机重建不能修复坏的ROWID

参考

NOTE:199081.1 – SQL_TRACE (10046), TKProf and Explain Plan – Overview Reference
NOTE:806195.1 – Understanding The ORA-1410 Error

Oracle CBO术语大集合

最近准备写点Histogram和density相关的文章,先把术语给大家理一理:

cardinality (CDN)
Legend
CBQT – cost-based query transformation
JPPD – join predicate push-down
OJPPD – old-style (non-cost-based) JPPD
FPD – filter push-down
PM – predicate move-around
CVM – complex view merging
SPJ – select-project-join
SJC – set join conversion
SU – subquery unnesting
OBYE – order by elimination
OST – old style star transformation
ST – new (cbqt) star transformation
CNT – count(col) to count(*) transformation
JE – Join Elimination
JF – join factorization
SLP – select list pruning
DP – distinct placement
qb – query block
LB – leaf blocks
DK – distinct keys
LB/K – average number of leaf blocks per key
DB/K – average number of data blocks per key
CLUF – clustering factor
NDV – number of distinct values
Resp – response cost
Card – cardinality
Resc – resource cost
NL – nested loops (join)
SM – sort merge (join)
HA – hash (join)
CPUSPEED – CPU Speed
IOTFRSPEED – I/O transfer speed
IOSEEKTIM – I/O seek time
SREADTIM – average single block read time
MREADTIM – average multiblock read time
MBRC – average multiblock read count
MAXTHR – maximum I/O system throughput
SLAVETHR – average slave I/O throughput
dmeth – distribution method
1: no partitioning required
2: value partitioned
4: right is random (round-robin)
128: left is random (round-robin)
8: broadcast right and partition left
16: broadcast left and partition right
32: partition left using partitioning of right
64: partition right using partitioning of left
256: run the join in serial
0: invalid distribution method
sel – selectivity
ptn – partition
adop Automatic degree of parallelism

TABLE: Table Name
ALIAS: Table Alias
QBS: Query Block Signature
#ROWS: Number of Rows
#BLKS: Number of Blocks
ARL: Average Row Length
COR: Cardinality Original
CRD: Cardinality Rounded
CCM: Cardinality Computed
CNA: Cardinality Non Adjusted

AVGLEN: Average Column Length
NDV: Number of Distinct Values
NULLS: Number of Nulls in Column
DEN: Column Density
MIN: Minimum Column Value
MAX: Maximum Column Value
TYPE: Histogram Type
#BKTS: Histogram Buckets
UNCOMPBKTS: Histogram Uncompressed Buckets   
ENDPTVALS: Histogram End Point Values 
OOR: Out-of-Range Predicate

TABLE: Table Name
ALIAS: Table Alias
INDEX: Index Name
QBS: Query Block Signature
LVLS: Index Levels
#LB: Number of Leaf Blocks
#DK: Number of Distinct Keys
LB/K: Average Number of Leaf Blocks Per Key
DB/K: Average Number of Data Blocks Per Key
CLUF: Clustering Factor
INDEX_COLS: Index Column Numbers

COST: Cost of the Join
CARD: Cardinality of the Join
BC: Best Cost
LINE#: Line Number in the 10053 Trace File Where Cost Value is Located
JOIN#: Join Number in the 10053 Trace File Associated With Key
STATUS: If Permutation was Computed for all Table Joins the Status = COMPL. If Not, status = ABORT
*: In ANY Column Indicates Value Not Found in File

Freq 频率直方图
HtBal 高度平衡直方图

 

 

关于 UNCOMPBKTS和ENDPTVALS

 

当直方图类型为frequency histograms( Histogram: Freq)时UncompBkts  等于统计信息中采样的总行数-NULLS(Card: Original- NULLS,因为dbms_stats默认是auto_sample_size采样,所以这栏其实是采样到的原始Card-NULLS), 而EndPtVals 等于bucket总数,或者说NDV,因为frequency histograms中 NDV=number of buckets 

当直方图类型为height balanced histograms (Histogram: HtBal) UncompBkts  等于bucket的数目(其实也等于10053 trace中#Bkts的数目),而EndPtVals 等于已经被压缩的Histogram的大小,其实是等于: select count(*) from dba_tab_histograms where table_name=’YOUR_TABLE_NAME’ and column_name=’YOUR_COLUMN_NAME’的实际总和。  通过这2个值对比,可以了解到popular值的多少以及数据的倾斜度, 是有多个大量重复的值(popular value)还是仅有一个巨大的重复值。

 

 

SQL Performance Analyzer SPA常用脚本汇总

SPA常用脚本汇总

附件为 一个SPA报告 spa_buffergets_summary

 

SQL 性能分析器 SQL Performance Analyzer SPA

Oracle Database 11g 引入了 SQL 性能分析器;使用该工具可以准确地评估更改对组成工作量的 SQL 语句的影响。SQL 性能分析器可帮助预测潜在的更改对 SQL 查询工作量的性能影响。这种功能可向 DBA 提供有关 SQL 语句性能的详细信息,例如,执行前后的统计信息,提高或降低性能的语句。这样一来,您就可以执行诸如以下操作的操作:在测试环境中进行更改,以确定数据库升级是否会改进工作量性能。

 

  1. 11g 的新增功能
  2. 目标用户:DBA、QA、应用程序开发人员
  3. 帮助预测系统更改对 SQL 工作量响应时间的影响
  4. 建立不同版本的 SQL 工作量性能(即 SQL 执行计划和执行统计信息)
  5. 以串行方式执行 SQL(不考虑并发性)
  6. 分析性能差异
  7. 提供对单个 SQL 的细粒度性能分析
  8. 与 SQL 优化指导集成在一起以优化回归

SQL 性能分析器:使用情形
SQL 性能分析器可用于预测和防止会影响 SQL 执行计划结构的任何数据库环境更改所带来的潜在性能问题。这些更改可以包括(但不限于)以下任何一种更改:

  1. 数据库升级
  2. 实施优化建议
  3. 更改方案
  4. 收集统计信息
  5. 更改数据库参数
  6. 更改操作系统和硬件

 

DBA 甚至可以使用 SQL 性能分析器为最复杂的环境预测先期更改导致的 SQL 性能更改。例如,随着应用程序在开发周期中的变化,数据库应用程序开发人员可以测试对方案、 数据库对象和重写应用程序的更改,以减轻任何潜在的性能影响。
使用 SQL 性能分析器还可以比较 SQL 性能统计信息。

SQL 性能分析器:概要

1.  收集 SQL:在这个阶段中,将收集用于表示生产系统中的 SQL 工作量的 SQL 语句集。可以使用 SQL 优化集或自动工作量资料档案库 (AWR) 来捕获要传送的信息。因为 AWR 本质上是捕获高负载的 SQL,所以应考虑修改默认的 AWR 快照设置和捕获的顶级 SQL,以确保 AWR 捕获最大数量的 SQL 语句。这可以确保捕获更加完整的 SQL 工作量。

2.  传送:在这个阶段中,应将得到的工作量结果传送到测试系统。从生产系统导出 STS,然后将 STS 导入到测试系统。

3.  计算“之前版本”性能:在进行任何更改之前,执行 SQL 语句,收集评估将来的更改对工作量性能的可能影响所需的基线信息。在此阶段收集的信息给出了系统工作量当前状态的一个快照。性能数据包括:

-执行计划(如由解释计划生成的计划)
-执行统计信息(如由占用时间、缓冲获取次数、磁盘读取次数和已处理的行数组成的信息)

4. 进行更改:获得了之前版本数据后,可以实施计划的更改,然后开始查看对性能的影响。

5.  计算“之后版本”性能:在数据库环境中进行了更改之后才执行此步骤。SQL 工作量的每个语句都在虚拟执行(仅收集统计信息)模式下运行,收集与步骤 3 所捕获的信息相同的信息。

6.  比较和分析 SQL 性能:在获得了两个版本的 SQL 工作量性能数据后,可以通过比较之后版本与之前版本的数据来进行性能分析。比较的根据是执行统计信息,如所用时间、CPU 时间和缓冲区获取次数等。

7.  优化回归的 SQL:在此阶段中,已经准确地确认了哪些 SQL 语句在进行数据库更改时可能导致性能问题。在此阶段中可以使用任何一种数据库工具来优化系统。例如,可以对确认的语句使用 SQL 优化指导或访问指导,然后实施相应的建议。也可以使用在步骤 3 中捕获的计划植入 SQL 计划管理 (SPM) 以确保计划保持不变。在实施了任何优化操作后,应重复该过程来创建新的之后版本,然后分析性能差异以确保新的性能是可接受的。
默认情况下SPA若涉及到DML语句则只有查询部分Query会被执行,但是貌似是从11.2开始可以执行完全的DML了,需要加入参数EXECUTE_FULLDML,但是该参数目前有一些BUG:

Bug 10428438 : WITH EXECUTE_FULLDML ROWS IS ALWAYS SET TO 0 11.2.0.1

Bug 14635522 : SPA SHOULD CAPTURE AND REPLAY TRANSACTIONS 11.2.0.3

 

By default, only the query portion of DMLs is executed. Using APIs, you can execute the full DML by using the EXECUTE_FULLDML task parameter.EXECUTE_FULLDML when set to TRUE executes DML statement fully, including acquiring row locks and modifying rows; When EXECUTE_FULLDML is set to FALSE (the default value is false) to execute only the query part of the DML without modifying data. When TRUE, SQL Performance Analyzer will issue a rollback following DML execution to prevent persistent changes from being made by the DML. So SPA does not make make any change to the data in the tables.

 

执行方法如下:

 

execute DBMS_SQLPA.SET_ANALYSIS_TASK_PARAMETER(task_name   => 'TASK_21137', -
                                               parameter   => 'EXECUTE_FULLDML', -
                                               value       => 'TRUE');

 

 

 

 

从cursor cache中收集tuning set, 持续12分钟,间隔5秒钟

 

 

begin
DBMS_SQLTUNE.CREATE_SQLSET (sqlset_name => 'MAC_SPA');
dbms_sqltune.capture_cursor_cache_sqlset(
sqlset_name => 'MAC_SPA' ,
time_limit => 12*60,
repeat_interval => 5);
end ;
/

basic_filter=> q'# module like 'DWH_TEST%' and sql_text not like '%applicat%' and parsing_schema_name in ('APPS') #'

basic_filter   => 'sql_text LIKE ''%my_objects%'' and parsing_schema_name = ''SPA_TEST_USER''',

==>过滤条件使用

 

从当前cursor cache中匹配条件 获得SQLset ROW

 

 

SELECT sql_id, sql_text 
FROM table(DBMS_SQLTUNE.SELECT_CURSOR_CACHE('buffer_gets > 500')) 
ORDER BY sql_id;

SELECT * 
FROM table(DBMS_SQLTUNE.SELECT_CURSOR_CACHE('sql_id = ''4rm4183czbs7j'''));

 DECLARE
  cur sys_refcursor;
BEGIN
  OPEN cur FOR
    SELECT value(P) 
    FROM table(DBMS_SQLTUNE.SELECT_CURSOR_CACHE) P;

  -- Process each statement (or pass cursor to load_sqlset).

  CLOSE cur;
END;
/

 -- create the tuning set
EXEC DBMS_SQLTUNE.CREATE_SQLSET('MAC_SPA');
-- populate the tuning set from the cursor cache
DECLARE
 cur DBMS_SQLTUNE.SQLSET_CURSOR;
BEGIN
 OPEN cur FOR
   SELECT VALUE(P)
     FROM table(
       DBMS_SQLTUNE.SELECT_CURSOR_CACHE(
         'parsing_schema_name <> ''SYS'' AND elapsed_time > 5000000',
          NULL, NULL, NULL, NULL, 1, NULL,
         'ALL')) P;

DBMS_SQLTUNE.LOAD_SQLSET(sqlset_name => 'MAC_SPA',
                        populate_cursor => cur);

END;
/

 

 

从AWR快照中加载SQLset ROW到SQL TUNING SET

 

 

DECLARE
  cur sys_refcursor;
BEGIN
  OPEN cur FOR
    SELECT VALUE (P) 
    FROM table(dbms_sqltune.select_workload_repository(4146,4161)) P;

  -- Process each statement (or pass cursor to load_sqlset)
  DBMS_SQLTUNE.LOAD_SQLSET(sqlset_name => 'MAC_SPA',
                        populate_cursor => cur);
  CLOSE cur;
END;
/

 

 

 

将SQL TUNING SET Pack到表中:

 

 

set echo on
select name,statement_count from dba_sqlset;

drop table maclean.pack_sqlset purge;

exec DBMS_SQLTUNE.CREATE_STGTAB_SQLSET('PACK_SQLSET','MACLEAN');

exec DBMS_SQLTUNE.PACK_STGTAB_SQLSET('MAC_SPA','SYS','PACK_SQLSET','MACLEAN');

SQL> desc maclean.pack_sqlset;
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 NAME                                               VARCHAR2(30)
 OWNER                                              VARCHAR2(30)
 DESCRIPTION                                        VARCHAR2(256)
 SQL_ID                                             VARCHAR2(13)
 FORCE_MATCHING_SIGNATURE                           NUMBER
 SQL_TEXT                                           CLOB
 PARSING_SCHEMA_NAME                                VARCHAR2(30)
 BIND_DATA                                          RAW(2000)
 BIND_LIST                                          SQL_BIND_SET
 MODULE                                             VARCHAR2(48)
 ACTION                                             VARCHAR2(32)
 ELAPSED_TIME                                       NUMBER
 CPU_TIME                                           NUMBER
 BUFFER_GETS                                        NUMBER
 DISK_READS                                         NUMBER
 DIRECT_WRITES                                      NUMBER
 ROWS_PROCESSED                                     NUMBER
 FETCHES                                            NUMBER
 EXECUTIONS                                         NUMBER
 END_OF_FETCH_COUNT                                 NUMBER
 OPTIMIZER_COST                                     NUMBER
 OPTIMIZER_ENV                                      RAW(1000)
 PRIORITY                                           NUMBER
 COMMAND_TYPE                                       NUMBER
 FIRST_LOAD_TIME                                    VARCHAR2(19)
 STAT_PERIOD                                        NUMBER
 ACTIVE_STAT_PERIOD                                 NUMBER
 OTHER                                              CLOB
 PLAN_HASH_VALUE                                    NUMBER
 PLAN                                               SQL_PLAN_TABLE_TYPE
 SPARE1                                             NUMBER
 SPARE2                                             NUMBER
 SPARE3                                             BLOB
 SPARE4                                             CLOB

 

 

 

将测试对应 schema的数据和 上述PACK TABLE 导出导入到 目标测试库中:

 

set echo on
exec DBMS_SQLTUNE.UNPACK_STGTAB_SQLSET('MAC_SPA','SYS',TRUE,'PACK_SQLSET','MACLEAN');
alter system flush buffer_cache;
alter system flush shared_pool;

 

 

创建SPA任务 并运行;

 

 

var sts_task varchar2(64);
exec :sts_task:= dbms_sqlpa.create_analysis_task(task_name => '10g_11g_spa',description => 'experiment for 10gR2 to 11gR2 upgrade',sqlset_name=> 'MAC_SPA');

PL/SQL procedure successfully completed.

var exe_task varchar2(64);
exec :exe_task:=dbms_sqlpa.execute_analysis_task(task_name=>'10g_11g_spa',execution_name=>'10g_trail',execution_type=>'CONVERT SQLSET',execution_desc=>'10g sql trail');

var exe_task varchar2(64);
exec :exe_task:=dbms_sqlpa.execute_analysis_task(task_name=>'10g_11g_spa',execution_name=>'11g_trail',execution_type=>'TEST EXECUTE',execution_desc=>'11g sql trail');

 

 

 

执行任务比较

 

 

 

比较CPU_TIME
EXEC dbms_sqlpa.execute_analysis_task( -
  task_name => '10g_11g_spa', -
  execution_name => 'compare_10g_112_cpu', -
  execution_type => 'COMPARE PERFORMANCE', -
  execution_params => dbms_advisor.arglist('COMPARISON_METRIC','CPU_TIME','EXECUTION_NAME1','10g_trail','EXECUTION_NAME2','11g_trail'), -
  execution_desc => 'Compare 10g SQL Trace Performance to 11g Test-Execute for CPU_TIME')
  /

比较BUFFER_GETS
EXEC dbms_sqlpa.execute_analysis_task( -
  task_name => '10g_11g_spa', -
  execution_name => 'compare_10g_112_buffergets', -
  execution_type => 'COMPARE PERFORMANCE', -
  execution_params => dbms_advisor.arglist('COMPARISON_METRIC','BUFFER_GETS','EXECUTION_NAME1','10g_trail','EXECUTION_NAME2','11g_trail'), -
  execution_desc => 'Compare 10g SQL Trace Performance to 11g Test-Execute for BUFFER_GETS')
  /

比较实际执行时长 

begin 
DBMS_SQLPA.EXECUTE_ANALYSIS_TASK( 
task_name => 'SPA_TEST', 
execution_type => 'COMPARE PERFORMANCE', 
execution_name => 'Compare_elapsed_time', 
execution_params => dbms_advisor.arglist('execution_name1', '10g_trail', 'execution_name2', '11g_trail', 'comparison_metric', 'elapsed_time') ); 
end; 
/

比较物理读

begin 
DBMS_SQLPA.EXECUTE_ANALYSIS_TASK( 
task_name => '10g_11g_spa', 
execution_type => 'COMPARE PERFORMANCE', 
execution_name => 'Compare_physical_reads0', 
execution_params => dbms_advisor.arglist('execution_name1', '10g_trail', 'execution_name2', '11g_trail', 'comparison_metric', 'disk_reads') ); 
end; 
/

Set the comparison_metric parameter to specify an expression of execution 
statistics to use in the performance impact analysis. Possible values include 
the following metrics or any combination of them: elapsed_time (default), 
cpu_time, buffer_gets, disk_reads, direct_writes, and optimizer_cost.

 

 

 

获得SPA报告:

 

 

 

set long 100000 longchunksize 100000 linesize 200 head off feedback off echo off 
spool spa_report_elapsed_time.html 
SELECT dbms_sqlpa.report_analysis_task('SPA_TEST', 'HTML', 'ALL','ALL', execution_name=>'Compare_elapsed_time') FROM dual; 
spool off

产生buffergets 比较report    

set heading off long 100000000 longchunksize 10000 echo off;
set linesize 1000 trimspool on;
spool buffergets_summary.html
select xmltype(dbms_sqlpa.report_analysis_task('10g_11g_spa',
                                                'html',
                                                'typical',
                                                'all',
                                                null,
                                                100,
                                                'compare_10g_112_buffergets')).getclobval(0,0)
from dual;
spool off

产生errors比较report 
spool errors_summary.html
select xmltype(dbms_sqlpa.report_analysis_task('10g_11g_spa',
                                                'html',
                                                'errors',
                                                'summary',
                                                null,
                                                100,
                                                '11g_trail')).getclobval(0,0)
from dual;
spool off

产生unsupport比较report 
spool unsuppor_all.html
select xmltype(dbms_sqlpa.report_analysis_task('10g_11g_spa',
                                                'html',
                                                'unsupported',
                                                'all',
                                                null,
                                                100,
                                                '11g_trail')).getclobval(0,0)
from dual;
spool off

 

 

 

 

 

execution_type
Type of the action to perform by the function. If NULL it will default to the value of the DEFAULT_EXECUTION_TYPE parameter. Possible values are:
[TEST] EXECUTE – test-execute every SQL statement and collect its execution plans and execution statistics. The resulting plans and statistics will be stored in the advisor framework. This is default.
EXPLAIN PLAN – generate explain plan for every statement in the SQL workload. This is similar to the EXPLAIN PLAN command. The resulting plans will be stored in the advisor framework in association with the task.
COMPARE [PERFORMANCE] – analyze and compare two versions of SQL performance data. The performance data is generated by test-executing or generating explain plan of the SQL statements. Use this option when two executions of type EXPLAIN_PLAN or TEST_EXECUTE already exist in the task
CONVERT SQLSET – used to read the statistics captured in a SQL Tuning Set and model them as a task execution. This can be used when you wish to avoid executing the SQL statements because valid data for the experiment already exists in the SQL Tuning Set.

 

 

For 9i Upgrade to 10g

 

 

exec dbms_stats.gather_system_stats(gathering_mode=>'NOWORKLOAD');

alter system set "_optim_peek_user_binds"=false;           ==> 禁用BIND PEEK特性,该特性在10g中有

exec DBMS_STATS.SET_PARAM( 'method_opt','FOR ALL COLUMNS SIZE 1' );
commit;

9i 

?/rdbms/admin/dbmssupp

exec dbms_support.start_trace(binds=>TRUE, waits=> FALSE);

exec dbms_support.stop_trace;

exec dbms_support.start_trace_in_session(sid=>sid,serial=>ser, binds=>TRUE, waits=>FALSE);

select sid,serial# from v$SESSION WHERE ... ;

exec dbms_support.stop_trace_in_session(sid=>SID,serial=>ser);

create table mapping_table tablespace USERS as
select object_id id, owner, substr(object_name, 1, 30) name
  from dba_objects
 where object_type not in ('CONSUMER GROUP',
                           'EVALUATION CONTEXT',
                           'FUNCTION',
                           'INDEXTYPE',
                           'JAVA CLASS',
                           'JAVA DATA',
                           'JAVA RESOURCE',
                           'LIBRARY',
                           'LOB',
                           'OPERATOR',
                           'PACKAGE',
                           'PACKAGE BODY',
                           'PROCEDURE',
                           'QUEUE',
                           'RESOURCE PLAN',
                           'SYNONYM',
                           'TRIGGER',
                           'TYPE',
                           'TYPE BODY')
union all
select user_id id, username owner, null name from dba_users;

declare
  mycur dbms_sqltune.sqlset_cursor;
begin
  dbms_sqltune.create_sqlset('9i_prod_wkld');
    open mycur for
      select value(p)
      from table(dbms_sqltune.select_sql_trace(
                   directory=>'SPADIR',
                   file_name=>'%trc',
                   mapping_table_name => 'MAPPING_TABLE',
                   select_mode => dbms_sqltune.single_execution)) p;
  dbms_sqltune.load_sqlset(
    sqlset_name => '9i_prod_wkld',
    populate_cursor => mycur,
    commit_rows => 1000);

  close mycur;
end;
/

create user spadba identified by oracle;
grant dba to spadba;
grant all on dbms_sqlpa to spadba;

create public database link to10g connect to spadba identified by oracle using 'STRINGS';

var sts_task varchar2(64);
exec :sts_task:= dbms_sqlpa.create_analysis_task(task_name => '9i_11g_spa1',description => 'experiment for 9i to 11gR2 upgrade',sqlset_name=> '9i_prod_wkld');

var exe_task varchar2(64);
exec :exe_task:=dbms_sqlpa.execute_analysis_task(task_name=>'9i_11g_spa1',execution_name=>'9i_trail1',execution_type=>'CONVERT SQLSET',execution_desc=>'9i sql trail generated from sts');

dbms_sqlpa.execute_analysis_task(task_name=>'9i_11g_spa1',execution_name=>'10g_trail1',execution_type=>'TEST EXECUTE',execution_desc=>'10g trail test',-
execution_params=>dbms_advisor.arglist('DATABASE_LINK','DBLINKNAME'));

select sofar,totalwork from V$ADVISOR_PROGRESS where task_id=<TID>;

为什么AWR中的DB TIME /DB CPU/ CPU TIME 过大 过小? 不准确?

这是因为,oracle数据库实例作为一个操作系统用户始终需要使用操作系统提供的函数来给这些时间指标 DB TIME /DB CPU/ CPU TIME 测时, 而我们也说过 “自我观察总是容易出现问题”,例如自己给自己的短跑计时,那么可能总是不太准确。

其次由于不同的操作系统上不同的系统函数的表现不一样也会有区别。

对于CPU TIME 一般使用 getrusage() 函数来计时,该函数的粒度为10,000(10*1000) msec。

对于DB TIME 则会使用gettimeofday() 函数以外部计时的形式来计时。

在AIX 上存在

Bug 13846587  AIX: Unrealistic CPU Time / DB time with 11.2.0.3 for IBM AIX

This problem is introduced in 11.2.0.3 on AIX systems by the change from 
 
This is an issue where the value of a CPU time related statistics such as 
the DB time in an AWR or the Execute CPU time in a TKPROF shows an 
enlarged value in the scale of E11 seconds.
 
Note:
 There is also an AIX problem in that thread_cputime() may return an incorrect value in some 
   cases, leading to incorrect/unexpected DB CPU figures in Oracle - see Note:1631769.1.


Following is complete messahe Customer  given regarding to that :
===================================================================
We had a ticket and a call with IBM on Friday. IBM confirms that vac.cfg.61 
file delivered in their C++ compiler which is a separate package from the 
runtime compiler that comes by default with the OS upgrade. Our OS runtime 
compiler is  11.1.0.2 that is the latest and the greatest delivered with IBM 
AIX 6.1 version.
 
Therefore I confirm that we will not have or not able to get vac.cfg.61 file 
because (1) having compiler on the production server is against Bank of 
America IT policies (2) We do not have license for the compiler.
 
What am I expecting from Oracle? I am expecting Oracle to resolve this issue 
where patch installs on our database w/o any installation issues. This is SEV 
1 issue and I will really appreciate if this issue is resolved ASAP.
 

RELEASE NOTES:
]]If the value of a cpu time related statistics such as the DB time in an AWR o
]]r the Execute cpu time in a TKPROF shows an enlarged value in the scale of E1
]]1 seconds, then apply this patch to correct the problem.
 
 
REDISCOVERY INFORMATION:
An AWR report shows enlarged (E11 secs) "DB time" values.
A TKPROF report shows enlarged (E11 secs) Execute cpu time.
WORKAROUND:

Bug 7445289 – DB time reported too high when there are a lot of direct path reads (Doc ID 7445289.8)

 

Affects:

Product (Component) Oracle Server (Rdbms)
Range of versions believed to be affected Versions BELOW 11.2
Versions confirmed as being affected
  • (None Specified)
Platforms affected Generic (all / most platforms affected)

Fixed:

This issue is fixed in

Description

DB time shown in AWR report's Time Model section can be higher than 
the actual value when there are a large number of direct path wait 
events with small (<1ms) avg wait time.
 
eg: When waiting on "direct path read" there may be some element 
    of CPU time included in the wait time. 
    Hence adding DB CPU time to time in non-idle wait time to get
    a DB time figure can give an over high figure due to double 
    counting of any CPU time used during the wait itself.
 
Note: 
 This is only a problem in the reporting of the DB time.
 
Workaround
 Infer DB time from ASH report. time_period * avg_active_sessions

 

In some AWR reports, the db cpu is larger than db time for example:
DB Time:

 

1911984.1_a 1911984.1_b 1911984.1_c

  • The issue can also be verified by 10046 trace:
    SQL> alter session set events '10046 trace name context forever, level 12';
    DECLARE
     ABC DATE;
    BEGIN
     FOR I IN 1 .. 10000 LOOP
       SELECT SYSDATE INTO ABC FROM DUAL;
     END LOOP;
    END;
    /
    SQL> alter session set events '10046 trace name context off';
    

    On checking the trace:

    ********************************************************************************
    
    SQL ID: c749bc43qqfz3 Plan Hash: 1388734953
    
    SELECT SYSDATE
    FROM
    DUAL
    
    
    call     count       cpu    elapsed       disk      query    current        rows
    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
    Parse        1      0.00       0.00          0          0          0           0
    Execute  10000      0.38       0.37          0          0          0           0   <--- CPU is larger than elapsed.
    Fetch    10000      0.08       0.10          0          0          0       10000
    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
    total    20001      0.46       0.48          0          0          0       10000
    
    Misses in library cache during parse: 1
    Optimizer mode: ALL_ROWS
    Parsing user id: SYS   (recursive depth: 1)
    Number of plan statistics captured: 1
    
    Rows (1st) Rows (avg) Rows (max)  Row Source Operation
    ---------- ---------- ----------  ---------------------------------------------------
            1          1          1  FAST DUAL  (cr=0 pr=0 pw=0 time=3 us cost=2 size=0 card=1)
    
    ********************************************************************************
    

 

CHANGES

None

CAUSE

The issue is that CPU time and DB time are measured by different C routines.

The CPU time uses the getrusage() call which has a granularity of 10,000(10*1000) msec.
The DB Time uses the gettimeofday() call using the external operating system clocks which has no such issue.
I.e. getrusage() can measure only 0 or N*1000 microsec (N = 1,2,…) for a FETCH call but the real elapsed time may be somewhere in between the 2 values.
This means that getrusage tends to overestimate the CPU time so you can see a slightly higher value of “DB CPU” than “DB Time” especially when thousands of roundings are involved.

This is not a bug but development commented upon this in the following bugs:

Bug 19266484 awr reports db cpu greater than db time
Bug 16241074 awr shows wrong cpu information for some sqls when hyperthreading is enabled
Bug 19234247 incorrect cpu reported in awr report for sqls

 

SOLUTION

This is a reporting issue as opposed to an actual problem. Since the variance is likely to be low, the differences can be ignored.

REFERENCES

BUG:16241074 – AWR SHOWS WRONG CPU INFORMATION FOR SOME SQLS WHEN HYPERTHREADING IS ENABLED
BUG:19234247 – INCORRECT CPU REPORTED IN AWR REPORT FOR SQLS
BUG:19266484 – AWR REPORT DB CPU GREATER THAN DB TIME

 

Oracle 数据库完全破损/损坏对策

Block Corruption 造成的影响
行造成的直接影响

  1. 数据业务停止
  2. 数据恢复时间变长
  3. 修复工作错误,造成二次灾害
  4. 找原因的时间变长

数据坏的主要原因

并且有可能被分割的Layer风险

中数据保的机制

§ 单个系统中,考虑到了一些无法防御的的风险的机制。

–    例:因为人为错误删除数据时,在RAID结构中无法防御。

§ 复制数据中保护一致性与确实性的机制。

–    例:部分备份导致的数据完整性的欠缺。

§ 考虑到迅速切换以及确实的修复的机制。

–    例:由于灾害恢复训练不足,实际上无法切换,无法返回的备份。

了确地能提高工作的可持性需要什么?

Oracle Maximum Availability Architecture

§ Maximum Availability Architecture (MAA) 是基于oracle验证完成的高可用性的数码与成功事例的最佳实践。

§ MAA的目的

–    为了修复、查出、规避所有的停止的情况提供最优实践。

–    在样本中构成最优的高可用性的架构。

§ 不受硬件以及OS影响(不需要特定的高价的产品以及技术)

§ 马上就可以提供高可用性的解决策略(Oracle事先完成验证)

高可用性的数的最佳践。

Oracle MAA 的整体映像

Low-Cost, Integrated, Fully Active, High ROI

Oracle MAA Data Protection

Oracle Database 11g Release 2

Oracle Database
Data Protection功能

Type of Block Corruption

§ Data Block Corruption(Doc ID 840978.1)

–    Physical Block Corruption

–    Logical Block Corruption

§ Other Corruptions

–    Control file Corruption

§ Use control file mirror & Copy

–    Redo Corruption

§ ASM Mirroring / Use multiplexed log file

–    Dictionary Corruption(Doc ID 136697.1)

Doc ID 1088018.1 Master Note for Handling Oracle Database Corruption

KROWN#152523 [Master Notes] Corruption()

Data Block Corruption

§ 物理性的块内数据破损状态的块

–    破损例

•          Block Header不正确

•          Block Header与Footer信息不一致

•         数据缺失

•          Block配置地点不正确

•          0隐藏的Block

Physical Block Corruptions

Data Block Corruption

§ Block中结构发生理论破损的状态的块

–    物理性(Block Header以及Footer的信息、Checksum的计算结果)正确

–    破损例

•          行碎片的开始与结束位置不在块内。

•          行碎片直接发生重叠

•          锁定行碎片的ITL编号不正确

•          ITL展示的锁定中的行碎片的数量与实际不一致

•          Block中空白区域的尺寸不正确

•          Lost Write

Logical Block Corruptions

Oracle Database中最适合的破损检测

§ Oracle Database的Block并不是单独的bit的罗列,
明确地事先定义过的结构

à 正是有理解了Block的结构的Oracle Database,可以检测
Physical Corruption以及Logical Corruption。

à 并且,通过同时使用Oracle DatabaseTechnology,可以提高检测水平

–    OS、文件系统以及存储中,仅仅是和命令一样对块进行I/O,但无法判断块的结构作为一个数据块来说是否正确。Exadata Cell Storage Serverと、H.A.R.D Initiative 是可以检测的

Data Block Format

Oracle DatabaseData Protection

§ 控制检测范围、水平、破损类型的初始化参数

–    DB_BLOCK_CHECKSUM

–    DB_BLOCK_CHECKING

–    DB_LOST_WRITE_PROTECT

§ 定期检测功能

–    Oracle Recovery Manager(CHECK LOGICAL句 / VALIDATE 命令)

–    SQL> ANALYZE TABLE文(VALIDATE STRUCTURE CASCADE 选项)

提高检测水平的功能

DB_BLOCK_CHECKSUM

§ 利用Data Block中的Checksum的Physical Corruption检测的机制

–    向Disk写入Block之前

§ 将DBWR以及Direct Load的服务器进程、Checksum(从Block中所有数据为基础来计算出的数值)储存到Block Header之中。

–    从Disk中读出Block后

§ 读出Block的进程将重新计算的Checksum与储存到Block Header中的Checksum进行比较验证。

à 如果Checksum不一致为块内数据可能会可能判断Physical Block Corruption以及生了

概要

DB_BLOCK_CHECKSUM

中的每个操作

DB_BLOCK_CHECKSUM

§ 请注意设定值的不同造成的操作的不同

–    DB_BLOCK_CHECKSUM=TYPICAL

•          由于服务器进程的Block更新之后,Checksum为0

•          DBWR在向Disk写入时,计算Checksum再嵌入

à 每次更新,因不会Checksum而高效化

–    DB_BLOCK_CHECKSUM=FULL

•          由于服务器进程的Block更新之后,计算Checksum再嵌入

•          DBWR在向Disk写入时, 验证Checksum

à可以验证内存上的Block Corruption

Buffer Cache上被更新Data Block
Checksum
嵌入

DB_BLOCK_CHECKSUM

§ 每次发行时,请注意其中不同的操作

–    Release 11.1以降では、Redo BlockのChecksumの
将生成处理由生成了Redo的Foreground Process担当à LGWR

§ 但是,Release 11.1~11.2.0.1中,设定为FULL时,
LGWR在向磁盘写入Redo Block之前,会执行Foreground Process所生成的Checksum的完整性检查。

Redo BlockChecksum生成验证KROWN#155653

DB_BLOCK_CHECKSUM

检测出破的操作

DB_BLOCK_CHECKSUM

§ Primary Database’s Alert.log

Automatic Block Media Recovery(ABR)来自修复所参考的日志

DB_BLOCK_CHECKING

§ Buffer Cache变更Block之后,
通过检测理论性的完整性,可以检测Logical Corruption

–    即使是Checksum正确,可以检测理论性的不正确的状态。

–    变更后被标记的理由是由于DML数据更新以外,包含变更。

§ 例:伴随着DBWR的写出,变更Block Header的信息

–    不经过Buffer Cache的Direct Load Operation不在本次检测对象中

–    根据参数的设定值,可以控制检测的对象以及水平

概要

DB_BLOCK_CHECKING

§ 上位设定值包含下位设定值的检测

–     比如,「LOW」=OFF检查 + 所有的BlockBlock Header Check

 

 

 

 

 

 

–     基本上,Buffer Cache上的Block内容在变更后会执行检测,但
Block Header Check是RAC的实例之间的Block在传送后也会被执行

的操作

DB_BLOCK_CHECKING

–     包含没有commit的redo,发生错误之前的Block的状态

à 同一继续进行事的可能

检测出破后的操作DiskBlock是正常的情况

DB_BLOCK_CHECKING

§ Oracle Client

§ Alert.log

检测的参考日志DiskBlock正常的情况

DB_BLOCK_CHECKING

–     即使自动修复失败,请注意成功时,只会返回同样的错误。

§  之后,重新访问block时,会发生ORA-1578 或者ORA-600

à 需要手操作 Block Media RecoveryABR不会发动

检测出破后的操作DiskBlock也破的情况

DB_BLOCK_CHECKING

§ Oracle Client

§ Alert.log

检测时的参考日志DiskBlock生破的情况

Soft Corrupt

§ 检测出破损的(Oracle的破损与认识)Block中被加上的标记

–    访问这些被标记的块时,会发生ORA-1578。

不是与Physical Corruption / Logical Corruption同列的单词

DB_LOST_WRITE_PROTECT

§ 不管从存储装置中block的写入完成是否发出通知,实际上磁盘中没有被写入的事项。

–    因为Data Block的构造是正常的,
即使访问发生了Lost Write 的Block也没有发生错误

à 可能会有提供不正确的数据给顾客或者用风险

à 不正确的数据染可能会

§ Lost Write所影响的例子

–    不管是否有没有储备,都作为有储备来接收订单
–    不管是否接受订单都会变成没有接收订单

Lost Write是什么

DB_LOST_WRITE_PROTECT

§ Data Guard(Physical Standby Database)中检测出Lost Write的机制Primary Database中从磁盘中读出Block时,生成验证用的redo

§ Data File Number

§ Data Block AddressDBA

§ System Change NumberSCN

•          Data Guard的机制中,对Physical Standby Database传送Redo

•          比较验证Standby Database方的对象Block与Redo内的SCN

à 如果SCN不一致,可能Lost Write

概要

DB_LOST_WRITE_PROTECT

§ 需要Primary Database以及Standby Database两方面的设定

–    Primary 或者 Standby Database的两方面的定都是NONE无效

§ Primary因为没有生成验证用的redo无法用Standby来验证

§ 即使用Primary来生成Redo,用Standby也不能验证

各个的操作

DB_LOST_WRITE_PROTECT

§ 检测Lost Write的时机是?

–    从Primary Database中发生Lost Write的Block,从磁盘向Buffer读入时生成的Redo,Standby Database的MRP验证时

–         à 不是Lost WriteDisk的写入不足的瞬
  Lost WriteBlockDisk入的

 

–    特定Block中,即使发生Lost Write,只要不使用那个Block

§ 搜索结果以及更新事务都正常

§ 不正确的数据不会传染到其他的块中

Lost Write生以及检测时机不一致

DB_LOST_WRITE_PROTECT

§ 验证用redo的生成仅限于向Buffer Cache读入Block时

–    不经过Buffer Cache的Direct Path Read中,不生成验证用的Redo

 

§ 非Data Guard环境中,用TYPICAL以上的设定来生成验证用Redo

–    Media recovery可以验证Lost Write

 

§ 可以验证Standby Database生的Lost Write

–    与Primary中生成验证用redo + Standby中验证这样的功能相同

–    仅限这种情况,但也可以用ASM Mirror以及ABR来自修复(后面将讲到)

動作の補足

DB_LOST_WRITE_PROTECT

Lost Write检测的流程PrimaryLost Write的情况

DB_LOST_WRITE_PROTECT

§ Primary中发生的Lost Write在Standby中被检测出来

–    记录Standby DatabaseのAlert.log中发生的ORA-752

–    为了保护Standby Database的数据,自动停止MRP进程

§ 终止以后的Redo适用,防止不正确数据导致的数据污染

–    PRxx程的Trace File(xxx_xxx_prxx_xxx.trc)中记录了Block的详细内容

§ 访问Primary中的对象block时所生成的Redo记录的Dump

§ Standby中所保存的对象Block的Dump

–    从这些信息中,确认以后会发生Lost Write

检测Lost Write后的操作PrimaryLost Write的情况

DB_LOST_WRITE_PROTECT

Wed Oct 23 19:18:08 2013

Hex dump of (file 7, block 131) in trace file /u01/app/oracle/diag/rdbms/orcls/orcls1/trace/orcls1_pr02_1401.trc

Reading datafile ‘+DATA/orcls/datafile/lw.281.829593935’ for corruption at rdba: 0x01c00083 (file 7, block 131)

Read datafile mirror ‘DATA_0004’ (file 7, block 131) found same corrupt data (logically corrupt)

Read datafile mirror ‘DATA_0006’ (file 7, block 131) found same corrupt data (logically corrupt)

STANDBY REDO APPLICATION HAS DETECTED THAT THE PRIMARY DATABASE

LOST A DISK WRITE OF BLOCK 131, FILE 7

NO REDO AT OR AFTER SCN 3987667 CAN BE USED FOR RECOVERY.

Recovery of Online Redo Log: Thread 1 Group 7 Seq 103 Reading mem 0

Slave exiting with ORA-752 exception

Errors in file /u01/app/oracle/diag/rdbms/orcls/orcls1/trace/orcls1_pr02_1401.trc:

ORA-00752: 由于恢复检测出数据的写入欠缺。

ORA-10567: Redo is inconsistent with data block (file# 7, block# 131, file offset is 1073152 bytes)

ORA-10564: tablespace LW

ORA-01110: 数据文件7: ‘+DATA/orcls/datafile/lw.281.829593935’

ORA-10561: block type ‘TRANSACTION MANAGED DATA BLOCK’, data object# 87637

Wed Oct 23 19:18:12 2013

Recovery Slave PR02 previously exited with exception 752

Wed Oct 23 19:18:12 2013

MRP0: Background Media Recovery terminated with error 448

Errors in file /u01/app/oracle/diag/rdbms/orcls/orcls1/trace/orcls1_pr00_1395.trc:

ORA-00448: バックグラウンド・プロセスが正常終了しました。

MRP0: Background Media Recovery process shutdown (orcls1)

Lost WriteStandby DatabaseAlert.log出的一部分摘

PrimaryLost Write的情况

 

DB_LOST_WRITE_PROTECT

Hex dump of (file 7, block 131)

Dump of memory from 0x00000000F03B0000 to 0x00000000F03B2000

0F03B0000 0000A206 01C00083 003CC55A 04010000  [……..Z.<…..]

STANDBY REDO APPLICATION HAS DETECTED THAT THE PRIMARY DATABASE

LOST A DISK WRITE OF BLOCK 131, FILE 7

The block read on the primary had SCN 3977658 (0x0000.003cb1ba) seq 1 (0x01)

 while expected to have SCN 3982682 (0x0000.003cc55a) seq 1 (0x01)

The block was read at SCN 3987667 (0x0000.003cd8d3), BRR:

CHANGE #1 TYP:2 CLS:6 AFN:7 DBA:0x01c00083 OBJ:87637 SCN:0x0000.003cb1ba SEQ:1 OP:23.2 ENC:0 RBL:1

REDO RECORD – Thread:1 RBA: 0x000067.00000128.0010 LEN: 0x0034 VLD: 0x10

SCN: 0x0000.003cd8d3 SUBSCN:  1 10/23/2013 19:18:16

(LWN RBA: 0x000067.00000126.0010 LEN: 0003 NST: 0001 SCN: 0x0000.003cd8cf)

CHANGE #1 TYP:2 CLS:6 AFN:7 DBA:0x01c00083 OBJ:87637 SCN:0x0000.003cb1ba SEQ:1 OP:23.2 ENC:0 RBL:1

 Block Read – afn: 7 rdba: 0x01c00083 BFT:(1024,29360259) non-BFT:(7,131)

              scn: 0x0000.003cb1ba seq: 0x01

              flags: 0x00000006 ( dlog ckval )

PRxxTrace File出例Primary生了Lost Write的情况

DB_LOST_WRITE_PROTECT

Lost Write检测流程StandbyLost Write的案例

DB_LOST_WRITE_PROTECT

§ 在Standby Database检测出了 Lost Write。

§ 与Primary中发生的Lost Write不同,自动进行修复

–    Primary Database中因为保存了最新正常Block

      Data GuardAutomatic Block Media Recovery修复

–    如果自动修复的话就不会发生ORA-752(Alert.log中没有输出)

§ MRP进程正常继续运行

检测Lost Write的操作Standby生了Lost Write的情况

DB_LOST_WRITE_PROTECT

检测Lost Write后的操作总结

DB_ULTRA_SAFE Parameter

§ 通过变更DB_ULTRA_SAFE参数值,可以一起变更3个参数值

–    本参数的默认值是FALSE

–    明确地个别设定各个参数时,优先个别设定值

提高检测水平的3个参数的一致

OLTPWorkload性能

§ Exadata X2-2 Quarter Rack( 2node RAC结构)

§ Oracle Database 11g Release 11.2.0.4

§ Oracle Grid Infrastructure 11g Release 11.2.0.4

§ Exadata Storage Server Software 11g Release 11.2.3.2.1

–    Write Through Mode

验证环

OLTPWorkload性能

§ 在下面的用户脚步中反复执行泛用的SQL

WorkloadTransaction的定

OLTPWorkload性能

§ 验证变更了Transaction的比例的三个Workload

–    伴随着Test#的增加,将更新处理的比例

Transaction比例的模式

OLTPWorkload性能

§ 在前页的各个Workload中,验证下面四个设定模式

Parameter定模式

OLTPWorkload性能

検証結果) 全模式测试的吞吐量Transaction Per Sec

OLTPWorkload性能

验证结 全模式测试Response Time

OLTPWorkload性能

验证结 模式测试CPU使用率

Oracle DatabaseData Protection

§ Oracle Recovery Manager(RMAN)

–    VALIDATE评论Data File、Backup File(Image Copy / Backup Piece)的破损检查

–    CHECK LOGICAL句

§ 追加Physical Corruption的检测,检测Logical Corruption

§ 可以与BACKUP / VALIDATE / RECOVER 评论同时检测

§ ANALYZE TABLE <TableName> VALIDATE STRUCTURE CASCADE ;

–    可能检测出表与索引Block之间的不完整

提供定期破损检测的功能

RMAN> Validate Check Logical

RMAN> validate check logical datafile 18 ;

Starting validate at 28-AUG-13

using target database control file instead of recovery catalog

channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

input datafile file number=00018 name=+DATA/orcl/datafile/tt.316.824637849

channel ORA_DISK_1: validation complete, elapsed time: 00:00:01

List of Datafiles

=================

File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

—- —— ————– ———— ————— ———-

18   OK     1              277          320             224873829

  File Name: +DATA/orcl/datafile/tt.316.824637849

  Block Type Blocks Failing Blocks Processed

  ———- ————– —————-

  Data       0              4

  Index      0              1

  Other      0              38

Finished validate at 28-AUG-13

参考出日志

SQL> ANALYZE TABLE VALIDATE STRUCTURE CASCADE ;

–    如果发生了Physical Block Corruption,那么就会发生ORA-1578

–    如上所示,发生ORA-1499的话,表以及索引之间发生不完整的状态。

§ 需要区分到底是在那个块中发生了故障。

–    例:FULL提示,或者使用NO_INDEX提示执行全表搜索

KROWN#68739 参考出日志

Oracle Database
Data Protection

Block Corruption 造成的影响
造成的直接

ü数据损伤造成的业务停止

ü修复时间变长

ü修复工作的人为错误、二次灾害

ü探究原因的时间变长

 

à 需要快速找出故障以及行正确的修复

Oracle MAA的数据保功能一

Oracle Database 11g Release 2

Oracle MAA

主要的修复功能

Oracle ASM / Mirroring

§ 需要Normal(二重化) 以及High Redundancy(三重化)的结构

§ 检测到破损时,参考Mirror数据自动进行修复

–    对于Oracle Client(ORA错误不会返回)

–    Primary中关于发生的Lost Write,不在此次对象中

§ Standby中发生Lost Write时,使用Mirror防止误检测

§ Redo Block破损时,参考Mirror数据

–    Normal/High Redundancy的ASM Diskgroup上配置Redo Log file

§ Doc ID 1274318.1

Oracle Client对块进行自修复

Automatic Block Media Recovery

由于Active Data Guard行穿透性的修复

Automatic Block Media Recovery

§ Standby中检测出块破损是,就会用逆向ABR进行自动修复

–    对象块破损

§ Standby中的Physical Block Corruption

–    用DB_BLOCK_CHECKSUM的功能检测出的项目
–    对Soft Corrupt以及标记完成的块进行访问时,不会有任何操作
(ABR至多在标记之前进行尝试修复。)

§ Standby中发生了Lost Write的Block

–    Primary中发上来Lost Write的Block不在此次对象中
(更新不正确的块时,生成的不正确的redo是无法修复的)

操作的追加信息

Automatic Block Media Recovery

§ Primary Database’s Alert.log

检测块以及用ABR行自修复的参考日志

Oracle Recovery Manager

§ 储存修复对象块的Data File只有在ONLINE状态下才可以执行

§ Lost Write导致的块破损不在此次对象中

–    修复指定块的命令

§ 一般而言,在V$DATABASE_BLOCK_CORRUPTION视图中,指定被表示的block(检测出破损的块)(执行时也需要检测破损)

–    将在V$DATABASE_BLOCK_CORRUPTION视图中表示的视图一起修复的命令。

RECOVER命令来对块进行修复

Oracle Recovery Manager

§ 对块单位的恢复来说,需要可以Restore的正常块

–    正常块的搜索地址的优先顺序如下所示

•          Active Data GuardPhysical Standby Database

•          Flashback Log Recovery行中的Database内)

•          RMAN Image Backup Recovery行中的Database内)

–    前述都是正常的块被Restore,用自动恢复来实现最新化

–    如果块单位中无法修复时,需要用数据文件单位的恢复

才会是可以成为块单位中恢复基准的正常

Oracle Enterprise Manager Cloud Control 12c

§ 根据Database环境状况,可以简单实现最适合的恢复

–    考虑块单位中的恢复是否可能的命令

–    Oracle Enterprise Manager可以简单地执行恢复能

§ 执行例

–    在下面的Database环境中,我们将介绍用Physical Block Corruption以及
EM(Data Recovery Advisor)来修复的顺序

§ 没有Active Data Guard环境

§ 有Flashback Log(但是是过了保存期限的状态)

§ 没有RMAN Image Copy Backup

Data Recovery Advisor中自生成修复命令

Oracle Enterprise Manager Cloud Control 12c

早期掌握Incident Manager的故障

Oracle Enterprise Manager Cloud Control 12c

Data Recovery Advisor

Oracle Enterprise Manager Cloud Control 12c

§ 考虑Database环境的状况(3页之前),
判断为块单位中的恢复是不可能的

Data Recovery Advisor

Oracle Enterprise Manager Cloud Control 12c

Recovery JobAdvisor果脚步

Flashback Technology

§ Flashback Database命令

–    人为错误(删除数据以及不合适的更新处理等)可以迅速恢复

–    因为Primary中发生Lost Write了,在Data Guard中执行Fail-Over之后,
将旧Primary作为新Standby环境来迅速修复的情况

§ Flashback Log

–    执行上述的Flashback Database命令时

–    前章中介绍过的,执行RMAN主导的块单位中的Media Recovery时

Flashback Database (Flashback Log)活用例

Oracle Data Guard

§ Oracle Data GuardのPhysical Standby是完全复制Database

–    将在Primary Database中生成的Redo同样地应用于Standby中

§ Primary中的块更新处理,也适用于redo的块。

à 一定会Fail-OverDatabase

 

§ 特别是在Primary Database中发生Lost Write时有效

–    使用正常数据迅速重新展开业务

–    可以不影响正常业务来调查Lost Write发生原因

§ 原因不明时,请考虑持续使用Primary的H/W的风险

 

Physical Standby DatabaseFail-Over

Oracle Data Guard

§ 从Lost Write检测开始到Fail-Over为止,被执行的事务会失去

–    Primary是Standby中检测出Lost Write也继续运行

à 由于业务事务,发生新变更的状况

à 然也有正确的更,但也混合了一些使用了Lost WriteBlock

–    Standby为了防止数据污染,检出以后的redo适用停止了。

–         à 重要的是如何迅速停止PrimaryFail-Over

  à Release 11.2.0.4以后
追加Data Guard BrokerPrimary Lost Write Action Property
检测Lost Write事可以自动对PrimaryABORT

检测Lost WriteFail-Over需要考的事情1

Oracle Data Guard

§ Fail-Over之后,Data Guard结构崩溃的状态

–    因为旧Primary与新Primary(原Standby)是在不同的道路上前进

à 需要Standby Database重新制成Primary Database

§ 重新制成的方法

检测Lost WriteFail-Over需要考的事情2

Oracles Data Protection
Conclusion

Data Protection

Oracle Database 11g Release 2

Data Protection

3个初始化参数的检测操作以及修复方法的概要

Conclusion
Data Protection

Oracle Database 可以提供保护数据的高可用性的解决方法。

l DB_ULTRA_SAFE Parameter

l Oracle Data Guard

l Oracle Recovery Manager

 Oracle Enterprise Manager

人工的にOracleデータブロックにロジックエラが起こったことをシミュレーションするORA-00600:[13013] [5001]一例

ORACLEデータベース によくあるエラ の解決策

プロのOracle Databaseの復旧サービスを提供
携帯番号: +86 13764045638 メール:service@parnassusdata.com

 

前週で、お客様からOracle Bugによって、テーブルデータブロックロジックエラでORA-00600:[13013], [5001]を引き起こした例をもらった。ここで、よりうまく説明できるように、人工的にそのエラをシミュレーションする発想が生み出した。

基礎的な知識

Oracleにテーブルのデータブロックはブロックへっだ、トランザクションスロット、行ディクショナリー、及び行データなどいろんな構造で組み立てた。行データは行数据(rowdata)いろんなrow piece によって組み立てる。各row pieceのヘッダにflag、locks、cols(cc)三つのマーク位置がある。

そのflagはrow pieceのタイプをマークした。そのflag位置は一つのバイトを占めて、違ったbit位置が違った意味を意味している。以下の通り:

 

 

ROW_CLUSTER_KEY = 0x80;              KDRHFK
ROW_CTABLE_NUMBER = 0x40;            KDRHFC
ROW_HEAD_PIECE = 0x20;               KDRHFH
ROW_DELETED_ROW = 0x10;              KDRHFD
ROW_FIRST_PIECE = 0x08;              KDRHFF
ROW_LAST_PIECE = 0x04;               KDRHFL
ROW_FROM_PREVIOUS = 0x02;            KDRHFP
ROW_CONTINUE_NEXT = 0x01;            KDRHFN

一般的に、もっとも普通なrow pieceは普通スタックテーブル(heap table)に削除されていない。そして行転移/行リンクもない。そのflag位置は以下の通り

普通のrowのflagは

Single Row =
ROW_HEAD_PIECE + ROW_FIRST_PIECE + ROW_LAST_PIECE= 0x20 + 0x08 + 0x04= 0x2c

===================================================================================

cluster keyのflagは

Cluster Key =
ROW_CLUSTER_KEY + ROW_HEAD_PIECE + ROW_FIRST_PIECE + ROW_LAST_PIECE=
KDRHFL, KDRHFF, KDRHFH, KDRHFK =0x80 + 0x2c =  0xac

BBED> x /rn
rowdata[68]                                 @8166
-----------
flag@8166: 0xac (KDRHFL, KDRHFF, KDRHFH, KDRHFK)
lock@8167: 0x00
cols@8168:    1
kref@8169:    1
mref@8171:    1
hrid@8173:0x01800014.0
nrid@8179:0x01800014.0

col    0[2] @8185: 10 

===================================================================================

Cluster Row =
ROW_CTABLE_NUMBER + ROW_HEAD_PIECE + ROW_FIRST_PIECE + ROW_LAST_PIECE =
(KDRHFL, KDRHFF, KDRHFH, KDRHFC) = 0x6c 

BBED> x /rncc
rowdata[0]                                  @8098
----------
flag@8098: 0x6c (KDRHFL, KDRHFF, KDRHFH, KDRHFC)
lock@8099: 0x00
cols@8100:   10

col    0[2] @8102: 200
col    1[8] @8105: Jennifer
col    2[6] @8114: Whalen
col    3[7] @8121: JWHALEN
col   4[12] @8129: 515.123.4444
col    5[7] @8142: w....
col    6[7] @8150: AD_ASST
col    7[2] @8158: 

                    col    8[0] @8161: *NULL*
col    9[3] @8162: .

ORA-00600:[13013], [5001]且Arg [f] Code =3 が現れて、row pieceのflag >0xc0と意味している,
つまり、そのrow pieceは同時にkeyとclustered(row is marked as both a Key and Clustered)とマークされた。その検証コードはcheck code 6251である。

flag >= 0xc0 の時にkdrchk: row is marked as both a Key and Clustered Block 12 failed with check code 6251が現れる

 0xac >flag >= 0xa0 の時に kdrchk: row is Key and Not only piece of key Block 12 failed with check code 6255が現れる

 flag = 0x43の時に kdrchk: C and neither of H or F Block 12 failed with check code 6263が現れる

 flag = 0x83 の時に  kdrchk: row is marked both as a Key and being continued Block 12 failed with check code 6254が現れる

Oracleプロセスがデータブロックをアクセスするときに、まずはブロックの合計値はテストする。そして、ブロックにあるCHECKSUM値と比べる。一致していれば、そのブロックに物理的なエラがないと意味している。ブロックを100%正確を確保するために、このテストだけでまだ足りない。それで、Oracleが一部の列ロジックテストを導入して、各ロジックテストは一つのテストコードに該当する (check code)row pieceのflag、cols(cc)状況テストなども含んでいる。

実際に、このロジック検証テストの関数は:kdbchk、kddummy_blkchk、kco_blkchk、kdBlkCheckError、kdrchkなどを含んでいる。

ここで、サビースプロセスがトラブルデータブロックをアクセスした、テストコードはそのflagがflag为0xff(KCHDFLPN)だと検出できた。そのflagはロジック的に衝突したから、テストコードはそのrow pieceにエラがあると認めている。さらに、updateによって、ORA-00600:[13013], [5001]あるクエリORA-600 [qertbFetchByRowID]内部エラを引き起こした。

ここで説明する必要があるのは、いろんな人がdbvツールで、ロジックエラを検出出来ないと認めているが、実際に、dbvもrmanもvalidate structureもbbed-verifyもある程度のロジックエラを検出できる。けど、一番頼りになるのが、db_block_checksum=trueの場合のvalidate structure [online]検証コマンドである。一方、普通のdbvは一つテストしかできないが。しかも、交換的なテストもできない、そして、テーブルとインディクスが不一致のトラブルを解決できる。けどvalidate structure onlineなら、案外できる。


正式シミュレーション

以上はORA-00600:[13013], [5001]内部エラがどうやって引き起こされたかを理解できた。次に人工的にそのエラをシミュレーションするのも難しくなくなる。ここで、bbedツールを使ってください。

次に実験用のtablespace,table,indexを作成する:

 

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE 10.2.0.4.0 Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production

SQL> select * from global_name;

GLOBAL_NAME
--------------------------------------------------------------------------------
www.askmaclean.com

/* 実験用テーブルスペースを作成する  */

SQL> create tablespace maclean datafile '/home/oracle/maclean.dbf' size 20M;

Tablespace created.

SQL> create table tv tablespace maclean as select rownum t1,'find me' t2 from
dba_tables where rownumcreate index ind_tv on tv(t1) tablespace users;

Index created.

SQL> update tv set t2='corrption here' where t1=200;
update tv set t2='corrption here' where t1=200
*
ERROR at line 1:
ORA-12899: value too large for column "SYS"."TV"."T2" (actual: 14, maximum: 7)

SQL> alter table tv modify t2 varchar2(200);

Table altered.

SQL> update tv set t2='corruption here' where t1=200;

1 row updated.

SQL> commit;

Commit complete.

/* 以上はインスタンステーブルを作成した。そのなかt1=200の記録は後で人工的に修正する行である。             */

SQL> select dump(200,16) from dual;

DUMP(200,16)
-----------------
Typ=2 Len=2: c2,3

/* 16進数コードで、t1=200の記録行を容易く探し出せる */ 
SQL> alter system checkpoint;

System altered.

SQL> alter tablespace maclean read only;

Tablespace altered.

SQL> select dbms_rowid.rowid_block_number(rowid) bno ,dbms_rowid.rowid_relative_fno(rowid) fno from tv;

BNO FNO
---------- ----------
12 6

[oracle@rh2 ~]$ cp maclean.dbf maclean.dbf.bak


次にBBEDツールで、目標を探し出し、人工的に修正してください:

[oracle@rh2 ~]$ bbed filename=maclean.dbf mode=edit
Password:

BBED: Release 2.0.0.0.0 - Limited Production on Sun Sep 18 22:14:59 2011

Copyright (c) 1982, 2007, Oracle. All rights reserved.

BBED> set blocksize 8192
BLOCKSIZE 8192

BBED> set block 13
BLOCK# 13

BBED> map /v

File: maclean.dbf (0)
Block: 13 Dba:0x00000000
------------------------------------------------------------
KTB Data Block (Table/Cluster)

struct kcbh, 20 bytes @0
ub1 type_kcbh @0
ub1 frmt_kcbh @1
ub1 spare1_kcbh @2
ub1 spare2_kcbh @3
ub4 rdba_kcbh @4
ub4 bas_kcbh @8
ub2 wrp_kcbh @12
ub1 seq_kcbh @14
ub1 flg_kcbh @15
ub2 chkval_kcbh @16
ub2 spare3_kcbh @18

struct ktbbh, 96 bytes @20
ub1 ktbbhtyp @20
union ktbbhsid, 4 bytes @24
struct ktbbhcsc, 8 bytes @28
b2 ktbbhict @36
ub1 ktbbhflg @38
ub1 ktbbhfsl @39
ub4 ktbbhfnx @40
struct ktbbhitl[3], 72 bytes @44

struct kdbh, 14 bytes @124
ub1 kdbhflag @124
b1 kdbhntab @125
b2 kdbhnrow @126
sb2 kdbhfrre @128
sb2 kdbhfsbo @130
sb2 kdbhfseo @132
b2 kdbhavsp @134
b2 kdbhtosp @136

struct kdbt[1], 4 bytes @138
b2 kdbtoffs @138
b2 kdbtnrow @140

sb2 kdbr[200] @142

ub1 freespace[4725] @542

ub1 rowdata[2921] @5267

ub4 tailchk @8188

BBED> find /x c203

File: maclean.dbf (0)
Block: 13 Offsets: 5271 to 5782 Dba:0x00000000
------------------------------------------------------------------------
c2030f63 6f727275 7074696f 6e206865 72652c00 0202c203 0766696e 64206d65
2c000203 c2026407 66696e64 206d652c 000203c2 02630766 696e6420 6d652c00
0203c202 62076669 6e64206d 652c0002 03c20261 0766696e 64206d65 2c000203
c2026007 66696e64 206d652c 000203c2 025f0766 696e6420 6d652c00 0203c202
5e076669 6e64206d 652c0002 03c2025d 0766696e 64206d65 2c000203 c2025c07
66696e64 206d652c 000203c2 025b0766 696e6420 6d652c00 0203c202 5a076669
6e64206d 652c0002 03c20259 0766696e 64206d65 2c000203 c2025807 66696e64
206d652c 000203c2 02570766 696e6420 6d652c00 0203c202 56076669 6e64206d
652c0002 03c20255 0766696e 64206d65 2c000203 c2025407 66696e64 206d652c
000203c2 02530766 696e6420 6d652c00 0203c202 52076669 6e64206d 652c0002
03c20251 0766696e 64206d65 2c000203 c2025007 66696e64 206d652c 000203c2
024f0766 696e6420 6d652c00 0203c202 4e076669 6e64206d 652c0002 03c2024d
0766696e 64206d65 2c000203 c2024c07 66696e64 206d652c 000203c2 024b0766
696e6420 6d652c00 0203c202 4a076669 6e64206d 652c0002 03c20249 0766696e
64206d65 2c000203 c2024807 66696e64 206d652c 000203c2 02470766 696e6420
6d652c00 0203c202 46076669 6e64206d 652c0002 03c20245 0766696e 64206d65

t1=200の偏差値は5271と探し出した。

では、fbの偏差値は5271 -4 = 5267

BBED> set offset 5267

OFFSET 5267

BBED> d
File: maclean.dbf (0)
Block: 13 Offsets: 5267 to 5778 Dba:0x00000000
------------------------------------------------------------------------
2c020202 c2030f63 6f727275 7074696f 6e206865 72652c00 0202c203 0766696e
64206d65 2c000203 c2026407 66696e64 206d652c 000203c2 02630766 696e6420
6d652c00 0203c202 62076669 6e64206d 652c0002 03c20261 0766696e 64206d65
2c000203 c2026007 66696e64 206d652c 000203c2 025f0766 696e6420 6d652c00
0203c202 5e076669 6e64206d 652c0002 03c2025d 0766696e 64206d65 2c000203
c2025c07 66696e64 206d652c 000203c2 025b0766 696e6420 6d652c00 0203c202
5a076669 6e64206d 652c0002 03c20259 0766696e 64206d65 2c000203 c2025807
66696e64 206d652c 000203c2 02570766 696e6420 6d652c00 0203c202 56076669
6e64206d 652c0002 03c20255 0766696e 64206d65 2c000203 c2025407 66696e64
206d652c 000203c2 02530766 696e6420 6d652c00 0203c202 52076669 6e64206d
652c0002 03c20251 0766696e 64206d65 2c000203 c2025007 66696e64 206d652c
000203c2 024f0766 696e6420 6d652c00 0203c202 4e076669 6e64206d 652c0002
03c2024d 0766696e 64206d65 2c000203 c2024c07 66696e64 206d652c 000203c2
024b0766 696e6420 6d652c00 0203c202 4a076669 6e64206d 652c0002 03c20249
0766696e 64206d65 2c000203 c2024807 66696e64 206d652c 000203c2 02470766
696e6420 6d652c00 0203c202 46076669 6e64206d 652c0002 03c20245 0766696e

/* 指定した行アドレスは5267だど探し出せる。既存するflagはまともな0x2cである  */

BBED> x /rnc

rowdata[0] @5267
----------
flag@5267: 0x2c (KDRHFL, KDRHFF, KDRHFH)
lock@5268: 0x02
cols@5269: 2

col 0[2] @5270: 200
col 1[15] @5273: corruption here

flag を 0xff BBED> modify /x 0xffに修正してください。

Warning: contents of previous BIFILE will be lost. Proceed? (Y/N) y
File: maclean.dbf (0)
Block: 13 Offsets: 5267 to 5778 Dba:0x00000000
------------------------------------------------------------------------
ff020202 c2030f63 6f727275 7074696f 6e206865 72652c00 0202c203 0766696e
64206d65 2c000203 c2026407 66696e64 206d652c 000203c2 02630766 696e6420
6d652c00 0203c202 62076669 6e64206d 652c0002 03c20261 0766696e 64206d65
2c000203 c2026007 66696e64 206d652c 000203c2 025f0766 696e6420 6d652c00
0203c202 5e076669 6e64206d 652c0002 03c2025d 0766696e 64206d65 2c000203
c2025c07 66696e64 206d652c 000203c2 025b0766 696e6420 6d652c00 0203c202
5a076669 6e64206d 652c0002 03c20259 0766696e 64206d65 2c000203 c2025807
66696e64 206d652c 000203c2 02570766 696e6420 6d652c00 0203c202 56076669
6e64206d 652c0002 03c20255 0766696e 64206d65 2c000203 c2025407 66696e64
206d652c 000203c2 02530766 696e6420 6d652c00 0203c202 52076669 6e64206d
652c0002 03c20251 0766696e 64206d65 2c000203 c2025007 66696e64 206d652c
000203c2 024f0766 696e6420 6d652c00 0203c202 4e076669 6e64206d 652c0002
03c2024d 0766696e 64206d65 2c000203 c2024c07 66696e64 206d652c 000203c2
024b0766 696e6420 6d652c00 0203c202 4a076669 6e64206d 652c0002 03c20249
0766696e 64206d65 2c000203 c2024807 66696e64 206d652c 000203c2 02470766
696e6420 6d652c00 0203c202 46076669 6e64206d 652c0002 03c20245 0766696e

BBED> x /rnc

rowdata[0] @5267
----------
flag@5267: 0xff (KDRHFN, KDRHFP, KDRHFL, KDRHFF, KDRHFD, KDRHFH, KDRHFC, KDRHFK)
lock@5268: 0x02
cols@5269: 0
ckix@5270: 2

BBED> sum apply

Check value for File 0, Block 13:
current = 0x0000, required = 0x0000

私たち使っているbbedのverifyコマンドはデータブロックをテストして、トラブルflagを検出できる。

BBED> verify
DBVERIFY - Verification starting
FILE = maclean.dbf
BLOCK = 12

kdrchk: row is marked as both a Key and Clustered

prow=0x7f5335f05693 flag=0xff
Block Checking: DBA = 25165836, Block Type = KTB-managed data block
data header at 0x7f5335f0427c
kdbchk: bad row tab 0, slot 199
Block 12 failed with check code 6251

DBVERIFY - Verification complete

Total Blocks Examined : 1
Total Blocks Processed (Data) : 1
Total Blocks Failing (Data) : 1
Total Blocks Processed (Index): 0
Total Blocks Failing (Index): 0
Total Blocks Empty : 0
Total Blocks Marked Corrupt : 0
Total Blocks Influx : 0

dbvツールもこのようなロジックエラを検出できる。

[oracle@rh2 ~]$ dbv file=maclean.dbf

DBVERIFY: Release 10.2.0.4.0 - Production on Sun Sep 18 22:27:49 2011

Copyright (c) 1982, 2007, Oracle. All rights reserved.

DBVERIFY - Verification starting : FILE = maclean.dbf
kdrchk: row is marked as both a Key and Clustered
prow=0x7f9ef25f7693 flag=0xff
Block Checking: DBA = 25165836, Block Type = KTB-managed data block
data header at 0x7f9ef25f627c
kdbchk: bad row tab 0, slot 199
Page 12 failed with check code 6251

DBVERIFY - Verification complete

Total Pages Examined : 2560
Total Pages Processed (Data) : 1
Total Pages Failing (Data) : 1
Total Pages Processed (Index): 0
Total Pages Failing (Index): 0
Total Pages Processed (Other): 11
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 2548
Total Pages Marked Corrupt : 0
Total Pages Influx : 0
Highest block SCN : 691691 (0.691691)


Sqlplusに戻って、前に修正したデータ行をアクセスする。ORA-600[13013] [5001]エラを引き起こす:


SQL> alter system flush buffer_cache;

System altered.

SQL> update tv set t2='correct here' where t1=200;
update tv set t2='correct here' where t1=200
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [13013], [5001], [52937],
[25165836], [199], [25165836], [3], []

PLAN_TABLE_OUTPUT
---------------------------------------------------------
Plan hash value: 568795662

----------------------------------------------------------------------------
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | UPDATE STATEMENT  |        |     1 |   115 |     2   (0)| 00:00:01 |
|   1 |  UPDATE           | TV     |       |       |            |          |
|*  2 |   INDEX RANGE SCAN| IND_TV |     1 |   115 |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T1"=200)

SQL> select * from tv where t1=200;
select * from tv where t1=200
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [],
[], [], []

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 1015724781

--------------------------------------------------------------------------------------
| Id  | Operation                   | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |        |     1 |   115 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| TV     |     1 |   115 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | IND_TV |     1 |       |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T1"=200)

ここで、トラブル行記録をupdateするときに、予想のとおり、ORA-00600:[13013], [5001]エラが現れた。けどACCESS BY INDEX ROWIDするときに、ORA-00600:[qertbFetchByRowID]エラになった。
解決策
1.バックアップがあれば、blockrecoveryを利用して、オンラインでトラブルブロックを解決できる:
RMAN> blockrecover datafile 6 block 12;
Starting blockrecover at 18-SEP-11
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=144 devtype=DISK
starting media recovery
media recovery complete, elapsed time: 00:00:01
Finished blockrecover at 18-SEP-11
だが、そのロジックエラは確かにOracle Bugを引き起こしたとしたら、blockrecoverも無効になる。この場合に、二つ目の方法を使ってください。

2. 二つ目はバックアップもないデータベースあるいはリカバリデータブロックが使えなくなった場合に対して利用できる。10231トランザクションを設定し、そのテーブルをctasコピできる。けど、これによって、トラブルがあった行記録をなくなるかもしれない:
SQL> alter session set events ‘10231 trace name context forever, level 10’
SQL> Create table.TABLE_COPY as select * from TABLE;

より多くのkdrchk関数の情報はこちら:


Add check for continued row piece pointing to itself with
corruption description:

"kdrchk: Row piece pointing to itself"

DB_BLOCK_CHECKING = MEDIUM will check for row pieces where the
next rowid (nrid) points to itself (chained row points to itself).
It produces error ORA-600 [kddummy_blkchk] or ORA-600 [kdBlkCheckError]
with check code [6266] (3rd ORA-600 argument).

DBVERIFY reports the same corruption description if the block is corrupt on disk.

RMAN when run with the CHECK LOGICAL option reports it as
     corruption_type=CORRUPT/LOGICAL in v$database_block_corruption.

"ANALYZE TABLE  VALIDATE STRUCTURE" produces error ORA-1498 and trace file
shows the same corruption description.

With this fix in place DBMS_REPAIR can be used to identify and mark the affected
block as Soft Corrupt producing error ORA-1578 and it can be skipped it for DML's
using DBMS_REPAIR.SKIP_CORRUPT_BLOCKS.

 [CM][SG][event 1][domain Q423][mem 0] Joining shared group
  kdrchk: column length 0 but not null
            prow=0x2a97f4d9d6 flag=0x2c column=57
  Block Checking: DBA = 29635651, Block Type = KTB-managed data block
  data header at 0x2a97f4be7c
  kdbchk: bad row tab 0, slot 2
  data_block_dump,data header at 0x2a97d113d8
  data_block_dump,data header at 0x2a97d113d8

 kdrchk: found invalid symbol reference 48
  reference to delete symbol
  valid symbol range [0,78)
  Block Checking: DBA = 411055291, Block Type = KTB-managed data block
  data header at 0x68a3f4
  kdbchk: bad row tab 0, slot 4
  Page 13499 failed with check code 6265

kdrchk: C and neither of H or F
          prow=0x4282803ae flag=0x41
Block Checking: DBA = 322963095, Block Type = KTB-managed data block
data header at 0x42828007c

kdrchk: column length 0 but not null
          prow=0x10021035e flag=0x2c column=40
Block Checking: DBA = 25189259, Block Type = KTB-managed data block
data header at 0x10020fe7c
kdbchk: bad row tab 0, slot 0
Page 23435 failed with check code 6264
kdrchk: column length 0 but not null
          prow=0x1002122e5 flag=0x2c column=40
Block Checking: DBA = 25189260, Block Type = KTB-managed data block

kdrchk:  row is marked as both a Key and Clustered
prow=0xd2bfa981 flag=0xff
File#67, Block#74754

kdbchk: bad row tab 0, slot 0
kdrchk:  no columns, but has one of P or N
          prow=0x934fbffa flag=0x31

DIAGNOSTIC ANALYSIS:
====================
A look at the block dump in the analyze trace file revealed two very
suspicious looking rows:

tab 0, row 0, @0x1ede
tl: 2 fb: --HD---N lb: 0x0
tab 0, row 1, @0x1edc
tl: 2 fb: --HD---N lb: 0x0

The flag bytes in these rows look incorrect.


V$SQLCOMMAND SQL opcodes and names


SQL> select command_type,command_name from V$SQLCOMMAND;

COMMAND_TYPE COMMAND_NAME
------------ ----------------------------------------------------------
           0
           1 CREATE TABLE
           2 INSERT
           3 SELECT
           4 CREATE CLUSTER
           5 ALTER CLUSTER
           6 UPDATE
           7 DELETE
           8 DROP CLUSTER
           9 CREATE INDEX
          10 DROP INDEX
          11 ALTER INDEX
          12 DROP TABLE
          13 CREATE SEQUENCE
          14 ALTER SEQUENCE
          15 ALTER TABLE
          16 DROP SEQUENCE
          17 GRANT OBJECT
          18 REVOKE OBJECT
          19 CREATE SYNONYM
          20 DROP SYNONYM
          21 CREATE VIEW
          22 DROP VIEW
          23 VALIDATE INDEX
          24 CREATE PROCEDURE
          25 ALTER PROCEDURE
          26 LOCK TABLE
          27 NO-OP
          28 RENAME
          29 COMMENT
          30 AUDIT OBJECT
          31 NOAUDIT OBJECT
          32 CREATE DATABASE LINK
          33 DROP DATABASE LINK
          34 CREATE DATABASE
          35 ALTER DATABASE
          36 CREATE ROLLBACK SEG
          37 ALTER ROLLBACK SEG
          38 DROP ROLLBACK SEG
          39 CREATE TABLESPACE
          40 ALTER TABLESPACE
          41 DROP TABLESPACE
          42 ALTER SESSION
          43 ALTER USER
          44 COMMIT
          45 ROLLBACK
          46 SAVEPOINT
          47 PL/SQL EXECUTE
          48 SET TRANSACTION
          49 ALTER SYSTEM
          50 EXPLAIN
          51 CREATE USER
          52 CREATE ROLE
          53 DROP USER
          54 DROP ROLE
          55 SET ROLE
          56 CREATE SCHEMA
          57 CREATE CONTROL FILE
          58 ALTER TRACING
          59 CREATE TRIGGER
          60 ALTER TRIGGER
          61 DROP TRIGGER
          62 ANALYZE TABLE
          63 ANALYZE INDEX
          64 ANALYZE CLUSTER
          65 CREATE PROFILE
          66 DROP PROFILE
          67 ALTER PROFILE
          68 DROP PROCEDURE
          70 ALTER RESOURCE COST
          71 CREATE MATERIALIZED VIEW LOG
          72 ALTER MATERIALIZED VIEW LOG
          73 DROP MATERIALIZED VIEW  LOG
          74 CREATE MATERIALIZED VIEW
          75 ALTER MATERIALIZED VIEW
          76 DROP MATERIALIZED VIEW
          77 CREATE TYPE
          78 DROP TYPE
          79 ALTER ROLE
          80 ALTER TYPE
          81 CREATE TYPE BODY
          82 ALTER TYPE BODY
          83 DROP TYPE BODY
          84 DROP LIBRARY
          85 TRUNCATE TABLE
          86 TRUNCATE CLUSTER
          87 CREATE BITMAPFILE
          88 ALTER VIEW
          89 DROP BITMAPFILE
          90 SET CONSTRAINTS
          91 CREATE FUNCTION
          92 ALTER FUNCTION
          93 DROP FUNCTION
          94 CREATE PACKAGE
          95 ALTER PACKAGE
          96 DROP PACKAGE
          97 CREATE PACKAGE BODY
          98 ALTER PACKAGE BODY
          99 DROP PACKAGE BODY
         157 CREATE DIRECTORY
         158 DROP DIRECTORY
         159 CREATE LIBRARY
         160 CREATE JAVA
         161 ALTER JAVA
         162 DROP JAVA
         163 CREATE OPERATOR
         164 CREATE INDEXTYPE
         165 DROP INDEXTYPE
         166 ALTER INDEXTYPE
         167 DROP OPERATOR
         168 ASSOCIATE STATISTICS
         169 DISASSOCIATE STATISTICS
         170 CALL METHOD
         171 CREATE SUMMARY
         172 ALTER SUMMARY
         173 DROP SUMMARY
         174 CREATE DIMENSION
         175 ALTER DIMENSION
         176 DROP DIMENSION
         177 CREATE CONTEXT
         178 DROP CONTEXT
         179 ALTER OUTLINE
         180 CREATE OUTLINE
         181 DROP OUTLINE
         182 UPDATE INDEXES
         183 ALTER OPERATOR
         184 Do not use 184
         185 Do not use 185
         186 Do not use 186
         187 CREATE SPFILE
         188 CREATE PFILE
         189 UPSERT
         190 CHANGE PASSWORD
         191 UPDATE JOIN INDEX
         192 ALTER SYNONYM
         193 ALTER DISK GROUP
         194 CREATE DISK GROUP
         195 DROP DISK GROUP
         196 ALTER LIBRARY
         197 PURGE USER RECYCLEBIN
         198 PURGE DBA RECYCLEBIN
         199 PURGE TABLESPACE
         200 PURGE TABLE
         201 PURGE INDEX
         202 UNDROP OBJECT
         203 DROP DATABASE
         204 FLASHBACK DATABASE
         205 FLASHBACK TABLE
         206 CREATE RESTORE POINT
         207 DROP RESTORE POINT
         209 DECLARE REWRITE EQUIVALENCE
         210 ALTER REWRITE EQUIVALENCE
         211 DROP REWRITE EQUIVALENCE
         212 CREATE EDITION
         213 ALTER EDITION
         214 DROP EDITION
         215 DROP ASSEMBLY
         216 CREATE ASSEMBLY
         217 ALTER ASSEMBLY
         218 CREATE FLASHBACK ARCHIVE
         219 ALTER FLASHBACK ARCHIVE
         220 DROP FLASHBACK ARCHIVE
         222 CREATE SCHEMA SYNONYM
         224 DROP SCHEMA SYNONYM
         225 ALTER DATABASE LINK
         226 CREATE PLUGGABLE DATABASE
         227 ALTER PLUGGABLE DATABASE
         228 DROP PLUGGABLE DATABASE
         229 CREATE AUDIT POLICY
         230 ALTER AUDIT POLICY
         231 DROP AUDIT POLICY
         238 ADMINISTER KEY MANAGEMENT
         239 CREATE MATERIALIZED ZONEMAP
         240 ALTER MATERIALIZED ZONEMAP
         241 DROP MATERIALIZED ZONEMAP

Introduction to ORA-600/ORA-7445 Internal Error Analysis

APPLIES TO:

Oracle Database – Enterprise Edition – Version 8.1.7.4 and later
Information in this document applies to any platform.
***Checked for relevance on 30-Sep-2014***

PURPOSE

The purpose of this troubleshooting article is to provide an insight into key areas of internal
600/7445 trace files to assist in deriving to a known bug or to highlight what might be needed
by ORACLE Support to progress an issue further.  Whilst the article can be used to some
extent on 11g it is primarily written for versions <V10204 due to :-

a) The worked example is based on a bug fixed in V10204 and higher releases
b) The alert/trace file structure in 11g has changed

The article will make reference to other notes so that the traces for 11g can also
be proactively reviewed if wished.

The nature of Internal errors means no one approach can guarantee getting to a solution but again
the intention is to provide an indication on notes that support make available to customers and
a simple workflow process that assists for a vast majority of SRs. A simple example is illustrated
to generate a known bug that is harmless to the database when generated, that said it
should ONLY be used on a TEST database as it’s not encouraged to willingly raise such errors in
‘LIVE’ environments. This example will then be used to highlight various sections and how each
section can be useful for general 600/7445 analysis and again to provide the best chance of
identifying a known bugfix should one be available.

The article also makes links to a number of others for completeness including Note:232963.1
should a testcase be required as is the ideal case for all internal errors but understandably
is not always possible dependent on the problem area.  Tests were made on SUN but the
bug is not platform specific, it may however show slightly different messages within the alert
log and for Windows based platforms some tools/notes might not be applicable.  However this
does not distract from the primary aim of how a trace file can be analyzed.

An Internal Error whether ORA-00600 or ORA-07445 can fall into many categories :-

Issue reproducible and testcase can be provided
Issue reproducible but no simple testcase perhaps due to code being oracle SQL, 3rd party SQL
Issue not reproducible but persistent and formulates to some pattern e.g. once a day
Issue not reproducible and random pattern to occurrences

By definition if an issue is not reproducible at will in a customer environment a testcase may
be very difficult to obtain but should always be attempted where possible.

These are a simplified subset, depending on root cause there can be many 600/7445 errors.
Typically the argument(s) of the error when unique should mean each are different problems
but if occurring on same timestamp or one always soon follows another there might be some
relationship and this will be for support to determine.

TROUBLESHOOTING STEPS

Worked Example
——————
A real life example will follow and a working methodology will then be provided. Conclusions
will then be made at the end of the analysis and some standard questions will be commented
on that should always be appropriate to analysis of Internal errors and perhaps any SR raised
into support.  The bug number identified will be reported in a later section so as to show the
workflow/analysis methodology used without knowing the solution in advance.  This article
is only suitable for <V10204 databases in terms of the testcase as the bug it relates to
if fixed in the V10204PSR and higher RDBMS versions.  The article is still appropriate as an
introduction to analysis for 11g RDBMS and we will mention the main differences for when an internal error is seen in 11g.  Due to the nature of Internal errors it is possible that the error
is reported differently between versions/PSRs/platforms making analysis more complex.  The
example and trace results for this article came from V10201/64bitSUN but the same principles
apply to all.

sqlplus scott/tiger
drop table a;
drop table b;

create table A(col11 number, col12 number);
create table B(col21 number, col22 number);

insert into a values (-3,-7);
insert into a values (null,-1);
insert into b values ( -7,-3);

update a sET col11 =
(select avg(b.col22) keep (dense_rank first order by (col22)) FROM b where
b.col21= a.col12);

The UPDATE will fail at session level with ORA-03113 error. A trace file will also be written
to the UDUMP destination of the database and will now be highlighted, where appropriate
if a section can be found using keyword searches in the trace it will be mentioned.

Before any analysis of UDUMP/BDUMP traces take place there should be an understanding of
how the error(s) are reported in the alert log.

ALERT LOG
—————
As this is a worked example we know what to expect in advance but the alert and an understanding
of what actions take place around an internal error can be fundamental to resolution. It is
certainly possible for internal errors to be a consequence of a problem rather than the cause.

Errors in file xxxx/xxxx_ora_24779.trc:
ORA-07445: exception encountered: core dump [_memcpy()+592] [SIGSEGV]
[Address not mapped to object] [0xFFFFFFFF7B180000] [] []
ORA-00600: internal error code, arguments: [kgh_heap_sizes:ds],
[0xFFFFFFFF7B179E98], [], [], [], [], [], []
ORA-07445: exception encountered: core dump [_memcpy()+592] [SIGSEGV]
[Address not mapped to object] [0xFFFFFFFF7B180000] [] []
ORA-00600: internal error code, arguments: [kghGetHpSz1], [0xFFFFFFFF7B179EA8]

The errors continue in the alert and as we can see >1 error is raised but all are reported
to be within the same tracefile.
There will be cases where >1 error and >1 trace exist but this is
beyond the scope of this ‘Introduction’.

Various sections of the trace are now reported based on this example, not all 600/7445 traces
will allow for each section to be reviewed. Again this is a very simple example and often a
combination of traces will need to be understood which is beyond the scope of this article.

Section 1 : Trace header information
-------------------------------------

xxxx/xxxx_ora_24779.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production
With the Partitioning, OLAP and Data Mining options
ORACLE_HOME = xxxxxx
System name: SunOS
Node name: xxxxxx
Release: 5.8
Version: Generic_117350-38
Machine: sun4u
Instance name: xxxxx
Redo thread mounted by this instance: 1
Oracle process number: 15
Unix process pid: 24779, image: oracle@xxxxx (TNS V1-V3)

This section gives general information on the machine/instance and version of RDBMS where the internal error has been seen, whilst important its certainly does not really contain much to narrow bug searches as this info is customer/machine specific where as a bug itself should really hold information generic to all customers.

 

Section 2 : The 600 or 7445 internal error
------------------------------------------

*** SESSION ID:(143.5) 2006-07-18 10:45:03.004
Exception signal: 11 (SIGSEGV), code: 1 (Address not mapped to object), addr: 0xffffffff7b280000,
PC: [0xffffffff7b700b58, memcpy()+1196]

*** 2006-07-18 10:45:03.008
ksedmp: internal or fatal error
ORA-07445: exception encountered: core dump [memcpy()+1196] [SIGSEGV] [Address not mapped to object]
[0xFFFFFFFF7B280000] [] []

This article is not written to show the differences between ORA-00600/ORA-07445 errors in any detail.

The basic difference between these two errors, is that
an ORA-600 is a trapped error condition in the Oracle code that should not occur, whereasan ORA-7445 is an untrapped
error condition detected by the operating system.

 

Section 3 : Current SQL statement
---------------------------------

Current SQL statement for this session:
update a sET col11 =
(select avg(b.col22) keep (dense_rank first order by (col22)) FROM b where b.col21= a.col12);

Not all internal issues will show a current SQL statement and there is no simple reason why this is the case. If this does happen it is recommended to try and use all other sections to narrow the search space down. When present in an internal (600/7445) trace file it should always be locatable using a search in the trace file of 'Current SQL' and should be the first hit found.

In addition there maybe a 'PLSQL Call Stack' that pinpoints the schema.object and line number for an internal issue.

 

Section 4 : Call Stack Trace
-----------------------------

The call stack within the trace file is seen
as follows :-

----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedmp()+744         CALL     ksedst()             000000840 ?
FFFFFFFF7FFE7D5C ?
000000000 ?
FFFFFFFF7FFE4850 ?
FFFFFFFF7FFE35B8 ?
FFFFFFFF7FFE3FB8 ?
ssexhd()+1000        CALL     ksedmp()             000106000 ? 106324A04 ?
106324000 ? 000106324 ?
000106000 ? 106324A04 ?
sigacthandler()+44   PTR_CALL 0000000000000000     000380007 ?
FFFFFFFF7FFEB9A0 ?
000000067 ? 000380000 ?
00000000B ? 106324A00 ?
_memcpy()+592        PTR_CALL 0000000000000000     00000000B ?
FFFFFFFF7FFEB9A0 ?
FFFFFFFF7FFEB6C0 ?
FFFFFFFFFFFFFFFF ?
000000004 ? 000000000 ?
ksxb1bqb()+36        FRM_LESS __1cY__nis_writeCol  FFFFFFFF7B2F0000 ?
dStartFile6FpcpnNdi  FFFFFFFF7B35EE90 ?
rectory_obj__i_()+3  FFFFFFFFFFFE6DA0 ?
75                   FFFFFFFFFFFFFFFF ?
000000004 ? 000000000 ?
kxhrPack()+1088      CALL     ksxb1bqb()           FFFFFFFF7B0F7220 ?
FFFFFFFF7B354BBF ?
FFFFFFFFFFFF1070 ?
FFFFFFFF7B2E5D60 ?
FFFFFFFF7B0F7220 ?
000001FF0 ?
qescaInsert()+280    CALL     kxhrPack()           000000080 ?
FFFFFFFF7B34A5E8 ?

The stack function is the first column on each line, and so reads:

ksedmp ssexhd sigacthandler memcpy ksxb1bqb  .....

For clarity the full stack trace is summarised to :-

Function List (to Full stack) (to Summary stack)
ksedmp ssexhd sigacthandler memcpy ksxb1bqb kxhrPack
qescaInsert subsr3 evaopn2 upderh upduaw kdusru
kauupd updrow qerupRowProcedure qerupFetch updaul updThreePhaseExe
updexe opiexe kpoal8 opiodr ttcpip opitsk opiino opiodr opidrv
sou2o opimai_real main start

There is no automated tool for customers to get into this
form.  Note:211909.1 reports a further example
of ORA-7445 stack analysis.

The call stack can be very powerful to narrow a problem down to known issues
but if used incorrectly will generate many hits totally unrelated to the
problem being encountered.

As the functions and their purpose are ORACLE proprietry this article can only give pointers towards good practice and these include :-

a) Ignore function names that are before the 600/7445 error so for this worked example searches on 'ksedmp','ssexhd' or 'sigacthandler' will not benefit.

The top most routines are for error handling, so this is
why the failing function 'memcpy()' is not at the top of
the stack, and why the top most functions can be ignored.

b) Ignore calls towards the end of the call stack

c) Try a number of different searches based on the failing function from the 600/7445 and 4-5 functions after

In this case a useful 'My ORACLE Support' (MOS) search would be : memcpy ksxb1bqb kxhrPack qescaInsert subsr3

 

Section 5 : Session Information
-------------------------------

SO: 392b5a188, type: 4, owner: 392a5a5b8, flag: INIT/-/-/0x00
(session) sid: 143 trans: 3912d2f78, creator: 392a5a5b8, flag: (41) USR/- BSY/-/-/-/-/-
DID: 0001-000F-0000000F, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 6, prv: 0, sql: 38d8c05b0, psql: 38d8c0b20, user: 173/SCOTT
O/S info: user: xxxxxx, term: pts/21, ospid: 24482, machine: xxxxxx
program: sqlplus@xxxxxx (TNS V1-V3)
application name: SQL*Plus, hash value=3669949024
last wait for 'SQL*Net message from client' blocking sess=0x0 seq=282 wait_time=830 seconds since wait started=1

A 600/7445 trace should always have a 'Call stack Trace' whether it contains function names or not, after this follows 'PROCESS STATE' information and in this we can search on keyword 'session'.

The first hit should take us to the section of the trace file that shows where the error came from and the session/user affected (above they have been blanked to xxxxxx). In many cases we will see user, terminal and machine information that can often be useful for any issue that needs to be reproduced for further investigation. On many occasions an internal error will be reported in alert logs and the DBA will not necessarily have been informed by any users of the database/application that they encountered a problem.

This section can often also show 3rd party sessions e.g. TOAD/PLSQL or even where the client is another ORACLE product e.g. FORMS/APPS and even if just a client process e.g. EXP/RMAN. This can be another powerful method of narrowing a problem down, in the case of a problem coming from 3rd party software it is important to determine wherever possible if an issue can be reproduced in SQLPLUS.

 

Section 6 : Explain Plan Information
-------------------------------------

============
Plan Table
============
---------------------------------------+-----------------------------------+
| Id | Operation | Name | Rows | Bytes | Cost | Time |
---------------------------------------+-----------------------------------+
| 0 | UPDATE STATEMENT | | | | 3 | |
| 1 | UPDATE | A | | | | |
| 2 | TABLE ACCESS FULL | A | 2 | 52 | 3 | 00:00:01 |
| 3 | SORT AGGREGATE | | 1 | 26 | | |
| 4 | TABLE ACCESS FULL | B | 1 | 26 | 3 | 00:00:01 |
---------------------------------------+-----------------------------------+

This section can be found in the 600/7445 trace (when present) using the case
sensitive search 'Explain plan' and will be the first hit unless the SQL for
example contained the same string.

In this section the 'Plan Table' for the failing SQL will be reported and
especially for CBO related internal errors might give a pointer to finding
a workaround. For this worked example nothing really can be used but we might
for example see the plan using a 'hash join' and if so a possible w/a to
try would be hash_join_enabled=false. Caution should be used here, permanent
workarounds should only be set in place if matching to a known bug and the
implications are understood.

 

Section 7 : Current Cursor Information
--------------------------------------

Current cursor: 4, pgadep: 0
Open cursors(pls, sys, hwm, max): 11(3, 7, 50, 50)
NULL 6 SYNTAX 0 PARSE 0 BOUND 5 FETCH 0 ROW 0
Cached frame pages(total, free):
4k(61, 58), 8k(10, 10), 16k(1, 1), 32k(0, 0)
pgactx: 38d2c9e98 ctxcbk: 38d2c9b30 ctxqbc: 0 ctxrws: 38e751110

This can be found using the case sensitive search in the trace file for 'Current cursor:' [do not include the quotes], this section will either tie up with the current SQL statement reported in the 600/7445 or when there is no SQL statement reported it 'might' assist in further analysis.

Once the current cursor is known the information on this can be found using the search 'Cursor#X' where X is the value shown e.g in this case 4.

However this is >=10.2 specific
and so it should be noted that prior to 10.2
use the search 'Cursor X '.

Cursor#4(ffffffff7b231828) state=BOUND curiob=ffffffff7b2456f8
curflg=4c fl2=0 par=0 ses=392b5a188
sqltxt(38d8c05b0)=update a sET col11 =
(select avg(b.col22) keep (dense_rank first order by (col22)) FROM b where b.col21= a.col12)
hash=e2ef469857c74a273b881c20276493b5
parent=38eabc4b8 maxchild=01 plk=38f664f50 ppn=nll

If a SQL statement is using BIND variables it is the cursor information that often proves useful at isolating the values used in the SQL which again can be the key to reproducing a problem if data specific.

 

ORA-00600/ORA-07445 Analysis : General Guidelines
--------------------------------------------------

Now that the sections have been explained we need to ensure we use as systematic a procedure as possible to determine if a known issue/bug exists. It should be clear that a 600/7445 trace contains far more information but if the above sections together with what follows cannot conclude to a known issue it is recommended that a SR is raised to support.

Three useful documents types can assist here :-

OERI notes
----------
An 'OERI' article is a document provided where possible to customers that provides a brief meaning of the error/impact and known bugs relating to the 600/7445. The quickest way to find this information from MOS is to
search as :

a) 7445 <1st argument> oeri
or
b) 600 <1st argument> oeri

So if the internal error was 600[12333] the search is:600 12333 oeri

So for this article and worked example the search would be : 7445 memcpy oeri

Equally the 600/7445 tool as per Note:153788.1 can be used and if the trace file is submitted into this tool an OERI note will be reported where published.  If the testcase is attempted on a WINDOWS platform the error reported in the alert and trace may not be the same and the tool might not return any hits.  Finding a suitable example for all platforms is beyond the scope of this document.

So for the latter search we should find Note:310531.1, it should be noted that not all bugs associated with an error will be tagged but ideally they would be.

It should also be clear from the notes seen in this type of search that ORA-600[x]/7445[y] does not mean only one bug can attribute to the Internal error.

BUG.8 notes
-----------
If the OERI or if a search gets a hit to a bug there maybe a link to a note where the MOS Noteid is <BUGNO>.8

For this worked example Note:5162852.8 exists, also to illustrate their purpose we can take for example Note:4451759.8

These articles are automatically generated when a bug becomes fixed and will typically include :-

--Product affected
--Range of releases the bug/error is believe appropriate to
--Known affected version(s)
--Fixed version

Together with the 'symptoms' of the bug and where possible workarounds. This information is also in the bug itself but the note provides this into a much clearer form.

Patch Set : List of Bugfixes by Problem Type Articles
-----------------------------------------------------

When a PSR is released ORACLE ensures there is an article that summarises all the bugfixes made into that PSR, this articles allows a quick search to customers to see if there is an ORA-600/ORA-7445 error of the same type in the PSR. As mentioned earlier in the OERI section just because there is a fix for the same type, it does not guarantee that the problem is the same.

For example if one of these PSR notes mentions ORA-00600[729] it means that the fix relates to one specifc way this error can be encountered, it does not stop 600[729] occurring for other reasons and even with the latest PSR applied to a given RDBMS release the internal error could occur due to a new unknown bug.

These documents can be found in MOS using a search with keywords like :-

<version> patch set bug fixes problem type

So if we are looking to see what fixes are made into the V92070 PSR we would use
the six keywords : 9.2.0.7 patch set bug fixes problem type

MOS will produce many hits but then the internet browsers search function can be used to search on the string '9.2.0.7' to quickly isolate the note which in this case is Note:308894.1

It is recommended to search always on the highest PSR available for the version you are running on since the note will then also contain links to the previous PSRs.

ORA-00600/ORA-07445 Analysis : Keyword Searches
-----------------------------------------------

This is the stage of the article where we need to use the research gathered from trace files and known MOS notes to see if we can use the MOS knowledge base to determine if the internal error is a known bug. It was commented that Section4 of a trace file can be very powerful here but again it is impossible to provide one unique way to search the knowledge bases. Instead we will use the worked example and will give some recommendations for 'bad' and 'good' search practices.

Bad Searches Practices
----------------------

a) Using first few calls of stack trace e.g ksedmp ssexhd sigacthandler memcpy
b) Use of MOS 'Advanced Search' and using the 'using any of the words' option

The reason this is a poor search method is that it is likely too many hits will be returned especially if keywords like 600/7445 are used. The default search functionality is an AND base search where all keywords must be present and this provides for better searches and fewer unrelated hits.

c) Using all of the stack trace. As explained above a default MOS search will find hits that contain all the keywords and using the whole stack allows a real risk of getting zero hits returned.

Good Searches Practices
------------------------

It should be noted that more than one search might actually be needed to get a feeling for what bugs/notes are relevant to the customers issue. The scope of what can be searched will depend on if a problem is reproducible.

a) Isolating a good section of the 'Call Stack' from Section4
b) If SQL is known then look at SQL to see if there is something special on the SQL e.g for this worked example we know dense_rank is being used and this is not common SQL
c) If a problem is reproducible and an explain plan is available look at the plan to see if there is something that can be changed at session level. An example was reported in Section6 and if a parameter is seen to make a query work this parameter can then become a very useful keyword for determining if a known bug exists.
d) Using the RDBMS version as keyword can sometimes show if any bugs exist that contain a oneoff fix for the same internal error argument but caution is needed as it is not a guarantee that the bug will cover the same cause.

With this information in mind the two searches :-

memcpy ksxb1bqb kxhrPack qescaInsert subsr3
memcpy dense_rank

both get very few bug hits and BUG:5162852 is quickly identified.

Worked Example for 11g
———————————

The general principles of internal 600/7445 error analysis holds
true for 11g but there are some differences in how traces get reported.

To illustrate Note.808071.1 provides an excellent testcase to reproduce
the error ORA-00600[16613]. As stated already if there is a desire to
reproduce this error so that the alert/trace can be looked at in a proactive
manner then it should ONLY be used on a TEST database.

The example will not be reported in the same detail but instead we can use
it to highlight the key differences.

After the error has been reproduced we need to look at the alert log which
can be located via :-

show parameter dump_dest

background_dump_dest string c:\app\diag\rdbms\orcl\orcl\trace
core_dump_dest               string c:\app\diag\rdbms\orcl\orcl\cdump
user_dump_dest               string c:\app\diag\rdbms\orcl\orcl\trace

It is the background and user destinations for the ‘trace’ directory that we will
need to look.

Tue Nov 15 09:33:59 2011
Errors in file c:\app\diag\rdbms\orcl\orcl\trace\orcl_ora_3896.trc (incident=74841):
ORA-00600: internal error code, arguments: [16613], [0x37D727F8], [2], [], [], [], [], []
Incident details in: c:\app\diag\rdbms\orcl\orcl\incident\incdir_74841\orcl_ora_3896_i74841.trc

For more details on how critical/internal errors are reported within 11g Note.443536.1
can be used. For further analysis of the ORA-00600 we are interested in the ‘Incident Trace’
and the full path to this file is still provided by the alert log. Such traces can also
be located via ‘ADRCI’ but is beyond the scope of this article.

If we look at the appropriate incident trace file we will see very similar traces to
the 10.2 example and the same analytical approach to these traces can be followed.

To provide 11g files to support it is vital to provide the appropriate incident traces
and Note.443529.1/Note.411.1 can be reviewed but again is outside of the scope
of how to try and diagnose the internal error itself.

ORA-00600/ORA-07445 Analysis : Key Questions
——————————————————————-
If trying to research an Internal error before raising a SR to ORACLE Support the following
questions will help avoid further questions from analysts.  These questions are asked in the
MOS template when raising a SR and if answered as clearly as possible this can greatly assist
in getting to a solution.

Is the error reproducible?

If an issue is reproducible where possible it should also be tested from SQLPLUS
(ref Section 5).  It is possible that a known bug/new bug might only occur on JDBC
based clients for example and time needs to be spent understanding the minimum
configuration needed to reproduce a problem.

If an issue is indeed reproducible in a controlled manner a testcase for the majority
of issues should be possible and if a SR or pre-analysis is not matched to a known bug
very rarely can a problem be fixed from a trace file alone.  For this reason please be
aware of Note:232963.1

Has this worked before?

If functionality that was working has just stopped it would tend to suggest either some
issue of data within one or more objects or some more significant change.  If a 600/7445
is seen within Development particularly on higher PSRs/releases it could well be that
you have found a new bug.  Again depending on reproducibility a testcase ideally would
be required.

Has any new change been made recently e.g new code/new PSR or DB version/higher DB load?

Rarely do 600/7445 errors just start to occur due to a database being open for a certain
amount of time.  Even a small change of init/spfile parameters can influence how the database operates
so if any change is known it is better to mention asap no matter how small.

What is frequency of the error, when did it start?

On many occasions a 600/7445 will be seen as a oneoff and not to be seen again.  If pre-analysis
cannot be matched to a known issue and if in any doubt about an error a SR should always
be raised.  It is however necessary to be aware that if the error is a oneoff or sporadic there is a
tradeoff between impact of error and further analysis, as covered below.

On other occasions we might see the error on the hour or some consistent period of time, in
these cases root cause might be some job that runs or linked to an ORACLE process that
gets invoked at certain times.  Finding some pattern can sometimes be vital to solving an issue
and knowing when an issue truly started is very important to analysis.

What is the known visible impact?

Of course any 600/7445 [or any issue for all product support] should be raised as a SR to
ORACLE if there is any concern.  Sometimes an Internal errors impact is very clear in that
we might see an error in a controlled manner only when a session logsoff the database or the
database closes down.  It is equally possible that a database may seem to be running stable
with no users reporting problems and only by proactively looking at the alert can errors be seen.

In all cases if a SR is raised to Support we will need to try and assess the impact of a given
error and to make a confirmation asap if linked to something serious within the database eg corruption.

In many cases of analysis there soon becomes a tradeoff between seeing the error in the alert
and the further diagnostics needed to try and get to the root cause.  For this reason if an error
is a oneoff or very infrequent and determined not to visibly affect users a decision may need to
be made on how far to go especially if running on a desupported release or not on the latest PSR.

The above questions relate to Support asking this information from the customer, one of the key questions asked by customers to Support is :-

‘Why am I being asked to go to the latest PSR when a bug has not been identified?’

The primary answers for this are twofold :-

a) Note:209768.1 explains the backport policy for new bugs. So if Support could not match analysis to a known issue, a new bug can be filed. Sometimes Development can release ‘diagnostics patches’ to better analyze new issue but it’s recommended to use it on latest patchset. Also if on older PSRs it is far more likely for the customer to be running with >1 ‘oneoff’ fix placing the customer into a configuration that is probably not seen with many other customers.

b) Perhaps more importantly each PSR contains many fixes and can solve many issues without
Support being able to determine why.  For this reason the latest PSR is seen as a statistically faster
option to solve a problem when an error cannot be matched to a known bug.

Of course if considering the latest PSR appropriate backups and testing need to be made in advance.

Summary and Conclusions
—————————

The article can be summarised into key sections :-

1. Most importantly always raise a SR if anything is unclear
2. Consider the ORA600/7445 tool : Note:153788.1
3. Make use of the key articles surrounding internal errors ie OERI notes and BUGTAGS
4. Understand basic trace information to be able to use MOS better for ‘efficient’ searching
5. If a SR is to be be raised provide as much information as possible based on the MOS
template questions as these will often be vital to solving the SR.

In conclusion 600/7445 errors are not hopefully a common event to be seen by DBA’s and
analysis of these errors is not something that would be expected by themselves.  However it is often of
interest to many customers as to what is looked at within support and how we make a formal analysis.
Therefore the articles purpose is to highlight some of these processes with the hope that some
self analysis can be made and to assist customer/support interaction on these issues.

沪公网安备 31010802001379号

TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569