Oracle中RMAN恢复目录性能问题研究

下列文本摘自metalink:

Applies to:

Oracle Server – Enterprise Edition – Version: 9.0.1.0 to 11.1.0.8
Information in this document applies to any platform.

Purpose

This note discusses what to look for when queries against the RMAN repository (whether in the controlfile or a catalog database) are performing badly and what diagnostics needs to be gathered when raising a Service Request with Oracle Support Services.

This note is intended for use by Database Administrators and Support personnel investigating RMAN performance problems relating specifically to catalog resyncs or queries against the controlfile or catalog database whilst running RMAN.

Last Review Date

November 7, 2008

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

Identifying a problem with the RMAN repository

The RMAN repository is updated whenever new backups are taken or housekeeping is done to maintain the backup metadata; it is queried when RMAN reports are run and during restore to check backup history and determine the best and most efficient way to carry out the restore. If a catalog is used, a number of resyncs are done first to update the catalog with any new information in the controlfile but once the resyncs are complete, ALL subsequent queries of RMAN metadata (during crosscheck, report, restore etc) are run against the catalog database only.

You know there is a problem if:

  • An explicit catalog resync takes a long time to complete
  • Statspack or AWR reports for a nocatalog run shows top wait events in the target are for CPU or control file IO:
    • “control file parallel write”
    • “control file sequential read”
  • Statspack or AWR reports for catalog run shows top wait events in the catalog are for CPU or the catalog datafile IO:
    • “db file scattered read”
    • “db file sequential read”
  • Statspack or AWR reports shows most expensive queries in terms of CPU or elapsed time are those issued against the controlfile or rman catalog tables
  • You have worked through Note 360443.1: RMAN Backup Performance, and identified the bottleneck to be in the Oracle Layer before the backup actually starts
  • You have worked through Note 740911.1: RMAN Restore Performance, and identified the bottleneck to be in the Oracle Layer before the restore actually starts

Control_file_record_keep_time and recovery window

Control_file_record_keep_time specifies the minimum amount of RMAN metadata guarranteed to be retained in the controlfile; the default is 7 days.  In practice the rate at which records age out of the controlfile depends on how long it takes for a recyclable section to fill up which may well be more than 7 days. It is important to note that control_file_record_keep_time and recovery window are NOT the same; the former governs when records can be overwritten in the controlfile, primarily to prevent unlimited controlfile growth whilst the latter is concerned with retention of RMAN metadata according to business requirements (see Bug 6448254: RAISE WARNING IF RECOVERY WINDOW IS > CONTROL_FILE_RECORD_KEEP_TIME IN NOCAT MODE). If a catalog is used and RMAN is run daily or at a frequency LESS than control_file_record_keep_time days, RMAN metadata will never be lost as implicit catalog resyncs ensure data is regularly propogated into the catalog.

Integrity of RMAN metadata

Constraints in the catalog ensure the integrity of the rman metadata. For example, a backup_set must have a corresponding backup_piece which must have corresponding backup_X (datafile, redolog ) entries. There are no such constraints in a controlfile; records in different sections can age out at different rates. For example, the one:many relationship between backup_set and backup_redolog entries means the backup_redologs section will fill up faster than the backup_set section and over time, a backup_redolog entry may age out of the controlfile BEFORE its corresponding backup_set entry, leaving a backup_set entry with missing backup_redolog entries (mis-matched). Running housekeeping operations (report/delete obsolete, crosscheck etc) against such a CONTROLFILE with mismatched entries will produce unpredictable results (see Bug 7357779: RMAN REPORT/DELETE OBSOLETE RETURNS ARCHIVELOG BACKUPS INSIDE RETENTION POLICY).

To avoid any issues with mis-matched entries, set control_file_record_keep_time equal to the recovery window +1 which ensures that RMAN metadata INSIDE the retention policy will never be affected. In practice this may not be possible if a very large recovery window is needed so typically a catalog is used, a retention policy configured and control_file_record_keep_time left to its default of 7 days. But there may still be times when the catalog database is unavailable (it is being upgraded perhaps or is under investigation) and an RMAN backup must still proceed in which case:

  • set control_file_record_keep_time days to the recovery window plus 1 to ensure no loss of metadata whilst the catalog is unavailable
  • DO NOT RUN ANY MAINTENANCE OPERATIONS until the catalog is again available expecially if your recovery window is greater than the original setting of control_file_record_keep_time days

Tackling a performance issue

AWR snapshot reports from the target (nocatalog) or catalog database should identify the specific SQL that is causing a problem in terms of CPU, high IO or elapsed times.

Refer to the following notes to see if the problematic SQL is a known issue:

Note 247611.1:Known RMAN Performance Problems
Note 463227.1: Oracle10g RMAN Recovery Catalog Known Performance Issues

If none of the issues in the above notes are relevant, then further investigation is needed and the following diagnostics will be useful and should be uploaded if a Service Request is raised with Oracle Support Services. There are three processes to consider:

  • The RMAN client – this should always be traced
  • The 1st default channel (for nocatalog operations only, this process runs all the rman queries against the controlfile)
  • The RMAN session in the catalog instance

All the traces will generate a lot of information so consider FIRST how to minimise the output by stripping the job down to a SINGLE rman command. Execute each command in a script in isolation to see which command produces the problem and trace just that one command. It is always worth running an explicit catalog resync first to see if the real problem lies in the implicit resync done for virtually all commands.

Tracing the RMAN client

Ideally, set debug trace on the RMAN command line and execute a simplied script to illustrate the problem eg

%rman target / catalog usr/pw@<alias> log rman.log trace rman.trc debug
run {
allocate channel t1 type sbt parms=(…………);
backup database plus archivelog all;
}

Otherwise trace only the command that you are interested in. The following example traces only the crosscheck command:

%rman target / catalog usr/pw@<alias> log rman.log trace rman.trc
run {
allocate channel t1 type sbt parms=(….);
allocate channel t2 type sbt parms=(….);
backup database;
backup archivelog all until time ' sysdate –7' delete all input;
debug on;
crosscheck backup;
debug off;
}

Tracing queries against the RMAN repository

Make sure before you invoke rman that the following environmental variables are set:

NLS_LANG='american_america.<charset>'
NLS_DATE_FORMAT='dd-mon-rr hh24:mi:ss'

And for the target database (nocatalog operations) or catalog database (catalog operations):

TIME_STATISTICS=TRUE
MAX_DUMP_FILE_SIZE=UNLIMITED

If a recovery catalog is normally used, try again WITHOUT the catalog; if the problem persists then for diagnostic purposes, get the trace against the CONTROLFILE only. Tracing the RMAN catalog session should only be done if the problem only occurs when a catalog is used.

The Controlfile (nocatalog)

%rman target / log rman.log trace rman.trc debug
run {
sql "alter session set events ''10046 trace name context forever, level 12''";
set command id to 'rman';
allocate channel for maintenance type sbt…………..;
backup database;
}

Note: Two single quotes before : 10046
Two single quotes and a double quote after : level 12

This traces the first default channel; trace file will be generated in the target udump directory using the format <SID>_ora_<spid>.trc. To find the <spid> of the 1st default channel:

col program format a20
col action format a20
select s.sid, p.spid, s.program, s.client_info,s.action, seq#, event, wait_time,
seconds_in_wait AS sec_wait
from v$session s, v$process p
where s.paddr = p.addr and s.program like '%rman%';

Typical output:

SID SPID PROGRAM                          CLIENT_INFO
---------------------------------------------------------
ACTION              SEQ# EVENT                      WAIT_TIME SEC_WAIT
-----------------------------------------------------------------------------
139 24822 rman@celcsol4 (TNS V1-V3)
0000012 FINISHED129    853  SQL*Net message from client 0        1794
147  4574 rman@celcsol4 (TNS V1-V3)
115  SQL*Net message from client 0        1797
135  5056 rman@celcsol4 (TNS V1-V3)       rman channel=ORA_DISK_1
0000046 FINISHED129   2219  SQL*Net message from client 0        1794

Note:
– sid 147 is the polling channel; it never has a value in ACTION
– sid 139 is the 1st default Channel (CLIENT_INFO is null)
– sid 135 is the allocated channel ORA_DISK_1 (per CLIENT_INFO)

The Catalog

Allow the rman job to make the catalog connection first – it may help to use the host command to allow time to find the catalogn session:

%rman target / catalog usr/pw@<alias> log rman.log trace rman.trc debug
run {
set command id to 'rman';
allocate channel for maintenance type sbt…………..;
host;
backup database;
}

The host command will halt execution and return control to the OS prompt.
To identify the RMAN session in the catalog database:

SELECT s.sid, s.serial#, p.spid, s.program, s.action
FROM v$session s, v$process p
WHERE s.paddr = p.addr
AND s.program like '%rman%';

Note the spid value eg 3164 then do:

oradebug setospid 3164
oradebug unlimit
oradebug event 10046 trace name context forever, level 12
exit

Once this is done, to return to the RMAN session and to continue execution of the script type: exit
A trace file will be generated in the CATALOG udump directory using the format <SID>_ora_<spid>.trc.

Upload Diagnostics

1. RMAN client trace file
2. 1st default channel 10046 trace file (nocatalog only)
3. RMAN catalog 10046 trace file (catalog issues only)
4. Related AWR/Statspack reports


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *