ORA-12500内存耗尽一例

3月8日下午发现主机130.31.1.234无法登录，尝试登录Oracle，系统返回ORA-12500错误（TNS:listener failed to start a dedicated server process）。可能引起该错误的原因有多种，包括以下：

Oracle服务进程使用的session或process数达到了参数设置的上限，导致无法再分配新的服务进程。

系统资源耗尽，Oracle在启动新进程时调用的系统调用fork函数因资源不足而出错。

AIX下sys0对象上的属性maxuproc代表用户可以使用的最大进程数，若用户进程数接近该设定值可能导致Oracle无法启动新进程。

因主机无法远程登录，故在晚间进行了重启。重启后查看Oracle告警日志发现记录：“skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3”，该段信息表示模块skgpspawn在fork一个新进程是出现了错误；由此可以判断不是由于session或process数达到参数设置的上限，因为若是session或process数不足，Oracle应当显示地返回ORA-00018（maximum number of sessions exceeded）或ORA-00020（maximum number of processes (%s) exceeded）错误。

通过在主机上查看sys0对象属性，发现maxuproc参数的设置值达到了4096，故基本可以排除因该参数不当引起连接问题的可能性。

系统资源耗尽将导致Oracle监听器无法为新的连接分配新的服务进程，而老的服务进程上的内存资源等可能一直没有得到释放；statspack是Oracle9i中反映Oracle运行性能的工具，以JOB形式在后台运行，目前设置为每两小时运行一次快照。分析快照发现，在主机重启前最后一次快照为下午14:15分开始的，之后系统进入资源紧张阶段，Oracle无法分配新的JOB(j00n)进程,故最后的快照发生在系统出现问题之前。

分析快照内容，在12:06到14：15之间，数据库参数没有改动的记录，sga_max_size设置为10GB, pga_aggregate_target设置为4GB,考虑到Oracle在启用RAC特性后SGA的实际内存使用量将会超过sga_max_size的设置，故Oracle总的内存最大使用量应控制在20GB内。而目标主机的实际物理内存达到了64GB,且专业计费系统一直以来运行良好，故可以排除因Oracle内存参数设置不当，而造成了本次问题的出现。

进一步分析快照发现这一阶段内Oracle数据库高速缓冲的命中率buffer hit为55.90,这个值要较平时水平低很多，可以判断该阶段内数据库可能执行了一些不同于日常业务的操作，这些操作引起较大的物理读表现为缓冲池的命中率明显降低。分析等待事件可以发现，db file scattered read事件即数据库多块物理读是这一阶段的主要等待事件，进一步印证了上述的判断。

通过对数据库快照的分析，证实在连接问题发生之前的短暂时间内,在P6702实例上的确有过引起较大物理读的操作，但实际Oracle使用的内存受到sga_max_size与pga_aggregate_target参数的限制应控制在20GB的范围内，且专业计费系统数据库使用裸设备数据文件，不存在过度使用文件系统缓存的可能，故可以排除由Oracle数据库导致系统资源耗尽的可能性。

因为没有该阶段内系统内存使用量的日志文件，故无法了解到目标主机上当时其他服务的实际内存使用量，但可以排除是问题因Oracle引起的。

Posted

August 9, 2009

mac

Tags:

Comments

8 responses to “ORA-12500内存耗尽一例”

admin

August 2, 2010

Applies to:
Oracle Server – Enterprise Edition – Version: 9.2.0.1 to 10.2.0.5 – Release: 9.2 to 10.2
Information in this document applies to any platform.
***Checked for relevance on 25-Jul-2010***
Symptoms
New connections not allowed and the following error appears in the alerlog

ERROR
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn5
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3

Occurs intermittently.
Cause
Insufficient SWAP memory allocated.

This problem has been investigated in the following bug which is closed as not bug

bug 4189338 INSTANCES FAILED WITH MULTIPLE ERROR MESSAGES
Solution
SET swap memory to ( 1 – 2 ) the size of the physical memory installed depending on the Installation and Configuration Requirements as clarified in the following note

Note 169706.1 Oracle Database on AIX,HP-UX,Linux,MacOSX,Solaris,Tru64 Operating Systems Installation and Configuration Requirements Quick Reference (8.0.5 to 10.2)

Resolution for this same issue was seen by increasing the OS parameter for maximum number of processes per user from its default value. (EX: default on AIX is 128; increased to 1280 resolved issue)

Reply
admin

August 2, 2010

Hdr: 4189338 9.2.0.6.0 RDBMS 9.2.0.6.0 PERFORMANCE PRODID-5 PORTID-212
Abstract: INSTANCES FAILED WITH MULTIPLE ERROR MESSAGES

*** 02/16/05 12:02 pm ***
TAR:
—-
4180379.994

PROBLEM:
——–
2 instances on server (DDWP and MSRP) failed after background processes
terminated. Alert log segments are below.
DDWP alert log segment
Wed Jan 19 05:38:16 2005

skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3

Wed Jan 19 05:38:38 2005

LGWR: terminating instance due to error 472

Instance terminated by LGWR, pid = 21221448

MSRP alert log segment
Wed Jan 19 05:38:18 2005

found dead shared server ‘S000’, pid = (11, 1)

skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3

failed to start shared server ‘S000’, oer=27142

found dead dispatcher ‘D000’, pid = (12, 1)

skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3

failed to start dispatcher ‘D000’ for network
‘(ADDRESS=(PARTIAL=YES)(PROTO =T
CP))’, oer=27142

Restarting dead background process CJQ0

Wed Jan 19 05:38:20 2005

Errors in file /ora01/app/oracle/admin/msrp/bdump/msrp_pmon_20840532.trc:

ORA-443: background process “CJQ0” did not start

ORA-27300: OS system dependent operation:fork failed with status: 12

ORA-27301: OS failure message: Not enough space

ORA-27302: failure occurred at: skgpspawn3

Wed Jan 19 05:39:18 2005

DBW1: terminating instance due to error 472

Instance terminated by DBW1, pid = 20951098

DIAGNOSTIC ANALYSIS:
——————–
RFA 14438 filed – issue felt to be a bug
IBM ticket logged to verify that this is not a hardware issue.
– IBM verified the hardware is okay and there are no hardware problem

Swap: {
System Configuration: lcpu=16 mem=128000MB
kthr memory page faults cpu
—– ———– ———————— ———— ———–
r b avm fre re pi po fr sr cy in sy cs us sy id wa
3 9 13474399 15601 0 0 0 132 638 0 2842 39907 5010 14 2 63 21

ALERT.LOG:
——————

Wed Jan 19 05:38:16 2005
skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3

The following errors were not shown in the RDA, not sure if they are not in
the other database Alert.log?
ORA-443: background process “CJQ0” did not start

ORA-27300: OS system dependent operation:fork failed with status: 12

ORA-27301: OS failure message: Not enough space

ORA-27302: failure occurred at: skgpspawn3

– currently 128 GB of RAM on the server, and another 30 GB of swap space.

WORKAROUND:
———–
None

RELATED BUGS:
————-

REPRODUCIBILITY:
—————-

TEST CASE:
———-

STACK TRACE:
————

SUPPORTING INFORMATION:
———————–

24 HOUR CONTACT INFORMATION FOR P1 BUGS:
—————————————-

DIAL-IN INFORMATION:
——————–

IMPACT DATE:
————
Effecting production att
– I highly suggest reviewing the uploaded files on ESS30
– Also highly suggest reviewing the SR for all of the relevent information
about this issue. There have been many steps implemented to try and resolve
this issue that have failed.

Reply
teapot

September 12, 2011

考虑到Oracle在启用RAC特性后SGA的实际内存使用量将会超过sga_max_size的设置
这是什么意思？ metalink有相关说明？
TKS

Reply
1. admin
  
  September 12, 2011
  
  这个文档写的不是最清楚，数据库版本是9.2.0.6
  
  之前诊断该9i RAC系统存在SGA内存泄露的问题。 Mos上 9i RAC Memory Leak 相关的Bug:
  
  Bug 3910149 Very small SGA memory leak in RAC environments:
  Slow SGA memory leak in an internal KGL descriptor (indicated by the
  “Permanent space allocated for KGL handle dependents” entry in library
  cache dump).
  
  Bug 3420610 RAC “BAST” leak can lead to instance hang:
  There is a rare case where Oracle does not release the BAST context after
  rejecting a ping, and this can lead to a BAST context leak. If
  this situation happens frequently enough and over time it is possible to
  run out of BAST context entries in which case LMS cannot process
  any incoming pings leading to a hang scenario.
  
  Bug 5043858 Server accept handle leak can crash RAC instance:
  RAC instance may crash due to server accept handle leak as
  a close message may not be sent on disconnect.
  
  Reply
admin

September 12, 2011
What Is GCS_RESOURCES And How Do We Control Them?

Applies to:
Oracle Server – Enterprise Edition – Version: 9.2.0.1 to 10.2.0.1
Information in this document applies to any platform.
Goal
Question 1: What is gcs_resources ?

Question 2: What happens if it exceeds its limit?

Question 3: How do we adjust the limit?
Solution

Answer 1: gcs_resources is global cache resources which are essentially structures we use to protect cache structures (buffers). Each instance “masters” cache resources and the implementation of this mastership is done via a gc resource structure

Answer 2 :Nothing – this is simply stating how high the value has reached for a particular resource. If we need more gcs_resources for some reason, they are obtained from the shared pool.

Answer 3 : RAC specific memory is allocated at the time of SGA creation, mostly in the shared pool. However, the memory does not count in the SHARED_POOL_SIZE as defined in the init.ora parameter file.Therefore, when migrating Oracle from single instance to RAC, you do not need to adjust the SHARED_POOL_SIZE parameter to accommodate the additional memory that RAC uses; Oracle does that automatically. Note that the memory for the KCL global cache lock elements will be allocated in the buffer cache, and not in the shared pool.

You can monitor the exact RAC memory resource usage of the shared pool by querying V$SGASTAT for GCS, GES and KCL related entries. Also, the current and maximum number of GCS resources / shadows (resource names gcs_resources and gcs_shadows) and GES resources / enqueues (resource names ges_ress and ges_locks) can be obtained from V$RESOURCE_LIMIT. In case we are running out of reserved memory for GCS / GES related components, Oracle will dynamically allocate memory from the free memory pool in the shared pool. However, it is recommended to carefully calculate the resources and enqueues as needed since dynamic memory allocation from the shared pool is an expensive operation.

We recommend you monitor for more time to see if this limit is reach and it is a constant situation, then you could considered to increase the value of gcs_resources by instance.
```
SQL> alter system set "_gcs_resources"=[value] scope=spfile sid='[sid]';
```
Reply
admin

September 12, 2011

LCK temporarily stuck waiting for latch ‘Child row cache objects’

Applies to:
Oracle Server – Enterprise Edition – Version: 10.2.0.1 to 11.1.0.7
Information in this document applies to any platform.
Symptoms

One RAC instance hangs from time to time. The hang resolves itself after a while until it hangs again after some time.

1. alert.log info

=> alert log of the hanging instance show on one node that the LCK process is
blocked on different enqueues and hanging messages are reported afterwards
(e.g. ORA-3136 and WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK)

LCK0 started with pid=10, OS id=7217
Wed Feb 11 12:25:36 2009
…
Mon Feb 16 19:01:24 2009
GES: Potential blocker (pid=7217) on resource IV-0051D934-1013071D;
enqueue info in file /u01/oracle/admin/SNM/bdump/snm2_lmd0_7006.trc and
DIAG trace file
Mon Feb 16 19:15:59 2009
GES: Potential blocker (pid=7217) on resource CI-00000046-00000002;
enqueue info in file /u01/oracle/admin/SNM/bdump/snm2_lmd0_7006.trc and
DIAG trace file
…
Mon Feb 16 19:26:16 2009
> />> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=25
System State dumped to trace file
/u01/oracle/admin/SNM/bdump/snm2_mmon_7032.trc
…
Tue Feb 16 19:35:22 2009
WARNING: inbound connection timed out (ORA-3136)

=> alert.log on the other nodes show CI crossinstance messages locks

Mon Feb 16 19:27:53 2009
GES: Potential blocker (pid=13607) on resource CI-00000046-00000005;
enqueue info in file /u01/oracle/admin/SNM/udump/snm1_ora_8418.trc and
DIAG trace file

2. Systemstate dumps show the LCK process is blocked in the ‘Child row cache objects’ latch
and plenty of processes are waiting on it (waiter count high). The LCK stack contains
terms like ‘sskgpwwait kslges kqrbip kqrbfr’.

(latch info) wait_event=0 bits=0
Location from where call was made: kqrbip:
waiting for 5795df1a0 Child row cache objects level=4 child#=3
Location from where latch is held: kghfrunp: clatch: wait:
Context saved from call: 0
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]:
249 (1, 1234808779, 1)
…
161 (1, 1234808779, 1)
waiter count=34
gotten 108711869 times wait, failed first 40623351 sleeps 1869257
gotten 14091082 times nowait, failed: 43699893
possible holder pid = 255 ospid=3663
on wait list for 5795df1a0

3. awrrpt info

=> After the hang, the awrrpt show plenty of ‘KQR L PO’ memory is freed during the hang

Pool Name Begin MB End MB % Diff
—— —————————— ————– ————– ——-
shared KQR L PO 2,118.6 1,005.9 -52.52
shared free memory 932.9 2,054.4 120.21

=> The rowcache (dictionary cache, the dc_*) usage decrease, too

dc_histogram_defs 332,935
dc_object_ids 57,894
dc_objects 160,639
dc_segments 382,909

=> big cursors are entering the shared pool regularly
Cause
The shared pool is stressed and memory need to be freed for the new cursors. As a consequence, the dictionary cache is reduced in size by the LCK process causing a temporal hang of the instance since the LCK can’t do other activity during that time. Since the dictionary cache is a memory area protected clusterwide in RAC, the LCK is responsible to free it in collaboration with the dictionary cache users (the sessions using cursors referenced in the dictionary cache). This process can be time consuming when the dictionary cache is big.
Solution

a. reduce the stress on the shared pool

=> by increasing it above the automatically reached value with dynamic sga, e.g.
when sga_target is set to 16G and the shared_pool_size was 6G during the hang time, set it to e.g. 8G.

=> by reducing the number of big cursors entering the shared pool, e.g. cursors using more than 1M sharable_mem e.g. via binding
select sql_text from v$sqlarea where sharable_mem > 1M;

b. reduce the dictionary cache usage in order to reduce the size of the dictionary cache, e.g.

=> when dc_histogram_defs is too high, it can point towards histograms calculations on all columns
of the tables. histograms should only be calculated on indexed columns
=> when dc_segments is high compared to dc_object_ids, it can point towards excessive partitioning usage. Reducing the partitions/subpartition usage will help reduce the dictionary cache usage to manage it.

c. set _enable_shared_pool_durations = false to avoid that one duration (a memory area in the shared pool used for a specific usage) need to give all space required for that usage, i.e. in case the duration containing the dictionary cache need to free memory, then that duration is extra stressed since no other type of memory from other durations can be used. Setting it to false make that any type of memory can be used to free space (i.e. any type of memory in the subpool). As a consequence, the number of subpools will be reduced by the factor of the number of durations (4 in 10gR2). Hence tuning the _kghdsidx_count is advisable, e.g. increasing it to have manageable subpool sizes (see note:396940.1).
d. check patch:8666117 has been applied. This patch speedup the processing to free memory.

Reply
admin

September 12, 2011

High memory allocation in GES resource Cache and ORA-4031 errors

Applies to:
Oracle Server – Enterprise Edition – Version: 11.1.0.6 to 11.2.0.2 – Release: 11.1 to 11.2
Information in this document applies to any platform.
Symptoms

Under certain workload conditions, with high enqueue activites, the GES resource cache memory allocation in the shared pool can grow excessive.

The usual resource cleanup mechanisms for GES cache by the LMON background process would not be sufficient enough in this case and we might see ORA-4031 errors and possible instance crashes due to this.

Cause

This issue is corrected in the following bug fix.

Bug 10042937 — HIGH MEMORY GROUP IN GES_CACHE_RESS AND ORA-4031 ERRORS

You may want to consider implement the solutions given, if you are seeing the following symptoms

o The memory of the ges_cache_ress could show high growth over a period of time. The ges_cache_ress captured in v$resource_limit would look as follows

select resource_name, current_utilization, max_utilization, initial_allocation from v$resource_limit where resource_name like ‘%ges_ress%’;

———————————————————————————————————————————————————–
RESOURCE_NAME CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION
———————————————————————————————————————————————————–
ges_ress 17717 19354 29912
ges_ress 416646 416647 29912
ges_ress 1357968 1358021 29912
ges_ress 2171240 2171240 29912
ges_ress 2954514 2954518 29912
ges_ress 3698162 3698166 29912
ges_ress 4369513 4369535 29912
———————————————————————————————————————————————————–

o At the time same time, we can see the sga growth in v$sgatat

select pool, name,bytes from v$sgastat where name like ‘%ges resource%’;
—————————————————————————–
POOL NAME BYTES
—————————————————————————–
shared pool ges resource 12585776
shared pool ges resource 120158976
shared pool ges resource 389778368
shared pool ges resource 624732080
shared pool ges resource 850151288
shared pool ges resource 1067068464
shared pool ges resource 1257971960
—————————————————————————–

o ORA-4031 errors in the ges resource cache.

ORA-4031: unable to allocate 5792 bytes of shared memory (“shared pool”,”unknown object”,”sga heap(1,1)”,”ges resource “)
Thu Jul 08 11:18:53 2010
Errors in file /home/orabase/diag/rdbms/rac111/rac1111/trace/rac1111_ora_7828.trc (incident=1515626)

Solution
bug 10042937 is fixed in 11.2.0.3, the solution is to apply patch 10042937. This bug fix has made the ges resource cleanup by the LMON process more aggressive, to handle workload conditions with high enqueue activities. With this fix, LMON would cleans up more resources in each iteration.

Please open a service request with support, if you are seeing this issue on other platform, to get the fix for this.

The workaround is to increase the “_lm_cache_res_cleanup” parameter value to a higher value (Default is 25) and check whether the problem reproduces again.

References
NOTE:10042937.8 – Bug 10042937 – High memory group in ges_cache_ress and ora-4031 errors.

Reply
admin

September 12, 2011

Bug 9026008 “GES_RESOURCES” may increase over time with high enqueue activity

Affects:

Product (Component) Oracle Server (Rdbms)
Range of versions believed to be affected Versions BELOW 11.2.0.2
Versions confirmed as being affected

11.2.0.1
11.1.0.7

Platforms affected Generic (all / most platforms affected)

Fixed:

This issue is fixed in

12.1 (Future Release)
11.2.0.2 (Server Patch Set)
11.2.0.1 Bundle Patch 2 for Exadata Database

Symptoms:

Related To:

Error May Occur
Leak (Memory Leak / Growth)
ORA-4031

RAC (Real Application Clusters) / OPS
_lm_cache_allocated_res_ratio
_lm_cache_res_cleanup

Description

Memory used for “ges resource” may increase over time.

Rediscovery Notes:
From v$resource_limit, check ges_resources and ges_cached_resources.
If these values continually increase when there is high enqueue activity
you may be hitting this issue.

Workaround
Possibly increasing the _lm_cache_res_cleanup parameter (default is 25 on 11.2).
If that does not help this fix may be required.

NOTES:
Most “ges resource” memory increases do not require this fix but in
extreme cases where enqueue activity is very high this fix may be necessary.
The fix introduces an underscore parameter _lm_cache_allocated_res_ratio
(or a spare parameter if a one-off patch was applied). This parameter defines
the percentage of cached resources being cleanup. The default value of this
parameter is 50 but if the fix is applied this value can be decreased to speed
up the frequency of cached resource cleanup. If this fix (and parameter) does
not improve the situation, the fix for bug 10042937 may be needed.

Also see bug 10042937 which may be useful alongside this fix to help
limit memory usage.

Reply

ORA-12500内存耗尽一例

Comments

8 responses to “ORA-12500内存耗尽一例”

Leave a Reply Cancel reply