Exadata性能监控的瑞士军刀——cellsrvstat

http://www.dbaleet.org/exadata_performance_monitoring_swiss_army_knife_cellsrvstat/

如果需要查找Exadata cell(存储节点)的offloading/smart scan/storage index的信息,通常我们可以在数据库端通过过滤查找v$sql, v$sysstat之类的动态性能视图得到,有没有更简单的方法呢?

从某一个版本开始在每个Exadata存储节点中加入了一个叫做cellsrvstat的小工具,这个工具收集针对当前cell节点进行收集,并且收集的信息非常全面,堪称Exadata上的“上古神器”。

[root@slca04cel01 ~]# cellsrvstat
===Current Time=== Fri Aug 23 08:12:19 2013

== Input/Output related stats ==
Number of hard disk block IO read requests 0 1823855
Number of hard disk block IO write requests 0 849658
Hard disk block IO reads (KB) 0 1390317
Hard disk block IO writes (KB) 0 424990
Number of flash disk block IO read requests 0 0
Number of flash disk block IO write requests 0 0
Flash disk block IO reads (KB) 0 0
Flash disk block IO writes (KB) 0 0
Number of disk IO errors 0 0
Number of reads from flash cache 0 0
Number of writes to flash cache 0 0
Flash cache reads (KB) 0 0
Flash cache writes (KB) 0 0
Number of flash cache IO errors 0 0
Size of eviction from flash cache (KB) 0 0
Number of outstanding large flash IOs 0 0
Number of latency threshold warnings during job 0 33
Number of latency threshold warnings by checker 0 0
Number of latency threshold warnings for smart IO 0 0
Number of latency threshold warnings for redo log writes 0 0
Current read block IO to be issued (KB) 0 0
Total read block IO to be issued (KB) 0 1446974
Current write block IO to be issued (KB) 0 0
Total write block IO to be issued (KB) 0 424990
Current read blocks in IO (KB) 0 0
Total read block IO issued (KB) 0 1446974
Current write blocks in IO (KB) 0 0
Total write block IO issued (KB) 0 424990
Current read block IO in network send (KB) 0 0
Total read block IO in network send (KB) 0 1446974
Current write block IO in network send (KB) 0 0
Total write block IO in network send (KB) 0 424990
Current block IO being populated in flash (KB) 0 0
Total block IO KB populated in flash (KB) 0 0

== Memory related stats ==
SGA heap used - kgh statistics (KB) 0 438098
SGA heap free - cellsrv statistics (KB) 0 20655
OS memory allocated to SGA (KB) 0 458754
SGA heap used - cellsrv statistics - KB 0 438099
OS memory allocated to PGA (KB) 0 898
PGA heap used - cellsrv statistics (KB) 0 376
OS memory allocated to cellsrv (KB) 0 5754818
Top 5 SGA consumers (KB)
storidx::arraySeqRIDX 0 88719
SUBHEAP Networ 0 81937
storidx:arrayRIDX 0 73816
Thread IO Lat Stats 0 35158
RemoteSendPort Fixed Size 0 33935
Top 5 SGA subheap consumers (KB)
Network mem 0 81925
Network heap chunk 0 2462
Number of allocation failures in 512 bytes pool 0 0
Number of allocation failures in 2KB pool 0 0
Number of allocation failures in 4KB pool 0 0
Number of allocation failures in 8KB pool 0 0
Number of allocation failures in 16KB pool 0 0
Number of allocation failures in 32KB pool 0 0
Number of allocation failures in 64KB pool 0 0
Number of allocation failures in 1MB pool 0 0
Allocation hwm in 512 bytes pool 0 620
Allocation hwm in 2KB pool 0 602
Allocation hwm in 4KB pool 0 620
Allocation hwm in 8KB pool 0 1002
Allocation hwm in 16KB pool 0 602
Allocation hwm in 32KB pool 0 601
Allocation hwm in 64KB pool 0 601
Allocation hwm in 1MB pool 0 55
Number of low memory threshold failures 0 0
Number of no memory threshold failures 0 0
Dynamic buffer allocation requests 0 0
Dynamic buffer allocation failures 0 0
Dynamic buffer allocation failures due to low mem 0 0
Dynamic buffer allocated size (KB) 0 0
Dynamic buffer allocation hwm (KB) 0 0

== Execution related stats ==
Incarnation number 0 5
Number of module version failures 0 0
Number of threads working 0 1
Number of threads waiting for network 0 19
Number of threads waiting for resource 0 0
Number of threads waiting for a mutex 0 0
Number of Jobs executed for each job type
CacheGet 0 1838056
CachePut 0 849658
CloseDisk 0 711757
OpenDisk 0 712141
ProcessIoctl 0 14062328
PredicateDiskRead 0 0
PredicateDiskWrite 0 0
PredicateFilter 0 0
PredicateCacheGet 0 0
PredicateCachePut 0 0
FlashCacheMetadataWrite 0 0
RemoteListenerJob 0 0
FlashCacheResilveringTableUpdate 0 0
CellDiskMetadataPrepare 0 0

SQL ids consuming the most CPU
other 0000000000000 2
END SQL ids consuming the most CPU

== Network related stats ==
Total bytes received from the network 0 804684378
Total bytes transmitted to the network 0 7721296
Total bytes retransmitted to the network 0 0
Number of active sendports 0 7
Hwm of active sendports 0 15
Number of active remote open infos 0 6
HWM of remote open infos 0 65

== SmartIO related stats ==
Number of active smart IO sessions 0 0
High water mark of smart IO sessions 0 0
Number of completed smart IO sessions 0 0
Smart IO offload efficiency (percentage) 0 0
Size of IO avoided due to storage index (KB) 0 0
Current smart IO to be issued (KB) 0 0
Total smart IO to be issued (KB) 0 0
Current smart IO in IO (KB) 0 0
Total smart IO in IO (KB) 0 0
Current smart IO being cached in flash (KB) 0 0
Total smart IO being cached in flash (KB) 0 0
Current smart IO with IO completed (KB) 0 0
Total smart IO with IO completed (KB) 0 0
Current smart IO being filtered (KB) 0 0
Total smart IO being filtered (KB) 0 0
Current smart IO filtering completed (KB) 0 0
Total smart IO filtering completed (KB) 0 0
Current smart IO filtered size (KB) 0 0
Total smart IO filtered (KB) 0 0
Total cpu passthru output IO size (KB) 0 0
Total passthru output IO size (KB) 0 0
Current smart IO with results in send (KB) 0 0
Total smart IO with results in send (KB) 0 0
Current smart IO filtered in send (KB) 0 0
Total smart IO filtered in send (KB) 0 0
Total smart IO read from flash (KB) 0 0
Total smart IO initiated flash population (KB) 0 0
Total smart IO read from hard disk (KB) 0 0
Total smart IO writes (fcre) to hard disk (KB) 0 0
Number of smart IO requests < 512KB 0 0
Number of smart IO requests >= 512KB and < 1MB 0 0
Number of smart IO requests >= 1MB and < 2MB 0 0
Number of smart IO requests >= 2MB and < 4MB 0 0
Number of smart IO requests >= 4MB and < 8MB 0 0
Number of smart IO requests >= 8MB 0 0
Number of times smart IO buffer reserve failures 0 0
Number of times smart IO request misses 0 0
Number of times IO for smart IO not allowed to be issued 0 0
Number of times smart IO prefetch limit was reached 0 0
Number of times smart scan used unoptimized mode 0 0
Number of times smart fcre used unoptimized mode 0 0
Number of times smart backup used unoptimized mode 0 0

可以看到cellsrvstat收集这么几类信息:

  • I/O相关的统计信息;
  • 内存相关的统计信息;
  • 执行相关的统计信息;
  • 网络相关的统计信息;
  • smart I/O相关的统计信息。

单纯运行cellsrv显示的是当前值。 我们可以通过加上-list参数来查询共有哪些metrics:

[root@dm01cel01 ~]# cellsrvstat -list
Statistic Groups:
io Input/Output related stats
mem Memory related stats
exec Execution related stats
net Network related stats
smartio SmartIO related stats

Statistics:
[ * - Absolute values. Indicates no delta computation in tabular format]

io_nbiorr_hdd Number of hard disk block IO read requests
io_nbiowr_hdd Number of hard disk block IO write requests
io_nbiorb_hdd Hard disk block IO reads (KB)
io_nbiowb_hdd Hard disk block IO writes (KB)
io_nbiorr_flash Number of flash disk block IO read requests
io_nbiowr_flash Number of flash disk block IO write requests
io_nbiorb_flash Flash disk block IO reads (KB)
io_nbiowb_flash Flash disk block IO writes (KB)
io_ndioerr Number of disk IO errors
io_nrfc Number of reads from flash cache
io_nwfc Number of writes to flash cache
io_fcrb Flash cache reads (KB)
io_fcwb Flash cache writes (KB)
io_nfioerr Number of flash cache IO errors
io_nbpfce Size of eviction from flash cache (KB)
io_nolfio Number of outstanding large flash IOs
io_ltow Number of latency threshold warnings during job
io_ltcw Number of latency threshold warnings by checker
io_ltsiow Number of latency threshold warnings for smart IO
io_ltrlw Number of latency threshold warnings for redo log writes
io_bcrti Current read block IO to be issued (KB) *
io_btrti Total read block IO to be issued (KB)
io_bcwti Current write block IO to be issued (KB) *
io_btwti Total write block IO to be issued (KB)
io_bcrii Current read blocks in IO (KB) *
io_btrii Total read block IO issued (KB)
io_bcwii Current write blocks in IO (KB) *
io_btwii Total write block IO issued (KB)
io_bcrsi Current read block IO in network send (KB) *
io_btrsi Total read block IO in network send (KB)
io_bcwsi Current write block IO in network send (KB) *
io_btwsi Total write block IO in network send (KB)
io_bcfp Current block IO being populated in flash (KB) *
io_btfp Total block IO KB populated in flash (KB)
mem_sgahu SGA heap used - kgh statistics (KB)
mem_sgahf SGA heap free - cellsrv statistics (KB)
mem_sgaos OS memory allocated to SGA (KB)
mem_sgahuc SGA heap used - cellsrv statistics - KB
mem_pgaos OS memory allocated to PGA (KB)
mem_pgahuc PGA heap used - cellsrv statistics (KB)
mem_allos OS memory allocated to cellsrv (KB)
mem_sgatop Top 5 SGA consumers (KB) *
mem_sgasubtop Top 5 SGA subheap consumers (KB) *
mem_halfkaf Number of allocation failures in 512 bytes pool
mem_2kaf Number of allocation failures in 2KB pool
mem_4kaf Number of allocation failures in 4KB pool
mem_8kaf Number of allocation failures in 8KB pool
mem_16kaf Number of allocation failures in 16KB pool
mem_32kaf Number of allocation failures in 32KB pool
mem_64kaf Number of allocation failures in 64KB pool
mem_1maf Number of allocation failures in 1MB pool
mem_halfkhwm Allocation hwm in 512 bytes pool
mem_2khwm Allocation hwm in 2KB pool
mem_4khwm Allocation hwm in 4KB pool
mem_8khwm Allocation hwm in 8KB pool
mem_16khwm Allocation hwm in 16KB pool
mem_32khwm Allocation hwm in 32KB pool
mem_64khwm Allocation hwm in 64KB pool
mem_1mhwm Allocation hwm in 1MB pool
mem_lmtf Number of low memory threshold failures
mem_nmtf Number of no memory threshold failures
mem_dynar Dynamic buffer allocation requests
mem_dynaf Dynamic buffer allocation failures
mem_dynafl Dynamic buffer allocation failures due to low mem
mem_dynam Dynamic buffer allocated size (KB)
mem_dynamh Dynamic buffer allocation hwm (KB)
exec_incno Incarnation number *
exec_versf Number of module version failures *
exec_ntwork Number of threads working *
exec_ntnetwait Number of threads waiting for network *
exec_ntreswait Number of threads waiting for resource *
exec_ntmutexwait Number of threads waiting for a mutex *
exec_njx Number of Jobs executed for each job type
exec_topcpusqlid SQL ids consuming the most CPU
net_rxb Total bytes received from the network
net_txb Total bytes transmitted to the network
net_rtxb Total bytes retransmitted to the network
net_sps Number of active sendports
net_sph Hwm of active sendports
net_rois Number of active remote open infos
net_roih HWM of remote open infos
sio_ns Number of active smart IO sessions *
sio_hs High water mark of smart IO sessions *
sio_ncs Number of completed smart IO sessions
sio_oe Smart IO offload efficiency (percentage) *
sio_sis Size of IO avoided due to storage index (KB)
sio_ctb Current smart IO to be issued (KB) *
sio_ttb Total smart IO to be issued (KB)
sio_cii Current smart IO in IO (KB) *
sio_tii Total smart IO in IO (KB)
sio_cfp Current smart IO being cached in flash (KB) *
sio_tfp Total smart IO being cached in flash (KB)
sio_cic Current smart IO with IO completed (KB) *
sio_tic Total smart IO with IO completed (KB)
sio_cif Current smart IO being filtered (KB) *
sio_tif Total smart IO being filtered (KB)
sio_cfc Current smart IO filtering completed (KB) *
sio_tfc Total smart IO filtering completed (KB)
sio_cfo Current smart IO filtered size (KB) *
sio_tfo Total smart IO filtered (KB)
sio_tcpo Total cpu passthru output IO size (KB)
sio_tpo Total passthru output IO size (KB)
sio_cis Current smart IO with results in send (KB) *
sio_tis Total smart IO with results in send (KB)
sio_ciso Current smart IO filtered in send (KB) *
sio_tiso Total smart IO filtered in send (KB)
sio_fcr Total smart IO read from flash (KB)
sio_fcw Total smart IO initiated flash population (KB)
sio_hdr Total smart IO read from hard disk (KB)
sio_hdw Total smart IO writes (fcre) to hard disk (KB)
sio_n512kb Number of smart IO requests < 512KB
sio_n1mb Number of smart IO requests >= 512KB and < 1MB
sio_n2mb Number of smart IO requests >= 1MB and < 2MB
sio_n4mb Number of smart IO requests >= 2MB and < 4MB
sio_n8mb Number of smart IO requests >= 4MB and < 8MB
sio_ngt8mb Number of smart IO requests >= 8MB
sio_nbrf Number of times smart IO buffer reserve failures
sio_nrm Number of times smart IO request misses
sio_ncio Number of times IO for smart IO not allowed to be issued
sio_nplr Number of times smart IO prefetch limit was reached
sio_nssuo Number of times smart scan used unoptimized mode
sio_nfcuo Number of times smart fcre used unoptimized mode
sio_nsbuo Number of times smart backup used unoptimized mode

我们可以通过加上-h来查看其帮助选项:

[root@dm01cel01 ~]# cellsrvstat -h
Usage:
cellsrvstat [-stat_group=<group name>,<group name>,]
[-stat=<stat name>,<stat name>,] [-interval=<interval>]
[-count=<count>] [-table] [-short] [-list]

stat A comma separated list of short strings representing
the stats.
Default is all. (unless - stat_group is specified.
The -list option displays all stats.
Example: -stat=io_nbiorr_hdd,io_nbiowr_hdd
stat_group A comma separated list of short strings representing
groups of stats.
Default: all (unless -stat is specified).
Currently valid options are: io, mem, exec, net.
Example: -stat_group=io,mem
interval At what interval the stats should be obtained and
printed (in seconds). Default is 1 second.
count How many times the stats should be printed.
Default is once.
list List all metric abbreviations and their descriptions.
All other options are ignored.
table Use a tabular format for output. This option will be
ignored if all metrics specified are not integer
based metrics.
short Use abbreviated metric name instead of
descriptive ones.
error_out An output file to print error messages to, mostly for
debugging.

In non-tabular mode, The output has three columns. The first column
is the name of the metric, the second one is the difference between the
last and the current value(delta), and the third column is the absolute value.
In Tabular mode absolute values are printed as is without delta.
cellsrvstat -list command points out the statistics that are absolute values

-stat_group=后面接统计信息的组名,例如上面提到的io, mem, exec, net。

-stat=后面接根据-list参数查找出来的统计信息的名称,例如io_nbiorr_hdd,io_nbiowr_hdd。

-interval=后面接统计信息采样的间隔

-count=后面接统计信息采样的次数

-table 表示使用统计信息简写的方式代替真实的名称 。

举一个例子:例如我们需要收集sio_ttb ­和 sio_tii ­两项信息,采样的频率为一秒一次,一共采样十次:

[root@dm01cel01 ~]# cellsrvsta -table -interval=1 -count=10 -stat=sio_ttb,sio_tii
===Current Time=== sio_ttb sio_tii
Fri Aug 23 08:29:46 2013 0 0
Fri Aug 23 08:29:47 2013 0 0
Fri Aug 23 08:29:48 2013 0 0
Fri Aug 23 08:29:49 2013 0 0
Fri Aug 23 08:29:50 2013 0 0
Fri Aug 23 08:29:51 2013 0 0
Fri Aug 23 08:29:52 2013 0 0
Fri Aug 23 08:29:53 2013 0 0
Fri Aug 23 08:29:54 2013 0 0
Fri Aug 23 08:29:55 2013 0 0

去掉-table选项则输出完整的信息:

[root@dm01cel01 ~]# cellsrvstat -interval=1 -count=10 -stat=sio_ttb,sio_tii
===Current Time=== Fri Aug 23 08:30:25 2013

== SmartIO related stats ==
Total smart IO to be issued (KB) 0 0
Total smart IO in IO (KB) 0 0

===Current Time=== Fri Aug 23 08:30:26 2013

== SmartIO related stats ==
Total smart IO to be issued (KB) 0 0
Total smart IO in IO (KB) 0 0

===Current Time=== Fri Aug 23 08:30:27 2013

== SmartIO related stats ==
Total smart IO to be issued (KB) 0 0
Total smart IO in IO (KB) 0 0

===Current Time=== Fri Aug 23 08:30:28 2013

== SmartIO related stats ==
Total smart IO to be issued (KB) 0 0
Total smart IO in IO (KB) 0 0

===Current Time=== Fri Aug 23 08:30:29 2013

== SmartIO related stats ==
Total smart IO to be issued (KB) 0 0
Total smart IO in IO (KB) 0 0

===Current Time=== Fri Aug 23 08:30:30 2013

== SmartIO related stats ==
Total smart IO to be issued (KB) 0 0
Total smart IO in IO (KB) 0 0

===Current Time=== Fri Aug 23 08:30:31 2013

== SmartIO related stats ==
Total smart IO to be issued (KB) 0 0
Total smart IO in IO (KB) 0 0

===Current Time=== Fri Aug 23 08:30:32 2013

== SmartIO related stats ==
Total smart IO to be issued (KB) 0 0
Total smart IO in IO (KB) 0 0

===Current Time=== Fri Aug 23 08:30:33 2013

== SmartIO related stats ==
Total smart IO to be issued (KB) 0 0
Total smart IO in IO (KB) 0 0

===Current Time=== Fri Aug 23 08:30:34 2013

== SmartIO related stats ==
Total smart IO to be issued (KB) 0 0
Total smart IO in IO (KB) 0 0

实际在oswatcher中会默认调用这个脚本:

[root@dm01cel01 ~]# ps -ef | grep osw
root 5219 17360 0 08:38 pts/0 00:00:00 grep osw
root 12914 23131 0 08:00 ? 00:00:00 /bin/ksh ./oswsub.sh HighFreq ./Exadata_cellsrvstat.sh
root 31625 23131 0 04:02 ? 00:00:00 /bin/ksh ./oswsub.sh HighFreq ./Exadata_vmstat.sh
root 31626 23131 0 04:02 ? 00:00:00 /bin/ksh ./oswsub.sh HighFreq ./Exadata_mpstat.sh
root 31627 23131 0 04:02 ? 00:00:00 /bin/ksh ./oswsub.sh HighFreq ./Exadata_netstat.sh
root 31628 23131 0 04:02 ? 00:00:00 /bin/ksh ./oswsub.sh HighFreq ./Exadata_iostat.sh
root 31629 23131 0 04:02 ? 00:00:00 /bin/ksh ./oswsub.sh HighFreq ./Exadata_diskstats.sh
root 31633 23131 0 04:02 ? 00:00:00 /bin/ksh ./oswsub.sh HighFreq ./Exadata_top.sh
root 31643 23131 0 04:02 ? 00:00:00 /bin/ksh ./oswsub.sh HighFreq /opt/oracle.oswatcher/osw/ExadataRdsInfo.sh
root 31656 31643 0 04:02 ? 00:00:03 /bin/bash /opt/oracle.oswatcher/osw/ExadataRdsInfo.sh HighFreq

 

 

[root@slca04cel01 osw]# cat /opt/oracle.oswatcher/osw/Exadata_cellsrvstat.sh
#!/bin/bash
# Copyright (c) 2009, 2011, Oracle and/or its affiliates. All rights reserved.

out_file=
zip_prog=
declare -i self_count=1
declare -i sample_interval=1
declare -i sample_duration=3
declare -i sample_count=1

/bin/touch /opt/oracle.oswatcher/osw/Exadata_cellsrvstat.lock
echo $$ > /opt/oracle.oswatcher/osw/Exadata_cellsrvstat.lock
while [ -e /opt/oracle.oswatcher/osw/Exadata_cellsrvstat.lock ];
do
if [ -f "archive/oswcellsrvstat/$1" ]; then
if [ ! -z "$out_file" ] && [ ! -z "$zip_prog" ]; then
$zip_prog $out_file &
fi
out_file=`/bin/cat archive/oswcellsrvstat/$1 | /bin/cut -d ' ' -f 1`
if [ $? -ne 0 ]; then
/bin/echo "[ERROR] archive/oswcellsrvstat/$1 not found or it is empty"
exit 1
fi
zip_prog=`/bin/cat archive/oswcellsrvstat/$1 | /bin/cut -d ' ' -f 2`
if [ $? -ne 0 ]; then
/bin/echo "[ERROR] archive/oswcellsrvstat/$1 not found or it is empty"
exit 1
fi
sample_interval=`/bin/cat archive/oswcellsrvstat/$1 | /bin/cut -d ' ' -f 3`
if [ $? -ne 0 ]; then
/bin/echo "[ERROR] archive/oswcellsrvstat/$1 not found or it is empty"
exit 1
fi
sample_duration=`/bin/cat archive/oswcellsrvstat/$1 | /bin/cut -d ' ' -f 4`
if [ $? -ne 0 ]; then
/bin/echo "[ERROR] archive/oswcellsrvstat/$1 not found or it is empty"
exit 1
fi
/bin/rm -f "archive/oswcellsrvstat/$1"
else
break
fi
if [ ! -z "$out_file" ]; then
if [ $sample_interval -gt 0 ] && [ $sample_duration -gt 0 ] && [ $sample_duration -gt $sample_interval ]; then
sample_count=$((sample_duration / sample_interval))
/bin/echo "zzz ***"`date`" Sample interval: $sample_interval secconds" >> ${out_file}

$OSS_BIN/cellsrvstat -interval=$sample_interval -count=$sample_count >> ${out_file}

bzip2 ${out_file}
/bin/rm -f ${out_file}
else
/bin/echo "[ERROR] Invalid arguments for sample_duration and sample_interval"
break
fi
fi
done

/bin/rm -f /opt/oracle.oswatcher/osw/Exadata_cellsrvstat.lock
exit 0

 

Comment

*

沪ICP备14014813号

沪公网安备 31010802001379号