【转】Oracle ASM ASM in Exadata

ASM is a critical component of the Exadata software stack. It is also a bit different - compared to non-Exadata environments. It still manages your disk groups, but builds those with grid disks. It still takes care of disk errors, but also handles predictive disk failures. It doesn't like external redundancy, but it makes the disk group smart scan capable. Let's have a closer look. Grid disks In Exadata the ASM disks live on storage cells and are presented to compute nodes (where ASM instances run) via Oracle proprietary iDB protocol. Each storage cell has 12 hard disks and 16flash disks. During Exadata deployment grid disks are created on those 12 hard disks. Flash disks are used for the flash and redo log cache, so grid disks are normally not created on flash disks. Grid disks are not exposed to the Operating System, so only database instances, ASM and related utilities, that speak iDB, can see them. The kfod, ASM discovery tool, is one such utility. Here is an example of kfod discovering grid disks in one Exadata environment:

$ kfod disks=all ----------------------------------------------------------------- Disk Size Path User Group ================================================================= 1: 433152 Mb o/192.168.10.9/DATA_CD_00_exacell01 2: 433152 Mb o/192.168.10.9/DATA_CD_01_exacell01 3: 433152 Mb o/192.168.10.9/DATA_CD_02_exacell01 ... 13: 29824 Mb o/192.168.10.9/DBFS_DG_CD_02_exacell01 14: 29824 Mb o/192.168.10.9/DBFS_DG_CD_03_exacell01 15: 29824 Mb o/192.168.10.9/DBFS_DG_CD_04_exacell01 ... 23: 108224 Mb o/192.168.10.9/RECO_CD_00_exacell01 24: 108224 Mb o/192.168.10.9/RECO_CD_01_exacell01 25: 108224 Mb o/192.168.10.9/RECO_CD_02_exacell01 ... 474: 108224 Mb o/192.168.10.22/RECO_CD_09_exacell14 475: 108224 Mb o/192.168.10.22/RECO_CD_10_exacell14 476: 108224 Mb o/192.168.10.22/RECO_CD_11_exacell14 ----------------------------------------------------------------- ORACLE_SID ORACLE_HOME ================================================================= +ASM1 /u01/app/11.2.0.3/grid +ASM2 /u01/app/11.2.0.3/grid +ASM3 /u01/app/11.2.0.3/grid ... +ASM8 /u01/app/11.2.0.3/grid $

Note that grid disks are prefixed with either DATA, RECO or DBFS_DG. Those are ASM disk group names in this environment. Each grid disk name ends with the storage cell name. It is also important to note that disks with the same prefix have the same size. The above example is from a full rack - hence 14 storage cells and 8 ASM instances. ASM_DISKSTRING In Exadata ASM_DISKSTRING='o/*/*'. That is suggesting to ASM that it is running on an Exadata compute node and to expect grid disks.

$ sqlplus / as sysasm SQL> show parameter asm_diskstring NAME TYPE VALUE -------------- ------ ----- asm_diskstring string o/*/*

Automatic failgroups

There are no external redundancy disk groups in Exadata - you have a choice of either normal or high redundancy. When creating disk groups, ASM automatically puts all grid disks from the same storage cell into the same failgroup. The failgroup is then named after the storage cell. This would be an example of creating a diskgroup in Exadata environment (note how that grid disk prefix comes in handy):

SQL> create diskgroup RECO disk 'o/*/RECO*' attribute 'COMPATIBLE.ASM'='11.2.0.0.0', 'COMPATIBLE.RDBMS'='11.2.0.0.0', 'CELL.SMART_SCAN_CAPABLE'='TRUE';

Once the disk group is created we can check the disk and failgroup names:

SQL> select name, failgroup, path from v$asm_disk_stat where name like 'RECO%'; NAME FAILGROUP PATH -------------------- --------- ----------------------------------- RECO_CD_08_EXACELL01 EXACELL01 o/192.168.10.3/RECO_CD_08_exacell01 RECO_CD_07_EXACELL01 EXACELL01 o/192.168.10.3/RECO_CD_07_exacell01 RECO_CD_01_EXACELL01 EXACELL01 o/192.168.10.3/RECO_CD_01_exacell01 ... RECO_CD_00_EXACELL02 EXACELL02 o/192.168.10.4/RECO_CD_00_exacell02 RECO_CD_05_EXACELL02 EXACELL02 o/192.168.10.4/RECO_CD_05_exacell02 RECO_CD_04_EXACELL02 EXACELL02 o/192.168.10.4/RECO_CD_04_exacell02 ... SQL>

Note that we did not specify the failgroup names in the CREATE DISKGROUP statement. ASM has automatically put grid disks from the same storage cell in the same failgroup.
cellip.ora The cellip.ora is the configuration file, on every database server, that tells ASM instances which cells are available to the cluster. Here is a content of a typical cellip.ora file for a quarter rack system:

$ cat /etc/oracle/cell/network-config/cellip.ora cell="192.168.10.3" cell="192.168.10.4" cell="192.168.10.5"

Now that we see what is in the cellip.ora, the grid disk path, in the examples above, should make more sense. Disk group attributes The following attributes and their values are recommended in Exadata environments:

COMPATIBLE.ASM - Should be set to the ASM software version in use.
COMPATIBLE.RDBMS - Should be set to the database software version in use.
CELL.SMART_SCAN_CAPABLE - Has be set to TRUE. This attribute/value is actually mandatory in Exadata.
AU_SIZE - Should be set to 4M. This is the default value in recent ASM versions for Exadata environments.

Initialization parameters The following recommendations are for ASM version 11.2.0.3.

Parameter	Value
CLUSTER_INTERCONNECTS	Bondib0 IP address for X2-2. Colon delimited Bondib* IP addresses for X2-8.
ASM_POWER_LIMIT	1 for a quarter rack, 2 for all other racks.
SGA_TARGET	1250 MB
PGA_AGGREGATE_TARGET	400 MB
MEMORY_TARGET	0
MEMORY_MAX_TARGET	0
PROCESSES	For less than 10 instances per node: 50(#db instances per node + 1). For 10 0r more more instances per node: [50MIN(#db instances per node + 1, 11)] + [10*MAX(#db instance per node - 10, 0)]
USE_LARGE_PAGES	ONLY

Voting disks and disk group redundancy Default location for voting disks in Exadata is ASM disk group DBFS_DG. That disk group can be either normal or high redundancy, except in a quarter rack where it has to be a normal redundancy. This is because of the voting disks requirement for the minimal number of failgroups in a given ASM disk group. If we put voting disks in a normal redundancy disk group, that disk group has to have at least 3 failgroups. If we put voting disks in a high redundancy disk group, that disk group has to have at least 5 failgroups. In a quarter rack, where we have only 3 storage cells, all disk groups can have at most 3 failgroups. While we can create a high redundancy disk group with 3 storage cells, voting disks cannot go into that disk group as it does not have 5 failgroups. XDMG and XDWK background processes These two process run in ASM instances on compute nodes. XDMG monitors all configured Exadata cells for storage state changes and performs the required tasks for such events. Its primary role is to watch for inaccessible disks and to initiate the disk online operations, when they become accessible again. Those operations are then handled by XDWK. XDWK gets started when asynchronous actions such as disk ONLINE, DROP and ADD are requested by XDMG. After a 5 minute period of inactivity, this process will shut itself down. Exadata Server, that runs on the storage cells, monitors disk health and performance. If the disk performance degrades it can put it into proactive failure mode. It also monitors for predictive failures based on the disk's SMART (Self-monitoring, Analysis and Reporting Technology) data. In both cases, the Exadata Server notifies XDMG to take those disks offline. When a faulty disk is replacedf on the storage cell, the Exadata Server will recrate all grid disks on a new disk. It will then notify XDMG to bring those grid disks online or add them back to disk groups, in case they were already dropped. The diskmon The master diskmon process (diskmon.bin) can be seen running in all Grid Infrastructure installs, but it's only in Exadata that it's actually doing any work. On every compute node there will be one master diskmon process and one DSKM, slave diskmon process, per every Oracle instance (including ASM). Here is an example from one compute node:

# ps -ef | egrep "diskmon|dskm" | grep -v grep oracle 3205 1 0 Mar16 ? 00:01:18 ora_dskm_ONE2 oracle 10755 1 0 Mar16 ? 00:32:19 /u01/app/11.2.0.3/grid/bin/diskmon.bin -d -f oracle 17292 1 0 Mar16 ? 00:01:17 asm_dskm_+ASM2 oracle 24388 1 0 Mar28 ? 00:00:21 ora_dskm_TWO2 oracle 27962 1 0 Mar27 ? 00:00:24 ora_dskm_THREE2 #

In Exadata, the diskmon is responsible for

Handling of storage cell failures and I/O fencing
Monitoring of Exadata Server state on all storage cells in the cluster (heartbeat)
Broadcasting intra database IORM (I/O Resource Manager) plans from databases to storage cells
Monitoring or the control messages from database and ASM instances to storage cells
Communicating with other diskmons in the cluster

ACFS The ACFS (ASM Cluster File System) is supported in Exadata environments staring with ASM version 12.1.0.2. Alternatives to the ACFS are the DBFS (Database based File System) and the NFS (Network File System). Many Exadata customers have an Oracle ZFS Appliance that can provide a high performance, InfiniBand connected, NFS storage. Conclusion There are quite a few extra features and differences in ASM compared to non-Exadata environments. Most of them are about storage cells and grid disks, and some are about tuning ASM for the extreme Exadata performance.