Testing ASM Disk Failure Scenario and disk_repair_time

When a disk failure occurs for an ASM disk, behavior of ASM would be different, based on what kind of redundancy for the diskgroup is in use. If diskgroup has EXTERNAL REDUDANCY, diskgroup would keep working if you have redundancy at external RAID level. If there is no RAID at external level, the diskgroup would immediately get dismounted and disk would need a repair/replaced and then diskgroup might need to be dropped and re-created, and data on this diskgroup would require recovery.

For NORMAL and HIGH redundancy diskgroups, the behavior is a little different. When a disk gets corrupted/missing in a NORMAL/HIGH redundancy diskgroup, error is reported in the alert log file, and disk becomes OFFLINE, as we can see in the output of bellow query, after I started my testing for an ASM disk failure. I just needed to plug out the disk from the storage that belonged to an ASM diskgroup with NORMAL redundancy.

col name format a8

col header_status format a7

set lines 2000

col path format a10

select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status from v$asm_disk;

NAME PATH STATE HEADER_ REPAIR_TIMER MODE_ST MOUNT_S

-------- ---------- -------- ------- ------------ ------- ----------------- ------------ ------------- -----------------

DATA1 ORCL:DATA1 NORMAL MEMBER 0 ONLINE CACHED

DATA2 ORCL:DATA2 NORMAL MEMBER 0 ONLINE CACHED

DATA3 ORCL:DATA3 NORMAL MEMBER 0 ONLINE CACHED

DATA4 NORMAL UNKNOWN 1200 OFFLINE MISSING

Here we see a value “1200” under REPAIR_TIME column; this value is time in seconds after which this disk would be dropped automatically. This time is calculated using value of a diskgroup attribute called DISK_REPAIR_TIME that I will discuss bellow.

In 10g, if a disk goes missing, it would immediately get dropped and REBALANCE operation would kick in immediately whereby ASM would start redistributing the ASM extents across the available disks in ASM diskgroup to restore the redundancy.

DISK_REPAIR_TIME

Starting 11g, oracle has provided an attribute for diskgroups called “DISK_REPAIR_TIME”. This has a default value of 3.6 hours. This actually means that in case a disk goes missing, this disk should not be dropped immediately and ASM should wait for this disk to come online/replaced. This feature helps in scenarios where a disk is plugged out accidentally, or a storage server/SAN gets disconnected/rebooted which leaves some ASM diskgroup without one or more disks. During the time when disk(s) remain unavailable, ASM would keep track of the extents that are candidates of being written to the missing disks, and immediately starts writing to the disk(s) as soon as missing disk(s) come back online (this feature is called fast mirror resync). If disk(s) does not come back online within DISK_REPAIR_TIME threshold, disk(s) is/are dropped and rebalance starts.

FAILGROUP_REPAIR_TIME

Starting 12c, another new attribute can be set for the diskgroup. This attribute is FAILGROUP_REPAIR_TIME, and this has a default value of 24 hours. This attribute is similar to DISK_REPAIR_TIME, but is applied to the whole failgroup. In Exadata, all disks belonging to a storage server can belong to a failgroup (to avoid a mirror copy of extent to be written in a disk from the same storage server), and this attribute is quite handy in Exadata environment when complete storage server is taken down for maintenance, or some other reason.

In the following we can see how to set values for the diskgroup attributes explained above.

SQL> col name format a30

SQL> select name,value from v$asm_attribute where group_number=3 and name like '%repair_time%';

NAME VALUE

------------------------------ --------------------

disk_repair_time 3.6h

failgroup_repair_time 24.0h

SQL> alter diskgroup data set attribute 'disk_repair_time'='1h';

Diskgroup altered.

SQL> alter diskgroup data set attribute 'failgroup_repair_time'='10h';

Diskgroup altered.

SQL> select name,value from v$asm_attribute where group_number=3 and name like '%repair_time%';

NAME VALUE

------------------------------ --------------------

disk_repair_time 1h

failgroup_repair_time 10h

ORA-15042

If a disk is offline/missing from an ASM diskgroup, ASM may not mount the diskgroup automatically during instance restart. In this case, we might need to mount the diskgroup manually, with FORCE option.

SQL> alter diskgroup data mount;

alter diskgroup data mount

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15040: diskgroup is incomplete

ORA-15042: ASM disk "3" is missing from group number "2"

SQL> alter diskgroup data mount force;

Diskgroup altered.

Monitoring the REPAIR_TIME

After a disk goes offline, the time starts ticking and value of REPAIR_TIMER can be monitored to see the time remains before the disk can be made available to avoid auto drop of the disk.

SQL> select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status from v$asm_disk;

NAME PATH STATE HEADER_ REPAIR_TIMER MODE_ST MOUNT_S

-------- ---------- -------- ------- ------------ ------- ----------------- ------------ ------------- -----------------

DATA1 ORCL:DATA1 NORMAL MEMBER 0 ONLINE CACHED

DATA2 ORCL:DATA2 NORMAL MEMBER 0 ONLINE CACHED

DATA3 ORCL:DATA3 NORMAL MEMBER 0 ONLINE CACHED

DATA4 NORMAL UNKNOWN 649 OFFLINE MISSING

--We can confirm that no rebalance has started yet by using following query

SQL> select * from v$asm_operation;

no rows selected

If we are able to make this disk available/replaced before DISK_REPAIR_TIME lapses, we can bring this disk back online. Please note that we would need to bring it ONLINE manually.

SQL> alter diskgroup data online disk data4;

Diskgroup altered.

select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status from v$asm_disk;

NAME PATH STATE HEADER_ REPAIR_TIMER MODE_ST MOUNT_S

-------- ---------- -------- ------- ------------ ------- ----------------- ------------ ------------- -----------------

DATA1 ORCL:DATA1 NORMAL MEMBER 0 ONLINE CACHED

DATA2 ORCL:DATA2 NORMAL MEMBER 0 ONLINE CACHED

DATA3 ORCL:DATA3 NORMAL MEMBER 0 ONLINE CACHED

DATA4 NORMAL UNKNOWN 465 SYNCING CACHED

--Syncing is in progress, and hence no rebalance would occur.

SQL> select * from v$asm_operation;

no rows selected

-- After some time, everything would become normal.

select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status from v$asm_disk;

NAME PATH STATE HEADER_ REPAIR_TIMER MODE_ST MOUNT_S

-------- ---------- -------- ------- ------------ ------- ----------------- ------------ ------------- -----------------

DATA1 ORCL:DATA1 NORMAL MEMBER 0 ONLINE CACHED

DATA2 ORCL:DATA2 NORMAL MEMBER 0 ONLINE CACHED

DATA3 ORCL:DATA3 NORMAL MEMBER 0 ONLINE CACHED

DATA4 ORCL:DATA4 NORMAL MEMBER 0 ONLINE CACHED

If same disk cannot be made available, or replaced, either ASM would auto drop the disk after DISK_REPAIR_TIME has lapsed, or we manually drop this ASM disk. Rebalance would occur after the disk drop.
Since the disk status if OFFLINE, we would need to use FORCE option to drop the disk. After dropping the disk rebalance would start and can be monitored from v$ASM_OPERATION view.

SQL> alter diskgroup data drop disk data4;

alter diskgroup data drop disk data4

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15084: ASM disk "DATA4" is offline and cannot be dropped.

SQL> alter diskgroup data drop disk data4 force;

Diskgroup altered.

select group_number,operation,pass,state,power,sofar,est_work from v$asm_operation;

GROUP_NUMBER OPERA PASS STATE POWER SOFAR EST_WORK

---------------------------------- --------- ---- ---------- ---------- ---------- ------------------------

2 REBAL RESYNC DONE 9 0 0

2 REBAL REBALANCE DONE 9 42 42

2 REBAL COMPACT RUN 9 1 0

Later we can replace the faulty disk and then add back the new disk again into this diskgroup. Adding diskgroup back would initiate rebalance once again.

SQL> alter diskgroup data add disk 'ORCL:DATA4';

Diskgroup altered.

SQL> select * from v$asm_operation;

select group_number,operation,pass,state,power,sofar,est_work from v$asm_operation;

GROUP_NUMBER OPERA PASS STATE POWER SOFAR EST_WORK

---------------------------------- --------- ---- ---------- ---------- ---------- ------------------------

2 REBAL RESYNC DONE 9 0 0

2 REBAL REBALANCE RUN 9 37 2787

2 REBAL COMPACT WAIT 9 1 0

OracleNext - Solution to your Oracle problems

Navigation

Monday, January 29, 2018

Testing ASM Disk Failure Scenario and disk_repair_time

DISK_REPAIR_TIME

FAILGROUP_REPAIR_TIME

ORA-15042

Monitoring the REPAIR_TIME

2 comments:

Popular Posts - All Times