Monday, January 29, 2018

Testing ASM Disk Failure Scenario and disk_repair_time


When a disk failure occurs for an ASM disk, behavior of ASM would be different, based on what kind of redundancy for the diskgroup is in use. If diskgroup has EXTERNAL REDUDANCY, diskgroup would keep working if you have redundancy at external RAID level. If there is no RAID at external level, the diskgroup would immediately get dismounted and disk would need a repair/replaced and then diskgroup might need to be dropped and re-created, and data on this diskgroup would require recovery.

For NORMAL and HIGH redundancy diskgroups, the behavior is a little different. When a disk gets corrupted/missing in a NORMAL/HIGH redundancy diskgroup, error is reported in the alert log file, and disk becomes OFFLINE, as we can see in the output of bellow query, after I started my testing for an ASM disk failure. I just needed to plug out the disk from the storage that belonged to an ASM diskgroup with NORMAL redundancy.
col name format a8
col header_status format a7
set lines 2000
col path format a10
select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status  from v$asm_disk;

NAME     PATH                STATE    HEADER_          REPAIR_TIMER    MODE_ST     MOUNT_S
-------- ---------- --------       ------- ------------ ------- -----------------  ------------  ------------- -----------------
DATA1    ORCL:DATA1 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA2    ORCL:DATA2 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA3    ORCL:DATA3 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA4                              NORMAL   UNKNOWN         1200                   OFFLINE       MISSING
  
Here we see a value “1200” under REPAIR_TIME column; this value is time in seconds after which this disk would be dropped automatically. This time is calculated using value of a diskgroup attribute called DISK_REPAIR_TIME that I will discuss bellow.

In 10g, if a disk goes missing, it would immediately get dropped and REBALANCE operation would kick in immediately whereby ASM would start redistributing the ASM extents across the available disks in ASM diskgroup to restore the redundancy.

DISK_REPAIR_TIME

Starting 11g, oracle has provided an attribute for diskgroups called “DISK_REPAIR_TIME”. This has a default value of 3.6 hours. This actually means that in case a disk goes missing, this disk should not be dropped immediately and ASM should wait for this disk to come online/replaced. This feature helps in scenarios where a disk is plugged out accidentally, or a storage server/SAN gets disconnected/rebooted which leaves some ASM diskgroup without one or more disks. During the time when disk(s) remain unavailable, ASM would keep track of the extents that are candidates of being written to the missing disks, and immediately starts writing to the disk(s) as soon as missing disk(s) come back online (this feature is called fast mirror resync). If disk(s) does not come back online within DISK_REPAIR_TIME threshold, disk(s) is/are dropped and rebalance starts.

FAILGROUP_REPAIR_TIME

Starting 12c, another new attribute can be set for the diskgroup. This attribute is FAILGROUP_REPAIR_TIME, and this has a default value of 24 hours. This attribute is similar to DISK_REPAIR_TIME, but is applied to the whole failgroup. In Exadata, all disks belonging to a storage server can belong to a failgroup (to avoid a mirror copy of extent to be written in a disk from the same storage server), and this attribute is quite handy in Exadata environment when complete storage server is taken down for maintenance, or some other reason.
In the following we can see how to set values for the diskgroup attributes explained above.
SQL> col name format a30
SQL> select name,value from v$asm_attribute where group_number=3 and name like '%repair_time%';

NAME                           VALUE
------------------------------ --------------------
disk_repair_time               3.6h
failgroup_repair_time          24.0h

SQL> alter diskgroup data set attribute 'disk_repair_time'='1h';

Diskgroup altered.

SQL>  alter diskgroup data set attribute  'failgroup_repair_time'='10h';

Diskgroup altered.

SQL> select name,value from v$asm_attribute where group_number=3 and name like '%repair_time%';

NAME                           VALUE
------------------------------ --------------------
disk_repair_time               1h
failgroup_repair_time          10h

ORA-15042

If a disk is offline/missing from an ASM diskgroup, ASM may not mount the diskgroup automatically during instance restart. In this case, we might need to mount the diskgroup manually, with FORCE option.
SQL> alter diskgroup data mount;
alter diskgroup data mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "3" is missing from group number "2"

SQL> alter diskgroup data mount force;

Diskgroup altered.

Monitoring the REPAIR_TIME

After a disk goes offline, the time starts ticking and value of REPAIR_TIMER can be monitored to see the time remains before the disk can be made available to avoid auto drop of the disk.
SQL> select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status  from v$asm_disk;

NAME     PATH                STATE    HEADER_          REPAIR_TIMER    MODE_ST     MOUNT_S
-------- ---------- --------       ------- ------------ ------- -----------------  ------------  ------------- -----------------
DATA1    ORCL:DATA1 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA2    ORCL:DATA2 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA3    ORCL:DATA3 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA4                              NORMAL   UNKNOWN         649                     OFFLINE       MISSING

--We can confirm that no rebalance has started yet by using following query
SQL> select * from v$asm_operation;

no rows selected

If we are able to make this disk available/replaced before DISK_REPAIR_TIME lapses, we can bring this disk back online. Please note that we would need to bring it ONLINE manually.
SQL> alter diskgroup data online disk data4;

Diskgroup altered.

select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status  from v$asm_disk;

NAME     PATH                STATE    HEADER_          REPAIR_TIMER    MODE_ST     MOUNT_S
-------- ---------- --------       ------- ------------ ------- -----------------  ------------  ------------- -----------------
DATA1    ORCL:DATA1 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA2    ORCL:DATA2 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA3    ORCL:DATA3 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA4                              NORMAL   UNKNOWN        465                      SYNCING     CACHED

--Syncing is in progress, and hence no rebalance would occur.

SQL> select * from v$asm_operation;

no rows selected
-- After some time, everything would become normal.

select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status  from v$asm_disk;

NAME     PATH                STATE    HEADER_          REPAIR_TIMER    MODE_ST     MOUNT_S
-------- ---------- --------       ------- ------------ ------- -----------------  ------------  ------------- -----------------
DATA1    ORCL:DATA1 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA2    ORCL:DATA2 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA3    ORCL:DATA3 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA4    ORCL:DATA4 NORMAL   MEMBER             0                         ONLINE        CACHED


If same disk cannot be made available, or replaced, either ASM would auto drop the disk after DISK_REPAIR_TIME has lapsed, or we manually drop this ASM disk. Rebalance would occur after the disk drop.
Since the disk status if OFFLINE, we would need to use FORCE option to drop the disk. After dropping the disk rebalance would start and can be monitored from v$ASM_OPERATION view.
SQL> alter diskgroup data drop disk data4;
alter diskgroup data drop disk data4
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15084: ASM disk "DATA4" is offline and cannot be dropped.


SQL> alter diskgroup data drop disk data4 force;

Diskgroup altered.

select group_number,operation,pass,state,power,sofar,est_work from v$asm_operation;

GROUP_NUMBER OPERA PASS                   STATE      POWER      SOFAR   EST_WORK 
---------------------------------- --------- ----            ---------- ---------- ---------- ------------------------
           2                     REBAL RESYNC             DONE          9                0             0   
           2                     REBAL REBALANCE    DONE           9                42          42  
           2                     REBAL COMPACT         RUN             9                1            0   

Later we can replace the faulty disk and then add back the new disk again into this diskgroup. Adding diskgroup back would initiate rebalance once again.
SQL> alter diskgroup data add disk 'ORCL:DATA4';

Diskgroup altered.

SQL> select * from v$asm_operation;

select group_number,operation,pass,state,power,sofar,est_work from v$asm_operation;

GROUP_NUMBER OPERA PASS                   STATE      POWER      SOFAR   EST_WORK 
---------------------------------- --------- ----            ---------- ---------- ---------- ------------------------
           2                     REBAL RESYNC             DONE          9                0             0   
           2                     REBAL REBALANCE    RUN              9               37           2787  
           2                     REBAL COMPACT         WAIT            9                1            0   

2 comments:

Popular Posts - All Times