HOW TO: Take a storage cell offline for maintenance.
A few days ago, I had to take a storage cell offline so an engineer
could replace a bad flash card. Following metaling article [1188080.1],
here’s what I did.
- By default, ASM drops a disk shortly after it is taken offline;
however, you can set the DISK_REPAIR_TIME attribute to prevent this
operation by specifying a time interval to repair the disk and bring it
back online. The default DISK_REPAIR_TIME attribute value of 3.6h should
be adequate for most environments.
- To check repair times for all mounted disk groups – log into the ASM instance and perform the following query:
SQL> select dg. name ,a.value from v$asm_diskgroup dg, v$asm_attribute a where dg.group_number=a.group_number and a. name = 'disk_repair_time' ; |
- If you need to offline the ASM disks for more than the default time
of 3.6 hours then adjust the parameter by issuing the command below as
an example:
SQL> ALTER DISKGROUP DATA SET ATTRIBUTE 'DISK_REPAIR_TIME' = '8.5H' ; |
- Next I checked if ASM will be OK if the grid disks go OFFLINE. The
following command should return ‘Yes’ for the grid disks being listed:
cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome |
DATA_CD_00_cel03 ONLINE Yes |
DATA_CD_01_cel03 ONLINE Yes |
DATA_CD_02_cel03 ONLINE Yes |
DATA_CD_03_cel03 ONLINE Yes |
- If one or more disks return asmdeactivationoutcome=’No’, I would
need to wait for some time and repeat the previous step. Since all disks
returned return asmdeactivationoutcome=’Yes’, I proceeded with taking
the griddisk offline in the next step.
Note: Taking the storage server
offline when one or more grid disks return asmdeactivationoutcome=’No’
will cause Oracle ASM to dismount the affected disk group, causing the
databases to shut down abruptly.
- The next step was to run cellcli command to Inactivate all grid disks on the cel03 – the cell I wanted to shut down.
CellCLI> ALTER GRIDDISK ALL INACTIVE |
This action could have taken 10
minutes or longer depending on activity. Luckily for me, it didn’t. It’s
is very important to make sure you were able to offline all the disks
successfully before shutting down the cell services. Inactivating the
grid disks will automatically OFFLINE the disks in the ASM instance.
- Next was to confirm that the griddisks are now offline by performing the following actions:
- I had to execute the command below and the output should show
asmmodestatus=UNUSED or OFFLINE and asmdeactivationoutcome=Yes for all
griddisks once the disks are offline in ASM. Only then is it safe to
proceed with shutting down or restarting the cell:
# cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome |
DATA_CD_00_cel03 OFFLINE Yes |
DATA_CD_01_cel03 OFFLINE Yes |
DATA_CD_02_cel03 OFFLINE Yes |
DATA_CD_03_cel03 OFFLINE Yes |
- List the griddisk to confirm that all show offline:
- I could now reboot the cell. Oracle Exadata Storage Servers are powered off and rebooted using the Linux shutdown command.
- The following command will shut down Oracle Exadata Storage Server immediately: (as root):
(When powering off Oracle Exadata Storage Servers, all storage services are automatically stopped.)
- If I had to reboot, I would have used this command:
- Once the cell comes back online – I had to reactive the grid disks:
cellcli -e alter griddisk all active |
- To verify that all disks are ‘active’, I used the following command:
Oracle ASM synchronization is only complete when all grid disks show asmmodestatus=ONLINE.
This operation uses Fast Mirror
Resync operation – which does not trigger an ASM rebalance. The Resync
operation restores only the extents that would have been written while
the disk was offline.
- Before taking another storage server offline, Oracle ASM
synchronization must complete on the restarted Oracle Exadata Storage
Server. If synchronization is not complete, then the check performed on
another storage server will fail. The following is an example of the
output:
CellCLI> list griddisk attributes name where asmdeactivationoutcome != 'Yes' |
DATA_CD_00_cel02 "Cannot de-activate due to other offline disks in the diskgroup" |
DATA_CD_01_cel02 "Cannot de-activate due to other offline disks in the diskgroup" |
DATA_CD_02_cel02 "Cannot de-activate due to other offline disks in the diskgroup" |
DATA_CD_03_cel02 "Cannot de-activate due to other offline disks in the diskgroup" |
I could then go in ASM and check on rebalance operations:
SQL>
select
*
from
gv$asm_operation;
No comments:
Post a Comment