Friday, May 4, 2012

HOW TO: Take a storage cell offline for maintenance.

 

A few days ago, I had to take a storage cell offline so an engineer could replace a bad flash card. Following metaling article [1188080.1], here’s what I did.
  • By default, ASM drops a disk shortly after it is taken offline; however, you can set the DISK_REPAIR_TIME attribute to prevent this operation by specifying a time interval to repair the disk and bring it back online. The default DISK_REPAIR_TIME attribute value of 3.6h should be adequate for most environments.
    • To check repair times for all mounted disk groups – log into the ASM instance and perform the following query:
SQL> select dg.name,a.value from v$asm_diskgroup dg, v$asm_attribute a where dg.group_number=a.group_number and a.name='disk_repair_time';
NAME     |VALUE
---------|-------
DATA     |3.6h
DBFS     |3.6h
RECO     |3.6h
    •  If you need to offline the ASM disks for more than the default time of 3.6 hours then adjust the parameter by issuing the command below as an example:
SQL> ALTER DISKGROUP DATA SET ATTRIBUTE 'DISK_REPAIR_TIME'='8.5H';
  • Next I checked if ASM will be OK if the grid disks go OFFLINE. The following command should return ‘Yes’ for the grid disks being listed:
cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
DATA_CD_00_cel03 ONLINE Yes
DATA_CD_01_cel03 ONLINE Yes
DATA_CD_02_cel03 ONLINE Yes
DATA_CD_03_cel03 ONLINE Yes
  • If one or more disks return asmdeactivationoutcome=’No’, I would need to wait for some time and repeat the previous step. Since all disks returned return asmdeactivationoutcome=’Yes’, I proceeded with taking the griddisk offline in the next step.
Note: Taking the storage server offline when one or more grid disks return asmdeactivationoutcome=’No’ will cause Oracle ASM to dismount the affected disk group, causing the databases to shut down abruptly.
  • The next step was to run cellcli command to Inactivate all grid disks on the cel03 – the cell I wanted to shut down.
CellCLI> ALTER GRIDDISK ALL INACTIVE
This action could have taken 10 minutes or longer depending on activity. Luckily for me, it didn’t. It’s is very important to make sure you were able to offline all the disks successfully before shutting down the cell services. Inactivating the grid disks will automatically OFFLINE the disks in the ASM instance.
  • Next was to confirm that the griddisks are now offline by performing the following actions:
    • I had to execute the command below and the output should show asmmodestatus=UNUSED or OFFLINE and asmdeactivationoutcome=Yes for all griddisks once the disks are offline in ASM. Only then is it safe to proceed with shutting down or restarting the cell:
# cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
DATA_CD_00_cel03 OFFLINE Yes
DATA_CD_01_cel03 OFFLINE Yes
DATA_CD_02_cel03 OFFLINE Yes
DATA_CD_03_cel03 OFFLINE Yes
    • List the griddisk to confirm that all show offline:
CellCLI> LIST GRIDDISK
  • I could now reboot the cell. Oracle Exadata Storage Servers are powered off and rebooted using the Linux shutdown command.
    • The following command will shut down Oracle Exadata Storage Server immediately: (as root):
#shutdown -h -y now
(When powering off Oracle Exadata Storage Servers, all storage services are automatically stopped.)
    • If I had to reboot, I would have used this command:
# shutdown -r -y now
  • Once the cell comes back online – I had to reactive the grid disks:
cellcli -e alter griddisk all active
  • To verify that all disks are ‘active’, I used the following command:
CellCLI> list griddisk
  • Verify grid disk status:
DATA_CD_00_cel03 SYNCING
DATA_CD_01_cel03 ONLINE
DATA_CD_02_cel03 ONLINE
DATA_CD_03_cel03 SYNCING
Oracle ASM synchronization is only complete when all grid disks show asmmodestatus=ONLINE.
This operation uses Fast Mirror Resync operation – which does not trigger an ASM rebalance. The Resync operation restores only the extents that would have been written while the disk was offline.
  • Before taking another storage server offline, Oracle ASM synchronization must complete on the restarted Oracle Exadata Storage Server. If synchronization is not complete, then the check performed on another storage server will fail. The following is an example of the output:
CellCLI> list griddisk attributes name where asmdeactivationoutcome != 'Yes'
DATA_CD_00_cel02 "Cannot de-activate due to other offline disks in the diskgroup"
DATA_CD_01_cel02 "Cannot de-activate due to other offline disks in the diskgroup"
DATA_CD_02_cel02 "Cannot de-activate due to other offline disks in the diskgroup"
DATA_CD_03_cel02 "Cannot de-activate due to other offline disks in the diskgroup"

I could then go in ASM and check on rebalance operations:


SQL> select * from gv$asm_operation;

 

No comments:

Post a Comment