机房断电引起的存储故障处理

发布时间:  2015-09-26 浏览次数:  795 下载次数:  0
问题描述

    客户环境大体说一下,年初上的一套桌面云,硬件包括:存储,S5500T,两个扩展柜,配置SAS与NLSAS硬盘。服务器RH5885V3,两台。无FC交换机。FusionComputer:C1R5C00SPC100  FusionAccess:V1R5C20SPC100.年初部署了40个桌面左右,后来才得知客户一直没有使用。这样操作起来比较大胆放心。

   故障现象:根据集成商描述,前段时间客户机房曾意外断电,UPS电量也耗尽了,后电源开启之后。有些桌面可以登录,有些桌面登录不了。我尝试了几个帐号,确实如此。

告警信息

然后我登录FA进行检查,发现桌面的状态有些是已停止状态,手动唤醒无效,关机无效。然后登录FC进行查看,看到了十几条虚拟机启动失败的告警,看任务中心也有虚拟机启动失败的告警,还有几个卡在HA过程中。
     
                 

        同时有一lun故障报警。未截图。

处理过程

经过查看,所有故障的虚拟机的系统磁盘均在此lun中。到此将故障定位至存储。

远程连接至存储发现在FC中报警的lun已故障,lun所在raid故障。查看硬盘,硬盘框1的11,12,14号硬盘离线。故障已经找到。然后又查看了一下事件日志。

时间日志中记录了断电后设备经历了高温,导致部分硬盘离线,部分恢复正常,,而上 面三个硬盘人品不好,重新加电后没有恢复正常。而且在事件日志和系统日志中都无法查看到硬盘失效的先后时间。硬盘所在raid组也是在同一时间降级并故 障。可以判断三块硬盘是同时故障,再这考虑客户并没有业务,即便丢失数据也不会对客户有任何影响(其他小伙伴不是这种情况可不要那么暴力)。进行了拉盘动 作。具体步骤如下:
    1.取消设置的热备盘。
    2.进入命令行查看硬盘状态并进行拉活操作。

--------------------- Welcome -----------------------
-----------------System Information------------------
|  System Name           | Desktop_S5500T           |
|  Device Type           | OceanStor S5500T         |
|  Current System Mode   | Double Controllers Normal|
|  Mirroring Link Status | Link Up                  |
|  Location              | dongying                 |
|  Time                  | 2015-09-22 16:56:53      |
|  Device Serial Number  | 210235G7KA10EC000011     |
|  Product Version       | V100R005C02              |
-----------------------------------------------------

admin:/>showdisk -physic
-physic
    Displays all physical disks in the storage system.

admin:/>showdisk -physic
===============================================================================
                               Disk Information
-------------------------------------------------------------------------------
  Disk Location    Status    Type      Vendor     Model              Serial Num
ber           FW Version    Speed(RPM)    Rate(Gbps)    Raw Capacity(GB)    Bar
Code              
-------------------------------------------------------------------------------
10235908810E6000748 
  (1,11)           Normal    SAS       Seagate    ST3600057SS        6SL8KWBY0000N443028L    0008          15000         6.0           558                 2
10235908810E6000751 
  (1,12)           Normal    SAS       Seagate    ST3600057SS        6SL8NYWK0000N4450SXM    0008          15000         6.0           558                 2
10235908810E6001081 
  (1,13)           Normal    SAS       Seagate    ST3600057SS        6SL8KJTM0000N4423DJ0    0008          15000         6.0           558                 2
10235908810E6001082 
  (1,14)           Normal    SAS       Seagate    ST3600057SS        6SL8NY330000N4445S7W    0008          15000         6.0           558                 2
10235908810E6001073 

===============================================================================


admin:/>showdisk -logic
-logic
    Displays all logical disks in the storage system.

admin:/>showdisk -logic
======================================================================
                           Disk Information
----------------------------------------------------------------------
  Disk Location    Logic Status    Logic Type    Usable Capacity(GB) 
----------------------------------------------------------------------                              
  (1,10)           Normal          Member        558                 
  (1,11)           Fault           Member        558                 
  (1,12)           Fault           Member        558                 
  (1,13)           Normal          Member        558                 
  (1,14)           Fault           Member        558                                 
======================================================================

admin:/>
admin:/>developer
IMPORTANT:
Only technical support engineers are allowed to use the commands in this mode. The misuse of any command may interrupt your services and cause data loss. Our company is not responsible for any loss or damage caused by any person not with our company. Before you start, make sure that you have fully understood the function and impact of each command.
Are you sure you want to enter this mode? (y/n)
y
Enter Password:
developer: admin:/>showrg
===========================================================================================================================
                                                  RAID Group Information
---------------------------------------------------------------------------------------------------------------------------
  ID    Level    Status    Free Capacity(MB)    Disk List                                                       Name      
---------------------------------------------------------------------------------------------------------------------------
  0     RAID5    Normal    6144                 0,0;0,1;0,2;0,3;                                                C_Raid01  
  1     RAID5    Normal    0                    0,4;0,5;0,6;0,7;0,8;0,9;0,10;                                   C_Raid02  
  2     RAID5    Normal    3618560              1,0;1,1;1,2;1,3;1,4;1,5;1,6;1,7;1,8;1,9;1,10;                   E1_Raid01 
  3     RAID5    Fault     5240576              1,11;1,12;1,13;1,14;1,15;1,16;1,17;1,18;1,19;1,20;1,21;1,22;    E1_Raid02 
  4     RAID5    Normal    8966144              2,0;2,1;2,2;2,3;2,4;2,5;2,6;2,7;2,8;                            E2_Raid01 
  5     RAID5    Normal    15257600             2,9;2,10;2,11;2,12;2,13;2,14;2,15;2,16;2,17;                    E2_Raid02 
===========================================================================================================================

developer: admin:/>showlun
=======================================================================================================================================================================
                                                                            LUN Information
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
  ID      RAID Group ID    Disk Pool ID    Status          Controller    Visible Capacity(MB)    LUN Name                            Stripe Unit Size(KB)    Lun Type 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0       0                --              Normal          A             1638400.0               OSlun01                             256                     THICK    
  1       1                --              Normal          B             3430400.0               OSlun02                             128                     THICK    
  2       2                --              Normal          A             1048576.0               VRMlun01                            128                     THICK    
  3       2                --              Normal          A             1048576.0               desktop_oslun01                     64                      THICK    
  4       3                --              Fault           B             1048576.0               desktop_oslun02                     64                      THICK    
  5       4                --              Normal          A             2097152.0               desktop_datalun01                   64                      THICK    
  6       4                --              Normal          A             4194304.0               datalun                             64                      THICK    
=======================================================================================================================================================================

developer: admin:/>revivedisklun -e 1 -s 11
DANGER: This command will revive the status of specific LUN/RAID group/disk to normal, it may bring data losing in the device.
Have you read danger alert message carefully?(y/n)
y
Are you sure to perform this operation?(y/n)
y
command operates successfully.
developer: admin:/>revivedisklun -e 1 -s 12
DANGER: This command will revive the status of specific LUN/RAID group/disk to normal, it may bring data losing in the device.
Have you read danger alert message carefully?(y/n)
y
Are you sure to perform this operation?(y/n)
y
command operates successfully.
developer: admin:/>revivedisklun -e 1 -s 14
DANGER: This command will revive the status of specific LUN/RAID group/disk to normal, it may bring data losing in the device.
Have you read danger alert message carefully?(y/n)
y
Are you sure to perform this operation?(y/n)
y
command operates successfully.
developer: admin:/>showrg
===========================================================================================================================
                                                  RAID Group Information
---------------------------------------------------------------------------------------------------------------------------
  ID    Level    Status    Free Capacity(MB)    Disk List                                                       Name      
---------------------------------------------------------------------------------------------------------------------------
  0     RAID5    Normal    6144                 0,0;0,1;0,2;0,3;                                                C_Raid01  
  1     RAID5    Normal    0                    0,4;0,5;0,6;0,7;0,8;0,9;0,10;                                   C_Raid02  
  2     RAID5    Normal    3618560              1,0;1,1;1,2;1,3;1,4;1,5;1,6;1,7;1,8;1,9;1,10;                   E1_Raid01 
  3     RAID5    Normal    5240576              1,11;1,12;1,13;1,14;1,15;1,16;1,17;1,18;1,19;1,20;1,21;1,22;    E1_Raid02 
  4     RAID5    Normal    8966144              2,0;2,1;2,2;2,3;2,4;2,5;2,6;2,7;2,8;                            E2_Raid01 
  5     RAID5    Normal    15257600             2,9;2,10;2,11;2,12;2,13;2,14;2,15;2,16;2,17;                    E2_Raid02 
===========================================================================================================================

developer: admin:/>showlun
=======================================================================================================================================================================
                                                                            LUN Information
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
  ID      RAID Group ID    Disk Pool ID    Status          Controller    Visible Capacity(MB)    LUN Name                            Stripe Unit Size(KB)    Lun Type 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0       0                --              Normal          A             1638400.0               OSlun01                             256                     THICK    
  1       1                --              Normal          B             3430400.0               OSlun02                             128                     THICK    
  2       2                --              Normal          A             1048576.0               VRMlun01                            128                     THICK    
  3       2                --              Normal          A             1048576.0               desktop_oslun01                     64                      THICK    
  4       3                --              Fault           B             1048576.0               desktop_oslun02                     64                      THICK    
  5       4                --              Normal          A             2097152.0               desktop_datalun01                   64                      THICK    
  6       4                --              Normal          A             4194304.0               datalun                             64                      THICK    
=======================================================================================================================================================================

developer: admin:/>revivedisklun -lun 4
DANGER: This command will revive the status of specific LUN/RAID group/disk to normal, it may bring data losing in the device.
Have you read danger alert message carefully?(y/n)
y
Are you sure to perform this operation?(y/n)
y
command operates successfully.
developer: admin:/>showlun
=======================================================================================================================================================================
                                                                            LUN Information
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
  ID      RAID Group ID    Disk Pool ID    Status          Controller    Visible Capacity(MB)    LUN Name                            Stripe Unit Size(KB)    Lun Type 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0       0                --              Normal          A             1638400.0               OSlun01                             256                     THICK    
  1       1                --              Normal          B             3430400.0               OSlun02                             128                     THICK    
  2       2                --              Normal          A             1048576.0               VRMlun01                            128                     THICK    
  3       2                --              Normal          A             1048576.0               desktop_oslun01                     64                      THICK    
  4       3                --              Normal          B             1048576.0               desktop_oslun02                     64                      THICK    
  5       4                --              Normal          A             2097152.0               desktop_datalun01                   64                      THICK    
  6       4                --              Normal          A             4194304.0               datalun                             64                      THICK    
=======================================================================================================================================================================

developer: admin:/>showrg
===========================================================================================================================
                                                  RAID Group Information
---------------------------------------------------------------------------------------------------------------------------
  ID    Level    Status    Free Capacity(MB)    Disk List                                                       Name      
---------------------------------------------------------------------------------------------------------------------------
  0     RAID5    Normal    6144                 0,0;0,1;0,2;0,3;                                                C_Raid01  
  1     RAID5    Normal    0                    0,4;0,5;0,6;0,7;0,8;0,9;0,10;                                   C_Raid02  
  2     RAID5    Normal    3618560              1,0;1,1;1,2;1,3;1,4;1,5;1,6;1,7;1,8;1,9;1,10;                   E1_Raid01 
  3     RAID5    Normal    5240576              1,11;1,12;1,13;1,14;1,15;1,16;1,17;1,18;1,19;1,20;1,21;1,22;    E1_Raid02 
  4     RAID5    Normal    8966144              2,0;2,1;2,2;2,3;2,4;2,5;2,6;2,7;2,8;                            E2_Raid01 
  5     RAID5    Normal    15257600             2,9;2,10;2,11;2,12;2,13;2,14;2,15;2,16;2,17;                    E2_Raid02 
===========================================================================================================================

developer: admin:/>exit
admin:/> 

根因

客户机房断电后,空调掉电,但是UPS对设备仍然供电,在UPS供电开始至电量耗尽这段时间里设备经过高温运行,导致了部分硬盘离线,raid故障,lun故障。

电力恢复后部分硬盘仍然处于离线状态,故障延续。

解决方案

1.确定业务类型及重要程度。以选择合适的时间和方法进行处理。

2.查看日志确定硬盘故障先后顺序。(revive操作原则上由最后一块失效的硬盘开始拉活,倒序依次进行拉活)

3.查看硬盘的物理状态与逻辑状态。物理状态normal,逻辑状态fault可以救活的希望还比较大。

4.取消热备盘,避免多余的校验或复制操作。

5.进行硬盘revive,lun revice操作。

6.查看业务运行情况。

建议与总结

1.机房应多路供电,并安装与设备功率相适应的能够供电3小时左右的UPS。

2.机房断电时应在UPS工供电期间将业务停止,将设备按照正确的顺序进行下电。电力恢复之后按照正确的上电顺序进行上电。

3.意外断电极易造成硬件故障,数据丢失,文件系统损坏等严重后果,要引起管理人员重视。

END