No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

A project S5600 storage hard disk failure analysis

Publication Date:  2012-10-18 Views:  72 Downloads:  0
Issue Description
There is a disk offline in S5600 (IPsan storage), red light brighten in panel, and display in log. When apply for new hard disk replacement, indicator of this hard disk is green, check through management software (ISM), the hard disk status is normal.
Alarm Information
Hard disk indicator is red.
Handling Process
This disk is kicked off by system because of faulty according to analysis, but this disk is not completely broken, which cause situation of online and offline. So it needs to replace this hard disk.
Root Cause
Analyze log
1、create a timer for every distributed io. IO timeout would trigger timeout handle function 
Apr 21 18:10:29 OceanStor kernel: [2874747527]The 17th long time command record, the time is 18063ms @ [jif=2874747527] FCINI_osIOCompleted : 571
2、fc initiator get into timeout error handle, wake up scsi middle-level error handle thread,
Apr 21 18:41:31 OceanStor kernel: [2876609803] @ [jif=2876609803] FCINI_eh_timed_out : 2238 Apr 21 18:41:31 OceanStor kernel: [2876609803]FCINI_eh_timed_out loop status is 1, cmnd=000001006c812300 disk id=46 cdb=0x4d Apr 21 18:41:31 OceanStor kernel: [2876609803] @ [jif=2876609803] FCINI_eh_timed_out : 2253 Apr 21 18:41:31 OceanStor kernel: [2876609803]scsi_error_handler:1782 : scsi_eh_1 waking up, failed cmd=1(1)
3、notify fc initiator, scsi middle-level abort this command
Apr 21 18:41:31 OceanStor kernel: [2876609803]agtiapi_eh_AbortCmnd 000001006c812300 disk id=46 cmd timer=10000 ticks Apr 21 18:41:31 OceanStor kernel: [2876609803] @ [jif=2876609803] agtiapi_eh_AbortCmnd : 2380 Apr 21 18:41:31 OceanStor kernel: [2876609803]Abort : found in busy queue Apr 21 18:41:31 OceanStor kernel: [2876609803] @ [jif=2876609803] agtiapi_eh_AbortCmnd : 2391
4、After scsi middle-level abort command succeed, distribute TUR (check if disk status is normal)command to corresponding disk, if one command fail, the disk would be offline. There is one command distributed in this failure
Apr 21 18:42:01 OceanStor kernel: [2876639803]scsi_eh_tur:826 : Dev 1:46 Test Unit Ready Fail (Cmd 000001006c812300(4d), retries=0)
5、Scsi middle-level configure (1,5) disk offline
Apr 21 18:42:01 OceanStor kernel: [2876639803]scsi: Device offlined - not ready after error recovery: host 1 channel 0 id 46 lun 0 .

Suggestions
Hard disk is fragile IT device, if there is hard disk alarm, the disk usually need to be replaced. Even though this hard disk can be used, there is also potential risk.

END