S5900 disks isolated issue

Publication Date:  2014-12-23 Views:  209 Downloads:  0
Issue Description
Product: S5900
Version: V100R002C00
Symptom:
The customer report they had receive many alarms on ISM about disk isolated issue.
Alarm Information
We can find the disk isolated alarms and FC expension down alarm:


Handling Process
1: We ask the customer collect logs from the stoagre system;

2: In the event log,we found there are many bit errors of the FC expansion port (B0, P2) like below:
2014-09-19 17:05:56    0x100e02020007    Infor    None    The bit errors of the FC expansion port (controller 0, FC expansion module B0, port P2) are excessive. Therefore, the device performance may be weakened.    None.
2014-09-19 17:20:21    0x100e02020007    Infor    None    The bit errors of the FC expansion port (controller 0, FC expansion module B0, port P2) are excessive. Therefore, the device performance may be weakened.    None.
2014-09-19 21:26:42    0x100e02020007    Infor    None    The bit errors of the FC expansion port (controller 0, FC expansion module B0, port P2) are excessive. Therefore, the device performance may be weakened.    None.
2014-10-26 03:04:03    0x100e02020007    Infor    None    The bit errors of the FC expansion port (controller 0, FC expansion module B0, port P2) are excessive. Therefore, the device performance may be weakened.    None.
2014-11-21 22:38:37    0x100e02020007    Infor    None    The bit errors of the FC expansion port (controller 0, FC expansion module B0, port P2) are excessive. Therefore, the device performance may be weakened.    None.

3: In the event log, there are many alarms about disk isolated like this:
2014-09-19 18:22:11    0xe0209000f    Major    None    The hard disk ([1] [3], controller [1], slot [12], serial number [--]) is isolated.   
2014-09-19 21:20:43    0xe0209000f    Major    None    The hard disk ([1] [3], controller [1], slot [16], serial number [--]) is isolated.    2014-09-19 21:26:30    0xe0209000f    Major    None    The hard disk ([1] [3], controller [1], slot [10], serial number [--]) is isolated.    2014-09-19 21:26:38    0xe0209000f    Major    None    The hard disk ([1] [3], controller [1], slot [4], serial number [--]) is isolated.  

4:In the message log,we will find many message like below,link[0:1] means the disk is working on single-path with controller A,the path to controller B had been isolated.

4) (3,5) state 0, link[0:1], wwn 2000b452536fd98e, fwwn 23000022a109f7eb, loop id 2
6) (3,8) state 0, link[0:1], wwn 2000b452536fd9d3, fwwn 23000022a109f7eb, loop id 2
8) (3,12) state 0, link[0:1], wwn 2000b452536fd9fb, fwwn 23000022a109f7eb, loop id 2
9) (3,16) state 0, link[0:1], wwn 2000b452536fda3c, fwwn 23000022a109f7eb, loop id 2
10) (3,9) state 0, link[0:1], wwn 2000b452536fda40, fwwn 23000022a109f7eb, loop id 2
16) (3,17) state 0, link[0:1], wwn 2000b452536fdb88, fwwn 23000022a109f7eb, loop id 2
19) (3,0) state 0, link[0:1], wwn 2000b452536fdc75, fwwn 23000022a109f7eb, loop id 2
22) (3,13) state 0, link[0:1], wwn 2000b452536fdf5f, fwwn 23000022a109f7eb, loop id 2
23) (3,6) state 0, link[0:1], wwn 2000b452536fdf97, fwwn 23000022a109f7eb, loop id 2
31) (3,1) state 0, link[0:1], wwn 2000b452536fe36c, fwwn 23000022a109f7eb, loop id 2
33) (3,14) state 0, link[0:1], wwn 2000b452536fe76f, fwwn 23000022a109f7eb, loop id 2
35) (3,2) state 0, link[0:1], wwn 2000b452536fe7ca, fwwn 23000022a109f7eb, loop id 2
36) (3,4) state 0, link[0:1], wwn 2000b452536fe7e2, fwwn 23000022a109f7eb, loop id 2
38) (3,10) state 0, link[0:1], wwn 2000b45253728266, fwwn 23000022a109f7eb, loop id 2
 
Root Cause
FC module or fiber calbe's poor quality may cause the bit error issue.
Each disk enclosure has two FC links to controller enclosure,one link  to controller A,another link to controller B.
If the FC module has bit error ,the excessive bit error will spread in the Fabric Loop, finally this link will be isolated,so all the disks in this enclosure will be isolated from this link to the controller.

From the logs,we can know the B0 P2 port connected to No.3 disk enclosure ,and it had reported many bit errors, the error codes spread on the link between B0 P2 and No.3 disk enclosure, so all the disks in disk  enclosure 3 had been isolated from controller B.
Solution
Replace the both optical modules of the fiber links and the fiber cable between controller and disk enclosure.
Note: We could do the replacement online, but every single step need to be hold at least 30s before the next operation, for example: Plugging the optical module out and wait at least more than 30s, then insert the new new one.
Suggestions
None.

END