Product model OceanStor Dorado2100
Product version V100R001C00SPC003
Service mode Falconstor NSS
Symptom The Controller A of OceanStor Dorado2100 in Spiegel reset by itself. The root cause of the reboot is Machine Check Exception (MCE) about Memory. The system recovered automatically when the Controller A re-powered on successfully 4 minutes later.
Checking the logs of Huawei storage logs.
1.The sequential events of the Reboot
17:06:43 The Controller B Sending a heartbeat message failed.
17:06:58 Confirmed that the Controller A was lost.
17:07:06 The Controller B handled the fault of the Controller A offline succeed, and the system entered only one controller mode. The whole time of handling Controller A offline was 19s.
17:10:26 The system entered dual-controller mode again, and the Controller B began to handle the Controller A re-powering.
17:10:28 Re-powering on the controller A succeeded and the system enter Active-Active mode again.
2.The Controller B began to handle the fault of the Controller A offline:
The Controller B handling the fault of the Controller A offline succeeded.
The Controller B began to handle the Controller A re-powering on.
Re-powering on the Controller A succeeded, and the system recover to Double Controllers Normal mode
3.Analysis of the Service Interruption
a)On FalconStor NSS, each LUN has two paths:
b)When Controller A was reset, a connect timeout occurred and NSS did not retry the command:
21:46:55 ipstor7-rz08 kernel: IOCORE1 SCSI_ERROR: [COMMAND = Write10] [SCSI_ADDR = 102 0 10 1] [HOST_STAT = Connect timeout] [DRIVER_STAT = 00] [TARGET_STAT = No error] [SENSE = No data] [ASC.ASCQ = No code]
The I/O was paused immediately:
21:46:55 ipstor7-rz08 iocore: [fsnalias.c:2896:alias_check()] alias_check:pause I/O to physical device guid ac190fb7-0000-2e75-2b24-bc394ec40561 because of failed path 102:0:10:1
c)Actually the other path was healthy and I/O should be retried. NSS just tried to failover (change LUN’s working controller), but the failover was failed because one controller just fault, storage need some times to handle the exception. The healthy path 100-0-5-1 was set disabled:
d)On Dorado2100, we could see that failover was failed:
21:47:28 ipstor7-rz08 iocore: [fsnalias.c:3082:SwapGroup()] Failed to swap Group ,mark this group failed
4. So the failure sequence should be
Upgrade the firmware per year.
Notice Huawei bulletins Periodically.