Product model OceanStor Dorado2100
Product version V100R001C00SPC003
Service mode Falconstor NSS
Symptom The Controller A of OceanStor Dorado2100 in Spiegel reset by itself. The root cause of the reboot is Machine Check Exception (MCE) about Memory. The system recovered automatically when the Controller A re-powered on successfully 4 minutes later.
Database detects this reboot.
Checking the logs of Huawei storage logs.
1.The sequential events of the Reboot
17:06:43 The Controller B Sending a heartbeat message failed.
17:06:58 Confirmed that the Controller A was lost.
17:07:06 The Controller B handled the fault of the Controller A offline succeed, and the system entered only one controller mode. The whole time of handling Controller A offline was 19s.
17:10:26 The system entered dual-controller mode again, and the Controller B began to handle the Controller A re-powering.
17:10:28 Re-powering on the controller A succeeded and the system enter Active-Active mode again.
2.The Controller B began to handle the fault of the Controller A offline:
The Controller B handling the fault of the Controller A offline succeeded.
The Controller B began to handle the Controller A re-powering on.
Re-powering on the Controller A succeeded, and the system recover to Double Controllers Normal mode
3.Analysis of the Service Interruption
a)On FalconStor NSS, each LUN has two paths:
b)When Controller A was reset, a connect timeout occurred and NSS did not retry the command:
21:46:55 ipstor7-rz08 kernel: IOCORE1 SCSI_ERROR: [COMMAND = Write10] [SCSI_ADDR = 102 0 10 1] [HOST_STAT = Connect timeout] [DRIVER_STAT = 00] [TARGET_STAT = No error] [SENSE = No data] [ASC.ASCQ = No code]
The I/O was paused immediately:
21:46:55 ipstor7-rz08 iocore: [fsnalias.c:2896:alias_check()] alias_check:pause I/O to physical device guid ac190fb7-0000-2e75-2b24-bc394ec40561 because of failed path 102:0:10:1
c)Actually the other path was healthy and I/O should be retried. NSS just tried to failover (change LUN’s working controller), but the failover was failed because one controller just fault, storage need some times to handle the exception. The healthy path 100-0-5-1 was set disabled:
d)On Dorado2100, we could see that failover was failed:
21:47:28 ipstor7-rz08 iocore: [fsnalias.c:3082:SwapGroup()] Failed to swap Group ,mark this group failed
4. So the failure sequence should be
In the log_debug_201XXXXX171007_poweron.txt, we can see the cause of reboot was MCE.
A Machine Check Exception (MCE) is a type of computer hardware error that occurs when a computer's central processing unit detects a hardware problem. (Reference to http://en.wikipedia.org/wiki/Machine-check_exception).
1. The solution to Machine Check Exception (MCE) is replacing the Controller A with a new one.
2. We suggest that Falconstor NSS retry I/O via healthy paths when active paths connect timeout, and this should be separated from failover command; it is not necessary to paused I/O after failover faile
Upgrade the firmware per year.
Notice Huawei bulletins Periodically.