Faulty symptom: VMware VMs crashed after customer plugged out one of the disks on 5500V3 HyperMetro. Windows VMs enter blue screen mode. In the meanwhile, filesystem on Linux VMs became read-only mode.
1. Analyze vmsupport log, we found error log in /var/run/log/wmkernel.log.
2017-07-27T02:49:17.505Z cpu30:32859)NMP: nmp_ThrottleLogForDevice:2458: Cmd 0x1a (0x4136cb26b040, 0) to dev "mpx.vmhba39:C0:T0:L0" on path "vmhba39:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE
2. Check from VMware KB289902, device status 0x2 means "BUS stayed busy through time out period". So, the issue is storage didn't response IO request.
3. Check storage log, we found one of the disk was removed and disk domain degraded before issue happened. Confirmed with customer that they did a test for plug out disk.
206747 2017-07-27 15:03:07 DST 0x100F00A000F Event Informational -- None The disk (controller enclosure CTE0, slot ID 7) is inserted.
206425 2017-07-27 14:39:11 DST 0x1000010A0009 Event Informational -- None The reconstruction of the disk (disk enclosure CTE0, slot ID 7, type SSD) in the disk domain (ID 0, name DD001.Sala1) started. Do not remove other member disks in the disk domain.
206404 2017-07-27 14:37:11 DST 0x10A0001 Fault Major Recovered 2017-07-27 15:03:07 DST Disk domain (name DD001.Sala1, ID 0) degrades.
206403 2017-07-27 14:37:11 DST 0x10A0008 Fault Major Recovered 2017-07-27 15:03:07 DST Disk (disk enclosure CTE0, slot ID 7, type SSD, serial number 2102350VVM10H4000061) is removed from disk domain (ID 0, name DD001.Sala1).
[2017-07-27 15:03:08][1823877.351129] [15000027c0006][INFO][Hang or blk relate LUN io, (LUN ID 34, action name SVC_DS_OPERATION_BLOCK, ret 0).][REPSVC][hangLunCallback,4685][TP_SysCtrlTPool]
[2017-07-27 15:03:08][1823877.351241] [15000027c0006][INFO][Hang or blk relate LUN io, (LUN ID 36, action name SVC_DS_OPERATION_BLOCK, ret 0).][REPSVC][hangLunCallback,4685][TP_SysCtrlTPool]
[2017-07-27 15:03:08][1823877.352270] [15000027c0006][INFO][Update DB for double split, taskcode(REP_SVC_OPCODE_LUN_DOUBLE_SPLIT), action(2).][REPSVC][chgRepLunSvcSplit,11320][CSD_1]
5. But, we can't find any path failover event from storage log. Only received UA command from VMware which used to check LUN status.
[2017-07-27 15:03:08][1823877.363159] [15000000e0019][INFO][CHECK(Opcode:0x2a Host:0xc0bfc082aa1a0003 Hostlun4 Devlun23):LUN Unit Attention, SenseCode 0x3f0e.][SCSI][scsiGetSessAluUACode,895][CSD_12]
6. Next, we check the event log again, and found customer didn't change ALUA mode when create iSCSI initiator. But, customer didn't install Huawei UltraPath on VMware ESXi.
2017-07-06 19:04:45 DST 0x200F00150025 Event Informational -- None admin:220.127.116.11 succeeded in adding the initiator (type iSCSI, identifier iqn.1998-01.com.vmware:esxlev151-1cad6947) to host (host ID 0).
1. Customer use VMware original NMP as multipathing software, but didn't configure ALUA mode as we requested in product document(HUAWEI SAN Storage Host Connectivity Guide for VMware ESXi Servers).
2. Because customer didn't choose "Uses third-party multipath software" when create iSCSI initiator, the storage will consider there's Huawei UltraPath on host side. When HyperMetro switchover, the storag will send special command to host multipathing which can only be recognized by Huawei UltraPath but VMware NMP. So, NMP can't failover path to remote site, and conside all path down. Then service was interrupted.
Chang iSCSI initiator setting on DeviceManager, based on current HyperMetro setting and storage software version. Customer need to enable "Uses third-party multipath software" , set "Switchover Mode" as "common ALUA" and set "Special mode" as "Mode 2".
Follow the HyperMetro document or host connectivity guide to configure the iSCSI initiator mode when use third part multipathing.