No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

VMware VM crashed after disk fail on 5500V3 HyperMetro solution

Publication Date:  2017-07-31 Views:  67 Downloads:  0
Issue Description

Faulty symptom: VMware VMs crashed after customer plugged out one of the disks on 5500V3 HyperMetro. Windows VMs enter blue screen mode. In the meanwhile, filesystem on Linux VMs became read-only mode. 

Version information:V300R003C20SPC200+SPH203

 

Handling Process

1. Analyze vmsupport log, we found error log in /var/run/log/wmkernel.log.

2017-07-27T02:49:17.505Z cpu30:32859)NMP: nmp_ThrottleLogForDevice:2458: Cmd 0x1a (0x4136cb26b040, 0) to dev "mpx.vmhba39:C0:T0:L0" on path "vmhba39:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2. Check from VMware KB289902, device status 0x2 means "BUS stayed busy through time out period". So, the issue is storage didn't response IO request.

https://kb.vmware.com/s/article/289902?r=2&Quarterback.validateRoute=1&KM_Utility.getArticleData=1&KM_Utility.getGUser=1&KM_Utility.getArticleLanguage=2&KM_Utility.getArticle=1

3. Check storage log, we found one of the disk was removed and disk domain degraded before issue happened. Confirmed with customer that they did a test for plug out disk.

206747    2017-07-27 15:03:07 DST    0x100F00A000F    Event    Informational    --    None    The disk (controller enclosure CTE0, slot ID 7) is inserted.
206425    2017-07-27 14:39:11 DST    0x1000010A0009    Event    Informational    --    None    The reconstruction of the disk (disk enclosure CTE0, slot ID 7, type SSD) in the disk domain (ID 0, name DD001.Sala1) started. Do not remove other member disks in the disk domain.
206404    2017-07-27 14:37:11 DST    0x10A0001    Fault    Major    Recovered    2017-07-27 15:03:07 DST    Disk domain (name DD001.Sala1, ID 0) degrades.
206403    2017-07-27 14:37:11 DST    0x10A0008    Fault    Major    Recovered    2017-07-27 15:03:07 DST    Disk (disk enclosure CTE0, slot ID 7, type SSD, serial number 2102350VVM10H4000061) is removed from disk domain (ID 0, name DD001.Sala1).

4. At the same time, we can find storage hang LUN and switch over HyperMetro to remote site.

[2017-07-27 15:03:08][1823877.351129] [][15000027c0006][INFO][Hang or blk relate LUN io, (LUN ID 34, action name SVC_DS_OPERATION_BLOCK, ret 0).][REPSVC][hangLunCallback,4685][TP_SysCtrlTPool]
[2017-07-27 15:03:08][1823877.351241] [][15000027c0006][INFO][Hang or blk relate LUN io, (LUN ID 36, action name SVC_DS_OPERATION_BLOCK, ret 0).][REPSVC][hangLunCallback,4685][TP_SysCtrlTPool]
[2017-07-27 15:03:08][1823877.352270] [][15000027c0006][INFO][Update DB for double split, taskcode(REP_SVC_OPCODE_LUN_DOUBLE_SPLIT), action(2).][REPSVC][chgRepLunSvcSplit,11320][CSD_1]

5. But, we can't find any path failover event from storage log. Only received UA command from VMware which used to check LUN status.

[2017-07-27 15:03:08][1823877.363159] [][15000000e0019][INFO][CHECK(Opcode:0x2a Host:0xc0bfc082aa1a0003 Hostlun4 Devlun23):LUN Unit Attention, SenseCode 0x3f0e.][SCSI][scsiGetSessAluUACode,895][CSD_12]

6. Next, we check the event log again, and found customer didn't change ALUA mode when create iSCSI initiator. But, customer didn't install Huawei UltraPath on VMware ESXi.

2017-07-06 19:04:45 DST    0x200F00150025    Event    Informational    --    None    admin:200.0.1.70 succeeded in adding the initiator (type iSCSI, identifier iqn.1998-01.com.vmware:esxlev151-1cad6947) to host (host ID 0).

 

Root Cause

1. Customer use VMware original NMP as multipathing software, but didn't configure ALUA mode as we requested in product document(HUAWEI SAN Storage Host Connectivity Guide for VMware ESXi Servers).

2. Because customer didn't choose "Uses third-party multipath software" when create iSCSI initiator, the storage will consider there's Huawei UltraPath on host side. When HyperMetro switchover, the storag will send special command to host multipathing which can only be recognized by Huawei UltraPath but VMware NMP. So, NMP can't failover path to remote site, and conside all path down. Then service was interrupted.

 

Solution

Chang iSCSI initiator setting on DeviceManager, based on current HyperMetro setting and storage software version. Customer need to enable "Uses third-party multipath software" , set "Switchover Mode" as "common ALUA" and set "Special mode" as "Mode 2".

Suggestions

Follow the HyperMetro document or host connectivity guide to configure the iSCSI initiator mode when use third part multipathing.

END