Fault symptom: The customer installed 2 nodes MicroSoft Failover Cluster(MSFC) on 5500V3 storage via iSCSI link. Each node can successfully scan the LUNs, but the scan progress is very slow. Then, the customer became test the component failover function, everything is passed except disk failover.
Version Informantion: 5500V3 V300R003C00SPC100, SmartIO card in 10GE eth mode, Host OS version:MicroSoft Windows 2012 Standard.
Networking topology: Two Huawei 10GE Switches.
1. Since the customer have used MicroSoft cluster verification tool to test the network and no error reported for networking configuration, we need to check the SCSI-3 persistent reservation first. The failover will fail in case of new node can't get reservation.
As below, we can login storage CLI command line and change mode to diagnose, then execute command "scsi show reservation lun [ -l LUN ID]" to inquiry the LUN reservation. Please check SCSI reservation state and InitiatorWWN, then we found the reservation have changed from old host node to the new one.
2. To exclude the multipath software, we tried to install and uninstall Huawei Ultrapath software, but not work.
3. We use hostinfo_tool of Ultrapath to colloct all Windows host logs. We checked Windows system event in systemeventlog\System.evtx and found alarm as below:
We searched the resolve suggestion on MicroSoft technet and tried, but not work either.
4. We collect storage system logs to analyze if something abnormal on storage. Then we found both of the controllers had a lot of ping timeout on iSCSI link(search "[ERR][Ping") as below:
This means two of the iSCSI links have some problom.
We try to ping storage iSCSI service IP from hosts, all passed.
5. We found the customer change MTU of storage port from default 1500 to 9000. So, we ask the customer to check if configured a wrong MTU value on host or switch.
As the result from customer, All switch ports are set as 9216(maximum). But one of the host is set as 1500, the other one is set as 9000. After the customer change it from 1500 to 9000. The failover issue was resolved.
6. But, the customer still found the storage is very slow. For example, it takes about minutes to scan disk or failover disk on MSFC. Then we checked host configuration through remote session and found the customer enabled "Jumbo Packet" and set MTU as 9000. Finally, we change it to 9014 and the problem resolved.
1. We need to set the MTU between host and storage ports. Otherwise, there would be a lot of packet restructuring on network. For example, the storage service port negotiate MTU with switch port and result is the smaller one as 9000. Also, host port negotiate with switch port and result is 1500. In the case, storage may reply a 9000Byte jumbo packet to the host. But the host port can't receive it, the packet should be restructured, this wil take a long time and cause IO latency very high or even timeout.
2. Jambo packet has an extra packet header of 14 Bytes. When storage side set MTU as 9000, we need to set MTU as 9014 on host when enabled Jambo packet.
Here is the MSFC installation and failover test procedure:
2. Disk Manager:
3. Public network:
4. Private network:
5. Create cluster:
6. Test steps:
7. Critical event: no event
8. Ping domain:
9. Ping the cluster node: