No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

iSCSI connection frequently disconnect between 5500V3 storage and Windows Server

Publication Date:  2017-11-07 Views:  83 Downloads:  0
Issue Description

Fault symptom: Customer found a lot of events in Windows OS, it indicates that iSCSI connection frequently disconnect. And the iSCSI connection automatically restored after a few seconds.

Error 5/29/2017 07:17 iScsiPrt 20 N/A Connection to the target was lost. The initiator will attempt to reply the connection.
Error 5/27/2017 13:53 iScsiPrt 20 N/A Connection to the target was lost. The initiator will attempt to reply the connection.
Error 5/26/2017 18:38 iScsiPrt 20 N/A Connection to the target was lost. The initiator will attempt to reply the connection.
Error 5/24/2017 05:48 iScsiPrt 20 N/A Connection to the target was lost. The initiator will attempt to reply the connection.
Error 5/21/2017 17:06 iScsiPrt 20 N/A Connection to the target was lost. The initiator will attempt to reply the connection.
Error 5/21/2017 08:39 iScsiPrt 20 N/A Connection to the target was lost. The initiator will attempt to reply the connection.

Version information:

OceanStor 5500V3 V300R002C10SPC100

Windows 2012 R2 64bit

Networking topology:


Handling Process

1. Analyze the storage log, we can find below error message. It indicates that Windows initiator doesn't reply iSCSI ping, cause No Ping timeout(Called NOP in iSCSI protocol).But, there's no physical link down event in storage log. It means the issue may caused by link issue between server and switch, or software issue.

[2017-05-11 16:39:59][Ping:conn(ffffc9001781eed0), ip(10.110.5.2) no reply(1) nopIns.][ISCSI_TGT][IST_NopInCheck,3333]
[2017-05-11 16:40:02][Ping:conn(ffffc9001781eed0), ip(10.110.5.2) no reply(2) nopIns.][ISCSI_TGT][IST_NopInCheck,3333]
[2017-05-11 16:40:05][Ping:conn(ffffc9001781eed0), ip(10.110.5.2) no reply(3) nopIns.][ISCSI_TGT][IST_NopInCheck,3333]
[2017-05-11 16:40:08][ERR][Ping:conn(ffffc9001781eed0), ip(10.110.5.2) max outstand.][ISCSI_TGT][IST_NopInCheck,3324]

2. We checked the CE switches, no link down issue found on switch side. In this case, it's more like a software.

3. We used Microsoft Message Analyzer to capture the iSCSI packets on server side, and found below behavior:


4. Each issue time, when storage send NopIn with TTT flag as 0xFFFFFFFF, the Windows initiator will not response the NopIn packet any more. Until storage side NOP timeout and reset the link, and the iSCSI session restored.

5. But on iSCSI specification(RFC3720), the TTT flag shouldn't be 0xFFFFFFFF. So, it's a violent on storage side.

Root Cause

1. In iSCSI protocol, there's an initiator and a target. The initiator is similar as client, and target is similar as server. When iSCSI target(storage) reply packet to Initiator, there should be a TTT flag(Target Transfer Tag) to indicate the packet sequence number. Each iSCSI packet should plus 1 on TTT.

2. When iSCSI Initiator or Target doesn't receive iSCSI packet on the iSCSI session, they need to send NOP packet to remote side to confirm whether it's "alive". If remote side doesn't give response, local side should disconnect the iSCSI session and release the iSCSI resource.

3. Since the multipathing on Window OS is working in Active-Standby mode. Some of the iSCSI paths(session) don't have IO access all the time. So, the storage(iSCSI target) and Windows server(iSCSI initiator) should frequently send NOP to each other. When storage send NOP packet with TTT flag as 0xFFFFFFFF, the issue will happen.

Solution

The issue was fixed in V300R003C20SPH205, upgrade this hot patch can fix the issue.

 

Suggestions

1. If the multipath software works in load balance mode, this issue will not happen, or not happen so frequently.

2. Until now, this issue only found between V3 storage and Windows OS. Other kind of OS like Linux doesn't have this behavor, it will response the NOP packet will TTT flag as 0xFFFFFFFF.

 

END