No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>

Reminder

To have a better experience, please upgrade your IE browser.

upgrade
MENU

Remote replication link is invalid and can't be deleted on 5300V3

Publication Date:  2017-05-28 Views:  24 Downloads:  0
Issue Description

Fault symptom:

Customer has two 5300v3 that were replicating over two iSCSI link, suddenly one of the link stopped working, when customer tried to delete the impacted link just to recreate a new one but we are not able to delete the link on the primiary storage, the status is the below:

  ID                Type          Health Status  Running Status  Is Primary  Remote Device SN      Replication Mode  User Type  Compress Enable  Compress Valid

  ----------------  ------------  -------------  --------------  ----------  --------------------  ----------------  ---------  ---------------  --------------

  ec4d47683d050000  File System   Normal         Normal          Yes         xxxxxxxxxxxxxxxxxxxxx  Asynchronous      --         Yes              Yes


Primary 5300v3 (named CA5-HOS-01)

ISCSI Link:

  ID   Health Status  Running Status  Remote Device Type  In Remote Device  Remote Device ID  Local Controller  Remote Controller

  ---  -------------  --------------  ------------------  ----------------  ----------------  ----------------  -----------------

  0    Normal         Link Up         Replication         Yes               0                 0A                0A

  256  Invalid        Disabled        Replication         Yes               0                 0B                0B

 

Secondary 5300v3 (named CA4-HOS-01)

ISCSI Link:

  ID  Health Status  Running Status  Remote Device Type  In Remote Device  Remote Device ID  Local Controller  Remote Controller

  --  -------------  --------------  ------------------  ----------------  ----------------  ----------------  -----------------

  0   Normal         Link Up         Replication         Yes               0                 0A                0A

Version information: V300R003C10SPC100

Alarm Information

There's alarm on primary storage:

2017-05-12 07:00:35 DST    0xF0E10001    Major    None    The replication link (link ID 256,  local controller 0B,  local port CTE0.B.H0,  remote controller 0B,  remote port CTE0.B.H0,  remote device name CA4-HOS-01,  serial number xxxxxxxxxxxxxxxxxxxxx) was disconnected. Therefore,  the remote devices cannot be accessed.

Handling Process

1. From the event log on the secondary storage, we found the iSCSI link was down at 06:59 12th May, 2017, and alarm restored at 10:24 12th May.

2017-05-12 06:59:40 DST    0xF0E10001    Major    2017-05-12 11:30:47 DST    The replication link (link ID 0,  local controller 0A,  local port CTE0.A.H0,  remote controller 0A,  remote port CTE0.A.H0,  remote device name CA5-HOS-01,  serial number xxxxxxxxxxxxxxxxxxxxx) was disconnected. Therefore,  the remote devices cannot be accessed.

2017-05-12 06:59:40 DST    0xF0E10001    Major    2017-05-12 10:24:42 DST    The replication link (link ID 257,  local controller 0B,  local port CTE0.B.H0,  remote controller 0B,  remote port CTE0.B.H0,  remote device name CA5-HOS-01,  serial number xxxxxxxxxxxxxxxxxxxxx) was disconnected. Therefore,  the remote devices cannot be accessed.

2. After 1 second, customer successfully removed the iSCSI link on the secondary storage.

 2017-05-12 10:24:43 DST    0x200F00E0002C    Informational    None    admin:xxx.xxx.xxx.xxx succeeded in removing the iSCSI link (ID 257).

3. In the meanwhile, we found the remote replication link on controller B of primary storage never restored.

2017-05-12 07:00:34 DST    0xF0E10001    Major    2017-05-12 11:31:40 DST    The replication link (link ID 0,  local controller 0A,  local port CTE0.A.H0,  remote controller 0A,  remote port CTE0.A.H0,  remote device name CA4-HOS-01,  serial number xxxxxxxxxxxxxxxxxxxxx) was disconnected. Therefore,  the remote devices cannot be accessed.

2017-05-12 07:00:35 DST    0xF0E10001    Major    None    The replication link (link ID 256,  local controller 0B,  local port CTE0.B.H0,  remote controller 0B,  remote port CTE0.B.H0,  remote device name CA4-HOS-01,  serial number xxxxxxxxxxxxxxxxxxxxx) was disconnected. Therefore,  the remote devices cannot be accessed.

4. After 10:24, the replication link on the primary storage changed to Invalid(Health status) and Disabled(Running status). Because the iSCSI link already deleted by secondary storage but the EPL link still remain on the primary storage.

5. We checked the network configuration on both of the storages, as below:

Management port IP of primary storage:

    Management Ethernet port--------------------------------
        Controller ID: 0A
        MAC Address: 20:f1:7c:be:2a:f2
        Network configuration: Static
        IPv4 Address: 172.23.40.35
        IPv4 Mask: 255.255.255.128
        IPv4 Gateway: 172.23.40.1
        IPv6 Address: --
        IPv6 Mask: --
        IPv6 Gateway: --

    Management Ethernet port--------------------------------
        Controller ID: 0B
        MAC Address: 20:f1:7c:be:2c:96
        Network configuration: Static
        IPv4 Address: 172.23.40.36
        IPv4 Mask: 255.255.255.128
        IPv4 Gateway: 172.23.40.1
        IPv6 Address: --
        IPv6 Mask: --
        IPv6 Gateway: --

Remote replication port IP of primary storage:

        Health Status: Normal
        Running Status: Link Up
        Type: Host Port
        IPv4 Address: 172.23.40.194
        IPv4 Mask: 255.255.255.224
        IPv4 Gateway: --
        IPv6 Address: --
        IPv6 Mask: --
        IPv6 Gateway: --

        ID: CTE0.B.H0
        Health Status: Normal
        Running Status: Link Up
        Type: Host Port
        IPv4 Address: 172.23.40.205
        IPv4 Mask: 255.255.255.224
        IPv4 Gateway: --
        IPv6 Address: --
        IPv6 Mask: --
        IPv6 Gateway: --

Management port IP of secondary storage:

    Management Ethernet port--------------------------------
        Controller ID: 0A
        MAC Address: 20:f1:7c:be:2d:0d
        Network configuration: Static
        IPv4 Address: 172.23.42.195
        IPv4 Mask: 255.255.255.224
        IPv4 Gateway: 172.23.42.193
        IPv6 Address: --
        IPv6 Mask: --
        IPv6 Gateway: --

    Management Ethernet port--------------------------------
        Controller ID: 0B
        MAC Address: 20:f1:7c:be:2b:e7
        Network configuration: Static
        IPv4 Address: 172.23.42.206
        IPv4 Mask: 255.255.255.224
        IPv4 Gateway: 172.23.42.193
        IPv6 Address: --
        IPv6 Mask: --
        IPv6 Gateway: --

Remote replication port IP of primary storage:

        ID: CTE0.A.H0
        Health Status: Normal
        Running Status: Link Up
        Type: Host Port
        IPv4 Address: 172.23.42.194
        IPv4 Mask: 255.255.255.224
        IPv4 Gateway: --
        IPv6 Address: --
        IPv6 Mask: --
        IPv6 Gateway: --

        ID: CTE0.B.H0
        Health Status: Normal
        Running Status: Link Up
        Type: Host Port
        IPv4 Address: 172.23.42.205
        IPv4 Mask: 255.255.255.224
        IPv4 Gateway: --
        IPv6 Address: --
        IPv6 Mask: --
        IPv6 Gateway: --

As the IP configuration, the management IP and remote replication IP on primary storage are in different VLAN. But on secondary storage, they are in the same network segment.

6. Check the system message log of controller B, we found the ETH port status was up during the issue time. Only iSCSI link was found maximum NOP(No Ping) timeout. So there should be network issue between switches, no hardware issue on storage side.


[2017-05-12 06:59:39][4898155.406628] [1226455858][1500000291476][INFO][Ping:conn(ffffc90014ec1250), ip(172.23.40.205) no reply(1) nopIns.][ISCSI_TGT][IST_NopInCheck,3345][swapper/1]
[2017-05-12 06:59:39][4898155.406639] [1226455858][15000002914dd][INFO][Ping(1):Tcp snd next (979382880), rcv next(1826917543), copied seq(1826917543),write seq(979382880),snd una(979382832).][ISCSI_TGT][IST_NopInCheck,3348][swapper/1]
[2017-05-12 06:59:40][4898156.856406] [1226456221][15000004613aa][INFO][IMP_Thread:Nop out number(4), Tcp snd next (1703303888), rcv next(311255712), copied seq(311255712), write seq(1703304040), snd una(1703303812).][ISCSI_INI][iscsi_initiator_tx_thread,2548][tx-0-1]
[2017-05-12 06:59:40][4898156.856414] [1226456221][15000004605ea][WARN][IMP_Thread:Nop out reach max number(4).][ISCSI_INI][iscsi_initiator_tx_thread,2564][tx-0-1]

Root Cause

1. We have two software layers' link on remote replication, the upper layer is EPL link, the lower layer is physical(iSCSI, FC, IB, etc.) link. Normally, we need to delete both of them when delete remote replication link.

2. Since the EPL link is upper layer link, it should be changed after physical link changed. So, we have designed a silence time(2 minutes) for EPL link. Which means when physical link is up, the EPL link will remain down status and block all event notifications for 2 minutes.

3. So, if customer delete the replication link within 2 minutes after physical link restored, there should be a problem. On secondary storage, because customer submit delete instruction here, it will directly delete EPL link and physical link information. In the meanwhile, it will send physical link and EPL link destroy instructions to primary storage. But, on primary storage, EPL link destory event is blocked during the silence time, only physical link information can be deleted. But EPL link information still remain in the system.

After the silence time, primary storage found but physical link was lost, it will set the health status of replication link as Invalid and the running status as Disabled.

4. Before V300R003C20SPC200 version, the remote replication link can't be deleted in abnormal status like Invalid.

Solution

1. Upgrade storage version to V300R003C20SPC200.

2. Force delete the remote replication link in CLI command line on primary storage.

developer:/>remove remote_device link iscsi iscsi_link_id=256

3. Create the iSCSI link again.

 

Suggestions

1. Set management IP and remote replication IP is different VLAN/network segment.

END