Broken slave SRU can affect forwarding at S7700 switch

Publication Date:  2016-12-31 Views:  384 Downloads:  0
Issue Description

We have a couple of S12704 switches as core, and several S7712 switches as access level devices. Noticed, what some hosts, connected to one of S7712 switches can’be reachable from WAN network.



Handling Process

1.            The core switch S12704-1 and the S7712  pinged each other successfully.

2.         The core switch S12704-2 pinged the problematic user host address 10.81.19.20 successfully.

3.         The problem persisted after the latest V2R8SPH009 patch was loaded, so it can be determined that this problem is not caused by a known issue of the software version.

4.         After the problematic user host was moved from slot 8 of the S7712 switch to another slot, some services recovered.

5.         After the slave SRU control unit of the S7712 switch was replaced, all services recovered.

 Traffic Statistics Analysis

By analyzing traffic statistics on these switches, we found that some service packets failed to be forwarded on the link between Xe4/0/7 of S12704-1 and Xe2/0/0 of S7712. Ping packets sent from Xe4/0/7 of S12704-1 failed to be forwarded through Xe2/0/0 of S7712.

Before starting the traffic statistics collection, we checked the ARP entry of the problematic host and matching route on the S7712. The ARP entry and routing entry were both normal.

 [S7712]disp arp | i 19.16

IP ADDRESS      MAC ADDRESS     EXPIRE(M) TYPE        INTERFACE   VPN-INSTANCE

                                          VLAN/CEVLAN

------------------------------------------------------------------------------

10.81.19.16     c03f-d551-9f80  20        D-0         GE8/0/47

The default route on S7712 shows that traffic from the host is forwarded upstream from the card in slot 2.

 0.0.0.0/0   OSPF    10   101         D   10.125.72.210   XGigabitEthernet1/0/0.2120

               OSPF    10   101         D   10.125.72.208   XGigabitEthernet2/0/0.2119

 

Then we performed a ping test from S12704-1 to the host, with the source address specified. ICMP replay packets were not received on the line card in slot 2.

On the user-side interface GE8/0/47, the number of packets sent was the same as the number of packets received.

<S7712>displ traffic policy statistics all

 

 Interface: GigabitEthernet8/0/47

 Traffic policy outbound: p1

 Rule number: 1

 Current status: success

 Statistics interval: 300

---------------------------------------------------------------------

 Board : 8

---------------------------------------------------------------------

 Matched          |      Packets:                            10

                  |      Bytes:                           1,020

                  |      Rate(pps):                           0

                  |      Rate(bps):                           0

---------------------------------------------------------------------

   Passed         |      Packets:                            10

                  |      Bytes:                           1,020

                  |      Rate(pps):                           0

                  |      Rate(bps):                           0

---------------------------------------------------------------------

   Dropped        |      Packets:                             0

                  |      Bytes:                               0

                  |      Rate(pps):                           0

                  |      Rate(bps):                           0

---------------------------------------------------------------------

     Filter       |      Packets:                             0

                  |      Bytes:                               0

---------------------------------------------------------------------

     Car          |      Packets:                             0

                  |      Bytes:                               0

---------------------------------------------------------------------

                                         

 Interface: GigabitEthernet8/0/47        

 Traffic policy inbound: p2               

 Rule number: 1                          

 Current status: success                 

 Statistics interval: 300                

---------------------------------------------------------------------

 Board : 8                               

---------------------------------------------------------------------

 Matched        |      Packets:                            10

                  |      Bytes:                           1,020

                  |      Rate(pps):                           0

                  |      Rate(bps):                           0

---------------------------------------------------------------------

   Passed        |      Packets:                            10

                  |      Bytes:                           1,020

                  |      Rate(pps):                           0

                  |      Rate(bps):                           0

---------------------------------------------------------------------

   Dropped       |      Packets:                             0

                  |      Bytes:                               0

                  |      Rate(pps):                           0

                  |      Rate(bps):                           0

---------------------------------------------------------------------

     Filter      |      Packets:                             0

                  |      Bytes:                               0

---------------------------------------------------------------------

     Car         |      Packets:                             0

                  |      Bytes:                               0

---------------------------------------------------------------------

      

However, the uplink interface in slot 2 did not receive any ICMP reply packets. This indicates that the packets were lost during inter-card forwarding on the S7712.

<S7712>displ traffic policy  statistics all

 

 Interface: XGigabitEthernet2/0/0.2119

 Traffic policy inbound: p1

 Rule number: 1

 Current status: success

 Statistics interval: 300

---------------------------------------------------------------------

 Board : 2

---------------------------------------------------------------------

 Matched          |      Packets:                             0

                  |      Bytes:                               0

                  |      Rate(pps):                           0

                  |      Rate(bps):                           0

---------------------------------------------------------------------

   Passed         |      Packets:                             0

                  |      Bytes:                               0

                  |      Rate(pps):                           0

                  |      Rate(bps):                           0

---------------------------------------------------------------------

   Dropped        |      Packets:                             0

                  |      Bytes:                               0

                  |      Rate(pps):                           0

                  |      Rate(bps):                           0

---------------------------------------------------------------------

     Filter       |      Packets:                             0

                  |      Bytes:                               0

---------------------------------------------------------------------

     Car          |      Packets:                             0

                  |      Bytes:                               0

---------------------------------------------------------------------

                                         

 Interface: GigabitEthernet8/0/47        

 Traffic policy outbound: p1             

 Rule number: 1                          

 Current status: success                 

 Statistics interval: 300                

---------------------------------------------------------------------

 Board : 8                               

---------------------------------------------------------------------

 Matched          |      Packets:                             0

                  |      Bytes:                               0

                  |      Rate(pps):                           0

                  |      Rate(bps):                           0

---------------------------------------------------------------------

   Passed         |      Packets:                             0

                  |      Bytes:                               0

                  |      Rate(pps):                           0

                  |      Rate(bps):                           0

---------------------------------------------------------------------

   Dropped        |      Packets:                             0

                  |      Bytes:                               0

                  |      Rate(pps):                           0

                  |      Rate(bps):                           0

---------------------------------------------------------------------

     Filter       |      Packets:                             0

                  |      Bytes:                               0

---------------------------------------------------------------------

     Car          |      Packets:                             0

                  |      Bytes:                               0

---------------------------------------------------------------------

 Internal Forwarding Process Analysis

We collected traffic statistics on HG links in the S7712 switch and found that ICMP reply packets were lost on the forwarding chip of the SRU in slot 14. The traffic statistics show that the SRU did not receive any packets, indicating that traffic had been lost on the link between the line card in slot 8 and the SRU.

[S7712-diagnose]catch slot 2 pkt-stat-analys hg-port all-hg srcip 10.81.19.16 dstip 10.125.72.208 delay-time 500

[S7712-diagnose]catch slot 13 pkt-stat-analys hg-port all-hg srcip 10.81.19.16 dstip 10.125.72.208 delay-time 500

[S7712-diagnose]catch slot 14 pkt-stat-analys hg-port all-hg srcip 10.81.19.16 dstip 10.125.72.208 delay-time 500

                                                                               

[S7712-diagnose]display gfpi catch slot 13 pkt-stat-analys

-------------------------------------------------------------------

 set time: 0x01f4 remain time: 0x005a mirror stat:0

 hig: Enable hig-pbm: 0xffffffff pdtindex: 0xffffffff Src-Ip  : 10.81.19.16         

 dst-Ip  : 10.125.72.208         

-------------------------------------------------------------------

 higig received packet stat-info: 0 packets

                                       0 bytes

-------------------------------------------------------------------

[S7712-diagnose]display gfpi catch slot 14 pkt-stat-analys

-------------------------------------------------------------------

 set time: 0x01f4 remain time: 0x0064 mirror stat:0

 hig: Enable hig-pbm: 0xffffffff pdtindex: 0xffffffff Src-Ip  : 10.81.19.16         

 dst-Ip  : 10.125.72.208         

-------------------------------------------------------------------

 higig received packet stat-info: 0 packets

                                         0 bytes

-------------------------------------------------------------------

 [S7712-diagnose]display gfpi catch slot 2 pkt-stat-analys

-------------------------------------------------------------------

 set time: 0x01f4 remain time: 0x01b0 mirror stat:0

 hig: Enable hig-pbm: 0xffffffff pdtindex: 0xffffffff Src-Ip  : 10.81.19.16         

 dst-Ip  : 10.125.72.208         

-------------------------------------------------------------------

 higig received packet stat-info: 0 packets

                                        0 bytes

-------------------------------------------------------------------

When checking the fabric chip link status, we found that an HG link between the SRU and line card experienced a large number of FCS errors, and the number of FCS errors was still increasing. So, the HG link might be faulty.

[S7712-diagnose]dis fabric status slot 14

 

  HG  UPDOWN  DISCARD  IRFCS  ITFCS  AllFCS   STAT       CHECK       LASTFAIL

----------------------------------------------------------------------------------

  0      0       0       0      0      0      Fail  09:17 16 Oct04   09:17 16 Oct04

  1      0       0       0      0      0      Fail  09:17 16 Oct04   09:17 16 Oct04

  2      7       0       0      0      0      OK    09:18 16 Oct04   09:17 16 Oct04

…………………….

  3      7       0       0      0      0      OK    09:18 16 Oct04   09:17 16 Oct04

23      7       0       0      0      0      OK    09:18 16 Oct04   09:17 16 Oct04

 24      7       0       0      0      0      OK    09:18 16 Oct04   09:17 16 Oct04

 25      0       0     186509      0    186509      Fail  09:17 16 Oct04   09:17 16 Oct04

 26      0       0       0      0      0      Fail  09:17 16 Oct04   09:17 16 Oct04

 After the SRU in slot 14 was removed and reinstalled, the HG link failed the check again, and the number of FCS errors still kept increasing.

[S7712-diagnose]display fabric status slot 14

 

  HG  UPDOWN  DISCARD  IRFCS  ITFCS  AllFCS   STAT       CHECK       LASTFAIL

----------------------------------------------------------------------------------

  0      0       0       0      0      0      Fail  00:01 16 Nov30   00:01 16 Nov30

  1      0       0       0      0      0      Fail  00:01 16 Nov30   00:01 16 Nov30

  2      1       0       0      0      0      OK    00:02 16 Nov30   00:01 16 Nov30

  3      1       0       0      0      0      OK    00:02 16 Nov30   00:01 16 Nov30

-----------------------------------------------

 24      1       0       0      0      0      OK    00:02 16 Nov30   00:01 16 Nov30

 25      0       0      84      0     84      Fail  00:01 16 Nov30   00:01 16 Nov30

 26      0       0       0      0      0      Fail  00:01 16 Nov30   00:01 16 Nov30

 27      1       0       0      0      0      OK    00:02 16 Nov30   00:01 16 Nov30

 Forwarding Mechanism Analysis

On an S7700 switch, inter-card forwarding is implemented by the fabric chips of the SRUs. Each line card used in this S7712 switch has four HG high-speed links for inter-card forwarding. Two of the HG links connect to the master SRU, and the other two connect to the slave SRU. You can see the connections using the display hg-connection slot x command.

Taking the line card in slot 8 as an example:

[S7712-diagnose]display hg-connection slot 8

 

    HGID (Unit, Port) --> (SruSlot, Chip, Phy HG)

 -------------------------------------------------

     hg0   ( 0,  28)  -->   ( 13,  0,   HG6  )

     hg1   ( 0,  29)  -->   ( 13,  1,   HG9  )

     hg2   ( 0,  30)  -->   ( 14,  0,   HG6  )

     hg3   ( 0,  31)  -->   ( 14,  1,   HG9  )

 

     Logic HG  -->  Phy HG  -->  (Unit, Port)

 ------------------------------------------------

        0      -->    hg0   -->   ( 0,  28 )

        1      -->    hg1   -->   ( 0,  29 )

        2      -->    hg2   -->   ( 0,  30 )

        3      -->    hg3   -->   ( 0,  31 )

[S7712-diagnose]display hg-connection slot 14

 

    (SruSlot, Chip, Phy HG)  -->  (LpuSlot, Logic HG)

 --------------------------------------------------------

       (14,   1,  HG2 )      -->    (  1,      2 )

       (14,   0,  HG13)      -->    (  1,      3 )

 

       (14,   0,  HG3 )      -->    (  2,      2 )

       (14,   1,  HG12)      -->    (  2,      3 )

 

       (14,   1,  HG4 )      -->    (  3,      2 )

       (14,   0,  HG11)      -->    (  3,      3 )

 

       (14,   0,  HG5 )      -->    (  4,      2 )

       (14,   1,  HG10)      -->    (  4,      3 )

 

       (14,   1,  HG6 )      -->    (  5,      2 )

       (14,   0,  HG9 )      -->    (  5,      3 )

 

       (14,   0,  HG7 )      -->    (  6,      2 )

       (14,   1,  HG8 )      -->    (  6,      3 )

       (14,   1,  HG0 )      -->    (  6,      6 )

       (14,   0,  HG1 )      -->    (  6,      7 )

 

       (14,   1,  HG7 )      -->    (  7,      2 )

       (14,   0,  HG8 )      -->    (  7,      3 )

       (14,   0,  HG0 )      -->    (  7,      6 )

       (14,   1,  HG1 )      -->    (  7,      7 )

 

       (14,   0,  HG6 )      -->    (  8,      2 )

       (14,   1,  HG9 )      -->    (  8,      3 )

 

       (14,   1,  HG5 )      -->    (  9,      2 )

       (14,   0,  HG10)      -->    (  9,      3 )

 

       (14,   0,  HG4 )      -->    ( 10,      2 )

       (14,   1,  HG11)      -->    ( 10,      3 )

 

       (14,   1,  HG3 )      -->    ( 11,      2 )

       (14,   0,  HG12)      -->    ( 11,      3 )

 

       (14,   0,  HG2 )      -->    ( 12,      2 )

       (14,   1,  HG13)      -->    ( 12,      3 )

The four HG links of the line card in slot 8 connect to two chips on each of the SRUs in slot 13 and slot 14. The faulty HG link is the HG9 link between chip 1 of the SRU in slot 14 and the line card in slot 8. When checking the register states in the diagnostic mode, we found that the number of CRC errors on the HG9 link increased continuously. This indicates that packets forwarded on this link were dropped by the SRU.

After the SRU in slot 14 was removed from the switch, services recovered. After this SRU was reinstalled in the slot and the switch was rebooted, the problem occurred again. It can be determined that the fault lies in the SRU and cannot be rectified by rebooting the switch. We suspect that the HG9 port on the SRU has a hardware failure.

Root Cause

The slave SRU in slot 14 of the S7712 does not work normally because of the hardware failure of its HG9 port. Inter-card packets forwarded on the HG9 link will be dropped, affecting services of users.

The key point of root cause is that traffic from slot-2 to slot-8 must pass-by the SRU (MPU), and slot-2 doesn’t connect slot-8 directly, please refer the illustration below:


        Further detail is that slot-8 connects SRU-slave(slot 14) with two hg-interfaces (hg2 — hg6, hg3—hg9), and these two hg-interfaces are bond as eth-trunk, the traffics are transferred  according to the load-balance of this hg eth-trunk, it means some packets are transferred by hg9 of SRU-slave, some packets are transferred by hg6 of SRU-slave:

 

[S7712-diagnose]display hg-connection slot 8

 

    HGID (Unit, Port) --> (SruSlot, Chip, Phy HG)

-------------------------------------------------

     hg0   ( 0,  28)  -->   ( 13,  0,   HG6  )

     hg1   ( 0,  29)  -->   ( 13,  1,   HG9  )

     hg2   ( 0,  30)  -->   ( 14,  0,   HG6  )

     hg3   ( 0,  31)  -->   ( 14,  1,   HG9  )

Because the traffics are transferred  according to the load-balance  of this hg eth-trunk, and this eth-trunk did not switch to down, just drop packets with FCS error.

        So the traffic still forward by slot 14, but some packets pass-by the damaged hg9 were dropped.

Slave SRU acts like real slave for control plane, but acts like “load-balancing” member with main SRU for data plane.

Suggestions

Replace the SRU in slot 14 with a new one and send the faulty SRU to Huawei for fault analysis.

END