No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Voice and Data Services on a Core Network Were Interrupted Because Eth-trunk Members Were Faulty

Publication Date:  2013-10-30 Views:  61 Downloads:  0
Issue Description

Network topology:

 

Three GE links between the two NE80Es (NE40E&80E V300R006C01SPC003) were bound as Eth-Trunk9. Eth-Trunk9 was configured as a Layer 2 interface by using the portswitch command, and multiple VLANIF interfaces were configured.

VLANIF99 was the public interface running public OSPF, MPLS LDP, and MP-IBGP. Other VLANIF interfaces were configured with VRRP (heartbeat packets were transmitted by Eth-trunk9) and bound to different VPN instances. One day, the core network in the core equipment room was faulty. To be specific, the success rate of signaling set up between MSCs was low, voice calls were temporarily interrupted, and 2G/3G PS services were abnormal. The network-wide KPIs degraded and services were interrupted. Frontline personnel found that:

1) Two VRRP masters existed sometimes.

2). VRRP status flapped frequently according to logs.

3). OSPF/MPLS LDP/MP-IBGP peers between the two NE80Es were abnormal.

As a result, network-wide signaling, voice services, and data services were abnormal. However, the GE interfaces between NE80Es were up.

R&D engineers verified that the GE interfaces on the board in slot 9 were normal. Ping testes between VLANIF1000 interfaces on the two NE80Es succeeded occasionally. It was suspected that the links between the two NE80Es had problems.

Services restored after the board in slot 9 on one NE80E-2 was reset.

LDP and OSPF peers between the two NE80Es were abnormal and VRRP switching occurred frequently.

Sep 22 2010 15:58:54 QOA-NE80E-02 %%01LSPM/6/SLOTOTHEREVENT(l)[101330]:Got interface event SEC ADD and address 10.15.20.241 in interface Vlanif1050(0x1587).
Sep 22 2010 15:58:54 QOA-NE80E-02 %%01VRRP/4/STATEWARNING(l)[101331]:Virtual Router state BACKUP changed to MASTER, because of protocol timer expired. (Interface=Vlanif1050, VrId=21)
Sep 22 2010 15:58:54 QOA-NE80E-02 %%01RM/6/HANDLE_ADD_IPMSG(l)[101332]:RM IM received the event of adding an IP address. (IpAddress=10.15.20.241, Mask=255.255.255.255, Interface=Vlanif1050)
Sep 22 2010 15:58:54 QOA-NE80E-02 %%01OSPF/6/LOGIC_IF_INFO(l)[101333]:OSPF logical interface information (InterfaceName=Vlanif1050, LogicalInterfaceIndex=241, PhysicalInterfaceIndex=230, RmInterfaceIndex=5511, RmInterfaceType=38, RM interface bandwidth=1000000000, RmInterfaceMtu=1500, ChangeType=0x1)
Sep 22 2010 15:58:54 QOA-NE80E-02 %%01RM/6/HANDLE_ADDED_IPMSG(l)[101334]:RM IM processed the event of adding an IP adress successfully. (IpAddress=10.15.20.241, Mask=255.255.255.255, Interface=Vlanif1050)
Sep 22 2010 15:58:54 QOA-NE80E-02 %%01OSPF/3/NBR_CHG_DOWN(l)[101335]:Neighbor event:neighbor state changed to Down. (ProcessId=1, NeighborAddress=192.168.10.193, NeighborEvent=InactivityTimer, NeighborPreviousState=Full, NeighborCurrentState=Down)
Sep 22 2010 15:58:54 QOA-NE80E-02 %%01OSPF/6/NBR_DOWN_REASON(l)[101336]:Neighbor state leaves full or changed to Down. (ProcessId=1, NeighborRouterId=192.168.10.193, NeighborAreaId=0, NeighborInterface=Vlanif99,NeighborDownImmediate reason=Neighbor Down Due to Inactivity, NeighborDownPrimeReason=Hello Not Seen, NeighborChangeTime=[2010/09/22] 15:58:54)
Sep 22 2010 15:59:09 QOA-NE80E-02 %%01LDP/6/PRONOTI(l)[101337]:The session was deleted and the notification sent by the peer was handled.(Notification=HOLD_TIMER_EXPIRED, PeerId=192.168.10.193 )
Sep 22 2010 15:59:09 QOA-NE80E-02 %%01RM/3/LDP_SESSION_STATE(l)[101338]:RM received the status DOWN  of the LDP session on the Vlanif99.
Sep 22 2010 15:59:10 QOA-NE80E-02 %%01LDP/4/HOLDTMREXP(l)[101339]:Sessions were deleted because the hello hold timer expired. (PeerId=192.168.10.193)
Sep 22 2010 15:59:10 QOA-NE80E-02 %%01LDP/4/ADJHOLDEXPSSNDOWN(l)[101340]:LDP sessions were Down because the Hello hold timer expired. (MaxHelloSendDelayTime=2s)
Sep 22 2010 15:59:32 QOA-NE80E-02 %%01OSPF/3/NBR_CHG_DOWN(l)[101341]:Neighbor event:neighbor state changed to Down. (ProcessId=200, NeighborAddress=10.15.30.248, NeighborEvent=InactivityTimer, NeighborPreviousState=Full, NeighborCurrentState=Down)
Sep 22 2010 15:59:32 QOA-NE80E-02 %%01OSPF/6/NBR_DOWN_REASON(l)[101342]:Neighbor state leaves full or changed to Down. (ProcessId=200, NeighborRouterId=10.15.30.248, NeighborAreaId=0, NeighborInterface=Vlanif2102,NeighborDownImmediate reason=Neighbor Down Due to Inactivity, NeighborDownPrimeReason=Hello Not Seen, NeighborChangeTime=[2010/09/22] 15:59:32)
Handling Process

R&D engineers analyzed log information and concluded as follows:

Network cables to one or two GE interfaces in Eth-trunk9 between the NE80Es were not properly connected. The NE80Es did not detect the exception and still transmit packets over the faulty links, resulting in service interruption.After the board in slot 9 was reset, network cables were re-activated and services restored temporarily. However, risks still existed and must be eliminated immediately.

Because VLAN99 interface ran the public MPLS, the public control plane was also abnormal.

Suggestions:

1). Configure BFD for default IP on Eth-trunks and the process-interface-status command. In this way, once a member link in the Eth-trunk is faulty, BFD can quickly detect the fault and set the related interface down. The interface will forward packets no longer and therefore services will not be affected. 

Configuration example:

NE80E-1:

bfd 9/0/37 bind peer-ip default-ip interface GigabitEthernet9/0/37
 discriminator local 100
 discriminator remote 200
  process-interface-status
 commit

NE80E-2 :

bfd 9/0/37 bind peer-ip default-ip interface GigabitEthernet9/0/37
 discriminator local 100
 discriminator remote 200
  process-interface-status
 commit

2) Configure inter-board Eth-trunks to prevent a fault on a board from affecting services.
Root Cause
Eth-trunk members were faulty.
Solution
Configured BFD for default IP on Eth-trunks and the process-interface-status command. In this way, once a member link in the Eth-trunk was faulty, BFD could quickly detect the fault and set the related interface down. The interface would forward packets no longer and therefore services would not be affected. 
Suggestions

1. Deploy BFD for default IP for Eth-trunks. 

2. Deploy 1588v2 if it is supported by all boards.

END