No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

GPRS Services Carried by an NE40E on an IP Bearer Network Were Interrupted Due to a Board Fault

Publication Date:  2013-10-08 Views:  22 Downloads:  0
Issue Description

The LPU board in slot 1 on an NE40E (AR1) was faulty. GPRS services carried by AR1 were interrupted. BGP neighbors were interrupted between AR1 and its subtended DC1, and LDP neighbors were interrupted between AR1 and AR2. The services recovered after being switched to a backup plane.

Alarm information:

 

 ===============display logbuffer===============

Sep 1 2011 07:52:19 NE40E %%01MEM/4/WARNING(l):-Slot=1; NPS_... event_code = 7

===============display alarm all history===============

…….

Error  11-09-01 07:55:59    LPU1 is failed, intelligent loopback heartbeat  has detected error  

Handling Process

Huawei queried NE logs to diagnose the problem.

A forwarding error caused by the chip failure of the LPU board in slot 1 was detected.

Sep  1 2011 07:52:19 NE40E %%01MEM/4/WARNING(l):-Slot=1; NPS_... event_code = 7

Sep  1 2011 07:52:21 NE40E %%01MEM/4/WARNING(l):-Slot=1; NPS_... event_code = 7

Sep  1 2011 07:52:21 NE40E %%01MEM/4/WARNING(l):-Slot=1; NPS_... event_code = 7

…….

#Sep  1 07:55:59 2011 NE40E SRM_BASE/1/ENTITYINVALID:OID 1.3.6.1.4.1.2011.5.25.129.2.1.9 Physical entity failed. (EntityPhysicalIndex=65537, BaseTrapSeverity=6, BaseTrapProbableCause=67843, BaseTrapEventType=5, EntPhysicalContainedIn=65536, EntPhysicalName="LPU Board 1", RelativeResource="", ReasonDescription="LPU1 is failed, intelligent loopback heartbeat has detected error")

Loopback heartbeat and intelligent loopback heartbeat detection results indicated exceptions.

The problem persisted despite automatic recovery operations performed by the intelligent loopback heartbeat mechanism.

Action for recovery: RST_INGRESS_ME    //Resetting the micro engine in the ingress direction

Action for recovery: RST_EGRESS_ME     //Resetting the micro engine in the egress direction

Action for recovery: RST_FIC           //Resetting the FIC chip

Action for recovery: ISOLATE_INGRESS_ME    //Isolating the micro engine in the ingress direction

Action for recovery: ISOLATE_EGRESS_ME     //Isolating the micro engine in the ingress direction

Because BGP neighbors were interrupted between AR1 and its subtended DC1 and LDP neighbors were interrupted between AR1 and AR2, GPRS services destined from DC1 for AR1 passed DC2 and AR2. The IS-IS connection between AR1 and AR2 was normal, so the GPRS services from AR2 were transmitted to the LPU board in slot 1 on AR1. The LPU board, which suffered a chip forwarding failure, failed to forward the GPRS packets.

 

Why was BGP neighboring relationship down, but IS-IS not down?

The cause parameter for BGP neighbor down was 4/0, indicating a timeout for receiving BGP protocol packets. In this failure mode, the hello packets of IS-IS can be transmitted.

 

Introduction to loopback heartbeat mechanism and intelligent loopback heartbeat mechanism

The two mechanisms are used for timely fault detection on LPUA/G and LPUF-10 boards (LPUF-21 uses another heartbeat mechanism).

Intelligent heartbeat detection

This mechanism emulates eight types of Layer 2 forwarding services (such as IPv4 and MPLS) in real time. If the number of transmitted packets and that of received packets are the same, the mechanism regards services normal. Working principle:

At the beginning, a primary timer is started, with 10 heartbeat packets of the same type sent every second for ten times. That is, 100 packets are sent continuously for each service type. The statuses (normal, discarded, or modified) of the 100 packets are monitored. If over 10% of them are discarded or abnormal, an alarm is reported indicating that the forwarding layer is faulty, and the control layer suspends the primary timer and starts a secondary timer.

Sep  1 2011 07:52:25 HAPDS-PC-SXNET-RT01-JianSheLu %%01FPI_DBG/4/HEARTBEATTIMER(D):-Slot=1; Proc HeartBeat Timer Switch: Forwarding of random packet is abnormal, Primary to Secondary.

 

With the secondary timer, a packet is sent at an interval of 200 ms. The status of each packet is monitored and statistics about each micro engine (ME) and packet drops are collected. If necessary, the troubleshooting mode is activated, for ME resets and isolation, or FIC/TM resets.

Loopback heartbeat detection

The heartbeat packets sent in this mechanism have a simpler structure. A packet is sent every second. If it is not received within 30s, the LPU board is reset. Once the fault threshold is crossed, the forwarding layer is considered faulty and the board is reset.
Root Cause
The forwarding plane of the LPU board was faulty.
Solution
Power off the faulty LPU board and switch services to the backup plane.
Suggestions
To mitigate the impact of similar problems on services, a solution for quickly troubleshooting and service recovery is paramount.

END