No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Links in an Eth-Trunk Are Transiently Interrupted Because LACP Packets Are Lost Due to High CPU Usage

Publication Date:  2013-09-27 Views:  27 Downloads:  0
Issue Description

Device: NE80E

Version: V600R001C00SPC900

Symptom: Links in Eth-Trunk1 were transiently interrupted.

CID:/icase/servlet/download?dlType=HtmlAreaImage&imageId=20507

 

Log information recorded during the transient interruption was as follows:

RT19182P1:

Apr 30 2013 19:16:08 RT19182P1 %%01ISIS/4/ADJ_CHANGE(l)[57]:The neighbor of ISIS was changed. (IsisProcessId=21243, Neighbor=2120.0210.5042, InterfaceName=Eth-Trunk1, CurrentState=down, ChangeType=P2P_HOLDTIMER_EXPIRED)

Apr 30 2013 19:16:08 RT19182P1 %%01ISIS/4/PEER_DOWN_BFDDOWN(l)[58]:IS-IS 21243 neighbor 2120.0210.5042 was Down on interface Eth-Trunk1 because the BFD node was Down. The Hello packet was received at 19:12:12 last time; the maximum interval for sending Hello packets was 10001; the local router sent 16052 Hello packets and received 158 packets; the type of the Hello packet was P2P.

RT19182PE1:

Apr 30 2013 19:16:08 RT19182PE1 %%01ISIS/4/ADJ_CHANGE(l)[356]:The neighbor of ISIS was changed. (IsisProcessId=21243, Neighbor=2120.0210.5040, InterfaceName=Eth-Trunk1, CurrentState=down, ChangeType=CIRCUIT_DOWN)

Apr 30 2013 19:16:08 RT19182PE1 %%01IFNET/4/LINKNO_STATE(l)[357]:The line protocol on the interface Eth-Trunk1 has entered the DOWN state.

Apr 30 2013 19:16:08 RT19182PE1 %%01IFNET/4/IF_INFO_CHANGE(l)[358]:The interface Eth-Trunk1 changed the BaudHigh from 4 to 2.

Apr 30 2013 19:16:08 RT19182PE1 %%01IFNET/4/IF_INFO_CHANGE(l)[359]:The interface Eth-Trunk1 changed the Baud from 2820130816 to 1410065408.
Handling Process

Possible causes were as follows:

1. Physical links were faulty, which might be caused by device hardware faults or transmission device faults.

2. Configurations were inconsistent on the devices at the two ends of the Eth-Trunk.

3. A traffic burst occurred, which led to LACP packet loss.

4. LACP has an error.

Huawei performed the following operations to diagnose the fault:

1. Checked the device logs and found BFD UP/DOWN records. Checked the status information of the related physical links, but found no UP/DOWN alarms about the physical links. Checked hardware information and determined that the related boards and physical links were functioning properly. Therefore, this issue was not caused by hardware faults.

Apr 30 2013 19:16:24 RT19182PE1 %%01BFD/6/STACHG_TODWN(l)[6914919]:Slot=2;BFD session changed to Down. (SlotNumber=2, Discriminator=4251, Diagnostic=NeighborDown, Applications=None, ProcessPST=False, BindInterfaceName=None, InterfacePhysicalState=None, InterfaceProtocolState=None)

Apr 30 2013 19:16:24 RT19182PE1 %%01BFD/6/STACHG_TODWN(l)[6914920]:Slot=2;BFD session changed to Down. (SlotNumber=2, Discriminator=4253, Diagnostic=NeighborDown, Applications=None, ProcessPST=False, BindInterfaceName=None, InterfacePhysicalState=None, InterfaceProtocolState=None)

Apr 30 2013 19:16:24 RT19182PE1 %%01BFD/6/STACHG_TODWN(l)[6914921]:Slot=2;BFD session changed to Down. (SlotNumber=2, Discriminator=4270, Diagnostic=NeighborDown, Applications=None, ProcessPST=False, BindInterfaceName=None, InterfacePhysicalState=None, InterfaceProtocolState=None)

Apr 30 2013 19:16:24 RT19182PE1 %%01BFD/6/STACHG_TODWN(l)[6914922]:Slot=2;BFD session changed to Down. (SlotNumber=2, Discriminator=4247, Diagnostic=NeighborDown, Applications=None, ProcessPST=False, BindInterfaceName=None, InterfacePhysicalState=None, InterfaceProtocolState=None)

2. Checked configurations and found that the static LACP mode was used. The fast mode was used in this version by default. In fast mode, detection packets were sent at an interval of 1 second. If no packets were received from the peer end within 3 consecutive seconds, the detection failed and the port status went DOWN.

#

interface Eth-Trunk1

 mtu 9100

 description SITE-RT19182P1_ETHTRUNK1-20E9

 ip address 212.2.104.74 255.255.255.252

 isis enable 21243

 isis circuit-type p2p

 isis cost 50

 isis bfd static

 mpls

 mpls ldp

 mode lacp-static

 least active-linknumber 2

 trust upstream default

#

3. Analyzed logs and found that the LACP status went DOWN due to time-out and the CPU usage was high (81%) during the same period of time. Therefore, it could be concluded that a burst of traffic sent to the CPU caused LACP packet loss and as a result links in the Eth-Trunk were transiently interrupted.

%2013-Apr-30 19:16:08.810.4 RT19182PE1 01LACP/7/LACP_DEBUG_STRING(D)[63595047]:Disable CollectingDistributing before UpdateFwd  10 : 1774658854

%2013-Apr-30 19:16:08.950.125 RT19182PE1 01LACP/7/LACP_DEBUG_STRING(D)[63595784]:Disable CollectingDistributing after UpdateFwd  10 : 1774658993

%2013-Apr-30 19:16:08.950.126 RT19182PE1 01LACP/6/MUX_STE_CHANGE(D)[63595785]:The state in the MUX state machine changes.(TrunkName=Eth-Trunk1,PortName=GigabitEthernet5/0/0,MuxOldStatus=4,MuxNewStatus=3)

 

%2013-Apr-30 19:16:08.950.127 RT19182PE1 01LACP/6/RX_STE_CHANGE(D)[63595786]:The state in the RX state machine changes.(TrunkName=Eth-Trunk1,PortName=GigabitEthernet5/0/0,RxOldStatus=7,RxNewStatus=4)

%2013-Apr-30 19:16:10.250.1 RT19182PE1 01LACP/6/MUX_STE_CHANGE(D)[63595875]:The state in the MUX state machine changes.(TrunkName=Eth-Trunk1,PortName=GigabitEthernet5/0/0,MuxOldStatus=3,MuxNewStatus=4)

%2013-Apr-30 19:16:10.250.2 RT19182PE1 01LACP/6/RX_STE_CHANGE(D)[63595876]:The state in the RX state machine changes.(TrunkName=Eth-Trunk1,PortName=GigabitEthernet5/0/0,RxOldStatus=4,RxNewStatus=7)

 

Apr 30 2013 19:16:17 RT19182PE1 %%01VOSCPU/4/CPU_USAGE_HIGH(l)[6915064]:Slot=5;The CPU is overloaded, and the tasks with top three CPU occupancy are SOCK(19%), PES(12%), BMON(9%). (CpuUsage=81%, Threshold=80%)

4. Changed the LACP working mode to the slow mode. In slow mode, detection packets were sent at an interval of 30s. The detection failed if no packets were received from the peer end within 90 consecutive seconds. Observed for a period of time and found that no links in the Eth-Trunk were transiently interrupted when the CPU usage increased due to traffic bursts.

<HUAWEI> system-view

[HUAWEI] interface eth-trunk 1

[HUAWEI-Eth-Trunk1] lacp timeout slow
Root Cause
None
Solution
The LACP working mode was changed to the slow mode.
Suggestions

1. If an Eth-Trunk needs to work in LACP mode, the slow mode is recommended. The slow mode can prevent transient Eth-Trunk interruption caused by LACP packet loss resulting from traffic bursts. In V300R003 and V600R001, LACP works in fast mode by default. In V600R003 and later versions, LACP works in slow mode by default.

2. If the priority of LACP packets sent to the CPU is low, change it to a higher priority manually. Then, LACP packets are sent first in case of traffic bursts. This ensures the normal operation of the Eth-Trunk.

The method for changing the priority of LACP packets sent to the CPU is as follows:

<HUAWEI> system-view

[HUAWEI] cpu-defend policy lacp

[HUAWEI-cpu-defend-policy-lacp] queue packet-type lacp 7

[HUAWEI] cpu-defend-policy lacp global

The change does not affect the sending of other packets to the CPU.

END