No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

BGP Connections Were Interrupted Due to an Over-sized LDP over TE Table

Publication Date:  2013-09-13 Views:  23 Downloads:  0
Issue Description

Versions:
RR: NE40E V00R001C00SPS800
BR: NE80E V300R003C02B697
CR: NE5000E V300R005C01

Network topology:

CID:/icase/servlet/download?dlType=HtmlAreaImage&imageId=20433

Symptom:

BR1 and BR2 established BGP VPNv4 connections with VPN RR1 and RR2. On 17 November, 2011, the BGP VPNv4 connections between BR2 and RR2 were interrupted. Ping test results were as follows:

Pinging Test Initiator

Pinged Target

Ping Test Result

RR2

Loopback address of BR2

Failed

CR3

Loopback address of BR2

Successful

BR2

Loopback address of RR2

Successful

BR2 (with source loopback address)

Loopback address of RR2

Failed

 

All links between BR2 and RR2 were normal and no packet loss occurred.
At 02:00 the next morning, the BGP connections between BR2 and RR2 automatically restored.

ping 210.78.0.246 (BR2 pinging RR2, succeeded)

ping 210.78.0.246: 56 data bytes, press ctrl_c to break
reply from 210.78.0.246: bytes=56 sequence=1 ttl=253 time=30 ms
reply from 210.78.0.246: bytes=56 sequence=2 ttl=253 time=30 ms
reply from 210.78.0.246: bytes=56 sequence=3 ttl=253 time=30 ms
reply from 210.78.0.246: bytes=56 sequence=4 ttl=253 time=29 ms
reply from 210.78.0.246: bytes=56 sequence=5 ttl=253 time=30 ms

ping -a 210.78.0.64 210.78.0.246 (BR2 pinging RR2 with the source IP, failed)
ping 210.78.0.246: 56 data bytes, press ctrl_c to break
request time out
request time out
request time out
request time out
request time out

tracert -a 210.78.0.64 210.78.0.246 (BR2 tracert RRB with the source IP, terminated at CR3)
traceroute to 210.78.0.246 (210.78.0.246), max hops: 30 ,packet length: 40
1 210.78.7.185 17 ms 16 ms 18 ms
2 210.78.5.182 32 ms 32 ms 31 ms
3 * * *
tracert 210.78.0.246
(BR2 tracert RR2 without the source IP, succeeded)
traceroute to 210.78.0.246(210.78.0.246), max hops: 30 ,packet length: 40
1 210.78.7.185 17 ms 16 ms 16 ms
2 210.78.5.182 31 ms 35 ms 32 ms
3 210.78.8.106 36 ms 37 ms 36 ms

ping 210.78.0.64 (RR2 pinging BR2, failed)
ping 210.78.0.64: 56 data bytes, press ctrl_c to break
request time out
request time out
request time out
request time out
request time out

ping -a 210.78.0.246 210.78.0.64(RR2 pining BR2 with the source IP, failed)
ping 210.78.0.64: 56 data bytes, press ctrl_c to break
request time out
request time out
request time out
request time out
request time out

tracert -a 210.78.0.246 210.78.0.64(RR2 tracert BR2 with the source IP, terminated at CR3)
t
raceroute to 210.78.0.64(210.78.0.64), max hops: 30 ,packet length: 40
1 210.78.8.105 7 ms 5 ms 14 ms
2 * * *
3 * * *
tracert 210.78.0.64
(RR2 tracert BR2 without the source IP, terminated at CR3)
traceroute to 210.78.0.64(210.78.0.64), max hops: 30 ,packet length: 40
1 210.78.8.105 1 ms 11 ms 6 ms
2 * * *
3 * * *

 

According to information on the NMS: At 14:03 on 17, November, the BGP connections between BR2 and RR2 were interrupted and restored at 02:00 on 18, November.

BR2's log about BGP interruption:

nov 17 2011 14:03:41 cqcq-jb-b-2 %%01bgp/6/send_notify(l): the router sent a notification message to peer 210.78.0.246. (errorcode=4, suberrorcode=0, bgpaddressfamily=public, errordata=null) (BR2 did not receive keeplive messages within 180s and sent a notification to RR2.)

nov 17 2011 14:03:41 cqcq-jb-b-2 %%01bgp/3/state_chg_updown(l): the status of the peer 210.78.0.246 changed from established to idle. (bgpaddressfamily=public)

RR2's log about BGP interruption:

nov 17 2011 14:03:41 sxxa-dwl-irr-2 %%01bgp/6/recv_notify(l)[17670]:the router received notification message from peer 210.78.0.64. (errorcode=4, suberrorcode=0, bgpaddressfamily=public, errordata=null) (RR2 received the notification sent from BR2.)

nov 17 2011 14:03:41 sxxa-dwl-irr-2 %%01bgp/3/state_chg_updown(l)[17671]:the status of the peer 210.78.0.64 changed from established to idle. (bgpaddressfamily=public)
Handling Process

To address the issue, Huawei performed the following operations and observed the following information:

1. Checked links between BR2 and RR2.
The links were normal.

2. Checked routing information about RR2, CR3, CR4, and BR2.
The routing information was correct and no route flapping occurred.

3. Checked the forwarding plane.
LDP was enabled between CRs, and between a CR and a BR. LDP over TE was deployed between CR1 and CR2, and between CR3 and CR4. LDP was not enabled between a RR and a CR. Label-based forwarding was triggered on the loopback interfaces on BRs and CRs, but was not triggered on RRs. Therefore, after arriving at CR3, BGP packets that RR2 sent to BR2 were forwarded based on MPLS labels, and BGP packets that BR2 sent to RR2 were forwarded based on IP addresses.

According to the preceding ping test results, IP-based packet forwarding between BR2 and RR2 was normal and MPLS-based forwarding was abnormal.

4. Concluded that the problem was caused by CR3 according to the preceding ping and tracert results. Analyzed traffic statistic data about RR2, CR3, CR4, and BR2 and found that MPLS packets were lost on CR3.

5. Analyzes the process that MPLS packets were forwarded from RR2 to BR2.

a. Queried packet forwarding information on CR3.
After arriving at CR3, BGP packets that RR2 sent to BR2 were forwarded as follows:
ldp over te
dis fib 210.78.0.64
route entry count: 1
destination/mask nexthop flag timestamp interface tunnelid
210.78.0.64/32 210.78.0.12 dghu t[1168570] tun8/0/2 0x2013adc

b. Queried FIBv4 of 210.78.0.64 on CR3's uplink board LPU7 to obtain LBTv4 indexes and found that the tunnel table indexes were null.
The tunnel table could not be queried so that transmitting of BGP packets destined for BR2 failed, resulting in BGP connection interruption.

[snxa-dwl-c-2-hidecmd]dis pe-entry 7 0 fibv4 210.78.0.64 0

#tcam addr: 5640 (0x1608)
20 00 69 27 00 20 00 00 00 --------------------------------- key-tcam
tid_fib_l = 1 vrid = 0 ipv4 = 210.78.0.64
00 00 00 00 00 00 7f ff ff --------------------------------- mask-tcam
ipv4mask = 255.255.255.255
#ddr ram addr: 5640 (0x1608)
15 00 00 0f bc 84 00 00 00 00 00 00 00 00 00 00 ---------- re-dram
opcode = 5 ag = 0 enttldec = 1
qppbaction = 0 ( no qppb param ) ce2home = 0
nhp_type = 0 nhp_ptr1 = 16114 cnt1 = 1
nhp_ptr2 = 0 cnt2 = 0 pushlabel1 = 0
pushlabel2 = 0 outcnt1(4) = 0 outcnt2(4) = 0

[snxa-dwl-c-2-hidecmd]dis pe-entry 7 0 lbtv4 16114

nhpindex = 16114
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ---------- nhp-dram (null)
master: m/s = 0 opcode = 0 isvlanif = 0 oportinfo = 0
aid = 0 rtalert = 0 setqidx = 0 qindex = 0
slave : enfrr = 0 opcode = 0 isvlanif = 0 oportinfo = 0
aid = 0 rtalert = 0 setqidx = 0 qindex = 0
Root Cause

When routes were generated on CR3, the FIB module refreshed the FIB and the MPLS module delivered the tunnel table.1209 over TE entries existed on CR3, exceeding the specification 1000. When route flapping occurred, a timing problem caused inconsistency between the FIB and tunnel table, resulting BGP interruption.

Why did BGP connection restore automatically at 02:00 the next morning?

This was because Huawei equipment automatically refreshed table entries at 02:00 the next morning so that data in the FIB and tunnel table became consistent.
Solution
LDP was enabled between RRs and CRs and label-based forwarding could be triggered on RRs. In this manner, the CRs transmit BGP packets that RRs send to BRs using LDP over TE as transit nodes. The transit nodes query tunnel tables based on the insegment table instead of FIB, and the tunnel table can contain a maximum of 4000 entries, avoiding the preceding problem.
Suggestions

  • If ping tests are abnormal, check whether ping packets are forwarded along the same paths.
  • If a specification-related problem cannot be resolved timely, contact GTAC or R&D personnel.  

END