No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

NE40E RR routers LPU high CPU usage caused by conflicted router ID

Publication Date:  2014-07-08 Views:  66 Downloads:  0
Issue Description

【Problem Description】
On 20th June X country local time, two NE40E-X3 RR routers presented high CPU usage alarms on their LPU boards; the CPU usage lasted at about 80%.

 

Jun 20 2014 08:25:09 DST NE40EX3-1 %%01VOSCPU/4/CPU_USAGE_HIGH(l)[326]:Slot=1;The CPU is overloaded(CpuUsage=82%, Threshold=80%), and the tasks with top three CPU occupancy are:

VPR  total      : 43%

SOCK  total      : 10%

NonDopraTask(k)  total      : 9%

 

Jun 20 2014 08:27:43 DST NE40EX3-2 %%01VOSCPU/4/CPU_USAGE_HIGH(l)[232]:Slot=1;The CPU is overloaded(CpuUsage=83%, Threshold=80%), and the tasks with top three CPU occupancy are:

VPR  total      : 48%

SOCK  total      : 11%

NonDopraTask(k)  total      : 10%


【Topology】

Handling Process

1. Checking the high CPU usage alarms on NE40EX3-1 and NE40EX3-2, the top tasks were always VPR and SOCK. The function of VPR is to send out the protocol packets from CPU, while the SOCK function is to receive the protocol packets and send to CPU. It meant that the LPU CPU was busy with protocol packets handling at that moment.

 

Jun 20 2014 08:25:09 DST NE40EX3-1 %%01VOSCPU/4/CPU_USAGE_HIGH(l)[326]:Slot=1;The CPU is overloaded(CpuUsage=82%, Threshold=80%), and the tasks with top three CPU occupancy are:

VPR  total      : 43%

SOCK  total      : 10%

NonDopraTask(k)  total      : 9%

 

Jun 20 2014 08:27:43 DST NE40EX3-2 %%01VOSCPU/4/CPU_USAGE_HIGH(l)[232]:Slot=1;The CPU is overloaded(CpuUsage=83%, Threshold=80%), and the tasks with top three CPU occupancy are:

VPR  total      : 48%

SOCK  total      : 11%

NonDopraTask(k)  total      : 10%

 

2. Checking the defend white-list of NE40EX3-1 LPU1, it was found that the passed-packets were increasing rapidly.

White-list is one channel to send the reliable protocol packets to LPU CPU, for example, the protocol packets of the established BGP peer and OSPF peers will go through the white list.

 

[NE40EX3-1-diagnose]display cpu-defend statistics-all slot 1

CarID Index Packet-Info                 Passed-Packets      Dropped-Packets

--------------------------------------------------------------------------------

   25   158 PES_EXCP_CAUSE_DEFEND_WHITELIST           240414114     0

   25   158 PES_EXCP_CAUSE_DEFEND_WHITELIST           240419645     0

   25   158 PES_EXCP_CAUSE_DEFEND_WHITELIST           240484382     0

   25   158 PES_EXCP_CAUSE_DEFEND_WHITELIST           241368652     0

 

3. From the BGP peer statistics of NE40EX3-1, the BGP messages received from 10.33.76.1 and 10.33.76.2 were increasing rapidly. There was no BGP peer down event during the issue time. It meant these two routes were sending too many BGP routes to the NE40E-X3 RR routers.

 

[NE40EX3-1]DIS BGP VPNV4 ALL PEER

 BGP local router ID : 10.33.76.3

 Local AS number : 28469

 Total number of peers : 334          Peers in established state : 286

  Peer          V       AS  MsgRcvd  MsgSent  OutQ  Up/Down  State PrefRcv

  10.33.76.1      4       28469  1042810  2140470    0 0467h01m Established 865

  10.33.76.2      4       28469  1408258  2140436    0 0467h01m Established 792

  10.33.76.19     4       28469    28405  2140436    0 0467h01m Established 534

  10.33.76.20     4       28469    28362  2140436    0 0467h01m Established 394

 

4. We login 10.33.76.2 which was CX6X16-2, and opened <debugging bgp update> function, the BGP routes were updated frequently from source 10.33.80.29 and 10.33.81.158.

Taking 10.10.200.8/29 for example, this route was flapping between 10.33.80.29 and 10.33.81.158 frequently.

 

Jun 20 2014 11:58:45.880.16-06:00 DST CX6X16-2 RM/6/RMDEBUG:

 BGP_L3VPN.VPNV4: Format UPDATEs for Group 0 with following destinations :

 

       MP_reach  : AFI/SAFI  1/128

       Origin    : Incomplete

       AS Path   : 

       Next Hop  : 10.33.80.29

       Local Pref: 100

       MED       : 65901

       ExtCommunity : RT <28469 : 600>, OSPF DOMAIN ID <0.0.0.0 : 0>,

              OSPF ROUTER ID <10.33.80.30 : 0>, OSPF RT <0.0.0.0 : 5 : 0>

       RD        : <28469:4600>

       10.10.200.8/29 (4722)

      

Jun 20 2014 11:58:50.810.10-06:00 DST CX6X16-2 RM/6/RMDEBUG:

 BGP_L3VPN.VPNV4: Format UPDATEs for Group 0 with following destinations :

 

       MP_reach  : AFI/SAFI  1/128

       Origin    : Incomplete

       AS Path   : 

       Next Hop  : 10.33.81.158

       Local Pref: 100

       MED       : 65601

       ExtCommunity : RT <28469 : 600>, OSPF DOMAIN ID <0.0.0.0 : 0>,

              OSPF ROUTER ID <10.33.80.30 : 0>, OSPF RT <0.0.0.0 : 5 : 0>

       RD        : <28469:4600>

       10.10.200.8/29 (4986)

 

5. Login 10.33.80.29 and 10.33.81.158, they were two E8000E firewalls. From the configuration, we could easily find that conflicted router id was configured on these two devices, which was causing the BGP route flapping.

 

E8160E-1

ospf 15 router-id 10.33.80.158

ospf 113 router-id 10.33.80.65

 

E8160E-2

ospf 15 router-id 10.33.80.158

ospf 113 router-id 10.33.80.65

Root Cause

Conflicted router id was configured on E8160E-1 and E8160E-2, which caused the route flapping on the two NE40E-X3 RR routers. The two RR routes were busy with receiving and sending out the updated BGP routes, finally their LPU became high.

 

E8160E-1

ospf 15 router-id 10.33.80.158

ospf 113 router-id 10.33.80.65

 

E8160E-2

ospf 15 router-id 10.33.80.158

ospf 113 router-id 10.33.80.65

Solution

Correct the conflicted router id on one of the two firewalls, and reset the corresponding ospf processes.

Suggestions

Pay attention on OSPF router ID while making the configuration.

END