环路导致NE40E和上联路由器OSPF邻居down

发布时间:  2015-12-18 浏览次数:  372 下载次数:  0
问题描述

 

http://support.huawei.com/enterprise/product/images/9936c1d175fd40f49d828ce818543940

如图所示:

组网描述:

为某企业组网简图, NE40E作为业务路由器,上联双上行到核心路由器NE40E,和核心路由器NE80E建立OSPF邻居交换路由;

NE40E下行通过ETH-TRUNK链接汇聚交换机S9300. NE40E使用多个ETH-TRUNK dot1q子接口封装vlan作为S9300业务网段的网关,同时将这些网段发布到了OSPF AREA 0.

正常的业务数据是:S9300下的各个网络的客户端通过S9300-------NE40E------NE80E----INTERNET.

问题:NE40E所在的下联网络中的客户端报大面积无法上网,NE80E上没有NE40E下联业务网段的路由信息。

关键配置如下:

 NE80E-1

#

interface GigabitEthernet2/0/0                          
下联到NE40E GigabitEthernet1/0/0

description to NE40E-1 gi 1/0/0

undo shutdown

ip address 172.19.131.13 255.255.255.252

traffic-policy ME60-1 inbound           

ospf cost 100

mpls

mpls ldp

dhcp snooping enable

dhcp snooping trusted



ne80e-2 :

#

interface GigabitEthernet2/0/0                    
下联到NE40E GigabitEthernet3/0/0

description to NE40E-1 gi-2/0/0

undo shutdown

ip address 172.19.131.41 255.255.255.252

traffic-policy ME60-2 inbound

ospf cost 300

mpls

mpls ldp



NE40E:

#

interface GigabitEthernet1/0/0           
上联NE80E-1

undo shutdown

ip address 172.19.131.14 255.255.255.252

ospf cost 100

mpls

mpls ldp

#

interface GigabitEthernet3/0/0         
上联NE80E-2
                                               

ip address 172.19.131.42 255.255.255.252

ospf cost 100

mpls

mpls ldp

#

interface Eth-Trunk5.12

control-vid 12 dot1q-termination

dot1q termination vid 12

ip address 172.16.227.254 255.255.254.0

arp broadcast enable

#

ospf 10 router-id 172.19.130.8
area 0.0.0.0
  network 172.19.130.8 0.0.0.0
  network 172.19.131.12 0.0.0.3
  network 172.19.131.40 0.0.0.3
  network 172.16.226.0 0.0.1.255  
包含了interface Eth-Trunk5.12

#

告警信息

NE40E上有OSPF router-id 冲突和ospf邻居down掉告警, EHT-TRUNK5 是连到下联S9300交换机。


Dec 10 2015 17:16:32 JSYC-DHJ-XDL-NE40E-1 %%01OSPF/4/CONFLICT_RTID_INTF(l)[116]:OSPF router ID conflict was detected on the interface. d=10, RouterId=172.19.130.8, AreaId=0.0.0.0, InterfaceName=Eth-Trunk5.12, IpAddr=172.16.227.254, PacketSrcIp=172.16.227.254)

Dec 10 2015 17:17:10 JSYC-DHJ-XDL-NE40E-1 %%01OSPF/3/NBR_CHG_DOWN(l)[442870]:Neighbor event:neighbor state changed to Down. (ProcessId=10, NeighborAddress=172.19.130.1, NeighborEvent=InactivityTimer, NeighborPreviousState=Full, NeighborCurrentState=Down)

Dec 10 2015 17:17:10 JSYC-DHJ-XDL-NE40E-1 %%01OSPF/6/NBR_DOWN_REASON(l)[442871]:Neighbor state leaves full or changed to Down. (ProcessId=10, NeighborRouterId=172.19.130.1, NeighborAreaId=0, NeighborInterface=GigabitEthernet1/0/0,NeighborDownImmediate reason=Neighbor Down Due to Inactivity, NeighborDownPrimeReason=Hello Not Seen, NeighborChangeTime=[2015/12/10] 17:17:10)

处理过程

1.由于故障大面积产生,并且问题集中在出现故障的XX区,XX区是通过图中NE40E出去,通常问题出现在网关设备或者网关以上。ping网关,无法ping通网关,怀疑网关NE40E有问题,查看到NE40E告警,有如下告警,

Dec 10 2015 17:16:32 JSYC-DHJ-XDL-NE40E-1 %%01OSPF/4/CONFLICT_RTID_INTF(l)[116]:OSPF router ID conflict was detected on the interface. d=10, RouterId=172.19.130.8, AreaId=0.0.0.0, InterfaceName=Eth-Trunk5.12, IpAddr=172.16.227.254, PacketSrcIp=172.16.227.254)

Dec 10 2015 17:17:10 JSYC-DHJ-XDL-NE40E-1 %%01OSPF/3/NBR_CHG_DOWN(l)[442870]:Neighbor event:neighbor state changed to Down. (ProcessId=10, NeighborAddress=172.19.130.1, NeighborEvent=InactivityTimer, NeighborPreviousState=Full, NeighborCurrentState=Down)

Dec 10 2015 17:17:10 JSYC-DHJ-XDL-NE40E-1 %%01OSPF/6/NBR_DOWN_REASON(l)[442871]:Neighbor state leaves full or changed to Down. (ProcessId=10, NeighborRouterId=172.19.130.1, NeighborAreaId=0, NeighborInterface=GigabitEthernet1/0/0,NeighborDownImmediate reason=Neighbor Down Due to Inactivity, NeighborDownPrimeReason=Hello Not Seen, NeighborChangeTime=[2015/12/10] 17:17:10)

Dec 10 2015 17:17:12 JSYC-DHJ-XDL-NE40E-1 %%01OSPF/3/NBR_CHG_DOWN(l)[442946]:Neighbor event:neighbor state changed to Down. (ProcessId=10, NeighborAddress=172.19.130.2, NeighborEvent=InactivityTimer, NeighborPreviousState=Full, NeighborCurrentState=Down)

Dec 10 2015 17:17:12 JSYC-DHJ-XDL-NE40E-1 %%01OSPF/6/NBR_DOWN_REASON(l)[442947]:Neighbor state leaves full or changed to Down. (ProcessId=10, NeighborRouterId=172.19.130.2, NeighborAreaId=0, NeighborInterface=GigabitEthernet3/0/0,NeighborDownImmediate reason=Neighbor Down Due to Inactivity, NeighborDownPrimeReason=Hello Not Seen, NeighborChangeTime=[2015/12/10] 17:17:12)

2.
邻居中断了,肯定是没有路由了,NE80E上没有NE40E下联业务网段的路由信息,NE80E上没有路由导致内网业务无法通过NE80E出公网。

3.
分析以上告警日志信息

OSPF router id
冲突之后发生了NE40E和上联的NE80E邻居DOWN掉,一般核心设备稳定之后是不会改动配置。怀疑OSPF router id冲突导致了OSPF邻居DOWN.

知悉查看告警信息OSPF router id冲突的情况,可以发现,报文的router id = 172.19.130.8NE40E一样,来自接口eth-trunk5.12, 接口的IP地址是172.16.227.254 ,关键就是,收到的OSPF报文Source ip = 172.16.227.254也就是说,这个OSPF报文的SIP就是NE40Eeth-trunk5.12的地址,这说明是下挂的网络环路了,把NE40E发出去的ospf hello报文给环回来了。

4.
排查S93环路,发现有人当天有人改动S9300一下的网络,接入网线导致VLAN12环路产生,拔掉网线问题解决。

5.
总结

分析OSPF的配置和接口配置,OSPF的配置是包含了ETH-TRUNK5.12这个接口的IP的,也就是这个接口会持续发送OSPF hello报文,当下挂网络环路后,OSPF报文环回,出现router id 冲突,但是实际上OSPF中断的原因,是由于大量OSPF报文上送导致接口板上送CPU的通道拥塞丢包,OSPF邻居中断,eth-trunk5接口有两个成员,分别在SLOT1SLOT3,正好影响这两个接口板。

根因

NE40E下挂交换机,交换机的一个VLAN成环,导致大量OSPF报文上送,NE40E CP-CAR丢包导致OSPF中断。NE40E上作为下联网段的网关 ,把下联网段发布到OSPF进程中,使下联网关接口具有OSPF报文收发能力,一旦下联的S9300交换机产生环路,接口下发送的OSPF报文经过还回,大量OSPF报文上送导致接口板上送CPU的通道拥塞丢包,OSPF邻居中断。

解决方案

1.最根本的解决办法,在NE40E下联的交换网络部署生成树协议,防止环路产生。

 

2.关闭NE40E上作为下联网段网关接口的OSPF报文收发功能,防止OSPF震荡。

 

关闭方法如下:

ospf 10 router-id 172.19.130.8 

 area 0.0.0.0

silent-interface Eth-Trunk5.12 建议将所有非邻居OSPF端口都关闭ospf报文收发功能。

 

建议与总结

1.部署ospf的时候建议将所有非邻居OSPF端口都关闭ospf报文收发功能(silent-interface )。

2.出现OSPFrouter id冲突,同时发现警告的源地址是设备本身接口的地址,考虑对端形成了二层环路,排查环路。

3.对于交换网络要部署生成树协议,防止网络误操作产生环路。

END