所选语种没有对应资源,请选择:

本站点使用Cookies,继续浏览表示您同意我们使用Cookies。Cookies和隐私政策>

提示

尊敬的用户,您的IE浏览器版本过低,为获取更好的浏览体验,请升级您的IE浏览器。

升级
案例库

S7712主控板收到大量TC报文,设备频繁刷新ARP表项导致异常重启。

发布时间:  2018-01-02  |   浏览次数:  511  |   下载次数:  31  |   作者:  sunzhongbao  |   文档编号: EKB1001185017

目录

问题描述

一、故障现象

  某客户反馈,他们局域网核心交换机S7712设备出现异常,所有业务板卡系统运行灯快闪,端口灯全熄灭,下挂业务中断,通过掉电重启紧急恢复。

二、设备版本/补丁信息

  V200R003C00SPC500+Null

三、业务影响

  此交换机为客户局域网核心交换机,下挂近2000用户,其中包含客户门户网站等重要业务,业务影响比较大。

处理过程

1.跟现场工程师确认,出问题时所有业务单板端口灯不亮,所有接口板系统灯为绿色快闪状态。

2.查看日志可以确定设备13号槽位为主用主控板,14号槽位为备用主控板,145728左右14号主控板升为主用主控,并上报与13号主控互联通道故障,怀疑此时13号主控板异常复位,由于设备掉电,导致掉电前部分日志没有正常记录。

Aug 31 2017 14:57:28+08:00 2_L_YCS_C_S7712-0001 %%01ALML/3/CHANNEL_FAULTY(l)[637]:No.0 channel from slot 1/14 to slot 1/13 is faulty.

Aug 31 2017 14:57:28+08:00 2_L_YCS_C_S7712-0001 %%01ALML/0/ALL_CHANNEL_FAULTY(l)[638]:All channels from slot 1/14 to slot 1/13 are faulty.

3.14号主控板做为主用主控板后,145804所有与接口板互联的HG变为DOWN状态。

%2017-Aug-31 14:58:04.900.1+00:00 2_L_YCS_C_S7712-0001 01SDKE/6/INFO(D)[1415]:Slot 14 layer DRV module AV level INFO: unit 0 hg5 change to down.
%2017-Aug-31 14:58:04.900.2+00:00 2_L_YCS_C_S7712-0001 01SDKE/6/INFO(D)[1416]:Slot 14 layer DRV module AV level INFO: unit 0 hg8 change to down.
%2017-Aug-31 14:58:04.900.3+00:00 2_L_YCS_C_S7712-0001 01SDKE/6/INFO(D)[1417]:Slot 14 layer DRV module AV level INFO: unit 0 hg12 change to down.
%2017-Aug-31 14:58:05.910.1+00:00 2_L_YCS_C_S7712-0001 01SDKE/6/INFO(D)[1418]:Slot 14 layer DRV module AV level INFO: unit 0 hg13 change to down.

4.查看所有接口板与主控板间通信信道故障,接口板收不到主控板心跳报文复位,业务中断。

Aug 31 2017 14:58:10+08:00 2_L_YCS_C_S7712-0001 %%01ALML/3/CHANNEL_FAULTY(l)[639]:No.0 channel from slot 1/14 to slot 1/1 is faulty.

Aug 31 2017 14:58:10+08:00 2_L_YCS_C_S7712-0001 %%01ALML/0/ALL_CHANNEL_FAULTY(l)[640]:All channels from slot 1/14 to slot 1/1 are faulty.

Aug 31 2017 14:58:10+08:00 2_L_YCS_C_S7712-0001 %%01ALML/3/CHANNEL_FAULTY(l)[641]:No.0 channel from slot 1/14 to slot 1/8 is faulty.

Aug 31 2017 14:58:10+08:00 2_L_YCS_C_S7712-0001 %%01ALML/0/ALL_CHANNEL_FAULTY(l)[642]:All channels from slot 1/14 to slot 1/8 are faulty.

Aug 31 2017 14:58:12+08:00 2_L_YCS_C_S7712-0001 %%01ALML/3/CHANNEL_FAULTY(l)[643]:No.0 channel from slot 1/14 to slot 1/3 is faulty.

Aug 31 2017 14:58:12+08:00 2_L_YCS_C_S7712-0001 %%01ALML/0/ALL_CHANNEL_FAULTY(l)[644]:All channels from slot 1/14 to slot 1/3 are faulty.

Aug 31 2017 14:58:12+08:00 2_L_YCS_C_S7712-0001 %%01ALML/3/CHANNEL_FAULTY(l)[645]:No.0 channel from slot 1/14 to slot 1/12 is faulty.

Aug 31 2017 14:58:12+08:00 2_L_YCS_C_S7712-0001 %%01ALML/0/ALL_CHANNEL_FAULTY(l)[646]:All channels from slot 1/14 to slot 1/12 are faulty.

5.问题发生前后,设备有频繁收到大量TC报文。

Aug 31 2017 13:31:53+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[12968]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:33:22+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[12973]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:33:56+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[12979]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:35:09+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[12986]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:35:14+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[12989]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:35:24+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[12993]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:36:08+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[12996]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:36:22+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13002]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:36:43+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13005]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:37:24+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13009]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:37:50+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13014]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:38:27+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13017]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:39:51+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13024]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:39:55+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13027]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:39:59+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13030]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

Aug 31 2017 13:40:31+08:00 2_L_YCS_C_S7712-0001 %%01MSTP/6/RECEIVE_MSTITC(l)[13035]:MSTP received BPDU with TC, MSTP process 0 instance 0, port name is GigabitEthernet1/1/10.

6.在出问题之前14号主控板做为主用主控时,已发生过频繁收到大量TC报文导致异常重启的问题。

Aug 30 2017 07:50:46+08:00 2_L_YCS_C_S7712-0001 %%01ALML/4/ENTRESET(l)[4973]:MPU frame[1] board[14] is reset. The reason is: VRP reset selfboard because of find deadloop.

7.主控板上记录多次TC刷新ARP导致死循环的异常记录。

============ Task Infinite Loop Information Begin ============

 Dopra Version                    = DOPRA V100R006C09CP0671

 Application Version              = VRPV500R013C00SPC295-GR

 Task Infinite Loop Type          = Task overrun

 Task Infinite Loop Handle        = Suspend Task

 Task Infinite Loop CpuId         = 13

 Overrun Task Name                = L2IF

 Overrun Task VOS ID              = 214

 Overrun Task Osal ID             = 0x0883fba0

 Task Overrun Threshold           = 20000 (ms)

 Task Has-run Time                = 20000 (ms)

 Task Infinite Loop Occur Time    = [2017.08.29  07:31:08]

 Task Infinite Loop Occur Cputick = [0x0000187c, 0xd11773bb]

 

 Task switch trace info before task infinite loop:

 From cputick [0,0] to cputick [0x187c,0xd11773bb]

 -------------------------------------------------------------

 No. TaskName        VosTID  OsalTID     Prio  RunTime[s, ns]

 No Task switch trace info!!!

 

 ============ Task Infinite Loop Information Begin ============

 Dopra Version                    = DOPRA V100R006C09CP0671

 Application Version              = VRPV500R013C00SPC295-GR

 Task Infinite Loop Type          = Task overrun

 Task Infinite Loop Handle        = Suspend Task

 Task Infinite Loop CpuId         = 13

 Overrun Task Name                = L2IF

 Overrun Task VOS ID              = 214

 Overrun Task Osal ID             = 0x08841300

 Task Overrun Threshold           = 20000 (ms)

 Task Has-run Time                = 20000 (ms)

 Task Infinite Loop Occur Time    = [2017.08.29  17:57:23]

 Task Infinite Loop Occur Cputick = [0x000017e5, 0x932a814c]

 Task switch trace info before task infinite loop:

 From cputick [0,0] to cputick [0x17e5,0x932a814c]

 -------------------------------------------------------------

 No. TaskName        VosTID  OsalTID     Prio  RunTime[s, ns]

 No Task switch trace info!!!

 

 Corresponding task call stack info:

 -------------------------------------------------------------

 

============ Task Infinite Loop Information Begin ============

 Dopra Version                    = DOPRA V100R006C09CP0671

 Application Version              = VRPV500R013C00SPC295-GR

 Task Infinite Loop Type          = Task overrun

 Task Infinite Loop Handle        = Suspend Task

 Task Infinite Loop CpuId         = 13

 Overrun Task Name                = L2IF

 Overrun Task VOS ID              = 214

 Overrun Task Osal ID             = 0x088444c0

 Task Overrun Threshold           = 20000 (ms)

 Task Has-run Time                = 20000 (ms)

 Task Infinite Loop Occur Time    = [2017.08.29  20:18:33]

 Task Infinite Loop Occur Cputick = [0x0000055e, 0x196312b2]

 Task switch trace info before task infinite loop:

 From cputick [0,0] to cputick [0x55e,0x196312b2]

 -------------------------------------------------------------

 No. TaskName        VosTID  OsalTID     Prio  RunTime[s, ns]

 No Task switch trace info!!!


根因

 1.13号主用主控板收到大量TC报文频繁刷新ARP表项触发已知问题进而导致异常重启,在14号主控板升主用主控板后,由于硬件存在故障,与所有接口板通信通道故障,导致所有接口板无心跳复位,近而导致下挂业务全部中断。

2.已知问题为:


解决方案

1.设备没有加载任何补丁,需要将设备加载V200R003SPH022,防止问题再次发生。

2.在设备上配置优化命令

  1stp tc-protection。保证设备频繁收到TC报文时,每2秒周期内最多只处理1次表项刷新,从而减少MACARP表项频繁刷新对设备造成的CPU处理任务过多。

  2arp topology-change disablemac-address update arp,当设备收到TC报文后,默认会清除MAC、老化ARP。当设备上的ARP表项较多时,ARP的重新学习会导致网络中的ARP报文过多。配置此两条命令后,在网络拓扑变化时,可以根据AMC地址的出接口变化刷新ARP表项出接口。可以减少大量不必要的ARP表项刷新。

3.替换14号主控板,信息如下:

 [Board Properties]

 BoardType=ES02SRUA

 BarCode=030MQS10E3000365

 Item=03030MQS

 Description=Quidway S7700,ES02SRUA,Quidway S7706/S7712,Main Control Unit A

 Manufactured=2014-03-16

 VendorName=Huawei

 IssueNumber=00

 CLEICode=

 BOM=

建议与总结

1.设备定期进行版本及补丁更新工作,防止已知问题再次发生。

2.根据用户网络特点,配置优化数据,防止突发情况,导致设备出现故障。