USG6000Agile controller业务出现网络中断的故障

发布时间:  2016-11-24 浏览次数:  84 下载次数:  0
问题描述

组网拓扑如下:


故障现象:

xx月xx16:29分网络中断


告警信息
日志信息如下:

% 16:29:33 USG6650-A %%01RIGHTM/4/CHANNELON(l): The emergency channel is enabled. Current active TSM servers: 0.

% 16:29:33 USG6650-A %%01RIGHTM/4/SERVERDOWN(l): The TSM server <xx.x.x.xxx> turns inactive. Current active TSM servers: 0.

% 16:29:23 USG6650-A %%01RIGHTM/4/SERVERDOWN(l): The TSM server <xx.x.x.xxx> turns inactive. Current active TSM servers: 1.

% 16:29:15 USG6650-A %%01RIGHTM/4/SERVERDOWN(l): The TSM server <xx.x.x.xxx> turns inactive. Current active TSM servers: 2.

处理过程

1、 查看USG6000设备日志:

%2015-11-09 16:29:33 USG6650-A %%01RIGHTM/4/CHANNELON(l): The emergency channel is enabled. Current active TSM servers: 0.

%2015-11-09 16:29:33 USG6650-A %%01RIGHTM/4/SERVERDOWN(l): The TSM server <10.8.8.102> turns inactive. Current active TSM servers: 0.

%2015-11-09 16:29:23 USG6650-A %%01RIGHTM/4/SERVERDOWN(l): The TSM server <10.8.8.101> turns inactive. Current active TSM servers: 1.

%2015-11-09 16:29:15 USG6650-A %%01RIGHTM/4/SERVERDOWN(l): The TSM server <10.8.8.100> turns inactive. Current active TSM servers: 2.

通过日志,发现在11916:29USG6000设备和3Agile controller服务器连接中断,防火墙逃生通道生效(日志红色部分),但是网络仍然处于中断状态;

1、 查看设备端口状态及HRP状态:(接口及HRP状态正常)

GigabitEthernet1/0/8        up    up       0.01%  0.01%                 0                 0

GigabitEthernet1/0/9        up    up       0.01%  0.01%                 0  

 

 The firewall's config state is: ACTIVE

 

 Backup channel usage: 0.01%

 Time elapsed after the last switchover: 0 days, 0 hours, 17 minutes

 Current state of virtual routers configured as active:

                       Eth-Trunk1    vrid   3 : active

           (GigabitEthernet1/0/0)             : up  

           (GigabitEthernet3/0/0)             : up  

             GigabitEthernet1/0/8    vrid   1 : active

 

2、 查看接入交换机端口状态:

<VIP>dis int Ten-GigabitEthernet 1/9/0/9

Ten-GigabitEthernet1/9/0/9 current state: DOWN

Line protocol current state: DOWN

Description: link-to-USG6000-1

The Maximum Transmit Unit is 1500

判断为接入交换机端口DOWN,判断是单纤故障,进机房检查尾纤,发现尾纤有被夹痕迹,更换尾纤后端口状态恢复正常;

根因

H3C1/9/0/9端口down后,防火墙和3Agile controller服务器连接中断,防火墙开启逃生通道,从1/0/8接口收到的流量直接从1/0/9端口转发,因为和1/0/9端口相连的H3C1/9/0/9端口已经down了,导致网络流量中断;直接原因是光纤被夹断导致H3C1/9/0/9端口收不到光而down掉,而根本原因是不同厂家设备对接,端口没有开启自协商模式,USG6000设备端口无法感知对端设备单纤故障。

解决方案

应急处理措施:

17:00左右在接入交换机上取消策略路由后网络恢复


根本解决措施:

第二天9:30经检查判断为尾纤故障,更换后网络恢复。

建议与总结

不同厂家设备对接容易出现单通等问题,可以采取一些技术手段避免类似的问题,比如采用ip-link技术检测直连ip地址的可达性来保证状态的同步,USG6650采用如下配置可以预防单通的故障:

 ip-link check enable                     

 ip-link 1 destination xx.xx.xx.xxx interface GigabitEthernet1/0/9 mode icmp

 hrp track ip-link 1 active

 ip-link 2 destination xx.xx.xxx.xxx interface GigabitEthernet1/0/8 mode icmp

 hrp track ip-link 2 active


END