E9000刀片服务器交换板重启导致网络中断问题处理

发布时间:  2016-01-12 浏览次数:  459 下载次数:  0
问题描述

某日1730分某金融局点来电,反映E9000刀片服务器业务网络中断,随后自动恢复正常。业务网络中断,导致运行在此框刀片上的多台在线交易系统受到影响,实时交易业务无法正常运行,影响重大。

处理过程

1、从os 日志message截屏分析:

网卡在17:10开始down,在17:14恢复正常,另一对网卡随后17:50 down,在17:53up,本身无异常报错。


2、收集交换板日志分析:

1)分析诊断日志:

2x cx110交换板:

2x交换在2015-12-02 17:10:37发生了重启,交换板复位



3x cx110交换板

3x交换在2015-12-02 17:50:06发生了重启,交换板复位





2)分析log日志:

2x交换板:

Ip 10.6.2.247 snmp operation failed2x交换板17:10:29开始复位 


3x log日志:

Ip 10.6.2.247 snmp operation failed2x交换板17:50:02复位



进一步解析:

2x

Dec  2 2015 17:10:29+08:00 E9000-1-CX110-2 %%01SNMP/3/SNMP_OPERATION_FAILED(l):CID=0x80d503fc;SNMP operation failed. (IPAddress=10.6.2.247, NodeName=sysUpTime.0, OperationType=160, RequestId=1904733280, ErrorStatus=5, ErrorIndex=28, ReasonInfo=Fail to perform get/set/get-next/get-bulk operation.)

3x

Dec  2 2015 17:50:02+08:00 E9000-1-CX110-3 %%01SNMP/3/SNMP_OPERATION_FAILED(l):CID=0x80d503fc;SNMP operation failed. (IPAddress=10.6.2.247, NodeName=sysUpTime.0, OperationType=160, RequestId=1905061496, ErrorStatus=5, ErrorIndex=28, ReasonInfo=Fail to perform get/set/get-next/get-bulk operation.)

Ip 10.6.2.247为第三方网管ip地址,网管采集的时候,出现了snmpa的异常调用栈。

2x调用栈:Show CallStack:

Instruction Address: 0x0e711fc4, Func:libsnmpa.so(UTIL_IsSameTypeNodeAndData+0x60) [0xe711fc4]

Instruction Address: 0x0e73cf44, Func:libsnmpa.so(PduprocCheckAsnValue+0x70) [0xe73cf44]

Instruction Address: 0x0e6f9204, Func:libsnmpa.so(PduprocFindVBErrorIndex+0x148) [0xe6f9204]

Instruction Address: 0x0e6fa328, Func:libsnmpa.so(SnmpAsynchResponseSendEx+0x224) [0xe6fa328]

Instruction Address: 0x0e6fa834, Func:libsnmpa.so(SNMP_AsynchResponseSend+0x1a0) [0xe6fa834]

Instruction Address: 0x0e7d5700, Func:libsnmpa.so(SNMPA_MultiVBQueryProcFinish+0x310) [0xe7d5700]

Instruction Address: 0x0e7d6b68, Func:libsnmpa.so(SNMPA_ProcMultiVBQueryDataMsg+0x278) [0xe7d6b68]

Instruction Address: 0x0e7d6e1c, Func:libsnmpa.so(SNMPA_ProcQueryData+0x198) [0xe7d6e1c]

Instruction Address: 0x0e7d702c, Func:libsnmpa.so(SNMPA_ProcFinalQueryData+0x94) [0xe7d702c]

Instruction Address: 0x0e7d7458, Func:libsnmpa.so(SNMPA_OnMsgCfgiMsgTypeIsQueryData+0x284) [0xe7d7458

Instruction Address: 0x0e7cedd8, Func:libsnmpa.so(SNMPA_OnMsgSmpSubIfIsCfgi+0x98) [0xe7cedd8]

Instruction Address: 0x0e746e04, Func:libsnmpa.so(SNMPA_CocketNotify+0x2fc) [0xe746e04]

Instruction Address: 0x0f2eda58, Func:libappcfgi.so(CO_MsgProc+0x398) [0xf2eda58]

Instruction Address: 0x0e7cf030, Func:libsnmpa.so(SNMPA_ComponentMsgProc+0x54) [0xe7cf030]

Instruction Address: 0x0f7423a0, Func:libdefault.so(rtfScmMessageSchedule+0x354) [0xf7423a0]

Instruction Address: 0x0f74260c, Func:libdefault.so(rtfScmCompScheKernelEntry+0x1c0) [0xf74260c]

Instruction Address: 0x0f7427a4, Func:libdefault.so(rtfScmCompScheDefaultEntry+0x194) [0xf7427a4]

Instruction Address: 0x0f742890, Func:libdefault.so(rtfScmTaskDeployDefaultCompEntry+0x24) [0xf742890

Instruction Address: 0x0f5c0f90, Func:libdefault.so(tskAllTaskEntry+0xd8) [0xf5c0f90]

Instruction Address: 0x0ff792c4, Func:libpthread.so.0(+0x62c4) [0xff792c4]

3x调用栈:Show CallStack:

Instruction Address: 0x0e711fc4, Func:libsnmpa.so(UTIL_IsSameTypeNodeAndData+0x60) [0xe711fc4]

Instruction Address: 0x0e73cf44, Func:libsnmpa.so(PduprocCheckAsnValue+0x70) [0xe73cf44]

Instruction Address: 0x0e6f9204, Func:libsnmpa.so(PduprocFindVBErrorIndex+0x148) [0xe6f9204]

Instruction Address: 0x0e6fa328, Func:libsnmpa.so(SnmpAsynchResponseSendEx+0x224) [0xe6fa328]

Instruction Address: 0x0e6fa834, Func:libsnmpa.so(SNMP_AsynchResponseSend+0x1a0) [0xe6fa834]

Instruction Address: 0x0e7d5700, Func:libsnmpa.so(SNMPA_MultiVBQueryProcFinish+0x310) [0xe7d5700]

Instruction Address: 0x0e7d6b68, Func:libsnmpa.so(SNMPA_ProcMultiVBQueryDataMsg+0x278) [0xe7d6b68]

Instruction Address: 0x0e7d6e1c, Func:libsnmpa.so(SNMPA_ProcQueryData+0x198) [0xe7d6e1c]

Instruction Address: 0x0e7d702c, Func:libsnmpa.so(SNMPA_ProcFinalQueryData+0x94) [0xe7d702c]

Instruction Address: 0x0e7d7458, Func:libsnmpa.so(SNMPA_OnMsgCfgiMsgTypeIsQueryData+0x284) [0xe7d7458

Instruction Address: 0x0e7cedd8, Func:libsnmpa.so(SNMPA_OnMsgSmpSubIfIsCfgi+0x98) [0xe7cedd8]

Instruction Address: 0x0e746e04, Func:libsnmpa.so(SNMPA_CocketNotify+0x2fc) [0xe746e04]

Instruction Address: 0x0f2eda58, Func:libappcfgi.so(CO_MsgProc+0x398) [0xf2eda58]

Instruction Address: 0x0e7cf030, Func:libsnmpa.so(SNMPA_ComponentMsgProc+0x54) [0xe7cf030]

Instruction Address: 0x0f7423a0, Func:libdefault.so(rtfScmMessageSchedule+0x354) [0xf7423a0]

Instruction Address: 0x0f74260c, Func:libdefault.so(rtfScmCompScheKernelEntry+0x1c0) [0xf74260c]

Instruction Address: 0x0f7427a4, Func:libdefault.so(rtfScmCompScheDefaultEntry+0x194) [0xf7427a4]

Instruction Address: 0x0f742890, Func:libdefault.so(rtfScmTaskDeployDefaultCompEntry+0x24) [0xf742890

Instruction Address: 0x0f5c0f90, Func:libdefault.so(tskAllTaskEntry+0xd8) [0xf5c0f90]

Instruction Address: 0x0ff792c4, Func:libpthread.so.0(+0x62c4) [0xff792c4]

对异常调用栈结合代码分析(代码略),可知:客户使用的第三方网管,发送的报文SNMP在交换板处理时,syntax类型不匹配,出现异常调用栈,访问异常

 

综合以上:

局点第三方网管的SNMP报文,交换板软件在处理时不匹配,导致访问异常,cpu复位




根因

E9000服务器硬件无问题。

此局点E9000接到收第三方网管发送的SNMP报文,交换板软件在处理时不匹配,导致访问异常,cpu复位,交换板自动重启,网络发生中断。交换板当前版本为1.1.3.330.13;交换板BMC版本为5.11CPLD版本为(U1042004

解决方案

针对第三方网管,为规避SNMP处理异常,华为发布补丁进行优化,可通过升级交换板版本解决。

解决方法:升级交换板到最新版本V3.13BMC版本6.09CPLD版本U1042005

    版本:CX110-Switch-V3.13.cc

    补丁:CX110-Switch-V3.13-SPH005.PAT

 

升级影响:升级需要重启交换板

建议与总结

因早期交付的E9000刀片服务器版本较老,在与上层操作系统、第三方网管软件等应用对接时,存在潜在风险。在日常维护及巡检时,建议及时确认在网运行E9000刀片服务器版本。如潜在已经问题,在允许的情况下及时更新版本,保障业务正常运行。

END