No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

2 Blade in E9000 Out of Network for 2 Minutes 30 Seconds

Publication Date:  2017-09-18 Views:  101 Downloads:  0
Issue Description


As per customer, 2 Blades out of 16 Blades went out of network for 2 minutes 30 seconds. The network drop has been observed from customer’s switch. Issue blade is Blade 2 and Blade 16.

 



Alarm Information

There is no alarm discovered in HMM Web.

From the HMM Logs, there is a Fault event(Major) recorded under Switch 3X.

 



Handling Process

Request customer to collect E9000 Logs, Switch Logs (4 Switches) and 3 Blade OS logs (2 blades which had issue and 1 blade which has no issues) .

1) E9000 Logs Collection Method

 

2) Switch Logs Collection Method  (4.4.2  Collecting Switch Module Logs)

http://support.huawei.com/enterprise/docinforeader.action?contentId=DOC1000087836&idPath=7919749|9856522|21782478|19955022|19961380

 

3) Blade OS Logs Collection Method

Use InfoCollect Tool :- http://support.huawei.com/enterprise/en/software/22400692-SW1000258489

Guide :- http://support.huawei.com/enterprise/en/doc/DOC1000093346?idPath=7919749%7C9856522%7C9856629%7C21015513



Root Cause

Switch log shows 3x switch restarted on 2017-08-17 10:51:39, at that time, all ports on 3x switch are down, and until 10:55:06, 3x switch boot up successfully.


CX912_10GE(Standby) 3 : uptime is  5 days, 1 hour, 2 minutes
StartupTime 2017/08/17   10:55:06
StartTime   : 2017-08-17 10:52:45                    
Description : The interface status changes. (ifName=Stack-Port2/1, AdminStatus=UP, OperStatus=DOWN, Reason=Interface physical link is down, mainName=Stack-Port2/1)

Based on time stamp information:-
Switch Module >>  2017-08-21 18:52:05.
Blade  OS>>Mon Aug 21 16:25:06 IST 2017
There are about 2 hours 27 minutes gap. About Aug 17 08:23 match the time that network down (the switch time should match the blade time).

Blade2:
At Aug 17 08:23~08:25, eth4 was down. Eth5 was normal. Eth4, eth5 configured bond0, but during eth4 was down, there is no log shows eth5 became active. Hence the connection between the switch and blade is link down.

And eth4 physically connect to 3x switch board, eth5 connect to 2x switch board. So when 3x switch restart, eth4 is down.

Blade 16:
From the OS logs , eth4 & eth5 did not configured bond (active-backup mode) hence when 3x switch board restart, the service on eth4 is stopped.

Blade3 (no issue blade)
From OS log, when enp3s0f 0 this port down, enp3s0f1 became active as these two port configured bond0. It have below message to show this activity. Since the other port is up, there is no business impact in this blade.

Further analysis, check bond0 configuration between blade2 and blade3. Found the configuration is different. Blade3 bond0 configuration is OK, need customer to check blade2 bond0 configuration. Since the Bond0 configuration is different hence the Blade 3's port UP when the other port is down. Suggest customer to configure bonding same as blade3.

Blade3 bond0                                                                                                         Blade2  bond0

                     

3X switch CX912 Fabric abnormally restart analysis:


1. Switch Fabric CPU contains 4 CPU cores(CPU0~CPU3), CPU3 usage of 3X switch keeps 100% or na from 2017-08-16 20:13:27 to 2017-08-17 10:51 (switch board time)


2017-08-16 20:13:27.133 --Record Current CPU Occupy Info here--
           User(%) Kernel(%)
CPU Total:  18.75  81.25
CPU 0    :  16.67  33.33
CPU 1    :  60.00  20.00
CPU 2    :   0.00 100.00
CPU 3    :   0.00 100.00
--Record Current CPU Occupy Info end--

2017-08-16 20:13:30.016 --Record Current CPU Occupy Info here--
           User(%) Kernel(%)
CPU Total:  40.00  60.00
CPU 0    :  40.00   0.00
CPU 1    :   0.00   0.00
CPU 2    :  16.67   0.00
CPU 3    :   0.00 100.00
--Record Current CPU Occupy Info end--

2017-08-17 10:51:37.528 --Record Current CPU Occupy Info here--
           User(%) Kernel(%)
CPU Total:   0.00 100.00
CPU 0    :   0.00 100.00
CPU 1    :   0.00   0.00
CPU 2    :   0.00   0.00
CPU 3    :    nan    nan
--Record Current CPU Occupy Info end--

2017-08-17 10:51:40.342 --Record Current CPU Occupy Info here--
           User(%) Kernel(%)
CPU Total:   0.00 100.00
CPU 0    :   0.00   0.00
CPU 1    :   0.00  16.67
CPU 2    :   0.00 100.00
CPU 3    :   0.00 100.00
--Record Current CPU Occupy Info end--

2. The stack thread was running on CPU3 of switch, when CPU3 usage keeps 100% for a long time, it will cause stack communication between two switches abnormal then trigger 3x switch auto restart. (If CPU3 usage increase to 100% for a short time then decrease, switch board will not restart)


3. Current switch software version is 3.10, in this old version. There is an issue that logs will keep recording frequently then it will occupy a lot of CPU resource then it is possible to trigger switch restart. See below, logs recording frequently even at 1 second.

Conclusion:
1. Current 3x switch software version is 3.10, in this old version, there is an issue that logs will keep recording frequently then it will occupy a lot of CPU resource then it is possible to trigger switch restart. And from 3x switch log(CPU3 keeps 100% for a long time), it matches the issue. Latest switch software version(after 5.16 version) will fix this issue.


2. When 3x switch auto restart, the physical link between blade nic port to 3x switch will be down. If the nic port in blade didn’t configure bond (active-backup) or bond didn’t take effect when one nic port down, then the service on this nic port will be affected.

 

Solution

1. Update 2X & 3X switch software version to 5.35. We recommend to apply RFC for the switch upgrade.

V5.35 download link: http://support.huawei.com/enterprise/en/software/22468860-SW1000235832
Upgrade guide:- http://support.huawei.com/enterprise/en/doc/DOC1000146339/?idPath=7919749|9856522|21782478|19955021|19961380

 

2. Upgrade MZ910 FW and driver to the latest version. As checked OS logs, the Blade 3 and Blade 16 Driver & Firmware is very old as shown below.


 

Suggestions

Follow rectification notice and arrange downtime to upgrade the firmware to avoid this issue happen on other Chassis.

Link :- http://support.huawei.com/enterprise/en/bulletins-product/NEWS1000008216

END