the 3X CX912 switch board was abnormally restarted. The customer switched services to the 2X switch board manually to recover the service.
The CPU of Switchboard includes 4 kernels（CPU0~CPU3）The CPU3 usage of the 2X switchboard is kept at 100% from 2018-03 -22 07:46（CPU usage%= User%+ Kernel%）. Here is log screenshot:
The thread of the stacking module is bound to CPU3. the CPU3 usage of the 2X switchboard frequently appears to 100%, it causes the stack protocol communication between the active and standby stack members becomes abnormal. As a result, the stack splits, triggering the restart of lower priority switchboard (3X CX912 has the lower priority than 2X, so 3X switchboard restart).
The priority of the 2X switchboard is 150, 3X switchboard uses the default priority 100, so the priority of the 2X switchboard is higher than 3X switchboard (the default priority 100 of 3X switchboard will not be shown in log), as shown in the following figure：
Current switchboard software version is 3.10 which is too low, and it has the issue of frequently writing logs. When logs are frequently written, the log files are compressed, replaced, and deleted. These operations occupy a large number of CPU resources. CPU resources are occupied for a long time, causing stack protocol packets to fail to be processed. As a result, the switchboard restart mechanism with a lower priority is triggered.
Why a Stack Fail to Be Established After 3X Switchboard Restart and cause service affect：
The 2X switchboard frequently writes logs, after 3X switchboard restarting, the stack communication is still in abnormal state. As a result, when the 3X switchboard tried to establish the stack and falsely thinking the stack configuration conflicts, then the stack protocol stops working and the stack system cannot be set up. When 2X, 3X stack couldn’t be established, then both switchboards works as master, so the service affected.
Below log shows configuration is conflict, then stack cannot to be set up.
the version about the switchboard is too low, need to upgrade the version about it.
Upgrade the switch software to the latest 5.52 version (also upgrade the compatible CPLD version)
Configure dual-active detection for the stack switchboards (Please make sure the stack domain ID for each E9000 is different, otherwise, if there are 2 or more E9000 have the same stack domain ID, then the service will be affected, so we don’t suggest to do it at today’s plan)