9 S5700LI switches stack has high CPU-usage

Publication Date:  2016-06-30 Views:  693 Downloads:  0
Issue Description

On the live network of customer has 9 S5700-X-LI switches set up a stack. Customer complained that the master switch had a high CPU usage (above 95%) and the switch responded slowly to NMS operations.

[sw-suffren]display cpu-usage slot 1

CPU Usage Stat. Cycle: 60 (Second)

CPU Usage            : 97% Max: 100%

CPU Usage Stat. Time : 2016-05-19  15:59:52

CPU utilization for five seconds: 97%: one minute: 95%: five minutes: 95%

Max CPU Usage Stat. Time : 2016-04-23 18:28:39.

                       TaskName             CPU  Runtime(CPU Tick High/Tick Low)  Task Explanation

mv_rx6                2%         0/1b945622       mv_rx6                      

mv_rx7                2%         0/1a2d554d       mv_rx7                      

VIDL                  3%         0/157529ff       DOPRA IDLE                             

IPCR                  1%         0/14ce7a0e       IPCR InterPro Communicate  Receive                                                                       

VPR                  23%         0/ad76ceb5       VPR VP Receive                    

RPCQ                  1%         0/ f27b209       RPCQRemote procedure call   

l2sync                9%         0/4dc7fed1       tS0a                                    

linkscan              7%         0/3d21c624       tS0e                        

IFPD                  4%         0/25c5b6ed       IFPD Ifnet Product Adapter  

AGNT                  2%         0/1c0ed172       AGNTSNMP agent task         

HS2M                  1%         0/124bf05a       HS2MHigh available task     

LLDP                  1%         0/1293c12f       LLDP Protocol               

SRMT                  2%         0/1a7efc17       SRMT System Resource Manage

OS                   39%         1/16f321ff       Operation System      

Handling Process

The collected switch information shows that the stack consists of 9 switches, there are more than 300 interfaces in Up state, and connected users frequently go online and offline to trigger MAC address learning and synchronization to other slots. There are 256 ARP entries on the switch, which is the maximum number supported by the switch. The CPU-defend statistics show that many ARP request packets have been sent to the MPU's CPU, overwhelming the CPU. 

Collect statistics about packets sent to the CPU twice and calculate the number of packets per second. Take ARP request packets for example.

Information collected the first time:

[sw-suffren-diagnose]display cpu-defend packet  statistics packet-type arp-request 

-------------------------------------------------------------------------------

Packet Type :arp-request

Top N : N/A

-------------------------------------------------------------------------------

Interface                       Packets     

-------------------------------------------------------------------------------

                         ……

GigabitEthernet0/0/2            1655296     

XGigabitEthernet0/0/1           119382050   

XGigabitEthernet0/0/2           428613      

GigabitEthernet8/0/23           4815794   

 

Information collected three hours later:

[sw-suffren-diagnose]display cpu-defend packet  statistics packet-type arp-request

-------------------------------------------------------------------------------

Packet Type :arp-request

Top N : N/A

-------------------------------------------------------------------------------

Interface                       Packets     

……

GigabitEthernet0/0/2            1661667     

XGigabitEthernet0/0/1           120048107   

XGigabitEthernet0/0/2           430263      

GigabitEthernet8/0/23           4892115     

 

Port with Sharp Traffic Increase

Information Collect the First Time

Information Collect Three Hours Later

Number of Increased Packets

GigabitEthernet0/0/2

1655296

1661667

6371

XGigabitEthernet0/0/1

119382050

120048107

666057

XGigabitEthernet0/0/2

428613

430263

1650

GigabitEthernet8/0/23

4815794

4892115

76321

A total of 760000 packets are increased in three hours, most of which are ARP request packets. The packet rate is 70 pps.

Root Cause

As shown in the preceding figure, ARP request, SNMP, and Telnet packets are sent to the CPU of the active MPU, received by the mv_rx6 task on the active MPU, and written into the VPR task. Therefore, many ARP request or other protocol packets between boards will cause a high CPU usage of the VPR task on the active MPU. 

Tasks occupying higher than 2% CPU usage:

Task Name

CPU Usage

Function

Cause

OS

39%

Operating system task

This task will have a high CPU usage when the CPU usages of other tasks increase.

VPR

23%

VP packet receiving

This task will have a high CPU usage when a large number of protocol packets are sent to the master switch. Collected information shows that the ARP packet rate is 70 pps, and there are other protocol packets (SNMP: 7 pps, STP: 2pps, Telnet: 2pps).

l2sync

9%

MAC software and hardware entry synchronization

There are 1700 MAC address entries.

linkscan

7%

Port link status detection

There are more than 300 ports in Up state.

IFPD

4%

Packet statistics on ports

FSP

3%

Stack management protocol

The stack consists of many (nine) switches, so the stack management protocol consumes high CPU usage.

mv_rx6

2%

CPU queue 6 processing

Many inter-board ARP packets are sent to the CPU.

mv_rx7

2%

CPU queue 7 processing

Inter-board management packets, such as IPC and FSP.

 

SRMT

2%

Device management timer task

Optical module detection and hardware component detection on stack ports.

AGNT

2%

IPv4 SNMP protocol stack

NMS polling at an interval of five minutes.

What is more:

Huawei switch uses separated structures to handle data and protocol packets: protocol packets are sent to the CPU and data packets (service packets) are forwarded by the LAN switch chip. Therefore, the high CPU usage does not affect data packet forwarding.

When the CPU works stably, data packets can be forwarded normally even if the CPU usage reaches 90%. When the CPU usage stays at 98-99% and the CPU resources are occupied by one task for a long time, protocol flapping or slow response may occur. This will affect services and cause a failure to respond to the management requests, such as Telnet and SSH sessions. In this situation, the switch may be out of management, respond slowly, or time out to execute commands or SNMP operations.

Solution

Since the issue was caused by a large size of stack cost a lot of CPU usage and the way to solve the issue is to reduce the size of stack. The final soluiton is to split the 9 switches stack to 2 stacks with one 5 switches stack and another 4 switches stack. Then the CPU-usage become normal.

Suggestions

For S5700LI,We recommend to set up the stack not more than 5 switches or it may cause performance issue since the stack feature itself may cost too much CPU resources.

END