Troubleshooting High CPU Usage
- Introduction
- Common Causes of High CPU Usage
- How to Locate the High CPU Usage Problem
- How to Fix the High CPU Usage Problem
- Checking Whether the Problem Is Caused by a Network Attack
- Checking Whether the Problem Is Caused by Network Flapping
- Checking Whether the Problem Is Caused by Network Loop
- Checking Whether the Problem Is Caused by the Flow Sampling Function
- Checking Whether the Problem Is Caused by a Large Number of Logs
- High CPU Usage Typical Cases
- How to Relieve CPU Load
- Related Information
Introduction
The CPU on a switch will be overloaded if the forwarding plane sends packets to the CPU at high speeds or a task consumes CPU resources for a long time. When this occurs, the CPU may be unable to process other tasks in a timely manner, which may cause exceptions in services.
This document describes the procedure for locating high CPU usage, the solution to high CPU usage, and the typical cases of high CPU usage.
Common Causes of High CPU Usage
A high CPU usage may be caused by:
- Attacks
- Network flapping (including STP and routing flapping)
- Network loop
- Flow sampling configuration on the device, consuming a large number of CPU resources
- A large number of logs generated on the device, consuming a lot of CPU resources
How to Locate the High CPU Usage Problem
Checking the Switch and Version Information
Run the display version and display device commands to check the switch version and component types. Record the information for follow-up operations.
- Run the display version command to view the switch software version.
<HUAWEI> display version Huawei Versatile Routing Platform Software VRP (R) software, Version 8.160 (CE12800 V200R003C00SPC200) Copyright (C) 2012-2017 Huawei Technologies Co., Ltd. HUAWEI CE12804 uptime is 11 days, 1 hour, 27 minutes BKP version information: 1.PCB Version : DE01BAK04A VER C 2.Board Type : CE-BAK04A 3.MPU Slot Quantity : 2 4.LPU Slot Quantity : 4 5.SFU Slot Quantity : 6 ……
The VRP (R) software, Version 8.160 field indicates that this is a CE12800 switch running V200R003C00.
- Run the display device command to check the switch model, whether the switch is in a stack, and service board types (only modular switches).
<HUAWEI> display device CE12804's Device status: ------------------------------------------------------------------------------------------- Slot Card Type Online Power Register Alarm Primary ------------------------------------------------------------------------------------------- 2 - CE-L12LQ-EA Present On Registered Normal NA 4 - CE-L48XS-EF Present On Registered Normal NA 6 - CE-MPUA Present On Registered Normal Master 7 - CE-CMUA Present On Registered Normal Slave 8 - CE-CMUA Present On Registered Normal Master 9 - CE-SFU04C Present On Registered Normal NA 10 - CE-SFU04C Present On Registered Normal NA 11 - CE-SFU04B Present On Registered Normal NA PWR1 - PAC-2700WA Present On Registered Normal NA FAN1 - FAN-12C Present On Registered Normal NA FAN2 - FAN-12C Present On Registered Normal NA FAN3 - FAN-12C Present On Registered Normal NA FAN4 - FAN-12C Present On Registered Normal NA FAN5 - FAN-12C Present On Registered Normal NA FAN6 - FAN-12C Present On Registered Normal NA FAN7 - FAN-12C Present On Registered Normal NA FAN8 - FAN-12C Present On Registered Normal NA FAN9 - FAN-12C Present On Registered Normal NA -------------------------------------------------------------------------------------------
Checking the CPU Usage
Check the CPU usage as follows:
- Run the display cpu [ slot slot-id ] command to view the CPU usage.
After several seconds, run the display cpu [ slot slot-id ] command again. The System CPU Using Percentage field still shows a large percentage value.
If the average CPU usage (indicated by the System CPU Using Percentage field) remains higher than 75% or a single CPU usage (indicated by the Current field) remains higher than 75%, the switch's CPU usage remains high.
<HUAWEI> display cpu CPU utilization statistics at 2017-12-01 11:17:44 945 ms System CPU Using Percentage : 12% CPU utilization for five seconds: 12%, one minute: 12%, five minutes: 11%. Max CPU Usage : 37% Max CPU Usage Stat. Time : 2017-11-28 16:55:21 599 ms State: Non-overload Overload threshold: 90%, Overload clear threshold: 75%, Duration: 480s --------------------------- ServiceName UseRate --------------------------- SYSTEM 12% AAA 0% ... ... --------------------------- CPU Usage Details ---------------------------------------------------------------- CPU Current FiveSec OneMin FiveMin Max MaxTime ---------------------------------------------------------------- cpu0 21% 22% 21% 19% 59% 2017-11-20 09:43:19 cpu1 12% 12% 13% 12% 64% 2017-11-20 09:43:19 cpu2 12% 11% 11% 11% 69% 2017-11-20 09:43:09 cpu3 3% 3% 3% 3% 8% 2017-11-20 09:43:09 ----------------------------------------------------------------
Find out the tasks occupying high CPU usage and focus on the top 3 tasks. For details, see Determining Fault Causes According to CPU Usages of Tasks.
- Check whether related alarms have been reported on the NMS.
When a switch connects to an NMS system, check whether there is a high CPU usage alarm on the NMS.
When the CPU usage exceeds the alarm threshold, the switch reports the alarm SYSTEM_1.3.6.1.4.1.2011.5.25.129.2.4.1 hwCPUUtilizationRisingAlarm to the NMS. Administrators can obtain high CPU usage information according to the alarm information. The CPU usage alarm threshold can be configured using the set cpu threshold command in the system view, and the default value is 95% in V100R005C00 and earlier versions or 90% in V100R005C10 and later versions.
- Check whether the log records a high CPU usage.
View the system log files or run the display logbuffer command to check whether the system has recorded logs about high CPU usage.
The system log may include the current or historical high CPU usage records.
Related log: SYSTEM/1/hwCPUUtilizationRisingAlarm_active.
Determining Fault Causes According to CPU Usages of Tasks
Run the display cpu [ slot slot-id ] command to view the top 3 tasks occupying high CPU usage (in V200R005 and later versions, the tasks are listed in a descending order of CPU usage).
Find out the reason why CPU usage is high and solution according to Table 1-1.
Task Name |
Description |
Reason for High CPU Usage |
Solution |
---|---|---|---|
SYSTEM |
System management |
A large number of protocol packets are processed. |
Check whether a network attack occurs. |
DEVICE |
Device management |
Interfaces are disconnected intermittently. |
Check whether network flapping occurs. |
CMF |
Configuration management framework |
Configurations are delivered in a batch or information is queried using SNMP. |
Check whether a large number of SNMP packets are sent to the CPU or whether a large number of configurations are delivered. |
NETSTREAM |
Flow sampling |
A large number of packets are sampled. |
Check whether flow sampling is configured. |
SFLOW |
Flow sampling |
A large number of packets are sampled. |
Check whether flow sampling is configured. |
FEA |
Service adaptation layer |
A batch configuration backup or task is performed. |
Check whether batch tasks are performed. |
IP STACK |
Protocol stack |
Routing protocols flap. |
Check whether routing protocols flap. |
LOCAL PKT |
Host packet transmission |
A large number of protocol packets are sent to the CPU. |
Check whether a network attack occurs. |
If the top tasks on your switch are not included in the preceding table, see What Are CPU and CPU Usage? to find out which services caused the high CPU usage.
If the top tasks on your switch are not included in the preceding table or What Are CPU and CPU Usage? , contact technical support personnel.
The preceding table is only a reference for you to locate a high CPU usage problem. To fix the problem, see How to Fix the High CPU Usage Problem.
How to Fix the High CPU Usage Problem
Checking Whether the Problem Is Caused by a Network Attack
In some situations, network attacks may cause high CPU usage. Network attacks are initiated by hosts or network devices by sending a large number of forged packets to switches, affecting security and services on the target switches. When a network attack occurs, the switch is busy with the requests from the attack source. Therefore, some tasks occupy much CPU resource, causing a high CPU usage on the switch.
Common Network Attacks
Common network attacks, such as ARP, ARP Miss, and DHCP attacks, can cause a high CPU usage on a switch. These attacks are all initiated by sending a large number of protocol packets; therefore, packet statistics on the switch show a large number of packets sent to the CPU.
- ARP and ARP-Miss attack
- ARP and ARP Miss flood
- ARP spoofing
- DHCP protocol packet attack
- Other attack
- ICMP attack
- DDoS
- Broadcast attack
- TTL-expired attack
- Initiating IP packets with the switch's IP address as the destination address
- SSH/FTP/Telnet attacks
Network Attack Locating
- Run the display version and display device commands to check the switch version and component types. Record the information for follow-up operations.
- Run the display cpu-defend statistics all command to view statistics about the packets sent to the CPU, determining whether too many protocol packets are discarded due to timeout.
- Run the reset cpu-defend statistics all command to clear statistics about the packets sent to the CPU.
- After several seconds, run the display cpu-defend statistics all command to view statistics about the packets sent to the CPU.
If there are too many packets of a protocol, determine whether it is normal depending on the networking. If not, there is a high probability that the switch is undergoing a protocol packet attack.
<HUAWEI> reset cpu-defend statistics all <HUAWEI> display cpu-defend statistics all Statistics(packets) on slot 1 : -------------------------------------------------------------------------------- PacketType Total Passed Total Dropped Last Dropping Time Last 5 Min Passed Last 5 Min Dropped -------------------------------------------------------------------------------- arp 784824 0 - 8 0 arp-miss 0 0 - 0 0 fib-hit 25993 0 - 0 0 snmp 4922372 0 - 599 0 telnet 425 0 - 0 0 ...... --------------------------------------------------------------------------------
- Configure the attack source tracing function to find out the attack source.
The switch provides the local attack defense function to protect the CPU, solving problems of service interruptions occurring when the CPU processes a large number of sent packets.
- Create the local attack defense policy based on attack source tracing.
<HUAWEI> system-view [~HUAWEI] cpu-defend policy policy1 //Create the local attack defense policy. [*HUAWEI-cpu-defend-policy-policy1] auto-defend enable //Enable attack source tracing. [*HUAWEI-cpu-defend-policy-policy1] auto-defend trace-type source-ip source-mac //Configure attack source tracing based on source MAC addresses and source IP addresses. [*HUAWEI-cpu-defend-policy-policy1] auto-defend protocol all //Match all protocol packets. [*HUAWEI-cpu-defend-policy-policy1] quit [*HUAWEI] cpu-defend-policy policy1 //Apply the CPU attack defense policy globally. [*HUAWEI] commit
After configuring the local attack defense function based on attack source tracing, run the display auto-defend attack-source command to check attack source information (IP address and MAC address).
<HUAWEI> display auto-defend attack-source Attack Source User Table on Slot 1 : ------------------------------------------------------------------------- MAC Address Interface PacketType VLAN:Outer/Inner Total ------------------------------------------------------------------------- 0000-c102-0102 10GE1/0/1 ICMP 1000/ 4832 ------------------------------------------------------------------------- Total: 1 Attack Source IP Table on Slot 1 : ------------------------------------------------------------------------- IP Address PacketType Total ------------------------------------------------------------------------- 10.1.1.2 ICMP 1144 ------------------------------------------------------------------------- Total: 1 Attack Source Port Table on Slot 1 : ------------------------------------------------------------------------- Interface VLAN:Outer/Inner PacketType Total ------------------------------------------------------------------------- 10GE1/0/1 1000/-- ICMP 4832 ------------------------------------------------------------------------- Total: 1
- Create the local attack defense policy based on attack source tracing.
Handling Suggestion
Select an appropriate method based on the attack source information and networking.
- Configure ARP security to prevent ARP attacks.
The switch provides ARP security to prevent ARP and ARP Miss packet attacks.
For details about ARP security, see ARP Security Solutions in the Configuration > Security Configuration Guide > ARP Security Configuration.
- Configure an attack source tracing action: discard attack packets within the specified period.
# Configure an attack source tracing action to discard attack packets within 300s.
<HUAWEI> system-view [~HUAWEI] cpu-defend policy policy1 [*HUAWEI-cpu-defend-policy-policy1] auto-defend enable [*HUAWEI-cpu-defend-policy-policy1] auto-defend action deny timeout 300 //By default, attack source tracing action is not enabled. [*HUAWEI-cpu-defend-policy-policy1] commit
- Configure a blacklist to discard the packets sent from the blacklisted users.
If a user sends attack packets to the switch, you can specify the characteristics of these packets in an ACL and apply the ACL to a blacklist. When the packets from this user are sent to the switch's CPU, the switch discards the packets.
In the following example, the attack source IP address is 10.1.1.0/24. Configure an ACL to match this IP address and discard attack packets.
<HUAWEI> system-view [~HUAWEI] acl number 2001 [*HUAWEI-acl4-basic-2001] rule permit source 10.1.1.0 0.0.0.255 [*HUAWEI-acl4-basic-2001] quit [*HUAWEI] cpu-defend policy policy1 [*HUAWEI-cpu-defend-policy-policy1] blacklist 1 acl 2001 [*HUAWEI-cpu-defend-policy-policy1] commit
- Configure a punishment action for the attack source tracing function: Set the interface that receives attack packets to the Error-down state to prevent the attack source from continuing to attack the switch.
If attack packets are sent from a specified interface, and setting this interface to the Error-down state does not affect services, use this method.
If services of authorized users on the interface that receives attack packets may be interrupted after this interface is set to the Error-down state, exercise caution when using this method.
# Set the interface that receives attack packets to the Error-down state.
<HUAWEI> system-view [~HUAWEI] cpu-defend policy policy1 [*HUAWEI-cpu-defend-policy-policy1] auto-defend enable [*HUAWEI-cpu-defend-policy-policy1] auto-defend action error-down [*HUAWEI-cpu-defend-policy-policy1] commit
Checking Whether the Problem Is Caused by Network Flapping
When network flapping occurs, the network topology frequently changes. The switch is busy with network switching events, causing a high CPU usage. Network flapping includes STP flapping and OSPF route flapping.
STP Flapping
When STP flapping occurs, the switch frequently calculates STP topology, updates MAC address table, and ARP table, causing a high CPU usage.
- Fault Location
If you consider that STP flapping may occur, run the display stp topology-change command multiple times at an interval of several seconds to view STP topology information. Alternatively, you can check the trap and log information on the switch to determine whether STP topology has changed.
Run the command multiple times. Check whether the value of Number of topology changes increases.
<HUAWEI> display stp topology-change CIST topology change information Number of topology changes :5 Time since last topology change :0 days 0h:23m:19s Topology change initiator(detected) :10GE1/0/1 Number of generated topologychange traps : 5 Number of suppressed topologychange traps: 3
When you confirm that network topology is frequently changed, run the display stp tc-bpdu statistics command after several seconds again. Check whether interfaces on the switch have received Topology Change (TC) BPDUs. If so, find out the source of the TC BPDUs, that is, the device causing the topology change.
<HUAWEI> display stp tc-bpdu statistics -------------------------- STP TC/TCN information -------------------------- MSTID Port TC(Send/Receive) TCN(Send/Receive) 0 10GE1/0/3 2/3 0/0 1 10GE1/0/5 1/0 -/-
- If only the TC(Send) value increases, the topology change is caused by the local switch.
- If only the TC(Send) value of a single interface increases, the topology change is caused by this interface.
- If the TC(Send) values of multiple interfaces increase, check the events and logs on the NMS to analyze the STP topology change reason. Find out the interface causing the flapping.
- If multiple values in the TC(Send/Receive) column increase, check the event and log information on the NMS to determine whether the local switch causes the topology change, and check whether STP flapping occurs on the device connected to the problematic interface.
- If only the TC(Send) value increases, the topology change is caused by the local switch.
- Suggestion
- Enable the TC protection alarm function to help administrators learn about TC BPDU processing on the switch.
To enable the TC protection and TC protection alarm functions, run the snmp-agent trap enable feature-name mstp and stp tc-protection commands in the system view.
- After the TC protection alarm function is enabled, the switch will generate two alarms: MSTP_1.3.6.1.4.1.2011.5.25.42.4.2.15 hwMstpiTcGuarded and MSTP_1.3.6.1.4.1.2011.5.25.42.4.2.16 hwMstpProTcGuarded.
- After the TC protection function is enabled, the switch processes only the maximum number of TC BPDUs (one TC BPDU by default) configured using the stp tc-protection threshold threshold command within the period (2s by default) specified using the stp tc-protection intervalinterval-value command.
- Rectify the fault according to topology changes.
If STP topology changes are caused by access-side port status transition, run the following commands to configure this port as an STP edge port or enable the STP BPDU protection function to reduce the impact of BPDUs on the CPU.
<HUAWEI> system-view [~HUAWEI] interface 10ge 1/0/1 [~HUAWEI-10GE1/0/1] stp edged-port enable //Configure the port as an edge port. [*HUAWEI-10GE1/0/1] quit [*HUAWEI] stp bpdu-protection //Enable the BPDU protection function. [*HUAWEI] commit
- Enable the TC protection alarm function to help administrators learn about TC BPDU processing on the switch.
OSPF Routing Protocol
Routing protocol flapping causes route re-advertisement and recalculation, which increases the load of the CPU. Generally, OSPF is configured to manage dynamic routing information. Therefore, OSPF route flapping is described here.
- Fault Location
- Run the display ospf peer last-nbr-down command to check the reason why the OSPF neighbor relationship goes Down.
The reason is displayed in the Immediate Reason and Primary Reason fields.
- Check logs on the switch to determine why the OSPF neighbor becomes Down.
Run the display logbuffer command, and you can find the following log information:
OSPF/3/NBR_DOWN_REASON: Neighbor state left full or changed to Down. (ProcessId=[ProcessId], NeighborRouterId=[NbrRouterId], NeighborIp=[NbrIp], NeighborAreaId=[NbrAreaId], NeighborInterface=[IfName], NeighborDownImmediate reason=[NbrImmReason], NeighborDownPrimeReason=[NbrPriReason], CpuUsage=[CpuUsage])
The NeighborDownImmediate reason field indicates the cause for the OSPF neighbor Down event.
- Run the display ospf peer last-nbr-down command to check the reason why the OSPF neighbor relationship goes Down.
- Suggestion
Determine the reason depending on the key fields and take measures.
Possible causes of the fault are as follows:
- Neighbor Down Due to Inactivity
The Hello packet is not received within the deadtime (set by the ospf timer dead interval command in the interface view).
When an OSPF neighbor is Down, OSPF neighbor flapping occurs and OSPF neighbor relationship cannot be set up. Run thedisplay ospf peer brief command to check whether OSPF neighbor flapping occurs or OSPF neighbor relationship cannot be set up.
- OSPF neighbor relationship flaps.
OSPF neighbor flapping may be caused by a small CPCAR value for OSPF, link flapping or congestion on interfaces, and LSA flooding.
- Run the display cpu-defend statistics packet-type ospf command to view statistics about the OSPF packets sent to the CPU. If too many OSPF packets are discarded, check whether the switch undergoes an OSPF attack or the CPCAR value for OSPF is too small.
- View the log to check whether interfaces alternate between Up and Down. If link flapping or congestion occurs, check the link on the interface.
- If the holdtime of the OSPF neighbor relationship is smaller than 20s, run the ospf timer dead interval command to change the holdtime to be larger than 20s.
- If the fault persists after the preceding operations are performed, contact technical support personnel.
- OSPF neighbor relationship cannot be set up.
Check whether the configurations in the OSPF view of devices on both ends are the same. If the configurations such as the OSPF area ID or area type (NSSA, stub area, or common area) are different, the two devices cannot establish an OSPF neighbor relationship.
Run the display ospf [ process-id ] interface command to check whether OSPF is successfully enabled on the interfaces.
<HUAWEI> display ospf 1 interface OSPF Process 1 with Router ID 192.168.5.5 Area: 0.0.0.0 MPLS TE not enabled Interface IP Address Type State Cost Pri Vlanif200 192.168.3.1 Broadcast DR 1 1
- If OSPF is not enabled on interfaces, run the ospf enable [ process-id ] area area-id command in the interface view to enable OSPF.
- If the OSPF process has been enabled on the related interface, run the display ospf error command multiple times at an interval of several seconds to check whether OSPF authentication information on the two devices is the same according to the Bad authentication type and Bad authentication key fields.
<HUAWEI> display ospf error OSPF Process 1 with Router ID 10.1.1.1 OSPF error statistics General packet errors: 0 : IP: received my own packet 0 : Bad packet 0 : Bad version 0 : Bad checksum 0 : Bad area id 0 : Drop on unnumbered interface 0 : Bad virtual link 0 : Bad authentication type 0 : Bad authentication key 0 : Packet too small 0 : Packet size > ip length 0 : Transmit error 0 : Interface down 0 : Unknown neighbor
If the value of the Bad authentication type or Bad authentication key value keeps increasing, OSPF authentication information on the two devices is different. To configure the same authentication information for the two devices, run the ospf authentication-mode command in the interface views or run the authentication-mode command in the OSPF process view.
- OSPF neighbor relationship flaps.
- Neighbor Down Due to Kill Neighbor
If the interface is Down, BFD is Down, or the reset ospf process command is executed, the OSPF neighbor relationship goes Down.
View the NeighborDownPrimeReason field to determine the reason.
- Neighbor Down Due to 1-Wayhello Received or Neighbor Down Due to SequenceNum Mismatch
When the OSPF status of the peer device goes Down first, the peer device sends a 1-Way Hello packet to the local device, causing OSPF on the local device to go Down.
Determine why OSPF status of the peer device becomes Down.
For other reasons, see OSPF/3/NBR_DOWN_REASON.
- Neighbor Down Due to Inactivity
Checking Whether the Problem Is Caused by Network Loop
A network loop will cause MAC address flapping on a switch and a broadcast storm on the network. When this occurs, a large number of protocol packets are sent to the CPU, resulting in high CPU usage of the switch.
- Fault Location
After a network loop occurs, the following situations often occur:
- A MAC address flapping alarm is generated on the switch. To check MAC address flapping records, run the display mac-address flapping command.
- The switch's CPU usage is high.
- Indicators of interfaces in the VLAN where the loop has occurred blink faster than usual.
- Administrators cannot remotely log in to the switch, and the switch responds slowly to operations performed through the console interface.
- Packets are lost or cannot be forwarded in ping tests.
- The display interface command output shows that a large number of broadcast or multicast packets exist on interfaces.
- PCs connected to the switch receive a large number of broadcast packets or unknown unicast packets.
- Suggestion
If configurations or connections are changed before the fault occurs, you are advised to roll back the changes. Otherwise, rectify the fault according to the following procedure:
- Confirm the interface where a broadcast storm has occurred according to the interface indicator status and traffic.
- Locate the device where the network loop has occurred according to the link topology.
- Locate the interfaces where the loop has occurred and remove the loop.
- If the fault persists after the preceding measures are taken, collect the networking information, device configuration file, log information, and alarm information, and contact Huawei technical support personnel.
Checking Whether the Problem Is Caused by the Flow Sampling Function
When the flow sampling function is configured on the device, the CPU usage may be high due to high traffic volume and high sampling rate.
- Fault Location
If the display cpu [ slot slot-id ] command output shows that the FEA and NETSTREAM tasks (or FEA and SFLOW tasks) have a high CPU, the flow sampling function has been configured on the device and the traffic volume is high or sampling rate is high.
- Suggestion
Check the flow sampling configuration, reduce the sampling rate based on the traffic on interfaces, and then check whether CPU usage is reduced to the normal range.
- Adjust the NetStream sampling rate.
- Adjust the flow sampling function of all interfaces in the system view.
<HUAWEI> system-view [~HUAWEI] netstream sampler random-packets 32768 inbound [*HUAWEI] netstream sampler random-packets 32768 outbound [*HUAWEI] commit
- Adjust the flow sampling function of a specified interface in the interface view.
<HUAWEI> system-view [~HUAWEI] interface 10ge 1/0/1 [~HUAWEI-10GE1/0/1] netstream sampler random-packets 32768 inbound [*HUAWEI-10GE1/0/1] netstream sampler random-packets 32768 outbound [*HUAWEI-10GE1/0/1] commit
- Adjust the flow sampling function of all interfaces in the system view.
- Adjust the sFlow sampling rate.
<HUAWEI> system-view [~HUAWEI] interface 10ge 1/0/1 [~HUAWEI-10GE1/0/1] sflow sampling rate 32768 [*HUAWEI-10GE1/0/1] commit
- Adjust the NetStream sampling rate.
Checking Whether the Problem Is Caused by a Large Number of Logs
The device generates diagnostic information or logs continuously in some conditions, for example, attacks, errors, or frequent interface status transitions occur. In these conditions, the system frequently reads and writes data in the storage device, causing a high CPU usage.
- Fault Location
Run the display logbuffer command to check whether a large number of abnormal logs are displayed. For example, a large number of the same logs are generated continuously.
- Suggestion
Check the log reference manual of the corresponding product according to the log name and solve the problem according to the troubleshooting procedure.
If the fault persists after the preceding measures are taken, collect the networking information, device configuration file, log information, and alarm information, and contact Huawei technical support personnel.
High CPU Usage Typical Cases
A Switch Suffers an ARP Packet Attack
Symptom
In Figure 1-1, Switch functions as a gateway, Switch_1 is frequently out of management, and users on Switch_1 are frequently disconnected. There is a delay when Switch_1 pings the Switch or the ping operation fails. Services on Switch_2 are normal, and Switch_2 can successfully ping the gateway.
Root Cause
Switch_1 receives ARP packets with fixed source MAC address. User devices cannot send or receive ARP packets.
Identification Method
Perform the following operations on Switch_1:
- Check whether the CPU usage is high.
<HUAWEI> display cpu CPU utilization statistics at 2015-12-04 11:04:40 820 ms System CPU Using Percentage : 82% CPU utilization for five seconds: 82%, one minute: 82%, five minutes: 82%. Max CPU Usage : 87% Max CPU Usage Stat. Time : 2015-11-28 16:55:21 599 ms
The CPU usage reaches 82%.
- View temporary ARP entries to check whether ARP learning is normal.
<HUAWEI> display arp ARP Entry Types: D - Dynamic, S - Static, I - Interface, O - OpenFlow EXP: Expire-time VLAN:VLAN or Bridge Domain IP ADDRESS MAC ADDRESS EXP(M) TYPE/VLAN INTERFACE VPN-INSTANCE ------------------------------------------------------------------------------ 10.137.222.139 00e0-fc01-4422 I - MEth0/0/0 10.1.1.1 200b-c739-130c I Vlanif10 10.2.3.4 200b-c739-1316 I Vlanif200 12.1.1.1 200b-c739-1302 I 10GE4/0/8 12.1.1.2 f84a-bff0-cac2 12 D 10GE4/0/8 50.1.1.2 Incomplete 1 D 10GE4/0/22 50.1.1.3 Incomplete 1 D 10GE4/0/22 ...... ------------------------------------------------------------------------------
The MAC ADDRESS fields of two ARP entries are Incomplete, indicating temporary entries. Some ARP entries cannot be learned.
- Check whether the switch is suffering an ARP attack.
- View statistics about ARP request packets sent to the CPU.
<HUAWEI> display cpu-defend statistics packet-type arp all Statistics(packets) on slot 2 : -------------------------------------------------------------------------------- PacketType Total Passed Total Dropped Last Dropping Time Last 5 Min Passed Last 5 Min Dropped -------------------------------------------------------------------------------- arp 0 0 - 0 0 -------------------------------------------------------------------------------- Statistics(packets) on slot 4 : -------------------------------------------------------------------------------- PacketType Total Passed Total Dropped Last Dropping Time Last 5 Min Passed Last 5 Min Dropped -------------------------------------------------------------------------------- arp 106549 44380928 - 3 0 --------------------------------------------------------------------------------
There are a large number of ARP request packets on the board in slot 4.
- Configure attack source tracing to identify the attack source.
<HUAWEI> system-view [~HUAWEI] cpu-defend policy policy1 [*HUAWEI-cpu-defend-policy-policy1] auto-defend enable [*HUAWEI-cpu-defend-policy-policy1] auto-defend attack-packet sample 5 //One packet is sampled when every five packets are sent. A small sampling rate will consume many CPU resources. [*HUAWEI-cpu-defend-policy-policy1] auto-defend threshold 30 //The packets of which the rate reaches 30 pps are considered attack packets. If there are many attack sources, reduce this value. [*HUAWEI-cpu-defend-policy-policy1] auto-defend trace-type source-mac //Identify the attack source based on source MAC address. [*HUAWEI-cpu-defend-policy-policy1] auto-defend protocol arp //Identify the attack source of the ARP attack. [*HUAWEI-cpu-defend-policy-policy1] quit [*HUAWEI] cpu-defend-policy policy1 [*HUAWEI] commit
- View attack source information.
[~HUAWEI] display auto-defend attack-source Attack Source User Table on Slot 4 : ------------------------------------------------------------------------- MAC Address Interface PacketType VLAN:Outer/Inner Total ------------------------------------------------------------------------- 0000-c102-0102 10GE4/0/22 ARP 1000/ 4832 -------------------------------------------------------------------------
The MAC address of attack source is 0000-c102-0102, which is connected to 10GE4/0/22.
- View statistics about ARP request packets sent to the CPU.
Solution
- Configure a blacklist.
# acl number 4000 rule 10 permit type arp source-mac 0000-c102-0102 # cpu-defend policy 1 blacklist 1 acl 4000 //Add the users with specified characteristics to the blacklist through an ACL. The switch discards the packets from the users in blacklist. # cpu-defend-policy 1 #
- Configure the attack source tracing action.
# cpu-defend policy policy1 auto-defend enable auto-defend action deny //Set the attack source tracing action. The switch discards all attack packets within the default interval, 300s. auto-defend alarm enable auto-defend threshold 30 auto-defend trace-type source-mac auto-defend protocol arp # cpu-defend-policy policy1 #
STP Flapping Causes a High CPU Usage
Symptom
A fixed switch has a high CPU usage, and generates many logs about ARP packets that are discarded because their rate exceeds the CPCAR value. The interface information shows that the number of TC BPDUs received by STP-enabled interfaces keeps increasing.
Root Cause
An interface has received a large number of TC BPDUs, causing STP flapping. Many MAC entries are deleted and ARP entries are updated. Therefore, the switch needs to process many ARP Miss, ARP request, and ARP reply packets, causing a high CPU usage.
Identification Method
- View logs, finding that logs indicating a high CPU usage are generated on the switch.
Dec 4 2016 11:37:34 HUAWEI %%01SYSTEM/1/hwCPUUtilizationRisingAlarm(t):CID=0x80020106-OID=1.3.6.1.4.1.2011.5.25.129.2.4.1;The CPU usage exceeded the pre-set overload threshold.(TrapSeverity=3, ProbableCause=74240, EventType=3, PhysicalIndex=17170433, PhysicalName=MPU slot 6, RelativeResource=CPU, UsageType=1, SubIndex=0, CpuUsage=92, Unit=1, CpuUsageThreshold=90)
- Find that the switch also generates an alarm indicating that packets were dropped due to CPCAR exceeding.
Dec 4 2016 11:45:47 HUAWEI %%01DEFEND/4/hwCpcarDropPacketAlarm(t):CID=0x80e70402-OID=1.3.6.1.4.1.2011.5.25.165.2.2.7.1;Rate of packets to cpu exceeded the CPCAR limit in slot 4. (Protocol=ARP, PPS/CBS=0/0, ExceededPacketCount=20699)
- Collect statistics about transmitted and received TC BPDUs on interfaces.
Run the display stp tc-bpdu statistics command at an interval of several seconds. Check the statistics about sent and received TC/TCN BPDUs. It is found that the number of TC BPDUs on all STP-enabled interfaces keeps increasing.
Solution
- Run the stp tc-protection command in the system view to enable TC protection trap. By default, TC protection trap is disabled.
After TC protection trap is enabled, the switch updates entries at most once within 2 seconds if it frequently receives TC BPDUs. This reduces the number of tasks to be processed by the CPU in frequently updating MAC and ARP entries.
The switch will trigger the MSTP_1.3.6.1.4.1.2011.5.25.42.4.2.15 hwMstpiTcGuarded and MSTP_1.3.6.1.4.1.2011.5.25.42.4.2.16 hwMstpProTcGuarded traps.
- Run the arp topology-change disable command in the system view to disable the switch from responding to TC BPDUs. By default, the switch responds to received TC BPDUs.
After receiving TC BPDUs, the switch ages out ARP entries by default. After this command is executed, the switch does not age out or delete ARP entries when receiving TC BPDUs. When the network topology changes frequently, this prevents excessive ARP packets caused by ARP relearning and high CPU usage.
- Run the mac-address update arp enable command in the system view to enable ARP entry update upon MAC address change. By default, ARP entry update upon MAC address change is enabled.
By default, the switch deletes the MAC address entries after receiving TC BPDUs. After this command is executed, the switch updates the outbound interfaces in ARP entries when the outbound interfaces in MAC entries are changed. This reduces the number of ARP entry update times.
Conclusion
When this problem occurs, check packet loss caused by CPCAR.
When deploying STP, you are advised to enable TC protection and configure all ports connected to terminals as edge ports. These measures prevent status change of an interface from causing flapping and re-convergence of the entire STP network.
OSPF Flapping Causes a High CPU Usage
Symptom
In Figure 1-2, OSPF is run on Switch_1, Switch_2, Switch_3, and Switch_4. Switch_1 has a high CPU usage. The CPU usage of the ROUT task is higher than the CPU usage of other tasks, and route flapping occurs.
Root Cause
IP address conflict on the network causes route flapping.
Identification Method
- Run the display ospf lsdb command on each switch at an interval of one second to check information about the OSPF link state database (LSDB) on the switches.
- Locate the fault based on the collected command output of each switch.
- If both the following situations occur, LSA aging is abnormal.
- The Age value that indicates the aging time of a network LSA is 3600 on a switch or the switch does not have the network LSA, and the Sequence value increases quickly.
- The Age value of the same network LSA on different switches frequently alternates between 3600 and smaller values, and the Sequence value increases quickly.
<HUAWEI> display ospf lsdb OSPF Process 100 with Router ID 3.3.3.3 Link State Database Area: 0.0.0.0 ---------------------------------------------------------------------------- Type LinkState ID AdvRouter Age Len Sequence Metric Router 4.4.4.4 4.4.4.4 2 48 8000000D 1 Router 3.3.3.3 3.3.3.3 6 72 80000016 1 Router 2.2.2.2 2.2.2.2 228 60 8000000D 1 Router 1.1.1.1 1.1.1.1 258 60 80000009 1 Network 112.1.1.4 4.4.4.4 121 32 80000001 0 Network 112.1.1.2 1.1.1.1 3600 32 80000015 0 Network 222.1.1.3 3.3.3.3 227 32 80000003 0 Network 111.1.1.1 1.1.1.1 259 32 80000002 0
- Run the display ospf routing command on each switch every 1 second. If route flapping occurs and the OSPF neighbor relationship does not flap, IP address conflicts or router ID conflicts occur. The IP address of the designated router (DR) or BDR conflicts with that of a non-DR based on the display ospf lsdb command output.
- Locate one conflicting interface on a switch based on the AdvRouter value, and locate the other conflicting device based on the IP address plan. It is difficult to locate the other conflicting device based only on OSPF information.
In this example, first determine that the conflicting IP address is 112.1.1.2, and the router ID of a conflicting device is 1.1.1.1. However, the other conflicting device (3.3.3.3) cannot be located through OSPF information.
- If the LinkState ID values of two network LSAs are both 112.1.1.2 on a switch, the aging time of the two network LSAs is short, and the Sequence value increases quickly, an IP address conflict occurs on the DR and BDR.
<HUAWEI> display ospf lsdb OSPF Process 100 with Router ID 3.3.3.3 Link State Database Area: 0.0.0.0 ---------------------------------------------------------------------------- Type LinkState ID AdvRouter Age Len Sequence Metric Router 4.4.4.4 4.4.4.4 17 48 8000011D 1 Router 3.3.3.3 3.3.3.3 21 72 8000015A 1 Router 2.2.2.2 2.2.2.2 151 60 80000089 1 Router 1.1.1.1 1.1.1.1 1180 60 8000002A 1 Network 112.1.1.2 3.3.3.3 3 32 8000016A 0 Network 112.1.1.2 1.1.1.1 5 32 80000179 0 Network 222.1.1.3 3.3.3.3 145 32 8000002D 0 Network 212.1.1.4 4.4.4.4 10 32 80000005 0 Network 111.1.1.2 2.2.2.2 459 32 80000003 0
- If both the following situations occur, LSA aging is abnormal.
Solution
Change the IP address of a conflicting device based on the IP address plan.
Conclusion
- The following problems may occur due to IP address conflicts on networks.
- The CPU usage is high.
- Route flapping occurs.
- On an OSPF network, IP address conflicts between interfaces may cause frequent aging and generation of LSAs. This results in network instability, route flapping, and high CPU usage.
Configure IP addresses for interfaces according to network plan, and do not modify planned network parameters.
How to Relieve CPU Load
- Configure ARP security to protect the device against ARP or ARP Miss attacks.
For details about ARP security, see ARP Security Solutions in the Configuration > Security Configuration Guide > ARP Security Configuration.
- On the network prone to DHCP and ARP attacks, configure local attack defense policies for DHCP and ARP protocol packets.This section provides suggestions on local attack defense policies in general situations. The requirements on different protocol packets sent to the CPU may vary according to the model and version. In practice, configure CPU attack defense based on service requirements; otherwise, the configuration may fail or services may be affected.
# cpu-defend policy policy1 auto-defend enable auto-defend action deny auto-defend trace-type source-mac source-ip auto-defend protocol arp dhcp auto-defend whitelist 1 interface 10GEx/x/x //Add interconnected interfaces to the whitelist. auto-defend whitelist 2 interface 10GEx/x/x //Add uplink interfaces to the whitelist. # cpu-defend-policy policy1 #
- Log in to the switch as an administrator through SSH, Telnet, and SNMP. Configure an ACL to allow only the administrator to log in.
# In VTY 0-14, configure the ACL to allow only the user with source IP address 10.1.1.1/32 to log in to the switch.
<HUAWEI> system-view [~HUAWEI] acl 2001 [*HUAWEI-acl4-basic-2001] rule 5 permit source 10.1.1.1 0 [*HUAWEI-acl4-basic-2001] quit [*HUAWEI] user-interface vty 0 14 [*HUAWEI-ui-vty0-14] acl 2001 outbound [*HUAWEI-ui-vty0-14] commit
- Frequent MAC address flapping may result in a high CPU usage. If MAC address flapping may occur frequently on an interface, run the mac-address flapping trigger error-down command in the interface view to enable the system to set the interface to error-down state after detecting a MAC address flapping.
- Load and activate the patch files of the corresponding software version.
Visit http://support.huawei.com/enterprise/ to obtain the corresponding patch file and documents (patch release notes and installation guide).
- The switch provides CPCAR values for each protocol. Generally, the default CPCAR values can meet requirements. If service traffic volume is too high, contact technical support personnel to adjust the CPCAR values.
- Introduction
- Common Causes of High CPU Usage
- How to Locate the High CPU Usage Problem
- How to Fix the High CPU Usage Problem
- Checking Whether the Problem Is Caused by a Network Attack
- Checking Whether the Problem Is Caused by Network Flapping
- Checking Whether the Problem Is Caused by Network Loop
- Checking Whether the Problem Is Caused by the Flow Sampling Function
- Checking Whether the Problem Is Caused by a Large Number of Logs
- High CPU Usage Typical Cases
- How to Relieve CPU Load
- Related Information