No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Troubleshooting High CPU Usage

Rate and give feedback:
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
Troubleshooting High CPU Usage

Troubleshooting High CPU Usage

Introduction

The CPU on a switch will be overloaded if the forwarding plane sends packets to the CPU at high speeds or a task consumes CPU resources for a long time. When this occurs, the CPU may be unable to process other tasks in a timely manner, which may cause exceptions in services.

This document describes the procedure for locating high CPU usage, the solution to high CPU usage, and the typical cases of high CPU usage.

Common Causes of High CPU Usage

A high CPU usage may be caused by:

  • Attacks
  • Network flapping (including STP and routing flapping)
  • Network loop
  • Flow sampling configuration on the device, consuming a large number of CPU resources
  • A large number of logs generated on the device, consuming a lot of CPU resources

How to Locate the High CPU Usage Problem

Checking the Switch and Version Information

Run the display version and display device commands to check the switch version and component types. Record the information for follow-up operations.

  1. Run the display version command to view the switch software version.

    <HUAWEI> display version  
    Huawei Versatile Routing Platform Software 
    VRP (R) software, Version 8.160 (CE12800 V200R003C00SPC200) 
    Copyright (C) 2012-2017 Huawei Technologies Co., Ltd. 
    HUAWEI CE12804 uptime is 11 days, 1 hour, 27 minutes 
     
    BKP  version information: 
    1.PCB      Version  : DE01BAK04A VER C 
    2.Board    Type     : CE-BAK04A 
    3.MPU Slot Quantity : 2 
    4.LPU Slot Quantity : 4 
    5.SFU Slot Quantity : 6 
    ……

    The VRP (R) software, Version 8.160 field indicates that this is a CE12800 switch running V200R003C00.

  2. Run the display device command to check the switch model, whether the switch is in a stack, and service board types (only modular switches).

    <HUAWEI> display device  
    CE12804's Device status: 
    ------------------------------------------------------------------------------------------- 
    Slot  Card Type                     Online Power Register     Alarm     Primary 
    ------------------------------------------------------------------------------------------- 
    2     -      CE-L12LQ-EA              Present  On    Registered Normal    NA 
    4     -      CE-L48XS-EF              Present  On    Registered Normal    NA 
    6     -      CE-MPUA                  Present  On    Registered Normal    Master 
    7     -      CE-CMUA                  Present  On    Registered Normal    Slave 
    8     -      CE-CMUA                  Present  On    Registered Normal    Master 
    9     -      CE-SFU04C                Present  On    Registered Normal    NA 
    10    -      CE-SFU04C                Present  On    Registered Normal    NA 
    11    -      CE-SFU04B                Present  On    Registered Normal    NA 
    PWR1  -      PAC-2700WA               Present  On    Registered Normal    NA 
    FAN1  -      FAN-12C                  Present  On    Registered Normal    NA 
    FAN2  -      FAN-12C                  Present  On    Registered Normal    NA 
    FAN3  -      FAN-12C                  Present  On    Registered Normal    NA 
    FAN4  -      FAN-12C                  Present  On    Registered Normal    NA 
    FAN5  -      FAN-12C                  Present  On    Registered Normal    NA 
    FAN6  -      FAN-12C                  Present  On    Registered Normal    NA 
    FAN7  -      FAN-12C                  Present  On    Registered Normal    NA 
    FAN8  -      FAN-12C                Present  On    Registered Normal    NA 
    FAN9  -      FAN-12C                  Present  On    Registered Normal    NA 
    -------------------------------------------------------------------------------------------

Checking the CPU Usage

Check the CPU usage as follows:

  • Run the display cpu [ slot slot-id ] command to view the CPU usage.

    After several seconds, run the display cpu [ slot slot-id ] command again. The System CPU Using Percentage field still shows a large percentage value.

    NOTE:

    If the average CPU usage (indicated by the System CPU Using Percentage field) remains higher than 75% or a single CPU usage (indicated by the Current field) remains higher than 75%, the switch's CPU usage remains high.

    <HUAWEI> display cpu
    CPU utilization statistics at 2017-12-01 11:17:44 945 ms 
    System CPU Using Percentage :  12% 
    CPU utilization for five seconds: 12%, one minute: 12%, five minutes: 11%. 
    Max CPU Usage :                37% 
    Max CPU Usage Stat. Time : 2017-11-28 16:55:21 599 ms 
    State: Non-overload 
    Overload threshold:  90%, Overload clear threshold:  75%, Duration:  480s 
    --------------------------- 
    ServiceName  UseRate 
    --------------------------- 
    SYSTEM           12% 
    AAA               0% 
    ... ... 
    --------------------------- 
    CPU Usage Details 
    ---------------------------------------------------------------- 
    CPU     Current  FiveSec OneMin  FiveMin  Max MaxTime 
    ---------------------------------------------------------------- 
    cpu0        21%      22%      21%      19%  59% 2017-11-20 09:43:19 
    cpu1        12%      12%      13%      12%  64% 2017-11-20 09:43:19 
    cpu2        12%      11%      11%      11%  69% 2017-11-20 09:43:09 
    cpu3         3%       3%       3%       3% 8% 2017-11-20 09:43:09 
    ----------------------------------------------------------------

    Find out the tasks occupying high CPU usage and focus on the top 3 tasks. For details, see Determining Fault Causes According to CPU Usages of Tasks.

  • Check whether related alarms have been reported on the NMS.

    When a switch connects to an NMS system, check whether there is a high CPU usage alarm on the NMS.

    When the CPU usage exceeds the alarm threshold, the switch reports the alarm SYSTEM_1.3.6.1.4.1.2011.5.25.129.2.4.1 hwCPUUtilizationRisingAlarm to the NMS. Administrators can obtain high CPU usage information according to the alarm information. The CPU usage alarm threshold can be configured using the set cpu threshold command in the system view, and the default value is 95% in V100R005C00 and earlier versions or 90% in V100R005C10 and later versions.

  • Check whether the log records a high CPU usage.

    View the system log files or run the display logbuffer command to check whether the system has recorded logs about high CPU usage.

    The system log may include the current or historical high CPU usage records.

    Related log: SYSTEM/1/hwCPUUtilizationRisingAlarm_active.

Determining Fault Causes According to CPU Usages of Tasks

Run the display cpu [ slot slot-id ] command to view the top 3 tasks occupying high CPU usage (in V200R005 and later versions, the tasks are listed in a descending order of CPU usage).

Find out the reason why CPU usage is high and solution according to Table 1-1.

Table 1-1 Common tasks with high CPU usages and solutions

Task Name

Description

Reason for High CPU Usage

Solution

SYSTEM

System management

A large number of protocol packets are processed.

Check whether a network attack occurs.

DEVICE

Device management

Interfaces are disconnected intermittently.

Check whether network flapping occurs.

CMF

Configuration management framework

Configurations are delivered in a batch or information is queried using SNMP.

Check whether a large number of SNMP packets are sent to the CPU or whether a large number of configurations are delivered.

NETSTREAM

Flow sampling

A large number of packets are sampled.

Check whether flow sampling is configured.

SFLOW

Flow sampling

A large number of packets are sampled.

Check whether flow sampling is configured.

FEA

Service adaptation layer

A batch configuration backup or task is performed.

Check whether batch tasks are performed.

IP STACK

Protocol stack

Routing protocols flap.

Check whether routing protocols flap.

LOCAL PKT

Host packet transmission

A large number of protocol packets are sent to the CPU.

Check whether a network attack occurs.

If the top tasks on your switch are not included in the preceding table, see What Are CPU and CPU Usage? to find out which services caused the high CPU usage.

If the top tasks on your switch are not included in the preceding table or What Are CPU and CPU Usage? , contact technical support personnel.

The preceding table is only a reference for you to locate a high CPU usage problem. To fix the problem, see How to Fix the High CPU Usage Problem.

How to Fix the High CPU Usage Problem

Checking Whether the Problem Is Caused by a Network Attack

In some situations, network attacks may cause high CPU usage. Network attacks are initiated by hosts or network devices by sending a large number of forged packets to switches, affecting security and services on the target switches. When a network attack occurs, the switch is busy with the requests from the attack source. Therefore, some tasks occupy much CPU resource, causing a high CPU usage on the switch.

Common Network Attacks

Common network attacks, such as ARP, ARP Miss, and DHCP attacks, can cause a high CPU usage on a switch. These attacks are all initiated by sending a large number of protocol packets; therefore, packet statistics on the switch show a large number of packets sent to the CPU.

  • ARP and ARP-Miss attack
    • ARP and ARP Miss flood
    • ARP spoofing
  • DHCP protocol packet attack
  • Other attack
    • ICMP attack
    • DDoS
    • Broadcast attack
    • TTL-expired attack
    • Initiating IP packets with the switch's IP address as the destination address
    • SSH/FTP/Telnet attacks

Network Attack Locating

  1. Run the display version and display device commands to check the switch version and component types. Record the information for follow-up operations.
  2. Run the display cpu-defend statistics all command to view statistics about the packets sent to the CPU, determining whether too many protocol packets are discarded due to timeout.

    1. Run the reset cpu-defend statistics all command to clear statistics about the packets sent to the CPU.
    2. After several seconds, run the display cpu-defend statistics all command to view statistics about the packets sent to the CPU.

      If there are too many packets of a protocol, determine whether it is normal depending on the networking. If not, there is a high probability that the switch is undergoing a protocol packet attack.

      <HUAWEI> reset cpu-defend statistics all 
      <HUAWEI> display cpu-defend statistics all 
      Statistics(packets) on slot 1 : 
      -------------------------------------------------------------------------------- 
      PacketType               Total Passed        Total Dropped Last Dropping Time 
                          Last 5 Min Passed Last 5 Min Dropped 
      --------------------------------------------------------------------------------  
      arp                            784824                    0 - 
                                          8                    0 
      arp-miss                            0                    0 - 
                                          0                    0 
      fib-hit                         25993                    0 - 
                                          0                    0 
      snmp                          4922372                    0 - 
                                        599                    0 
      telnet                            425                    0 - 
                                          0                    0 
      ...... 
      --------------------------------------------------------------------------------

  3. Configure the attack source tracing function to find out the attack source.

    The switch provides the local attack defense function to protect the CPU, solving problems of service interruptions occurring when the CPU processes a large number of sent packets.

    1. Create the local attack defense policy based on attack source tracing.
      <HUAWEI> system-view 
      [~HUAWEI] cpu-defend policy policy1 //Create the local attack defense policy. 
      [*HUAWEI-cpu-defend-policy-policy1] auto-defend enable //Enable attack source tracing. 
      [*HUAWEI-cpu-defend-policy-policy1] auto-defend trace-type source-ip source-mac //Configure attack source tracing based on source MAC addresses and source IP addresses.
      [*HUAWEI-cpu-defend-policy-policy1] auto-defend protocol all //Match all protocol packets. 
      [*HUAWEI-cpu-defend-policy-policy1] quit 
      [*HUAWEI] cpu-defend-policy policy1 //Apply the CPU attack defense policy globally. 
      [*HUAWEI] commit

      After configuring the local attack defense function based on attack source tracing, run the display auto-defend attack-source command to check attack source information (IP address and MAC address).

      <HUAWEI> display auto-defend attack-source 
        Attack Source User Table on Slot 1 : 
        ------------------------------------------------------------------------- 
        MAC Address      Interface       PacketType    VLAN:Outer/Inner     Total 
        ------------------------------------------------------------------------- 
        0000-c102-0102 10GE1/0/1       ICMP          1000/                 4832 
        ------------------------------------------------------------------------- 
        Total: 1 
        Attack Source IP Table on Slot 1 : 
        ------------------------------------------------------------------------- 
        IP Address      PacketType    Total 
        ------------------------------------------------------------------------- 
        10.1.1.2        ICMP          1144 
        ------------------------------------------------------------------------- 
        Total: 1 
        Attack Source Port Table on Slot 1 : 
        ------------------------------------------------------------------------- 
        Interface       VLAN:Outer/Inner     PacketType     Total 
        ------------------------------------------------------------------------- 
        10GE1/0/1       1000/--              ICMP            4832 
        ------------------------------------------------------------------------- 
        Total: 1

Handling Suggestion

Select an appropriate method based on the attack source information and networking.

  • Configure ARP security to prevent ARP attacks.

    The switch provides ARP security to prevent ARP and ARP Miss packet attacks.

    For details about ARP security, see ARP Security Solutions in the Configuration > Security Configuration Guide > ARP Security Configuration.

  • Configure an attack source tracing action: discard attack packets within the specified period.

    # Configure an attack source tracing action to discard attack packets within 300s.

    <HUAWEI> system-view  
    [~HUAWEI] cpu-defend policy policy1  
    [*HUAWEI-cpu-defend-policy-policy1] auto-defend enable 
    [*HUAWEI-cpu-defend-policy-policy1] auto-defend action deny timeout 300  //By default, attack source tracing action is not enabled.
    [*HUAWEI-cpu-defend-policy-policy1] commit
  • Configure a blacklist to discard the packets sent from the blacklisted users.

    If a user sends attack packets to the switch, you can specify the characteristics of these packets in an ACL and apply the ACL to a blacklist. When the packets from this user are sent to the switch's CPU, the switch discards the packets.

    In the following example, the attack source IP address is 10.1.1.0/24. Configure an ACL to match this IP address and discard attack packets.

    <HUAWEI> system-view 
    [~HUAWEI] acl number 2001 
    [*HUAWEI-acl4-basic-2001] rule permit source 10.1.1.0 0.0.0.255 
    [*HUAWEI-acl4-basic-2001] quit 
    [*HUAWEI] cpu-defend policy policy1 
    [*HUAWEI-cpu-defend-policy-policy1] blacklist 1 acl 2001 
    [*HUAWEI-cpu-defend-policy-policy1] commit
  • Configure a punishment action for the attack source tracing function: Set the interface that receives attack packets to the Error-down state to prevent the attack source from continuing to attack the switch.

    If attack packets are sent from a specified interface, and setting this interface to the Error-down state does not affect services, use this method.

    If services of authorized users on the interface that receives attack packets may be interrupted after this interface is set to the Error-down state, exercise caution when using this method.

    # Set the interface that receives attack packets to the Error-down state.

    <HUAWEI> system-view
    [~HUAWEI] cpu-defend policy policy1
    [*HUAWEI-cpu-defend-policy-policy1] auto-defend enable
    [*HUAWEI-cpu-defend-policy-policy1] auto-defend action error-down
    [*HUAWEI-cpu-defend-policy-policy1] commit

Checking Whether the Problem Is Caused by Network Flapping

When network flapping occurs, the network topology frequently changes. The switch is busy with network switching events, causing a high CPU usage. Network flapping includes STP flapping and OSPF route flapping.

STP Flapping

When STP flapping occurs, the switch frequently calculates STP topology, updates MAC address table, and ARP table, causing a high CPU usage.

  1. Fault Location

    If you consider that STP flapping may occur, run the display stp topology-change command multiple times at an interval of several seconds to view STP topology information. Alternatively, you can check the trap and log information on the switch to determine whether STP topology has changed.

    Run the command multiple times. Check whether the value of Number of topology changes increases.

    <HUAWEI> display stp topology-change
     CIST topology change information 
    Number of topology changes             :5 
     Time since last topology change        :0 days 0h:23m:19s 
     Topology change initiator(detected)    :10GE1/0/1 
     Number of generated topologychange traps : 5 
     Number of suppressed topologychange traps: 3

    When you confirm that network topology is frequently changed, run the display stp tc-bpdu statistics command after several seconds again. Check whether interfaces on the switch have received Topology Change (TC) BPDUs. If so, find out the source of the TC BPDUs, that is, the device causing the topology change.

    <HUAWEI> display stp tc-bpdu statistics
     -------------------------- STP TC/TCN information --------------------------
     MSTID Port                        TC(Send/Receive)      TCN(Send/Receive)
     0     10GE1/0/3                   2/3                   0/0
     1     10GE1/0/5                   1/0                   -/-
    • If only the TC(Send) value increases, the topology change is caused by the local switch.
      • If only the TC(Send) value of a single interface increases, the topology change is caused by this interface.
      • If the TC(Send) values of multiple interfaces increase, check the events and logs on the NMS to analyze the STP topology change reason. Find out the interface causing the flapping.
    • If multiple values in the TC(Send/Receive) column increase, check the event and log information on the NMS to determine whether the local switch causes the topology change, and check whether STP flapping occurs on the device connected to the problematic interface.
  2. Suggestion
    1. Enable the TC protection alarm function to help administrators learn about TC BPDU processing on the switch.

      To enable the TC protection and TC protection alarm functions, run the snmp-agent trap enable feature-name mstp and stp tc-protection commands in the system view.

      NOTE:
      • After the TC protection alarm function is enabled, the switch will generate two alarms: MSTP_1.3.6.1.4.1.2011.5.25.42.4.2.15 hwMstpiTcGuarded and MSTP_1.3.6.1.4.1.2011.5.25.42.4.2.16 hwMstpProTcGuarded.
      • After the TC protection function is enabled, the switch processes only the maximum number of TC BPDUs (one TC BPDU by default) configured using the stp tc-protection threshold threshold command within the period (2s by default) specified using the stp tc-protection intervalinterval-value command.
    2. Rectify the fault according to topology changes.

      If STP topology changes are caused by access-side port status transition, run the following commands to configure this port as an STP edge port or enable the STP BPDU protection function to reduce the impact of BPDUs on the CPU.

      <HUAWEI> system-view 
      [~HUAWEI] interface 10ge 1/0/1 
      [~HUAWEI-10GE1/0/1] stp edged-port enable   //Configure the port as an edge port. 
      [*HUAWEI-10GE1/0/1] quit 
      [*HUAWEI] stp bpdu-protection   //Enable the BPDU protection function. 
      [*HUAWEI] commit

OSPF Routing Protocol

Routing protocol flapping causes route re-advertisement and recalculation, which increases the load of the CPU. Generally, OSPF is configured to manage dynamic routing information. Therefore, OSPF route flapping is described here.

  1. Fault Location
    • Run the display ospf peer last-nbr-down command to check the reason why the OSPF neighbor relationship goes Down.

      The reason is displayed in the Immediate Reason and Primary Reason fields.

    • Check logs on the switch to determine why the OSPF neighbor becomes Down.

      Run the display logbuffer command, and you can find the following log information:

      OSPF/3/NBR_DOWN_REASON: Neighbor state left full or changed to Down. (ProcessId=[ProcessId], NeighborRouterId=[NbrRouterId], NeighborIp=[NbrIp], NeighborAreaId=[NbrAreaId], NeighborInterface=[IfName], NeighborDownImmediate reason=[NbrImmReason], NeighborDownPrimeReason=[NbrPriReason], CpuUsage=[CpuUsage])

      The NeighborDownImmediate reason field indicates the cause for the OSPF neighbor Down event.

  2. Suggestion

    Determine the reason depending on the key fields and take measures.

    Possible causes of the fault are as follows:

    • Neighbor Down Due to Inactivity

      The Hello packet is not received within the deadtime (set by the ospf timer dead interval command in the interface view).

      When an OSPF neighbor is Down, OSPF neighbor flapping occurs and OSPF neighbor relationship cannot be set up. Run thedisplay ospf peer brief command to check whether OSPF neighbor flapping occurs or OSPF neighbor relationship cannot be set up.

      • OSPF neighbor relationship flaps.

        OSPF neighbor flapping may be caused by a small CPCAR value for OSPF, link flapping or congestion on interfaces, and LSA flooding.

        1. Run the display cpu-defend statistics packet-type ospf command to view statistics about the OSPF packets sent to the CPU. If too many OSPF packets are discarded, check whether the switch undergoes an OSPF attack or the CPCAR value for OSPF is too small.
        2. View the log to check whether interfaces alternate between Up and Down. If link flapping or congestion occurs, check the link on the interface.
        3. If the holdtime of the OSPF neighbor relationship is smaller than 20s, run the ospf timer dead interval command to change the holdtime to be larger than 20s.
        4. If the fault persists after the preceding operations are performed, contact technical support personnel.
      • OSPF neighbor relationship cannot be set up.

        Check whether the configurations in the OSPF view of devices on both ends are the same. If the configurations such as the OSPF area ID or area type (NSSA, stub area, or common area) are different, the two devices cannot establish an OSPF neighbor relationship.

        Run the display ospf [ process-id ] interface command to check whether OSPF is successfully enabled on the interfaces.

        <HUAWEI> display ospf 1 interface
        
        OSPF Process 1 with Router ID 192.168.5.5
        
         Area: 0.0.0.0          MPLS TE not enabled
        
         Interface             IP Address      Type         State    Cost    Pri
         Vlanif200             192.168.3.1     Broadcast    DR       1       1
        • If OSPF is not enabled on interfaces, run the ospf enable [ process-id ] area area-id command in the interface view to enable OSPF.
        • If the OSPF process has been enabled on the related interface, run the display ospf error command multiple times at an interval of several seconds to check whether OSPF authentication information on the two devices is the same according to the Bad authentication type and Bad authentication key fields.
          <HUAWEI> display ospf error
                    OSPF Process 1 with Router ID 10.1.1.1
                            OSPF error statistics
          
          General packet errors:
           0       : IP: received my own packet     0       : Bad packet
           0       : Bad version                    0       : Bad checksum
           0       : Bad area id                    0       : Drop on unnumbered interface
           0       : Bad virtual link               0       : Bad authentication type
           0       : Bad authentication key         0       : Packet too small
           0       : Packet size > ip length        0       : Transmit error
           0       : Interface down                 0       : Unknown neighbor

          If the value of the Bad authentication type or Bad authentication key value keeps increasing, OSPF authentication information on the two devices is different. To configure the same authentication information for the two devices, run the ospf authentication-mode command in the interface views or run the authentication-mode command in the OSPF process view.

    • Neighbor Down Due to Kill Neighbor

      If the interface is Down, BFD is Down, or the reset ospf process command is executed, the OSPF neighbor relationship goes Down.

      View the NeighborDownPrimeReason field to determine the reason.

    • Neighbor Down Due to 1-Wayhello Received or Neighbor Down Due to SequenceNum Mismatch

      When the OSPF status of the peer device goes Down first, the peer device sends a 1-Way Hello packet to the local device, causing OSPF on the local device to go Down.

      Determine why OSPF status of the peer device becomes Down.

    For other reasons, see OSPF/3/NBR_DOWN_REASON.

Checking Whether the Problem Is Caused by Network Loop

A network loop will cause MAC address flapping on a switch and a broadcast storm on the network. When this occurs, a large number of protocol packets are sent to the CPU, resulting in high CPU usage of the switch.

  1. Fault Location

    After a network loop occurs, the following situations often occur:

    • A MAC address flapping alarm is generated on the switch. To check MAC address flapping records, run the display mac-address flapping command.
    • The switch's CPU usage is high.
    • Indicators of interfaces in the VLAN where the loop has occurred blink faster than usual.
    • Administrators cannot remotely log in to the switch, and the switch responds slowly to operations performed through the console interface.
    • Packets are lost or cannot be forwarded in ping tests.
    • The display interface command output shows that a large number of broadcast or multicast packets exist on interfaces.
    • PCs connected to the switch receive a large number of broadcast packets or unknown unicast packets.
  2. Suggestion

    If configurations or connections are changed before the fault occurs, you are advised to roll back the changes. Otherwise, rectify the fault according to the following procedure:

    1. Confirm the interface where a broadcast storm has occurred according to the interface indicator status and traffic.
    2. Locate the device where the network loop has occurred according to the link topology.
    3. Locate the interfaces where the loop has occurred and remove the loop.
    4. If the fault persists after the preceding measures are taken, collect the networking information, device configuration file, log information, and alarm information, and contact Huawei technical support personnel.

Checking Whether the Problem Is Caused by the Flow Sampling Function

When the flow sampling function is configured on the device, the CPU usage may be high due to high traffic volume and high sampling rate.

  1. Fault Location

    If the display cpu [ slot slot-id ] command output shows that the FEA and NETSTREAM tasks (or FEA and SFLOW tasks) have a high CPU, the flow sampling function has been configured on the device and the traffic volume is high or sampling rate is high.

  2. Suggestion

    Check the flow sampling configuration, reduce the sampling rate based on the traffic on interfaces, and then check whether CPU usage is reduced to the normal range.

    • Adjust the NetStream sampling rate.
      • Adjust the flow sampling function of all interfaces in the system view.
        <HUAWEI> system-view
        [~HUAWEI] netstream sampler random-packets 32768 inbound
        [*HUAWEI] netstream sampler random-packets 32768 outbound
        [*HUAWEI] commit
      • Adjust the flow sampling function of a specified interface in the interface view.
        <HUAWEI> system-view
        [~HUAWEI] interface 10ge 1/0/1
        [~HUAWEI-10GE1/0/1] netstream sampler random-packets 32768 inbound
        [*HUAWEI-10GE1/0/1] netstream sampler random-packets 32768 outbound
        [*HUAWEI-10GE1/0/1] commit
    • Adjust the sFlow sampling rate.
      <HUAWEI> system-view 
      [~HUAWEI] interface 10ge 1/0/1 
      [~HUAWEI-10GE1/0/1] sflow sampling rate 32768 
      [*HUAWEI-10GE1/0/1] commit

Checking Whether the Problem Is Caused by a Large Number of Logs

The device generates diagnostic information or logs continuously in some conditions, for example, attacks, errors, or frequent interface status transitions occur. In these conditions, the system frequently reads and writes data in the storage device, causing a high CPU usage.

  1. Fault Location

    Run the display logbuffer command to check whether a large number of abnormal logs are displayed. For example, a large number of the same logs are generated continuously.

  2. Suggestion

    Check the log reference manual of the corresponding product according to the log name and solve the problem according to the troubleshooting procedure.

    If the fault persists after the preceding measures are taken, collect the networking information, device configuration file, log information, and alarm information, and contact Huawei technical support personnel.

High CPU Usage Typical Cases

A Switch Suffers an ARP Packet Attack

Symptom

In Figure 1-1, Switch functions as a gateway, Switch_1 is frequently out of management, and users on Switch_1 are frequently disconnected. There is a delay when Switch_1 pings the Switch or the ping operation fails. Services on Switch_2 are normal, and Switch_2 can successfully ping the gateway.

Figure 1-1 Networking diagram

Root Cause

Switch_1 receives ARP packets with fixed source MAC address. User devices cannot send or receive ARP packets.

Identification Method

Perform the following operations on Switch_1:

  1. Check whether the CPU usage is high.

    <HUAWEI> display cpu  
    CPU utilization statistics at 2015-12-04 11:04:40 820 ms 
    System CPU Using Percentage :  82% 
    CPU utilization for five seconds: 82%, one minute: 82%, five minutes: 82%. 
    Max CPU Usage :                87% 
    Max CPU Usage Stat. Time : 2015-11-28 16:55:21 599 ms

    The CPU usage reaches 82%.

  2. View temporary ARP entries to check whether ARP learning is normal.

    <HUAWEI> display arp 
    ARP Entry Types: D - Dynamic, S - Static, I - Interface, O - OpenFlow
    EXP: Expire-time VLAN:VLAN or Bridge Domain
    
    IP ADDRESS      MAC ADDRESS    EXP(M) TYPE/VLAN       INTERFACE        VPN-INSTANCE
    ------------------------------------------------------------------------------
    10.137.222.139  00e0-fc01-4422        I -             MEth0/0/0 
    10.1.1.1        200b-c739-130c        I               Vlanif10
    10.2.3.4        200b-c739-1316        I               Vlanif200
    12.1.1.1        200b-c739-1302        I               10GE4/0/8
    12.1.1.2        f84a-bff0-cac2   12   D               10GE4/0/8
    50.1.1.2        Incomplete        1   D               10GE4/0/22
    50.1.1.3        Incomplete        1   D               10GE4/0/22
    ......
    ------------------------------------------------------------------------------

    The MAC ADDRESS fields of two ARP entries are Incomplete, indicating temporary entries. Some ARP entries cannot be learned.

  3. Check whether the switch is suffering an ARP attack.

    1. View statistics about ARP request packets sent to the CPU.
      <HUAWEI> display cpu-defend statistics packet-type arp all 
      Statistics(packets) on slot 2 :
      --------------------------------------------------------------------------------
      PacketType               Total Passed        Total Dropped   Last Dropping Time
                          Last 5 Min Passed   Last 5 Min Dropped
      --------------------------------------------------------------------------------
      arp                                 0                    0   -
                                          0                    0
      --------------------------------------------------------------------------------
      Statistics(packets) on slot 4 :
      --------------------------------------------------------------------------------
      PacketType               Total Passed        Total Dropped   Last Dropping Time
                          Last 5 Min Passed   Last 5 Min Dropped
      --------------------------------------------------------------------------------
      arp                            106549             44380928   -
                                          3                    0
      --------------------------------------------------------------------------------

      There are a large number of ARP request packets on the board in slot 4.

    2. Configure attack source tracing to identify the attack source.
      <HUAWEI> system-view  
      [~HUAWEI] cpu-defend policy policy1  
      [*HUAWEI-cpu-defend-policy-policy1] auto-defend enable  
      [*HUAWEI-cpu-defend-policy-policy1] auto-defend attack-packet sample 5  //One packet is sampled when every five packets are sent. A small sampling rate will consume many CPU resources.  
      [*HUAWEI-cpu-defend-policy-policy1] auto-defend threshold 30  //The packets of which the rate reaches 30 pps are considered attack packets. If there are many attack sources, reduce this value.  
      [*HUAWEI-cpu-defend-policy-policy1] auto-defend trace-type source-mac  //Identify the attack source based on source MAC address.  
      [*HUAWEI-cpu-defend-policy-policy1] auto-defend protocol arp  //Identify the attack source of the ARP attack.  
      [*HUAWEI-cpu-defend-policy-policy1] quit  
      [*HUAWEI] cpu-defend-policy policy1  
      [*HUAWEI] commit
    3. View attack source information.
      [~HUAWEI] display auto-defend attack-source  
      Attack Source User Table on Slot 4 : 
        ------------------------------------------------------------------------- 
        MAC Address      Interface       PacketType    VLAN:Outer/Inner      Total 
        ------------------------------------------------------------------------- 
        0000-c102-0102 10GE4/0/22       ARP          1000/                 4832 
        -------------------------------------------------------------------------

      The MAC address of attack source is 0000-c102-0102, which is connected to 10GE4/0/22.

Solution

  • Configure a blacklist.
    #  
    acl number 4000  
     rule 10 permit type arp source-mac 0000-c102-0102  
    #  
    cpu-defend policy 1  
     blacklist 1 acl 4000  //Add the users with specified characteristics to the blacklist through an ACL. The switch discards the packets from the users in blacklist.  
    #  
    cpu-defend-policy 1 
    #
  • Configure the attack source tracing action.
    # 
    cpu-defend policy policy1 
     auto-defend enable 
     auto-defend action deny  //Set the attack source tracing action. The switch discards all attack packets within the default interval, 300s. 
     auto-defend alarm enable
     auto-defend threshold 30  
     auto-defend trace-type source-mac  
     auto-defend protocol arp  
    #  
    cpu-defend-policy policy1 
    # 

STP Flapping Causes a High CPU Usage

Symptom

A fixed switch has a high CPU usage, and generates many logs about ARP packets that are discarded because their rate exceeds the CPCAR value. The interface information shows that the number of TC BPDUs received by STP-enabled interfaces keeps increasing.

Root Cause

An interface has received a large number of TC BPDUs, causing STP flapping. Many MAC entries are deleted and ARP entries are updated. Therefore, the switch needs to process many ARP Miss, ARP request, and ARP reply packets, causing a high CPU usage.

Identification Method

  1. View logs, finding that logs indicating a high CPU usage are generated on the switch.
    Dec  4 2016 11:37:34 HUAWEI %%01SYSTEM/1/hwCPUUtilizationRisingAlarm(t):CID=0x80020106-OID=1.3.6.1.4.1.2011.5.25.129.2.4.1;The CPU usage exceeded the pre-set overload threshold.(TrapSeverity=3, ProbableCause=74240, EventType=3, PhysicalIndex=17170433, PhysicalName=MPU slot 6, RelativeResource=CPU, UsageType=1, SubIndex=0, CpuUsage=92, Unit=1, CpuUsageThreshold=90) 
  2. Find that the switch also generates an alarm indicating that packets were dropped due to CPCAR exceeding.
    Dec  4 2016 11:45:47 HUAWEI %%01DEFEND/4/hwCpcarDropPacketAlarm(t):CID=0x80e70402-OID=1.3.6.1.4.1.2011.5.25.165.2.2.7.1;Rate of packets to cpu exceeded the CPCAR limit in slot 4. (Protocol=ARP, PPS/CBS=0/0, ExceededPacketCount=20699) 
  3. Collect statistics about transmitted and received TC BPDUs on interfaces.

    Run the display stp tc-bpdu statistics command at an interval of several seconds. Check the statistics about sent and received TC/TCN BPDUs. It is found that the number of TC BPDUs on all STP-enabled interfaces keeps increasing.

Solution

  1. Run the stp tc-protection command in the system view to enable TC protection trap. By default, TC protection trap is disabled.

    After TC protection trap is enabled, the switch updates entries at most once within 2 seconds if it frequently receives TC BPDUs. This reduces the number of tasks to be processed by the CPU in frequently updating MAC and ARP entries.

    The switch will trigger the MSTP_1.3.6.1.4.1.2011.5.25.42.4.2.15 hwMstpiTcGuarded and MSTP_1.3.6.1.4.1.2011.5.25.42.4.2.16 hwMstpProTcGuarded traps.

  2. Run the arp topology-change disable command in the system view to disable the switch from responding to TC BPDUs. By default, the switch responds to received TC BPDUs.

    After receiving TC BPDUs, the switch ages out ARP entries by default. After this command is executed, the switch does not age out or delete ARP entries when receiving TC BPDUs. When the network topology changes frequently, this prevents excessive ARP packets caused by ARP relearning and high CPU usage.

  3. Run the mac-address update arp enable command in the system view to enable ARP entry update upon MAC address change. By default, ARP entry update upon MAC address change is enabled.

    By default, the switch deletes the MAC address entries after receiving TC BPDUs. After this command is executed, the switch updates the outbound interfaces in ARP entries when the outbound interfaces in MAC entries are changed. This reduces the number of ARP entry update times.

Conclusion

When this problem occurs, check packet loss caused by CPCAR.

When deploying STP, you are advised to enable TC protection and configure all ports connected to terminals as edge ports. These measures prevent status change of an interface from causing flapping and re-convergence of the entire STP network.

OSPF Flapping Causes a High CPU Usage

Symptom

In Figure 1-2, OSPF is run on Switch_1, Switch_2, Switch_3, and Switch_4. Switch_1 has a high CPU usage. The CPU usage of the ROUT task is higher than the CPU usage of other tasks, and route flapping occurs.

Figure 1-2 Networking diagram

Root Cause

IP address conflict on the network causes route flapping.

Identification Method

  1. Run the display ospf lsdb command on each switch at an interval of one second to check information about the OSPF link state database (LSDB) on the switches.
  2. Locate the fault based on the collected command output of each switch.

    • If both the following situations occur, LSA aging is abnormal.
      • The Age value that indicates the aging time of a network LSA is 3600 on a switch or the switch does not have the network LSA, and the Sequence value increases quickly.
      • The Age value of the same network LSA on different switches frequently alternates between 3600 and smaller values, and the Sequence value increases quickly.
        <HUAWEI> display ospf lsdb 
        
        OSPF Process 100 with Router ID 3.3.3.3
        Link State Database
        
        Area: 0.0.0.0
        ----------------------------------------------------------------------------
         Type      LinkState ID    AdvRouter        Age    Len Sequence       Metric
         Router    4.4.4.4         4.4.4.4            2     48 8000000D            1
         Router    3.3.3.3         3.3.3.3            6     72 80000016            1
         Router    2.2.2.2         2.2.2.2          228     60 8000000D            1
         Router    1.1.1.1         1.1.1.1          258     60 80000009            1
         Network   112.1.1.4       4.4.4.4          121     32 80000001            0
         Network   112.1.1.2       1.1.1.1         3600     32 80000015            0
         Network   222.1.1.3       3.3.3.3          227     32 80000003            0
         Network   111.1.1.1       1.1.1.1          259     32 80000002            0
      1. Run the display ospf routing command on each switch every 1 second. If route flapping occurs and the OSPF neighbor relationship does not flap, IP address conflicts or router ID conflicts occur. The IP address of the designated router (DR) or BDR conflicts with that of a non-DR based on the display ospf lsdb command output.
      2. Locate one conflicting interface on a switch based on the AdvRouter value, and locate the other conflicting device based on the IP address plan. It is difficult to locate the other conflicting device based only on OSPF information.

      In this example, first determine that the conflicting IP address is 112.1.1.2, and the router ID of a conflicting device is 1.1.1.1. However, the other conflicting device (3.3.3.3) cannot be located through OSPF information.

    • If the LinkState ID values of two network LSAs are both 112.1.1.2 on a switch, the aging time of the two network LSAs is short, and the Sequence value increases quickly, an IP address conflict occurs on the DR and BDR.
      <HUAWEI> display ospf lsdb 
      
      OSPF Process 100 with Router ID 3.3.3.3
      Link State Database
      
      Area: 0.0.0.0
      ----------------------------------------------------------------------------
       Type      LinkState ID    AdvRouter        Age    Len Sequence       Metric
       Router    4.4.4.4         4.4.4.4           17     48 8000011D            1
       Router    3.3.3.3         3.3.3.3           21     72 8000015A            1
       Router    2.2.2.2         2.2.2.2          151     60 80000089            1
       Router    1.1.1.1         1.1.1.1         1180     60 8000002A            1
       Network   112.1.1.2       3.3.3.3            3     32 8000016A            0
       Network   112.1.1.2       1.1.1.1            5     32 80000179            0
       Network   222.1.1.3       3.3.3.3          145     32 8000002D            0
       Network   212.1.1.4       4.4.4.4           10     32 80000005            0
       Network   111.1.1.2       2.2.2.2          459     32 80000003            0

Solution

Change the IP address of a conflicting device based on the IP address plan.

Conclusion

  • The following problems may occur due to IP address conflicts on networks.
    • The CPU usage is high.
    • Route flapping occurs.
  • On an OSPF network, IP address conflicts between interfaces may cause frequent aging and generation of LSAs. This results in network instability, route flapping, and high CPU usage.

Configure IP addresses for interfaces according to network plan, and do not modify planned network parameters.

How to Relieve CPU Load

  1. Configure ARP security to protect the device against ARP or ARP Miss attacks.

    For details about ARP security, see ARP Security Solutions in the Configuration > Security Configuration Guide > ARP Security Configuration.

  2. On the network prone to DHCP and ARP attacks, configure local attack defense policies for DHCP and ARP protocol packets.
    This section provides suggestions on local attack defense policies in general situations. The requirements on different protocol packets sent to the CPU may vary according to the model and version. In practice, configure CPU attack defense based on service requirements; otherwise, the configuration may fail or services may be affected.
    # 
    cpu-defend policy policy1  
     auto-defend enable  
     auto-defend action deny
     auto-defend trace-type source-mac source-ip  
     auto-defend protocol arp dhcp         
     auto-defend whitelist 1 interface 10GEx/x/x  //Add interconnected interfaces to the whitelist.  
     auto-defend whitelist 2 interface 10GEx/x/x  //Add uplink interfaces to the whitelist.  
    #  
    cpu-defend-policy policy1  
    #
  3. Log in to the switch as an administrator through SSH, Telnet, and SNMP. Configure an ACL to allow only the administrator to log in.

    # In VTY 0-14, configure the ACL to allow only the user with source IP address 10.1.1.1/32 to log in to the switch.

    <HUAWEI> system-view  
    [~HUAWEI] acl 2001  
    [*HUAWEI-acl4-basic-2001] rule 5 permit source 10.1.1.1 0  
    [*HUAWEI-acl4-basic-2001] quit  
    [*HUAWEI] user-interface vty 0 14  
    [*HUAWEI-ui-vty0-14] acl 2001 outbound
    [*HUAWEI-ui-vty0-14] commit
  4. Frequent MAC address flapping may result in a high CPU usage. If MAC address flapping may occur frequently on an interface, run the mac-address flapping trigger error-down command in the interface view to enable the system to set the interface to error-down state after detecting a MAC address flapping.
  5. Load and activate the patch files of the corresponding software version.

    Visit http://support.huawei.com/enterprise/ to obtain the corresponding patch file and documents (patch release notes and installation guide).

  6. The switch provides CPCAR values for each protocol. Generally, the default CPCAR values can meet requirements. If service traffic volume is too high, contact technical support personnel to adjust the CPCAR values.
Translation
Download
Updated: 2019-07-01

Document ID: EDOC1100086953

Views: 567

Downloads: 10

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next