NetEngine A821 E Troubleshooting Guide
Troubleshooting High CPU Usage
- Troubleshooting Flowchart
- Troubleshooting Procedure
- Checking the Software Version and Board Status
- Checking the CPU Usage of Each Board
- Checking the CPU Usage of Each Service Type
- Checking the CPU Usage of Each Process
- Checking the CPU Usage of Each Component
- Checking the CPU Usage of Each Thread
- Checking Thread Call Stacks
- Checking the Message Processing Statistics of Components
- Checking Alarms and Logs As Well As Statistics About Historical CPU Usage and Packets Sent to the CPU
- Collecting Information About High CPU usage of the CMF Service
Troubleshooting Procedure
Save the results of each troubleshooting step. If the fault persists after following this procedure, these results are needed for further troubleshooting.
Checking the Software Version and Board Status
Run the display version command to check the software version of the device, and run the display device command to check the board status. Record the obtained information for subsequent fault locating.
- Run the display version command to check the software version of the device.
<HUAWEI> display version Huawei Versatile Routing Platform Software VRP (R) software, Version 8.24 (NetEngine A800 series V800R024C00SPC500) Copyright (C) 2012-2020 Huawei Technologies Co., Ltd. HUAWEI NetEngine A811 uptime is 7 days, 1 hour, 18 minutes NetEngine A811 version information: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - BKP version information: PCB Version : xxxx REV C MPU Slot Quantity : 2 ...
- Run the display device command to check the types of boards on the device and their status.
<HUAWEI> display device NetEngine A811's Device status: ----------------------------------------------------------------------------- Slot # Type Online Register Status Role LsId Primary ----------------------------------------------------------------------------- 17 MPU Present Registered Normal MMB 0 Master 18 MPU Present Unregistered Abnormal MMB 0 Slave 19 SFU Present Registered Normal OTHER 0 NA 20 SFU Present Registered Normal OTHER 0 NA 21 SFU Present Registered Normal OTHER 0 NA 22 SFU Present Registered Normal OTHER 0 NA 23 CLK Present Registered Normal OTHER 0 Master 24 CLK Present Unregistered Abnormal OTHER 0 NA 25 PWR Present Registered Abnormal OTHER 0 NA 27 FAN Present Registered Normal OTHER 0 NA 28 FAN Present Registered Normal OTHER 0 NA -------------------------------------------------------------------------------
Checking the CPU Usage of Each Board
- Run the display health command to check the CPU usage of each board to learn about the CPU running status.If the CPU usage of a board exceeds a specified alarm threshold, the device generates an alarm and a log. For details about how to check alarm or log information, see Checking Alarms and Logs. Then, check the CPU usage of each service type to further locate the fault.
<HUAWEI> display health ---------------------------------------------------------------- Slot CPU Usage Memory Usage(Used/Total) ---------------------------------------------------------------- 17 MPU(Master) 20% 12% **06MB/**724MB 19 SFU 3% 14% 13MB/ 28MB 20 SFU 3% 14% 13MB/ 28MB ...
Checking the CPU Usage of Each Service Type
- Run the display cpu-usage service [ slot
slot-id
] command to check the CPU usage of each service type. Then locate and rectify the fault according to Table 8-2.
<HUAWEI> display cpu-usage service Cpu utilization statistics at 2020-07-23 15:51:48 381 ms System cpu use rate is : 16% --------------------------- ServiceName UseRate --------------------------- SYSTEM 11% FEA 5% ARP 0% CMF 0% CSP 0% DEVICE 0% DHCP 0% FEC 0% IP STACK 0% LINK 0% LLDP 0% LOCAL PKT 0% ND 0% ---------------------------
Service Name |
Description |
Common Causes of High CPU Usage |
Handling Suggestion |
---|---|---|---|
BRAS |
BRAS-related services |
A large number of users go online and offline. |
Run the display aaa online-fail-record statistics and display aaa offline-record statistics commands to check whether a large number of users go online and offline. If this is the case, see the related section in "Troubleshooting Guide" for detailed instruction. For details, see A Large Number of Users Dial Up, Causing High CPU Usage. |
AM |
Address management in a DHCP server scenario |
In a DHCP server scenario, when a large number of users go online, the CPU usage of Ethernet user management (EUM) and DHCP services is high. |
Decrease the CPCAR value to reduce the number of packets to be sent to the CPU. For details, see A Large Number of Users Go Online at the Same Time in a DHCP Relay or DHCP Server Scenario, Causing High CPU Usage. |
DHCP |
DHCP relay and DHCP server scenarios |
A large number of DHCP packets are sent to the CPU. |
Decrease the CPCAR value to reduce the number of packets to be sent to the CPU. For details, see A Large Number of Users Go Online at the Same Time in a DHCP Relay or DHCP Server Scenario, Causing High CPU Usage. |
EUM |
Ethernet user management in DHCP relay and DHCP server scenarios |
A large number of users go online concurrently. Generally, DHCP services also encounter high CPU usage in this case. |
Decrease the CPCAR value to reduce the number of packets to be sent to the CPU. For details, see A Large Number of Users Go Online at the Same Time in a DHCP Relay or DHCP Server Scenario, Causing High CPU Usage. |
AAA |
Authentication, authorization, and accounting |
A large number of authentication, authorization, or accounting packets are sent or received. |
Check the increase in the number of authentication, authorization, and accounting packets on the device, determine the characteristics of the type of packet whose number increases obviously, locate their sources and destinations, and reduce the number of packets of this type to be sent and received. For details, see A Large Number of Packets Are Sent to the CPU, Causing High CPU Usage . |
ARP |
Layer 3 interfaces learning ARP entries |
An ARP packet attack occurs. |
Check CPCAR statistics and attack source tracing information on the involved board. For details, see A Large Number of Packets Are Sent to the CPU, Causing High CPU Usage . |
IP stack |
Hosts sending and receiving packets |
Generally, a large number of packets, such as ICMP packets and TTL Exceeded packets, are sent to the CPU. |
Check the number of packets sent to the CPU on each interface board. If a large number of packets are sent to the CPU or some packets are discarded, see A Large Number of Packets Are Sent to the CPU, Causing High CPU Usage . |
ND |
ND packets |
A large number of ND packets are sent or received. |
Check the increase in the number of ND packets, determine the characteristics of those packets that increase obviously, locate their sources and destinations, and reduce the number of such packets to be sent and received. For details, see A Large Number of Packets Are Sent to the CPU, Causing High CPU Usage . |
CMF |
Data collection and configuration delivery by an NMS |
The NMS frequently collects data. |
Reduce the data collection frequency for the NMS. For details, see An NMS Frequently Collects Data, Causing High CPU Usage. For details about data collection, see Collecting Information About High CPU usage of the CMF Service. |
BGP |
BGP |
BGP route or peer flapping occurs. |
Troubleshoot the route or peer flapping. |
IS-IS |
IS-IS (IGP) |
IS-IS route or neighbor flapping occurs. |
Troubleshoot the route or neighbor flapping. |
OSPF |
OSPF (IGP) |
OSPF route or neighbor flapping occurs. |
Troubleshoot the route or neighbor flapping. |
PIM |
PIM (multicast) |
PIM routing entry or neighbor flapping occurs. |
Troubleshoot the PIM routing entry or neighbor flapping. |
Device |
Device management |
|
If the high CPU usage is caused by a device hardware exception or fault, rectify the exception or fault. |
IFM |
Interface status management |
The interface status changes frequently. |
Analyze the cause of the interface status change.
|
FEA |
Forwarding engine adaptation component |
This problem is related to the common process of the FEI framework. The other service types with high CPU usage also need to be checked. |
Rectify the fault based on the other service types with high CPU usage. |
FEC |
Forwarding engine general processing component |
The problem is related to the database of the FES module. The other service types with high CPU usage also need to be checked. |
Rectify the fault based on the other service types with high CPU usage. |
System |
Both some system components and services consuming CPU resources |
The CPU usage of a service is high. |
Troubleshoot the high CPU usage of the specific service. |
Checking the CPU Usage of Each Process
- Run the display cpu-usage process [ slot slot-id ] command in the diagnostic view to check the CPU usage of each process.
<HUAWEI> system-view [~HUAWEI] diagnose [~HUAWEI-diagnose] display cpu-usage process 2020-07-22 18:03:56.389 Cpu utilization statistics at 2020-07-22 18:03:56 431 ms System cpu use rate is : 14% Cpu utilization for five seconds: 14% ; one minute: 10% ; five minutes: 14%. Max CPU Usage : 94% Max CPU Usage Stat. Time : 2020-07-22 15:08:17 653 ms Cpu use top process: ------------------------------------------------------------------------------ ProcessId ProcessName UseRate ProcessDescription ------------------------------------------------------------------------------ 6 CFG 7% Configuration Management 0 OS 7% OS Kernel 1004 PSM2 0% Protocol Stack Manager 2 8 LM 0% Board Management 1185 LOGSERVER 0% Log Server 1003 FESMB 0% FES MB 1001 RESP 0% RESP distribution 1002 PSM1 0% Protocol Stack Manager 1 1022 PROTO4 0% PROTO4 1005 OPS 0% OPS 1000 PROTO5 0% IP Stack 5 ------------------------------------------------------------------------------
Checking the CPU Usage of Each Component
- Run the display cpu-usage process process-id command in the diagnostic view to check the CPU usage of each component.
<HUAWEI> system-view [~HUAWEI] diagnose [~HUAWEI-diagnose] display cpu-usage process 3 CPU utilization statistics at 2020-04-09 22:43:08 936 ms Process ID : 3 Process name : CFG Process CPU use rate is : 8% Component CPU use rate : ------------------------------------------- ComponentCid ComponentName UseRate ------------------------------------------- 0x802B272F LCS 0% 0x80CF0009 APPCFG6 0% 0x80CE000A DBMS7 0% 0x80D02739 PM 0% 0x80D22737 LAM 0% 0x80D12735 AAA 3% 0x80E72715 FMSERVER 0% 0x80CD000B ADM8 0% 0x80CB000C CFG9 0% 0x802C000E SSPRPM11 0% 0x80CC000D ECM10 5% 0x80332713 DBG_AGENT 0% 0x80CA2713 CLI10002 0% 0x80602717 LOGSERVER 0% 0x0 OS 0% ------------------------------------------- Total = 15
Checking the CPU Usage of Each Thread
- Run the display system thread process process-id command in the diagnostic view to check the CPU usage of each thread and locate the thread with high CPU usage.
<HUAWEI> system-view [~HUAWEI] diagnose [~HUAWEI-diagnose] display system thread process 3 Info: Operating, please wait for a moment.....done. -------------------------------------------------------------------------------- Process ID Thread ID Thread Type Bind Comp Bind Cpu Bind Flag Usage -------------------------------------------------------------------------------- 3 3075348160 main thread Bind all Bind 0% 3 2887215984 DefSch0700 Free all Free 0% 3 2896264048 DefSch0800 Free all Free 0% 3 2905373552 DefSch0302 Free all Free 0% 3 2913766256 DefSch0301 Free all Free 0% 3 2877770608 RPM_Listen Bind all Free 0% 3 2814995312 DefSch0a00 Free all Free 0% 3 2800188272 AppThread Free all Free 0% 3 2747112304 DefSch0500 Free all Free 0% 3 2756094832 DefSch0400 Free all Free 0% 3 2737068912 DefSch0600 Free all Free 0% 3 2922158960 DefSch0300 Free all Bind 0% 3 2946890608 DefSch0100 Free all Free 0% 3 3034577776 TICK Free all Free 0% 3 3044182896 VCLK Free 0 Free 0% 3 3052575600 AppThread Free all Free 0% 3 2938497904 DefSch0101 Free all Free 0% 3 3014716272 RtSchedTask Free all Free 0% 3 2994244464 IPC0000 Free all Free 0% 3 3006323568 DMS_TIPC_SEND Free all Free 0% 3 2976168816 DefSch0900 Free all Free 0% 3 2984790896 DefSch0200 Free all Free 0% 3 2707618672 AppThread Free all Free 0% -------------------------------------------------------------------------------- Total = 23
Checking Thread Call Stacks
- Run the display thread schedule process process-id command in the diagnostic view to check the thread in which a specified component runs and check the scheduling distribution relationship among the scheduling container of a specified process, scheduling threads, and components.
<HUAWEI> system-view [~HUAWEI] diagnose [~HUAWEI-diagnose] display thread schedule process 2 Schedule container: 1 Type: nobind Schedule threads list: Task-id Thread-id 9 1269589168 10 1270637744S Schedule components list: Bind-component Component-type Component-cid PDEVM2 250 0x80FA0003 LDEVM5 251 0x80FB0007 SOCK 101 0x806503F6 LCSS 77 0x804D2711 MPE 79 0x804F2712 TM_SVR 38 0x80262713 ---------------------------------------------------------------------------- Schedule container: 2 Type: bind Task-id Thread-id Bind-component Component-type Component-cid 6 1260139696 SEM_Agent4 3 0x80030004
- Run the display thread callstack process process-id thread-id command in the diagnostic view to check the call stack of a specified thread. (You can run the command multiple times to find the corresponding call stack.) Alternatively, you can run the command without specifying the thread-id parameter to check the call stack of each thread of the current process for multiple times when the CPU usage is high.
<HUAWEI> system-view [~HUAWEI] diagnose [~HUAWEI-diagnose] display thread callstack process 3 Thread 0 (Thread MainThread): #00 0xb7aa8496 libsspbase.so(SSP_ProcCallStackSigFunc+0x116) [0xb7aa8496] #01 0xffffe410 [0xffffe410] #02 0xb7ab294d libsspbase.so(SSP_sleep+0x1d) [0xb7ab294d] #03 0x0804e39c location(LOC_BlockMainThread+0xac) [0x804e39c] #04 0x08051158 location(LOC_LocationMain+0x3b8) [0x8051158] #05 0x08051442 location(main+0x32) [0x8051442] #06 0xb783087c libc.so.6(__libc_start_main+0xdc) [0xb783087c] #07 0x0804cbf1 location [0x804cbf1] Thread 1 (Thread BOX_Out): #00 0xb7aa8496 libsspbase.so(SSP_ProcCallStackSigFunc+0x116) [0xb7aa8496] #01 0xffffe410 [0xffffe410] Thread 2 (Thread VCLK): #00 0xb7aa8496 libsspbase.so(SSP_ProcCallStackSigFunc+0x116) [0xb7aa8496] #01 0xffffe410 [0xffffe410] Thread 3 (Thread TICK): #00 0xb7aa8496 libsspbase.so(SSP_ProcCallStackSigFunc+0x116) [0xb7aa8496] #01 0xffffe410 [0xffffe410]
- Run the display cpu-usage [ slot slotid | all ] top [ show-num ] command in the diagnostic view to check detailed information about the process CPU usage, thread CPU usage, and thread call stacks. You are advised to run this command multiple times.
<HUAWEI> system-view [~HUAWEI] diagnose [~HUAWEI-diagnose] display cpu-usage slot 2 top 3 -------------------------------------------------------- Slot 2 cpu information -------------------------------------------------------- Slot 2 cpu use top 3 Process: -------------------------------------------------------- OsProcessID OsProcessname Usage -------------------------------------------------------- 1 monitor 0% 2 kthreadd 0% 3 migration/0 0% -------------------------------------------------------- Slot 2 cpu use top 3 thread: ------------------------------------------------------------------------------------------ ProcessID ProcessName ThreadID ThreadType Bind Comp Bind Cpu Usage ------------------------------------------------------------------------------------------ 3 LM 1208090752 main thread Bind all 15% 3 LM 1259918512 DefSch0800 Free all 12% 3 LM 1260049584 ADMSESS0 Free all 0% ------------------------------------------------------------------------------------------ Thread call stack information: ----------------------------- Thread 1208090752 (Thread main thread): #00 libc.so.6(epoll_wait) #01 location(Frame_FdMainThread) #02 location(Frame_Main) #03 location(main) #04 kernel(System symbol) #05 kernel(System symbol) Thread 1222050992 (Thread DefSch0800): #00 libc.so.6(__select) #01 libdefault.so(VOS_TaskDelay) #02 libdefault.so(vosBoxextOutputTaskEntry) #03 libdefault.so(tskAllTaskEntry) #04 kernel(System symbol) #05 libc.so.6(clone) Thread 1222182064 (Thread ADMSESS0): #00 libc.so.6(__select) #01 libdefault.so(tskAllTaskEntry) #02 kernel(System symbol) #03 libc.so.6(clone) --------------------------------------------------------
Checking the Message Processing Statistics of Components
All VRP8 components are driven by message communication. That is, all the processing that consumes CPU resources is related to message processing. If the CPU usage of a component is high, analyze the number of messages processed by the component and the processing time to locate the cause of the high CPU usage.
- Run the display message-process statistic process process-id command in the diagnostic view to check the number of messages processed by all components. In the command output, CompName indicates a component name, SubIntName indicates a message name, Count indicates the number of messages processed, and RunTime indicates the total processing time.
<HUAWEI> system-view [~HUAWEI] diagnose [~HUAWEI-diagnose] display message-process statistic process 3 The message process statistics -------------------------------------------------------------------------------- CompType CompName CompCID Intf IntfName SubIntf SubIntfName Count RunTime(us) AvgRunTime(us) MaxRunTime(us) CpuRunTime(us) AvgCpuRunTime(us) MaxCpuRunTime(us) MaxStarveTime(ms) TotalStarveTime(ms) SendErrNum RecvErrNum -------------------------------------------------------------------------------- 3 SEM_Agent 0x800303F7 0 INTF_SSP 3 VOS_IID_TMR_TIMEOUT 1904529 170309075 89 4913 154479701 81 4907 0 74609 0 0 3 SEM_Agent 0x800303F7 0 INTF_SSP 32 SSP_SUB_INTF_HA 90542 12239643 135 28202 11570968 127 28200 0 36869 0 0 3 SEM_Agent 0x800303F7 0 INTF_SSP 36 SSP_SUB_INTF_DBG 58 4735 81 118 4255 73 105 0 51 0 0 3 SEM_Agent 0x800303F7 0 INTF_SSP 40 SSP_SUB_INTF_IMI 27399 421632 15 98 324716 11 97 0 1946 0 0 --------------------------------------------------------------------------------
- Run the display message-process statistic process process-id | include CompName command in the diagnostic view to check statistics about messages processed by a specified component.
<HUAWEI> system-view [~HUAWEI] diagnose [~HUAWEI-diagnose] display message-process statistic process 3 | include BR_UM Info: It will take a long time if the content you search is too much or the string you input is too long, you can press CTRL_C to break. -------------------------------------------------------------------------------- CompType CompName CompCID Intf IntfName SubIntf SubIntfName Count RunTime(us) AvgRunTime(us) MaxRunTime(us) CpuRunTime(us) AvgCpuRunTime(us) MaxCpuRunTime(us) MaxStarveTime(ms) TotalStarveTime(ms) SendErrNum RecvErrNum -------------------------------------------------------------------------------- 3 BR_UM 0x800303F7 0 INTF_SSP 3 VOS_IID_TMR_TIMEOUT 1904529 170309075 89 4913 154479701 81 4907 0 74609 0 0 3 BR_UM 0x800303F7 0 INTF_SSP 32 SSP_SUB_INTF_HA 90542 12239643 135 28202 11570968 127 28200 0 36869 0 0 3 BR_UM 0x800303F7 0 INTF_SSP 36 SSP_SUB_INTF_DBG 58 4735 81 118 4255 73 105 0 51 0 0 3 BR_UM 0x800303F7 0 INTF_SSP 40 SSP_SUB_INTF_IMI 27399 421632 15 98 324716 11 97 0 1946 0 0 --------------------------------------------------------------------------------
Checking Alarms and Logs As Well As Statistics About Historical CPU Usage and Packets Sent to the CPU
Checking Alarms and Logs
If the CPU usage exceeds a specified threshold, both an alarm and log are generated. You can view the alarm or log to obtain the record of high CPU usage.
- Run the display alarm all or display alarm history command to check alarm information on the device.
The set cpu-usage threshold command sets an overload alarm threshold and an alarm recovery threshold for CPU usage. By default, the overload alarm threshold of CPU usage is 90%, and the alarm recovery threshold of CPU usage is 75%. You can set proper thresholds for CPU usage based on service deployment on the device. If the overload alarm threshold is too low, the device may frequently report alarms. If the overload alarm threshold is too high, you cannot obtain information about high CPU usage in time.
When the CPU usage reaches the overload alarm threshold, the SYSTEM_1.3.6.1.4.1.2011.5.25.129.2.4.1 hwCPUUtilizationRisingAlarm is generated.
The alarm information is as follows:The CPU usage exceeded the pre-set overload threshold.(TrapSeverity=[TrapSeverity], ProbableCause=[ProbableCause], EventType=[EventType], PhysicalIndex=[PhysicalIndex], PhysicalName=[PhysicalName], RelativeResource=[RelativeResource], UsageType=[UsageType], SubIndex=[SubIndex], CpuUsage=[CpuUsage], Unit=[Unit], CpuUsageThreshold=[CpuUsageThreshold])
When the CPU usage decreases to the alarm recovery threshold, the SYSTEM_1.3.6.1.4.1.2011.5.25.129.2.4.2 hwCPUUtilizationResume alarm is generated.
The alarm information is as follows:
The CPU usage falls below the pre-set clear threshold.(TrapSeverity=[TrapSeverity], ProbableCause=[ProbableCause], EventType=[EventType], PhysicalIndex=[PhysicalIndex], PhysicalName=[PhysicalName], RelativeResource=[RelativeResource], UsageType=[UsageType], SubIndex=[SubIndex], CpuUsage=[CpuUsage], Unit=[Unit], CpuUsageThreshold=[CpuUsageThreshold])
- Run the dir cfcard:/logfile/ command to download files in the logfile directory. Then, you can view log information.When the CPU usage reaches the overload alarm threshold, the SYSTEM/1/hwCPUUtilizationRisingAlarm_active log is generated.
SYSTEM/1/hwCPUUtilizationRisingAlarm_active: The CPU usage exceeded the pre-set overload threshold.(TrapSeverity=[TrapSeverity], ProbableCause=[ProbableCause], EventType=[EventType], PhysicalIndex=[PhysicalIndex], PhysicalName=[PhysicalName], RelativeResource=[RelativeResource], UsageType=[UsageType], SubIndex=[SubIndex], CpuUsage=[CpuUsage], Unit=[Unit], CpuUsageThreshold=[CpuUsageThreshold])
When the CPU usage decreases to the alarm recovery threshold, the hwCPUUtilizationRisingAlarm_clear log is generated.SYSTEM/1/hwCPUUtilizationRisingAlarm_clear: The CPU usage falls below the pre-set clear threshold.(TrapSeverity=[TrapSeverity], ProbableCause=[ProbableCause], EventType=[EventType], PhysicalIndex=[PhysicalIndex], PhysicalName=[PhysicalName], RelativeResource=[RelativeResource], UsageType=[UsageType], SubIndex=[SubIndex], CpuUsage=[CpuUsage], Unit=[Unit], CpuUsageThreshold=[CpuUsageThreshold])
- When the CPU usage exceeds 80%, the information about the processes with the top 3 highest CPU usage is recorded in the diagnostic log. The format of the recorded information is as follows. You can download the diagnostic log file to view the detailed information.
[C(1] CID=[CID];Cpu overload warning in process [ProcessID] , the info is: The current out of cpu overload warning in process [ProcessID]
Checking Historical CPU Usage Statistics
The system collects CPU usage statistics at a specified interval (usually 60s) and saves the statistics. You can run the display cpu-usage history command to check the CPU usage recorded in a recent period to check whether the CPU works properly.
- Run the display cpu-usage history [ 1hour | 24hour | 72hour ] [ slot slot-id ] command to check historical CPU usage statistics.
The CPU usage is displayed in the time-usage two-dimensional chart, as shown in the following example:
<HUAWEI> display cpu-usage history 72 100%| 95%| * 90%| ** * 85%| ** * 80%| ** * 75%| * ** * 70%| ** * * * ************ * ****************************** 65%|************************************************************************ 60%|*****************************HH***************************************** 55%|*****************************HHHHH*********HHHHHHHHHHHHHHHH********HHHHH 50%|HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH 45%|HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH 40%|HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH 35%|HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH 30%|HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH 25%|HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH 20%|HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH 15%|HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH 10%|HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH 5%|HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH -----+----+----+----+----+----+----+----+----+----+----+----+----+----+--> 10 20 30 40 50 60 70 System cpu-usage last 72 hours(Per Hour) * = maximum cpu-usage H = average cpu-usage
Checking Statistics About the Packets Sent to the CPU on a Board
Most of the high CPU usage on the live network is related to the fact that a large number of packets are sent to the CPU of an interface board. If this is the case, you are advised to check statistics about the packets sent to the CPU to determine the type of packet sent to the CPU or discarded.
- Run the display cpu-defend statistics-all slot slot-id command in the diagnostic view to check statistics about the packets sent to the CPU on the interface board in a specified slot.
<HUAWEI> system-view [~HUAWEI] diagnose [~HUAWEI-diagnose] display cpu-defend statistics-all slot 2 Index CarID Packet-Info Passed Packets Dropped Packets ============================================================================================== 15 190 IPV4_ARP_REQUEST 5 0 563 215 PST_BROADCAST 1 0 684 645 IPV6_MC_NS 1 0 NA NA UDF1_DENYv4 NA 4411185 NA NA UDF1_DENYv6 NA 4411185 NA NA BLACKLIST_DENYv4 NA 4411185 NA NA BLACKLIST_DENYv6 NA 4411185 NA NA WHITELIST_DENYv4 NA 4411185 NA NA WHITELIST_DENYv6 NA 4411185
- You can also run the display cpu-defend statistics-all slot slot-id clear command in the diagnostic view multiple times at an interval (5 seconds recommended) to check the new statistics recorded in each interval.
<HUAWEIHUAWEI> system-view [~HUAWEI] diagnose [~HUAWEI-diagnose] display cpu-defend statistics-all slot 2 clear Index CarID Packet-Info Passed Packets Dropped Packets ================================================================================================== 8 223 OSPF 10111 0 109 160 LLDP 9951 0 338 179 CROSSBOARD_MACDEL 340 0 563 215 PST_BROADCAST 9975 0
Checking Diagnostic Logs About CPU Usage Fluctuation
The system checks the CPU usage every 10 seconds. If the CPU usage increases by more than 10% within 10 seconds, the CPU usage fluctuates and CPU usage information is recorded in diagnostic logs. You can download diagnostic log files to view related information.
Save the results of each troubleshooting step. If the fault persists after following this procedure, these results are needed for further troubleshooting.
- Determine the time when the CPU usage fluctuates.Search for the keyword hopping in the diagnostic log. The following log indicates that the CPU usage fluctuated at 2020-09-28 20:01:07.496-03:00. now is 99 indicates that the CPU usage was 99%.
2020-09-28 20:01:07.496-03:00 HUAWEI %%01DEBUG/4/DEBUG_CPUOVLOAD(D):CID=0x0;Cpu overload warning in process 11, the info is: The current out of cpu is hopping , lastUsage is 83, now is 99, old is 91:
- Check the threads with high CPU usage.
When the CPU usage fluctuates, the system records the CPU usage of the top 3 threads of all processes. Search for the keyword The current out of cpu info in location in the diagnostic log to find the thread with high CPU usage at the corresponding time point. The following log indicates that the CPU usage of the thread whose ID is 3006076080 in process 3 is 24%. You can compare the CPU usage of the top 3 threads of all processes at the corresponding time point to determine the ID of the thread with the highest CPU usage.
2020-09-28 20:01:07.455-03:00 HUAWEI %%01DEBUG/4/DEBUG_CPUOVLOAD(D):CID=0x80030008;Cpu overload warning in process 3, the info is: The current out of cpu info in location 3 is ThreadId: 3006076080, CpuUsage: 24; ThreadId: 3064108208, CpuUsage: 11; ThreadId: 3063977136, CpuUsage: 1
- Check call stack information based on the thread ID.
When the CPU usage fluctuates, the system records the call stack information of the top 3 threads of all processes. Search for the thread ID determined in Step 2 in the diagnostic log. Find the corresponding module based on the call stack.
2020-09-28 20:01:08.622-03:00 HUAWEI %%01DEBUG/4/DBG_THREAD_CALLBACK(D):CID=0x80030008;The call stack of top thread CPU on slot 11: Process 3 Thread 3006076080(DefSch0300):#01 libecm.so(ECM_SPTI_GetField) #02 libecm.so(ECM_SPT_GetValue) #03 libecm.so(ECM_SPT_GetOneRecord) #04 libecm.so(ECM_SPT_GetAllRecords) #05 libecm.so(ECM_SPT_OutSomeRecords) #06 libecm.so(ECM_SPT_OutRecord) #07 libecm.so(ECM_SPT_DoOutCtrl4Para) #08 libecm.so(ECM_SPT_DoOutCtrl) #09 libecm.so(ECM_SPT_OutCtrl) #10 libscriptlib.so [0x1e23e458] #11 libscriptlib.so [0x1e25a6c8] #12 libscriptlib.so [0x1e23e18c] #13 libscriptlib.so [0x1e23d348] #14 libscriptlib.so(lua_resume) #15 libscriptlib.so(VSF_lInnerResumeCo) #16 libscriptlib.so(VSF_lResumeCo) #17 libscriptlib.so(VSF_DirectResumeCo) #18 libscriptlib.so(Script_ResumeCoDirectly) #19 libscriptlib.so(Script_ResumePreparedCo)
Not all call stacks displayed are valid call stacks. Skip the call stack where the DOPRA is waiting for scheduling, and repeat this step to search for the call stack of the specific service based on the thread ID until this call stack is found.
The following is an example call stack where the DOPRA is waiting for scheduling and that is invalid.
2020-09-28 20:01:08.622-03:00 HUAWEI %%01DEBUG/4/DBG_THREAD_CALLBACK(D):CID=0x80030008;The call stack of top thread CPU on slot 11: Process 3 Thread 3006076080(DefSch0300):#00 libpthread.so.0(pthread_cond_timedwait) #01 libdefault.so(vosSemaP) #02 libdefault.so(VOS_SemaP) #03 libdefault.so(rtfScmCompSchTaskEntry) #04 libdefault.so(rtfScmCompScheDefaultEntry) #05 libdefault.so(rtfScmTaskDeployDefaultCompEntry) #06 libdefault.so(tskAllTaskEntry) #07 libpthread.so.0 [0x2062cedc] #08 libc.so.6(clone)
Collecting Information About High CPU usage of the CMF Service
The CMF service processes configuration delivery and query requests from the CLI, SNMP, and NETCONF services. If the CPU usage of the CMF service is high, the possible cause is that the NMS frequently queries service MIB objects through SNMP, the NMS periodically performs full or incremental synchronization through NETCONF, or related services are time-consuming.
Save the results of each troubleshooting step. If the fault persists after following this procedure, these results are needed for further troubleshooting.
- Check the services queried by the NMS through SNMP/NETCONF. When the CPU usage is high, run the query command as many times as possible. The information collected by the NMS is displayed more frequently.
Run the display cmf-info luascript history process process-id command in the diagnostic view to collect the executed service script.
<HUAWEI> system-view [~HUAWEI] diagnose [~HUAWEI-diagnose] display cmf-info luascript history process 6 *************************************BEGIN************************************* The history script from ecm: TransNo SsnID Optype MsgType Ecm RunNum RunRetCode SptRetCode TotalMem(K) CurSptMem(K) LoadTime(ms) MicroRunTime(ms) SptFileName StartRunTime FinishRunTime CancelTime TimeOut LuaLastErrInfo 0x00000000 0x00000d90 0x0000000b 0x00000006 0x80cc001d 1 0x00800910 0x00000000 49856 1 0 0 RPKIOM_SyncStaticAlm_after.lua 2024-07-18 14:55:46.006 2024-07-18 14:55:46.006 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x00000000 0x00000d90 0x0000000b 0x00000006 0x80cc001d 1 0x00800910 0x00000000 49857 1 0 0 SEGROM_SyncStaticAlm_after.lua 2024-07-18 14:55:46.006 2024-07-18 14:55:46.006 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x00000000 0x00000d90 0x0000000b 0x00000006 0x80cc001d 1 0x00800910 0x00000000 49859 2 0 0 SRPOLICYOM_SyncStaticAlm_after 2024-07-18 14:55:46.006 2024-07-18 14:55:46.006 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x00000000 0x00000d90 0x0000000b 0x00000006 0x80cc001d 1 0x00800910 0x00000000 49859 0 0 0 TWAMPCLIENTOM_SyncStaticAlm_af 2024-07-18 14:55:46.006 2024-07-18 14:55:46.006 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x00000000 0x00000d90 0x0000000b 0x00000006 0x80cc001d 1 0x00800910 0x00000000 49860 1 0 0 TWAMPOM_SyncStaticAlm_after.lu 2024-07-18 14:55:46.006 2024-07-18 14:55:46.006 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x00000000 0x00000d90 0x0000000b 0x00000006 0x80cc001d 1 0x00800910 0x00000000 49862 2 0 0 TELEMETRYOM_SyncStaticAlm_afte 2024-07-18 14:55:46.006 2024-07-18 14:55:46.007 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x00000000 0x00000d90 0x0000000b 0x00000006 0x80cc001d 1 0x00800910 0x00000000 49862 0 0 0 NVO3_SyncStaticAlm_after.lua 2024-07-18 14:55:46.007 2024-07-18 14:55:46.007 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x00000000 0x00000d90 0x0000000b 0x00000006 0x80cc001d 1 0x00800910 0x00000000 49863 1 0 0 Y1731OM_SyncStaticAlm_after.lu 2024-07-18 14:55:46.007 2024-07-18 14:55:46.007 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x00000000 0x00000d90 0x0000000b 0x00000006 0x80cc001d 1 0x00800910 0x00000000 49863 0 0 0 NSMOM_SyncStaticAlm_after.lua 2024-07-18 14:55:46.007 2024-07-18 14:55:46.007 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x00000000 0x00000d90 0x0000000b 0x00000006 0x80cc001d 1 0x00800910 0x00000000 49881 18 0 0 IKEIPSECOM_SyncStaticAlm_after 2024-07-18 14:55:46.007 2024-07-18 14:55:46.009 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x00000000 0x00000d90 0x0000000b 0x00000006 0x80cc001d 1 0x00800910 0x00000000 49882 1 0 0 MCNAT_SyncStaticAlm_after.lua 2024-07-18 14:55:46.009 2024-07-18 14:55:46.010 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x00000000 0x00000d90 0x0000000b 0x00000006 0x80cc001d 1 0x00800910 0x00000000 49886 4 0 0 FEI_MC_MVPN_SyncStaticAlm_afte 2024-07-18 14:55:46.010 2024-07-18 14:55:46.010 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x00000000 0x00000d90 0x0000000b 0x00000006 0x80cc001d 1 0x00800910 0x00000000 49887 1 0 0 FEIFRAMEPDT_OM_LCS_syncstatica 2024-07-18 14:55:46.010 2024-07-18 14:55:46.010 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 ...... ...... *************************************END***************************************
Run the display cmf-info luascript longtime process process-id command in the diagnostic view to collect information about the time-consuming service script.
<HUAWEI> system-view [~HUAWEI] diagnose [~HUAWEI-diagnose] display cmf-info luascript longtime process 6 *************************************BEGIN************************************* The run time long script from ecm: Ecm TransNo SessionID Optype MsgType RunNum RunRetCode SptRetCode TotalMem(K) CurSptMem(K) RunTime(s) SptFileName StartRunTime FinishRunTime CancelTime TimeOut 0x80cc001d 0xf0000020 0x00000079 0x0000000b 0x00000006 0x00000015 0x00800910 0x00000000 23575 209 21 DevmInnerAction.lua 2024-07-18 08:11:46.825 2024-07-18 08:12:07.675 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x80cc001d 0xf0000029 0x00000084 0x0000000b 0x00000006 0x00000016 0x00800910 0x00000000 24805 211 50 DevmInnerAction.lua 2024-07-18 08:12:31.312 2024-07-18 08:13:21.507 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x80cc001d 0xf000002e 0x00000089 0x0000000b 0x00000006 0x00000016 0x00800910 0x00000000 24871 207 50 DevmInnerAction.lua 2024-07-18 08:12:31.316 2024-07-18 08:13:21.510 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x80cc001d 0xf000002f 0x0000008a 0x0000000b 0x00000006 0x00000004 0x00800910 0x00000000 24942 32 44 DevmInnerAction.lua 2024-07-18 08:12:37.809 2024-07-18 08:13:21.516 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x80cc001d 0xf0000013 0x00000205 0x0000000b 0x00000006 0x0000000f 0x00800910 0x00000000 44725 93 26 DevmInnerAction.lua 2024-07-18 08:16:35.229 2024-07-18 08:17:01.323 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x80cc001d 0xf0000012 0x00000204 0x0000000b 0x00000006 0x0000000a 0x00800910 0x00000000 47608 2719 26 DevmInnerAction.lua 2024-07-18 08:16:35.178 2024-07-18 08:17:01.441 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x80cc001d 0x00000133 0x0000013b 0x0000000b 0x00000006 0x00000003 0x00800910 0x00000001 49712 11 30 opsSubscribe.lua 2024-07-18 08:18:04.619 2024-07-18 08:18:34.627 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x80cc001d 0x0001ce45 0x000008e7 0x00000008 0x00000005 0x0000002b 0x00800910 0x00000000 48062 -320 31 BoardExTestAct.lua 2024-07-18 12:07:45.452 2024-07-18 12:08:16.205 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x80cc001d 0xf3000005 0x00000909 0x0000000b 0x00000006 0x000000d2 0x00800910 0x00000000 66354 -1156 30 DevmBatchInAct.lua 2024-07-18 12:11:22.576 2024-07-18 12:11:52.449 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x80cc001d 0xf3000006 0x00000ab7 0x0000000b 0x00000006 0x00000085 0x00800910 0x00000000 44771 777 30 DevmBatchInAct.lua 2024-07-18 13:11:24.978 2024-07-18 13:11:54.727 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x80cc001d 0xf3000007 0x00000c55 0x0000000b 0x00000006 0x0000008d 0x00800910 0x00000000 48770 847 30 DevmBatchInAct.lua 2024-07-18 14:11:27.306 2024-07-18 14:11:57.141 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 0x80cc001d 0xf3000008 0x00000dfb 0x0000000b 0x00000006 0x000000b2 0x00800910 0x00000000 46219 1207 30 DevmBatchInAct.lua 2024-07-18 15:11:29.631 2024-07-18 15:11:59.364 0000-00-00 00:00:00:000 0000-00-00 00:00:00:000 *************************************END***************************************
- Based on SNMP packet statistics, check which SNMP NMSs have performed collection and which service scripts are time-consuming.
Run the display snmp-agent statistics nms command in the diagnostic view to check statistics about the NMS collection behavior.
<HUAWEI> system-view [~HUAWEI] diagnose [~HUAWEI-diagnose] display snmp-agent statistics nms statistics of nms: ----------------------------------------------------------- NMS-IP/VPN-ID (10.58.250.21/0): 5 Messages delivered to the SNMP entity 0 Messages which were for an unsupported version 0 Messages which used an SNMP community name not known 0 Messages which represented an illegal operation for the community supplied 0 ASN.1 or BER errors in the process of decoding 5 Messages passed from the SNMP entity 0 SNMP PDUs which had badValue error-status 0 SNMP PDUs which had genErr error-status 0 SNMP PDUs which had noSuchName error-status 0 SNMP PDUs which had tooBig error-status 5 MIB objects retrieved successfully 0 MIB objects altered successfully 0 GetRequest-PDU accepted and processed 5 GetNextRequest-PDU accepted and processed 0 GetBulkRequest-PDU accepted and processed 5 GetResponse-PDU sent 0 SetRequest-PDU accepted and processed 0 Trap-PDU sent 0 Inform-PDU sent 0 Inform-PDU received with no acknowledgement 0 Inform-PDU received with acknowledgement ----------------------------------------------------------- Total nms counts : 1
Run the display snmp-agent statistics mib command in the diagnostic view to check statistics about the NMSs' access to MIB objects yesterday and today.
<HUAWEI> system-view [~HUAWEI] diagnose [~HUAWEI-diagnose] display snmp-agent statistics mib Statistics of SNMP packet: ---------------------------------------------------------------------- Date TotalPacket PeakPacket TotalVB PeakVB ---------------------------------------------------------------------- Yesterday 0 0 0 0 Today 2 0 2 0 Statistics of MIB node: Total MIB number: 1 MIB Node: sysName Last Operation Information: SourceIP/Port/VPNId : 192.168.1.1/3607/0 Operation Information : SET/1/0/0 Total Operation Information: -------------------------------------------------------------------------------------------------------------------------------------- Operation Total YTDPeak YTDAVG TDPeak TDAVG MaxSnmpT(ms) MinSnmpT(ms) AVGSnmpT(ms) MaxAppT(ms) MinAppT(ms) AVGAppT(ms) CacheHit -------------------------------------------------------------------------------------------------------------------------------------- SET 1 0 0 0 0 0 0 0 0 0 0 0 GET 1 0 0 0 0 10 10 10 10 10 10 0 GETNEXT 0 0 0 0 0 0 0 0 0 0 0 0 GETBULK 0 0 0 0 0 0 0 0 0 0 0 0
Run the display snmp-agent statistics mib timeout command in the diagnostic view to check statistics about MIB object request timeouts. Based on the OID of the timeout object, find the corresponding service module to analyze the cause.
<HUAWEI> system-view [~HUAWEI] diagnose [~HUAWEI-diagnose] display snmp-agent statistics mib timeout SNMP timeout statistics ----------------------------------------------------------- Current timeout config: 1 s ----------------------------------------------------------- Timeout config: 5 s VB number : 1 VB[0] : 1.3.6.1.6.3.12.1.2.1 Sys time : 2017-12-14, 14:48:12:543 E2E time : 20630 ms App time : 4294 ms Wait time : 0 ms NMS IP : 0.0.0.0 REQ ID : 113 -----------------------------------------------------------
- Check the NETCONF NMSs that are performing full or incremental synchronization based on user login records.
Method 1: If the NETCONF connection is an SSH-based TCP connection, you can view user login logs to check which users have performed login operations and find the IP addresses of the users.
SSH/5/SSH_USER_LOGIN: The SSH user succeeded in logging in. (ServiceType=[ServiceType], UserName=[UserName], UserAddress=[UserAddress], LocalAddress=[LocalAddress], VPNInstanceName=[VPNInstanceName]) SSH/5/SSH_USER_LOGOUT: The SSH user logged out. (ServiceType=[ServiceType], LogoutReason=[LogoutReason], UserName=[UserName], UserAddress=[UserAddress], LocalAddress=[LocalAddress], VPNInstanceName=[VPNInstanceName])
Method 2: If an NMS user is performing a synchronization operation, run the display users command to view the IP address of the login user. NETCONF occupies the NCA channel, which is different from the common VTY channel.
<HUAWEI> display users User-Intf Delay Type Network Address AuthenStatus AuthorcmdFlag + 34 VTY 0 00:00:00 TEL 10.134.146.150 pass yes Username : huawei123 35 VTY 1 00:12:17 TEL 10.179.179.127 pass yes Username : huawei123 36 VTY 2 00:00:00 TEL 10.136.138.221 not pass no Username : Unspecified
To prevent the high CPU usage of the CMF service caused by frequent data collection by the NMS, you are advised to run the set configuration operation cpu-limit { percent-value access-type snmp | ncf-percent-value access-type netconf } command to configure the CPU rate decreasing threshold to reduce the CPU usage caused by data collection by the NMS. If the CPU usage reaches the configured threshold, the device reduces CPU resources allocated to the NMS when the NMS collects data. After CPU rate limiting is configured for data collection by the NMS, the actual CPU usage is higher than the configured threshold. For example, if the configured threshold is 70%, the actual CPU usage ranges from 70% to 80%.
<HUAWEI> system-view [~HUAWEI] set configuration operation cpu-limit 70 access-type snmp Info: After the configuration is complete, relevant service process is slowed down. [~HUAWEI] set configuration operation cpu-limit 70 access-type netconf Info: After the configuration is complete, relevant service process is slowed down. [*HUAWEI] commit
- Troubleshooting Flowchart
- Troubleshooting Procedure
- Checking the Software Version and Board Status
- Checking the CPU Usage of Each Board
- Checking the CPU Usage of Each Service Type
- Checking the CPU Usage of Each Process
- Checking the CPU Usage of Each Component
- Checking the CPU Usage of Each Thread
- Checking Thread Call Stacks
- Checking the Message Processing Statistics of Components
- Checking Alarms and Logs As Well As Statistics About Historical CPU Usage and Packets Sent to the CPU
- Collecting Information About High CPU usage of the CMF Service