No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Troubleshooting System Crashes for Modular Switches

Rate and give feedback:
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
Troubleshooting System Crashes for Modular Switches

Troubleshooting System Crashes for Modular Switches

Introduction

An unexpected reboot of a card will interrupt running services. This document helps you quickly understand how to handle such problems. Knowing the possible causes of such problems, you can take preventive measures against some causes to protect running services.

All Cards Reset

Troubleshooting Flowchart

A modular switch uses a distributed system architecture, in which each card has an independent system. The LPUs run independently and are managed by the active MPU. A failure of the active MPU in a switch will cause all LPUs to reset. If a switch has two MPUs, the standby MPU will change to the active state and take over services once the original active MPU fails. The original active MPU will become the standby MPU after an automatic reset. Therefore, reset of the active MPU will not cause all cards to reset in this case.

Figure 1-1 Troubleshooting flowchart for resets of all cards

Diagnosis and Troubleshooting Procedures

  1. Run the display device command to check the number of MPUs on the switch where a card is reset.

    If the switch has only one MPU, all LPUs of the switch are reset after the MPU is reset. To diagnose the MPU reset problem, see A Single Card Resets.

    If the switch has two MPUs, the switch may reboot due to a power failure.

    Check whether the external power supply system is working normally.

    Run the display logbuffer command to check the reboot records of the switch and confirm whether a power failure occurred around the reboot time displayed in the command output. Check the following:
    • Whether any manual operations caused the switch to be powered off
    • Whether any exceptions were recorded in logs of the UPS (if the switch was powered by a UPS)
    • Whether other devices in the same rack or powered by the same power supply system were powered off
    • Whether any high-power device was connected to the network at that time
    • Whether any power lines were aged out or loose
    • Whether the input voltage measured using a multimeter was in the normal range

    If any of the preceding situations exists, take measures to fix the problem of the external power supply system.

  2. If the external power supply system works properly, run the display alarm all command to check whether there are alarms about power modules of the switch.

    Common power module alarms include:

    • Power is invalid for not support: An incompatible power module is installed on the switch.
    • PWR_LACK and SWITCH_STAT sensor alarms for the same power module: The power module is present but has no power cable connected or its power switch is in OFF position.
    • PWR_FAULT sensor alarm for a power module: The power module is experiencing a fan failure, output overvoltage, external short circuit, output failure, or input failure.

  3. If possible, move the problematic power module to another power slot or replace it with another power module to check whether the power module is faulty.
  4. If the power module is not faulty, go to section Contacting Technical Support Personnel.

A Single Card Resets

Troubleshooting Flowchart

Figure 1-2 Troubleshooting flowchart for reset of a single card

Diagnosis and Troubleshooting Procedures

Checking the Switch Model and Version

  1. Run the display device command to check the switch model and status of each module on the switch.

    <HUAWEI> display device
    S9706's Device status:                                                          
    Slot  Sub Type         Online    Power      Register       Status     Role      
    ------------------------------------------------------------------------------- 
    1     -   EH1D2X12SSA0 Present   PowerOn    Registered     Normal     NA        
    4     -   -            Present   PowerOn    Unregistered   -          NA        
    7     -   EH1D2SRUDC00 Present   PowerOn    Registered     Normal     Master    
    PWR1  -   -            Present   -          Unregistered   -          NA        
    PWR2  -   -            Present   PowerOn    Registered     Normal     NA        
    CMU1  -   EH1D200CMU00 Present   PowerOn    Registered     Normal     Master    
    FAN1  -   -            Present   PowerOn    Registered     Normal     NA        
    FAN2  -   -            Present   PowerOn    Registered     Normal     NA        

    The command output shows that the switch model is S9706 and the status of cards, power modules, and fan modules (the Status field is Normal).

  2. Run the display version command to check the software version running on the switch.

    <HUAWEI> display version 
    Huawei Versatile Routing Platform Software                                      
    VRP (R) software, Version 5.160 (S9700 V200R008C00SPC300)                       
    Copyright (C) 2000-2016 HUAWEI TECH CO., LTD                                    
    Quidway S9706 Terabit Routing Switch uptime is 0 week, 3 days, 18 hours, 31 minu
    tes                                                                

    The command output shows that the software version is V200R008C00.

Checking the Cause of a Card Reset

  1. Run the display reset-reason command to view reset information about all cards.

    <HUAWEI> display reset-reason
    The LPU frame[1] board[1] has no reset records.
    The LPU frame[1] board[2] has no reset records.
    The LPU frame[1] board[3]'s reset total 1, detailed information:
    --  1. 2012/03/13   19:58:15, Reset No.: 1
           Reason: Check mod information fail
    The MPU frame[1] board[4] has no reset records.
    The MPU frame[1] board[5]'s reset total 967, detailed information:
    --  1. 2012/03/20   13:07:52, Reset No.: 967
           Reason: Warm reset board for no receiving message in a long time
    --  2. 2012/03/20   12:57:52, Reset No.: 966
           Reason: Warm reset board for no receiving message in a long time
    --  3. 2012/03/20   12:47:52, Reset No.: 965
           Reason: Warm reset board for no receiving message in a long time
    --  4. 2012/03/20   12:37:52, Reset No.: 964
           Reason: Warm reset board for no receiving message in a long time
    --  5. 2012/03/20   12:27:52, Reset No.: 963
           Reason: Warm reset board for no receiving message in a long time

    Alternatively, run the display reset-reason slot ID command to view reset information of the card in a specified slot. ID specifies the slot ID.

    Table 1-1 Description of the display reset-reason command output

    Item

    Description

    LPU/MPU

    LPU or MPU.

    frame

    Chassis ID of the card.

    board

    Slot ID of the card.

    reset total

    Number of times the card is reset.

    detailed information

    Detailed reset information.

    Reset No.

    Reset sequence number.

    Reason

    Reset cause.

  2. Analyze the cause of the reset and take corresponding measures. Table 1-2 describes the causes that may be displayed in the display reset-reason command output and provides the handling methods.

    Table 1-2 Reset causes and handling methods

    Reset Cause

    Handling Method

    User operations

    Reset by user command

    A user has reset the card using the command line interface (CLI) or network management system (NMS).

    Check whether any user with the reset privilege has reset the card.

    Power off by user command

    VRP reset selfboard because of command

    Reset board by vrp cmd

    Reset board by snmp

    Reset for rollback

    The demo time of license is overtime

    The temporary license loaded on the card has expired.

    Obtain a license from Huawei.

    System loading

    Reset for load

    During a software upgrade, an LPU is reset after the system software is loaded.

    This is a normal reset and requires no action.

    Reset for lpu resource-mode disaccord with mpu

    The resource mode configured on an LPU does not match that of the MPU.

    This is a normal reset and requires no action.

    Reset for the LPU patch file or module does not match that on the MPU

    The patch package or plug-in specified for an LPU is different from that of the MPU.

    After the LPU is registered, delete the patch package or plug-in, and then load the correct one.

    Reset for initializing the board's status by IFNET

    An LPU's status is initialized after an active/standby switchover.

    If the LPU configuration is not recovered after a switchover, it cannot communicate with other cards.

    It is a normal condition if the LPU works normally after a switchover.

    Reset slave board for memsize too little

    The memory size of the standby MPU is smaller than that of the active MPU.

    Check the memory size of the standby MPU. If its memory size is smaller than that of the active MPU, replace the standby MPU.

    Reset for slave board's card statement disaccord with master's

    Only one of MPUs has a subcard (such as an FSU) installed.

    Install the same subcard on the other MPU or remove the current subcard to ensure that the two MPUs have the same subcard configuration.

    Reset for patch load

    An LPU is reset after a patch is loaded.

    It is a normal condition if a patch is loaded during an LPU startup.

    Reset for patch get state fail

    The patch fails to be loaded to a card.

    It is normal for such resets to occur one or two times during a system startup.

    If such resets occur multiple times, go to section Contacting Technical Support.

    Reset for patch load file fail

    Reset for patch synchronize file fail

    Reset for patch state compare fail

    Software exceptions

    VRP reset selfboard because of find deadloop

    A deadloop is detected.

    Check alarms and logs on the switch to locate the problem.

    VRP reset selfboard because of find exception

    A software exception is detected.

    Contact technical support personnel.

    Board reset by VRP for schedule

    A congestion occurs.

    Check alarms and logs on the switch to locate the problem.

    VRP reset selfboard because of no memory

    The memory has been used up.

    Check whether the memory usage is high.

    Check alarms and logs on the switch to locate the problem.

    Reset for memory use out

    Device management

    Reset for no receiving mpu's heart

    An LPU does not receive heartbeat packets from the MPU in 40 seconds.

    See Checking Whether the Card Reset Is Caused by Bad Installation.

    Reset for no heart

    The MPU did not receive heartbeat packets from an LPU in 30 seconds.

    Reset for not receiving register ack from mpu

    An LPU is registered 20 times but does not receive registration response packets from the MPU.

    The inter-card communication fails. See Checking Whether the Card Reset Is Caused by Bad Installation.

    Reset for state not stable

    The MPU's communication with an LPU is interrupted intermittently.

    Warm reset board for no register in a long time

    An LPU fails to be registered in 30 minutes.

    Warm reset board for no receiving message in a long time

    The MPU does not receive any packet from an LPU in 10 minutes.

    Cold reset board for no receiving message in a long time

    The MPU does not receive any packet from an LPU in 20 minutes.

    Cold reset board for CPU is not active

    The MPU detects that the CPU of an LPU does not work.

    Power off the board because of reset three times continuously

    A card is reset three times during a startup.

    A card will be power cycled after three warm start failures.

    Reset for unregister but receive heartbeat info

    The MPU receives heartbeat packets from an unregistered card.

    Check alarms and logs on the switch to locate the problem.

    Reset for slave board class disaccord with mpu

    The active and standby MPUs are of different types.

    Check the types of the active and standby MPUs and replace one of them to ensure that the switch uses MPUs of the same type.

    Reset for lpu or slave version disaccord with mpu

    The startup version of a card differs from that of the MPU.

    If the reset card is the standby MPU, check the versions of the active and standby MPUs. If the two MPUs run V100R002 and V100R003 respectively, the standby MPU will be reset because the two versions do not support automatic synchronization.

    If the reset card is an LPU, go to section Contacting Technical Support.

    Reset for no receiving master cpu's heart

    A VASP card is reset because the main core in its CPU does not receive heartbeat packets from the sub-core in 60 seconds.

    Contact technical support personnel.

    Hardware components

    Reset for selftest fail

    A card's self-check fails.

    Reinstall the card or install it into another slot, and then check whether it works normally. If the fault persists, the card is faulty.

    Reset for CPLD self-test fail

    The CPLD self-check fails.

    Reset selfboard because of initialize fsu fail

    The FSU fails to be initialized.

    reset for fpga load failed

    The FPGA fails to be loaded.

    Reset for fpga in abnormal state

    The FPGA status is abnormal.

    Reset for lanswitch chip parity error

    An error occurs during an LSW circuit parity check.

    Reset for FSU card type mismatch

    The FSU does not match the chassis.

    Replace the FSU with a matching one. If the problem cannot be fixed, go to section Contacting Technical Support.

    Board reset by ISIS for purging LSP error

    An error occurs when link state packets (LSPs) are cleared.

    It is normal for such resets to occur one or two times during a system startup.

    If such resets occur multiple times, go to section Contacting Technical Support.

    CSS

    Reset for frame combine

    Two chassis are merged.

    These are normal conditions and require no action.

    Reset for frame split

    The CSS is split.

    Reset for fsp

    The CSS is reset.

    Reset for one frame register, but the board is not register

    A card is not registered during a chassis registration.

    Reset for slave to master in slave frame, but self is not register

    On the standby switch, the standby MPU becomes the active MPU before the reset card is registered.

    Reset for slave to master in master frame, but self is not register

    On the master switch, the standby MPU becomes the active MPU before the reset card is registered.

    Reset by switchover command from system master chassis

    The switchover command is executed in the CSS.

    Reset by command from other chassis

    The reset command is executed on the other member switch of the CSS.

    Reset board after syn version

    A card is reset after version synchronization.

    Reset board for Peer frame is in CSS force master status

    The other switch is forcibly specified as the master switch.

    Reset for fpga state disaccord with system master

    The SRU hardware engine function is enabled on a switch using SRUD that sets up a CSS with a switch using SRUC.

    Run the undo detect-engine enable command to disable the SRU hardware engine function, reboot the switch for the configuration to take effect, and then perform the CSS configuration again.

    Device self-healing

    Reset selfboard for ecm channel switch

    The ECM channel is faulty.

    Contact technical support personnel.

    Reset for an entry check error in chip

    A major fault occurs in chip entries.

    Reset for CSS chip failure

    The CSS chip on the MPU is faulty.

    Reset for all HG down

    All internal interconnection ports on the MPU are faulty.

    Reset for critical task has not been scheduled for long time

    A key task on the device cannot be scheduled within a long time.

Checking Alarms

Methods of Checking Alarms on a Switch

If a switch fails or cannot operate normally because the environmental conditions do not meet operating requirements, it will generate alarm messages depending on the type of the problem.

You can use either of the following methods to view alarm messages:

  • Log in to the network management system (for example, eSight) to view alarm messages.
  • Run the display trapbuffer command on the CLI of the switch to view alarm messages in the trap buffer.

The value parameter specifies the maximum number of alarm messages that can be displayed in the command output. If the actual number of alarm messages is smaller than the specified value, all the available alarm messages are displayed.

<HUAWEI> display trapbuffer
Trapping buffer configuration and contents : enabled                            
Allowed max buffer size : 1024                                                  
Actual buffer size : 256                                                        
Channel number : 3 , Channel name : trapbuffer                                  
Dropped messages : 0                                                            
Overwritten messages : 6248                                                     
Current messages : 256                                              
#Sep 19 2012 04:38:03+08:00 HUAWEI DS/4/DATASYNC_CFGCHANGE:OID 1.3.6.1.4.1.2011
.5.25.191.3.1 configurations have been changed. The current change number is 8, 
the change loop count is 0, and the maximum number of records is 4095.          
#Sep 19 2012 04:37:39+08:00 HUAWEI LINE/5/VTYUSERLOGIN:OID 1.3.6.1.4.1.2011.5.2
5.207.2.2 A user login. (UserIndex=34, UserName=VTY, UserIP=10.135.18.114, UserC
hannel=VTY0)                                                         
You can also use the following commands to check specific types of alarm messages:
  • display alarm all: displays all alarms on the switch.
  • display alarm active: displays alarms that have not been cleared after the switch starts.
  • display alarm history: displays historical alarms recorded after the switch starts.
Common Alarms About Card Resets and Handling Methods
Table 1-3 Common alarms about card resets and handling methods

Alarm/Alarm ID

Message

Description

Possible Causes

Handling Method

BASETRAP/4/ENTITYREMOVE

1.3.6.1.4.1.2011.5.25.129.2.1.1

Physical entity is removed

A physical entity (such as a card, subcard, power module, fan module, or optical module) was removed.

The physical entity (such as a card, subcard, fan module, or optical module) was removed.

Check whether the physical entity removal is a normal operation.

ENTITYTRAP/4/BOARDREMOVE

1.3.6.1.4.1.2011.5.25.219.2.2.1

Board has been removed

A card was removed.

The card was removed.

  • Check whether the card was removed manually.
  • Check whether the card was properly installed.

ENTITYTRAP/4/POWERREMOVE

1.3.6.1.4.1.2011.5.25.219.2.5.1

Power is absent

The power module was removed.

The power module was removed.

Check whether the power module was manually removed. If the alarm persists with the power module securely installed, replace the power module.

Entitytrap/1/POWERINVALID

1.3.6.1.4.1.2011.5.25.219.2.5.5

Power supply is unavailable for some reason

The switch experienced a complete failure of power supply.

The power was switched off.

The switch was not connected to an external power source.

The input voltage was out of the range required for operation of the switch.

Checking Whether the Card Reset Is Caused by a Power Exception

BASETRAP/1/POWEROFF

1.3.6.1.4.1.2011.5.25.129.2.3.1

The power supply is off

The power supply was cut off.

BASETRAP/1/VOLTRISING

1.3.6.1.4.1.2011.5.25.129.2.2.9

Voltage exceeded the upper pre-alarm limit

The voltage exceeded the upper threshold.

The external power supply was unstable.

The power modules failed.

The card was faulty.

Checking Whether the Card Reset Is Caused by a Power Exception

ENTITYTRAP/1/ENTITYVOLTALARM (error code: 141056)

1.3.6.1.4.1.2011.5.25.219.2.10.5

Voltage of power rise over or fall below the alarm threshold

The voltage of power modules was too high.

BASETRAP/1/VOLTFALLING

1.3.6.1.4.1.2011.5.25.129.2.2.11

Voltage has fallen below the lower pre-alarm limit

The voltage fell below the lower threshold.

The external power supply was unstable.

The power modules failed.

The power provided to the switch was insufficient because there were not enough power modules on the switch.

The card was faulty.

Checking Whether the Card Reset Is Caused by a Power Exception

ENTITYTRAP/1/ENTITYVOLTALARM (error code: 141057)

1.3.6.1.4.1.2011.5.25.219.2.10.5

Voltage of power rise over or fall below the alarm threshold

The voltage of power modules was too low.

ENTITYTRAP/1/ENTITYBRDTEMPALARM (error code: 140544)

1.3.6.1.4.1.2011.5.25.219.2.10.13

Temperature rise over or fall below the warning alarm threshold

The equipment temperature exceeded the alarm threshold.

Heat could not be exhausted from the switch quickly.

The air filter was blocked by dust.

Vacant slots were not covered with filler panels.

The environment temperature was too high.

There were not enough fan modules on the switch.

Fan modules on the switch were faulty.

Checking Whether the Card Reset Is Caused by High Temperature or Failure of Fans

BASETRAP/3/TEMRISING

1.3.6.1.4.1.2011.5.25.129.2.2.1

Temperature exceeded the upper pre-alarm limit

The temperature sensor of an entity (card or subcard) detected that the entity temperature exceeded the upper threshold.

ENTITYTRAP/4/FANREMOVE

1.3.6.1.4.1.2011.5.25.219.2.6.1

Fan has been removed

No fan module was detected.

The fan module was removed or was not securely installed.

Checking Whether the Card Reset Is Caused by High Temperature or Failure of Fans

ENTITYTRAP/1/FANINVALID (error code: 139264)

1.3.6.1.4.1.2011.5.25.219.2.6.5

Fan is invalid

A fan module experienced a complete failure.

The fan module experienced a hardware failure.

Remove and reinstall the fan module properly. If the fan module is securely installed, its indicator will blink green slowly.

If the alarm persists, replace the fan module.

ENTITYTRAP/1/FANINVALID (error code: 139266)

1.3.6.1.4.1.2011.5.25.219.2.6.5

Fan is invalid

A fan module experienced a complete failure.

The fan module did not match the chassis model.

Determine the fan module that triggers the alarm according to the alarm information. Run the display elabel command to check the electronic label of the fan module for determining whether the fan module matches the chassis model. If not, replace it.

ENTITYTRAP/4/ENTITYCPUALARM

1.3.6.1.4.1.2011.5.25.219.2.14.1

CPU utilization exceeded the pre-alarm threshold

The CPU usage of the switch exceeded the alarm threshold.

The CPU usage alarm threshold was too low.

The switch was providing too many services.

The switch was undergoing an attack.

For troubleshooting methods, see Troubleshooting: High CPU Usage.

BASETRAP/2/CPUUSAGERISING

1.3.6.1.4.1.2011.5.25.129.2.4.1

CPU utilization exceeded the pre-alarm threshold

The CPU usage exceeded the threshold.

ENTITYTRAP/4/ENTITYMEMORYALARM

1.3.6.1.4.1.2011.5.25.219.2.15.1

Memory usage exceeded the threshold, and it may cause the system to reboot

The memory usage of the switch exceeded the threshold.

The memory usage of the switch exceeded the threshold.

Processing some services may cause high memory usage in a period of time. Typically, the memory usage will restore to the normal range some time later.

Entitytrap/1/BOARDINVALID (error code: 132625)

1.3.6.1.4.1.2011.5.25.219.2.2.5

Board is invalid for some reason

A card experienced a complete failure.

The LSW chip of the card was faulty.

If the alarm persists after the card is reset, replace the card. If the alarm is still not cleared, go to section Contacting Technical Support.

Entitytrap/1/BOARDINVALID (error code: 132632)

1.3.6.1.4.1.2011.5.25.219.2.2.5

Board is invalid for some reason

A card experienced a complete failure.

The Peripheral Component Interconnect (PCI) bus failed.

ENTITYTRAP/2/BOARDFAIL (error code: 132124)

1.3.6.1.4.1.2011.5.25.219.2.2.3

Board become failure for some reason

The card experienced a partial failure.

The I2C bus was faulty.

ENTITYTRAP/2/BOARDFAIL (error code: 132127)

1.3.6.1.4.1.2011.5.25.219.2.2.3

Board become failure for some reason

The card experienced a partial failure.

The clock of the card was faulty.

ENTITYTRAP/2/BOARDFAIL (error code: 132128)

1.3.6.1.4.1.2011.5.25.219.2.2.3

Board become failure for some reason

The card experienced a partial failure.

The Phase-Locked Loop (PLL) on the card was faulty.

ENTITYTRAP/2/BOARDFAIL (error code: 132131)

1.3.6.1.4.1.2011.5.25.219.2.2.3

Board become failure for some reason

The card experienced a partial failure.

The Digital Signal Processor (DSP) was faulty.

ENTITYTRAP/2/BOARDFAIL (error code: 132137)

1.3.6.1.4.1.2011.5.25.219.2.2.3

Board become failure for some reason

The card experienced a partial failure.

A chip (TCAM, PIC, CPLD, RTC, EEPROM, or temperature chip) was faulty.

ENTITYTRAP/2/BOARDFAIL (error code: 132171)

1.3.6.1.4.1.2011.5.25.219.2.2.3

Board become failure for some reason

The card experienced a partial failure.

The ambient temperature exceeded 45°C.

Lower the ambient temperature.

ASMNG/4/ASSLOTIDINVALID

1.3.6.1.4.1.2011.5.25.327.31.2.2.21

The new member of the AS has an invalid slot ID

The new member switch in an AS stack system had a stack ID greater than 4.

In an SVF system, the new member of an AS stack system had a stack ID greater than 4.

Set the stack ID of this member switch to 4 or smaller.

BASETRAP/4/ENTITYRESET

1.3.6.1.4.1.2011.5.25.129.2.1.5

Physical entity is reset

A card was reset.

The card did not work normally.

Check whether another alarm is generated, indicating that the card is installed or removed.

ENTITYTRAP/3/OPTICALINVALID

1.3.6.1.4.1.2011.5.25.219.2.4.5

Optical Module is invalid

A non-Huawei-certified optical module was installed.

The optical module was not a Huawei-certified one. A non-Huawei-certified optical module might have high current or power, which might cause a reset of the card. This alarm might also be reported for Huawei optical modules (delivered before July 1, 2013), because vendor information of these optical modules was not recorded.

If the optical module is not a Huawei-certified one, replace it with a Huawei-certified optical module.

If this optical module was early delivered from Huawei, run the transceiver phony-alarm-disable command to disable the alarm function for non-Huawei-certified optical modules.

ENTITYEXTTRAP/2/VERSIONINCOMPATIBLE

1.3.6.1.4.1.2011.5.25.31.2.2.1

The board software version is incompatible with MPU

The version running on an SPU was incompatible with that on the MPU.

The startup software version on the SPU was incompatible with the software version of the MPU.

Load a compatible software package to the SPU.

NOTE:

The following tips will help you quickly find the reference information for a specific alarm:

  • An alarm ID uniquely identifies an alarm. You can search for the ID of an alarm in the Alarm Reference to find the description of the alarm and the handling methods.
  • Alarms with the same ID but triggered by different causes are identified by different error codes (for example, BaseTrapProbableCause). You can search for the error code in the Alarm Reference.
  • You can also use the information query tool to query alarm information.

Do not search for alarms based on variables, such as alarm generation time, interface number, process ID, and device name.

Checking Switch Appearance and Environment

If a card is reset because its communication with the MPU is interrupted or power, fan, or temperature alarms are generated, check the switch appearance and environment to locate the fault.

Checking Whether the Card Reset Is Caused by Insecure Installation

If the cause of a card reset is heartbeat loss or communication failure with the MPU, the card may not be securely installed in the slot.

  1. Check whether the reset card and the MPU are securely installed.
  2. Remove the reset card and check whether any pins on its connector are bent.
  3. Install the card in another slot or replace it with a new card to determine whether the card or chassis is faulty.
  4. If the fault cannot be located, go to section Contacting Technical Support.
Checking Whether the Card Reset Is Caused by a Power Exception
  1. Determine whether a power failure has occurred around the reboot time according to logs. Check the following:

    • Whether any manual operations caused the switch to be powered off
    • Whether any exceptions were recorded in logs of the UPS (if the switch was powered by a UPS)
    • Whether other devices in the same rack or powered by the same power supply system were powered off at that time
    • Whether any high-power device was connected to the network at that time
    • Whether any power lines were aged out or loose
    • Whether the input voltage measured using a multimeter was in the normal range

    If any of the preceding situations exists, take measures to fix the problem of the external power supply system.

  2. If the external power supply is operational, check whether power modules of the switch are faulty. Check whether any power module is removed or loose. If possible, move the problematic power module to another power slot or replace it with another power module to check whether the power module is faulty.
  3. If the switch is faulty, go to section Contacting Technical Support.
Checking Whether the Card Reset Is Caused by High Temperature or Failure of Fans
  1. Check whether the operating temperature is in the normal range (typically 0°C to 45°C). If the temperature is too high, lower the temperature in the equipment room.
  2. Check whether the heat dissipation system of the device is normal. Check the air intake vents, air exhaust vents, fan modules, and air filter to ensure the following points:

    • The air intake vents (on the front and left sides of the chassis) and air exhaust vents (on the rear side of the chassis) are not blocked. The cabinet must have side panels to isolate the chassis from devices in other cabinets. If there are obstacles nearby affecting cooling of the switch, remove the obstacles and check whether the device temperature drops to the normal range.
    • Fan modules are running properly. Check whether any fan trays are removed or loose, and whether air is exhausted from fan modules.
    • The air filter is clean and not blocked so that air can enter the chassis. If the air filter is blocked, clean or replace it.

  3. If any fan modules are faulty, replace them.
  4. If the fault cannot be located, go to section Contacting Technical Support.

Checking Logs

If the procedures described in the preceding sections cannot locate the cause of the card reset, check logs on the switch for further analysis.

How to Check Logs on a Switch

The log module of the system software logs events occurring during system operations. Logs provide reference information for system diagnosis and maintenance, and help you check the equipment running status, analyze network condition, and locate faults.

To check logs on a switch, log in to the switch through the console port or using Telnet, and then run the display logbuffer command. You can also save log information on the switch and use the syslog protocol to export logs to a log server.

# Run the display logbuffer command to check all logs in the log buffer.

<HUAWEI> display logbuffer
Logging buffer configuration and contents : enabled                  
Allowed max buffer size : 1024                                       
Actual buffer size : 512                                             
Channel number : 4 , Channel name : logbuffer                        
Dropped messages : 0                                                 
Overwritten messages : 0                                             
Current messages : 43                                                

Oct 16 2013 06:06:48 HUAWEI %%01VFS/4/DISKSPACE_NOT_ENOUGH(l)[3]:Disk space is insufficient. The system begins to delete unused log files. 
Oct 10 2013 19:06:48 HUAWEI %%01VFS/4/DISKSPACE_NOT_ENOUGH(l)[4]:Disk space is insufficient. The system begins to delete unused log files
  ---- More----
Common Alarms About Card Resets and Handling Methods
Table 1-4 Common alarms about card resets and handling methods

Digest

Log Description

Possible Cause

Handling Method

ALML/4/48V_CHECK_FAULT

The sensor of a card detected alarms on two 48 V power lines.

The power supply circuit of the card was faulty, and the card could not be powered on.

Check whether power modules are present.

If power modules are present but cannot be powered on, go to section Contacting Technical Support.

ALML/0/BRD_PWOFF

A card overheat and was powered off because of a fan failure.

The fans used to cool the card were removed or stopped running.

Run the display temperature all command. In the command output, the Status field shows whether the chassis temperature is within the normal range, and the Temperature.(C) field displays the temperature of each component. If the Status field displays minor, go to the next step.

Fix the problem of the cooling system. See Checking Whether the Card Reset Is Caused by High Temperature or Failure of Fans.

If the card temperature is still high, remove and install the card and then check whether it can be registered successfully. If not, go to section Contacting Technical Support.

ALML/4/ENTPOWEROFF

A card was powered off.

The power off slot slot-id command was executed.

The system detected power insufficiency and powered off the card.

If the power is insufficient, rectify the fault according to Checking Whether the Card Reset Is Caused by a Power Exception.

ALML/4/ENTRESET

A card was reset.

The card reset command was executed.

The system does not run properly. Check the reason field in the log for the specific cause of the reset.

If the card reset command is not executed, check the reset cause in the log and contact Contacting Technical Support.

ALML/4/ENT_PULL_OUT

A card or subcard was removed.

The card or subcard was removed by a user.

The card or subcard was not securely installed.

If the card is subcard has been removed by a user, ignore the log.

If it is not securely installed, reinstall it securely in the slot.

ALML/4/HSB_SWITCH_CAUSE

The active MPU was reset.

The active MPU might be reset for any of the following reasons:

Unknown switch reason: The reason is unknown.

VRP command force: The MPU is forcibly reset using a command.

Master MPU is no memory: The active MPU does not have sufficient memory.

VRP find task deadloop: A task deadloop occurs.

Batch was not over: A task execution exception occurs.

Master switch to slave Interrupt: An active/standby switchover occurs.

Ecm Channel was faulty: The Ethernet channel management (ECM) channel is faulty.

Monitor bus communication Interrupt: The CANbus communication is interrupted.

MPU board was pulled out: The MPU is removed.

Check whether the MPU has been removed by a user.

Run the display current-configuration command to check whether an active/standby switchover is forcibly triggered using the slave switchover command.

Contact technical support personnel.

ALML/4/MASTER_TO_SLAVE

The active MPU became the standby MPU.

An active/standby switchover was forcibly triggered using the slave switchover command. (This log will not be recorded if the active MPU becomes the standby one due to an exception.)

If the active/standby switchover is performed using the command, ignore the log.

ALML/4/POWERSUPPLY_OFF

The power supply was cut off.

The power supply was cut off manually.

The power supply system was not operational.

Checking Whether the Card Reset Is Caused by a Power Exception

ALML/4/PWRFANABSENT

A fan module was not present.

The fan module was not present.

Checking Whether the Card Reset Is Caused by High Temperature or Failure of Fans

ALML/4/TEMP_UPPER

The temperature sensor detected that the temperature exceeded the upper limit. The cooling efficiency was low for some reasons, for example, the air filter was blocked, some fan modules were not running, or vacant slots were not covered by filler panels.

Heat could not be exhausted from the switch quickly.

The air filter was blocked by dust.

Vacant slots were not covered with filler panels.

The environment temperature was too high.

There were not enough fan modules on the switch.

Fan modules on the switch were faulty.

Checking Whether the Card Reset Is Caused by High Temperature or Failure of Fans

FMEA/6/AVS_ABNORMAL

The adaptive voltage scaling (AVS) module on a card did not work normally.

The card experienced a hardware failure.

Replace the card.

MAD/4/CONFLICT_DETECT

A multi-active scenario was detected.

More than one master switch existed due to a cluster link failure.

Rectify the cluster link failure.

MAD/4/MEMBER_LOST

The cluster split due to loss of the cluster neighbor.

The cluster link failed.

The cluster member switch failed.

Rectify the fault of the member switch.

Rectify the cluster link failure.

NOTE:

The following tips will help you quickly find the reference information for a specific log:

  • A digest uniquely identifies a log. You can search for the digest of a log in the Log Reference to find the description of the log and the handling methods.
  • Do not search for logs based on variables, such as log generation time, interface number, process ID, and device name.

Example:

To find reference information for the log: Apr 27 2014 07:45:35 HUAWEI %%01SHELL/4/LOGIN_FAIL_FOR_INPUT_TIMEOUT(s)[6]:Failed to log in due to timeout.(Ip=10.135.19.157, UserNa me=**, Times=1, AccessType=TELNET, VpnName=), search for the digest LOGIN_FAIL_FOR_INPUT_TIMEOUT in the Log Reference. Then you will find the description of the log: After entering a user name or password, a user failed to log in because of a timeout.

Contacting Technical Support Personnel

If you have trouble locating a card reset problem, collect related information and send it to Huawei agent or Huawei for fault locating and handling.

Collect the following information:

• Fault occurrence time, network topology of the failure point (for example, the upstream and downstream devices connected to the failure point, and location of the failure point), operations performed before the fault occurs, measures taken to handle the fault and results of the measures, fault symptom, and impact on services.

• Name, version, and current configuration of the faulty device, as well as related interface information. For details, see Collecting Diagnostic Information Using One Command.

• Logs recorded when the fault occurs.

• If a card cannot be properly registered after a reset, collect the console port information displayed during the card startup process.

Collecting Diagnostic Information Using One Command

The display diagnostic-information command provides outputs of multiple commonly used display commands. You can use this command to view diagnostic information about a switch, including the startup configuration, current configuration, interface information, time, and system software version. It is an effective information collection tool.

The display diagnostic-information [ file-name ] command can display running diagnostic information on screen or export the information to a .txt file. If you do not specify the file-name parameter, the command displays diagnostic information on screen. If you specify the file-name parameter, diagnostic information will be saved in the .txt file with the specified file name. You are advised to export diagnostic information to a .txt file. The following is an example:

<HUAWEI> display diagnostic-information dia-info.txt
  This operation will take several minutes, please wait.........................
Info: The diagnostic information was saved to the device successfully.

The .txt file is saved in cfcard:/. You can run the dir command in the user view to check whether the .txt file exists.

If diagnostic information is displayed on screen, you can press Ctrl+C to stop the display.

This command is used to collect diagnostic information for fault locating. Executing this command may affect the system performance. For example, it may cause a high CPU usage. Therefore, do not run this command when the switch is running properly. Do not run this command on multiple terminals connected to the switch at the same time. Otherwise, the CPU usage of the switch will increase sharply, deteriorating the system performance.

Commonly used terminal software supports information output to a specified file. For example, if you are using the HyperTerminal software of a Windows operating system, choose Transfer > Capture Text, enter the file name, and click Start. After that, run the display diagnostic-information command. Then all diagnostic information is displayed on the terminal screen and automatically saved in a file in the specified path.

Obtaining Log Files

Logs and alarms of a switch can be saved in log files. Perform the following steps to obtain log files:

  1. Run the save logfile command to save information in the log buffer to log files.
  2. Upload files in cfcard:/logfile/ to your computer using FTP or TFTP. If the log files cannot be transferred using FTP or TFTP, run the more command in the user view to display the logs. For example, run the more logfile/log.log command to display logs saved in the log.log file.
NOTE:
  • There may be a large number of log files in the logfile folder. You only need to collect the log files generated around the fault occurrence time.
  • If the standby MPU is involved, you also need to collect log files saved on the standby MPU. These log files are saved in slave#cfcard:/logfile/.
  • For a cluster split or reset problem, collect log files on all the member switches.

Related Information

Translation
Download
Updated: 2019-07-23

Document ID: EDOC1100088111

Views: 494

Downloads: 25

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next