No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Huawei Rack Server iBMC Alarm Handling 27

This document describes iBMC alarms in terms of the meaning, impact on the system, possible causes, and handling suggestions.
Rate and give feedback :
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
Other Alarms

Other Alarms

This topic describes other alarms for servers.

ALM-0x2900FFFF Battery Low (RTC Battery/RAID controller card BBU/PCIeN Card BBU)

Description

Alarm message:

Battery low
This alarm is generated when the following faults occur:
  • The real-time clock (RTC) battery on the mainboard fails or the RTC battery voltage is low.
  • The BBU (iBBU or supercapacitor) of the RAID controller card is faulty or the BBU voltage is low.

Sensor triggering the alarm:

  • RTC Battery
  • RAID controller card BBU
  • PCIeN Card BBU

Attribute

Alarm ID Alarm Severity Auto Clear

0x2900FFFF

Major

Yes

Parameters

Name Meaning
N Indicates a PCIe slot number.

Impact on the System

  • The real-time clock (RTC) battery failure causes incorrect system time, data loss on the complementary metal-oxide-semiconductor (CMOS), and system configuration errors.

  • If the BBU of the RAID controller card is faulty, the power-off protection for cache data is affected.

Possible Causes

  • The battery is not installed on the mainboard or is low in power.

  • The BBU of the RAID controller card is faulty.

Procedure

  1. Replace the battery or the BBU of the RAID controller card and check whether the alarm is cleared.

    • Replace the iBBU or supercapacitor of the RAID controller card when the BBU is faulty.
    • Replace the mainboard when the RTC battery is faulty.

    For details, see the server user guide.

    • If yes, no further action is required.

    • If no, go to 2.

  2. Contact Huawei technical support.

ALM-0x2901FFFF Battery Failed (RAID Card BBU/PCIeN Card BBU)

Description

Alarm message:

Battery Failed

This alarm is generated when the BBU (iBBU or supercapacitor) of the RAID controller card is faulty.

Sensor triggering the alarm:

  • RAID Card BBU
  • PCIeN Card BBU

Attribute

Alarm ID Alarm Severity Auto Clear

0x2901FFFF

Major

Yes

Parameters

Name Meaning
N Indicates a PCIe slot number.

Impact on the System

If the iBBU or supercapacitor of the RAID controller card is faulty, the power-off protection for cache data is affected.

Possible Causes

  • The BBU triggers the replacement alarm.
  • The BBU learn cycle fails.
  • The BBU learn cycle times out.
  • The BBU triggers an pre-alarm.
  • The capacity of the BBU (supercapacitor only) is low.
  • The BBU has no capacity for cache-offload.

Procedure

  1. Power off the server, and remove the power cables from the server.
  2. Replace the BBU for the RAID controller card.
  3. After the server is powered on again and the OS startup is complete, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 4.

  4. Contact Huawei technical support.

ALM-0x0700FFFF CAT Error Detected in the x86 OS (CPUN Status)

Description

Alarm message:

CAT error detected in the x86 OS

This alarm is generated when a CPU internal error is detected.

Sensor triggering the alarm: CPUN Status

Attribute

Alarm ID Alarm Severity Auto Clear

0x0700FFFF

Critical

Yes

Parameters

Name Meaning
N

Serial number of the CPU.

Impact on the System

The CPU internal error affects the mainboard performance and OS operation, and the mainboard may restart or stop responding.

Possible Causes

  • A hardware fault, such as the CPU, DIMM, or mainboard fault, occurs.
  • A software fault, such as the OS incompatibility or abnormal logical status, occurs.

Procedure

  1. Use the Huawei Server Compatibility Checker to check whether the OS type and version are supported by the server.

    • If yes, go to 3.
    • If no, go to 2.

  2. Install the correct OS version. Then, check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 3.

  3. On the iBMC, check whether there is a hardware fault alarm generated for the CPU, DIMM, or mainboard.

    • If yes, go to 4.
    • If no, go to 5.

  4. Rectify the hardware fault. Then, check whether the CAT ERROR alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 5.

  5. On the FDM page of the iBMC WebUI, collect fault diagnosis information and rectify the fault. Then, check whether the CAT ERROR alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 6.

  6. Collect the iBMC and OS log information, and contact Huawei technical support.

ALM-0x0702FFFF CPU Initialization Failed (FRB1/BIST) (CPUN Status)

Description

Alarm message:

CPU initialization failed (FRB1/BIST)

This alarm is generated when an error occurs in the CPU self-check during startup.

Sensor triggering the alarm: CPUN Status

Attribute

Alarm ID Alarm Severity Auto Clear

0x0702FFFF

Critical

Yes

Parameters

Name Meaning
N

Serial number of the CPU.

Impact on the System

When a CPU self-check error occurs, the system cannot be started properly, or services are not properly running on the system.

Possible Causes

  • The CPU is faulty.

  • The mainboard is faulty.

Procedure

  1. Power off the server, remove and reconnect the power cables, and power on the server. Check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 2.

  2. Replace the mainboard. Check whether the alarm is cleared.

    For details about how to replace the mainboard, see "Replacing Parts" in the server user guide.

    • If yes, no further action is required.

    • If no, go to 3.

  3. Replace the faulty CPU. Then, check whether the alarm is cleared.

    For details about how to replace the CPU, see "Replacing Parts" in the server user guide.

    • If yes, no further action is required.

    • If no, go to 4.

  4. Contact Huawei technical support.

ALM-0x0705FFFF Configuration Error (CPUN Status)

Description

Alarm message:

Configuration error

This alarm applies only to the RH8100 V3 only.

This alarm is generated if any of the following is detected during system startup.

  • A PCIe card that requires I/O resources is installed in an incorrect slot.
  • An unsupported CPU is installed.
NOTE:
For details about the CPUs supported by the server, see the Server Compatibility Checker.

This alarm is generated by the following sensor:

  • CPUN Status

Attribute

Alarm ID Alarm Severity Auto Clear

0x0705FFFF

Critical

Yes

Parameters

Name Meaning
N

Serial number of the CPU.

Impact on the System

The system fails to start.

Possible Causes

  • A PCIe card that requires I/O resources is installed in an incorrect slot.
  • An unsupported CPU is installed.

Procedure

  1. Check whether a PCIe card requiring I/O resources is installed in an incorrect slot.

    • If yes, go to 2.

    • If no, go to 3.

  2. Install the PCIe card in a correct slot to conform to I/O resource allocation rules. Then check whether the alarm is cleared.

    For details about I/O resource allocation rules, see the server user guide.

    • If yes, no further action is required.

    • If no, go to 3.

  3. Check whether an unsupported CPU is installed.

    • If yes, go to 4.

    • If no, go to 5.

  4. Replace the unsupported CPU with a compatible one. Then check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 5.

  5. Contact Huawei technical support.

ALM-0x0705FFFF Configuration Error (CPUN Status)

Description

Alarm message:

Configuration error

The RH8100 V3 does not generate this alarm.

This alarm is generated when different types of CPUs or CPU incompatible with the server are detected during the server startup process.

Sensor triggering the alarm: CPUN Status

Attribute

Alarm ID Alarm Severity Auto Clear

0x0705FFFF

Critical

Yes

Parameters

Name Meaning
N

Serial number of the CPU.

Impact on the System

The system cannot operate properly, and services are interrupted.

Possible Causes

  • The server is configured with different models of CPUs.
  • The CPU is faulty.

  • The mainboard is faulty.

Procedure

  1. Check whether the server is configured with different models of CPUs.

    • If yes, go to 2.

    • If no, go to 3.

  2. Use the same model of CPUs. Then, check whether the alarm is cleared.

    For details about the CPUs supported by the server, see the Server Compatibility Checker.

    • If yes, no further action is required.

    • If no, go to 3.

  3. Switch the CPU with a functioning CPU in the same chassis, and check whether the alarm is still generated for this CPU.

    • If yes, go to 4.

    • If no, go to 5.

  4. Replace the faulty CPU. Then, check whether the alarm is cleared.

    For details about how to replace the CPU, see "Replacing Parts" in the server user guide.

    • If yes, no further action is required.

    • If no, go to 5.

  5. Replace the mainboard. Check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 6.

  6. Contact Huawei technical support.

ALM-0x070BFFFF Uncorrectable CPU Error (CPUN Status)

Description

Alarm message:

Uncorrectable CPU error

This alarm is generated when one of the following errors occurs:

  • The SMI2 link fails in non-memory mirroring mode.
  • The CPU runs an error program.
  • A parity error occurs on the voltage mode single ended (VMSE) link.
  • The memory controller receives data marked with the poison tag.

Sensor triggering the alarm: CPUN Status

Attribute

Alarm ID Alarm Severity Auto Clear

0x070BFFFF

Critical

Yes

Parameters

Name Meaning
N indicates a CPU number.

Impact on the System

Services are interrupted, or the system restarts.

Possible Causes

  • The CPU is faulty.

  • The mainboard is faulty.

Procedure

  1. Power off the server, remove and reconnect the power cables, and power on the server. Check whether the alarm is cleared.

    • If yes, no further operation is required.

    • If no, go to 2.

  2. Remove and reinstall the CPU. Then, check whether the alarm is cleared.

    • If yes, no further operation is required.

    • If no, go to 3.

  3. Switch the CPU with a functioning CPU in the same chassis, and check whether the alarm is still generated for this CPU.

    • If yes, go to 4.

    • If no, go to 5.

  4. Replace the faulty CPU. Then, check whether the alarm is cleared.

    For details about how to replace the CPU, see "Replacing Parts" in the server user guide.

    • If yes, no further operation is required.

    • If no, go to 6.

  5. Replace the mainboard. Check whether the alarm is cleared.

    For details about how to replace the mainboard, see "Replacing Parts" in the server user guide.

    • If yes, no further operation is required.

    • If no, go to 6.

  6. Contact Huawei technical support.

ALM-0x070CFFFF Correctable Machine Check Error (CPUN Status)

Description

Alarm message:

Correctable Machine Check Error

This alarm is generated when the sensor detects that a self-check exception has occurred in a CPU.

This alarm applies only to the RH5885 V3, RH5885H V3, and RH8100 V3.

Sensor triggering the alarm: CPUN Status

Attribute

Alarm ID Alarm Severity Auto Clear

0x070CFFFF

Minor

Yes

Parameters

Name Meaning
N

Serial number of the CPU.

Impact on the System

The DIMMs corresponding to the CPU cannot be used. As a result, server performance may deteriorate.

Possible Causes

  • The SMI2 link has failed in memory mirroring mode.
  • An internal error has occurred in Jordan Creek.
  • The number of errors that occur during data transmission between Jordan Creek and the memory controller has reached the alarm threshold.

Procedure

  1. Check whether there is any alarm generated for the memory board or DIMM corresponding to the CPU.

    • For RH5885 V3, check the DIMMs corresponding the CPU.
    • For RH5885H V3 and RH8100 V3, check the memory boards corresponding the CPU.
    • If yes, go to 2.
    • If no, go to 3.

  2. Replace the memory board or DIMM. Then check whether the alarm is cleared.

    • For RH5885 V3, replace the DIMM.
    • For RH5885H V3 and RH8100 V3, replace the memory board.
    • If yes, no further action is required.
    • If no, go to 3.

  3. Replace the mainboard or the system compute module (SCM). Then, check whether the alarm is cleared.

    • For RH5885 V3 and RH5885H V3, replace the mainboard.
    • For RH8100 V3, replace the SCM.
    • If yes, no further action is required.
    • If no, go to 4

  4. Contact Huawei technical support.

ALM-0x1B01FFFF Incorrect Cable Connected/Incorrect Interconnection (CPUN QPI Link)

Description

Alarm message:

Incorrect cable connected/Incorrect interconnection

This alarm is generated when the QuickPath Interconnect (QPI) bus is faulty.

Sensor triggering the alarm: CPUN QPI Link

Attribute

Alarm ID Alarm Severity Auto Clear

0x1B01FFFF

Major

Yes

Parameters

Name Meaning
N Indicates a CPU number.

Impact on the System

Services may be interrupted, and the system may crash or restart.

Possible Causes

  • The QPI link is faulty.
  • The CPU is faulty.

Procedure

  1. Gracefully power off the server.
  2. Remove the CPU indicated in the alarm message and check whether the CPU socket has twisted pins.

    • If yes, go to 5.

    • If no, go to 3.

  3. Check whether the CPU is faulty.

    • If yes, go to 4.

    • If no, go to 5.

    The following is an example of clearing the alarm.

    Incorrect cable connected/Incorrect interconnection (CPU1 QPI Link)
    1. Switch positions of CPU1 and a normal CPU.
    2. Power on the server. If the CPU indicated in the alarm message is changed, CPU1 is faulty. Otherwise, the QPI link on the mainboard is faulty.

  4. Gracefully power off the server, replace the faulty CPU, and power on the server again. Then check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 6.

  5. Gracefully power off the server, replace the mainboard, and power on the server again. Then check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 6.

  6. Contact Huawei technical support.

ALM-0x0F0001FF System Error (SysFWProgress)

Description

Alarm message:

System error. Please check the SEL for root cause

During BIOS startup, this alarm is generated if any of the following situation occurs: No DIMM is detected; the only DIMM is faulty; the only DIMM is installed in an correct position.

NOTE:

For details about DIMM layout, see the server user guide or use the Huawei Server Product Memory Configuration Assistant.

Sensor triggering the alarm: SysFWProgress

Attribute

Alarm ID Alarm Severity Auto Clear

0x0F0001FF

Major

Yes

Impact on the System

The system fails to start because no DIMM is available.

Possible Causes

  • No DIMM is installed.
  • The only DIMM is faulty and therefore isolated by BIOS.
  • The only DIMM is installed in an incorrect position.

Procedure

  1. Power off the server and check whether DIMMs are installed.

    • If yes, go to 2.

    • If no, go to 3.

  2. Check whether the DIMMs are installed in correct positions.

    • If yes, go to 4.

    • If no, go to 3.

  3. Install the required DIMMs in the correct positions. Power on the server. Check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 4.

  4. Replace the DIMMs. Then check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 5.

  5. Contact Huawei technical support.

ALM-0x0F0007FF Unrecoverable PS/2 or USB Keyboard Failure (SysFWProgress)

Description

Alarm message:

Unrecoverable PS/2 or USB keyboard failure

This alarm is generated when the PS/2 or USB device is unavailable or fails.

Sensor triggering the alarm: SysFWProgress

Attribute

Alarm ID Alarm Severity Auto Clear

0x0F0007FF

Major

Yes

Impact on the System

The PS/2 or USB device is unavailable.

Possible Causes

  • The PS/2 or USB device is not connected.

  • The PS/2 or USB device is faulty.

  • The mainboard is faulty.

Procedure

  1. Power off the server, remove and reconnect the power cables, and power on the server. Check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 2.

  2. Replace the PS/2 or USB device. Check whether the alarm is cleared.

    For details about how to replace the PS/2 or USB device, see "Replacing Parts" in the server user guide.

    • If yes, no further action is required.

    • If no, go to 3.

  3. Replace the mainboard. Check whether the alarm is cleared.

    For details about how to replace the mainboard, see "Replacing Parts" in the server user guide.

    • If yes, no further action is required.

    • If no, go to 4.

  4. Contact Huawei technical support.

ALM-0x0F0009FF Unrecoverable Video Controller Failure (SysFWProgress)

Description

Alarm message:

Unrecoverable video controller failure

This alarm is generated when the BIOS cannot detect the display device.

This alarm is generated by the following sensor:

  • SysFWProgress

Attribute

Alarm ID Alarm Severity Auto Clear

0x0F0009FF

Major

Yes

Impact on the System

The video device connected to the server does not work.

Possible Causes

The display adapter is faulty.

Procedure

  1. Check whether the server is configured with an external display adapter.

    • If yes, go to 2.

    • If no, go to 3.

  2. Replace the display adapter. Then, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 3.

  3. Replace the mainboard. Then, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 4.

  4. Contact Huawei technical support.

ALM-0x0F000CFF CPU Voltage Mismatch (SysFwProgress)

Description

Alarm message:

CPU voltage mismatch

This alarm is generated when the sensor detects that CPUs of different models are installed on the server.

Sensor triggering the alarm: SysFWProgress

Attribute

Alarm ID Alarm Severity Auto Clear

0x0F000CFF

Major

Yes

Impact on the System

The server may not be used properly.

Possible Causes

CPUs of different models are installed on the server.

Procedure

  1. Contact Huawei technical support.

ALM-0x0F01FFFF System Firmware Hang (SysFWProgress)

Description

Alarm message:

System firmware hang

This alarm is generated if the CPU does not match the BIOS version, or if the CPU matches the BIOS version but the CPU microcode fails to be loaded.

Sensor triggering the alarm: SysFWProgress

Attribute

Alarm ID Alarm Severity Auto Clear

0x0F01FFFF

Critical

Yes

Impact on the System

The system cannot be started, or the CPU cannot operate normally.

Possible Causes

  • The CPU does not match with the BIOS version.

  • The CPU microcode fails to be loaded.

Procedure

  1. Log in to the iBMC WebUI or CLI.

    For details, see the iBMC user guide of the server.

  2. Upgrade the iBMC or BIOS to the latest version. Then check whether the alarm is cleared.

    • To upgrade the iBMC or BIOS on the WebUI, log into the Upgrade Firmware page.
    • To upgrade the iBMC or BIOS on the CLI, run the ipmcset -d upgrade command.
    • If yes, no further action is required.

    • If no, go to 3.

  3. Contact Huawei technical support.

ALM-0x1B01FFFF Incorrect Cable Connected/Incorrect Interconnection (SAS Cable)

Description

Alarm message:

Incorrect cable connected/Incorrect interconnection

This alarm is generated when the SAS cable connection is incorrect.

Sensor triggering the alarm: SAS Cable

Attribute

Alarm ID Alarm Severity Auto Clear

0x1B01FFFF

Major

Yes

Impact on the System

If the SAS cables are incorrectly connected, the system cannot detect all disks.

Possible Causes

  • The SAS cables are connected incorrectly.

  • The SAS cables are faulty.

  • The component connected to the SAS cable is faulty.

Procedure

  1. Check whether the SAS cables are connected correctly.

    For details about the SAS cable connection sequence, see "Internal Cabling" in the server user guide.

    • If yes, go to 2.

    • If no, go to 3.

  2. Remove and reinstall the SAS cables. Then, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 3.

  3. Replace the SAS cables. Then, check whether the alarm is cleared.

    For details about how to replace the SAS cables, see "Replacing Parts" in the server user guide.

    • If yes, no further action is required.

    • If no, go to 4.

  4. Replace the RAID card or the RAID riser card. Then, check whether the alarm is cleared.

    For details, see "Replacing Parts" in the server user guide.

    • If yes, no further action is required.

    • If no, go to 5.

  5. Replace the hard disk backplane. Then, check whether the alarm is cleared.

    For details about how to replace the hard disk backplane, see "Replacing Parts" in the server user guide.

    • If yes, no further action is required.

    • f no, go to 6.

  6. Contact Huawei technical support.

ALM-0x1B01FFFF Incorrect Cable Connected/Incorrect Interconnection (HDD Backplane)

Description

Alarm message:

Incorrect cable connected/Incorrect interconnection

This alarm is generated when the hard disk drive (HDD) backplane cable is loose or disconnected.

Sensor triggering the alarm: HDD Backplane

Attribute

Alarm ID Alarm Severity Auto Clear

0x1B01FFFF

Major

Yes

Impact on the System

The hard disk cannot work properly if the backplane fails.

Possible Causes

  • The hard disk backplane cables are connected incorrectly.

  • The hard disk backplane cables are faulty.

  • The hard disk backplane is faulty.

Procedure

  1. Check whether the hard disk backplane cables are connected correctly.

    For details about the cable connection sequence, see "Internal Cabling" in the server user guide.

    • If yes, go to 3.

    • If no, go to 2.

  2. Connect the hard disk backplane cables in correct sequence. Then, check whether the alarm is cleared.

    • If yes, no further operation is required.

    • If no, go to 3.

  3. Replace the hard disk backplane cables. Then, check whether the alarm is cleared.

    For details about how to replace the hard disk backplane cables, see "Internal Cabling" in the server user guide.

    • If yes, no further operation is required.

    • If no, go to 4.

  4. Replace the hard disk backplane. Then, check whether the alarm is cleared.

    • If yes, no further operation is required.

    • If no, go to 5.

  5. Contact Huawei technical support.

ALM-0x0742FFFF Transition to Critical From Less Severe (HDD BP status)

Description

Alarm message:

Transition to critical from less severe.

This alarm is generated when the 12-disk backplane is abnormal or fails.

Sensor triggering the alarm: HDD BP status

Attribute

Alarm ID Alarm Severity Auto Clear

0x0742FFFF

Major

Yes

Impact on the System

The hard disk cannot work properly if the backplane fails.

Possible Causes

The hard disk backplane is faulty.

Procedure

  1. Connect the hard disk backplane cables in correct sequence. Then, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 2.

  2. Replace the hard disk backplane. Then, check whether the alarm is cleared.

    For details about how to replace the hard disk backplane, see "Replacing Parts" in the server user guide.

    • If yes, no further action is required.

    • If no, go to 3.

  3. Contact Huawei technical support.

ALM-0x0441FFFF Predictive Failure Detected (RAID Status)

Description

Alarm message:

Predictive failure detected

This alarm is generated when the RAID controller card is faulty.

Sensor triggering the alarm: RAID Status

Attribute

Alarm ID Alarm Severity Auto Clear

0x0441FFFF

Minor

Yes

Impact on the System

If the RAID controller card is faulty, data in the disk cannot be accessed, which affects the system startup.

Possible Causes

  • The chip of the RAID controller card is faulty.
  • The RAID controller card supports out-of-band management, but the communication fails.
  • The RAID controller card has an uncorrectable error.
  • The number of memory ECC errors of the RAID controller card reaches the maximum.
  • The nonvolatile random access memory (NVRAM) of the RAID controller card is faulty.

Procedure

  1. Remove and install the RAID controller card. Check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 2.

  2. Replace the RAID controller card. Then check whether the alarm is cleared.

    For details, see the server user guide.

    • If yes, no further action is required.

    • If no, go to 3.

  3. Contact Huawei technical support.

ALM-0x2100FFFF PCIe Error (PCIE Status)

Description

Alarm message:

PCIe Error, PCIe Slot N

This alarm is generated when a standard PCIe device error is detected.

Sensor triggering the alarm: PCIE Status

Attribute

Alarm ID Alarm Severity Auto Clear

0x2100FFFF

Major

Yes

Parameters

Name Meaning
N PCIe device slot number.

Impact on the System

The standard PCIe device cannot be used.

Possible Causes

The standard PCIe device is faulty.

Procedure

  1. Gracefully power off the server and check whether the standard PCIe device or slot is faulty or poorly connected.

    • If yes, go to 4.

    • If no, go to 2.

  2. Power on the server to start the power-on self-test (POST) and then run test software (just as FusionServer Toolkit). Check whether the POST succeeds and the test software finds no fault.

    For details about the download address and the operation methods about FusionServer Toolkit, see FusionServer Tools V2R2 Toolkit User Guide.

    • If yes, go to 3.

    • If no, go to 4.

  3. Check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 4.

  4. Replace the part that may be faulty. Then check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 5.

  5. Contact Huawei technical support.

ALM-0x2100FFFF Fault Status (CPU Brd Config)

Description

Alarm message:

Fault status

This alarm is generated if different types of compute nodes are detected.

This alarm applies only to the RH8100 V3 only.

Sensor triggering the alarm: CPU Brd Config

Attribute

Alarm ID Alarm Severity Auto Clear

0x2100FFFF

Major

Yes

Impact on the System

The system performance deteriorates.

Possible Causes

Different types of compute nodes are used together.

Procedure

  1. Power off the server and disconnect power cables from the server.
  2. Check whether different types of compute nodes are used together.

    • If yes, go to 3.

    • If no, go to 4.

  3. Replace the inconsistent compute nodes so that all compute nodes are of the same model. Then connect the power cables and power on the server, and check whether the alarm is cleared.

    For details about how to replace a compute node, see the server user guide.

    • If yes, no further action is required.

    • If no, go to 4.

  4. Contact Huawei technical support.

ALM-0x2100FFFF Fault Status (MEM Brd Config)

Description

Alarm message:

Fault status

This alarm can only be generated by the RH8100 V3.

This alarm is generated if different types of memory risers are detected.

Sensor triggering the alarm: MEM Brd Config

Attribute

Alarm ID Alarm Severity Auto Clear

0x2100FFFF

Major

Yes

Impact on the System

The system performance deteriorates.

Possible Causes

Different types of memory risers are used together.

Procedure

  1. Power off the server and disconnect power cables from the server.
  2. Check whether different types of memory risers are used together.

    • If yes, go to 3.

    • If no, go to 4.

  3. Replace the inconsistent memory risers so that all memory risers are of the same model. Then connect the power cables and power on the server, and check whether the alarm is cleared.

    For details about how to replace a memory riser, see the server user guide.

    • If yes, no further action is required.

    • If no, go to 4.

  4. Contact Huawei technical support.

ALM-0x2100FFFF Fault Status (MEMRiser Config)

Description

Alarm message:

Fault status

This alarm can only be generated by the RH5885H V3.

This alarm is generated if different types of memory risers are detected.

Sensor triggering the alarm: MEMRiser Config

Attribute

Alarm ID Alarm Severity Auto Clear

0x2100FFFF

Major

Yes

Impact on the System

The system performance deteriorates.

Possible Causes

Different types of memory risers are used together.

Procedure

  1. Power off the server and disconnect power cables from the server.
  2. Check whether different types of memory risers are used together.

    • If yes, go to 3.

    • If no, go to 4.

  3. Replace the inconsistent memory risers so that all memory risers are of the same model. Then connect the power cables and power on the server, and check whether the alarm is cleared.

    For details about how to replace a memory riser, see the server user guide.

    • If yes, no further action is required.

    • If no, go to 4.

  4. Contact Huawei technical support.

ALM-0x0541FFFF Limit Exceeded (CPU Usage)

Description

Alarm message:

Limit Exceeded

This alarm is generated when the CPU usage is higher than the upper limit. This alarm is cleared when the system detects that the CPU usage is restored to the acceptable range.

Sensor triggering the alarm: CPU Usage

Attribute

Alarm ID Alarm Severity Auto Clear

0x0541FFFF

Minor

Yes

Impact on the System

The system cannot call some processes, and system performance is affected.

Possible Causes

  • The number of ongoing processes in the system is large.

  • The CPU usage of a process is high.

Procedure

  1. Start the task manager on the server, and check whether any processes can be stopped.

    • If yes, go to 2.

    • If no, go to 3.

  2. Stop the processes as required, and check whether the alarm is cleared.

    • If yes, no further operation is required.

    • If no, go to 3.

  3. Contact Huawei technical support.

ALM-0x0541FFFF Limit Exceeded (MEM Bandwidth)

Description

Alarm message:

Limit Exceeded

This alarm is generated when the MEM Bandwidth is higher than the upper limit. This alarm is cleared when the system detects that the MEM Bandwidth is restored to the acceptable range.

Sensor triggering the alarm: MEM Bandwidth

Attribute

Alarm ID Alarm Severity Auto Clear

0x0541FFFF

Minor

Yes

Impact on the System

Some processes are blocked, the system cannot run new processes, or the system performance deteriorates.

Possible Causes

  • The number of ongoing processes in the system is large.

  • The MEM Bandwidth of a process is high.

Procedure

  1. Start the task manager on the server, and check whether any processes can be stopped.

    • If yes, go to 2.

    • If no, go to 3.

  2. Stop the processes as required, and check whether the alarm is cleared.

    • If yes, no further operation is required.

    • If no, go to 3.

  3. Contact Huawei technical support.

ALM-0x0441FFFF Predictive Failure Detected (LCD Status)

Description

Alarm message:

Predictive Failure Detected

This alarm is generated when the communication between the LCD and the PME is abnormal.

Sensor triggering the alarm: LCD Status

Attribute

Alarm ID Alarm Severity Auto Clear

0x0441FFFF

Minor

Yes

Impact on the System

The login from the LCD fails. As a result, information query and simple configuration cannot be performed using the LCD.

Possible Causes

The serial cable to the LCD is faulty or the LCD is removed.

Procedure

  1. Check whether the LCD is removed.

    • If yes, go to 2.

    • If no, go to 3.

  2. Install the LCD. Then check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 3.

  3. Contact Huawei technical support.

ALM-0x2100FFFF Fault Status (Heartbeat)

Description

Alarm message:

Fault status

This alarm is generated when the sensor detects that High-performance Fusion Consoles (HFCs) fail to communicate with each other.

This alarm applies only to the RH8100 V3 only.

Sensor triggering the alarm: Heartbeat

Attribute

Alarm ID Alarm Severity Auto Clear

0x2100FFFF

Major

Yes

Impact on the System

The HFCs cannot obtain all server information, which reduces system management capability.

Possible Causes

  • The network is faulty.
  • The iBMC board is faulty.

Procedure

  1. Restart the server. Then check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 2.

  2. Replace the HFCs and check whether the alarm is cleared.

    For details about how to replace an HFC, see the server user guide.

    • If yes, no further action is required.

    • If no, go to 3.

  3. Contact Huawei technical support.

ALM-0x2100FFFF Fault Status (BkpB MISC Cable)

Description

Alarm message:

Fault status

This alarm is generated when the sensor detects that the MISC signal cable to hard disk backplane B on the server with 24 hard disks is faulty.

This alarm applies only to the RH8100 V3 only.

Sensor triggering the alarm: BkpB MISC Cable

Attribute

Alarm ID Alarm Severity Auto Clear

0x2100FFFF

Major

Yes

Impact on the System

The hard disks cannot be used properly.

Possible Causes

  • The connection is abnormal between the hard disk backplane and the mainboard.
  • The hard disk backplane fails.

Procedure

  1. Restart the server. Then check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 2.

  2. Remove the cables from the hard disk backplane, and then reconnect the cables. Check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 3.

  3. Contact Huawei technical support.

ALM-0x2100FFFF Fault Status (BkpC MISC Cable)

Description

Alarm message:

Fault status

This alarm is generated when the sensor detects that the MISC signal cable to hard disk backplane C on the server with 24 hard disks is faulty.

This alarm applies only to the RH8100 V3 only.

Sensor triggering the alarm: BkpC MISC Cable

Attribute

Alarm ID Alarm Severity Auto Clear

0x2100FFFF

Major

Yes

Impact on the System

The hard disks cannot be used properly.

Possible Causes

  • The connection is abnormal between the hard disk backplane and the mainboard.
  • The hard disk backplane fails.

Procedure

  1. Restart the server. Then check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 2.

  2. Remove the cables from the hard disk backplane, and then reconnect the cables. Check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 3.

  3. Contact Huawei technical support.

ALM-0x2100FFFF Fault Status (Bkp MISC Cable)

Description

Alarm message:

Fault status

This alarm is generated when the sensor detects that the MISC signal cable to hard disk backplane on the server with 8 hard disks is faulty.

This alarm applies only to the RH8100 V3 only.

Sensor triggering the alarm: Bkp MISC Cable

Attribute

Alarm ID Alarm Severity Auto Clear

0x2100FFFF

Major

Yes

Impact on the System

The hard disks cannot be used properly.

Possible Causes

  • The connection is abnormal between the hard disk backplane and the mainboard.
  • The hard disk backplane fails.

Procedure

  1. Restart the server. Then check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 2.

  2. Remove the cables from the hard disk backplane, and then reconnect the cables. Check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 3.

  3. Contact Huawei technical support.

ALM-0x2100FFFF Fault Status (ExpN Status)

Description

Alarm message:

Fault status

This alarm is generated when the sensor detects that the Expander heartbeat is lost on a server with 12 or 24 hard disks.

This alarm applies only to the RH8100 V3 only.

Sensor triggering the alarm: ExpN Status

Attribute

Alarm ID Alarm Severity Auto Clear

0x2100FFFF

Major

Yes

Parameters

Name Meaning
N Indicates the number of the Expander board.

Impact on the System

The hard disk cannot be used properly.

Possible Causes

  • The front I/O board fails.
  • The hard disk backplanes fail.

Procedure

  1. Restart the server and check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 2.

  2. Remove and reinstall the front I/O board. Check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 3.

  3. Remove and reinstall the hard disk backplane. Check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 4.

  4. Contact Huawei technical support.

ALM-0x1202FFFF System Error (System Error)

Description

Alarm message:

System error. Please check the SEL for root cause.

This alarm is generated when the management software detects an error that may cause the system to restart or stop responding.

Sensor triggering the alarm: System Error

Attribute

Alarm ID Alarm Severity Auto Clear

0x1202FFFF

Critical

Yes

Impact on the System

The system may restart or stop responding, which reduces system stability.

Possible Causes

A hardware fault occurs.

Procedure

  1. Collect iBMC and OS logs.
  2. Send the iBMC log to Huawei technical support for further analysis.
  3. Send the OS log to the OS vendor for further analysis.

ALM-0x0341FFFF PCIe Error (RAIDN PCIE ERR/NICN Status/MezzN Status)

Description

Alarm message:

PCIe Error

This alarm is generated when the management software detects a critical alarm of a RAID controller card, LOM (LAN on Motherboard), or mezzanine card.

Sensor triggering the alarm:

  • RAIDN PCIE ERR
  • NICN Status
  • MezzN Status

Attribute

Alarm ID Alarm Severity Auto Clear

0x0341FFFF

Critical

Yes

Parameters

Name Meaning
N

indicates the slot number of a RAID card, a LOM, or a mezzanine card.

If the server supports only one RAID controller card or mezzanine card, the corresponding sensor names do not include N.

Impact on the System

The system may restart or stop responding.

Possible Causes

The PCIe device is faulty.

Procedure

  1. Gracefully power off the server and check whether the PCIe device or slot is faulty or poorly connected.

    • If yes, go to 4.

    • If no, go to 2.

  2. Power on the server to start the power-on self-test (POST) and then run test software (just as FusionServer Toolkit). Check whether the POST succeeds and the test software finds no fault.

    For details about the download address and the operation methods about FusionServer Toolkit, see FusionServer Tools V2R2 Toolkit User Guide.

    • If yes, go to 3.

    • If no, go to 4.

  3. Check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 4.

  4. Replace the part that may be faulty. Then check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 5.

  5. Contact Huawei technical support.

ALM-0x0341FFFF Uncorrectable PCH Error (PCH Status)

Description

Alarm message:

Uncorrectable PCH error

The alarm is generated when a PCH chip error is detected.

Sensor triggering the alarm: PCH Status

Attribute

Alarm ID Alarm Severity Auto Clear

0x0341FFFF

Critical

Yes

Impact on the System

The system may restart or stop responding.

Possible Causes

The PCH chip is faulty.

Procedure

  1. Gracefully power off the server and check whether the PCH chip or mainboard has any damage.

    • If yes, go to 4.

    • If no, go to 2.

  2. Power on the server to start the power-on self-test (POST) and then run test software (just as FusionServer Toolkit). Check whether the POST succeeds and the test software finds no fault.

    For details about the download address and the operation methods about FusionServer Toolkit, see FusionServer Tools V2R2 Toolkit User Guide.

    • If yes, go to 3.

    • If no, go to 4.

  3. Check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 4.

  4. Replace the mainboard. Then, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 5.

  5. Contact Huawei technical support.

ALM-0x2100FFFF Fault Status (CPU NUM Config)

Description

Alarm message:

Fault status

This alarm is generated when the CPU installation positions are incorrect.

This alarm applies only to the RH5885 V3, RH5885H V3, and RH8100 V3.

Sensor triggering the alarm: CPU NUM Config

Attribute

Alarm ID Alarm Severity Auto Clear

0x2100FFFF

Major

Yes

Impact on the System

The system may fail to power on.

Possible Causes

  • The CPU quantity is incorrect.
  • The CPU installation positions are incorrect.

Procedure

  1. Check whether the CPU quantity is correct.

    For the details about the CPU quantity of a server, see the user guide.

    • If yes, go to 3.
    • If no, go to 2.

  2. Install CPUs of a specific quantity. Then check whether the alarm is cleared.

    For details, see "Replacing Parts" in the user guide.

    • If yes, no further action is required.

    • If no, go to 3.

  3. Reinstall CPUs in the correct positions. Then check whether the alarm is cleared.

    For details about the CPU position, see the user guide.

    • If yes, no further action is required.

    • If no, go to 4.

  4. Contact Huawei technical support.

ALM-0x2100FFFF Fault Status (CPU Match)

Description

Alarm message:

Fault status

This alarm is generated when an unsupported CPU model or multiple CPU models are installed in the same server.

This alarm applies only to the RH5885 V3, RH5885H V3, and RH8100 V3.

Sensor triggering the alarm: CPU Match

Attribute

Alarm ID Alarm Severity Auto Clear

0x2100FFFF

Major

Yes

Impact on the System

The system may fail to power on.

Possible Causes

  • An incompatible CPU is installed.
  • The CPUs installed in the server are not of the same model.

Procedure

  1. Power off the server, and remove the power cables from the server.
  2. Check the number of CPUs installed in the server.

    • If only one CPU is installed, go to 4.

    • If multiple CPUs are installed, go to 3.

  3. Check whether the models of the installed CPUs are the same.

    • If yes, go to 4.

    • If no, go to 6.

  4. Check whether the CPU(s) are compatible with the server.

    You can check the CPU compatibility by using the server compatibility checker.

    • If yes, go to 9.

    • If no, go to 5.

  5. Replace the incompatible CPU(s). Then check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 9.

  6. Check whether any CPU is incompatible with the server.

    • If yes, go to 8.

    • If no, go to 7.

  7. Replace certain CPUs so that the CPU models are the same. Then check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 9.

  8. Replace the incompatible CPUs so that all CPUs are of the same model and compatible with the server. Then check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 9.

  9. Contact Huawei technical support.

ALM-0x2100FFFF Fault Status (HPC Match)

Description

Alarm message:

Fault status

This alarm applies only to the RH8100 V3 only.

This alarm is generated when HFCs of different models are installed in the same server.

Sensor triggering the alarm: HPC Match

Attribute

Alarm ID Alarm Severity Auto Clear

0x2100FFFF

Major

Yes

Impact on the System

The system may fail to power on.

Possible Causes

The HFCs installed in the server are of different models.

Procedure

  1. Power off the server, and remove the power cables from the server.
  2. Check whether the models of the two HFCs are the same.

    • If yes, go to 4.

    • If no, go to 3.

  3. Replace the improper HFC. Then power on the server and check whether the alarm is cleared.

    For details about how to replace an HFC, see the server user guide.

    • If yes, no further action is required.

    • If no, go to 4.

  4. Contact Huawei technical support.

ALM-0x2100FFFF Fault status (HPCN Type)

Description

Alarm message:

Fault status

This alarm applies only to the RH8100 V3 only.

This alarm is generated when a compute module is incompatible with an HFC.

Sensor triggering the alarm: HPCN Type

Attribute

Alarm ID Alarm Severity Auto Clear

0x2100FFFF

Major

Yes

Parameters

Name Meaning
N indicates an HFC number.

Impact on the System

The system may fail to power on.

Possible Causes

The compute module is incompatible with the HFC.

Procedure

  1. Power off the server, and remove the power cables from the server.
  2. Check whether the HFC is compatible with the compute module.

    For details about the relationship between them, see the server user guide.

    • If yes, go to 4.

    • If no, go to 3.

  3. Replace the HFC or compute module with a compatible one. Then power on the server and check whether the alarm is cleared.

    For details about how to replace a HFC or compute module, see the server user guide.

    • If yes, no further action is required.

    • If no, go to 4.

  4. Contact Huawei technical support.

ALM-0x0341FFFF State Asserted (Board Mismatch)

Description

Alarm message:

State Asserted

This alarm is generated when the current mainboard does not match the 12/24-bay NVMe PCIe SSD backplane.

Sensor triggering the alarm: Board Mismatch

Attribute

Alarm ID Alarm Severity Auto Clear

0x0341FFFF

Major

Yes

Impact on the System

NVMe PCIe SSDs cannot be identified.

Possible Causes

The mainboard does not support the NVMe PCIe SSD backplane.

Procedure

  1. Contact Huawei technical support.
Download
Updated: 2019-02-28

Document ID: EDOC1000054724

Views: 219749

Downloads: 2924

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next