No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Huawei Rack Server iBMC Alarm Handling 27

This document describes iBMC alarms in terms of the meaning, impact on the system, possible causes, and handling suggestions.
Rate and give feedback :
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
Memory Alarms

Memory Alarms

This topic describes the memory alarms for servers.

ALM-0x0C01FFFF Uncorrectable Memory Error (DIMMN)

Description

Alarm message:

Uncorrectable memory error, dimm is N

This alarm is generated when an uncorrectable error occurs in a DIMM:

Sensor triggering the alarm: DIMMN

Attribute

Alarm ID Alarm Severity Auto Clear
0x0C01FFFF Critical Yes

Parameters

Name Meaning
N Indicates the silkscreen of a DIMM.

Impact on the System

The DIMM cannot work properly, which affects the server performance.

Possible Causes

  • The DIMM is faulty.

  • The mainboard is faulty.

Procedure

  1. Remove and reinstall the DIMM. Then, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 2.

  2. Switch the DIMM with a functioning DIMM, and check whether the alarm is still generated for this DIMM.

    • If yes, go to 3.

    • If no, go to 4.

  3. Replace the DIMM. Then, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 4.

  4. Replace the mainboard or memory board. Then, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 5.

  5. Contact Huawei technical support.

ALM-0x0C01FFFF Uncorrectable Memory Error (MEMBrdN DIMM)

Description

Alarm message:

Uncorrectable memory error, dimm is M

This alarm is generated when an uncorrectable error occurs in a DIMM:

This alarm applies only to the RH8100 V3 only.

Sensor triggering the alarm: MEMBrdN DIMM

Attribute

Alarm ID Alarm Severity Auto Clear
0x0C01FFFF Critical Yes

Parameters

Name Meaning
M Indicates the silkscreen of a DIMM.
N Indicates a memory riser number.

Impact on the System

Services may be interrupted, and the system may stop responding or restart.

Possible Causes

The DIMM is faulty.

Procedure

  1. Remove and reinstall the DIMM. Then, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 2.

  2. Replace the DIMM. Then, check whether the alarm is cleared.

    For details about how to replace the DIMM, see "Replacing Parts" in the server user guide.

    • If yes, no further action is required.

    • If no, go to 3.

  3. Contact Huawei technical support.

ALM-0x0C01FFFF Uncorrectable Memory Error (MEMRiserN DIMM)

Description

Alarm message:

Uncorrectable memory error, dimm is M

This alarm is generated when an uncorrectable error occurs in a DIMM.

This alarm can only be generated by the RH5885H V3.

This alarm is generated by the following sensor:

  • MEMRiserN DIMM (N indicates a memory riser number.)

Attribute

Alarm ID Alarm Severity Auto Clear
0x0C01FFFF Critical Yes

Parameters

Name Meaning
M Indicates the silkscreen of a DIMM.
N Indicates a memory riser number.

Impact on the System

Services may be interrupted and the system may stop responding or restart.

Possible Causes

The DIMM is faulty.

Procedure

  1. Remove and then install the DIMM. Then check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 2.

  2. Replace the DIMM. Then check whether the alarm is cleared.

    For details about how to replace a DIMM, see the server user guide.

    • If yes, no further action is required.

    • If no, go to 3.

  3. Contact Huawei technical support.

ALM-0x0C07FFFF Configuration Error (DIMMN)

Description

Alarm message:

Configuration error, dimm is N

This alarm is generated during startup of the basic input/output system (BIOS) when a DIMM is not installed properly or is faulty.

NOTE:

For details about DIMM layout, see the server user guide or use the Huawei Server Product Memory Configuration Assistant.

Sensor triggering the alarm: DIMMN

Attribute

Alarm ID Alarm Severity Auto Clear

0x0C07FFFF

Critical

Yes

Parameters

Name Meaning
N Indicates the silkscreen of a DIMM.

Impact on the System

If the alarm is generated for the DIMM in slot 1, the system cannot start. If the alarm is generated for a DIMM in another slot, the system starts properly but the DIMM is unavailable.

Possible Causes

  • The DIMM is not installed in the correct slot.

  • The DIMM is faulty.

Procedure

  1. Power off the server, and check whether the DIMM is installed in the correct slot.

    For details, see the troubleshooting manual.

    • If yes, go to 2.

    • If no, go to 3.

  2. Remove and reinstall the DIMM, and check whether the alarm is cleared.

    • If yes, no further operation is required.

    • If no, go to 3.

  3. Replace the DIMM. Then, check whether the alarm is cleared.

    For details about how to replace the DIMM, see "Replacing Parts" in the server user guide.

    • If yes, no further operation is required.

    • If no, go to 4.

  4. Contact Huawei technical support.

ALM-0x0C07FFFF Configuration Error (MEMBrdN DIMM)

Description

Alarm message:

Configuration error, dimm is M

This alarm is generated when the sensor detects, during basic input/output system (BIOS) startup, that a DIMM on a memory riser is installed in the incorrect slot or is faulty.

This alarm can only be generated by the RH8100 V3.

NOTE:
For details about DIMM installation positions, see "Installing DIMMs" in the server user guide.

Sensor triggering the alarm: EMBrdN DIMM

Attribute

Alarm ID Alarm Severity Auto Clear

0x0C07FFFF

Critical

Yes

Parameters

Name Meaning
M Indicates the silkscreen of a DIMM.
N Indicates a memory riser number.

Impact on the System

If the error occurs in the DIMM in the first slot corresponding to a CPU, the system cannot start. If the error occurs in the DIMM in another slot, the system can start but the DIMM cannot be used.

Possible Causes

  • The DIMM is installed in an incorrect slot.
  • The DIMM is faulty.

Procedure

  1. Power off the server, and check whether the DIMM is installed in an incorrect slot.

    For details about DIMM installation positions, see the server user guide.

    • If yes, go to 2.

    • If no, go to 3.

  2. Install the DIMM in the correct slot. Then check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 3.

  3. Replace the DIMM. Then check whether the alarm is cleared.

    For details about how to replace a DIMM, see the server user guide.

    • If yes, no further action is required.

    • If no, go to 4.

  4. Contact Huawei technical support.

ALM-0x0C07FFFF Configuration Error (MEMRiserN DIMM)

Description

Alarm message:

Configuration error, dimm is M

This alarm is generated when the sensor detects, during basic input/output system (BIOS) startup, that a DIMM on a memory riser is installed in the incorrect slot or is faulty.

NOTE:
For details about DIMM installation positions, see "Installing DIMMs" in the server user guide.

This alarm is for the RH5885H V3 only.

Sensor triggering the alarm: MEMRiserN DIMM

Attribute

Alarm ID Alarm Severity Auto Clear

0x0C07FFFF

Critical

Yes

Parameters

Name Meaning
M Indicates the silkscreen of a DIMM.
N Indicates a memory riser number.

Impact on the System

If the error occurs in the DIMM in the first slot corresponding to a CPU, the system cannot start. If the error occurs in the DIMM in another slot, the system can start but the DIMM cannot be used.

Possible Causes

  • The DIMM is installed in an incorrect slot.
  • The DIMM is faulty.

Procedure

  1. Power off the server, and check whether the DIMM is installed in an incorrect slot.

    For details about DIMM installation positions, see the server user guide.

    • If yes, go to 2.

    • If no, go to 3.

  2. Install the DIMM in the correct slot. Then check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 3.

  3. Replace the DIMM. Then check whether the alarm is cleared.

    For details about how to replace a DIMM, see the server user guide.

    • If yes, no further action is required.

    • If no, go to 4.

  4. Contact Huawei technical support.

ALM-0x0C0AFFFF Critical Overtemperature (CPUN Memory)

Description

Alarm message:

Critical overtemperature

This alarm is generated when the sensor detects the state of a dual in-line memory module (DIMM) is abnormal.

Sensor triggering the alarm: CPUN Memory

Attribute

Alarm ID Alarm Severity Auto Clear
0x0C0AFFFF Major Yes

Parameters

Name Meaning
N

Serial number of the CPU.

Impact on the System

When the alarm appeared, the operating system (OS) fails, and the mainboard restarts or stops responding.

Possible Causes

  • The fan module is faulty.

  • The ambient temperature exceeds the normal range.

  • The air inlet or outlet is blocked.

  • Idle disk bays are not installed with hard disk fillers.

  • Air ducts are not installed properly.

  • The DIMM is faulty.

Procedure

  1. Check whether a low fan speed alarm is generated for the fan module.

    You can obtain alarm information in either of the following ways:
    • View alarm information on the Current Alarms page of the iBMC WebUI.
    • Run the ipmcget -d healthevents command on the iBMC CLI.
    • If yes, go to 2.

    • If no, go to 5.

  2. Remove and then reinstall the fan module. After 5 minutes, check whether the fan module alarm is cleared.

    • If yes, go to 4.

    • If no, go to 3.

  3. Replace the fan module. After 5 minutes, check whether the fan module alarm is cleared.

    For details about how to replace the fan module, see "Replacing Parts" in the server user guide.

    • If yes, go to 4.

    • If no, go to 14.

  4. Check whether the DIMM overheating alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 5.

  5. Check whether the temperature room temperature exceeds the normal range.

    • If yes, go to 6.

    • If no, go to 7.

  6. Lower the ambient temperature to the normal range. After 5 minutes, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 7.

  7. Check whether the air inlet or outlet is blocked.

    • If yes, go to 8.

    • If no, go to 9.

  8. Clear the blockage. Then, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 9.

  9. Check whether idle disk bays are not installed with hard disk fillers.

    • If yes, go to 10.

    • If no, go to 11.

  10. Install hard disk fillers in idle disk bays. After 5 minutes, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 11.

  11. Check whether air ducts are installed properly.

    • If yes, go to 13.

    • If no, go to 12.

  12. Install air ducts properly. After 5 minutes, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 13.

  13. Replace the DIMMs connected to the CPU. Then, check whether the alarm is cleared.

    For details about DIMM layout, see the server user guide or use the Huawei Server Product Memory Configuration Assistant.

    • If yes, no further action is required.

    • If no, go to 14.

  14. Contact Huawei technical support.
Download
Updated: 2019-02-28

Document ID: EDOC1000054724

Views: 218911

Downloads: 2922

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next