No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Huawei Rack Server iBMC Alarm Handling 28

This document describes iBMC alarms in terms of the meaning, impact on the system, possible causes, and handling suggestions.
Rate and give feedback :
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
ALM-0x0147FFFF Above Upper Minor Threshold (XXGPUN Temp/GPUN Temp)

ALM-0x0147FFFF Above Upper Minor Threshold (XXGPUN Temp/GPUN Temp)

Description

Alarm message:

Above upper minor threshold

This alarm is generated when a temperature sensor detects that the temperature of a graphics processing unit (GPU) is higher than the minor alarm upper threshold. This alarm is cleared when the system detects that the temperature is restored to the acceptable range.

This alarm is generated by the following sensors:

  • XXGPUN Temp
  • GPUN Temp

Attribute

Alarm ID Alarm Severity Auto Clear
0x0147FFFF Minor Yes

Parameters

Name Meaning
N indicates a GPU slot number.
XX indicates type of the GPU card, for example, K1, K2, K10, K20X, K20C, K20M, K40M, M40, P100, P4, and P40.

Impact on the System

The GPU and components on the mainboard cannot operate stably, which shortens the service life of the server and increases power consumption. If the alarm persists, the server powers off or restarts, which interrupts services and causes data loss.

Possible Causes

  • A fan module is faulty.

  • The service volume is massive.

  • The ambient temperature is excessively high.

  • The air intake is blocked.

  • The air exhaust vent is blocked.

  • The heat sink is not properly connected to the mainboard.

  • The GPU is faulty.

Procedure

  1. Log in to the iBMC command-line interface (CLI) or WebUI, and check whether an alarm is generated for the fan module. If a critical alarm is generated for a low fan speed, power off the server, and remove and then install the fan module. Then check whether the critical alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 2.

  2. Replace the fan module. Then check whether the alarm is cleared. For details about how to replace a fan module, see the server user guide.

    • If yes, no further action is required.

    • If no, go to 3.

  3. Check whether the services running on the server are in massive volume.

    • If yes, go to 4.

    • If no, go to 5.

  4. Stop non-critical services to reduce the serviceload on the server. Then, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 5.

  5. Check whether the ambient temperature is extremely high.

    • If yes, go to 6.

    • If no, go to 7.

  6. Reduce the ambient temperature. Then check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 7.

  7. Ensure that the air intake or air exhaustvent is not blocked. Then, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 8.

  8. Power off the server and open the chassis.Then check whether the heat sink is not properly connected to the mainboard.

    • If yes, go to 9

    • If no, go to 10.

  9. Remove and then install the heat sink, andpower on the server. After 5 minutes, check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 10.

  10. Replace the GPU, then check whether the alarm is cleared.

    • If yes, no further action is required.

    • If no, go to 11.

  11. Contact Huawei technical support.
Download
Updated: 2019-06-04

Document ID: EDOC1000054724

Views: 243319

Downloads: 2949

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next