No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Huawei Rack Server iBMC Alarm Handling 28

This document describes iBMC alarms in terms of the meaning, impact on the system, possible causes, and handling suggestions.
Rate and give feedback :
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
Event Alarms by Component

Event Alarms by Component

An event alarm indicates an event occurred during the running of a server. Generally, this type of alarms does not affect services and need to be handled immediately. Users can handle event alarms in off-peak hours. Table 2-3 lists the events of servers.

Table 2-3 Event list

Event Code

Event Description

Impact/Suggestions

0x00000015

CPU arg1 installed.

NOTE:

arg1 indicates the CPU No.

0x00000017

CPU arg1 removed.

NOTE:

arg1 indicates the CPU No.

Impact:

  • If this event is generated for CPU 1, the server OS fails to start.

  • If this event is generated for other CPUs, the server performance deteriorates.

Handling suggestions: Install the CPU at the appropriate time.

0x0000001F

CPU arg1 Core arg2 isolated.

NOTE:
  • arg1 indicates the CPU No.

  • arg2 indicates the CPU core No.

Impact: The CPU performance deteriorates.

Handling suggestions: Replace the CPU at the appropriate time.

0x00000021

Faulty CPU arg1 isolated.

NOTE:

arg1 indicates the CPU No.

Impact: The available CPUs are reduced.

Handling suggestions: Replace the CPU at the appropriate time.

0x00000079

CPU arg1 health status degradation detected by PFAE.

NOTE:

arg1 indicates the CPU No.

Impact: The system reliability is affected.

Handling suggestions:

  1. Arrange planned maintenance. After the server is powered off, check whether the CPU or its socket is damaged or whether the CPU has poor contact with its socket.
  2. Replace the faulty part and check whether the alarm is cleared.

0x0100000D

[Memory board arg1] arg2 memory correctable ECC.

NOTE:
  • arg1 indicates the slot No. of the memory board.
  • arg2 indicates the DIMM silkscreen or CPU socket number and memory channel No.
    • DIMM silkscreen, for example, DIMM020(A) or DIMM010(B).
    • CPU socket number and memory channel No. For example, in a 2488 V5 server, CPU 1 channel 2 indicates memory channel No.2 of CPU 1, that is, the DIMMs corresponding to DIMM020 and DIMM021.

      The number of DIMMs corresponding to a channel varies depending on the server model.

Impact: The system performance is affected.

0x0100000F

[Memory board arg1] arg2 installed.

NOTE:
  • arg1 indicates the slot No. of the memory board.
  • arg2 indicates the DIMM silkscreen or CPU socket number and memory channel No.
    • DIMM silkscreen, for example, DIMM020(A) or DIMM010(B).
    • CPU socket number and memory channel No. For example, in a 2488 V5 server, CPU 1 channel 2 indicates memory channel No.2 of CPU 1, that is, the DIMMs corresponding to DIMM020 and DIMM021.

      The number of DIMMs corresponding to a channel varies depending on the server model.

0x01000011

[Memory board arg1] arg2 removed.

NOTE:
  • arg1 indicates the slot No. of the memory board.
  • arg2 indicates the DIMM silkscreen or CPU socket number and memory channel No.
    • DIMM silkscreen, for example, DIMM020(A) or DIMM010(B).
    • CPU socket number and memory channel No. For example, in a 2488 V5 server, CPU 1 channel 2 indicates memory channel No.2 of CPU 1, that is, the DIMMs corresponding to DIMM020 and DIMM021.

      The number of DIMMs corresponding to a channel varies depending on the server model.

Impact: The system performance is affected.

Handling suggestions:

  1. Install a DIMM in the slot indicated by the alarm.
  2. Reseat the DIMM.
  3. Replace the DIMM.
  4. Replace the mainboard or the memory board.

0x0100001D

All DIMMs on memory board arg1 have been switched to the standby board.

NOTE:

arg1 indicates the slot No. of the memory board.

0x01000041

arg1 arg2 is replaced from SN(arg3) to SN(arg4).

NOTE:
  • arg1 indicates the slot No. of the memory board.
  • arg2 indicates the DIMM silkscreen or CPU socket number and memory channel No.
    • DIMM silkscreen, for example, DIMM020(A) or DIMM010(B).
    • CPU socket number and memory channel No. For example, in a 2488 V5 server, CPU 1 channel 2 indicates memory channel No.2 of CPU 1, that is, the DIMMs corresponding to DIMM020 and DIMM021.

      The number of DIMMs corresponding to a channel varies depending on the server model.

  • arg3 indicates the SN of the DIMM to be replaced.
  • arg4 indicates the SN of the new DIMM.

-

0x02000003

The [arg1] disk arg2 installed.

NOTE:
  • arg1 indicates the disk location, for example, rear or front.
  • arg2 indicates the disk slot No.

0x02000005

The [arg1] disk arg2 removed.

NOTE:
  • arg1 indicates the disk location, for example, rear or front.
  • arg2 indicates the disk slot No.

0x0200000D

RAID rebuild starts at the [arg1] disk arg2.

NOTE:
  • arg1 indicates the disk location, for example, rear or front.
  • arg2 indicates the disk slot No.

0x0200000F

RAID rebuild at the [arg1] disk arg2 stopped.

NOTE:
  • arg1 indicates the disk location, for example, rear or front.
  • arg2 indicates the disk slot No.

0x0200001F

The [arg1] disk arg2 health status degradation detected by PFAE.

NOTE:
  • arg1 indicates the disk location, for example, rear or front.
  • arg2 indicates the disk slot No.

Impact: The system reliability is affected.

Handling suggestions:

  1. Arrange planned maintenance. After the server is powered off, check whether the CPU or its socket is damaged or whether the CPU has poor contact with its socket.
  2. Replace the faulty part and check whether the alarm is cleared.

0x02000023

The arg1 disk arg2 is replaced from SN(arg3) to SN(arg4).

NOTE:
  • arg1 indicates the disk location, for example, rear or front.
  • arg2 indicates the disk slot No.
  • arg3 indicates the SN of the disk to be replaced.
  • arg4 indicates the SN of the new disk.

-

0x02000033

The [arg1] disk arg2 disconnected temporarily.

NOTE:
  • arg1 indicates the disk location, for example, rear or front.
  • arg2 indicates the disk slot No.

0x03000003

PSU arg1 installed.

NOTE:

arg1 indicates the PSU slot No.

0x03000005

PSU arg1 removed.

NOTE:

arg1 indicates the PSU slot No.

Impact: The power supply redundancy is affected.

0x04000001

Fan arg1 [arg2] installed.

NOTE:
  • arg1 indicates the slot No. of the fan module.
  • arg2 indicates the fan module location, for example, rear or front.

0x04000003

Fan arg1 [arg2] removed.

NOTE:
  • arg1 indicates the slot No. of the fan module.
  • arg2 indicates the fan module location, for example, rear or front.

Impact: The fan redundancy is affected.

0x06000001

The RAID controller card arg1 installed.

NOTE:

arg1 indicates the slot No. of the RAID controller card.

0x06000003

The RAID controller card arg1 removed.

NOTE:

arg1 indicates the slot No. of the RAID controller card.

Impact: The services related to the RAID controller card will be interrupted.

0x06000013

arg1 RAID card arg2 BBU is absent.

NOTE:
  • arg1 indicates the location of the RAID controller card.
    • FM: The RAID controller card is located in the front I/O module.
    • CMN: The RAID controller card is located in the compute node in slot N.
  • arg2 indicates the slot No. of the RAID controller card.

Impact: The cache function of the RAID controller card fails.

0x06000015

arg1 RAID card arg2 BBU is present.

NOTE:
  • arg1 indicates the location of the RAID controller card.
    • FM: The RAID controller card is located in the front I/O module.
    • CMN: The RAID controller card is located in the compute node in slot N.
  • arg2 indicates the slot No. of the RAID controller card.

-

0x06000023

The arg1 RAID controller card arg2 health status degradation detected by PFAE.

NOTE:
  • arg1 indicates the location of the RAID controller card.
    • FM: The RAID controller card is located in the front I/O module.
    • CMN: The RAID controller card is located in the compute node in slot N.
  • arg2 indicates the slot No. of the RAID controller card.

Impact: The system reliability is affected.

Impact:

  1. Arrange planned maintenance. After the server is powered off, check whether the CPU or its socket is damaged or whether the CPU has poor contact with its socket.
  2. Replace the faulty part and check whether the alarm is cleared.

0x08000019

The [arg1] PCIe card arg2 (arg3) starting arg4.

NOTE:
  • arg1 indicates the PCIe card location, for example, front, inner, or rear.

  • arg2 indicates the PCIe card slot No.

  • arg3 indicates the PCIe card type, for example, M60 GPU.
  • arg4 indicates the system startup phase, for example, BIOS POST successful or OS load successful.

0x0800003D

The [arg1] PCIe card arg2 (RAID) BBU is absent.

NOTE:
  • arg1 indicates the PCIe card location, for example, front, inner, or rear.

  • arg2 indicates the PCIe card slot No.

Impact: The cache function of the PCIe card fails.

0x0800003F

The [arg1] PCIe card arg2 (RAID) BBU is present.

NOTE:
  • arg1 indicates the PCIe card location, for example, front, inner, or rear.

  • arg2 indicates the PCIe card slot No.

0x08000047

The [arg1] PCIe card arg2 installed.

NOTE:
  • arg1 indicates the PCIe card location, for example, front, inner, or rear.

  • arg2 indicates the PCIe card slot No.

0x08000049

The [arg1] PCIe card arg2 removed.

NOTE:
  • arg1 indicates the PCIe card location, for example, front, inner, or rear.

  • arg2 indicates the PCIe card slot No.

0x0800005F

Recoverable errors are detected on arg1 PCIe card arg2 (arg3). Error code: arg4

NOTE:
  • arg1 indicates the PCIe card location, for example, front, inner, or rear.

  • arg2 indicates the PCIe card slot No.

  • arg3 indicates the PCIe card type, for example, M60 GPU.
  • arg4 indicates the error code.

0x08000065

arg1 PCIe card arg2 (arg3) health status degradation detected by PFAE.

NOTE:
  • arg1 indicates the PCIe card location, for example, front, inner, or rear.

  • arg2 indicates the PCIe card slot No.

  • arg3 indicates the PCIe card type, for example, PCIe Card or SDI Card.

Impact: The system reliability is affected.

Handling suggestions:

  1. Arrange planned maintenance. After the server is powered off, check whether the CPU or its socket is damaged or whether the CPU has poor contact with its socket.
  2. Replace the faulty part and check whether the alarm is cleared.

0x08000071

The arg1 PCIe card arg2 (arg3) arg4 is absent.

NOTE:
  • arg1 indicates the device holding the PCIe card, for example, GpuBoard or Riser.
  • arg2 indicates the PCIe card slot No.

  • arg3 indicates the PCIe card type, for example, NIC or SDI.
  • arg4 indicates the device name, for example, NetCard or TransformCard.

0x0D000007

The NIC arg1 health status degradation detected by PFAE.

NOTE:

arg1 indicates the NIC slot No.

Impact: The system reliability is affected.

Handling suggestions:

  1. Arrange planned maintenance. After the server is powered off, check whether the CPU or its socket is damaged or whether the CPU has poor contact with its socket.
  2. Replace the faulty part and check whether the alarm is cleared.

0x100000CD

The LOM [arg1] health status degradation detected by PFAE.

NOTE:

arg1 indicates the LOM slot No.

Impact: The system reliability is affected.

Handling suggestions:

  1. Arrange planned maintenance. After the server is powered off, check whether the CPU or its socket is damaged or whether the CPU has poor contact with its socket.
  2. Replace the faulty part and check whether the alarm is cleared.

0x11000001

LCD installed.

0x11000003

LCD removed.

0x12000005

Chassis cover opened.

Impact: Heat dissipation and component protection will be affected.

Handling suggestions: Close the chassis cover.

0x12000007

Chassis cover closed.

0x1A00000D

iBMC is restarted after AC power supply is restored.

0x1A00000F

iBMC event records are cleared.

0x1A000011

iBMC event record has reached 90% space capacity.

Impact: If this alarm is not handled in time, the event records will overflow.

Handling suggestions: Clear event records.

0x1A00001B

iBMC operation log has reached 90% space capacity.

Impact: If this alarm is not handled in time, the operation logs will overflow and some historical operation logs may be lost.

Handling suggestions:

  1. Export the operation logs.
  2. Enable the remote syslog dumping function.

0x1A00001D

iBMC security log has reached 90% space capacity.

Impact: If this alarm is not handled in time, the security logs will overflow and some historical security logs may be lost.

Handling suggestions:

  1. Export the security logs.
  2. Enable the remote syslog dumping function.

0x1A000021

iBMC is reset and started.

0x1A000023

arg1 certificate is about to expire or has expired.

NOTE:

arg1 indicates the certificate type.

Handling suggestions: Import a new certificate.

0x1A000025

Heartbeat signals between the iBMC and the system management software (iBMA) are lost.

Impact: The in-band management and monitoring information cannot be obtained or updated on a realtime basis.

Handling suggestions: Reinstall the iBMA.

0x1A000029

iBMC time is stepped by more than arg1 minutes.

NOTE:

arg1 indicates the time stepped.

Impact: The iBMC log time is inaccurate.

Handling suggestions: Restart the iBMC.

0x1A00002B

iBMC failed to synchronize time with the NTP server.

Impact: The iBMC system time is inaccurate.

Handling suggestions:

  1. Check whether the NTP server is configured correctly.
  2. Check whether the communication between the iBMC and the NTP server is normal.
  3. Restart the NTP service of the iBMC.

0x1A000039

The iBMC license enters the grace period and can still be used. It will expire in arg1 days.

NOTE:

arg1 indicates the remaining days in the grace period.

Impact: Advanced features of the iBMC cannot be implemented.

Handling suggestions: Install a valid license or delete the current license.

0x1A00003B

The iBMC license has expired.

Impact: Advanced features of the iBMC cannot be implemented.

Handling suggestions: Install a valid license or delete the current license.

0x21000001

SD card arg1 installed.

NOTE:

arg1 indicates the SD card slot No.

0x21000003

SD card arg1 removed.

NOTE:

arg1 indicates the SD card slot No.

Impact: The server storage capacity is reduced.

0x21000007

Data rebuild starts at SD card arg1.

NOTE:

arg1 indicates the SD card slot No.

0x21000009

Data rebuild at SD card arg1 is complete.

NOTE:

arg1 indicates the SD card slot No.

0x27000033

PCH health status degradation detected by PFAE.

Impact: The system reliability is affected.

Handling suggestions:

  1. Arrange planned maintenance. After the server is powered off, check whether the mainboard is damaged.
  2. Replace the mainboard and check whether the alarm is cleared.

0x28000015

CPU arg1 QPI/UPI arg2 link health status degradation detected by PFAE.

NOTE:
  • arg1 indicates the CPU No.

  • arg2 indicates the QPI/UPI channel No.

Impact: The system reliability is affected.

Handling suggestions:

  1. Arrange planned maintenance. After the server is powered off, check whether the CPU or its socket is damaged or whether the CPU has poor contact with its socket.
  2. Replace the faulty part and check whether the alarm is cleared.

0x2B000003

arg1

NOTE:

arg1 indicates alarm message:

  • RAID card (RAID Card1) PHY0 bit error increased too fast.
  • RAID card (RAID Card1) expander 1 PHY0 bit error increased too fast.

0x2C000009

ACPI is in the working state.

0x2C00000B

ACPI is in the soft-off state.

Impact: The server fails to power on.

0x2C00002F

The server system crashes or is abnormally reset.

Impact: The server OS is abnormal, and related services are interrupted.

0x2C000063

The host was restarted by BMC arg1.

NOTE:

arg1 indicates the cause of the restart, for example, "due to an IERR diagnosis failure" or "due to PCIe switch or retimer upgrade".

Impact: The system restarts, which interrupts services.

0x31000001

The power button on the panel is pressed.

Impact: The server OS is shut down.

0x2C00000F

The host was restarted due to unrecognized reason.

Impact: Services running on the server will be interrupted.

0x2C000011

The host was restarted by command.

Impact: Services running on the server will be interrupted.

0x2C000013

The host was restarted by power button.

Impact: Services running on the server will be interrupted.

0x2C000015

The host was restarted due to watchdog timeout.

Impact: Services running on the server will be interrupted.

0x2C000017

The host is restarted after being powered on (Power strategy is "Turn On").

Impact: Services running on the server will be interrupted.

0x2C000019

The host is restarted after being powered on (Power strategy is "Restore Previous State").

Impact: Services running on the server will be interrupted.

0x2C00001B

The OS cannot start without a boot device.

Impact: The server OS fails to start.

0x2C00001D

The OS cannot start without a bootable disk.

Impact: The server OS fails to start.

0x2C00001F

The OS cannot start because the PXE service is unavailable.

Impact: The server OS fails to start.

0x2C000021

The OS cannot start due to the invalid boot partition.

Impact: The server OS fails to start.

0x2C000023

The watchdog(arg1) timed out.

NOTE:

arg1 indicates the watch dog type, which can be BIOS FRB2, BIOS/POST, OS Load, SMS/OS, or OEM.

0x2C00002D

Power capping failed.

Impact: The server automatically powers off, which interrupts services.

Handling suggestions:

  1. Check whether the mains supply meets power consumption requirements of the server. If no, adjust the power supply.
  2. Increase the power cap value.

0x2C000053

The hard disk partition (arg1) usage (arg2%) exceeds the threshold (arg3%).

NOTE:
  • arg1 indicates the disk partition No.
  • arg2 indicates the current disk partition usage.
  • arg3 indicates the disk partition usage threshold.

Impact: The system performance is affected.

Handling suggestions:

  1. Check whether the disk partition usage threshold is set improperly (too low).
  2. Clear disk partition to release resources.

0x31000003

The UID button on the panel is pressed.

0x31000005

The PCIe card arg1 hot swap button is pressed.

NOTE:

arg1 indicates PCIe card No.

Download
Updated: 2019-06-04

Document ID: EDOC1000054724

Views: 247936

Downloads: 2954

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next