No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Huawei Server Maintenance Manual 09

Rate and give feedback :
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
V2

V2

Common Problems During Startup and Shutdown

An RH2285 Is Powered On by Pressing the Power Button After Connected to AC Power Supplies
Problem Description
Table 5-56 Basic information

Item

Information

Source of the Problem

RH2285

Intended Product

RH2285, RH1285, E6000, X6000, and T6000

Release Date

2010-08-02

Keyword

Restore on AC Power Loss

Author

Xu Changming (employee ID: 141981)

Symptom

After an RH2285 is connected to alternating current (AC) power supplies, you must press the power button on the front panel to start the RH2285 if the RH2285 fails to start within a minute.

Key Process and Cause Analysis

Cause analysis

Restore on AC Power Loss in the basic input/output system (BIOS) is set to power off.

The following describes three setting items for Restore on AC Power Loss:

Setting Item

Meaning

power on

The RH2285 is automatically started after powered on.

power off

The RH2285 is not started after power-on. You need to press the power button on the front panel to start it.

Last State

The previous shut-down operation determines whether the RH2285 is started automatically after powered on. If you press the power button to shut down the RH2285, the RH2285 cannot be automatically started. If you remove and reinstall AC power cables to shut down the RH2285, the RH2285 can be automatically started.

Conclusion and Solution

Solution

  1. After startup, hold down Delete to open the BIOS, and click Advanced.

  2. Select IPMI 2.0 Configuration, and press Enter.

  3. Select Restore on AC Power Loss, and press Enter. Select Power on from the displayed three options, and press F10 to save the settings and exit.

  4. Power off the server, remove and reinstall AC power cables. Then the problem is resolved.
Experience

None

Note

None

An RH2285 Automatically Displays the BIOS Setup Interface
Problem Description
Table 5-57 Basic information

Item

Information

Source of the Problem

RH2285

Intended Product

RH2285 and RH1285

Release Date

2011-03-05

Keyword

BIOS, Remote access

Author

Xu Changming (employee ID: 141981)

Symptom

Each time the RH2285 is started, the basic input/output system (BIOS) setup interface is displayed automatically, and OS booting cannot be implemented.

Key Process and Cause Analysis
  1. Check that the keyboard works properly, and you do not press ~ or Delete.
  2. Check that no jumper cap is on the third pair of jumpers from right to left of the position shown in Figure 5-67.
    Figure 5-67 Jumper position

    Jumper caps are used for switchover between the system serial port and the management serial port. When jumper caps are installed, the DB9 serial port connector serves as a BMC management serial port. When jumper caps are removed, the DB9 serial port connector serves as a system serial port, and the RH2285 has a peripheral that sends serial port data to the RH2285, as shown in Figure 5-68.

    Figure 5-68 Serial port position

  3. When the "Remote Access" for serial port redirection is enabled in the BIOS, the peripheral connected to the DB9 serial port connector sends serial port data to the RH2285. The RH2285/E6000 serial port program has only the sending and receiving functions. An external serial port device can constantly send data. After a serial port is initialized in the BIOS, the serial port captures the ASCII code of the Delete key. Then the system determines that someone presses Delete, and the BIOS setting program is displayed. In this case, when the power on self-test (POST) is complete, the BIOS setup interface is automatically displayed.
Conclusion and Solution

Conclusion

When the "Remote Access" for serial port redirection is enabled in the BIOS, the peripheral connected to the DB9 serial port connector sends serial port data to the RH2285.

Solution

Solution 1: Before the RH2285 system starts the operating system (OS), disconnect or shut down the peripheral connected to the DB9 serial port connector to stop sending data to the RH2285. After starting the OS, connect and start the peripheral.

Solution 2: Display the BIOS interface, and set Remote Access to Disable. The function is used only to set serial port redirection in the POST phase without affecting the redirection function in the GRUB phase and in the OS kernel. The following describes the methods of setting Remote Access:

  1. Display the BIOS interface, and select Remote Access Configuration on the Advanced tab page, as shown in Figure 5-69.
    Figure 5-69 Advanced

  2. Select Remote Access, and press Enter. In the displayed check box, select Disable, as shown in Figure 5-70 and Figure 5-71.
    Figure 5-70 Remote Access
    Figure 5-71 Disabled Remote Access
Experience

None

Note

Figure 5-72 shows the Remote Access Configuration screen.

Figure 5-72 Remote Access Configuration

Parameter description

Name

Description

Remote Access

Sets serial port redirection in the POST phase.

Serial port number

Displays the IDs of serial ports that support serial port redirection.

Serial Port Mode

Sets the serial port display mode and operation mode.

Flow Control

Controls the traffic. The default value is None.

Redirection After BIOS POST

Checks whether serial port redirection is valid in the OS LOADER phase and after the OS starts.

Terminal Type

Sets the serial port terminal type.

Handling BRD Alarms for the Optical Channel Diagnosis Panel in an RH5485
Problem Description
Table 5-58 Basic information

Item

Information

Source of the Problem

RH5485

Intended Product

RH5485

Release Date

2011-11-11

Keyword

RH5485, BRD, PCIe, BR10i, main board, I/O board

Author

Han Yao (employee ID: 171887)

Symptom

Figure 5-73 and Figure 5-74 shows an optical channel diagnosis panel. The BRD alarm indicator is in the red box. If the BRD alarm indicator is on, the input/output (I/O) board or mainboard is faulty.

Figure 5-73 Optical channel diagnosis panel

Figure 5-74 Diagram of the optical channel diagnosis panel

Key Process and Cause Analysis

Key process

  1. Log in to the integrated management module (IMM) to view the IMM event log, and find the following alarm logs:
    1499. E --  -- 7/24/2011:16:19:58 -- Fault in slot "No Op ROM Space" on system "SN# 99B5585"
  2. Check whether a Peripheral Component Interconnect Express (PCIe) device is installed in PCIe slot 5. If yes, remove the device and reinstall it in another slot.
  3. After you perform the preceding operations, BRD alarms are cleared. If the IMM continues to report errors, and the network interface card (NIC) integrated by the main board does not support the preboot execution environment (PXE) function, perform the following operations to start the read-only memory (ROM) with the disabled NIC in the Unified Extensible Firmware Interface (UEFI):
    1. Choose F1 setup > System Settings > Network > XE Configuration.
    2. Select the media access control (MAC) address of the on-board NIC 1.
    3. Set PXE Mode to Disabled.
    4. Select Save Changes.
    5. Select the MAC address of the on-board NIC 2.
    6. Set PXE Mode to Disabled.
    7. Select Save Changes.
  4. If the BRD alarm indicator is lit, check that the I/O board or main board is faulty based on the actual condition, and replace the faulty component.
Conclusion and Solution

Conclusion

  • If a PCIe device is installed in the PCIe slot 5, and the server is configured with the BR10i RAID controller card, BRD alarms are generated.
  • If the seven PCIe slots on the server are installed of PCIe devices, BRD alarms are caused.
  • If no PCIe device is installed in PCIe slot 5, BRD alarms are caused due to the faulty main board or I/O board.

Solution

  • For the first case, install the PCIe device in other PCIe slots.
  • For the second case, you are advised to remove those unnecessary PCIe devices.
  • For the last case, replace the faulty board.
Experience

Experience: If BRD alarms are generated for the optical channel diagnosis panel, check that PCIe devices are configured in the seven PCIe slots, a PCIe device is installed in PCIe slot 5, and the RAID controller card is the BR10i card. In addition, check that the NoOp ROM Space alarm is in the IMM log.

Precautions: Do not install PCIe devices in PCIe slot 5.

Note

None

Handling LINK Alarms for the Optical Channel Diagnosis Panel in an RH5485
Problem Description
Table 5-59 Basic information

Item

Information

Source of the Problem

RH5485

Intended Product

RH5485

Release Date

2011-11-11

Keyword

RH5485, LINK, QPI

Author

Han Yao (employee ID: 171887)

Symptom

Hardware configuration

RH5485 server, four CPUs, and two QuickPath Interconnect (QPI) wrap cards.

Symptom

The LINK alarm indicator is lit on the optical channel diagnosis panel in the RH5485.

Figure 5-75 and Figure 5-76 shows an optical channel diagnosis panel. The LINK alarm indicator is in the red box.

Figure 5-75 Optical channel diagnosis panel

Figure 5-76 Diagram of the optical channel diagnosis panel

Key Process and Cause Analysis

Key process

The LINK indicator is lit, indicating that server QPI links are abnormal. Figure 5-77 shows the RH5485 architecture. If over two microprocessors are installed in the server, two QPI wrap cards are required.

Figure 5-77 RH5485 architecture

  1. If QPI links are abnormal, check the QPI wrap cards. Exchange the left and right QPI cards. If the fault symptom changes, QPI cards are abnormal. If the fault symptom persists, check other components. Check the microprocessors, and focus on whether the CPU socket on the main board has twisted pins. If the CPU socket on the main board has twisted pins, the main board is abnormal. Replace the main board.
  2. Exchange CPU 1 or 2 and CPU 3 or 4. If the fault occurs on the same QPI port, the main board is faulty. If the fault symptom changes on the QPI port that reports errors, the CPU is abnormal.
NOTE:

From the main board structure, CPUs 1 and 4 are in the same channel, and CPUs 2 and 3 are in the same channel. In addition, the QPI cards on the rear of the chassis are respectively responsible for data exchange between CPU 1 or 2 and CPU 3 or 4. The mappings between the QPI link and the CPU are reverse, that is, QPI links 1, 2, 3, and 4 respectively correspond to CPUs 4, 3, 2, and 1.

Conclusion and Solution

Conclusion

LINK alarms occur on the optical channel diagnosis panel, indicating that server QPI links are abnormal.

Solution

Check the hardware related to the QPI links, and the hardware includes the CPU, main board, I/O board, and QPI wrap card. After identifying faulty hardware, replace the faulty hardware.

Experience

None

Note

None

Handling MEM Alarms for the Optical Channel Diagnosis Panel in an RH5485
Problem Description
Table 5-60 Basic information

Item

Information

Source of the Problem

RH5485

Intended Product

RH5485

Release Date

2011-10-22

Keyword

RH5485, memory, optical channel diagnosis panel, MEM alarms

Author

Han Yao (employee ID: 171887)

Symptom

Hardware configuration

RH5485 server

Symptom

  1. Start the RH5485. Then the power on self-test (POST) is in the memory initialization phase, as shown in Figure 5-78.
    Figure 5-78 Memory initialization

  2. The system prompts that memory initialization fails. The MEM indicator on the optical channel diagnosis panel is on, that is, the indicator in the red box shown in Figure 5-79 is lit. On the operator information panel shown in Figure 5-80, the system error indicator is on.
    Figure 5-79 MEM indicator on the optical channel diagnosis panel

    Figure 5-80 Operator information panel

  3. From the memory card, the memory expansion card/dual in-line memory module (DIMM) error indicator and DIMMx error indicator are lit, as shown in Figure 5-81.
NOTE:

x is an integer ranging from 1 to 8.

Figure 5-81 Indicators and button of the memory expansion card

Key Process and Cause Analysis

Key process

  1. Check the fault type for the memory system.

    Assume that alarms occur on a memory expansion card, install the card in another slot. If the fault symptom changes, a physical fault occurs on the card or its DIMMs. If the fault symptom persists, no physical fault occurs on the card or its DIMMs.

  2. If no physical fault occurs on the memory expansion card or its DIMMs, modify the memory configuration.

    Assume that a memory expansion card on which alarms occur is configured with four DIMMs, reduce or increase DIMMs. If the server is configured with over two memory expansion cards, reduce or increase memory expansion cards. After modifying the memory configuration, restart the server. In this case, the system recalculates the power consumption of the server, verifies DIMMs or memory expansion cards, and activates disabled memory slots to clear alarms. Then restore the DIMM configuration.

  3. If a physical fault occurs on the memory expansion card or its DIMMs, continue to locate the fault.
    • If DIMMx error indicators are lit in pairs. For example, when DIMM 1 and DIMM 8 error indicators are lit, exchange DIMM 1 and DIMM 3.
  • If only the DIMM 3 error indicator is lit, original DIMM 1 (that is, DIMM 3 after exchange) is faulty.
  • If DIMM 1 and DIMM 8 error indicators are lit, exchange DIMM 6 and DIMM 8. If only the DIMM 6 error indicator is lit, original DIMM 8 (that is, DIMM 6 after exchange) is faulty.
  • If DIMM 1 and DIMM 3 are re-exchanged, and DIMM 1 and DIMM 8 error indicators are lit, replace the memory expansion card.
    • Assume that only one DIMMx error indicator is lit. For example, when the DIMM 1 error indicator is lit, exchange DIMM 1 and DIMM 3.
  • If only the DIMM 3 error indicator is lit, original DIMM 1 (that is, DIMM 3 after exchange) is faulty.
  • If the DIMM 1 error indicator is lit, replace the memory expansion card.
Conclusion and Solution

Solution

  • For a memory expansion card that is physically damaged, replace its spare parts.
  • For a memory expansion card that is not physically damaged, common solutions are as follows:
    • Modify the memory configuration (that is, change the number of memory expansion cards or DIMMs). In this way, when the server restarts, the system recalculates the power consumption, verifies DIMMs or memory expansion cards, and enables disabled memory slots.
    • Remove batteries from the complementary metal oxide semiconductor (CMOS) on the main board, reset the system clock, and reinstall the batteries. When the server restarts, the system verifies DIMMs or memory expansion cards again, and enables disabled memory slots.
    • IBM resolved the Q2 firmware problem that memory slots cannot be enabled in the Unified Extensible Firmware Interface (UEFI) in 2011. In the latest firmware, the UEFI provides the function of enabling a memory slot without removing batteries from the CMOS on the main board.
  • For a rare memory alarm symptom that is caused due to the compatibility bug on DIMMs, the memory system has memory alarms instead of physical faults, and the alarm symptom changes with the memory expansion card that is configured with DIMMs. In this case, mix the sequence of the DIMMs on the memory expansion card, and remove and reinsert the card to clear alarms.
Experience

There is a small possibility that a physical fault occurs on a memory expansion card or its DIMMs. Most memory alarms are caused due to poor contact (error verification occurs during memory self-test, and the system automatically disables the DIMMs of the related slots).

Note

None

Handling NMI Alarms for the Optical Channel Diagnosis Panel in an RH5485
Problem Description
Table 5-61 Basic information

Item

Information

Source of the Problem

RH5485

Intended Product

RH5485

Release Date

2011-12-01

Keyword

RH5485, NMI, PCIe

Author

Han Yao (employee ID: 171887)

Symptom

Figure 5-82 and Figure 5-83 show an optical channel diagnosis panel. The NMI alarm indicator is in the red box. If it is on, a hardware error is recorded by the system.

Figure 5-82 Optical channel diagnosis panel

Figure 5-83 Diagram of the optical channel diagnosis panel

Key Process and Cause Analysis

Key process

  1. Log in to the integrated management module (IMM) to view the IMM event log, and find the following alarm logs:
    18. I --  -- 8/12/2011:10:8:50 -- System "SN# 99C8637" has recovered from an Uncorrectable Bus Error 
    19. E --  -- 8/12/2011:10:8:38 -- A Uncorrectable Bus Error has occurred on system "SN# 99C8637"     
  2. Remove alternating current (AC) power supplies from the server. After waiting for 15 minutes, power on and restart the system. Then alarms are cleared.

Cause analysis

If a hardware error has occurred in the system and the error is recorded in a system log, the NMI alarm indicator is illuminated even though the system recovers from the error. When the NMI alarm indicator is illuminated, the PCI or MEM indicator may also be illuminated. If the PCI or MEM indicator is not illuminated, the error may be recovered but is recorded in the system. In this case, restart the server.

Peripheral Component Interconnect Express (PCIe) devices are installed in the server PCIe slots. A transient PCIe bus error occurs due to the improper contact and is recorded in the system log. Therefore, the NMI indicator is on.

Conclusion and Solution

Conclusion

If a hardware error is recorded in the system log, the NMI indicator is illuminated.

NOTE:

The improper contact between the memory and PCIe devices is inclined to cause a hardware error. The hardware error may be automatically recovered or is recovered after maintenance personnel handle the error.

Solution

  • If the NMI indicator is illuminated, and the system has recovered from the hardware error based on the system log, clear alarms by using the following methods:
    • Remove AC PSUs from the server, shut down the system, and wait for 15 minutes. Then restart the server.
    • Reinstall devices (including PCIe devices and the memory) that are inclined to cause the hardware error.
    • Delete alarm logs that cause NMI alarms from the system log.
  • If the NMI alarm indicator is illuminated, and the MIM or PCI alarm indicator is also illuminated, resolve the problem based on the prompt information on the MEM or PCI indicator.
Experience

When the RH5485 uses PCIe devices, NMI alarms are generated because the PCIe slots are fastened by latches that are inclined to cause contact problems. Reinstall PCIe devices, or remove AC PSUs and restart the server to clear alarms.

Note

None

Exceptions Occur in the System When an AC Power Failure Occurs During a BIOS Upgrade
Problem Description
Table 5-62 Basic information

Item

Information

Source of the Problem

V2 servers

Intended Product

V2 servers

Release Date

2013-01-23

Keyword

BIOS upgrade, AC power failure, system abnormality

Symptom

Hardware configuration:

V2 server (RH/E/X series servers developed on the Romley platform)

Symptom:

When the basic input/output system (BIOS) is being upgraded through the firmware upgrade web user interface (WebUI) of the iMana, AC power is disconnected. When the BIOS is upgraded again, the upgrade may fail, the operating system (OS) may be powered off, or the server may restart.

Key Process and Cause Analysis

Cause analysis:

  1. The Romley platform of V2 servers forcibly uses the ME module. The server cannot be powered on or starts improperly if the ME module fails to start.
  2. The BIOS of V2 servers is upgraded in the following sequence: a. The ME module enters the recovery state before the serial peripheral interface (SPI) bus is switched. b. The BIOS is upgraded. c. The ME module returns to the normal state.
  3. If an AC power failure occurs during the BIOS upgrade, the BIOS software loading may not be complete, and the ME module cannot exit the recovery state and enter the normal state. As a result, the BIOS upgrade fails. Upgrade the BIOS again. The upgrade may fail, the OS may be powered off, or the server may restart.
Conclusion and Solution

Conclusion:

If an AC power failure occurs during the BIOS upgrade, the BIOS software loading may not be complete, and the ME module cannot exit the recovery state and enter the normal state. As a result, the BIOS upgrade fails. Upgrade the BIOS again. The upgrade may fail, the OS may be powered off, or the server may restart.

Solution:

  1. Check operations performed by the customer when the fault occurs and iMana logs to determine whether the fault occurs after an AC power failure occurs during the BIOS upgrade.
  2. Upgrade the BIOS again two to three times.
  3. If the BIOS still fails to be upgraded, disconnect the AC power supply from the server, wait 1 to 3 minutes until the mainboard is powered off, and then power on the server and upgrade the BIOS again. If the fault persists, replace the mainboard.
Experience

Do not restart the OS or disconnect the AC power supply from the server during a BIOS upgrade.

Note

ME is an Intel-developed server management engine. The ME manages server hardware (in terms of power capping, CPU temperature, and fan status) by executing commands delivered by the iMana.

The SPI is a bus interface between the Platform Controller Hub (PCH) and the BIOS flash memory.

CPU Fails to Start When the Long Serial Cable Hangs in the Air
Problem Description
Table 5-63 Basic information

Item

Information

Source of the Problem

Management module of the E9000

Intended Product

V2 servers

Release Date

2013-09-29

Keyword

Long serial cable, start failure

Symptom

Hardware configuration:

The serial cable between the server and the PC is 19 m long.

Symptom:

The server cannot be powered on.

Key Process and Cause Analysis

Cause analysis:

The serial cable hanging in the air serves as an antenna which interferes signal receiving of the CPU. The CPU considers that data transmission starts and receives incorrect data.

Conclusion and Solution

Conclusion:

If the serial cable is not connected to the PC and hangs in the air, the TXD signal is coupled into the RXD end of the serial port. In this case, an incorrect command is delivered to the CPU through the serial port during the CPU startup. As a result, the CPU cannot start up. The MM910 cannot deliver a command for powering on the server.

Solution:

If a long serial cable is not used, it cannot be floated. Remove the serial cable.

Experience

None

Note

None

A Blank Screen Is Displayed After Power-On

This topic describes how to rectify the fault that the monitor screen is blank after the RH5885 V3 is powered on.

Problem Description
Table 5-64 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

Rack Server

Release Date

2015-03-17

Keyword

Blank screen

Symptom

After the RH5885 V3 is powered on, a blank screen is displayed.

Key Process and Cause Analysis

Possible Causes:

  • The power supply to the chassis is abnormal.
  • The RH5885 V3 is faulty.

Fault Diagnosis:

Narrow down the scope of the causes by checking the preceding possible problems one by one.

Conclusion and Solution

Procedure:

  1. Check whether the cause is the same as that for All Indicators Are Off.

    • If yes, no further action is required.
    • If no, go to Step 2.

  2. Check whether the alarm indicator on the panel is on.

  3. Log in to the iMana 200 to acknowledge and clear the alarm. For details, see the RH5885 V3 Server V100R003 Alarm Handling.
  4. Check whether the fault persists.

    • If yes, please refer to 02 About Huawei Server R&D and Maintenance Team. Contact Huawei technical support for help.
    • If no, no further action is required.

  5. Check that the dual in-line memory modules (DIMMs) are installed in appropriate slots. For details about DIMM configuration rules, see Installing a DIMM in RH5885 V3 Server V100R003 User Guide.
  6. Check whether the fault persists.

    • If yes, please refer to 02 About Huawei Server R&D and Maintenance Team. Contact Huawei technical support for help.
    • If no, no further action is required.

DIMM Configuration Error
Symptom

The iMana 200 reports a server error "DIMMxxx Configuration Error."

Key Process and Cause Analysis

Key process:

  1. Check that the DIMMs are installed in correct slots. For details, check on the Huawei Server Product Memory Configuration Assistant or see the related user guide.

    For example, if DIMMs are installed in DIMM000 and DIMM002, but no DIMM is installed in DIMM001, the iMana 200 reports the server error "DIMM000/001/002 Configuration Error."

  2. Check that DIMMs are properly installed. If one end of a DIMM is inserted but the other end tilts, the DIMM is not properly installed. Remove and insert DIMMs to confirm DIMM installation and check that no dust or foreign matters are in slots.
  3. Switch the positions of a functioning DIMM and an abnormal DIMM. If the alarm follows the abnormal DIMM, the DIMM is faulty. Otherwise, the DIMM slot is faulty.
  4. If the DIMM slot is faulty, check whether the CPU socket has twisted pins. If no, switch CPUs to check whether the current CPU is faulty. If the CPU is faulty, replace the CPU.
  5. If a DIMM slot on the mainboard is faulty, replace the mainboard.
Conclusion and Solution

None

Experience

None

Note

None

Server Black Screen Caused by Mismatch Between v2 CPUs and BIOSs
Problem Description
Table 5-65 Basic information

Item

Information

Source of the Problem

XH320 V2

Intended Product

All V2 servers

Release Date

2014-09-15

Keyword

E5-2430 v2, black screen

Symptom

Hardware configuration:

XH320 V2 server + 2 x E5-2430 v2

Software configuration:

BIOS V056 and iBMC 3.95

Symptom:

The server HLY indicator is steady green but black screen occurs. No information is displayed on the screen, and no alarm is displayed on the iBMC.

Key Process and Cause Analysis

Cause analysis:

The server uses E5-2430 v2 CPUs, which can be properly used only in BIOS V3xx or later. However, the onsite BIOS is V056, which does not support v2 CPUs. You need to upgrade the BIOS to the latest version to resolve the problem.

Conclusion and Solution

Solution

Upgrade the iBMC to 588 or later and upgrade the BIOS to V386 or later.

Experience

None

Note

None

Server Fails to Automatically Power On
Problem Description
Table 5-66 Basic information

Item

Information

Source of the Problem

RH1288 V2

Intended Product

2-socket rack server/X6000

Release Date

2014-11-25

Keyword

Automatic power-on

Symptom

Hardware configuration:

RH1288 V2 server

Software configuration:

BIOS V372 and BMC 5.88

Symptom:

The BMC power-on policy is set to automatic power-on, but after recovery from a power failure that occurs in the equipment room, the server fails to automatically power on.

Key Process and Cause Analysis

Key process:

  1. Collect BMC logs, and find that a power failure occurred on the server at 20:17:52 on November 22, 2014 (Saturday), as shown in Figure 5-84.
    Figure 5-84 BMC logs

  2. Check the BMC uptime information, and find that the BMC has been running for 39 days. The log information was collected on November 23, 2014, indicating that the BMC did not restart at 20:17:52 on November 22, 2014 (Saturday). There is only one possible explanation for this: the AC power supply is intermittently interrupted, that is, interrupted for one or two seconds before recovery. When the power is interrupted, the mainboard is powered off. However, as the mainboard is equipped with capacitors, the capacitors can still supply power to the BMC for a short time. As the AC power supply recovers very soon, the BMC does not restart during the whole process. Figure 5-85 shows the uptime information.
    Figure 5-85 uptime

  3. The power-on policy set for the server is automatic power-on. This policy is supposed to be implemented after the BMC restarts, but as the BMC does not restart during the whole process, the power-on policy is not implemented. The BMC regards the whole incident as a normal power-off incident.
Conclusion and Solution

Conclusion:

AC intermittent disconnection occurs, and during this process, the BMC does not restart. As a result, the power-on policy is not implemented.

Experience

In BMC versions later than 6.0, the power-on policy is optimized, so that when this kind of AC intermittent disconnection occurs, the power-on policy can still be implemented.

Note

None

Faulty Power Supply in an Equipment Room Causes Servers to Trigger the AC Lost Alarm Repeatedly
Problem Description
Table 5-67 Basic information

Item

Information

Source of the Problem

RH2288 V2

Intended Product

RH2288 V2

Release Date

2014-11-25

Keyword

Power supply AC lost alarm

Symptom

Symptom:

At a certain site, the power supply AC lost alarm is repeatedly triggered and cleared for multiple RH2288 V2 servers (equipped with 750 W golden power supply), as shown in Figure 5-86. Services are still running properly, though.

Figure 5-86 Power supply AC lost alarm is repeatedly triggered and cleared

Key Process and Cause Analysis

Key process:

The management tool iMana reads the information of the power supply register 0X04 as input loss, as shown in Figure 5-87, and generates the AC lost alarm to remind users to check whether the power supply environment in the equipment room is normal.

Figure 5-87 Information of the power supply register 0X04

Use an oscilloscope to access the customer's power supply environment on site for checks.

Analyze the captured power supply waveforms, as shown in Figure 5-88. The following two problems are discovered:

  • The waveform distortion of the L-N input voltage is severe, exceeding the specification limit of the power supply by 10%.
  • The peak value of the N-PE inter-wire voltage is 20.6 V, indicating that the N-PE dropout voltage distortion is extremely large (meaning that the grounding condition in the equipment room is poor). Consequently, the 0x04 register of the power supply frequently detects input loss and triggers the iMana AC lost alarm. In a good equipment room environment, the N-PE dropout voltage does not exceed 2 V.
    Figure 5-88 Power supply waveform
Conclusion and Solution

Conclusion:

The power supply AC lost alarm is triggered because the N-PE dropout voltage distortion on the input end of the power supply in the equipment room is too large.

Solution:

Optimize the grounding condition in the equipment room to ensure that the N-PE dropout voltage stays within 2 V.

Experience

If the iMana alarm is not a false alarm, you need to check the power supply environment, including but not limited to the power supply cables, PDU sockets, UPS, and other power supply components.

Note

None

LED Indicators of the RH22XX(H) V2 IB Identification Card Do Not Light Up
Problem Description
Table 5-68 Basic information

Item

Information

Source of the Problem

RH22XX(H) V2

Intended Product

RH22XX(H) V2

Release Date

2014-07-28

Keyword

IB indicator

Symptom

Hardware configuration:

RH22XX(H) V2 equipped with the IB identification card

Software configuration:

Windows Server 2008 R2

Symptom:

After the system and IB identification card driver are installed, the LED indicators (including the green physical link indicator and yellow logical link indicator) of the IB identification card do not light up. Figure 5-89 introduces the indicators.

Figure 5-89 IB identification card indicators

Key Process and Cause Analysis

Key process:

  1. Ensure that the system and driver have been properly installed. Change the IB cables and the port interworking with the IB switch. If the problem persists, cable faults and IB switch faults can be excluded.
  2. Exchange the faulty IB identification card with a normal one. The problem persists in the original device, indicating that the IB identification card itself is normal. Check the OS and mainboard.
  3. Install the faulty card in another PCIe slot (any x8 half-height slot). The problem persists, indicating that the problem is not caused by any single slot.
  4. Exchange system drive with that of a functioning device. If the problem persists with the original hard drive after the RAID configuration is imported, reinstall the system and the driver, and the problem is resolved.

Cause analysis:

This problem occurs because the OS/driver installation is faulty.

Conclusion and Solution

Conclusion:

This problem occurs because the OS/driver installation is faulty.

Solution

Reinstall the OS and driver.

Experience

If a similar problem occurs, perform the preceding procedure to identify the cause.

Mellanox IB identification card manual:

http://www.mellanox.com/related-docs/user_manuals/ConnectX-3_VPI_Single_and_Dual_QSFP+_Port_Adapter_Card_User_Manual.pdf

Link for downloading the Mellanox IB identification card driver:

http://www.mellanox.com/page/software_overview_ib

Note

None

A V2 Server Enters the S3 Resume State and Fails to Boot
Problem Description
Table 5-69 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

V2 rack servers

Release Date

2016-01-20

Keyword

S3 Resume, boot failure

Symptom

Symptom:

The server is powered on, and the BMC completes its initialization. However, the power switch indicator is steady yellow (standby), and attempts to boot the server by pressing the Power button fail.

Key Process and Cause Analysis

Key process:

  1. BMC SEL logs show no hardware alarm (CPU CAT ERROR occurs at a low probability).
  2. BMC SOL logs show that the device has entered the S3 Resume status, as the logs contain the following information: bootMode = S3Resume. Taking the S3 Resume boot path through MRC. The problem is that V2 rack servers do not support S3 sleep status. Figure 5-90 shows snippets of the SOL logs.
    Figure 5-90 Snippets of the SOL logs
Conclusion and Solution

Solution:

Upgrade the BIOS version to V379 or later versions. You are advised to upgrade it to the latest version.

Experience

If the server fails to boot and no obvious fault information is recorded in the BMC SEL logs, you need to analyze the SOL logs.

Note

None

Error Message "No memory found" Is Displayed, and the Server Fails to Boot
Problem Description
Table 5-70 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

V1, V2, and V3 servers

Release Date

2016-01-20

Keyword

No memory found, boot failure

Symptom

Symptom:

The server is powered on, and the BMC completes its initialization. However, the BMC KVM displays the message "No Signal", and attempts to boot the server by pressing the Power button fail.

Key Process and Cause Analysis

Key process:

  1. The BMC SEL logs show no hardware alarm.
  2. The BMC SOL logs show that no memory is detected: Fatal Error! No memory found! Figure 5-91 shows snippets of the SOL logs.
    Figure 5-91 Snippets of the SOL logs
Conclusion and Solution

Solution:

Use memory that has been approved on the server compatibility list, and install the memory according to the memory installation rules of server products.

Experience

If the server fails to boot and no obvious fault information is recorded in the BMC SEL logs, you need to analyze the SOL logs.

Note

None

"POST Error Unrecoverable video controller failure" Is Reported to the iMana When the RH2288H V2 Is Configured with a K1 Video Card
Problem Description
Table 5-71 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

Servers configured with a K1 video card

Release Date

2015-09-11

Keyword

RH2288H V2, K1 video card

Symptom

When an RH2288H V2 server configured with a K1 video card is powered on, the indicator is red, and "POST Error, Unrecoverable video controller failure" is reported to the iMana.

Key Process and Cause Analysis

pci 64-bit decode is set to Disabled in the BIOS.

Conclusion and Solution

In the POST phase of the server, press Delete to enter the BIOS. Choose Advanced > Misc Configuration > pci 64-bit decode, set pci 64-bit decode to Enabled, and press F10 to save the settings and exit.

Fault locating:

  1. Log in to the Huawei Server Compatibility Checker. Check the BOM number of the riser card corresponding to the high-consumption GPU card and check that the power cable is supported.
  2. Check that the GPU cable is correctly connected, as shown in Table 5-72 and Figure 5-92.
    Table 5-72 Cable connection(1)

    BOM Number

    Function

    Cable Connector

    Cable Connection

    04150626

    Mainboard to riser card

    2 x 4 to 2 x 4 pin power link

    Connects the 2 x 4 pin connector on the mainboard to the 2 x 4 pin connector in the middle of the riser card.

    04150606

    Riser card to GPU card

    2 x 4 to 2 x 3 pin power link

    Connects a connector other than the middle one on the riser card to the connector on the GPU card.

    04150627

    Riser card to GPU card

    2 x 4 to 2 x 4 pin power link

    Connects a connector other than the middle one on the riser card to the connector on the GPU card.

    Figure 5-92 Cable connection (2)

  3. Check that the BIOS version is the latest.
  4. Set pci 64-bit decoded to Enabled in the BIOS.
  5. Check that the riser card, GPU card (including the cable), and mainboard do not have hardware faults.
Experience

None

Note

If the server (rack, blade, or high-density server) is not configured with a GPU card, but "POST Error and Unrecoverable video controller failure" is still reported to the iMana, perform the following steps:

  1. Run the ipmcset -d clearcmos command on the iMana and restart the server to restore the default BIOS settings.
  2. If other PCIe devices are installed, check that their slots correspond to the CPUs (different CPUs manage different PCIe slots) by removing and then installing the PCIe device.
  3. Upgrade the iBMC or BIOS to the latest version.
  4. Replace the mainboard.
RH5885 V2 QPI Alarms
Problem Description
Table 5-73 Basic information

Item

Information

Source of the Problem

RH5885 V2

Intended Product

RH5885 V2

Release Date

2014-08-06

Keyword

QPI alarms

Symptom

Hardware configuration:

RH5885 V2

Symptom

RH2285 V2 fans keep running at about 7000 rotations per minute.

On an RH5885 V2, CPU and QPI alarms are generated on the iBMC WebUI, as shown in Figure 5-93.

Figure 5-93 CPU and QPI alarms

Key Process and Cause Analysis

Cause analysis:

CPU and QPI alarms are generated due to abnormal communication between CPUs on the RH5885 V2. The possible faulty parts include:

  1. CPUs
  2. CPU pins
  3. IOH
  4. BMC board
  5. QPI cables (8-socket servers)

The following figures show the CPU links of 4-socket and 8-socket RH5885 V2 servers.

Figure 5-94 shows the link connection of a 4-socket server.

  1. CPU1 and CPU3 connect to IOH1, and CPU2 and CPU4 connect to IOH2.
  2. CPU1 and CPU2 connect to CPU3 and CPU4 through the mainboard.
  3. CPU1 connects to CPU2 and CPU3 connects to CPU4 through BMC connectors.
Figure 5-94 Link connection of a 4-socket server

Figure 5-95 shows the link connection of an 8-socket server.

  1. On each node, CPU1 and CPU3 connect to IOH1, and CPU2 and CPU4 connect to IOH2.
  2. On each node, CPU1 and CPU2 each connect to CPU3 and CPU4 through the mainboard.
  3. CPUs on two nodes connect to each other through QPI cables.
Figure 5-95 Link connection of an 8-socket server

Conclusion and Solution

Solution:

There are four common types of RH5885 V2 CPU and QPI alarms. The CPU and port numbers correspond to those in the link diagram. The following describes how to locate the fault for each type of alarm.

Alarm 1: CPU-IOH link alarm (4-socket and 8-socket servers)

Figure 5-96 Event description

Locating method:

  1. Replace one IOH at a time to check IOH faults.
  2. Replace one CPU at a time to check CPU or socket faults.

Alarm 2: CPU-CPU link alarm (4-socket and 8-socket servers)

Figure 5-97 Event description

Locating method:

  1. Replace one CPU at a time to check CPU or socket faults.

Alarm 3: CPU-CPU link alarm (4-socket servers)

Figure 5-98 Event description

Locating method:

  1. Replace the BMC board to check BMC faults.
  2. Replace one CPU at a time to check CPU or socket faults.

Alarm 4: link alarm between primary and secondary nodes (8-socket servers)

Figure 5-99 Event description

Locating method:

  1. Locate QPI cables according to the 8-socket server link diagram and replace one cable at a time to check cable faults.
  2. Replace one CPU at a time to check CPU or socket faults.
Experience

None

Note

None

RH2288H V2 BIOS Does Not Contain the Hyper-Threading Parameter
Problem Description
Table 5-74 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

FusionServer

Release Date

2016-06-13

Keyword

BIOS, hyper-threading, HT

Symptom

After E5-2609 v2 is configured on a server, the hyper-threading (HT) parameter (Intel HT Technology) is missing in the BIOS.

Figure 5-100 Processor type
Figure 5-101 No HT parameter (E5-2609 v2)
Figure 5-102 HT parameter (other CPUs)
Key Process and Cause Analysis

Cause analysis:

The E5-2609 v2 CPU does not support HT. Therefore, the BIOS does not have the HT parameter Intel HT Technology.

https://ark.intel.com/products/75787/Intel-Xeon-Processor-E5-2609-v2-10M-Cache-2_50-GHz

Figure 5-103 CPU parameters on the Intel website
Conclusion and Solution

Conclusion:

The E5-2609 v2 CPU does not support HT. Therefore, the BIOS does not have the HT parameter Intel HT Technology.

Solution:

You can use either of the following methods:

1. Purchase a CPU that supports HT. For details about the CPUs that support HT, visit official Intel website at:

https://ark.intel.com/

2. If the CPU HT function is not required, no action is required.

Experience

None

Note

None

Server Does Not Restart After ipmitool Delivers the power_cycle Command
Problem Description
Table 5-75 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

FusionServer

Release Date

2016-07-18

Keyword

Power cycle, not restarted

Symptom

The client is directly connected to the MGMT port on the server. After the power cycle command is delivered to the server, the server does not restart.

ipmitool -I lanplus -H 192.168.2.100 -U root -P Huawei12#$ power cycle

192.168.2.100 indicates the iBMC IP address.

root indicates the user name required for logging in to the iBMC.

Huawei12#$ indicates the password of the iBMC root user.

Key Process and Cause Analysis

Cause analysis:

The following conditions must be met for the server to respond to the power cycle command delivered by ipmitool:

The OS is running properly, and the power management service is enabled.

In Linux, run the service acpid status command to check that the acpid service is running.

In Windows, choose Control Panel > Hardware and Sound > Power Options > System Settings. On the page displayed, set When I press the power button to Shut down.

Conclusion and Solution

Solution:

  1. Check whether Graceful Power-off Timeout Period of the server is set to 0.

    On the iBMC CLI, run ipmcget -d shutdowntimeout.

    iBMC WebUI

    If yes, go to 2.

    If no, go to 3.

  2. Check whether the OS is running properly (that is, the OS can respond to the keyboard).

    If yes, go to 4.

    If no, forcibly restart the OS.

  3. If the OS still does not restart after the graceful power-off timeout period expires, go to 5.
  4. Check that the power management of the OS is enabled.

    In Linux, run the service acpid status command to check whether the acpid service is running. If not, run the service acpid start command to enable the acpid service.

    In Windows, set When I press the power button to Shut down.

  5. Contact Huawei technical support.
Experience

None

Note

None

Exceptional Breakdown and CAT Error of the RH2288 V2
Problem Description
Table 5-76 Basic information

Item

Information

Source of the Problem

RH2288 V2

Intended Product

RH2288 V2

Release Date

2018-05

Keyword

RH2288 V2, breakdown, CAT error

Symptom

The RH2288V2 server breaks down twice. The alarm information shows that CAT errors occur on CPU1 and CPU2.

Key Process and Cause Analysis

Problem analysis:

No exception is found in the FDM logs. However, ECC error alarms about DIMM010 and DIMM011 are generated in the SEL logs. The ECC and CAT alarms occur at the same time. In addition, the number of ECC errors has exceeded the threshold. Exceptions in the CPU, DIMM, and mainboard may cause the CPU CAT error. The logs are insufficient for locating the cause. Therefore, the on-site engineers replace DIMM010 and DIMM011, and continue monitoring the server to see whether any exception persists.

Conclusion and Solution

Solution:

Replace DIMM010 and DIMM011, and monitor the server. The server is monitored for a period of time. No exception occurs.

Common Problems of RAID Controller Cards and Hard Drives

The BMC Cannot Identify Hard Disks When an RH2285 Is Configured With LSI SAS1068E Cards
Problem Description
Table 5-77 Basic information

Item

Information

Source of the Problem

RH2285

Intended Product

RH2285

Release Date

2010-09-26

Keyword

LSI SAS1068E, BMC, disk monitoring

Symptom
  1. Log in to the BMC WebUI of the RH2285, and choose System Information > System Status. The displayed page shows that the disks are not detected.
    Figure 5-104 No hard disk detected

  2. Choose System Info > Hard Disk Monitoring. The displayed page shows that the disks are not detected.
    Figure 5-105 No hard disk detected
  3. Check hard disk information in the BIOS, which is displayed.
    Figure 5-106 Hard disk information in the BIOS
Key Process and Cause Analysis

Key process:

Mini-SAS cables are not connected properly after the chassis cover is removed. The correct cable connections are as follows: When you view the RH2285 from the front to the rear, cable connections are "left top and right bottom". That is, ports (port 0 to port 3) on the left of the RAID controller card correspond to the ports on the top of the backplane; ports (port 4 to port 7) on the right of the RAID controller card correspond to the ports at the bottom of the backplane. However, the onsite RH2285 adopts the "left bottom and right top" cable connections.

Figure 5-107 Correct connection

Cause analysis:

The RH2285 management system uses a low-speed signal cable of the master Mini-SAS cable to obtain the backplane information, including the backplane logic version and the presence status of hard disks. If Mini-SAS cables are not connected properly, the corresponding management cable channels are incorrectly connected, and the BMC cannot identify the hard disk status.

Conclusion and Solution

Conclusion:

The BMC cannot read the hard disk information because Mini-SAS cables are connected reversely.

Solution:

Exchange the two Mini-SAS cables connected to the 1068E card or the hard disk backplane. In this way, the BMC can display the hard disk information properly.

Experience

Open the BIOS or 1068E configuration interface, and check whether hard disks are normal.

  • If yes, exchange Mini-SAS cable connections.
  • If no, remove and reinstall the 1068E card, Mini-SAS cables, and hard disks.
Note

The RH2285 is configured with one LSISAS 1078 card, one expander backplane, and two SATAs, and Mini-SAS cable connections are exchanged. In this way, the BMC can display the hard disk information properly.

An Unconfig Good Hard Drive of the LSI SAS2208 RAID Controller Card Automatically Enters the Rebuild Status
Problem Description
Table 5-78 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

Servers equipped with an LSI SAS2208 RAID controller card

Release Date

2015-10-11

Keyword

LSI SAS2208 card, Rebuild

Symptom

Symptom:

An RH2288H V2 server is equipped with LSI SAS2208 cards and four 600 GB SAS hard drives (slot 0 to slot 3), of which slot 0 and slot 1 are configured for RAID1 and slot 2 and slot 3 are not configured. The hard drive in slot 0 is removed while the server is still running, and the hard drive in slot 3 then enters the Rebuild status. After synchronization is complete, the hard drives in slot 3 and slot 1 form a normal RAID1.

Key Process and Cause Analysis

Cause analysis:

By default, the LSI SAS2208 card allows an Unconfig Good hard drive to replace a faulty drive in the RAID array. If there are multiple Unconfig Good hard drives, the hard drive with the quickest response is chosen to replace the faulty drive.

Conclusion and Solution

Solution:

To disable the hot spare drive function of Unconfig Good hard drives, open the WebBIOS, click Controller Properties, and set Emergency Spare to GHS (Global Hot Spare). Emergency Spare indicates the emergency hot spare function, which means that if there is no hot spare drive, the RAID controller card will replace faulty drives based on the emergency hot spare settings. The default value of this parameter is UG and GHS (Unconfig Good and Global Hot Spare), and other values are none, UG (Unconfig Good), and GHS (Global Hot Spare).

The procedure for changing the value is as follows:

  1. During the POST phase of the server, press Ctrl+H as prompted on the RAID controller card self-check screen to go to the RAID controller card screen, as shown in Figure 5-108.
    Figure 5-108 RAID controller card screen

  2. Click Start, as shown in Figure 5-109.
    Figure 5-109 Adapter Selection

  3. Click Controller Properties, as shown in Figure 5-110.
    Figure 5-110 Controller Properties

  4. Click Next, as shown in Figure 5-111.
    Figure 5-111 Next

  5. Click Next, as shown in Figure 5-112.
    Figure 5-112 Next

  6. Click Next, as shown in Figure 5-113.
    Figure 5-113 Next

  7. Set Emergency Spare to GHS, and click Submit, as shown in Figure 5-114.
    Figure 5-114 Setting Emergency Spare
Experience

None

Note

None

Handling DASD Alarms for the Optical Channel Diagnosis Panel in an RH5485
Problem Description
Table 5-79 Basic information

Item

Information

Source of the Problem

RH5485

Intended Product

RH5485

Release Date

2011-10-22

Keyword

RH5485, DASD, Raid

Author

Han Yao (employee ID: 171887)

Symptom

Hardware configuration

RH5485 server with two hard drives.

Figure 5-115 and Figure 5-116 shows an optical channel diagnosis panel. The DASD alarm indicator is in the red box. It indicates the hard disk drive status.

Figure 5-115 Optical channel diagnosis panel

Figure 5-116 Diagram of the optical channel diagnosis panel

Symptoms

DASD alarms are generated on the optical channel diagnosis panel; the system error indicator is steady on; "00" is displayed in the checkpoint code display. If two hard drives are inserted in slots 6 and 7, the server cannot detect the hard drives. It is the same condition for slots 0 and 1. If the two hard drives are inserted in slots 1 and 2, the server can detect them. The integrated management module (IMM) log includes the following alarms:

1. I --  -- 9/23/2011:8:45:19 -- Remote Login Successful. Login ID: USERID from Web at IP address 10.142.67.199 
2. I --  -- 9/22/2011:13:55:40 -- Rebuild completed for Array in system "Host" 
3. I --  -- 9/22/2011:13:55:40 -- Critical Array "Host" has deasserted 
4. I --  -- 9/22/2011:12:36:22 -- Rebuild in progress for Array in system "Host" 
5. E --  -- 9/22/2011:12:36:22 -- Array in system "Host" is in critical condition 
6. I --  -- 9/22/2011:12:33:54 -- Rebuild completed for Array in system "Host" 
7. I --  -- 9/22/2011:12:33:54 -- Critical Array "Host" has deasserted 
8. I --  -- 9/22/2011:12:33:43 -- "Host Power" has been turned on 
9. I --  -- 9/22/2011:12:33:31 -- "Host Power" has been turned off 
10. I --  -- 9/22/2011:12:17:33 -- Redundancy "Power Group 1" has been restored 
11. I --  -- 9/22/2011:12:17:27 -- "Power Supply 1" has returned to a Normal Input State 
12. I --  -- 9/22/2011:12:8:51 -- Rebuild in progress for Array in system "Host" 
13. E --  -- 9/22/2011:12:8:51 -- Array in system "Host" is in critical condition     
Key Process and Cause Analysis

Key process

  • The DASD alarm indicator is lit on the optical channel diagnosis panel, indicating that the hard disk drive is faulty or removed.
  • If the checkpoint code indicator is blinking Cx and 00 alternately, the bus cannot be repaired.
  • According to the IMM log, the array is abnormal, and the RAID controller card data is being repeatedly reconstructed.

The preceding information indicates that the server array is abnormal. In this case, perform the following operations:

  1. Reinstall the following components:
    • Faulty hard disk drive
    • Backplane of the SAS hard disk drive
    • SAS cables
    • RAID controller card
    • Input/output (I/O) board components

      Then restart the server.

  2. If the problem persists after the previous steps are performed, delete the current RAID controller card array, and reconfigure its array.
Conclusion and Solution

Conclusion

DASD alarms indicate that the hard disk drive is faulty, that is, the hard drives, hard drive backplane, SAS cables, or RAID controller card is faulty. Faulty hardware and poor connection can cause DASD alarms.

DASD alarms and abnormal array introduced in the case are caused due to poor contact of the storage subsystem.

Solution

When DASD alarms occur, check the hard drives, power off the server, and then reinstall the storage subsystem. Figure 5-117 shows a hard drive subsystem. In the figure, red cables indicate SAS cables used for connecting the hard drive backplane to the RAID controller card.

Figure 5-117 RAID controller card and hard drive subsystem

Experience

None

Note

None

A RAID Cannot Be Rebuilt After "PD Missing" Is Displayed
Problem Description
Table 5-80 Basic information

Item

Information

Source of the Problem

E6000

Intended Product

RH2285, T6000, X6000, and E6000

Release Date

2012-03-05

Keyword

1078, PD Missing, rebuild

Symptom

Hardware configuration

An E6000 configured with one redundant array of independent disks (RAID) controller card (1078)

Symptom

The E6000 uses the LSI SAS1078 controller card to configure two hard drives with RAID 1 properties. After one hard drive fails and is replaced with a new one, data cannot be synchronized to the new hard drive, and the new hard drive cannot be set to be the hot spare drive, as shown in Figure 5-118.

Figure 5-118 Hard drive failed to be set to a hot spare drive

Key Process and Cause Analysis

Key process

  1. In the Logical View window, FOREIGN is displayed for the new hard drive, as shown in Figure 5-119. If a hard drive already has RAID configuration, the hard drive is identified as FOREIGN by the LSI SAS1078 controller card.
    Figure 5-119 Logical View

  2. After the existing RAID configuration is deleted from the new hard drive, the hard drive can be set to be the global hot spare drive and enters the synchronization state.

Cause analysis

RAID configuration exists on the new hard drive before replacement. As a result, data cannot be synchronized to the new hard drive.

Conclusion and Solution

Conclusion

RAID configuration exists on the new hard drive before replacement. As a result, data cannot be synchronized to the new hard drive, and the hard drive cannot be set to be the hot spare drive.

Solution

If "PD Missing" is displayed and a hard drive needs to be replaced, check whether the new hard drive already has RAID configuration. If yes, delete the RAID configuration. Otherwise, data cannot be synchronized to the new hard drive. For details about how to view and delete old RAID configuration, see "Clearing Foreign Configurations" in the HUAWEI Server RAID Controller Card User Guide.

Experience

If "PD Missing" is displayed and a hard drive needs to be replaced, check whether the new hard drive already has RAID configuration. If yes, delete the RAID configuration. Otherwise, data cannot be synchronized to the new hard drive.

Note

None

LSI SAS2208 RAID Controller Card Firmware Do Not Support Local Setting
Problem Description
Table 5-81 Basic information

Item

Information

Source of the Problem

LSI SAS2208 redundant array of independent disks (RAID) controller card

Intended Product

V2 servers

Release Date

2012-09-01

Keyword

2208, no longer supported, license key

Symptom

Hardware configuration

RAID controller card: SR220, SR320, SR620, RU220, or RU620

Symptom

A server is configured with an LSI SAS2208 controller card. During the RAID controller card initialization, "The native configuration is no longer supported by the current controller and firmware" is displayed, as shown in Figure 5-120. The RAID controller card initialization is not complete, and no hard drive is detected.

Figure 5-120 No hard drive detected

Key Process and Cause Analysis

Cause analysis

When the LSI SAS2208 controller card is powered on for the first time, the license subcard is checked, and information about the license subcard is saved in the nonvolatile random access memory (NVRAM) of the RAID controller card. Every time the RAID controller card is powered on, the system checks whether the current license subcard information is the same as that in the NVRAM. If not, the message "The native configuration is no longer supported by the current controller and firmware" is displayed. The message indicates that the verification of RAID license information fails. The license subcard information is highlighted in the red circle shown in Figure 5-121.

Figure 5-121 License subcard

Conclusion and Solution

Conclusion

If the current license chassis information is not the same as that in the NVRAM of the RAID controller card, the message "The native configuration is no longer supported by the current controller and firmware" is displayed.

Solution

Perform the following steps to delete data in the NVRAM to match the new license subcard:

  1. Check that no RAID information exists in the RAID controller card.
    • Check that no RAID properties are configure for hard drives.
    • Remove a hard drive if the hard drive has been configured with RAID properties.
  2. Power on the server and press Ctrl+Y during the LSI SAS2208 controller card initialization, as shown in Figure 5-122.
    Figure 5-122 LSI SAS2208 controller card initialization

  3. The Megacli screen is displayed, as shown in Figure 5-123.
    Figure 5-123 Megacli screen

  4. Run the -adpnvram –clear –a0 command to delete data in the NVRAM, as shown in Figure 5-124.
    Figure 5-124 Running the -adpnvram –clear –a0 command

  5. Data in the NVRAM is successfully deleted. Restart the system, as shown in Figure 5-125.
    Figure 5-125 Data successfully deleted

  6. The new RAID works properly. You can also import the previous RAID configuration for hard drives.
Experience

The SR220, SR320, SR620, RU220, or RU620 controller card configured on a V2 server carries with a license subcard. If the license subcard is replaced with a new one, or the LSI SAS2208 controller card is not tested before internal use, this problem occurs.

Note
  1. If an LSI SAS2208 controller card has no license subcard, the registration requiring a license will fail, and the RAID controller card cannot be used. The information shown in Figure 5-126 is displayed.
    Figure 5-126 Registration failed

  2. If the hard drives are configured with RAID properties and inserted in the backplane, the NVRAM data deletion fails. You need to remove the hard drives that have been configured with RAID properties and then delete data in the NVRAM.
The OS Cannot Be Accessed Due to An Exception Occurred in the Hard Drive Partition Table
Problem Description
Table 5-82 Basic information

Item

Information

Source of the Problem

Problem in Online Devices

Intended Product

RH2285, E6000, and X6000

Release Date

2012-01-28

Keyword

RH2285, damaged Pre1

Symptom

Hardware configuration

An RH2285 configured with an LSI SAS1068E controller card, and two 1 TB hard drives configured with RAID 1 properties for installing an operating system (OS)

Software configuration

SUSE11 SP1 64-bit

Symptom

A fault occurs during the RAID controller card self-check. As a result, the OS cannot be accessed. Figure 5-127 shows the fault information, which is highlighted in the red square.

Figure 5-127 Pre1 displayed for OS access failure

Key Process and Cause Analysis

Key process

  1. Restart the server for five times. The problem persists.
  2. Use the hard drive test tool Toolkit to check the hard drive health status. The hard drives are normal.
  3. Reinstall the OS. The OS works properly, indicating that the problem is resolved.

Cause analysis

An exception occurs in the hard drive partition table. As a result, the OS cannot start.

Conclusion and Solution

Conclusion

The OS cannot be accessed due to an exception occurred in the hard drive partition table.

Solution

Reinstall the OS.

Experience

The fault shown in Figure 5-127 is caused by an exception in the hard drive partition table. You are advised to check the hard drive health status by using the hard drive test tool. If the hard drives are normal, reinstall the OS.

Note

For a SATA hard drive, if the health status value is less than 40, replace the hard drive with a new one.

Slot IDs Start from a Non-Zero Number During LSI SAS1078 RAID Controller Card Self-Check
Problem Description
Table 5-83 Basic information

Item

Information

Source of the Problem

RH2285

Intended Product

E6000, RH2285/RH1285, and T6000

Release Date

2011-06-11

Keyword

LSI SAS1078, SLOT ID, megacli

Author

Qi Yanjun (employee ID: 00171744)

Symptom

When server installation is complete, slot IDs of hard drives start from a non-zero number during LSI SAS1078 RAID controller card (1078 RAID controller card for short) test, as shown in Figure 5-128.

Figure 5-128 Incorrect slot ID sequence

Generally, hard drive slot IDs start from zero, as shown in Figure 5-129.

Figure 5-129 Correct slot ID sequence

Key Process and Cause Analysis

Cause analysis

For a newly installed server, the 1078 RAID controller card identifies hard drive slot IDs from zero by default. If the 1078 RAID controller card was once connected to a hard drive, certain slot IDs are used. Therefore, slot IDs start from a non-zero number.

Slot IDs are used to label hard drives and have no impact on hard drive applications. If customers require that slot IDs start from zero, use megacli to modify slot IDs.

Conclusion and Solution

Conclusion

Slot IDs are saved in the non-volatile random access memory (NVRAM) of the 1078 RAID controller card. You can use megacli to clear the NVRAM and initialize slot IDs.

NOTE:

In Windows and Linux, megacli cannot clear the NVRAM. megacli can clear the NVRAM only in disk operating system (DOS).

When megacli is used, the server memory capacity cannot be greater than 2 GB; however, the memory capacity of a common server is generally greater than 4 GB. In this case, replace the memory of a server with a 2 GB memory to clear the NVRAM, and then remove the 2 GB memory and reinstall the original memory. Otherwise, an error message is reported when megacli is used to clear nvram.

Solution

  1. Run the megacli-adpnvram -clear -a0 command to clear the NVRAM (the LSI provides megacli in the DOS).

  2. Restart the server, and check whether the NVRAM is cleared.

Experience

None

KVM Does Not Respond and a GRUB Error Occurs After SLES 11 SP3 Is Installed on Multiple Hard Drives on an RH2288 V2
Problem Description
Table 5-84 Basic information

Item

Information

Source of the Problem

RH2288 V2-12L

Intended Product

RH2288 V2-12L

Release Date

2015-01-19

Keyword

No response from KVM, GRUB error

Symptom

Hardware configuration

RH2288H V2-12L running SUSE Linux Enterprise Server (SLES) 11 SP3 x86_64

Symptom

  • Four SATA drives are configured as a RAID 0 array, and two SAS drives are configured as a RAID 1 array.
  • SLES 11 SP3 is installed.

GRUB error 21 is reported. See Figure 5-130.

Figure 5-130 GRUB error 21

Key Process and Cause Analysis
  1. No hardware alarm is generated on the BMC. It is concluded that the error is not caused by a hardware fault.
  2. The error message indicates that the GRUB file may have been damaged due to a power failure or a GRUB boot error may have occurred.

    GRUB error code description:

    21: Selected drive does not exist

  3. After an attempt to power on the server, the BIOS screen is not displayed during the boot process. The KVM does not respond. It is concluded that the KVM is disabled in the BIOS.
Conclusion and Solution
  1. Take action to resolve the problem that the KVM does not respond.

    After an attempt to restart the BMC, the problem persists. It is concluded that the KVM is disabled in the BIOS. To resolve the problem, restore the fault CMOS values as follows:

    Log in to the BMC over Secure Shell (SSH) and run the ipmcset -d clearcmos command.

  2. Reinstall the OS. For details, see Huawei Server OS Installation Guide.
Experience

None

Note

None

The Message "Software RAID can not be Configured" Is Displayed During SoftRAID Initialization
Problem Description
Table 5-85 Basic information

Item

Information

Source of the Problem

RH2285H V2

Intended Product

RH2285H V2/RH2288H V2

Release Date

2014-12-04

Keyword

SoftRAID

Symptom

Hardware configuration

RH2285H V2 server

Symptom

During SoftRAID initialization, the system displays "LSI Software RAID can not be Configured" but does not display "Ctrl+M", as shown in Figure 5-131. As a result, the user cannot open the SoftRAID configuration screen. The sensor Cable/Interconnect (SAS Cable) generates the alarm "Config error".

Figure 5-131 "Ctrl+M" is not displayed

Key Process and Cause Analysis

Key process

  1. The BIOS version is V396, which meets SoftRAID requirements.
  2. After the user removes and then reconnects the mini-SAS cable, the problem persists. The alarm "config error" still exists.
  3. A SAS drive is inserted in slot 0 onsite. After the SAS drive is replaced with a SATA drive, the problem persists.
  4. When a SAS drive or no hard drive is inserted in slot 0 in a lab, the message "LSI Software RAID can not be Configured" is displayed.

Cause analysis

A 12-bay expander backplane is used onsite. SoftRAID can connect to a pass-through backplane but not to an expander backplane. Therefore, the hard drive backplane on the server does not support SoftRAID.

Conclusion and Solution

Conclusion

A 12-bay expander backplane is used onsite. SoftRAID can connect to a pass-through backplane but not to an expander backplane. Therefore, the hard drive backplane on the server does not support SoftRAID.

Solution

Use the LSI SAS2308 or LSI SAS2208 controller card.

Experience

SoftRAID can connect to a pass-through backplane but not to an expander backplane.

Note

SoftRAID is a simplified RAID solution. It uses the Intel Platform Controller Hub (PCH) and LSI MegaRAID solution. The Intel PCH provides six Serial Advanced Technology Attachment (SATA) ports for signal output, and the LSI MegaRAID solution supports RAID 0, 1, 5, and 10.

Table 5-86 lists the SoftRAID specifications.

Table 5-86 SoftRAID specifications

Item

Description

Chip

Intel PCH

Supported RAID levels

RAID 0, 1, 5, and 10

Number of supported RAID arrays

8

Hard drive type

SATA HDDs, SATA SSDs, and both

Main device port

Six 3 Gbit/s SATA ports

Other features

  • A global hot spare drive is supported.
  • Only 64 KB stripes are supported.

Table 5-87 lists the RAID levels and drive quantities supported by SoftRAID.

Table 5-87 Supported RAID levels and drive quantities

RAID Level

Number of Supported Hard Drives

Maximum Number of Failed Drives Allowed

RAID 0

1–6

0

RAID 1

2

1

RAID 5

3–6

1

RAID 10

4–6

Depending on the number of subgroups

Each subgroup allows only one failed drive.

Note: SoftRAID does not allow the failure of adjacent hard drives in a RAID array.

NOTE:
  • When using SoftRAID, upgrade the basic input/output system (BIOS) of the server to the SoftRAID version. For details, see the upgrade guide of the server. To download the BIOS upgrade package, visit the Huawei Support Enterprise website, choose Support > Downloads > IT > FusionServer, and select the server model and version.
  • When using SoftRAID to configure RAID 5, install a RAID key. For details, see the user guide of the server.
  • SoftRAID supports a maximum of six hard drives. To support eight hard drives, install a RAID controller card and Serial Attached SCSI (SAS) cables. For details, see "Replacing the RAID Controller Card" in the user guide of the server.

Method for distinguishing a pass-through backplane and an expander backplane:

An 8-bay backplane can only be a pass-through backplane.

A 12-bay or 24-bay backplane can be either a pass-through backplane or an expander backplane.

There are two methods to distinguish them:

Method 1: Observe the appearance.

A mini-SAS port on a pass-through backplane supports four PHYs.

A 12-bay pass-through backplane provides three mini-SAS ports and a 24-bay pass-through backplane provides six mini-SAS ports whereas an expander backplane provides only two mini-SAS ports.

Method 2: Check the BOM number.

Pass-through backplane

03021EGK: Manufactured Board,Tecal RH2285,BC11THBB,12HDD direct backplane,Board ID 0x49

03021WNL: Manufactured Board,Tecal RH2285 V2,BC11THBD,24HDD direct backplane,Board ID 0x45,Server

Expander backplane:

03021EGM: Manufactured Board,Tecal RH2285,BC11THBA,12HDD backplane,Board ID 0x48,1*2

03021UWF: Manufactured Board,Tecal RH2285 V2,BC11THBC,24HDD backplane,Board ID 0x4d,Server,1*2

Two Physical Drives Configured as a RAID 1 Array by Using a SoftRAID Controller Are Displayed in Device Manager After Windows Server 2008 R2 Is Installed
Problem Description
Table 5-88 Basic information

Item

Information

Source of the Problem

RH2285H V2

Intended Product

RH2285H V2/RH2288H V2

Release Date

2014-11-18

Keyword

LSI SoftRAID, Windows Server 2008 R2

Symptom

Hardware configuration

RH2285H V2 server where two hard drives are configured as a RAID 1 array by using the LSI SoftRAID controller

Symptom

After Windows Server 2008 R2 is installed by using the installation DVD on an RH2285H V2 server, the two hard drives configured as a RAID 1 array are displayed in Device Manager.

Key Process and Cause Analysis

Key process

  1. The following storage controllers are displayed in Device Manager:

    Intel® C600 Series Chipset SAS RAID (SATA mode)

    Intel® C600 Series Chipset SAS RAID Controller

  2. Reinstall Windows Server 2008 R2. After the OS is installed, you can see that only one drive is displayed in Device Manager. The drive is the configured RAID 1 array.

Cause analysis

The installation DVD for Windows Server 2008 R2 does not contain the LSI SoftRAID controller driver. Therefore, the Intel chipset driver is loaded during DVD-based OS installation. As a result, you see two physical drives instead of a virtual drive after the installation.

Conclusion and Solution

Conclusion

The LSI SoftRAID controller driver is not loaded during the installation of Windows Server 2008 R2. As a result, a drive exception occurs after the installation.

Solution

To resolve the problem, use one of the following methods:

  1. Reinstall the OS.
  2. Mount the LSI SoftRAID controller driver by using the virtual media, and then install Windows Server 2008 R2 by using the installation DVD. (You can download the LSI SoftRAID controller driver from Huawei Enterprise support website.)
Experience

If an LSI SoftRAID controller is configured on a server, ensure that the LSI SoftRAID controller driver is loaded during OS installation.

Note

None

"Wrg Type" Displayed for a Hard Drive on the BIOS Screen for the LSI SAS2308 Controller Card
Problem Description
Table 5-89 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

LSI SAS2308 controller card

Release Date

2014-11-13

Keyword

LSI SAS2308, hard drive, Wrg Type

Symptom

Hardware configuration

RH2288H V2 server with the LSI SAS2308 controller card and 26 raw drives

Symptom

In the WebBIOS of the LSI SAS2308 controller card, "Wrg Type" is displayed in the Drive Status column for a hard drive on an RH2288H V2 server. See Figure 5-132.

Figure 5-132 "Wrg Type" displayed for a hard drive

Key Process and Cause Analysis

Key process

Remove the faulty drive in slot 24 and replace the drive with a new one. "Wrg Type" is not displayed for the new drive.

Cause analysis

"Wrg Type" means that the device is not compatible for use as part of the RAID array. See Figure 5-133.

Figure 5-133 Meaning of "Wrg Type"

Conclusion and Solution

Conclusion

"Wrg Type" is displayed for a hard drive because the drive is faulty.

Solution

If "Wrg Type" is displayed for a hard drive, perform the following operations:

  1. Remove and reinstall the hard drive.
  2. If the hard drive still works abnormally or "Wrg Type" is displayed again after a while, replace the hard drive with a new one.

If the original RAID array cannot be rebuilt after a hard drive is replaced, perform the following operations:

  1. Back up the data in the original RAID array.
  2. Reconfigure and initialize the RAID array.
  3. Reinstall the OS and applications.
  4. Import the original data to the RAID array.
Experience

In some scenarios, "Wrg Type" is displayed for a hard drive on the BIOS screen for the LSI SAS1064E, LSI SAS1068E, or LSI SAS2308 controller card. To resolve the problem, remove and reinstall the faulty hard drive. If the problem persists, replace the faulty hard drive.

If data in the original RAID array is damaged, the RAID array may fail to be rebuilt after a hard drive is replaced.

Note

None

Error Message "single-bit ecc errors" Is Displayed During Self-Check of the LSI SAS2208 RAID Controller Card
Problem Description
Table 5-90 Basic information

Item

Information

Source of the Problem

Servers equipped with LSI SAS2208 RAID controller cards

Intended Product

Servers equipped with LSI SAS2208 RAID controller cards

Release Date

2015-04-09

Keyword

LSI SAS2208, single-bit ecc errors

Symptom

Symptom:

During the self-check process of the server, the following error information is reported on the self-check interface of the RAID controller card:

single-bit ecc errors were detected during the previous boot of the raid controller. the dimm on the controller needs replacement. 
please contace technical support to resolve this issue. 
press ¡®x¡¯ to continue or else power off the system and replace the dimm module and reboot. if you have replaced the dimm press 'x' to continue

Figure 5-134 shows the error information.

Figure 5-134 Error information reported during self-check of the RAID controller card

Key Process and Cause Analysis

Cause analysis:

The DDR granules in the RAID controller card memory are faulty.

Conclusion and Solution

Solution:

Replace the RAID controller card.

Experience

None

Note

None

Error Message "No MegaRAID Adapter Installed" Is Displayed During Self-Check of the LSI SAS2208 RAID Controller Card
Problem Description
Table 5-91 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

V2 servers

Release Date

2015-04-13

Keyword

SAS2208, responding, MegaRAID

Symptom

Hardware configuration:

RH2288 V3 servers equipped with LSI SAS2208 RAID controller cards

Symptom:

During self-check, the following information is displayed:

Adapter at Baseport is not responding 
No MegaRAID Adapter Installed

Figure 5-135 shows the error information.

Figure 5-135 Error information reported during self-check of the RAID controller card

Key Process and Cause Analysis

Key process:

Disconnect the mini-SAS cable on the backplane of the hard drive. If self-check still fails to be performed, replace the RAID controller card.

Cause analysis:

The RAID controller card is faulty, and consequently, the message "No MegaRAID Adapter Installed" is displayed during self-check.

Conclusion and Solution

Conclusion:

The RAID controller card is faulty, or the RAID controller card key is not inserted. Consequently, the message "No MegaRAID Adapter Installed" is displayed during self-check.

Solution:

Replace the RAID controller card.

Experience

None

Note

None

Methods to Control the Buzzer of the LSI SAS2208 RAID Controller Card
Problem Description
Table 5-92 Basic information

Item

Information

Source of the Problem

RH5885 V2

Intended Product

Servers equipped with LSI SAS2208 cards

Release Date

2014-01-21

Keyword

Linux, LSI SAS2208, buzzer

Symptom

Hardware configuration:

RH5885 V2 servers, equipped with LSI SAS2208 RAID controller cards

Symptom:

On RH5885 V2 servers, if hard drives where RAID relationships have been configured are damaged, removed, or replaced, the buzzer of the LSI SAS2208 RAID controller card gives an alarm sound. Customers hope that the buzzer can be turned off.

Key Process and Cause Analysis

Cause analysis:

On RH5885 V2 servers, if hard drives where RAID relationships have been configured are damaged, removed, or replaced, the buzzer of the LSI SAS2208 RAID controller card gives an alarm sound. To control the switch of the buzzer, you can access the control page of the LSI SAS2208 RAID controller card or install the management tool supplied with the LSI SAS2208 RAID controller card under the OS.

Conclusion and Solution

Conclusion:

To control the switch of the buzzer, you can access the configuration page of the LSI SAS2208 RAID controller card or install the management tool supplied with the LSI SAS2208 RAID controller card under the OS.

Solution:

Solution 1: Modify the settings on the configuration page of the RAID controller card.

  1. During the boot process of RH5885, press Ctrl+H to enter the configuration page of the LSI SAS2208 RAID controller card. Click Start, as shown in Figure 5-136.
    Figure 5-136 Adapter Selection

  2. Enter the control page of the RAID controller card of Figure 5-137, and click Controller Properties.
    Figure 5-137 MegaRAID BIOS Config Utility Virtual Configuration

  3. Enter the setting page of the RAID controller card of Figure 5-138, and click Next.
    Figure 5-138 MegaRAID BIOS Config Utility Controller Information (1)

  4. Enter the setting page of the RAID controller card of Figure 5-139, and click Next.
    Figure 5-139 MegaRAID BIOS Config Utility Controller Information (2)

  5. Enter the setting page of the RAID controller card of Figure 5-140, and choose an option in Alarm Control:
    • Enable: to enable the buzzer
    • Disable: to disable the buzzer
    • Silence: to mute the buzzer

      After completing the settings, click Submit.

      Figure 5-140 MegaRAID BIOS Config Utility Controller Properties

Solution 2: Use the supplied tool to modify the settings.

  1. In the Linux OS system, copy the management tool StorCli supplied with the LSI SAS2208 RAID controller card to the system, by means of image mounting (or other methods), as shown in Figure 5-141.
    Figure 5-141 Mounting the image
  2. Copy the StorCli tool to the system, as shown in Figure 5-142.
    Figure 5-142 Copying the tool
  3. Enter the folder where the tool StorCli is stored. Right-click to open a terminal, as shown in Figure 5-143.
    Figure 5-143 Opening a terminal

  4. Run the following command to install the StorCli tool package:

    Install the rpm package:

    rpm -ivh storcli-1.01.75-1.noarch.rpm

    Establish the corresponding soft connection based on the OS version:

    ln -s /opt/MegaRAID/storcli/storcli64 /usr/bin/storcli

    as shown in Figure 5-144.

    Figure 5-144 Terminal page
  5. Run a command to modify the buzzer settings of the RAID controller card:

    StorCli64 /c0 set alarm=silence

    If the settings are modified successfully, the result shown in Figure 5-144 is displayed.

    In this command:

    • c0: Indicates the target controller. (When there are multiple RAID controller cards, you need to select different controllers. For example, an 8-socket RH5885 V2 server can be equipped with two RAID controller cards.)
    • set: Indicates that this is a setting command.
    • alarm: Indicates the buzzer settings.
    • silence: Indicates the silence option of the buzzer (on: enabled; off: disabled).
Experience

None

Note

None

Hard Drive Formatting in the Background of an LSI SAS2208 Card Causes a RAID Array Creation Failure
Problem Description
Table 5-93 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

Servers equipped with LSI SAS2208 cards

Release Date

2015-10-22

Keyword

LSI SAS2208 card, RAID array

Symptom

Symptom:

An RH2288H V2 server is equipped with one LSI SAS2208 card and three 600 GB hard drives. During the creation of RAID5, the message "Failed to Save Configuration" is displayed, as shown in Figure 5-145.

Figure 5-145 Failed to Save Configuration

Key Process and Cause Analysis

Cause analysis:

The WebBIOS screen shows that there is a background task (PD Progress Info) in progress. Click to find that the hard drives are in the status of Drive Erase Progress. After the hard drives complete the Drive Erase Progress, the RAID array can be successfully created, as shown in Figure 5-146.

Figure 5-146 Hard drives in Drive Erase Progress

Conclusion and Solution

Solution:

Create the RAID array after the hard drive formatting is complete.

Experience

None

Note

None

Attempts to Add New Hard Drives to an Existing RAID Array of an LSI SAS2208 RAID Controller Card Fail
Problem Description
Table 5-94 Basic information

Item

Information

Source of the Problem

RH5885H V3

Intended Product

V1, V2, and V3 servers

Release Date

2015-11-08

Keyword

Hard drive, capacity expansion

Symptom

Hardware configuration:

RH5885H V3 server, equipped with an LSI SAS2208 card (one RAID5 group is created by four hard drives and is further divided into two VDs)

Symptom:

On-site personnel perform the operation described in section 2.4.4 "Adding New Hard Disks to RAID" in the RAID Controller Card User Guide but get stuck in step 3, as shown in Figure 5-147.

Figure 5-147 WebBIOS interface of the LSI SAS2208 card after the on-site capacity expansion operation fails

Figure 5-148 shows the normal interface when step 3 of "Adding New Hard Disks to RAID" is successfully performed.

Figure 5-148 Normal WebBIOS interface when step 3 of "Adding New Hard Disks to RAID" is successfully performed

Key Process and Cause Analysis

Key process:

  1. The RAID array on site is RAID5, created by four hard drives and further divided into two VDs, as shown in Figure 5-149.
    Figure 5-149 Logical View of the WebBIOS interface of the LSI SAS2208 card on site

  2. LSI SAS2208 RAID controller cards do not allow new hard drives to be added to RAID arrays that have been divided into two or more VDs. Therefore, the symptom is a normal occurrence.

Cause analysis:

LSI SAS2208 RAID controller cards do not allow new hard drives to be added to RAID arrays that have been divided into two or more VDs. Therefore, the symptom is a normal occurrence.

Conclusion and Solution

Conclusion:

LSI SAS2208 RAID controller cards do not allow new hard drives to be added to RAID arrays that have been divided into two or more VDs.

Solution:

LSI SAS2208 RAID controller cards do not allow new hard drives to be added to RAID arrays that have been divided into two or more VDs. Therefore, do not add new hard drives.

Experience
  1. Descriptions about the cause of this issue should be added to the RAID Controller Card User Guide.
  2. When you use the storcli tool to perform this operation, the system also indicates the same failure, as shown in Figure 5-150.
    Figure 5-150 Failure alert
Note

None

Preserved Cache Causes LSI SAS2208 Card Configuration Errors
Problem Description
Table 5-95 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

V2 servers

Release Date

2015-12-16

Keyword

SAS2208, preserved cache

Symptom

Hardware configuration:

RH2288H V2 server equipped with one LSI SAS2208 RAID controller card

Symptom:

Symptom 1: The RAID array is offline, and the following messages are displayed during self-check:

There are offline or missing virtual drives with preserved cache. 
Please check the cables and ensure that all drives are present. 
Press any key to enter the configuration utiity.

as shown in Figure 5-151.

Figure 5-151 Messages displayed

The same messages are still displayed after restart.

Symptom 2: Hard drive faults cause RAID0 (one hard drive) to fail. A new hard drive is installed, but during reconfiguration of RAID0, the following error is displayed:

udo /opt/MegaRAID/MegaCli/MegaCli64 -CfgLdAdd -r0 [32:8] WB Direct -a0 


Adapter 0: Configure Adapter Failed 

FW error description:  
 The current operation is not allowed because the controller has data in cache for offline or missing virtual drives. 
Key Process and Cause Analysis

Key process:

  1. For symptom 1, press any key as prompted to enter the WebBIOS of the RAID controller card. When the prompt shown in Figure 5-152 is displayed, click Discard Cache.
    Figure 5-152 Clicking Discard Cache

    Save the operation and exit. The preserved cache is then cleared. The error messages are not displayed any more after restart.

  2. For symptom 2, run the following command under the OS to clear the preserved cache:

    ./storcli64 /c0/vall delete preservedcache

    Then run commands to configure RAID0.

Cause analysis:

There is data in the cache of the RAID controller card. When you restart the server or replace hard drives, the data in the cache cannot be written into the new hard drives, which results in the problem mentioned above.

Conclusion and Solution

Conclusion:

When the preserved cache in the RAID controller card causes the server to restart or causes hard drives to be replaced, the data in the cache cannot be written into the new hard drives, and an error is reported.

Solution:

Clear the preserved cache by referring to section 2 "Key Process and Cause analysis."

Experience

None

Note

None

Member Drive in an LSI SAS2208 Card RAID Array Changes Its Slot Position Online, Triggering an Alarm That Cannot Be Cleared
Problem Description
Table 5-96 Basic information

Item

Information

Source of the Problem

RH2288A V2

Intended Product

V1, V2, and V3 servers

Release Date

2015-12-29

Keyword

Drive in failed array

Symptom

Hardware configuration:

RH2288A V2, equipped with an LSI SAS2208 card and five hard drives (slot 0 to slot 4) as RAID5

Symptom:

The hard drive in slot 3 is removed while the service system is running. Consequently, the iBMC reports the Disk3 In Failed Array alarm. Install the removed hard drive in slot 5, clear the marked external RAID configuration information of the hard drive, and set the hard drive in slot 5 as the hot spare drive. RAID5 returns to the normal state after completing the rebuild process, as shown in Figure 5-153.

Figure 5-153 RAID5 restored

However, the iBMC still reports the Disk3 In Failed Array alarm.

Key Process and Cause Analysis

Cause analysis:

The RAID controller card records the slot information of each member drive in the RAID array. When a member drive is disconnected, the hot spare drive/emergency spare drive in other slots can take part in the rebuild process to restore the RAID array, but the indicator for the slot position of the original member drive still lights up for alarm purposes, and the iBMC reports the DiskN in failed array alarm (N indicates the physical slot number of the hard drive). This mechanism is designed to remind users that there was once a member drive in slot N and that the drive has not been restored. This reminds users that the current slot configuration is different from the original slot configuration.

Conclusion and Solution

Solution:

  1. Shut down the server, and remove all hard drives (that is, remove them from the hard drive backplane).
  2. Boot the server, and during the POST phase, press Ctrl+H to enter the RAID controller card configuration interface. Then choose Configuration Wizard > Clear Configuration to clear the original RAID configuration information.
  3. Shut down the server, and reinsert all hard drives.
  4. Boot the server, and during the POST phase, press Ctrl+H short-cut keys to enter the RAID controller card configuration interface. Then choose Scan Drives > Preview > Import to import the hard drive RAID configuration information.
Experience

If the hot spare drive is removed online, the BMC also generates a record of In failed array. The same symptom occurs when the hot spare drive changes its slot position online.

Note

None

Hard Drives of the LSI SAS2208 Card Are Removed and Inserted Too Quickly, Leading to Disorder in Slot Numbers
Problem Description
Table 5-97 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

V1, V2, and V3 servers

Release Date

2015-12-29

Keyword

Disorderly slot numbers of hard drives

Symptom

Hardware configuration:

After factory production, an RH2288H V2 server sets the rear hard drives in slot 12 and slot 13 as the 300 GB SAS hard drives and sets the front hard drives in slot 0 to slot 7 as the 4 TB SATA hard drives. Four Intel 480 GB hard drives are added to the RH2288H V2 server on site and installed in slot 0 to slot 3. The 4 TB SATA hard drives in the original slots are moved to slot 8 to slot 11. The OS is installed in RAID1 created from the hard drives in slot 12 and slot 13.

Symptom:

MegaCli64 -pdlist -a0 is run under the OS to display hard drive information. It is found that the hard drive information is inconsistent with the physical slot information, and there are two slot 8 hard drives, without any information about the slot 1 hard drive, as shown in Figure 5-154 and Figure 5-155.

Figure 5-154 MegaCli command output

Figure 5-155 Physical View of the RAID controller card

Key Process and Cause Analysis

Cause analysis:

The Intel 480 GB SSDs and 4 TB SATA hard drives are removed and inserted too quickly during slot change (according to standard requirements, the interval for hard drive removal and insertion should be at least 30 seconds). Consequently, the RAID controller card confuses the slot information of the hard drives, as shown in Figure 5-156.

Figure 5-156 Log information showing that the hard drives are removed and inserted in 10s
Conclusion and Solution

Solution:

  1. Shut down the server, and detach all hard drives from the hard drive backplane.
  2. Boot the server, and during the POST phase, press Ctrl+Y short-cut keys to enter the RAID controller card command line interface. Then run -adpnvram -clear -a0 to clear NVRAM, and press Ctrl+Alt+Del for restart.
  3. After the RAID controller card completes its self-check, shut down the server, and reinsert all hard drives.
  4. Boot the server, and during the POST phase, press Ctrl+H short-cut keys to enter the RAID controller card configuration interface. Then choose Scan Drives > Preview > Import to import the hard drive RAID configuration information.
Experience

None

Note

According to standard requirements, the interval for hard drive removal and insertion should be at least 30 seconds, and hard drives must not be tilted. Otherwise, the hard drives may be scratched during high-speed operation of the HDD magnetic head.

Bad Blocks in the RAID 1 Active Drive Lead to Rebuild Failure
Problem Description
Table 5-98 Basic information

Item

Information

Source of the Problem

RH2285

Intended Product

V1 servers

Release Date

2014-07-24

Keyword

LSI SAS1078 RAID controller card, rebuild failure

Symptom

Hardware configuration:

RH2285 server + LSI SAS1078 card + 300 GB SAS x 2

Symptom:

Two hard drives form RAID 1, butone of them becomes faulty and is replaced. However, the rebuild process fails, and the yellow light is on for the newly installed hard drive.

Key Process and Cause Analysis

Key process:

Use the MegaCLI tool to collect the RAID controller card log information, and find that the following error is reported during the rebuild process:

2218: 14-07-02,02:08:12 Info:State change on PD 03(e0xfc/s3) from HOT SPARE(2) to REBUILD(14) 
2219: 14-07-02,02:09:36 Info:Unexpected sense: PD 00(e0xfc/s0) Path 5000cca043200d71, CDB: 28 00 01 c9 3f 00 00 00 80 00, Sense: F0 00 03 01 C9 3F 0C 18 00 00 00 00 11 14 00 80 00 8A 00 00 F7 CC 00 00 00 19 AA 01 08 B6 00 00  
2220: 14-07-02,02:09:36 FATAL:Unrecoverable medium error during rebuild on PD 00(e0xfc/s0) at 1c93f0c 
2221: 14-07-02,02:09:36 FATAL:Puncturing bad block on PD 03(e0xfc/s3) at 1c93f0c 
2257: 14-07-02,02:10:41 WARNING:Error on PD 03(e0xfc/s3) (Error 02) 
2258: 14-07-02,02:10:41 Info:State change on PD 03(e0xfc/s3) from REBUILD(14) to FAILED(11) 
2259: 14-07-02,02:10:41 CRITICAL:Rebuild failed on PD 03(e0xfc/s3) due to target drive error

Cause analysis:

Log analysis shows that when the rebuild process fails, the system reports there is a bad block in the active drive, which leads to the rebuild failure.

Conclusion and Solution

Solution:

There are two solutions to resolve this problem:

  • Back up data, use two new hard drives to form RAID 1, and copy data to RAID 1. The drawback is that the OS and application software need to be reinstalled.
  • Use dd, DiskGenius, or other tools to copy the data of the entire drive to the new RAID array. This saves the trouble of reinstalling the OS and software.
Experience

For problems related to LSI SAS1078 and LSI SAS2208 RAID controller cards, you must use the megacli tool to collect RAID controller card information and log information.

Note

None

Hard Drive Formatting on a Windows System Fails
Problem Description
Table 5-99 Basic information

Item

Information

Source of the Problem

RH2288 V2

Intended Product

All servers

Release Date

2016-3-2

Keyword

RAID, formatting, dynamic drive

Symptom

Symptom:

  1. The Exchange reports an error, as shown in Figure 5-157.
    Figure 5-157 Error reported by the Exchange

    This problem is clearly related to drive sectors. exchange2010 does not support 4096 sectors.

  2. Figure 5-158 shows the printed result.
    Figure 5-158 Printed result
    Figure 5-159 Symptom (1)
    Figure 5-160 Symptom (2)
Key Process and Cause Analysis

Key process:

Figure 5-161 shows a normal server that is running properly.

Figure 5-161 Normal server

Figure 5-162 shows a faulty server.

Figure 5-162 Faulty server

Difference: For a normal server, the size of its physical sector is displayed as Not Supported.

If the value of Bytes Per Physical Sector is not supported, it is likely that the drive is not in 512e compatibility mode. On the destination end, Bytes Per Physical Sector is not even displayed (KB 982018 update is needed). Some third-party tools such as DiskGenius can also display this value, as shown in Figure 5-163.

Figure 5-163 Bytes Per Physical Sector

Cause analysis:

The size of the physical sector of the RAID drive is 4096, and 512 is not supported.

Conclusion and Solution

Conclusion:

According to the Exchange 2010 log records, the physical sector of the drive is required to be 512.

After LSI SAS2208 configures RAID, the size of the physical sector of the VD drive is 4096, and 512 is not supported.

Solution:

  1. Convert the hard drive from a basic drive into a dynamic drive, as shown in Figure 5-164.
    Figure 5-164 Converting a basic drive into a dynamic drive

  2. Format the hard drive, as shown in Figure 5-165.
    Figure 5-165 Formatting the hard drive

Verification record:

  1. By default, the physical sector of a drive supports 4K, as shown in Figure 5-166.
    Figure 5-166 4K supported

  2. In that case, 4K is supported, however the drive is formatted and whatever the partition table is (GPT or MBR).
  3. All you need to do is convert the drive into a dynamic drive.

    Figure 5-167 shows the result of the formatting operation.

    Figure 5-167 Result of the formatting operation
Experience
  1. If the problem is complicated and you are not sure whether you can resolve it, do not perform debugging operations in customers' environment.
  2. Establish a test environment, and find out the root cause by combining theory with practice.
  3. Track subsequent results after providing a solution to customers.
Note

None

Failed to Clear Preserved Cache
Problem Description
Table 5-100 Basic information

Item

Information

Source of the Problem

LSI SAS2208

Intended Product

All servers

Release Date

2016-04-04

Keyword

LSI SAS2208, Preserved Cache

Symptom

Symptom

  1. A hard drive in a RAID 0 array was faulty. The customer replaced the drive, and attempted to clear the preserved cache for the LSI SAS2208 controller card, as shown in Figure 5-168. After the operation was complete, the preserved cache still existed on the screen.
    Figure 5-168 Clearing preserved cache
Key Process and Cause Analysis

Cause analysis

This is an inherent problem of LSI SAS2208. The preserved cache occasionally fails to be cleared on the CU screen, resulting in RAID array configuration failures.

Conclusion and Solution

Solution 1: Use StroCLI to clear preserved cache. (The following uses Linux as an example and assumes that the user can log in to the OS.)

  1. On the RAID configuration screen, set Boot Error Handling to Ignore errors. See Figure 5-169.
    Figure 5-169 Setting Boot Error Handling

  2. Download and decompress the StroCLI package.
    1. Log in to the official Avago website and download the StroCLI package. See Figure 5-170.
      Figure 5-170 Downloading the StroCLI package

    2. Decompress the StroCLI package.
    NOTE:

    For details about how to install the OS, see Readme.txt in the folder that stores the OS file.

  3. Upload the StroCLI package (for example storcli-1.17.08-1.noarch.rpm for Linux OSs) to /tmp of the server OS.
  4. Run rpm -ivh storcli-1.17.08-1.noarch.rpm to install StroCLI.

  5. After the installation is complete, run find ./ -name *storcli* to query the installation directory.
  6. Go to the directory and run the following command to clear preserved cache:
    [linux~host]# cd /opt/MegaRAID/storcli 
    [linux~host]# /storcli64 /c0/vall delete preservedcache
  7. Restart the server.
  8. Go to the RAID configuration screen and configure a new RAID array.
  9. Set Boot Error Handling to Stop on Errors.

Solution 2: Use the toolkit to clear preserved cache. (Use this solution when the OS login fails.)

  1. Log in to the Huawei Enterprise support website.
  2. Choose Support > Documents and Software > IT > Server > TaiShan > FusionServer Tools, click the Downloads tab, and download the latest toolkit.
  3. Mount the toolkit ISO file to the virtual drive of the server. See Figure 5-171.
    Figure 5-171 Mounting the toolkit

  4. Restart the server from the virtual drive and load the toolkit as prompted.
  5. After the loading is complete, press C to go to the CLI, and enter the user name and password. See Figure 5-172.
    Figure 5-172 Logging in to the toolkit CLI

  6. Run the following command to clear preserved cache:
    linux:/ #cd /opt/MegaRAID/storcli 
    linux:/opt/MegaRAID/storcli #./storcli64 /c0/vall delete preservedcache
  7. Restart the server.
  8. Go to the RAID configuration screen and configure a new RAID array.
Experience

None

Note

None

Supercapacitor Faults on a Server Configured with the LSI SAS2208 or LSI SAS3108
Problem Description
Table 5-101 Basic information

Item

Information

Source of the Problem

RH2288 V2

Intended Product

All servers configured with the LSI SAS2208 or LSI SAS3108

Release Date

2015-03-31

Keyword

Supercapacitor, fault

Symptom

Hardware configuration:

RH2288 V2 + LSI SAS2208 + supercapacitor

Software version:

Firmware of any version

Symptom:

After the server is started, the supercapacitor is faulty, and Required is displayed for Battery Replacement.

Key Process and Cause Analysis

Key process:

The logs show that the charging voltage of the supercapacitor is abnormal and reaches 12000mv, which exceeds the configured maximum charging voltage 11000mv.

03/24/15 10:43:43: ____________________________________________________________

03/24/15 10:43:43: Temperature : 33 C Voltage : 12169 mV

03/24/15 10:43:43: Current : 85 mA Charging Current : 0 mA

03/24/15 10:43:43: ChargingVoltage : 10888 mV RSOC : 100

03/24/15 10:43:43: ESR : 210 mOhm Capacitance : 6400 mFarad

03/24/15 10:43:43: Cap1Voltage : 0 mV Cap2Voltage : 0 mV

03/24/15 10:43:43: Cap3Voltage : 0 mV Cap4Voltage : 0 mV

03/24/15 10:43:43: Cap5Voltage : 0 mV

03/24/15 10:43:43: BQ33100 Alert Status : 0x0

03/24/15 10:43:43: BQ33100 Safety Status : 0x100

03/24/15 10:43:43: BQ33100 Operation Status : 0x1814

03/24/15 10:43:43: packEnergy : 473 joules remainingReserveSpace : 96

03/24/15 10:43:43: tfmState = 3

03/24/15 10:43:43: FW Status = 2060

03/24/15 10:43:43: BBUGood = 0

03/24/15 10:43:43: GPIO Status:

03/24/15 10:43:43: GPIO Expander Status = bf

03/24/15 10:43:43: BBE = 0

03/24/15 10:43:43: BBSTATUS = 0

03/24/15 10:43:43: PE_PWRGD = 1

03/24/15 10:43:43: PE_PWRLOSSN = 1

03/24/15 10:43:43: PWRDN_OK = 1

03/24/15 10:43:43: SCAP_TFM_FAULT = 1

03/24/15 10:43:43: Misc:

03/24/15 10:43:43: CCR_SDRAM_CTRL= 00003808

03/24/15 10:43:43: CCR_MISC_CFG = c003ff0e

03/24/15 10:43:43: ____________________________________________________________

The supercapacitor is faulty.

Cause analysis: The voltage of the supercapacitor is abnormal, and logs show that the supercapacitor is in the fault state. Therefore, the supercapacitor is faulty.

Conclusion and Solution

Conclusion:

The supercapacitor is faulty.

Solution:

Replace a supercapacitor.

Appendix 1: The method for determining the supercapacitor status is as follows:

The status of the battery or supercapacitor is logged. Determine the status of the supercapacitor based on the following information:

Charging Status Charging

Voltage OK (If OK is not displayed, an exception occurs.)

Temperature OK

Learn Cycle Requested Yes

Learn Cycle Active No

Learn Cycle Status OK (If the value is Failed, replace the supercapacitor.)

Learn Cycle Timeout No (If Yes is displayed, replace the supercapacitor.)

I2C Errors Detected No

Battery Pack Missing No (If Yes is displayed, replace the supercapacitor.)

Replacement required No (If Yes is displayed, replace the supercapacitor.)

Remaining Capacity Low No (If Yes is displayed, replace the supercapacitor.)

Periodic Learn Required No

Transparent Learn No

No space to cache offload No (If Yes is displayed, replace the supercapacitor.)

Pack is about to fail & should be replaced No (If Yes is displayed, replace the supercapacitor.)

Cache Offload premium feature required No

Module microcode update required No

Experience

Even though all supercapacitors are tested during the production phase, some supercapacitors may become faulty when working in the live network. Check supercapacitor errors based on the collected logs.

Note

None

Cache Data Is Lost Because the Server Configured with LSI SAS2208 and the Dedicated iBBU Is Powered Off for a Long Time After a Power Failure
Problem Description
Table 5-102 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

All servers configured with the LSI SAS2208 and the dedicated intelligent backup battery unit (iBBU)

Release Date

2016-04-30

Keyword

2208, battery, power failure, cache data loss

Symptom

Hardware configuration:

RH2288H V2 configured with the LSI SAS2208 and the dedicated iBBU

Software version:

Firmware of any version

Symptom:

When the server is powered on after a power failure, a message is displayed indicating that cache data is lost during the startup.

Key Process and Cause Analysis

The error information is logged. The first record is shown in the following figure.

Check the timestamp of the OS startup in the logs, which is April 14, 2016.

Check the timestamp of the latest OS running before this startup, which is January 16, 2016.

The interval between two startups is three months.

The dedicated iBBU of the LSI SAS2208 supplies power to the DDR of the LSI SAS2208 to protect cache data when the server is powered off abnormally. The power of Huawei iBBU dedicated for the LSI SAS2208 lasts 48 hours, after which the iBBU cannot supply power to the DDR. As a result, the cache data is lost.

Cause analysis:

The server is powered off abnormally and is not powered on for a long time, which exceeds the iBBU power protection time.

Conclusion and Solution

Conclusion:

The server is powered off abnormally and is not powered on for a long time, which exceeds the iBBU power protection time.

Solution:

If the server configured with the iBBU dedicated for the LSI SAS2208 is powered off abnormally, restore the power supply within 48 hours.

Experience

If the server configured with the iBBU dedicated for the LSI SAS2208 is powered off abnormally, restore the power supply within 48 hours.

Note

Battery fault identification standard (iBBU09):

1. If Yes is displayed for Battery Replacement required, replace the iBBU.

2. If the value of Full Charge Capacity is lower than 960mAh, the iBBU is aged.

3. If the value of Designe Capacity is not 1500mAh, the iBBU is faulty.

LSI SAS2208 RAID Controller Card Lights up the Hard Drive Yellow Indicator Due to Drive Failure
Problem Description
Table 5-103 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

All servers configured with RAID controller cards

Release Date

2016-04-27

Keyword

Hard drive failure, sense key, hardware error

Symptom

Hardware configuration:

RH2288H V2 configured with the LSI SAS2208

Symptom:

A hard drive turns to the FAILED state, and the hard drive indicator is steady yellow.

Key Process and Cause Analysis

Key process:

Check the RAID controller card log.

The following log indicates that the hard drive in slot 3 is faulty. As a result, the RAID controller card sets the hard drive to the FAILED state.

Check the content about slot 3 in the log. The hard drive in slot 3 has reported a sense key 4/03/00.

The sense key 4/03/00 indicates that the hard drive has a hardware fault and needs to be replaced.

Cause analysis:

The hard drive has a hardware fault.

Conclusion and Solution

Conclusion:

The hard drive has a hardware fault.

Solution:

Replace the hard drive.

Experience

When the hard drive indicator turns yellow, you can check the RAID controller card log, system log, and hard drive SMART information to determine whether the hard drive is faulty.

The RAID controller card log (2208/3108) or system log (2308/3008) can be used to determine whether a hard drive is faulty if the sense key returned by the hard drive is displayed.

In the log, the keyword of sense key is Sense, as shown in the following figure.

Alternatively, check bytes 2, 12, and 13 (starting from 0) of the log shown in the following figure.

You can query the meaning of sense key code in Wikipedia.

https://en.wikipedia.org/wiki/Key_Code_Qualifier

Generally, hardware error, medium error, and self-initiated-reset occurred indicate that the hard drive is faulty.

In other cases, determine the cause based on other logs.

Note

None

LSI SAS2208 Dedicated iBBU Relearn Failure
Problem Description
Table 5-104 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

All servers configured with the LSI SAS2208 and the dedicated intelligent backup battery unit (iBBU)

Release Date

2015-08-13

Keyword

2208, battery, relearn failure

Symptom

Hardware configuration:

RH2288H V2 configured with the LSI SAS2208 and the dedicated iBBU

Software version:

3.190.05-1669

Symptom:

The relearn fails to be executed on the LSI SAS2208 dedicated iBBU.

$sudo /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -BbuLearn -a0

Adapter 0: BBU Learn Failed

Exit Code: 0x01

Key Process and Cause Analysis

When the relearn fails to be executed, check the error information in the RAID controller card log.

The preceding information indicates that the command fails because the iBBU is being charged.

Run the -adpbbucmd –a0 command to check the iBBU status.

The iBBU is not being charged.

It is determined by LSI that the problem is caused by the 3.190.05-1669 software version. If the iBBU receives the relearn command when its remaining battery power is 95%, the iBBU state enters an infinite loop, in which the iBBU cannot execute the relearn command successfully, or execute charging or discharging operations.

This problem no longer exists in later firmware versions. Upgrade the firmware.

Alternatively, you can use the management tool to run the battery hibernation command.

MegaCli -AdpBbuCmd -BbuMfgSleep -a0

The preceding command can temporarily set the iBBU to the hibernation state. The iBBU automatically recovers from the hibernation state and resets the iBBU state to restore the battery management.

Cause analysis:

If the LSI SAS2208 dedicated iBBU receives the relearn command at a specific power level, the iBBU is in disorder.

Conclusion and Solution

Conclusion:

The firmware version 3.190.05-1669 has a bug. As a result, the iBBU is in disorder when the iBBU receives the relearn command at a specific power level.

Solution:

Workaround: Run the command to reset the iBBU state.

Solution: Upgrade the firmware to a later version.

Experience

None

Note

None

LSI SAS2208 Is Reset Due to a Chip Fault
Problem Description
Table 5-105 Basic information

Item

Information

Source of the Problem

RH2288 V2

Intended Product

All servers configured with the LSI SAS2208

Release Date

2014-03-20

Keyword

2208, reset, PMU

Symptom

Hardware configuration:

RH2288 V2 configured with the LSI SAS2208

Software version:

Firmware of any version

Symptom:

During the stress test, the RAID controller card restarts unexpectedly. As a result, the OS does not respond and crashes.

Key Process and Cause Analysis

"Pmu Msg Fault" is displayed in the logs.

The PMU error is caused by the chip slow silicon. You need to replace the RAID controller card.

Cause analysis:

A PMU fault is reported when the LSI SAS2208 is reset, which is caused by the chip slow silicon.

Conclusion and Solution

Conclusion:

A PMU fault is reported when the LSI SAS2208 is reset, which is caused by the fault of the control chip on the RAID controller card.

Solution:

Replace the RAID controller card.

Experience

None

Note

None

Read Performance of Dual-Drive RAID 1 and RAID 10 Is Less Than Twice That of a Single Drive on a Server Configured with the LSI SAS2208
Problem Description
Table 5-106 Basic information

Item

Information

Source of the Problem

Tecal RH2285

Intended Product

Tecal RH2285

Release Date

2016-06-15

Keyword

2208, RAID 1, RAID 10, performance

Symptom

Hardware configuration:

RH2285 (eight or twelve drives) + LSI SAS2208

Two SATA drives form a RAID 1 array. The read performance of a single drive is about 150 MB/s, but the read performance of RAID 1 is 230 MB/s, not twice that of a single drive as it should be.

Figure 5-173 RAID 1 performance
Key Process and Cause Analysis

Key process:

  1. Check whether the RAID array is in the reconstruction or Patrol Read phase based on the LSI SAS2208 RAID controller card log. If yes, the problem may occur.
  2. Check whether other services are running in the OS. If yes, the FIO test result may be lowered down.

    You can run iostat to view the service I/O status when no test is performed.

  3. Upgrade the driver and firmware to the latest version released.

    The firmware of later versions optimizes the performance. However, the system-native driver is not compatible with later firmware versions and needs to be upgraded.

    For details about the mapping, visit http://support.huawei.com/enterprise/SoftwareVersionActionNew!showVDetailNew?lang=en&idAbsPath=fixnode01|7919749|9856522|21782478|21463589|21588909&pid=21588909&vrc=21588912|21588913|21630292|21852491&from=soft&tab=bz&bz_vr=21588913&bz_vrc=&nbz_vr=null.

  4. If no exception is found after performing the preceding steps, check the driver parameter modinfo megaraid_sas.

    The ld_pending_cmds parameter is used to control the I/O queues of a RAID array. If the performance does not meet requirements, you can set this parameter to a larger value to increase I/O queues to optimize the performance.

Cause analysis:

To ensure the consistency between the system driver and the kernel, the number of I/O queues of the driver is reduced. As a result, the I/O combination effect of the driver is not obvious, and the read performance of the RAID array is affected.

Conclusion and Solution

Conclusion:

Change the value of lb_pending_cmds in the new driver to 16 so that the I/O combination effect of the driver is more obvious.

Solution:

You can either reload the driver or add the driver parameter to the kernel configuration to modify this parameter. The details are as follows:

  1. Reload the driver and add the driver parameter.

    Note: Because the RAID driver is related to the running of the controller and OS, this method applies only to the PXE or memory OS.

    Run the rmmod megaraid_sas command to uninstall the current driver.

    Run the modprobe megaraid_sas.ko lb_pending_cmds=16 command to import the driver and modify the parameter value.

  2. Add the parameter to the kernel configuration.

    Add the content in the red box shown in the following figure to the kernel line in vi/boot/grub/menu.lst.

  3. Restart the OS for the settings to take effect.

    After the preceding operations are performed, run the cat /sys/module/megaraid_sas/parameters/lb_pending_cmds command to check that the configuration takes effect.

    The following figure shows the optimized performance.

Experience

Check whether the status of the RAID controller card and hard drives is normal. If yes, modify the driver parameter.

Note

None

iBBU09 Log Records EOL and Deep Discharge on a Server Configured with the LSI SAS2208
Problem Description
Table 5-107 Basic information

Item

Information

Source of the Problem

RH2285 V2

Intended Product

LSI SAS2208

iBBU09

Release Date

2016-11-02

Keyword

2208, iBBU09, battery, End of Life (EOL), deep discharge, learn cycle requested

Symptom

Hardware configuration:

RH2285 V2 configured with the LSI SAS2208 and iBBU09

Software version:

The firmware version of the LSI SAS2208 is 103 (3.190.05-1669, 23.7.0-0029).

Symptom:

The Learn Cycle Active alarm is reported to the customer's system.

Key Process and Cause Analysis

The system checks the iBBU status by running the command delivered by the iBMA.

/opt/huawei/bma/bin/hwdiag -t raid –b

The RAID controller card log and battery information are as follows:

BatteryType: iBBU-09

Voltage: 4084 mV

Current: 0 mA

Temperature: 37 C

Battery State: Optimal

Design Mode : 48+ Hrs retention with a non-transparent learn cycle and moderate service life.

BBU Firmware Status:

Charging Status : None

Voltage : OK

Temperature : OK

Learn Cycle Requested : Yes

Learn Cycle Active : No

Learn Cycle Status : OK

Learn Cycle Timeout : No

I2c Errors Detected : No

Battery Pack Missing : No

Battery Replacement required : No

Remaining Capacity Low : No

Periodic Learn Required : No

Transparent Learn : No

No space to cache offload : No

Pack is about to fail & should be replaced : No

Cache Offload premium feature required : No

Module microcode update required : No

BBU GasGauge Status: 0x32a8

Relative State of Charge: 99 %

Charger System State: 1

Charger System Ctrl: 0

Charging current: 0 mA

Absolute state of charge: 82 %

Max Error: 0 %

Battery backup charge time : 48 hours +

BBU Capacity Info for Adapter: 0

Relative State of Charge: 99 %

Absolute State of charge: 82 %

Remaining Capacity: 1237 mAh

Full Charge Capacity: 1262 mAh

Run time to empty: Battery is not being charged.

Average time to empty: 3 Hour, 53 Min.

Estimated Time to full recharge: Battery is not being charged.

Cycle Count: 6

BBU Design Info for Adapter: 0

Date of Manufacture: 09/15, 2013

Design Capacity: 1500 mAh

Design Voltage: 4100 mV

Specification Info: 0

The preceding information shows the battery status. The values of Full Charge Capacity and Design Capacity meet the requirements, and the iBBU is normal.

The value of Learn Cycle Requested is Yes.

The following information is displayed in the RAID controller card log.

EOL indicates that the iBBU is used for a long time reaches the end of the life cycle, but does not indicate that the iBBU is faulty and needs to be replaced. For details about the battery replacement standard, see case KB0601000002.

The following figure shows Broadcom's reply.

The reason that "Battery Yearly Deep Discharge Relearn Pending. Please Initiate a manual Relearn" is displayed is as follows:

iBBU09 does not ensure by design that the periodic relearns can update the chemical capacity of the battery, which affects the accuracy of the battery parameters. Therefore, it is recommended by the manufacturer that deep discharge learn be executed once a year. The deep discharge learn takes a long time and keeps the VD in the WT state. Therefore, you need to manually execute a relearn.

The following figure shows Broadcom's original reply.

Cause analysis:

The iBBU works properly. Manually performing a relearn is a normal maintenance mechanism.

Conclusion and Solution

Conclusion:

1. The message indicating that the iBBU requires replacement is not displayed.

2. If "Learn Cycle Requested : Yes" is displayed, perform a relearn. If this state lasts for a long time, you are advised to manually perform a relearn.

3. The message "Battery Reached EOL" does not indicate that the iBBU needs to be replaced.

4. If "Battery Yearly Deep Discharge Relearn Pending. Please Initiate a manual Relearn" is displayed in the logs, you need to manually perform a relearn for the iBBU.

Solution:

1. Run the relearn command on the CLI.

The Storcli command is as follows:

./storcli64 /c0/bbu start learn

c0 indicates the first LSI SAS2208.

Experience

None

Note

None

Multiple Hard Drives Are Offline on an RH2288H V2 Configured with the LSI SAS2208
Problem Description
Table 5-108 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

LSI SAS2208

Release Date

2016-11-10

Keyword

2208, offline, link

Symptom

Hardware configuration:

RH2288H V2 configured with the LSI SAS2208

Symptom:

Multiple hard drives are offline, and the iBMC reports alarms about the offline hard drives.

Key Process and Cause Analysis

The RAID controller card log shows that transmission errors occur on multiple hard drives. In the following information, sense key 0B/47/03 indicates command abort.

Search for the keyword bad globally. Almost all drives are once in the bad state.

Check the counts of media errors, other errors, and predictive failures of each drive. The counts are all 0.

According to the preceding conditions, the hard drives are normal. You are advised to check the SAS links and the link errors.

PHY 4 of the RAID controller card has a large number of reset problems and cannot be linked.

The preceding links connect between the RAID controller card and the expander hard drive backplane. One connection is responsible for communication of all hard drives. When a PHY connection is abnormal, the communication of all hard drives is affected, or even the drives turn offline.

Cause analysis:

The links between the RAID controller card and the expander hard drive backplane are abnormal. As a result, hard drives turn offline.

Conclusion and Solution

Conclusion:

The links between the RAID controller card and the expander hard drive backplane are abnormal. As a result, hard drives turn offline.

Solution:

The fault source cannot be determined (a RAID controller card, hard drive backplane, or cable), but cable A is not faulty because port A is responsible for PHYs 0–3 and port B is responsible for PHYs 4–7.

The recommended handling methods are as follows:

1. Replace the port B cable, hard drive backplane, and RAID controller card in sequence to locate the fault.

2. If method 1 is unavailable onsite, replace the port B cable, hard drive backplane, and RAID controller card.

Experience

None

Note

None

RAID1 Member Drive Faults Cause Linux OS I/O Error on a Server Configured with the LSI SAS2308
Problem Description
Table 5-109 Basic information

Item

Information

Source of the Problem

RH2285

Intended Product

RH2285, E6000, X6000, and RH2288

Release Date

2015-05-20

Keyword

RH2285, LSI SAS2308, RAID 1, I/O error

Symptom

Hardware configuration:

RH2285 configured with the LSI SAS2308 (Two hard drives form a RAID 1.)

Software version:

Linux

Symptom:

After the OS is installed and started, if a RAID 1 member drive is offline due to a hardware fault, an I/O error may occur on the drive letter corresponding to RAID 1. If a file system exists, the file system becomes read-only.

Figure 5-174 Error information
Key Process and Cause Analysis

Key process:

Create a RAID 1 on the LSI SAS2308 using two hard drives.

This problem occurs only when a member drive is suddenly faulty or removed during frequent data write on RAID 1.

There is a low probability that this problem occurs. A read-only RAID 1 file system may also be caused by link faults or other faults. However, before an I/O error occurs, hostbyte will turn to the DID_SOFT_ERROR state.

Cause analysis:

According to the RAID 1 specifications, data must be written into the two member drives of RAID 1 at the same time. The firmware processes the removal of a hard drive. When data is being written to the drive that is to be removed (or the drive is faulty or a bad sector occurs on the drive), you need to wait some time for the removal to complete to ensure reliability. The waiting time ensures that the system is robust enough to prevent drive intermittent disconnection.

During the waiting time, the removal action sent by the host to the drive is repeatedly aborted and retried. When the firmware starts to remove the hard drive, pending I/Os are detected. After the retry fails, the firmware clears the I/Os and reports an IOC error to the upper layer. The driver then reports a DID_SOFT_ERROR error to the upper layer. As a result, the upper layer considers that an I/O error occurs.

When an I/O error occurs, the file system becomes read-only due to data synchronization errors. The following figure shows Avago's description and the problem process.

If the file system does not exist when this problem occurs, perform the write operation again. If a file system exists and is not in the system partition, unmount the file system and then mount it again. If the file system is in the system partition, restart the OS. This problem does not cause data loss.

Conclusion and Solution

Conclusion:

Upgrade the firmware and driver at the same time to solve this problem.

Impact scope: If RAID 1 and RAID 10 are created on the LSI SAS2308, there is a low probability that the file system becomes read-only when a member drive of a RAID array is faulty while data is being written to the drives. The data will not be lost.

Solution:

Modify the mechanism for removing member hard drives. After the mechanism is modified, when the driver detects that this type of hard drive removal event occurs, the driver does not report the DID_SOFT_ERROR error. Instead, the driver re-reports that hostbyte is in the DID_RESET state. In this way, the upper layer retries I/O instead of directly aborting I/Os and reporting an error. When the system retries I/Os, the firmware removes the hard drive or troubleshoots the bad sector problems. When I/Os occur again, the system does not block the I/Os. Instead, it directly processes the I/Os, avoiding I/O errors or I/O blockage.

The following figure shows the key modifications to the driver.

The information printed by the driver shows that when the write operation fails, the system does not return a failure message. Instead, it executes a device reset operation and attempts to write data again. According to the lab test, no I/O errors occur after the modification.

The firmware version is 18.00.02.00 or later, which has been released by Huawei. The address is as follows (the version is optional):

http://support.huawei.com/enterprise/SoftwareVersionAction!getSoftwareInfo.action?nodePath=fixnode01|7919749|9856522|9856786|19955021|8576254|9314948|9314950|9314955|21441662&idAbsPath=fixnode01|7919749|9856522|9856786|19955021|8576254&version=BH622+V2+V100R002C00SPC300&hidExpired=0&contentId=SW1000112907

Download and decompress the package, and mount the decompressed ISO file to the DVD drive. For details, see section 9.2.2 in the FusionServer Tools V100R001 Toolkit User Guide.

To upgrade the firmware in the OS, you need to obtain the software package in the level-S directory and run the ./update.sh script in Linux. The script automatically upgrades the firmware and the upgrade takes effect after the OS restarts.

The driver version is 20.00.00.00 or later. You can download the package on the official Huawei website.

http://support.huawei.com/enterprise/en/server/fusionserver-idriver-pid-21588909/software

If compilation is required, obtain source code from the official Avago website.

http://www.lsi.com/products/host-bus-adapters/pages/lsi-sas-9217-8i.aspx#tab/tab4

Download the matched OS version, install the OS, and restart the OS.

Experience

None

Note

Driver compilation is required for customized kernels. The driver compilation procedure is as follows:

1. Copy the driver source file mpt2sas-20.00.00.00-src.tar.gz to Linux and run tar –vxf mpt2sas-20.00.00.00-src.tar.gz to decompress the driver source file.

2. Run ./compile.sh to start compilation. Error information shown in the following figures may be displayed.

This indicates that ctags is not installed.

This indicates that kernel source code is not installed.

If the compilation is successful, the following information is displayed.

3. After the compilation, the driver file is mpt2sas.ko.

"HDIO_GET_IDENTIFY_FAIL" Is Displayed in the SATA OS on a Server Configured with the LSI SAS3008
Problem Description
Table 5-110 Basic information

Item

Information

Source of the Problem

RH228X

Intended Product

RH228X

Release Date

2015-10-20

Keyword

3008, Linux, HDIO_GET_IDENTIFY_FAIL

Symptom

Hardware configuration:

RH228X (8 or 12 drives) configured with the LSI SAS3008 (firmware version: 02.00.00.00)

When the LSI SAS3008 and SATA drives are used, information similar to Figure 5-175 is displayed in the OS message log after each restart, or insertion or removal of a hard drive.

Figure 5-175 Error screen in Linux
Key Process and Cause Analysis

Analyze the meaning of the printed information and use the following program to simulate the scenario. The same error information is displayed in the OS message log.

The ioctl interface is invoked to deliver ATA Passthrough to the hard drive where the OS is deployed. However, the hard drive or controller does not support ATA Passthrough. As a result, an error is reported.

The error information is reported only when the OS is started or shut down, the hard drive is hot swapped, or the drive letter is lost. The OS automatically invokes the ata_id program when a new drive is inserted, which needs to obtain certain information (such as SMART information and ATA Identify return information) of the hard drive through the ATA Passthrough command. An error is reported when the program cannot obtain return values successfully.

The hard drive supplier confirms that this type of hard drives supports the ATA Passthrough command. Therefore, the controller does not support the ATA Passthrough command or fails to parse the command. Check the bug list provided by the supplier Avago. The improvement points of the firmware are related to this problem. LSI engineers also confirm that later firmware version can solve this problem.

Huawei has released a new firmware version for the LSI SAS3008 in early September and verified that the problem has been solved. Download the firmware at:

http://support.huawei.com/enterprise/SoftwareVersionAction!getSoftwareInfo.action?nodePath=fixnode01|7919749|9856522|9856792|9768163|21011086|21011087|21011089|21632955&idAbsPath=fixnode01|7919749|9856522|9856792|9768163&version=RH5885H+V3+V100R003C00SPC110&hidExpired=0&contentId=SW1000132016

Conclusion and Solution

Conclusion:

The LSI SAS3008 firmware of the old version fails to support the ATA Passthrough command. As a result, hdparm cannot read the hard drive information.

Solution:

Solution: Upgrade the LSI SAS3008 firmware to the latest version. For details, see the firmware upgrade guide.

Workaround: The soft links created by ata_id are not used in actual service scenarios. You can perform the following operations in the system:

Add # at the beginning of lines 37 and 39 in the /lib/udev/rules.d/60-persistent-storage.rules file to comment out the two rules, save the settings, and exit.

When a new hard drive is added, the OS invokes ata_id to create soft links based on the rules in lines 37 and 39 in the /lib/udev/rules.d/60-persistent-storage.rules file. If ata_id fails to create soft links, error information is recorded in the OS message log. After lines 37 and 39 in the /lib/udev/rules.d/60-persistent-storage.rules file are commented out, when a new drive is added, the OS does not invoke ata_id, affecting only the generation of soft links.

Experience

ata_id is used to create soft links in the /dev/disk/by-id directory when a new hard drive is added. Visit http://linux.die.net/man/8/ata_id to see the official description.

The following figure shows the /dev/disk/by-id directory when ata_id creates ata-xxx soft links successfully.

The following figure shows the /dev/disk/byi-id directory when ata_id fails to create soft links.

The soft links created by ata_id are not used in actual scenarios. Therefore, the solution and workaround do not affect actual services.

Note

None

Firmware Does Not Respond to the XFS File System of SSD RAID 10 on a Server Configured with the LSI SAS3008
Problem Description
Table 5-111 Basic information

Item

Information

Source of the Problem

RH228X

Intended Product

RH228X

Release Date

2016-04-15

Keyword

3008, Ubuntu, XFS fault, 0501

Symptom

Hardware configuration:

RH228X (8 or 12 drives) configured with the LSI SAS3008

Six hard drives are installed on the LSI SAS3008. Two SAS drives form a RAID 1 array for OS installation. The root file system is in XFS format. Four SSD hard drives form a RAID 10 array. Partitions are created and mounted. FIO is used for read and write stress tests. There is a possibility that the firmware reports an error and does not respond, as shown in Figure 5-176.

Figure 5-176 Error message screen
Key Process and Cause Analysis

To reproduce the problem, perform the following steps:

1) Install two SAS drives and four SATA SSD drives on the LSI SAS3008.

2) Create a RAID 1 using SAS drives and set it as a boot drive for installing the OS. Create RAID 10 arrays using SSD drives and use them as data drives.

3) Install Ubuntu 14.04.03 and set the default root file system format to XFS.

4) Format and mount RAID 10 member drives, and use FIO to execute read and write stress tests.

The problem occurs only when the LSI SAS3008 + RAID 10 combination and the XFS file system are used.

The LSI SAS3008 currently uses the IR mode (that is, the firmware supports simple RAID). For RAID solutions of the IR mode provided by Broadcom, the maximum processing capability of a RAID array is 8196 KB. If the actual capacity exceeds 8196 KB, the RAID controller card firmware will be overloaded. As a result, a firmware fault occurs.

In addition, Linux provides the parameter max_sectors_kb to set the maximum I/O request queue length allowed by the devices under the OS. Kernels of 3.18.22 and later versions set this parameter to 16383 by default.

For kernels earlier than 3.18.22, the value of max_sectors_kb is fixed to 512.

Cause analysis:

For LSI SAS3008 RAID solutions, the maximum processing capability of a RAID array is 8196. If the number of I/O request queues of a specific kernel version exceeds 8196, a firmware fault occurs.

Conclusion and Solution

Conclusion:

For LSI SAS3008 RAID solutions, the maximum processing capability of a RAID array is 8196. If the number of I/O request queues of a specific kernel version exceeds 8196, a firmware fault occurs.

Solution:

The max_sectors_kb parameter is fixed in the kernel and cannot be modified by changing the kernel configuration. Add echo 512 > /sys/block/sdX/queue/max_sectors_kb to the startup item to avoid the setting being restored after OS restart.

The location of Ubuntu startup items is as follows.

Experience

This problem can be triggered only in the latest kernel. Currently, no officially released driver version can solve this problem.

Note

None

iBBU09 Missing Is Reported on a Server Configured with the LSI SAS2208
Problem Description
Table 5-112 Basic information

Item

Information

Source of the Problem

RH2285 V2

Intended Product

LSI SAS2208

iBBU09

Firmware version: 103 (3.190.05-1669)

Release Date

2016-10-31

Keyword

2208, iBBU09, battery missing

Symptom

Hardware configuration:

RH2285 V2 configured with the LSI SAS2208 and iBBU09

Software version:

The firmware version of the LSI SAS2208 is 103 (3.190.05-1669, 23.7.0-0029).

Symptom:

The customer system detects an alarm indicating that the iBBU pack is missing.

Key Process and Cause Analysis

The system checks the iBBU status by running the command delivered by the iBMA.

/opt/huawei/bma/bin/hwdiag -t raid –b

The RAID controller card log and battery information are as follows:

BatteryType: iBBU-09

Battery State: Missing

Battery backup charge time : 48 hours +

BBU Capacity Info for Adapter: 0

Relative State of Charge: 96 %

Absolute State of charge: 80 %

Remaining Capacity: 1209 mAh

Full Charge Capacity: 1272 mAh

Run time to empty: Battery is not being charged.

Average time to empty: 2 Hour, 25 Min.

Estimated Time to full recharge: Battery is not being charged.

Cycle Count: 6

BBU Design Info for Adapter: 0

Date of Manufacture: 09/15, 2013

Design Capacity: 1500 mAh

Design Voltage: 4100 mV

Specification Info: 0

Serial Number: 1789

Pack Stat Configuration: 0x0000

Manufacture Name: LS36691

Firmware Version :

Device Name: iBBU-09

Device Chemistry: LION

Battery FRU: N/A

Transparent Learn = 0

App Data = 0

The preceding information shows that the values of Full Charge Capacity and Design Capacity meet the requirements. The iBBU is normal.

According to the log, the battery relearn is executed successfully.

Broadcom confirms that earlier firmware versions of the LSI SAS2208 have bugs. When the communication between the iBBU and a RAID controller card is interrupted, the firmware records the status of the iBBU as missing. After the communication recovers, the iBBU remains in the missing state, but other battery functions can be used properly.

Power cycle the server to rectify the fault.

The SCGCQ00342963/SCGCQ00427022 problem has been incorporated into the LSI SAS2208 MR5.10. Upgrade the firmware to version 107.

Cause analysis:

The problem is caused by bugs of the LSI SAS2208 firmware 3.190.05-1669. When the communication between the iBBU and a RAID controller card is interrupted, the firmware records the status of the iBBU as missing. After the communication recovers, the iBBU remains in the missing state, but other battery functions can be used properly.

Conclusion and Solution

Conclusion:

The iBBU status of the LSI SAS2208 is missing, but other parameters are normal. The firmware (3.190.05-1669) does not automatically restore the recorded iBBU status after the communication recovers from an interruption.

The iBBU works properly even if the status is displayed as missing.

Solution:

1. Power cycle the server to restore the iBBU status.

2. Upgrade the LSI SAS2208 firmware to 107 or later.

Firmware version number (107): 3.400.45-3507

http://support.huawei.com/enterprise/SoftwareVersionAction!getSoftwareInfo.action?nodePath=fixnode01|7919749|9856522|21782478|21782482|9581539|19888581|19888582|19888584|22057110&idAbsPath=fixnode01|7919749|9856522|21782478|21782482|9581539&version=RH2288H+V2+V100R002C00SPC610&hidExpired=0&contentId=SW1000191029

Firmware version number (108): 3.400.95-4061

http://support.huawei.com/enterprise/SoftwareVersionAction!getSoftwareInfo.action?nodePath=fixnode01|7919749|9856522|21782478|21782482|9581539|19888581|19888582|19888584|22057110&idAbsPath=fixnode01|7919749|9856522|21782478|21782482|9581539&version=RH2288H+V2+V100R002C00SPC610&hidExpired=0&contentId=SW1000191030

Experience

None

Note

None

RH2288H V2 Hard Disk Cable Misconnect
Problem Description
Table 5-113 Basic information

Item

Information

Source of the Problem

2017/12

Intended Product

RH2288H V2 rear hard disk backplane

Release Date

2018/1/2

Keyword

Cable

Symptom

Hardware configuration: RH2288H V2

OS configuration: N/A

Symptom: The indicators of the RAID array are incorrect, and the indicated hard disk is normal.

Key Process and Cause Analysis

Key process:

Log analysis:

  1. The score of the hard disk in slot 24 is 40, indicating that the hard disk in slot 24 is faulty.
    Figure 5-177 Hard disk scores
  2. According to the product user guide, slot 24 corresponds to HDDA on the rear panel.

  3. HDDA and HDDB are located on the rear hard disk backplane. HDDA corresponds to slot 24 and HDDB corresponds to slot 25. Figure 2 shows how to connect the rear hard disk backplane to the mainboard and front hard disk backplane.
    1. Cables 1 and 2 are SATA cables, which are connected to the front hard disk backplane and rear hard disk backplane.
    2. Cable 3 is used to connect the indicator board.
    3. Cable 4 is the power cable.
    Figure 5-178 Backplane cable connections

The mapping between the logical slots and physical slots of the hard disks is incorrect.

  1. Cable 1 is used by HDDA exclusively, and cable 2 is used by HDDB exclusively, as shown in Figure 2. However, indicator board cable 3 is used by both HDDA and HDDB.
  2. If cable 1 and cable 2 are reversely connected:
    • When the indicator of slot 24 is turned on, the HDDA yellow indicator blinks.
    • When the indicator of slot 25 is turned on, the HDDB yellow indicator blinks.
    • When the HDDA hard disk is removed, the hard disk in slot 25 is not in position.
    • When the HDDB hard disk is removed, the hard disk in slot 24 is not in position.
  3. According to the preceding analysis, if cables 1 and 2 are reversely connected:
    • The paths of SATA data signals are slot 24<->HDDB and slot 25<->HDDA.
    • The paths of indicator board signals are slot 24<->HDDA and slot 25<->HDDB

After the HDDA hard disk is removed, the HDDB yellow indicator is on.

If the HDDA hard disk is removed, the system considers that the hard disk in slot 25 is removed, and turns on the indicator corresponding to slot 25. However, on the indicator board slot 25 corresponds to HDDB. Therefore, the HDDB yellow indicator is on.

The file system is damaged due to a faulty hard disk.

  1. Symptom: The hard disk is faulty, and the file system is damaged after the system is restarted.
  2. Problem analysis: The hwdiag score of slot 24 is 40, indicating that the hard disk in slot 24 is at the edge of failure and has a large number of bad blocks. The RAID controller card continues reading and writing the hard disk in slot 24, causing slow read/write speed. In addition, the hard disk is not completely damaged. Therefore, the RAID controller card does not kick off the hard disk. When I/O operations are performed on the bad blocks, the file system is damaged.
Conclusion and Solution

Power off the system, exchange cable 1 and cable 2, and replace the HDDB hard disk based on Figure 2.

Common Problems of the Management Software

An Error Message Is Displayed When the OS Boots from the Virtual BMC DVD-ROM Drive
Problem Description
Table 5-114 Basic information

Item

Information

Source of the Problem

RH2285

Intended Product

RH2285, E6000, and X6000

Release Date

2012-05-07

Keyword

BMC, virtual DVD-ROM drive

Symptom

Hardware configuration

  • An RH2285 configured with 12 hard drives
  • A LSI SAS1068E controller card

Symptom

The RH2285 baseboard management controller (BMC) mounts the ISO file through the virtual DVD-ROM drive to install an operating system (OS). An error message is displayed when the OS boots from the virtual DVD-ROM drive, as shown in Figure 5-179.

Figure 5-179 Error message displayed in the virtual DVD-ROM drive window
Key Process and Cause Analysis

Key process

  1. Determine that the OS cannot boot from the virtual DVD-ROM drive according to the displayed message "Loading stage2 Read error 0x20". Replace the ISO image file. This problem persists.
  2. Assume that the problem occurs on server A. Use server B to install the image file. The problem does not occur. Therefore, the ISO file is not the cause.
  3. Remove and insert the network cable that connects to the BMC network port on server A. This problem persists.
  4. Use the network cable on server B to replace the network cable on server A. This problem is resolved.

Cause analysis

The faulty network cable causes the failure in OS booting from the virtual DVD-ROM drive.

Conclusion and Solution

Conclusion

The faulty network cable causes the failure in OS booting from the virtual DVD-ROM drive.

Solution

Replace the network cable.

Experience

Check whether the ISO file can be used properly. If yes, locate faults in the BMC network.

Note

None

The Remote Control Function of the RH2285 Cannot Be Used
Problem Description
Table 5-115 Basic information

Item

Information

Source of the Problem

Problem in Online Devices

Intended Product

RH2285

Release Date

2012-01-10

Keyword

RH2285, remote control, BMC

Symptom

Software configuration

  • Baseboard management controller (BMC) 2.06
  • Basic input/output system (BIOS) V036

Symptom

Log in to the RH2285 iMana web user interface (WebUI) and click Remote Control. The Invalid User dialog box is displayed, and the remote control function cannot be used.

Key Process and Cause Analysis

Key process

  1. Choose Tools > Internet options.
  2. Delete Internet Explorer temporary files.
  3. Open the control panel, and click Other options.
  4. Click Java.
  5. Choose Regular > Setting.
  6. Go to the Java cache, for example, C:\Documents and Settings\h00179477\Application Data\Sun\Java\Deployment\cache.
  7. Delete the 6.0 folder.

Cause analysis

The Internet Explorer and Java caches are fully occupied. As a result, the remote control function cannot be used on the iMana WebUI.

Conclusion and Solution

Conclusion

The Internet Explorer and Java caches are fully occupied. Files cannot be loaded by using the remote control function on the WebUI.

Solution

  • Delete Internet Explorer temporary files.
  • Delete files in the Java application cache.
Experience

The method described in this case can be used for Java 5 and Java 6.

Note

None

Failed to Upgrade BMCs Using uMate
Problem Description
Table 5-116 Basic information

Item

Information

Source of the Problem

Problem on the live network

Intended Product

Rack servers and E9000 blade servers

Release Date

2016-02-06

Keyword

uMate, upgrade, BMC

Symptom

Symptom:

uMate is used to upgrade BMCs in batches and most upgrades fail, as shown in Figure 5-180 and Figure 5-181.

Figure 5-180 Errors

Figure 5-181 Errors

Key Process and Cause Analysis

Cause analysis:

According to tool developers, the error messages do not reflect true causes. The true cause is that the client running uMate does not have enough hard drive capacity, as shown in Figure 5-182. After the client is replaced, the upgrade success rate is nearly 100%, as shown in Figure 5-183.

Figure 5-182 Insufficient hard drive capacity of the original client
Figure 5-183 Sufficient hard drive capacity of the new client

Conclusion and Solution

Solution:

Reserve sufficient space in client hard drives, and use uMate 116 or later.

Experience

None

Note

None

NA is Displayed for P-State and T-State on the Power Adjustment Page
Problem Description
Table 5-117 Basic information

Item

Information

Source of the Problem

RH2285 V2

Intended Product

RH2285 V2 servers

Release Date

2012-12-28

Keyword

P-State, T-State, NA, iMana, power adjustment

Symptom

Hardware configuration

RH2285 V2

Symptom

Log in to the iMana web user interface (WebUI) of a server and choose PS Management > Power Adjustment. NA is displayed for P-State and T-State, as shown in Figure 5-184.

Figure 5-184 Power adjustment page

Key Process and Cause Analysis

NA is displayed for P-State and T-State in the following scenarios:

  1. The server operating system (OS) is unexpectedly powered off during startup before the basic input/output system (BIOS) initialization page is accessed, as shown in Figure 5-185.
    Figure 5-185 Power-off during OS startup

  2. The BIOS is being upgraded.
  3. The management engine (ME) is in the recovery state due to hardware issues.

Cause analysis

  1. The ME obtains initialization information from the BIOS during OS startup. The information saved in the last power-on is cleared first. If the OS is powered off at the moment, the ME cannot obtain the CPU information, and therefore the iMana displays NA for the CPU P-State and T-State.
  2. The ME stays in the recovery state during a BIOS upgrade, and the ME does not respond to the commands issued by the iMana for querying the CPU P-State and T-State. After the upgrade, the ME enters the normal state.
  3. The ME enters the recovery state because of a platform controller hub (PCH) fault.
Conclusion and Solution

Conclusion

The ME is in the recovery state, or the OS is powered off when BIOS initialization is not complete. As a result, the iMana cannot query the CPU P-State and T-State.

Solution

  1. If the BIOS is being upgraded, wait until the upgrade is complete and check whether the fault still occurs. If the BIOS is not being upgraded, restart the server OS.
  2. If the fault persists when the OS is powered on after the BIOS is upgraded, replace the mainboard.
  3. If the OS is unexpectedly powered off during startup, restart the server OS.
Experience

None

Note

The ME is an Intel-developed server management engine. The ME can manage server hardware (in terms of power capping, CPU temperature, and fan status) by responding to commands issued by the iMana.

Error Message "invalid role" Is Displayed When an ipmitool Command Is Executed
Problem Description
Table 5-118 Basic information

Item

Information

Source of the Problem

RH2285 V2

Intended Product

servers

Release Date

2013-03-08

Keyword

ipmitool, error

Symptom

Hardware configuration

RH2285 V2

Symptom

The administrator user name galaxbmc and password A3UjChOi3b are created on the iMana.

When ipmitool -I lanplus -H 192.168.60.200 -U galaxbmc -P A3UjChOi3b chassis power status is run to query the power status of servers, an error message "invalid role" is displayed, as shown in Figure 5-186.

Figure 5-186 invalid role

Key Process and Cause Analysis

Key process

  1. Determine that an incorrect user name or password is used based on error message "invalid role".
  2. Use user name galaxbmc and password A3UjChOi3b in the ipmitool command to log in to the iMana web user interface (WebUI). The iMana prompts that user galaxbmc has been locked, as shown in Figure 5-187.
    Figure 5-187 Logging in to the iMana WebUI

  3. Use the default user name root and password root of the iMana administrator to log in to the iMana. The iMana can be logged in.
  4. A new feature is added to the iMana of V2 servers and the baseboard management controller (BMC) of R1 servers (for details about BMC versions, see the note). The feature is that when a user accesses the iMana (through the CLI, WebUI, or ipmitool) by entering an incorrect password for consecutive five times, an error message is displayed indicating that the user name or password is incorrect. When the user accesses the iMana by entering an incorrect password for the sixth time, the iMana regards the user invalid and locks the user. The user is automatically unlocked after 5 minutes (counted from the last time when an incorrect password is entered).
  5. Stop sending the ipmitool command. Change the password of user galaxbmc on iMana, and change the password of user galaxbmc in the ipmitool command after 5 minutes. Then the power status can be displayed after the ipmitool command is run.
Conclusion and Solution

Conclusion

An incorrect password is used in the ipmitool command, so error message "invalid role" is displayed.

Solution

Access the iMana using a correct user name and password.

Experience

None

Note

In the following BMC versions of R1 servers, users are locked when an incorrect password is used:

RH2285 BMC 2.16 or later

E6000 BMC 2.17 or later

T6000 BMC 2.13 or later

Time Cannot Be Synchronized from the BIOS to the iMana
Problem Description
Table 5-119 Basic information

Item

Information

Source of the Problem

X6000 V2

Intended Product

All V2 servers

Release Date

2012-09-26

Keyword

iMana, BIOS, time synchronization

Symptom

Hardware configuration

X6000 V2

Symptom

  1. Time cannot be synchronized from the basic input/output system (BIOS) to the iMana.
  2. On the BIOS screen, the year is displayed as "2002", as shown in Figure 5-188.
    Figure 5-188 BIOS time

  3. On the iMana command-line interface (CLI), the time is displayed as "1970-01-01", as shown in Figure 5-189.
    Figure 5-189 iMana time
Key Process and Cause Analysis

Cause analysis

If the BIOS time is earlier than 2010 or later than 2080, time cannot be synchronized from the BIOS to the iMana.

Conclusion and Solution

Solution

Restart the server, press Delete to enter the BIOS screen during the power on self-test (POST), and change the BIOS time to the current time, as shown in Figure 5-190. (When the BIOS time is between 2010 and 2080, time is automatically synchronized from the BIOS to the iMana.)

Figure 5-190 BIOS time description

Experience

None

Note

None

The iMana Reports DIMM Alarms
Problem Description
Table 5-120 Basic information

Item

Information

Source of the Problem

X6000 V2

Intended Product

All V2 servers

Release Date

2012-09-26

Keyword

iMana, BIOS, time synchronization

Symptom

Hardware configuration

Three DIMMs are installed in slots DIMM010, DIMM012, and DIMM030 respectively.

Symptom

  1. The iMana reports that configurations in slots DIMM010 and DIMM012 are incorrect, as shown in Figure 5-191.
    Figure 5-191 Configuration error

  2. On the iMana command-line interface (CLI), run ipmcget -t sensor -d list | grep -i dimm and check the DIMM status, as shown in Figure 5-192.
    Figure 5-192 DIMM status

  3. Figure 5-192 shows that the DIMM in slot DIMM010 is detected but is faulty, the DIMM in slot DIMM030 is detected and operating properly, and the DIMM in slot DIMM012 is not detected and is faulty.
Key Process and Cause Analysis

Cause analysis

If a DIMM is installed in an incorrect slot, the basic input/output system (BIOS) and management engine (ME) determine that the DIMM is not properly installed. In this case, DIMMs are installed in slots DIMM010 and DIMM012, but no DIMM is installed in slot DIMM011. Therefore, the BIOS and management engine determine that the DIMM in slot DIMM012 is not properly installed and report the error to the iMana. Then the iMana displays the configuration errors of slots DIMM010 and DIMM012.

Conclusion and Solution

Conclusion

DIMMs are installed in incorrect slots.

Solution

When installing DIMMs, ensure that DIMMs in the same memory channel are installed from the end far away from the CPU to the end near the CPU and no intermediate slot can be skipped. For example, DIMMs must be installed in slots DIMM010, DIMM011, and DIMM012 in sequence, and slot DIMM011 cannot be skipped.

Experience

When the iMana reports DIMM faults, query DIMM status and check whether DIMMs are installed in incorrect slots.

Note

DIMM status:

0x8000: indicates that no DIMM is installed.

0x8040: indicates that a DIMM is detected and operating properly.

0x8080: indicates that a DIMM is not detected and is faulty.

0x80C0: indicates that a DIMM is detected but is faulty.

Garbled Characters Are Displayed When the Client Accesses SOL Through the iMana
Problem Description
Table 5-121 Basic information

Item

Information

Source of the Problem

RH2288 V2

Intended Product

series servers

Release Date

2013-01-29

Keyword

Baud rate, SOL, BMC, garbled character

Symptom

Hardware configuration

RH2288 V2

Symptom

  1. On the basic input/output system (BIOS) screen, set the serial port baud rate to 38400, as shown in Figure 5-193.
    Figure 5-193 Serial port baud rate

  2. On the client, run the following command to connect to the iMana and activate Serial Over LAN (SOL) by using the IPMItool:

    ipmitool -I lanplus -H iManaIP -U root -P root sol activate (iManaIP indicates the iMana IP address, -U root indicates the iMana user name, and -P root indicates the password of iMana user root.)

  3. Garbled characters are displayed during system startup, as shown in Figure 5-194.
    Figure 5-194 Garbled characters
Key Process and Cause Analysis

Cause analysis

  1. The BIOS sends data to the serial port during startup. After the BIOS starts up, the service system also sends data to the serial port.
  2. If the baud rate sent by the BIOS to the serial port is inconsistent with that sent by the service system (the service system sends the baud rate that has a fixed value of 115200), the iMana will differentiate between the baud rates.
  3. Differentiating between baud rates is implemented by sampling, which takes certain time.
  4. The baud rate is not 115200 during BIOS startup, and the iMana will receive data at the baud rate sent by the BIOS for a period of time after the BIOS starts up. The iMana receives data at a baud rate different from that of the service system. Therefore, garbled characters are displayed.
Conclusion and Solution

Conclusion

The serial port baud rate during BIOS startup is inconsistent with that of the service system.

Solution

Press Delete during the power on self-test (POST) of the server to enter the BIOS screen, and set the serial port baud rate to 115200 (choose Advanced > Console Redirection Setup > Baud Rate), as shown in Figure 5-195. Then press F10 to save and exit.

Figure 5-195 Setting the serial port baud rate

Experience

Check whether the serial port baud rate of the BIOS is 115200 (default) first. If no, set it to 115200.

Note

None

Watchdog on a V2 Server Times Out But the Server Does Not Reset
Problem Description
Table 5-122 Basic information

Item

Information

Source of the Problem

V2 servers

Intended Product

V2 servers

Release Date

2015-01-20

Keyword

watchdog overflow, power cycle

Symptom

Hardware configuration

V2 server with iMana 200

Symptom

iMana 200 on a V2 server reports the event "Watchdog overflow, power cycle", but the server does not reset. See Figure 5-196.

Figure 5-196 Event information

Key Process and Cause Analysis

Cause analysis

Graceful Power-off Timeout Period is added to the page displayed after you choose Configuration > System Configuration on iMana 200. The value of Graceful Power-off Timeout Period is 0 (default value) or ranges from 10 seconds to 6540 seconds.

The value 0 indicates that the operating system (OS) controls server power-off. A value from 10 to 6540 indicates that iMana resets the southbridge chip to forcibly power off the server if the OS fails to power off within the specified graceful power-off timeout period. If Graceful Power-off Timeout Period is set to 0 and the OS crashes, the server cannot power off.

Conclusion and Solution

Solution

  1. Restart the system by using iMana 200 during off-peak hours to minimize impact on services.
  2. Set Graceful Power-off Timeout Period to a value other than 0 so that iMana can automatically reset the server when the watchdog overflows. (There is a low probability that data loss may occur during reset.)
Experience

None

Note

None

"Illegal User" Is Displayed in an Attempt to Open the Remote Control Page on the iMana WebUI
Problem Description
Table 5-123 Basic information

Item

Information

Source of the Problem

RH2288 V2

Intended Product

V2 servers

Release Date

2013-05-28

Keyword

iMana, remote control, illegal user

Symptom

Hardware configuration

RH2288 V2

Symptom

The message "Illegal User" is displayed when a user attempts to open the Remote Control page on the iMana WebUI.

Key Process and Cause Analysis

Key process

  1. The user can successfully log in to the iMana WebUI from another computer and open the Remote Control page. This indicates that iMana is operating properly.
  2. To resolve the problem, the user has performed the following steps:
    1. The user has logged in to iMana over Telnet and restarted iMana. The problem persists.
    2. The user has opened Internet Explorer and chosen Tools > Internet Options. On the General tab page, the user has clicked Delete in the Browsing History area. The problem persists.
    3. The user has restarted the PC that functions as a client. The problem is resolved.
Conclusion and Solution

Conclusion

The problem is resolved after the PC that functions as a client is restarted.

Solution

Restart the PC.

Experience
  1. To resolve the problem, perform the following steps:
    1. Log in to iMana over Telnet and restart iMana.
    2. Open Internet Explorer and choose Tools > Internet Options. On the General tab page, click Delete in the Browsing History area.
    3. Choose Start > Control Panel > Java. On the General tab page, click Settings in the Temporary Files Settings area. In the displayed dialog box, click Delete Files.
    4. Open Internet Explorer and choose Tools > Internet Options. On the Connections tab page, click LAN Settings. In the displayed dialog box, deselect the Use a proxy server for your LAN check box in the Proxy server area.
    5. Restart the PC that functions as a client.
Note

None

BMC WebUI Access Failed Because Port 80 Is Disabled
Problem Description
Table 5-124 Basic information

Item

Information

Source of the Problem

RH2288 V2

Intended Product

All servers

Release Date

2013-07-26

Keyword

Port 80, BMC WebUI, HTTPS

Symptom

Hardware configuration

RH2288 V2 server

Symptom

Users fail to open the BMC WebUI after entering the BMC IP address of the RH2288 V2 in the address box of Internet Explorer.

Key Process and Cause Analysis
  1. Check that Use a proxy server for your LAN is deselected in Internet Explorer.
  2. Log in to the BMC over Telnet, and run a command to check that port 80 on the server is enabled. See Figure 5-197.
    Figure 5-197 Checking the status of port 80 on the server

  3. Check that the BMC WebUI is successfully accessed over Hypertext Transfer Protocol Secure (HTTPS), that is, port 443.
  4. Determine that port 80 is disabled on the switch.
Conclusion and Solution

Conclusion

By default, users access the BMC WebUI over HTTP (using port 80). When port 80 on the switch is disabled, the access fails.

Solution

Access the BMC WebUI over HTTPS (using port 443).

Experience

Use port 80 when accessing the BMC WebUI over HTTP.

Use port 443 when accessing the BMC WebUI over HTTPS.

See Figure 5-198.

Figure 5-198 Ports for accessing the BMC WebUI

Note

Request the customer to enable port 80 on the switch, or directly connect to the server in the equipment room (without through the switch) to access the BMC WebUI.

Error Message "Set Session Privilege Level to ADMINISTRATOR failed" Is Displayed When a Non-Administrator User Fails to Connect to the BMC Through ipmitool Ports
Problem Description
Table 5-125 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

V1, V2, and V3 servers

Release Date

2016-02-02

Keyword

ipmitool lanplus, Set Session Privilege Level to ADMINISTRATOR failed

Symptom

Symptom:

A non-administrator user (operator or common user) fails to connect to the BMC using the ipmitool lan or lanplus port, and the following information is displayed:

Set Session Privilege Level to ADMINISTRATOR failed 
ipmitool -I lanplus -H **.**.**.** -U **** -P ******** mc info 
Set Session Privilege Level to ADMINISTRATOR failed: Unknown (0x81) 
Error: Unable to establish IPMI v2 / RMCP+ session     

**.**.**.** indicates the BMC IP address.

**** indicates the BMC user name.

******** indicates the BMC password.

Key Process and Cause Analysis

Cause analysis:

-L needs to be added when a non-administrator user uses the ipmitool lan or lanplus port. Otherwise, permission of an administrator is granted.

-L level Remote session privilege level [default=ADMINISTRATOR]

Conclusion and Solution

Solution:

Specify -L as follows in the commands (operator: operator; user: common user):

ipmitool -I lanplus -H **.**.**.** -U **** -P ******** -L operator mc info

ipmitool -I lanplus -H **.**.**.** -U **** -P ******** -L user mc info

Experience

None

Note

None

The Mainboard UUIDs of Two FusionServer Servers Are the Same
Problem Description
Table 5-126 Basic information

Item

Information

Source of the Problem

RH1288 V2

Intended Product

V1, V2, and V3 servers

Release Date

2016-02-16

Keyword

Same UUIDs

Symptom

Symptom:

A user runs dmidecode -t system-uuid or dmidecode -t system|grep -i uuid to query the UUIDs of two servers running Linux, and the UUIDs are the same, as shown in Figure 5-199.

Figure 5-199 Same UUIDs

Key Process and Cause Analysis

Cause analysis:

The mainboard UUIDs are not updated before server delivery.

The UUIDs of V2 servers are in the format of UUID generation time-BMC MAC address in reverse order, as shown in the red box in Figure 5-200.

Figure 5-200 Relationship between the UUID and BMC MAC address of a V2 server

The UUIDs of V3 servers are in the format of UUID generation time-BMC MAC address, as shown in the red box in Figure 5-201.

Figure 5-201 Relationship between the UUID and BMC MAC address of a V3 server

Conclusion and Solution

Solution:

Run ipmitool commands to update the mainboard UUIDs and restart the OS for the update to take effect.

  1. Out-of-band:

    ipmitool -I lanplus -H **.**.**.** -U **** -P ******** raw 0x30 0x90 0x27 0x0 0x47 0x55 0x49 0x44 0xaa

    **.**.**.** indicates the BMC IP address.

    **** indicates the BMC user name.

    ******** indicates the BMC password.

  2. In-band:

    ipmitool raw 0x30 0x90 0x27 0x0 0x47 0x55 0x49 0x44 0xaa

Experience

None

Note

None

Error Message "Authentication type NONE not supported" Is Displayed During Access to the BMC Through the ipmitool lan Port
Problem Description
Table 5-127 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

V1, V2, and V3 servers

Release Date

2016-02-02

Keyword

ipmitool lan, Authentication type NONE not supported

Symptom

Symptom:

The client fails to connect to the BMC using the ipmitool lan port, and the following information is displayed:

Authentication type NONE not supported 
ipmitool -H **.**.**.** -U **** -P ******** mc info 
Authentication type NONE not supported 
Error: Unable to establish LAN session     

**.**.**.** indicates the BMC IP address.

**** indicates the BMC user name.

******** indicates the BMC password.

Key Process and Cause Analysis

Cause analysis:

The ipmitool lan port is disabled by default for the BMC.

Figure 5-202 BMC port information
Conclusion and Solution

Solution:

Use the lanplus port or enable the ipmitool lan port.

ipmitool -I lanplus -H **.**.**.** -U **** -P ******** mc info
Experience

None

Note

None

Error Message"no matching cipher suite" During Access to the BMC Through the ipmitool lanplus Port
Problem Description
Table 5-128 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

V1, V2, and V3 servers

Release Date

2016-02-02

Keyword

ipmitool lanplus, error in open session response message: no matching cipher suite

Symptom

Symptom:

The client fails to connect to the BMC using the ipmitool lanplus port, and the following information is displayed:

Error in open session response message: no matching cipher suite 
ipmitool -I lanplus -H **.**.**.** -U **** -P ******** mc info 
Error in open session response message: no matching cipher suite 
Error: Unable to establish IPMI v2 / RMCP+ session     

**.**.**.** indicates the BMC IP address.

**** indicates the BMC user name.

******** indicates the BMC password.

Key Process and Cause Analysis

Cause analysis:

Permission on the lanplus port channel has been set.

Conclusion and Solution

Solution:

Run the following in-band command to restore the permission on the lanplus port channel to ADMINISTRATOR (the setting will be lost upon BMC reset):

ipmitool raw 0x6 0x40 0x01 0x22 0x84

Run the following in-band command to restore the permission on the lanplus port channel to ADMINISTRATOR (the setting will not be lost upon BMC reset):

ipmitool raw 0x6 0x40 0x01 0x22 0x44

Experience

Run the following command to set the permission on the lanplus port channel:

ipmitool -I lanplus -H 192.168.23.89 -U root -P root raw 0x6 0x40 0x01 0x22 0x82

81 indicates CALLBACK (no permission), 82 indicates USER (common user), 83 indicates OPERATOR (operator), 84 indicates ADMINISTRATOR (administrator).

Run the following command to restore the permission on the lanplus port channel to ADMINISTRATOR (the setting will be lost upon BMC reset):

ipmitool raw 0x6 0x40 0x01 0x22 0x84

Run the following command to restore the permission on the lanplus port channel to ADMINISTRATOR (the setting will not be lost upon BMC reset):

ipmitool raw 0x6 0x40 0x01 0x22 0x44

Run the following command to query the permission on the lanplus port channel:

ipmitool raw 0x6 0x41 0x01 0x80

22 01 (CALLBACK)/22 02 (USER)/22 03 (OPERATOR)/22 04 (ADMINISTRATOR)

Note

None

Watchdog on a V2 Server Times Out but the Server Does Not Reset
Problem Description
Table 5-129 Basic information

Item

Information

Source of the Problem

V2 servers

Intended Product

V2 servers

Release Date

2015-01-20

Keyword

Watchdog

Symptom

Symptom:

The iMana on a V2 server reports the "watchdog overflow, power cycle" event, but the server does not reset, as shown in Figure 5-203.

Figure 5-203 Reported event

Key Process and Cause Analysis

Cause analysis:

The Graceful Power-off Timeout Period parameter is added under Configuration > System Configuration in the iMana on a V2 server. The value of this parameter is 0 (default value) or ranges from 10 to 6540 seconds. The value 0 indicates that the service system controls server power-off. A value from 10 to 6540 indicates that the iMana resets the southbridge chip to forcibly power off the server if the OS fails to power off within the specified graceful power-off timeout interval. If the value of this parameter is set to 0 and the service system crashes, the server cannot power off.

Conclusion and Solution

Solution:

  1. Restart the system by using the iMana during off-peak hours to minimize impact on services.
  2. Set Graceful Power-off Timeout Period to a value other than 0 so that the server can reset automatically when the service system crashes. (There is a low probability that data loss may occur during reset.)
NOTE:

The graceful power-off timeout interval in the iMana on an R1 server ranges from 5 to 480 minutes. Other possible causes are as follows:

  • In the iMana earlier than 2.15 on an R1 server, the IPMI channel is suspended at a low probability because the OS sends IPMItool instructions to interact with the iMana. As a result, the watchdog cannot work properly. Optimization for this scenario is integrated into the iMana later than 2.15.
  • In the iMana 5.97 or earlier on a V2 server, the watchdog powers off the service system at a low probability and then the service system cannot be powered on. When this problem occurs, the iMana needs to be upgraded to 6.05 or later.
  • The cooperation between the OS and the iMana is not in the normal state.
Experience

None

Note

None

A User Forgets the Password of BMC User ADMIN for an RH2488 V2
Problem Description
Table 5-130 Basic information

Item

Information

Source of the Problem

RH2488 V2

Intended Product

RH2488 V2

Release Date

2013-11-01

Keyword

RH2488 V2, ADMIN, IPMICFG

Symptom

Hardware configuration

RH2488 V2 server

Symptom

A user cannot log in to the BMC of an RH2488 V2 after the user has changed the password of the default BMC user ADMIN but forgotten the new password.

Key Process and Cause Analysis

Key process

  1. Download the IPMICFG tool from ftp://ftp.supermicro.com/utility/IPMICFG/.
  2. Run either of the following commands in the server OS to restore the factory BMC settings:

    IPMICFG -fd

    IPMICFG -fde

  3. Log in to the BMC using the default user name ADMIN and default password ADMIN.

Cause analysis

The user forgets the new password of BMC user ADMIN after changing the password.

Conclusion and Solution

Conclusion

The user forgets the password of BMC user ADMIN and therefore cannot log in to the BMC.

Solution

Use the IPMICFG tool in the OS to restore the factory BMC settings.

Experience

None

Note

None

Message "undefined" Is Displayed on the BMC WebUI of an RH2488 V2
Problem Description
Table 5-131 Basic information

Item

Information

Source of the Problem

RH2488 V2

Intended Product

RH2488 V2

Release Date

2013-11-25

Keyword

RH2488 V2, BMC, undefined

Symptom

Hardware configuration

RH2488 V2 server

Symptom

The BMC WebUI of an RH2488 V2 displays "undefined." See Figure 5-204.

Figure 5-204 "undefined" displayed on the BMC WebUI

Key Process and Cause Analysis

Key process

  1. Download the IPMICFG tool from ftp://ftp.supermicro.com/utility/IPMICFG/.
  2. Run either of the following commands in the OS to restore the factory BMC settings:

    IPMICFG –fd

    IPMICFG –fde

  3. Log in to the BMC using the default user name and password.

    Default user name: ADMIN

    Default password: ADMIN

Cause analysis

The BMC is operating abnormally due to a power failure.

Conclusion and Solution

Conclusion

The BMC is operating abnormally due to a power failure.

Solution

Use the IPMICFG tool in the OS to restore the factory BMC settings.

Disconnect the AC power supply from the server for about 5 minutes if allowed, and then power on the server.

Experience

None

Note

None

iMana IP Address Is Lost After iMana Resets Upon the IP Address Change
Problem Description
Table 5-132 Basic information

Item

Information

Source of the Problem

V2 servers

Intended Product

V2 servers that use iMana

Release Date

2013-11-25

Keyword

iMana IP address, lost

Symptom

Symptom

The old IP address information of iMana is as follows:

  • IP address: 3.1.95.254
  • Subnet mask: 255.255.255.0
  • Gateway address: 3.1.95.205

The iMana IP address information is changed in the BIOS as follows:

  • IP address: 10.175.1.42
  • Subnet mask: 255.255.252.0
  • Gateway address: 10.175.0.1

After the server is powered off and then powered on, the IP address 10.175.1.42 is inaccessible and the IP address displayed in the BIOS is still 3.1.95.254.

Key Process and Cause Analysis

Key process

Check the iMana IP address in /data/backupip. It is found that the IP address is the old one. See Figure 5-205.

Figure 5-205 iMana IP address in /data/backupip

Possible causes:

  • The new IP address information cannot be written to /data/backupip because of high flash memory usage.
  • iMana is operating abnormally.

Cause analysis

Run the du -sh * command to view the file size. It is found that the collector_linux.zip file of a large size is in /home on iMana. As a result, the flash memory usage reaches 97% (the flash memory usage of /data/ should be less than 70%). See Figure 5-206.

Figure 5-206 Viewing the file size

Conclusion and Solution

Solution

Delete the collector_linux.zip file from /home on iMana, and change the iMana IP address information again. See Figure 5-207.

Figure 5-207 Deleting the collector_linux.zip file

Experience

Do not upload any large file to iMana. A large file may cause iMana to operate abnormally because the iMana flash memory capacity is limited.

Note

None

Login to iMana Is Restricted
Problem Description
Table 5-133 Basic information

Item

Information

Source of the Problem

V2 servers

Intended Product

V2 servers that use iMana

Release Date

2013-11-25

Keyword

iMana WebUI, limited login

Symptom

Symptom

The message "Limited login" is displayed when a user attempts to log in to the iMana WebUI from a client.

Key Process and Cause Analysis

Cause analysis

A login rule has been added on the iMana WebUI. When a user attempts to log in to the iMana WebUI from a client, the system verifies the login request based on the login rule. The message "Limited login" is displayed when the login request does not conform to the login rule.

Conclusion and Solution

Solution

  1. Remember or record the configured login rule.
  2. If you forget the login rule, view the rule as follows:
    1. Log in to the iMana CLI and run the following commands in sequence to add a temporary administrator.

      ipmcset -d adduser -v newuser //After running the command, enter your password and set a password for the new administrator as prompted.

      ipmcset -d privilege -v newuser 4 //In the command, 4 indicates an administrator.

      See Figure 5-208.

      Figure 5-208 Adding a temporary administrator

    2. Log in to the iMana WebUI as the temporary administrator, choose Configuration > User in the navigation tree, and view the login rule.
    3. Modify the client information to meet the login rule. Attempt to log in to the iMana WebUI again.
      NOTE:

      You can set the root user as the emergency login user, who is not restricted by the password validity period or login rule. (Only an administrator can be set as an emergency login user.)

    4. Log in to the iMana WebUI as the root user, choose Configuration > User in the navigation tree, and delete the temporary administrator.
Experience

None

Note

None

Alarm "Memory Usage Limit Exceeded"
Problem Description
Table 5-134 Basic information

Item

Information

Source of the Problem

V2 servers

Intended Product

V2 servers

Release Date

2014-09-16

Keyword

Memory Usage Limit Exceeded, BMA

Symptom

Symptom

The alarm "Memory Usage Limit Exceeded" is displayed after a user logs in to the BMC WebUI and chooses Events and Logs > System Events.

Key Process and Cause Analysis

Key process

Handle the alarm. For details, see "ALM-0541FFFF Limit Exceeded (Memory Usage)" in HUAWEI Rack Server Alarm Handling (iMana 200) 01.

Conclusion and Solution

Conclusion

The in-band management software BMA is installed in the OS. The BMA reports the resource usage (including the CPU usage, memory usage, and drive usage) to the BMC. Then the BMC compares the reported resource usage with the alarm thresholds, and reports alarms if the resource usage exceeds the alarm thresholds.

Solution

  1. Expand the memory capacity based on the actual memory usage.
  2. Check whether Memory Usage is set to a small value.
  3. Upgrade the BMA to V5.10 or later, in which the alarm threshold for the memory usage can be set to 100%.
Experience

For details, see HUAWEI Rack Server Alarm Handling (iMana 200) 01.

Note

None

Message "connect manage system fail" Is Displayed During the Access to the Remote Control Window
Problem Description
Table 5-135 Basic information

Item

Information

Source of the Problem

RH2285H V2

Intended Product

V2 servers

Release Date

2016-02-26

Keyword

Remote control, connect manage system fail

Symptom

The remote control window of an RH2285H V2 server cannot be opened properly, and the following message is displayed:

connect manage system fail, the manage system IP is 192.168.xx.xx

See Figure 5-209.

Figure 5-209 Error message displayed during the access to the remote control window

Key Process and Cause Analysis

On the Windows OS on the client, run the telnetBMC_IP Port number command in the CLI. It is found that port 2198 used by the remote control service is disabled. After the port is enabled, the remote control window can be accessed properly, as shown in Figure 5-210 and Figure 5-211.

Figure 5-210 Checking that port 2198 is disabled using Telnet

Figure 5-211 Port 2198 is used by the remote control service by default

Conclusion and Solution

Conclusion:

The port used by the remote control service is disabled.

Solution:

Enable the port used by the remote control service on the live network.

Experience

None

Note

None

An Error Occurs When an ISO File Is Mounted to the Virtual DVD-ROM Drive of an RH2488 V2 Server
Problem Description
Table 5-136 Basic information

Item

Information

Source of the Problem

RH2488 V2

Intended Product

RH2488 V2

Release Date

2014-09-09

Keyword

ISO file mounting

Symptom

Hardware configuration: RH2488 V2

Symptom: An error occurs when an ISO file is mounted to the virtual DVD-DOM drive of an RH2488 V2 server, as shown in Figure 5-212.

Figure 5-212 Error that occurs when an ISO file is mounted to the virtual DVD-ROM drive of an RH2488 V2 server
Key Process and Cause Analysis

Cause Analysis:

The path of an ISO file cannot contain Chinese characters. However, the path that stores the ISO file on the site contains Chinese characters. The problem can be resolved after the path is modified.

Conclusion and Solution

The path of an ISO file cannot contain Chinese characters and spaces. Otherwise, the ISO file information cannot be obtained and an error occurs.

Experience

None

Note

None

iMana V7xx Cannot Be Accessed on Windows XP or Windows Server 2003
Problem Description
Table 5-137 Basic information

Item

Information

Source of the Problem

Live network

Intended Product

iMana V7xx

Release Date

2015-08-06

Keyword

Windows XP, Windows Server 2003, Internet Explorer 8, iMana

Symptom

System configuration:

Client system configuration: Windows XP and Internet Explorer 8.0

iMana version: V715

Symptom:

The iMana WebUI cannot be accessed on an RH2288H V2 server.

  • The IP address can be pinged.
  • The server can be connected using SSH.
  • Port 443 can be connected using Telnet.

The iMana WebUI cannot be accessed on all the three RH2288H V2 servers.

Key Process and Cause Analysis

Cause analysis:

The WebUI cannot be accessed in an environment configured with Windows Server 2003 or Windows XP and Internet Explorer 8. Internet Explorer 8 on Windows Server 2003 or Windows XP does not support the certificate encryption algorithms required by security policies: TLS_ECDH_ECDSA_WITH_AES_128_CBC_SHA and TLS_ECDH_ECDSA_WITH_AES_2568_CBC_SHA.

On a 64-bit OS, the WebUI cannot be accessed using Internet Explorer 10, and the Java applet is disabled by default in this scenario. This problem cannot be resolved currently. The KVM screens of other vendors cannot be accessed in this scenario.

Conclusion and Solution

Solution:

The WebUI cannot be accessed because Internet Explorer 8 on Windows Server 2003 or Windows XP does not support the certificate encryption algorithms required by security policies. To resolve the problem, change the Windows management client to another version (Windows 7, Windows 8, Windows Server 2008, or Windows Server 2012).

Experience

None

Note

None

Abnormal Memory Status of V2 Servers
Problem Description
Table 5-138 Basic information

Item

Information

Source of the Problem

Live network

Intended Product

V2 servers

Release Date

2015-10-15

Keyword

V2 server, abnormal memory status

Symptom

A customer reported RH2288H V2 memory faults. The memory information on the System Information > System Hardware page was inconsistent with that on the Real-time Monitoring > Component page.

  • On the System Information > System Hardware page, the status of DIMM001 is "No Module Installed".
  • On the Real-time Monitoring > Component page, the status of DIMM001 is "Installed", a normal state.
Key Process and Cause Analysis

Key process

  • The System Information > System Hardware page displays parsed SMBIOS information, mainly component installation information.
  • The Real-time Monitoring > Component page displays parsed SMBIOS and ME information, mainly alarm information.
  • The SMBIOS information is updated only after the system detects the boot device. Otherwise, the information reported last time is still displayed.

Cause analysis

The system failed to start due to a memory fault. As a result, the SMBIOS information was not refreshed.

Conclusion and Solution

Conclusion:

The system failed to start due to a memory fault. As a result, the SMBIOS information was not refreshed.

Solution:

Clear the alarm reported for DIMM101, and the SMBIOS information is updated after the server starts from the boot device.

Experience

None

Note

None

"Invalid User" Is Displayed When an ISO Image Is Mounted to the RH2288 V2 KVM
Problem Description
Table 5-139 Basic information

Item

Information

Source of the Problem

Live network

Intended Product

RH2288 V2

Release Date

2018-05

Keyword

KVM, invalid user

Symptom

When an ISO image is mounted to the RH2288 V2 KVM, "Invalid User" is displayed.

Key Process and Cause Analysis

Problem analysis:

  1. Check the OS image file.

    On the live network, "Invalid User" is displayed when different types of image files are used. Therefore, the problem is not caused by the ISO image file.

  2. Check the iMana version.

    Mount the image file after iMana is restarted. "Invalid User" is not displayed. The BMC R&D engineers confirm that the iMana version 3.91 is outdated and that the problem is caused by the outdated iMana version.

Conclusion and Solution

Solution:

Temporary solution: Restart iMana.

Long-term solution: Upgrade iMana to the latest version. The recommended version is 7.01 or later.

Common Problems of Fan Modules and Power Supplies

Fans Keep Running Rapidly Due to Incorrect Connection of Front VGA Cables
Problem Description
Table 5-140 Basic information

Item

Information

Source of the Problem

RH2285 V2

Intended Product

RH1288 V2, RH2268 V2, RH2265 V2, RH2288 V2, RH2285 V2, RH2288H V2, and RH2285H V2

Release Date

2014-06-17

Keyword

Fan module

Symptom

Hardware configuration

RH2285 V2 server, configured with a backplane housing eight hard drives

Symptom

RH2285 V2 fans keep running at about 7000 rotations per minute. The BMC displays alarm information shown in Figure 5-213.

Figure 5-213 Alarm information

Key Process and Cause Analysis

Key process

  1. Power off and restart the server, but the problem persists.
  2. Upgrade the BMC to version 597, but the problem persists.
  3. Replace the mainboard, but the problem persists.

The BMC contains an alarm about the rear hard drive temperature, but no rear hard drive nor rear hard drive backplane is installed on the current server. The server is configured with eight hard drives and a front VGA port. The ports on the server mainboard are shown in Figure 5-214.

Figure 5-214 Mainboard ports

A cable is connected the REAR_HDD port numbered 45 in Figure 5-215 and the front VGA board, as shown in Figure 5-216.

Figure 5-215 Server cable (1)

The front VGA cable of the faulty server is incorrectly connected. Move the other end of the VGA cable from port 45 (J141 REAR_HDD) to port 47 (J157 VGA_CARD), as shown in Figure 5-216. Then, the fans run properly, and the problem is resolved.

Figure 5-216 Server cable (2)

Cause analysis

The front VGA cable is connected to the rear hard drive port on the mainboard, which is incorrect.

Conclusion and Solution

Conclusion

The front VGA cable is connected to the rear hard drive port on the mainboard, which is incorrect.

Solution

Connect the front VGA cable to the correct port on the mainboard.

Experience
  • Only an 8-hard-drive server contains a front VGA port.
  • The incorrect connection of a VGA port on the mainboard may cause a mainboard fault.
  • The mis-connection of a rear hard drive port on the mainboard may cause an abnormal fan speed.
  • If the cable in the front VGA port is mis-connected to the rear hard drive backplane port on the mainboard, Select OutTime may be displayed after a BMC command is run, BMC FRU information burning may fail, or fans may run rapidly.
  • If the cable in the front VGA port is mis-connected to the rear hard drive backplane port on the mainboard, the change of the BMC IP address on the BIOS setup menu may not take effect.
Note

None

The Fan Speed Reaches the Maximum Value After an RH2285 Is Powered On
Problem Description
Table 5-141 Basic information

Item

Information

Source of the Problem

RH2285

Intended Product

RH2285 and RH1285

Release Date

2010-09-26

Keyword

Fan speed, maximum value, BMC

Author

Li Zhibing (employee ID: 62067)

Symptom

Hardware configuration:

RH2285 server

Symptom

After the onsite RH2285 is powered on, the fan speed reaches the maximum value.

(1) Run the ipmctool 2 command on the baseboard management controller (BMC) CLI to check that the fan rate is 100%.

(2) Run the TOP command on the BMC CLI to check that the sum of the CPU usages for the bmcipmi.out and webs processes is over 50%.

Key Process and Cause Analysis

Key process

Based on the Top command output, the webs and bmcipmi.out processes are communicating without interruption, indicating that exceptions exist. ased on the experience of locating similar problems, the webs process failed to obtain the sdr information.

After running the ls –al /data/ command on the BMC CLI, find that the size of RH2285sdr.bin or RH1285sdr.bin is 0. See Figure 5-217.

NOTE:

For versions later than B031, the sdr file is renamed as sdr0.bin.

After running the ls -al /data/ command on the RH2285, find that the size of RH2285sdr.bin is 0.

Figure 5-217 RH2285sdr.bin size

After running the ls -al /data/ command on the RH2285, find that the size of RH1285sdr.bin is 0. See Figure 5-218.

Figure 5-218 RH1285sdr.bin size

Conclusion and Solution

Conclusion

Multiple process modules cannot be initialized because sdr is empty. The modules include fan modules, and fan speed adjusting does not take effect.

Solution: Two options are available, and solution 1 is recommended.

  1. Manually copy RH2285_*disk_sdr.bin of the required hard drive from the /data/mgnt directory to the /data directory. For example, if the required hard drive is hard drive 8, copy RH2285_8disk_sdr.bin; if the required hard drive is hard drive 12, copy RH2285_12disk_sdr.bin. Rename the file based on the server type, and run the reboot command on the BMC CLI to reset the BMC. The solution features minimum workload and is applicable to onsite technical support engineers and maintenance personnel.
  2. Re-upgrade the BMC software. Then the RH2285 is restored. For details, see the upgrade guide. Based on the solution, you just need to upgrade the BMC by referring to the upgrade guide without learning about the server. Therefore, the solution is applicable to customers or those people who do not know the server.

Assume that the RH2285 for hard drive 12 is faulty,

  1. The sdr files (including files for hard drives 4, 8, and 12) of the rack server are in the BMC /data/mgnt directory.
  2. Copy RH2285_12disk_sdr.bin to the /data directory.
  3. Rename RH2285_12disk_sdr.bin as RH2285sdr.bin.
  4. Reset the BMC to make fan speed adjusting take effect and recover fans.

Experience

If fans run at high speed, run the ipmctool 2 and ipmctool3 commands on the BMC CLI to check the fan speed. Then run the ipmcget -d healthevent command to check the health status of the BMC system and whether alarms are generated on other components.

Note

If the preceding method works, the fans immediately run at normal speed a period after the BMC resets. The fan speed change can be identified with your ears.

Fans Keep Running Rapidly Due to Incorrect Front VGA Cable Connection on an RH2285 V2
Problem Description
Table 5-142 Basic information

Item

Information

Source of the Problem

RH2285 V2

Intended Product

RH1288 V2, RH2268 V2, RH2265 V2, RH2288 V2, RH2285 V2, RH2288H V2, and RH2285H V2

Release Date

2014-06-17

Keyword

Fan module

Symptom

Hardware configuration:

RH2285 V2 server, equipped with a backplane housing eight hard drives

Symptom:

RH2285 V2 fans keep running at about 7000 rotations per minute.

Key Process and Cause Analysis

Key process:

  1. Power off and restart the server, but the problem persists.
  2. Upgrade the BMC to version 597, but the problem persists.
  3. Replace the mainboard, but the problem persists.

The BMC contains an alarm about the temperature of rear hard drives, but no rear hard drive nor rear hard drive backplane is installed on the current server. The server is configured with eight hard drives and a front VGA port. The ports on the server mainboard are shown in Figure 5-219.

Figure 5-219 Ports

A cable is connected to the REAR_HDD port (numbered 45 in Figure 5-219) and the front VGA board, as shown in Figure 5-220.

Figure 5-220 REAR_HDD port

The front VGA cable of the faulty server is incorrectly connected. Move the end of the VGA cable connected to port 45 (J141 REAR_HDD) to port 47 (J157 VGA_CARD), as shown in Figure 5-221.

Figure 5-221 Front VGA cable

Cause analysis:

The front VGA cable is connected to the rear hard drive port on the mainboard.

Conclusion and Solution

Conclusion:

The front VGA cable is connected to the rear hard drive port on the mainboard.

Solution:

Connect the front VGA cable to the correct port on the mainboard.

Experience

Only an 8-drive server contains a front VGA port.

Incorrect connection of a VGA port on the mainboard may cause a mainboard fault.

The incorrect connection of a rear hard drive port on the mainboard may cause an abnormal fan speed.

If the cable from the front VGA port is connected to the rear hard drive backplane port on the mainboard, Select OutTime may be displayed after a BMC command is run, BMC FRU information burning may fail, or fans may run rapidly.

If the cable from the front VGA port is connected to the rear hard drive backplane port on the mainboard, the change of the BMC IP address on the BIOS setup menu may not take effect.

If the cable from the front VGA port is connected to the rear hard drive backplane port on the mainboard, the BMC may fail to identify all fan modules.

Note

None

Failed to Identify RH2485 V2 Fan Modules Due to an Unclosed DIMM Slot Latch
Problem Description
Table 5-143 Basic information

Item

Information

Source of the Problem

Tecal RH2485

Intended Product

Tecal RH2485

Release Date

2015-12-24

Keyword

Fan module

Symptom

Hardware configuration:

RH2485 server

Symptom:

After an RH2485 is powered on, fan modules and the fan speed are not detected, and the fans are running at full speed.

Key Process and Cause Analysis

Cause analysis:

  1. The customer removed fan modules before removing a DIMM.
  2. After the DIMM was removed, the DIMM slot latch was not closed, as shown in the red box in Figure 5-222.
  3. The DIMM slot latch prevented the fan modules from being fully inserted to the connectors.
Figure 5-222 DIMM slot latch

Conclusion and Solution

Solution:

Close the DIMM slot latch, and reinstall the fan modules.

Experience

None

Note

None

Translation
Download
Updated: 2019-02-25

Document ID: EDOC1000041338

Views: 69275

Downloads: 3770

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next