No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Huawei Server Maintenance Manual 09

Rate and give feedback :
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
V3

V3

Common Problems During Startup and Shutdown

RH5885 V3 Automatically Shuts Down

This topic describes how to rectify the fault that the RH5885 V3 automatically shuts down.

Problem Description
Table 5-144 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

Rack Server

Release Date

2015-03-17

Keyword

Automatically shuts down

Symptom

The RH5885 V3 automatically shuts down.

Key Process and Cause Analysis
  • The ambient operating temperature is excessively high.
  • The RH5885 V3 is faulty.
Conclusion and Solution
  1. Check whether the cause is the same as that for All Indicators Are Off.

    • If yes, no further action is required.
    • If no, go to Step 2.

  2. Check whether the ambient temperature in the equipment room is higher than the upper alarm threshold, or log in to the iMana 200 WebUI and check that a high temperature alarm is generated for a component.

    • If yes, go to Step 3.
    • If no, please refer to 02 About Huawei Server R&D and Maintenance Team. Contact Huawei technical support for help.

  3. Decrease the temperature in the equipment room.
  4. Start the RH5885 V3, and check whether it automatically shuts down.

    • If yes, please refer to 02 About Huawei Server R&D and Maintenance Team. Contact Huawei technical support for help.
    • If no, no further action is required.

Experience

None

Note

None

RH1288 V3 Cannot Be Powered On After an Unexpected Mainboard Power-off and iBMC Logs Record the PwrOK Sig.Drop and PwrOn TimeOut Alarms
Symptom

An RH1288 V3 server was powered off for protection after a power alarm was generated, and could not be powered on. iBMC logs recorded the PwrOk Sig.Drop alarm. When the iBMC was used to power on the server, the iBMC printed "PwrOn TimeOut".

Key Process and Cause Analysis

Key process:

  1. iBMC logs of the server record a major alarm PwrOk Sig.Drop. This indicates that a power supply unit (PSU) on the mainboard is faulty.

    The following figure shows how a PwrOk Sig.Drop alarm is generated.

    1. Normal power-on process

      The mainboard CPLD sends an ENABLE=1 signal to each PSU on the mainboard. After a PSU is powered on successfully, it returns a POWER GOOD=1 signal to the mainboard.

    2. Power supply failure (PwrOk Sig.Drop) process

      The mainboard CPLD sends an ENABLE=1 signal to each PSU on the mainboard. If a PSU of the mainboard experiences a power failure, it returns a POWER GOOD=0 signal to the mainboard. Upon receiving the signal, the CPLD reports it to the iBMC. The iBMC then generates a PwrOk Sig.Drop alarm.

  2. According to the SEL log of the iBMC on the server, the mainboard is powered off for protection after a power alarm is generated. You are advised to apply for a spare mainboard for replacement. When the PwrOk Sig.Drop alarm is generated, do not power off and then power on the server. Otherwise, a short circuit may occur due to secondary damage of the mainboard.
Conclusion and Solution

Conclusion:

This problem is caused by the mainboard fault. You are advised to apply for a spare mainboard for replacement.

If the iBMC logs contain the PwrOk Sig.Drop alarm, do not power off and then power on the server. Otherwise, a short circuit may occur due to secondary damage of the mainboard.

Solution:

Replace the mainboard.

Experience

None

Note

None

X710 (2 x 10GE) PXE Error Is Reported on the RH5885 V3
Problem Description
Table 5-145 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

Rack servers

Release Date

2018-01-29

Keyword

X710, !PXE structure, legacy PXE

Symptom

An error is reported when the X710 (2 x 10GE) is used on the RH5885 V3 for PXE boot. The same symptom appears on other servers with the same configuration.

When the X710 NIC is used to install debian 9 over PXE, a PXE-EC8 error is reported, as shown in the following figure.

Key Process and Cause Analysis

PXE-EC8 error:

PXE-EC8: !PXE structure was not found in UNDI driver code segment UNDI ROM

According to the PXE-EC8 error message, this problem is related to the allocation of the BIOS address space. The legacy PXE cannot access the high-order address space. Therefore, the X710 NIC must be allocated with resources lower than 4G. Disable the above 4G decoding function.

Conclusion and Solution

Conclusion:

The legacy PXE cannot access the high-order address space. Therefore, the X710 NIC must be allocated with resources lower than 4G. Disable the above 4G decoding function.

Solution:

In the BIOS setup menu, choose Advanced > PCI Settings. On the displayed screen, set Above 4G to Disable. In this way, PXE is available in legacy mode.

Experience

None

Note

None

Power-on Failure Caused by the CPU_voltage_mismatch Alarm Reported by the iBMC
Problem Description
Table 5-146 Basic information

Item

Information

Source of the Problem

RH2288 V3

Intended Product

FusionServer

Release Date

2016-09-07

Keyword

SysFWProgress, CPU voltage mismatch

Symptom

The iBMC WebUI reports the Mainboard SysFWProgress CPU voltage mismatch event.

Key Process and Cause Analysis

Cause analysis:

The voltages of the CPUs in the same server are inconsistent.

Conclusion and Solution

Conclusion

The voltages of the CPUs in the same server are inconsistent. As a result, the server fails to initialize the CPUs.

Solution:

Replace the faulty CPU (use the CPU replacement method to locate the faulty CPU).

Experience

None

Note

None

Memory UCE Alarm Causes RH2288H V3 OS Breakdown
Problem Description
Table 5-147 Basic information

Item

Information

Source of the Problem

Live network

Intended Product

Scenarios in which the memory UCE alarm causes OS breakdown

Release Date

2017-11

Keyword

OS breakdown, uncorrectable error

Symptom
  • Hardware configuration:

    RH2288H V3

  • Symptom:

    The RH2288H V3 reports the DIMM131 triggered an uncorrectable error and Critical OS error. Analyze the alarm based on other events alarms.

Key Process and Cause Analysis

Key process

The following figure shows the DIMM layout of the two CPUs on an RH2288H V3 server.

Fault locating:

  1. Replace DIMM 131 and check whether the problem is caused by the DIMM. If yes, replace the faulty DIMM.
  2. Switch CPU 1 and CPU 2 to check whether the problem is caused by CPU 2. If yes, replace the faulty CPU.
  3. If the fault is not caused by CPU 2, apply for a mainboard for field replacement.

Possible Causes

  1. DIMM 131 is faulty.
  2. CPU 2 is faulty.
  3. The mainboard is faulty.
Conclusion and Solution

Conclusion: CPU 2 is faulty.

Solution: Replace the CPU.

Verification: The UCE alarm is cleared and the server works properly.

Experience

N/A

Note

N/A

PwrOk sig.Drop Is Reported After the RH1288 V3 Reports CPU thermal trip
Problem Description
Table 5-148 Basic information

Item

Information

Source of the Problem

Live network

Intended Product

RH1288 V3

Release Date

2017-11

Keyword

thermal trip, PwrOk sig.Drop

Symptom
  • Hardware configuration:

    RH1288 V3

  • Symptom:

    PwrOk sig.Drop is reported when the RH1288 V3 is powered on again after it reports the CPU overtemperature alarm (thermal trip) during a power-on.

    Alarm information in the SEL log shows that the CPU temperature is too high, and then a power failure occurs.

    The maintenance log shows that the cpu2_vrd voltage is abnormal.

Key Process and Cause Analysis

Key process

The CPU temperature increases continuously. If the heat dissipation is poor, the CPU DTS alarm and the thermal trip critical alarm are generated in sequence. According to the alarm logs, CPU 2 reports a thermal trip alarm without generating the CPU DTS alarm. In addition, after the server is powered on again, a voltage failure occurs on cpu2_vrd. The voltage is controlled by the CPU. If the CPU is abnormal, the vrd voltage may be abnormal. Therefore, the CPU may encounter an internal working exception (for example, the power supply is short-circuited), causing CPU overtemperature.

Key process:

  1. Leave only socket CPU 1 be populated on the mainboard and check the CPUs.

    (1) Remove CPU 2 and check whether the server can be powered on successfully. If yes, go to the next step. If no, replace the mainboard.

    (2) Remove CPU 1, install CPU 2 in socket CPU 1, and check whether the server can be powered on successfully.

    If the fault is caused by a CPU, replace the CPU. Otherwise, replace the mainboard.

  2. CPU 2 is faulty. Replace CPU 2.

Possible Causes

  1. CPU 2 is faulty.
  2. The mainboard voltage chip is faulty.
Conclusion and Solution

Conclusion: CPU 2 is faulty. As a result, the server cannot be powered on.

Solution: Replace CPU 2.

Experience

When a server is faulty, check whether other exception information exists before the fault occurs. If yes, determine the cause based on the exception information. Comprehensive analysis helps locate the fault quickly and accurately.

Note

The fault locating method in this case also applies to other servers, such as CH121 V3, RH2288 V3, and more.

All Indicators on an LED Diagnosis Panel Are Off
Problem Description
Table 5-149 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

RH5885 V3 and other servers equipped with an LED diagnosis panel

Release Date

2015-03-19

Keyword

LED, diagnosis panel, indicator

Symptom

All indicators on an LED diagnosis panel are off, as shown in Figure 5-223.

Figure 5-223 LED diagnosis panel

Key Process and Cause Analysis

The indicators on an LED diagnosis panel are alarm indicators, and these indicators are off when there is no alarm. Figure 5-224 shows the LED diagnosis panel.

Figure 5-224 LED diagnosis panel

To find out whether the indicators are off because LED cables are not properly connected or because there is no alarm, use the following methods:

  • Check the health indicator on the mounting ear. If the indicator is red but the LED diagnosis panel indicators are off, the indicators on the panel are faulty.
  • If the cables between the LED diagnosis panel and mainboard are not properly connected, the BMC will generate an alarm: LED Panel configure err.
Conclusion and Solution

Conclusion:

The indicators on an LED diagnosis panel are alarm indicators, and these indicators are off when there is no alarm.

Solution:

Check whether the health indicator on the right mounting ear of the device is red. If yes, there is an alarm. In that case, pull out the LED diagnosis panel, and check the silkscreen of the indicator to find out which component triggered the alarm. Then log in to the management software to view the detailed alarm cause.

Experience

None

Note

None

A Blank Screen Is Displayed After Power-On

This topic describes how to rectify the fault that the monitor screen is blank after the RH5885 V3 is powered on.

Problem Description
Table 5-150 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

Rack Server

Release Date

2015-03-17

Keyword

Blank screen

Symptom

After the RH5885 V3 is powered on, a blank screen is displayed.

Key Process and Cause Analysis

Possible Causes:

  • The power supply to the chassis is abnormal.
  • The RH5885 V3 is faulty.

Fault Diagnosis:

Narrow down the scope of the causes by checking the preceding possible problems one by one.

Conclusion and Solution

Procedure:

  1. Check whether the cause is the same as that for All Indicators Are Off.

    • If yes, no further action is required.
    • If no, go to Step 2.

  2. Check whether the alarm indicator on the panel is on.

  3. Log in to the iMana 200 to acknowledge and clear the alarm. For details, see the RH5885 V3 Server V100R003 Alarm Handling.
  4. Check whether the fault persists.

    • If yes, please refer to 02 About Huawei Server R&D and Maintenance Team. Contact Huawei technical support for help.
    • If no, no further action is required.

  5. Check that the dual in-line memory modules (DIMMs) are installed in appropriate slots. For details about DIMM configuration rules, see Installing a DIMM in RH5885 V3 Server V100R003 User Guide.
  6. Check whether the fault persists.

    • If yes, please refer to 02 About Huawei Server R&D and Maintenance Team. Contact Huawei technical support for help.
    • If no, no further action is required.

Experience

None

Note

None

All Indicators Are Off

This topic describes how to rectify the fault that all indicators on the RH5885 V3 panel are off.

Problem Description
Table 5-151 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

Rack Server

Release Date

2015-03-17

Keyword

All Indicators, off

Symptom

All the indicators on the RH5885 V3 panel are off, including the PWR indicator, UID indicator, HEALTH indicator, and hard drive status indicator. At the same time, no information is displayed on the keyboard, video, and mouse (KVM).

Key Process and Cause Analysis

Possible Causes:

  • The power supply to the chassis is abnormal.
  • The iMana 200 software is abnormal.

Fault Diagnosis:

Check the power supply units (PSUs) installation status and the power supply to the rack.

Conclusion and Solution

Procedure:

  1. Check whether the power supply unit (PSU) indicator on the chassis is steady green.

  2. Check whether the PSU indicators on other chassis in the same rack are steady green.

  3. Replace the rack.
  4. Check whether the indicators on the RH5885 V3 panel are in the normal states.

    • If yes, no further action is required.
    • If no, go to Step 5.

  5. Replace the PSU. For details, see the RH5885 V3 Server V100R003 User Guide.
  6. Check whether the indicators on the RH5885 V3 panel are in the normal states.

Experience

None

Note

None

Status Indicator Is Blinking
Problem Description
Table 5-152 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

Rack Server

Release Date

2015-03-17

Keyword

Status indicator, blinking

Symptom

The status indicator on the RH5885 V3 panel is blinking red.

Key Process and Cause Analysis

An alarm is generated for the RH5885 V3.

Fault Diagnosis:

View the alarm status and logs by using the iMana 200, and process the alarm based on the alarm information. For details, see the RH5885 V3 Server V100R003 Alarm Handling.

Conclusion and Solution

Procedure:

  1. The alarm severity varies depending on the indicator blinking frequency. A higher blinking frequency indicates a severer alarm.

    • For a major alarm, the indicator blinks at a frequency of 1 Hz (once every 1 second).
    • For a critical alarm, the indicator blinks at a frequency of 2 Hz (two times every 1 second).

  2. You can view the alarm status and logs by using any of the following methods:

    • Use the iMana 200 CLI.
      1. Connect the client to the iMana 200 management network port on the RH5885 V3 by using a network cable.
      2. Access the iMana 200 CLI.
      3. Run the ipmcget -d healthevents and ipmcget -d sel -v list commands.
    NOTE:

    For details, see the HUAWEI Server iMana 200 V100R002 User Guide. For details about how to clear alarms, see the RH5885 V3 Server V100R003 Alarm Handling.

    • Use the iMana 200 WebUI.
      1. Connect the client to the iMana 200 management network port on the RH5885 V3 by using a network cable.
      2. Log in to the iMana 200 WebUI.
      3. View alarms and logs on the System Event Log page.
    NOTE:

    For details, see the iMana 200 Help. For details about how to clear alarms, see the RH5885 V3 Server V100R003 Alarm Handling.

Experience

None

Note

None

Replaced Component Does Not Work
Problem Description
Table 5-153 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

Rack Server

Release Date

2015-03-17

Keyword

Replaced component

Symptom

The newly replaced component does not work.

Key Process and Cause Analysis

Possible Causes:

  • The component does not apply to the RH5885 V3.
  • The component replacement operation is incorrect.
  • The component is faulty.
  • The port is abnormal.

Fault Diagnosis:

Narrow down the scope of the causes by checking the preceding possible problems one by one.

Conclusion and Solution

Procedure

  1. Check that the component applies to the RH5885 V3. For details, see the RH5885 V3 Server V100R003 User Guide.
  2. Check that the component is properly replaced. For details, see RH5885 V3 Server V100R003 User Guide.
  3. Check that all devices and cables are properly installed.
  4. Power on the RH5885 V3. Check whether the component runs properly. For details about how to power on the RH5885 V3, see Powering On in RH5885 V3 Server V100R003 User Guide.

    • If yes, no further action is required.
    • If no, go to Step 5.

  5. Replace the component with a new one. For details, see RH5885 V3 Server V100R003 User Guide.
  6. Check whether the new component is operating properly.

    • If yes, no further action is required.
    • If no, please refer to 02 About Huawei Server R&D and Maintenance Team. Contact Huawei technical support for help.

Experience

None

Note

None

Displayed Memory Capacity Is Inconsistent with the Physical Memory Capacity

This topic describes how to rectify the fault that the memory capacity displayed in the operating system (OS) is inconsistent with the physical memory capacity.

Problem Description
Table 5-154 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

Rack Server

Release Date

2015-03-17

Keyword

Physical Memory Capacity

Symptom

The memory capacity displayed on the OS is inconsistent with the physical memory capacity.

NOTE:

In normal cases, the displayed memory capacity is slightly less than the physical memory capacity. If the capacity difference is large, an exception occurs.

Key Process and Cause Analysis

Possible Causes:

  • Dual in-line memory modules (DIMMs) are installed incorrectly.
  • A memory fault occurs.

Fault Diagnosis:

Check the DIMM installation status and memory operating status.

Conclusion and Solution

Procedure:

  1. Check that all the DIMMs of the correct type are properly installed and operating normally. For details, see Physical Structure and Installing a DIMM in the RH5885 V3 Server V100R003 User Guide.

    • If yes, please refer to 02 About Huawei Server R&D and Maintenance Team. Contact Huawei technical support for help.
    • If no, go to Step 2.

  2. Replace the DIMM with a correct model. For details, see Installing a DIMM.
  3. Check whether the memory capacity displayed in the OS is consistent with the physical memory capacity.

    • If yes, no further action is required.
    • If no, please refer to 02 About Huawei Server R&D and Maintenance Team. Contact Huawei technical support for help.

Experience

None

Note

None

Keyboard and Mouse Do Not Work
Problem Description
Table 5-155 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

Rack Server

Release Date

2015-03-17

Keyword

Keyboard and mouse

Symptom

All or some of the keys on the keyboard do not work, or the mouse does not work.

Key Process and Cause Analysis

Possible Causes:

  • The cable connection to the keyboard or mouse is abnormal.
  • The keyboard or mouse is faulty.
  • The RH5885 V3 is faulty.

Fault Diagnosis:

Check the cable connection to the keyboard or mouse, and check the keyboard or mouse status.

Conclusion and Solution

Procedure:

  1. Check that the cable is properly connected to the keyboard or mouse. For details about cable connection, see the RH5885 V3 Server V100R003 User Guide.
  2. Check that the RH5885 V3 is properly powered on. For details, see Powering On in RH5885 V3 Server V100R003 User Guide.
  3. Check whether the keyboard or mouse runs properly when it is connected to another RH5885 V3.

    • If the keyboard or mouse runs properly, the RH5885 V3 is faulty. Please refer to 02 About Huawei Server R&D and Maintenance Team. Contact Huawei technical support for help.
    • If the keyboard or mouse still does not work, replace it. For details, see the related keyboard or mouse documents.

Experience

None

Note

None

Screen Is Blank
Problem Description
Table 5-156 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

Rack Server

Release Date

2015-03-17

Keyword

Screen, blank

Symptom

The screen goes blank when you connect the monitor to the operating RH5885 V3 or when you start some applications on the RH5885 V3.

Key Process and Cause Analysis

Possible Causes:

  • The power supply or cable connection to the monitor is abnormal.
  • The monitor is faulty.
  • The RH5885 V3 is faulty.

Fault Diagnosis:

Check the power supply and cable connection to the monitor, and check the monitor status.

Conclusion and Solution

Procedurev:

  1. Check that the power cable is properly connected to the monitor. If the indicator on the monitor is on, the power cable is properly connected.
  2. Check that the monitor is properly connected to the RH5885 V3.
  3. Check that the monitor is started and the brightness and contrast controls are adjusted correctly.
  4. Power off the RH5885 V3 and then power it on. For details, see Powering On and Powering Off in RH5885 V3 Server V100R003 User Guide.
  5. Check whether the fault persists.

    • If yes, go to Step 6.
    • If no, no further action is required.

  6. Check whether the monitor runs properly when it is connected to another RH5885 V3.

    • If yes, please refer to 02 About Huawei Server R&D and Maintenance Team. Contact Huawei technical support for help.
    • If no, go to Step 7.

  7. Replace the monitor. For details, see the document shipped with your monitor.
  8. Check whether the fault persists.

    • If yes, please refer to 02 About Huawei Server R&D and Maintenance Team. Contact Huawei technical support for help.
    • If no, no further action is required.

Experience

None

Note

None

Monitor Screen Is Wavy
Problem Description
Table 5-157 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

Rack Server

Release Date

2015-03-17

Keyword

Monitor Screen

Symptom

The screen is rolling, distorted, or has jitters.

Key Process and Cause Analysis

Possible Causes:

  • The cable between the monitor and RH5885 V3 is not connected securely.
  • The monitor is affected by magnetic fields of peripheral devices.
  • The monitor is faulty.
  • The RH5885 V3 is faulty.

Fault Diagnosis:

Narrow down the scope of the causes by checking the preceding possible problems one by one.

Conclusion and Solution

Procedure:

  1. Check that the power cable is properly connected to the monitor. If the indicator on the monitor is on, the power cable is properly connected.
  2. Check that the monitor is properly connected to the RH5885 V3.
  3. Check that the monitor is placed properly.
  4. Check that the monitor is more than 305 mm (12.01 in.) away from other devices, such as transformers, electrical appliances, fluorescent lights, and other monitors.

    Magnetic fields around these devices can cause screen jitters, rolling, or distorted screen images.

  5. Power off the RH5885 V3 and then power it on. For details, see Powering On and Powering Off in RH5885 V3 Server V100R003 User Guide.
  6. Check whether the fault persists.

    • If yes, go to Step 7.
    • If no, no further action is required.

  7. Check whether the monitor runs properly when it is connected to another RH5885 V3.

    • If yes, please refer to 02 About Huawei Server R&D and Maintenance Team. Contact Huawei technical support for help.
    • If no, go to Step 8.

  8. Replace the monitor. For details, see the document shipped with your monitor.
  9. Check whether the fault persists.

    • If yes, please refer to 02 About Huawei Server R&D and Maintenance Team. Contact Huawei technical support for help.
    • If no, no further action is required.

Experience

None

Note

None

No Information Is Displayed on the Monitor Screen
Problem Description
Table 5-158 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

Rack Server

Release Date

2015-03-17

Keyword

Monitor Screen, no information

Symptom

No information is displayed on the monitor screen.

Key Process and Cause Analysis

Possible Causes:

  • The power supply or cable connection of the monitor is abnormal, or the monitor is faulty.
  • There is not dual in-line memory modules (DIMMs), or DIMMs are not installed securely in proper slots.
  • The RH5885 V3 is faulty.

Fault Diagnosis:

  • Check that the power cable is properly connected.
  • Check that the monitor is properly connected to the RH5885 V3.
  • Check that DIMMs are installed for the RH5885 V3.
Conclusion and Solution

Procedure:

  1. Check that the power cable is properly connected to the monitor. If the indicator on the monitor is on, the power cable is properly connected.
  2. Check that the monitor is properly connected to the RH5885 V3.
  3. Check that the monitor is started and the brightness and contrast controls are adjusted correctly.
  4. Check that all the DIMMs of the correct type are properly installed and operating normally. For details, see Physical Structure and Installing a DIMM in the RH5885 V3 Server V100R003 User Guide.

  5. Replace the DIMM with a correct model. For details, see Installing a DIMM.
  6. Check whether the fault persists.

    • If yes, go to Step 7.
    • If no, no further action is required.

  7. Check whether the monitor runs properly when it is connected to another RH5885 V3.

    • If yes, please refer to 02 About Huawei Server R&D and Maintenance Team. Contact Huawei technical support for help.
    • If no, go to Step 8.

  8. Replace the monitor. For details, see the document shipped with your monitor.
  9. Check whether the fault persists.

    • If yes, go to Step 6.
    • If no, please refer to 02 About Huawei Server R&D and Maintenance Team. Contact Huawei technical support for help.

Experience

None

Note

None

USB Port Does Not Work
Problem Description
Table 5-159 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

Rack Server

Release Date

2015-03-17

Keyword

USB port

Symptom

A USB port does not work.

Key Process and Cause Analysis

Possible Causes:

  • The operating system (OS) does not support the connected USB device or no USB drive is installed.
  • The USB device is faulty.
  • The RH5885 V3 is faulty.

Fault Diagnosis:

Narrow down the scope of the causes by checking the preceding possible problems one by one.

Conclusion and Solution

Procedure:

  1. Check that the RH5885 V3 OS supports the connected USB device.
  2. Check that the USB device driver is properly installed.
  3. Power off the RH5885 V3 and then power it on. For details, see Powering On and Powering Off in RH5885 V3 Server V100R003 User Guide.
  4. Check whether the fault persists.

    • If yes, go to Step 5.
    • If no, no further action is required.

  5. Check whether the USB device runs properly when it is connected to another RH5885 V3.

    • If yes, please refer to 02 About Huawei Server R&D and Maintenance Team. Contact Huawei technical support for help.
    • If no, go to Step 6.

  6. Replace the USB device. For details, see the document shipped with your USB device.
  7. Check whether the fault persists.

    • If yes, go to Step 6.
    • If no, please refer to 02 About Huawei Server R&D and Maintenance Team. Contact Huawei technical support for help.

Experience

None

Note

None

Server RTC Battery Low Voltage Alarm
Problem Description
Table 5-160 Basic information

Item

Information

Source of the Problem

RH2288 V3

Intended Product

V2, V3, and V5 servers

Release Date

2017-11-21

Keyword

Coin battery, RTC, battery low

Symptom

The RTC battery alarm "Battery low" is displayed on the RH2288 V3 iBMC.

Key Process and Cause Analysis

Principle description:

The coin battery supplies power to the CMOS RAM when the mainboard is powered off. On the mainboard of a server, the positive side of the coin battery is connected to the southbridge pin, and the BMC continuously checks the battery voltage.

Possible causes:

1. If the coin battery is faulty or the voltage is too low, replace the coin battery.

2. If the voltage detection link is faulty, replace the mainboard.

----End

Conclusion and Solution

Solution:

  1. Apply for a mainboard and coin battery.
  2. Replace the coin battery and check whether the fault is rectified. If yes, return the mainboard intact. If no, replace the mainboard.
Experience

None

Note

None

RH5885 V3 Fails to Start
Problem Description
Table 5-161 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

Rack Server

Release Date

2015-03-17

Keyword

Start fail

Symptom

After you replace a CPU or DIMM, the RH5885 V3 fails to start.

Key Process and Cause Analysis

Possible Causes:

  • The replacement operation is incorrect.
  • The replaced component is faulty.
  • The RH5885 V3 is faulty.

Fault Diagnosis:

Verify the replacement and check the component status

Conclusion and Solution

Procedure:

  1. Check whether the cause is the same as that for All Indicators Are Off.

    • If yes, no further action is required.
    • If no, go to Step 2.

  2. View operation logs to check whether a CPU or DIMM has been replaced before the fault.

    • If yes, go to Step 3.
    • If no, please refer to 02 About Huawei Server R&D and Maintenance Team. Contact Huawei technical support for help.

  3. Replace the CPU or DIMM. For details, see RH5885 V3 Server V100R003 User Guide.
  4. Check whether the fault persists.

    • If yes, please refer to 02 About Huawei Server R&D and Maintenance Team. Contact Huawei technical support for help.
    • If no, no further action is required.

Experience

None

Note

None

PSOD Caused by the lsi_mr3 Driver in ESXi 6.0 U2
Problem Description
Table 5-162 Basic information

Item

Information

Source of the Problem

RH5885H V3

Intended Product

FusionServer

Release Date

2017-11-15

Keyword

ESXi, 3108, lsi_mr3, PSOD

Symptom

On an RH5885H V3 server configured with the LSI SAS3108, the purple screen of death (PSOD) occurs when ESXi 6.0 U2 is running. The OS-native lsi_mr3 rather than megaraid_sas released by Huawei is used as the RAID controller card driver.

Figure 5-225 ESXi PSOD
Figure 5-226 lsi_mr3 driver information
Key Process and Cause Analysis

Cause analysis:

The ESXi-native driver lsi_mr3 does not retain the required complete memory during the initial boot from the low-memory area. As a result, an exception occurs when the RAID controller card is running.

Figure 5-227 VMware analysis result
Conclusion and Solution

Conclusion

The ESXi-native driver lsi_mr3 does not retain the required complete memory during the initial boot from the low-memory area. As a result, an exception occurs when the RAID controller card is running.

Solution:

Select either of the following solutions:

  • Upgrade the host OS to ESXi 6.0 P04.
  • If the host OS cannot be upgraded, install a traditional driver (for example, megaraid sas) instead of using the local driver (for example, lsi_mr3). You are advised to use the megaraid_sas driver released by Huawei.
Experience

When installing an ESXi OS on the server, you are advised to use the firmware and driver released on the Huawei enterprise website.

Note

None

Common Problems of RAID Controller Cards and Hard Drives

The Message "Firmware version inconsistency" Is Displayed and Startup Fails After an LSI SAS3108 Controller Card Is Replaced
Problem Description
Table 5-163 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

All servers whose LSI SAS3108 controller cards need to be replaced

Release Date

2015-11-30

Keyword

LSI SAS3108, RAID controller card replacement, version inconsistency, startup failure

Symptom

Hardware configuration

RH2288H V3 server with an LSI SAS3108 controller card and a supercapacitor, and a spare LSI SAS3108 controller card

RAID configuration: Hard drives are configured as a RAID array by using the LSI SAS3108 controller card and the write cache is enabled for the RAID array.

Firmware configuration: The old LSI SAS3108 controller card uses firmware 4.210.90-3396, and the spare LSI SAS3108 controller card uses firmware 4.270.00-4382.

Symptom

After a user restarts the server when a power failure occurs, an error message is displayed and the system fails to start. After the user replaces the LSI SAS3108 controller card and restarts the server, the screen shown in Figure 5-228 is frozen.

Figure 5-228 Frozen screen

Key Process and Cause Analysis

Key process

  1. The server encounters a power failure when drive input/output (I/O) operations are in progress.
  2. The RAID controller card is replaced, but its supercapacitor is not replaced.
  3. The fault occurs after the server powers on.

Cause analysis

The server encounters a power failure when drive I/O operations are in progress. The cache data protection mechanism is triggered for the supercapacitor of the LSI SAS3108 controller card, and preserved cache data is stored in the supercapacitor flash. The old LSI SAS3108 controller card uses firmware 4.210.90-3396, and the new LSI SAS3108 controller card uses firmware 4.270.00-4382. The firmware defines the format of preserved cache data. After the RAID controller card is replaced, the cache data format is incompatible with the firmware of the new LSI SAS3108 controller card. As a result, the firmware fails to process preserved cache data, the system fails to start, and many exception dump records are printed over the serial port. See Figure 5-229.

Figure 5-229 Many exception dump records printed over the serial port

Conclusion and Solution

Conclusion

The RAID controller card firmware defines the cache data format. After the RAID controller card is replaced, the cache data format is incompatible with the firmware of the new RAID controller card. As a result, the RAID controller card fails to start.

No RAID controller card vendor can guarantee the compatibility of the cache data format between different firmware versions.

The fault occurs when the following conditions are all met:

  • The RAID controller card has a supercapacitor.
  • The server encounters a power failure when drive I/O operations are in progress.
  • After a power failure, preserved cache data is not deleted before the RAID controller card is replaced.
  • The old LSI SAS3108 controller card uses firmware 4.210.90-3396, and the new LSI SAS3108 controller card uses firmware 4.270.00-4382.

If the fault occurs, use one of the following solutions to rectify the fault:

Solutions

Solution 1: Replace the supercapacitor.

Solution 2: Use the firmware of the old LSI SAS3108 controller card to delete preserved cache data. The procedure is as follows:

  1. Replace the LSI SAS3108 controller card with the old one, and connect the card to the supercapacitor.
  2. Power on the server. Upon server startup, press Ctrl+R when prompted to open the WebBIOS screen of the LSI SAS3108 controller card.
  3. View the status of the preserved cache.

    On the VD Mgmt screen, select SAS3108 and press F2. The options shown in Figure 5-230 are displayed.

    Figure 5-230 VD Mgmt screen

    If Manage Preserved Cache is available (white), preserved cache data needs to be deleted. Go to step 4. If Manage Preserved Cache is unavailable (dimmed), preserved cache data has been deleted automatically. Go to step 5.

  4. Select Manage Preserved Cache and press Enter. The screen shown in Figure 5-231 is displayed.
    Figure 5-231 Manage Preserved Cache

    Select DISCARD CACHE and press Enter. On the confirmation screen, select YES and press Enter. Preserved cache data is deleted.

  5. Power off the server, and replace the RAID controller card with the new one.

Solution 3: Roll back the firmware of the new RAID controller card to 4.210.90-3396, delete preserved cache data by using solution 2, and upgrade the firmware to 4.270.00-4382.

Experience

None

Note

None

"Native Configuration no longer supported" Is Reported on the LSI SAS3108 RAID Controller Card During Server Startup
Problem Description
Table 5-164 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

V3 servers

Release Date

2015-01-30

Keyword

LSI SAS3108, Native Configuration, firmware

Symptom

Hardware configuration:

RH2288 V3 server equipped with the LSI SAS3108 RAID controller card

Symptom:

The screen shown in Figure 5-232 always remains during the startup self-check of an RH2288 V3 server.

Figure 5-232 Frozen screen

Note:

The native configuration is not supported by the current firmware.

You need to press any key to continue the server startup.

Key Process and Cause Analysis

Key process:

This problem occurs because the RAID controller card detects that information that does not match the existing RAID information exists.

This error message does not affect services. However, when this error message is displayed, the startup process can be complete only with user confirmation.

This error message can be cleared by clearing NVRAM. Clearing NVRAM does not affect the configured RAID relationship or OS startup.

You can run commands in EFI mode to clear NVRAM. However, the server does not support EFI shell due to security requirements. Therefore, this method for clearing NVRAM is unavailable currently.

You can use the following method to eliminate the impact on server startup brought by this error message:

On the RAID controller card configuration screen, set BIOS Mode to Pause on Error. In this mode, the error message remains for only a few seconds and then the startup continues. Figure 5-233 shows the configuration screen.

Figure 5-233 Setting BIOS Mode

Cause analysis:

If the LSI SAS3108 RAID controller card has been used on other platforms before it is used on the current platform, the RAID controller card detects that information that does not match the existing RAID information exists. Then an error message is reported during the startup and the startup process stops. The error message is as follows:

The native configuration is no longer supported by the current controller and firmware.
Conclusion and Solution

Conclusion:

The RAID controller card detects that the RAID information on the hard drive does not match the existing RAID information. Then an error message is reported during the startup and the startup process stops. The error message is as follows:

The native configuration is no longer supported by the current controller and firmware.

Solution:

On the RAID controller card configuration screen, set BIOS Mode to Pause on Error. In this mode, the error message remains for only a few seconds and then the startup continues.

Experience

None

Note

None

RAID Configuration Information Is Lost After a Server Configured with an LSI SAS3108 RAID Controller Card Restarts
Problem Description
Table 5-165 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

Servers equipped with LSI SAS3108 RAID controller cards

Release Date

2015-08-25

Keyword

LSI SAS3108 card, Erase

Symptom

Hardware configuration:

An RH2288H V3 server equipped with one LSI SAS3108 card and eight hard drives, of which the first two hard drives are configured into RAID1 and the other six hard drives are configured into RAID10

Symptom:

After the configuration is complete, the server is restarted, and during self-check of the RAID controller card, a message is displayed indicating that the configuration information has been lost, as shown in Figure 5-234.

Figure 5-234 RAID information lost after the server is restarted

Key Process and Cause Analysis

Key process:

Check the RAID controller card log, and find that recently two RAID arrays have been created. No problem occurs after several restarts. This indicates that the RAID controller card and hard drives are normal, as shown in Figure 5-235.

Figure 5-235 RAID array creation record

However, the problem arises after a series of operations are performed on the two RAID arrays: Erase VD, Initialize VD, CC, and Delete VD, as shown in Figure 5-236.

Figure 5-236 Operations performed on RAID arrays

Cause analysis:

The problem can be reproduced by reperforming the operations recorded in the log. Analysis shows that the problem is related to the Delete VD after Erase operation, as shown in Figure 5-237. The Delete VD after Erase operation is optional and is intended to erase all data on the VD and clear RAID information. The purpose of this operation is to completely delete data to ensure data security and prevent restoration of RAID relationships and data restoration. To create RAID arrays again, you need to format hard drives first.

Figure 5-237 EraseVD settings

Conclusion and Solution

Conclusion:

This problem is a normal result of the optional operation of Delete Virtual Drive after Erase operation.

Solution:

Format the hard drives and re-create RAID arrays. If you want to erase RAID arrays, you are advised not to select Delete Virtual Drive after Erase operation.

Experience

None

Note

Formatting hard drives of the LSI SAS3108 RAID controller card:

  1. Restart the system, and when the system displays the SAS BIOS screen of the LSI SAS3108 RAID controller card, press Ctrl+R to go to the SAS BIOS screen. The procedure is shown in Figure 5-238.
    Figure 5-238 SAS BIOS screen

  2. On the SAS BIOS screen, press Ctrl+N to go to the PD Mgmt screen. Select a hard drive to be formatted, as shown in Figure 5-239.
    Figure 5-239 Selecting a hard drive to be formatted

  3. Press F2, select a hard drive, choose Drive Erase and Normal, and press Enter. Then choose Yes in the displayed dialog box. The Erase operation will automatically start, as shown in Figure 5-240. This operation can be performed simultaneously on multiple hard drives. For a 900 GB hard drive, the operation takes about 8 hours.
    Figure 5-240 Drive Erase

Checking the hard drive formatting progress:

  1. On the SAS BIOS screen, press Ctrl+N to go to the PD Mgmt screen. Select a hard drive, as shown in Figure 5-241.
    Figure 5-241 Selecting a hard drive

  2. Select GoToPage:2 and press Enter, as shown in Figure 5-242.
    Figure 5-242 Checking the progress
Hard Drive Recovery During Hot Spare Drive Rebuilding on the Avago LSI SAS3108 RAID Controller Card
Problem Description
Table 5-166 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

V3 servers

Release Date

2015-12-29

Keyword

LSI SAS3108, Frn-Bad, rebuild

Symptom

Symptom:

On a new RH2288H V3 server, after the customer removes and then reinstalls a hard drive (Slot0) of the RAID array in hot-swap mode, the hard drive turns into the Foreign-Unconfig Bad (Frn-Bad) state on the LSI SAS3108 RAID controller card screen, and the hot spare drive (Slot11) starts rebuilding to replace the faulty hard drive (Slot0), as shown in Figure 5-243.

Figure 5-243 Hard drive Slot0 in the Frn-Bad state

Key Process and Cause Analysis

Key process:

  1. Select the hard drive in the Frn-Bad state and choose make unconfigured good from the shortcut menu to convert the hard drive from the Frn-Bad state to the Foreign-Unconfig Good (Frn-Good) state, as shown in Figure 5-244.
    Figure 5-244 Make unconfigured good

  2. Choose Foreign View > Foreign Config > Import to import the external RAID configuration, as shown in Figure 5-245.
    Figure 5-245 Importing the external RAID configuration

  3. After the preceding operations, the hard drive turns into the Foreign state. If no hot spare drive can be used to replace the faulty hard drive, the external RAID configuration can be imported to recover the faulty hard drive to the Online state, instead of the Foreign state. After the hard drive (Slot0) is removed, a hot spare drive (Slot11) replaces the faulty hard drive (Slot0) and starts data rebuild. The RAID array is complete. Therefore, the import operation cannot recover the hard drive (Slot0) to the original RAID array, and the hard drive is still marked Foreign, as shown in Figure 5-246.
    Figure 5-246 Hard drive Slot0 in the Foreign state
Conclusion and Solution

Solution:

To recover the hard drive, you need to clear the marked external RAID configuration information and remove and reinstall the faulty hard drive in hot-swap mode (to trigger the hard drive replacement operation). The operations are as follows:

  1. On the Foreign View screen, select the LSI SAS3108 RAID controller card, press F2, and choose Foreign Config > Clear to clear the external RAID configuration information, as shown in Figure 5-247. After the external RAID configuration of hard drive Slot0 is cleared, the hard drive turns into the Unconfig Good (UG) state on the PD Mgmt screen, as shown in Figure 5-248.
    Figure 5-247 Clearing the external RAID configuration

    Figure 5-248 Hard drive Slot0 in the UG state

  2. Remove and reinstall the hard drive (Slot0) again (to trigger the hard drive replacement operation). After the hot spare drive (Slot11) completes data rebuild, hard drive Slot0 performs data copyback. After the data copyback is complete, hard drive Slot0 recovers to the Online state and hard drive Slot11 recovers to the hot spare drive.
Experience

None

Note

None

LSI SAS3108 Card Reports Multibit ECC Errors
Problem Description
Table 5-167 Basic information

Item

Information

Source of the Problem

V3 servers

Intended Product

V3 servers

Release Date

2014-07-24

Keyword

LSI SAS3108 card

Symptom

Hardware configuration:

LSI SAS3108 card

Symptom

A server performs a power-on self-check, but the following information is displayed during the RAID self-check phase:

Multibit ECC errors were detected on the RAID controller. 
The DIMM on the controller needs replacement. 
Please contact technical support to resolve this issue. 
If you continue, data corruption can occur. 
Press 'X' to continue or else power off the system and replace the 
DIMM module and reboot. If you have replaced the DIMM press 'X' to continue.

See Figure 5-249.

Figure 5-249 Information
Key Process and Cause Analysis

Cause analysis:

Multibit ECC errors occur in the DDR granules of the LSISAS3108 card.

Conclusion and Solution

Solution:

Replace the RAID controller card.

Experience

None

Note

None

SNM0034 Hard Drive Fault Occurs Under Large I/O Pressure on the RH2288H V3 and SMART Reports HARD DRIVE FAILURE
Problem Description
Table 5-168 Basic information

Item

Information

Source of the Problem

Tecal RH2288H V3

Intended Product

Full series of servers compatible with ST6000NM0034

Release Date

2016-04-25

Keyword

ST6000NM0034, SMART, hard drive failure

Symptom

Hardware configuration:

RH2288H V3 configured with the LSI SAS2208 and twelve 3.5-inch ST6000NM0034 hard drives

The hard drive fault rate is high, and the iBMC reports a hard drive fault. The SMART information shows that the health status of the hard drive is "DATA CHANNEL IMPENDING FAILURE GENERAL HARD DRIVE FAILURE".

Key Process and Cause Analysis

After a hard drive is faulty, collect SMART information and query the health status of the hard drive.

"SMART Health Status: DATA CHANNEL IMPENDING FAILURE GENERAL HARD DRIVE FAILURE" is displayed. The fault analysis by the vendor shows that the drive head has weak writes. The underlying logs of the hard drive show that the fault is caused by writes in a wrong frequency.

Cause analysis: When a hard drive is under heavy pressure and full seek is triggered, the drive head moves from zone X to zone Y and then moves back to zone X. Due to a code bug in the firmware E001, the drive head may use the zone Y parameter to write zone X after it moves back to zone X from zone Y, causing a write frequency error. As a result, weak writes occur on the drive head, and the hard drive failure and SMART alarms are generated.

Conclusion and Solution

Conclusion:

When a hard drive is under heavy pressure, weak writes may occur on the drive head. As a result, an SMART alarm is generated, and the hard drive is identified as faulty.

Solution:

After a hard drive is faulty, collect SMART information and query the health status of the hard drive.

If "SMART Health Status: DATA CHANNEL IMPENDING FAILURE GENERAL HARD DRIVE FAILURE" is displayed, query the hard drive firmware version. If the firmware version is earlier than E005, writes in a wrong frequency may occur. Upgrade the hard drive firmware to E005.

Experience

If an STX000NM0034 hard drive is faulty, query the SMART information and firmware version of the drive, and determine the risk of writes in a wrong frequency. If the hard drive firmware version is earlier than E005, you are advised to upgrade the firmware to the latest version.

Note

Involved hard drive models:

ST2000NM0034, ST4000NM0034, and ST6000NM0034

Automatic Rebuild for New Hard Drives Fails After RAID Member Drives Fail Because Auto Rebuild Is Disabled on a Server Configured with the LSI SAS2208
Problem Description
Table 5-169 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

All servers configured with the LSI SAS2208

Release Date

2016-09-28

Keyword

2208, rebuild failure, faulty hard drive, replacement

Symptom

Hardware configuration: RH2288H V3 configured with the LSI SAS2208

Software version: Firmware of any version

Symptom: After faulty member drives of a RAID array are replaced with new ones, the automatic drive rebuild fails.

Key Process and Cause Analysis

The configuration information in the following figure shows that the Auto Rebuild function is disabled.

Cause analysis:

If the hard drives in a RAID array are faulty after Auto Rebuild is disabled, the new drives cannot be automatically rebuilt and added to the RAID array.

Conclusion and Solution

Conclusion:

If the hard drives in a RAID array are faulty after Auto Rebuild is disabled, the new drives cannot be automatically rebuilt and added to the RAID array.

Solution:

Manually rebuild the newly inserted hard drives.

Step 1: Run ./storcli64/c0 show to obtain the drive group (DG) to be manually rebuilt, the RAID array (Arr) where the hard drives reside, and the sequence number (ROW) of the RAID array from the TOPOLOGY list, as shown in the following figure.

Step 2: Run the storcli64/cx[/ex]/sx insert dg=A array=B row=C command on the hard drives that require manual rebuild based on the queried parameters. After the command is executed successfully, the drive status changes from UG to Offline.

Step 3: Run ./storcli64/cx[/ex]/sx start rebuild to start rebuild.

Experience

If the newly inserted hard drives cannot be rebuilt, query the RAID controller card configuration information and check the enablement status of Auto Rebuild.

Note

The storcli/cx[/ex]/sx start rebuild command cannot be used to rebuild a new drive in the UG state. Before rebuilding new hard drives, run the storcli/cx[/ex]/sx insert dg=A array=B row=C command. In this command, A specifies a drive group, B specifies a disk array ID, and C specifies a row in the disk array.

A Large Number of SuperCap Errors Are Recorded in Logs on a Server Configured with the LSI SAS2208
Problem Description
Table 5-170 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

All servers configured with the LSI SAS2208

Release Date

2016-12-01

Keyword

2208, SuperCap, error

Symptom

Hardware configuration: RH2288H V3 configured with the LSI SAS2208

Software version: 3.400.45-3507

Symptom: A large number of SuperCap errors are recorded in logs.

Key Process and Cause Analysis

Cause analysis:

This problem is caused by hardware faults. According to the logs, the communication between the trans flash module (TFM) and supercapacitor is faulty.

Conclusion and Solution

Conclusion:

The supercapacitor or TFM is faulty.

Solution:

1. Check the supercapacitor connection.

2. Replace the supercapacitor.

3. Replace the TFM.

Experience

None

Note

None

Hard Drive Backplane Fault Indicator Is Not On When a Single-Drive LSI SAS2208 RAID 0 Is Faulty
Problem Description
Table 5-171 Basic information

Item

Information

Source of the Problem

Tecal RH2288 V3

Intended Product

FusionServer

Release Date

2017-10-12

Keyword

2208, single-drive RAID 0, fault indicator

Symptom

Hardware configuration: RH2288 V3 configured with the LSI SAS2208

RAID firmware version: 3.400.95-4061

Symptom: An RH2288 V3 is configured with the LSI SAS2208 and multiple single-drive RAID 0 arrays are created. When a RAID 0 is faulty, the hard drive backplane fault indicator is still off, as shown in Figure 5-250.

Figure 5-250 Single-drive RAID 0
Key Process and Cause Analysis

Key process:

According to the RAID controller card logs, the drive in slot 11 was removed by the firmware after it was faulty.

Figure 5-251 Drive (slot 11) failure due to command timeout
Figure 5-252 iBMC logs

The iBMC logs record only the drive replacement without the drive absence information. The server uses the expander backplane, and the firmware uses the SES service to manage the drive indicators. After removing a single-drive RAID 0, the firmware does not turn on the fault indicator on the backplane through SES. As a result, the iBMC cannot read the absence information about the hard drive from the CPLD register.

Cause analysis:

The LSI SAS2208 firmware of version 4061 has bugs. When the hard drive of a single-drive RAID 0 is removed from the RAID controller card, the firmware does not turn on the fault indicator of the corresponding slot on the backplane through SES. As a result, the iBMC cannot read the drive fault status from the CPLD register.

Conclusion and Solution

Conclusion:

The LSI SAS2208 firmware of version 4061 has bugs. When the hard drive of a single-drive RAID 0 is removed from the RAID controller card, the firmware does not turn on the fault indicator of the corresponding slot on the backplane through SES. As a result, the iBMC cannot read the drive fault status from the CPLD register.

Solution:

  1. Upgrade the firmware to 3.460.165-8277.

Visit http://support.huawei.com/enterprisesoftware/SoftwareVersionActionNew%21getSoftwareInfo.action?lang=en&pid=22625637&contentId=SW1000261600 to download the package.

Experience

None

Note

None

OS Installation Fails in EFI Mode on a V3 Server Configured with the LSI SAS2308
Problem Description
Table 5-172 Basic information

Item

Information

Source of the Problem

RH228X V3

Intended Product

RH228XV3, E9000, and X6800

Release Date

2016-03-20

Keyword

RH228X V3, LSI SAS2308, EFI

Symptom

Hardware configuration: A V3 server configured with the LSI SAS2308

Software version: EFI BIOS in Linux or Windows

Symptom: If the capacity of a RAID array or hard drive is greater than 2 TB, you are advised to install the OS in EFI mode. After the BIOS is switched to the EFI mode, the OS can be successfully installed. However, the OS cannot be accessed after a restart, and the BIOS displays a message indicating that the boot device is missing.

Figure 5-253 Boot device missing in the EFI BIOS
Key Process and Cause Analysis

Key process:

An RH228X V3 server is configured with the LSI SAS2308.

This problem occurs only when or after the OS is installed in the EFI BIOS.

After the problem occurs, the boot drive is missing during the installation of a Windows OS, and a Linux OS fails to start after the installation due to boot drive missing.

Cause analysis:

The EFI mode is used when the system has many PCIe devices or the target hard drive for installing the OS is greater than 2 TB. In this case, the RAID controller card must support the EFI mode to be identified and started by the BIOS.

The LSI SAS2308 firmware does not contain EFI driver, which needs to be loaded manually. The LSI SAS2308 is introduced when V2 servers are developed. By default, the BIOS contains the LSI SAS2308 EFI driver. If the driver is loaded again, an unknown error may occur (the EFI driver loading fails, and the BIOS-native EFI driver is used). Therefore, the LSI SAS2308 does not load EFI driver by default.

However, the BIOS of V3 servers does not contain the LSI SAS2308 EFI driver by default. As a result, the boot device cannot be found when a V3 server is configured with the LSI SAS2308.

Conclusion and Solution

Conclusion:

The BIOS of V3 servers does not contain the LSI SAS2308 EFI driver by default. As a result, the OS fails to be installed or fails to start after the OS installation in EFI mode.

Impact scope: This problem occurs when you install OS in the EFI BIOS on a V3 server configured with the LSI SAS2308.

Solution:

The EFI driver is integrated into the LSI SAS2308 firmware 20.00.04.00 and later. Use Toolkit to upgrade the firmware. You can download Toolkit-V119 at:

http://support.huawei.com/enterprise/SoftwareVersionAction!getSoftwareInfo.action?nodePath=fixnode01|7919749|9856522|9856629|21015513|21802824|21802825|22149439&idAbsPath=fixnode01|7919749|9856522|9856629|21015513&version=FusionServer+Tools+V2R2C00RC5SPC1&hidExpired=0&contentId=SW1000210091

Experience

None

Note

None

Failed to Add a Hot Spare Drive to a Degraded RAID 1 on the LSI SAS2308
Problem Description
Table 5-173 Basic information

Item

Information

Source of the Problem

RH228X V3

Intended Product

RH228XV3, E9000, and X6800

Release Date

2016-07-27

Keyword

RH228X V3, LSI SAS2308, RAID 1

Symptom

Hardware configuration: A V3 server configured with the LSI SAS2308

Software version: BIOS in Linux or Windows

Symptom: After RAID 1 is degraded, there is a possibility that hot spare drives fail to be added when the RAID array is created again. As a result, a RAID 1 fails to be created. That is, hot spare drives fail to be set in the OS.

Figure 5-254 Error information
Key Process and Cause Analysis

Key process:

An RH228X V3 server is configured with the LSI SAS2308.

After a RAID 1 is degraded, hot spare drives fail to be added to the RAID 1 in the OS or Option ROM.

Cause analysis:

The cause of the problem is as follows:

The logs show that dirty data exists on the RAID controller card. Run sas3flash -o -e 3 to clear the NVRAM data of the RAID controller card to solve this problem.

Conclusion and Solution

Conclusion:

Use the LSI SAS2308 to clear NVRAM data from the degraded RAID 1.

Solution:

If OS installation is required for a V3 server, use Toolkit to clear NVRAM data. Toolkit is integrated into the tool package and does not need to be downloaded.

  1. Mount the ISO image of Toolkit.

  2. Run Toolkit and press C to enter the command-line interface (CLI).

  3. Go to the tool directory and clear NVRAM data.

    The tool directory is /home/Project/tools/lsi2308/linux.

    Note: lsi2308 is the current RAID controller card. The following commands apply to the LSI SAS2308 and LSI SAS3008.

    # cd /home/Project/tools/lsi2308/linux

    # ./sas2flash –o –e 3

    The following figure shows the command output after the commands are executed successfully.

  4. Restart the OS.
  5. After the restart, press Ctrl+C to access the option ROM screen of the LSI SAS2308.

  6. Go to the management screen and activate the RAID array.

  7. Set a hot spare drive again.

  8. Press C to save the settings and exit.

Experience

None

Note

None

SMP Command Failure Triggers a False Alarm and Toolkit Reports an Error on a Server Configured with the LSI SAS2308
Problem Description
Table 5-174 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

RH2288H V3

Release Date

2016-12-29

Keyword

2308, 311c0030, Toolkit

Symptom

Hardware configuration: RH2288H V3 (8 directly connected hard drives) configured with the LSI SAS2308 (firmware version: 20.00.04.00)

Check results of Toolkit:

|Expander link |DriveID:0,PhyNo:23; error. |[0;31mFAIL[0m|

A driver alarm is recorded in OS logs.

Key Process and Cause Analysis

According to the MPT2SAS standard stipulated by LSI, log_info 0x31 indicates that the problem occurs at the PL layer (the link layer interacting with the target), and PL code 0x1c0030 indicates that the SMP command fails at the PL layer.

This indicates that when the controller delivers an SMP command, the connection requested by the previous SMP command still exists. After the SMP command transfer is complete, the disconnection is interrupted. In this case, the firmware retries the SMP command and records the log information. This has little impact on data integrity and service functions.

To check whether the problem causes the OS to run abnormally, monitor link errors when there are service I/Os and ensure that the link works properly.

Conclusion and Solution

For the LSI SAS2308 firmware of versions earlier than 20.100.04.00, such alarm may be falsely reported when the management tool such as hwdiag or hdmon is used.

Solution:

You are advised to upgrade the firmware. Download the latest firmware at:

http://support.huawei.com/enterprise/SoftwareVersionAction!getSoftwareInfo.action?nodePath=fixnode01|7919749|9856522|21782478|21782482|8576237|9314996|9314998|9315003|22088728&idAbsPath=fixnode01|7919749|9856522|21782478|21782482|8576237&version=RH2288 V2 V100R002C00SPC603&hidExpired=0&contentId=SW1000192382

Experience

None

Reboot Fails in Linux on a Server Configured with the LSI SAS3008
Problem Description
Table 5-175 Basic information

Item

Information

Source of the Problem

RH228X V3

Intended Product

RH228X V3

Release Date

2015-11-26

Keyword

3008, Linux OS, no response, _base_event_notification

Symptom

Hardware configuration: RH228X V3 server (8 or 12 drives) configured with the LSI SAS3008

If Linux is installed on an LSI SAS3008 RAID controller card (especially when the OS uses a Xen kernel), the OS startup may fail after the server is restarted repeatedly. The driver reports an error message indicating that the initialization of the RAID controller card fails and the OS boot device is missing.

Figure 5-255 Error information
Key Process and Cause Analysis

1. Analyze OS logs.

The driver prints "_base_event_notification: Timeout", indicating that the RAID controller card does not respond to the command delivered by the driver. As a result, the driver fails to initialize the RAID controller card.

2. Enable the Xen kernel printing function to determine the cause.

Set Xen.gz to print all initialization information of the PCIe devices during startup. According to the initialization information, when the LSI SAS3008 is being initialized, the value of :table_offset read from the PCIe configuration space is not 0xe001.

This indicates that when the PCIe device corresponding to the LSI SAS3008 is being initialized, an error occurs when the OS reads the PCIe configuration space information. As a result, the driver fails to initialize the LSI SAS3008 RAID controller card.

Cause analysis:

Based on the analysis of Broadcom, the problem is caused by the firmware defects of the LSI SAS3008. During OS startup, the OS may read information about the PCIe configuration space at the same time when the LSI SAS3008 firmware writes data to the space. This triggers hardware timing errors, and the OS fails to read correct information about the PCIe configuration space.

Conclusion and Solution

Conclusion:

When the OS obtains information about the PCIe configuration space, the firmware is writing data to the space. As a result, the OS fails to read correct information about the PCIe configuration space. The following figure shows the description of Broadcom.

Solution:

Workaround: Disable the MSIX interrupt function.

Run echo "mpt3sas options msix_disable=1" > /etc/modprobe.d/mpt3sas.conf to disable the MSIX interrupt function of mpt3sas.

Run mkinitrd to repack initram and restart the OS.

Solution: Upgrade the firmware to the latest version.

Experience

After the problem occurs, the OS may fail to start. Restart the OS, and the OS can start successfully. You can disable the MSIX interrupt function as a workaround, or upgrade the firmware to solve the problem.

Note

None

"MPTLib2 Error" Is Reported When sas3ircu Executes Commands on a Server Configured with the LSI SAS3008
Problem Description
Table 5-176 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

RH2288H V3

Release Date

2016-07-22

Keyword

sas3ircu, MPTLib2 Error

Symptom

Hardware configuration: RH2288 V3 configured with the LSI SAS3008 and multiple hard drives

The LSI SAS3008 is configured on an RH2288 V3, and the dedicated management tool sas3ircu is used in the OS. When sas3ircu executes sas3ircu list and sas3ircu 0 display, the error message "MPTLib2 Error" is displayed.

Key Process and Cause Analysis
  1. Check the rights of the user that uses sas3ircu.

    Run the groups command in Linux to check the user group.

    The user group must have root rights.

    If the group to which the current user belongs does not have root rights, switch the user to the root user group.

  2. Upgrade the driver.

    If the user has the permission to run commands on sas3ircu, check the RAID controller card driver.

    Run lsmod to check that the mpt3sas driver module exists.

    Run modinfo mpt3sas to check the driver version.

  3. Collect OS logs.

If the fault persists after 1 and 2 are performed, the OS or driver may be incompatible with the tool. Run sas3ircu multiple times and use InfoCollect to collect system logs for analysis.

Conclusion and Solution

Conclusion:

An error is reported when sas3ircu executes commands. As a result, the LSI SAS3008 cannot be found. The possible causes are as follows:

  1. The user does not have the permission to run commands on the tool.
  2. The driver version is too early.
  3. The OS kernel or driver is incompatible with the tool of a specific version.

Solution:

  1. Check that the user has root rights.
  2. Upgrade the driver to the latest version and use the tool of the latest version.
Experience

You are advised to switch to the root user when using the RAID controller card management tool. In addition, ensure that the version mapping between the firmware and driver is correct.

If the fault persists, collect sufficient OS information and logs for quick troubleshooting.

Note

None

OS Startup Is Suspended in the Initialization Phase When the LSI SAS3008 PH2 Is Connected to SATA Drives
Problem Description
Table 5-177 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

RH2288H V3

Release Date

2016-09-05

Keyword

3008, initialization, timeout

Symptom

Hardware configuration:

RH2288H V3 (8 or 12 drives) configured with the LSI SAS3008 and SATA drives. The RAID controller card firmware version is 2.00, and the expander backplane firmware version is 120.

The OS startup is suspended in the initialization phase for a long time, and then an error message is displayed.

Figure 5-256 OS startup
Key Process and Cause Analysis

Cause analysis:

The RAID controller card firmware version is too early.

Conclusion and Solution

Solution:

Upgrade the RAID controller card firmware to the latest version.

Experience

Check the RAID controller card, hard drives, and backplane. In this case, the fault is caused by the RAID controller card firmware.

Failed to Upgrade the BIOS Online on a Server Configured with the LSI SAS3008
Problem Description
Table 5-178 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

RH2288H V3

Release Date

2016-11-21

Keyword

3008, FTK, BIOS, upgrade

Symptom

The BIOS fails to be upgraded using Toolkit V108, and information shown in the following figure is displayed.

Key Process and Cause Analysis

Key process:

  1. Run the ./FwUpgrade.py FwUpgrade.xml script on Toolkit V108. The upgrade is successful.

    Analysis: The problem is not caused by the upgrade script.

  2. An error message is displayed when the sas3flash –b mptsas3.rom command is executed. Check the Toolkit V108 files. Two sas3flash tools exist. The first one is located in /Project/tools/upgrade/raid/tool/lsi/3008, and the version is 13.00.00.00.

    The second one is located in /Project/tools/lsi3008/linux, and the version is 07.00.00.

    The second tool is executed by the customer.

  3. Use sas3flash of version 07.00.00 to upgrade the BIOS. The upgrade fails.

    Analysis: The problem is caused by the tool of the earlier version.

Cause analysis:

The problem is caused by sas3flash of the earlier version.

Conclusion and Solution

Use sas3flash of the later version to perform the upgrade.

Experience

None

RAID Array Member Drives Fail to Be Automatically Rebuilt After the LSI SAS3008 Is Replaced
Problem Description
Table 5-179 Basic information

Item

Information

Source of the Problem

RH2288 V3

Intended Product

RH2288 V3

Release Date

2016-11-29

Keyword

3008, automatic rebuild

Symptom

An RH2288 V3 is configured with the LSI SAS3008. The OS is SUSE Linux Enterprise Server (SLES) 11.3, which contains the RAID controller card driver of version 1.10. After a RAID 1 member drive is faulty, the drive, hard drive backplane, RAID controller card, and SAS cable are replaced onsite.

The new hard drive in slot 9 (RAID 1 member drive) fails to be automatically rebuilt and is in the missing state in the virtual drive (VD). However, slot 9 is visible on the SAS Topology screen, and the hardware is normal according to the check result of Toolkit. The rebuild starts after the drive in slot 9 is set as a hot spare drive.

Key Process and Cause Analysis

Key process:

Install SLES 11.3, create a four-drive RAID 10, and replace a member drive of RAID 10 with a new hard drive. The new drive can be automatically rebuilt.

Analysis:

The driver works properly. The new RAID controller card may not support automatic rebuild for the original RAID array.

  1. Create a four-drive RAID 10 in the OS, remove one member drive (slot 9), power off the server, and replace the RAID controller card.
  2. Log in to the OS again. RAID 10 is in the inactive state. Insert a new hard drive in slot 9 and check the drive status. The drive is not added to RAID 10. Instead, a new drive letter sdb is allocated to the new drive.

  3. Reactivate RAID 10. The new drive in slot 9 is not added to RAID 10. As a result, the new drive is not automatically rebuilt.

  4. Set the drive in slot 9 as a hot spare drive. The automatic rebuild starts.

Cause analysis:

After the RAID controller card is replaced, RAID 10 becomes inactive. If you install a new hard drive in slot 9, the new RAID controller card allocates a new drive letter to the new drive and regards it as a common single drive instead of adding it to RAID 10. This is because the new hard drive does not have RAID information and the new RAID controller card does not have the original RAID 10 information.

Conclusion and Solution

After replacing a RAID controller card, you are advised to activate the RAID array (the new RAID controller card will record the slot information of the RAID array member drives) and then replace a member drive. In this way, the new member drive can be automatically rebuilt.

Experience

None

Task Management Timeout Causes Drive Rejection and "fault_state (0x0d03)" Is Reported on a Server Configured with the LSI SAS3008
Problem Description
Table 5-180 Basic information

Item

Information

Source of the Problem

RH2288 V3

Intended Product

All servers configured with RAID controller cards

Release Date

2016-03-21

Keyword

TM timeout, fault_state(0x0d03), task abort: FAILED scmd

Symptom

FusionStorage rejects all hard drives on the server, and task abort, TM timeout, and fault state 0d03 are recorded in the logs.

Key Process and Cause Analysis

The driver fails to deliver task management (TM) commands and attempts to execute diag reset for restoration based on the mechanism.

Cause analysis:

The TM timeout is caused by defects of the firmware of earlier versions.

Based on the logs, the LSI internal R&D team finds that the RAID controller card firmware version used in the system is 02.00.00.00, which is a 2013 version. The task abort and timeout errors are handled in later firmware versions.

System processing exception may occur due to the TM incorrectly reports aborted I/Os. This problem has been solved in firmware version 05.00.00.00.

Multiple Drives Become Offline Simultaneously Due to a Hard Drive Expander Backplane Fault on a Server Configured with the LSI SAS3108
Problem Description
Table 5-181 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

All servers configured with RAID controller cards

Release Date

2016-03-21

Keyword

Hard drive, failure, intermittent disconnection, presence

Symptom

Hardware configuration:

RH2288H V3 configured with the LSI SAS3108

Symptom:

All hard drive indicators turn yellow.

Key Process and Cause Analysis

Key process:

Check RAID controller card logs.

The logs show that all drives are reset. This indicates that the SAS uplink is faulty, that is, a fault occurs between the RAID controller card and the expander chip on the hard drive backplane.

In addition, wide port lost link is recorded in the logs. wide port indicates that when multiple PHYs connect between two SAS devices, the PHYs can form a wide port to transmit data more efficiently.

According to the logs, the wide port connecting between the RAID controller card and the backplane expander is interrupted, and links are lost on PHYs 2, 3, and 4. Two SAS cables are connected to PHYs 0–3 and PHYs 4–7 respectively. There is a low probability that the two cables are faulty at the same time. Therefore, the expander chip on the hard drive backplane may be faulty. Replace the backplane.

You can also replace the RAID controller card, SAS cables, and backplane at the same time.

Cause analysis:

The connection between the RAID controller card and the backplane is abnormal.

Conclusion and Solution

Conclusion:

The connection between the RAID controller card and the backplane is abnormal.

Solution:

Replace the backplane. You can also replace the RAID controller card, SAS cables, and backplane at the same time.

Experience

If the RAID controller card logs show that all hard drives are reset simultaneously and links are lost on the wide port, replace the backplane, RAID controller card, and SAS cables.

Note

None

Intermittent Disconnected Hard Drive Is Identified as Faulty on a Server Configured with the LSI SAS3108
Problem Description
Table 5-182 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

All servers configured with RAID controller cards

Release Date

2015-10-29

Keyword

Hard drive, failure, intermittent disconnection, presence

Symptom

Hardware configuration:

RH2288H V3 configured with the LSI SAS3108

Symptom:

After a hard drive becomes offline, its status turns to FAILED and then foreign.

Key Process and Cause Analysis

Key process:

  1. Check RAID controller card logs.

    According to the preceding figure:

    No log is generated between 2015-09-26 10:20:45 and 2015-09-28 15:54:35, that is, no abnormal event occurs during a long time.

    At 15:54:35 on 2015-09-29, the hard drive in slot 11 is reset. Then, the RAID controller card removes the drive from slot 11 and turns it to the FAILED state.

    A hard drive is reset and removed suddenly when no other I/O exception information is recorded. Therefore, a drive removal operation may be performed.

  2. Check the iBMC system event log (SEL).

    According to the SEL, the presence status of hard drive in slot 11 became Deasserted at 15:54:36 on 2015-09-28, which indicated that the connection between the drive and the backplane was interrupted. Twelve seconds later, the drive status became Asserted.

    Confirm with the customer and frontline service personnel whether the drive has been reseated. If no, status change at an interval of about 10s indicates that the connection between the hard drive and the backplane is faulty. As a result, the electrical connection is interrupted intermittently.

The hard drive has not been reseated. The intermittent disconnection is caused by a loose connection between the drive and backplane.

Cause analysis:

The hard drive is intermittently disconnected because the connection between the hard drive and backplane is loose.

Conclusion and Solution

Conclusion:

The hard drive is intermittently disconnected because the connection between the hard drive and backplane is loose.

Solution:

Reinstall the hard drive securely. If the fault persists, replace the hard drive or backplane.

Experience

If a drive intermittent disconnection occurs, view the iBMC SEL to check whether the drive presence status has changed. If yes, the fault may be caused by manual operations or a loose connection.

Note

None

"Firmware Version inconsistency" Is Displayed and System Startup Fails After the LSI SAS3108 Is Replaced
Problem Description
Table 5-183 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

All servers whose LSI SAS3108 RAID controller cards need to be replaced

Release Date

2015-11-30

Keyword

Startup failure, inconsistency

Symptom

Hardware configuration: RH2288H V3 configured with the LSI SAS3108 and a supercapacitor, with a spare LSI SAS3108

RAID configuration: RAID arrays created using the LSI SAS3108 hard drives and enabled write cache for the RAID arrays

Software version:

Original LSI SAS3108 firmware version: 4.210.90-3396

Spare LSI SAS3108 firmware version: 4.270.00-4382

Symptom: After the customer restarts the server when a power failure occurs, an error message is displayed and the system fails to start. After the customer replaces the LSI SAS3108 RAID controller card and restarts the server, the startup is suspended in the screen shown in the following figure.

Key Process and Cause Analysis

Key process:

The server encounters a power failure when drive input/output (I/O) operations are in progress.

1) Replace the RAID controller card, but do not replace the supercapacitor.

2) Power on the server. The fault recurs.

Cause analysis:

The server encounters a power failure when drive I/O operations are in progress. The cache data protection mechanism is triggered by the supercapacitor of the LSI SAS3108, and preserved cache data is stored in the supercapacitor flash. The original LSI SAS3108 uses firmware 4.210.90-3396, and the new LSI SAS3108 uses firmware 4.270.00-4382. The firmware defines the format of preserved cache data. After the RAID controller card is replaced, the cache data format is incompatible with the firmware of the new LSI SAS3108. As a result, the firmware fails to process preserved cache data, the system fails to start, and many exception dump records are printed over the serial port.

Conclusion and Solution

Conclusion:

The RAID controller card firmware defines the cache data format. After the RAID controller card is replaced, the cache data format is incompatible with the firmware of the new RAID controller card. As a result, the RAID controller card fails to start.

The fault occurs when the following conditions are all met:

1. The RAID controller card has a supercapacitor.

2. The server encounters a power failure when drive I/O operations are in progress.

3. After a power failure, the RAID controller card is replaced before the preserved cache data is deleted.

4. The original LSI SAS3108 uses firmware 4.210.90-3396, and the new LSI SAS3108 uses firmware 4.270.00-4382.

If the preceding conditions are met at the same time, you are advised to use a workaround.

Solution:

Workaround 1: Replace the supercapacitor.

Workaround 2: Use the LSI SAS3108 whose firmware version is 4.210.90-3396 to delete preserved cache data.

(1) Use the original LSI SAS3108 delivered with the server to connect to the supercapacitor and delete preserved cache data. Alternatively, roll back the firmware of the new LSI SAS3108 to 4.210.90-3396. Before rolling back the firmware, remove the supercapacitor. After the rollback is complete, connect the RAID controller card to the supercapacitor.

(2) Power on the server. Upon server startup, press Ctrl+R when prompted to open the WebBIOS screen of the LSI SAS3108 controller card.

(3) View the status of the preserved cache.

On the VD Mgmt screen, select SAS3108 and press F2. The options shown in the following figure are displayed.

If Manage Preserved Cache is available (white), preserved cache needs to be deleted. Go to (4). If Manage Preserved Cache is unavailable (dimmed), preserved cache has been deleted automatically. Go to (5).

(4) Select Manage Preserved Cache and press Enter. The screen shown in the following figure is displayed.

Select DISCARD CACHE and press Enter. On the confirmation screen, select YES and press Enter. Preserved cache data is deleted.

(5) Restore the RAID controller card and firmware.

If you choose to use the original RAID controller card in (1), replace the RAID controller card with the spare one.

If you choose to roll back the firmware version in (1), upgrade the firmware to 4.270.00-4382.

Experience

None

Note

None

No Alarm Is Found After a Member Drive of a Single-Drive RAID 0 Is Faulty on a Server Configured with the LSI SAS3108
Problem Description
Table 5-184 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

All servers configured with the LSI SAS3108

Release Date

2016-11-06

Keyword

Single-drive RAID 0, drive fault, iBMC, no alarm, OCR

Symptom

Hardware configuration:

RH2288H V3 configured with the LSI SAS3108

RAID configuration

Data drives of the LSI SAS3108 are configured as single-drive RAID 0 arrays.

Symptom:

After the customer finds that the hard drive in slot 21 is faulty, the iBMC does not report alarms and the fault indicator on the panel is off.

Key Process and Cause Analysis

The following figure shows the iBMC system event log (SEL).

The iBMC has reported an alarm when the hard drive in slot 21 is faulty, but the alarm is cleared later.

The following figure shows the RAID controller card log.

On 2016-07-27, the hard drive in slot 21 was faulty. After a command timeout, the hard drive was reset and removed. The hard drive then became FAILED, and an iBMC alarm was generated.

At about 16:30 on 2016-07-28, the RAID controller card was reset abnormally.

After the reset, no records exist indicating that a drive is inserted into slot 21 during drive discovery in the initialization phase, which means that the drive in slot 21 is missing.

1716: 16-07-28,16:34:27 Info:Inserted: PD 01(e0x00/s38)

1717: 16-07-28,16:34:27 Info:Inserted: PD 01(e0x00/s38) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=5000c500964774ad,0000000000000000

1718: 16-07-28,16:34:27 Info:Inserted: PD 02(e0x00/s39)

1719: 16-07-28,16:34:27 Info:Inserted: PD 02(e0x00/s39) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=5000c500964c028d,0000000000000000

1720: 16-07-28,16:34:27 Info:Inserted: PD 03(e0x00/s8)

1721: 16-07-28,16:34:27 Info:Inserted: PD 03(e0x00/s8) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027388,0000000000000000

1722: 16-07-28,16:34:27 Info:Inserted: PD 04(e0x00/s17)

1723: 16-07-28,16:34:27 Info:Inserted: PD 04(e0x00/s17) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027391,0000000000000000

1724: 16-07-28,16:34:27 Info:Inserted: PD 05(e0x00/s10)

1725: 16-07-28,16:34:27 Info:Inserted: PD 05(e0x00/s10) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b00002738a,0000000000000000

1726: 16-07-28,16:34:27 Info:Inserted: PD 06(e0x00/s14)

1727: 16-07-28,16:34:27 Info:Inserted: PD 06(e0x00/s14) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b00002738e,0000000000000000

1728: 16-07-28,16:34:27 Info:Inserted: PD 07(e0x00/s13)

1729: 16-07-28,16:34:27 Info:Inserted: PD 07(e0x00/s13) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b00002738d,0000000000000000

1730: 16-07-28,16:34:27 Info:Inserted: PD 08(e0x00/s4)

1731: 16-07-28,16:34:27 Info:Inserted: PD 08(e0x00/s4) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027384,0000000000000000

1732: 16-07-28,16:34:27 Info:Inserted: PD 0a(e0x00/s9)

1733: 16-07-28,16:34:27 Info:Inserted: PD 0a(e0x00/s9) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027389,0000000000000000

1734: 16-07-28,16:34:27 Info:Inserted: PD 0b(e0x00/s0)

1735: 16-07-28,16:34:27 Info:Inserted: PD 0b(e0x00/s0) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027380,0000000000000000

1736: 16-07-28,16:34:27 Info:Inserted: PD 0c(e0x00/s6)

1737: 16-07-28,16:34:27 Info:Inserted: PD 0c(e0x00/s6) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027386,0000000000000000

1738: 16-07-28,16:34:27 Info:Inserted: PD 0d(e0x00/s2)

1739: 16-07-28,16:34:27 Info:Inserted: PD 0d(e0x00/s2) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027382,0000000000000000

1740: 16-07-28,16:34:27 Info:Inserted: PD 0e(e0x00/s19)

1741: 16-07-28,16:34:27 Info:Inserted: PD 0e(e0x00/s19) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027393,0000000000000000

1742: 16-07-28,16:34:27 Info:Inserted: PD 0f(e0x00/s11)

1743: 16-07-28,16:34:27 Info:Inserted: PD 0f(e0x00/s11) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b00002738b,0000000000000000

1744: 16-07-28,16:34:27 Info:Inserted: PD 10(e0x00/s7)

1745: 16-07-28,16:34:27 Info:Inserted: PD 10(e0x00/s7) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027387,0000000000000000

1746: 16-07-28,16:34:27 Info:Inserted: PD 11(e0x00/s12)

1747: 16-07-28,16:34:27 Info:Inserted: PD 11(e0x00/s12) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b00002738c,0000000000000000

1748: 16-07-28,16:34:27 Info:Inserted: PD 12(e0x00/s1)

1749: 16-07-28,16:34:27 Info:Inserted: PD 12(e0x00/s1) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027381,0000000000000000

1750: 16-07-28,16:34:27 Info:Inserted: PD 13(e0x00/s3)

1751: 16-07-28,16:34:27 Info:Inserted: PD 13(e0x00/s3) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027383,0000000000000000

1752: 16-07-28,16:34:27 Info:Inserted: PD 14(e0x00/s5)

1753: 16-07-28,16:34:27 Info:Inserted: PD 14(e0x00/s5) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027385,0000000000000000

1754: 16-07-28,16:34:27 Info:Inserted: PD 15(e0x00/s16)

1755: 16-07-28,16:34:27 Info:Inserted: PD 15(e0x00/s16) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027390,0000000000000000

1756: 16-07-28,16:34:27 Info:Inserted: PD 16(e0x00/s15)

1757: 16-07-28,16:34:27 Info:Inserted: PD 16(e0x00/s15) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b00002738f,0000000000000000

1758: 16-07-28,16:34:27 Info:Inserted: PD 17(e0x00/s18)

1759: 16-07-28,16:34:27 Info:Inserted: PD 17(e0x00/s18) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027392,0000000000000000

1760: 16-07-28,16:34:27 Info:Inserted: PD 18(e0x00/s20)

1761: 16-07-28,16:34:27 Info:Inserted: PD 18(e0x00/s20) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027394,0000000000000000

1762: 16-07-28,16:34:27 Info:Inserted: PD 19(e0x00/s23)

1763: 16-07-28,16:34:27 Info:Inserted: PD 19(e0x00/s23) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027397,0000000000000000

1764: 16-07-28,16:34:27 Info:Inserted: PD 1a(e0x00/s22)

1765: 16-07-28,16:34:27 Info:Inserted: PD 1a(e0x00/s22) Info: enclPd=00, scsiType=0, portMap=00, sasAddr=500605b000027396,000000000000000

The slot 21 is configured with a single-drive RAID 0. When the member drive fails to be found during initialization, the RAID controller card cannot know whether there is a RAID array. In this case, the RAID controller card cannot determine whether the drive is manually removed during maintenance or the hard drive is faulty and cannot be identified. As a result, the iBMC does not report new alarms.

Conclusion and Solution

Conclusion:

The iBMC has reported an alarm when the hard drive is faulty. After the RAID controller card is reset, the faulty drive cannot be identified. In this case, the RAID controller card cannot determine whether the drive is manually removed during maintenance or the hard drive is faulty and cannot be identified. As a result, the iBMC does not report new alarms and the original alarm is cleared.

Solution:

You are advised to monitor iBMC alarms and handle hard drive fault alarms in a timely manner.

Experience

None

Note

None

Firmware Cannot Be Initialized Because "Bad or missing RAID controller memory module" Is Displayed on a Server Configured with the LSI SAS3108
Problem Description
Table 5-185 Basic information

Item

Information

Source of the Problem

LSI SAS3108

Intended Product

All servers configured with the LSI SAS3108

Release Date

2016-09-30

Keyword

3108, memory, bad or missing

Symptom

Hardware configuration: LSI SAS3108

Software version: 4.270.00-4382

Symptom: "Bad or missing RAID controller memory module" is displayed during system startup.

Key Process and Cause Analysis

Information shown in the following figure is displayed in the serial port log of the RAID controller card.

Cause analysis:

The memory chip on the RAID controller card is faulty. As a result, the firmware stops responding and cannot initialize the RAID controller card.

Conclusion and Solution

Conclusion:

"Bad or missing RAID controller memory module" is displayed during system startup. As a result, the RAID controller card fails to be initialized.

Solution:

Replace the RAID controller card.

Experience

None

Note

None

LSI SAS3108 Is Reset Due to a Chip SRAM Correctable Error
Problem Description
Table 5-186 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

LSI SAS3108

Firmware version: not later than 107 (4.270.00-4382)

Release Date

2017-02-08

Keyword

3108, reset, SRAM

Symptom

Hardware configuration: LSI SAS3108

Software version: 4.270.00-4382

Symptom: The service is abnormal. The logs show that the LSI SAS3108 was reset.

Key Process and Cause Analysis

The RAID controller card log shows that the RAID controller card was reset on December 17, 2016.

Logs recorded before the RAID controller card reset show that the RAID controller card reported a large number of SRAM correctable errors. No problem will occur if the RAID controller card firmware rectifies a correctable error. However, repeatedly rectifying a large number of such errors will use up all resources and finally lead to a reset.

The SRAM is the RAM memory inside the 3108 chip and is used to run the 3108 firmware.

"SRAM correctable error" indicates that the RAM encountered an ECC error. All SRAM error addr parameters show 0xc038e9b0 in the logs. The supplier confirmed that the firmware 4.270.00-4382 had a bug. When the firmware corrects the SRAM correctable errors, the firmware will enter an infinite loop and finally stop responding. As a result, the reset mechanism of the driver and firmware is triggered.

The problem has been solved in MR6.10.

Cause analysis:

Errors occur in the SRAM of the 3108 chip. When the firmware 4.270.00-4382 (107) corrects the errors, an infinite loop occurs. As a result, the RAID controller card is reset.

Conclusion and Solution

Conclusion:

Errors occur in the SRAM of the 3108 chip. When the firmware 4.270.00-4382 (107) corrects the errors, an infinite loop occurs. As a result, the RAID controller card is reset.

Solution:

Upgrade the RAID controller card firmware to a version later than 4.650.00-6121 (108) and upgrade the driver to the matching version.

Experience

None

Note

None

I/O Faults Occur and "2108vI2o.c" Is Recorded in the Logs on a Server Configured with the LSI SAS3108
Problem Description
Table 5-187 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

LSI SAS3108

Firmware version: 4.270.00-4382

The driver version does not match the firmware version.

Release Date

2017-02-08

Keyword

3108, reset, 2108vI2o.c

Symptom

Hardware configuration:

LSI SAS3108

Software version:

LSI SAS3108 firmware: 4.270.00-4382

LSI SAS3108 driver:

CentOS 6.7 built-in driver version 06.806.08.00-rh3

Symptom:

Input/outputs (I/Os) fail to be delivered and the OS is read-only. The message log records that the RAID controller card has been reset.

Key Process and Cause Analysis

The RAID controller card log records fatal errors of 2108vI2o.c.

3508: 17-01-25,21:19:48 CRITICAL:Controller encountered a fatal error and was reset

3509: 17-01-25,21:19:50 Info:Battery Present

3510: 17-01-25,21:19:50 Info:Package version 24.7.0-0057

3511: 17-01-25,21:19:50 Info:Board Revision

3512: 17-01-25,21:19:54 Info:Battery charge complete

3513: 17-01-25,21:19:54 Info:Battery temperature is normal

3514: 17-01-25,21:19:54 Info:Battery relearn will start in 4 days

3515: 17-01-25,21:19:59 DEAD:Fatal firmware error: Line 1518 in ../../raid/2108vI2o.c

3516: 17-01-25,21:19:59 DEAD:Fatal firmware error: Line 1518 in ../../raid/2108vI2o.c

Generally, 2108vI2o.c fatal errors are related to the driver. The driver version recorded in the log is 06.806.08.00-rh3, which is the OS built-in version. You are advised to upgrade the driver to 06.807.10.00 that matches the 4.270.00-4382 firmware.

Cause analysis:

The driver version does not match the LSI SAS3108 firmware. As a result, the RAID controller card may be reset.

Conclusion and Solution

Conclusion:

The driver version does not match the LSI SAS3108 firmware. As a result, the RAID controller card may be reset.

Solution:

Upgrade the LSI SAS3108 driver to the matching version of the firmware.

The version mapping table and driver compilation guide (the customer needs to compile the driver when using non-standard kernels) can be downloaded on iDriver.

Experience

None

Note

None

Sequential Write Performance Deteriorates When Two RAID Arrays Trigger Rebuild Simultaneously on a Server Configured with the LSI SAS3108
Problem Description
Table 5-188 Basic information

Item

Information

Source of the Problem

Tecal RH2288H V3

Intended Product

Tecal RH2288H V3

Release Date

2017-03-31

Keyword

3108, rebuild, performance

Symptom

Hardware configuration:

RH2288H V3 (25 or 12 drives) configured with the LSI SAS3108

Symptom

Two 3-HDD RAID 6 arrays are created on the LSI SAS3108. After the rebuild process is triggered for the two RAID arrays simultaneously, a drive not in these RAID arrays is hot swapped. As a result, the rebuild performance is reduced by 90%.

Figure 5-257 Performance data in the RAID 6 rebuild process
Key Process and Cause Analysis

Key process:

1) Use storcli64 to check the firmware version and the driver version of the RAID controller card.

The check result shows that the firmware version and driver version are the latest.

Firmware: 4.650.00-6121

Driver: 06.811.02.00

2) Ensure that the parameters of the FIO mode are correctly set.

./fio --name=Stress_64k_randrw_$disk --ioengine="libaio" --size=20GB --direct=1 --bs=64k --rw=write --time_based --runtime=259200 --filename=/dev/$part --output=/root/stresslog/result_64k_randrw_$disk.txt &

3) Create two RAID 5 arrays and two RAID 1 arrays using HDDs and check the sequential write performance for rebuild. When the rebuild is triggered simultaneously on two RAID arrays, the performance of sequence write is reduced by over 90%.

4) Check whether the sequential read and random read performance of the two RAID 5 arrays drops.

When the rebuild is triggered simultaneously on the two RAID 5 arrays, there is no obvious drop in sequential read performance.

When the rebuild is triggered simultaneously on the two RAID 5 arrays, there is no obvious drop in random read performance.

5) Check whether the random write performance of the two RAID 5 array drops.

When the rebuild is triggered simultaneously on the two RAID 5 arrays, there is no obvious drop in random write performance.

6) Create two RAID 5 arrays using SSDs and check whether the sequential write performance of the two RAID 5 arrays drops.

When the rebuild is triggered simultaneously on the two RAID 5 arrays, there is no obvious drop in sequential write performance.

If SSDs are used to create two RAID 5 arrays and the rebuild is triggered simultaneously on the two RAID arrays, the sequential write performance does not drop significantly. If HDDs are used to create two RAID 5 arrays and the rebuild is triggered simultaneously on the two RAID arrays, the sequential write performance is reduced by over 90%, but the performance of sequential read, random read, and random write does not drop significantly.

Cause analysis:

Traditional HDDs have seek time due to architectural restrictions. If I/Os access and rebuild an area at the same time, switching between logical blocks requires a large amount of seek time, and I/O performance is greatly affected. The rebuild performance is not affected when SSDs are used to create two RAID arrays. This indicates that the RAID controller card firmware meets the processing capability requirements.

Conclusion and Solution

Conclusion:

If HDDs are used to create two RAID 1, 5, or 6 arrays and the rebuild is triggered simultaneously on two RAID arrays, the sequential write performance deteriorates by over 90%.

Solution:

If two RAID arrays are rebuilt at the same time and high write performance is required, you are advised to configure SSDs.

If HDDs are used, a performance drop during the rebuild is normal. The performance will be restored after the rebuild is complete.

Experience

None

Note

None

LSI SAS3108 Does Not Respond Due to Chip PPC47612 Error and the OS Is Read-Only
Problem Description
Table 5-189 Basic information

Item

Information

Source of the Problem

Tecal RH2288H V3

Intended Product

FusionServer

Release Date

2017-09-09

Keyword

3108, ppc47612, firmware, hang

Symptom

Hardware configuration: RH2288H V3 (25 or 12 drives) configured with the LSI SAS3108

Symptom: An RH2288H V3 is configured with the LSI SAS3108. When the OS is running, the RAID controller card turns to the FAULT state. The Megaraid_sas driver attempts to reset the RAID controller card. However, a message is displayed indicating that the reset operation is not supported. As a result, the RAID controller card does not respond, hard drives cannot be accessed, and the OS becomes read-only.

The key log content is as follows:

megaraid_sas: Found FW in FAULT state, will reset adapter scci0.

megaraid_sas: resetting fusion adapter scsi0.

megaraid_sas: Reset not supported, killing adapter scsi0

Figure 5-258 System event logs
Key Process and Cause Analysis

Key process:

(1) Check the value of Disable Online Controller Reset, which is No, indicating that the RAID controller card allows online controller reset (OCR).

Figure 5-259 Disable Online Controller Reset

(2) Check the RAID controller card log, which contains error messages "DEAD:Fatal firmware error: Line 782 in ../../raid/ppc47612.c" and "DEAD: Fatal firmware error: Driver detected possible FW hang, halting FW". DEAD:Fatal firmware error: Line 782 in ../../raid/ppc47612.c indicates that an L2 cache error occurs at the hardware layer of the CPU (PowerPC 476FP Module) of the RAID controller card.

Figure 5-260 Fatal firmware error ppc47612

The PowerPC 476FP module adjusts its integer computing capability according to the running frequency of the PowerPC 476FP Module. When the frequency is 1 GHz, the computing capability is 2000 dhrystone million instructions executed per second (DMIPS). When the frequency is 1.2 GHz, the computing capability is 2400 DMIPS.

(3) Check the PowerPC 476FP module. When the module is faulty, the RAID controller card becomes faulty and the automatic OCR (enabled by default) fails. As a result, hard drives under the RAID controller card cannot be accessed, and the OS becomes read-only.

Cause analysis:

An L2 cache error occurs on the PowerPC 476FP module and the OCR function fails. Therefore, the RAID controller card turns to the FAULT state and cannot be reset online. As a result, hard drives under the RAID controller card cannot be accessed and the OS becomes read-only.

Conclusion and Solution

Conclusion:

An L2 cache error occurs on the PowerPC 476FP module, and the RAID controller card turns to the FAULT state and cannot be reset online. As a result, hard drives under the RAID controller card cannot be accessed and the OS becomes read-only.

Solution:

Replace the RAID controller card.

Experience

None

Note

None

Data on a Hot Spare Drive Cannot Be Automatically Copied to the New Drive After the Faulty Drive Is Replaced on a Server Configured with the LSI SAS3108
Problem Description
Table 5-190 Basic information

Item

Information

Source of the Problem

Tecal RH2288H V3

Intended Product

FusionServer

Release Date

2017-09-09

Keyword

3108, copyback failure, hot spare drive

Symptom

Hardware configuration:

RH2288H V3 configured with the LSI SAS3108

RAID controller card firmware: 4.270.00-4382

Symptom:

An RH2288H V3 is configured with the LSI SAS3108. When the hard drive in slot 0 is faulty, the hot spare drive in slot 7 is used for rebuild, as shown in Figure 5-261. When the faulty drive in slot 0 is replaced, data on the hot spare drive is not automatically copied back to the new drive, as shown in Figure 5-262.

Figure 5-261 Rebuilding
Figure 5-262 Copyback not triggered
Key Process and Cause Analysis

Key process:

(1) Check the value of Disable Copyback, which is No, indicating that the copyback function is enabled.

Figure 5-263 RAID controller card attributes

(2) Check the RAID controller card log. According to the log, when the hard drive in slot 0 is faulty, the hot spare drive in slot 7 is in the POWERSAVE state.

Figure 5-264 Status of the drive in slot 7

(3) After the rebuild is complete using the hot spare drive in slot 7, the copyback is not triggered when the faulty drive in slot 0 is replaced.

Figure 5-265 Copyback not triggered

Cause analysis:

If a hot spare drive is in the POWERSAVE state, it is still in the POWERSAVE state at the time the rebuild is triggered, and the firmware cannot write the COPYBACK flag to the hot spare drive. As a result, when a hard drive is replaced, the copyback cannot be triggered for the hot spare drive.

Conclusion and Solution

Conclusion:

If a hot spare drive is in the POWERSAVE state, it is still in the POWERSAVE state at the time the rebuild is triggered, and the firmware cannot write the COPYBACK flag to the hot spare drive. As a result, when a hard drive is replaced, the copyback cannot be triggered for the hot spare drive.

Solution:

1. Upgrade the 3108 firmware version to 4.650.00-8102.

http://support.huawei.com/enterprise/en/software/22577898-SW1000259768

2. Run ./storcli64 /c0 set ds=off type=2 on storcli to disable the power save mode of the hot spare drive. In this command, type=2 indicates that the hot spare drive is the target device.

Experience

None

Note

None

RAID Array Creation Fails After a Faulty Hard Drive Is Replaced and Dirty Data Is Deleted on a Server Configured with the LSI SAS3108
Problem Description
Table 5-191 Basic information

Item

Information

Source of the Problem

RH2288 V3

Intended Product

FusionServer

Release Date

2017-09-09

Keyword

3108, preserved cache, failed to create a RAID array

Symptom

Hardware configuration: RH2288 V3 (25 or 12 drives) configured with the LSI SAS3108

Firmware version: 4.270.00-4382

Symptom: A hard drive of the LSI SAS3108 is used to form a RAID 0 array onsite. After the drive is faulty and replaced, a RAID array fails to be created using commands. After the command is executed to delete pinned cache on the command-line interface (CLI), the RAID array still fails to be created.

Figure 5-266 Failing to create a RAID array
Key Process and Cause Analysis

Key process:

1. Check the RAID controller card log, which shows that after a hard drive was removed by the firmware on March 31, the RAID controller card reported a preserved cache error.

2. On April 1, a RAID array failed to be created using the hard drive in slot 2 because preserved cache existed. The log did not contain a record indicating that preserved cache was deleted.

3. Reproduce the fault in the lab.

The firmware version is the same as that onsite.

(1) Produce preserved cache and create a RAID array. The RAID array fails to be created.

[root@localhost ~]# ./storcli64 /c0 add vd type=r0 drive=11:4

Controller = 0

Status = Failure

Description = controller has data in cache for offline or missing virtual disks

(2) Delete preserved cache.

[root@localhost ~]# ./storcli64 /c0/vall delete preservedcache

Controller = 0

Status = Success

Description = Virtual Drive preserved Cache Data Cleared.

(3) Check that the preserved cache is deleted.

[root@localhost ~]# ./storcli64 /c0 show preservedcache

Controller = 0

Status = Success

Description = No Virtual Drive has Preserved Cache Data.

(4) Create a RAID array again. The creation is successful.

[root@localhost ~]# ./storcli64 /c0 add vd type=r0 drive=11:4

Controller = 0

Status = Success

Description = Add VD Succeeded

The log shows that "discard pinned cache for targetId 1 complete" is displayed when the preserved cache is deleted.

Cause analysis:

Storcli64 1.12.13 was used onsite, whose command for deleting dirty data does not take effect. Therefore, after the command for deleting dirty data was executed onsite, RAID 0 still failed to be created.

Conclusion and Solution

Conclusion:

Storcli64 1.12.13 was used onsite, whose command for deleting dirty data does not take effect. Therefore, after the command for deleting dirty data was executed onsite, RAID 0 still failed to be created.

Solution:

Use the storage command line tool (version 1.19.04 released on Feb 1, 2016) to delete preserved cache before creating a RAID array.

https://www.broadcom.com/products/storage/raid-controllers/megaraid-sas-9361-8i#tab-archive-drivers4-abc

Experience

None

Note

None

LSI SAS3108 Fails to Respond to Cold Reset
Problem Description
Table 5-192 Basic information

Item

Information

Source of the Problem

-

Intended Product

FusionServer

Release Date

2018-02-03

Keyword

1073int.c, firmware, no response

Symptom

Hardware configuration: LSI SAS3108 and 2 GB cache

Firmware version: 4.660.8102

Backplane: Pass-through backplane and expander

Symptom:

During the server stability test, the boot device may fail to be found.

Figure 5-267 No boot device
Key Process and Cause Analysis

Key process:

  1. Check the RAID controller card log, which shows that the firmware initialization fails and the following error message is recorded: "FW report Fatal firmware error: Line 245 in ../../raid/1078int.c".
    Figure 5-268 Error information
  2. Check the serial port log, which shows that the firmware returns 0 device after receiving "SAS UNIT page 0 (which has number of phys)" and reports "Exception in Core0".
    Figure 5-269 Serial port log

Cause analysis:

If the RAID controller card firmware reports "FW report Fatal firmware error: Line 245 in ../../raid/1078int.c" during the system startup and does not respond, the BIOS cannot find the hard drives under the RAID controller card. As a result, the boot device fails to be found and the error message "No bootable device" is displayed.

The following figure shows possible causes of the problem.

The PL layer of the RAID controller card fails to process the null pointer because it receives a Broadcast change primitive event from the expander before the discovery is complete during the automatic port configuration. As a result, the firmware does not respond.

Conclusion and Solution

Conclusion:

The RAID controller card exception is caused by the firmware bugs. Set the PL layer on the RAID controller card to respond to the Broadcast change primitive event requests after discovering all drives.

Solution:

  1. Upgrade the RAID controller card firmware to the latest version.

Workaround:

Restart the server to recover the RAID controller card.

Experience

None

Note

None

Hitachi Drives May Become Offline After They Enters the POWERSAVE State and the System Is Restarted on a Server Configured with the LSI SAS3108
Problem Description
Table 5-193 Basic information

Item

Information

Source of the Problem

RH5288H V3

Intended Product

FusionServer

Release Date

2018-01-20

Keyword

Energy saving, POWERSAVE, Hitachi hard drive, offline

Symptom

Server: RH5228H V3

RAID controller card: LSI SAS3108

Firmware version: 4.660.8102

Backplane: Cobra 24-bay expander

Drive vendor: Hitachi

Symptom: When hard drives in the UG state or hot spare drives enter the POWERSAVE state for energy saving, powering on and off the server will make the drives offline. As a result, after the firmware initialization is complete during system startup, the system is suspended in the configuration loss screen.

Key Process and Cause Analysis

Key process:

  1. On December 30, 2017, the hot spare drive in slot 37 automatically entered the POWERSAVE state 30 minutes after power-on.
    Figure 5-270 Hard drive status
  2. The system was restarted on January 4, 2018.
    Figure 5-271 System restart
  3. After the system restarted, the firmware failed to identify the hot spare drive in slot 37.
    Figure 5-272 RAID controller card log

Cause analysis:

When the hard drive status changes from standby power save to spin up, the firmware delivers the Read Log Ext command, which may not be responded for more than 3s. When the expander backplane enables End Device Frame Buffering (EDFB, the latest feature in Cobra for bandwidth improvement dedicated for 12G SAS), the firmware delivers the non-NCQ command again if the first non-NCQ command does not respond for more than 3s. The expander has a Read Log Ext command pending for processing when it receives a new command. The SXP considers that this is abnormal and delivers a link reset command. For the firmware, the Read Log Ext commands are discarded. As a result, the hard drive in the POWERSAVE state becomes offline when the system is reset.

According to the trace analysis, all Read Log Ext commands delivered by the firmware fail to be executed (Incomplete).

Figure 5-273 SAS traces from the backplane to the hard drive
Conclusion and Solution

Conclusion:

The identify command can be executed successfully to query the SATA drive information. Use the identity command instead of the Read Log Ext command to query drive information.

Solution:

  1. Disable the energy saving function of hard drives.
  2. Upgrade the RAID controller card firmware to the latest version.
Experience

None

Note

None

Data Drives Are Blocked After CacheCade Is Disassociated on a Server Configured with the LSI SAS3108
Problem Description
Table 5-194 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

FusionServer

Release Date

2018-03-15

Keyword

3108, CacheCade, I/O block

Symptom

Hardware configuration: RH2288H V3 configured with the LSI SAS3108

Firmware version: 4.660.8102

After CacheCade was disassociated from a data virtual drive (VD) onsite, the data VD was inaccessible from I/O reads and writes.

Figure 5-274 Device not ready
Key Process and Cause Analysis

Key process:

  1. Query the logical drive attributes of the sdb in the RAID controller card log.
    Figure 5-275 Logical drive status
  2. Check the logical drive status at the operation time of the customer. The logical drive was blocked.
    Figure 5-276 VD 02 status
  3. Analyze the onsite operation logs. CacheCade was disassociated from the VD successfully.
    Figure 5-277 Disassociation progress
  4. Check the customer's operations. During the disassociation process, the customer set the CacheCade VD drives in slots 8 and 9 to OFFLINE.
    Figure 5-278 Drive status
  5. The CacheCade VD was offline after disassociation. As a result, the associated data drives were blocked.
    Figure 5-279 VD blocked

Cause analysis:

During disassociation, the customer sets physical drives in the CacheCade VD to OFFLINE. As a result, the CacheCade VD status changes from OPTIMAL to OFFLINE, and the disassociation is interrupted by the customer. When the disassociation is interrupted, the firmware blocks access of the corresponding data VD to ensure that the original data can be successfully updated to the new data VD after the CacheCade VD is recovered. However, the customer creates a new CacheCade, and the firmware cannot undo the ongoing disassociation progress. Therefore, the data VD remains blocked and cannot be automatically restored. As a result, the data VD is inaccessible.

Conclusion and Solution

Conclusion:

CacheCade fails to be disassociated from the VD due to abnormal operations of the customer, and the associated data VD is blocked.

Solution:

  1. Run ./storcli64 /c0/vx set accesspolicy=rmvblkd.
Experience

Disassociate a CacheCade from a VD before deleting the CacheCade VD. Do not set CacheCade VD member drives to OFFLINE during the disassociation process.

Note

The following figure shows that the data VD is unblocked onsite.

L01 Alarm on the RH1288 V3
Problem Description
Table 5-195 Basic information

Item

Information

Source of the Problem

RH1288 V3

Intended Product

V3 and V5 servers

Release Date

2018-04-11

Keyword

L01 alarm

Symptom

The L01 alarm "The SAS or PCIe cable to disk backplane is incorrectly connected." is generated on the server.

After the SAS cables, RAID controller card, and backplane are replaced, the L01 alarm still exists.

Key Process and Cause Analysis
  1. Analyze L01 alarm information.

    The alarm information is as follows:

    The SAS or PCIe cable to [arg1] disk backplane arg2 is incorrectly connected.

    This alarm is generated when the SAS or PCIe cables of the hard drive backplane are incorrectly connected.

    The alarm object is cable.

    Impact on the system:

    Data read and write operations may be abnormal due to unstable SAS links.

    Possible causes:

    • The SAS cables are connected incorrectly.
    • The SAS cables are faulty.
    • The RAID controller card is faulty.
    • The hard drive backplane is faulty.
  2. Check the L01 alarm mechanism.

    SAS link: CPU1 > mainboard > LSI SAS3008 > SAS cables > hard drive backplane

    The BMC parses CPLD logic register parameter which defines the SAS cable interface after receiving it. If the parameter is inconsistent with the BMC interface definition whitepaper, the L01 alarm is generated.

    According to the CPLD register definition, the correct value of 0x56 should be 0x32.

  3. Analyze the logs.

    The SAS cables, RAID controller card, and backplane are replaced on the faulty server, but the L01 alarm still exists.

    The L01 alarm link is: CPU1 > mainboard > LSI SAS3008 > SAS cables > hard drive backplane. All backend parts have been replaced. According to the FDM logs, the frontend parts CPU and mainboard have no hardware faults.

    According to the CPLD logic register, the value of 0x56 corresponding to the SAS cables is 23.

    Port A and port B of SAS cables are reversely connected, resulting in SAS cable alarms.

Conclusion and Solution

Conclusion:

SAS port A and port B are reversely connected to the RAID controller card on the server. As a result, an SAS cable alarm is generated.

Solution:

When the AC power supply of the server fails, connect SAS port A to port A on the RAID controller card and port A on the hard drive backplane. Connect SAS port B to port B on the RAID controller card and port B on the hard drive backplane.

Experience

None

Note

None

RAID Array Information On the iBMC WebUI Is Inconsistent with Physical Drive Information
Problem Description
Table 5-196 Basic information

Item

Information

Source of the Problem

RH2288 V3

Intended Product

iBMC

Release Date

2016-09-07

Keyword

iBMC WebUI, RAID, physical hard drive, slot, inconsistency

Symptom

Hardware configuration:

LSI SAS3108 configured with 12 front hard drives (each drive forms a RAID 0) and 2 rear hard drives (RAID 1). The iBMC version is 2.10.

Symptom:

The RAID array information on the iBMC WebUI is inconsistent with the physical drive information. Physical drives Disk9 and Disk10 correspond to logical RAID arrays Logical Drive 10 and Logical Drive 11 respectively. However, on the iBMC WebUI, the two physical drives are associated with DiskA and DiskB respectively.

Figure 5-280 Inconsistent information
Key Process and Cause Analysis

Cause analysis:

The WebUI layout of iBMC 2.10 is abnormal. As a result, the relationship between physical hard drives and RAID arrays is incorrectly displayed, but the RAID arrays work properly.

Conclusion and Solution

Conclusion:

The WebUI layout of iBMC 2.10 is abnormal. As a result, the relationship between physical hard drives and RAID arrays is incorrectly displayed, but the RAID arrays work properly.

Solution:

Upgrade the iBMC to 2.12 or later.

Experience

The iBMC supports only out-of-band monitoring and information query for the LSI SAS3108 and LSI SAS3008. For other RAID controller cards, Out-of-Band Management Supported is displayed as No on the iBMC WebUI.

Figure 5-281 Storage
Note

None

Boot Device Is Not Found After the 5288 V3 Is Restarted
Problem Description
Table 5-197 Basic information

Item

Information

Source of the Problem

5288 V3

Intended Product

All servers

Release Date

2018-05-18

Keyword

LSI SAS3008, boot device not found

Symptom

Hardware configuration: 5288 V3

After the 5288 V3 is restarted, a message is displayed indicating that no boot device is found. The BMC reports no alarm, and displays a message indicating that the system is booted from PXE and the cables needs to be checked.

Key Process and Cause Analysis

Key process:

  1. Access the management screen of the RAID controller card, and check the RAID group status.

    On the management screen of the RAID controller card, no RAID group information is found.

  2. Check the RAID SAS topology.

    Access the SAS Topolog screen of the RAID controller card, and check the hard disk topology. The information shows that the RAID controller card fails to detect the hard disk backplane.

Cause analysis:

The RAID controller card fails to detect the hard disk backplane.

Conclusion and Solution

Conclusion:

The RAID controller card fails to detect the hard disk backplane. Therefore, the boot device is not found. Replace the hard disk backplane and SAS cables to resolve the problem.

Solution:

Replace the hard disk backplane and SAS cables.

Experience

If no boot device is found during the server startup, check the RAID controller card to determine whether the RAID information is damaged. If the RAID information is damaged, check whether the hard disks, hard disk backplane, and storage link are normal.

Note

None

Abnormal NVMe Hard Disk Link of the RH2288 V3
Problem Description
Table 5-198 Basic information

Item

Information

Source of the Problem

RH2288 V3

Intended Product

All servers

Release Date

2018-05-18

Keyword

Faulty hard disk, green indicator is on, yellow indicator blinks

Symptom

Device SN: 2102311LBD10H6003085

On the RH2288 V3, the system reports an alarm indicating that the slot 8 hard disk is abnormal. The green indicator is on, and the yellow indicator blinks. The service is interrupted.

Key Process and Cause Analysis
Problem analysis:
  1. BMC log analysis:
    1. No exception is recorded in the SEL logs on January 24.

    2. Hot swap records exist in the FDM logs.

    3. The OS logs show the same hot swap records of the slot 8 hard disk as the FDM logs.

    4. However, no hard disk is removed or inserted. The most probable cause is that the hard disk status signal of the slot 8 cable is interfered. As a result, the I2C signal changes abnormally. The fault is rectified after the cable is replaced.

  2. Conclusion:

    The I2C signal for detecting hard disk hot swap is interfered. As a result, the system detects incorrect exception signals, and changes the hot swap status. The hard disk is disconnected, and the processes running on the hard disk are affected.

Conclusion and Solution

Solution:

Replace the shielded cable with a new one to avoid interference.

RH2288H V3 Cable Misconnect Alarm
Problem Description
Table 5-199 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

RH2288H V3 configured with 8 SAS/SATA hard disks and 4 NVMe hard disks

Release Date

2018/1/20

Keyword

PCIe riser card, the SAS or PCIe cable to disk backplane is incorrectly connected

Symptom

Hardware configuration: RH2288H V3 configured with 12 hard disks on the backplane

OS configuration: N/A

Symptom: The BMC reports an error indicating that the SAS or PCIe cable to the disk backplane is incorrectly connected.

Key Process and Cause Analysis

Key process:

  1. Check whether the cables are correctly connected. The ASP engineers confirm that the cables are correctly connected.
  2. After the entire RAID card<->backplane<->PCIe riser card link is replaced, the error persists.
  3. The preceding diagnosis shows that the alarm is not caused by a single abnormal hardware component. The possible cause is that the hardware has been modified. Further diagnosis proves that the hardware has been modified on-site.
    Figure 5-282 Comparison between the normal configuration and on-site configuration
  4. According to the user guide, the PCIe riser card must be inserted into slot 7, but the customer inserts the PCIe riser card into slot 6. As a result, the BMC keeps reporting the alarm.

Cause analysis:

  1. According to the user guide, if the 8+4 hard disk configuration is used, the PCIe riser card must be inserted into slot 7. However, the customer inserts the PCIe riser card into slot 6.
    Figure 5-283 Installation requirements of the PCIe riser card
  2. The BMC keeps reporting the cable misconnect alarm, but the cables are correctly connected on-site. The error is caused by incorrect installation of the PCIe riser card. The CPLD does not distinguish the two types of errors and reports the same error to the BMC.
Conclusion and Solution

Conclusion:

  1. The problem is caused by hardware modification.
  2. The CPLD provides incorrect alarm information, misleading the troubleshooting engineers.
  3. The riser card is not replaced according to the user guide, and the customer's problem is not resolved in time.

Solution: Advise the customer to use the Huawei recommended hardware configuration.

Effect: The problem is resolved.

Locating the Slot of a Slow Hard Disk in a Big Data Service Scenario
Problem Description
Table 5-200 Basic information

Item

Information

Source of the Problem

Servers

Intended Product

Servers

Release Date

2018-01-30

Keyword

RAID, lsscsi, storcli

Symptom

The V3 servers are equipped with the LSI SAS2208, LSI SAS2308, LSI SAS3008, or LSI SAS3108 RAID controller card. This section describes how to use the drive letter to locate the hard disk slot on Linux in a big data service scenario.

Key Process and Cause Analysis
  • LSI SAS2308/LSI SAS3008+Linux:

    Background: Locate the slot of a slow hard disk in a big data service scenario.

    On the OS:

    Run the df command to query the drive letter corresponding to the abnormal file system.

    Query the serial number of the hard disk.
    1. Use the SMART information to query the device serial number.

      On the OS:

      Run the smartctl –a/dev/sdb command. (The smartctl file is required in the system. Generally, the smartctl file is installed by default during the Linux OS installation.)

      The serial number of the hard disk corresponding to the sdb drive letter is 9XG50X1F.

      NOTE:

      You can use the drive letter to query the hard disk serial number only in single-disk RAID 0 and hard disk pass-through scenarios. When multiple hard disks exist under one VD, multiple hard disks correspond to one drive letter. In such scenario, do not use the drive letter to query the device serial number.

    2. Use the serial number to query the slot number.

      On the OS:

      1. Go to the \InfoCollect_Linux\modules\raid\RAIDtool\3008 directory where the tool is located, and run the chmod +x sas* command to grand command permission.
      2. Run the ./ sas3ircu 0 display command.
      3. In the command output, query the slot number by using the serial number obtained in step 1.

        NOTE:

        You can also query the slot number by searching the raid folder in the collected log package. However, for servers (such as the X6800) that are configured with SoftRAID and a RAID controller card, the information of the RAID controller card may not exist in the log files. Therefore, using the preceding commands is more accurate.

    Special situation:

    1. Failed to obtain the SMART information.

      On the OS:

      Check the messages logs. The messages logs show that the sdm disk in slot 11 has task abort records.

      On the OS, run the lsscsi command to view the hard disk information. In the displayed information, the first column shows the [H:C:D:L] numbers of the hard disks. Use the [H:C:D:L] number to query the disk slot number.

      Query method:

      For example, if the [H:C:D:L] number is [0:0:11:0], the meanings of the numbers are as follows:

      H: indicates the HBA number. For RAID controller cards, the number 0 indicates an onboard RAID controller card. If the system has only one RAID controller card, the H value is 0. C: indicates the channel number. The default value is 0. You can ignore this number.

      D: indicates the device number. If a RAID controller card is used, the value indicates the VD number. For [0:0:11:0], view the VD 11 information of the RAID controller card. The VD 11 information shows that the slot number is 11.

      The sda disk is a RAID 1 array using slot 0 and slot 1, and the [H:C:D:L] number is [0:2:0:0]. The sda disk is the boot partition. The [H:C:D:L] number of the sdb disk is [0:2:2:0]. The sdb disk is in slot 2.

      In single-disk RAID 0 scenarios, one slot is used by one VD. Therefore, you can use this method to query the slot number.

      In the hard disk pass-through scenario where an LSI SAS3008 RAID controller card is used, the value of the device number is the slot number. For example, if the [H:C:D:L] value is [0:0:11:0], the hard disk in slot 11.

      L: indicates the LUN number, which is the number of the channel between the local host and the SCSI device. LUN is not used in the local storage, and the default value is 0.

  • LSI SAS2208/LSI SAS3108+Linux

    Query the serial number of the hard disk.

    1. Use the SMART information to query the device serial number.

      On the OS:

      Run the smartctl –a/dev/sdb command. (The smartctl file is required in the system. Generally, the smartctl file is installed by default during the Linux OS installation.)

      The serial number of the hard disk corresponding to the sdb drive letter is 9XG50X1F.

      NOTE:

      You can use the drive letter to query the hard disk serial number only in single-disk RAID 0 and JBOD scenarios. When multiple hard disks exist under one VD, multiple hard disks correspond to one drive letter. In such scenario, do not use the drive letter to query the device serial number.

    2. Use the serial number to query the slot number.
      1. Use the StorCLI tool. If the tool is unavailable, run the chmod +x storcli command to grand permission.
      2. Run the storcli64 -PDList -aALL command.
      3. In the command output, query the slot number by using the serial number obtained in step 1.

        NOTE:

        You can also query the slot number by searching the raid folder in the collected log package. However, for servers (such as the X6800) that are configured with SoftRAID and a RAID controller card, the information of the RAID controller card may not exist in the log files. Therefore, using the preceding commands is more accurate.

    3. Special situation:

      Failed to obtain the SMART information.

      Use the lsscsi command to locate the hard disk slot as described in the preceding method.

    4. Use the logs to query the slot number of the disconnected hard disk.

      Example 1

      a. Collect OS logs from the customer, and check the messages and dmesg logs. For example, the drive letter of the faulty disk reported by the customer is sdu.

      Search for sdu in the dmesg logs. In the dmesg logs, the [H:C:D:L] information of the sdu disk is [0:2:21:0]. Search for sdu in the messages logs. No record about the sdu disk is found.

      b. View the VD information in the RAID controller card logs.

      The VD information is as follows:

      VD 0

      VD 2

      VD 3

      VD 21

      VD 23

      The preceding information shows that the sdu disk is in slot 21.

FAQs
  1. Obtain the tools from the following websites:

    http://support.huawei.com/enterprise/en/software/22747368-SW1000282789

    The directory of the tool is \home\Project\tools\lsi3008\linux\sas3irc.

    http://support.huawei.com/enterprise/en/software/22400698-SW1000265416

    The directory of the tool is \InfoCollect_Linux\modules\raid\RAIDtool\3008.

Conclusion and Solution

None

Experience

None

Note

None

Drive Letter Offset on the RH2288 V3 Caused by a Faulty Hard Disk
Problem Description
Table 5-201 Basic information

Item

Information

Source of the Problem

RH2288 V3 configured with the LSI SAS3008 in hard disk pass-through mode

Intended Product

All Huawei servers

Release Date

2018-04-03

Keyword

Hard disk fault, drive letter change

Symptom

Background: The RH2288 V3 is configured with an LSI SAS3008 RAID controller card. A RAID 1 array of two Intel 240 GB SSD disks is used as the system disk, and 12 Hitachi 4 TB pass-through hard disks are used as the data disks.

Fault symptom: A customer reports that the /hdfsdata/5 file system has an error, and the corresponding drive letter is /dev/sdf. Before the R&D engineers analyze the problem, the customer replaces the SAS cable to resolve the problem. Later the customer contacts the R&D engineers for support, and provides the SAS cable and the logs collected after the system is restarted.

Figure 5-284 Error of the /hdfsdata/5 file system

Preliminary analysis: In the hard disk logs, the SMART information of the /dev/sdf hard disk is normal.

Figure 5-285 Hard disk logs

A large number of fs errors about the /dev/sdf disk exist in the system logs.

Figure 5-286 System logs

However, the preceding analysis is incorrect.

Key Process and Cause Analysis
  1. The underlying hard disk logs show that the K4JH7MB hard disk is faulty.
    Figure 5-287 Analysis result of the underlying hard disk logs
  2. Figure 5 shows that serial number of the faulty hard disk does not exist in the hard disk logs.
    Figure 5-288 Hard disk serial numbers
  3. Analyze the logs. On January 10, a medium error and an unrecovered read error of the sd 10:0:4:0: [sdf] disk are generated in the system logs.
    Figure 5-289 Hard disk error records
  4. In Figure 7, the information about the 10:0:4:0 hard disk is missing. Figure 6 shows that the original drive letter of the 10:0:4:0 hard disk is sdf. Therefore, the 10:0:4:0 hard disk is faulty. As a result, the drive letters of the 10:0:5:0 to 10:0:11:0 hard disks are offset in sequence after the system is restarted.
    Figure 5-290 Hard disk block logs
  5. Figure 8 shows the hard disk scores. The SMART score of the K4JHT7MB hard disk is 56.05, indicating that the hard disk is faulty.
    Figure 5-291 Hard disk scores
Conclusion and Solution

Conclusion:

The hard disk in slot 4 is faulty. As a result, the /hdfsdata/5 file system fails to be read and written properly. After the system is restarted, the drive letters are offset. Therefore, the /hdfsdata/5 file system is recovered.

Solution:

Replace the faulty hard disk in slot 4.

Experience
  1. The hard disks are not scored on-site. As a result, the subsequent analysis is incorrect. Therefore, when resolving similar problems related to hard disks, urge the customer to collect the logs and score the hard disks.
  2. When analyzing logs, pin point the time when the exception logs are recorded to make clear the fault symptom.
  3. In scenarios where drive letter offset is prone to occur (such as in hard disk pass-through mode), advise the customer to associate the hard disk UUID with the file system mount point.
Note

Normal block logs:

PCIe SAS Cable Alarm on the RH2288H V3
Problem Description
Table 5-202 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

RH2288 V3/RH2288H V3 PCIe NVMe SSD

Release Date

2018-04-10

Keyword

PCIE NVME SSD, cable, hard disk backplane, PCIe SAS cable alarm

Symptom

After the PCIe riser card is replaced, the system reports an alarm indicating that the SAS or PCIe cable to the disk backplane is incorrectly connected.

Key Process and Cause Analysis

Analyze the feedback information. The procedure is as follows:

  1. In the beginning, the system frequently reports correctable errors indicating that the riser card corresponding to slot 2 needs to be replaced.

  2. The customer reports that after the riser card is replaced, the system reports an alarm indicating that the SAS or PCIe cable to the disk backplane is incorrectly connected. The hard disk backplane information shows that the hard disk backplane supports PCIe NVMe SSD. Therefore, the cause of the problem is that the PCIe riser card is incorrectly connected to the hard disk backplane, or the cables are incorrectly connected.

    On-site engineers report that the cable connections are correct. The customer is an NA customer out of China. Therefore, the cost of on-site trouble shooting is high. The cause of the problem needs to be resolved by remote service.

    1. Analyze the log registers.

      HD_REG15

      0X54

      7

      R

      0

      1

      backplane_prsent_n

      Optional

      Presence status signal of the hard disk backplane

      1: The hard disk backplane is in position.

      0: No hard disk backplane is in position.

      [3:0]

      R

      0

      NA

      hdd_logic_ver

      Optional

      Logical version number of the hard disk backplane

      HD_REG16

      0X55

      [7:4]

      R

      1111

      NA

      hdd_board_id

      Optional

      BOARD ID of the hard disk backplane

      [3:0]

      R

      0000

      NA

      hdd_pcb_id

      Optional

      PCB VER of the hard disk backplane

      Note: 001 indicates the VER.A version.

      HD_REG17

      0X56

      [7:4]

      R

      0

      0

      sas_port_b

      Recommended

      Filter the signal when reading sas_port_X. The result is valid only when the two read results are consistent. You can use logic methods to perform filtering, but a large amount of code is required.

      0000: The mini-SAS cable is not connected.

      0001: The mini-SAS cable is connected to a PCIe plug-in RAID controller card.

      0010: The mini-SAS cable is connected to PORTA of a RAID controller card.

      0011: The mini-SAS cable is connected to PORTB of a RAID controller card.

      0100: The mini-SAS cable is connected to PORTA on the PCH.

      0101: The mini-SAS cable is connected to PORTB on the PCH.

      [3:0]

      R

      0

      0

      sas_port_a

      Recommended

      HD_REG18

      0X57

      [7:4]

      R

      0

      0

      sas_port_d

      Recommended

      [3:0]

      R

      0

      0

      sas_port_c

      Recommended

      HD_REG18

      0X58

      [7:4]

      R

      0

      0

      sas_port_f

      Recommended

      [3:0]

      R

      0

      0

      sas_port_e

      Recommended

      HD_REG19

      0x59

      [7:0]

      R

      0x03

      NA

      Type 1 of built-in hard disks (left in the rear view)

      Recommended

      0x03: No hard disk backplane is in position.

      0x00: The slot is reserved.

      0x01: A backplane for 2.5-inch hard disks is in position.

      0x02: A backplane for 3.5-inch hard disks is in position.

      HD_REG20

      0x5a

      [7:0]

      R

      0x03

      NA

      Type 2 of built-in hard disks (right in the rear view)

      Recommended

      0x03: No hard disk backplane is in position

      0x00: The slot is reserved.

      0x01: A backplane for 2.5-inch hard disks is in position.

      0x02: A backplane for 3.5-inch hard disks is in position.

    The information of the registers is stored in a two-dimensional table with a vertical coordinate and a horizontal coordinate

    0X56: 32 (hexadecimal) = 0011 0010 (binary). The value 0010 indicates that the mini-SAS cable is connected to PORTA of the RAID controller card, and 0011 indicates that the mini-SAS cable is connected to PORTB of the RAID controller card. For details, see the preceding description table.

    0X57: 11 = 0001 0001, 0X58: 00 = 0000 0000.

    According to the preceding description table, two cables are connected to a screw-in RAID controller card, and two cables are connected to a PCIe plug-in RAID controller card. However, two cables of a PCIe plug-in RAID controller card are not connected.

    The cause of the SAS cable alarm is that the cables of a PCIe plug-in RAID controller card are not connected. The riser card corresponding to slots 6, 7, and 8 is incorrectly replaced.

Conclusion and Solution

The cause of the problem is that the riser card corresponding to slots 6, 7, and 8 is incorrectly replaced.

Experience
  1. For problems caused by on-site misoperations, the log information is more accurate. Therefore, use the log information to specify the cause.
  2. The CPLD registers can be used to check the status of the hardware signals, such as LED lighting and device presence status.
Note

None

Fatal Firmware Error of a RAID Controller Card
Problem Description
Table 5-203 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

LSI SAS3108

Release Date

2018-05-29

Keyword

Fatal, onfi.c, raidpci.c

Author

Shi Tinglei 00383583

Symptom
  1. The RAID controller card is reset. The alarm information shows that a fatal firmware error occurs in onfi.c.
  2. The following records exist in the logs of the RAID controller card:

Key Process and Cause Analysis

Key process:

Problem 1:

  1. The server model is RH2288 V3, and the RAID controller card model is LSI SAS3108.

  2. Check the logs of the RAID controller card where the reset problem occurs. The following information is found:

  3. On the OS logs, the firmware and driver versions of the RAID controller card are as follows:

  4. The bug list released by Avago shows that the problem is caused by the current firmware version, and is resolved in the 4.660.00-8102 firmware.

    Bug list:

    https://www.broadcom.com/products/storage/raid-controllers/megaraid-sas-9361-8i#downloads

  5. The 4.660.00-8313 firmware is an optimized branch version based on the 4.660.00-8102 firmware. Use the 4.660.00-8313 firmware and the compatible driver.

Question 2:

  1. The server model is RH2288 V3, and the RAID controller card model is LSI SAS3108.

  2. The following figure shows the times when the records are generated.

  3. Query the production test records based on the serial number of the RAID controller card. The production test is completed on April 1, 2018, and all tests are passed.

    SN:

    Production testing records:

    During the production test of the RAID controller card, the product is tested and verified to ensure that the product is intact when shipped out of the factory. Some records are generated during the production test. You can ignore these records.

Conclusion and Solution

Problem 1: The RAID controller card reset problem is caused by the outdated firmware. You are advised to upgrade the firmware and driver to the recommended version.

Problem 2: The records in the logs are generated during the production test. You can ignore these records.

Upgrading the PMC8060 Firmware on the RH2288H V3
Problem Description
Table 5-204 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

All Huawei servers

Release Date

2018-03-05

Keyword

PMC8060, firmware upgrade

Symptom

Scenario:

The current firmware version of the RH2288H V3 PMC8060 is 32971. The supplier suggests that firmware be upgraded to the latest version 33083.

Figure 5-292 Current PMC8060 firmware version

Prepare the PMC8060.iso file.

  1. Decompress the firmware package and change the name of the P83T0112.ufi file to PM806X01.ufi.
    Figure 5-293 Changing the firmware file name
  2. Start UltraISO and open the ISO boot file.
  3. Create a PMC8060 folder, and copy the AFU tool and firmware file to the folder.

  4. Save the files as PMC8060.iso. The PMC8060.iso file is created.

Use the PMC8060.iso file to upgrade the firmware.

  1. Mount the PMC8060.iso file.

  2. Restart the server. The services must be migrated to other servers in advance. During the startup, the firmware version of the RAID controller card is displayed.

    Press F11 to access the Boot Manager screen.

  3. Choose Virtual DVD-ROM VM 1.1.0.

    Press Enter to access the CD-ROM directory.

    Run the cd PMC8060 command to access the directory where the AFU tool and PM806X01.ufi file are located, and run the AFU command to open the AFU tool.

  4. Press Enter.

  5. Choose OK and press Enter.

  6. The following message is displayed.

  7. Choose OK and press Enter.

  8. The following message is displayed.

    The following message is displayed.

  9. A message is displayed indicating upgrade success.

  10. Press any key. A message is displayed prompting you to restart the server.

  11. Press any key, and choose Exit.

  12. Enter Y to exit the AFU tool.

  13. If the new firmware version is displayed, the upgrade is successful.

Conclusion and Solution

Conclusion:

The PMC8060 firmware needs to be upgraded to 33083 or the latest version.

Experience

None

Note

None

RAID Controller Card Alarm
Problem Description
Table 5-205 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

All servers

Release Date

2018-05-15

Keyword

RAID, driver, firmware

Symptom

A RAID error is displayed on the BMC screen, and the server restarts. In some cases the screen turns purple.

Key Process and Cause Analysis

Analysis process:

  1. BMC log analysis:

    An error is generated in the SEL logs indicating RAID controller card 1 failure.

  2. FDM log analysis:

    The FDM logs show that the error is in device 0x00:0x01.0x00 (the LSI SAS3108 RAID controller card), and the error is a Completion Time-out error.

  3. RAID controller card log analysis:

    "DEAD:Fatal firmware error: Line 2397 in ../../raid/2108vI2o.c" is generated in the RAID controller card logs. This error is generated by the driver. Therefore, an exception occurs in the driver.

    Check the driver version of the RAID controller card. The driver version is 0.255.03.01-1vmw.550.0.0.1331820, and the driver is provided by the VMware ESXi system.

    Chip supplier of the RAID controller card Avago confirms the following information: The version of the driver provided by the ESXi system is outdated, and an exception may occur when the driver works with the firmware on the live network; if the RAID controller card does not respond to upper-layer command for a long time, an error, such as the VMware purple screen, may occur in the upper-layer software. The bug related to the Avago driver is described as follows:

Conclusion and Solution

Cause: The driver provided by the VMware ESXi system does not match the firmware of the RAID controller card. As a result, an exception may occur. You are advised to upgrade the driver and firmware by referring to the Huawei driver mapping table. Visit the following website to download the mapping table:

http://support.huawei.com/enterprise/en/doc/EDOC1100017577

Solution:

Upgrade the RAID controller card driver and firmware.

Visit the following website to download the firmware:

http://support.huawei.com/enterprise/en/software/23029668-SW2000014677

Visit the following website to download the driver:

https://my.vmware.com/web/vmware/details?downloadGroup=DT-ESXI55-AVAGO-LSI-MR3-69121200-1OEM&productId=353

Experience

Experience:

  1. In the delivery stage: The driver and firmware are not installed and checked according to the mapping table released on the official Huawei website.
  2. In the troubleshooting stage: The logs are not analyzed in detail, and an incorrect conclusion that the RAID controller card has a hardware fault is made based on the BMC alarm.

Suggestions:

  1. Standardize the on-site delivery process.
  2. Develop case studies and apply the cases to on-site practice.
Note

None

Common NIC Problems

"Abort command issued nexus" Error of the Qlogic HBA
Problem Description
Table 5-206 Basic information

Item

Information

Source of the Problem

Huawei servers

Intended Product

Huawei servers

Release Date

2018-01-30

Keyword

Abort command issued nexus

Symptom

Two Qlogic ISP2532 FC HBAs (single-port) are installed on the RH5885 V3. When the system is running, the customer detects I/O timeout, and stops the service demanding technical support.

Key Process and Cause Analysis
  1. Analyze the BMC logs of the server. No hardware exception is found, and the HBA link is normal.
  2. In the system logs, "Abort command issued nexus" is generated at 10:56:43 December 25 for both of the two HBAs. (Two same log records are generated at the same moment. To a large extent, this indicates that the hardware is not faulty.)

    After the system logs are analyzed, the cloud computing R&D engineers consider that the I/O timeout is caused by HBA exceptions.

  3. The system logs are sent to the HBA manufacturer for analysis. The analysis result indicates that the storage LUNs need to be examined.
  4. The storage logs show that the customer deletes two LUNs (one 20 TB and one 40 TB) at 10:47:40 December 25. As a result, the system fails to perform timely I/O scheduling, and the I/O timeout causes the upper-layer host link to be disconnected.
Conclusion and Solution

Conclusion:

Large capacity LUNs are deleted. As a result, the I/O is blocked and an I/O exception occurs, and the system service is interrupted.

Solution:

Install a patch package provided by the storage R&D department. The I/O exception caused by LUN deletion is resolved.

Experience

For details, see the description about "qla2xxx [0000:45:00.0]-801c:7: Abort command issued" on the official Red Hat website:

https://access.redhat.com/solutions/27624

The article points out that this error is caused by the exception of a component on the FC link from the HBA to the storage. Therefore, you need to check the HBA FC link (modules including the optical modules, optical cables, ports of the uplink FC switch, and storage logs). The description is too general and unpersuasive.

However, the description shows that "Abort command issued" is irrelevant to the server hardware.

Note

None

Common Problems of the Management Software

Failed to Access the iBMC
Problem Description
Table 5-207 Basic information

Item

Information

Source of the Problem

RH8100 V3

Intended Product

Rack Server

Release Date

2015-03-17

Keyword

Access the iBMC

Symptom

iBMC cannot be accessed from a PC.

Key Process and Cause Analysis

Possible Causes:

  • The IP addresses of the PC and iBMC are not on the same network segment.
  • The network cable between the PC and the management network port is loose or damaged.
  • The PC is connected to the management network port only on HFC-1.
Conclusion and Solution

Procedure:

  1. Check whether the IP addresses of the PC and iBMC are on the same network segment.

  2. Change the PC IP address to ensure that it is on the same network segment as the iBMC IP address.
  3. Check whether the network cable between the PC and the management network port is loose.

  4. Securely connect the network cable.
  5. Check whether the network cable is damaged.

  6. Replace the network cable.
  7. Check whether network cable is connected to the management network port on HFC-1.

  8. Connect the PC to the management network port on HFC-2 using a network cable, and log in to iBMC over the management network port on HFC-2.
  9. Please refer to 02 About Huawei Server R&D and Maintenance Team. Contact Huawei technical support for help.
Experiencew

None

Notes

None

iBMC WebUI Cannot Be Refreshed After an Upgrade
Problem Description
Table 5-208 Basic information

Item

Information

Source of the Problem

RH1288 V3

Intended Product

V3 servers

Release Date

2015-08-04

Keyword

iBMC

Symptom

Hardware configuration: RH1288 V3 server

Symptom: After the iBMC is upgraded, the WebUI cannot be refreshed.

Key Process and Cause Analysis

Cause analysis:

When the browser loads iBMC information, information cached in the browser is used. Due to the upgrade, some information becomes inconsistent. The WebUI freezes after the browser loads iBMC information for a long period. To resolve the problem, you need to clear the cache in the browser and reload iBMC information.

Conclusion and Solution

Conclusion:

Clear the cache in the browser to resolve the problem.

Experience

None

Note

None

Upgrade File Path Is Always Displayed in the C:\fakepath\File Name Format
Problem Description
Table 5-209 Basic information

Item

Information

Source of the Problem

iBMC

Intended Product

Servers using the iBMC

Release Date

2015-08-12

Keyword

iBMC, upgrade file

Symptom

Symptom: See Figure 5-294.

Figure 5-294 Upgrade File
Key Process and Cause Analysis

Cause analysis:

This problem occurs due to the security settings in Internet Explorer and Google Chrome. In Internet Explorer and Google Chrome, local directories are not displayed to ensure security. This problem does not affect function use.

Conclusion and Solution

Solution:

In Internet Explorer, perform the following operations to display local directories:

Choose Tools > Internet options > Security > Custom level, find Include local directory path when uploading files to a server under Miscellaneous, and select Enable.

Experience

None

Note

None

iBMC WebUI Cannot Be Accessed After the Correct User Name and Password Are Entered or the WebUI Does Not Respond After Login
Problem Description
Table 5-210 Basic information

Item

Information

Source of the Problem

iBMC

Intended Product

All servers

Release Date

2016-01-18

Keyword

User name, password

Symptom

Symptom 1: The iBMC WebUI cannot be accessed after the correct user name and password are entered.

Symptom 2: The iBMC WebUI does not respond after login.

Key Process and Cause Analysis

The cache in the browser is different from the WebUI. As a result, the WebUI cannot be refreshed.

There are the following scenarios:

  1. On the browser, a server of the same IP address has been accessed before, and the WebUI version of the accessed server is different from that of the current server.
  2. The iBMC has been upgraded on the server, and the versions are of large span. Therefore, the WebUIs of the two versions differ greatly.
Conclusion and Solution

Method 1: Press CTRL+F5 to forcibly refresh the WebUI.

Method 2: Clear the cache in the browser, restart the browser, and log in to the WebUI again.

  1. Press CTRL+SHIFT+DELETE to clear the cache in the browser.
  2. After the cache is cleared, restart the browser, and log in to the WebUI again.
Experience

None

Note

None

After the KVM Is Started on a macOS, the Mouse and Keyboard Cannot Be Used
Problem Description
Table 5-211 Basic information

Item

Information

Source of the Problem

iBMC

Intended Product

All servers

Release Date

2016-01-18

Keyword

macOS, KVM

Symptom

After the KVM is started on a macOS, the mouse and keyboard cannot be used.

Key Process and Cause Analysis

On a macOS, the KVM client obtains the keyboard hardware codes in a different way. As a result, correct key values cannot be entered using the keyboard.

Conclusion and Solution

Upgrade the iBMC to version 1.58 or later.

Experience

None

Note

None

An Error Occurs When a PCIe Card Is Installed on an RH8100 V3 Server
Problem Description
Table 5-212 Basic information

Item

Information

Source of the Problem

RH8100 V3

Intended Product

RH8100 V3 server and RH5885H V3 server

Release Date

2016-03-01

Keyword

I/O resource, PCIe card

Symptom

Hardware configuration:

RH8100 V3 server, of which the configuration is shown in Figure 5-295

Figure 5-295 Hardware configuration

Symptom:

When a PCIe card (030WSQ10F6003451) is installed in slot 4, the message "CPU6 config error" is displayed, as shown in Figure 5-296.

Figure 5-296 Symptom

Key Process and Cause Analysis

Key process:

  • Remove the added 03030WSQ PCIe card. It is found that the system is started properly.
  • Collect system startup logs and make the BIOS analyze the logs.
  • View the PCIe card configuration guide of the RH8100 V3 server to determine whether the problem occurs due to I/O resource conflicts.

Cause analysis:

  • Determine whether the 8P or 4P mode is used.
  • Determine whether the existing PCIe cards require I/O resources and whether the newly added PCIe card requires I/O resources.
  • After a configuration table comparison, it is found that the newly added PCIe card requires I/O resources and only the hot-swapping slots among the remaining PCIe slots support the requirements of the newly added PCIe card. Therefore, an error occurs if the newly added PCIe card is installed in a slot other than a hot-swapping slot.
Conclusion and Solution

Conclusion:

The I/O resource configuration problems are intercepted during the verification before a PCIe card is delivered. When a CPU configuration error occurs after the customer adds a PCIe card, check whether the problem is caused by I/O resource conflicts first.

Solution:

When a similar problem occurs on an RH5885H V3 server, check whether the problem is caused by I/O resource conflicts first.

Experience

None

Note

None

Failed to Upload a File During an Online Upgrade
Problem Description
Table 5-213 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

Rack servers

Release Date

2015-03-17

Keyword

Online upgrade, file upload failure

Symptom

During an online upgrade of iMana 200, the system displays a message indicating that an upgrade file fails to be uploaded.

Key Process and Cause Analysis

Possible Causes

  • The IP addresses of an uploading tool and iMana 200 are not on the same network segment.
  • The network cable is loose or damaged.
Conclusion and Solution
  1. Check whether the IP addresses of the uploading tool and iMana 200 are on the same network segment.

  2. Change the IP address of the uploading tool to ensure that it is on the same network segment as the IP address of iMana 200.
  3. Check whether the network cable is loose.

  4. Securely connect the network cable.
  5. Check whether the network cable is damaged.

  6. Replace the network cable.
  7. Contact Huawei technical support by referring to Obtaining Technical Support.
Experience

None

Note

None

Failed to Set the Port Number to 5900 on iBMC of V3 Servers.
Problem Description
Table 5-214 Basic information

Item

Information

Source of the Problem

RH2288H V3, CH121 V3

Intended Product

V3 servers

Release Date

2018-05-04

Keyword

BMC, KVM, 5900

Symptom

An NA customer reports that the port number cannot be set to 5900 on the iBMC WebUI of the CH121 V3. Other port numbers, such as 5800, are normal.

Key Process and Cause Analysis
  1. Port number setting rules:

    View the port numbers used by the system services in the iBMC user guide.

    Value: an integer ranging from 1 to 65535

    Default value:

    KVM: 2198

    VNC: 5900

    The default port number of the V5 VNC feature is 5900.

  2. Problem cause:

    BMC R&D engineers confirm that 5900 is used by the V5 VNC feature. The V3 and V5 servers share the same iBMC software platform.

    The VNC feature is incorporated into the new iBMC of the V3 servers, and 5900 is allocated to the VNC feature by default. However, V3 servers do not support the VNC feature. Therefore, the VNC port number 5900 is not displayed on the iBMC WebUI.

  3. Similar scenarios:

    This problem occurs on V3 servers with BMC 2.53 or later.

Conclusion and Solution

Conclusion:

A bug exists in the code of the V3 server iBMC. The 5900 port number is used by the V5 VNC feature.

Solution:

Temporary solution: Do not use 5900 as a port number. If the port number is required to be 5900, roll back the iBMC version.

Long-term solution: Upgrade the iBMC to 3.04 or a later.

Note

None

Common Problems of Fan Modules and Power Supplies

RH1288 V3 Fans Keep Running Rapidly
Problem Description
Table 5-215 Basic information

Item

Information

Source of the Problem

RH1288 V3

Intended Product

V3 servers

Release Date

2015-11-24

Keyword

Fan, high speed, loud noise

Symptom

Symptom:

The fans of an RH1288 V3 server produce loud noise. Fan alarms are generated on the iBMC WebUI, as shown in Figure 5-297, and the fan speed is higher than normal speed, as shown in Figure 5-298.

Figure 5-297 Fan alarms

Figure 5-298 Fan speed information

Key Process and Cause Analysis

Key process:

After a server is powered on, the iBMC scans fans and initializes the fan speed based on the speed range and fan type (single-fan or dual-fan).

At 17:20:23 on October 23, the server was powered on. At 17:20:32, the server was powered off by pressing the power button, and the fan type failed to be identified (identifying the fan type takes about 30 seconds and varies depending on the server model). At 14:05:08 on November 2, the iBMC identified several fan types, generated alarms for all fans, and adjusted the fan speed to 80% of the maximum speed, as shown in Figure 5-299, Figure 5-300, and Figure 5-301.

Figure 5-299 iBMC logs

Figure 5-300 Failing to identify the fan type due to system power-off

Figure 5-301 Several fan types identified

Conclusion and Solution

Conclusion:

Powering off the system while the iBMC is identifying the fan type results in identification errors. The fan speed is adjusted to 80% of the maximum speed, which has no impact on the service system.

Solution:

  1. Preventive measure: Reset the iBMC.
  2. Solution: Upgrade the iBMC to 2.01 or later.
Experience

None

Note

None

RH5885H V3 Fans Keep Running Rapidly Due to a BMC Board Fault
Problem Description
Table 5-216 Basic information

Item

Information

Source of the Problem

RH5885H V3

Intended Product

RH5885H V3

Release Date

2015-12-30

Keyword

Fan module, high speed, BMC

Symptom

Hardware configuration:

RH5885H V3

Symptom:

  • Figure 5-302 shows the alarms on the iMana 200 WebUI.
    Figure 5-302 Alarms on the iMana 200 WebUI
  • The firmware version information does not include the FPGA version.
    Figure 5-303 Firmware version information
Key Process and Cause Analysis

Key process:

The BMC adjusts the fan speed based on the monitoring results of component temperature sensors.

In this case, the fan speed is abnormal and alarms are generated for almost all component temperature sensors on the BMC WebUI. Simultaneous failure of so many independent components on one server is almost unlikely. It is likely that something is wrong with the BMC obtaining temperature sensor data. The FPGA version failing to be displayed (FPGA is on the BMC board) also indicates that the BMC board is faulty. Check the BMC software and the BMC board.

  • Reset the BMC or switch to another image, and the problem persists, indicating that the BMC software is not faulty.
  • Replace the BMC board, and the problem is resolved.
Conclusion and Solution

Solution:

Replace the BMC board.

Experience

None

Note

None

Abnormal Fan Speed on the RH5885 V3 Equipped with the New Mainboard
Problem Description
Table 5-217 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

RH5885 V3

Release Date

2018-01-22

Keyword

VRD, sensor, CPLD

Symptom

On the RH5885 V3 equipped with the VRD power module mainboard, if the CPLD version is 0.07, abnormal fan speed may occur.

Key Process and Cause Analysis

Analyze the feedback information. The procedure is as follows:

  1. Check the delivery date. The delivery date is around November 2017.

  2. The mainboard PCB version is .C.

  3. Check whether the CPLD version is 0.07.

  4. The system occasionally fails to read the VRD sensor. Therefore, the fan modules occasionally run rapidly.
Conclusion and Solution

Conclusion:

For abnormal fan speed on the RH5885 V3 that meets the preceding conditions, update iBMC and CPLD to the latest versions.

Experience

None

Note

None

Abnormal Fan Speed on 4-Socket and 8-Socket Servers
Problem Description
Table 5-218 Basic information

Item

Information

Source of the Problem

RH5885 V3, RH5885H V3

Intended Product

RH5885 V3, RH5885H V3, and RH8100 V3

Release Date

2018-02-01

Keyword

Hard disk backplane, signal cable, fan module

Symptom

The fan speed of the RH5885 V3 is abnormal, and alarms indicating faulty fan modules are generated in the SEL logs. Abnormal fan speed may also occur on the RH8100 V3.

Key Process and Cause Analysis

Analyze the feedback information. The procedure is as follows:

For the 4-socket RH5885 V3/RH5885H V3:

  1. Replace the components to specify the cause of the abnormal fan speed:
    1. If the problem occurs when the fan module is installed, replace the fan module.
    2. If the fault occurs when the slot is used, replace the hard disk backplane.
    3. If the fault spreads to slots that were normal in the past, replace the hard disk backplane and fan modules (both of the two fan modules need to be replaced).
  2. Fan speed adjustment and monitoring mechanism on 4-socket servers:
    1. The fan modules are connected to the hard disk backplane.

      Hardware structure:

      Cable connections (connectors): 12 V power supply (enabled after the fan is detected), fan module presence detection (detected by the backplane CPLD), speed control signal (controlled by the backplane CPLD), and speed detection signal (detected by the backplane CPLD).

      The backplane CPLD uses signal cables to send information to the mainboard. The information includes hard disk presence, fan module presence, fan speed, and fan speed adjustment.

  3. The fan speed signal link is as follows: iBMC<->mainboard (CPLD)<->signal cables<->hard disk backplane<->fan modules. Therefore, when the fan speed is abnormal, the system may report different types of alarms such as fan module absence or abnormal fan speed. You need to analyze the problem based on the alarm information and topology of the signal link.

For the 8-socket RH8100 V3:

Replace the fan modules to specify the cause. If the fault occurs when a fan module is installed, replace the fan module to resolve the problem.

Conclusion and Solution

None

Experience

For abnormal fan speed on 4-socket and 8-socket servers, handle the problem by referring to the troubleshooting case.

Note

None

I2C Link Troubleshooting Method and Sensor Information
Problem Description
Table 5-219 Basic information

Item

Information

Source of the Problem

Servers

Intended Product

Rack servers

Release Date

2018-01-09

Keyword

I2C, sensor problem

Symptom

The I2C link topology varies on different server models, and multiple components (such as sensors) exist on the I2C link. Therefore, troubleshooting for the I2C link is difficult. This article provides the I2C topologies of different server models and the troubleshooting method.

Key Process and Cause Analysis

Symptom:

The RH2288H V3 server reports an alarm indicating that the BMC fails to read the electronic labels of the air inlet, air outlet, and mainboard.

Troubleshooting procedure:

  1. The fault persists after the mainboard, mounting ears, and cables are replaced.
  2. The problem persists after the LOM, RAID controller card, and mounting ears are removed. According to the I2C link topology, the problem is caused by the hard disk backplane. After the cables of the hard disk backplane are removed, the problem is resolved.

Analyze the feedback information. The procedure is as follows:

  1. Symptom analysis:

    The alarms include the air inlet, air outlet, and FRU alarms. The I2C topology shows that the sensors are on the same I2C link. Therefore, a faulty slave device may cause the entire I2C link to be abnormal.

  2. Problem diagnosis:

    Use the method of exclusion to locate the cause. Based on the I2C topology, disconnect the slave hardware component to locate the faulty device.

  3. Information about the I2C link:

For details about the I2C protocol, see the attachment.

On the I2C link, the control device (host) and I/O devices (slave) are connected. If any I/O device is abnormal, the SCL or SDA on the I2C is suspended. As a result, the entire link is abnormal.

Conclusion and Solution

Conclusion:

The abnormal hard disk backplane causes the entire I2C link to be abnormal. After the hard disk backplane is replaced, the problem is resolved.

Experience

The link topologies of the FRUs and temperature sensors on different products have been added to this article.

The positions of air inlet temperature sensors on common servers are as follows:

Product Name

Location

Access from BMC

BH620

On the management module, not on the compute node

No

RH2285

Behind the fan modules on the mainboard

Yes

RH2285 V2

On the left mounting ear

Yes

RH2288 V2

On the left mounting ear

Yes

RH2285H V2

On the left mounting ear

Yes

RH2288H V2

On the left mounting ear

Yes

RH1288 V2

On the front panel

Yes

XH320

On the mainboard and close to the front panel

Yes

XH320 V2

On the mainboard and close to the front panel

Yes

XH321 V2

On the mainboard and close to the front panel

Yes

DH321 V2

On the mainboard and close to the front panel

Yes

RH1288 V3/V5

On the indicator board

Yes

RH2288 V3/V5

On the left mounting ear

Yes

RH2288H V3/V5

On the left mounting ear

Yes

RH5885 V3

On the right mounting ear

Yes

RH5885H V3

On the right mounting ear

Yes

RH8100 V3

On the left mounting ear

Yes

For details, see the following documents:

Huawei Rack Server iBMC Alarm Handling http://support.huawei.com/enterprise/en/doc/EDOC1000054724/?idPath=7919749%7C9856522%7C21782478%7C21782482%7C21000390

Huawei High-Density Server iBMC Alarm Handling

http://support.huawei.com/enterprise/en/doc/EDOC1000157054/?idPath=7919749%7C9856522%7C21782478%7C21782482%7C21152126

Note

If the system reports multiple sensor alarms, analyze the alarms by referring to the cases on the following website:

http://3ms.huawei.com/hi/group/1004825/thread_6884877.html?mapId=8574321

I2C topology of the RH2288H V3 and RH2288 V3:

I2C topology of the RH1288H V3:

I2C topology of the RH5885 V3 E7 V2/E7 V3 (BC61BLCA, BC61BLCC):

I2C topology of the RH5885 V3 E7 V3/V4 (BC61BLCB):

I2C topology of the RH5885H V3 E7 V2/V3 (BC61BFSA, BC61BFSB):

I2C topology of the RH5885H V3 E7 V4 (BC61BFSC):

I2C topology of the RH2485 V2:

None

I2C topology of the RH8100 V3:

See the attachment.

I2C topology of the 2488 V5:

I2C topology of the 2488H V5:

Common BIOS Problems

BIOS Upgrade Failed But the New BIOS Version Was Displayed After Startup
Problem Description
Table 5-220 Basic information

Item

Information

Source of the Problem

RH8100 V3

Intended Product

Rack Server

Release Date

2015-03-17

Keyword

BIOS upgrade failed

Symptom
  • Hardware configuration: RH8100 V3 server
  • iBMC: 1.00
  • BIOS: (U6145)V019

If a power failure occurs in the RH8100 V3 chassis during an online upgrade of the BIOS, there is a possibility that the BIOS version is displayed as the new version after the RH8100 V3 is powered on again.

Key Process and Cause Analysis

Possible Causes:

The RH8100 V3 supports dual BIOS flashes in redundancy mode. You need to upgrade flash 0 and then flash 1. If either flash fails to be upgraded, an upgrade failure message is displayed.

If the upgrade of flash 1 fails or is forcibly interrupted due to a power failure, flash 0 has been upgraded successfully. The BIOS of the new version in flash 0 is in use after you restart the server.

Conclusion and Solution

Procedure:

  1. The BIOS of the new version runs properly. However, you are advised to perform the upgrade again and ensure that both flash 0 and flash 1 are successfully upgraded to improve BIOS reliability.

No BMC Alarm When the RH2288 V3 Yada 460 W Power Supply Is Over-Voltage and PFC Protection Is Enabled

Problem Description
Table 5-221 Basic information

Item

Information

Source of the Problem

RH2288 V3

Intended Product

2P rack servers with Yada 460 W power supply

Release Date

2018-3-23

Keyword

Yada 460 W power supply

Symptom

Twelve RH2288 V3 servers in a cabinet in the customer's equipment room use the 1+1 redundancy power supply. On March 14, the power supply system is overhauled. The PS2 is cut off. After the maintenance is complete, the power supply is restored. On March 15, the PS1 is cut off. After the PS1 is cut off, the devices are powered off. During this period, no power exception alarm is generated on the BMC.

Key Process and Cause Analysis

Analyze the reasons for device power-off:

The following Figure 1 shows the power distribution on site.

Figure 5-304 Power supply system

The RH2288 V3 server uses 1+1 redundancy power supply. On March 14, check and cut off input B, and the servers work properly. After the maintenance is complete, power on the servers again. Analyze the BMC log. It is found that the servers are powered off immediately after powering on, as shown in Figure 2. Further check shows that this is caused by the three-phase imbalance of the power supply system after the equipment room is overhauled.

Figure 5-305 SEL

After the power supply system is adjusted, power on the servers again. The power indicators are not checked and no power alarm is generated on the BMC. On March 15, check and cut off input A. The 10 servers are powered off. After analysis, it is confirmed that the power indicators on the 10 servers are off. Input B has supplied power, but the PS2 has no output. The indicator is off, and the PS2 is protected or damaged.

After the power cable is removed and re-inserted, the power supply recovers. According to the verification, it is confirmed that the PS2 has no output because it is protected and output is disabled.

Cause analysis on power supply protection:

The personnel in the equipment room confirm that the three-phase imbalance occurs during the examination and repair of input B. In that case, the normal 220 V single-phase voltage is distorted. When one phase of the three phases is open, the single-phase voltage turns to 380 V. According to the experiment, when the AC voltage suddenly changes to 325 V, the power supply enters the bulk overvoltage protection state and the Boost circuit stops working. The power input voltage recovers, and the power indicator is still off. The fault recurs.

Summary: On March 14, the power supply system is abnormal. As a result, the PS2 is in overvoltage protection and has no output. On March 15, the devices are powered off after the PS2 is cut off.

Analyze the reason why there is no power alarm reported on the BMC:

The alarms related to the power supply on the BMC are reported by the PSU. The PSU records its own alarms in its key event register. Currently, this register records the following items: 12 V undervoltage (Standby UV), fan fault (FAN Fail), 12 V output overcurrent (Main OC), overtemperature (OTP), and 12 V output overvoltage (Main OV). This problem occurs when the input overvoltage bulk protection does not set the key event register. Therefore, alarms cannot be displayed on the BMC.

Experiment confirms that when input overvoltage occurs and PFC undervoltage is triggered, the failure bit flag is 0, the key event register is not set, and the BMC does not display an alarm. See figure 3.

Figure 5-306 Register alarm verification

When Standby UV, FAN Fail, Main OC, OTP, Main OV is protected, Failure bit in the register is set to 1. The BMC displays a real time alarm.

Summary: If the input overvoltage protection scenario is not included in the key event register, the BMC cannot display the alarm. In other no output scenarios, alarms can be properly displayed on the BMC.

Conclusion and Solution

Conclusion:

According to the processing feedback information and log analysis, the abnormal power failure of the server is caused by the abnormal AC input power supply (because of three-phase imbalance, the normal 220 V voltage can change to 380 V theoretically) which exceeds the power supply specifications. Therefore, the PFC overvoltage protection is triggered and no output is generated. The PSU does not record the exception in the key event register. As a result, the alarm is not reported to the BMC after the protection mechanism is triggered.

Solution:

  • If there is no power output in the equipment room, remove and then insert the power cable.
  • If there is no output for PFC overvoltage protection, and no alarm is displayed on the BMC, the solution is as follows:

    Optimize the BMC software version to automatically determine whether overvoltage protection is triggered and report alarms in time.

Experience

None

Note

None

Failure of Some Devices to Join the Cluster During EVC configuration on an RH2288 V3 VMware VM

Problem Description
Table 5-222 Basic information

Item

Information

Source of the Problem

RH2288 V3

Intended Product

VM configuration EVC

Release Date

2018-1-12

Keyword

VM, EVC, VMware

Symptom

Configuration: CPU Broadwell platform

OS VMware 6.0

During EVC configuration on a VMware VM, some devices fail to join the cluster.

Key Process and Cause Analysis

Check the mapping between the VMware version and the EVC function supported by the Intel CPU. See Figure 5-307.

Figure 5-307 Intel EVC supported by the vCenter Server release versions

The baseline shows that the Intel Broadwell CPU supports the EVC only when using VMware 6.5. The customer version is not the correct version and fails to join the cluster. If the version is correct, check whether the Monitor/Mwait feature is enabled in the BIOS. See Figure 5-308.

Figure 5-308 Enabling the Monitor/Mwait feature in the BIOS
Conclusion and Solution

Conclusion:

The VMware version of the customer does not match the configured Broadwell CPU. As a result, the EVC cannot be used and therefore cannot be added to the cluster.

Solution:

Upgrade the VMware version to 6.5.

Experience

None

Note

None

Translation
Download
Updated: 2019-02-25

Document ID: EDOC1000041338

Views: 70612

Downloads: 3777

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next