No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Huawei Server Maintenance Manual 09

Rate and give feedback :
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
Viewing Fault Symptoms to Diagnose Faults

Viewing Fault Symptoms to Diagnose Faults

Table 3-32 lists the minimum configuration of servers.
Table 3-32 Minimum configuration of servers

Model

Minimum Configuration

Remarks

RH1288 V3, RH2288 V3, RH2288H V3, and 5288 V3

One CPU, installed in slot CPU1

None

One DIMM, installed in slot DIMM000 (A)

RH8100 V3 (8P)

One CPU, installed in slot CPU1

Dual system mode (one PSU, installed in any slot)

One memory board, installed in slot 1

One DIMM, installed in slot DIMM000

One HFC board, installed in slot HFC2

RH8100 V3 (dual-system primary 4P)

One CPU, installed in slot CPU1

Primary 4P in the dual system (one PSU, installed in any slot)

One memory board, installed in slot 1

One DIMM, installed in slot DIMM000

One HFC board, installed in slot HFC2

RH8100 V3 (secondary 4P in the dual system)

One CPU, installed in slot CPU5

Secondary 4P in the dual system (one PSU, installed in any slot)

One memory board, installed in slot 9

One DIMM, installed in slot DIMM000

One HFC board, installed in slot HFC1

RH5885 V3

Two CPUs, installed in slots CPU1 and CPU2

One PSU, installed in any slot

One DIMM, installed in slot DIMM000

RH5885H V3

Two CPUs, installed in slots CPU1 and CPU2

One PSU, installed in any slot

One DIMM, installed in slot DIMM A1 of the first memory board

CH121 V5, CH242 V5, CH121L V5, and CH221 V5

One CPU, installed in slot CPU1

None

One DIMM, installed in slot DIMM000

Power Failures

The terms depicting server power status are defined as follows:

  • Power connected: The server is connected to a power source, and the power indicator is on.
  • Standby: The server is connected to a power source, and the power indicator is steady yellow.
  • Power-on: The server is on, and the power indicator is steady green.
  • POST: The server is in the power-on self-test (POST) process.

Diagnose and rectify power failures depending on the symptoms.

Fault Symptom

Handling Procedure

Quick Recovery Method

A power supply unit (PSU) is faulty. (The PSU has no power output and the health indicator is blinking red.)

  1. Check the PSU indicator, and record any alarms on the iMana 200 or iBMC WebUI. For details, see Checking Indicators to Locate FaultsChecking Indicators to Locate Faults.
    NOTE:

    For an E9000 server, record alarms on the MM910 WebUI.

  2. Check whether an "AC lost" alarm is generated.
    • If yes, check that the power cable is connected properly and that the power distribution unit (PDU) is supplying power properly.
    • If no, go to 3.
  3. Replace the PSU with a spare PSU, and check whether the fault is rectified.
    • If yes, no further action is required.
    • If no, go to 4.
  4. Replace the PSU backplane or replace the mainboard if no PSU backplane is configured. Check whether the fault is rectified.
    • If yes, no further action is required.
    • If no, contact Huawei technical support.
  1. Check whether the current configuration has sufficient power supplies.
    • If yes, services are not affected.
    • If no, contact Huawei technical support.
  2. Replace the faulty PSU with a spare PSU. Do not install the faulty PSU into a server again.

A rack server is not connected to a power source. (All of its indicators are off.)

  1. Check whether the external power supply to the rack server is normal.
    • If yes, go to 2.
    • If no, resolve this issue.
  2. Replace the PSUs on the server, and check whether the fault is rectified.
    • If yes, no further action is required.
    • If no, go to 3.
  3. Replace the mainboard and PSU backplane, and check whether the fault is rectified.
    • If yes, no further action is required.
    • If no, contact Huawei technical support.

Follow the handling procedure to replace any faulty modules.

The chassis where a blade server or a high-density server is located has no power.

  1. Check whether the external power supply to the chassis is normal or whether a power overload has occurred.
  2. Remove all blades, switch modules, management modules, and fan modules, label them with the slot numbers, and check that their power connectors are normal.
  3. Remove all PSUs, install the PSUs back one at a time in ascending order by slot number, and check whether the chassis can be connected to the power source. If the chassis cannot be connected to the power source with any of the installed PSUs, replace the chassis.
  4. If the chassis cannot be connected to the power source after a PSU is installed, replace the PSU.
  5. After verifying that the chassis and PSUs can be connected to the power source, install only one PSU. Then install the fan modules, management modules, switch modules, and blades one at a time in ascending order by slot number, and check whether the module can be connected to the power source.
  6. After the fault is rectified, install the fan modules, management modules, switch modules, and blades back into their original slots.

Follow the handling procedure to replace any faulty modules.

The chassis of a blade or high-density server has power but a compute node or server node does not.

  1. Remove the compute node or server node, and check whether its power connector is damaged.
    • If yes, replace the compute node or server node mainboard or replace the chassis.
    • If no, go to 2.
  2. Do not install the faulty compute node or server node into a server again. Install a spare part when available.
  1. Remove the faulty compute node or server node. Check whether the other compute nodes or server nodes work properly. (Do not install the node into a server again.)
    • If yes, services are not affected.
    • If no, contact Huawei technical support.
  2. Follow the handling procedure to replace any faulty modules.

KVM Login Faults

Diagnose and rectify keyboard, video, and mouse (KVM) login faults depending on the symptoms.

Fault Symptom

Handling Procedure

Quick Recovery Method

The KVM is inaccessible.

  1. To check whether the KVM interface is normal, run the telnet IP address:8208 command on a third-party tool (such as PuTTY).

    You can query the port number by checking the VMM port number on the Configuration > Services page of the iMana WebUI or the Services page of the iBMC WebUI. The default port number is 8208.

    If the telnet command fails, directly log in to iMana 200 or iBMC from a PC.

  2. Clear the cache of all browsers and Java, and close all the browsers. Then log in to iMana 200 or iBMC.
  3. Set the security level of Java to medium or lower, or add the KVM address to the Java exception sites.
  4. Check the versions of the operating system, Java, and browsers. Firefox 23.0 or later is recommended. For details, see iMana 200 or iBMC Help.
  1. Follow the handling procedure to replace any faulty modules.
  2. Restart iMana 200 or iBMC and replace the local PC.
  3. Connect the management network port to the local PC directly instead of through a switching network.

The KVM displays an error message.

  • If the number of login users exceeds the maximum allowed value, check whether other users are using the KVM. If other users are using the KVM, restart iMana 200 or iBMC to force the users to log out.
  • If the KVM displays a message indicating that there is an unauthorized user, clear the cache of all browsers and Java, and close all the browsers. Then log in to iMana 200 or iBMC.
  • If the input signal is out of range, check whether the OS resolution exceeds the maximum value of 1280 x 1024.

Login to the KVM is successful, but the KVM is not functioning correctly.

  • If the keyboard and mouse cannot be used but services are operating properly, reset the USB and check whether the problem is solved.
    • If yes, no further action is required.
    • If no, restart the service system, clear the CMOS, upgrade iMana 200 or iBMC, and upgrade the BIOS.
  • If an ISO file fails to be mounted to the virtual DVD drive, log in to the virtual DVD drive port over Telnet to check whether the port is functioning correctly. Next, mount the ISO file with FusionServer Tools Toolkit V102 to check whether the ISO file is correct, upgrade the iMana 200 or iBMC, and upgrade the HMM and BIOS.

POST Faults

Diagnose and rectify power-on self-test (POST) faults depending on the symptoms.

Fault Symptom

Handling Procedure

Quick Recovery Method

The server fails to enter the standby mode after it powers on. (The power indicator is blinking yellow for over 5 minutes.)

  1. View serial port logs to determine whether the iMana 200 or iBMC has been repeatedly reset.

    If the iMana 200 or iBMC has been repeatedly reset, the logs repeatedly record the following information:

    ### JFFS2 load complete: 1107083 bytes loaded to 0x8b000000 
      ## Booting kernel from Legacy Image at 8a000000 ... 
         Image Name:   linux-2.6.34 
         Image Type:   ARM Linux Kernel Image (uncompressed) 
         Data Size:    1511292 Bytes = 1.4 MiB 
         Load Address: 86008000 
         Entry Point:  86008000 
         Verifying Checksum ... OK 
      ## Loading init Ramdisk from Legacy Image at 8b000000 ... 
         Image Name:   Ramdisk Image 
         Image Type:   ARM Linux RAMDisk Image (uncompressed) 
         Data Size:    1107019 Bytes = 1.1 MiB 
         Load Address: 00000000 
         Entry Point:  00000000 
         Verifying Checksum ... OK 
         Loading Kernel Image ... OK 
      OK 
       
      Starting kernel ...
    NOTE:
    • The CH140 and CH140 V3 compute nodes of the E9000 do not provide any serial ports. Directly ping the IP address of the iMana 200 or iBMC. If the ping tests occasionally or always fail, use the quick recovery method. If the problem persists, contact Huawei technical support.
    • During the iMana 200 or iBMC startup process, the serial port on a server is used by default. After the startup is complete, the serial port is switched for the system serial port.
    • During the iBMC startup process, the serial port on a server is used by default. After the startup is complete, the serial port is switched for the system serial port.
  2. Contact Huawei technical support to query a case or replace the mainboard.

For a rack server, pPerform the following operations:

  1. Power off the server, remove and reinstall the power cables, power on the server, and check whether the iMana 200 or iBMC is functioning correctly.
    • If yes, upgrade iMana 200 or iBMC by using software of its current version or a later version.
    • If no, check the iMana 200 or iBMC version. If the version is 1.91 or later, go to 2; otherwise, go to 3.
  2. Keep the power cables removed and add a jumper cap to the Clear_BMC_PW pin on the mainboard to attempt to restore the default settings of the iMana 200 or iBMC. Then reconnect power cables.
  3. Replace the mainboard or BMC board.

For an E9000 server, perform the following operations:

  1. Remove and reinstall the compute node and check whether the iMana 200 or iBMC is functioning correctly.
    • If yes, update the iMana 200 or iBMC or upgrade it to a later version.
    • If no, check the iMana 200 or iBMC version. If the version is 1.91 or later, go to 2; otherwise, go to 3.
  2. Keep the compute node removed and add a jumper cap to the Clear_BMC_PW pin on the mainboard to attempt to restore the default settings of iMana 200 or iBMC. Then reinstall the compute node.
  3. Replace the mainboard or BMC board.

A server in standby mode cannot power on. (The power indicator is steady yellow.)

  1. Collect iMana 200 or iBMC logs, and query the complex programmable logical device (CPLD) register to determine whether the power supply link to the mainboard has failed.
  2. Check whether the mainboard, CPUs, and DIMMs are installed properly.
  1. Remove the external devices, including the PCIe cards and HBAs. Then check whether the fault is rectified.
    • If yes, no further action is required.
    • If no, go to 2.
  2. Retain only the minimum server configuration, that is, the CPUs, mainboard, and memory modules. Then check whether the fault is rectified.
    • If yes, no further action is required.
    • If no, go to 3.
  3. Check whether the CPUs, mainboard, and memory modules are faulty, and replace the faulty parts.
NOTE:

If the customer requires immediate recovery, replace the entire server.

A server powers off immediately when powered on.

  1. Collect iMana 200 or iBMC logs, and query the CPLD register to determine whether the power supply link to the mainboard has failed.
    NOTE:

    For an E9000 server, you are advised to use the MM910 for one-click log collection.

  2. Check the power supply unit (PSU) backplane and the mainboard.
  1. Check all external power supplies, including the PDUs, PSUs, and power cables. Replace any faulty parts and check whether the fault is rectified.
    • If yes, no further action is required.
    • If no, go to 2.
  2. Replace the mainboard or PSU backplane.

The message "no signal" is displayed immediately after the server powers on.

  1. Collect iMana 200 or iBMC logs, and query the CPLD register to determine whether the power supply link to the mainboard has failed.
    NOTE:

    For an E9000 server, you are advised to use the MM910 for one-click log collection.

  2. Set the printing level to debug for the BIOS with the iMana 200 or iBMC CLI, restart the server, and save system serial port logs. When the fault is repeated, collect iMana 200 or iBMC logs and download the .bin file of the BIOS.
  1. Run the ipmcset -d clearcmos command to clear the CMOS. Then check whether the fault is rectified.
    • If yes, no further action is required.
    • If no, go to 2.
  2. Upgrade the iMana 200 or iBMC, and the BIOS. Then check whether the fault is rectified.
    • If yes, no further action is required.
    • If no, go to 3.
  3. Remove the external devices, including the PCIe cards and HBAs. Then check whether the fault is rectified.
    • If yes, no further action is required.
    • If no, go to 4.
  4. Retain only the minimum server configuration, that is, the CPUs, mainboard, and memory modules. Then check whether the fault is rectified.
    • If yes, no further action is required.
    • If no, go to 5.
  5. Check whether the CPUs, mainboard, and memory modules are faulty, and replace the faulty parts.
NOTE:

If the customer requires immediate recovery, replace the entire server.

The server repeatedly powers on and then powers off.

  1. Enable the video recording function on the iMana 200 or iBMC WebUI.
  2. Set the printing level for debugging the BIOS with the iMana 200 or iBMC CLI, restart the server, and save system serial port logs. When the fault is repeated, collect iMana 200 or iBMC logs and download the .bin file of the BIOS.
  3. Restore the default BIOS settings, and check whether the server operates properly.
    • If yes, modify the BIOS parameters on the OS side based on actual requirements.
    • If no, collect iMana 200 or iBMC logs, download the .bin file of the BIOS. For details, see iMana 200 User Guide or iBMC User Guide of the corresponding version.
NOTE:

For an E9000 server, you are advised to use the MM910 for one-click log collection.

The POST stops responding at a screen.

  1. Capture the current screen.
  2. Collect iMana 200 or iBMC logs, and query the CPLD register to determine whether the power supply link to the mainboard has failed.
  3. Set the printing level for debugging the BIOS with the iMana 200 or iBMC CLI.
  4. Enable the video recording function on the iMana 200 or iBMC WebUI, restart the server, and save system serial port logs. When the fault is repeated, collect iMana 200 or iBMC logs and download the .bin file of the BIOS.
  5. Check the external USB devices, CPUs, drives, DIMMs, and PCIe devices.

RAID self-check is suspended.

  1. Capture the current screen on the iMana 200 or iBMC KVM or local KVM.
  2. Collect iMana 200 or iBMC logs.
  1. If a RAID controller card firmware error exists, replace the RAID controller card, supercapacitor, or BBU. Then check whether the fault is rectified.
    • If yes, no further action is required.
    • If no, go to 2.
  2. Check whether the drives, drive backplane, and SAS cables are faulty.
    • If yes, replace faulty parts.
    • If no, go to 3.
  3. If the RAID array is offline, import it again. Then check whether the fault is rectified.
    • If yes, no further action is required.
    • If no, go to 4.
  4. If the BBU or supercapacitor runs out of power, follow the instructions shown in the displayed messages to keep the server running. After the server runs for 30 minutes, check the BBU or supercapacitor status. If the BBU or supercapacitor is abnormal, replace it.

NIC Preboot Execution Environment (PXE) has failed.

  1. Check whether the NIC supports PXE.
  2. Check whether the BIOS PXE, NIC PXE, and NIC Universal Multi-Channel (UMC) functions are enabled. To check the NIC PXE function, press Ctrl+S during the POST process.
  3. Check the NIC.
  4. Check the PXE network environment on the service side.

Follow the handling procedure.

Memory Errors

Diagnose and rectify memory errors depending on the symptoms.

Fault Symptom

Handling Procedure

Quick Recovery Method

The memory capacity detected by the system is less than the configured memory capacity.

  1. Check whether the DIMMs are on the server compatibility list.
    • If yes, go to 2.
    • If yes, go to 3.
    • If no, replace the DIMMs with compatible DIMMs.
  2. Check whether memory mirroring has been enabled in the BIOS.
    • If yes, the memory capacity is reduced by 50% due to the memory mirroring function. You can disable the function in the BIOS. If the problem persists, go to 3.
    • If no, go to 3.
  3. Check whether the DIMM installation positions meet configuration rules.
    • If yes, go to 4.
    • If no, reinstall the DIMMs in correct slots according to the configuration rules.
  4. Check whether a "DIMM configuration error" alarm is generated by iBMC.
  5. Check whether any DIMM slots are abnormal. If a DIMM slot is abnormal, replace the mainboard.
  1. If the iBMC generates the "DIMMxxx Configuration Error" alarm, replace the related DIMM.
  2. If the DIMM status displayed in iBMC or the OS is abnormal (unidentified or faulty), replace the faulty DIMMs.
  3. If memory mirroring or memory rank sparing is configured in the BIOS, the total available memory capacity is less than the configured physical memory capacity.
  4. If the DIMMs do not comply with DIMM installation rules, use Huawei Server Product Memory Configuration Assistant to reinstall the DIMMs.
  5. If DIMM installation slots are faulty, replace the mainboard.

An uncorrectable DIMM error is generated.

  1. Install the faulty DIMM on a different channel, and use a test tool to check whether the DIMM is causing the error.
    • If the error is caused by the DIMM, replace the DIMM.
    • If the error is caused by the DIMM slot, check the DIMM connector. If the connector is damaged, replace the mainboard or memory board.
  2. Remove the CPU connected to the faulty DIMM channel, and check whether the CPU socket pins are damaged.
    • If yes, replace the mainboard.
    • If no, go to 3.
  3. Replace the CPU connected to the faulty DIMM channel.
NOTE:

To check whether the DIMM error is rectified, use FusionServer Tools to perform a stress test on the DIMM.

  1. Switch the position of a DIMM you suspect to be faulty and a DIMM which is functioning correctly. Then, determine whether the fault is caused by the DIMM or DIMM slot.
    • If the fault is caused by the DIMM you suspect to be faulty, replace the DIMM.
    • If the fault is caused by the DIMM slot, switch the positions of the CPU that corresponds to the faulty DIMM slot and another CPU. If the fault is caused by the corresponding CPU, replace the CPU; otherwise, replace the mainboard or memory board.
  2. If the preceding steps do not reproduce the fault, use Toolkit to perform memory pressure tests. If the fault is reproduced, perform 1; otherwise, contact Huawei technical support.

Drive I/O Faults

Diagnose and rectify drive I/O faults depending on the symptoms.

Fault Symptom

Handling Procedure

Quick Recovery Method

A "Disk Fault" alarm is reported to iMana 200 or iBMC.

  1. If the drive is in a RAID array and the RAID array is not functioning correctly, troubleshoot the RAID array.
  2. If the server has stopped, use Toolkit to inspect the server hardware. If the server is operating, replace the drive.
  3. If the fault persists, insert the new drive into the slot that you suspect to be faulty to check whether that slot is faulty.
  1. If the faulty drive is not in a RAID array, the drive cannot be used and needs to be replaced. It is recommended that you configure RAID for all drives and then deploy the redundant services.
  2. Back up the data of redundant RAID arrays to avoid data loss.
  3. Follow the handling procedure to replace any faulty modules.

A RAID controller card fails to identify one or more drives.

  1. Insert the drive you suspect to be faulty into another slot, and insert a normal drive into the slot you suspect to be faulty. Then check which of these is causing the fault.
    • If the fault is caused by the drive, replace the drive.
    • If the fault is caused by the drive slot, check whether SAS cables are connected properly to all SAS ports on the drive backplane. For details, see the server user guide.
    • If the fault persists, go to 2.
  2. Replace the RAID controller card first, the SAS cables second, and the drive backplane third.
  1. If the redundant RAID array fails or no RAID array is configured, the related drive partitions are unavailable.
  2. Move the unidentified drives or all drives in the RAID array to a standby server. Ensure that you retain their order during this process and attempt to back up data.
  3. Follow the handling procedure to replace any faulty modules.

A RAID controller card cannot identify any drives.

  1. Check whether the active indicators on the drives are on. If they are off, ensure that both the power cable and drive are installed properly.
  2. If the fault persists, check that the SAS cables and signal cables are connected properly. For details, see the server user guide.
  3. If the fault persists, replace any RAID controller card first, the SAS cables second, and the drive backplane third.

Follow the handling procedure to replace any faulty modules without changing the drive installation positions.

Note: If a fault occurs on the RH2288A V2 server, check whether the cable connecting the mainboard to the power adapter board is connected properly. Figure 3-10 shows the cable connection.
Figure 3-10 Cable connection

Ethernet Controller Faults

Diagnose and rectify Ethernet controller faults depending on the symptoms.

Fault Symptom

Diagnosis Procedure

Quick Recovery Method

A network port is invisible.

  1. Ensure that the NIC type, NIC driver, OS, BIOS version, and iMana 200 or iBMC version on the server or compute node are compatible.
    • If the server uses an OS that is not in the compatibility list, contact R&D engineers of the OS vendor.
    NOTE:

    It is recommended that you use an OS that is in the compatibility list.

    • If the NIC driver version is incompatible, upgrade the driver before continuing.
  2. Collect logs.
  3. To check whether the PCI device of the NIC is visible, run the lspci | grep -i eth* command in Linux (or equivalent in other operating systems) and observe the response.
    • If yes, go to 5.
    • If no, go to 4.
  4. If the PCI device is invisible, perform the following steps:
    1. View the logical mapping between the NIC and the corresponding CPU. If the CPU that is mapped to the NIC is not shown, the PCI device that maps to the CPU is invisible.
    2. Power the iMana 200 or iBMC off and then on. Check whether the fault persists.
    3. Insert the NIC you suspect to be faulty into another slot, and a normal NIC into the slot you suspect to be faulty. Then check which of these cause the fault.
  5. If the PCI device is visible but its network port is invisible, the driver cannot be loaded. To rectify the fault, perform the following steps:
    1. Run the ifconfig ethN up command in Linux (or equivalent in other operating systems) to ensure the information in the network port configuration file is consistent with the actual physical network ports and whether the network ports are up.
    2. If the driver fails to install when running the compilation script, check whether GNU C Compiler (GCC) and C/C++ Compiler and Tools have been correctly installed.
    3. Check the optical module type. If an Intel NIC and a non-Intel optical module are configured, the driver cannot be loaded and the network port is invisible.
    4. Reinstall the driver. Check that no errors are reported during the driver installation and check whether system logs record any failures when loading driver.
  1. If a visible NIC port becomes invisible when the server is running, and services can be interrupted, power the server off and on. If the fault persists, go to 2.
  2. Insert the NIC into another PCIe slot and check whether the fault is rectified.
  • If the NIC is causing the fault, replace the NIC.
  • If the PCIe slot is causing the fault, replace the mainboard.

A communication error occurs on a network port.

  1. Check whether the network cable is connected properly to the network port.
  2. Ensure that the NIC type, NIC driver, OS, BIOS version, and iMana 200 or iBMC version meet the compatibility requirements of the server or compute node. If the NIC driver is incompatible, upgrade the driver before continuing.
  3. Collect logs.
  4. To check whether the network ports are up, run the ifconfig ethN up command (or equivalent in other operating systems) in Linux. To check whether IP addresses are set for the required network ports, run the ethtool ethN command.
  5. If the fault occurs on a rack server, run the ethtool -p command in Linux (or similar in other operating systems) to check whether the information in the network port configuration file of the rack server is consistent with the actual physical network ports, and check whether the network port status indicators are on and whether the network ports on the switch are up.
  6. Check whether the network ports on the compute node and switch module are up. For details, see E9000 Blade Server Mezzanine Module-Switch Module Interface Mapping Tool.
  7. Check the settings of IP addresses, gateway addresses, VLANs, bondings, and uplink switch network ports.
  1. Use the ping command to check whether the server or other servers on the network have network faults.
    • If the fault occurs on more than one server, check whether the external switching network is normal.
    • If the fault occurs only on one server, go to 2.
  2. Check the indicator to see the NIC port status. If the indicator is off, switch the optical module, optical cable, and uplink switch port related to the faulty NIC port with those of a normal NIC port if any of these components are faulty. Then replace them.
  3. If the NIC is causing the fault, restart the server when interruption will not affect services, and check whether the communication is normal. If the fault persists, power the server off and on. If the fault still persists, replace the NIC.

A packet error or packet loss occurs on a network port.

  1. Ensure that the NIC type, NIC driver, OS, BIOS version, and iMana 200 or iBMC version meet the compatibility requirements of the server or compute node. If the NIC driver is incompatible, upgrade the driver before continuing.
  2. Collect logs.
  3. Check whether there are an increasing number of network port packet losses and errors. If there is no continuous increase, ignore this error.
  4. Insert the NIC that you suspect to be faulty into another slot, and insert a normal NIC into the slot that you suspect to be faulty. Then, check which of these is causing the fault.
  5. If the fault occurs on a rack server, cConnect the network cable that you suspect to be faulty to a working rack server, and connect a working network cable to the rack server that you suspect to be faulty. Then, check which of these is causing the fault.
  6. Switch the service traffic from the network port that you suspect to be faulty to a different network port. Then, check whether the fault is caused by the network port.
  7. To check parameters regarding the packet error or loss, run the ethtool -S ethN command in Linux (or similar in other operating systems).
  1. Check whether the packet loss occurs only on a server. Run the ethtool -S command to check the packet loss type and run the top command to check the system resource usage (NIC interrupts, and CPU and memory usage) and NIC traffic.
  2. When you have the customer's permission to interrupt services, connect a PC to the port and check for packet loss. Connect the PC to other working ports, and check optical modules, optical cables, and uplink switches. Then, replace or adjust components based on the actual situation.
  3. If the NIC is causing the fault, wait until services can be interrupted and then restart the server. Then, check whether the communication is normal. If the fault persists, power the server off and on. If the fault still persists, replace the NIC.

The performance of a network port does not meet requirements.

  1. Check whether the NIC type and driver are compatible with the BIOS version and the iMana 200 or iBMC version on the server or compute node. If the NIC driver version is incompatible, upgrade the NIC driver before continuing.
  2. Collect logs.
  3. Check whether the physical network port meets performance requirements.
  4. Check whether the binding between the network port interrupt and CPU queue has been modified.
  5. To check whether the TSO and GSO settings of the network port have been modified, run the ethtool -k ethN command in Linux (or equivalent in other operating systems).
  6. To check whether the network port buffer information has been modified, run the ethtool -g ethN command in Linux (or equivalent in other operating systems).

FC Controller Faults

Common FC Controller Faults and Handling Procedures

Diagnose and rectify FC controller faults depending on the symptoms.

Fault Symptom

Handling Procedure

The storage device fails to identify the host World Wide Port Name (WWPN).

  1. Connect to the switch and run the brocade: switchshow command to query port connection status.
  2. If the switch fails to obtain the host WWPN, the host bus adapter (HBA) cannot register with the switch. In this case, do as follows:
    1. Check that the HBA and the processor connected to the PCIe bus are installed properly.
    2. (Optional) Check the mapping between the HBAs and switch modules for E9000 and E6000 servers.
    3. Check FC links between the HBA and the switch by checking the optical cable connections and the optical module power. If E9000 servers are used, check the HBA work mode.
    4. Ensure that the lpfc driver and firmware matching the E9000 are installed.
    5. If multiple switches are connected, check whether the switch connection mode (AG or TR) is correct.
    6. Collect the OS message logs and check lpfc driver information for faults.
    7. Collect log information of the switches.
  3. If the HBA is successfully registered with the switch, the switch obtains the host WWPN, but the storage cannot identify host WWPNs, rectify the fault as follows:
    1. Check the FC links (optical cables and modules) between the switch and the storage device.
    2. Check whether the HBA and the storage ports are in the same zone.
    3. Check whether the zone configurations are the same for switches from the same vendor.
    4. Collect the OS message logs and check lpfc driver information for faults.
    5. Collect the log information of switches.

The storage device has identified the HBA WWPN, but LUNs cannot be mapped to the host.

  1. Check whether the lpfc driver and firmware matching the E9000 have been installed.
  2. Collect the OS message logs and check lpfc driver information for faults.
  3. Collect log information of the switches.
  4. If no faults are identified, faults may exist on the storage device or OS SCSI application layer. Contact the OS or storage device vendor.

Some multipath links of LUNs are down.

  1. Ensure that the installed lpfc driver and firmware match the E9000.
  2. Check for error codes on FC links between the HBA and the storage device.
  3. Collect the OS message log and check lpfc and multipath driver information for faults.
  4. Collect log information of the switches.
  5. Contact the OS multipath driver vendor or storage device vendor.

Poor data read/write performance of LUNs

  1. Check whether the installed lpfc driver and firmware match the E9000.
  2. Check for error codes on FC links between the HBA and the storage device.
  3. Run the iostat command on the host to query the I/O delay and concurrent I/O operations.
  4. Collect the OS message log and check the lpfc driver information and the I/O queue depth configured for the HAB driver.
  5. Perform drive performance tests (read and write 100 GB and 100 MB files).
  6. Contact storage analysis engineers.
Quick Recovery from FC Controller Faults

Table 3-33 describes the common quick recovery methods and handling procedures of FC controller faults.

Table 3-33 Quick recovery methods and handling procedures of FC controller faults

Fault Symptom

Quick Recovery Method

All HBA links are disconnected.

  1. Check the link redundancy status.
    • If the links are redundant, reset the switch module ports connected to the faulty HBAs, and go to 2.
    • If the links are not redundant, go to 3.
  2. Check whether the ports connected to the faulty HBAs are functioning correctly.
    • If yes, check whether the fault is rectified.
    • If no, migrate all services, and safely power off the server. Next, remove and reinstall the compute node, and power on the server. If the fault persists, apply for spare HBAs to replace the faulty ones.
  3. Before contacting Huawei technical support, it is recommended that you migrate services and collect switch module logs, OS logs, LLD networking information, and device time differences.

Storage services are affected but HBA links are normal.

  1. Migrate all services, and safely power off the server. Next, remove and reinstall the compute node, and power on the server. Then, check whether the fault is rectified.
    • If yes, no further action is required.
    • If no, contact the storage vendor for quick fault recovery.
  2. Before contacting Huawei technical support, it is recommended that you migrate services and collect switch module logs, OS logs, LLD networking information, and device time differences.

Storage LUN performance issues

  1. Check for FC link error codes on the FC switch module. If error codes exist, run the porterrshow command and determine the cause of the fault depending on the port mapping relationships.
    • If any links between the switch modules and the external switches are faulty, remove and reconnect the optical cables and modules. If a link is still faulty and spare parts are available, replace any related optical cables and modules and try again.
    • If a link between an HBA and switch module is faulty, move the compute node to a working slot to check whether the fault is caused by the HBA, switch module, or backplane. Replace any faulty modules as required.
  2. Clear the error code count history, observe the error codes for 10 minutes, test the performance, and contact the storage vendor for quick fault recovery.

Switch Module Faults

Switch Module Quick Recovery Method

Rectify switch module faults depending on the symptoms.

Fault Symptom

Quick Recovery Method

A switch module fails to be started. After logging in to the switch module over SOL, the SOL screen displays the following: Can not get config file from smm. Begin reboot ....

  1. Switch between active and standby MM910s and check whether the switch module can start normally.
    • If yes, no further action is required.
    • If no, go to 2.
  2. Restart the baseboard management controller (BMC) of the switch module and check whether the switch module can be started properly.
    • If yes, no further action is required.
    • If no, go to 3.
  3. Upgrade the switch module software to the latest version. For details, see the "Upgrading Software by Using U-Boot" section in the "Common Operations" chapter of the E9000 Server V100R001 Upgrade Guide.

A switch module fails to start. After logging in to the switch module over SOL, the SOL screen displays the following: Ensure that the optical fibers or cables are inserted on the same ports on the panel after the board replacement. During system startup, do not power off or remove the board. To continue the startup, press Y:.

  1. If services are running, connect the network cable or the optical cable to the switch module and press Y to continue.
  2. If no services are running, press Y to continue.

After logging in to a switch module over SOL, the SOL screen shows Critical Error! and only the meth port can be displayed by running display interface.

Upgrade the switch module software to a specified version or the latest version depending on the displayed message.

A network storm occurs (the Mulcast and Broadcast counters of a port encounter a fault).

Perform one of the following operations:

  • Run the following commands to disable the port with abnormal traffic:

    [~HUAWEI]interface 10ge 1/17/1

    [~HUAWEI-10ge 1/17/1]shutdown

  • Disconnect the optical cable or network cable from the port that has abnormal traffic.

A port is Up but no traffic passes through the port.

  1. On the interface view, run the following commands to check whether the fault is rectified:

    [~HUAWEI]interface 10ge 1/17/1

    [~HUAWEI-10ge 1/17/1]restart

    • If yes, no further action is required.
    • If no, go to 2.
  2. Run the reboot command to restart the switch module.

Incorrect packets are generated (running the display interface command shows that the value of Total Error in the Input area is not zero and keeps increasing).

Run the display interface command and check CRC and Symbols.

  1. If the values of CRC and Symbols are not zero, perform the following operations:
    • Ensure that the optical cables are connected properly to the faulty switch module and the device it is directly connected to.
    • Check whether any optical cables are damaged.
    • Check whether the optical modules of the faulty switch module and the device it is directly connected to are working properly.
    • If there is a transmission device between the switch module and its connected device, check the transmission device gateway for alarms.
  2. If the values of CRC and Symbols are zero, run the reboot command to restart the switch module.

OS Faults

OS Installation Faults

Diagnose and rectify faults related to OS installation depending on the symptoms.

Possible Cause

Diagnosis Procedure

Incompatible OS

Use the Huawei Server Compatibility Checker to determine whether the OS is compatible with the server.

Incorrect installation method

Use the Huawei Server Compatibility Checker to check compatible OSs on the server and installation description. For details, see Huawei Server OS Installation GuideTaiShan Series Server EulerOS Installation Guide.

ServiceCD issue

  1. Use the Huawei Server Compatibility Checker to determine whether the OS installation requires a ServiceCD.
  2. Ensure that the ServiceCD version is correct.
  3. Check whether the installation method selected using the ServiceCD is correct.

Installation process issue

  1. Check that the installation is correct. For details, see Huawei Server OS Installation GuideTaiShan Series Server EulerOS Installation Guide.
  2. Check whether the OS installation requires a physical DVD drive or other media.
  3. Check whether the OS installation requires a special installation DVD, for example, one integrated with drivers.
  4. Check whether the OS installation DVD is an original from the manufacturer or whether it has been modified by a third party.
  5. Disconnect any external storage devices.
  6. Ensure that the default BIOS settings are used.
  7. Ask the OS vendor for installation support.

Drive identification issue

  1. Ensure that the target drive is identified by the RAID controller, and use the Huawei Server Compatibility Checker to check whether the target drive is compatible with the server. Next, check the BIOS to see whether the target storage devices, including SATADOMs, microSD cards, and built-in USB flash drives, are identified.
  2. Check the RAID controller card model and determine whether to configure RAID (software RAID configuration, LSI SAS1078, LSI SAS2108, LSI SAS2208, LSI SAS3008, LSI SAS2308, LSI SAS3108, Avago SAS3408, Avago SAS3416iMR, Avago SAS3416IT, and Avago SAS3508).
    NOTE:

    The V5 server supports OS installation on the drive that is managed by the standard RAID controller card.

  3. Check the RAID array properties to ensure that the boot drive and the target drive are the same or in the same RAID array.
  4. Set the BIOS mode to UEFI if the drive capacity is over 2 TB.
    NOTE:

    V1 and V3 servers do not support UEFI mode.

  5. Check whether the drive is a 4K drive.
  6. Check whether the loaded RAID controller card driver is correct.
  7. Format the drive or reconfigure the RAID array.
OS Operation Faults

If you have confirmed that faults are not caused by other factors, diagnose them as follows:

Fault Symptom

Diagnosis Method

Conclusion

The server is suspended or restarted.

Disable C state, P state, T state, and ASPM in the BIOS and ensure that the server functions correctly.

The OS version does not support CPUs of the current platform.

Check whether Kdump information contains crashed process names or board vendor names. For example, FC_XX indicates an FC device breakdown.

The built-in OS drivers are incompatible.

Check whether it is a PCIe card compatibility issue.

  • There is a power supply issue. (A cat err alarm is generated on iMana 200 or iBMC.)
  • The PCIe protocol is not supported.
  • There is a driver issue.

The PCIe card is incompatible.

Check whether the breakdown screenshot contains CPUidle.

NOTE:

The G2500 server does not currently support this method.

The OS kernel is incompatible with the hardware platform.

NOTE:

The G2500 server does not currently support this method.

Use the iMana 200 or iBMC to locate the fault. For example, determine whether the alarm was reported for the DIMM, drive, or mainboard component.

Circuit hardware is faulty.

Check whether the system logs contain read-only file system records, and use FusionServer Tools to rate the drive. Then decide whether to replace the drive depending on the result.

A drive fault occurred.

Check whether an imana cat err alarm is displayed on iMana 200. Use the fdm log of iMana 200 to locate the fault.

Hardware is faulty.

Check whether there is a machine check exception issue. Locate such a fault by checking the /var/log/mce.log and error codes of serial port kdump information.

  • The hardware is faulty.
  • The software or hardware interface setting is incorrect.

Collect the following information:

  • For new servers, confirm the proportion of servers that are not functioning correctly, and check whether normal servers and those functioning incorrectly have the same configurations.
  • For existing servers, confirm the number of servers that are not functioning correctly, and check whether the issues occur under specific circumstances.
  • Check iMana 200 or iBMC for hardware alarms.

After collecting the preceding information, confirm whether the issue occurs on a single or multiple servers. If the issues occur on a single server, run FusionServer Tools to locate faults. If the issues occur on multiple servers, contact Huawei technical support.

Locate the fault based on the report.

Check whether a breakdown occurs under specific circumstances after software upgrades have been performed for customer service software, database, middleware, kernel, BIOS, management modules, iMana 200 or iBMC, or storage devices.

  • The new software version has bugs.
  • Original interfaces are disabled for security purposes causing issues.

Check whether the Kdump information of the breakdown screenshot periodically displays update_cpu_power, divide_error, or timer_xx.

NOTE:

The G2500 server does not currently support this method.

The OS has bugs or kernel defects.

Check whether the Kdump information of the breakdown screenshot non-periodically displays gethostbyname.

NOTE:

The G2500 server does not currently support this method.

Check whether the breakdown screenshot contains CPUidle.

NOTE:

The G2500 server does not currently support this method.

The OS kernel is incompatible with the hardware platform.

Fault Symptom

Diagnosis Method

Conclusion

The server is suspended or restarted.

Check whether the Kdump information contains crashed process names or board vendor names. For example, FC_XX indicates an FC device breakdown.

Drivers built in the OS are incompatible.

Check whether it is a PCIe card compatibility issue.

  • Check the power supply.
  • Check whether the PCIe protocol is supported by the PCIe card.
  • Check the drivers.

The PCIe card is incompatible.

Use iBMC to locate the fault, for example, the DIMM, drive, or mainboard component for which an alarm is reported.

Circuit hardware is faulty.

If the system logs contain read-only file system records, use FusionServer Tools to rate the drive. Decide whether to replace the drive based on the result.

A drive fault has occurred.

Check whether there is a Machine Check Exception issue. Locate such a fault by checking the /var/log/mce.log and error codes of serial port kdump information.

  • The hardware is faulty.
  • The software or hardware interface is set incorrectly.

Collect the following information:

  • For new servers, confirm the proportion of abnormal servers and check whether normal and abnormal servers have the same configurations.
  • For existing servers, confirm the number of abnormal servers and check whether the issues occur under specific circumstances.
  • Check iBMC for hardware alarms.

After collecting the preceding information, determine whether it is a single server or hardware issue. Run FusionServer Tools for fault locating.

Locate the fault based on the report.

Breakdown occurs under specific circumstances after software upgrade of customer service software, database, middleware, kernel, BIOS, iBMC, and storage devices.

  • The new software version has bugs.
  • Original interfaces are disabled for security purposes causing issues.

Kdump information of the breakdown screenshot periodically displays update_cpu_power, divide_error, or timer_xx.

The OS has bugs or kernel defects.

Kdump information of the breakdown screenshot non-periodically displays gethostbyname.

Translation
Download
Updated: 2019-02-25

Document ID: EDOC1000041338

Views: 80537

Downloads: 3867

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next