No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Huawei Server Maintenance Manual 09

Rate and give feedback :
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
V5

V5

Common Problems During Startup and Shutdown

V5 Server Startup Fails and a VCC Power-On Timeout Alarm Is Reported
Problem Description
Table 5-223 Basic information

Item

Information

Source of the Problem

1288H & 2288H V5

Intended Product

1288H & 2288H V5

Release Date

2017-12-01

Keyword

1288H V5, 2288H V5, VCC alarm

Symptom

After a purchased GPU card is installed on a V5 server onsite, the server cannot be powered on, and the U10 alarm is generated.

Key Process and Cause Analysis

(1) Query information about the U10 alarm.

Locate the cause by referring to the Huawei Rack Server iBMC Alarm Handling. The non-standby power supply is abnormal.

(2) Analyze logs.

When the alarm is generated, the maintenance log records that a power-on timeout occurs. The GPU power supply may be abnormal. As a result, the mainboard cannot be powered on.

Check the GPU compatibility and ensure that the GPU is properly installed.

(3) Check the GPU compatibility.

The GPU model is Tesla M10, which is purchased by the customer and is not included in the contract. However, the GPU can be found using the Huawei Enterprise Server Compatibility Checker.

(4) Install a GPU.

According to the notes of Tesla M10:

A riser card supports two M10 GPU cards. Each GPU requires a dedicated GPU cable (04150627-001). Huawei dedicated cables must be used. Do not connect the cables to the mainboard.

Figure 5-309 shows how to install the GPU properly.

Figure 5-309 Layout of the power cable 04150627-001 on the chassis

The customer purchases GPUs and uses standard power cables in the industry instead of Huawei power cables. In this way, the VCC_12V0 is ground short-circuited, and the server cannot be powered on.

Conclusion and Solution

Conclusion:

The customer purchases GPUs and uses standard power cables in the industry instead of Huawei dedicated power cables. In this way, the VCC_12V0 is ground short-circuited, and the server cannot be powered on.

Solution:

Replace the GPU power cables with Huawei dedicated power cables.

Experience

None

Note

None

"PCI Data Acquisition and Signal Processing Controller" Is Displayed in the OS of a V5 Server
Problem Description
Table 5-224 Basic information

Item

Information

Source of the Problem

1288H & 2288H V5

Intended Product

1288H & 2288H V5

Release Date

2018-01-03

Keyword

1288H & 2288H V5, OS, PCI Data Acquisition and Signal Processing Controller

Symptom

When Windows Server 2016 is installed on a V5 server, the error message "PCI Data Acquisition and Signal Processing Controller" is displayed.

Key Process and Cause Analysis

(1) Analyze the symptom.

This error message indicates that a device fails to be identified. The cause is that the corresponding driver is not installed in the newly installed OS.

(2) Determine the device that fails to be identified.

The iBMC of a 1288H or 2288H V5 enables the black box function by default. However, the corresponding driver is not installed in the OS.

(3) Install the driver.

The installation of the Windows iBMA involves the black box driver, SNMP service, and hwBMAService service.

During the installation, ensure that the black box driver and SNMP service are installed before hwBMAService. If you install HwBMAService before the black box driver or SNMP service, some functions of the iBMA software are unavailable.

Install the black box driver by following instructions in the iBMA V100R002 User Guide.

Conclusion and Solution

Conclusion:

The black box function is enabled by default for the iBMC of V5 servers, and no driver is available in the Windows OS.

Solution:

Install the iBMA in Windows Server 2016 correctly.

Experience

None

Note

None

A Large Number of Unknown Base System Devices Are Displayed in Windows of a V5 Server
Problem Description
Table 5-225 Basic information

Item

Information

Source of the Problem

2288H V5

Intended Product

V5 servers

Release Date

2018-03-05

Keyword

Base system device

Symptom

Windows Server 2012 R2 is installed on a 2288H V5 by the NA customer. A large number of unknown base system devices are displayed in the OS. The chipset driver released by Huawei support has been installed.

Key Process and Cause Analysis

(1) Analyze the unknown device type.

The following figure shows the unknown devices. Nearly all devices are of the 8086 type with the chipset driver.

---- --------

Base System Device PCI\VEN_8086&DEV_208F&SUBSYS_400319E5&REV_04\3&2411E6FE&...

Base System Device PCI\VEN_8086&DEV_208F&SUBSYS_400319E5&REV_04\3&2411E6FE&...

...

(2) Install drivers.

Download the latest drivers on Huawei support according to the server configuration, install drivers in the OS, and restart the OS.

The devices still cannot be identified, and a message indicating that the drivers are not installed is displayed.

(3) Determine the cause with Intel.

Based on the ID of unknown 8086 devices, Intel determines that the fault is related to the BIOS settings.

Based on the Intel feedback, compare the BIOS settings between normal and abnormal servers. Five differences exist, as shown in the following figure.

  • If DFXEnable is set to Disabled, the server runs properly.
  • When the value of DFXEnable is changed to Enabled on a functioning server, a large number of unknown 8086 devices are displayed in the OS with the same device ID.

DFXEnable needs to be enabled only when the FDM fault injection debugging or the DMA debugging function is required. By default, DFXEnable is disabled.

Conclusion and Solution

Conclusion:

DFXEnable is set to Enabled in the BIOS. As a result, unknown devices are displayed in the OS.

Solution:

Change the value of DFXEnable to Disabled in the BIOS.

The problem is solved.

Experience

If unknown devices are displayed in the OS, the parts drivers are abnormal or DFXEnable is enabled in the BIOS.

Note

None

Common Problems of RAID Controller Cards and Hard Drives

Failed to Install Citrix 7.1 on a Server Configured with the Avago SAS3508
Problem Description
Table 5-226 Basic information

Item

Information

Source of the Problem

RH2288H V5

Intended Product

FusionServer

Release Date

2018-02-03

Keyword

Citrix OS, driver installation failure

Symptom

Server: 2288H V5

RAID controller card: Avago SAS3508

Firmware version: 5.030.1073

Symptom:

During the server test, after the LSI SAS3508 driver is installed in Citrix 7.1, the system enters the emergency mode.

Key Process and Cause Analysis

Key process:

  1. Load the LSI SAS3508 driver as prompted.
    Figure 5-310 Installation process

  2. After the installation is complete, restart the server. The system cannot be accessed.
    Figure 5-311 System startup failure

Cause analysis:

The following figure shows the cause analysis by Citrix.

https://support.citrix.com/article/CTX226401

Conclusion and Solution

Conclusion:

When installing a Citrix OS, you need to load the Citrix driver twice, as shown in the following figure.

Experience

None

Note

None

"The disk Disk0 failure" Is Displayed on the iBMC of a V5 Server
Problem Description
Table 5-227 Basic information

Item

Information

Source of the Problem

1288H & 2288H V5

Intended Product

1288H & 2288H V5

Release Date

2018-01-09

Keyword

1288H & 2288H V5, Disk0

Symptom

The error message "The disk Disk0 failure" is displayed on the iBMC of two V5 servers onsite.

Key Process and Cause Analysis

(1) Pre-verify the fault.

1) The configuration onsite is the same as that in the contract.

2) The LSI SAS3008IT is used onsite.

3) After the alarm is generated, the customer switches the Disk0 drives on the two servers with each other. The fault persists. This indicates that the drives work properly.

(2) Analyze logs.

Collect logs on the iBMC in one-click mode and check the logs. The alarm is related to the RAID controller card.

Cause analysis:

When the LSI SAS3008IT is connected to the pass-through backplane, the drive enclosure ID is 0xffff, which is the same as the default invalid ID configured for the Disk0 drives.

Turn on the indicator to confirm the mapping between the hard drive and the RAID controller card. When the OS is restarted or a hard drive is reinstalled, the system continues identification after it identifies the drive enclosure ID 0xffff, which is considered as an invalid ID.

In the subsequent identification process, the identification function dynamically checks whether the mapping relationship is established. The function only checks that the PD list value of the RAID controller card is the same as the saved value, and determines that the devices are identified. The identification process is then skipped, and the function returns the default identification status, which is unsuccessful. As a result, an alarm is generated indicating that the drive is missing.

Solution

Modify the get_pd_list interface at the SML lib layer. If the enclosure ID of a hard drive is 0xffff, convert the ID to a value other than 0xffff and ensure that the ID is different from other enclosure IDs in the PD list.

Conclusion and Solution

Conclusion:

When a V5 server is configured with the LSI SAS3008IT, the iBMC may falsely report a drive failure alarm.

Solution:

Upgrade the LSI SAS3008IT firmware to 2288H V5 V100R005C00SPC107.

Experience

None

Note

None

RAID Controller Card Configuration for a V5 Server Is Unavailable on the Device Manager Screen
Problem Description
Table 5-228 Basic information

Item

Information

Source of the Problem

2288H V5

Intended Product

V5 servers

Release Date

2018-02-02

Keyword

Device manager, RAID

Symptom

The RAID controller card configuration for a V5 server is unavailable on the Device Manager screen, but the RAID controller card model can be identified on the iBMC.

Key Process and Cause Analysis

(1) Check the RAID controller card model.

The LSI SAS2208, LSI SAS2308, LSI SAS3008, and LSI SAS3108 RAID controller cards support RAID configuration and startup in both UEFI mode and legacy mode.

The Avago SAS3408, Avago SAS3416, and Avago SAS3508 RAID controller cards support RAID configuration only in UEFI mode. Boot in legacy mode is supported. If you need to configure RAID arrays offline, switch to the EFI mode for configuration and then switch back to the legacy mode.

(2) Check the boot types.

No alarm is displayed on the iBMC WebUI, indicating that the RAID controller card can be identified.

Press F11 to go to the Device Manager screen. No RAID controller card identification information is found and the RAID controller card cannot be configured.

Check the BIOS boot type, which is legacy.

The legacy BIOS does not contain the Boot From File and Administrator Secure Boot options on the Front Page screen.

(3) Determine the cause.

If the boot type is legacy, you need to configure the RAID controller card in UEFI mode.

The RAID controller card can be identified in the UEFI BIOS.

Conclusion and Solution

Conclusion:

The Avago SAS3508 RAID controller card supports RAID configuration only in UEFI mode. Boot in legacy mode is supported. If you need to configure RAID arrays offline, switch to the EFI mode for configuration and then switch back to the legacy mode.

Solution:

Configure the RAID controller card in the UEFI BIOS.

Experience

None

Note

None

Occasional Initialization Failure of the LSI SAS3508 RAID Controller Card When the V5 Server Is Powered On and Off Repeatedly
Problem Description
Table 5-229 Basic information

Item

Information

Source of the Problem

CH121 V5, 2288H V5

Intended Product

V5 servers

Release Date

2018-05

Keyword

V5, power-on, power-off, LSI SAS3508, RAID controller card, initialization failure

Symptom

During the long-term ORT reliability test of the LSI SAS3508 RAID controller card, initialization failure may occur at a low probability when the server is repeatedly powered on and off using AC power supply (simulating extreme scenarios). When the RAID controller card fails to be initialized, the OS fails to be started.

Trigger conditions:

  1. The control node uses the LSI SAS3508 RAID controller card.
  2. The PCB version of the LSI SAS3508 RAID controller card is .A.
  3. The current write policy of the LSI SAS3508 RAID controller card is Write Back or Write Back with BBU.
  4. The entire chassis is powered on and then powered off, or a compute node is removed and then inserted.

Fault symptom:

The boot device is not found during server startup, and the OS fails to be started.

Identification method:

  1. Obtain the iBMC IP address of the compute node or the 2288H V5 server from the network design document, and log in to the WebUI. The default user name is Administrator, and password is Admin@9000.

  2. Choose Information > System Info > Storage. Check whether the Type of the RAID controller card is LSI SAS3508. If yes, go to the next step; if no, this article is not applicable.

  3. Check whether the PCB version of the RAID controller card is .A. If yes, go to the next step; if no, this article is not applicable.

  4. Check whether Current Write Policy of the RAID controller card is Write Back or Write Back with BBU. If yes, this article is applicable; if no, this article is not applicable.

Key Process and Cause Analysis

Cause:

A consistency problem exists in the chipset of the LSI SAS3508 RAID controller card. When the server is powered on or off repeatedly using AC power supply, a signal metastable state may occur at a low probability (0 or 1 at random). As a result, the RAID software does not enter the power-off protection process, and initialization fails.

In the firmware of the LSI SAS3508 RAID controller card, the default BIOS mode is Stop on error. In this mode, when an error or configuration change occurs on the FW, the status of the UEFI driver is set to Not healthy during the startup. To log in to the OS, press F11 during server startup, and restore the driver status on the Device Manager screen.

Conclusion and Solution
NOTE:

This solution applies only to NFV scenarios. This solution affects the performance. In other scenarios, use this solution based on the actual service evaluation.

Rectification method:

If the fault described in this case occurs, perform the following operations to rectify the fault.

  1. Log in to the Device Manager screen of the LSI SAS3508 RAID controller card.

    1. Log in to the iBMC WebUI, and choose Remote Console > Java Integrated Remote Console (Shared) to access the KVM.

    2. Restart the server on the KVM.

    3. During the startup, press F11 when prompted. Then enter the password.

    4. Enter the password (the default password is Admin@9000) and press Enter. On the management screen, choose Device Manager.

  2. On the Device Manager screen, choose Some drivers are not healthy.

  3. On the Driver Health screen, choose Repair the whole platform.

  4. "Memory/battery problems were detected" is displayed.

  5. Press Enter.

  6. Enter c and press Enter twice. If the following screen is displayed, the configuration is complete.

  7. Use the KVM to restart the server.

Solution:

For V5 servers on the live network, the problem may occur in three scenarios.

  1. The OS has been installed on the server and is running properly.
  2. A RAID group has been created on the server, but the OS is not installed.
  3. No RAID group is created on the server.

Scenario 1: The OS has been installed on the server and is running properly.

  1. Obtain MegaRAID Storcli.

    1. Log in to the Broadcom website, and choose DOWNLOADS > Management Software and Tools. The address is as follows:

      https://www.broadcom.com/products/storage/raid-controllers/megaraid-9440-8i#downloads

    2. Download MegaRAID Storcli of the latest version.

    3. Decompress the downloaded tool package, and use FileZilla or WinSCP to upload the rpm tool package from the Linux directory to the first node of FusionSphere OpenStack.

  2. Log in to the head node of FusionSphere OpenStack as the fsp user over SSH. The IP address of head node is the reverse proxy IP address of FusionSphere OpenStack. The default password is Huawei@CLOUD8. Run the su - root command to switch to the root user. The default password is Huawei@CLOUD8!.
  3. Run the source set_env command to import environment variables.

    For V100R006C10SPCXXX, the command output is as follows:

    please choose environment variable which you want to import:

    1. openstack environment variable (keystone v3)
    2. cps environment variable
    3. openstack environment variable legacy (keystone v

    please choose:[1|2|3]

    Enter 1 and press Enter. Enter the password of OS_USERNAME. The default password is FusionSphere123.

    Run the TMOUT=0 command to disable logout on timeout.

  4. Log in to FusionSphere, and choose Summary to view the management IP addresses of the control nodes.

  5. Copy the Storcli tool package to other nodes whose cache mode needs to be modified. (In the following command, XX.XX.XX.XX indicates the management IP address of the control node to be modified).

    scp storcli-007.0504.0000.0000-1.noarch.rpm mailto:fsp@XX.XX.XX.XX:/home/fsp/

  6. Log in to the control node as the fsp user, and run the su – root command to switch to root user. The default password is Huawei@CLOUD8!.
  7. Go to the /home/fsp directory, and run the following command to install the Storcli tool:

    rpm –ivh storcli-007.0504.0000.0000-1.noarch.rpm

  8. Go to the /opt/MegaRAID/storcli directory, and check whether the cache mode of the RAID group is RWBD or RAWBD. If yes, go to the next step.

    ./storcli64 /c0/vall show

  9. Run the following command to change the cache mode of the RAID group to RWTD:

    ./storcli64 /c0/vall set wrcache=wt

  10. Run the following command to check whether cache mode is RWTD:

    ./storcli64 /c0/vall show

  11. Go to the /home/fsp directory, and run the following commands to uninstall and delete the tool package:

    rpm -e storcli-007.0504.0000.0000-1.noarch

    rm storcli-007.0504.0000.0000-1.noarch.rpm

Scenario 2: A RAID group has been created on the server, but the OS is not installed.

  1. Log in to the Device Manager screen of the LSI SAS3508 RAID controller card. For details, see step 1 in "Rectification method".
  2. Choose Device Manager and press Enter.

  3. Choose Avago MegaRAID <SAS3508> Configuration Utility and press Enter.

  4. Choose Main Menu and press Enter.

  5. Choose Virtual Drive Management and press Enter.

  6. Choose the virtual disk to be operated and press Enter.

  7. Choose Advanced... and press Enter.

  8. Choose Default Write Cache Policy, and press Enter.
  9. Choose Write Through and press Enter.

  10. Choose Apply Changes and press Enter. "The operation has been performed successfully" is displayed.

  11. Choose OK and press Enter. The configuration is complete.
  12. Use the KVM to forcibly restart the server.

Scenario 3: No RAID group is created on the server.

  1. Access the main menu screen by referring to step 1 and step 2 in scenario 2. Choose Configuration Management and press Enter.

  2. Choose Create Virtual Drive and press Enter.

  3. Set Write Policy to Write Through.

  4. Choose Save Configuration and press Enter. The confirmation screen is displayed.
  5. Choose Confirm and press Enter.
  6. Choose Yes and press Enter. "The operation has been performed successfully" is displayed.
  7. Choose OK and press Enter. The configuration is complete.
  8. Use the KVM to forcibly restart the server.
CPU Configuration Error Caused by the SAS3IRCU Tool When the Avago SAS3416IT RAID Controller Card Is Used
Problem Description
Table 5-230 Basic information

Item

Information

Source of the Problem

2288H V5 equipped with Avago SAS3416IT

Intended Product

Avago SAS3416IT

Release Date

2018-05

Keyword

Avago SAS3416IT, SAS3IRCU, CPU configuration error

Symptom

Problem analysis:

  1. Log analysis:

    The sas3ircu command can be run properly. However, after the command is run, the CPU UCE and configuration error alarms are generated in the BMC SEL logs.

    A CPU VTD fatal error alarm is generated in the FDM logs.

  2. Cause analysis:

    According to the official Intel document, the system reports a VTD error when the CPU identifies an invalid request.

    Broadcom confirms that the SAS3IRCU tool does not support the Avago SAS3416IT RAID controller card. Therefore, the error is caused by the hardware and software compatibility.

Conclusion and Solution

Solution:

Use the StorCLI tool to configure the RAID controller card. The problem is resolved.

Abnormal Presence Status of All Hard Disks Managed by the RAID Controller Card on the 2288H V5
Problem Description
Table 5-231 Basic information

Item

Information

Source of the Problem

2288H V5

Intended Product

V5 servers

Release Date

2018-05-04

Keyword

BMC, hard disk

Symptom

When the 2288H V5 is running, the system reports an alarm indicating that presence status of multiple hard disks is abnormal.

Key Process and Cause Analysis
  1. Symptom analysis:

    In the alarm logs, the presence status of disks 0 to 11 (on the front backplane) and disks 40 to 43 (on the rear I/O modules 1 and 2) are abnormal, and the hard disks are removed and then installed within 1 second. The alarms are generated in a sequence from disks 0 to 43.

  2. Log analysis:

    In the iBMC logs, no hardware fault alarm of the same time is generated. In the RAID controller card logs, only records of the previous day exist.

    The preliminary judgment is that the fault is not caused by the RAID controller card hardware.

    The version information of the server is as follows:

  3. Mechanism analysis:

    On the BMC, the abnormal presence status of hard disks is obtained by querying the CPLD registers. Abnormal change of the CPLD register address may cause this problem.

    The CPLD and BMC engineers confirm that the problem is not caused by the CPLD registers. The alarms of the hard disks are generated in a sequence, which does not match the symptom of an abnormal CPLD register. In addition, alarms of all hard disks are generated. If the GPIO signal is interfered, random hard disk absence may occur.

    Engineers from various fields conclude that the problem is caused by the bug in the outdated BMC version. When the system is restarted, the BMC reconfigures the presence status of the hard disks. As a result, the records in the SEL logs are generated.

Conclusion and Solution

Conclusion:

The problem is caused by the bug in the outdated BMC 2.70. When the BMC storage module restarts, the code reconfigures the hard disk presence status. As a result, the alarm records in the SEL logs are generated.

Solution:

Upgrade the BMC to 2.94 or later.

Experience

None

Note

None

Read/Write Error of Multiple Hard Disks on the 2288H V5
Problem Description
Table 5-232 Basic information

Item

Information

Source of the Problem

2288H V5

Intended Product

V3 and V5 servers

Release Date

2018-05-24

Keyword

LSI SAS3108, buffer, I/O

Symptom

An NA customer reports that multiple hard disks on the 2288H V5 fail to be read and written, and "kernel: Buffer I/O error on device sdX" is displayed on the OS.

Key Process and Cause Analysis
  1. Log analysis:

    On the OS, "kernel: Buffer I/O error on device sdX" indicates that the hard disk fails to be read or written properly. Faults in the RAID controller card may cause the problem.

    No error or alarm related to hardware faults exist in the BMC logs.

    The logs of the RAID controller card show that the server uses the LSI SAS3108 RAID controller card, and the RAID mode is RAID+JBOD.

    RAID mode:

    Firmware version of the LSI SAS3108 RAID controller card:

    Driver version of the RAID controller card:

  2. Cause analysis:

    The firmware version must match the driver version for the LSI SAS3108 RAID controller card to support the RAID+JBOD mode.

    The firmware version of the RAID controller card must be 4.660.00-8102 or later.

    The driver version must match the OS.

    The version information of the RAID controller card shows that the firmware version meets the requirements, but the driver is an OS built-in driver and does not match the RAID controller card. Therefore, the RAID+JBOD mode is not supported. As a result, the hard disks managed by the RAID controller card cannot be read and written properly, and the error messaged is generated on the OS.

Conclusion and Solution

Conclusion:

The version information of the RAID controller card shows that the firmware version meets the requirements, but the driver is an OS built-in driver and does not match the RAID controller card. Therefore, the RAID+JBOD mode is not supported.

Solution:

Upgrade the driver of the LSI SAS3108 RAID controller card to the matching version.

Experience

None

Note

None

Common Problems of the Management Software

Failed to Log In to the iBMC WebUI as User root After the iBMC is Upgraded
Problem Description
Table 5-233 Basic information

Item

Information

Source of the Problem

RH2288H V3

Intended Product

Rack servers

Release Date

2018-01-29

Keyword

iBMC, root, login

Symptom

The iBMC on 20 RH2288 V3 servers is upgrade from 2.0.6 to 2.66 by using uMate. After the upgrade, the original root account (password: Root@123) cannot be used to log in to the iBMC WebUI and can only be used to log in to the iBMC background U-Boot.

Key Process and Cause Analysis

(1) Add an account and password to the U-Boot.

The operation fails.

(2) Upgrade the iBMC under the U-Boot.

After the upgrade:

  • The old root account (password: Root@123) cannot be used.
  • The initial root account (password: Huawei12#$) cannot be used.
  • The IP address of the server can be pinged.
  • You can log in to the U-Boot but cannot log in to the iBMC WebUI.

(3) Analyze the iBMC version.

iBMC 2.02 is correctly upgraded to the latest version 2.66. iBMC 2.02 does not have compatibility problems.

iBMC 2.66 is a customized version, whose default user name and password are different from those of common servers.

After the iBMC is upgraded to a customized version, the old and default accounts cannot be used.

Conclusion and Solution

Conclusion:

After the iBMC is upgraded to a customized version, the old and default accounts cannot be used.

Solution:

After the iBMC on 18 servers is rolled back from 2.66 to 2.62, the initial account can be used.

Two servers are restored to the factory settings on iBMC 2.66. Then, only the active iBMC can be rolled back to version 2.62. You can select either of the following solutions:

  • Run the ipmcset -d rollback command to perform a active/standby switchover, and roll back the iBMC.
  • Use the keyboard, video, and mouse (KVM) onsite and configure the account in the BIOS.

Experience

None

Note

None

iBMC Password Configuration on a V5 Server
Problem Description
Table 5-234 Basic information

Item

Information

Source of the Problem

2288H V5

Intended Product

V5 servers

Release Date

2018-03-14

Keyword

iBMC, password

Symptom

A user forgets the iBMC password. The frontline engineers remotely instruct the user to change the iBMC password in the BIOS. The following problems are found:

1) When the iBMC password is changed on the BIOS setup screen, the default password Admin@9000 cannot be changed to Admin@1000, Admin@2000, or Admin@900.

2) The default password can be changed to Huawei12#$ or Admin12#$. After the setting is successful, the password of the iBMC can be changed to the default password Admin@9000.

Key Process and Cause Analysis

(1) Check the iBMC password rules.

According to the iBMC user guide, the iBMC password must comply with the following rules.

  • The iBMC password must pass the complexity check.

    When the password complexity check is enabled, the system checks whether the password meets the complexity requirements. If no, the setting fails. The complexity check is enabled by default.

  • The iBMC password must not be one contained in the weak password dictionary.

    If the password is in the weak password dictionary, the setting fails. The weak password dictionary is enabled by default.

The password complexity check and weak password dictionary are independent of each other.

(2) Analyze the cause.

Symptom 1: The default password Admin@9000 cannot be changed to Admin@1000, Admin@2000, or Admin@900.

According to the password complexity requirements, the new password must be different from the old one by at least two characters. The password complexity check is enabled by default, and the preceding passwords do not meet the complexity requirements.

Symptom 2: The default password Admin@9000 can be changed to Huawei12#$ or Admin12#$. After the setting is successful, the password of the iBMC can be changed to the default password Admin@9000.

The preceding passwords meet the complexity requirements.

However, it is abnormal to change the password back to the default password.

The default password Admin@9000 is in the weak password dictionary. The weak password dictionary is enabled by default. A new password cannot be changed back to the default password, unless the weak password dictionary is disabled manually.

The iBMC logs show that the weak password dictionary is enabled.

The following figure shows the weak password dictionary.

The customer may incorrectly change the password to admin@9000, which is not in the weak password dictionary and meets complexity requirements.

After confirmation, the customer incorrectly changes the password to admin@9000.

Conclusion and Solution

Conclusion:

A new iBMC password must meet the password complexity requirements and cannot be in the weak password dictionary.

Solution:

Change the password according to the password specifications.

Experience

The two requirements for changing the iBMC password of a V5 server must be met.

Note

None

Restoring the Administrator Account of a V5 Server
Problem Description
Table 5-235 Basic information

Item

Information

Source of the Problem

1288H & 2288H V5

Intended Product

1288H & 2288H V5

Release Date

2017-12-01

Keyword

1288H & 2288H V5, administrator, restore, account

Symptom

The Administrator account of a V5 server is deleted from the iBMC WebUI. The customer cannot log in to the iBMC WebUI and needs to restore the account.

Key Process and Cause Analysis

If the only administrator account is deleted accidentally and the iBMC WebUI cannot be logged in, you can select either of the following solutions:

1) Restore the default iBMC configuration.

Connect the serial port on the server. Press Ctrl+B when prompted to enter the U-Boot main screen, and run the datafs_reset command to restore the factory settings of the server. In this way, the Administrator account and other iBMC configurations are restored.

2) Add a local user to the iBMC by delivering standard IPMI commands in the OS.

Conclusion and Solution

Conclusion:

The administrator account is deleted accidentally.

Solution:

Restore the default iBMC settings or add users through in-band management.

Experience

None

Note

None

Failed to Obtain the Temperatures of the Air Inlet, Air Outlet, and RAID Controller Card on the 2288H V5
Problem Description
Table 5-236 Basic information

Item

Information

Source of the Problem

2288H V5

Intended Product

2288H V5

Release Date

2018-05-21

Keyword

Air inlet temperature, air outlet temperature, RAID controller card temperature, alarm

Symptom

Hardware configuration: 2288H V5

Symptom: On an 2288H V5 server, the system reports alarms indicating that the system fails to obtain the temperatures of the air inlet, air outlet, and RAID controller card. Restarting the server cannot resolve the problem.

Figure 5-312 2288H V5 alarm logs

Key Process and Cause Analysis

The BMC reads the temperatures of the air inlet, air outlet, and RAID controller card from the I2C link. Faulty components on the I2C link may cause the problem. Check the components on the I2C link.

The diagnosis procedure is as follows:

  1. Remove the cable from the left mounting ear to the mainboard (the air inlet temperature sensor is on the left mounting ear). Power on the server, and check whether the system still fails to read the air outlet and RAID controller card temperatures. If no, the problem is caused by the faulty left mounting ear or the faulty cable from the left mounting ear to the mainboard; if yes, go to the next step.
  2. Remove the signal cable from the hard disk backplane to the mainboard. Power on the server, and check whether the system still fails to read the air outlet and RAID controller card temperatures. If no, the problem is caused by the faulty hard disk backplane or the faulty cable from the hard disk backplane to the mainboard; if yes, go to the next step.
  3. Power off the server, and remove the RAID controller card. Power on the server, and check whether the system still fails to read the air outlet temperature. If no, the problem is caused by the faulty RAID controller card; if yes, the problem is caused by the faulty mainboard, and the mainboard needs to be replaced.
Conclusion and Solution

Conclusion:

Faulty components on the I2C link cause abnormal signals. As a result, the server fails to obtain the temperatures of the air inlet, air outlet, and RAID controller card.

Note

None

Common NIC Problems

Four LOMs on a V5 Server Are Abnormal
Problem Description
Table 5-237 Basic information

Item

Information

Source of the Problem

1288H & 2288H V5

Intended Product

1288H & 2288H V5

Release Date

2017-11-24

Keyword

1288H & 2288H V5, LOM, MAC

Symptom

The four LOMs on a 2288H V5 server are abnormal, that is, the network port indicators are steady on and the MAC addresses are displayed as FF:FF:FF:FF:FF:FF.

Key Process and Cause Analysis

The four LOMs on the V5 server are directly connected to the Platform Controller Hub (PCH). If the LOM MAC addresses are abnormal, check the firmware of the X722 NIC chip integrated in the PCH.

The firmware of the X722 in the PCH is faulty.

Upgrade the firmware.

Conclusion and Solution

Conclusion

The firmware of the X722 NIC chip integrated in the PCH on the mainboard is abnormal.

Solution

Use either of the following methods to upgrade the firmware:

  • Upgrade the firmware on the BIOS CLI.

ipmcset -t maintenance -d upgradebios -v /tmp/biosimage.hpm

  • Download the latest LANconfig tool from the Huawei support website and upgrade the firmware.

Obtain the upgrade tool based on the OS.

Download the NIC firmware upgrade package at:

http://support.huawei.com/enterprise/en/servers/1288h-v5-pid-21872252/software/22739006?idAbsPath=fixnode01%7C7919749%7C9856522%7C21782478%7C21782482%7C21872252

Remarks: After the firmware upgrade, you need to power off and then power on the server to make the upgrade take effect.

For details, see the HUAWEI Server Firmware Upgrade Guide.

Experience

None

Note

None

NIC Fails to Be Identified When a V5 Server Is Restarted After a Network Cable Is Reinstalled
Problem Description
Table 5-238 Basic information

Item

Information

Source of the Problem

1288H & 2288H V5

Intended Product

1288H & 2288H V5

Release Date

2018-01-12

Keyword

1288H & 2288H V5, NIC

Symptom

A Japanese NA customer installs Windows Server 2016 on a V5 server, and an X722 network interface card (NIC) is used for port binding in teaming mode. After the network cable connected to one of the bound network ports is removed and the OS is restarted, the X722 is abnormal. The system logs indicate that the MAC address is invalid and the X722 cannot be started.

Key Process and Cause Analysis

The NIC firmware of the current version has bugs. Upgrade the firmware to 3.51 or later.

Conclusion and Solution

Conclusion:

The Intel X722 NIC firmware bugs lead to NIC missing in some cases.

Solution:

Upgrade the X722 firmware to 3.51 or later.

Experience

None

Note

None

Driver of a LOM X722 Cannot Be Installed on a V5 Server
Problem Description
Table 5-239 Basic information

Item

Information

Source of the Problem

2288H V5

Intended Product

Rack servers

Release Date

2018-01-29

Keyword

V5, X722 driver

Symptom

The NIC driver cannot be installed on a 2288H V5.

The driver cannot be installed by using Service CD or iDriver. In Windows Server 2012, the LOM 4 x GE + 2 x 10GE can identify only ports 0 and 1. After the network cable is connected to the rear LOM GE, the indicator can be turned on, indicating that the NIC installed on the PCIe card works properly (the NIC can be identified and the driver is normal).

Key Process and Cause Analysis

(1) Install the NIC driver.

The X722 driver is not integrated into Service CD 2.0 or iDriver.

Install the X722 driver using the software driver package of the corresponding OS.

(2) Check the OS version.

The customer uses Red Hat Enterprise Linux (RHEL) 6.6 and Windows Server 2012 Datacenter Edition, which cannot be found in the Huawei Enterprise Server Compatibility Checker.

To use the software driver package, the RHEL version must be 6.9 or later, and the Windows Server version must be 2012 R2 or later.

(3) Check X722 features.

The following description is provided for the OS features supported by the X722 NIC.

Conclusion and Solution

Conclusion:

The customer uses Windows Server 2012, which cannot be found in the Huawei Enterprise Server Compatibility Checker.

Visit http://support.huawei.com/onlinetoolsweb/ftca/indexEn?serise=2 to use the Huawei Enterprise Server Compatibility Checker.

To use the software driver package, the Windows Server version must be 2012 R2 or later, and Windows Server 2012 is not supported.

Solution:

You are advised to install an OS that can be found in the Huawei Enterprise Server Compatibility Checker.

Experience

None

Note

None

25G NIC PXE Fails on a V5 Server
Problem Description
Table 5-240 Basic information

Item

Information

Source of the Problem

2288H V5

Intended Product

V5 servers

Release Date

2018-02-08

Keyword

PXE

Symptom

The PXE fails over 25G network ports on the thirty V5 servers onsite. The following figure shows the cabling. The out-of-band IP address can be obtained only when the network cable is connected to the management port.

Key Process and Cause Analysis

(1) Check the 25G NIC model.

The BOM number of the 25G NIC is 06310106, and the description is "Network Card,25 Gigabit,64bit,SFP28,2 ports,PCIE 3.0 X8-15b3-1015-2,No Driver CD".

(2) Compare the PXE boot in different BIOS modes.

In the same server chassis, use the 25G NIC for PXE in different BIOS boot types.

  1. NIC PXE fails in UEFI mode.
  2. NIC PXE is successfully executed in Legacy mode.

(3) Determine the causes.

NIC PXE fails because the onsite NIC firmware has bugs. Only the latest version of the 25G NIC firmware supports PXE in UEFI mode.

The old firmware version of the NIC is 14.18.2000, which does not support PXE in UEFI mode.

After the NIC firmware is upgraded to version 14.21.1000, the problem is solved.

The firmware of the 25G NICs on the 30 servers is 14.18.2000. Upgrade the NIC firmware to the latest version.

(4) Upgrade the NIC firmware.

After the firmware is upgraded, NIC PXE is successful.

Conclusion and Solution

Conclusion:

The 25G NIC firmware version is too early and does not support PXE in UEFI mode.

Solution:

Upgrade the 25G NIC firmware to 14.21.1000 or later.

Experience

None

Note

None

NIC Indicator Is Off After the SP310 NIC on the 2288H V5 Is Connected to the Switch
Problem Description
Table 5-241 Basic information

Item

Information

Source of the Problem

SP310 (82599ES) 10GE NIC

Intended Product

82599ES 10GE NIC

Release Date

2018-05-23

Keyword

SP310 NIC, 2288H V5

Symptom

During the on-site virtualization selection test on the 2288H V5, when the SP310 NIC is connected to the switch on the private network, the NIC indicator is off. The optical modules are 10GE modules delivered with the device.

Key Process and Cause Analysis
  1. Check whether the NIC is detected by the system. Run the ifconfig -a command on the OS. The command output shows that eight network ports are detected.

  2. Run the lspci |grep –i eth command. The command output shows that eight physical network ports are detected, which is consistent with the feedback from the on-site engineer.

  3. The preceding steps show that the NIC is detected by the system. Run the ifconfig command to check the status of all network ports. The command output shows that only eth0 and eth2 are up and other network ports are down.

  4. Run the ethtool eth0 and ethtool eth2 commands to check whether the physical link is normal. If the value of Link detected is yes, the physical links are normal.

  5. Run the ifconfig ethX up command to bring up network ports other than eth0 and eth2. Check the NIC indicator. The indicator is on.
Conclusion and Solution

Conclusion:

The NIC is detected, and the detected number of network ports is the same as the actual number of network ports. However, some network ports are down. Run the ifconfig ethX up command or use the network port configuration file to bring up the network ports.

Experience

For problems related to faulty 10GE NICs, see the reference case on: http://3ms.huawei.com/hi/group/1004825/thread_6974697.html?mapId=8672477&for_statistic_from=my_threads_group_forum

Note

None

Failed Boot from PXE After the Firmware Is Updated on the 2288H V5
Problem Description
Table 5-242 Basic information

Item

Information

Source of the Problem

SP310 (82599ES) 10GE NIC

Intended Product

82599ES 10GE NIC

Release date

2018-05-23

Keyword

SP310 NIC, 2288H V5

Symptom

The server fails to obtain the out-of-band IP address. After the mainboard is replaced, the system can be booted from PXE. However, miniOS detects that the NIC version is incorrect, and the firmware of the NIC needs to be upgraded. The firmware of the NIC is integrated into the BIOS. To upgrade the firmware, the BIOS needs to be upgraded to 0.57 by using the CLI (the CLI version is at least 2.90). Upgrade the BMC to 2.94, upgrade the BIOS to 0.57, and then roll back the BMC to 2.58. After the firmware is upgraded, the system fails to be booted from PXE.

Key Process and Cause Analysis

Key process:

  1. In the beginning the system can be booted from PXE. After a series of upgrade operations, the server fails to be booted from PXE. Therefore, the BIOS version may cause the problem. Rolled back the BIOS to 0.59, the problem persists.
  2. After the mainboard is replaced, the system can be booted from PXE. However, the NIC version does not meet the customer's requirements. After the NIC firmware is upgraded, the system fails to be booted from PXE. Therefore, the problem is not caused by the server hardware, and is caused by the software settings.
  3. Compare the BIOS settings before and after the fault occurs. No abnormal setting item is found, and the PXE network port is enabled.

  4. The customer reports that the BIOS is customized. Check BIOS customization settings. In settings related to PXE, the LOM is disabled. Therefore, the PXE device is not detected, and the PXE option is unavailable on the BIOS screen.

  5. After the default BIOS settings are loaded, the system still fails to detect the PXE device. You must press F10 to save the settings. After the modification, the problem is resolved.

Cause:

The PXE device is disabled during customization. The problem is resolved after the PXE device is enabled on the BIOS.

Common Problems of Configuration and Installation

AC Power Failure Occurs During the BIOS Upgrade of a V5 Server and the BIOS Is Restored to the Default Configuration
Problem Description
Table 5-243 Basic information

Item

Information

Source of the Problem

2288H V5

Intended Product

V5 servers

Release Date

2018-04-12

Keyword

BIOS, factory settings

Symptom

All V5 servers and the server BIOS parameters of an NA customer use customized configurations.

When the customer upgraded the BIOS of a V5 server on the iBMC WebUI, the power cable was accidentally removed. As a result, an AC power failure occurred. After the restart, all BIOS parameters were restored to the default settings.

Key Process and Cause Analysis

During the BIOS upgrade, an AC power failure occurs, and the BIOS upgrade is interrupted unexpectedly. Therefore, the upgrade process of the BIOS flash program is interrupted abnormally, and the ME zone in the BIOS flash program may be damaged.

After the server is powered on again, the BIOS upgrade process continues, and the iBMC will query the ME status. When the iBMC fails to communicate with the ME, all BIOS zones are erased during the upgrade, including the variable zone of the BIOS configuration. As a result, the BIOS parameters are restored to the default settings.

Conclusion and Solution

Conclusion:

It is normal that the BIOS is restored to the factory settings after the BIOS upgrade is abnormally interrupted.

Solution:

  • If the number of servers with the same problem is small, you are advised to manually restore the customized configuration.
  • If there are a large number of servers with the same problem, you are advised to export the customized BIOS configuration for import.

Procedure:

1) Export the customized BIOS configuration of a functioning server.

2) Import the exported configuration file to faulty servers, and make the configuration take effect.

For details, see section 3.7.10 "Import/Export" in the related iBMC user guide.

Experience

None

Note

None

Common Performance Problems

Inconsistent Performance Test Results of the 2288H V5 and CH121 V5 Caused by Different DIMM Positions
Problem Description
Table 5-244 Basic information

Item

Information

Source of the Problem

2288H V5

Intended Product

2288H V5, E9000

Release Date

2018-05-30

Keyword

Memory, performance

Symptom

A customer tests the performance of the 2288H V5 and the E9000 CH121 V5, and reports that the performance of the 2288H V5 is lower than that of the CH121 V5.

  1. When the default BIOS configuration (custom mode, non-performance priority) is used, the performance of the 2288H V5 decreases sharply after the service pressure reaches 1500 CAPS. This problem does not exist on the CH121 V5.
  2. After the 2288H V5 is set to performance priority mode, the preceding performance deterioration problem is resolved. However, the performance of the 2288H V5 in performance priority mode is still lower than the performance of the CH121 V5 in custom mode. When the service pressure exceeds 1900 CAPS, the 2288H V5 shows severe jitters; when the service pressure is 2100 CAPS, the CH121 V5 idle rate is 12.4%, and the 2288H V5 idle rate is 3.3%.
Key Process and Cause Analysis

Compare the hardware configuration of the 2288H V5 and CH121 V5. The DIMM positions are different.

The positions of the CH121 V5 DIMMs meet the requirements in Huawei Server Product Memory Configuration Assistant. However, on the 2288H V5, the DIMM in the DIMM020 slot is incorrectly inserted into the DIMM030 slot, and the DIMM in the DIMM120 slot is incorrectly inserted into the DIMM130 slot.

Adjust the DIMM positions of the 2288H V5 to be the same as that of the CH121 V5, and test the service performance. The test results of the 2288H V5 and the CH121 V5 are on the whole consistent. The problem is resolved.

The differences between the two DIMM insertion methods are as follows:

  1. For different DIMM positions, the system uses different methods to divide the memory space into regions.
  2. When the DIMM000/010/030 slots are used (not recommended), the memory space is divided into two regions. The two regions are accessed by different CPUs, and the influence on the performance is uncontrollable.
  3. When the DIMM000/010/020 slots are used (recommended), only one memory region exists, and all DIMMs can be accessed at the same time.

When the DIMMs are evenly inserted, the DIMMs are uniformly accessed.

  1. All memory addresses are stored in one memory region.
  2. The DIMMs are accessed in a sequence according the addresses stored in one region.
  3. The software can store data in any memory locations and obtain the same high bandwidth.

When the DIMMs are unevenly inserted, the system divides the memory space into multiple regions.

  1. The iMC divides the memory space into independent regions to achieve optimal performance.
  2. Software performance varies depending on the memory regions that are accessed.
  3. The performance is unpredictable.
Conclusion and Solution

Conclusion:

On the 2288H V5, when the DIMM000/010/030 and DIMM100/110/130 slots are used, the memory space is divided into two regions. The two regions are accessed by different CPUs, and the influence on the performance is uncontrollable.

Solution:

Use DIMM000/010/020 and DIMM100/110/120 slots according to the requirements of Huawei Server Product Memory Configuration Assistant.

Experience

The previous BIOS version has a problem. Therefore, the DIMM positions of the 2288H V5 are temporarily changed in production. Engineers from the production, maintenance, BIOS, and performance test departments confirm that the problem has been resolved in the current BIOS version. The 2288H V5 can use the recommended DIMM insertion method. The recommended DIMM insertion method was implemented on April 20, 2018.

In the future, if similar problems occur, the influence of the solution (especially the impact on performance) should be fully evaluated.

Note

None

Translation
Download
Updated: 2019-02-25

Document ID: EDOC1000041338

Views: 70923

Downloads: 3782

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next