No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Huawei Server Maintenance Manual 09

Rate and give feedback :
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
V3

V3

Power-On and Power-Off Problems

CH121 V3 Power-On Failure Without Power-Off Timeout or Abnormal Power-Off Records
Problem Description
Table 5-35 Basic information

Item

Information

Source of the Problem

Live network

Intended Product

E9000 CH121 V3

Release Date

2017-03

Keyword

CPU 1 status, CPU 2 status

Symptom
  • Hardware configuration:

    CH121 V3

  • Symptom:

    A newly installed CH121 V3 fails to be powered on, and the iBMC logs do not contain power-on timeout or abnormal power-off records.

Key Process and Cause Analysis
  1. The CPU is improperly installed.
  2. The CPU is faulty.
  3. The PCH status is abnormal.
  4. The mainboard hardware is faulty.

    Fault Locating:

    1. Check that CPUs are in position and that twisted pins do not exist on the CPU sockets of the mainboard. Check the sensor_info.txt file in the dump_info\AppDump\sensor_alarm directory. As shown in the following figure, if a CPU is in position, the value in the red box is 0x8080. If the value is 0x8000, the CPU is not in position.

    2. Switch the two CPUs and check again.
    3. Clear the CMOS by running the ipmcset -d clearcmos command to restore the default configuration and check whether the CH121 V3 can be powered on. If no, you are advised to refresh the BIOS.
    4. If the fault still persists, collect logs and send them to R&D engineers for analysis.
Conclusion and Solution

Conclusion

The CPUs were not installed because they were removed by the customer. After the CPUs were reinstalled, the problem was resolved.

Solution

Reinstall the CPUs.

Experience

None

Note

The fault locating method in this case also applies to other Huawei servers.

CH121 V3 Fails to Start and Logs Show that Switchover Repeats Between M1 and M6
Problem Description
Table 5-36 Basic information

Item

Information

Source of the Problem

Live network

Intended Product

All mainboards

Release Date

2017-08-07

Keyword

FRU Hot Swap, M7, M1, M6

Symptom
  • Hardware configuration:

    CH121 V3

  • Symptom:

    After the CH121 V3 is powered on, only "NO Signal" is displayed on the KVM. The SEL log shows that switchover repeats between M1 and M6.

Key Process and Cause Analysis

1. The mainboard is faulty.

2. A CPU is faulty.

Cause Analysis:

  1. Check the SEL log (AppDump\sensor_alarm\) to ensure that no alarm is generated. If the power supply is abnormal, the mainboard will detect and report an alarm.

  2. View maintenance_log (\LogDump\) to check for errors.
  3. Leave only the CPU in socket 1 and check for CPU fault.
  1. After CPU 2 is removed, power on the board again to check whether it can start successfully. If the board starts successfully, CPU 1 is normal. If the board fails to start, remove CPU 1 and store it properly. Then, check CPU 2.
  2. Remove CPU 1, and install CPU 2 in the socket of CPU 1. If the board starts successfully, CPU 2 is normal.
  3. If the board fails to start when either CPU is installed, collect logs and send them to R&D engineers for analysis. If the problem is caused by a CPU, replace the CPU.
Conclusion and Solution

Conclusion

As indicated by the minimum configuration result, CPU 2 is faulty.

Solution

Replace the faulty CPU.

Experience

N/A

Note

The fault locating method in this case also applies to other servers.

Uncorrectable CPU Error Occurs During CH121 V3 OS Loading
Problem Description
Table 5-37 Basic information

Item

Information

Source of the Problem

Live network

Intended Product

CH121 V3

Release Date

2017-11-04

Keyword

Uncorrectable CPU error

Symptom

Hardware configuration:

CH121 V3

Software configuration:

OS: Oracle Linux 7.4

BIOS: 3.63

CPU: E5-2637 v4

Symptom:

Multiple CH121 V3 boards are configured with one E5-2637 v4 CPU each. When Oracle Linux 7.4 is loading, there is a possibility that the CPU reports an uncorrectable CPU error and the OS does not respond. The CH121 V3 boards with other configurations in the same chassis do not report any errors.

The following figure shows an OS loading failure.

The following figure shows alarms on the iBMC.

Key Process and Cause Analysis

The same problem occurs on all boards with the same configuration onsite, and boards with other configurations work properly. In addition, the boards do not respond each time when the OS is being loaded. Restarting the boards can rectify the fault. If the hardware is faulty, the fault cannot be rectified. Therefore, check that the OS is compatible with the CPU and that the CPU does not have known defects such as an incompatible microcode version.

When the microcode version of the CPU is earlier than 0x0b000021, the OS does not respond when the CPU microcode is upgraded.

The following figure shows the microcode defect information.

The microcode version integrated in the BIOS (3.63) is 0x0b000020, and the microcode version integrated in Oracle Linux 7.4 is 0x0b000021. The microcode version integrated in the OS is higher than that integrated in the BIOS. Therefore, the OS will upgrade the CPU microcode version during startup. As a result, the CPU reports an uncorrectable error and the OS does not respond.

Conclusion and Solution

Conclusion:

The CPU has bugs. During OS startup, the CPU microcode upgrade is triggered. As a result, an uncorrectable CPU error occurs.

Solution:

Upgrade the BIOS to 3.66 or later.

Experience

None

Note

None

Common Problems of RAID Controller Cards and Hard Disks

Red Indicator of the CH242 V3 DDR4
Problem Description
Table 5-38 Basic information

Item

Information

Source of the Problem

Live network

Intended Product

E9000 CH242 V3 DDR4

Release Date

2018-05

Keyword

CH242 V3 DDR4, compute node, red indicator

Symptom

A customer reports that the red indicator of the CH242 V3 DDR4 is intermittently on.

Key Process and Cause Analysis

Key process:

  1. Check the hardware logs. The compute node model is CH242 V3 DDR4.

  2. The hardware structure provided by the CH242 V3 DDR4 Compute Node User Guide shows that the red indicator is located on the RAID controller card.

  3. Check the records related to the RAID controller card on the OS logs. The model of the RAID controller card is SAS3108, and the cache size is 2048 KB.

    The RAID controller card is normal.

    The current write policy of the RAID array is Write Back, that is, data is written to the cache and then to the disk.

  4. The LSI SAS3108 RAID controller card has a Write_Pending indicator. When data in the cache is not written to the disk, the red indicator is on.

  5. The problem video shows that the two red indicators are intermittently on. The red indicators reported by the customer are Write_Pending indicators. Other compute nodes in the chassis are configured with LSI SAS2308 RAID controller cards. The LSI SAS2308 RAID controller card does not have a cache. Therefore, the red indicator situation does not occur on other compute nodes.

Cause:

When data in the cache of the LSI SAS3108 RAID controller card is not written to the disk, the Write_Pending indicator is on. The system is normal.

The LSI SAS2308, LSI SAS3008, and Avago SAS3408 RAID controller cards do not have a cache. Therefore, the red indicator situation does not occur.

The LSI SAS2208 RAID controller card does not have the Write_Pending indicator.

When data in the cache of the LSI SAS3108 RAID controller card is not written to the disk, the Write_Pending indicator is on.

The Avago SAS3508 RAID controller card does not have the Write_Pending indicator.

Locating the Slot of a Slow Hard Disk in a Big Data Service Scenario
Problem Description
Table 5-39 Basic information

Item

Information

Source of the Problem

Servers

Intended Product

Servers

Release Date

2018-01-30

Keyword

RAID, lsscsi, storcli

Symptom

The V3 servers are equipped with the LSI SAS2208, LSI SAS2308, LSI SAS3008, or LSI SAS3108 RAID controller card. This section describes how to use the drive letter to locate the hard disk slot on Linux in a big data service scenario.

Key Process and Cause Analysis

LSI SAS2308/LSI SAS3008+Linux:

Background: Locate the slot of a slow hard disk in a big data service scenario.

On the OS:

Run the df command to query the drive letter corresponding to the abnormal file system.

Query the serial number of the hard disk.
  1. Use the SMART information to query the device serial number.

    On the OS:

    Run the smartctl –a/dev/sdb command. (The smartctl file is required in the system. Generally, the smartctl file is installed by default during the Linux OS installation.)

    The serial number of the hard disk corresponding to the sdb drive letter is 9XG50X1F.

    NOTE:

    You can use the drive letter to query the hard disk serial number only in single-disk RAID 0 and hard disk pass-through scenarios. When multiple hard disks exist under one VD, multiple hard disks correspond to one drive letter. In such scenario, do not use the drive letter to query the device serial number.

  2. Use the serial number to query the slot number.

    On the OS:

    1. Go to the \InfoCollect_Linux\modules\raid\RAIDtool\3008 directory where the tool is located, and run the chmod +x sas* command to grand command permission.
    2. Run the ./ sas3ircu 0 display command.
    3. In the command output, query the slot number by using the serial number obtained in step 1.

      NOTE:

      You can also query the slot number by searching the raid folder in the collected log package. However, for servers (such as the X6800) that are configured with SoftRAID and a RAID controller card, the information of the RAID controller card may not exist in the log files. Therefore, using the preceding commands is more accurate.

Special situation:

  1. Failed to obtain the SMART information.

    On the OS:

    Check the messages logs. The messages logs show that the sdm disk in slot 11 has task abort records.

    On the OS, run the lsscsi command to view the hard disk information. In the displayed information, the first column shows the [H:C:D:L] numbers of the hard disks. Use the [H:C:D:L] number to query the disk slot number.

    Query method:

    For example, if the [H:C:D:L] number is [0:0:11:0], the meanings of the numbers are as follows:

    H: indicates the HBA number. For RAID controller cards, the number 0 indicates an onboard RAID controller card. If the system has only one RAID controller card, the H value is 0. C: indicates the channel number. The default value is 0. You can ignore this number.

    D: indicates the device number. If a RAID controller card is used, the value indicates the VD number. For [0:0:11:0], view the VD 11 information of the RAID controller card. The VD 11 information shows that the slot number is 11.

    The sda disk is a RAID 1 array using slot 0 and slot 1, and the [H:C:D:L] number is [0:2:0:0]. The sda disk is the boot partition. The [H:C:D:L] number of the sdb disk is [0:2:2:0]. The sdb disk is in slot 2.

    In single-disk RAID 0 scenarios, one slot is used by one VD. Therefore, you can use this method to query the slot number.

    In the hard disk pass-through scenario where an LSI SAS3008 RAID controller card is used, the value of the device number is the slot number. For example, if the [H:C:D:L] value is [0:0:11:0], the hard disk in slot 11.

    L: indicates the LUN number, which is the number of the channel between the local host and the SCSI device. LUN is not used in the local storage, and the default value is 0.

LSI SAS2208/LSI SAS3108+Linux

Query the serial number of the hard disk.

  1. Use the SMART information to query the device serial number.

    On the OS:

    Run the smartctl –a/dev/sdb command. (The smartctl file is required in the system. Generally, the smartctl file is installed by default during the Linux OS installation.)

    The serial number of the hard disk corresponding to the sdb drive letter is 9XG50X1F.

    NOTE:

    You can use the drive letter to query the hard disk serial number only in single-disk RAID 0 and JBOD scenarios. When multiple hard disks exist under one VD, multiple hard disks correspond to one drive letter. In such scenario, do not use the drive letter to query the device serial number.

  2. Use the serial number to query the slot number.
    1. Use the StorCLI tool. If the tool is unavailable, run the chmod +x storcli command to grand permission.
    2. Run the storcli64 -PDList -aALL command.
    3. In the command output, query the slot number by using the serial number obtained in step 1.

      NOTE:

      You can also query the slot number by searching the raid folder in the collected log package. However, for servers (such as the X6800) that are configured with SoftRAID and a RAID controller card, the information of the RAID controller card may not exist in the log files. Therefore, using the preceding commands is more accurate.

  3. Special situation:

    Failed to obtain the SMART information.

    Use the lsscsi command to locate the hard disk slot as described in the preceding method.

  4. Use the logs to query the slot number of the disconnected hard disk.

    Example 1

    a. Collect OS logs from the customer, and check the messages and dmesg logs. For example, the drive letter of the faulty disk reported by the customer is sdu.

    Search for sdu in the dmesg logs. In the dmesg logs, the [H:C:D:L] information of the sdu disk is [0:2:21:0]. Search for sdu in the messages logs. No record about the sdu disk is found.

    b. View the VD information in the RAID controller card logs.

    The VD information is as follows:

    VD 0

    VD 2

    VD 3

    VD 21

    VD 23

    The preceding information shows that the sdu disk is in slot 21.

    FAQ:

    Obtain the tools from the following websites:

    http://support.huawei.com/enterprise/en/software/22747368-SW1000282789

    The directory of the tool is \home\Project\tools\lsi3008\linux\sas3irc.

    http://support.huawei.com/enterprise/en/software/22400698-SW1000265416

    The directory of the tool is \InfoCollect_Linux\modules\raid\RAIDtool\3008.

Conclusion and Solution

None

Experience

None

Note

None

Red Indicator of the LSI SAS3108
Problem Description
Table 5-40 Basic information

Item

Information

Source of the Problem

E9000 CH242 V3 DDR4

Intended Product

LSI SAS3108 RAID controller card

Release Date

2018-03-31

Keyword

LSI SAS3108, RAID controller card, red indicator

Symptom

A customer reports that the front red indicators of two E9000 compute nodes are on.

Key Process and Cause Analysis

Key process:

  1. Check the hardware logs. The compute node model is CH242 V3 DDR4.

  2. The hardware structure provided by the CH242 V3 DDR4 Compute Node User Guide shows that the red indicator is located on the RAID controller card.

  3. Check the records related to the RAID controller card on the OS logs. The model of the RAID controller card is SAS3108, and the cache size is 2048 KB.

    The RAID controller card is normal.

    The current write policy of the RAID array is Write Back, that is, data is written to the cache and then to the disk.

  4. The LSI SAS3108 RAID controller card has a Write_Pending indicator. When data in the cache is not written to the disk, the red indicator is on.

    The problem video shows that the two red indicators are intermittently on. The red indicators reported by the customer are Write_Pending indicators.

  5. Other compute nodes in the chassis are configured with LSI SAS2308 RAID controller cards. The LSI SAS2308 RAID controller card does not have a cache. Therefore, the red indicator situation does not occur on other compute nodes.
Conclusion and Solution

When data in the cache of the LSI SAS3108 RAID controller card is not written to the disk, the Write_Pending indicator is on. The system is normal.

Note
  1. The LSI SAS2308, LSI SAS3008, and Avago SAS3408 RAID controller cards do not have a cache. Therefore, the red indicator situation does not occur.
  2. The LSI SAS2208 RAID controller card does not have the Write_Pending indicator.
  3. When data in the cache of the LSI SAS3108 RAID controller card is not written to the disk, the Write_Pending indicator is on.
  4. The Avago SAS3508 RAID controller card does not have the Write_Pending indicator.

Problems of HBAs, FC/FCoE Switch Modules and iSCSI Switch Modules

Failed to Set Up a CX710 Switch Stack
Problem Description
Table 5-41 Basic information

Item

Information

Source of the Problem

Live network

Intended Product

E9000 CX710

Release Date

2018-05

Keyword

CX710, switch module, stack

Symptom

During the factory installation, a CX710 switch module is configured for stacking and restarted. After the other CX710 switch module is configured and restarted, the stack fails to be created, and the second switch module restarts repeatedly.

Key Process and Cause Analysis

Cause analysis:

CX710 6.29 and 6.30 support port mode delivery. During the startup of the switch modules, port mode delivery is performed twice. The first port mode delivery is when the switch module is added to the stack, and the second is during the configuration restoration stage. The first port mode delivery is disabled.

The first switch module enters the configuration restoration stage and delivers the port mode. After the second switch module is restarted, the first port mode delivery is disabled. Therefore, the port modes of the two switch modules are different, and the stack fails to be created. After the second switch module enters the configuration restoration stage and delivers the port mode, the port modes of the two switch modules are consistent. Master election begins, and the switch module with lower priority restarts. After the switch module is restarted, the system repeats the preceding process. As a result, the stack fails to be created and the second switch model restarts repeatedly.

Trigger conditions:

  1. The versions of the CX710 switch modules are 6.29 or 6.30.
  2. After a switch module is configured and restarted, the other switch module is configured and restarted (the interval of the restart operations is more than 3 minutes).
Conclusion and Solution

Rectification measure:

Two solutions are available:

Solution 1: Power off and then power on the two switch modules to restore the switch modules.

Log in to WebUI of the MM910 management module, and power off and then power on the two CX710 switch modules simultaneously. The problem is resolved after the switch modules are powered on.

Solution 2: Restore the switch modules by using the CLI.

  1. Log In to the switch module CLI over SOL

  2. Delete the stack ports of the normal switch module. For example, on the switch module in slot 1 (1E), delete the 18/1 and 18/2 stack ports.

  3. Configure the ports of the normal switch module as stack ports. For example, on the switch module in slot 1 (1E), configure the 18/1 and 18/2 ports as stack ports.

  4. After the negotiation is complete, reset the other switch module to resolve the problem.

Solution:

This problem is resolved in the next baseline version. The next version will be released on around June 30, 2018.

Failed to Access the MZ512 BIOS by Pressing Ctrl+P (Fixed Network OSS)
Problem Description
Table 5-42 Basic information

Item

Information

Source of the Problem

Live network

Intended Product

E9000 MZ512

Release Date

2018-05

Keyword

MZ512, multi-channel, Ctrl+P message not displayed

Symptom

During a new installation on the live network, the uTraffic deployment guide suggests entering the MZ512 NIC BIOS to configure the multi-channel function by pressing Ctrl+P. However, the Ctrl+P message is not displayed. Pressing Ctrl+P cannot enter the BIOS.

Key Process and Cause Analysis

Problem analysis:

  1. On the first generation Fixed Network OSS, the U2000+uTraffic software is deployed on the CH242 V3 4-HDD (8-HDD)+MZ512+CX310 hardware. This hardware configuration supports the NIC BIOS by pressing Ctrl+P.
  2. The new hardware configuration is CH242 V3 DDR4+MZ312+CX310. This hardware configuration does not support accessing the NIC BIOS by pressing Ctrl+P. On the new hardware, use VMWare to configure the VLAN.
  3. The outdated uTraffic deployment guide is not applicable. Deploy uTraffic based on the new deployment guide.
Conclusion and Solution

Solution:

Configure the VLAN on VMWare. Do not use Ctrl+P to access the NIC BIOS.

Common Problems of the Management Software

CH242 V3 DDR4 Reports an Unstable Alarm
Problem Description
Table 5-43 Basic information

Item

Information

Source of the Problem

Live network

Intended Product

E9000 CH242 V3 DDR4

Release Date

2017-05

Keyword

Stable Status:Fault status

Symptom
  • Hardware configuration:

    CH242 V3 DDR4

  • Symptom:

    The Stable Status:Fault status alarm is reported on the iBMC for blade1 (CH242 V3 DDR4).

Key Process and Cause Analysis
  1. The board is not securely inserted into the chassis.
  2. The board is faulty.
  3. The slot of the chassis is faulty.

Cause Analysis:

After the BMC is powered on, it detects the board installation signal. If an exception occurs, the iBMC reports an alarm.

Fault locating:

  1. Check that the board is completely inserted into the chassis and that the ejector levers of the board are fastened.
  2. Remove the board and check the signal connector.

If the connector on the board is damaged, replace the faulty connector. In addition, check whether the connector in the chassis slot is damaged. If the connector is damaged, the slot cannot be used. Replace the chassis.

If the connector on the board is not damaged, insert the board into another normal slot. If the fault is caused by the board, replace the board. If the fault is caused by the chassis slot, replace the chassis.

Conclusion and Solution

Conclusion

As confirmed by onsite engineers, the board had been removed before the alarm was generated. The board was not completely inserted into the chassis, and the ejector levers are not fastened. As a result, the BMC detected that the installation signal was abnormal and reported an alarm. After the board was inserted properly, the fault was rectified.

Solution

Insert the board into the chassis again and fasten the ejector levers.

Experience

When installing a board, ensure that the board is inserted into the chassis completely and the ejector levers are fastened.

Knowledge Points

N/A

Management Port Alarm Is Reported by the MM910
Problem Description
Table 5-44 Basic information

Item

Information

Source of the Problem

Live network

Intended Product

E9000 MM910

Release Date

2017-10

Keyword

Management port

Symptom
  • Hardware configuration:

    MM910

  • Symptom:

    A management port alarm is reported by the MM910 and an error message "Transition to non-recoverable from less severe" is displayed on the HMM WebUI.

Key Process and Cause Analysis

1. The data configuration is incorrect.

2. The network cable is abnormal.

3. The MM910 is faulty.

4. The service side of the switch module is not powered on.

Procedure

  1. Run the smmget -d outportmode command on the command-line interface (CLI) of the active MM910 to check the value of outportmode.

    If the following command output is displayed, the value of outportmode is 1.

    root@SMM:/# smmget -d outportmode

    the outportmode: 1

    • 1 indicates that an internal port is connected to the MM910. Go to Step 2.
    • 0 indicates that an internal port is connected to a switch module. Go to Step 4.

  2. Check whether a network cable is connected to the MGMT port on the MM910 and whether the link indicator of the network port works properly.

    • If yes, contact Huawei technical support.
    • If no, go to Step 3.

  3. Connect a network cable to the MGMT network port on the MM910, ensure that the link indicator works properly, and check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to Step 6.

  4. Check whether the switch modules in slots 2X and 3X are in position.

  5. Install the switch modules properly and check whether the alarm is cleared. For details about how to replace a switch module, see the E9000 Server V100R001 User Guide.

    • If yes, no further action is required.
    • If no, go to Step 6.

  6. On the CLI of the active MM910, run the smmget -l swiN:fru2 -d hotswapstate command with N set to 2 or 3. Check whether the hot swap status is M4.

    • If yes, contact Huawei technical support.
    • If no, go to Step 7.

  7. On the CLI of the active MM910, run smmset -l swiN:fru2 -d powerstate -v poweron with N set to 2 or 3. Check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, contact Huawei technical support.

Conclusion and Solution

Conclusion

Onsite engineers checked that no network cable was connected to the MGMT port on the MM910. As a result, an alarm was reported.

Solution

Connect a network cable to the MGMT port on the MM910 panel.

Experience

N/A

Note

N/A

"CPU FRB1BIST failure" Is Reported After the Mainboard of the CH242 V3 DDR4 Is Replaced
Problem Description

The mainboard of blade4 (CH242 V3 DDR4) is faulty. After the mainboard is replaced, the system fails to be powered on, and CAT Error and FRB1/BIST failure alarms are reported by all CPUs.

Symptom

The board fails to start.

Key Process and Cause Analysis
  1. The CPU is faulty.
  2. The CPU is not compatible with the BIOS version.

After the system is powered on, the CPU performs self-check. If the check fails, the FRB1/BIST failure alarm is reported to the iBMC. If such alarm is generated, the CPU may be faulty or the BIOS version may be incompatible.

Key process:

  1. Check the CPU model and BIOS version. If the CPU model is v4 (for example, E7-4850 v4), check whether the BIOS version is V7xx or later. If the version is too early (for example, V629), upgrade the BIOS version to Vxx or later (for example, V790).
  2. If the CPU is compatible with the BIOS version, switch the faulty CPU with a functioning CPU. If the cause cannot be determined, collect logs and send them to R&D engineers for analysis.
Conclusion and Solution

Solution

  1. Check the CPU model of the current mainboard, which is E7-4850 v4 (BDW).
  2. Check the BIOS version of the mainboard, which is V629. This version does not support v4 CPUs. After the BIOS version is upgraded to V790, the problem is resolved.
Experience

N/A

Note

N/A

E9000 Compute Node Offline Alarm on the FusionSphere WebUI
Problem Description
Table 5-45 Basic information

Item

Information

Source of the Problem

Live network

Intended Product

E9000

Release Date

2018-05

Keyword

FusionSphere WebUI, compute node, offline alarm

Symptom

On the FusionSphere WebUI, the system reports that two compute nodes are offline.

Key Process and Cause Analysis

The E9000 HMM WebUI shows that the compute nodes are in position. No exception is found. Log in to the KVM of the compute node. The BIOS screen of the compute node is displayed.

Checked the SEL logs of the compute node. No exception is found. However, the logs show that the compute node is restarted at the time when the problem occurs. The SEL logs show that the compute node is manually restarted.

Open the operation logs of the compute node for further confirmation. The directory is \dump_info\dump_info\LogDump\maintenance_log.

At the time when the offline alarm is reported, a user logs in to the KVM, manually powers off the compute node, and then logs in to the BIOS. The problem is caused by manual operations.

Conclusion and Solution

None

Note

None

Connection Timeout When Using the MM910 to Log In to the KVM of an E9000 Compute Node
Problem Description
Table 5-46 Basic information

Item

Information

Source of the Problem

Live network

Intended Product

E9000 CH121 V3

Release Date

2017-05

Keyword

MM910, KVM, connection timeout

Symptom

On the E9000 server, when the MM910 is used for logging in to the KVM of a compute node, the system reports connection timeout.

Key Process and Cause Analysis

The device where the problem occurs is a CH121 V3 compute node. Other compute nodes are all V2 compute nodes. When the MM910 is used, the KVM of other compute nodes can be logged in to. The versions of the MM910 and the CH121 V3 BMC must match. The problem is caused by the outdated MM910 3.07. Upgrade the MM910 to resolve the problem.

Conclusion and Solution

Solution:

Temporary solution: Set the BMC management IP address of the compute node. Log in to the BMC of the compute node and use the BMC to log in to the KVM.

Solution: Upgrade the MM910 to the latest version. The recommended version is 5.88 or later. The CPLD and help document also need to be upgraded.

Failed to Log In to the Remote KVM and "BMC is restarting or communication is lost" Is Displayed
Problem Description
Table 5-47 Basic information

Item

Information

Source of the Problem

Live network

Intended Product

E9000

Release Date

2018-05

Keyword

Remote connection, KVM, BMC is restarting or communication is lost

Symptom

During the KVM login, "BMC is restarting or communication is lost" is displayed after the compute node is selected.

Key Process and Cause Analysis

Problem analysis:

  1. The SMM version is 5.58 and the BMC version is 5.15.
  2. A bug exists in SMM 5.57 and 5.58. When SMM 5.57 or 5.58 works with the early BMC, the SMM fails to query whether the compute node supports the KVM.
Conclusion and Solution

Solution:

Upgrade the SMM and BMC to the latest version.

Abnormal Sensor Alarm on the E9000 CH222 V3
Problem Description
Table 5-48 Basic information

Item

Information

Source of the Problem

CH222 V3

Intended Product

E9000 CH222 V3

Release Date

2018-04-10

Keyword

Sensor, I2C, cable

Symptom

Multiple sensor alarms are generated on the CH222 V3 after the mainboard is replaced.

Key Process and Cause Analysis

Analyze the feedback information. The procedure is as follows:

  1. Compare the sensor alarm information with that of a normal compute node.

    Check the FRU logs. The outlet, inlet, and SESA sensors fail to obtain data.

  2. The I2C link topology of the CH222 V3 shows that the I2C4 link is related to the FRU and EX.

    The sensor alarms are generated after the compute node (including the mainboard, disk enclosure, and cables) is replaced.

  3. Obtain the physical structure of the I2C4 link from the user guide. The SESA sensors are connected to the mainboard by cables.

  4. The problem is resolved after the compute node is replaced again.
Conclusion and Solution

The I2C link is faulty. As a result, alarms about multiple sensors are generated.

Experience

For sensor alarms, locate the physical I2C link first, and check the hardware components.

Note

For further information, see the sensor information and other reference cases.

Configuration and Installation Problems

PCIe Error Is Reported During the Startup of the CH242 V3 DDR4
Problem Description
Table 5-49 Basic information

Item

Information

Source of the Problem

Live network

Intended Product

E9000 CH242 V3 DDR4

Release Date

2017-01

Keyword

Windows Server 2012 R2, MZ510, PCIe error

Symptom
  • Hardware configuration:

    CH242 V3 DDR4 + MZ510 (Mezz1)

  • Symptom:

    There is a possibility that mezzanine card 1 reports a PCIe Error alarm during the startup of the CH242 V3 DDR4.

    OS: Windows Server 2012 R2

In Windows, the alarm information is as follows.

Key Process and Cause Analysis

Possible Causes:

  1. The MZ510 is faulty.
  2. The BIOS PCIe parameter is incompatible.

Locating Method

Key process:

  1. Switch the MZ510 with a functioning one or install the MZ510 to another board, and check whether the fault is caused by the MZ510. If the MZ510 is faulty, replace the faulty MZ510.
  2. Check whether the value of PCI-E Port Max Payload Size is set to 256B.
    1. On the BIOS setup screen, choose IntelRCSetup.

    2. Select IIO0 Configuration.

    3. Select PCI Express Port 2C.

    4. Check whether the value of PCI-E Port Max Payload Size is set to 256B. If not, change it to 256B.

  3. If the fault is not caused by the MZ510 or the BIOS parameter, check the CPU. If the CPU is normal, replace the mainboard.
Conclusion and Solution

Conclusion

After the MZ510 was replaced onsite, the problem persisted. The problem was resolved by changing the value of PCI-E Port Max Payload Size from 128B to 256B.

Solution

Change the value of PCI-E Port Max Payload Size from 128B to 256B.

Verification:

The alarm is cleared, and the board works properly.

Experience

N/A

Note

N/A

Network Disconnection Between the VM and the EoR Gateway
Problem Description
Table 5-50 Basic information

Item

Information

Source of the Problem

Live network

Intended Product

E9000 CH121 V3+EVS networking

Release Date

2018-05

Keyword

EVS networking, EoR gateway, network disconnection, VM, port isolation

Symptom

On an EVS network, the network between VM and the EoR gateway is disconnected. On the switch module, configure a VLANIF interface in the same VLAN as the VM. The network between the VM and the VLANIF interface is normal, and the network between the VLANIF interface and the EoR gateway is normal.

Key Process and Cause Analysis
  1. Based on the source and destination MAC addresses, capture packets from the Eth-Trunk port on the switch module connected to the compute node. No packet from VM is received by the switch module.
  2. Check the configurations of the downlink Eth-Trunk link and uplink Eth-Trunk link. The same configuration exists.

    port-isolate enable group 1

  3. The description of the configuration is as follows:

    To implement Layer 2 isolation between the ports, you can add different ports to different VLANs. However, this method wastes the limited VLAN resources. The ports in the same VLAN can be isolated by using the port isolation feature. Ports added into one isolation group are isolated on Layer 2. The port isolation feature enables safer and more flexible networking.

  4. The on-site configuration causes the uplink and downlink ports to be disconnected. After the isolation group ID of the uplink ports is reconfigured, the problem is resolved. The network is recovered.
Conclusion and Solution

Conclusion:

The problem is caused by the port isolation configuration. The uplink and downlink ports are configured in one isolation group, and the ports in one isolation group are disconnected. As a result, the packets from the VM to the switch module cannot be sent to the EoR switch through the uplink port. The network between the VM and the EoR gateway is abnormal.

Solution:

Configure the port isolation group based on the network design. Do not configure the uplink and downlink ports into one isolation group.

Experience

None

Note

None

Translation
Download
Updated: 2019-02-25

Document ID: EDOC1000041338

Views: 81919

Downloads: 3878

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next