No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Huawei Server Maintenance Manual 09

Rate and give feedback :
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
V2

V2

Power-On and Power-Off Problems

No Blade in an E9000 Chassis Can Be Powered On
Problem Description
Table 5-1 Basic information

Item

Information

Source of the Problem

E9000

Intended Product

E9000

Release Date

2015-12-16

Keyword

CPUn Memory, CPUn Prochot

Symptom

Symptom:

The power of an E9000 server decreases from 3000 W to 1500 W, and all blades cannot be powered on properly. The power indicator turns green for a period and then turns off.

When only one blade is installed, the blade still cannot be powered on.

Key Process and Cause Analysis

Key process:

  1. Collect blade log information by running the one-click information collection command on the management module. Information shown in Figure 5-1 is displayed in blade BMC logs.
    Figure 5-1 Blade BMC logs

  2. Check the management module version and find that it is V507. View the latest management module release notes, and find that the problems including occasional blade performance decrease and misreported CPUn Memory and CPUn Prochot alarms are solved on the MM910 V512. The upgrade package name and download address are as follows:

    MM910-MM-V512.zip

    http://support.huawei.com/enterprise/SoftwareVersionAction!getSoftwareInfo.action?nodePath=fixnode01|7919749|9856522|9856786|19955022|19961380|19962084|19962085|19962087|21499562&idAbsPath=fixnode01|7919749|9856522|9856786|19955022|19961380&version=E9000+Chassis+V100R001C00SPC270&hidExpired=0&contentId=SW1000126221

  3. Upgrade the MM910 to version V512. The blades can be powered on properly.

Cause analysis:

Blades cannot be powered on occasionally because the MM910 of an earlier version is used.

Conclusion and Solution

Conclusion:

Blades cannot be powered on occasionally because the MM910 of an earlier version is used.

Solution:

Upgrade the MM910 to version V512.

Experience

None

Note

None

Common Problems of RAID Controller Cards and Hard Drives

E6000 Hard Drives Are Offline Due to Connection to a Low-Quality Monitor
Problem Description
Table 5-2 Basic information

Item

Information

Source of the Problem

E6000

Intended Product

2011-06-09

Release Date

Monitor, E6000, hard drive offline

Keyword

CPUn Memory, CPUn Prochot

Author

Qi Yanjun (employee ID: 00171744)

Symptom
  1. When keyboard, video, and mouse (KVM) cables are used to connect a monitor to a BH620, the hard drive fault indicator is steady on, and alarms are generated on the BH620 and shelf management module (SMM). Figure 5-2 shows the alarm information in the SMM log.
    Figure 5-2 SMM alarm information

  2. The server operating system (OS) is abnormal. If the server OS is Windows Server 2003, the server system breaks down. If the server OS is Linux, an error occurs on the Linux terminal, as shown in Figure 5-3.
    Figure 5-3 Linux error information
Key Process and Cause Analysis

Key process

  1. Based on the communication with onsite personnel, the server is new, no power failure occurs, and hard drives and RAID controller cards are normal.
  2. After a BH620 is connected to a monitor, faults may occur in the BH620 when KVM cables are connected or disconnected from the BH620. In addition, hard drives are offline after KVM cables are connected to the high-density cable port.
  3. After connected to other qualified monitors, the server is normal after multiple tests.
  4. When KVM cables are not connected to a monitor, the server is normal after multiple tests.
  5. The problem is found at the first time during KVM cable connection/disconnection. In this case, consult with onsite engineers.
  6. Check that the onsite monitor has no CE or CCC certification label and is the low-quality monitor. Figure 5-4 shows the rear view of the low-quality monitor. From the figure, no certification label is attached.
    Figure 5-4 Rear view of a low-quality monitor
  7. Cause analysis

    After multiple tests, the problem is caused due to connection to a low-quality monitor.

Conclusion and Solution

Conclusion

The problem is caused due to connection to a low-quality monitor without certification.

Solution

Low-quality monitors without certification are not allowed on site. You are advised to use monitors that are produced by formal manufacturers and passed China Compulsory Certification (CCC) or other related certifications. You can also enable the remote control function for the baseboard management controller (BMC) to implement the display function.

Experience

Low-quality monitors without certification are not allowed on site. Monitors that are produced by formal manufacturers and passed certifications are recommended. Alternatively, enable the remote control function for the BMC to implement the display function.

Note

None

HBA and FC Switch Module Problems

The License Key for Cascading E6000 NX120 Switch Modules Fails to Be Activated
Problem Description
Table 5-3 Basic information

Item

Information

Source of the Problem

E6000

Intended Product

E6000

Release Date

2011-06-24

Keyword

E6000, NX120

Author

Han Hui (employee ID: 179477)

Symptom

Hardware configuration

Four E6000 blades and eight E6000 NX120s

Symptom

A customer applies for activating the license key for cascading NX120s at http://www.brocadechina.com/. When onsite engineers activate the license key, errors occur, as shown in the red box in Figure 5-5.

Figure 5-5 Errors against activating the license key
Key Process and Cause Analysis

Cause analysis

Check that the values of ID Type and Transaction Key for the license key are correct. According to the feedback from BROCADE, the transaction key database of BROCADE is abnormal.

Conclusion and Solution

Conclusion

The transaction key database of BROCADE is abnormal.

Solution

BROCADE updates the transaction key database information.

Experience

None

Note

None

Zone Conflict Occurs When NX120 Switch Modules Are Cascaded
Problem Description
Table 5-4 Basic information

Item

Information

Source of the Problem

E6000

Intended Product

E6000

Release Date

2011-11-22

Keyword

E6000, NX120, license

Author

Han Hui (employee ID: 179477)

Symptom

Hardware configuration

One E6000 blade and two NX120 switch modules

Software configuration

FOS FW for NX120s is V6.2.2c3.

Symptom

NX120s fail to cascade with BROCADE B5100s. Run the switchshow command on the NX120 CLI. The system displays "zone conflict" shown in the red box in Figure 5-6.

Figure 5-6 Segmented errors against cascading NX120s
Key Process and Cause Analysis

Key process

  1. Run the licenseshow command to view the license for cascading NX120s, and check that the license has been added, as shown in Figure 5-7.
    Figure 5-7 License for cascading NX120s

  2. Run the switchshow command to view the switch status, and find that the values of Domain ID and switchname for all switches are the same on the storage area network (SAN).
  3. Change the values of Domain ID and switchname for all switches based on the configuration plan, as shown in Figure 5-8.
    Figure 5-8 SAN configuration plan

    NOTE:

    The values of Domain ID and switchname for a switch cannot be the same on an SAN.

  4. Disable the zone configuration of BROCADE B5100, as shown in Figure 5-9.
    Figure 5-9 Disabling the zone configuration of BROCADE B5100

  5. Reserve only the default zone configuration (XFMA_48_ZONE) of NX120s, as shown in Figure 5-10.
    Figure 5-10 Reserving the default zone configuration of NX120s

  6. Check that NX120s are cascaded, as shown in Figure 5-11.
    Figure 5-11 Checking the cascading condition for NX120s

Cause analysis

If the values of Domain ID and switchname for all switches are the same on an SAN, device conflict occurs. Improper zone division causes failures in cascading fiber channel (FC) switches.

Conclusion and Solution

Conclusion

The values of Domain ID and switchname for all switches are the same on an SAN, and the zone configuration is incorrect.

Solution

  1. Apply for a license for cascading NX120s, and add the license to NX120s.
  2. Change the values of Domain ID and switchname for all switches on an SAN based on the configuration plan.
  3. Enable the default zone configuration for NX120s, and disable the zone configuration for FC switches.
  4. After cascading NX120s, plan the zone configuration based on the actual SAN.
Experience

None

Note

None

The OS Cannot Start After SAN Boot Is Enabled on the FC HBA Card
Problem Description
Table 5-5 Basic information

Item

Information

Source of the Problem

RH5485

Intended Product

RH5485 series servers

Release Date

2013-04-08

Keyword

FC, SAN Boot, start failure

Symptom

Hardware configuration

A RH5485 configured with a Fibre Channel (FC) host bus adapter (HBA) card; the server is connected to an FC storage area network (SAN) 5600T.

Symptom

After the server is connected to the FC SAN by using an optical fibre, the operating system (OS) cannot boot from the local drive. After the server is disconnected from the FC SAN, the OS can normally boot from the local drive.

Key Process and Cause Analysis

Key process

  1. According to the symptom, the SAN Boot function is enabled on the FC HBA card.
  2. After the SAN Boot function is enabled on the FC HBA card, the server OS preferentially boot from the FC SAN. After the server is connected to the FC SAN, the OS scans all logical unit numbers (LUNs) one by one of the FC SAN. The scanning time depends on the number of LUNs of the FC SAN. If no bootable OS is available on the FC SAN, the OS boots from the local drive. It takes too long time to scan LUNs of the FC SAN. As a result, it is assumed that the OS cannot start.
NOTE:

If the local boot mode is disabled in the basic input/output system (BIOS), the server OS does not boot from the local drive.

Conclusion and Solution

Conclusion

After the SAN Boot function is enabled on the FC HBA card, the server OS preferentially boot from the FC SAN. After the server is connected to the FC SAN, the OS scans all logical unit numbers (LUNs) one by one of the FC SAN. The scanning time depends on the number of LUNs of the FC SAN. If no bootable OS is available on the FC SAN, the OS boots from the local drive. It takes too long time to scan LUNs of the FC SAN. As a result, it is assumed that the OS cannot start.

Solution

Solution 1 (for general scenarios)

Perform the following steps to disable boot from SAN in the FC HBA BIOS:

  1. Restart the server, and press Ctrl+B as prompted to enter the HBA card configuration screen, as shown in Figure 5-12.
    Figure 5-12 HBA card configuration screen

  2. Set BIOS to Disable and press Alt+S to save the settings and exit.

Solution 2 (for the RH5485 only)

Restart the server to enter the BIOS and the first boot option to Legacy Only.

Experience

None

Note

None

The NX220 Cannot Be Accessed over HTTP
Problem Description
Table 5-6 Basic information

Item

Information

Source of the Problem

E6000 V2

Intended Product

E6000 V2

Release Date

2013-02-06

Keyword

HTTP, NX220

Symptom

Hardware configuration

A E6000 V2; an NX220 configured for the B1 switching plane

Symptom

  1. Log in to the management module web user interface (WebUI) using a client, choose System Management > Network Management > NEM > Manage IP Address, set the management IP address of the NX220 for the B1 switching plane to 10.77.77.77 and subnet mask to 255.255.255.0, as shown in Figure 5-13.
    Figure 5-13 Management module WebUI

  2. Open Internet Explorer on the client (IP address: 10.77.77.89) and enter http://10.77.77.77 in the address box to access the NX220. A message is displayed indicating that the interface of the client is disabled, as shown in Figure 5-14.
    Figure 5-14 Restricted access to the NX220 over HTTP
Key Process and Cause Analysis

Cause analysis

The Hypertext Transfer Protocol (HTTP) port on the NX220 is disabled before delivery. You need to access the NX220 over Hypertext Transfer Protocol Secure (HTTPS) by entering https://10.77.77.77 in the address box of the browser.

Conclusion and Solution

Conclusion

The HTTP port on the NX220 is disabled before delivery. You need to access the NX220 over HTTPS.

Solution

Access the NX220 over HTTPS by entering https://10.77.77.77 in the address box of the browser.

Experience

Port disabling of the NX220 is as follows:

  1. File Transfer Protocol (FTP) is disabled. Use Secure File Transfer Protocol (SFTP) instead.
  2. Telnet is disabled. Use Secure Shell (SSH) instead.
  3. HTTP is disabled. Use HTTPS instead.
Note

None

The Temporary License for NX120s Cannot Be Used
Problem Description
Table 5-7 Basic information

Item

Information

Source of the Problem

E6000

Intended Product

E6000

Release Date

2012-01-10

Keyword

E6000, NX120, license

Author

Han Hui (employee ID: 179477)

Symptom

Hardware configuration

BH620 and NX120

Software configuration

FOS FW for NX120s is V5.05a.

Symptom

BROCADE provides a temporary license for NX120s. Customers fail to add the license, as shown in the red box in Figure 5-15.

Figure 5-15 Failure in adding the license for cascading NX120s

Key Process and Cause Analysis

Key process

  1. Check that the license can be used, and operations of adding the license are correct.
  2. Check that FOS FW for NX120s is V5.05a. The version does not support a temporary license. To use a temporary license, upgrade FOS FW for NX120s to V6.2.3c3.
  3. Send the FOS FW upgrade package for NX120s to customers. FOS FW for customers' NX120s can be remotely upgraded to V6.2.3c3. After the upgrade, a temporary license can be added, as shown in Figure 5-16.
    Figure 5-16 Success in adding the license for cascading NX120s

Cause analysis

FOS FW for NX120s is V5.05a, and the version does not support a temporary license.

Conclusion and Solution

Conclusion

FOS FW for NX120s is V5.05a, and the version does not support a temporary license. To use a temporary license, upgrade FOS FW to V6.2.3c3.

Solution

Send the FOS FW upgrade package for NX120s to customers, and remotely upgrade the customer's version to V6.2.3c3.

Experience

None

Note

None

HBA Identification Failure on the E6000 NX220
Problem Description
Table 5-8 Basic information

Item

Information

Source of the Problem

E6000

Intended Product

E6000

Release Date

2013-05-30

Keyword

NX220, HBA, E6000

Symptom

Hardware configuration

E6000H chassis, MM620, NX220, and 10 BH622 V2 blades

Log in to the NX220 over the CLI. (For details, see the NX220 8G FC Switch Module V100R002 User Guide.) Run the switchshow command. The command output shows that all the ports are disabled, as shown in Figure 5-17.

Figure 5-17 NX220 status

Key Process and Cause Analysis

Key process

  1. Determine that the MU220 host bus adapter (HBA) is installed in slot Mezz1.
    Figure 5-18 MU220 HBA in slot Mezz1

  2. Check that the NX220 is installed on switching plane C on the site.
  3. Move the NX220 to switching plane B, and run the switchshow command. The command output shows that the MU220 HBA is successfully detected.

    Cause analysis

    HBAs must be installed in accordance with the mapping between mezzanine card slots and switching planes B and C in an E6000 blade server. Slot Mezz1 corresponds to switching plane B, and slot Mezz2 corresponds to switching plane C.

    Figure 5-19 MU220 HBA in slot Mezz1
NOTE:

The network ports on the mezzanine card in slot Mezz1 connect to the switch modules in slots B1 and B2 through the backplane. The network ports on the mezzanine card in slot Mezz2 connect to the switch modules in slots C1 and C2 through the backplane.

Conclusion and Solution

Conclusion

The HBA is installed in slot Mezz1, but the NX220 is installed on switching plane C. They do not match.

Solution

Install the HBA in slot Mezz1, and install the NX220 on switching plane B.

Experience

If the host fails to identify an HBA, check:

  1. Whether an operating system (OS) and the HBA driver are properly installed on the host.
  2. Whether the HBA is in good contact with the host.
  3. Whether the NX220 is compatible with the MM620.
  4. Whether HBAs are installed in accordance with the mapping between mezzanine card slots and switching planes B and C.
Note

None

VMware PSOD Due to Incompatibility of the MZ510 FC Driver and Firmware
Problem Description
Table 5-9 Basic information

Item

Information

Source of the Problem

CH121

Intended Product

E9000

Release Date

2014-02-24

Keyword

VMware, purple screen

Symptom

Hardware configuration

Two E5-2603 CPUs, sixteen 8 GB DIMMs, LSI SAS2308 RAID controller card, and MZ510

Software configuration

MZ510 driver: 8.2.3.1-127vmw; firmware: 4.4.262.3; OS: VMware5.1

Symptom

After VMware5.1 is installed on the CH121, there is a possibility that purple screen of death (PSOD) occurs, as shown in Figure 5-20.

Figure 5-20 VMware PSOD

Key Process and Cause Analysis

Key process

  1. Analyze VMware logs, and determine that the MZ510 FC driver and firmware do not match, as shown in Figure 5-21.
Figure 5-21 Analyzing VMware logs

For details, see http://kb.vmware.com/kb/2052729.

Cause analysis

There is a possibility that a purple screen is displayed if the MZ510 FC driver and firmware do not match. They are not compatible versions verified by Huawei.

Conclusion and Solution

None

Experience

None

Note

None

NX220 Generated Error Codes Due to an Incorrect S3900 Port Mode
Problem Description
Table 5-10 Basic information

Item

Information

Source of the Problem

BH622 V2

Intended Product

All servers

Release Date

2013-07-26

Keyword

NX220, error code, storage

Symptom

Network configuration

BH622 V2 server blade with the MU220 HBA + NX220 switch module + OceanStor S3900 storage system

Software configuration

The operating system (OS) is VMware 5.1.

Symptom

The link between the BH622 V2 and the S3900 is abnormal. Error information is recorded in S3900 logs for the S3900 port for connecting to the NX220, and many error codes are recorded in NX220 logs for the NX220 port for connecting to the S9300. Figure 5-22 shows the error codes on the NX220 side.

Figure 5-22 Viewing error codes on the NX220 side

Key Process and Cause Analysis

Key process

  1. The analysis of NX220 logs shows that the port that connects the S3900 to the NX220 is in arbitrated loop mode.

Cause analysis

The port that connects the S3900 to the NX220 is in arbitrated loop mode.

Conclusion and Solution

Conclusion

The port that connects the S3900 to the NX220 must be in switch mode.

Solution

For OceanStor S3900 earlier than V100R002C00SPC010, change the Fibre Channel (FC) port mode from arbitrated loop to switch.

For details, see "Setting an FC Host Port" in the OceanStor S2900, S3900, S5900, or S6900 storage system product documentation.

Figure 5-23 NX220 port modes

Experience

None

Note

None

"Retry this adapter" Displayed During HBA Self-Check
Problem Description
Table 5-11 Basic information

Item

Information

Source of the Problem

E6000

Intended Product

V1/V2 series servers

Release Date

2013-07-28

Keyword

retry this adapter

Symptom

Hardware configuration

BH62 server blade and an MU120 HBA

Software

SUSE 11 SP1 64-bit

Symptom

"retry this adapter" is displayed when the HBA is inspected by the server. See Figure 5-24.

Figure 5-24 Alarm during power-on self-test (POST)
Key Process and Cause Analysis

Set BIOS parameters for the HBA as follows:

  1. When the server starts, press ALT+E to go to the HBA configuration screen to set the BIOS parameters.
    Figure 5-25 Opening the HBA configuration screen by pressing ALT+E

  2. Enter 2. See Figure 5-26.
    Figure 5-26 Adapter parameter screen

  3. Enter 1. See Figure 5-27.
    Figure 5-27 Entering 1

  4. Enter 2 to disable the BIOS. See Figure 5-28.
    Figure 5-28 Disabling the BIOS

  5. Check that the fault is solved after the server restarts.
Conclusion and Solution

Conclusion

This fault is caused because SAN Boot is enabled.

Solution

Disable SAN Boot.

Experience

None

Note

None

Multipath Loss Due to Bit Errors in the Links Between E6000 NX220 Switch Modules and an EMC FC Switch
Problem Description
Table 5-12 Basic information

Item

Information

Source of the Problem

E6000

Intended Product

E6000

Release Date

2014-10-30

Keyword

NX220, EMC FC switch, bit error, multipath loss

Symptom

Hardware configuration

Two E6000 chassis, four NX220 switch modules, EMC FC switch, and S6900 storage device

Software configuration

  • OS: Citrix 6.1
  • MU220 firmware: 2.01A9 (U2F2.01A9)

Symptom

Four SATA drives are configured as a RAID 0 array, and two SAS drives are configured as a RAID 1 array.

Alarms indicating multipath loss are generated during the operating of the server blades in slots 1 and 5 of an E6000 chassis. See Figure 5-29.

Figure 5-29 Multipath loss alarms in Citrix 6.1
Key Process and Cause Analysis

Cause analysis

Storage device logs contain massive information about abort sequences (ABTSs). See Figure 5-30.

Figure 5-30 Many ABTSs in storage device logs

Possible causes for ABTSs:

  1. A host detects that no response is returned after a delivered I/O times out, and the I/O needs to be canceled.
  2. Upon receiving an ABTS from the host, the storage device searches for the I/O in the I/O processing link list. If the storage device finds the I/O, it returns a BLS_ACC packet to the host. If the storage device does not find the I/O, it returns a BLS_RJT packet to the host.

    The BLS_RJT packet indicates that the I/O is not received by the FC interface card of the storage device or that the FC interface card of the storage device receives the I/O and returns a response to the host but the host does not receive the response. The failure to receive the response is often caused by a bit error in the link between the host and the storage device.

  3. Storage device logs show that there is no bit error in the link between switch modules and the storage device. The BLS_RJT packet indicates that a bit error occurred in the link between server blades and switch modules.

    ABTSs are detected from the following storage device ports:

    Ports 0x99, 0x9b, 0x9a, and 0x9c on the switch module with the domain ID 0x14

    Port 0x9b on the switch module with the domain ID 0x0a (a few ABTSs)

    ABTSs are detected from the following switch module ports:

    Domain ID: port 0x66 on the 0x14 switch module and port 0x67 on the 0x0a switch module (a few ABTSs)

    Run the switchshow command. The command output shows that the ID of the 0x66 port on the EMC FC switch is 146600.

    Figure 5-31 Viewing a port ID on an EMC FC switch

View the ID of the NX220 port corresponding to port 146600 on the EMC FC switch. It is found that the NX220 port is 11. See Figure 5-32.

Figure 5-32 Viewing the NX220 port corresponding to the port on an EMC FC switch

Check the port mapping in AG mode. It is found that the ports on the E6000 server blades in slots 1 and 5 are connected to the storage device through port 11 on an NX220 switch module.

Figure 5-33 Viewing the NX220 port corresponding to the port on an EMC FC switch

Figure 5-34 Viewing bit error information for the port on an NX220 switch module

Replace the optical cable for the link, and run the statsclear and porterrshow commands on each NX220 switch module.

Figure 5-35 Viewing bit error information for the port on an NX220 switch module
Conclusion and Solution

Conclusion

A bit error occurs in the link between port 11 on the NX220 switch module and port 102 on the EMC FC switch. As a result, the links of server blades in slots 1 and 5 are unstable, which causes multipath loss.

Solution

Replace the optical module.

Experience

Check the optical module and optical cable connected to the port if a bit error occurs in an FC link.

Note

None

FCoE Connection Error Due to Incorrect Switch Module Configuration After E9000 CX311s Are Stacked
Problem Description
Table 5-13 Basic information

Item

Information

Source of the Problem

E9000

Intended Product

E9000

Release Date

2014-10-20

Keyword

CX311, FCoE, stacking

Symptom

Hardware configuration

MM910, three CH242 V3 compute nodes, six MZ510 NICs, CX311 switch modules, and Brocade 300E switch

Software configuration

  • OS: Red Hat Enterprise Linux (RHEL) 6.5
  • MZ510 NIC firmware: 1.1.43.24

Symptom

Four SATA drives are configured as a RAID 0 array, and two SAS drives are configured as a RAID 1 array.

An E9000 chassis houses three CH242 V3 compute nodes. Each CH242 V3 is equipped with two MZ510 NICs. CX311 switch modules connect to the Brocade 300E switch in transparent (TR) mode. Figure 5-36 shows World Wide Port Names (WWPNs) on the CX311 in slot 2X.

Figure 5-36 Viewing WWPNs in TR mode

You can see a maximum of six WWPNs in theory and need to identify unnecessary WWPNs.

Key Process and Cause Analysis

Cause analysis

View Fibre Channel over Ethernet (FCoE) session information in the Fabric switch module of a CX311. It is found that the FCoE port on the CX311 in slot 3X corresponding to the MZ510 HBA uses VLAN 1002. Therefore, the FCoE port on the CX311 in slot 3X uses the CX311 in slot 2X to transmit FCoE packets.

Figure 5-37 Viewing FCoE session information in a CX311 Fabric switch module
Figure 5-38 Port configuration in a CX311 Fabric switch module

Check the VLAN configuration. It is found that the FCoE port is added to VLANs 1002 and 1003, which is incorrect.

Modify Fabric configuration for the CX311 in slot 2X as follows:

interface 10GE2/2/1 

 port link-type hybrid 

 port hybrid tagged vlan 1 to 1002 

port hybrid tagged vlan 1004 to 4094 

 mac-learning priority 3 

 lldp tlv-enable dcbx 

 dcb pfc enable mode auto 

 dcb ets enable DCBX 

 dcb compliance intel-oui

Modify Fabric configuration for the CX311 in slot 3X as follows:

interface 10GE3/2/1 

 port link-type hybrid 

 port hybrid tagged vlan 1 to 1001 

port hybrid tagged vlan 1003 to 4094 

 mac-learning priority 3 

 lldp tlv-enable dcbx 

 dcb pfc enable mode auto 

 dcb ets enable DCBX 

 dcb compliance intel-oui

After the configuration is modified, the problem is resolved.

Conclusion and Solution

Conclusion

The VLAN configuration in the Fabric switch module of each CX311 is incorrect.

Solution

Modify the VLAN configuration by using the preceding method.

Experience

None

Note

None

Failure to Use Eth-Trunk and FCoE Simultaneously on the MZ510+CX311 Network
Problem Description
Table 5-14 Basic information

Item

Information

Source of the Problem

E9000

Intended Product

E9000

Release Date

2014-10-20

Keyword

MZ510, CX311

Symptom

Symptom

A customer uses the MZ510+CX311 networking scheme, as shown in Figure 5-39. In this scheme, the CX311 switch modules in slots 2X and 3X are stacked, and two 10GE ports on the MZ510 are configured as Eth-Trunk. The gateway detects that the Eth-Trunk port is in the link down state. After Eth-Trunk is canceled on the CX311 switch modules, the gateway detects that the Eth-Trunk port is in the link up state.

Figure 5-39 MZ510+CX311 networking scheme
Key Process and Cause Analysis

Cause analysis

Eth-Trunk and FCoE cannot be used simultaneously on the MZ510+CX311 network because FCoE supports only point-to-point transmission while Eth-Trunk supports data packet transmission.

Conclusion and Solution

Solution

Use either Eth-Trunk or FCoE on the MZ510+CX311 network.

Experience

None

Note

None

E9000 CX912 Reports the "zone conflict" Error
Problem Description
Table 5-15 Basic information

Item

Information

Source of the Problem

E9000

Intended Product

CX912/NX220/NX120

Release Date

2014-09-19

Keyword

Zone conflict

Symptom

Hardware configuration

E9000 chassis with four CH242 compute nodes and two CX912 switch modules

Network configuration

Figure 5-40 CX912+SNS2124 network diagram

Symptom

The "zone conflict" error is reported when CX912s on an E9000 server are cascaded with SNS2124s. See Figure 5-41.

Figure 5-41 "zone conflict" error

Key Process and Cause Analysis

Possible causes

  1. The active zone config (zone cfg or zone set) names on the CX912s and SNS2124s are different.
  2. The members in the zones with the same name on the CX912s and SNS2124s are different.
  3. The default zone access modes of the CX912s and SNS2124s are different.

Key process

The check results show that:

  1. The active zone names or zone set names on the CX912s and SNS2124s are the same.
  2. The names of zones on the CX912s and SNS2124s are different.
  3. The default zone access modes of the CX912s and SNS2124s are different.
    Figure 5-42 Default zone configuration on CX912s

    Figure 5-43 Default zone configuration on SNS2421s

  4. Log in to the CLI of each CX912 and change the default zone access mode to All Access.
    defZone –allaccess 
     
    cfgsave 
     
    defZone –show
  5. Cascade CX912s with SNS2124s. The cascading is successful and no error is reported.
Conclusion and Solution

Conclusion

The error is reported because the default zone access modes of the CX912s and SNS2124s are different.

Solution

Change the default zone access mode of each CX912 to ensure that the default zone access modes of the CX912s and SNS2124s are the same.

Experience

None

Note

None

FCoE Link Offline Due to E9000 CX911 Ethernet Storm
Problem Description
Table 5-16 Basic information

Item

Information

Source of the Problem

E9000 CX911

Intended Product

E9000 CX911

Release Date

2016-04-26

Keyword

E9000, CX911, storm, offline

Symptom

The onsite networking uses the CH242 (4 x MXEAs), CX911 (XCUA+FCoE_GW), and S3900, as shown in the following figure. The VMware software is installed on the server blade. The FCF modules (FCoE_GW) on the two CX911s provide two external FC ports, which are connected to controller A and controller B of the FC disk array S3900 respectively.

As shown in the preceding figure, install CH242s into slots 1 to 4 of the E9000 chassis. Inject a network storm to the CX911 in slot 2X. Run the show port command on the switch modules in slots 2X and 3X to check the status of FCoE ports of Mezz1 and Mezz3, which is offline. The storage device cannot be accessed.

Key Process and Cause Analysis

Preliminary analysis:

The GW logs show that when a CX911 Ethernet plane storm occurs, all FCoE ports connected to FCoE_GW are offline, but the FC ports are not disconnected.

CMD: show port

----

Fibre Channel / Passthrough Ethernet

Admin Operational Login Config Running Link Link

Port State State Status Type Type State Speed

---- ----- ----------- ------ ------ ------- ----- -----

Ext1:0 Online Offline NotLoggedIn GL Unknown Inactive Auto

Ext2:1 Online Offline NotLoggedIn GL Unknown Inactive Auto

Ext3:2 Online Offline NotLoggedIn GL Unknown Inactive Auto

Ext4:3 Online Offline NotLoggedIn GL Unknown Inactive Auto

Ext5:4 Online Offline NotLoggedIn GL Unknown Inactive Auto

Ext6:5 Online Offline NotLoggedIn GL Unknown Inactive Auto

Ext7:6 Online Online LoggedIn GL F Active 8Gb/s

Ext8:7 Online Offline NotLoggedIn GL Unknown Inactive Auto

Int13:8 Online Online LoggedIn F F Active 8Gb/s

Int14:9 Online Online LoggedIn F F Active 8Gb/s

Int15:10 Online Offline NotLoggedIn F Unknown Inactive Auto

Int16:11 Online Online LoggedIn F F Active 8Gb/s

Ethernet

Admin Operational Config Link Link

Port State State Type State Speed MACAddress

---- ----- ----------- ------ ----- ----- ----------

Int1:12 Online Offline F Active 10Gb/s 00:c0:dd:29:66:a2

Int2:13 Online Offline F Inactive 10Gb/s 00:c0:dd:29:66:a3

Int3:14 Online Offline F Inactive 10Gb/s 00:c0:dd:29:66:a4

Int4:15 Online Offline F Active 10Gb/s 00:c0:dd:29:66:a5

Int5:16 Online Offline F Inactive 10Gb/s 00:c0:dd:29:66:a6

Int6:17 Online Offline F Active 10Gb/s 00:c0:dd:29:66:a7

Int7:18 Online Offline F Inactive 10Gb/s 00:c0:dd:29:66:a8

Int8:19 Online Offline F Inactive 10Gb/s 00:c0:dd:29:66:a9

Int9:20 Online Offline F Inactive 10Gb/s 00:c0:dd:29:66:aa

Int10:21 Online Offline F Inactive 10Gb/s 00:c0:dd:29:66:ab

Int11:22 Online Offline F Inactive 10Gb/s 00:c0:dd:29:66:ac

Int12:23 Online Offline F Active 10Gb/s 00:c0:dd:29:66:ad

The GW logs show that the FCoE ports are disconnected because FCoE_GW does not receive the keepalive packet sent by the MZ910 connected to port 12. Therefore, FCoE_GW delivers FIP CVL to clear virtual links. However, the ports that work in FC mode on the same blade are still online.

[312][Sat Apr 26 14:55:10.488 UTC 2014][I][8600.001D][Port][Port: 12][PortID 0x10c00 PortWWN 10:00:f8:4a:bf:5b:92:86 logged into nameserver.]

[313][Sat Apr 26 14:55:10.489 UTC 2014][I][8600.0024][Port][Port: 12][Enode mac f8:4a:bf:5b:92:86 VN_Port mac 0e:fc:00:01:0c:00 port wwn 10:00:f8:4a:bf:5b:92:86 logged in.]

[314][Sat Apr 26 15:09:59.209 UTC 2014][I][8600.0021][Port][Port: 12][Enode mac f8:4a:bf:5b:92:86 VN_Port mac 0e:fc:00:01:0c:00 port wwn 10:00:f8:4a:bf:5b:92:86 logged out due Enode FKA_TOV violation.]

[315][Sat Apr 26 15:09:59.210 UTC 2014][I][8600.002C][Port][Port: 12][FIP Clear Virtual Link (CVL) being sent to Enode host f8:4a:bf:5b:92:86]

[316][Sat Apr 26 15:09:59.213 UTC 2014][I][8600.001E][Port][Port: 12][PortID 0x10c00 PortWWN 10:00:f8:4a:bf:5b:92:86 logged out of nameserver.]

Compared with the FC protocol, the FCoE protocol establishes and maintains virtual links. The FCoE virtual links may be interrupted by the storm. However, in theory, the Ethernet storm does not affect FCoE_GW, since the Ethernet switching plane and FC switching plane are physically isolated from each other in the design and implementation of the CX911. The interconnection point between the Ethernet network and the storage network is the MZ910. Therefore, the MZ910 may be affected by the packets in the storm.

Packet analysis:

Capture packets in the storm through the serial port redirection. As shown in the following figures, the packets in the storm are mainly FIP packets and are all VLAN request packets.

Filter out these FIP packets and check again. The CX310 is used as an example to generate an Ethernet storm. Connect the CX310 and CX911 through an optical cable.

Set traffic-policy FCOE-p11 inbound and traffic-policy FCOE-p11 outbound for the CX310 panel ports to filter FIP packets, and then create a storm. FCoE links are not disconnected. Therefore, FCoE links are disconnected due to the FIP VLAN request packets.

Cause analysis:

The debug serial port print of the MZ910 shows that when the FIP VLAN request packet storm occurs, the debug serial port of the MZ910 stops printing. After the storm stops, the debug serial port continues to print. The VLAN request packet storm causes the MZ910 firmware to be busy. Capture the firmware dump logs and send them to Emulex for analysis. The root cause of the problem is that Lancer does not handle FIP packets with the destination MAC address set to ALL_FCF_MAC. The control path processor can only handle normal FIP packets which are received at a very slow rate.

The MZ910 cannot process a large number of FIP packets with the destination MAC address being ALL_FCF_MAC. When the MZ910 receives a large number of such packets in a short time, the firmware is busy and cannot process other threads. Therefore, FIP keepalive packets cannot be sent. As a result, FCoE links are disconnected.

Conclusion and Solution

Conclusion: Emulex adds the following additional logic to the firmware code to solve this problem:

1. Drop the FIP packets with destination address equal to ALL_FCF_MAC.

2. If the number of ALL_FCF_MAC packets exceeds a certain threshold, suspend the processing thread so that other threads can get time to run, thereby recovering from the sudden burst of unhandled packets.

Solution: Update the MZ910 firmware to the latest version (1.1.43.32 or later).

Experience

Compared with FC, FCoE adds the Ethernet part (mainly FIP), which makes the communication between nodes more complicated. In addition, end-to-end links pass through the DCB network and introduce potential risks (such as network storms) from the Ethernet network. In this way, the robustness of the FCoE network is lower than that of the traditional pure FC network. In this case, the QoS mechanism of the DCB network and complete network configuration are required to ensure the reliability and service quality of the FCoE communication. Therefore, when the E9000 uses FCoE, check the version information about the FCoE switches and FCoE mezzanine cards to ensure that there is no problem caused by the configuration or version.

Note

None

CDRx Status Alarm Reported by the CX317 Pass-Through Module
Problem Description
Table 5-17 Basic information

Item

Information

Source of the Problem

Live network

Intended Product

E6000

Release Date

2012-01-10

Keyword

E6000, NX120, license

Symptom

On the HMM WebUI, the CX317 pass-through module repeatedly reports CDRx Status alarms and messages indicating that the alarms are cleared.

The following figure shows the alarm information.

There are eight CDR chips on the CX317. Each chip provides four external optical ports. When the CDRx status alarm is reported by the CX317, the 10G optical port on the corresponding panel is affected. As a result, the links become abnormal.

Key Process and Cause Analysis
  1. The links are abnormal.
  2. The CDR chip is faulty.
Conclusion and Solution

Check whether services are affected.

If services are running normally, check the CPLD version of the CX317. If the version is earlier than 013, alarms may be falsely reported. Upgrade the CPLD to 013 or later.

(Note: During the upgrade, the CX317 is automatically restarted. Migrate the services to another board before the upgrade.)

If services are affected, the CDR chip is faulty. Replace the CX317.

Experience

None

Storage Link Down of an E9000 CH121 Compute Node
Problem Description
Table 5-18 Basic information

Item

Information

Source of the Problem

E9000

Intended Product

E9000

Release Date

2018-03-06

Keyword

HBA

Symptom

A storage link of an E9000 CH121 compute node is unexpectedly down. The linkdown information on the initiator is shown in the following figure.

Key Process and Cause Analysis
  1. Log into the compute node, and check the HBA port state.

  2. Check the multipath state of the compute node. The path state of controller B is abnormal.

    The HBA state is normal, but the multipath state is abnormal. Therefore, the link from the FC plane of the switch module to the storage system is abnormal.

  3. Log in to the FC plane, and run the switchshow command. The physical state of a cable link connected to the FC plane is down.
    1. Run the show port 0 command to check the port statistics. No invalid packet exists.

    2. Run the show media 0 command to check the optical module of the port. The optical module fails to receive signals properly.

    3. The preceding analysis shows that the peer optical module or the optical cable is faulty. As a result, the optical module fails to receive signals properly. The storage engineers confirm that the peer optical module is normal.
Conclusion and Solution

Conclusion:

The optical modules of the FC plane and the storage system are normal. However, the optical module of the FC plane fails to receive signals properly. Therefore, the cable in between is faulty and needs to be replaced.

Solution:

Replace the optical cable.

Experience

None

Note

None

Failed to View the HBA Registration Information on the FC Plane (MZ912+CX911)
Problem Description
Table 5-19 Basic information

Item

Information

Source of the Problem

E9000

Intended Product

E9000

Release Date

2018-05-09

Keyword

Offline, HBA

Symptom

When the MZ912 and CX911 modules are used, the registration information of an HBA is not displayed on the FC plane. The registration information on the FC plane is as follows:

Three compute nodes exist in the chassis. After the show ns command is run, the HBA registration information of only two compute nodes is displayed. Therefore, the HBA of a compute node fails to be registered on the MX510

Key Process and Cause Analysis
  1. Log in to the OSs of the compute nodes based on the slot numbers, and check the HBA state. The HBA of a compute node is offline.

  2. Check the HBA firmware version.

  3. Check the OS messages logs. Exception HBA records exist.

Conclusion and Solution

Conclusion:

This problem is caused by the outdated HBA firmware. The bugs in the HBA firmware may cause the HBA port to be offline. This problem occurs on HBAs with 10.2.630.0 and earlier firmware.

Solution:

For details about the rectification method, visit http://support.huawei.com/enterprise/en/bulletins-product/ENEWS1000010829/.

You can contact the hardware maintenance personnel for further support.

Experience

None

Note

None

Alarm About bond1 on the OS of the E9000 CH121
Problem Description
Table 5-20 Basic information

Item

Information

Source of the Problem

E9000

Intended Product

E9000

Release Date

2018-05-05

Keyword

bond

Symptom

On the OS of an E9000 CH121 compute node, an alarm about bond1 is reported to the I2000 network management system.

Key Process and Cause Analysis

Check the state of bond1. A network port is down. No exception is found in the messages logs.

Log in to the switch module corresponding to the network port. The internal port is down. Check the configuration of the port. The auto-negotiation function is disabled.

Conclusion and Solution

Conclusion:

On the switch module interconnected with the service switch, the auto-negotiation function of the internal ports is disabled. As a result, network port negotiation fails, and the ports fail to be brought up.

Solution:

On the switch module, enable the auto-negotiation function of the internal ports and save the configuration.

Experience

None

Note

None

Alternate Up/Down State of All Ports on an E6000 Switch Module
Problem Description
Table 5-21 Basic information

Item

Information

Source of the Problem

E6000 NX112

Intended Product

E6000

Release Date

2018-05-18

Keyword

switch module, up, down

Symptom

On an E6000 server, all ports of the B2 switch module are brought down and then brought up alternately. Onsite engineers suspect that the switch module is faulty.

Key Process and Cause Analysis

Key process:

  1. Analyze the logs of the switch module:

    The 0/0/21 uplink port of monitor-link is down. As a result, multiple ports of the NX112 switch module are down.

  2. Check the peer port state:

    On the S9312 switch, the port connected to the E6000 B2 switch module is GigabitEthernet5/0/28. When no operation is performed on the port, the port is alternately brought up and down. The cause is that the port is physically disconnected.

  3. Cause analysis:

    The switch module is connected to a port on the S9312 switch. When no operation is performed, the port on the S9312 is alternately down and up. This problem is caused by faulty physical components. The network cables need to be checked.

Conclusion and Solution

Conclusion:

The switch module is connected to a port on the S9312 switch. When no operation is performed, the port on the S9312 is alternately down and up. This problem is caused by faulty physical components. The network cables need to be checked.

Solution:

Replace the network cables.

Experience

If the port on the switch module is alternately brought up and down, check the physical link first. The components on the link include: switch module<->optical module<->optical cable<->optical module<->peer switch. The link varies in different environments. Check whether any error alarm is generated by the switch module, and replace the optical modules and optical cables for diagnosis.

Note

None

Increasing Number of InvalidCRC Errors on the Uplink Port of the MX510
Problem Description
Table 5-22 Basic information

Item

Information

Source of the Problem

E9000 MX510

Intended Product

CX311 CX911

Release Date

2018-01-31

Keyword

MX510, abnormal link, InvalidCRC error

Symptom

An E9000 server is equipped with CH240 compute nodes and VMware ESXi 5.5.0. The CX311 switch module of the E9000 server is connected to a Brocade optical switch. The Brocade switch is used as the core optical switch and is connected the storage system. On January 25, the links of two compute nodes are lost. System suspension occurs on one compute node.

Key Process and Cause Analysis

Key process:

  1. After the suspended compute node is restarted, check the vmkernel logs. A large number of storage command failures exist. Storage command failures may cause this problem. The problem is also related to exceptions in the storage communication.

  2. Check the firmware and driver of the mezzanine card. The driver and firmware are upgraded to the matching version.

  3. Check the packet counts of the ports on the switch module. The ports interconnected with the optical switch have a large number of InvalidCRC errors. InvalidCRC indicates a packet frame checking error. Generally, this error is caused by mismatching port configurations or faulty link components.

    57:

    Port 1:

    58:

    Port 0:

  4. Check the port configuration of the Brocade optical switch. The fill words of the ports interconnected with the switch module are not configured as required.

    When the CX311 or CX911 is connected to a Brocade 8 Gbit/s optical switch, the fill word of the port connected to the optical switch must be 3.

Conclusion and Solution

Early optical switches work at 1, 2, or 4 Gbit/s and use IDLE as the fill word. The Brocade 8 Gbit/s switch continues using the same fill word. However, the FC-PI-4 and FC-FS-3 protocols require that 8 Gbit/s devices use ARB as the fill word. Most 8 Gbit/s devices (including hosts, storage arrays, and switches) use ARB as the fill word. When the Brocade 8 Gbit/s switch is interconnected with 8 Gbit/s devices, the fill words are inconsistent. As a result, the Brocade switch cannot communicate with the devices properly.

Inconsistent fill words may increase the number of InvalidCRC errors.

Note
  1. On the Brocade optical switch, change the fill words of the ports interconnected with the switch module to 3. Run the show port port number command every 10 minutes to check whether the number of InvalidCRC errors continues to increase. Run the command for four to five times. If the link is not restored, restart the compute node.
  2. If the number of InvalidCRC errors continues to increase after the modification, check the physical link between the switch module and the optical switch (including the optical modules at both ends and the optical cables in between).
WWN Registration Failure When the CX310s Are Interconnected with the FCoE Switches
Problem Description
Table 5-23 Basic information

Item

Information

Source of the Problem

E9000 CX310

Intended Product

CX300 series switch modules

Release Date

2018-05-29

Keyword

FCoE, stacking, WWN

Symptom

An E9000 is equipped with the MZ510 NICs and the CX310 switch modules (two CX310s are stacked). The system is VMware. The server is connected to H3C FCoE switches and IBM storage devices. The VMware storage configurator shows that some host controllers are offline. After a switch module is removed, the link of the other switch module is normal.

Key Process and Cause Analysis

Key process:

  1. Check the FCoE registration of the switch module. The WWN links of only some hosts are registered successfully.

  2. Check the FIP statistics of the switch modules. A large number of FLOGI request packets are dropped.

  3. The CX310 stack is interconnected with the H3C FCoE switches by using an Eth-Trunk link. The Eth-Trunk configuration is as follows:

  4. The global configuration contains fcoe dual-fabric enable. As a result, the stack port does not allow FIP packets to pass through.

Conclusion and Solution

The CX310 stack is interconnected with the H3C FCoE switches by using an Eth-Trunk link. In this situation, two risks exist.

  1. Link setup risk: In the following figure, the FIP negotiation packets from port P0 are sent through the green path, and the packets from the H3C switch stack are sent through the yellow path. The FIP packets on the yellow path cannot pass the stack port due to the fcoe dual-fabric enable setting. As a result, the link fails to be set up.
  2. Reliability risk: If the FIP packets can pass the stack port, the FCoE links concentrate on the 3X switch module. When the 3X switch module is reset or faulty, the two FCoE links are affected.

Experience

Solutions:

Solution 1: Separate the FCoE connections of the 2X and 3X switch modules and set up physical active and standby links according to the deployment guide.

Solution 2: Do not use stacking and connect the 2X and 3X switch modules to the switch stack independently. This method can also build physical FCoE links.

Common Problems of the Management Software

Active Management Module Cannot Be Accessed Due to a Data Module Loop on an E6000 Server
Problem Description
Table 5-24 Basic information

Item

Information

Source of the Problem

E6000

Intended Product

E6000

Release Date

2014-06-19

Keyword

Data module, management module

  • Do not connect the management network port on a management module and that on a data module to one switch. Otherwise, a network loop occurs.
  • Do not connect uplink and downlink network ports to one switch. Otherwise, a network loop occurs.
  • Data modules do not support hot swapping. You can remove data modules only after the chassis is powered off.
Symptom

The client is directly connected to management modules through network cables. When a management module is in the standby state, the static IP address of the management module can be accessed; when a management module is in the active state, the static and floating IP addresses of the management module cannot be pinged.

Key Process and Cause Analysis

Cause Analysis:

A data module is configured in the chassis, and the uplink and downlink network ports of the data module are connected to the same switch. A network loop occurs. As a result, the active management module cannot be accessed.

Recognition Method:

Connect to the active management module through a serial port and run the dmesg command. If a large amount of "eth0: duplicate address detected" information is displayed and no IP address conflict occurs in the network environment, the problem is caused by the data module loop.

NOTE:

When a data module is configured in the chassis, the active management module and the data module are interconnected, but the standby management module and the data module are isolated. Therefore, the data module loop affects only the active management module.

Conclusion and Solution

None

Note

Data module installation

For details, see the E6000 Server V100R002 Installation Guide.

http://support.huawei.com/enterprise/en/doc/DOC0100436839

A data module is used to cascade the management modules of multiple E6000 chassis so that the chassis can be managed in a centralized manner, as shown in Figure 5-44.

Figure 5-44 Cascading
Virtual Media Cannot Be Used on the MM910 After an iMana Upgrade
Problem Description
Table 5-25 Basic information

Item

Information

Source of the Problem

E9000

Intended Product

E9000

Release Date

2015-02-09

Keyword

Virtual media

Symptom

Hardware configuration:

E9000 with MM910 2.20

Symptom:

The virtual media cannot be used on the MM910 after the iMana is upgraded to 6.01, as shown in Figure 5-45.

Figure 5-45 Virtual media cannot be used on the MM910

Key Process and Cause Analysis

Cause analysis:

The iMana 5.30 and MM910 3.00 or their later versions incorporate the VMM encryption function. To use the encryption function, ensure that both the iMana and MM910 support this function.

Conclusion and Solution

Conclusion:

The MM910 and iMana versions are not matched.

Solution:

Upgrade the MM910 to 3.00 or later.

Experience

None

Note

None

E9000 Management Modules Cannot Connect to the KVM Because Port 2200 Is Restricted on the Firewall
Problem Description
Table 5-26 Basic information

Item

Information

Source of the Problem

Live network

Intended Product

E9000

Release Date

2015-07-15

Keyword

E9000, management module, KVM, firewall, port 2200

Symptom

Hardware configuration:

CH242 V3 DDR4 server equipped with the LSI SAS3108 RAID controller card

Symptom:

At a site outside China, the blade KVM cannot be connected using management modules.

After a user clicks KVM via MM on the HMM Web, the KVM list is displayed normally. However, after the user clicks a blade slot number, the error message Connect blade fail, the blade IP is 10.238.57.188 is displayed, as shown in Figure 5-46.

Figure 5-46 Error message

Management module network configuration information:

  • IP addresses of active and standby management modules: 10.238.57.186; 10.238.57.187
  • Floating IP address: 10.238.57.188
Key Process and Cause Analysis

Key process:

Check the internal IP address configuration of the active management module and find that it is correct, as shown in Figure 5-47.

Figure 5-47 Internal IP address configuration of the active management module

Check the internal IP address configuration of the standby management module and find that it is correct, as shown in Figure 5-48.

Figure 5-48 Internal IP address configuration of the standby management module

Check the internal IP address configuration of the blade BMC and find that it is correct, as shown in Figure 5-49.

Figure 5-49 Internal IP address configuration of the blade BMC

Log in to the BMC WebUI of the blade server, and view the service configuration. The port number is the default value, as shown in Figure 5-50.

Figure 5-50 Service configuration

Check the route information of the management PC and find that multiple route hops occur, as shown in Figure 5-51.

Figure 5-51 Route information of the management PC

Connect the management PC and management modules to the same switch. The symptom disappears.

Conclusion and Solution

Conclusion:

Reproduce the problem in the local environment and find that the problem reoccurs when port 2200 is blocked.

Figure 5-52 Port 2200
Experience

If a similar problem occurs, you can manage blades by connecting cables to them to meet urgent service needs.

Check the firewall configuration and check whether port 2200 is blocked.

Note

None

A Network Loop Occurs Because the Stack Port Is Incorrectly Used on the MM910 of an E9000 Server
Problem Description
Table 5-27 Basic information

Item

Information

Source of the Problem

HMM

Intended Product

HMM

Release Date

2015-4-30

Keyword

Network loop

Symptom

A network storm occurs in the entire subnet.

Key Process and Cause Analysis

Key process:

  1. Log in to the active HMM and run the smmget –d outportmode command to view the network port out mode of the HMM. If the network port out mode is 0, go to Step 2. If the network port out mode is 1, go to Step 3.
  2. When the network port out mode is 0, avoid the scenario that one SMM comes out of the stack port, the other SMM comes out of the switch module, and the SMMs are connected to the same subnet, as shown in Figure 5-53.

    Figure 5-53 The network port out mode is 0

  3. When the network port out mode is 1, avoid the scenario that one SMM comes out of the management port, the other SMM comes out of the stack port, and the SMMs are connected to the same subnet, as shown in Figure 5-54.

    Figure 5-54 The network port out mode is 1

Conclusion and Solution

Conclusion: A network loop occurs when the stack port is incorrectly connected to the HMM.

Solution: When the problem occurs, remove the network cable from the stack port on the HMM to eliminate the network loop.

Experience
  1. You can remove the network cable from the stack port to resolve the problem.
  2. To prevent the customer from connecting the network cable incorrectly to the stack port, the stack port needs to be disabled by default in later versions.
Note

None

The MM910 Administrator Password Is Forgotten
Problem Description
Table 5-28 Basic information

Item

Information

Source of the Problem

E9000

Intended Product

E9000

Release Date

2014-04-15

Keyword

MM910, administrator password, U-Boot

Symptom

Hardware configuration

E9000 server

Symptom

The default administrator password of the MM910 is changed, but the new password is forgotten.

Key Process and Cause Analysis

Key process

  1. Log in to the MM910 U-Boot. For details, see the MM910 Management Module V100R001 User Guide.
  2. Run the following commands in sequence to restore the default password:
    1. pwdclr (This command is used to restore the default password.)
    2. saveenv (This command is used to save the setting.)
    3. reset (This command is used to restart the server.)
  3. After the restart, run the following commands in sequence to set a new administrator password:
    1. setenv passwd (This command is used to set a new password.)
    2. saveenv (This command is used to save the setting.)
    3. reset (This command is used to restart the server.)
Conclusion and Solution

Solution

Log in to the MM910 U-Boot, restore the default administrator password, and set a new administrator password immediately.

Experience

None

Note

None

MM910 Management Network Cannot Be Connected After the Switch Module CPLD Is Upgraded
Problem Description
Table 5-29 Basic information

Item

Information

Source of the Problem

E9000

Intended Product

MM910

Release Date

2017-12-20

Keyword

MM910, CPLD, upgrade, network disconnection

Symptom

After the CPLD of the switch module is upgraded, the MM910 management network is disconnected.

Key Process and Cause Analysis

Problem Analysis:

1) Onsite networking

The management networks of the MM910s and switch modules in slots 2X and 3X are connected to an external switch. The switch modules in slots 2X and 3X are stacked, ports 2/17/15 and 3/17/15 form an eth trunk, and the external switch ports connecting to 2X and 3X form an eth trunk.

2) The CPLDs of the switch modules in slots 2X and 3X are upgraded at the same time, and both the switch modules are restarted simultaneously. The eth trunk configuration of 2/17/15 and 3/17/15 on the live network is not saved to the management module. As a result, after the switch modules are restarted, a loop is formed between the external switch and the switch modules. After the broadcast packets sent by the management module pass through the loop, MAC address flapping occurs, as shown in the following figure.

Note: As shown in the preceding figure, after the MAC address of the active management module passes through the active management module > external switch > 3X > 2X > external switch, the MAC address of the management module shifts to the port connecting the external switch and the switch module in slot 2X. As a result, the external network fails to access the management module.

Conclusion and Solution

Solution

Create an eth trunk of 2/17/15 and 3/17/15 on the switch modules in slots 2X and 3X again and save it to the management module.

Experience

None

Note

None

Common Problems of Fan Modules and Power Supplies

Indicators on the Active and Standby SMMs Are Yellow Due to I2C Link Deadlock for E6000 Power Supplies
Problem Description
Table 5-30 Basic information

Item

Information

Source of the Problem

E6000

Intended Product

E6000

Release Date

2011-04-20

Keyword

I2C, SMM

Author

Li Weijia (employee ID: 00176591)

Symptom

Indicators on the active and standby shelf management modules (SMMs) are yellow when an E6000 is operating in an office (the temperature of the equipment room is about 20°C, and the ventilation is normal). Figure 5-55, Figure 5-56, and Figure 5-57 show the alarm information. The power supply unit (PSU) power cannot be obtained from the Status Monitoring interface.

Figure 5-55 MM alarm 1

Figure 5-56 MM alarm 2

Figure 5-57 PSU alarm

Key Process and Cause Analysis

Cause analysis

The inter-integrated circuit (I2C) link is in deadlock state without the self repair mechanism. In this case, the SMM I2C data clock cannot reach the operating voltage.

Conclusion and Solution

Solution

Identify a PSU (the PSU whose I2C link is in deadlock state) that makes the SMM I2C data block fail to reach the operating voltage, and replace the PSU.

Perform the following operations with the prerequisite that office services are not affected (E6000 power-off or re-startup is not allowed. If the E6000 is fully configured with 10 blades, ensure that power is supplied to three PSUs at the same time, and the power of each PSU is 1600 W).

  1. Remove and reinstall a PSU or power cables, and wait for about a minute.
  2. Check whether SMM alarms are cleared.
  3. If alarms are not cleared, repeat Indicators on the Active and Standby SMMs Are Yellow Due to I2C Link Deadlock for E6000 Power Supplies and Indicators on the Active and Standby SMMs Are Yellow Due to I2C Link Deadlock for E6000 Power Supplies on those PSUs without being removed or reinstalled until alarms are cleared. Then replace a PSU (after the PSU is removed and reinstalled, alarms are cleared).
Experience

None

Note

None

Nine Fan Indicators in an E6000 Are Blinking Red at the Same Time
Problem Description
Table 5-31 Basic information

Item

Information

Source of the Problem

E6000

Intended Product

E6000

Release Date

2011-04-14

Keyword

E6000, fan

Author

Li Weijia (employee ID: 00176591)

Symptom

Symptom

Alarms are generated on nine fans at the same time in a running E6000 server. Alarm indicators are blinking red in the same frequency. Fan alarm information repeat in the fan log, as shown in Figure 5-58. The fan speed is normal, as shown in Figure 5-59. No temperature alarm is on the shelf management module (SMM), as shown in Figure 5-60. The equipment room temperature is 22°C to 24°C in the office, and the ventilation is in good condition.

Figure 5-58 Fan alarms

Figure 5-59 Fan speed

Figure 5-60 SMM

Key Process and Cause Analysis

Key process

  1. E6000 fan alarm indicators are controlled by the SMM software. If the fan speed is abnormal, the SMM software turns on fan alarm indicators.
  2. Run the script on the SMM to set the fan speed and then capture the current fan sensor reading. Perform the pressure test for the current fan reading, obtain the fan sensor reading every 5s, 3s, 1s, and 50 ms in infinite loop mode, and observe whether fan alarm indicators are lit.
  3. If the interval is short, fan alarm indicators are lit with high frequency.

Cause analysis

The principle for the SMM software to control fan alarm indicators is as follows: After setting the fan speed, the SMM software obtains the sensor reading. Fan speed switchover requires a period of time, and the fan speed is not changed in linear mode. Therefore, the SMM software checks that the fan speed deviation is too large at a time point, misunderstands that fans are abnormal, and turns on fan alarm indicators.

Conclusion and Solution

Solution

Two solutions are available:

Solution 1

If possible, remove the active and standby SMMs at the same time, and then install them. In this way, alarms are cleared.

Solution 2

Delete fan logs to clear alarms.

  1. For details about how to view the information about the active and standby SMMs, see Note.
  2. Log in to the active MM over File Transfer Protocol (FTP), and delete the memory.bin file in the /tmp directory.
    NOTE:

    The SMM login method is ftp://IP address of the active SMM.

  3. Log in to the standby MM over FTP, and delete the memory.bin file in the /tmp directory.
  4. Log in to the active SMM over Telnet, and run the smmset -l smm -d failover -v 1 or smmset –d failover –v 1 command. When the message shown in Figure 5-61 is displayed, enter y to switch the active and standby SMMs. Then fan alarms are cleared. During switchover between the active SMM and the standby SMM, the health indicators on server blades in the chassis are red at the same time. After the switchover, these health indicators are cleared.
    Figure 5-61 Prompt
Experience

None

Note

You can obtain the information about the active and standby SMMs by using the following two methods:

  • Run the smmget [-l smm ] -d redundancy command to determine the active SMM. The SMM in active state is the active SMM, and the SMM in standby state is the standby SMM. From the front of the E6000 chassis, the SMM in the upper left corner is SMM 1, and that in the upper right corner is SMM 2.
     root@SMM:/#smmget -d redundancy 
    The Redundancy States of SMMs: 
    SMM1: Present(active)*  
    SMM2: Present(standby) 
    * = The SMM you are currently logged into
  • Identify the active and standby SMMs based on the ACT indicator status, as shown in Table 5-32.
    Table 5-32 Relationships between the ACT indicator status and the active and standby SMMs

    Indicator

    State

    Description

    ACT

    Steady green

    Active SMM

    Blinking green

    Standby SMM

E9000 Fan Module Faults
Problem Description
Table 5-33 Basic information

Item

Information

Source of the Problem

Live network

Intended Product

E9000 fan modules

Release Date

2017-05

Keyword

Fantray fault

Symptom
  • Hardware configuration:

    Fan modules

  • Symptom:

    The Fan9 indicator on the E9000 chassis blinks red, and the Fan9 Fault alarm is reported on the HMM WebUI.

Alarm page

Key Process and Cause Analysis

Possible Causes:

  1. The fan module is faulty.
  2. The link of the chassis backplane is faulty.
  3. The MM910 is faulty.

Locating Method:

Description of the fan module detection link: The MM910 sends a detection signal to the fan module through the backplane link. After receiving the detection signal, the fan module returns a response signal to the MM910. If the MM910 fails to receive the response signal from the fan module, it reports a fan module fault alarm.

Fault Locating:

1. Check whether the fan module is properly installed. If not, reinstall the fan module.

2. Switch the fan module with a functioning one. If the fault persists with the original fan module, the fan module is faulty. Replace the faulty fan module. If the fault persists with the original slot, the fault is caused by the chassis or MM910.

3. Observe the fan module slot in the chassis and check whether there is any foreign object or faulty connector in the fan slot. If the connector is damaged, replace the chassis.

4. If there are multiple fan module alarms, perform active/standby switchover on the management module, and check whether the fault is rectified. If the fault persists, collect logs and send them to R&D engineers for analysis.

Conclusion and Solution

Conclusion:

The problem is caused by the fan module fault. Replace the faulty fan module.

Solution:

The problem is caused by the fan module onsite. After the fan module is replaced, the problem is resolved.

Verification:

The alarm is cleared, and the fan module works properly.

Experience

N/A

Note

N/A

Configuration and Installation Problems

Error Occurs when a K2 Video Card Is Configured on the CH221
Problem Description
Table 5-34 Basic information

Item

Information

Source of the Problem

CH221

Intended Product

CH221

Release Date

2014-12-25

Keyword

K2 video card

Symptom

Symptom

Configure the K2 video card on the CH221 where the video card drive has been installed. An error occurs when the video card is invoked (nvidia-smi -a) at the system level. Figure 5-62 shows the error information.

Figure 5-62 Error information

Key Process and Cause Analysis

Cause analysis

The CH221 is configured with two 6-pin-to-6-pin power cables by default. If the CH221 server uses a K2 video card, an independent 8-pin-to-6-pin power cable needs to be configured to replace one of the default 6-pin-to-6-pin power cables. Otherwise, the video card power supply will be insufficient.

Conclusion and Solution

Solution

When a K2 video card is purchased separately, check the BOM (for the CH221, the K2 video card BOM is 06320053) in the compatibility list, and the configurator automatically displays the cable package BOM.

Experience

None

Note
  • In normal cases, the information shown in Figure 5-63 is displayed.
    Figure 5-63 Normal displayed information
  • The installation rules are as follows:
    1. Verify that the installation component 02311BDD is in the order or delivery list.
    2. Video card power connectors adopt a fool-proofing design and are connected (the cable types and quantity are selected based on the video card power connector types).
  • Pay attention to the following when installing a K2 video card on the CH221:
    1. A K2 video card has a 6-pin connector and a 8-pin connector. The mainboard has two 6-pin connectors. See Figure 5-64.
      Figure 5-64 Pin connector

    2. Pay attention to the cable connector types for the K2 video card.

      The default cables on the server are for video cards other than K2 video cards. The default cables are for 6-pin connectors. Two pins less than those required by a K2 video card, as shown in Figure 5-65.

      A K2 video card requires dedicated 8-pin connector cables. The cables shown in Figure 5-65 must be replaced.

      Figure 5-65 Cable connector

    3. The connectors on the mainboard are shown in Figure 5-66.
      Figure 5-66 Connectors on the mainboard
Translation
Download
Updated: 2019-02-25

Document ID: EDOC1000041338

Views: 81398

Downloads: 3876

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next