Diagnose and rectify Ethernet controller faults depending on the symptoms.
- If a fault can be located using logs or tools, see "Handling Procedure". If a fault needs to be rectified quickly onsite, see "Quick Recovery Method".
- For more fault symptoms and solutions, see the Computing Product Case Library. The Computing Product Case Library is available only to Huawei partners and Huawei engineers.
Fault Symptom
|
Handling Procedure
|
Quick Recovery Method
|
A network port is invisible.
|
- Check whether the NIC type and driver are compatible with the OS and server (BIOS/iBMC).
- If the NIC firmware and driver versions do not match, upgrade them to the matching versions.
- To check whether the PCIe device is visible, run the lspci | grep -i eth* command in Linux (or equivalent in other OSs) and observe the response.
- If the PCIe device is visible, go to 4.
- If the PCIe device is invisible, go to 3.
- If the PCIe device is invisible in the system, perform the following steps:
- Check the logical topology of the NIC. If the PCIe bus of the NIC does not have the corresponding CPU, the PCIe component of the CPU is invisible.
- Power the iBMC off and then on. Check whether the fault persists.
- Insert the NIC you suspect to be faulty into another slot, and a normal NIC into the slot you suspect to be faulty. Then check which of these causes the fault.
- If the PCIe device is visible in the system but the network port is invisible, the driver fails to be loaded. In this case, perform the following steps:
- Run the ifconfig ethN up command in Linux (or equivalent in other operating systems) to ensure the information in the network port configuration file is consistent with the actual physical network ports and whether the network ports are up.
- If an error is reported when you install the driver in compilation mode, run the gcc -v and c++ -v commands on the OS CLI. If the command output displays the corresponding version information, the GCC and C/C++ software is installed properly. Otherwise, install the GCC and C/C++ software first.
- Check the optical module type. If an Intel NIC and a non-Intel optical module are configured, the driver cannot be loaded and the network port is invisible.
- Reinstall the driver. Check that no errors are reported during the driver installation and check whether system logs record any failures when loading driver.
- Collect OS logs. For details, see Collecting OS Logs.
|
- If a visible NIC port becomes invisible when the server is running, and services can be interrupted, power the server off and on. If the fault persists, go to 2.
- Insert the NIC into another PCIe slot and check whether the fault is rectified.
- If the fault is caused by the NIC, replace the NIC.
- If the fault is caused by the PCIe card slot, replace the mainboard.
- If the fault persists, contact Huawei technical support.
|
A communication error occurs on a network port.
|
- Check whether the network cable is connected properly to the network port.
- Use the Computing Product Compatibility Checker to check whether the NIC type is compatible with the server board. Contact Huawei technical support to check whether the NIC firmware and driver versions match the OS version. If they do not match, upgrade the NIC firmware and driver first.
- To check whether the network ports are up, run the ifconfig ethN up command in Linux (the command may vary in different OSs). To check whether IP addresses are set for the required network ports, run the ethtool ethN command.
- Run the ethtool -p ethN command in Linux (the command may vary in other OSs) to check whether the information in the network port configuration file of the server is consistent with the actual physical network ports, and check whether the network port status indicators are on and whether the network ports on the switch are up.
NOTE: The ethtool -p ethN command applies only to PCIe cards.
- Check the settings of IP addresses, gateway addresses, VLANs, bondings, and uplink switch network ports.
- Collect OS logs. For details, see Collecting OS Logs.
|
- Use the ping command to check whether the server or other servers on the network have network faults.
- If the fault occurs on more than one server, check whether the external switching network is normal.
- If the fault occurs on only one server, go to 2.
- Check the network port status (whether the status indicator is steady on). If the network port status is link down (the status indicator is off), exchange the module, cable, and uplink switch port corresponding to the abnormal network port with those corresponding to the normal network port to check whether the network port is normal. Replace or adjust the component based on the site requirements.
- If the NIC is causing the fault, restart the server when interruption will not affect services, and check whether the communication is normal. If the fault persists, power the server off and on. If the fault still persists, replace the NIC.
- If the fault persists, contact Huawei technical support.
|
A packet error or packet loss occurs on a network port.
|
- Use the Computing Product Compatibility Checker to check whether the NIC type is compatible with the server board. Contact Huawei technical support to check whether the NIC firmware and driver versions match the OS version. If they do not match, upgrade the NIC firmware and driver first.
- Check whether there are an increasing number of network port packet losses and errors. If there is no continuous increase, ignore this error.
- Insert the NIC that you suspect to be faulty into another slot, and insert a normal NIC into the slot that you suspect to be faulty. Then, check which of these is causing the fault.
- Connect the suspicious network cable to a normal server, connect a normal network cable to the suspicious server, and check whether the fault is caused by the suspicious network cable.
- Switch the service traffic from the network port that you suspect to be faulty to a different network port. Then, check whether the fault is caused by the network port.
- To check parameters regarding the packet error or loss, run the ethtool -S ethN command in Linux (or similar in other operating systems).
- Collect OS logs. For details, see Collecting OS Logs.
|
- Check whether the packet loss occurs only on a single server. Run the ethtool -S ethN command to check the packet loss type and run the top command to check the system resource usage (software interrupts, CPU usage, and memory usage) and NIC traffic.
- When you have the customer's permission to interrupt services, connect a PC to the port and check for packet loss. Connect the PC to other working ports, and check optical modules, optical cables, and uplink switches. Then, replace or adjust components based on the actual situation.
- If the NIC is causing the fault, restart the server when interruption will not affect services, and check whether the communication is normal. If the fault persists, power the server off and on. If the fault still persists, replace the NIC.
- If the fault persists, contact Huawei technical support.
|
The performance of a network port does not meet requirements.
|
- Use the Computing Product Compatibility Checker to check whether the NIC type is compatible with the server board. Contact Huawei technical support to check whether the NIC firmware and driver versions match the OS version. If they do not match, upgrade the NIC firmware and driver first.
- Check whether the physical network port meets performance requirements.
- Check whether the binding between the network port interrupt and CPU queue has been modified.
- To check whether the TSO and GSO settings of the network port have been modified, run the ethtool -k ethN command in Linux (or equivalent in other operating systems).
- To check whether the network port buffer information has been modified, run the ethtool -g ethN command in Linux (or equivalent in other operating systems).
- Collect OS logs. For details, see Collecting OS Logs.
|