Mismatch Between the Number of Queried Devices and the Actual Number of Devices
Symptom
A standard PCIe card (with four devices, also called processors in this situation) is installed in the environment. Only one device is queried using the npu-smi or upgrade-tool tool, as shown in Figure 5-10.
Possible Cause
The possible causes are as follows:
- The heat dissipation of the host is poor. As a result, the temperature of the PCIe card is too high and the device enters the overtemperature protection state.
- The number of host-side interrupts of the device is insufficient, and the driver cannot be automatically loaded.
- Device communication line fault:
- The hardware communication line on the device side is unavailable.
- The communication line on the device side is disconnected.
Solution
To rectify the fault, perform the following steps:
- If the device is abnormal due to poor heat dissipation of the host, power off the host and then restart it.
Go to the /var/log/hisi_logs/device-XX/ directory and view the black box log. By referring to the LPM3 sheet in Black Box Error Codes, check whether the error is caused by overtemperature.
- Run the dmesg command to check whether the log information shown in Figure 5-11 exists.
If the log information shown in the following figure exists, the hardware environment is faulty.
- If the host where the Ascend AI Processor is installed is a physical machine, replace it with a hardware device that offers higher performance.
- If the host where the Ascend AI Processor is installed is a VM, add CPUs to the VM.
- Run the lspci | grep d100 command to query the connection status.
As shown in Figure 5-12, the number of devices displayed is less than the actual count because the hardware communication line of the device is disconnected.
As shown in Figure 5-13, there are devices in ff state, indicating that the device communication line is disconnected.
The PCIe card may be in poor contact with the host. Power off the server, remove and reinstall the PCIe card, and then power on the server.
If the fault persists, contact Huawei technical support.