Troubleshooting
Common Errors in RL Tuning
- Error "unknown op compute."
Check whether the operator is supported. Currently, only the elewise, broadcast, and reduce operators are supported. For details about all the supported operators, see Operator Lists.
- Error "import base64 in python3.7 failed in host XX, please fix it!"
An error is reported during the operating environment check, indicating that the base64 component is missing from Python 3.7. Ensure that the TBE development environment is ready before RL tuning.
Run the following command to install the base64 component:
pip3.7 install pybase64
- Error: "The avail space of /home/HwHiAiUser in XXX is smaller than 1G, please fix it!"
An error is reported during the operating environment check, indicating that the available space of the host is less than 1 GB. Clear the space before RL tuning.
- Error: "stage[xx] > max_stages[128]."
The number of stages of the current operator exceeds the allowed maximum 128. Tuning of the operator is not supported currently.
Common Errors in GA Tuning
- Error: "there is no kernel_perf_comm in PATH!"
In the inference scenario, check whether ${install_path}/atc/bin has been configured in the environment variable PATH. For details, see Environment Variable Configuration.
- Error: "Failed run kernel too many!"
The following message is displayed in the tuning log:
kernelName:xxxx,ResultStatus:0-255,TotalCycle:0-xxx
KernelName is the name of the current .o file.
ResultStatus indicates the result status. For details, see Table 6-3.
Table 6-3 Result status listStatus Code
Description
Solution
0
Execution success
-
1
GA tuning failed to preempt the device.
During GA tuning, the device resources need to be exclusively occupied by Auto Tune.
Stop other processes and try GA tuning again.
2
Failed to register the operator binary file (.o).
Check if the user who performs the tuning has the write permission on the target directory.
3
Failed to execute the operator binary file on the RTS side.
If an error occurs during operator execution, search for the keywords aic_error and task_exception in the /var/log/npu/slog/host-0/host-0_*.log file on the host and analyze the log.
4
Failed to allocate the memory required for the input and output of the operator binary file on the host.
Check if the host has sufficient memory space.
Other
-
Contact Huawei technical support.