Analyzing AI Core Errors
Setting Environment Variables
The tool depends on the ADC and CCE. You need to configure the following environment variables on the server where you execute the training script:
- ADC environment variable
Add the ADC installation path under the Toolkit installation path to the PATH variable:
export install_path=/home/HwHiAiUser/Ascend/ascend-toolkit/latest # Replace it with the actual installation path. export PATH=${install_path}/toolkit/bin:$PATH
- CCE environment variable
Add the CCE installation path under the ATC installation path to the PATH variable:
export install_path=/home/HwHiAiUser/Ascend/ascend-toolkit/latest # Replace it with the actual installation path. export PATH=${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH
Starting AI Core Error Analyzer
Option |
Short Form |
Required/Optional |
Description |
---|---|---|---|
--remote_host |
-host |
Required for remote training |
IP address and port number of the remote host in the remote training scenario. The port number is default to 22118. |
--compile_path |
-c |
Required |
Training script execution path. |
--output |
-out |
Optional |
Output path. The AI Core Error report will be generated to this path. If not specified, the current path is used. |
- In the remote training scenario, the specified paths are looked up locally and then in the remote host.
- Replace the xx.xx.xx.xx argument of the remote_host option with the actual IP address.
The AI Core Error Analyzer can help you locate AI Core errors locally or remotely. Start it by running the startup script from the command line.
Go to the script directory: {Toolkit installation path}/toolkit/tools/msaicerr, for example, /usr/local/Ascend/toolkit/tools/msaicerr
- Local scenario:
$ python3 msaicerr.pyc --compile_path /home/.../Project/aicerror_data/compile_path_train --output local_train
- Remote scenario:
$ python3 msaicerr.pyc --remote_host xx.xx.xx.xx:22118 --compile_path /home/.../aicerror/compile_path
Viewing Analysis Result
The outputs of the AI Core Error Analyzer are generated to the info_xxxx directly specified by --output.
├── aicerror_xxxx //AI Core Error Analyzer outputs │ ├──info.txt //AI Core Error Analyzer analysis result summary │ ├──te_transdata_xxxx.o │ ├──te_transdata_xxxx.o.txt //Decompilation file ├── collection //Error operator files │ ├──compile │ ├──kernel_meta │ ├──CCE code file │ ├──JSON code file │ ├──loc.json file │ ├──.o file │ ├──hisi_logs //Black box errors │ ├──slog ├──error.log //ERROR-level log messages in the log directory ├──imas.log //IMAS log messages of GE