FAQs
- What Do I Do If the container Directory Is Deleted?
- What Do I Do If the GCC Installation Fails After Privilege Escalation?
- What Do I Do If a Training Job Fails to Be Delivered Due to Full Disk Space?
- What Do I Do If Profiling Installation Fails with the Error Message "google.protobuf is not installed"?
- What Do I Do If a Training Job Has No Profiling Result?
What Do I Do If the container Directory Is Deleted?
Symptom
In the cloud scenario, profile data fails to be collected, and the following information is displayed.
Possible Cause
The container directory or its subdirectory in the /var/log/npu/profiling/ directory is manually deleted, causing file write failure.
Solution
The Profiling tool will automatically recreate the directory. You only need to wait and start a new training job when the directory is recreated.
What Do I Do If the GCC Installation Fails After Privilege Escalation?
Symptom
During the Profiling installation, privilege escalation is performed before installing the GCC, g++, and make tools. The installations fail, and the information in Figure 3-25 is displayed.
Possible Cause
The dependencies are installed after privilege escalation. As a result, the generated privilege escalation file is incorrect.
Solution
Perform the following steps:
- Log in to the server as the root user and delete the /etc/sudoers.d/xxx_specific file.
Replace xxx with the Profiling installation user name.
- Log in to the server as the Profiling installation user.
What Do I Do If a Training Job Fails to Be Delivered Due to Full Disk Space?
Symptom
The error message shown in Figure 3-26 is displayed when a training job is delivered in system profiling or job profiling.
Possible Cause
No usable temporary directory is displayed, the system disk space maybe full.
Solution
Perform the following steps:
- Delete unnecessary files in the system disk.
- Run the df -h command to check whether the disk has available space.
What Do I Do If Profiling Installation Fails with the Error Message "google.protobuf is not installed"?
Symptom
Profiling fails to be installed, and an error message in Figure 3-27 is displayed.
Possible Cause
The possible causes are as follows:
- Protobuf is not installed.
- Protobuf has been installed, but the Profiling installation user does not have enough permission.
Solution
Perform the following steps:
- Log in to the OS as the root user.
- Run the python3.7 command and then run import protobuf to check that protobuf is installed.
- If protobuf is not installed, install it by referring to "Preparing the Environment."
- If protobuf has been installed, go to the next step.
- Run the whereis python3.7 command to find the directory where Python 3.7 is located.
- Change the permission of the site-packages directory found in Step 3 to 550.
chmod -R 550 site-packages
What Do I Do If a Training Job Has No Profiling Result?
Symptom
The error message shown in Figure 3-28 is displayed, indicating that the training job has no profiling result.
Possible Cause
The possible cause is as follows:
The profiling result cannot be found because the corresponding data file is being analyzed while you are copying the profile data.
Solution
Perform the following steps:
- Log in to the OS as the Profiling running user.
- Run the ./stop.sh command to stop the Profiling process.
cd /usr/local/Ascend/toolkit/tools/profiler/bin
./stop.sh
- Delete irrelevant data from the result_dir directory.
- Run the ./start.sh command to start the Profiling process.
- Run the command to query the job again.
- What Do I Do If the container Directory Is Deleted?
- What Do I Do If the GCC Installation Fails After Privilege Escalation?
- What Do I Do If a Training Job Fails to Be Delivered Due to Full Disk Space?
- What Do I Do If Profiling Installation Fails with the Error Message "google.protobuf is not installed"?
- What Do I Do If a Training Job Has No Profiling Result?