What Do I Do If the Network Inference Fails Due to AI Core Operator Execution Timeout?
Symptom
During model inference, a message is displayed indicating that the model fails to be executed.
The /var/log/npu/slog/device-id/device-**.log file displays the following ERROR-level log message:
[ERROR] TSCH(-1,null):2020-06-04-10:51:09.520.395 28 (cpuid:0) ai_core_dispatcher.c:1012 bs_done_exception_proc_timeout: slot_id=1,TS_ctrl=0x4,exception_core_list=0x0,current core usage=0x1,AI_CORES_COUNT=2, fault_task=0
Cause Analysis
When an operator execution task on the AI Core times out, Task Scheduler returns a timeout failure.
The timeout interval for Ascend 310 AI Processor is 55s.
Troubleshooting
Perform the following steps to locate the timeout operator:
- Check the logs on the device to locate the ID of the failed task of the TSCH component.
Go to the log file device-**.log in /var/log/npu/slog/device-id.
Query log message of the TSCH component by searching for the keyword bs_done_exception_proc_timeout, to locate the fault_task ID, for example, fault_task=0.
For details about log message, see Symptom.
- Check the logs on the host to locate the name of the operator that fails to be executed.
Go to the log file host-0_*.log in /var/log/npu/slog/host-0.
fault_task=0 is used as an example task ID. Search for TaskLaunched and task_id=0 in the host-0_*.log file based on the fault_task ID obtained in 1.
[EVENT] RUNTIME(15568,acl_caffe_interp):2020-06-04-10:50:14.522.076 [runtime/feature/src/logger.cc:1014]15570 TaskLaunched:device_id=0, stream_id=514, sq_id=514, task_id=0, kernel_name=test_case/2_16_144_417_248_408_float16/0_Interp_1_0_2_16_144_417_0_0_2_16_248_408.om/Interp_tvmbin, devfunc_name = te_interp_1ead9f4957880f1e_0__kernel0 task_type=AiCoreKernel, task_launched_num=2
In the preceding log message, Interp of kernel_name indicates the operator whose execution times out.
You can perform operator tuning by referring to Performance Optimization.