Profiling by Calling the acl.json File
Run the executable file of the application project and call the acl.json file to read the profiling configuration. Profile data will be collected automatically. After that, you can analyze the collected profile data in the development environment where Toolkit is installed, and display the visualized profiling results.
For details about building and running an application project, see Application Software Development Guide.
In addition, you must call aclInit() to initialize AscendCL and call aclFinalize() to deinitialize AscendCL.
Collecting Profile Data
Configure the acl.json file, build and run the application project by taking the following steps:
- Open the project file and access the path of the acl.json file in the aclInit() call, as shown in Figure 7-32.
If the acl.json path is not passed to the aclInit() call, modify the call and pass the path created in Step 2.
- Modify the acl.json file in the directory (if the file does not exist, create it in the out directory after project build) and add the related profiling configuration in the following format.
{ "profiler": { "switch": "on", "device_id": "0,1", "result_path": "output", "ai_core_metrics": "aicoreArithmeticThroughput" } }
profiler is described as follows:- switch (optional): profiling switch, either on or off.
If this parameter is not included or is not set to on, profiling is disabled.
- device_id (optional): ID of the device to be profiled, default to 0.
Either set to device IDs separated by commas (,), or all to collect data about all devices.
- result_path (optional): path for dumping profile data to the disk.
After data collection is complete, directories starting with JOB generated in this specified directory will store the profile data. Each directory corresponds to the profile data of one device. result_path can be an absolute path or a relative path (relative to the path where commands are executed).
- An absolute path starts with a slash (/), for example, /home/HwHiAiUser/output.
- A relative path starts with a directory name, for example, output.
- If the directory specified by this parameter does not exist, the profile data will be stored in the directory of the executable file by default. (Ensure that the running user has the read and write permissions on the directory.)
The directory specified by this parameter must be created in advance and the user configured during installation must have the read and write permissions on the directory.
- ai_core_metrics (optional): AI Core metrics, default to aicoreArithmeticThroughput.
The value can be aicoreArithmeticThroughput, aicorePipeline, aicoreSynchronization, aicoreMemoryBandwidth, aicoreInternalMemoryBandwidth or aicorePipelineStall. For details about the corresponding metrics, see Description of AI Core analysis result.
- switch (optional): profiling switch, either on or off.
- After the acl.json file is configured, rebuild and run the application project by referring to Application Software Development Guide.
result_path specifies the path for storing profile data, as shown in Figure 7-33.
If the acl.json file already exists, modify the file content and add profiling configurations. You do not need to rebuild the application project.
Analyzing Profile Data
To display the visualized data, you need to analyze the profile data by taking the following steps.
Before data analysis, you need to install the development environment equipped with the Profiling tool. The HwHiAiUser user is an example of how to install the tool to the default path /home/HwHiAiUser/Ascend. Replace them as required.
- Log in to the development environment as the HwHiAiUser user.
- Upload the collected data using the scp command to the /home/HwHiAiUser/ directory (or another directory that the HwHiAiUser user has the read and write permissions).
For example, /home/HwHiAiUser/JOBDCDCHAGIAGFICDAAHIAAAAAAAAAAA.
JOBDCDCHAGIAGFICDAAHIAAAAAAAAAAA is the directory for storing profile data. Replace it with the actual one.
- Create a log path in the user home directory (/home/HwHiAiUser).
mkdir -p .mindstudio/profiler/log
- Analyze the profile data.
Go to the /home/HwHiAiUser/Ascend/ascend-toolkit/latest/toolkit/tools/profiler/profiler_tool/analysis/interface directory and run the following command:
python3.7.5 msvp_import.pyc --target=/home/HwHiAiUser/JOBDCDCHAGIAGFICDAAHIAAAAAAAAAAA --export_type=all --deviceid=0
- --target: path for storing the profile data.
- --export_type (optional): profiling results to export. Either set to all to analyze all the data, or basic (default) to analyze some of the data.
- --deviceid (optional): device ID. The default value is --target, indicating all device IDs in the specified directory.
After the command is executed, the ai_core_op_summary_0.csv file is generated to the /home/HwHiAiUser/JOBDCDCHAGIAGFICDAAHIAAAAAAAAAAA/csv directory. For details about the file, see Description of ai_core_op_summary_{device_id}.csv.
- View the profiling results in the following modes:
- Print the results in tabular format on the screen.
Go to the /home/HwHiAiUser/Ascend/ascend-toolkit/latest/toolkit/tools/profiler/bin directory and run the following command:
bash hiprof.sh --report_infer /home/HwHiAiUser/JOBDCDCHAGIAGFICDAAHIAAAAAAAAAAA
- Dump the results in CSV format to the disk.
Go to the /home/HwHiAiUser/Ascend/ascend-toolkit/latest/toolkit/tools/profiler/profiler_tool/analysis/interface directory and run the following command.
The command syntax is as follows:
python3.7.5 get_msvp_info.pyc --save_file --project=$output_data --deviceid=$device_id --data_type=$export_data_type
For example, you can run the following command to export the Runtime API calls:
python3.7.5 get_msvp_info.pyc --save_file --project=/home/HwHiAiUser/JOBDCDCHAGIAGFICDAAHIAAAAAAAAAAA --deviceid=0 --data_type=runtime_api
- For details about how to use get_msvp_info.pyc, see Exporting Profiling Results.
- --data_type: profiling results to export. Only the following types are supported:
- runtime_api: Runtime API calls
- task_scheduler: Task Scheduler analysis result
- ai_core_pmu_events: AI Core analysis result
- ge_basic: GE task and graph information
- ge_model_load: information of GE loading a model
- ge_model_time: time taken by GE to load a model
- acl: AscendCL output
- top_down: time taken by each module in the inference workflow
- ai_core_op_summary: AI Core Operator Summary
- op_counter: AI Core Operator Statistics
- all: all profiling results
- Print the results in tabular format on the screen.
Description of Profiling Results
- Runtime API calls:
- Name: API name
- Stream ID: stream ID of an API
- Time (%): percentage of time taken by an API
- Time (ns): time taken by an API
- Calls: number of API calls
- Avg, Min, Max: average, minimum, and maximum time taken by API calls
N/A in the StreamID column indicates that the API is directly called and does not belong to any stream.
- Task Scheduler:
- Time(%): percentage of time taken by a task
- Time(ns): time taken by a task
- Count: number of task execution times
- Avg, Min, Max: average, minimum, and maximum time
- Waiting: total wait time of a task
- Running: total run time of a task. If a task has been running for a long time, the operator implementation may be incorrect.
- Pending: total pending time of a task
- Type: task type
- API: API name
- Task ID: task ID
- Stream ID: stream ID
- AI Core:
- Task ID: task ID
- Stream ID: stream ID
- Op Name: operator name
- aicore_time: time taken to execute all instructions
- total_cycles: number of cycles taken to execute all instructions
The analysis result of the AI Core metrics is described as follows:- aicoreArithmeticThroughput
- mac_fp16_ratio: percentage of cycles taken to execute Cube fp16 instructions
- mac_int8_ratio: percentage of cycles taken to execute Cube int8 instructions
- vec_fp32_ratio: percentage of cycles taken to execute Vector fp32 instructions
- vec_fp16_ratio: percentage of cycles taken to execute Vector fp16 instructions
- vec_int32_ratio: percentage of cycles taken to execute Vector int32 instructions
- vec_misc_ratio: percentage of cycles taken to execute Vector misc instructions
- aicorePipeline
- vec_time: time taken to execute Vector instructions
- vec_ratio: percentage of cycles taken to execute Vector instructions
- mac_time: time taken to execute Cube instructions
- mac_ratio: percentage of cycles taken to execute Cube instructions
- scalar_time: time taken to execute Scalar instructions
- scalar_ratio: percentage of cycles taken to execute Scalar instructions
- mte1_time: time taken to execute MTE1 instructions (L1-to-L0A/L0B movement)
- mte1_ratio: percentage of cycles taken to execute MTE1 instructions (L1-to-L0A/L0B movement)
- mte2_time: time taken to execute MTE2 instructions (DDR-to-AI Core movement)
- mte2_ratio: percentage of cycles taken to execute MTE2 instructions (DDR-to-AI Core movement)
- mte3_time: time taken to execute MTE3 instructions (AI Core-to-DDR movement)
- mte3_ratio: percentage of cycles taken to execute MTE3 instructions (AI Core-to-DDR movement)
- icache_miss_rate: I-Cache miss rate
- memory_bound: identifies a memory bottleneck when the AI Core is computing operators, calculated as: mte2_ratio/max(mac_ratio, vec_ratio). If the value is less than 1, no memory bottleneck exists. If not, a memory bottleneck exists.
- aicoreSynchronization
- scalar_waitflag_ratio: percentage of cycles for waiting between Scalar instructions
- cube_waitflag_ratio: percentage of cycles for waiting between Cube instructions
- vector_waitflag_ratio: percentage of cycles for waiting between Vector instructions
- mte1_waitflag_ratio: percentage of cycles for waiting between MTE1 instructions
- mte2_waitflag_ratio: percentage of cycles for waiting between MTE2 instructions
- mte3_waitflag_ratio: percentage of cycles for waiting between MTE3 instructions
- aicoreMemoryBandwidth
- ub_read_bw: UB read bandwidth (GB/s)
- ub_write_bw: UB write bandwidth (GB/s)
- l1_read_bw: L1 read bandwidth (GB/s)
- l1_write_bw: L1 write bandwidth (GB/s)
- l2_read_bw: L2 read bandwidth (GB/s)
- l2_write_bw: L2 write bandwidth (GB/s)
- main_mem_read_bw: main memory read bandwidth (GB/s)
- main_mem_write_bw: main memory write bandwidth (GB/s)
- aicoreInternalMemoryBandwidth
- scalar_ld_ratio: percentage of cycles taken to execute Scalar-read-UB instructions
- scalar_st_ratio: percentage of cycles taken to execute Scalar-read-UB instructions
- l0A_read_bw: L0A read bandwidth (GB/s)
- l0A_write_bw: L0A write bandwidth (GB/s)
- l0B_read_bw: L0B read bandwidth (GB/s)
- l0B_write_bw: L0B write bandwidth (GB/s)
- l0C_read_bw: L0C read bandwidth (GB/s)
- l0C_write_bw: L0C write bandwidth (GB/s)
- aicorePipelineStall
- vec_bankgroup_cflt_ratio: percentage of cycles taken to execute vec_bankgroup_stall_cycles instructions
- vec_bank_cflt_ratio: percentage of cycles taken to execute vec_bank_stall_cycles instructions
- vec_resc_cflt_ratio: percentage of cycles taken to execute vec_resc_cflt_ratio instructions
- mte1_iq_full_ratio: percentage of cycles taken to execute mte1_iq_full_cycles instructions
- mte2_iq_full_ratio: percentage of cycles taken to execute mte2_iq_full_cycles instructions
- mte3_iq_full_ratio: percentage of cycles taken to execute mte3_iq_full_cycles instructions
- cube_iq_full_ratio: percentage of cycles taken to execute cube_iq_full_cycles instructions
- vec_iq_full_ratio: percentage of cycles taken to execute vec_iq_full_ratio instructions
- iq_full_ratio: percentage of cycles taken to execute vec_resc_cflt_ratio, mte1_iq_full_ratio, mte2_iq_full_ratio, mte3_iq_full_ratio, cube_iq_full_ratio, and vec_iq_full_ratio instructions
- GE task and graph information:
- Model Name: model name
- Op Name: operator name
- Op Type: operator type
- Task ID: task ID
- Block Dim: number of cores to execute a task
- Stream ID: stream ID
- Input Count: number of inputs
- Input Formats: input formats
- Input Shapes: input shapes
- Input Data Types: input data types
- Output Count: number of outputs
- Output Formats: output formats
- Output Shapes: output shapes
- Output Data Types: output data types
- Information of GE loading a model:
- Model Name: model name
- Model ID: model ID
- Stream ID: stream ID of a fusion operator
- Fusion Op: name of a fusion operator
- Original Ops: names of fused operators
- Memory Input: memory size of an input tensor
- Memory Output: memory size of an output tensor
- Memory Weight: weight memory size
- Memory Workspace: workspace memory size
- Memory Total: total memory, the sum of Memory Input, Memory Output, Memory Weight, and Memory Workspace
- Task IDs: task IDs
- Time taken by GE to load a model:
- Model Name: model name
- Model ID: model ID
- Data Index: data index
- Request ID: request ID
- Input Start Time: start time of data input
- Input Duration: time taken by the model to input data
- Inference Start Time: start time of data inference
- Inference Duration: time taken by the model to infer data
- Output Start Time: start time of data output
- Output Duration: time taken by the model to output data
- AscendCL output:
- Name: AscendCL API name
- Type: AscendCL API type
- Start Time: AscendCL API start time
- Duration: time taken to run an AscendCL API
- Process ID: process ID of an AscendCL API
- Thread ID: thread ID of an AscendCL API
- Top-down information:
- Infer ID: inference iteration ID
- Module Name: module name
- API: API name
- Start Time: start time
- Duration: total time