Profiling by Calling the AscendCL API
The Profiling tool can also be enabled with the AscendCL API to automatically collect profile data. After that, you can analyze the collected profile data in the development environment where Toolkit is installed, and display the visualized profiling results.
Collecting Profile Data
The tool provides four AscendCL APIs, including aclprofInit, aclprofFinalize, aclprofStart and aclprofStop, which are used to call application projects and enable the profiling function. On using these APIs, see "AscendCL API Reference" in Application Software Development Guide.
Note that the user must have the read and write permissions on the dump path passed to the aclprofInit call. If the disk is full during data collection, profile data cannot be dumped to the disk. Therefore, it is necessary to have enough space on the disk. In addition, the profile data needs to be aged by the user to prevent the disk space from being used up.
To collect profile data with Profiling, develop, build, and run the application project by referring to Application Software Development Guide.
A dump directory (for example: JOBDCDCHAGIAGFICDAAHIAAAAAAAAAAA) will be created in the configured profile data path after each aclprofInit API call from the app.
Analyzing Profile Data
To display the visualized data, you need to analyze the profile data by taking the following steps.
Before data analysis, you need to install the development environment equipped with the Profiling tool. The HwHiAiUser user is an example of how to install the tool to the default path /home/HwHiAiUser/Ascend. Replace them as required.
- Log in to the development environment as the HwHiAiUser user.
- Upload the collected data using the scp command to the /home/HwHiAiUser/ directory (or another directory that the HwHiAiUser user has the read and write permissions).
For example, the collected data is uploaded to the path /home/HwHiAiUser/JOBDCDCHAGIAGFICDAAHIAAAAAAAAAAA.
JOBDCDCHAGIAGFICDAAHIAAAAAAAAAAA is the directory for storing profile data. Replace it with the actual one.
- Create a log path in the user home directory (/home/HwHiAiUser).
mkdir -p .mindstudio/profiler/log
- Analyze the profile data.
Go to the /home/HwHiAiUser/Ascend/ascend-toolkit/latest/toolkit/tools/profiler/profiler_tool/analysis/interface directory and run the following command:
python3.7.5 msvp_import.pyc --target=/home/HwHiAiUser/JOBDCDCHAGIAGFICDAAHIAAAAAAAAAAA --export_type=all --deviceid=0
- --target: path for storing the profile data.
- --export_type (optional): profiling results to export. Either set to all to analyze all the data, or basic (default) to analyze some of the data.
- --deviceid (optional): device ID. The default value is --target, indicating all device IDs in the specified directory.
After the command is executed, the ai_core_op_summary_0.csv file is generated to the /home/HwHiAiUser/JOBDCDCHAGIAGFICDAAHIAAAAAAAAAAA/csv directory. For details about the file, see Description of ai_core_op_summary_{device_id}.csv.
- View the profiling results in the following modes:
- Print the results in tabular format on the screen.
Go to the /home/HwHiAiUser/Ascend/ascend-toolkit/latest/toolkit/tools/profiler/bin directory and run the following command:
bash hiprof.sh --report_infer /home/HwHiAiUser/JOBDCDCHAGIAGFICDAAHIAAAAAAAAAAA
- Dump the results in CSV format to the disk.
Go to the /home/HwHiAiUser/Ascend/ascend-toolkit/latest/toolkit/tools/profiler/profiler_tool/analysis/interface directory and run the following command.
The command syntax is as follows:
python3.7.5 get_msvp_info.pyc --save_file --project=$output_data --deviceid=$device_id --data_type=$export_data_type
For example, you can run the following command to export the Runtime API calls:
python3.7.5 get_msvp_info.pyc --save_file --project=/home/HwHiAiUser/JOBDCDCHAGIAGFICDAAHIAAAAAAAAAAA --deviceid=0 --data_type=runtime_api
- For details about how to use get_msvp_info.pyc, see Exporting Profiling Results.
- --data_type: profiling results to export. Only the following types are supported:
- task_scheduler: Task Scheduler analysis result
- ai_core_pmu_events: AI Core analysis result
- ge_basic: GE task and graph information
- ge_model_load: information of GE loading a model
- ge_model_time: time taken by GE to load a model
- acl: AscendCL output
- top_down: time taken by each module in the inference workflow
- ai_core_op_summary: AI Core Operator Summary
- op_counter: AI Core Operator Statistics
- all: all profiling results
- Print the results in tabular format on the screen.
Description of Profiling Results
- Task Scheduler:
- Time(%): percentage of time taken by a task
- Time(ns): time taken by a task
- Count: number of task execution times
- Avg, Min, Max: average, minimum, and maximum time
- Waiting: total wait time of a task
- Running: total run time of a task. If a task has been running for a long time, the operator implementation may be incorrect.
- Pending: total pending time of a task
- Type: task type
- API: API name
- Task ID: task ID
- Stream ID: stream ID
- AI Core:
- Task ID: task ID
- Stream ID: stream ID
- Op Name: operator name
- aicore_time: time taken to execute all instructions
- total_cycles: number of cycles taken to execute all instructions
The analysis result of the AI Core metrics is described as follows:- aicoreArithmeticThroughput
- mac_fp16_ratio: percentage of cycles taken to execute Cube fp16 instructions
- mac_int8_ratio: percentage of cycles taken to execute Cube int8 instructions
- vec_fp32_ratio: percentage of cycles taken to execute Vector fp32 instructions
- vec_fp16_ratio: percentage of cycles taken to execute Vector fp16 instructions
- vec_int32_ratio: percentage of cycles taken to execute Vector int32 instructions
- vec_misc_ratio: percentage of cycles taken to execute Vector misc instructions
- aicorePipeline
- vec_time: time taken to execute Vector instructions
- vec_ratio: percentage of cycles taken to execute Vector instructions
- mac_time: time taken to execute Cube instructions
- mac_ratio: percentage of cycles taken to execute Cube instructions
- scalar_time: time taken to execute Scalar instructions
- scalar_ratio: percentage of cycles taken to execute Scalar instructions
- mte1_time: time taken to execute MTE1 instructions (L1-to-L0A/L0B movement)
- mte1_ratio: percentage of cycles taken to execute MTE1 instructions (L1-to-L0A/L0B movement)
- mte2_time: time taken to execute MTE2 instructions (DDR-to-AI Core movement)
- mte2_ratio: percentage of cycles taken to execute MTE2 instructions (DDR-to-AI Core movement)
- mte3_time: time taken to execute MTE3 instructions (AI Core-to-DDR movement)
- mte3_ratio: percentage of cycles taken to execute MTE3 instructions (AI Core-to-DDR movement)
- icache_miss_rate: I-Cache miss rate
- memory_bound: AI Core memory bound, calculated as: mte2_ratio/max(mac_ratio, vec_ratio). If the value is less than 1, no memory bottleneck exists. Otherwise, a memory bottleneck exists.
- aicoreSynchronization
- scalar_waitflag_ratio: percentage of cycles for waiting between Scalar instructions
- cube_waitflag_ratio: percentage of cycles for waiting between Cube instructions
- vector_waitflag_ratio: percentage of cycles for waiting between Vector instructions
- mte1_waitflag_ratio: percentage of cycles for waiting between MTE1 instructions
- mte2_waitflag_ratio: percentage of cycles for waiting between MTE2 instructions
- mte3_waitflag_ratio: percentage of cycles for waiting between MTE3 instructions
- aicoreMemoryBandwidth
- ub_read_bw: UB read bandwidth (GB/s)
- ub_write_bw: UB write bandwidth (GB/s)
- l1_read_bw: L1 read bandwidth (GB/s)
- l1_write_bw: L1 write bandwidth (GB/s)
- l2_read_bw: L2 read bandwidth (GB/s)
- l2_write_bw: L2 write bandwidth (GB/s)
- main_mem_read_bw: main memory read bandwidth (GB/s)
- main_mem_write_bw: main memory write bandwidth (GB/s)
- aicoreInternalMemoryBandwidth
- scalar_ld_ratio: percentage of cycles taken to execute Scalar-read-UB instructions
- scalar_st_ratio: percentage of cycles taken to execute Scalar-read-UB instructions
- l0a_read_bw: L0A read bandwidth (GB/s)
- l0a_write_bw: L0A write bandwidth (GB/s)
- l0b_read_bw: L0B read bandwidth (GB/s)
- l0b_write_bw: L0B write bandwidth (GB/s)
- l0c_read_bw: L0C read bandwidth (GB/s)
- l0c_write_bw: L0C write bandwidth (GB/s)
- aicorePipelineStall
- vec_bankgroup_cflt_ratio: percentage of cycles taken to execute vec_bankgroup_stall_cycles instructions
- vec_bank_cflt_ratio: percentage of cycles taken to execute vec_bank_stall_cycles instructions
- vec_resc_cflt_ratio: percentage of cycles taken to execute vec_resc_cflt_ratio instructions
- mte1_iq_full_ratio: percentage of cycles taken to execute mte1_iq_full_cycles instructions
- mte2_iq_full_ratio: percentage of cycles taken to execute mte2_iq_full_cycles instructions
- mte3_iq_full_ratio: percentage of cycles taken to execute mte3_iq_full_cycles instructions
- cube_iq_full_ratio: percentage of cycles taken to execute cube_iq_full_cycles instructions
- vec_iq_full_ratio: percentage of cycles taken to execute vec_iq_full_ratio instructions
- iq_full_ratio: percentage of cycles taken to execute vec_resc_cflt_ratio, mte1_iq_full_ratio, mte2_iq_full_ratio, mte3_iq_full_ratio, cube_iq_full_ratio, and vec_iq_full_ratio instructions
- GE task and graph information:
- Model Name: model name
- Op Name: operator name
- Op Type: operator type
- Task ID: task ID
- Block Dim: number of cores to execute a task
- Stream ID: stream ID
- Input Count: number of inputs
- Input Formats: input formats
- Input Shapes: input shapes
- Input Data Types: input data types
- Output Count: number of outputs
- Output Formats: output formats
- Output Shapes: output shapes
- Output Data Types: output data types
- Information of GE loading a model:
- Model Name: model name
- Model ID: model ID
- Stream ID: stream ID of a fusion operator
- Fusion Op: name of a fusion operator
- Original Ops: names of fused operators
- Memory Input: memory size of an input tensor
- Memory Output: memory size of an output tensor
- Memory Weight: weight memory size
- Memory Workspace: workspace memory size
- Memory Total: total memory, the sum of Memory Input, Memory Output, Memory Weight, and Memory Workspace
- Task IDs: task IDs
- Time taken by GE to load a model:
- Model Name: model name
- Model ID: model ID
- Data Index: data index
- Request ID: request ID
- Input Start Time: start time of data input
- Input Duration: time taken by the model to input data
- Inference Start Time: start time of data inference
- Inference Duration: time taken by the model to infer data
- Output Start Time: start time of data output
- Output Duration: time taken by the model to output data
- AscendCL output:
- Name: AscendCL API name
- Type: AscendCL API type
- Start Time: AscendCL API start time
- Duration: time taken to run an AscendCL API
- Process ID: process ID of an AscendCL API
- Thread ID: thread ID of an AscendCL API
- Top-down information:
- Infer ID: inference iteration ID
- Module Name: module name
- API: API name
- Start Time: start time
- Duration: total time