Application Project Profiling
To profile an application project in the CLI, log in to the development environment server as a common user, run the script to collect profile data, and view the profiling result in the CLI.
Before starting Profiling, develop, build, and run the application project to generate an executable file of the project. For details, see Application Software Development Guide.
Collecting Profile Data
Perform the following steps to collect profile data:
- Log in to the development environment as the HwHiAiUser user created during installation.
- Copy the project executable file (all files are in the out directory), model files, dataset files, and the acl.json file to the operating environment. Ensure that the owner of the copied files is HwHiAiUser.
scp -r /home/HwHiAiUser/AscendProjects/MyAppname/run/out HwHiAiUser@x.x.x.x:/home/HwHiAiUser/HIAI_PROJECTS/MyAppname
scp -r /home/HwHiAiUser/AscendProjects/MyAppname/run/model HwHiAiUser@x.x.x.x:/home/HwHiAiUser/HIAI_PROJECTS/MyAppname
scp -r /home/HwHiAiUser/AscendProjects/MyAppname/run/data HwHiAiUser@x.x.x.x:/home/HwHiAiUser/HIAI_PROJECTS/MyAppname
scp -r /home/HwHiAiUser/AscendProjects/MyAppname/src/ HwHiAiUser@x.x.x.x:/home/HwHiAiUser/HIAI_PROJECTS/MyAppname
In this section, paths and file names in the commands are examples only. Replace them with the actual paths and file names.
- /home/HwHiAiUser/AscendProjects/MyAppname/run/out: path of the generated executable file after building
- /home/HwHiAiUser/AscendProjects/MyAppname/run/model: path of the model files
- /home/HwHiAiUser/AscendProjects/MyAppname/run/data: path of the dataset files
- /home/HwHiAiUser/AscendProjects/MyAppname/src: path of the acl.json file
- /home/HwHiAiUser/HIAI_PROJECTS/workspace_mind_studio/MyAppname: path of the project file in the operating environment. Replace MyAppname with the actual project name. If the project directory does not exist, create it.
After the copy operation, the out directory of executable file and directories of model files, dataset files, and the acl.json file on the host must be the same as those in the application project code path.
- x.x.x.x: IP address of the host for Ascend EP. IP address of the board environment for Ascend RC.
- Go to the directory of the hiprof.pyc script, for example, /home/HwHiAiUser/Ascend/ascend-toolkit/latest/toolkit/tools/profiler/profiler_tool/analysis/command.
- Set the environment variable for the hiprof command in the development environment.
export LD_LIBRARY_PATH=/home/HwHiAiUser/Ascend/ascend-toolkit/latest/toolkit/lib64:${LD_LIBRARY_PATH}
If .so files such as libgflags.so.2.2 cannot be found when the application project is running, fix the error by referring to Profiling Fails Due to App Execution Error.
- Collect profile data over software and hardware modules.
The command syntax is as follows. For details about the command-line options, see Table 7-1.
python3.7.5 hiprof.pyc --ip_address={ip:port} --result_dir={result_path} --profiling_options={profiling_options} --app_dir={app_path} --app={app_name} ......
- Example of executing an application project without arguments (with app_dir)
python3.7.5 hiprof.pyc --ip_address=x.x.x.x:port --result_dir=/home/HwHiAiUser/tools/out/ --profiling_options=task_trace --app_dir=/home/HwHiAiUser/HIAI_PROJECTS/MyAppname/out --app=main
- Example of executing an application project without arguments (without app_dir)
python3.7.5 hiprof.pyc --ip_address=x.x.x.x:port --result_dir=/home/HwHiAiUser/tools/out/ --profiling_options=task_trace --app=/home/HwHiAiUser/HIAI_PROJECTS/MyAppname/out/main
- Example of executing an application project with arguments
In this scenario, the application name and arguments must be enclosed in double quotation marks (""). In the following example, benchmark is the app name, and --om, --dataDir, --batchSize, --dvppConfig, --postprocessType, and --resnet50StdFile are the additional arguments required for the app. Format of the arguments must be consistent with those in the application project.
python3.7.5 hiprof.pyc --ip_address=x.x.x.x:port --result_dir=/home/HwHiAiUser/tools/out --profiling_options=task_trace --app_dir=/home/HwHiAiUser/HIAI_PROJECTS/Benchmark/out --app="benchmark --om model/resnet50_aipp_b8_fp16_output_FP32.om --dataDir datasets/ImageNet2012-1024/ --batchSize 8 --dvppConfig configure/dvppConfig_resnet --postprocessType resnet --resnet50StdFile configure/jpg_accuracy.csv"
- Example of executing a command supporting system profiling
python3.7.5 hiprof.pyc --ip_address=x.x.x.x:port --result_dir=/home/HwHiAiUser/tools/out/ --profiling_options=task_trace,system_trace --ai_core_profiling_mode=sample-based --app_dir=/home/HwHiAiUser/HIAI_PROJECTS/MyAppname/out --app=main
- Command-line Profiling supports prefix matching. A correct and unambiguous prefix for any command-line option can trigger proper execution.
For example, --profiling_option will be matched as --profiling_options.
- The command must be entered in English input method. Ensure that the space is in correct format. Otherwise, the command execution may fail.
- Options in the command must be assigned with values. Otherwise, an exception is reported. This is a native Python problem.
- The value of the --app option cannot contain the following special characters in the double quotation marks: [';*?`!#$%^&+=<>{}]|"
If a custom option contains the preceding special characters, write the corresponding execution statements to the executable file, start the application project using the executable file, and use the name and path of the executable file as the values of --app and --app_dir.
- If you press Ctrl+C to stop profiling after it has been started, wait for 10s before running profiling command again. Otherwise, the execution may fail.
- If the error message Data folder is locked is reported during the command execution, the profiling command may exit abnormally last time. Delete the files in the output result folder specified by result_dir and run the command again.
- Here is a quick tip. Create an alias for the hiprof.pyc script with the command alias hiprof='python3.7.5 /home/HwHiAiUser/Ascend/ascend-toolkit/latest/toolkit/tools/profiler/profiler_tool/analysis/command/hiprof.pyc' as the HwHiAiUser user. Then, you can start profiling with the shortcut hiprof in any directory.
Table 7-1 Command-line optionsOption
Description
--result_dir
(Required) Directory of the profiling result file. If it exists, add a suffix .old to the directory name. If not, create one. The path must be an absolute path.
The application running user must have the read and write permissions on the path specified by result_dir. Set the specified path to the home directory of the user, for example, /home/HwHiAiUser/tools/out.
Note: This path must not be a path occupied by other applications. Otherwise, the folder name will be changed and thus the application execution will be affected.
--ip_address
(Required) IP address and port number of the operating environment. The port number is optional, default to 22118.
IP address of the host for Ascend EP. IP address of the board environment for Ascend RC.
--profiling_options
(Required) Option (or options separated by commas) to be traced:
- task_trace: application project profiling.
- system_trace: system profiling.
--app
(Required) Application executable file in the operating environment. Set to a file name or a full path. For example:
- Set to a file name: --app=main.
- Set to a full path: --app=/home/HwHiAiUser/HIAI_PROJECTS/MyApp2019/out/main.
If app has arguments, enclose the arguments in double quotation marks (""). For example, --app="main parameters1 parameters2 parameters3".
If the arguments are set to the executable file name, app_dir is required.
--app_dir
(Optional) Path for storing the application executable file in the operating environment.
Example: --app_dir=/home/HwHiAiUser/HIAI_PROJECTS/MyApp2019/out
NOTE:- If you change the default value of WORK_PATH in the ide_daemon.cfg file in the ADA module, do not use ~ to indicate the home directory of the ADA running user in the operating environment. Instead, you need to write the absolute path of the application executable file.
- Ensure that the ADA running user in the operating environment has the read and write permissions on the path of the application executable file.
--devices
(Optional) Device ID (or device IDs separated by commas):
- When profiling_options is set only to task_trace, this option is default to all. When profiling_options is set to other values, this option is default to 0.
- When the configured value of profiling_options contains system_trace, only one device ID can be specified.
--app_location
(Optional) Execution target of an application project or a single-operator. Defaults to host.
--advisor
(Optional) Whether to provide suggestions on performance improvement, either on (default) or off.
--ai_core_profiling
(Optional) AI Core profiling switch, either on (default) or off.
--ai_core_profiling_mode
(Optional) AI Core profiling mode, either task-based (default) or sample-based.
In task-based mode, profile data is collected task by task; in sample-based mode, profile data is collected at a fixed interval (that is specified by aicore_sampling_interval).
- To collect AI Core profile data, set --ai_core_profiling to on.
- When --profiling_options is set to system_trace, this option is required and must be set to sample-based.
--aicore_sampling_interval
(Optional) AI Core sampling interval (ms), default to 10. The value range is [10, 1000].
--ai_core_metrics
(Optional) AI Core metrics: aicoreArithmeticThroughput, aicorePipeline, aicoreSynchronization, aicoreMemoryBandwidth, aicoreInternalMemoryBandwidth, and aicorePipelineStall. For details about the AI Core metrics, see AI Core Metrics.
- aicoreArithmeticThroughput: percentages of arithmetic throughput.
- aicorePipeline (default): percentages of time taken by the compute units and MTEs.
- aicoreSynchronization: percentages of synchronization instructions.
- aicoreMemoryBandwidth: percentages of external memory read/write instructions.
- aicoreInternalMemoryBandwidth: percentages of internal memory read/write instructions.
- aicorePipelineStall: percentages of pipeline stall instructions.
--app_env
(Optional) A custom environment variable required in the operating environment during profiling.
Enclose the arguments with double quotation marks (""). Separate multiple ones with semicolons (;).
Example: --app_env="LD_LIBRARY_PATH=/home/HwHiAiUser/Ascend/ascend-toolkit/latest/acllib/lib64"
--cpu_profiling
CPU (AI CPU, Ctrl CPU, and TS CPU) profiling switch, either on (default) or off.
(Optional) These hardware metrics take effect only when the value of profiling_options contains system_trace.
--cpu_sampling_interval
CPU sampling interval (ms), default to 20. The value range is [20, 1000].
NOTE:If the configured interval is longer than the application execution, TS CPU may fail to be sampled.
--sys_profiling
Profiling switch for system CPU utilization and system memory, either on or off (default).
--sys_sampling_interval
Sampling interval for system CPU utilization and system memory (ms), default to 100. The value range is [100, 1000].
--pid_profiling
Profiling switch for process CPU utilization and process memory, either on or off (default).
--pid_sampling_interval
Sampling interval for process CPU utilization and process memory (ms), default to 100. The value range is [100, 1000].
--hardware_mem
Profiling switch for LLC and DDR, either on (default) or off.
--llc_profiling
LLC event to profile.
- capacity: the LLC capacity of the AI CPU and Ctrl CPU.
- bandwidth: the LLC bandwidth.
To profile LLC capacity or bandwidth, set --hardware_mem to on.
--hardware_mem_sampling_interval
Sampling interval for LLC and DDR (ms), default to 20. The value range is [1, 1000].
--io_profiling
NIC profiling switch, either on (default) or off.
--io_sampling_interval
NIC sampling interval (ms), default to 10. The value range is [10, 1000].
--interconnection_profiling
PCIe profiling switch, either on (default) or off.
--interconnection_sampling_interval
PCIe sampling interval (ms), default to 20. The value range is [20, 1000].
--dvpp_profiling
DVPP profiling switch, either on (default) or off.
--dvpp_sampling_interval
DVPP sampling interval (ms), default to 20. The value range is [10, 1000].
--aiv_profiling
(Optional) Reserved.
--aiv_profiling_mode
(Optional) Reserved.
--aiv_sampling_interval
(Optional) Reserved.
--aiv_metrics
(Optional) Reserved.
- Example of executing an application project without arguments (with app_dir)
Viewing the Profiling Results
If there are over 50 data records for Runtime API, TS API, and AI Core Metrics in the CLI, export them in CSV format by referring to Exporting Profiling Results.
The command outputs shown in this section are examples only, and these examples may not apply to all cases.
- Runtime API calls
The information in the figure is described as follows:
- Name: API name
- Stream ID: stream ID of an API
- Time (%): percentage of time taken by an API
- Time (ns): time taken by an API
- Calls: number of API calls
- Avg, Min, Max: average, minimum, and maximum time taken by API calls
In Figure 7-1, N/A in the StreamID column indicates that the API is directly called and does not belong to any stream.
- Task Scheduler summaryFigure 7-2 Task Scheduler summary
The information in the figure is described as follows:
- Time(%): percentage of time taken by a task
- Time(ns): time taken by a task
- Count: number of task execution times
- Avg, Min, Max: average, minimum, and maximum time
- Waiting: total wait time of a task
- Running: total run time of a task. If a task has been running for a long time, the operator implementation may be incorrect.
- Pending: total pending time of a task
- Type: task type
- API: API name
- Task ID: task ID
- Op Name: operator name
- Stream ID: stream ID
Time(%) and Time(ns) for a task may be 0 because the actual chip crystal oscillator frequency is different from the sampling frequency. As a result, the start time and end time for profiling are the same.
An API name may be empty because the following Runtime APIs do not report the information: rtRDMASend, rtNotifyWait, rtIpcOpenNotify, rtSubscribeReport, rtCallbackLaunch, rtProcessReport, rtUnSubscribeReport, rtGetRunMode, rtRDMADBSend, and rtEndGraph.
- AI Core metricsFigure 7-3 describes the collected profile data over AI Core when --ai_core_profiling_mode is default to task-based and --ai_core_metrics is default to aicorePipeline.
The AI Core metrics are described as follows:
- Task ID: task ID
- Stream ID: stream ID
- Op Name: operator name
- aicore_time: time taken to execute all instructions
- total_cycles: number of cycles taken to execute all instructions
- vec_time: time taken to execute Vector instructions
- vec_ratio: percentage of cycles taken to execute Vector instructions
- mac_time: time taken to execute Cube instructions
- mac_ratio: percentage of cycles taken to execute Cube instructions
- scalar_time: time taken to execute Scalar instructions
- scalar_ratio: percentage of cycles taken to execute Scalar instructions
- mte1_time: time taken to execute MTE1 instructions (L1-to-L0A/L0B movement)
- mte1_ratio: percentage of cycles taken to execute MTE1 instructions (L1-to-L0A/L0B movement)
- mte2_time: time taken to execute MTE2 instructions (DDR-to-AI Core movement)
- mte2_ratio: percentage of cycles taken to execute MTE2 instructions (DDR-to-AI Core movement)
- mte3_time: time taken to execute MTE3 instructions (AI Core-to-DDR movement)
- mte3_ratio: percentage of cycles taken to execute MTE3 instructions (AI Core-to-DDR movement)
- icache_miss_rate: I-Cache miss rate
- memory_bound: identifies a memory bottleneck when the AI Core is computing operators, calculated as: mte2_ratio/max(mac_ratio, vec_ratio). If the value is less than 1, no memory bottleneck exists. Otherwise, a memory bottleneck exists.
- Operator data provided by the GE componentFigure 7-4 GE task and graph information
The information in the figure is described as follows:
- Model Name: model name
- Op Name: operator name
- Op Type: operator type
- Task ID: task ID
- Block Dim: number of cores to execute a task
- Stream ID: stream ID
- Input Count: number of inputs
- Input Formats: input formats
- Input Shapes: input shapes
- Input Data Types: input data types
- Output Count: number of outputs
- Output Formats: output formats
- Output Shapes: output shapes
- Output Data Types: output data types
- Time taken by GE to load a model, and the model to input, infer and output data
- Information of GE loading a modelFigure 7-5 Information of GE loading a model
The information in the figure is described as follows:
- Model Name: model name
- Model ID: model ID
- Stream ID: stream ID of a fusion operator
- Fusion Op: name of a fusion operator
- Original Ops: names of fused operators
- Memory Input: memory size of an input tensor
- Memory Output: memory size of an output tensor
- Memory Weight: weight memory size
- Memory Workspace: workspace memory size
- Memory Total: total memory, the sum of Memory Input, Memory Output, Memory Weight, and Memory Workspace
- Task IDs: task IDs
- Time taken by the model to input, infer, and output dataFigure 7-6 Time taken by the model to input, infer, and output data
The information in the figure is described as follows:
- Model Name: model name
- Model ID: model ID
- Data Index: data index
- Request ID: request ID
- Input Start Time: start time of data input
- Input Duration: time taken by the model to input data
- Inference Start Time: start time of data inference
- Inference Duration: time taken by the model to infer data
- Output Start Time: start time of data output
- Output Duration: time taken by the model to output data
- Information of GE loading a model
- AscendCL module, operators, and Runtime API resultsFigure 7-7 AscendCL module result
- Name: AscendCL API name
- Type: AscendCL API type
- Start Time: AscendCL API start time
- Duration: time taken to run an AscendCL API
- Process ID: process ID of an AscendCL API
- Thread ID: thread ID of an AscendCL API
- AI Core operator statisticsFigure 7-8 AI Core operator statistics
The information in the figure is described as follows:
- Model Name: model name
- Op Type: operator type
- Core Type: Core type
- Count: number of times that an operator is called
- Total Time: time taken to call an operator
- Avg Time, Min Time, and Max Time: average, minimum, and maximum time taken to call an operator
- Ratio: percentage of time taken by an operator to the corresponding model
- Time taken by each module in the inference workflowFigure 7-9 Top-down information
The information in the figure is described as follows:
- Infer ID: inference iteration ID
- Module Name: module name
- API: API name
- Start Time: start time
- Duration: total time
- System profiling result
- Ctrl CPU/AI CPU/TS CPU PMU events and hotspot functionsFigure 7-10 Ctrl CPU/AI CPU/TS CPU PMU events and hotspot functions
Unknown in the figure indicates a function without a symbol table.
- NIC resultFigure 7-11 NIC result
- Duration: duration
- Bandwidth: bandwidth
- Rx Bandwidth efficiency: bandwidth efficiency of RX packets
- rxPacket/s: RX packets per second
- rxError rate: error rate of RX packets
- rxDropped rate: loss rate of RX packets
- Tx Bandwidth efficiency: bandwidth efficiency of TX packets
- txPacket/s: TX packets per second
- txError rate: error rate of TX packets
- txDropped rate: loss rate of TX packets
- DVPP resultFigure 7-12 DVPP result
- Engine type: engine type
- Engine id: engine ID
- All Time: time taken to sample each engine
- All Frame: number of frames processed in each engine sampling
- All Utilization: average utilization, calculated by dividing the accumulated processing time by the running time
- LLC resultFigure 7-13 LLC result
- When llc_profiling is set to capacity, the LLC capacity usage is displayed.
- When llc_profiling is set to bandwidth, the LLC read/write bandwidth and hit rate are displayed.
- DDR read/write bandwidthFigure 7-14 DDR bandwidth read/write speed
- System AI CPU and system Ctrl CPU utilizationFigure 7-15 System CPU utilization
The information in the figure is described as follows:
- User: percentage of the execution duration of user-mode processes
- Sys: percentage of the execution duration of kernel-mode processes
- IoWait: percentage of the I/O wait time
- Irq: percentage of the hardware interrupt time
- Soft: percentage of the soft interrupt time
- Idle: percentage of the idle time
- Process CPU utilizationFigure 7-16 Process CPU utilization (top 50 data records)
- System memory summaryFigure 7-17 System memory summary
The information in the figure is described as follows:
- Memory Total: total memory size
- Memory Free: free memory size
- Buffers: buffer size
- Cached: cache size
- Share Memory: shared memory size
- Commit Limit: virtual memory limit
- Committed AS: memory allocated by the system
- Huge Pages Total: number of allocated huge memory pages
- Huge Pages Free: number of remaining huge memory pages
- Process memory informationFigure 7-18 Process memory information (top 50 data records)
- Ctrl CPU/AI CPU/TS CPU PMU events and hotspot functions
Suggestions on Performance Improvement
N/A indicates no suggestion given to improve performance.
- Cube/Vector compute utilization
Applies only to application project tracing (task_trace).
Suggestion:
The performance summary report displays "low vector compute utilization" or "low cube compute utilization" if the compute utilization of the Vector unit or Cube unit for an operator is lower than the preset lower limit.
Figure 7-19 Low Vector/Cube compute utilization - vec_bankgroup_cflt_ratio or vec_bank_cflt_ratio
Applies only to application project tracing (task_trace).
Suggestion:
The performance summary report displays a bank conflict when vec_bankgroup_cflt_ratio or vec_bank_cflt_ratio of an operator reaches the preset upper limit.
Figure 7-20 Vector bank group conflict has reached the upper limit - Memory bound
Applies only to application project tracing (task_trace).
Suggestion:
The performance summary report displays "Low memory handling efficiency" when the memory bound of an operator reaches the preset upper limit. Check whether the burth length for data movement is too small and whether there are repeated movements.
Figure 7-21 Low data memory handling efficiency - Vector bound
Applies only to application project tracing (task_trace).
Suggestion:
The performance summary report displays "please check repeat counts and vector mask" when the vector bound of an operator reaches the preset upper limit. Check whether the repeat counts of Vector instructions are too small and whether Vector masks are frequently edited.
Figure 7-22 Please check repeat counts and vector mask
- Interval between adjacent operators
Applies only to application project tracing (task_trace).
Suggestion:
The performance summary report displays "Task wait time has reached the upper limit" when the wait time between operators reaches the upper limit.
Figure 7-23 Task wait time has reached the upper limit - transData operators
Applies only to application project tracing (task_trace).
Suggestion:
The performance summary report displays "please check and reduce the transData" when the number of transData operators reaches the preset upper limit. Check if there are redundant transData operators.
Figure 7-24 Please check and reduce the transData - AI CPU
Applies only to application project tracing (task_trace).
Suggestion:
The performance summary report displays "please check and reduce aicpu operator" when the number of AI CPU operators reaches the preset upper limit. Check if there are redundant AI CPU operators.
Figure 7-25 Please check and reduce aicpu operator - Workspace memory
Applies only to application project tracing (task_trace).
Suggestion:
The performance summary report displays "please check and reduce the memory workspace" when memory_workspace of an operator is not 0. Reduce memory_workspace as required.
Figure 7-26 Please check and reduce the workspace memory