Overview
The Ascend AI Processor is a high-performance integrated circuit dedicated for AI applications. It consists of Ctrl CPU, AI CPU, and AI Core. Ctrl CPU runs the operating system (OS), while AI CPU and AI Core implement high-performance AI computing. An end-to-end Profiling system can verify the Ascend 910 AI Processor performance in terms of processor verification, operator development, training and inference. This system provides an economical solution for achieving optimal performance by accurate location of bottlenecks in software and hardware, efficient analysis, and specific optimization.
Ascend 910 AI Processor can be profiled in three modes:
- Job profiling
- Collects software profile data of a training job and the AI Software Stack to profile the training job. Focuses on data augmentation, forward and backward propagation, and gradient aggregation and update.
- Collects the profile data of Hardware Task Scheduler (HWTS) or AI Core in Ascend 910 AI Processor to analyze the start and end time of a training job.
- System profiling
Collects profile data irrelevant to training jobs, including Ctrl CPU, AI CPU, Task Scheduler (TS) CPU, high bandwidth memory (HBM), and double data rate (DDR), to analyze and optimize the performance of a single Ascend 910 AI Processor.
- Single-operator profiling
Collects profile data of a single-operator, including Framework, Runtime, AI Core (task-based), and HWTS.
- Table 3-1 lists the modules available for job profiling, system profiling, and single-operator profiling.
Category |
Node |
Module |
Metric |
---|---|---|---|
Job profiling |
AI Host |
Framework |
Graph variables |
HCCL |
Collective communication |
||
Runtime |
Task traces |
||
Device |
Data Preprocess |
Data augmentation |
|
AI Core (task-based) |
PMU events |
||
Task Scheduler Track |
Task Scheduler timeline |
||
Training Trace |
Iteration traces |
||
L2 Cache |
PMU events about L2 AI Core caches |
||
HWTS Log (task-based) |
Task time |
||
System Profiling |
Device |
Ctrl CPU |
PMU events |
AI CPU |
PMU events |
||
TS CPU |
PMU events |
||
HCCS |
High-performance inter-chip communication bandwidth |
||
LLC |
CPU L3 caches |
||
DDR |
SDRAM read/write bandwidth |
||
Memory |
Memory utilization of the system and processes |
||
CPU Usage |
CPU utilization of the system and processes |
||
HBM |
High-bandwidth memory |
||
NIC |
NIC speed, error rate, and packet loss rate |
||
RoCE |
RoCE speed, error rate, and packet loss rate |
||
AI Core (sample-based) |
PMU events about AI Core instructions retired, CPU cycles, and more |
||
PCIe |
PCIe read/write bandwidth |
||
Single-operator profiling |
AI Host |
Framework |
Graph variables |
Runtime |
Task traces |
||
Device |
AI Core (task-based) |
PMU events |
|
HWTS Log (task-based) |
Task time |
The Profiling system only supports query in command-line interface (CLI). It does not support the query, start, stop, and profiling result viewing in a graphical user interface (GUI). ModelArts can start and stop the Profiling tool when interfaced with HUAWEI CLOUD. For details, see the online help or manual of ModelArts.
This document only details how to install the tool on a server independently. The Toolkit contains the Profiling tool. If the Toolkit has been installed during environment setup, you do not need to install the Profiling tool again.