Overview
Why Does Auto Tune Matter?
- To fully utilize the computing power of AI computing processors, the computing process needs to be well organized.
An AI computing processor consists of multiple compute units, on-chip storage, data transfer module, and other modules. The running time consumption of an operator running on the Ascend AI Processor cannot be obtained by simply dividing the computation amount by the computing power. It is also affected by the collaboration between the components. For an operator deployed on the same computing processor, the computing efficiency varies greatly depending on pipeline. Similar computing inputs may require completely different pipelines. Only well-designed scheduling logic can give full play to the computing power of hardware.
- Pipelines between components must be well-designed to achieve optimal performance.
The theoretical maximum performance of an operator is obtained by dividing the bottleneck load (including data computing and transfer) by the efficiency of processing units. Due to limited on-chip storage, a computing task is tiled for processing, which brings certain computing and transfer redundancy. Therefore, the actual load is usually greater than the theoretical load. The redundancy of a computing task varies depending on the pipeline scheme. Generally, a solution with low redundancy is selected or the redundancy is transferred to a non-bottleneck component. Therefore, to achieve optimal performance, the timing between the components needs to be properly designed.
Figure 4-1 Efficiency comparison of different timings - Operator scheduling is too complex to be covered by experience.
In the TBE, a large part of pipeline between computing components is controlled by Schedule. To improve the applicability, TVM introduces a number of processing layers between schedule and hardware behavior. Scheduling operations could vary significantly in terms of the hardware behavior. There are many possibilities for scheduling, which cannot be covered only by experience.
The following table compares the time consumption and cost analysis between manual tuning and auto tuning.
-
Time Consumption
Cost
Manual tuning
Long, in the unit of days
Requires experienced experts.
Auto tuning
Short, in the unit of minutes
Mainly requires machine resources and reduces manul intervention.
In short, it is complex to maximize the operator performance on the Ascend AI Processor. Using auto tuning, instead of manual tuning, is a much more expeditious way to explore hardware performance.
Auto Tune Functions
Auto Tune can automatically tune operators by leveraging hardware resources.
During the generation of your network model, you only need to enable auto_tune_mode to enable the Auto Tune tool. Then, this tool can be automatically called for operator tuning during operator building, and the tuning result is stored in the custom repository. After that, the operator can achieve the tuned performance when it is called again.
The current version of Auto Tune supports only the auto tuning of AI Core operators whose computation logic is implemented using DSL APIs. For details about the supported operators, see Operator Lists.