Pipeline
AI Core reads instructions in sequence and executes them in parallel, as shown in the following figure.
The instructions are decoded in sequence and executed in either of the following mode:
- Scalar instructions are executed directly.
- Other instructions are scheduled to five independent classification queues and then allocated to the corresponding execution unit for execution.
The following table lists the classification queues.
Table 2-7 Classification queuesQueue Abbreviation
Queue Name
Description
V
Vector instruction queue
Schedules Vector instructions.
M
Matrix instruction queue
Schedules Cube instructions.
MTE1
Movement instruction queue 1
Schedules memory movement instructions for the following types:
L1 to L0A/L0B/UB, or L0A/L0B Buffer initialization using the SPR
MTE2
Movement instruction queue 2
Schedules memory movement instructions for the following types:
L2/HBM/DDR to L1/L0A/L0B/UB
MTE3
Movement instruction queue 3
Schedules memory movement instructions for the following types:
UB to L2/HBM/DDR
Instructions are classified according to different scheduling classifications. Plus the scalar instruction (S for short) that is directly interpreted in a decoding process, there are six types instructions: S, V, M, and MTE1/2/3.
Except for the S queue, instructions in different queues can be executed out of order. However, instructions in a queue are executed in sequence. That is, on the premise that data dependency is met, the physical execution sequence of instructions may not be consistent with the writing sequence in code.
The hardware distributes the instructions of different queues to the corresponding queues for execution based on the delivery sequence. The Ascend AI Processor provides the Barrier and set_flag/wait_flag instructions to ensure that intra- and inter-queue instructions are executed based on the logical relationship.- Barrier itself is an instruction used to constrain the execution sequence within a queue. This ensures that all data read and write operations in the previous queue are completed before the subsequent instructions are executed.
- set_flag and wait_flag are two instructions. In these instructions, you can specify the relationship between a pair of instruction queues, indicating a "lock" mechanism between the two queues. The following describes their functions:
- set_flag: The current instruction starts to be executed after all read and write operations of the current instruction are completed and the corresponding flag bit in hardware is set to 1.
- wait_flag: When this instruction is executed, if the corresponding flag bit is 0, the subsequent instructions in the queue are blocked; if the corresponding flag bit is 1, it is changed to 0, and subsequent instructions are executed.
Because TBE has encapsulated this dependency, application developers do not need to program Barrier, set_flag or wait_flag. However, they still need to understand this basic principle to achieve better synchronization through proper code scheduling. DSL-based operator development does not require code scheduling. DSL provides the auto_schedule mechanism.