Main Functions
TIK Control
TIK control functions assist in controlling DSL container or TIK behaviors. For example, modifying behavior of a target program, and generating binary code. The TIK control functions include the following:
- Class TIK Constructor for creating a TIK DSL container.
tik_instance = tik.Tik()
- BuildCCE for building all statements in the DSL container to generate binary code that can be executed on the Ascend AI Processor, that is, the operator binary file (.o) and operator description file (.json).
tik_instance.BuildCCE(kernel_name = "test", inputs = [input_tensor], outputs = [output_tensor])
- kernel_name: indicates the name of the AI Core kernel function in the binary code generated during the build.
- inputs: indicates the input tensor to store program files. The storage type must be global memory.
- outputs: indicates the output tensor to store program files. The storage type must be global memory.
By default, the files generated after the build are stored in ./kernel_meta. You can also specify a path by setting the output_files_path parameter in BuildCCE.
Data Definition
TIK supports computation of different data types, but each API supports specific data types. Data types supported by the TIK include int8, uint8, int16, uint16, int32, uint32, float16, float32 and uint1 (bool). TIK is a strongly typed language, which means computation between data types is not allowed.
TIK provides two kinds of data types: Scalar and Tensor. This section describes the definition methods of the two data types.
Scalar
Scalar data corresponds to the data in the storage register or Scalar Buffer. TIK provides the Scalar API for defining scalar data. For example:
data_A = tik_instance.Scalar(dtype = "float32")
The data type of the scalar object is specified by dtype, and can be int8, uint8, int16, uint16, float16, int32, uint32, float32, int64, or uint64.
Alternatively, you can assign an initial value for the scalar by specifying init_value.
data_A = tik_instance.Scalar(dtype="float32", init_value=10.2)
The lifecycle of a scalar complies with the following rules:
- A scalar is created when it is applied for, and the code block where the scalar is located is released. The state between creation and release is called active state.
- Only a scalar in active state can be accessed.
Figure 5-4 shows a scalar lifecycle, including the active states of variables S0, S1, and S2.
Tensor
Tensor data corresponds to the data in the storage buffer. TIK provides the API in Tensor for defining tensor data. You only need to pay attention to the data type (dtype), shape (shape), and data storage space (scope). For example:
data_B = tik_instance.Tensor("float16", (128,), name="data_A", scope=tik.scope_gm)
Key parameters:
- dtype: data type of the Tensor object. The value can be uint8, int8, uint16, int16, float16, uint32, int32, float32, uint64, or int64.
- shape: a list or tuple of ints, specifying the shape of the Tensor object.
- scope: buffer scope of the Tensor object, that is, buffer space where the Tensor object is located. The options are as follows:
- scope_cbuf: L1 Buffer
- scope_cbuf_out: L1OUT Buffer
- scope_ubuf: Unified Buffer
- scope_gm: Global Memory
TIK automatically allocates an address for each tensor object and avoids address conflicts between data blocks. In addition, TIK automatically checks the data dependency between tensors to implement synchronization.
The lifecycle of a tensor complies with the following rules:
- A tensor is created when it is applied for, and the code block where the tensor is located is released. The state between creation and release is called active state.
- Only a tensor in active state can be accessed.
- At any time, the total buffer size of active tensors cannot exceed the total size of the corresponding physical buffers.
Figure 5-5 shows the lifecycle of a tensor.
In the preceding example, the code is divided into five time segments (1–5). Table 5-1 lists the active tensors in each time segment and the total UB size.
Time Segment |
Active Tensor |
Total UB Size |
---|---|---|
1 |
B0 |
256 * 2 bytes |
2 |
B0, B1 |
256 * 2 * 2 bytes |
3 |
B0, B1, B2 |
256 * 3 * 2 bytes |
4 |
B0, B1, B3 |
256 * 3 * 2 bytes |
5 |
B0, B4 |
256 * 2 * 2 bytes |
In actual development, the shape size of the input tensor may exceed the upper limit of the Unified Buffer. In this case, multiple times of movement are required for computation. Therefore, to maximize the storage space of the Unified Buffer, the shape specified in the data definition is the maximum allowed by the Unified Buffer. A code example is as follows:
# Obtain the Unified Buffer size, in bytes. ub_size_bytes = te.platform.get_soc_spec("UB_SIZE") # In the Unified Buffer, data must be read and written in the unit of 32-byte blocks. block_bite_size = 32 # Calculate the number of elements that can be stored in a block based on the input data type dtype_x. dtype_bytes_size = cce.cce_intrin.get_bit_len(dtype_x) // 8 data_each_block = block_bite_size // dtype_bytes_size # Calculate the space to be allocated in the Unified Buffer and perform 32-byte alignment. ub_tensor_size = (ub_size_bytes // dtype_bytes_size // data_each_block * data_each_block) # Create a tensor input_x_ub in the Unified Buffer. input_x_ub = tik_instance.Tensor(dtype_x, (ub_tensor_size,), name="input_x_ub", scope=tik.scope_ubuf)
In addition, you can save storage space of the Unified Buffer through address reuse (address overlapping). Use the vector single-input computation as an example. When address reuse meets corresponding constraints, you can define a tensor that can be used by both the source and destination operands to save the storage space.
Data Movement
For vector computation, data is stored in the Unified Buffer and then computed. The data flow is Global Memory > Unified Buffer > Global Memory. TIK provides the data_move API to implement data movement between the Global Memory and Unified Buffer. The function prototype is as follows:
data_move(dst, src, sid, nburst, burst, src_stride, dst_stride, *args, **argv)
In the data_move function prototype, you need to pay attention to six parameters: dst, src, nburst, burst, src_stride, and dst_stride. Among them, dst and src indicate the destination and source operands respectively, that is, start addresses of data movement; nburst indicates the number of elements to be moved; burst indicates the length of each consecutive data segment (unit: 32 bytes); src_stride and dst_stride indicate block-to-block stride between adjacent segments of the source operand and destination operand, respectively. data_move supports the consecutive and interleaved movement modes by configuring the preceding six parameters.
In the consecutive mode shown in Figure 5-6, a float16 tensor with shape (2, 128) can be moved to the Unified Buffer at a time. In consecutive mode, elements 0–255 form a consecutive segment. There is only one consecutive segment, so the stride between two adjacent segments of the source and destination tensors is 0. The corresponding code is as follows:
# Number of elements to be transmitted
nburst = 1
# In the Unified Buffer, data must be read and written in the unit of 32-byte blocks.
block_bite_size = 32
# Calculate the number of elements that can be stored in a block based on the input data type dtype_x.
dtype_bytes_size = cce.cce_intrin.get_bit_len(dtype_x) // 8
data_each_block = block_bite_size // dtype_bytes_size
#Length of a consecutive data segment, 32-byte aligned
burst = math.ceil(2 * 128 / data_each_block)
# Consecutive movement
tik_instance.data_move(input_x_ub, input_x_gm, 0, nburst, burst, 0, 0)
In the interleaved mode shown in Figure 5-6, assume that data pieces 0–63 and 128–191 are to be moved, there are two consecutive data segments with the same length. In this case, the block-to-block stride between adjacent elements of the source tensor is 4 x 32 bytes. elements are moved to the destination tensor and stored consecutively without stride. The corresponding code is as follows:
# Number of elements to be transmitted
nburst = 2
# In the Unified Buffer, data must be read and written in the unit of 32-byte blocks.
block_bite_size = 32
# Calculate the number of elements that can be stored in a block based on the input data type dtype_x.
dtype_bytes_size = cce.cce_intrin.get_bit_len(dtype_x) // 8
data_each_block = block_bite_size // dtype_bytes_size
#Length of a consecutive data segment, 32-byte aligned
burst = math.ceil(64 / data_each_block)
# Interleaved movement
tik_instance.data_move(input_x_ub, input_x_gm, 0, nburst, burst, 4, 0)
In both consecutive and interleaved mode, if data to be moved exceeds the upper limit of the Unified Buffer, the data needs to be moved in multiple cycles. In this case, to maximize the storage space of the Unified Buffer, the shape specified in the data definition is the maximum allowed by the Unified Buffer. An example is as follows:
# move_num indicates the input tensor size, exceeding the maximum value ub_tensor_size allowed by the Unified Buffer.
loop_time = move_num // ub_tensor_size
# Movement offset
move_offset = 0
# Loop movement
with tik_instance.for_range(0, loop_time) as loop_index:
move_offset += loop_index * ub_tensor_size
burse_len = ub_tensor_size / data_each_block
tik_instance.data_move(input_x_ub, input_x_gm[move_offset], 0, 1, burse_len, 0, 0)
move_offset += loop_index * ub_tensor_size
# Last movement
last_num = move_num % ub_tensor_size
if last_num > 0:
burse_len = math.ceil(last_num / data_each_block)
tik_instance.data_move(input_x_ub, input_x_gm[move_offset], 0, 1, burse_len, 0, 0)
Vector Computation
TIK provides a large number of APIs to schedule vector computation resources for computation. You need to properly set API parameters and pay attention to the principles. API parameters of single-input and multi-input operations are similar. The vector single-input operation API is used as an example for description. The function prototype of the vector single-input operation is as follows:
instruction(mask, dst, src, repeat_times, dst_rep_stride, src_rep_stride)
In the preceding prototype, mask specifies if each compute unit of a vector is skipped in the computation. A vector operation can compute up to eight blocks at a time. In this case, the value of mask is assigned the maximum value of the corresponding data type. If data to be computed is fewer than eight blocks, only some Vector Units are used. In this case, the value of mask is assigned based on the actual data volume. Note that TIK provides two modes for assigning the value of mask: consecutive mode and bit-wise mode. The consecutive mode easier, while the bit-wise mode is more flexible and complex. You can select a mode as required.
dst and src indicate the destination operand and source operand, respectively. They also indicate the start addresses of data read in the vector operation. In addition, pay attention to repeat_times, dst_rep_stride, and src_rep_stride. In the current version, a vector operation reads and computes 256 consecutive bytes each time. If data to be computed is more than 256 bytes, multiple iterations (repeats) are required to read and compute all data. repeat_times indicates the number of iterations in a single API call. Since each API call has a fixed delay, use multiple iterations of a single API call greatly reduces unnecessary call overheads and improves the overall execution efficiency. The maximum value of repeat_times is 255 due to hardware limitations of AI Core. As shown in Figure 5-7, dst_rep_stride and src_rep_stride indicate the block-to-block stride between adjacent iteration of the source and destination operands, respectively. The two parameters are collectively referred to as *_rep_stride for convenience.
As shown in Figure 5-7, assume that a defined tensor can be used by both the source and destination operands through address overlapping, and *_rep_stride is set to 8. Eight consecutive blue blocks are read in the first iteration, and eight consecutive gray blocks are read in the second iteration. This rule applies until all input data is processed.
Note that *_rep_stride are assigned some special values in special scenarios.
- When repeat_times > 1 and *_rep_stride > 8 (such as 10), data read in the vector operation is not consecutive. For example, it is separated by the two red boxes in the following figure.
- When repeat_times > 1 and *_rep_stride = 0, the first eight blocks are repeatedly read and computed in the vector operation.
- When repeat_times > 1 and 0 < *_rep_stride < 8, some data of two adjacent iterations is repeatedly read and computed in the vector operation. This scenario is generally not involved.
In conclusion, you need to set parameters based on the data access mode of the operator to obtain the correct computation result.
Program Control
Procedural programming structure elements include sequences, branches, and loops. TIK provides the if_scope and else_scope APIs to define the branch structure in the DSL program. The for_range API defines the loop structure in the DSL program, and other DSL APIs form a sequential structure in the current code block. TIK expresses code block semantics by using the with statement of Python.
Branch
TIK provides the if_scope and else_scope APIs to define the branch structure in the DSL program. The definition method is as follows:
tik.TIk.if_scope(condition)
The if_scope() and else_scope() APIs are expressed as with statements in Python scripts. The with statement defines the if/else code block of TIK. The else part is optional in syntax. If else_scope appears, it is logically complementary with if_scope at the same indentation level in the script above.
a = tik_instance.Scalar("int32") # operations on scalar `a' omitted ... with tik_instance.if_scope(a > 1): # operations should be performed # when `a' is larger than 1 ... with tik_instance.else_scope(): # operations need to be performed # when `a' is less than or equal to 1 ...
Native Python statements:
int32 a; # operations on scalar `a' omitted ... if a > 1: # operations should be performed # when `a' is larger than 1 ... else: # operations need to be performed # when `a' is less than or equal to 1 ...
The following compares TIK Python statements with native Python statements.
Native Python branch statement:
if a: Instruction 1 else: Instruction 2
The value of a must be an immediate, which is a constant that can be determined during build. During the build, only one instruction is compiled based on the judgment conditions. If a==True, only instruction 1 is built. Otherwise, only instruction 2 is built.
TIK Python branch statement:
with tik_instance.if_scope(a>3): Instruction 1 with tik_instance.else_scope(): Instruction 2
if_scope and else_scope are common Python commands without if and else branch logic. Both commands are compiled during TIK compilation, but only one of them is run based on the value of a.
Loop
TIK provides an API to define the loop structure in the DSL program. For details, see for_range. The definition method is as follows:
tik.Tik.for_range(begint, endt, name="i", thread_num = 1, bock_num = 1, dtype ="int32")
- begint: start value of the loop variable
- endt: end value of the loop variable
- name: name of the loop variable in the TIK DSL
- thread_num: whether to enable the double buffering in the loop to control the instruction parallelism For details, see Double Buffering.
- block_num: number of AI Cores used in a loop, used to control the instruction parallelism For details, see AI Core Parallelism.
- dtype: data type of the loop variable
The for_range() API is expressed as a with statement in Python scripts. The with statement defines the loop body of the DSL for loop. The loop body of the DSL for loop forms a statement block. DSL statements in the loop body are executed in parallel as much as possible.
TIK Python statements:
vec_a = tik_instance.Tensor(...) vec_b = tik_instance.Tensor(...) vec_c = tik_instance.Tensor(...) # Loading data to vec_a and vec_b omitted ... with tik_instance.for_range(0,10) as i: tik_instance.vec_add(..., vec_c, vec_a, vec_b, ...)
Native Python statements:
Tensor vec_a (...); Tensor vec_b(...); Tensor vec_c(...); # Loading data to vec_a and vec_b omitted ... for i in range(0,10): vec_add (vec_c, vec_a, vec_b, ...) ...
The following compares TIK Python statements with native Python statements.
Native Python loop statement:
for i in range(4): a[i] = b[i] + c[i]
i is an immediate. The loop is expanded during compilation.
a [0]= b [0]+ c[0]
a [1]= b [1]+ c[1]
a [2]= b [2]+ c[2]
a [3]= b [3]+ c[3]
TIK Python loop statement:
with tik_instance.for_range(10) as i: a[i] = b[i] + c[i]
i is a scalar. The loop is not expanded during compilation. The TIK Python loop statement is similar to the native Python statement during running.
Data Management
This section describes the common operations on tensor and scalar data.
- Obtaining part of tensor data
Obtaining Partial Tensor Data by Using the Tensor Array Index.
- Reshaping a tensor
For details, see reshape.
- Changing the tensor data type
For details, see new_stmt_scope. This API is used to read data of a specified type, excluding data type conversion.
- Assigning a value to a scalar
Set or change the value of a scalar by using set_as.
- Assigning a value to a tensor
- Method 1: Set or change the value of a tensor by using set_as.
- Method 2: Refer to Changing the Tensor Content by Using the Tensor Array Index.