Quick Start
Objectives
This topic describes the method of compiling a TIK operator by using the add operator as an example.
This TIK operator is used to read 128 float16 values from addresses A and B in the Global Memory, add them, and write the result to address C in the Global Memory.
Operator Analysis
- Specify the names of the operator implementation file and operator implementation function, and OpType.
The naming rules are as follows:
- The operator type must be named in upper camel case to distinguish different semantics.
- You can name the operator file and operator function in either of the following naming rules:
- Perform the opFile.value and opInterface.value in Operator Information Definition.
- If opFile.value and opInterface.value in the Operator Information Definition are not configured, the FE converts OpType and matches the operator file name with the operator function name as follows:The conversion rules are as follows:
- Convert the first uppercase letter to a lowercase letter.
Example: Abc -> abc
- Convert the uppercase letter after a lowercase letter to a lowercase letter with an underscore (_) prefix.
Example: AbcDef -> abc_def
- Uppercase letters following a digit or an uppercase letter are regarded as a semantic string. If there is a lowercase letter after this string, convert the last uppercase letter in this string into an underscore (_) and a lowercase letter, and convert the other uppercase letters into lowercase letters. If there is no lowercase letter after the string, directly convert the string into lowercase letters.
Examples: ABCDef -> abc_def; Abc2DEf -> abc2d_ef; Abc2DEF -> abc2def; ABC2dEF -> abc2d_ef
- Convert the first uppercase letter to a lowercase letter.
According to the naming rules, set its OpType to Add, and name the operator implementation file and operator function to Add.
- Specify the operator function and mathematical expression.
This operator adds two tensors element-wise.
Mathematical expression: C = A + B
- Specify the TIK compute APIs to be used.
- Define data. The Tensor API is required.
- Move input data from the Global Memory to the Unified Buffer. The data_move API is required.
- Perform the vec_add operation for the imported data. The vec_add API is required.
- Move the obtained result from the Unified Buffer to the Global Memory. The data_move API is required.
Operator Implementation
Figure 5-3 shows the TIK custom operator development flow.
- Import the Python module.
- Define the target machine and build a TIK DSL container.
- Define the input and output data in the external storage (such as the Global Memory) and internal storage (such as the Unified Buffer) of AI Core through the data definition API.
- Move data from the external storage of AI Core to the internal storage over the data movement API.
- Perform data computation using the scalar, vector, matrix, and more computation APIs.
- Move data from the internal storage of AI Core to the external storage over the data movement API.
The internal storage (such as Unified Buffer) of AI Core is limited. When the data volume is large, the input data and output result cannot be stored as a whole. In this case, the input data needs to be fragmented, moved in, calculated, and then moved out.
- Build the operator to generate the operator binary file (.o) and operator description file (.json).
The following is a complete code example based on the process shown in Figure 5-3:
from te import tik tik_instance = tik.Tik() data_A = tik_instance.Tensor("float16", (128,), name="data_A", scope=tik.scope_gm) data_B = tik_instance.Tensor("float16", (128,), name="data_B", scope=tik.scope_gm) data_C = tik_instance.Tensor("float16", (128,), name="data_C", scope=tik.scope_gm) data_A_ub = tik_instance.Tensor("float16", (128,), name="data_A_ub", scope=tik.scope_ubuf) data_B_ub = tik_instance.Tensor("float16", (128,), name="data_B_ub", scope=tik.scope_ubuf) data_C_ub = tik_instance.Tensor("float16", (128,), name="data_C_ub", scope=tik.scope_ubuf) tik_instance.data_move(data_A_ub, data_A, 0, 1, 128 //16, 0, 0) tik_instance.data_move(data_B_ub, data_B, 0, 1, 128 //16, 0, 0) tik_instance.vec_add(128, data_C_ub[0], data_A_ub[0], data_B_ub[0], 1, 8, 8, 8) tik_instance.data_move(data_C, data_C_ub, 0, 1, 128 //16, 0, 0) tik_instance.BuildCCE(kernel_name="simple_add",inputs=[data_A,data_B],outputs=[data_C]) return tik_instance
Description of the preceding code:
- Import the Python module.
from te import tik
te.tik: provides all TIK-related Python functions. For details, see python/site-packages/te.egg/te/tik in the ATC installation path.
- Construct a TIK DSL container.Use the constructor in Class TIK Constructor to construct a TIK DSL container.
tik_instance = tik.Tik()
- Define data.
Define input data data_A and data_B and output data data_C in the Global Memory by using Tensor. Each consists of 128 segments of float16 data.
Define data_A_ub, data_B_ub, and data_C_ub in the Unified Buffer by using Tensor. Each consists of 128 segments of float16 data.
- [API Definition] Tensor (dtype, shape, scope, name)
- [Parameter Analysis]
- dtype: data type of a tensor object.
- shape: shape of a tensor object
- scope: buffer space where the tensor object is located scope_gm indicates the data in the Global Memory. scope_ubuf indicates the data in the Unified Buffer.
- name: tensor name, which must be unique
- [Example]
# Define the input data data_A and data_B and output data data_C in the Global Memory. Each consists of 128 segments of float16 data. data_A = tik_instance.Tensor("float16", (128,), name="data_A", scope=tik.scope_gm) data_B = tik_instance.Tensor("float16", (128,), name="data_B", scope=tik.scope_gm) data_C = tik_instance.Tensor("float16", (128,), name="data_C", scope=tik.scope_gm)
# Define data_A_ub, data_B_ub, and data_C_ub in the UB. Each consists of 128 segments of float16 data. data_A_ub = tik_instance.Tensor("float16", (128,), name="data_A_ub", scope=tik.scope_ubuf) data_B_ub = tik_instance.Tensor("float16", (128,), name="data_B_ub", scope=tik.scope_ubuf) data_C_ub = tik_instance.Tensor("float16", (128,), name="data_C_ub", scope=tik.scope_ubuf)
- Move the data in the Global Memory to the Unified Buffer.Data movement is implemented by using the data_move API. That is, data in data_A is moved to data_A_ub, and data in data_B is moved to data_B_ub.
- [API Definition] data_move (dst, src, sid, nburst, burst, src_stride, dst_stride, *args, **argv)
- [Parameter Analysis]
- src/dst: source address/destination address
- sid: SIM ID, which is fixed to 0
- burst/nburst: burst indicates the number of bursts moved each time (one burst indicates 32 bytes), and nburst indicates the number of data movement times. The data to be moved consists of 128 float16 elements, that is 128 x 2 bytes, which is less than the size of the Unified Buffer (256 KB). Therefore, the input data can be moved to the Unified Buffer at a time (nburst = 1). Since one burst indicates 32 bytes, the number of bursts for each movement is (128 x 2/32), that is, burst = 8.
- src_stride/dst_stride: strides of the source and destination addresses respectively. These two parameters need to be set when the data is moved at the specified interval. In the example, the data is moved consecutively. Therefore, both parameters are set to 0.
- [Example]
tik_instance.data_move(data_A_ub, data_A, 0, 1, 128 //16, 0, 0) tik_instance.data_move(data_B_ub, data_B, 0, 1, 128 //16, 0, 0)
- Perform the vec_add operation on the data loaded to data_A_ub and data_B_ub and write the computation result back to data_C_ub.
Before implementing computation, learn about the basic operation units involved in TIK instructions.
For TIK vector instructions, 256 bytes of data can be processed per clock cycle. The masking function is provided to skip certain elements in the computation, and the iteration function is provided for repeated data computation.
TIK instructions are processed in the space and time dimensions, supporting up to 256-byte data (that is, 128 float16/uint16/int16 elements, 64 float32/uint32/int32 elements, or 256 int8/uint8 elements) in the space dimension, and supporting the repeat operation in the time dimension. The data to be computed every iteration is determined by the mask parameter. For float16 data, the vector engine computes 128 elements at a time. For example, if mask is 128, the first 128 elements in the float16 data are computed.
The computation is implemented based on the add operator by referring to the function described in vec_add.
- [API Definition] vec_add(mask, dst, src0, src1, repeat_times, dst_rep_stride, src0_rep_stride, src1_rep_stride)
- [Parameter Analysis]
- src0, src1, dst: source operand 0, source operand 1, and destination operand, which are data_A_ub, data_B_ub, and data_C_ub respectively.
- repeat_times: repeat times. Based on the preceding TIK instruction, the computation of 128 float16 elements can be completed in one repeat. Therefore, the value of repeat_times is 1.
- dst_rep_stride, src0_rep_stride, src1_rep_stride: block-to-block stride between the destination operand/source operand 0/source operand 1 in adjacent iterations. The unit is 32 bytes. In the following example, they are set to 8, indicating that 8 x 32 bytes of data is processed in an iteration.
- mask: data operation validity indicator. The value 128 indicates that all elements are computed.
- [Example]
tik_instance.vec_add(128, data_C_ub[0], data_A_ub[0], data_B_ub[0], 1, 8, 8, 8)
- Move the computation result in data_C_ub to data_C. The data movement is implemented by the data_move API. The analysis process is similar to that in 4.
tik_instance.data_move(data_C, data_C_ub, 0, 1, 128 //16, 0, 0)
- Compile the statements in the TIK DSL container into the code that can be executed by the Ascend AI Processor.
Compile the TIK DSL container into an executable binary file of the Ascend AI Processor by using BuildCCE.
- [API Definition] BuildCCE(kernel_name, inputs, outputs, output_files_path=None, enable_l2=False)
- [Parameter Analysis]
- kernel_name: indicates the name of the AI Core kernel function in the binary code generated during the build.
- inputs: indicates the input tensor to store program files. The storage type must be global memory.
- outputs: indicates the output tensor to store program files. The storage type must be global memory.
- output_files_path: specifies the path to store files generated in the build. The default value is ./kernel_meta.
- enable_l2: This parameter does not take effect currently.
- [Example]
tik_instance.BuildCCE(kernel_name="simple_add",inputs=[data_A,data_B],outputs=[data_C])
- Return a TIK instance.
return tik_instance