Quick Start
Objectives
This section describes the method of writing the code implementation of a TIK operator by using the Add operator as an example.
This TIK operator reads 128 data elements of type float16 from addresses A and B in the Global Memory respectively, adds them, and writes the result to address C in the Global Memory.
Operator Analysis
- Specify the names of the operator implementation file and operator implementation function, and the operator type OpType.
The naming rules are as follows:
- The operator type must be named in upper camel case to distinguish different semantics.
- You can name the operator file and operator function in either of the following naming rules:
- To create user-defined names, configure the opFile.value and opInterface.value in Operator Information Definition.
- If opFile.value and opInterface.value in the Operator Information Definition are not configured, the FE converts OpType and matches the operator file name with the operator function name as follows:The conversion rules are as follows:
- Convert the first uppercase letter to a lowercase letter.
Example: Abc -> abc
- Convert the uppercase letter after a lowercase letter to a lowercase letter with an underscore (_) prefix.
Example: AbcDef -> abc_def
- Uppercase letters following a digit or an uppercase letter are regarded as a semantic string. If there is a lowercase letter after this string, convert the last uppercase letter in this string into an underscore (_) and a lowercase letter, and convert the other uppercase letters into lowercase letters. If there is no lowercase letter after the string, directly convert the string into lowercase letters.
Examples: ABCDef -> abc_def; Abc2DEf -> abc2d_ef; Abc2DEF -> abc2def; ABC2dEF -> abc2d_ef
- Convert the first uppercase letter to a lowercase letter.
According to the naming rules, set its OpType to Add, and name the operator implementation file and operator function to Add.
- Specify the operator function and mathematical expression.
This operator adds two tensors element-wise.
Mathematical expression: C = A + B
- Specify the TIK compute APIs to be used.
- Define data. The Tensor API is required.
- Move input data from the Global Memory to the Unified Buffer. The data_move API is required.
- Perform the vec_add operation for the imported data. The vec_add API is required.
- Move the obtained result from the Unified Buffer to the Global Memory. The data_move API is required.
Operator Implementation
Figure 6-4 shows the TIK custom operator development process.
- Import the Python module.
- Define the target machine and build a container for the TIK DSL program.
- Define the input and output data in the external storage (such as the Global Memory) and internal storage (such as the Unified Buffer) of AI Core using the data definition API.
- Move data from the external storage of AI Core to the internal storage using the data movement API.
- Perform data computation using the compute APIs for scalar, vector, and matrix computation.
- Move data from the internal storage of AI Core to the external storage using the data movement API.
The internal storage (such as Unified Buffer) of AI Core is limited. When the data volume is large, the input data and output result cannot be stored as a whole. In this case, the input data needs to be fragmented, moved in, calculated, and then moved out.
- Build the operator to generate an .o binary file and a .json description file of the operator.
The following is a complete code example based on the process shown in Figure 6-4:
from te import tik tik_instance = tik.Tik() data_A = tik_instance.Tensor("float16", (128,), name="data_A", scope=tik.scope_gm) data_B = tik_instance.Tensor("float16", (128,), name="data_B", scope=tik.scope_gm) data_C = tik_instance.Tensor("float16", (128,), name="data_C", scope=tik.scope_gm) data_A_ub = tik_instance.Tensor("float16", (128,), name="data_A_ub", scope=tik.scope_ubuf) data_B_ub = tik_instance.Tensor("float16", (128,), name="data_B_ub", scope=tik.scope_ubuf) data_C_ub = tik_instance.Tensor("float16", (128,), name="data_C_ub", scope=tik.scope_ubuf) tik_instance.data_move(data_A_ub, data_A, 0, 1, 128 //16, 0, 0) tik_instance.data_move(data_B_ub, data_B, 0, 1, 128 //16, 0, 0) tik_instance.vec_add(128, data_C_ub[0], data_A_ub[0], data_B_ub[0], 1, 8, 8, 8) tik_instance.data_move(data_C, data_C_ub, 0, 1, 128 //16, 0, 0) tik_instance.BuildCCE(kernel_name="simple_add",inputs=[data_A,data_B],outputs=[data_C]) return tik_instance
Description of the preceding code:
- Import the Python module.
from te import tik
te.tik: provides all TIK-related Python functions. For details, see python/site-packages/te/tik in the ATC installation directory.
- Build a container for the TIK DSL program.Use the constructor in Class TIK Constructor to construct a TIK DSL container.
tik_instance = tik.Tik()
- Define data.
Define input data data_A and data_B and output data data_C in the Global Memory by using Tensor. Each consists of 128 data segments of type float16.
Define data_A_ub, data_B_ub, and data_C_ub in the Unified Buffer by using Tensor. Each consists of 128 data segments of type float16.
- [API Definition] Tensor (dtype, shape, scope, name)
- [Parameter Analysis]
- dtype: data type of a tensor object.
- shape: shape of a tensor object.
- scope: buffer space where the tensor object resides. scope_gm indicates the data resides in the Global Memory. scope_ubuf indicates the data resides in the Unified Buffer.
- name: tensor name, which must be unique.
- Example
# Define the input data data_A and data_B and output data data_C in the Global Memory. Each consists of 128 data segments of type float16. data_A = tik_instance.Tensor("float16", (128,), name="data_A", scope=tik.scope_gm) data_B = tik_instance.Tensor("float16", (128,), name="data_B", scope=tik.scope_gm) data_C = tik_instance.Tensor("float16", (128,), name="data_C", scope=tik.scope_gm)
# Define data_A_ub, data_B_ub, and data_C_ub in the Unified Buffer. Each consists of 128 data elements of type float16. data_A_ub = tik_instance.Tensor("float16", (128,), name="data_A_ub", scope=tik.scope_ubuf) data_B_ub = tik_instance.Tensor("float16", (128,), name="data_B_ub", scope=tik.scope_ubuf) data_C_ub = tik_instance.Tensor("float16", (128,), name="data_C_ub", scope=tik.scope_ubuf)
- Move the data in the Global Memory to the Unified Buffer.Data movement is implemented by using the function defined in data_move. That is, data in data_A is moved to data_A_ub, and data in data_B is moved to data_B_ub.
- [API Definition] data_move (dst, src, sid, nburst, burst, src_stride, dst_stride, *args, **argv)
- [Parameter Analysis]
- src/dst: source address/destination address
- sid: SIM ID, which is fixed to 0
- burst/nburst: burst indicates the size of data moved each time (in the unit of 32 bytes), and nburst indicates the number of data movements. The data to be moved consists of 128 float16 elements, which uses 128 x 2 bytes and is less than the size of the Unified Buffer (256 KB). Therefore, all the input data is moved to the Unified Buffer at a time, and the number of movement times (nburst) is 1. Because the unit of burst is 32 bytes, the number of bursts for each movement is (128 x 2/32).
- src_stride/dst_stride: strides of the source and destination addresses respectively. These two parameters need to be set when the data is moved at the specified interval. In the example, the data is moved consecutively. Therefore, both parameters are set to 0.
- Example
tik_instance.data_move(data_A_ub, data_A, 0, 1, 128 //16, 0, 0) tik_instance.data_move(data_B_ub, data_B, 0, 1, 128 //16, 0, 0)
- Perform the vec_add operation on the data loaded to data_A_ub and data_B_ub and write the computation result back to data_C_ub.
Before implementing computation, learn about the basic operation units involved in TIK instructions.
For TIK vector instructions, 256 bytes of data can be processed per clock cycle. The masking function is provided to skip certain elements in the computation, and the iteration function is provided for repeated data computation.
TIK instructions are processed in the space and time dimensions, supporting up to 256-byte data (that is, 128 float16/uint16/int16 elements, 64 float32/uint32/int32 elements, or 256 int8/uint8 elements) in the space dimension, and supporting the repeat operation in the time dimension. The data to be computed every iteration is determined by the mask parameter. For float16 data, the vector engine computes 128 elements at a time. For example, if mask is 128, the first 128 elements in the float16 data are computed.
The computation is implemented based on the Add operator by using vec_add.
- [API Definition] vec_add(mask, dst, src0, src1, repeat_times, dst_rep_stride, src0_rep_stride, src1_rep_stride)
- [Parameter Analysis]
- src0, src1, dst: source operand 0, source operand 1, and destination operand, which are data_A_ub, data_B_ub, and data_C_ub respectively.
- repeat_times: repeat times. Based on the preceding TIK instruction, the computation of 128 float16 elements can be completed in one repeat. Therefore, the value of repeat_times is 1.
- dst_rep_stride, src0_rep_stride, src1_rep_stride: block-to-block stride between the destination operand/source operand 0/source operand 1 in adjacent iterations. The unit is 32 bytes. In the following example, they are set to 8, indicating that 8 x 32 bytes of data is processed in an iteration.
- mask: data operation validity indicator. The value 128 indicates that all elements are computed.
- Example
tik_instance.vec_add(128, data_C_ub[0], data_A_ub[0], data_B_ub[0], 1, 8, 8, 8)
- Move the computation result in data_C_ub to data_C. Data movement is implemented by the function described in data_move. The analysis process is similar to that in 4.
tik_instance.data_move(data_C, data_C_ub, 0, 1, 128 //16, 0, 0)
- Build the statements in the TIK DSL container into the code that can run on the Ascend AI Processor.
Build the TIK DSL container into an executable binary file of the Ascend AI Processor by using the function described in BuildCCE.
- [API Definition] BuildCCE(kernel_name, inputs, outputs, output_files_path=None, enable_l2=False)
- [Parameter Analysis]
- kernel_name: indicates the kernel name of the AI Core operator in the generated binary code.
- inputs: stores the input tensor of the program file in Global Memory.
- outputs: stores the output tensor to the program file in Global Memory.
- output_files_path: specifies the path to store files generated in the build. The default value is ./kernel_meta.
- enable_l2: This parameter does not take effect currently.
- Example
tik_instance.BuildCCE(kernel_name="simple_add",inputs=[data_A,data_B],outputs=[data_C])
- Obtain a TIK instance.
return tik_instance