TIK APIs
- Introduction
- General Restrictions
- Chip Configuration Management
- TIK Container Management
- Data Definition
- Scalar Management
- Tensor Management
- Program Control
- Scalar Computation
- Vector Computation
- Matrix Computation
- Data Conversion
- Data Padding
- Data Movement
Introduction
This document describes a collection of Python classes and APIs used for developing operators in the domain-specific language (DSL) of the Tensor Iterator Kernel (TIK). The TIK APIs exist as common Python language elements, but can directly or indirectly influence the DSL program. The Python code of non-TIK APIs is closely related to the constantization of neural network (NN) operators. Generally, common Python variables are used to compute related configuration properties.
Find the API definitions in the atc/python/site-packages/te/te/tik file in the ATC installation directory.
General Restrictions
- For user-defined tensors, the starting address of the allocated buffer scope will be aligned according to the following rules:
- Unified Buffer: 32-byte aligned
- L1 Buffer: 512-byte aligned
- L1OUT Buffer: Data of type float16 must be 512-byte aligned. Data of types float32, int32, and uint32 must be 1024-byte aligned.
- Global Memory: no alignment requirement
When the TIK data compute and data move APIs are used, the address offset of the destination and source operands must be aligned according to the preceding rules.
Chip Configuration Management
Class Dprofile Constructor
Description
Manages and configures the information about the Ascend AI processor, such as the product architecture, product form, and buffer sizes at all levels.
Since the buffer size and instructions vary according to the Ascend AI processor version, the class Dprofile constructor is used to define the target machine of the Ascend AI processor and specify the hardware environment for programming.
Prototype
__init__(ai_core_arch=None, ai_core_version=None, ddk_version=None)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
ai_core_arch |
Input |
A string for the product architecture. The options are as follows:
Reserved, and not recommended for newly developed operators. |
ai_core_version |
Input |
A string for the AI Core version. The options are as follows:
Reserved, and not recommended for newly developed operators. |
ddk_version |
Input |
Reserved, and not recommended for newly developed operators. |
Restrictions
Select instructions carefully since they are related to the product architecture and product form.
If the product architecture and product form are not specified in the Dprofile, the default product form is used.
Returns
Instance of class Dprofile
Example
from te import tik tik_dprofile = tik.Dprofile("v100","cloud")
get_ unified_buffer_size
Description
Obtains the UB size (in bytes) of the corresponding product form.
Prototype
get_unified_buffer_size()
Parameters
None
Restrictions
None
Returns
UB size (in bytes) of the corresponding product form
Example
from te import tik tik_dprofile = tik.Dprofile("v100","cloud") unified_buffer_size = tik_dprofile.get_unified_buffer_size()
TIK Container Management
Class TIK Constructor
Description
Creates a TIK DSL container by passing a tik.Dprofile instance.
Prototype
__init__(profiling, disable_debug=False, err_msg_level=0)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
profiling |
Input |
Configuration information of the Ascend AI processor. Dprofile is supported. |
disable_debug |
Input |
An optional bool for disabling the debug function. Defaults to False (debug enabled). |
err_msg_level |
Input |
An optional int for specifying the level of error messages to be printed. Defaults to 0. Value range:
|
Restrictions
- If the build duration is strictly limited, the debug function can be used during operator development. After the code is submitted, you can manually set the disable_debug parameter to True when constructing a TIK instance to disable the debug function. This reduces the build time.
- If disable_debug is set to True and the debug API is called after BuildCCE is complete, the program exits abnormally and the debugging fails.
Returns
Instance of class TIK
Examples
The following is an example of enabling the debug function:
from te import tik tik_instance = tik.Tik() Alternatively, tik_instance = tik.Tik(disable_debug=False)
The following is an example of disabling the debug function:
from te import tik tik_instance = tik.Tik(disable_debug=True)
The following is an example of setting err_msg_level to the developer level:
from te import tik tik_instance = tik.Tik(err_msg_level=1)
BuildCCE
Description
Generates DSL defined on the target machine and compiles the DSL into binary code that is executable on the Ascend AI Processor and corresponding configuration files.
Prototype
BuildCCE(kernel_name, inputs, outputs, output_files_path=None, enable_l2=False, config=None, flowtable=None)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
kernel_name |
Input |
|
inputs |
Input |
|
outputs |
Input |
|
output_files_path |
Input |
A string specifying the path of the generated files after build. Defaults to None, indicating the path ./kernel_meta in the current directory. |
enable_l2 |
Input |
A bool specifying whether to enable L2 buffer enable. Defaults to False. This argument does not take effect. |
config |
Input |
A dictionary including a key string and its value, used to configure the operator build properties. Format: config = {"key" : value} The following keys are supported: double_buffer_non_reuse: If set to True, the ping and pong variables in double_buffer are not reused. Example: config = {"double_buffer_non_reuse" : True} |
flowtable |
Input |
A list or tuple of InputScalars. A flow table of tiling parameters (computed by the operator selector in the dynamic-shape scenario). The flowtable length and inputs length adds up to less than or equal to 64. |
Restrictions
- inputs and outputs must not have the same tensor. Otherwise, the TIK reports an error.
- All non-workspace tensors with the scope of scope_gm must be in inputs or outputs. Otherwise, a build error is reported.
- When there is no output, BuildCCE specifies an array whose length is 1 and data is 0, that is, outputs=[]. The return value is [[0]].
- In inputs, tensors must be placed before InputScalars.
Returns
None
Example
from te import tik tik_instance = tik.Tik() data_A = tik_instance.Tensor("float16", (128,), name="data_A", scope=tik.scope_gm) data_B = tik_instance.Tensor("float16", (128,), name="data_B", scope=tik.scope_gm) data_C = tik_instance.Tensor("float16", (128,), name="data_C", scope=tik.scope_gm) tik_instance.BuildCCE(kernel_name="simple_add",inputs=[data_A,data_B],outputs=[data_C])
Data Definition
Tensor
Description
Defines a Tensor variable.
Prototype
Tensor(dtype, shape, scope, name, enable_buffer_reuse=False, no_reuse_list=None, reuse_list=None, is_workspace=False, is_atomic_add=False)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
dtype |
Input |
Data type of the Tensor object. Must be one of the following data types: uint8, int8, uint16, int16, float16, uint32, int32, float32, uint64, int64 |
shape |
Input |
A list or tuple of ints, specifying the shape of the Tensor object. NOTICE:
In the current version, only a list or tuple of immediates is supported. |
scope |
Input |
Buffer scope of the Tensor object, that is, buffer space where the Tensor object is located:
|
name |
Input |
A string specifying the name of the Tensor object. Only digits (0–9), uppercase letters (A–Z), lowercase letters (a–z), and underscores (_) are allowed. However, the name cannot start with a digit. If set to None, the name auto_tensor_$(COUNT) is automatically used, with COUNT starting at zero. NOTE:
When scope is set to scope_gm, the name must not be __fake_tensor. |
enable_buffer_reuse |
Input |
Reserved and not recommended |
no_reuse_list |
Input |
Reserved and not recommended |
reuse_list |
Input |
Reserved and not recommended |
is_workspace |
Input |
A bool. Defaults to False. If set to True, the current tensor is used for storing intermediate data only. If set to True, scope must be scope_gm and the tensor must not be included in the input and output tensors (that is, the names of the input and output tensors do not contain the workspace tensor). |
is_atomic_add |
Input |
A bool. Defaults to False. This argument does not take effect. |
Restrictions
- When the total size of the tensors exceeds the total size of the corresponding buffer type, a build error is reported.
In the following example, the size of data_a is 1025 x 1024 bytes, which is greater than the total size of L1 buffer by 1 MB.
import numpy as np import sys from te import tik import tvm def buffer_allocate_test6(): tik_instance = tik.Tik() data_a = tik_instance.Tensor("int8", (1025 * 1024,), name="data_a", scope=tik.scope_cbuf) tik_instance.BuildCCE(kernel_name="buffer_allocate_test",inputs=[],outputs=[]) return tik_instance if __name__ == "__main__": tik_instance = buffer_allocate_test6()
Build error:
RuntimeError: Appiled buffer size(1049600B) more than avaiable buffer size(1048576B).
- If a tensor is access beyond its defined scope, a build error will be reported.In the following example, data_a_l1 is defined only in new_stmt_scope. Beyond its defined scope, an error will be reported when the data_move API is called to access data_a_l1 again.
import numpy as np import sys from te import tik import tvm def tensor_outrange_examine_test6(): tik_instance = tik.Tik() data_a = tik_instance.Tensor("float16", (128,), name="data_a", scope=tik.scope_gm) data_b = tik_instance.Tensor("float16", (128,), name="data_b", scope=tik.scope_gm) with tik_instance.new_stmt_scope(): data_a_ub = tik_instance.Tensor("float16", (128,), name="data_a_ub", scope=tik.scope_ubuf) data_a_l1 = tik_instance.Tensor("float16", (128,), name="data_a_l1", scope=tik.scope_cbuf) tik_instance.data_move(data_a_l1, data_a, 0, 1, 128 // 16, 0, 0) tik_instance.data_move(data_a_ub, data_a_l1, 0, 1, 128 // 16, 0, 0) tik_instance.data_move(data_b, data_a_ub, 0, 1, 128 // 16, 0, 0) tik_instance.BuildCCE(kernel_name="tensor_outrange_examine", inputs=[data_a], outputs=[data_b]) return tik_instance
Build error:
RuntimeError: This tensor is not defined in this scope.
- If a tensor is beyond its defined scope, the buffer can be reused.In the following example, as data_a_ub1 and data_a_ub2 are beyond the defined scopes, the occupied buffer of size 126,976 bytes (62 x 2 x 1024 bytes) can be reused by data_b_ub.
import numpy as np import sys from te import tik import tvm def double_buffer_test6(): tik_instance = tik.Tik() data_a = tik_instance.Tensor("int8", (124 *1024,), name="data_a", scope=tik.scope_ubuf) with tik_instance.for_range(0, 2): data_a_ub1 = tik_instance.Tensor("int8", (62 * 1024,), name="data_a_ub1", scope=tik.scope_ubuf) data_a_ub2 = tik_instance.Tensor("int8", (62 * 1024,), name="data_a_ub2", scope=tik.scope_ubuf) data_b_ub = tik_instance.Tensor("int8", (125 * 1024,), name="data_b_ub", scope=tik.scope_ubuf) tik_instance.BuildCCE(kernel_name="tbe_double_buffer_no_loop", inputs=[ ], outputs=[ ]) return tik_instance if __name__ == "__main__": tik_instance = double_buffer_test6()
If data_b_ub exceeds the Unified Buffer size, the following error is reported during the build:
RuntimeError: Tensor data_b_ub appiles buffer size(128000B) more than avaiable buffer size(126976B).
- shape does not support scalar arguments. It supports only immediates or Python variables.
- For user-defined tensors, the starting address of the allocated buffer scope will be aligned according to the following rules:
- UB: 32-byte aligned
- L1:512-byte aligned
- L1OUT: 512-byte aligned
- GM: alignment not required
If the total size of a buffer type is exceeded due to address alignment, a build error is reported.
Returns
Tensor instance
Example
from te import tik tik_instance = tik.Tik() data_A = tik_instance.Tensor("float16", (128,), name="data_A", scope=tik.scope_gm)
Scalar
Description
Defines a Scalar variable.
Prototype
Scalar(dtype="int64", name="reg_buf", init_value=None)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
dtype |
Input |
Data type of the Scalar object. Must be one of the following data types: int8, uint8, int16, uint16, float16, int32, uint32, float32, int64, uint64 Defaults to int64. |
name |
Input |
A string specifying the name of the Scalar object. Only digits (0–9), uppercase letters (A–Z), lowercase letters (a–z), and underscores (_) are allowed. Defaults to reg_buf$(COUNT), with COUNT starting at 0. |
init_value |
Input |
Initial value: An immediate of type int or float A Scalar variable A Tensor value An Expr consisting of a Scalar variable, an immediate, and a Tensor value NOTICE:
If it is an Expr, the immediate cannot be of type float. |
Restrictions
When the initial value is an Expr, the immediate can only be of type int instead of float, for example:
from te import tik tik_instance = tik.Tik() index_reg = tik_instance.Scalar(dtype="float32") # Immediate: float index_reg.set_as(10.2) # Assign an initial value to the scalar using init_value. index_reg1 = tik_instance.Scalar(dtype="float32", init_value=10.2) index_reg2 = tik_instance.Scalar(dtype="float32") # Expr. The immediate is of float type. An error occurs with the CCE compiler (CCEC), because the hardware does not support this data type. #index_reg2.set_as(index_reg+2.2);
Returns
An instance of class Scalar
Example
from te import tik tik_instance = tik.Tik() # Immediate: integer index_reg = tik_instance.Scalar(dtype = "int32") index_reg.set_as(10) # Immediate: float index_reg2 = tik_instance.Scalar(dtype = "float16") index_reg2.set_as(10.2) # Scalar variable index_reg3 = tik_instance.Scalar(dtype = "float16") index_reg3.set_as(index_reg2) # Tensor value data_A = tik_instance.Tensor("float16", (128,), name="data_A", scope=tik.scope_gm) index_reg3.set_as(data_A[0])// tensor value #Expr index_reg4 = tik_instance.Scalar(dtype = "int32") index_reg4.set_as(index_reg+20)
InputScalar
Description
Defines an InputScalar variable. An InputScalar serves as an inputs argument passed to the BuildCCE call. It supports a range of basic data types including int, uint, and float.
Prototype
InputScalar(dtype="int64", name="input_scalar")
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
dtype |
Input |
Data type of the InputScalar object. Must be one of the following data types: int8, uint8, int16, uint16, int32, uint32, int64, uint64, float16, float32 Defaults to int64. |
name |
Input |
A string specifying the name of the InputScalar object. Only digits (0–9), uppercase letters (A–Z), lowercase letters (a–z), and underscores (_) are allowed. Defaults to input_scalar. Ensure that each InputScalar variable has a unique name. |
Restrictions
- Currently, InputScalar can be used in scenarios where a variable argument is an Expr.
For example, if the repeat_times parameter in vec_abs is an Expr, the code can be written as follows: from te import tik tik_instance = tik.Tik() data_A = tik_instance.Tensor("float16", (128,), name="data_A", scope=tik.scope_gm) data_B = tik_instance.Tensor("float16", (128,), name="data_B", scope=tik.scope_gm) src_ub = tik_instance.Tensor("float16", (128,), name="src_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf) inputscalar = tik_instance.InputScalar(dtype="int16", name="inputscalar") tik_instance.vec_abs(128, dst_ub, src_ub, inputscalar, 8, 8) tik_instance.BuildCCE(kernel_name="simple_add",inputs=[data_A,data_B,inputscalar],outputs=[])
- Ensure that each InputScalar object has a unique name.
Example
from te import tik tik_instance = tik.Tik() data_A = tik_instance.Tensor("float16", (128,), name="data_A", scope=tik.scope_gm) data_B = tik_instance.Tensor("float16", (128,), name="data_B", scope=tik.scope_gm) abc = tik_instance.InputScalar(dtype="int16", name="abc") src_ub = tik_instance.Tensor("float16", (128,), name="src_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf) tik_instance.vec_abs(128, dst_ub, src_ub, abc, 8, 8) tik_instance.BuildCCE(kernel_name="simple_add",inputs=[data_A,data_B,abc],outputs=[])
Scalar Management
TIK scalar management class. A scalar is an independent number.
set_as
Description
Sets the scalar value.
Prototype
set_as(value, src_offset=None)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
value |
Input |
Value to be assigned from:
|
src_offset |
Input |
Reserved and not recommended |
Restrictions
- Scalar value assignment between different data types is not supported, for example, between float16 and float32.
- Scalar value assignment between int/uint and float16/float32 is not supported.
- Value assignment from an Expr of any type to a float16/float32 scalar is not supported.
- Value assignment from an Expr to an int/uint scalar is supported only when the Expr's scalar is of type int or uint and the Expr's immediate is of type int or float.
Returns
None
Example
from te import tik tik_instance = tik.Tik() # Immediate: int index_reg = tik_instance.Scalar(dtype = "int32") index_reg.set_as(10) # Immediate: float index_reg2 = tik_instance.Scalar(dtype = "float16") index_reg2.set_as(10.2) # A Scalar variable index_reg3 = tik_instance.Scalar(dtype = "float16") index_reg3.set_as(index_reg2) # A Tensor value data_A = tik_instance.Tensor("float16", (128,), name="data_A", scope=tik.scope_ubuf) index_reg3.set_as(data_A[0])// tensor value #An Expr index_reg4 = tik_instance.Scalar(dtype = "int32") index_reg4.set_as(index_reg+20)
Tensor Management
TIK tensor management class
reshape
Description
Reshapes a tensor.
Prototype
reshape(new_shape)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
new_shape |
Input |
New shape of the tensor object. The supported types are list(int) and tuple(int). NOTICE:
In the current version, lists or tuples consisting of integral immediates are supported. |
Restrictions
- The total size of the new shape must be the same as that of the old shape.
- The new and old tensors point to the same buffer. After the value of the new tensor is changed, the value of the old tensor would also change.
Returns
The new tensor
Example
from te import tik tik_instance = tik.Tik() data_A = tik_instance.Tensor("float16", (128,), name="data_A", scope=tik.scope_gm) data_A.reshape((64,2))
reinterpret_cast_to
Description
Casts the data type of a tensor. Reads the same memory data based on the specified data type and re-interprets the bytes in the memory. The data precision conversion is not supported. For details about how to convert the data precision, see vec_conv.
For example, to cast the data type of a tensor with 128 data elements of type float16 to type float32, use reinterpret_cast_to to read the data in float32 mode. In this case, 64 data elements of type float32 are obtained. Therefore, reinterpret_cast_to does not change the data precision.
Prototype
reinterpret_cast_to(dtype)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
dtype |
Input |
Data type of the Tensor object. Must be one of the following data types: uint8, int8, uint16, int16, float16, uint32, int32, float32, uint64, int64 |
Restrictions
- To make it easier to describe the restrictions in calling reinterpret_cast_to(), we define a factor yielded by dividing the number of bits of the original data type by that of the specified data type. Assume the original tensor is declared to have 128 float16 data entries in the buffer. To read the entries in float32 mode, the factor should be 0.5 (16/32). The call to reinterpret_cast_to() must meet the following restrictions:
- The factor must be greater than 0.
- If the factor is greater than 1, it must be an integer.
- If the factor is less than 1, pay attention to the tensor shape. The last dimension size (shape[-1]) multiplied by the factor must be an integer. Assume the original tensor is with shape (128, 1). To read its 128 float16 entries in float32 mode, shape[-1] * factor = 1 * 0.5 = 0.5, which is not an integer and therefore the preceding restriction is not met. In this case, the error message "Error: Last dimension in shape multiplies factor should be an integer" will be reported. Setting the tensor shape to 128 can avoid this error.
Returns
The new tensor
Examples
Example 1:
from te import tik tik_instance = tik.Tik() data_A = tik_instance.Tensor("float16", (16,), name="data_A", scope=tik.scope_gm) data_B = data_A.reinterpret_cast_to("uint32") data_C = data_B.reinterpret_cast_to("float16") """Example: Input: data_A: [ 4.812e+00 1.870e-04 -5.692e-02 2.528e-02 -9.225e+02 -1.431e+02 -1.541e+01 -2.018e-03 1.653e-03 -4.090e+00 2.016e+01 -5.846e+04 -8.072e-03 2.627e+00 -3.174e-02 -3.088e-01] Returns: data_B: [ 169952464 645507913 3631866677 2552417204 3289847493 4213394698 1094819874 3035736080] data_C: [ 4.812e+00 1.870e-04 -5.692e-02 2.528e-02 -9.225e+02 -1.431e+02 -1.541e+01 -2.018e-03 1.653e-03 -4.090e+00 2.016e+01 -5.846e+04 -8.072e-03 2.627e+00 -3.174e-02 -3.088e-01] """
Example 2
from te import tik tik_instance = tik.Tik() data_A = tik_instance.Tensor("float16", (16,), name="data_A", scope=tik.scope_gm) data_B = data_A.reinterpret_cast_to("uint16") data_C = data_B.reinterpret_cast_to("float16") """Example: Input: data_A: [ 4.566e+01 -7.880e+02 1.414e-04 -1.300e-02 -1.893e+03 -1.622e-01 -1.289e+00 2.478e+02 -3.107e+00 -2.072e+01 7.192e-01 -1.805e+00 3.259e+01 -3.181e-03 -3.248e-05 4.086e+04] Returns: data_B: [20917 57896 2210 41640 59237 45361 48424 23486 49719 52526 14785 48952 20499 39556 33313 30973] data_C: [ 4.566e+01 -7.880e+02 1.414e-04 -1.300e-02 -1.893e+03 -1.622e-01 -1.289e+00 2.478e+02 -3.107e+00 -2.072e+01 7.192e-01 -1.805e+00 3.259e+01 -3.181e-03 -3.248e-05 4.086e+04] """
Obtaining Partial Tensor Data by Using the Tensor Array Index
Description
Obtains partial tensor data to form a new tensor.
Prototype
__getitem__(index_in)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
index_in |
Input |
Tensor array index. The options are as follows:
|
Restrictions
None
Returns
The new tensor
Example
from te import tik tik_instance = tik.Tik() data_A = tik_instance.Tensor("float16", (128,), name="data_A", scope=tik.scope_gm) data_b = data_A[1]
Changing the Tensor Content by Using the Tensor Array Index
Description
Changes a tensor.
Prototype
__setitem__(index, value)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
index |
Input |
Tensor array index. The options are as follows:
|
value |
Input |
Specific value, which is related to the data type defined by the tensor. Currently, only the scalar, Expr, and tensor variables are supported. immediates are not supported. |
Restrictions
None
Returns
None
Example
from te import tik tik_instance = tik.Tik() data_A = tik_instance.Tensor("float16", (128,), name="data_A", scope=tik.scope_ubuf) scalar_B = tik_instance.Scalar(dtype="float16", name="scalar_B", init_value=2.0) data_A[0].set_as(scalar_B)
shape
Description
Obtains the tensor shape.
Prototype
shape()
Parameters
None
Restrictions
None
Returns
A list specifying the tensor shape
Example
from te import tik tik_instance = tik.Tik() data_A = tik_instance.Tensor("float16", (128,), name="data_A", scope=tik.scope_ubuf) data_A.shape();
Return: [128]
set_as
Description
Sets a tensor.
Prototype
set_as(value, dst_offset=0, src_offset=None)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
value |
Input |
Value to be assigned from:
|
dst_offset |
Input |
Reserved and not recommended |
src_offset |
Input |
Reserved and not recommended |
Restrictions
None
Returns
None
Example
from te import tik tik_instance = tik.Tik() data_A = tik_instance.Tensor("float16", (128,), name="data_A", scope=tik.scope_ubuf) data_B = tik_instance.Tensor("float16", (128,), name="data_A", scope=tik.scope_ubuf) data_A[0].set_as(data_B[0])
Program Control
if_scope
Description
Creates an if statement of the TIK. When the condition is met, the statement in the structure is executed.
Prototype
if_scope(cond)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
cond |
Input |
An Expr specifying the judgment condition NOTICE:
|
Restrictions
None
Returns
TikWithScope object
Example
with self.tik_instance.if_scope(core_index != core_num - 1): do_something()
else_scope
Description
Creates an else statement of the TIK. If the if statement does not meet the conditions, the statement in the else_scope structure is executed.
Prototype
else_scope()
Parameters
None
Restrictions
- This function must be after if_scope.
Returns
TikWithScope object
Example
with self.tik_instance.if_scope(core_index != core_num - 1): do_something() with self.tik_instance.else_scope(): do_else_something()
for_range
Description
Indicates the for loop statement of the TIK. Double buffering and AI Core parallelism can be enabled in the for loop.
Prototype
for_range(begint, endt, name="i", thread_num=1, thread_type="whole", block_num=1,dtype="int32", for_type="serial")
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
begint |
Input |
Start of the for loop. begint and endt are immediates of type int or uint, scalars of type int or uint, or Exprs. If Exprs are passed, the simplified values must be integers. 0 ≤ begint ≤ endt ≤ 2147483647 |
endt |
Input |
End of the for loop. begint and endt are immediates of type int or uint, scalars of type int or uint, or Exprs. If Exprs are passed, the simplified values must be integers. 0 ≤ begint ≤ endt ≤ 2147483647 NOTE:
The performance deteriorates when begint and endt are scalars. |
name |
Input |
Name of a variable in the for loop. The default name is i. |
thread_num |
Input |
Whether to enable double buffering in the for loop. Value range:
|
thread_type |
Input |
Thread type of the for loop. This parameter is reserved and has no impact on system running. Value: whole |
block_num |
Input |
Number of AI Cores used in the for loop. The maximum value is 65535.
|
dtype |
Input |
Variable type of the for loop. This parameter is reserved and has no impact on system running. Value: int32 |
for_type |
Input |
Type of the for loop. Value: serial |
Restrictions
- If AI Core parallelism is enabled, the initial value (init) of the AI Core parallelism loop must be 0, and the number of AI Cores must be equal to the end value (endt) of the AI Core parallelism loop.
- In a for loop, AI Core parallelism and double buffering are mutually exclusive. To enable them both, you need to use multiple loops.
- If AI Core parallelism is enabled, a tensor in the AI Core parallelism loop must be defined in the loop. If the tensor buffer allocation in both the inner and outer sides of the AI Core parallelism loop starts at 0, address overlapping and data errors may occur.
- When double buffering is enabled, two buffers are allocated for tensors defined in the for loop.
- Do not change the value of endt in for_range in the loop executor. Otherwise, the operator execution task will be suspended.
- When using loop variables, pay attention to the following:
# Use loop variables. with self.tik_instance.for_range(0,10) as i: with self.tik_instance.if_scope(i==0): # Do not use i==0. do_someting with self.tik_instance.else_scope(): # Do not use else. do_someting
Returns
TikWithScope object
Example
with self.tik_instance.for_range(0,1,thread_num=1): do_someting # Enable the double buffering function. Note that two buffers are allocated only for tensors are defined in for range. with self.tik_instance.for_range(0,2,thread_num=2): Tensor definition do_someting # Enable AI Core parallelism. with self.tik_instance.for_range(0,2,block_num=2): do_someting
new_stmt_scope
Description
Indicates a new scope (C language).
Prototype
new_stmt_scope()
Parameters
None
Restrictions
After a tensor is defined beyond new_stmt_scope, the buffer is automatically freed and the tensor defined in new_stmt_scope cannot be accessed externally.
Returns
TikWithScope object
Example
with tik_instance.new_stmt_scope(): do_something
Scalar Computation
Single Operand
scalar_abs
Description
Obtains the absolute value of a scalar:
Prototype
scalar_abs(dst, src)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
dst |
Output |
Destination operand. Must be one of the following data types: Ascend 910 AI Processor: a scalar of type int64 |
src |
Input |
Source operand. When src is a scalar, dst must have the same data types as src. Must be one of the following data types: Ascend 910 AI Processor: a scalar or an immediate of type int64 |
Restrictions
None
Returns
None
Example
from te import tik tik_instance = tik.Tik() src_scalar = tik_instance.Scalar(dtype = "int64") src_scalar.set_as(10) dst_scalar = tik_instance.Scalar(dtype = "int64") tik_instance.scalar_abs(dst_scalar, src_scalar)
scalar_sqrt
Description
Extracts the square root of a scalar:
Prototype
scalar_sqrt(dst, src)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
dst |
Output |
Destination operand, which must be the same as the source operand. Must be one of the following data types: Ascend 910 AI Processor: a scalar of type int64/float32 |
src |
Input |
Source operand. Must be one of the following data types: Ascend 910 AI Processor: a scalar of type int64/float32 or an immediate of type int64/float32 |
Restrictions
A negative value is supported. The absolute value must be obtained before the square root.
Returns
None
Example
from te import tik tik_instance = tik.Tik() src_scalar = tik_instance.Scalar(dtype = "int64") src_scalar.set_as(10) dst_scalar = tik_instance.Scalar(dtype = "int64") tik_instance.scalar_sqrt(dst_scalar, src_scalar)
scalar_countbit0
Description
Counts the number of bits whose values are 0 in the 64-bit binary format of the source operand bitwise.
Prototype
scalar_countbit0(dst, src)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
dst |
Output |
A scalar of type uint64, for the destination operand |
src |
Input |
A scalar or an immediate of type uint64, for the source operand |
Restrictions
None
Returns
None
Example
from te import tik tik_instance = tik.Tik() src_scalar = tik_instance.Scalar(dtype = "uint64") src_scalar.set_as(10) dst_scalar = tik_instance.Scalar(dtype = "uint64") tik_instance.scalar_countbit0(dst_scalar, src_scalar)
scalar_countbit1
Description
Counts the number of bits whose values are 1 in the 64-bit binary format of the source operand bitwise.
Prototype
scalar_countbit1(dst, src)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
dst |
Output |
A scalar of type uint64, for the destination operand |
src |
Input |
A scalar or an immediate of type uint64, for the source operand |
Returns
None
Example
from te import tik tik_instance = tik.Tik() src_scalar = tik_instance.Scalar(dtype = "uint64") src_scalar.set_as(10) dst_scalar = tik_instance.Scalar(dtype = "uint64") tik_instance.scalar_countbit1(dst_scalar, src_scalar)
scalar_countleading0
Description
Counts the number of consecutive bits whose values are 0 in the 64-bit binary format of the source operand.
Prototype
scalar_countleading0(dst, src)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
dst |
Output |
A scalar of type uint64, for the destination operand |
src |
Input |
A scalar or an immediate of type uint64, for the source operand |
Returns
None
Example
from te import tik tik_instance = tik.Tik() src_scalar = tik_instance.Scalar(dtype = "uint64") src_scalar.set_as(10) dst_scalar = tik_instance.Scalar(dtype = "uint64") tik_instance.scalar_countleading0(dst_scalar, src_scalar)
scalar_conv
Description
Converts the scalar precision (value) as follows:
- int32 to float32
- float32 to int32
- float32 to float16
- float16 to float32
Prototype
scalar_conv(round_mode, dst, src)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
dst |
Output |
A scalar of type float32/float16/int32, for the destination operand |
round_mode |
Input |
Rounding mode
Table 10-22 describes the precision conversion and the corresponding round_mode. |
src |
Input |
A scalar of type float32/float16/int32, for the source operand |
Restrictions
During the conversion, precision loss may occur.
See the following Round_mode examples.
Value |
round |
floor |
ceil/ceiling |
away-zero |
to-zero |
odd |
---|---|---|---|---|---|---|
1.8 |
2 |
1 |
2 |
2 |
1 |
2 |
1.5 |
2 |
1 |
2 |
2 |
1 |
1 |
1.2 |
1 |
1 |
2 |
1 |
1 |
1 |
0.8 |
1 |
0 |
1 |
1 |
0 |
1 |
0.5 |
0 |
0 |
1 |
1 |
0 |
1 |
0.2 |
0 |
0 |
1 |
0 |
0 |
0 |
-0.2 |
0 |
-1 |
0 |
0 |
0 |
0 |
-0.5 |
0 |
-1 |
0 |
-1 |
0 |
-1 |
-0.8 |
-1 |
-1 |
0 |
-1 |
0 |
-1 |
-1.2 |
-1 |
-2 |
-1 |
-1 |
-1 |
-1 |
-1.5 |
-2 |
-2 |
-1 |
-2 |
-1 |
-1 |
-1.8 |
-2 |
-2 |
-1 |
-2 |
-1 |
2 |
Returns
None
Example
from te import tik tik_instance = tik.Tik() src_scalar = tik_instance.Scalar(dtype="float32", init_value=10.2) dst_scalar = tik_instance.Scalar(dtype="int32") tik_instance.scalar_conv('round', dst_scalar, src_scalar)
Dual Operands
scalar_max
Description
Compares two source operands and returns the maximum:
Prototype
scalar_max(dst, src0, src1)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
dst |
Output |
A scalar of type int64, for the destination operand |
src0 |
Input |
A scalar or an immediate of type int64, for source operand 0 |
src1 |
Input |
A scalar or an immediate of type int64, for source operand 1 |
Restrictions
The operands must have the same data type.
Returns
None
Example
from te import tik tik_instance = tik.Tik() src0_scalar = tik_instance.Scalar(dtype = "int64", name='src0_scalar', init_value=3) src1_scalar = tik_instance.Scalar(dtype = "int64", name='src1_scalar', init_value=2) dst_scalar = tik_instance.Scalar(dtype = "int64", name='dst_scalar') tik_instance.scalar_max(dst_scalar, src0_scalar, src1_scalar)
scalar_min
Description
Compares two source operands and returns the minimum:
Prototype
scalar_min(dst, src0, src1)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
dst |
Output |
A scalar of type int64, for the destination operand |
src0 |
Input |
A scalar or an immediate of type int64, for source operand 0 |
src1 |
Input |
A scalar or an immediate of type int64, for source operand 1 |
Restrictions
The operands must have the same data type.
Returns
None
Example
from te import tik tik_instance = tik.Tik() src0_scalar = tik_instance.Scalar(dtype = "int64", name='src0_scalar', init_value=3) src1_scalar = tik_instance.Scalar(dtype = "int64", name='src1_scalar', init_value=2) dst_scalar = tik_instance.Scalar(dtype = "int64", name='dst_scalar') tik_instance.scalar_min(dst_scalar, src0_scalar, src1_scalar)
Vector Computation
SIMD Instruction Execution Model
SIMD stands for single-instruction, multiple-data stream processing (SIMD), which means that a single instruction can be used to perform multiple data operations. The basic operation units of an SIMD instruction of the Ascend AI processor include two dimensions: space (in the unit of blocks) and time (in the unit of repeats). Generally, a block is 32 bytes, including 16 elements of type float16/uint16/int16, eight elements of type float32/uint32/int32, or 32 elements of type int8/uint8.
The address offset of the same block between adjacent iterations supports only the linear mode. That is, you need to specify the address offset of each block in the next iteration repeat.
The main operands of TIK SIMD instructions are tensors, and a few operands are scalars or immediates. According to the data flow, the operations can be classified into element-wise operations and reduce operations. The element-wise operations can be further classified into single-operand, dual-operand, and triple-operand instructions (only operands that participate in the scalar operations are counted).
Single Input (Gather Mode)
General Definition
Description
This is a generic format for an instruction with only one source operand. Note that it is not a real instruction.
Prototype
instruction (mask, dst, src, repeat_times, dst_rep_stride, src_rep_stride)
PIPE: VECTOR
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
instruction |
Input |
A string specifying the instruction name. Only lowercase letters are supported in TIK DSL. |
mask |
Input |
128-bit mask. If a bit is set to 0, the corresponding element of the vector is masked in the computation. If a bit is set to 1, the corresponding element of the vector participates in the computation. The consecutive mode and bit-wise mode are supported.
Note: mask applies to the source operand of each repeat. |
dst |
Output |
Destination operand, which is the start element of the tensor. For details about the supported data precision, see the specific instruction. |
src |
Input |
Source operand, which is the start element of the tensor. For details about the supported data precision, see the specific instruction. |
repeat_times |
Input |
Number of iteration repeats |
dst_rep_stride |
Input |
Block-to-block stride between adjacent iterations of the destination operand |
src_rep_stride |
Input |
Block-to-block stride between adjacent iterations of the source operand |
Restrictions
- repeat_times is within the range [0, 255]. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. If an immediate is passed, 0 is not supported.
- The degree of parallelism of each repeat depends on the data type and chip version. The following uses PAR to describe the degree of parallelism.
- dst_rep_stride and src_rep_stride are within the range [0, 255]. The unit is 32 bytes. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
- dst and src must be declared in scope_ubuf, and the supported data types are related to the chip version. If the data types are not supported, the tool reports an error.
- dst has the same data type as src.
- To save memory space, you can define a tensor shared by the source and destination operands (by address overlapping). The general instruction restrictions are as follows. Note that each instruction might have specific restrictions.
- For a single repeat (repeat_times = 1), the source operand must completely overlap the destination operand.
- For multiple repeats (repeat_times > 1), if there is a dependency between the source operand and the destination operand, that is, the destination operand of the Nth iteration is the source operand of the (N+1)th iteration, address overlapping is not supported.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
vec_relu
Description
Performs a ReLU operation element-wise:
ReLU stands for rectified linear unit, and is the most used activation function in artificial neural networks.
Prototype
vec_relu(mask, dst, src, repeat_times, dst_rep_stride, src_rep_stride)
Parameters
For details, see Parameters.
dst has the same data type as src. Must be one of the following data types:
Ascend 910 AI Processor: tensors of type float16
Returns
None
Restrictions
For details, see Restrictions.
Example
from te import tik tik_instance = tik.Tik() src_ub = tik_instance.Tensor("float16", (128,), name="src_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf) tik_instance.vec_relu(128, dst_ub, src_ub, 1, 8, 8)
vec_abs
Description
Computes the absolute value element-wise:
Prototype
vec_abs(mask, dst, src, repeat_times, dst_rep_stride, src_rep_stride)
Parameters
For details, see Parameters.
dst has the same data type as src:
Ascend 910 AI Processor: tensors of type float16 or float32
Returns
None
Restrictions
For details, see Restrictions.
Example
from te import tik tik_instance = tik.Tik() src_ub = tik_instance.Tensor("float16", (128,), name="src_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf) tik_instance.vec_abs(128, dst_ub, src_ub, 1, 8, 8)
vec_not
Description
Performs bit-wise NOT element-wise:
Prototype
vec_not(mask, dst, src, repeat_times, dst_rep_stride, src_rep_stride)
Parameters
For details, see Parameters.
dst and src have the same data type:
Tensors of type uint16 or int16
Returns
None
Restrictions
For details, see Restrictions.
Example
from te import tik tik_instance = tik.Tik() src_ub = tik_instance.Tensor("uint16", (128,), name="src_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("uint16", (128,), name="dst_ub", scope=tik.scope_ubuf) tik_instance.vec_not(128, dst_ub, src_ub, 1, 8, 8)
vec_exp
Description
Computes the natural exponential element-wise:
Even if the ex computation result meets the accuracy requirement, the ex – 1 computation result using this API with float16 input fails to meet the dual-0.1% error limit (the error ratio is within 0.1% and the relative error is within 0.1%) due to the subtraction error. If the accuracy requirement for the ex – 1 computation is high, the vec_expm1_high_preci API is preferred.
Prototype
vec_exp(mask, dst, src, repeat_times, dst_rep_stride, src_rep_stride)
Parameters
For details, see Parameters.
dst has the same data type as src. Must be one of the following data types:
Ascend 910 AI Processor: tensors of type float16 or float32
Returns
None
Restrictions
For details, see Restrictions.
Example
from te import tik tik_instance = tik.Tik() src_gm = tik_instance.Tensor("float16", (128,), name="src_gm", scope=tik.scope_gm) src_ub = tik_instance.Tensor("float16", (128,), name="src_ub", scope=tik.scope_ubuf) dst_gm = tik_instance.Tensor("float16", (128,), name="dst_gm", scope=tik.scope_gm) dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf) tik_instance.data_move(src_ub, src_gm, 0, 1, 8, 0, 0) tik_instance.vec_exp(128, dst_ub, src_ub, 1, 8, 8) tik_instance.data_move(dst_gm, dst_ub, 0, 1, 8, 0, 0) tik_instance.BuildCCE(kernel_name="exp", inputs=[src_gm], outputs=[dst_gm]) Inputs: [0, 1, 2, 3, ......] Returns: [1.0, 2.719, 7.391, 20.08, ......]
vec_expm1_high_preci
Description
Computes the natural base element-wise:
The ex – 1 computation result using this API offers higher accuracy than the vec_exp API.
Prototype
vec_expm1_high_preci(mask, dst, src, work_tensor, repeat_times, dst_rep_stride, src_rep_stride)
Parameters
For details, see Parameters. The following describes only the dst, src, and work_tensor parameters.
dst, src, and work_tensor are tensors of the same data type, float16.
- If the source operand tensor has an offset, the passing formats are as follows: tensor[offset1:offset2] means that starting from offset1 and ending at offset2. tensor[offset1:] means starting from offset1. tensor[offset] means that only one element is passed. (In this case, the tensor is impossible to be sliced and a runtime error will be reported. Therefore. this format is not allowed.)
- If the source operand tensor does not have an offset, the tensor can be passed directly.
work_tensor:
work_tensor is a user-defined temporary buffer space for storing the intermediate result. The space is limited to scope_ubuf and is used for internal computation only.
work_tensor buffer space calculation:
- Calculate the minimum buffer space required for src computation based on repeat_times and src_rep_stride as follows: src_extent_size = (repeat_times – 1) * src_rep_stride * 16 + 128 If 0 < src_rep_stride <= 8, consider src_rep_stride as 8. Otherwise, retain its original value.
- Round up the minimum buffer space required for src computation to the least multiple of 32 bytes: wk_size_unit = (src_extent_size + 15)//16 * 16
- Calculate the size of work_tensor as follows: work_tensor = 11 * wk_size_unit
Example of work_tensor buffer space calculation:
- If repeat_times = 1 and src_rep_stride = 8, then src_extent_size= 128 and work_tensor = 128 * 11.
- If repeat_times = 2 and src_rep_stride = 4, then src_extent_size = (2 – 1) * 8 * 16 + 128 = 256 and work_tensor = 256 * 11.
Restrictions
- dst, src, and work_tensor must be declared in scope_ubuf.
- The space of the dst, src, and work_tensor tensors cannot overlap.
- The final computation result must be within the data range. Otherwise, an infinite or saturated result is yielded.
- For other restrictions, see Restrictions.
Returns
None
Example
from te import tik tik_instance = tik.Tik() src_gm = tik_instance.Tensor("float16", (128,), name="src_gm", scope=tik.scope_gm) src_ub = tik_instance.Tensor("float16", (128,), name="src_ub", scope=tik.scope_ubuf) dst_gm = tik_instance.Tensor("float16", (128,), name="dst_gm", scope=tik.scope_gm) dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf) # The required space is ((1 – 1) * 8 * 16 + 128) * 11 = 128 * 11. work_tensor_ub = tik_instance.Tensor("float16", (128*11,), name="work_tensor_ub", scope=tik.scope_ubuf) tik_instance.data_move(src_ub, src_gm, 0, 1, 8, 0, 0) tik_instance.vec_expm1_high_preci(128, dst_ub, src_ub, work_tensor_ub, 1, 8, 8) tik_instance.data_move(dst_gm, dst_ub, 0, 1, 8, 0, 0) tik_instance.BuildCCE(kernel_name="expm1", inputs=[src_gm], outputs=[dst_gm]) Inputs: [0, 1, 2, 3, ......] Returns: [0.0, 1.719, 6.391, 19.08, ......]
vec_ln
Description
Computes the natural logarithm element-wise:
Prototype
vec_ln(mask, dst, src, repeat_times, dst_rep_stride, src_rep_stride)
Parameters
For details, see Parameters.
dst has the same data type as src:
Ascend 910 AI Processor: tensors of type float16 or float32
Returns
None
Restrictions
- If any value of src is not positive, an unknown result may be produced.
- For other restrictions, see Restrictions.
Example
from te import tik tik_instance = tik.Tik() # Define the tensors. src_gm = tik_instance.Tensor("float16", (128,), tik.scope_gm, "src_gm") src_ub = tik_instance.Tensor("float16", (128,), name="src_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf) dst_gm = tik_instance.Tensor("float16", (128,), tik.scope_gm, "dst_gm") # Move the user input to the source UB. tik_instance.data_move(src_ub, src_gm, 0, 1, 8, 0, 0) tik_instance.vec_ln(128, dst_ub, src_ub, 1, 8, 8) # Move the computation result to the destination GM. tik_instance.data_move(dst_gm, dst_ub, 0, 1, 8, 0, 0) tik_instance.BuildCCE("v100_mini_vec_ln_test", [src_gm], [dst_gm]) Inputs: [1, 2, 3, 4, ......, 128] Returns: [0, 0.6931, 1.0986, 1.3863, ......, 4.8520]
vec_rec
Description
Computes the reciprocal element-wise:
Using this API, the operator computation result fails to meet the dual-0.1% error limit (the error ratio is within 0.1% and the relative error is within 0.1%) with float16 input, and fails to meet the dual-0.01% error limit with float32 input. If the accuracy requirement is high, the vec_rec_high_preci API is preferred.
Prototype
vec_rec(mask, dst, src, repeat_times, dst_rep_stride, src_rep_stride)
Parameters
For details, see Parameters.
dst must have the same data type as src. Must be one of the following data types:
Tensors of type float16 or float32
Returns
None
Restrictions
- For details, see Restrictions.
- If any value of src is 0, an unknown result may be produced.
Example 1
from te import tik # Define a container. tik_instance = tik.Tik() # Define the tensors. src_gm = tik_instance.Tensor("float32", (128,), name="src_gm", scope=tik.scope_gm) dst_gm = tik_instance.Tensor("float32", (128,), name="dst_gm", scope=tik.scope_gm) src_ub = tik_instance.Tensor("float32", (128,), name="src_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float32", (128,), name="dst_ub", scope=tik.scope_ubuf) # Move data from the GM to the UB. tik_instance.data_move(src_ub, src_gm, 0, 1, 128*4 // 32, 0, 0) tik_instance.vec_rec(64, dst_ub, src_ub, 2, 8, 8) # Move data from the UB to the GM. tik_instance.data_move(dst_gm, dst_ub, 0, 1, 128*4 // 32, 0, 0) tik_instance.BuildCCE(kernel_name="vec_rec", inputs=[src_gm], outputs=[dst_gm]) Inputs: [1.2017815 -8.758528 -3.9551935 ... -1.3599057 -2.319316] Returns: [0.83203125 -0.11401367 -0.2529297 ... -0.734375 -0.43164062]
Example 2
from te import tik # Define a container. tik_instance = tik.Tik() # Define the tensors. src_gm = tik_instance.Tensor("float16", (128,), name="src_gm", scope=tik.scope_gm) dst_gm = tik_instance.Tensor("float16", (128,), name="dst_gm", scope=tik.scope_gm) src_ub = tik_instance.Tensor("float16", (128,), name="src_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf) # Move data from the GM to the UB. tik_instance.data_move(src_ub, src_gm, 0, 1, 128*2 // 32, 0, 0) tik_instance.vec_rec(128, dst_ub, src_ub, 1, 8, 8) # Move data from the UB to the GM. tik_instance.data_move(dst_gm, dst_ub, 0, 1, 128*2 // 32, 0, 0) tik_instance.BuildCCE(kernel_name="vec_rec", inputs=[src_gm], outputs=[dst_gm]) Inputs: [-7.152 -7.24 1.771 ... -1.339 4.473] Returns: [-0.1396 -0.1382 0.5645 ... -0.748 0.2231]
vec_rec_high_preci
Description
Computes the reciprocal element-wise:
The computation result using this API offers higher accuracy than the vec_rec API.
Prototype
vec_rec_high_preci(mask, dst, src, work_tensor, repeat_times, dst_rep_stride, src_rep_stride)
Parameters
For details, see Parameters. The following describes only the dst, src, and work_tensor parameters.
dst has the same data type as src. They are tensors of type float16 or float32. work_tensor is a tensor of type float32.
- If the source operand tensor has an offset, the passing formats are as follows: tensor[offset1:offset2] means that starting from offset1 and ending at offset2. tensor[offset1:] means starting from offset1. tensor[offset] means that only one element is passed. (In this case, the tensor is impossible to be sliced and a runtime error will be reported. Therefore. this format is not allowed.)
- If the source operand tensor does not have an offset, the tensor can be passed directly.
work_tensor:
work_tensor is a user-defined temporary buffer space for storing the intermediate result. The space is limited to scope_ubuf and is used for internal computation only.
work_tensor buffer space calculation:
- Calculate the minimum buffer space required for src computation based on repeat_times, mask, and src_rep_stride as follows: src_extent_size = (repeat_times – 1) * src_rep_stride * block_len + mask_len
When the source operand is of type float16, block_len is 16.
When the source operand is of type float32, block_len is 8.
In consecutive mask mode, mask_len is the mask value itself.
In bit-wise mask mode, mask_len is the mask value corresponding to the most significant bit.
- Round up the minimum buffer space required for src computation to the least multiple of 32 bytes: wk_size_unit = ((src_extent_size+block_len-1)//block_len) * block_len
- Calculate the size of work_tensor as follows:
When the source operand is of type float16, work_tensor = 4 * wk_size_unit
When the source operand is of type float32, work_tensor = 2 * wk_size_unit
Example of work_tensor buffer space calculation:
- If src is of type float16, mask is 128, repeat_times is 2, and src_rep_stride is 8, then block_len is 16, mask_len is 128, and src_extent_size = (2 – 1) * 8 * 16 + 128 = 256. Round up src_extent_size to the least multiple of 32 bytes: src_extent_size (wk_size_unit = ((256+16-1)//16) * 16 = 256). Therefore, the size of work_tensor is 4 * 256 = 1024.
- If src is of type float32, mask is 64, repeat_times is 2, and src_rep_stride is 8, then block_len is 8, mask_len is 64, and src_extent_size = (2 – 1) * 8 * 8 + 64 = 128. Round up src_extent_size to the least multiple of 32 bytes: src_extent_size (wk_size_unit = ((128+8-1)//8) * 8 = 128). Therefore, the size of work_tensor is 2 * 128 = 256.
Restrictions
- dst, src, and work_tensor must be declared in scope_ubuf.
- The space of the dst, src, and work_tensor tensors cannot overlap.
- If any value is 0, an unknown result may be produced.
- For other restrictions, see Restrictions.
Returns
None
Example 1
from te import tik # Define a container. tik_instance = tik.Tik() # Define the tensors. src_gm = tik_instance.Tensor("float32", (128,), name="src_gm", scope=tik.scope_gm) dst_gm = tik_instance.Tensor("float32", (128,), name="dst_gm", scope=tik.scope_gm) src_ub = tik_instance.Tensor("float32", (128,), name="src_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float32", (128,), name="dst_ub", scope=tik.scope_ubuf) # Move data from the GM to the UB. tik_instance.data_move(src_ub, src_gm, 0, 1, 128*4 // 32, 0, 0) # Calculate the size of work_tensor. mask = [0, 2**64 - 1] mask_len = 64 repeat_times = 2 dst_rep_stride = 8 src_rep_stride = 8 block_len = 8 # src dtype is float32 src_extent_size = (repeat_times - 1)*src_rep_stride*block_len + mask_len wk_size_unit = ((src_extent_size + block_len - 1)//block_len) *block_len wk_size = 2*wk_size_unit # Define work_tensor. work_tensor_ub = tik_instance.Tensor("float32", (wk_size,), name="work_tensor_ub", scope=tik.scope_ubuf) # If the work_tensor has an index, use the work_tensor[index:] format. tik_instance.vec_rec_high_preci(mask_len, dst_ub, src_ub, work_tensor_ub[0:], repeat_times, dst_rep_stride, src_rep_stride) # Move data from the UB to the GM. tik_instance.data_move(dst_gm, dst_ub, 0, 1, 128*4 // 32, 0, 0) tik_instance.BuildCCE(kernel_name="test_vec_rec_high_preci", inputs=[src_gm], outputs=[dst_gm]) Inputs: [-6.9427586 -3.5300326 1.176882 ... -6.196793 9.0379095] Returns: [-0.14403497 -0.2832835 0.8497028 ... -0.16137381 0.11064506]
Example 2
from te import tik # Define a container. tik_instance = tik.Tik() # Define the tensors. src_gm = tik_instance.Tensor("float16", (128,), name="src_gm", scope=tik.scope_gm) dst_gm = tik_instance.Tensor("float16", (128,), name="dst_gm", scope=tik.scope_gm) src_ub = tik_instance.Tensor("float16", (128,), name="src_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf) # Move data from the GM to the UB. tik_instance.data_move(src_ub, src_gm, 0, 1, 128*2 // 32, 0, 0) # Calculate the size of work_tensor. mask = 128 mask_len = mask repeat_times = 1 dst_rep_stride = 8 src_rep_stride = 8 block_len = 16 # src dtype is float16 src_extent_size = (repeat_times - 1)*src_rep_stride*block_len + mask_len wk_size_unit = ((src_extent_size + block_len - 1) // block_len)*block_len wk_size = 4*wk_size_unit # Define work_tensor. work_tensor_ub = tik_instance.Tensor("float32", (wk_size,), name="work_tensor_ub", scope=tik.scope_ubuf) # If the work_tensor has an index, use the work_tensor[index:] format. tik_instance.vec_rec_high_preci(mask_len, dst_ub, src_ub, work_tensor_ub[0:], repeat_times, dst_rep_stride, src_rep_stride) # Move data from the UB to the GM. tik_instance.data_move(dst_gm, dst_ub, 0, 1, 128*2 // 32, 0, 0) tik_instance.BuildCCE(kernel_name="test_vec_rec_high_preci", inputs=[src_gm], outputs=[dst_gm]) Inputs: [-7.08 -4.434 1.294 ... 8.82 -2.854] Returns: [-0.1412 -0.2256 0.773 ... 0.1134 -0.3503]
vec_rsqrt
Description
Computes the reciprocal after extracting the square root element-wise:
Using this API, the operator computation result fails to meet the dual-0.1% error limit (the error ratio is within 0.1% and the relative error is within 0.1%) with float16 input, and fails to meet the dual-0.01% error limit with float32 input. If the accuracy requirement is high, the vec_rsqrt_high_preci API is preferred.
Prototype
vec_rsqrt(mask, dst, src, repeat_times, dst_rep_stride, src_rep_stride)
Parameters
For details, see Parameters.
dst has the same data type as src:
Tensors of type float16 or float32
Returns
None
Restrictions
- For details, see Restrictions.
- If any value of src is not positive, an unknown result may be produced.
Example
from te import tik tik_instance = tik.Tik() # Define the tensors. src_gm = tik_instance.Tensor("float16", (128,), name="src_gm", scope=tik.scope_gm) src_ub = tik_instance.Tensor("float16", (128,), name="src_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf) dst_gm = tik_instance.Tensor("float16", (128,), name="dst_gm", scope=tik.scope_gm) # Move the user input to the source UB. tik_instance.data_move(src_ub, src_gm, 0, 1, 8, 0, 0) tik_instance.vec_rsqrt(128, dst_ub, src_ub, 1, 8, 8) # Move the computation result to the destination GM. tik_instance.data_move(dst_gm, dst_ub, 0, 1, 8, 0, 0) tik_instance.BuildCCE("test_vec_rsqrt", [src_gm], [dst_gm]) Inputs: [1, 2, 3, 4, ......, 128] Returns: [0.998, 0.705, 0.576, 0.499, ......, 0.08813]
vec_rsqrt_high_preci
Description
Computes the reciprocal after extracting the square root element-wise:
The computation result using this API offers higher accuracy than the vec_rsqrt API.
Prototype
vec_rsqrt_high_preci(mask, dst, src, work_tensor, repeat_times, dst_rep_stride, src_rep_stride)
Parameters
For details, see Parameters. The following describes only the dst, src, and work_tensor parameters.
dst has the same data type as src. They are tensors of type float16 or float32. work_tensor is a tensor of type float32.
- If the source operand tensor has an offset, the passing formats are as follows: tensor[offset1:offset2] means that starting from offset1 and ending at offset2. tensor[offset1:] means starting from offset1. tensor[offset] means that only one element is passed. (In this case, the tensor is impossible to be sliced and a runtime error will be reported. Therefore. this format is not allowed.)
- If the source operand tensor does not have an offset, the tensor can be passed directly.
work_tensor:
work_tensor is a user-defined temporary buffer space for storing the intermediate result. The space is limited to scope_ubuf and is used for internal computation only.
work_tensor buffer space calculation:
- Calculate the minimum buffer space required for src computation based on repeat_times, mask, and src_rep_stride as follows: src_extent_size = (repeat_times – 1) * src_rep_stride * block_len + mask_len
When the source operand is of type float16, block_len is 16.
When the source operand is of type float32, block_len is 8.
In consecutive mask mode, mask_len is the mask value itself.
In bit-wise mask mode, mask_len is the mask value corresponding to the most significant bit.
- Round up the minimum buffer space required for src computation to the least multiple of 32 bytes: wk_size_unit = ((src_extent_size+block_len-1)//block_len) * block_len
- Calculate the size of work_tensor as follows:
For Ascend 910 AI Processor:
When the source operand is of type float16, work_tensor = 5 * wk_size_unit
When the source operand is of type float32, work_tensor = 3 * wk_size_unit
Example of work_tensor buffer space calculation:
For Ascend 910 AI Processor:
If src is of type float16, mask is 128, repeat_times is 2, and src_rep_stride is 8, then block_len is 16, mask_len is 128, and src_extent_size = (2 – 1) * 8 * 16 + 128 = 256. Round up src_extent_size to the least multiple of 32 bytes: wk_size_unit = 256. Therefore, the size of work_tensor is 5 * 256 = 1280.
Returns
None
Restrictions
- dst, src, and work_tensor must be declared in scope_ubuf.
- The space of the dst, src, and work_tensor tensors cannot overlap.
- If any value of src is not positive, an unknown result may be produced.
- For other restrictions, see Restrictions.
Example 1
from te import tik tik_instance = tik.Tik() # Define the tensors. dst_gm = tik_instance.Tensor("float16", (128,), name="dst_gm", scope=tik.scope_gm) src_gm = tik_instance.Tensor("float16", (128,), name="src_gm", scope=tik.scope_gm) src_ub = tik_instance.Tensor("float16", (128,), name="src_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf) # Move the user input to the source UB. tik_instance.data_move(src_ub, src_gm, 0, 1, 128*2 // 32, 0, 0) mask = 128 mask_len = mask #In consecutive mask mode, mask_len is the mask value itself. repeat_times = 1 dst_rep_stride = 8 src_rep_stride = 8 block_len = 16 # src dtype is float16 src_extent_size = (repeat_times - 1)*src_rep_stride*block_len + mask_len wk_size_unit = ((src_extent_size + block_len - 1) // block_len)*block_len wk_size = 6*wk_size_unit # Obtain the size of work_tensor. # Define work_tensor. work_tensor = tik_instance.Tensor("float32", (wk_size ,), name="work_tensor", scope=tik.scope_ubuf) # If the tensor has an index offset, add a colon (:) after the subscript in the following format. Otherwise, the program will report an error. tik_instance.vec_rsqrt_high_preci(mask, dst_ub, src_ub, work_tesnor[0:], repeat_times, dst_rep_stride, src_rep_stride) # Move the computation result to the destination GM. tik_instance.data_move(dst_gm, dst_ub, 0, 1, 128*2 // 32, 0, 0) tik_instance.BuildCCE("test_vec_rsqrt_high_preci", inputs=[src_gm], outputs=[dst_gm]) For example: Inputs: src_gm= [6.996 1.381 5.996 7.902 ... 5.113 5.78 1.672 5.418 ] Returns: dst_gm: [0.3782 0.851 0.4084 0.3557 ... 0.4421 0.416 0.7734 0.4297]
Example 2
from te import tik tik_instance = tik.Tik() # Define the tensors. dst_gm = tik_instance.Tensor("float32", (128,), name="dst_gm", scope=tik.scope_gm) src_gm = tik_instance.Tensor("float32", (128,), name="src_gm", scope=tik.scope_gm) src_ub = tik_instance.Tensor("float32", (128,), name="src_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float32", (128,), name="dst_ub", scope=tik.scope_ubuf) # Move the user input to the source UB. tik_instance.data_move(src_ub, src_gm, 0, 1, 128*4 // 32, 0, 0) mask = [0, 2**64 - 1] mask_len = 64 # In bit-wise mask mode, mask_len is the mask value corresponding to the most significant bit. repeat_times = 2 dst_rep_stride = 8 src_rep_stride = 8 block_len = 8 # src dtype is float32 src_extent_size = (repeat_times - 1)*src_rep_stride*block_len + mask_len wk_size_unit = ((src_extent_size + block_len - 1)//block_len)*block_len wk_size = 4*wk_size_unit # Obtain the size of work_tensor. # Define work_tensor. work_tensor = tik_instance.Tensor("float32", (wk_size ,), name="work_tensor", scope=tik.scope_ubuf) # If the tensor has an index offset, add a colon (:) after the subscript in the following format. Otherwise, the program will report an error. tik_instance.vec_rsqrt_high_preci(mask, dst_ub, src_ub, work_tesnor[0:], repeat_times, dst_rep_stride, src_rep_stride) # Move the computation result to the destination GM. tik_instance.data_move(dst_gm, dst_ub, 0, 1, 128*4 // 32, 0, 0) tik_instance.BuildCCE("test_vec_rsqrt_high_preci", inputs=[src_gm], outputs=[dst_gm]) For example: Inputs: src_gm= [5.349619, 0.4301902, 4.7152824, 9.539162, ..., 5.7243876, 4.4785686, 7.030495, 7.489954] Returns: dst_gm: [0.43235308, 1.5246484, 0.46051747, 0.32377616, ..., 0.41796073, 0.47253108, 0.37714386, 0.36539316]
Dual Inputs (Gather Mode)
General Definition
Description
This is a generic format for an instruction with two source operands. Note that it is not a real instruction.
Prototype
instruction (mask, dst, src0, src1, repeat_times, dst_rep_stride, src0_rep_stride, src1_rep_stride)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
instruction |
Input |
A string specifying the instruction name. Only lowercase letters are supported in TIK DSL. |
mask |
Input |
For details, see the description of the mask parameter in Table 10-25. |
dst |
Output |
Destination operand, which is the start element of the tensor. For details about the supported data type, see the specific instruction. |
src0 |
Input |
Source operand 0, which is the start element of the tensor. For details about the supported data type, see the specific instruction. |
src1 |
Input |
Source operand 1, which is the start element of the tensor. For details about the supported data type, see the specific instruction. |
repeat_times |
Input |
Number of iteration repeats |
dst_rep_stride |
Input |
Block-to-block stride between adjacent iterations of the destination operand |
src0_rep_stride |
Input |
Block-to-block stride between adjacent iterations of source operand 0 |
src1_rep_stride |
Input |
Block-to-block stride between adjacent iterations of source operand 1 |
Restrictions
- repeat_times is within the range [0, 255]. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. If an immediate is passed, 0 is not supported.
- The degree of parallelism of each repeat depends on the data precision and chip version. The following uses PAR to describe the degree of parallelism.
- dst_rep_stride, src0_rep_stride, and src1_rep_stride are within the range [0, 255]. The unit is 32 bytes. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
- To save memory space, you can define a tensor shared by the source and destination operands (by address overlapping). The general instruction restrictions are as follows.
- For a single repeat (repeat_times = 1), the source operand must completely overlap the destination operand.
- For multiple repeats (repeat_times > 1), if there is a dependency between the source operand and the destination operand, that is, the destination operand of the Nth iteration is the source operand of the (N+1)th iteration, address overlapping is not supported. Address overlapping is supported in the following cases: (1) The argument of instructions vec_add, vec_sub, vec_mul, vec_max, v_min, vec_and, vec_or is of type float16, int32, or float32, and the destination operand completely overlaps the second source operand. (2)src1_rep_stride = dst_rep_stride, src0 and src1 cannot overlap.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
vec_add
Description
Performs addition element-wise:
Prototype
vec_add(mask, dst, src0, src1, repeat_times, dst_rep_stride, src0_rep_stride, src1_rep_stride)
Parameters
For details, see Parameters.
dst, src0, and src1 have the same data type:
Ascend 910 AI Processor: tensors of type float16/float32/int32
Returns
None
Restrictions
For details, see Restrictions.
Example
from te import tik tik_instance = tik.Tik() src0_ub = tik_instance.Tensor("float16", (128,), name="src0_ub", scope=tik.scope_ubuf) src1_ub = tik_instance.Tensor("float16", (128,), name="src1_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf) tik_instance.vec_add(128, dst_ub, src0_ub, src1_ub, 1, 8, 8, 8)
vec_sub
Description
Performs subtraction element-wise:
Prototype
vec_sub(mask, dst, src0, src1, repeat_times, dst_rep_stride, src0_rep_stride, src1_rep_stride)
Parameters
For details, see Parameters.
dst, src0, and src1 have the same data type:
Ascend 910 AI Processor: tensors of type float16/float32/int32
Returns
None
Restrictions
For details, see Restrictions.
Example
from te import tik tik_instance = tik.Tik() src0_ub = tik_instance.Tensor("float16", (128,), name="src0_ub", scope=tik.scope_ubuf) src1_ub = tik_instance.Tensor("float16", (128,), name="src1_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf) tik_instance.vec_sub(128, dst_ub, src0_ub, src1_ub, 1, 8, 8, 8)
vec_mul
Description
Performs multiplication element-wise:
Prototype
vec_mul(mask, dst, src0, src1, repeat_times, dst_rep_stride, src0_rep_stride, src1_rep_stride)
Parameters
For details, see Parameters.
dst, src0, and src1 have the same data type:
Ascend 910 AI Processor: tensors of type float16/float32/int32
Returns
None
Restrictions
For details, see Restrictions.
Example
from te import tik tik_instance = tik.Tik() src0_ub = tik_instance.Tensor("float16", (128,), name="src0_ub", scope=tik.scope_ubuf) src1_ub = tik_instance.Tensor("float16", (128,), name="src1_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf) tik_instance.vec_mul(128, dst_ub, src0_ub, src1_ub, 1, 8, 8, 8)
vec_max
Description
Computes the maximum element-wise:
Prototype
vec_max(mask, dst, src0, src1, repeat_times, dst_rep_stride, src0_rep_stride, src1_rep_stride)
Parameters
For details, see Parameters.
dst, src0, and src1 have the same data type:
Ascend 910 AI Processor: tensors of type float16/float32/int32
Returns
None
Restrictions
For details, see Restrictions.
Example
from te import tik tik_instance = tik.Tik() src0_ub = tik_instance.Tensor("float32", (64,), name="src0_ub", scope=tik.scope_ubuf) src1_ub = tik_instance.Tensor("float32", (64,), name="src1_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float32", (64,), name="dst_ub", scope=tik.scope_ubuf) tik_instance.vec_max(64, dst_ub, src0_ub, src1_ub, 1, 8, 8, 8)
vec_min
Description
Computes the minimum element-wise:
Prototype
vec_min(mask, dst, src0, src1, repeat_times, dst_rep_stride, src0_rep_stride, src1_rep_stride)
Parameters
For details, see Parameters.
dst, src0, and src1 have the same data type:
Ascend 910 AI Processor: tensors of type float16/float32/int32
Returns
None
Restrictions
For details, see Restrictions.
Example
from te import tik tik_instance = tik.Tik() src0_ub = tik_instance.Tensor("float32", (64,), name="src0_ub", scope=tik.scope_ubuf) src1_ub = tik_instance.Tensor("float32", (64,), name="src1_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float32", (64,), name="dst_ub", scope=tik.scope_ubuf) tik_instance.vec_min(64, dst_ub, src0_ub, src1_ub, 1, 8, 8, 8)
vec_and
Description
Performs bit-wise AND element-wise:
Prototype
vec_and(mask, dst, src0, src1, repeat_times, dst_rep_stride, src0_rep_stride, src1_rep_stride)
Parameters
For details, see Parameters.
dst, src0, and src1 have the same data type:
Tensors of type uint16 or int16
Returns
None
Restrictions
For details, see Restrictions.
Example
from te import tik tik_instance = tik.Tik() src0_ub = tik_instance.Tensor("uint16", (128,), name="src0_ub", scope=tik.scope_ubuf) src1_ub = tik_instance.Tensor("uint16", (128,), name="src1_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("uint16", (128,), name="dst_ub", scope=tik.scope_ubuf) tik_instance.vec_and([0, 2**64-1], dst_ub, src0_ub, src1_ub, 1, 8, 8, 8)
vec_or
Description
Performs bit-wise OR element-wise:
Prototype
vec_or(mask, dst, src0, src1, repeat_times, dst_rep_stride, src0_rep_stride, src1_rep_stride)
Parameters
For details, see Parameters.
dst, src0, and src1 have the same data type:
Tensors of type uint16 or int16
Returns
None
Restrictions
For details, see Restrictions.
Example
from te import tik tik_instance = tik.Tik() src0_ub = tik_instance.Tensor("uint16", (128,), name="src0_ub", scope=tik.scope_ubuf) src1_ub = tik_instance.Tensor("uint16", (128,), name="src1_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("uint16", (128,), name="dst_ub", scope=tik.scope_ubuf) mask_h = tik_instance.Scalar(dtype="uint64", init_value=1) mask_l = tik_instance.Scalar(dtype="uint64", init_value=15) mask = [mask_h, mask_l] repeat_times = tik_instance.Scalar(dtype="int32", init_value=1) tik_instance.vec_or(mask, dst_ub, src0_ub, src1_ub, repeat_times, 8, 8, 8)
Dual Scalar Inputs (Gather Mode)
General Definition
Description
This is a generic format for an instruction with two source operands (src and scalar). Note that it is not a real instruction.
Prototype
instruction (mask, dst, src, scalar, repeat_times, dst_rep_stride, src_rep_stride, mask_mode="normal")
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
instruction |
Input |
A string specifying the instruction name. Only lowercase letters are supported in TIK DSL. |
mask |
Input |
There are two modes based on mask_mode:
For the Ascend 910 AI Processor, the normal mode is used. |
dst |
Output |
Vector destination operand, which is the start element of the tensor. For details about the supported data precision, see the specific instruction. |
src |
Input |
Vector source operand, which is the start element of the tensor. For details about the supported data precision, see the specific instruction. |
scalar |
Input |
A scalar or immediate, specifying the scalar source operand |
repeat_times |
Input |
Number of iteration repeats |
dst_rep_stride |
Input |
Block-to-block stride between adjacent iterations of the vector destination operand |
src_rep_stride |
Input |
Block-to-block stride between adjacent iterations of the vector source operand |
mask_mode |
Input |
A string specifying the mask mode. The options are as follows:
For Ascend 910 AI Processor, this parameter has no effect. |
Restrictions
- repeat_times is within the range [0, 255]. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. If an immediate is passed, 0 is not supported.
- dst_rep_stride and src_rep_stride are within the range [0,255]. The unit is 32 bytes. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
- The addresses of dst and src cannot overlap.
- The argument of the scalar parameter is a scalar or an immediate of type int/float.
- To save memory space, you can define a tensor shared by the source and destination operands (by address overlapping). The general instruction restrictions are as follows.
- For a single repeat (repeat_times = 1), the source operand must completely overlap the destination operand.
- For multiple repeats (repeat_times > 1), if there is a dependency between the source operand and the destination operand, that is, the destination operand of the Nth iteration is the source operand of the (N+1)th iteration, address overlapping is not supported.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
vec_adds
Description
Performs addition between a vector and a scalar element-wise:
Prototype
vec_adds(mask, dst, src, scalar, repeat_times, dst_rep_stride, src_rep_stride, mask_mode="normal")
Parameters
For details, see Parameters.
The dst, src, and scalar operands have the same data type:
Ascend 910 AI Processor: tensors of type float16 or float32
Returns
None
Restrictions
For details, see Restrictions.
Example
from te import tik tik_instance = tik.Tik() src_ub = tik_instance.Tensor("float16", (128,), name="src_ub", scope=tik.scope_ubuf) scalar = tik_instance.Scalar(dtype="float16", init_value=2) dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf) tik_instance.vec_adds(128, dst_ub, src_ub, scalar, 1, 8, 8)
vec_muls
Description
Performs multiplication between a vector and a scalar element-wise:
Prototype
vec_muls(mask, dst, src, scalar, repeat_times, dst_rep_stride, src_rep_stride, mask_mode="normal")
Parameters
For details, see Parameters.
dst must have the same data type as src. If the scalar operand is a scalar, it must have the same data type as dst and src. The following data types are supported:
Ascend 910 AI Processor: tensors of type float16 or float32
Returns
None
Restrictions
For details, see Restrictions.
Example
from te import tik tik_instance = tik.Tik() src_ub = tik_instance.Tensor("float16", (128,), name="src_ub", scope=tik.scope_ubuf) scalar = 2 dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf) tik_instance.vec_muls(128, dst_ub, src_ub, scalar, 1, 8, 8)
Triple Scalar Inputs (Gather Mode)
General Definition
Description
This is a generic format for an instruction with three source operands (src, dst, and scalar). Note that it is not a real instruction.
Prototype
instruction (mask, dst, src, scalar, repeat_times, dst_rep_stride, src_rep_stride, )
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
instruction |
Input |
A string specifying the instruction name. Only lowercase letters are supported in TIK DSL. |
mask |
Input |
For details, see the description of the mask parameter in Table 10-25. |
dst |
Output |
Vector destination operand or source operand 1, which is the start element of the tensor. For details about the supported data precision, see the specific instruction. |
src |
Input |
Vector source operand 0, which is the start element of the tensor. For details about the supported data precision, see the specific instruction. |
scalar |
Input |
A scalar or immediate, for the scalar source operand |
repeat_times |
Input |
Number of iteration repeats |
dst_rep_stride |
Input |
Block-to-block stride between adjacent iterations of the vector destination operand |
src_rep_stride |
Input |
Block-to-block stride between adjacent iterations of the vector source operand |
Restrictions
- repeat_times is within the range [0, 255]. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. If an immediate is passed, 0 is not supported.
- dst_rep_stride and src_rep_stride are within the range [0, 65535]. The unit is 32 bytes. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
- dst is both the destination operand and the source operand.
- To save memory space, you can define a tensor shared by the source and destination operands (by address overlapping). The general instruction restrictions are as follows.
- For a single repeat (repeat_times = 1), the source operand must completely overlap the destination operand.
- For multiple repeats (repeat_times > 1), if there is a dependency between the source operand and the destination operand, that is, the destination operand of the Nth iteration is the source operand of the (N+1)th iteration, address overlapping is not supported.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
vec_axpy
Description
Performs multiplication-accumulation between a vector and a scalar element-wise.
Prototype
vec_axpy(mask, dst, src, scalar, repeat_times, dst_rep_stride, src_rep_stride)
Parameters
For details, see Parameters.
dst, src, and scalar are tensors of type float16 or float32. src and scalar must have the same data type.
The supported precision combinations are as follows:
Type |
src.dtype |
scalar.dtype |
dst.dtype |
PAR/Repeat |
---|---|---|---|---|
fp16 |
float16 |
float16 |
float16 |
128 |
fp32 |
float32 |
float32 |
float32 |
64 |
fmix |
float16 |
float16 |
float32 |
64 |
Returns
None
Restrictions
- For details, see Restrictions.
- Note that mixed precision (fmix) is supported.
- In fmix mode, only the first four blocks of src are computed every iteration.
Example
from te import tik tik_instance = tik.Tik() src_ub = tik_instance.Tensor("float16", (128,), name="src_ub", scope=tik.scope_ubuf) scalar = 2 dst_ub = tik_instance.Tensor("float32", (128,), name="dst_ub", scope=tik.scope_ubuf) tik_instance.vec_axpy(64, dst_ub, src_ub, scalar, 1, 8, 4)
Comparison and Selection Instructions (Gather Mode)
vec_cmpv_xx
Description
Performs element-wise comparison to generate a 1-bit result. 1'b1 indicates true, and 1'b0 indicates false. Multiple comparison modes are supported.
Prototype
vec_cmpv_xx (dst, src0, src1, repeat_times, src0_rep_stride, src1_rep_stride)
PIPE: vector
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
instruction |
Input |
Instruction name. The following comparison modes are supported:
|
dst |
Output |
Destination operand, which is the start element of the tensor. Must be one of the following data types: uint64, uint32, uint16, uint8 |
src0 |
Input |
Source operand 0, which is the start element of the tensor. Must be one of the following data types: Ascend 910 AI Processor: float16 or float32 |
src1 |
Input |
Source operand 1, which is the start element of the tensor. src1 has the same data type as src0. |
repeat_times |
Input |
Number of iteration repeats
|
src0_rep_stride |
Input |
Block-to-block stride between adjacent iterations of source operand 0 |
src1_rep_stride |
Input |
Block-to-block stride between adjacent iterations of source operand 1 |
Returns
None
Restrictions
- The mask parameter is unavailable.
- dst is generated continuously. For example, if the source operand is of type float16 while the destination operand is of type uint16, eight elements of dst are skipped between adjacent iterations.
- src0_rep_stride and src1_rep_stride are within the range [0, 255], in the unit of blocks. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
- To save memory space, you can define a tensor shared by the source and destination operands (by address overlapping). The general instruction restrictions are as follows.
- For a single repeat (repeat_times = 1), the source operand must completely overlap the destination operand.
- For multiple repeats (repeat_times > 1), if there is a dependency between the source operand and the destination operand, that is, the destination operand of the Nth iteration is the source operand of the (N+1)th iteration, address overlapping is not supported.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
Example
from te import tik tik_instance = tik.Tik() src0_ub = tik_instance.Tensor("float16", (128,), name="src0_ub", scope=tik.scope_ubuf) src1_ub = tik_instance.Tensor("float16", (128,), name="src1_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("uint16", (16,), name="dst_ub", scope=tik.scope_ubuf) tik_instance.vec_dup(16, dst_ub, 5, 1, 1) # Initializes dst_ub to all 5s. tik_instance.vec_cmpv_eq(dst_ub, src0_ub, src1_ub, 1, 8, 8) """For example: Inputs (float16): src0_ub = {1,2,3,...,128} src1_ub = {2,2,2,...,2} Returns: dst_ub = {2,0,0,0,0,0,0,0,5,5,5,5,5,5,5,5} """
vec_sel
Description
Selects elements bit-wise. 1'b1: selected from src0; other values: selected from src1.
Prototype
vec_sel(mask, mode, dst, sel, src0, src1, repeat_times, dst_rep_stride=0, src0_rep_stride=0, src1_rep_stride=0)
PIPE: vector
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
mask |
Input |
For details, see the description of the mask parameter in Table 10-25. |
mode |
Input |
Instruction mode 0: Select between two tensors based on sel. Multiple iterations are supported. Each iteration is based on the first 128 bits (if the destination operand is of type float16) or 64 bits (if the destination operand is of type float32) of sel. 1: Select between a tensor and a scalar bit-wise based on sel. Multiple iterations are supported. 2: Select between two tensors bit-wise based on sel. Multiple iterations are supported. Ascend 910 AI Processor supports only mode 0. |
dst |
Output |
A tensor for the start element of the destination operand. Must be one of the following data types: Ascend 910 AI Processor: float16 or float32 |
sel |
Input |
Mask selection. Each bit indicates the selection of an element. In mode 0, 1, or 2, sel is a tensor of type uint8/uint16/uint32/uint64. In mode 1 or 2, elements are consumed continuously between iterations. |
src0 |
Input |
A tensor for the start element of source operand 0 Note: dst must have the same data type as src0 and src1. |
src1 |
Input |
A tensor for the start element of source operand 1 In mode 0 or 2, the argument is a tensor. In mode 1, the argument is a scalar or and immediate of type int/float. Note: dst must have the same data type as src0 and src1. |
repeat_times |
Input |
Number of iteration repeats |
dst_rep_stride |
Input |
Block-to-block stride between adjacent iterations of the destination operand |
src0_rep_stride |
Input |
Block-to-block stride between adjacent iterations of source operand 0 |
src1_rep_stride |
Input |
Block-to-block stride between adjacent iterations of source operand 1 Note: This parameter is invalid in mode 1. |
Returns
None
Restrictions
- The mode argument must be an immediate.
- repeat_times is within the range [0, 255]. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. If an immediate is passed, 0 is not supported.
- dst_rep_stride, src0_rep_stride, and src1_rep_stride are within the range [0, 255], in the unit of 32 bytes. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
- dst and src0 must be different tensors or the same element of the same tensor, not different elements of the same tensor. This also applies to dst and src1.
- To save memory space, you can define a tensor shared by the source and destination operands (by address overlapping). The general instruction restrictions are as follows.
- For a single repeat (repeat_times = 1), the source operand must completely overlap the destination operand.
- For multiple repeats (repeat_times > 1), if there is a dependency between the source operand and the destination operand, that is, the destination operand of the Nth iteration is the source operand of the (N+1)th iteration, address overlapping is not supported.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
Example
Ascend 910 AI Processor:
from te import tik tik_instance = tik.Tik() src0_ub = tik_instance.Tensor("float16", (128,), name="src0_ub", scope=tik.scope_ubuf) src1_ub = tik_instance.Tensor("float16", (128,), name="src1_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf) sel = tik_instance.Tensor("uint16", (8,), name="sel", scope=tik.scope_ubuf) tik_instance.vec_sel(128, 0, dst_ub, sel, src0_ub, src1_ub, 1, 8, 8, 8) """For example: Inputs (float16): src0_ub = {1,2,3,...,128} src1_ub = {2,2,2,...,2} sel: [2,0,0,0,0,0,0,0] Returns: dst_ub = {2,2,2,...,2} """
Data Conversion Instructions (Gather Mode)
vec_conv
Description
Converts the precision based on the data types of the src and dst tensors.
Prototype
vec_conv(mask, round_mode, dst, src, repeat_times, dst_rep_stride, src_rep_stride, deqscale=None, ldst_high_half=False)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
mask |
Input |
For details, see the description of the mask parameter in Table 10-25. |
round_mode |
Input |
Rounding mode. The following string-based configurations are supported:
|
dst |
Output |
Destination operand |
src |
Input |
Source operand |
repeat_times |
Input |
Number of iteration repeats |
dst_rep_stride |
Input |
Block-to-block stride between adjacent iterations of the destination operand |
src_rep_stride |
Input |
Block-to-block stride between adjacent iterations of the source operand |
deqscale |
Input |
Quantization scale, which is an auxiliary conversion parameter. Defaults to None. The argument is a scalar of type float16 or an immediate of type float. |
ldst_high_half |
Input |
A bool specifying whether dst_list or src_list stores or comes from the upper or lower half of each block. Defaults to False. True indicates the upper half, and False indicates the lower half. Note: This parameter defines different functions for different combinations, indicating the storage and read of dst_list and src_list respectively. Ascend 910 AI Processor does not support this parameter. |
src.dtype |
dst.dtype |
Supported round_mode |
deqscale |
---|---|---|---|
float16 |
int32 |
'round', 'floor', 'ceil', 'ceiling', 'away-zero', 'to-zero' |
None |
float32 |
int32 |
'round', 'floor', 'ceil', 'ceiling', 'away-zero', 'to-zero' |
None |
int32 |
float32 |
'', 'none' |
None |
float16 |
float32 |
'', 'none' |
None |
float32 |
float16 |
'', 'none', 'odd' |
None |
float16 |
int8 |
'', 'none', 'floor', 'ceil', 'ceiling', 'away-zero', 'to-zero' |
None |
float16 |
uint8 |
'', 'none', 'floor', 'ceil', 'ceiling', 'away-zero', 'to-zero' |
None |
int32 |
float16 |
'', 'none' |
Scalar (float16)/Immediate (float) |
uint8 |
float16 |
'', 'none' |
None |
int8 |
float16 |
'', 'none' |
None |
Returns
None
Restrictions
- repeat_times is within the range [0, 255]. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. If an immediate is passed, 0 is not supported.
- The degree of parallelism of each repeat depends on the data precision and chip version. For example, 64 source or destination elements are operated in each repeat during f32-to-f16 conversion.
- Instructions dst_rep_stride and src_rep_stride are within the range [0, 255], in the unit of 32 bytes. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
- The supported data types of dst and src are related to the chip version. If the data types are not supported, the tool reports an error.
- dst and src must be the same element of different tensors or the same tensor, not different elements of the same tensor.
- To save memory space, you can define a tensor shared by the source and destination operands (by address overlapping). The general instruction restrictions are as follows.
- For a single repeat (repeat_times = 1), the source operand must completely overlap the destination operand.
- For multiple repeats (repeat_times > 1), if there is a dependency between the source operand and the destination operand, that is, the destination operand of the Nth iteration is the source operand of the (N+1)th iteration, address overlapping is not supported.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
Example
from te import tik tik_instance = tik.Tik() src_ub = tik_instance.Tensor("float16", (128,), name="src_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("int32", (128,), name="dst_ub", scope=tik.scope_ubuf) tik_instance.vec_conv(64, "round", dst_ub, src_ub, 2, 8, 4)
Pair Reduce
General Definition
Description
This is a generic format for a pair-reduce instruction, which uniformly processes the source operands of adjacent pair in each block in the current iteration. Note that it is not a real instruction.
Prototype
instruction (mask, dst, src, repeat_times, dst_rep_stride, src_rep_stride)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
instruction |
Input |
A string specifying the instruction name. Only lowercase letters are supported in TIK DSL. |
mask |
Input |
For details, see the description of the mask parameter in Table 10-25. |
dst |
Output |
Destination operand, which is the start element of the tensor |
src |
Input |
Source operand, which is the start element of the tensor |
repeat_times |
Input |
Number of iteration repeats |
dst_rep_stride |
Input |
Block-to-block stride between adjacent iterations of the destination operand |
src_rep_stride |
Input |
Block-to-block stride between adjacent iterations of the source operand |
Restrictions
- repeat_times is within the range [0, 255]. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. If an immediate is passed, 0 is not supported.
- The degree of parallelism of each repeat depends on the data precision and chip version. The following uses PAR to describe the degree of parallelism.
- dst_rep_stride and src_rep_stride are within the range [0, 65535]. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. If dst_rep_stride is set to 0, the value 1 is used.
- Note that the implementation of dst_rep_stride is different. The unit is 128 bytes.
- To save memory space, you can define a tensor shared by the source and destination operands (by address overlapping). The general instruction restrictions are as follows.
- For a single repeat (repeat_times = 1), the source operand must completely overlap the destination operand.
- For multiple repeats (repeat_times > 1), if there is a dependency between the source operand and the destination operand, that is, the destination operand of the Nth iteration is the source operand of the (N+1)th iteration, address overlapping is not supported.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
vec_cpadd
Description
Adds elements (odd and even) between adjacent pairs.
Prototype
vec_cpadd(mask, dst, src, repeat_times, dst_rep_stride, src_rep_stride)
Parameters
For details, see Parameters. Tensor dst has the same data type as src (float16).
Returns
None
Restrictions
For details, see Restrictions.
Example
from te import tik tik_instance = tik.Tik() src_ub = tik_instance.Tensor("float16", (128,), name="src_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float16", (64,), name="dst_ub", scope=tik.scope_ubuf) tik_instance.vec_cpadd(128, dst_ub, src_ub, 1, 1, 8)
Reduce
General Definition
Description
This is the generic format for the reduce instruction, which uniformly processes all source operands. Note that it is not a real instruction.
Prototype
instruction (mask, dst, src, work_tensor, repeat_times, src_rep_stride, cal_index=False)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
instruction |
Input |
A string specifying the instruction name. Only lowercase letters are supported in TIK DSL. |
mask |
Input |
For details, see the description of the mask parameter in Table 10-25. |
dst |
Output |
Destination operand, which is the start element of the tensor |
src |
Input |
Source operand, which is the start element of the tensor |
work_tensor |
Input |
Intermediate results are stored during command execution to calculate the required operation space. Pay attention to the space size. For details, see the restrictions for each command. |
repeat_times |
Input |
Number of iteration repeats |
src_rep_stride |
Input |
Block-to-block stride between adjacent iterations of the source operand |
cal_index |
Input |
A bool that specifies whether to obtain the index with the minimum value (supported only by vec_reduce_max and vec_reduce_min). Defaults to False. The options are as follows:
|
vec_reduce_add
Description
Adds all input data.
Each two data pieces are added in binary tree mode.
Assume that the source operand is 256 pieces of float16 data [data0, data1, data2, ..., data255], the computation can be completed in two repeats. The computation process is as follows:
- [data0,data1,data2...data127] is the source operand of the first repeat. Result 01 is obtained through the following calculation method:
- Add data0 and data1 to obtain data00, add data2 and data3 to obtain data01, ..., add data124 and data125 to obtain data62, and add data126 and data127 to obtain data63.
- Add data00 and data01 to obtain data000, add data02 and data03 to obtain data001, ..., add data62 and data63 to obtain data031.
- This rule applies until result01 is obtained.
- [data128,data1,data2...data255] is the source operand of the second repeat. Result 02 is obtained.
- Add result01 and result02 to obtain [data], whose destination operand is one float16.
Prototype
vec_reduce_add(mask, dst, src, work_tensor, repeat_times, src_rep_stride)
Parameters
dst, src, and work_tensor have the same data types. For details, see Parameters.
Ascend 910 AI Processor: tensors of type float16 or float32
Returns
None
Restrictions
- The work_tensor space requires at least repeat_times elements. For example, when repeat_times=120, the shape of work_tensor has at least 120 elements.
- repeat_times is within the range [1, 4095]. The argument is a scalar of type int32, an immediate of type int, or an Expr of type int32.
- src_rep_stride is within the range [0, 65535]. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
- Note that if the calculation result overflows during the calculation of adding every two elements, there are two processing modes: return the defined maximum value or return inf/nan. The mode is selected by the inf/nan control bit. When the max-value mode is used, if the sum of float16 data is greater than 65504, the output will be 65504. For example, the source operand is [60000, 60000, –30000, 100], 60000 + 60000 > 65504, meaning that the result overflows. In this case, the maximum value 65504 will be used as the result. Similarly, –30000 + 100 = –29900, 65504 – 29900 = 35604.
- Address overlapping among src, dst, and work_tensor is not allowed.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
Example
from te import tik tik_instance = tik.Tik() dst_ub = tik_instance.Tensor("float16", (32,), tik.scope_ubuf, "dst_ub") src_ub = tik_instance.Tensor("float16", (256,), tik.scope_ubuf, "src_ub") work_tensor_ub = tik_instance.Tensor("float16", (32,), tik.scope_ubuf, "work_tensor_ub") tik_instance.vec_reduce_add(128, dst_ub, src_ub, work_tensor_ub, 2, 8) Description: Inputs: src_ub=[1,1,1,,...,1] Return: dst_ub=[256]
vec_reduce_max
Description
Obtains the maximum value and its corresponding index position among the input data. If there are multiple maximum values, determine which one to be returned by referring to the restrictions.
Prototype
vec_reduce_max(mask, dst, src, work_tensor, repeat_times, src_rep_stride, cal_index=False)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
mask |
Input |
For details, see the description of the mask parameter in Table 10-25. |
dst |
Input |
A tensor for the start element of the destination operand |
src |
Input |
A tensor for the start element of the source operand |
work_tensor |
Input |
Intermediate results are stored during command execution to calculate the required operation space. Pay attention to the space size. For details, see the restrictions for each command. |
repeat_times |
Input |
Number of iteration repeats The argument is a scalar of type int32, an immediate of type int, or an Expr of type int32. Immediate is recommended because it provides higher performance. |
src_rep_stride |
Input |
Block-to-block stride between adjacent iterations of the source operand The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. |
cal_index |
Input |
A bool that specifies whether to obtain the index with the minimum value. Defaults to False. The options are as follows:
|
dst, src, and work_tensor have the same data types. For details, see Parameters.
Ascend 910 AI Processor: tensors of type float16
Returns
None
Restrictions
- The argument of repeat_times is a scalar of type int32, an immediate of type int, or an Expr of type int32.
- When cal_index is set to False, repeat_times is within the range [1, 4095].
- When cal_index is set to True:
- If the operand data type is int16, the maximum value of the operand is 32767, meaning that a maximum of 255 iterations are supported. Therefore, repeat_times is within the range [1, 255].
- If the operand data type is float16, the maximum value of the operand is 65504, meaning that a maximum of 511 iterations are supported. Therefore, repeat_times is within the range [1, 511].
- Similarly, if the operand data type is float32, repeat_times is within the range [1, 4095].
- src_rep_stride is within the range [0, 65535]. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
- The storage sequence of the dst result is: maximum value, corresponding index. In the result, index data is stored as integers. For example, if the defined data type of dst is float16, but that of index is uint16, an error occurs when the index data is read in float16 format. Therefore, the reinterpret_cast_to() method needs to be called to convert the float16 index data to corresponding integers.
- Restrictions for the work_tensor space are as follows:
- If cal_index is set to False, at least (repeat_times x 2) elements are required. For example, when repeat_times=120, the shape of work_tensor has at least 240 elements.
- When cal_index is set to True, the space size is calculated by using the following formula. For details about examples, see Example.
# DTYPE_SIZE indicates the data type size, in bytes. For example, float16 occupies 2 bytes. elements_per_block indicates the number of elements required by each block. elements_per_block = 32 // DTYPE_SIZE[dtype] elements_per_repeat = 256 // DTYPE_SIZE[dtype] # elements_per_repeat indicates the number of elements required for each repeat. it1_output_count = 2*repeat_times # Number of elements generated in the first iteration. it2_align_start = ceil_div(it1_output_count, elements_per_block)*elements_per_block # Offset of the start position of the second iteration. ceil_div is used to perform division and round up the result. it2_output_count = ceil_div(it1_output_count, elements_per_repeat)*2 # Number of elements generated in the second iteration. it3_align_start = ceil_div(it2_output_count, elements_per_block)*elements_per_block # Offset of the start position of the third iteration. it3_output_count = ceil_div(it2_output_count, elements_per_repeat)*2 # Number of elements generated in the third iteration. it4_align_start = ceil_div(it3_output_count, elements_per_block)*elements_per_block # Offset of the start position of the fourth iteration. it4_output_count = ceil_div(it3_output_count, elements_per_repeat)*2 # Number of elements generated in the fourth iteration. final_work_tensor_need_size = it2_align_start + it3_align_start + it4_align_start + it4_output_count # Finally required work_tensor size
- Address overlapping between dst and work_tensor is not allowed.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
Example
from te import tik tik_instance = tik.Tik() dst_ub = tik_instance.Tensor("float16", (2,), tik.scope_ubuf, "dst_ub") src_ub = tik_instance.Tensor("float16", (256,), tik.scope_ubuf, "src_ub") work_tensor_ub = tik_instance.Tensor("float16", (18,), tik.scope_ubuf, "work_tensor_ub") tik_instance.vec_reduce_max(128, dst_ub, src_ub, work_tensor_ub, 2, 8, cal_index=True)
- [Example 1]
src, work_tensor, and dst are tensors of type float16. src has shape (65, 128), and repeat_times of vec_reduce_max/vec_reduce_min is 65.
The following is an API calling example:
tik_instance.vec_reduce_max(128, dst, src, work_tensor, 65, 8, cal_index=True)
The space of work_tensor is calculated as follows:
elements_per_block = 16 (elements) elements_per_repeat = 128 (elements) it1_output_count = 2*65 = 130 (elements) it2_align_start = ceil_div(130, 16)*16 = 144 (elements) it2_output_count = ceil_div(130, 128)*2 = 4 (elements) it3_align_start = ceil_div(4, 16)*16 = 16 (elements) it3_output_count = ceil_div(4, 128)*2 = 2 (elements)
The final maximum value and its index can be obtained after three iterations. The required space of work_tensor is it2_align_start + it3_align_start + it3_output_count = 144 + 16 + 2 = 162 (elements).
- [Example 2]
src, work_tensor, and dst are tensors of type float16. src has shape (65, 128). repeat_times of vec_reduce_max and vec_reduce_min is a scalar with the value 65. If repeat_times is a scalar or contains a scalar, four iterations of calculation are required.
The following is an API calling example:
scalar = tik_instance.Scalar (init_value=65, dtype="int32") tik_instance.vec_reduce_max(128, dst, src, work_tensor, scalar, 8, cal_index=True)
The space of work_tensor is calculated as follows:
elements_per_block = 16 (elements) elements_per_repeat = 128 (elements) it1_output_count = 2*65 = 130 (elements) it2_align_start = ceil_div(130, 16)*16 = 144 (elements) it2_output_count = ceil_div(130, 128)*2 = 4 (elements) it3_align_start = ceil_div(4, 16)*16 = 16 (elements) it3_output_count = ceil_div(4, 128)*2 = 2 (elements) it4_align_start = ceil_div(2, 16)*16 = 16 (elements) it4_output_count = ceil(2, 128)*2 = 2(elements)
In cases where repeat_times is a scalar or contains a scalar, the result is obtained in the third round. However, the scalar value cannot be obtained at Python compilation, another round is required. work_tensor = it2_align_start + it3_align_start + it4_aign_start + it4_output_count = 144 + 16 + 16 + 2 = 178 (elements)
- [Example 3]
src, work_tensor, and dst are tensors of type float32. src has shape (65, 64), and repeat_times of vec_reduce_max/vec_reduce_min is 65.
The following is an API calling example:
tik_instance.vec_reduce_max(64, dst, src, work_tensor, 65, 8, cal_index=True)
The space of work_tensor is calculated as follows:
elements_per_block = 8 (elements) elements_per_repeat = 64 (elements) it1_output_count = 2*65 = 130 (elements) it2_align_start = ceil_div(130, 8)*8 = 136 (elements) it2_output_count = ceil_div(130, 64)*2 = 6 (elements) it3_align_start = ceil_div(6, 8)*8 = 8 (elements) it3_output_count = ceil_div(6, 64)*2 = 2 (elements)
The final maximum value and its index can be obtained after three iterations. The required space of work_tensor is it2_align_start + it3_align_start + it3_output_count = 136 + 8 + 2 = 146 (elements).
- [Example 4]
src, work_tensor, and dst are float32 tensors. The shape of src is (65, 64). repeat_times of vec_reduce_max and vec_reduce_min is a scalar with the value 65. If repeat_times is a scalar or contains a scalar, four iterations of calculation are required.
The following is an API calling example:
scalar = tik_instance.Scalar (init_value=65, dtype="int32") tik_instance.vec_reduce_max(64, dst, src, work_tensor, scalar, 8, cal_index=True)
The space of work_tensor is calculated as follows:
elements_per_block = 8 (elements) elements_per_repeat = 64 (elements) it1_output_count = 2*65 = 130 (elements) it2_align_start = ceil_div(130, 8)*8 = 136 (elements) it2_output_count = ceil_div(130, 64)*2 = 6 (elements) it3_align_start = ceil_div(6, 8)*8 = 8 (elements) it3_output_count = ceil_div(6, 64)*2 = 2 (elements) it4_align_start = ceil_div(2, 8)*8 = 8 (elements) it4_output_count = ceil(2, 64)*2 = 2(elements)
In cases where repeat_times is a scalar or contains a scalar, the result is obtained in the third round. However, the scalar value cannot be obtained at Python compilation, another round is required. work_tensor = it2_align_start + it3_align_start + it4_align_start + it4_output_count = 136 + 8 + 8 + 2 = 154 (elements)
vec_reduce_min
Description
Obtains the minimum value and its corresponding index position among the input data. If there are multiple minimum values, determine which one to be returned by referring to the restrictions.
Prototype
vec_reduce_min(mask, dst, src, work_tensor, repeat_times, src_rep_stride, cal_index=False)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
mask |
Input |
For details, see the description of the mask parameter in Table 10-25. |
dst |
Input |
A tensor for the start element of the destination operand |
src |
Input |
A tensor for the start element of the source operand |
work_tensor |
Input |
Intermediate results are stored during command execution to calculate the required operation space. Pay attention to the space size. For details, see the restrictions for each command. |
repeat_times |
Input |
Number of iteration repeats. The argument is a scalar of type int32, an immediate of type int, or an Expr of type int32. Immediate is recommended because it provides higher performance. |
src_rep_stride |
Input |
Block-to-block stride between adjacent iterations of the source operand The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. |
cal_index |
Input |
A bool that specifies whether to obtain the index with the minimum value (supported only by vec_reduce_max and vec_reduce_min). Defaults to False. The options are as follows:
|
dst, src, and work_tensor have the same data types. For details, see Parameters.
Ascend 910 AI Processor: tensors of type float16
Returns
None
Restrictions
- The argument of repeat_times is a scalar of type int32, an immediate of type int, or an Expr of type int32.
- When cal_index is set to False, repeat_times is within the range [1, 4095].
- When cal_index is set to True:
- If the operand data type is int16, the maximum value of the index (int16) is 32767, meaning that a maximum of 255 iterations are supported. Therefore, repeat_times is within the range [1, 255].
- If the operand data type is float16, the maximum value of the index (float16) is 65504, meaning that a maximum of 511 iterations are supported. Therefore, repeat_times is within the range [1, 511].
- Similarly, if the operand data type is float32, repeat_times is within the range [1, 4095].
- src_rep_stride is within the range [0, 65535]. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
- The storage sequence of the dst result is: maximum value, corresponding index. In the result, index data is stored as integers. For example, if the defined data type of dst is float16, but that of index is uint16, an error occurs when the index data is read in float16 format. Therefore, the reinterpret_cast_to() method needs to be called to convert the float16 index data to corresponding integers.
- Restrictions for the work_tensor space are as follows:
- If cal_index is set to False, at least (repeat_times x 2) elements are required. For example, when repeat_times=120, the shape of work_tensor has at least 240 elements.
- When cal_index is set to True, the space size is calculated by using the following formula.
# DTYPE_SIZE indicates the data type size, in bytes. For example, float16 occupies 2 bytes. elements_per_block indicates the number of elements required by each block. elements_per_block = 32 // DTYPE_SIZE[dtype] elements_per_repeat = 256 // DTYPE_SIZE[dtype] # elements_per_repeat indicates the number of elements required for each repeat. it1_output_count = 2*repeat_times # Number of elements generated in the first iteration. it2_align_start = ceil_div(it1_output_count, elements_per_block)*elements_per_block # Offset of the start position of the second iteration. ceil_div is used to perform division and round up the result. it2_output_count = ceil_div(it1_output_count, elements_per_repeat)*2 # Number of elements generated in the second iteration. it3_align_start = ceil_div(it2_output_count, elements_per_block)*elements_per_block # Offset of the start position of the third iteration. it3_output_count = ceil_div(it2_output_count, elements_per_repeat)*2 # Number of elements generated in the third iteration. it4_align_start = ceil_div(it3_output_count, elements_per_block)*elements_per_block # Offset of the start position of the fourth iteration. it4_output_count = ceil_div(it3_output_count, elements_per_repeat)*2 # Number of elements generated in the fourth iteration. final_work_tensor_need_size = it2_align_start + it3_align_start + it4_align_start + it4_output_count # Finally required work_tensor size
- Address overlapping between dst and work_tensor is not allowed.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
Example
from te import tik tik_instance = tik.Tik() dst_ub = tik_instance.Tensor("float16", (32,), tik.scope_ubuf, "dst_ub") src_ub = tik_instance.Tensor("float16", (256,), tik.scope_ubuf, "src_ub") work_tensor_ub = tik_instance.Tensor("float16", (18,), tik.scope_ubuf, "work_tensor_ub") tik_instance.vec_reduce_min(128, dst_ub, src_ub, work_tensor_ub, 2, 8, cal_index=True)
- [Example 1]
src, work_tensor, and dst are tensors of type float16. src has shape (65, 128), and repeat_times of vec_reduce_max/vec_reduce_min is 65.
The following is an API calling example:
tik_instance.vec_reduce_max(128, dst, src, work_tensor, 65, 8, cal_index=True)
The space of work_tensor is calculated as follows:
elements_per_block = 16 (elements) elements_per_repeat = 128 (elements) it1_output_count = 2*65 = 130 (elements) it2_align_start = ceil_div(130, 16)*16 = 144 (elements) it2_output_count = ceil_div(130, 128)*2 = 4 (elements) it3_align_start = ceil_div(4, 16)*16 = 16 (elements) it3_output_count = ceil_div(4, 128)*2 = 2 (elements)
The final maximum value and its index can be obtained after three iterations. The required space of work_tensor is it2_align_start + it3_align_start + it3_output_count = 144 + 16 + 2 = 162 (elements).
- [Example 2]
src, work_tensor, and dst are tensors of type float16. src has shape (65, 128). repeat_times of vec_reduce_max and vec_reduce_min is a scalar with the value 65. If repeat_times is a scalar or contains a scalar, four iterations of calculation are required.
The following is an API calling example:
scalar = tik_instance.Scalar (init_value=65, dtype="int32") tik_instance.vec_reduce_max(128, dst, src, work_tensor, scalar, 8, cal_index=True)
The space of work_tensor is calculated as follows:
elements_per_block = 16 (elements) elements_per_repeat = 128 (elements) it1_output_count = 2*65 = 130 (elements) it2_align_start = ceil_div(130, 16)*16 = 144 (elements) it2_output_count = ceil_div(130, 128)*2 = 4 (elements) it3_align_start = ceil_div(4, 16)*16 = 16 (elements) it3_output_count = ceil_div(4, 128)*2 = 2 (elements) it4_align_start = ceil_div(2, 16)*16 = 16 (elements) it4_output_count = ceil(2, 128)*2 = 2(elements)
In cases where repeat_times is a scalar or contains a scalar, the result is obtained in the third round. However, the scalar value cannot be obtained at Python compilation, another round is required. work_tensor = it2_align_start + it3_align_start + it4_aign_start + it4_output_count = 144 + 16 + 16 + 2 = 178 (elements)
- [Example 3]
src, work_tensor, and dst are tensors of type float32. src has shape (65, 64), and repeat_times of vec_reduce_max/vec_reduce_min is 65.
The following is an API calling example:
tik_instance.vec_reduce_max(64, dst, src, work_tensor, 65, 8, cal_index=True)
The space of work_tensor is calculated as follows:
elements_per_block = 8 (elements) elements_per_repeat = 64 (elements) it1_output_count = 2*65 = 130 (elements) it2_align_start = ceil_div(130, 8)*8 = 136 (elements) it2_output_count = ceil_div(130, 64)*2 = 6 (elements) it3_align_start = ceil_div(6, 8)*8 = 8 (elements) it3_output_count = ceil_div(6, 64)*2 = 2 (elements)
The final maximum value and its index can be obtained after three iterations. The required space of work_tensor is it2_align_start + it3_align_start + it3_output_count = 136 + 8 + 2 = 146 (elements).
- [Example 4]
src, work_tensor, and dst are float32 tensors. The shape of src is (65, 64). repeat_times of vec_reduce_max and vec_reduce_min is a scalar with the value 65. If repeat_times is a scalar or contains a scalar, four iterations of calculation are required.
The following is an API calling example:
scalar = tik_instance.Scalar (init_value=65, dtype="int32") tik_instance.vec_reduce_max(64, dst, src, work_tensor, scalar, 8, cal_index=True)
The space of work_tensor is calculated as follows:
elements_per_block = 8 (elements) elements_per_repeat = 64 (elements) it1_output_count = 2*65 = 130 (elements) it2_align_start = ceil_div(130, 8)*8 = 136 (elements) it2_output_count = ceil_div(130, 64)*2 = 6 (elements) it3_align_start = ceil_div(6, 8)*8 = 8 (elements) it3_output_count = ceil_div(6, 64)*2 = 2 (elements) it4_align_start = ceil_div(2, 8)*8 = 8 (elements) it4_output_count = ceil(2, 64)*2 = 2(elements)
In cases where repeat_times is a scalar or contains a scalar, the result is obtained in the third round. However, the scalar value cannot be obtained at Python compilation, another round is required. work_tensor = it2_align_start + it3_align_start + it4_align_start + it4_output_count = 136 + 8 + 8 + 2 = 154 (elements)
Matrix Computation
conv2d
Description
Performs 2D convolution on an input tensor and a weight tensor and outputs a result tensor.
The following data types are supported (feature_map:weight:dst):
(1) uint8:int8:int32
(2) int8:int8:int32
(3) float16:float16:float32
Prototype
conv2d(dst, feature_map, weight, fm_shape, kernel_shape, stride, pad, dilation, pad_value=0, init_l1out=True)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
dst |
Output |
Start element of the destination operand. For details about data type restrictions, see Table 10-39. The scope is L1OUT. Has format [Cout/16, Ho, Wo, 16], and size Cout * Ho * Wo,where, Ho and Wo can be calculated as follows: Ho = floor((H + pad_top + pad_bottom – dilation_h * (Kh – 1) – 1) / stride_h + 1) Wo = floor((W + pad_left + pad_right – dilation_w * (Kw – 1) – 1) / stride_w + 1) The hardware requires that Ho * Wo be a multiple of 16. When defining the dst tensor, the shape should be rounded up to the nearest multiple of 16 pixels. The actual shape size should be Cout * round_howo: round_howo = ceil(Ho * Wo/6) * 16 The invalid data introduced due to round-up will be removed in the subsequent fixpipe operation. |
feature_map |
Input |
Start element of the input tensor operand. For details about data type restrictions, see Table 10-39. The scope is L1. |
weight |
Input |
Start element of the weight tensor operand. For details about the data type restrictions, see Table 10-39. The scope is L1. |
fm_shape |
Input |
Shape of the input tensor, in the format of [C1, H, W, C0]. C1 * C0 indicates the number of input channels.
H is an immediate of type int, specifying the height. The value range is [1, 4096]. W is an immediate of type int, specifying the width. The value range is [1, 4096]. |
kernel_shape |
Input |
Shape of each convolution kernel tensor, in the format of [C1, Kh, Kw, Cout, C0]. C1 * C0 indicates the number of input channels.
Cout is an int specifying the number of convolution kernels. The value is a multiple of 16 within the range [16, 4096]. Kh is an int specifying the height of each convolution kernel. The value range is [1, 255]. Kwis an int specifying the width of each convolution kernel. The value range is [1, 255]. |
stride |
Input |
Convolution stride, in the format of [stride_h, stride_w]. stride_h: an int specifying the height stride. The value range is [1, 63]. stride_w: an int specifying the width stride. The value range is [1, 63]. |
pad |
Input |
Padding factors, in the format of [pad_left, pad_right, pad_top, pad_bottom]. pad_left: an int specifying the number of columns to be padded to the left of the feature_map. The value range is [0, 255]. pad_right: an int specifying the number of columns to be padded to the right of the feature_map. The value range is [0, 255]. pad_top: an int specifying the number of rows to be padded to the top of the feature_map. The value range is [0, 255]. pad_bottom: an int specifying the number of rows to be padded to the bottom of the feature_map. The value range is [0, 255]. |
dilation |
Input |
Convolution dilation factors, in the format of [dilation_h, dilation_w] dilation_h: an int specifying the height dilation factor. The value range is [1, 255]. dilation_w: an int specifying the width dilation factor. The value range is [1, 255]. The width and height of the dilated convolution kernel is calculated as follows: dilation_w * (Kw – 1) + 1; dilation_h * (Kh – 1) + 1 |
pad_value |
Input |
Padding value, an immediate of int or float Defaults to 0. Value range: If feature_map is of type uint8, pad_value is within the range [0, 255]. If feature_map is of type int8, pad_value is within the range [–128, +127]. If feature_map is of type uint8 or int8, pad_value is an immediate of type int. If feature_map is of type float16, pad_value is within the range [–65504, +65504]. |
init_l1out |
Input |
A bool specifying whether to initialize dst . Defaults to True.
|
Restrictions
- It takes a long time to perform step-by-step debugging. Therefore, step-by-step debugging is not recommended.
- This instruction must not be used together with the vectoring instructions.
- This instruction should be used together with the fixpipe instruction.
- This instruction does not support the scenario where W is equal to Kw and H is greater than Kh. This will produce unexpected results.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
Returns
None
Examples
Example 1: feature_map:weight:dst of type uint8:int8:int32
from te import tik tik_instance = tik.Tik() # Define the tensors. feature_map_gm = tik_instance.Tensor("uint8", [1, 4, 4, 32], name='feature_map_gm', scope=tik.scope_gm) weight_gm = tik_instance.Tensor("int8", [1, 2, 2, 32, 32], name='weight_gm', scope=tik.scope_gm) dst_gm = tik_instance.Tensor("int32", [2, 9, 16], name='dst_gm', scope=tik.scope_gm) feature_map = tik_instance.Tensor("uint8", [1, 4, 4, 32], name='feature_map', scope=tik.scope_cbuf) weight = tik_instance.Tensor("int8", [1, 2, 2, 32, 32], name='weight', scope=tik.scope_cbuf) # dst has shape [2, 16, 16], where, cout = 32. cout_blocks = 2, ho = 3, wo = 3, howo = 9. Therefore, round_howo = 16. dst = tik_instance.Tensor("int32", [2, 16, 16], name='dst', scope=tik.scope_cbuf_out) # Move data from the GM to the source operand tensor. tik_instance.data_move(feature_map, feature_map_gm, 0, 1, 16, 0, 0) tik_instance.data_move(weight, weight_gm, 0, 1, 128, 0, 0) # Perform convolution. tik_instance.conv2d(dst, feature_map, weight, [1, 4, 4, 32], [1, 2, 2, 32, 32], [1, 1], [0, 0, 0, 0], [1, 1], 0) # Move dst from L1OUT to the GM by co-working with the fixpipe instruction. # cout_blocks = 2, cburst_num = 2, burst_len = howo * 16 * src_dtype_size/32 = 9 * 16 * 4/32 = 18 tik_instance.fixpipe(dst_gm, dst, 2, 18, 0, 0, extend_params=None) tik_instance.BuildCCE(kernel_name="conv2d", inputs=[feature_map_gm, weight_gm], outputs=[dst_gm]) Inputs: feature_map_gm: [[[[2, 4, 2, 3, 2, ..., 3, 3, 0]]]] weight_gm: [[[[[-3, -5, -4, ..., -2, -4, -2]]]]] Returns: dst_gm: [[[-230, -11, -83, -103, -123, ..., -174, -255]]]
Example 2: feature_map:weight:dst of type float16:float16:float32
from te import tik tik_instance = tik.Tik() # Define the tensors. feature_map_gm = tik_instance.Tensor("float16", [2, 4, 4, 16], name='feature_map_gm', scope=tik.scope_gm) weight_gm = tik_instance.Tensor("float16", [2, 2, 2, 16, 16], name='weight_gm', scope=tik.scope_gm) dst_gm = tik_instance.Tensor("float32", [1, 4, 16], name='dst_gm', scope=tik.scope_gm) feature_map = tik_instance.Tensor("float16", [2, 4, 4, 16], name='feature_map', scope=tik.scope_cbuf) weight = tik_instance.Tensor("float16", [2, 2, 2, 16, 16], name='weight', scope=tik.scope_cbuf) # dst has shape [1, 16, 16], where, cout = 16, cout_blocks = 1, ho = 2, wo = 2, howo = 4. Therefore, round_howo = 16. dst = tik_instance.Tensor("float32", [1, 16, 16], name='dst', scope=tik.scope_cbuf_out) # Move data from the GM to the source operand tensor. tik_instance.data_move(feature_map, feature_map_gm, 0, 1, 32, 0, 0) tik_instance.data_move(weight, weight_gm, 0, 1, 128, 0, 0) # Perform convolution. tik_instance.conv2d(dst, feature_map, weight, [2, 4, 4, 16], [2, 2, 2, 16, 16], [1, 1], [0, 0, 0, 0], [2, 2], 0) # Move dst from L1OUT to the GM by co-working with the fixpipe instruction. # cout_blocks = 1, cburst_num = 1, burst_len = howo * 16 * src_dtype_size/32 = 4 * 16 * 4/32 = 8 tik_instance.fixpipe(dst_gm, dst, 1, 8, 0, 0, extend_params=None) tik_instance.BuildCCE(kernel_name="conv2d", inputs=[feature_map_gm, weight_gm], outputs=[dst_gm]) Inputs: feature_map_gm: [[[[0.0, 0.01, 0.02, 0.03, 0.04, ..., 5.09, 5.1, 5.11]]]] weight_gm: [[[[[0.0, 0.01, 0.02, 0.03, 0.04, ..., 20.46, 20.47]]]]] Returns: dst_gm: [[[3568.7373, 3612.8433, 3657.0618, 3701.162 , 3745.287 , 3789.4834, 3833.6282, 3877.876 , 3921.9812, 3966.0745, 4010.311 , 4054.4119, 4098.5713, 4142.702 , 4186.8457, 4231.0312], [3753.9888, 3801.3733, 3848.8735, 3896.2534, 3943.6558, 3991.1353, 4038.5586, 4086.0913, 4133.4736, 4180.8457, 4228.3643, 4275.745 , 4323.1826, 4370.5947, 4418.016 , 4465.4844], [4309.196 , 4366.4077, 4423.745 , 4480.9565, 4538.1816, 4595.5054, 4652.755 , 4710.135 , 4767.34 , 4824.5405, 4881.897 , 4939.1104, 4996.374 , 5053.6226, 5110.871 , 5168.179 ], [4494.4526, 4554.944 , 4615.564 , 4676.0557, 4736.5586, 4797.166 , 4857.695 , 4918.3604, 4978.8433, 5039.323 , 5099.9624, 5160.456 , 5220.999 , 5281.5293, 5342.0566, 5402.6475]]]
fixpipe
Description
Processes the matrix computation result, for example, adds an offset to and quantizes the computation result, and move the data from the L1OUT to the GM.
Prototype
fixpipe(dst, src, cburst_num, burst_len, dst_stride, src_stride, extend_params=None)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
dst |
Output |
A tensor of type float16, float32, or int32, for the start element of the destination operand. For details about data type restrictions, see Table 10-41. The scope is GM. After fixpipe processing, the extra data allocated during matrix computation is deleted in addition to the offset and quantization operations. If this API is used to process the conv2d result, the format is [cout_blocks, howo, 16]. If this API is used to process the matmul result, the format is [N1, m, N0]. Note: For the meanings of cout_blocks and howo, see the parameter description of conv2d in Parameters. For the meanings of N1, m, and N0, see parameter description of matmul in Parameters. |
src |
Input |
A tensor of type float32 or int32, for the start element of the source operand. For details about data type restrictions, see Table 10-41. The scope is L1OUT. The source operand is the result of matrix computation. If this API is used to process the conv2d result, the format is [cout_blocks, round_howo, 16]. If this API is used to process the matmul result, the format is [N1, M, N0]. Note: For the meanings of cout_blocks and round_howo, see the parameter description of conv2d in Parameters. For the meanings of N1, M, and N0, see parameter description of matmul. in Parameters |
cburst_num |
Input |
An immediate of type int specifying the number of bursts. The value range is [1, 4095]. If this API is used to process the conv2d result, the format is [cout_blocks, round_howo, 16], where, cburst_num is set to cout_blocks. If this API is used to process the matmul result, the format is [N1, M, N0], where, cburst_num is set to N1. Note: For the meanings of cout_blocks and round_howo, see the parameter description of conv2d in Parameters. For the meanings of N1, M, and N0, see parameter description of matmul in Parameters. |
burst_len |
Input |
Burst length, in the unit of 32 bytes. The value is an even number within the range [2, 65535]. The argument is an immediate of type int. For src, the valid data segment length of each burst is as follows:
|
dst_stride |
Input |
Tail-to-header stride between adjacent bursts of the dst operand tensor, in the unit of 32 bytes. The value range is [0, 65535]. The argument is an immediate of type int. |
src_stride |
Input |
Tail-to-header stride between adjacent bursts of the dst operand tensor, in the unit of 256 elements. The value range is [0, 65535]. The argument is an immediate of type int. This parameter is reserved. To ensure data accuracy, pass 0. |
extend_params |
Input |
A dictionary of extended parameters. Defaults to None. Currently, three keys are supported: bias, quantize_params, and relu, which are described as follows: 1. key "bias" value: Defaults to None, indicating bias disabled. To enable bias, specify the value as the start element of the bias operand. Has the same data type as src (a tensor of type int32 or float32). Has shape [Cout, ]. Cout: number of convolution kernels if src is the output of conv2d; or the length in the N dimension if src is the output of matmul. The tensor scope is L1. 2. key "quantize_params" value: Defaults to None, indicating quantization disabled. If enabled, the value is a dictionary of two keys: "mode" and "mode_param". The mode argument is a string, for the quantization mode:
mode_param has the following meanings:
3. key "relu" value: Defaults to False. Must be a bool. False indicates the ReLU function is disabled. True indicates that the ReLU function is enabled. Notes:
|
Restrictions
- It takes a long time to perform step-by-step debugging. Therefore, step-by-step debugging is not recommended.
- The functions enabled in extend_params is executed in the following sequence:
- This instruction must not be used together with the vectoring instructions.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
Returns
None
Examples
Example 1: src is of type int32 and dst is of type float16, bias is disabled, and mode_param is a tensor argument.
from te import tik tik_instance = tik.Tik() # Define the tensors. feature_map_gm = tik_instance.Tensor("uint8", [1, 4, 4, 32], name='feature_map_gm', scope=tik.scope_gm) weight_gm = tik_instance.Tensor("int8", [1, 2, 2, 32, 32], name='weight_gm', scope=tik.scope_gm) deqscale_gm = tik_instance.Tensor("float16", [16], name='deqscale_gm', scope=tik.scope_gm) dst_gm = tik_instance.Tensor("float16", [2, 9, 16], name='dst_gm', scope=tik.scope_gm) feature_map = tik_instance.Tensor("uint8", [1, 4, 4, 32], name='feature_map', scope=tik.scope_cbuf) weight = tik_instance.Tensor("int8", [1, 2, 2, 32, 32], name='weight', scope=tik.scope_cbuf) deqscale = tik_instance.Tensor("float16", [16], name='deqscale', scope=tik.scope_cbuf) dst_l1out = tik_instance.Tensor("int32", [2, 16, 16], name='dst_l1out', scope=tik.scope_cbuf_out) # Move data from the GM to the source operand tensor. tik_instance.data_move(feature_map, feature_map_gm, 0, 1, 16, 0, 0) tik_instance.data_move(weight, weight_gm, 0, 1, 128, 0, 0) tik_instance.data_move(deqscale, deqscale_gm, 0, 1, 1, 0, 0) # Perform convolution. tik_instance.conv2d(dst_l1out, feature_map, weight, [1, 4, 4, 32], [1, 2, 2, 32, 32], [1, 1], [0, 0, 0, 0], [1, 1], 0) # Perform quantization using fixpipe. tik_instance.fixpipe(dst_gm, dst_l1out, 2, 18, 0, 0, extend_params={"bias": None, "quantize_params": {"mode": "int322fp16", "mode_param": deqscale}}) tik_instance.BuildCCE(kernel_name="conv2d", inputs=[feature_map_gm, weight_gm, deqscale_gm], outputs=[dst_gm]) Inputs: feature_map_gm: [[[[3, 2, 4, 2, ..., 4, 3]]]] weight_gm: [[[[[0, -5, -3, ..., -4, -2]]]]] deqscale_gm: [ 0.1214, -0.2238, ..., 0.4883, 0.2788] Returns: dst_gm: [[[-13.48, 39.38, -114.8, 30.38, ..., 9.766, -24.81]]]
Example 2: src is of type float32 and dst is of type float16, bias is enabled, and mode_param is None.
from te import tik tik_instance = tik.Tik() # Define the tensors. feature_map_gm = tik_instance.Tensor("float16", [2, 4, 4, 16], name='feature_map_gm', scope=tik.scope_gm) weight_gm = tik_instance.Tensor("float16", [2, 2, 2, 16, 16], name='weight_gm', scope=tik.scope_gm) bias_gm = tik_instance.Tensor("float32", (16,), name='bias_gm', scope=tik.scope_gm) dst_gm = tik_instance.Tensor("float16", [1, 4, 16], name='dst_gm', scope=tik.scope_gm) feature_map = tik_instance.Tensor("float16", [2, 4, 4, 16], name='feature_map', scope=tik.scope_cbuf) weight = tik_instance.Tensor("float16", [2, 2, 2, 16, 16], name='weight', scope=tik.scope_cbuf) bias = tik_instance.Tensor("float32", (16,), name='bias', scope=tik.scope_cbuf) dst_l1out = tik_instance.Tensor("float32", [1, 16, 16], name='dst_l1out', scope=tik.scope_cbuf_out) # Move data from the GM to the source operand tensor. tik_instance.data_move(feature_map, feature_map_gm, 0, 1, 32, 0, 0) tik_instance.data_move(weight, weight_gm, 0, 1, 128, 0, 0) tik_instance.data_move(bias, bias_gm, 0, 1, 2, 0, 0) # Perform convolution. tik_instance.conv2d(dst_l1out, feature_map, weight, [2, 4, 4, 16], [2, 2, 2, 16, 16], [1, 1], [0, 0, 0, 0], [2, 2], 0) # Perform bias and quantization using fixpipe. tik_instance.fixpipe(dst_gm, dst_l1out, 1, 8, 0, 0, extend_params={"bias": bias, "quantize_params": {"mode": "fp322fp16", "mode_param": None}}) tik_instance.BuildCCE(kernel_name="conv2d", inputs=[feature_map_gm, weight_gm, bias_gm], outputs=[dst_gm]) Inputs: feature_map_gm: [[[[0.0, 0.01, 0.02, 0.03, 0.04, ..., 5.09, 5.1, 5.11]]]] weight_gm: [[[[[0.0, 0.01, 0.02, 0.03, 0.04, ..., 20.46, 20.47]]]]] bias_gm: [0.0, 1.0, 2.0, 3.0, ..., 14.0, 15.0] Returns: dst_gm: [[[3568., 3614., 3660., 3704., 3750., 3794., 3840., 3884., 3930., 3976., 4020., 4066., 4110., 4156., 4200., 4250.], [3754., 3802., 3850., 3900., 3948., 3996., 4044., 4094., 4140., 4188., 4240., 4290., 4336., 4384., 4430., 4480.], [4308., 4370., 4424., 4484., 4544., 4600., 4660., 4716., 4776., 4830., 4892., 4950., 5010., 5068., 5124., 5184.], [4496., 4556., 4616., 4680., 4740., 4804., 4864., 4924., 4988., 5050., 5108., 5172., 5230., 5296., 5356., 5416.]]]
matmul
Description
Multiplies matrix a by matrix b and outputs a result tensor.
For details about the data type restrictions, see Table 10-43.
Prototype
matmul(dst, a, b, m, k, n, init_l1out=True)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
dst |
Output |
Start element of the destination operand. For details about the data type restrictions, see Table 10-43. The scope is L1OUT. A tensor in the format of [N1, M, N0], where, N = N1 * N0
|
a |
Input |
Source operand, matrix tensor a. For details about the data type restrictions, see Table 10-43. The scope is L1. A tensor in the format of [K1, M, K0], where, K = K1 * K0
|
b |
Input |
Source operand, matrix tensor b. For details about the data type restrictions, see Table 10-43. The scope is L1. A tensor in the format of [K1, N, K0], where, K = K1 * K0
|
m |
Input |
An immediate of type int specifying the valid height of matrix a. The value range is [1, 4096]. Note: The m argument does not need to be rounded up to a multiple of16. |
k |
Input |
An immediate of type int specifying the valid width of matrix a and the valid height of matrix b. If matrix a is of type float16, the value range is [1, 16384], If matrix a is of type int8/uint8, the value range is [1, 32768], Note: The k argument does not need to be rounded up to a multiple of16. |
n |
Input |
An immediate of type int specifying the valid width of matrix b. The value range is [1, 4096]. Note: The n argument does not need to be rounded up to a multiple of16. |
init_l1out |
Input |
A bool specifying whether to initialize dst . Defaults to True.
|
Restrictions
- It takes a long time to perform step-by-step debugging. Therefore, step-by-step debugging is not recommended.
- The tensor[immediate] or tensor[scalar] format indicates a 1-element tensor. To specify the computation start (with offset), use the tensor [immediate:] or tensor [scalar:] format.
- For Ascend 910 AI Processor, the start addresses of the source operands a and b of the instruction must be 512-byte aligned. For example, when tensor slices are input and the source operand is of type float16, tensor[256:] can be used. However, tensor[2:] does not meet the alignment requirement, and an unknown error may occur.
- The start address of the destination operand dst must be 1024-byte aligned. For example, when tensor slices are input and the destination operand is of type float32, tensor[256:] can be used. However, tensor[2:] does not meet the alignment requirement, and an unknown error may occur.
- This instruction must not be used together with the vectoring instructions.
- The m, k, and n arguments do not need to be rounded up to multiples of 16 pixels. However, due to hardware restrictions, the shape of operands dst, a, and b must meet the following alignment requirements. The m and n arguments must be rounded up to multiples of 16 pixels, and the k argument must be rounded up to multiples of 16 or 32 pixels, depending on the operand data type.
- When n is not a multiple of 16, the invalid data in the n dimension of dst needs to be processed by the user. When m is not a multiple of 16, the invalid data in the m dimension of dst can be deleted in the fixpipe instruction. The following figure shows the implementation diagram of the matmul API. The rightmost data block is the output result after dst is processed by the fixpipe API.
- This instruction should be used together with the fixpipe instruction.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
Returns
None
Examples
Example 1: Matrix a and matrix b are of type int8, dst is of type int32, and ReLU is implemented using fixpipe.
from te import tik tik_instance = tik.Tik() # Define the tensors. a_gm = tik_instance.Tensor("int8", [2, 32, 32], name='a_gm', scope=tik.scope_gm) b_gm = tik_instance.Tensor("int8", [2, 160, 32], name='b_gm', scope=tik.scope_gm) # For matmul, m = 30. The fixpipe instruction deletes invalid data from dst_l1out. Therefore, set the m dimension of dst_gm to 30. dst_gm = tik_instance.Tensor("int32", [10, 30, 16], name='dst_gm', scope=tik.scope_gm) a_l1 = tik_instance.Tensor("int8", [2, 32, 32], name='a_l1', scope=tik.scope_cbuf) b_l1 = tik_instance.Tensor("int8", [2, 160, 32], name='b_l1', scope=tik.scope_cbuf) dst_l1out = tik_instance.Tensor("int32", [10, 32, 16], name='dst_l1out', scope=tik.scope_cbuf_out) # Move data to the source operand. tik_instance.data_move(a_l1, a_gm, 0, 1, 64, 0, 0) tik_instance.data_move(b_l1, b_gm, 0, 1, 320, 0, 0) # Perform matmul. The m, k, and n arguments are 30, 64, and 160, respectively. The m dimension of dst_l1out is rounded up to 32. tik_instance.matmul(dst_l1out, a_l1, b_l1, 30, 64, 160) # Move data to dst_gm, where, burst_len = 30 * 16 * dst_l1out_dtype_size//32 = 60. tik_instance.fixpipe(dst_gm, dst_l1out, 10, 60, 0, 0, extend_params={"relu": True}) tik_instance.BuildCCE(kernel_name="matmul", inputs=[a_gm, b_gm], outputs=[dst_gm]) Inputs: a_l1 = [[[-1, -1, -1, ..., -1, -1, -1] ... [-1, -1, -1, ..., -1, -1, -1]] [[-1, -1, -1, ..., -1, -1, -1] ... [-1, -1, -1, ..., -1, -1, -1]]] b_l1 = [[[1, 1, 1, ..., 1, 1, 1] ... [1, 1, 1, ..., 1, 1, 1]] [[1, 1, 1, ..., 1, 1, 1] ... [1, 1, 1, ..., 1, 1, 1]]] Returns: dst_gm = [[[0, 0, 0, ..., 0, 0, 0] ... [0, 0, 0, ..., 0, 0, 0]] ... [[0, 0, 0, ..., 0, 0, 0] ... [0, 0, 0, ..., 0, 0, 0]]]
Example 2: Matrix a and matrix b are of type float16, dst is of type float32, and element-wise addition is implemented using fixpipe.
from te import tik tik_instance = tik.Tik() # Define the tensors. a_gm = tik_instance.Tensor("float16", [4, 32, 16], name='a_gm', scope=tik.scope_gm) b_gm = tik_instance.Tensor("float16", [4, 160, 16], name='b_gm', scope=tik.scope_gm) element_wise_add_gm = tik_instance.Tensor("float32", [10, 32, 16], name='element_wise_add_gm', scope=tik.scope_gm) # For matmul, m = 30. The fixpipe instruction deletes invalid data from dst_l1out. Therefore, set the m dimension of dst_gm to 30. dst_gm = tik_instance.Tensor("float32", [10, 30, 16], name='dst_gm', scope=tik.scope_gm) a_l1 = tik_instance.Tensor("float16", [4, 32, 16], name='a_l1', scope=tik.scope_cbuf) b_l1 = tik_instance.Tensor("float16", [4, 160, 16], name='b_l1', scope=tik.scope_cbuf) element_wise_add = tik_instance.Tensor("float32", [10, 32, 16], name='element_wise_add', scope=tik.scope_cbuf) dst_l1out = tik_instance.Tensor("float32", [10, 32, 16], name='dst_l1out', scope=tik.scope_cbuf_out) # Move data to the source operand. tik_instance.data_move(a_l1, a_gm, 0, 1, 128, 0, 0) tik_instance.data_move(b_l1, b_gm, 0, 1, 640, 0, 0) tik_instance.data_move(element_wise_add, element_wise_add_gm, 0, 1, 640, 0, 0) # Perform matmul. The m, k, and n arguments are 30, 64, and 160, respectively. The m dimension of dst_l1out is rounded up to 32. tik_instance.matmul(dst_l1out, a_l1, b_l1, 30, 64, 160) # Move data to dst_gm, where, burst_len = 30 * 16 * dst_l1out_dtype_size//32 = 60. tik_instance.fixpipe(dst_gm, dst_l1out, 10, 60, 0, 0, extend_params={"element-wise-add": element_wise_add}) tik_instance.BuildCCE(kernel_name="matmul", inputs=[a_gm, b_gm, element_wise_add_gm], outputs=[dst_gm]) Inputs: a_l1 = [[[1.0, 1.0, 1.0, ..., 1.0, 1.0, 1.0] ... [1.0, 1.0, 1.0, ..., 1.0, 1.0, 1.0]] ... [[1.0, 1.0, 1.0, ..., 1.0, 1.0, 1.0] ... [1.0, 1.0, 1.0, ..., 1.0, 1.0, 1.0]]] b_l1 = [[[1.0, 1.0, 1.0, ..., 1.0, 1.0, 1.0] ... [1.0, 1.0, 1.0, ..., 1.0, 1.0, 1.0]] ... [[1.0, 1.0, 1.0, ..., 1.0, 1.0, 1.0] ... [1.0, 1.0, 1.0, ..., 1.0, 1.0, 1.0]]] element_wise_add = [[[1.0, 1.0, 1.0, ..., 1.0, 1.0, 1.0] ... [1.0, 1.0, 1.0, ..., 1.0, 1.0, 1.0]] ... [[1.0, 1.0, 1.0, ..., 1.0, 1.0, 1.0] ... [1.0, 1.0, 1.0, ..., 1.0, 1.0, 1.0]]] Returns: dst_gm = [[[65.0, 65.0, 65.0, ..., 65.0, 65.0, 65.0] ... [65.0, 65.0, 65.0, ..., 65.0, 65.0, 65.0]] ... [[65.0, 65.0, 65.0, ..., 65.0, 65.0, 65.0] ... [65.0, 65.0, 65.0, ..., 65.0, 65.0, 65.0]]]
Data Conversion
Format Conversion
vec_trans
Description
Consecutively transposes 16x16 two-dimensional matrix data blocks for repeat_times times. Each iteration operates 256 consecutive address space data blocks. The addresses between different iterations can be inconsecutive. The address space between adjacent iterations is specified by dst_rep_stride and src_rep_stride.
Prototype
vec_trans(dst, src, repeat_times, dst_rep_stride, src_rep_stride)
Parameters
Parameter |
Input/Output |
Description |
dst |
Output |
A tensor for the destination operand. Must be one of the following data types: int16, uint16, float16 |
src |
Input |
A tensor for the source operand. Must be one of the following data types: int16, uint16, float16 |
repeat_times |
Input |
Number of repeats. The argument is a scalar of type int/uint, an immediate of type int, or an Expr of type int/uint. The value range is [1, 4095]. |
dst_rep_stride |
Input |
dst address space between adjacent iterations (unit: 512 bytes). The argument is a scalar of type int/uint, an immediate of type int, or an Expr of type int/uint. The value range is [0, 4095]. |
src_rep_stride |
Input |
src address space between adjacent iterations (unit: 512 bytes). The argument is a scalar of type int/uint, an immediate of type int, or an Expr of type int/uint. The value range is [0, 4095]. |
Restrictions
- To save memory space, you can define a tensor shared by the source and destination operands (by address overlapping). The general instruction restriction is that the source operand must completely overlap the destination operand.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
Returns
None
Example
from te import tik tik_instance = tik.Tik() dst_ub = tik_instance.Tensor("float16", (1,16,16), name='dst_ub', scope=tik.scope_ubuf) src_ub = tik_instance.Tensor("float16", (1,16,16), name='src_ub', scope=tik.scope_ubuf) tik_instance.vec_trans(dst_ub, src_ub, 1, 1, 1) Description: Inputs: src_ub=[1,2,3,4,...,256] Return: dst_ub=[1,17,33,49,...,256]
vec_trans_scatter
Description
Converts NCHW into NC1HWC0. If the data type is float32, int32, uint32, int16, unint16, or float16, then C0 is 16. If the data type is uint8 or int8, then C0 is 32.
Prototype
vec_trans_scatter (dst_high_half, src_high_half, dst_list, src_list, repeat_times, dst_rep_stride, src_rep_stride)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
dst_high_half |
Input |
Whether to store the data of dst_list[*] to the upper or lower half of the block. Only the int8 or uint8 data type is supported. The data type of this parameter can only be bool. The options are as follows:
|
src_high_half |
Input |
A bool specifying whether to read the data of src_list[*] from the upper or lower half of the block. Only the int8 or uint8 data type is supported. The options are as follows:
|
dst_list |
Output |
A list of elements, specifying the vector destination operand. Each element marks the start of a destination operand. The supported data types are as follows: Ascend 910 AI Processor: tensor (int8/uint8/int16/uint16/float16) |
src_list |
Input |
A list of elements, specifying the vector source operand. Each element marks the start of a destination operand. Has the same data type as dst_list. |
repeat_times |
Input |
Number of iteration repeats, in the unit of blocks. The value range is [0, 255]. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. Notes: 1. When repeat_times=1, the valid start of a destination or source operand is the start of dst_list or src_list plus dst_rep_stride or src_rep_stride. 2. When repeat_times > 1, the valid start of a destination or source operand in the first repeat is the start of dst_list or src_list. In the second repeat, dst_rep_stride or src_rep_stride needs to be added. This rule applies. |
dst_rep_stride |
Input |
Block-to-block stride between adjacent iterations of the destination operand, in the unit of blocks. The value range is [0, 65535]. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. |
src_rep_stride |
Input |
Block-to-block stride between adjacent iterations of the source operand, in the unit of blocks. The value range is [0, 65535]. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. |
Restrictions
- Generally, each element in src_list or dst_list is configured as the start of each HW plane.
- For better performance, it is recommended that dstHighHalf and srcHighHalf be fixed when the data type is int8 or uint8, and be changed after the repeat in the H and W directions.
- The mask value does not affect the execution of the API.
- To save memory space, you can define a tensor shared by the source and destination operands (by address overlapping). The general instruction restrictions are as follows.
- For a single repeat (repeat_times = 1), the source operand sequence and the target operand sequence must be completely the same. Partial overlapping is not supported. Instead, each block must be the same.
- For multiple repeats (repeat_times > 1), if there is a dependency between the source operand sequence and the destination operand sequence, that is, the destination operand of the Nth iteration is the source operand of the (N+1)th iteration, address overlapping is not supported.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
Returns
None
Example
from te import tik tik_instance = tik.Tik() dst_ub = tik_instance.Tensor("float16", (256,), tik.scope_ubuf, "dst_ub") src_ub = tik_instance.Tensor("float16", (256,), tik.scope_ubuf, "src_ub") dst_list = [dst_ub[16 * i] for i in range(16)] src_list = [src_ub[16 * i] for i in range(16)] tik_instance.vec_trans_scatter(True, False, dst_list, src_list, 1, 0, 0)
Data Padding
vec_dup
Description
Copies a Scalar variable or an immediate for multiple times and fill it in the vector (PAR indicates the degree of parallelism):
Prototype
vec_dup(mask, dst, scalar, repeat_times, dst_rep_stride)
PIPE: vector
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
mask |
Input |
For details, see the description of the mask parameter in Table 10-25. |
dst |
Output |
A tensor for the start element of the destination operand. Must be one of the following data types: uint16, int16, float16, uint32, int32, float32 |
scalar |
Input |
A scalar or an immediate, for the source operand to be copied. Has the same dtype as dst. |
repeat_times |
Input |
Number of iteration repeats. The addresses of the source and destination operands change upon every iteration. The value range is [0, 255]. If an immediate is passed, 0 is not supported. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. |
dst_rep_stride |
Input |
Block-to-block stride in a single iteration of the destination operand. The value range is [0, 255], in the unit of 32 bytes. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. |
Restrictions
- If the argument is a scalar, you need to ensure that it is within the range.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
Returns
None
Example
from te import tik tik_instance = tik.Tik() dst_ub = tik_instance.Tensor("float16", (128, ), tik.scope_ubuf, "dst_ub") src_scalar = tik_instance.Scalar(init_value=0, dtype="float16") tik_instance.vec_dup(128, dst_ub, src_scalar, 1, 8)
Data Movement
data_move
Description
Moves data based on the data types of the src and dst tensors. The options are as follows:
- UB->UB
- UB->OUT
- OUT->UB
- OUT->L1
Prototype
data_move (dst, src, sid, nburst, burst, src_stride, dst_stride, *args, **argv)
Parameters
Parameter |
Input/Output |
Description |
---|---|---|
dst |
Output |
Destination operand. For details about the data type restrictions, see Table 10-47. |
src |
Input |
Source operand. For details about the data type restrictions, see Table 10-47. |
sid |
Input |
A scalar, an immediate, or an Expr of type int32, specifying the SMMU ID, which is hardware-related. The value range is [0, 15]. The value 0 is recommended. |
nburst |
Input |
A scalar, an immediate, or an Expr of type int32, specifying the number of the data segments to be transmitted. The value range is [1, 4095]. |
burst |
Input |
Burst length. The value range is [1, 65535], in the unit of 32 bytes. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. |
src_stride |
Input |
Burst-to-burst stride of the source tensor. The value range is [0, 65535]. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. |
dst_stride |
Input |
Burst-to-burst stride of the destination tensor. The value range is [0, 65535]. The argument is a scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. |
*args |
Input |
Number of extended arguments |
**argv |
Input |
Extended arguments |
src.scope |
dst.scope |
dtype (src and dst Have the Same dtype) |
burst Unit |
src_stride Unit |
dst_stride Unit |
---|---|---|---|---|---|
OUT |
L1 |
uint8, int8, float16, uint16, int16, float32, int32, uint32, float64, uint64, int64 |
32 bytes |
32 bytes |
32 bytes |
OUT |
UB |
uint8, int8, float16, uint16, int16, float32, int32, uint32, float64, uint64, int64 |
32 bytes |
32 bytes |
32 bytes |
UB |
OUT |
uint8, int8, float16, uint16, int16, float32, int32, uint32, float64, uint64, int64 |
32 bytes |
32 bytes |
32 bytes |
UB |
UB |
uint8, int8, float16, uint16, int16, float32, int32, uint32, float64, uint64, int64 |
32 bytes |
32 bytes |
32 bytes |
Restrictions
- For details about the alignment requirements of the operand address offset, see General Restrictions.
Returns
None
Example
from te import tik tik_instance = tik.Tik() src_ub = tik_instance.Tensor("float16", (128,), name="src_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf) tik_instance.data_move(dst_ub, src_ub, 0, 1, 128 // 16, 0, 0)
- Introduction
- General Restrictions
- Chip Configuration Management
- TIK Container Management
- Data Definition
- Scalar Management
- Tensor Management
- Program Control
- Scalar Computation
- Vector Computation
- Matrix Computation
- Data Conversion
- Data Padding
- Data Movement