Matmul Compute API
matmul
Description
Multiplies the matrix. The formula is tensor_c = trans_a(tensor_a) * trans_b(tensor_b) + tensor_bias.
For tensor_a and tensor_b, the last two dimensions of shape (after transposition) must meet the following matrix multiplication condition: (M, K) * (K, N) = (M, N). Multiple dimensions are supported. If is_fractal is set to True, the data layout of tensor_a must meet the fractal structure of L0A, and the data layout of tensor_b must meet the fractal structure of L0B. If is_fractal is set to False, both tensor_a and tensor_b use the ND layout.
The API is defined in python/site-packages/te/lang/cce/te_compute/mmad_compute.py in the ATC installation path.
Restrictions
This API cannot be used in conjunction with other TBE DSL APIs.
The input supports float16, and the output supports float16 and float32.
Prototype
te.lang.cce.matmul(tensor_a, tensor_b, trans_a=False, trans_b=False, format_a="ND", format_b="ND", alpha_num=1.0, beta_num=0.0, dst_dtype="float16", tensor_bias=None, quantize_params=None)
Parameters
- tensor_a: a tvm.tensor for matrix A
- tensor_b: a tvm.tensor for matrix B
- trans_a: a bool specifying whether to transpose matrix A
- trans_b: a bool specifying whether to transpose matrix B
- format_a: format of matrix A, either ND or fractal.
- format_b: format of matrix B, either ND or fractal.
- alpha_num: a broadcast parameter, which is not used currently. Defaults to 1.0.
- beta_num: a broadcast parameter, which is not used currently. Defaults to 0.0.
- dst_dtype: output data type, either float16 or float32.
- tensor_bias: Defaults to None. If the value is not empty, tensor_bias is added to the computation result obtained after matrix A is multiplied by matrix B. The shape of tensor_bias supports broadcasting. The data type of tensor_bias must be the same as dst_dtype.
- quantize_params: quantization parametersquantize_params: input parameter about quantization, which is in the dictionary format. If quantize_params is None, quantization is disabled. If quantize_params is not None, quantization is enabled. The parameters are as follows:
- quantize_alg: quantization mode. The value can be NON_OFFSET (default) or HALF_OFFSET_A.
- scale_mode_a: reserved
- scale_mode_b: reserved
- scale_mode_out: output dequantization, that is, the value type of the quantization parameter. The value can be SCALAR (default) or VECTOR.
- sqrt_mode_a: reserved
- sqrt_mode_b: reserved
- sqrt_mode_out: whether the square root of scale_drq is extracted. The value can be NON_SQRT (default) or SQRT.
- scale_q_a: reserved
- offset_q_a: reserved
- scale_q_b: reserved
- offset_q_b: reserved
- scale_drq: placeholder of the output dequantization or requantization weight parameter. Defaults to None.
- offset_drq: reserved
The quantization modes are as follows:
- Input quantization: refers to quantization from input data to intermediate data. Generally, fp16 data is quantized to the int8 or uint8 data.
- Output quantization: refers to quantization from intermediate data to output data. The following two quantization modes are available:
- Requantization: quantizes int32 to int8.
- Dequantization: quantizes int32 to fp16.
Returns
tensor_c: a tvm.tensor for the result
Example
from te import tvm import te.lang.cce a_shape = (1024, 256) b_shape = (256, 512) bias_shape = (512, ) in_dtype = "float16" dst_dtype = "float32" tensor_a = tvm.placeholder(a_shape, name='tensor_a', dtype=in_dtype) tensor_b = tvm.placeholder(b_shape, name='tensor_b', dtype=in_dtype) tensor_bias = tvm.placeholder(bias_shape, name='tensor_bias', dtype=dst_dtype) res = te.lang.cce.matmul(tensor_a, tensor_b, False, False, False, dst_dtype=dst_dtype, tensor_bias=tensor_bias)