Quantization
Quantization is to process model parameters and data using the low-bit quantization technique, to make the network model more compact. In this way, the storage space, transfer delay and computing efficiency can be optimized to enhance the performance.
You can use your own framework and tools to perform quantization and inject the quantization parameters (scaled, scalew and offsetd) into the model during IR graph construction.
- Currently, only the Conv2D, DepthwiseConv2D and FullyConnection operators support quantization.
- When the channel dimension size of the input data of the Conv2D, DepthwiseConv2D or FullyConnection operator is less than or equal to 16, INT8 quantization does not improve the performance due to padding. Therefore, for these three operators, the channel dimension size of the input data must be greater than 16. Otherwise, quantization cannot be performed.
Take the Conv2D operator as an example. Insert the AscendQuant quantization operator before the Conv2D operator and insert the AscendDequant dequantization operator after the Conv2D operator to implement model quantization, as shown in Figure 4-4.
The AscendQuant quantization operator converts float data into int8 data as in the following formula: dataint8 = round[(datafloat x scale) + offset], where scale = 1/scaled, and offset = offsetd.
The AscendDequant operator converts int32 data into float16 data as in the following formula: datafloat = dataint32 x deq_scale, where deq_scale = scaled x scalew.
Inserting AscendQuant Before Conv2D
AscendQuant operator prototype definition:
REG_OP(AscendQuant) .INPUT(x, TensorType({DT_FLOAT16, DT_FLOAT32})) .OUTPUT(y, TensorType({DT_INT8})) .REQUIRED_ATTR(scale, Float) .REQUIRED_ATTR(offset, Float) .ATTR(sqrt_mode, Bool, false) .ATTR(round_mode, String, "Round") .OP_END_FACTORY_REG(AscendQuant)
The AscendQuant operator has one input (x), two required attributes (scale and offset), and two optional attributes (sqrt_mode and round_mode). They are described as follows:
- x: a tensor of type float16 or float32, for the input of the AscendQuant operator.
- scale: a float16 or float32, for the quantization factor (scale = 1/scaled). Should be within the float16 range; otherwise, set sqrt_mode to True.
- offset: a float16 or float32, for the quantization offset (offset = offsetd)
- sqrt_mode: If set to True, performs square root extraction on the scale. Defaults to False (recommended). If the value of the scale exceeds the float16 range, set sqrt_mode to True to avoid precision loss (square root extraction is performed on the scale).
- round_mode: a string from Round, Floor, Ceiling, and Truncate, for the round mode for float to int conversion. Defaults to Round.
Create an AscendQuant operator instance based on the operator prototype definition.
auto quant = op::AscendQuant("quant") .set_input_x(data) .set_attr_scale(1.00049043) //Specify scale. .set_attr_offset(-128.0); //Specify offset.
Conv2D
Set AscendQuant as the input of Conv2D, and set the output type to int32.
// const op: conv2d weight auto weight_shape = ge::Shape({ 5,17,1,1 }); TensorDesc desc_weight_1(weight_shape, FORMAT_NCHW, DT_INT8); Tensor weight_tensor(desc_weight_1); uint32_t weight_1_len = weight_shape.GetShapeSize(); bool res = GetConstTensorFromBin(PATH+"const_0.bin", weight_tensor, weight_1_len); if (!res) { std::cout << "GetConstTensorFromBin Failed!" << std::endl; return –1; } auto conv_weight = op::Const("const_0") .set_attr_value(weight_tensor); // const op: conv2d bias auto bias_shape = ge::Shape({ 5 }); TensorDesc desc_bias(bias_shape, FORMAT_NCHW, DT_INT32); Tensor bias_tensor(desc_bias); uint32_t bias_len = bias_shape.GetShapeSize() * sizeof(int32_t); res = GetConstTensorFromBin(PATH + "const_1.bin", bias_tensor, bias_len); if (!res) { std::cout << "GetConstTensorFromBin Failed!" << std::endl; return –1; } auto conv_bias = op::Const("const_1") .set_attr_value(bias_tensor); // conv2d op auto conv2d = op::Conv2D("Conv2d") .set_input_x(quant) .set_input_filter(conv_weight) .set_input_bias(conv_bias) .set_attr_strides({ 1, 1, 1, 1 }) .set_attr_pads({ 0, 0, 0, 0 }) .set_attr_dilations({ 1, 1, 1, 1 }); TensorDesc conv2d_input_desc_x(ge::Shape(), FORMAT_NCHW, DT_INT8); // After quantization, set the data type of the input_x to int8. TensorDesc conv2d_input_desc_filter(ge::Shape(), FORMAT_NCHW, DT_INT8); // After quantization, set the data type of the input_filter to int8. TensorDesc conv2d_input_desc_bias(ge::Shape(), FORMAT_NCHW, DT_INT32); // After quantization, set the data type of the input_bias to int8. TensorDesc conv2d_output_desc_y(ge::Shape(), FORMAT_NCHW, DT_INT32); // After quantization, set the data type of the output_y to int32. conv2d.update_input_desc_x(conv2d_input_desc_x); conv2d.update_input_desc_filter(conv2d_input_desc_filter); conv2d.update_input_desc_bias(conv2d_input_desc_bias); conv2d.update_output_desc_y(conv2d_output_desc_y);
Inserting AscendDequant After Conv2D
AscendDequant operator prototype definition:
REG_OP(AscendDequant) .INPUT(x, TensorType({DT_INT32})) .INPUT(deq_scale, TensorType({DT_FLOAT16, DT_UINT64})) .OUTPUT(y, TensorType({DT_FLOAT16})) .ATTR(sqrt_mode, Bool, false) .ATTR(relu_flag, Bool, false) .OP_END_FACTORY_REG(AscendDequant)
The AscendDequant operator has two inputs (x and deq_scale), and two optional attributes (sqrt_mode and relu_flag). The parameters are described as follows:
- x: a tensor of type int32 for the input of the AscendDequant operator.
- deq_scale: a tensor of type uint64 for the dequantization factor (deq_scale = scaled x scalew). With shape 1, or the same as the channel dimension of the Conv2D output.
You need to convert the float32 data obtained by multiplying the scaled and scalew into uint64 before filling the result in the lower 32 bits of deq_scale. The upper 32 bits must be all 0s.
import numpy as np def trans_float32_scale_deq_to_uint64(scale_deq): float32_scale_deq = np.array(scale_deq, np.float32) uint32_scale_deq = np.frombuffer(float32_scale_deq, np.uint32) uint64_result = np.zeros(float32_scale_deq.shape, np.uint64) uint64_result |= np.uint64(uint32_scale_deq) return uint64_result
- sqrt_mode: If set to True, performs square root extraction on the deq_scale. Defaults to False (recommended). If the value of the deq_scale exceeds the float16 range, set sqrt_mode to True to avoid precision loss (square root extraction is performed on the deq_scale).
- relu_flag: If set to True, performs ReLU. Defaults to False.
Create an AscendDequant operator instance based on the operator prototype definition.
// Construct dequant_scale. TensorDesc desc_dequant_shape(ge::Shape({ 5 }), FORMAT_NCHW, DT_UINT64); Tensor dequant_tensor(desc_dequant_shape); uint32_t dequant_scale_len = 5 * sizeof(uint64_t); res = GetConstTensorFromBin(PATH + "const_2.bin", dequant_tensor, dequant_scale_len); if (!res) { std::cout << "GetConstTensorFromBin Failed!" << std::endl; return –1; } auto dequant_scale = op::Const("dequant_scale") .set_attr_value(dequant_tensor); // Define the AscendDequant operator. auto dequant = op::AscendDequant("dequant") .set_input_x(conv2d) .set_input_deq_scale(dequant_scale);
Set the output of AscendDequant as the input of other operators, or as the graph output.
auto bias_add_1 = op::BiasAdd("bias_add_1") .set_input_x(dequant) .set_input_bias(bias_weight_1) .set_attr_data_format("NCHW");