Overview
This document describes how to quantize a TensorFlow model using the Ascend Model Compression and Training Toolkit (AMCT). In the quantization process, the precision of model inputs (including the weights and data) is reduced to make the model more compact, reducing the transfer latency and improving the compute efficiency. Figure 4-1 shows the AMCT workflow.
According to different quantization methods, quantization is classified into quantization based on calibration and quantization based on retrain. The foregoing two quantization methods are classified into weight quantization and data quantization according to different quantization objects.
As used in this document, the following terms have the meanings specified below.
Calibration-based Quantization
Calibration-based quantization refers to a solution in which the quantization factors of data and weights are found by using a small calibration dataset. For details about the quantization workflow, see Calibration-based Quantization. Calibration-based quantization does not support execution on more than one GPU at a time.
Currently, the following layers support quantization: MatMul, Conv2D (dilation = 1), DepthwiseConv2dNative (dilation = 1), Conv2DBackpropInput (dilation = 1), and AVE Pooling.
- Calibration dataset
During the calibration process of determining quantization parameters of data, the algorithm uses each piece of data in the calibration dataset as input, accumulates corresponding input data of each layer (or operation) that needs to be quantized, and uses the accumulated data as input data of the quantization algorithm to determine the quantization parameters. Because the quantization parameters and the accuracy of the quantized model are closely related to the selection of the calibration dataset, you are advised to calibrate the model using a subset of images from the validation dataset.
- Data quantization
Data quantization is to collect statistics on the input data of each layer (or operation) to be quantized, to find the optimal pair of scale and offset per layer (or operation). UINT8 data (q_uint8) is converted from high-precision source data (d_float) using the following formula: q_uint8 = round(d_float/scale) – offset, where scale is the scaling factor of floating-point numbers and offset indicates the offset.
Data is the intermediate result of model inference computation. The data range is related to the input. Therefore, a group of reference inputs (calibration dataset) needs to be used as incentives to record the input data of each layer (or operation) to be quantized for searching for quantization parameters (scale and offset). During data calibration, extra memory (video memory/RAM) is needed to store the input data used to determine the quantization parameters. Therefore, the video memory/RAM usage is higher than that required for performing inference only. The size of the extra memory is positively correlated with batch_size x batch_num during calibration.
- Weight quantization
The weights of the model and the value ranges have been determined during model inference acceleration. Therefore, quantization may be directly performed based on the value range of each weight.
Retrain-based Quantization
Retrain-based quantization refers to inserting quantization operations during model training based on a complete training dataset in order to find the optimal quantization factors of data and weights. In this solution, the quantization parameters need to be further optimized by using a large dataset, so that the quantization parameters can better match the current data, thereby improving accuracy on top of regular quantization.
Currently, the following layers support retrain: MatMul, Conv2D (dilation = 1), DepthwiseConv2dNative (dilation = 1), Conv2DBackpropInput (dilation = 1), and AVE Pooling.
- Training dataset
Dataset of the already-trained network.
- Data quantization
Data quantization is to collect statistics on the input data of each layer (or operation) to be quantized, to find the optimal pair of scale and offset per layer (or operation). Data is the intermediate result of model inference computation. The arq retrain algorithm is used to continuously optimize the pair of parameters during the retrain process to obtain the optimal result.
- Weight quantization
Weight quantization means that the quantization parameter of a weight is continuously optimized in the retrain process, to obtain the optimal weight quantization parameter.
Retrain-based quantization allows only uniform quantization.
Tensor Decomposition
If the original TensorFlow model contains a Conv2D layer and the layer meets the following conditions, the Conv2D layer can be decomposed into two or three child layers. Then, you can use the AMCT to convert the TensorFlow model into a quantized model that can be deployed on the Ascend AI Processor for better inference performance. For details about how to decompose the original model, see Tensor Decomposition. You can determine whether to decompose the original model as required.
- group = 1, dilation = (1,1)
- kernel_h = kernel_w, and kernel_h > 2
Figure 4-2 shows the resnet_v2_50 model before and after decomposition.