Tool Principles and Highlights
Tool Principles
The AMCT is a Caffe-based Python toolkit that implements Conv+BN+Scale fusion, Deconv+BN+Scale fusion, BN+Scale+Conv fusion, FC+BN+Scale fusion, as well as 8-bit quantization of data and weights in neural networks. This toolkit decouples model quantization from model conversion. It implements independent quantization of quantization-capable operators in a model, and outputs a .prototxt model file and a .caffemodel weight file. The obtained accuracy simulation model can run on the CPU or GPU to complete accuracy simulation. The obtained deployable model can run on the Ascend AI Processor to improve the inference performance.
Figure 3-4 shows the workflow. The operations in blue are implemented by users, and the operations in gray are implemented by calling APIs provided by AMCT. You can import the library to the original Caffe network inference code and call the APIs at specific locations to implement the quantization function. The tool can be used in the following scenarios:
- Calibration-based quantization
- Scenario 1
- Construct an original Caffe model or generate a quantization configuration file from the simplified configuration file and the create_quant_config API.
- Based on the Caffe model and quantization configuration file, call the init API to initialize the tool, configure the quantization factor storage file, and parse the model into a graph.
- Call the quantize_model API to optimize the graph of the original Caffe model. The optimized model contains the quantization algorithm. Perform inference with the optimized model in the Caffe environment based on the image dataset and calibration dataset preset in the AMCT to obtain the quantization factors.
The image dataset is used to test the accuracy of the quantized data during model inference in the Caffe environment. The calibration dataset is used to generate quantization factors to ensure accuracy.
- Call the save_model API to save the quantized model to a model (including its weight file) for accuracy simulation in the Caffe environment or a model (including its weight file) deployable on the Ascend AI Processor.
- Scenario 2
Instead of using the APIs in scenario 1, if you have generated a quantized model based on your own quantization factors and original Caffe model, call the convert_model API to complete the quantization. Quantization Example Using the convert_model API gives a quantization example in this scenario.
- Scenario 1
- Retrain-based quantization
- Construct an original Caffe model or generate a quantization configuration file from the simplified configuration file and the create_quant_retrain_config API.
- Call the create_quant_retrain_model API to optimize the original Caffe model. The optimized model contains the quantization algorithm. Retrain the optimized model in the Caffe environment based on the image dataset and calibration dataset preset in the AMCT to obtain the quantization factors.
- Call the save_quant_retrain_model API to save the quantized model to a model (including its weight file) for accuracy simulation in the Caffe environment or a model (including its weight file) deployable on the Ascend AI Processor.
Fusion Functions
- Convolution+BatchNorm+Scale fusion: Before quantization, the "Convolution+BatchNorm+Scale" structure in the model is fused into "Conv+BN+Scale." After fusion, the BatchNorm and Scale layers are removed.
- Deconv+BN+Scale fusion: Before quantization, the "Deconvolution+BatchNorm+Scale" structure in the model is fused into "Deconv+BN+Scale." After fusion, the BatchNorm and Scale layers are removed.
- BN+Scale+Conv fusion: Before quantization, the "BatchNorm+Scale+Convolution" structure in the model is fused into "BN+Scale+Conv." After fusion, the BatchNorm and Scale layers are removed.
- FC+BN+Scale fusion: Before quantization, the "InnerProduct+BatchNorm+Scale" structure in the model is fused into "FC+BN+Scale." After fusion, the BatchNorm and Scale layers are removed.
Tool Highlights
- Lightweight: You only need to install the tool and rebuild the Caffe environment.
- Easy-to-use APIs: You can complete quantization using APIs based on the Caffe inference script.
- Hardware compatibility: The generated deployable model can be converted to an offline mode by using the ATC tool to implement 8-bit inference on the Ascend AI Processor.
- Configurable quantization: You can modify the quantization configuration file and adjust the quantization policy to obtain the optimal quantization result.