Skip to content

Latest commit

 

History

History
182 lines (121 loc) · 7.68 KB

quantization.md

File metadata and controls

182 lines (121 loc) · 7.68 KB

Quantization

  1. Quantization Introduction

  2. Quantization Fundamentals

  3. Get Started

    3.1 Post Training Quantization

    3.2 Specify Quantization Rules

    3.3 Specify Quantization Backend and Device

  4. Examples

Quantization Introduction

Quantization is a very popular deep learning model optimization technique invented for improving the speed of inference. It minimizes the number of bits required by converting a set of real-valued numbers into the lower bit data representation, such as int8 and int4, mainly on inference phase with minimal to no loss in accuracy. This way reduces the memory requirement, cache miss rate, and computational cost of using neural networks and finally achieve the goal of higher inference performance. On Intel 3rd Gen Intel® Xeon® Scalable Processors, user could expect up to 4x theoretical performance speedup. We expect further performance improvement with Intel® Advanced Matrix Extensions on 4th Gen Intel® Xeon® Scalable Processors.

Quantization Fundamentals

Affine quantization and Scale quantization are two common range mapping techniques used in tensor conversion between different data types.

The math equation is like: $X_{int8} = round(Scale \times X_{fp32} + ZeroPoint)$.

Affine Quantization

This is so-called Asymmetric quantization, in which we map the min/max range in the float tensor to the integer range. Here int8 range is [-128, 127], uint8 range is [0, 255].

here:

If INT8 is specified, $Scale = (|X_{max} - X_{min}|) / 127$ and $ZeroPoint = -128 - X_{min} / Scale$.

or

If UINT8 is specified, $Scale = (|X_{max} - X_{min}|) / 255$ and $ZeroPoint = - X_{min} / Scale$.

Scale Quantization

This is so-called Symmetric quantization, in which we use the maximum absolute value in the float tensor as float range and map to the corresponding integer range.

The math equation is like:

here:

If INT8 is specified, $Scale = max(abs(X_{max}), abs(X_{min})) / 127$ and $ZeroPoint = 0$.

or

If UINT8 is specified, $Scale = max(abs(X_{max}), abs(X_{min})) / 255$ and $ZeroPoint = 128$.

NOTE

Sometimes the reduce_range feature, that's using 7 bit width (1 sign bit + 6 data bits) to represent int8 range, may be needed on some early Xeon platforms, it's because those platforms may have overflow issues due to fp16 intermediate calculation result when executing int8 dot product operation. After AVX512_VNNI instruction is introduced, this issue gets solved by supporting fp32 intermediate data.

Quantization Support Matrix

Framework Backend Library Symmetric Quantization Asymmetric Quantization
ONNX Runtime MLAS Activation (int8/uint8), Weight (int8/uint8) Activation (int8/uint8), Weight (int8/uint8)

Note

Activation (uint8) + Weight (int8) is recommended for performance on x86-64 machines with AVX2 and AVX512 extensions.

Reference

Quantization Approaches

Quantization has two different approaches which belong to optimization on inference:

  1. post training dynamic quantization
  2. post training static quantization

Post Training Dynamic Quantization

The weights of the neural network get quantized into 8 bits format from float32 format offline. The activations of the neural network is quantized as well with the min/max range collected during inference runtime.

This approach is widely used in dynamic length neural networks, like NLP model.

Post Training Static Quantization

Compared with post training dynamic quantization, the min/max range in weights and activations are collected offline on a so-called calibration dataset. This dataset should be able to represent the data distribution of those unseen inference dataset. The calibration process runs on the original fp32 model and dumps out all the tensor distributions for Scale and ZeroPoint calculations. Usually preparing 100 samples are enough for calibration.

This approach is major quantization approach people should try because it could provide the better performance comparing with post training dynamic quantization.

Get Started

The design philosophy of the quantization interface of Neural Compressor is easy-of-use. It requests user to provide model_input, model_output and quant_config. Those parameters would be used to quantize and save the model.

model_input is the ONNX model location or the ONNX model object.

model_output is the path to save ONNX model.

quant_config is the configuration to do quantization.

User could leverage Neural Compressor to directly generate a fully quantized model without accuracy validation. Currently, Neural Compressor supports Post Training Static Quantization and Post Training Dynamic Quantization.

Post Training Quantization

from onnx_neural_compressor.quantization import quantize, config
from onnx_neural_compressor import data_reader


class DataReader(data_reader.CalibrationDataReader):
    def get_next(self): ...

    def rewind(self): ...


calibration_data_reader = DataReader()  # only needed by StaticQuantConfig
qconfig = config.StaticQuantConfig(calibration_data_reader)  # or qconfig = DynamicQuantConfig()
quantize(model, q_model_path, qconfig)

Specify Quantization Rules

Neural Compressor support specify quantization rules by operator name. Users can use set_local API of configs to achieve the above purpose by below code:

op_config = config.StaticQuantConfig(per_channel=False)
quant_config = config.StaticQuantConfig(
    per_channel=True,
)
quant_config.set_local("/h.4/mlp/fc_out/MatMul", op_config)

Specify Quantization Backend and Device

Neural-Compressor will quantized models with user-specified backend or detecting the hardware and software status automatically to decide which backend should be used. The automatically selected priority is: GPU/NPU > CPU.

Backend Backend Library Support Device(cpu as default)
CPUExecutionProvider MLAS cpu
TensorrtExecutionProvider TensorRT gpu
CUDAExecutionProvider CUDA gpu
DnnlExecutionProvider OneDNN cpu
DmlExecutionProvider* OneDNN npu


Note

DmlExecutionProvider support works as experimental, please expect exceptions.

Known limitation: the batch size of onnx models has to be fixed to 1 for DmlExecutionProvider, no multi-batch and dynamic batch support yet.

Examples

User could refer to examples on how to quantize a new model.