# Quantization Embedl Deploy provides hardware-aware INT8 quantization through explicit QDQ (Quantize/DeQuantize) stub placement. Unlike uniform quantization approaches that insert QDQ nodes around every operator, Embedl Deploy places stubs only at positions declared by each pattern's `qdq_points` — ensuring that quantization does not break operator fusions in the target hardware compiler. ## Quantization pipeline The quantization pipeline has three steps: ``` ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ transform() │ ──▶ │insert_qdq() │ ──▶ │ calibrate │ │(fuse model) │ │(add stubs) │ │(set scales) │ └─────────────┘ └─────────────┘ └─────────────┘ ``` 1. **Transform** — apply conversions and fusions. Each pattern declares its `qdq_points` attribute (e.g., `INPUT`, `RESIDUAL_INPUT`), and these are stored in the returned `PatternMatch` objects for use in the next step. 2. **Insert QDQ** — read the `qdq_points` from each `PatternMatch` and place `QuantStub` (for activations) and `WeightFakeQuantize` (for weights) modules at those positions. A `PatternMatch` is a dataclass that records which nodes were matched and what QDQ points apply to that fusion. 3. **Calibrate** — run representative data through the model to compute scale and zero-point for each quantizer. ## Step 1: Transform and fuse ```python import torch from torchvision.models import resnet50 from embedl_deploy import transform from embedl_deploy.tensorrt import TENSORRT_PATTERNS model = resnet50(weights="DEFAULT").eval() result = transform(model, patterns=TENSORRT_PATTERNS) fused_model = result.model matches = result.matches # needed for insert_qdq ``` ## Step 2: Insert QDQ stubs ```python from torch import nn from embedl_deploy.quantize import ( QuantConfig, TensorQuantConfig, insert_qdq, ) quantized_model = insert_qdq( fused_model, matches, config=QuantConfig( activation=TensorQuantConfig(n_bits=8, symmetric=True), weight=TensorQuantConfig(n_bits=8, symmetric=True, per_channel=True), ), ) ``` ### QuantConfig `QuantConfig` controls how quantization stubs are configured: | Parameter | Type | Description | |---|---|---| | `activation` | `TensorQuantConfig` | Config for activation quantizers | | `weight` | `TensorQuantConfig` | Config for weight quantizers | | `skip_weight_quant_for` | `tuple[type, ...]` | Module types to skip weight quantization for | ### TensorQuantConfig | Parameter | Type | Default | Description | |---|---|---|---| | `n_bits` | `int` | 8 | Quantization bit width | | `symmetric` | `bool` | `True` | Use symmetric quantization | | `per_channel` | `bool` | `False` | Per-channel (weights) vs per-tensor | **Recommended settings for TensorRT:** - Activations: 8-bit, symmetric, per-tensor - Weights: 8-bit, symmetric, **per-channel** - Skip weight quantization for `LayerNorm` (it runs in FP16 on TensorRT) ```python config = QuantConfig( activation=TensorQuantConfig(n_bits=8, symmetric=True), weight=TensorQuantConfig(n_bits=8, symmetric=True, per_channel=True), skip_weight_quant_for=(nn.LayerNorm,), ) ``` ## Step 3: Calibrate ```python from embedl_deploy.quantize import calibrate # Define a forward loop that runs calibration data through the model def forward_loop(model): for batch_tensor, _ in calibration_loader: model(batch_tensor) calibrate(quantized_model, forward_loop) ``` `calibrate` runs the forward loop on the model, collecting activation statistics to compute the optimal scale and zero-point for each `QuantStub`. ## QDQ point placement Each fused pattern declares where QDQ stubs should be placed via its `qdq_points` attribute. This is the key difference from uniform quantization: | Pattern | QDQ points | Rationale | |---|---|---| | `ConvBNPattern` | `INPUT` | Quantize activations entering the conv | | `ConvBNReLUPattern` | `INPUT` | Quantize input; ReLU is fused, no separate stub needed | | `ConvBNAddReLUPattern` | `INPUT`, `RESIDUAL_INPUT` | Both paths into the add must be quantized | | `StemConvBNReLUMaxPoolPattern` | `INPUT` | First layer — quantize input image | | `LinearPattern` | `INPUT` | Quantize activations entering the linear | | `LinearReLUPattern` | `INPUT` | Quantize input; activation is fused | | `LayerNormPattern` | (none) | Not quantized — hurts accuracy, no latency benefit | | `MHAInProjectionPattern` | `INPUT` | Quantize input to Q/K/V projection | | `ScaledDotProductAttentionPattern` | `INPUT`, `KEY_INPUT`, `VALUE_INPUT` | Q, K, V each need separate quantization | | `AdaptiveAvgPoolPattern` | `INPUT`, `OUTPUT` | Both sides for TRT fusion | ## Why pattern-aware QDQ matters ### The problem with uniform quantization Tools like NVIDIA ModelOpt apply QDQ stubs around every operator indiscriminately. This causes several issues: 1. **Broken fusions** — QDQ nodes between operators that should be fused (e.g., between Conv and BN) prevent the hardware compiler from merging them into a single kernel. 2. **Reformatting overhead** — quantizing memory-bound operators like depthwise convolutions or global average pooling forces TensorRT to insert data reformatting layers (INT8 ↔ FP16), which can cost more than the quantization saves. 3. **Accuracy loss without latency gain** — quantizing operators that don't benefit from INT8 (LayerNorm, element-wise ops) reduces accuracy with no performance improvement. ### Embedl Deploy's approach Pattern-declared QDQ points ensure: - Stubs are placed **outside** fused operator groups, not between them. - Memory-bound operators can be left unquantized when beneficial. - The hardware compiler sees exactly the QDQ topology it expects. ## Full example: ResNet50 INT8 PTQ ```python import torch from torchvision.models import resnet50 from embedl_deploy import transform from embedl_deploy.tensorrt import TENSORRT_PATTERNS from embedl_deploy.quantize import ( QuantConfig, TensorQuantConfig, QuantStub, WeightFakeQuantize, calibrate, insert_qdq, ) # 1. Load and fuse model = resnet50(weights="DEFAULT").eval() result = transform(model, patterns=TENSORRT_PATTERNS) # 2. Insert QDQ stubs quantized = insert_qdq( result.model, result.matches, config=QuantConfig( activation=TensorQuantConfig(n_bits=8, symmetric=True), weight=TensorQuantConfig(n_bits=8, symmetric=True, per_channel=True), ), ) # Count inserted stubs n_quant = sum(1 for m in quantized.modules() if isinstance(m, QuantStub)) n_wfq = sum(1 for m in quantized.modules() if isinstance(m, WeightFakeQuantize)) print(f"QuantStubs: {n_quant}, WeightFakeQuantize: {n_wfq}") # 3. Calibrate def forward_loop(model): for batch in calibration_batches[:32]: model(batch) calibrate(quantized, forward_loop) # 4. Export torch.onnx.export( quantized.cpu().eval(), torch.randn(1, 3, 224, 224), "resnet50_int8.onnx", opset_version=20, input_names=["input"], output_names=["output"], dynamic_axes={"input": {0: "batch"}, "output": {0: "batch"}}, ) ``` Compile with TensorRT using `--best` (FP16 + INT8): ```bash trtexec --onnx=resnet50_int8.onnx --best ``` ## Quantization-Aware Training (QAT) For higher accuracy, you can fine-tune the quantized model with QAT after calibration: ```python from embedl_deploy.quantize import ( enable_fake_quant, disable_fake_quant, freeze_bn_stats, prepare_qat, ) # Prepare for QAT (enables fake quantization in training mode) prepare_qat(quantized) # Fine-tune with your training loop quantized.train() enable_fake_quant(quantized) freeze_bn_stats(quantized) # Keep BN statistics from calibration for epoch in range(num_epochs): for images, targets in train_loader: output = quantized(images) loss = criterion(output, targets) loss.backward() optimizer.step() optimizer.zero_grad() # Switch back to eval for export quantized.eval() disable_fake_quant(quantized) ```