Quantization#

Embedl Deploy provides hardware-aware INT8 quantization through explicit QDQ (Quantize/DeQuantize) stub placement. Unlike uniform quantization approaches that insert QDQ nodes around every operator, Embedl Deploy places stubs only at positions declared by each pattern’s qdq_points — ensuring that quantization does not break operator fusions in the target hardware compiler.

Quantization pipeline#

The quantization pipeline has three steps:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ transform() │ ──▶ │insert_qdq() │ ──▶ │ calibrate   │
│(fuse model) │     │(add stubs)  │     │(set scales) │
└─────────────┘     └─────────────┘     └─────────────┘
  1. Transform — apply conversions and fusions. Each pattern declares its qdq_points attribute (e.g., INPUT, RESIDUAL_INPUT), and these are stored in the returned PatternMatch objects for use in the next step.

  2. Insert QDQ — read the qdq_points from each PatternMatch and place QuantStub (for activations) and WeightFakeQuantize (for weights) modules at those positions. A PatternMatch is a dataclass that records which nodes were matched and what QDQ points apply to that fusion.

  3. Calibrate — run representative data through the model to compute scale and zero-point for each quantizer.

Step 1: Transform and fuse#

import torch
from torchvision.models import resnet50

from embedl_deploy import transform
from embedl_deploy.tensorrt import TENSORRT_PATTERNS

model = resnet50(weights="DEFAULT").eval()
result = transform(model, patterns=TENSORRT_PATTERNS)
fused_model = result.model
matches = result.matches  # needed for insert_qdq

Step 2: Insert QDQ stubs#

from torch import nn
from embedl_deploy.quantize import (
    QuantConfig,
    TensorQuantConfig,
    insert_qdq,
)

quantized_model = insert_qdq(
    fused_model,
    matches,
    config=QuantConfig(
        activation=TensorQuantConfig(n_bits=8, symmetric=True),
        weight=TensorQuantConfig(n_bits=8, symmetric=True, per_channel=True),
    ),
)

QuantConfig#

QuantConfig controls how quantization stubs are configured:

Parameter

Type

Description

activation

TensorQuantConfig

Config for activation quantizers

weight

TensorQuantConfig

Config for weight quantizers

skip_weight_quant_for

tuple[type, ...]

Module types to skip weight quantization for

TensorQuantConfig#

Parameter

Type

Default

Description

n_bits

int

8

Quantization bit width

symmetric

bool

True

Use symmetric quantization

per_channel

bool

False

Per-channel (weights) vs per-tensor

Recommended settings for TensorRT:

  • Activations: 8-bit, symmetric, per-tensor

  • Weights: 8-bit, symmetric, per-channel

  • Skip weight quantization for LayerNorm (it runs in FP16 on TensorRT)

config = QuantConfig(
    activation=TensorQuantConfig(n_bits=8, symmetric=True),
    weight=TensorQuantConfig(n_bits=8, symmetric=True, per_channel=True),
    skip_weight_quant_for=(nn.LayerNorm,),
)

Step 3: Calibrate#

from embedl_deploy.quantize import calibrate

# Define a forward loop that runs calibration data through the model
def forward_loop(model):
    for batch_tensor, _ in calibration_loader:
        model(batch_tensor)

calibrate(quantized_model, forward_loop)

calibrate runs the forward loop on the model, collecting activation statistics to compute the optimal scale and zero-point for each QuantStub.

QDQ point placement#

Each fused pattern declares where QDQ stubs should be placed via its qdq_points attribute. This is the key difference from uniform quantization:

Pattern

QDQ points

Rationale

ConvBNPattern

INPUT

Quantize activations entering the conv

ConvBNReLUPattern

INPUT

Quantize input; ReLU is fused, no separate stub needed

ConvBNAddReLUPattern

INPUT, RESIDUAL_INPUT

Both paths into the add must be quantized

StemConvBNReLUMaxPoolPattern

INPUT

First layer — quantize input image

LinearPattern

INPUT

Quantize activations entering the linear

LinearReLUPattern

INPUT

Quantize input; activation is fused

LayerNormPattern

(none)

Not quantized — hurts accuracy, no latency benefit

MHAInProjectionPattern

INPUT

Quantize input to Q/K/V projection

ScaledDotProductAttentionPattern

INPUT, KEY_INPUT, VALUE_INPUT

Q, K, V each need separate quantization

AdaptiveAvgPoolPattern

INPUT, OUTPUT

Both sides for TRT fusion

Why pattern-aware QDQ matters#

The problem with uniform quantization#

Tools like NVIDIA ModelOpt apply QDQ stubs around every operator indiscriminately. This causes several issues:

  1. Broken fusions — QDQ nodes between operators that should be fused (e.g., between Conv and BN) prevent the hardware compiler from merging them into a single kernel.

  2. Reformatting overhead — quantizing memory-bound operators like depthwise convolutions or global average pooling forces TensorRT to insert data reformatting layers (INT8 ↔ FP16), which can cost more than the quantization saves.

  3. Accuracy loss without latency gain — quantizing operators that don’t benefit from INT8 (LayerNorm, element-wise ops) reduces accuracy with no performance improvement.

Embedl Deploy’s approach#

Pattern-declared QDQ points ensure:

  • Stubs are placed outside fused operator groups, not between them.

  • Memory-bound operators can be left unquantized when beneficial.

  • The hardware compiler sees exactly the QDQ topology it expects.

Full example: ResNet50 INT8 PTQ#

import torch
from torchvision.models import resnet50

from embedl_deploy import transform
from embedl_deploy.tensorrt import TENSORRT_PATTERNS
from embedl_deploy.quantize import (
    QuantConfig,
    TensorQuantConfig,
    QuantStub,
    WeightFakeQuantize,
    calibrate,
    insert_qdq,
)

# 1. Load and fuse
model = resnet50(weights="DEFAULT").eval()
result = transform(model, patterns=TENSORRT_PATTERNS)

# 2. Insert QDQ stubs
quantized = insert_qdq(
    result.model,
    result.matches,
    config=QuantConfig(
        activation=TensorQuantConfig(n_bits=8, symmetric=True),
        weight=TensorQuantConfig(n_bits=8, symmetric=True, per_channel=True),
    ),
)

# Count inserted stubs
n_quant = sum(1 for m in quantized.modules() if isinstance(m, QuantStub))
n_wfq = sum(1 for m in quantized.modules() if isinstance(m, WeightFakeQuantize))
print(f"QuantStubs: {n_quant}, WeightFakeQuantize: {n_wfq}")

# 3. Calibrate
def forward_loop(model):
    for batch in calibration_batches[:32]:
        model(batch)

calibrate(quantized, forward_loop)

# 4. Export
torch.onnx.export(
    quantized.cpu().eval(),
    torch.randn(1, 3, 224, 224),
    "resnet50_int8.onnx",
    opset_version=20,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={"input": {0: "batch"}, "output": {0: "batch"}},
)

Compile with TensorRT using --best (FP16 + INT8):

trtexec --onnx=resnet50_int8.onnx --best

Quantization-Aware Training (QAT)#

For higher accuracy, you can fine-tune the quantized model with QAT after calibration:

from embedl_deploy.quantize import (
    enable_fake_quant,
    disable_fake_quant,
    freeze_bn_stats,
    prepare_qat,
)

# Prepare for QAT (enables fake quantization in training mode)
prepare_qat(quantized)

# Fine-tune with your training loop
quantized.train()
enable_fake_quant(quantized)
freeze_bn_stats(quantized)  # Keep BN statistics from calibration

for epoch in range(num_epochs):
    for images, targets in train_loader:
        output = quantized(images)
        loss = criterion(output, targets)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

# Switch back to eval for export
quantized.eval()
disable_fake_quant(quantized)