Quantization#

Embedl Deploy provides hardware-aware INT8 quantization through explicit QDQ (Quantize/DeQuantize) stub placement. Unlike uniform quantization approaches that insert QDQ nodes around every operator, Embedl Deploy places stubs only at positions controlled by each FusedModule’s inputs_to_quantize — ensuring that quantization does not break operator fusions in the target hardware compiler.

Quantization pipeline#

The quantization pipeline has three steps:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ transform() │ ──▶ │insert_qdq() │ ──▶ │ calibrate   │
│(fuse model) │     │(add stubs)  │     │(set scales) │
└─────────────┘     └─────────────┘     └─────────────┘

Transform — apply conversions and fusions. Each FusedModule subclass declares inputs_to_quantize (a set of positional input indices), and FusedModule.__init__ creates the corresponding QuantStub instances.
Insert QDQ — insert_qdq wires the QuantStub instances already present on each FusedModule into the graph, and places WeightFakeQuantize modules on weights. A PatternMatch is a dataclass that records which nodes were matched.
Calibrate — run representative data through the model to compute scale and zero-point for each quantizer.

Step 1: Transform and fuse#

import torch
from torchvision.models import resnet50

from embedl_deploy import transform
from embedl_deploy.tensorrt import TENSORRT_PATTERNS

model = resnet50(weights="DEFAULT").eval()
result = transform(model, patterns=TENSORRT_PATTERNS)
fused_model = result.model
matches = result.matches  # needed for insert_qdq

Step 2: Insert QDQ stubs#

from torch import nn
from embedl_deploy.quantize import (
    QuantConfig,
    TensorQuantConfig,
    insert_qdq,
)

quantized_model = insert_qdq(
    fused_model,
    matches,
    config=QuantConfig(
        activation=TensorQuantConfig(n_bits=8, symmetric=True),
        weight=TensorQuantConfig(n_bits=8, symmetric=True, per_channel=True),
    ),
)

QuantConfig#

QuantConfig controls how quantization stubs are configured:

Parameter	Type	Description
`activation`	`TensorQuantConfig`	Config for activation quantizers
`weight`	`TensorQuantConfig`	Config for weight quantizers
`skip_weight_quant_for`	`tuple[type, ...]`	Module types to skip weight quantization for

TensorQuantConfig#

Parameter	Type	Default	Description
`n_bits`	`int`	8	Quantization bit width
`symmetric`	`bool`	`True`	Use symmetric quantization
`per_channel`	`bool`	`False`	Per-channel (weights) vs per-tensor

Recommended settings for TensorRT:

Activations: 8-bit, symmetric, per-tensor
Weights: 8-bit, symmetric, per-channel
Skip weight quantization for LayerNorm (it runs in FP16 on TensorRT)

config = QuantConfig(
    activation=TensorQuantConfig(n_bits=8, symmetric=True),
    weight=TensorQuantConfig(n_bits=8, symmetric=True, per_channel=True),
    skip_weight_quant_for=(nn.LayerNorm,),
)

Step 3: Calibrate#

from embedl_deploy.quantize import calibrate

# Define a forward loop that runs calibration data through the model
def forward_loop(model):
    for batch_tensor, _ in calibration_loader:
        model(batch_tensor)

calibrate(quantized_model, forward_loop)

calibrate runs the forward loop on the model, collecting activation statistics to compute the optimal scale and zero-point for each QuantStub.

QDQ stub placement#

Each FusedModule subclass declares inputs_to_quantize — the set of positional input indices for which QuantStub instances are created. This is the key difference from uniform quantization:

Pattern	QDQ points	Rationale
`ConvBNPattern`	`INPUT`	Quantize activations entering the conv
`ConvBNReLUPattern`	`INPUT`	Quantize input; ReLU is fused, no separate stub needed
`ConvBNAddReLUPattern`	`INPUT`, `RESIDUAL_INPUT`	Both paths into the add must be quantized
`StemConvBNReLUMaxPoolPattern`	`INPUT`	First layer — quantize input image
`LinearPattern`	`INPUT`	Quantize activations entering the linear
`LinearReLUPattern`	`INPUT`	Quantize input; activation is fused
`LayerNormPattern`	(none)	Not quantized — hurts accuracy, no latency benefit
`MHAInProjectionPattern`	`INPUT`	Quantize input to Q/K/V projection
`ScaledDotProductAttentionPattern`	`INPUT`, `KEY_INPUT`, `VALUE_INPUT`	Q, K, V each need separate quantization
`AdaptiveAvgPoolPattern`	`INPUT`, `OUTPUT`	Both sides for TRT fusion

Why pattern-aware QDQ matters#

The problem with uniform quantization#

Tools like NVIDIA ModelOpt apply QDQ stubs around every operator indiscriminately. This causes several issues:

Broken fusions — QDQ nodes between operators that should be fused (e.g., between Conv and BN) prevent the hardware compiler from merging them into a single kernel.
Reformatting overhead — quantizing memory-bound operators like depthwise convolutions or global average pooling forces TensorRT to insert data reformatting layers (INT8 ↔ FP16), which can cost more than the quantization saves.
Accuracy loss without latency gain — quantizing operators that don’t benefit from INT8 (LayerNorm, element-wise ops) reduces accuracy with no performance improvement.

Embedl Deploy’s approach#

Pattern-declared QDQ points ensure:

Stubs are placed outside fused operator groups, not between them.
Memory-bound operators can be left unquantized when beneficial.
The hardware compiler sees exactly the QDQ topology it expects.

Full example: ResNet50 INT8 PTQ#

import torch
from torchvision.models import resnet50

from embedl_deploy import transform
from embedl_deploy.tensorrt import TENSORRT_PATTERNS
from embedl_deploy.quantize import (
    QuantConfig,
    TensorQuantConfig,
    QuantStub,
    WeightFakeQuantize,
    calibrate,
    insert_qdq,
)

# 1. Load and fuse
model = resnet50(weights="DEFAULT").eval()
result = transform(model, patterns=TENSORRT_PATTERNS)

# 2. Insert QDQ stubs
quantized = insert_qdq(
    result.model,
    result.matches,
    config=QuantConfig(
        activation=TensorQuantConfig(n_bits=8, symmetric=True),
        weight=TensorQuantConfig(n_bits=8, symmetric=True, per_channel=True),
    ),
)

# Count inserted stubs
n_quant = sum(1 for m in quantized.modules() if isinstance(m, QuantStub))
n_wfq = sum(1 for m in quantized.modules() if isinstance(m, WeightFakeQuantize))
print(f"QuantStubs: {n_quant}, WeightFakeQuantize: {n_wfq}")

# 3. Calibrate
def forward_loop(model):
    for batch in calibration_batches[:32]:
        model(batch)

calibrate(quantized, forward_loop)

# 4. Export
torch.onnx.export(
    quantized.cpu().eval(),
    torch.randn(1, 3, 224, 224),
    "resnet50_int8.onnx",
    opset_version=20,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={"input": {0: "batch"}, "output": {0: "batch"}},
)

Compile with TensorRT using --best (FP16 + INT8):

trtexec --onnx=resnet50_int8.onnx --best

Quantization-Aware Training (QAT)#

For higher accuracy, you can fine-tune the quantized model with QAT after calibration:

from embedl_deploy.quantize import (
    enable_fake_quant,
    disable_fake_quant,
    freeze_bn_stats,
    prepare_qat,
)

# Prepare for QAT (enables fake quantization in training mode)
prepare_qat(quantized)

# Fine-tune with your training loop
quantized.train()
enable_fake_quant(quantized)
freeze_bn_stats(quantized)  # Keep BN statistics from calibration

for epoch in range(num_epochs):
    for images, targets in train_loader:
        output = quantized(images)
        loss = criterion(output, targets)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

# Switch back to eval for export
quantized.eval()
disable_fake_quant(quantized)