Quantization#
Embedl Deploy provides hardware-aware INT8 quantization through explicit QDQ
(Quantize/DeQuantize) stub placement. Unlike uniform quantization approaches
that insert QDQ nodes around every operator, Embedl Deploy places stubs only
at positions declared by each pattern’s qdq_points — ensuring that
quantization does not break operator fusions in the target hardware compiler.
Quantization pipeline#
The quantization pipeline has three steps:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ transform() │ ──▶ │insert_qdq() │ ──▶ │ calibrate │
│(fuse model) │ │(add stubs) │ │(set scales) │
└─────────────┘ └─────────────┘ └─────────────┘
Transform — apply conversions and fusions. Each pattern declares its
qdq_pointsattribute (e.g.,INPUT,RESIDUAL_INPUT), and these are stored in the returnedPatternMatchobjects for use in the next step.Insert QDQ — read the
qdq_pointsfrom eachPatternMatchand placeQuantStub(for activations) andWeightFakeQuantize(for weights) modules at those positions. APatternMatchis a dataclass that records which nodes were matched and what QDQ points apply to that fusion.Calibrate — run representative data through the model to compute scale and zero-point for each quantizer.
Step 1: Transform and fuse#
import torch
from torchvision.models import resnet50
from embedl_deploy import transform
from embedl_deploy.tensorrt import TENSORRT_PATTERNS
model = resnet50(weights="DEFAULT").eval()
result = transform(model, patterns=TENSORRT_PATTERNS)
fused_model = result.model
matches = result.matches # needed for insert_qdq
Step 2: Insert QDQ stubs#
from torch import nn
from embedl_deploy.quantize import (
QuantConfig,
TensorQuantConfig,
insert_qdq,
)
quantized_model = insert_qdq(
fused_model,
matches,
config=QuantConfig(
activation=TensorQuantConfig(n_bits=8, symmetric=True),
weight=TensorQuantConfig(n_bits=8, symmetric=True, per_channel=True),
),
)
QuantConfig#
QuantConfig controls how quantization stubs are configured:
Parameter |
Type |
Description |
|---|---|---|
|
|
Config for activation quantizers |
|
|
Config for weight quantizers |
|
|
Module types to skip weight quantization for |
TensorQuantConfig#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
8 |
Quantization bit width |
|
|
|
Use symmetric quantization |
|
|
|
Per-channel (weights) vs per-tensor |
Recommended settings for TensorRT:
Activations: 8-bit, symmetric, per-tensor
Weights: 8-bit, symmetric, per-channel
Skip weight quantization for
LayerNorm(it runs in FP16 on TensorRT)
config = QuantConfig(
activation=TensorQuantConfig(n_bits=8, symmetric=True),
weight=TensorQuantConfig(n_bits=8, symmetric=True, per_channel=True),
skip_weight_quant_for=(nn.LayerNorm,),
)
Step 3: Calibrate#
from embedl_deploy.quantize import calibrate
# Define a forward loop that runs calibration data through the model
def forward_loop(model):
for batch_tensor, _ in calibration_loader:
model(batch_tensor)
calibrate(quantized_model, forward_loop)
calibrate runs the forward loop on the model, collecting activation
statistics to compute the optimal scale and zero-point for each QuantStub.
QDQ point placement#
Each fused pattern declares where QDQ stubs should be placed via its
qdq_points attribute. This is the key difference from uniform quantization:
Pattern |
QDQ points |
Rationale |
|---|---|---|
|
|
Quantize activations entering the conv |
|
|
Quantize input; ReLU is fused, no separate stub needed |
|
|
Both paths into the add must be quantized |
|
|
First layer — quantize input image |
|
|
Quantize activations entering the linear |
|
|
Quantize input; activation is fused |
|
(none) |
Not quantized — hurts accuracy, no latency benefit |
|
|
Quantize input to Q/K/V projection |
|
|
Q, K, V each need separate quantization |
|
|
Both sides for TRT fusion |
Why pattern-aware QDQ matters#
The problem with uniform quantization#
Tools like NVIDIA ModelOpt apply QDQ stubs around every operator indiscriminately. This causes several issues:
Broken fusions — QDQ nodes between operators that should be fused (e.g., between Conv and BN) prevent the hardware compiler from merging them into a single kernel.
Reformatting overhead — quantizing memory-bound operators like depthwise convolutions or global average pooling forces TensorRT to insert data reformatting layers (INT8 ↔ FP16), which can cost more than the quantization saves.
Accuracy loss without latency gain — quantizing operators that don’t benefit from INT8 (LayerNorm, element-wise ops) reduces accuracy with no performance improvement.
Embedl Deploy’s approach#
Pattern-declared QDQ points ensure:
Stubs are placed outside fused operator groups, not between them.
Memory-bound operators can be left unquantized when beneficial.
The hardware compiler sees exactly the QDQ topology it expects.
Full example: ResNet50 INT8 PTQ#
import torch
from torchvision.models import resnet50
from embedl_deploy import transform
from embedl_deploy.tensorrt import TENSORRT_PATTERNS
from embedl_deploy.quantize import (
QuantConfig,
TensorQuantConfig,
QuantStub,
WeightFakeQuantize,
calibrate,
insert_qdq,
)
# 1. Load and fuse
model = resnet50(weights="DEFAULT").eval()
result = transform(model, patterns=TENSORRT_PATTERNS)
# 2. Insert QDQ stubs
quantized = insert_qdq(
result.model,
result.matches,
config=QuantConfig(
activation=TensorQuantConfig(n_bits=8, symmetric=True),
weight=TensorQuantConfig(n_bits=8, symmetric=True, per_channel=True),
),
)
# Count inserted stubs
n_quant = sum(1 for m in quantized.modules() if isinstance(m, QuantStub))
n_wfq = sum(1 for m in quantized.modules() if isinstance(m, WeightFakeQuantize))
print(f"QuantStubs: {n_quant}, WeightFakeQuantize: {n_wfq}")
# 3. Calibrate
def forward_loop(model):
for batch in calibration_batches[:32]:
model(batch)
calibrate(quantized, forward_loop)
# 4. Export
torch.onnx.export(
quantized.cpu().eval(),
torch.randn(1, 3, 224, 224),
"resnet50_int8.onnx",
opset_version=20,
input_names=["input"],
output_names=["output"],
dynamic_axes={"input": {0: "batch"}, "output": {0: "batch"}},
)
Compile with TensorRT using --best (FP16 + INT8):
trtexec --onnx=resnet50_int8.onnx --best
Quantization-Aware Training (QAT)#
For higher accuracy, you can fine-tune the quantized model with QAT after calibration:
from embedl_deploy.quantize import (
enable_fake_quant,
disable_fake_quant,
freeze_bn_stats,
prepare_qat,
)
# Prepare for QAT (enables fake quantization in training mode)
prepare_qat(quantized)
# Fine-tune with your training loop
quantized.train()
enable_fake_quant(quantized)
freeze_bn_stats(quantized) # Keep BN statistics from calibration
for epoch in range(num_epochs):
for images, targets in train_loader:
output = quantized(images)
loss = criterion(output, targets)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Switch back to eval for export
quantized.eval()
disable_fake_quant(quantized)