Benchmark Results#

This page summarizes INT8 Post-Training Quantization (PTQ) benchmark results for three architecture families: ResNet, ConvNeXt, and Vision Transformer (ViT). Each benchmark compares:

Baseline FP16 — the pretrained model compiled with TensorRT in FP16 mode.
Embedl Deploy Mixed-Precision — pattern-aware QDQ placement, e.g., depthwise convolutions and global average pooling are left in FP16 while compute-bound operators are quantized to INT8.

For reference, results from uniform INT8 quantization (NVIDIA ModelOpt INT8_DEFAULT_CFG) are also included. “Uniform” quantization refers to applying INT8 quantization uniformly to all operators without considering their compute characteristics — in contrast to selective mixed-precision quantization that leaves memory-bound operators in FP16.

Test setup#

GPU: NVIDIA L4
TensorRT: 10.9
Dataset: ImageNette (10-class subset of ImageNet)
Calibration: 32 batches of 32 images from the training set
Latency: measured with trtexec --useCudaGraph --useSpinWait --noDataTransfers --duration=30
Accuracy: Top-1 / Top-5 on the ImageNette validation set (3,925 images)

ResNet50#

ResNet50 is a classic convolutional architecture with:

7×7 stem convolution + MaxPool
16 bottleneck blocks with residual connections
Global average pooling + linear classifier
25.6M parameters

Fusions applied#

Fused module	Count
`FusedConvBNReLUMaxPool` (stem)	1
`FusedConvBNAddReLU` (residuals)	16
`FusedConvBNReLU` (main path)	16
`FusedConvBN`	17
`FusedAdaptiveAvgPool2d`	1

Results#

Variant	Top-1	Top-5	Latency (ms)	Speedup
Baseline FP16	96.36%	99.69%	0.267	1.00x
Embedl Deploy Mixed-Precision	96.10%	99.67%	0.218	1.22x
ModelOpt INT8	96.08%	99.64%	0.222	1.20x

Analysis: ResNet50 has no depthwise convolutions, so the smart quantization advantage comes from correct QDQ placement around residual connections (RESIDUAL_INPUT QDQ points) and skipping GlobalAvgPool quantization. Both approaches achieve good speedup with minimal accuracy loss (<0.3pp Top-1).

ConvNeXt#

ConvNeXt is a modernized CNN that uses:

Depthwise 7×7 convolutions (memory-bound)
LayerNorm instead of BatchNorm
GELU activations
No residual add + ReLU pattern (uses element-wise add without activation)

ConvNeXt is where mixed-precision makes the biggest difference.

ConvNeXt Tiny (28.6M params)#

Variant	Top-1	Top-5	Latency (ms)	Speedup
Baseline FP16	96.56%	100.00%	0.560	1.00x
Embedl Deploy Mixed-Precision	96.10%	99.97%	0.486	1.15x
ModelOpt INT8	96.13%	99.95%	0.538	1.04x

ConvNeXt Base (88.6M params)#

Variant	Top-1	Top-5	Latency (ms)	Speedup
Baseline FP16	97.02%	100.00%	1.132	1.00x
Embedl Deploy Mixed-Precision	96.61%	100.00%	0.932	1.21x
ModelOpt INT8	96.08%	99.97%	1.100	1.03x

ConvNeXt Large (197.8M params)#

Variant	Top-1	Top-5	Latency (ms)	Speedup
Baseline FP16	97.25%	100.00%	2.193	1.00x
Embedl Deploy Mixed-Precision	96.74%	100.00%	1.701	1.29x
ModelOpt INT8	96.13%	99.97%	2.151	1.02x

Analysis: ConvNeXt Large is where mixed-precision shines — Embedl Deploy achieves 1.29x speedup by skipping QDQ on depthwise convolutions, avoiding TensorRT reformatting overhead. In contrast, uniform INT8 quantization (uniform INT8) achieves only 1.02x because the reformatting overhead offsets INT8 compute gains.

The benefit scales with model size: Tiny (1.15x), Base (1.21x), Large (1.29x). Larger models have more depthwise convolution layers where selective QDQ placement pays off.

Vision Transformer (ViT-B/16)#

ViT-B/16 is a pure transformer architecture with:

16×16 patch embedding (Conv2d)
12 self-attention layers (MultiheadAttention)
MLP blocks (Linear → GELU → Linear)
LayerNorm throughout
86.6M parameters

Conversions applied#

DecomposeMultiheadAttentionPattern decomposes each of the 12 nn.MultiheadAttention modules into:

MHAInProjection (Q/K/V linear)
ScaledDotProductAttention
nn.Linear (output projection)

Results#

Variant	Top-1	Top-5	Latency (ms)	Speedup
Baseline FP16	98.70%	100.00%	0.788	1.00x
Embedl Deploy Mixed-Precision	98.55%	100.00%	0.653	1.21x
ModelOpt INT8	98.42%	100.00%	0.702	1.12x

Analysis: Embedl Deploy achieves 1.21x speedup with only −0.15pp Top-1 drop. The key design choices:

LayerNorm is left unquantized (FusedLayerNorm declares no inputs_to_quantize, so no QuantStub is created) — quantizing LayerNorm hurts accuracy with no latency benefit.
Q/K/V projections and attention have properly placed separate QDQ stubs.
Weight quantization is skipped for LayerNorm via skip_weight_quant_for.

Summary#

Architecture	Embedl Speedup	ModelOpt Speedup	Embedl Top-1 Drop	ModelOpt Top-1 Drop
ResNet50	1.22x	1.20x	−0.26pp	−0.28pp
ConvNeXt Tiny	1.15x	1.04x	−0.46pp	−0.43pp
ConvNeXt Base	1.21x	1.03x	−0.41pp	−0.94pp
ConvNeXt Large	1.29x	1.02x	−0.51pp	−1.12pp
ViT-B/16	1.21x	1.12x	−0.15pp	−0.28pp

Key takeaways#

mixed-precision matters most for depthwise-heavy architectures. ConvNeXt Large achieves 1.29x speedup thanks to pattern-aware QDQ placement.
Uniform quantization can be counterproductive. Quantizing every operator barely improves over FP16 on ConvNeXt because TensorRT reformatting overhead offsets INT8 gains on depthwise convolutions.
Accuracy preservation improves with smart QDQ placement. Leaving memory-bound and element-wise operators in FP16 reduces quantization noise.
The pattern-based approach scales with model complexity. The benefit of selective quantization widens as models get larger and use more depthwise/element-wise operators.

Reproducing these results#

The full benchmark script is available as a tutorial: Deploying vision models with Embedl Deploy.

# ResNet50
python compare_modelopt.py --model resnet50 --crop-size 224

# ConvNeXt variants
python compare_modelopt.py --model convnext_tiny
python compare_modelopt.py --model convnext_base
python compare_modelopt.py --model convnext_large

# Vision Transformer
python compare_modelopt.py --model vit_b_16 --crop-size 224 --resize-size 256

# Include ModelOpt uniform INT8 for comparison
python compare_modelopt.py --compare-modelopt

Requirements: NVIDIA GPU, TensorRT 10.x, PyTorch 2.x, modelopt (for comparison), onnx, onnxsim.