Benchmark Results#

This page summarizes INT8 Post-Training Quantization (PTQ) benchmark results for three architecture families: ResNet, ConvNeXt, and Vision Transformer (ViT). Each benchmark compares:

  • Baseline FP16 — the pretrained model compiled with TensorRT in FP16 mode.

  • Embedl Deploy Mixed-Precision — pattern-aware QDQ placement, e.g., depthwise convolutions and global average pooling are left in FP16 while compute-bound operators are quantized to INT8.

For reference, results from uniform INT8 quantization (NVIDIA ModelOpt INT8_DEFAULT_CFG) are also included. “Uniform” quantization refers to applying INT8 quantization uniformly to all operators without considering their compute characteristics — in contrast to selective mixed-precision quantization that leaves memory-bound operators in FP16.

Test setup#

  • GPU: NVIDIA L4

  • TensorRT: 10.9

  • Dataset: ImageNette (10-class subset of ImageNet)

  • Calibration: 32 batches of 32 images from the training set

  • Latency: measured with trtexec --useCudaGraph --useSpinWait --noDataTransfers --duration=30

  • Accuracy: Top-1 / Top-5 on the ImageNette validation set (3,925 images)

ResNet50#

ResNet50 is a classic convolutional architecture with:

  • 7×7 stem convolution + MaxPool

  • 16 bottleneck blocks with residual connections

  • Global average pooling + linear classifier

  • 25.6M parameters

Fusions applied#

Fused module

Count

FusedConvBNReLUMaxPool (stem)

1

FusedConvBNAddReLU (residuals)

16

FusedConvBNReLU (main path)

16

FusedConvBN

17

FusedAdaptiveAvgPool2d

1

Results#

Variant

Top-1

Top-5

Latency (ms)

Speedup

Baseline FP16

96.36%

99.69%

0.267

1.00x

Embedl Deploy Mixed-Precision

96.10%

99.67%

0.218

1.22x

ModelOpt INT8

96.08%

99.64%

0.222

1.20x

Analysis: ResNet50 has no depthwise convolutions, so the smart quantization advantage comes from correct QDQ placement around residual connections (RESIDUAL_INPUT QDQ points) and skipping GlobalAvgPool quantization. Both approaches achieve good speedup with minimal accuracy loss (<0.3pp Top-1).

ConvNeXt#

ConvNeXt is a modernized CNN that uses:

  • Depthwise 7×7 convolutions (memory-bound)

  • LayerNorm instead of BatchNorm

  • GELU activations

  • No residual add + ReLU pattern (uses element-wise add without activation)

ConvNeXt is where mixed-precision makes the biggest difference.

ConvNeXt Tiny (28.6M params)#

Variant

Top-1

Top-5

Latency (ms)

Speedup

Baseline FP16

96.56%

100.00%

0.560

1.00x

Embedl Deploy Mixed-Precision

96.10%

99.97%

0.486

1.15x

ModelOpt INT8

96.13%

99.95%

0.538

1.04x

ConvNeXt Base (88.6M params)#

Variant

Top-1

Top-5

Latency (ms)

Speedup

Baseline FP16

97.02%

100.00%

1.132

1.00x

Embedl Deploy Mixed-Precision

96.61%

100.00%

0.932

1.21x

ModelOpt INT8

96.08%

99.97%

1.100

1.03x

ConvNeXt Large (197.8M params)#

Variant

Top-1

Top-5

Latency (ms)

Speedup

Baseline FP16

97.25%

100.00%

2.193

1.00x

Embedl Deploy Mixed-Precision

96.74%

100.00%

1.701

1.29x

ModelOpt INT8

96.13%

99.97%

2.151

1.02x

Analysis: ConvNeXt Large is where mixed-precision shines — Embedl Deploy achieves 1.29x speedup by skipping QDQ on depthwise convolutions, avoiding TensorRT reformatting overhead. In contrast, uniform INT8 quantization (uniform INT8) achieves only 1.02x because the reformatting overhead offsets INT8 compute gains.

The benefit scales with model size: Tiny (1.15x), Base (1.21x), Large (1.29x). Larger models have more depthwise convolution layers where selective QDQ placement pays off.

Vision Transformer (ViT-B/16)#

ViT-B/16 is a pure transformer architecture with:

  • 16×16 patch embedding (Conv2d)

  • 12 self-attention layers (MultiheadAttention)

  • MLP blocks (Linear → GELU → Linear)

  • LayerNorm throughout

  • 86.6M parameters

Conversions applied#

DecomposeMultiheadAttentionPattern decomposes each of the 12 nn.MultiheadAttention modules into:

  • MHAInProjection (Q/K/V linear)

  • ScaledDotProductAttention

  • nn.Linear (output projection)

Results#

Variant

Top-1

Top-5

Latency (ms)

Speedup

Baseline FP16

98.70%

100.00%

0.788

1.00x

Embedl Deploy Mixed-Precision

98.55%

100.00%

0.653

1.21x

ModelOpt INT8

98.42%

100.00%

0.702

1.12x

Analysis: Embedl Deploy achieves 1.21x speedup with only −0.15pp Top-1 drop. The key design choices:

  • LayerNorm is left unquantized (qdq_points = frozenset()) — quantizing LayerNorm hurts accuracy with no latency benefit.

  • Q/K/V projections and attention have properly placed separate QDQ stubs.

  • Weight quantization is skipped for LayerNorm via skip_weight_quant_for.

Summary#

Architecture

Embedl Speedup

ModelOpt Speedup

Embedl Top-1 Drop

ModelOpt Top-1 Drop

ResNet50

1.22x

1.20x

−0.26pp

−0.28pp

ConvNeXt Tiny

1.15x

1.04x

−0.46pp

−0.43pp

ConvNeXt Base

1.21x

1.03x

−0.41pp

−0.94pp

ConvNeXt Large

1.29x

1.02x

−0.51pp

−1.12pp

ViT-B/16

1.21x

1.12x

−0.15pp

−0.28pp

Key takeaways#

  1. mixed-precision matters most for depthwise-heavy architectures. ConvNeXt Large achieves 1.29x speedup thanks to pattern-aware QDQ placement.

  2. Uniform quantization can be counterproductive. Quantizing every operator barely improves over FP16 on ConvNeXt because TensorRT reformatting overhead offsets INT8 gains on depthwise convolutions.

  3. Accuracy preservation improves with smart QDQ placement. Leaving memory-bound and element-wise operators in FP16 reduces quantization noise.

  4. The pattern-based approach scales with model complexity. The benefit of selective quantization widens as models get larger and use more depthwise/element-wise operators.

Reproducing these results#

The full benchmark script is available as a tutorial: Deploying vision models with Embedl Deploy.

# ResNet50
python compare_modelopt.py --model resnet50 --crop-size 224

# ConvNeXt variants
python compare_modelopt.py --model convnext_tiny
python compare_modelopt.py --model convnext_base
python compare_modelopt.py --model convnext_large

# Vision Transformer
python compare_modelopt.py --model vit_b_16 --crop-size 224 --resize-size 256

# Include ModelOpt uniform INT8 for comparison
python compare_modelopt.py --compare-modelopt

Requirements: NVIDIA GPU, TensorRT 10.x, PyTorch 2.x, modelopt (for comparison), onnx, onnxsim.