# Benchmark Results This page summarizes INT8 Post-Training Quantization (PTQ) benchmark results for three architecture families: **ResNet**, **ConvNeXt**, and **Vision Transformer (ViT)**. Each benchmark compares: - **Baseline FP16** — the pretrained model compiled with TensorRT in FP16 mode. - **Embedl Deploy Mixed-Precision** — pattern-aware QDQ placement, e.g., depthwise convolutions and global average pooling are left in FP16 while compute-bound operators are quantized to INT8. For reference, results from uniform INT8 quantization (NVIDIA ModelOpt `INT8_DEFAULT_CFG`) are also included. "Uniform" quantization refers to applying INT8 quantization uniformly to all operators without considering their compute characteristics — in contrast to selective mixed-precision quantization that leaves memory-bound operators in FP16. ## Test setup - **GPU:** NVIDIA L4 - **TensorRT:** 10.9 - **Dataset:** ImageNette (10-class subset of ImageNet) - **Calibration:** 32 batches of 32 images from the training set - **Latency:** measured with `trtexec --useCudaGraph --useSpinWait --noDataTransfers --duration=30` - **Accuracy:** Top-1 / Top-5 on the ImageNette validation set (3,925 images) ## ResNet50 ResNet50 is a classic convolutional architecture with: - 7×7 stem convolution + MaxPool - 16 bottleneck blocks with residual connections - Global average pooling + linear classifier - **25.6M parameters** ### Fusions applied | Fused module | Count | |---|---| | `FusedConvBNReLUMaxPool` (stem) | 1 | | `FusedConvBNAddReLU` (residuals) | 16 | | `FusedConvBNReLU` (main path) | 16 | | `FusedConvBN` | 17 | | `FusedAdaptiveAvgPool2d` | 1 | ### Results | Variant | Top-1 | Top-5 | Latency (ms) | Speedup | |---|---|---|---|---| | Baseline FP16 | 96.36% | 99.69% | 0.267 | 1.00x | | Embedl Deploy Mixed-Precision | 96.10% | 99.67% | 0.218 | 1.22x | | ModelOpt INT8 | 96.08% | 99.64% | 0.222 | 1.20x | **Analysis:** ResNet50 has no depthwise convolutions, so the smart quantization advantage comes from correct QDQ placement around residual connections (`RESIDUAL_INPUT` QDQ points) and skipping GlobalAvgPool quantization. Both approaches achieve good speedup with minimal accuracy loss (<0.3pp Top-1). ## ConvNeXt ConvNeXt is a modernized CNN that uses: - Depthwise 7×7 convolutions (memory-bound) - LayerNorm instead of BatchNorm - GELU activations - No residual add + ReLU pattern (uses element-wise add without activation) ConvNeXt is where mixed-precision makes the biggest difference. ### ConvNeXt Tiny (28.6M params) | Variant | Top-1 | Top-5 | Latency (ms) | Speedup | |---|---|---|---|---| | Baseline FP16 | 96.56% | 100.00% | 0.560 | 1.00x | | Embedl Deploy Mixed-Precision | 96.10% | 99.97% | 0.486 | 1.15x | | ModelOpt INT8 | 96.13% | 99.95% | 0.538 | 1.04x | ### ConvNeXt Base (88.6M params) | Variant | Top-1 | Top-5 | Latency (ms) | Speedup | |---|---|---|---|---| | Baseline FP16 | 97.02% | 100.00% | 1.132 | 1.00x | | Embedl Deploy Mixed-Precision | 96.61% | 100.00% | 0.932 | 1.21x | | ModelOpt INT8 | 96.08% | 99.97% | 1.100 | 1.03x | ### ConvNeXt Large (197.8M params) | Variant | Top-1 | Top-5 | Latency (ms) | Speedup | |---|---|---|---|---| | Baseline FP16 | 97.25% | 100.00% | 2.193 | 1.00x | | Embedl Deploy Mixed-Precision | 96.74% | 100.00% | 1.701 | 1.29x | | ModelOpt INT8 | 96.13% | 99.97% | 2.151 | 1.02x | **Analysis:** ConvNeXt Large is where mixed-precision shines — Embedl Deploy achieves **1.29x speedup** by skipping QDQ on depthwise convolutions, avoiding TensorRT reformatting overhead. In contrast, uniform INT8 quantization (uniform INT8) achieves only 1.02x because the reformatting overhead offsets INT8 compute gains. The benefit scales with model size: Tiny (1.15x), Base (1.21x), Large (1.29x). Larger models have more depthwise convolution layers where selective QDQ placement pays off. ## Vision Transformer (ViT-B/16) ViT-B/16 is a pure transformer architecture with: - 16×16 patch embedding (Conv2d) - 12 self-attention layers (MultiheadAttention) - MLP blocks (Linear → GELU → Linear) - LayerNorm throughout - **86.6M parameters** ### Conversions applied `DecomposeMultiheadAttentionPattern` decomposes each of the 12 `nn.MultiheadAttention` modules into: - `MHAInProjection` (Q/K/V linear) - `ScaledDotProductAttention` - `nn.Linear` (output projection) ### Results | Variant | Top-1 | Top-5 | Latency (ms) | Speedup | |---|---|---|---|---| | Baseline FP16 | 98.70% | 100.00% | 0.788 | 1.00x | | Embedl Deploy Mixed-Precision | 98.55% | 100.00% | 0.653 | 1.21x | | ModelOpt INT8 | 98.42% | 100.00% | 0.702 | 1.12x | **Analysis:** Embedl Deploy achieves 1.21x speedup with only −0.15pp Top-1 drop. The key design choices: - LayerNorm is left unquantized (`qdq_points = frozenset()`) — quantizing LayerNorm hurts accuracy with no latency benefit. - Q/K/V projections and attention have properly placed separate QDQ stubs. - Weight quantization is skipped for `LayerNorm` via `skip_weight_quant_for`. ## Summary | Architecture | Embedl Speedup | ModelOpt Speedup | Embedl Top-1 Drop | ModelOpt Top-1 Drop | |---|---|---|---|---| | ResNet50 | 1.22x | 1.20x | −0.26pp | −0.28pp | | ConvNeXt Tiny | 1.15x | 1.04x | −0.46pp | −0.43pp | | ConvNeXt Base | 1.21x | 1.03x | −0.41pp | −0.94pp | | ConvNeXt Large | **1.29x** | 1.02x | −0.51pp | −1.12pp | | ViT-B/16 | 1.21x | 1.12x | −0.15pp | −0.28pp | ### Key takeaways 1. **mixed-precision matters most for depthwise-heavy architectures.** ConvNeXt Large achieves 1.29x speedup thanks to pattern-aware QDQ placement. 2. **Uniform quantization can be counterproductive.** Quantizing every operator barely improves over FP16 on ConvNeXt because TensorRT reformatting overhead offsets INT8 gains on depthwise convolutions. 3. **Accuracy preservation improves with smart QDQ placement.** Leaving memory-bound and element-wise operators in FP16 reduces quantization noise. 4. **The pattern-based approach scales with model complexity.** The benefit of selective quantization widens as models get larger and use more depthwise/element-wise operators. ## Reproducing these results The full benchmark script is available as a tutorial: {doc}`../auto_tutorials/deploy_torchvision`. ```bash # ResNet50 python compare_modelopt.py --model resnet50 --crop-size 224 # ConvNeXt variants python compare_modelopt.py --model convnext_tiny python compare_modelopt.py --model convnext_base python compare_modelopt.py --model convnext_large # Vision Transformer python compare_modelopt.py --model vit_b_16 --crop-size 224 --resize-size 256 # Include ModelOpt uniform INT8 for comparison python compare_modelopt.py --compare-modelopt ``` Requirements: NVIDIA GPU, TensorRT 10.x, PyTorch 2.x, `modelopt` (for comparison), `onnx`, `onnxsim`.