# Benchmark Results

This page summarizes INT8 Post-Training Quantization (PTQ) benchmark results
for three architecture families: **ResNet**, **ConvNeXt**, and **Vision
Transformer (ViT)**. Each benchmark compares:

- **Baseline FP16** — the pretrained model compiled with TensorRT in FP16 mode.
- **Embedl Deploy Mixed-Precision** — pattern-aware QDQ placement, e.g.,
  depthwise convolutions and global average pooling are left in FP16 while
  compute-bound operators are quantized to INT8.

For reference, results from uniform INT8 quantization (NVIDIA ModelOpt
`INT8_DEFAULT_CFG`) are also included. "Uniform" quantization refers to
applying INT8 quantization uniformly to all operators without considering
their compute characteristics — in contrast to selective mixed-precision
quantization that leaves memory-bound operators in FP16.

## Test setup

- **GPU:** NVIDIA L4
- **TensorRT:** 10.9
- **Dataset:** ImageNette (10-class subset of ImageNet)
- **Calibration:** 32 batches of 32 images from the training set
- **Latency:** measured with `trtexec --useCudaGraph --useSpinWait
  --noDataTransfers --duration=30`
- **Accuracy:** Top-1 / Top-5 on the ImageNette validation set (3,925 images)


## ResNet50

ResNet50 is a classic convolutional architecture with:
- 7×7 stem convolution + MaxPool
- 16 bottleneck blocks with residual connections
- Global average pooling + linear classifier
- **25.6M parameters**

### Fusions applied

| Fused module | Count |
|---|---|
| `FusedConvBNReLUMaxPool` (stem) | 1 |
| `FusedConvBNAddReLU` (residuals) | 16 |
| `FusedConvBNReLU` (main path) | 16 |
| `FusedConvBN` | 17 |
| `FusedAdaptiveAvgPool2d` | 1 |

### Results

| Variant | Top-1 | Top-5 | Latency (ms) | Speedup |
|---|---|---|---|---|
| Baseline FP16 | 96.36% | 99.69% | 0.267 | 1.00x |
| Embedl Deploy Mixed-Precision | 96.10% | 99.67% | 0.218 | 1.22x |
| ModelOpt INT8 | 96.08% | 99.64% | 0.222 | 1.20x |

**Analysis:** ResNet50 has no depthwise convolutions, so the smart
quantization advantage comes from correct QDQ placement around residual
connections (`RESIDUAL_INPUT` QDQ points) and skipping GlobalAvgPool
quantization. Both approaches achieve good speedup with minimal accuracy
loss (<0.3pp Top-1).


## ConvNeXt

ConvNeXt is a modernized CNN that uses:
- Depthwise 7×7 convolutions (memory-bound)
- LayerNorm instead of BatchNorm
- GELU activations
- No residual add + ReLU pattern (uses element-wise add without activation)

ConvNeXt is where mixed-precision makes the biggest difference.

### ConvNeXt Tiny (28.6M params)

| Variant | Top-1 | Top-5 | Latency (ms) | Speedup |
|---|---|---|---|---|
| Baseline FP16 | 96.56% | 100.00% | 0.560 | 1.00x |
| Embedl Deploy Mixed-Precision | 96.10% | 99.97% | 0.486 | 1.15x |
| ModelOpt INT8 | 96.13% | 99.95% | 0.538 | 1.04x |

### ConvNeXt Base (88.6M params)

| Variant | Top-1 | Top-5 | Latency (ms) | Speedup |
|---|---|---|---|---|
| Baseline FP16 | 97.02% | 100.00% | 1.132 | 1.00x |
| Embedl Deploy Mixed-Precision | 96.61% | 100.00% | 0.932 | 1.21x |
| ModelOpt INT8 | 96.08% | 99.97% | 1.100 | 1.03x |

### ConvNeXt Large (197.8M params)

| Variant | Top-1 | Top-5 | Latency (ms) | Speedup |
|---|---|---|---|---|
| Baseline FP16 | 97.25% | 100.00% | 2.193 | 1.00x |
| Embedl Deploy Mixed-Precision | 96.74% | 100.00% | 1.701 | 1.29x |
| ModelOpt INT8 | 96.13% | 99.97% | 2.151 | 1.02x |

**Analysis:** ConvNeXt Large is where mixed-precision shines — Embedl
Deploy achieves **1.29x speedup** by skipping QDQ on depthwise convolutions,
avoiding TensorRT reformatting overhead. In contrast, uniform INT8 quantization
(uniform INT8) achieves only 1.02x because the reformatting overhead offsets
INT8 compute gains.

The benefit scales with model size: Tiny (1.15x), Base (1.21x),
Large (1.29x). Larger models have more depthwise convolution layers
where selective QDQ placement pays off.


## Vision Transformer (ViT-B/16)

ViT-B/16 is a pure transformer architecture with:
- 16×16 patch embedding (Conv2d)
- 12 self-attention layers (MultiheadAttention)
- MLP blocks (Linear → GELU → Linear)
- LayerNorm throughout
- **86.6M parameters**

### Conversions applied

`DecomposeMultiheadAttentionPattern` decomposes each of the 12
`nn.MultiheadAttention` modules into:
- `MHAInProjection` (Q/K/V linear)
- `ScaledDotProductAttention`
- `nn.Linear` (output projection)

### Results

| Variant | Top-1 | Top-5 | Latency (ms) | Speedup |
|---|---|---|---|---|
| Baseline FP16 | 98.70% | 100.00% | 0.788 | 1.00x |
| Embedl Deploy Mixed-Precision | 98.55% | 100.00% | 0.653 | 1.21x |
| ModelOpt INT8 | 98.42% | 100.00% | 0.702 | 1.12x |

**Analysis:** Embedl Deploy achieves 1.21x speedup with only −0.15pp
Top-1 drop. The key design choices:
- LayerNorm is left unquantized (`qdq_points = frozenset()`) — quantizing
  LayerNorm hurts accuracy with no latency benefit.
- Q/K/V projections and attention have properly placed separate QDQ stubs.
- Weight quantization is skipped for `LayerNorm` via `skip_weight_quant_for`.


## Summary

| Architecture | Embedl Speedup | ModelOpt Speedup | Embedl Top-1 Drop | ModelOpt Top-1 Drop |
|---|---|---|---|---|
| ResNet50 | 1.22x | 1.20x | −0.26pp | −0.28pp |
| ConvNeXt Tiny | 1.15x | 1.04x | −0.46pp | −0.43pp |
| ConvNeXt Base | 1.21x | 1.03x | −0.41pp | −0.94pp |
| ConvNeXt Large | **1.29x** | 1.02x | −0.51pp | −1.12pp |
| ViT-B/16 | 1.21x | 1.12x | −0.15pp | −0.28pp |

### Key takeaways

1. **mixed-precision matters most for depthwise-heavy architectures.**
   ConvNeXt Large achieves 1.29x speedup thanks to pattern-aware QDQ
   placement.

2. **Uniform quantization can be counterproductive.** Quantizing every
   operator barely improves over FP16 on ConvNeXt because TensorRT
   reformatting overhead offsets INT8 gains on depthwise convolutions.

3. **Accuracy preservation improves with smart QDQ placement.** Leaving
   memory-bound and element-wise operators in FP16 reduces quantization
   noise.

4. **The pattern-based approach scales with model complexity.** The benefit
   of selective quantization widens as models get larger and use more
   depthwise/element-wise operators.


## Reproducing these results

The full benchmark script is available as a tutorial:
{doc}`../auto_tutorials/deploy_torchvision`.

```bash
# ResNet50
python compare_modelopt.py --model resnet50 --crop-size 224

# ConvNeXt variants
python compare_modelopt.py --model convnext_tiny
python compare_modelopt.py --model convnext_base
python compare_modelopt.py --model convnext_large

# Vision Transformer
python compare_modelopt.py --model vit_b_16 --crop-size 224 --resize-size 256

# Include ModelOpt uniform INT8 for comparison
python compare_modelopt.py --compare-modelopt
```

Requirements: NVIDIA GPU, TensorRT 10.x, PyTorch 2.x, `modelopt` (for
comparison), `onnx`, `onnxsim`.