# Quantization

Embedl Deploy provides hardware-aware INT8 quantization through explicit QDQ
(Quantize/DeQuantize) stub placement. Unlike uniform quantization approaches
that insert QDQ nodes around every operator, Embedl Deploy places stubs only
at positions declared by each pattern's `qdq_points` — ensuring that
quantization does not break operator fusions in the target hardware compiler.

## Quantization pipeline

The quantization pipeline has three steps:

```
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ transform() │ ──▶ │insert_qdq() │ ──▶ │ calibrate   │
│(fuse model) │     │(add stubs)  │     │(set scales) │
└─────────────┘     └─────────────┘     └─────────────┘
```

1. **Transform** — apply conversions and fusions. Each pattern declares its
   `qdq_points` attribute (e.g., `INPUT`, `RESIDUAL_INPUT`), and these are
   stored in the returned `PatternMatch` objects for use in the next step.
2. **Insert QDQ** — read the `qdq_points` from each `PatternMatch` and place
   `QuantStub` (for activations) and `WeightFakeQuantize` (for weights)
   modules at those positions. A `PatternMatch` is a dataclass that records
   which nodes were matched and what QDQ points apply to that fusion.
3. **Calibrate** — run representative data through the model to compute
   scale and zero-point for each quantizer.


## Step 1: Transform and fuse

```python
import torch
from torchvision.models import resnet50

from embedl_deploy import transform
from embedl_deploy.tensorrt import TENSORRT_PATTERNS

model = resnet50(weights="DEFAULT").eval()
result = transform(model, patterns=TENSORRT_PATTERNS)
fused_model = result.model
matches = result.matches  # needed for insert_qdq
```


## Step 2: Insert QDQ stubs

```python
from torch import nn
from embedl_deploy.quantize import (
    QuantConfig,
    TensorQuantConfig,
    insert_qdq,
)

quantized_model = insert_qdq(
    fused_model,
    matches,
    config=QuantConfig(
        activation=TensorQuantConfig(n_bits=8, symmetric=True),
        weight=TensorQuantConfig(n_bits=8, symmetric=True, per_channel=True),
    ),
)
```

### QuantConfig

`QuantConfig` controls how quantization stubs are configured:

| Parameter | Type | Description |
|---|---|---|
| `activation` | `TensorQuantConfig` | Config for activation quantizers |
| `weight` | `TensorQuantConfig` | Config for weight quantizers |
| `skip_weight_quant_for` | `tuple[type, ...]` | Module types to skip weight quantization for |

### TensorQuantConfig

| Parameter | Type | Default | Description |
|---|---|---|---|
| `n_bits` | `int` | 8 | Quantization bit width |
| `symmetric` | `bool` | `True` | Use symmetric quantization |
| `per_channel` | `bool` | `False` | Per-channel (weights) vs per-tensor |

**Recommended settings for TensorRT:**
- Activations: 8-bit, symmetric, per-tensor
- Weights: 8-bit, symmetric, **per-channel**
- Skip weight quantization for `LayerNorm` (it runs in FP16 on TensorRT)

```python
config = QuantConfig(
    activation=TensorQuantConfig(n_bits=8, symmetric=True),
    weight=TensorQuantConfig(n_bits=8, symmetric=True, per_channel=True),
    skip_weight_quant_for=(nn.LayerNorm,),
)
```


## Step 3: Calibrate

```python
from embedl_deploy.quantize import calibrate

# Define a forward loop that runs calibration data through the model
def forward_loop(model):
    for batch_tensor, _ in calibration_loader:
        model(batch_tensor)

calibrate(quantized_model, forward_loop)
```

`calibrate` runs the forward loop on the model, collecting activation
statistics to compute the optimal scale and zero-point for each `QuantStub`.


## QDQ point placement

Each fused pattern declares where QDQ stubs should be placed via its
`qdq_points` attribute. This is the key difference from uniform quantization:

| Pattern | QDQ points | Rationale |
|---|---|---|
| `ConvBNPattern` | `INPUT` | Quantize activations entering the conv |
| `ConvBNReLUPattern` | `INPUT` | Quantize input; ReLU is fused, no separate stub needed |
| `ConvBNAddReLUPattern` | `INPUT`, `RESIDUAL_INPUT` | Both paths into the add must be quantized |
| `StemConvBNReLUMaxPoolPattern` | `INPUT` | First layer — quantize input image |
| `LinearPattern` | `INPUT` | Quantize activations entering the linear |
| `LinearReLUPattern` | `INPUT` | Quantize input; activation is fused |
| `LayerNormPattern` | (none) | Not quantized — hurts accuracy, no latency benefit |
| `MHAInProjectionPattern` | `INPUT` | Quantize input to Q/K/V projection |
| `ScaledDotProductAttentionPattern` | `INPUT`, `KEY_INPUT`, `VALUE_INPUT` | Q, K, V each need separate quantization |
| `AdaptiveAvgPoolPattern` | `INPUT`, `OUTPUT` | Both sides for TRT fusion |


## Why pattern-aware QDQ matters

### The problem with uniform quantization

Tools like NVIDIA ModelOpt apply QDQ stubs around every operator
indiscriminately. This causes several issues:

1. **Broken fusions** — QDQ nodes between operators that should be fused
   (e.g., between Conv and BN) prevent the hardware compiler from merging
   them into a single kernel.

2. **Reformatting overhead** — quantizing memory-bound operators like
   depthwise convolutions or global average pooling forces TensorRT to
   insert data reformatting layers (INT8 ↔ FP16), which can cost more
   than the quantization saves.

3. **Accuracy loss without latency gain** — quantizing operators that
   don't benefit from INT8 (LayerNorm, element-wise ops) reduces accuracy
   with no performance improvement.

### Embedl Deploy's approach

Pattern-declared QDQ points ensure:

- Stubs are placed **outside** fused operator groups, not between them.
- Memory-bound operators can be left unquantized when beneficial.
- The hardware compiler sees exactly the QDQ topology it expects.


## Full example: ResNet50 INT8 PTQ

```python
import torch
from torchvision.models import resnet50

from embedl_deploy import transform
from embedl_deploy.tensorrt import TENSORRT_PATTERNS
from embedl_deploy.quantize import (
    QuantConfig,
    TensorQuantConfig,
    QuantStub,
    WeightFakeQuantize,
    calibrate,
    insert_qdq,
)

# 1. Load and fuse
model = resnet50(weights="DEFAULT").eval()
result = transform(model, patterns=TENSORRT_PATTERNS)

# 2. Insert QDQ stubs
quantized = insert_qdq(
    result.model,
    result.matches,
    config=QuantConfig(
        activation=TensorQuantConfig(n_bits=8, symmetric=True),
        weight=TensorQuantConfig(n_bits=8, symmetric=True, per_channel=True),
    ),
)

# Count inserted stubs
n_quant = sum(1 for m in quantized.modules() if isinstance(m, QuantStub))
n_wfq = sum(1 for m in quantized.modules() if isinstance(m, WeightFakeQuantize))
print(f"QuantStubs: {n_quant}, WeightFakeQuantize: {n_wfq}")

# 3. Calibrate
def forward_loop(model):
    for batch in calibration_batches[:32]:
        model(batch)

calibrate(quantized, forward_loop)

# 4. Export
torch.onnx.export(
    quantized.cpu().eval(),
    torch.randn(1, 3, 224, 224),
    "resnet50_int8.onnx",
    opset_version=20,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={"input": {0: "batch"}, "output": {0: "batch"}},
)
```

Compile with TensorRT using `--best` (FP16 + INT8):

```bash
trtexec --onnx=resnet50_int8.onnx --best
```


## Quantization-Aware Training (QAT)

For higher accuracy, you can fine-tune the quantized model with QAT
after calibration:

```python
from embedl_deploy.quantize import (
    enable_fake_quant,
    disable_fake_quant,
    freeze_bn_stats,
    prepare_qat,
)

# Prepare for QAT (enables fake quantization in training mode)
prepare_qat(quantized)

# Fine-tune with your training loop
quantized.train()
enable_fake_quant(quantized)
freeze_bn_stats(quantized)  # Keep BN statistics from calibration

for epoch in range(num_epochs):
    for images, targets in train_loader:
        output = quantized(images)
        loss = criterion(output, targets)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

# Switch back to eval for export
quantized.eval()
disable_fake_quant(quantized)
```