Operator Fusions#

Fusions combine sequences of operators into single fused modules that map directly to hardware-accelerated kernels. After fusion, the model graph contains modules like FusedConvBNReLU instead of separate Conv2d, BatchNorm2d, and ReLU layers.

Fused modules:

Are numerically equivalent to the original operator sequence (no weight folding at this stage — that’s handled by the hardware compiler).
Declare QDQ insertion points so quantization stubs are placed correctly.
Export cleanly to ONNX for downstream compilation.

Convolution fusions#

ConvBNPattern#

Matches: Conv2d → [BatchNorm2d]

Produces: FusedConvBN

The most basic convolution fusion. The BatchNorm2d is optional — a bare Conv2d is also matched (useful for ensuring QDQ stub placement even on convolutions without batch normalization).

QDQ points: INPUT

ConvBNReLUPattern#

Matches: Conv2d → [BatchNorm2d] → Activation

Produces: FusedConvBNReLU

Fuses convolution, optional batch normalization, and an activation function. The pattern is named after the most common case (ReLU), but it matches any of these activations: ReLU, ReLU6, LeakyReLU, ELU, GELU, SiLU, Hardswish, Hardsigmoid.

QDQ points: INPUT

StemConvBNReLUMaxPoolPattern#

Matches: Conv2d(3in, 7×7) → [BatchNorm2d] → Activation → MaxPool2d

Produces: FusedConvBNReLUMaxPool

Captures the classification network stem found in ResNet and similar architectures. The convolution is constrained to in_channels=3, kernel_size=(7,7) so only the actual stem is matched.

QDQ points: INPUT

ConvBNAddReLUPattern#

Matches: Conv2d → BatchNorm2d → add(·, residual) → Activation

Produces: FusedConvBNAddReLU

Captures the tail of ResNet-style bottleneck blocks where the convolution path merges with a skip connection before the final activation. This is a branching pattern — it matches a Fork topology with two inputs feeding into an operator.add node.

QDQ points: INPUT, RESIDUAL_INPUT

The RESIDUAL_INPUT QDQ point ensures the skip connection is also quantized, which is critical for TensorRT to fuse the entire residual block into a single INT8 kernel.

Linear fusions#

LinearReLUPattern#

Matches: Linear → Activation

Produces: FusedLinearReLU

QDQ points: INPUT

LinearPattern#

Matches: standalone Linear

Produces: FusedLinear

QDQ points: INPUT

LayerNormPattern#

Matches: LayerNorm

Produces: FusedLayerNorm

QDQ points: none (inputs_to_quantize is empty — no QuantStub created)

LayerNorm is not quantized — it operates element-wise and runs efficiently in FP16/FP32. Placing QDQ stubs around LayerNorm would hurt accuracy without improving latency.

Attention fusions#

These patterns match the sub-modules produced by the DecomposeMultiheadAttentionPattern conversion.

MHAInProjectionPattern#

Matches: MHAInProjection

Produces: FusedMHAInProjection

QDQ points: INPUT

ScaledDotProductAttentionPattern#

Matches: ScaledDotProductAttention

Produces: FusedScaledDotProductAttention

QDQ points: INPUT, KEY_INPUT, VALUE_INPUT

The three QDQ points ensure that Q, K, and V tensors are independently quantized, matching TensorRT’s expected input format for fused attention kernels.

Pooling fusions#

AdaptiveAvgPoolPattern#

Matches: AdaptiveAvgPool2d

Produces: FusedAdaptiveAvgPool2d

QDQ points: INPUT, OUTPUT

Note

In smart-quantization workflows, the AdaptiveAvgPoolPattern is intentionally omitted from the pattern list to skip QDQ placement around global average pooling. This is because GlobalAvgPool is element-wise and memory-bound — quantizing it adds TensorRT reformatting overhead without meaningful compute savings. See Custom Patterns for details.

Fusion summary by architecture#

ResNet50#

Fused module	Count	Description
`FusedConvBNReLUMaxPool`	1	Stem: Conv(7×7) + BN + ReLU + MaxPool
`FusedConvBNAddReLU`	16	Bottleneck residual blocks
`FusedConvBNReLU`	16	Main-path Conv + BN + ReLU
`FusedConvBN`	17	Conv + BN without activation
`FusedAdaptiveAvgPool2d`	1	Global average pool
Total	51

Conversions applied: FlattenLinearToConv1x1Pattern converts the Flatten → Linear classifier into Conv2d(1×1) → Flatten.

ConvNeXt (Tiny/Base/Large)#

Fused module	Count	Description
`FusedConvBN`	36/72/108	Depthwise + pointwise convolutions
`FusedConvBNReLU`	27/54/81	Conv + BN + GELU chains
`FusedLayerNorm`	27/54/81	LayerNorm layers (unquantized)
`FusedLinear`	27/54/81	Standalone linear layers

Counts increase with model depth (Tiny: 9 stages × 3, Base: 9 × 6, Large: 9 × 9).

Conversions applied:

RemoveIdentityAdaptiveAvgPoolPattern removes identity pooling ops.
FlattenLinearToConv1x1Pattern converts the classifier head.

ConvNeXt uses depthwise separable convolutions extensively. The default pattern set quantizes all convolutions equally, but mixed-precision (see Custom Patterns) skips depthwise convolutions for better latency.

Vision Transformer (ViT-B/16)#

Fused module	Count	Description
`FusedConvBN`	1	Patch embedding Conv2d
`FusedMHAInProjection`	12	Q/K/V projections
`FusedScaledDotProductAttention`	12	Attention computation
`FusedLinear`	36+	Out-proj + MLP linear layers
`FusedLinearReLU`	12	MLP hidden → GELU chains
`FusedLayerNorm`	25	Pre/post-norm layers (unquantized)

Conversions applied: DecomposeMultiheadAttentionPattern decomposes all 12 attention layers into explicit sub-modules.

Running fusions only#

from embedl_deploy import transform
from embedl_deploy.tensorrt import TENSORRT_FUSION_PATTERNS

result = transform(model, patterns=TENSORRT_FUSION_PATTERNS)

To inspect what was fused:

from collections import Counter

fused_counts = Counter(
    type(m).__name__
    for m in result.model.modules()
    if hasattr(type(m).__name__, 'startswith')
    and type(m).__name__.startswith('Fused')
)
for name, count in sorted(fused_counts.items()):
    print(f"  {name}: {count}")