Custom Patterns#

The built-in TENSORRT_PATTERNS list is a good starting point, but real-world deployment often benefits from a custom pattern list that skips quantization on operators where INT8 hurts more than it helps.

This is the “mixed-precision” strategy: selectively place QDQ stubs based on the compute characteristics of each operator and its behavior on specific hardware.

Why customize?#

Two operators commonly benefit from being left in FP16:

Depthwise convolutions#

Depthwise convolutions (groups == in_channels) are memory-bound, not compute-bound. Quantizing them to INT8 forces TensorRT to insert data reformatting layers (INT8 → FP16 → INT8) around the convolution, and this reformatting overhead typically exceeds the compute savings from INT8.

This effect is especially pronounced on ConvNeXt, which uses depthwise 7×7 convolutions throughout.

Global average pooling#

AdaptiveAvgPool2d is an element-wise operation with no matrix multiplication. INT8 quantization adds QDQ stubs on both sides but provides negligible compute benefit while risking accuracy loss.

Writing a custom pattern#

A custom pattern is a subclass of Pattern with:

  • tree — the node topology to match

  • qdq_points — where QDQ stubs should be placed (empty = no quantization)

  • match() — how to find occurrences

  • replace() — how to rewrite the graph

Here is a complete example that matches depthwise convolutions and skips quantization by declaring empty qdq_points:

import torch.nn as nn
from torch import fx

from embedl_deploy._internal.core.match import match_tree
from embedl_deploy._internal.core.pattern import (
    Pattern,
    PatternMatch,
    Wildcard,
    get_module,
)
from embedl_deploy._internal.core.replace import replace_tree
from embedl_deploy._internal.tensorrt.modules.conv import FusedConvBN


def _is_depthwise_conv(node: fx.Node) -> bool:
    """Return True for a depthwise Conv2d (groups == in_channels > 1)."""
    module = get_module(node)
    return (
        isinstance(module, nn.Conv2d)
        and module.groups > 1
        and module.groups == module.in_channels
    )


class DepthwiseConvBNPattern(Pattern):
    """Match depthwise Conv2d → [BatchNorm2d] without quantization."""

    tree = (_is_depthwise_conv, Wildcard((nn.BatchNorm2d,)))
    qdq_points = frozenset()  # no QDQ stubs

    def match(self, graph_module: fx.GraphModule) -> list[PatternMatch]:
        return match_tree(graph_module, pattern=self)

    def replace(self, pattern_match: PatternMatch) -> list[fx.Node]:
        tree_match = pattern_match.tree_match
        conv = get_module(tree_match.get_node(0))
        wild_nodes = tree_match.get_node(1).nodes
        bn = get_module(wild_nodes[0]) if wild_nodes else None
        return replace_tree(
            pattern_match,
            [FusedConvBN(conv, bn, bn_foldable=bn is not None)],
        )

Key points:

  • The _is_depthwise_conv predicate uses a callable check instead of a module type, allowing fine-grained matching.

  • qdq_points = frozenset() means insert_qdq will not place any quantization stubs around matched depthwise convolutions.

  • The replacement uses the existing FusedConvBN module, which is mathematically equivalent to the original Conv2d BatchNorm2d sequence — BatchNorm can be folded into Conv2d without changing the output.

Building a custom pattern list#

A custom pattern list is assembled by combining built-in patterns with your custom ones. Order matters — longer/more-specific patterns first:

# NOTE: These imports use internal APIs. Public exports may be added
# in a future release.
from embedl_deploy._internal.tensorrt.patterns.conversions import (
    DecomposeMultiheadAttentionPattern,
    FlattenLinearToConv1x1Pattern,
    RemoveIdentityAdaptiveAvgPoolPattern,
)
from embedl_deploy._internal.tensorrt.patterns.fusions.conv import (
    ConvBNAddReLUPattern,
    ConvBNPattern,
    ConvBNReLUPattern,
    StemConvBNReLUMaxPoolPattern,
)
from embedl_deploy._internal.tensorrt.patterns.fusions.linear import (
    LayerNormPattern,
    LinearPattern,
    LinearReLUPattern,
)
from embedl_deploy._internal.tensorrt.patterns.fusions.attention import (
    MHAInProjectionPattern,
    ScaledDotProductAttentionPattern,
)

SMART_PATTERNS = [
    # -- Conversions (applied first, iteratively) --
    DecomposeMultiheadAttentionPattern(),
    FlattenLinearToConv1x1Pattern(),
    RemoveIdentityAdaptiveAvgPoolPattern(),

    # -- Fusions (longest first) --
    StemConvBNReLUMaxPoolPattern(),
    ConvBNAddReLUPattern(),
    ConvBNReLUPattern(),
    LinearReLUPattern(),
    DepthwiseConvBNPattern(),  # custom: no QDQ on depthwise
    ConvBNPattern(),
    LinearPattern(),
    LayerNormPattern(),
    MHAInProjectionPattern(),
    ScaledDotProductAttentionPattern(),
    # NOTE: AdaptiveAvgPoolPattern() intentionally omitted
    #       → no QDQ stubs around GlobalAveragePooling
]

Two deliberate omissions:

  1. No AdaptiveAvgPoolPattern — global average pooling stays in FP16.

  2. DepthwiseConvBNPattern with empty qdq_points — depthwise convolutions are fused but not quantized.

Using the custom pattern list#

from torch.fx.passes.shape_prop import ShapeProp

# NOTE: These imports use internal APIs. Public exports may be added
# in a future release.
from embedl_deploy._internal.core.modules import symbolic_trace

from embedl_deploy import transform
from embedl_deploy.quantize import (
    QuantConfig,
    TensorQuantConfig,
    insert_qdq,
    calibrate,
)

# Define your model (e.g., from torchvision)
# my_model = torchvision.models.convnext_tiny(weights="DEFAULT")

# Trace and propagate shapes (needed for conversion patterns)
model = my_model.cpu().eval()
gm = symbolic_trace(model)
ShapeProp(gm).propagate(torch.randn(1, 3, 224, 224))

# Transform with custom patterns
result = transform(gm, patterns=SMART_PATTERNS)
fused_model = result.model
matches = result.matches

# Verify lossless fusion
with torch.no_grad():
    y_orig = model(torch.randn(1, 3, 224, 224))
    y_fused = fused_model(torch.randn(1, 3, 224, 224))
max_diff = (y_orig - y_fused).abs().max().item()
assert max_diff < 1e-4

# Insert QDQ and calibrate
quantized = insert_qdq(
    fused_model,
    matches,
    config=QuantConfig(
        activation=TensorQuantConfig(n_bits=8, symmetric=True),
        weight=TensorQuantConfig(n_bits=8, symmetric=True, per_channel=True),
        skip_weight_quant_for=(nn.LayerNorm,),
    ),
)
def forward_loop(model):
    for batch in calibration_batches[:32]:
        model(batch)

calibrate(quantized, forward_loop)

Impact on benchmarks#

The mixed-precision strategy makes a significant difference on models with depthwise convolutions. On ConvNeXt Large (NVIDIA RTX 4090, TensorRT 10.9):

Variant

Latency

Speedup

Baseline FP16

2.19 ms

1.00x

Blanket INT8 (ModelOpt)

2.15 ms

1.02x

Smart INT8 (Embedl Deploy)

1.70 ms

1.29x

Blanket quantization barely improves over FP16 because TensorRT reformatting overhead around depthwise convolutions offsets the INT8 compute gains. Smart quantization avoids this by leaving depthwise convolutions in FP16.

See Benchmark Results for full results across architectures.