embedl_deploy.tensorrt.modules package#

Module contents:

Public re-exports of TensorRT fused nn.Module classes.

Users import from here:

from embedl_deploy.tensorrt.modules import FusedConvBNAct, FusedConvBN
class embedl_deploy.tensorrt.modules.FusedAdaptiveAvgPool2d(pool: AdaptiveAvgPool2d)[source]#

Bases: FusedModule

Fused wrapper for AdaptiveAvgPool2d.

forward(x: Tensor) Tensor[source]#

Apply adaptive average pooling.

inputs_to_quantize: set[int] = {}#

Positional argument indices that should receive a QuantStub. The Q/DQ insertion pass uses this to decide which inputs of the fused node to quantize. Every subclass must set this explicitly.

class embedl_deploy.tensorrt.modules.FusedConvBN(conv: Conv2d, bn: BatchNorm2d | None, *, bn_foldable: bool = True)[source]#

Bases: FusedModule

Fused Conv2d [BatchNorm2d] (no activation).

forward(x: Tensor) Tensor[source]#

Apply conv [bn].

inputs_to_quantize: set[int] = {0}#

Positional argument indices that should receive a QuantStub. The Q/DQ insertion pass uses this to decide which inputs of the fused node to quantize. Every subclass must set this explicitly.

class embedl_deploy.tensorrt.modules.FusedConvBNAct(conv: Conv2d, bn: BatchNorm2d | None, act: ReLU | ReLU6 | GELU | SiLU | Mish | Hardswish | Hardsigmoid | LeakyReLU | PReLU | ELU | Sigmoid | Tanh, *, bn_foldable: bool = True)[source]#

Bases: FusedModule

Fused Conv2d [BatchNorm2d] Act.

forward(x: Tensor) Tensor[source]#

Apply conv [bn] act.

inputs_to_quantize: set[int] = {0}#

Positional argument indices that should receive a QuantStub. The Q/DQ insertion pass uses this to decide which inputs of the fused node to quantize. Every subclass must set this explicitly.

class embedl_deploy.tensorrt.modules.FusedConvBNActMaxPool(conv: Conv2d, bn: BatchNorm2d | None, act: ReLU | ReLU6 | GELU | SiLU | Mish | Hardswish | Hardsigmoid | LeakyReLU | PReLU | ELU | Sigmoid | Tanh, maxpool: MaxPool2d, *, bn_foldable: bool = True)[source]#

Bases: FusedModule

Fused Conv2d [BatchNorm2d] Activation MaxPool2d.

forward(x: Tensor) Tensor[source]#

Apply conv [bn] act maxpool.

inputs_to_quantize: set[int] = {0}#

Positional argument indices that should receive a QuantStub. The Q/DQ insertion pass uses this to decide which inputs of the fused node to quantize. Every subclass must set this explicitly.

class embedl_deploy.tensorrt.modules.FusedConvBNAddAct(conv: Conv2d, bn: BatchNorm2d, act: ReLU | ReLU6 | GELU | SiLU | Mish | Hardswish | Hardsigmoid | LeakyReLU | PReLU | ELU | Sigmoid | Tanh, *, bn_foldable: bool = True)[source]#

Bases: FusedModule

Fused Conv2d BatchNorm2d add(·, residual) Activation.

forward() accepts two inputs: the main tensor x and the residual tensor.

forward(x: Tensor, residual: Tensor) Tensor[source]#

Apply conv bn add(·, residual) act.

inputs_to_quantize: set[int] = {0, 1}#

Positional argument indices that should receive a QuantStub. The Q/DQ insertion pass uses this to decide which inputs of the fused node to quantize. Every subclass must set this explicitly.

class embedl_deploy.tensorrt.modules.FusedLayerNorm(layer_norm: LayerNorm)[source]#

Bases: FusedModule

Fused wrapper for a standalone LayerNorm.

Weight quantization is disabled by default. LayerNorm’s learnable weight is an element-wise affine scale, so quantizing it yields negligible savings while hurting accuracy.

Parameters:

layer_norm – The nn.LayerNorm from the matched chain.

forward(x: Tensor) Tensor[source]#

Apply layer_norm, fake-quantizing the weight.

inputs_to_quantize: set[int] = {}#

Positional argument indices that should receive a QuantStub. The Q/DQ insertion pass uses this to decide which inputs of the fused node to quantize. Every subclass must set this explicitly.

prefers_fp_input: bool = True#
class embedl_deploy.tensorrt.modules.FusedLinear(linear: Linear)[source]#

Bases: FusedModule

Fused wrapper for a standalone Linear layer.

Parameters:

linear – The nn.Linear from the matched chain.

forward(x: Tensor) Tensor[source]#

Apply linear, fake-quantizing the weight.

inputs_to_quantize: set[int] = {0}#

Positional argument indices that should receive a QuantStub. The Q/DQ insertion pass uses this to decide which inputs of the fused node to quantize. Every subclass must set this explicitly.

class embedl_deploy.tensorrt.modules.FusedLinearAct(linear: Linear, act: ReLU | ReLU6 | GELU | SiLU | Mish | Hardswish | Hardsigmoid | LeakyReLU | PReLU | ELU | Sigmoid | Tanh)[source]#

Bases: FusedModule

Fused Linear Activation.

Parameters:
  • linear – The nn.Linear from the matched chain.

  • act – The activation module from the matched chain.

forward(x: Tensor) Tensor[source]#

Apply linear activation, fake-quantizing the weight.

inputs_to_quantize: set[int] = {0}#

Positional argument indices that should receive a QuantStub. The Q/DQ insertion pass uses this to decide which inputs of the fused node to quantize. Every subclass must set this explicitly.

class embedl_deploy.tensorrt.modules.FusedMHAInProjection(in_proj: MHAInProjection)[source]#

Bases: FusedModule

Fused wrapper for MHAInProjection.

Allows the Q/DQ insertion pass to place quantize / dequantize stubs around the input projection and to attach a WeightFakeQuantize for the packed linear weight.

Parameters:

in_proj – The MHAInProjection from the decomposed MHA.

forward(query: Tensor, _key: Tensor, _value: Tensor) tuple[Tensor, ...][source]#

Project input to per-head (Q, K, V) tensors.

When weight_fake_quant is set, fake-quantizes the packed projection weight before the linear operation. Only query is used; _key and _value are accepted to match the call-site signature but ignored for self-attention.

Parameters:

query – Input tensor of shape [B, S, E].

Returns:

Tuple (Q, K, V) each of shape [B, num_heads, S, head_dim].

inputs_to_quantize: set[int] = {0}#

Positional argument indices that should receive a QuantStub. The Q/DQ insertion pass uses this to decide which inputs of the fused node to quantize. Every subclass must set this explicitly.

class embedl_deploy.tensorrt.modules.FusedScaledDotProductAttention(attention: ScaledDotProductAttention)[source]#

Bases: FusedModule

Fused wrapper for ScaledDotProductAttention.

Allows the Q/DQ insertion pass to place quantize / dequantize stubs on each of the three inputs (Q, K, V).

Additionally holds an internal QuantStub between the softmax output and the second batched matrix multiply (BMM2). When None (before Q/DQ insertion) the forward pass is numerically identical to the unwrapped ScaledDotProductAttention.

Parameters:

attention – The ScaledDotProductAttention from the decomposed MHA.

forward(q: Tensor, k: Tensor, v: Tensor) Tensor[source]#

Compute scaled dot-product attention.

Performs manual attention with an internal quantization step between softmax and BMM2.

Parameters:
  • q – Query tensor [B, num_heads, S, head_dim].

  • k – Key tensor [B, num_heads, S, head_dim].

  • v – Value tensor [B, num_heads, S, head_dim].

Returns:

Output tensor [B, S, embed_dim].

inputs_to_quantize: set[int] = {}#

Positional argument indices that should receive a QuantStub. The Q/DQ insertion pass uses this to decide which inputs of the fused node to quantize. Every subclass must set this explicitly.