`embedl_deploy.tensorrt.modules` package#

Module contents:

Public re-exports of TensorRT fused nn.Module classes.

Users import from here:

from embedl_deploy.tensorrt.modules import FusedConvBNAct, FusedConvBN

class embedl_deploy.tensorrt.modules.FusedAdaptiveAvgPool2d(pool: AdaptiveAvgPool2d)[source]#

Bases: FusedModule

Fused wrapper for AdaptiveAvgPool2d.

forward(x: Tensor) → Tensor[source]#: Apply adaptive average pooling.

inputs_to_quantize: set[int] = {}#: Positional argument indices that should receive a QuantStub. The Q/DQ insertion pass uses this to decide which inputs of the fused node to quantize. Every subclass must set this explicitly.

class embedl_deploy.tensorrt.modules.FusedConvBN(conv: Conv2d, bn: BatchNorm2d | None)[source]#

Bases: FusedModule

Fused Conv2d → [BatchNorm2d] (no activation).

forward(x: Tensor) → Tensor[source]#: Apply conv → [bn].

inputs_to_quantize: set[int] = {0}#: Positional argument indices that should receive a QuantStub. The Q/DQ insertion pass uses this to decide which inputs of the fused node to quantize. Every subclass must set this explicitly.

Bases: FusedModule

Fused Conv2d → [BatchNorm2d] → Act.

forward(x: Tensor) → Tensor[source]#: Apply conv → [bn] → act.

inputs_to_quantize: set[int] = {0}#: Positional argument indices that should receive a QuantStub. The Q/DQ insertion pass uses this to decide which inputs of the fused node to quantize. Every subclass must set this explicitly.

Bases: FusedModule

Fused Conv2d → [BatchNorm2d] → Activation → MaxPool2d.

forward(x: Tensor) → Tensor[source]#: Apply conv → [bn] → act → maxpool.

inputs_to_quantize: set[int] = {0}#: Positional argument indices that should receive a QuantStub. The Q/DQ insertion pass uses this to decide which inputs of the fused node to quantize. Every subclass must set this explicitly.

Bases: FusedModule

Fused Conv2d → BatchNorm2d → add(·, residual) → Activation.

forward() accepts two inputs: the main tensor x and the residual tensor.

forward(x: Tensor, residual: Tensor) → Tensor[source]#: Apply conv → bn → add(·, residual) → act.

inputs_to_quantize: set[int] = {0, 1}#: Positional argument indices that should receive a QuantStub. The Q/DQ insertion pass uses this to decide which inputs of the fused node to quantize. Every subclass must set this explicitly.

class embedl_deploy.tensorrt.modules.FusedLayerNorm(layer_norm: LayerNorm)[source]#

Bases: FusedModule

Fused wrapper for a standalone LayerNorm.

Weight quantization is disabled by default. LayerNorm’s learnable weight is an element-wise affine scale, so quantizing it yields negligible savings while hurting accuracy.

Parameters:: layer_norm – The nn.LayerNorm from the matched chain.

forward(x: Tensor) → Tensor[source]#: Apply layer_norm.

inputs_to_quantize: set[int] = {}#: Positional argument indices that should receive a QuantStub. The Q/DQ insertion pass uses this to decide which inputs of the fused node to quantize. Every subclass must set this explicitly.

prefers_fp_input: bool = True#

class embedl_deploy.tensorrt.modules.FusedLinear(linear: Linear)[source]#

Bases: FusedModule

Fused wrapper for a standalone Linear layer.

Parameters:: linear – The nn.Linear from the matched chain.

forward(x: Tensor) → Tensor[source]#: Apply linear, fake-quantizing the weight.

inputs_to_quantize: set[int] = {0}#: Positional argument indices that should receive a QuantStub. The Q/DQ insertion pass uses this to decide which inputs of the fused node to quantize. Every subclass must set this explicitly.

Bases: FusedModule

Fused Linear → Activation.

Parameters:

linear – The nn.Linear from the matched chain.
act – The activation module from the matched chain.

forward(x: Tensor) → Tensor[source]#: Apply linear → activation, fake-quantizing the weight.

inputs_to_quantize: set[int] = {0}#: Positional argument indices that should receive a QuantStub. The Q/DQ insertion pass uses this to decide which inputs of the fused node to quantize. Every subclass must set this explicitly.

class embedl_deploy.tensorrt.modules.FusedMHAInProjection(in_proj: MHAInProjection)[source]#

Bases: FusedModule

Fused wrapper for MHAInProjection.

Allows the Q/DQ insertion pass to place quantize / dequantize stubs around the input projection and to attach a WeightFakeQuantize for the packed linear weight.

Parameters:: in_proj – The MHAInProjection from the decomposed MHA.

forward(query: Tensor, _key: Tensor, _value: Tensor) → tuple[Tensor, ...][source]#

Project input to per-head (Q, K, V) tensors.

Fake-quantizes the packed projection weight when enabled, then performs the linear operation. Only query is used; _key and _value are accepted to match the call-site signature but ignored for self-attention.

Parameters:: query – Input tensor of shape [B, S, E].
Returns:: Tuple (Q, K, V) each of shape [B, num_heads, S, head_dim].

inputs_to_quantize: set[int] = {0}#: Positional argument indices that should receive a QuantStub. The Q/DQ insertion pass uses this to decide which inputs of the fused node to quantize. Every subclass must set this explicitly.

class embedl_deploy.tensorrt.modules.FusedScaledDotProductAttention(attention: ScaledDotProductAttention)[source]#

Bases: FusedModule

Fused wrapper for ScaledDotProductAttention.

Allows the Q/DQ insertion pass to place quantize / dequantize stubs on each of the three inputs (Q, K, V).

Additionally holds an internal QuantStub between the softmax output and the second batched matrix multiply (BMM2). When that stub is disabled the forward pass delegates to the unwrapped ScaledDotProductAttention; when enabled it performs manual attention with the quantization step.

Parameters:: attention – The ScaledDotProductAttention from the decomposed MHA.

forward(q: Tensor, k: Tensor, v: Tensor, attn_mask: Tensor | None = None) → Tensor[source]#

Compute scaled dot-product attention.

When the SDPA has been surrounded by QuantStubs on its Q/K/V inputs and the internal softmax quant stub is enabled, performs manual attention with a quantization step between softmax and BMM2. Otherwise delegates to the wrapped attention module so TensorRT can fuse it into its native FP16 MHA kernel.

Parameters:

q – Query tensor [B, num_heads, S, head_dim].
k – Key tensor [B, num_heads, S, head_dim].
v – Value tensor [B, num_heads, S, head_dim].
attn_mask – Optional mask forwarded to the inner attention. Either an additive float mask broadcastable to [B, num_heads, S, S] or a bool mask where True means “attend”.

Returns:

Output tensor [B, num_heads, S, head_dim]. Callers are responsible for any subsequent head-flattening reshape.

inputs_to_quantize: set[int] = {}#: Positional argument indices that should receive a QuantStub. The Q/DQ insertion pass uses this to decide which inputs of the fused node to quantize. Every subclass must set this explicitly.

embedl_deploy.tensorrt.modules package#

`embedl_deploy.tensorrt.modules` package#