embedl_deploy.tensorrt.modules package#
Module contents:
Public re-exports of TensorRT fused nn.Module classes.
Users import from here:
from embedl_deploy.tensorrt.modules import FusedConvBNAct, FusedConvBN
- class embedl_deploy.tensorrt.modules.FusedAdaptiveAvgPool2d(pool: AdaptiveAvgPool2d)[source]#
Bases:
FusedModuleFused wrapper for
AdaptiveAvgPool2d.
- class embedl_deploy.tensorrt.modules.FusedConvBN(conv: Conv2d, bn: BatchNorm2d | None)[source]#
Bases:
FusedModuleFused
Conv2d → [BatchNorm2d](no activation).
- class embedl_deploy.tensorrt.modules.FusedConvBNAct(conv: Conv2d, bn: BatchNorm2d | None, act: ReLU | ReLU6 | GELU | SiLU | Mish | Hardswish | Hardsigmoid | LeakyReLU | PReLU | ELU | Sigmoid | Tanh)[source]#
Bases:
FusedModuleFused
Conv2d → [BatchNorm2d] → Act.
- class embedl_deploy.tensorrt.modules.FusedConvBNActMaxPool(conv: Conv2d, bn: BatchNorm2d | None, act: ReLU | ReLU6 | GELU | SiLU | Mish | Hardswish | Hardsigmoid | LeakyReLU | PReLU | ELU | Sigmoid | Tanh, maxpool: MaxPool2d)[source]#
Bases:
FusedModuleFused
Conv2d → [BatchNorm2d] → Activation → MaxPool2d.
- class embedl_deploy.tensorrt.modules.FusedConvBNAddAct(conv: Conv2d, bn: BatchNorm2d, act: ReLU | ReLU6 | GELU | SiLU | Mish | Hardswish | Hardsigmoid | LeakyReLU | PReLU | ELU | Sigmoid | Tanh)[source]#
Bases:
FusedModuleFused
Conv2d → BatchNorm2d → add(·, residual) → Activation.forward()accepts two inputs: the main tensorxand theresidualtensor.
- class embedl_deploy.tensorrt.modules.FusedLayerNorm(layer_norm: LayerNorm)[source]#
Bases:
FusedModuleFused wrapper for a standalone
LayerNorm.Weight quantization is disabled by default. LayerNorm’s learnable
weightis an element-wise affine scale, so quantizing it yields negligible savings while hurting accuracy.- Parameters:
layer_norm – The
nn.LayerNormfrom the matched chain.
- inputs_to_quantize: set[int] = {}#
Positional argument indices that should receive a
QuantStub. The Q/DQ insertion pass uses this to decide which inputs of the fused node to quantize. Every subclass must set this explicitly.
- prefers_fp_input: bool = True#
- class embedl_deploy.tensorrt.modules.FusedLinear(linear: Linear)[source]#
Bases:
FusedModuleFused wrapper for a standalone
Linearlayer.- Parameters:
linear – The
nn.Linearfrom the matched chain.
- class embedl_deploy.tensorrt.modules.FusedLinearAct(linear: Linear, act: ReLU | ReLU6 | GELU | SiLU | Mish | Hardswish | Hardsigmoid | LeakyReLU | PReLU | ELU | Sigmoid | Tanh)[source]#
Bases:
FusedModuleFused
Linear → Activation.- Parameters:
linear – The
nn.Linearfrom the matched chain.act – The activation module from the matched chain.
- class embedl_deploy.tensorrt.modules.FusedMHAInProjection(in_proj: MHAInProjection)[source]#
Bases:
FusedModuleFused wrapper for
MHAInProjection.Allows the Q/DQ insertion pass to place quantize / dequantize stubs around the input projection and to attach a
WeightFakeQuantizefor the packed linear weight.- Parameters:
in_proj – The
MHAInProjectionfrom the decomposed MHA.
- forward(query: Tensor, _key: Tensor, _value: Tensor) tuple[Tensor, ...][source]#
Project input to per-head
(Q, K, V)tensors.Fake-quantizes the packed projection weight when enabled, then performs the linear operation. Only query is used; _key and _value are accepted to match the call-site signature but ignored for self-attention.
- Parameters:
query – Input tensor of shape
[B, S, E].- Returns:
Tuple
(Q, K, V)each of shape[B, num_heads, S, head_dim].
- class embedl_deploy.tensorrt.modules.FusedScaledDotProductAttention(attention: ScaledDotProductAttention)[source]#
Bases:
FusedModuleFused wrapper for
ScaledDotProductAttention.Allows the Q/DQ insertion pass to place quantize / dequantize stubs on each of the three inputs (Q, K, V).
Additionally holds an internal
QuantStubbetween the softmax output and the second batched matrix multiply (BMM2). When that stub is disabled the forward pass delegates to the unwrappedScaledDotProductAttention; when enabled it performs manual attention with the quantization step.- Parameters:
attention – The
ScaledDotProductAttentionfrom the decomposed MHA.
- forward(q: Tensor, k: Tensor, v: Tensor, attn_mask: Tensor | None = None) Tensor[source]#
Compute scaled dot-product attention.
When the SDPA has been surrounded by
QuantStubs on its Q/K/V inputs and the internal softmax quant stub is enabled, performs manual attention with a quantization step between softmax and BMM2. Otherwise delegates to the wrapped attention module so TensorRT can fuse it into its native FP16 MHA kernel.- Parameters:
q – Query tensor
[B, num_heads, S, head_dim].k – Key tensor
[B, num_heads, S, head_dim].v – Value tensor
[B, num_heads, S, head_dim].attn_mask – Optional mask forwarded to the inner attention. Either an additive float mask broadcastable to
[B, num_heads, S, S]or a bool mask whereTruemeans “attend”.
- Returns:
Output tensor
[B, num_heads, S, head_dim]. Callers are responsible for any subsequent head-flattening reshape.