embedl_deploy.tensorrt.modules package#
Module contents:
Public re-exports of TensorRT fused nn.Module classes.
Users import from here:
from embedl_deploy.tensorrt.modules import FusedConvBNAct, FusedConvBN
- class embedl_deploy.tensorrt.modules.FusedAdaptiveAvgPool2d(pool: AdaptiveAvgPool2d)[source]#
Bases:
FusedModuleFused wrapper for
AdaptiveAvgPool2d.
- class embedl_deploy.tensorrt.modules.FusedConvBN(conv: Conv2d, bn: BatchNorm2d | None, *, bn_foldable: bool = True)[source]#
Bases:
FusedModuleFused
Conv2d → [BatchNorm2d](no activation).
- class embedl_deploy.tensorrt.modules.FusedConvBNAct(conv: Conv2d, bn: BatchNorm2d | None, act: ReLU | ReLU6 | GELU | SiLU | Mish | Hardswish | Hardsigmoid | LeakyReLU | PReLU | ELU | Sigmoid | Tanh, *, bn_foldable: bool = True)[source]#
Bases:
FusedModuleFused
Conv2d → [BatchNorm2d] → Act.
- class embedl_deploy.tensorrt.modules.FusedConvBNActMaxPool(conv: Conv2d, bn: BatchNorm2d | None, act: ReLU | ReLU6 | GELU | SiLU | Mish | Hardswish | Hardsigmoid | LeakyReLU | PReLU | ELU | Sigmoid | Tanh, maxpool: MaxPool2d, *, bn_foldable: bool = True)[source]#
Bases:
FusedModuleFused
Conv2d → [BatchNorm2d] → Activation → MaxPool2d.
- class embedl_deploy.tensorrt.modules.FusedConvBNAddAct(conv: Conv2d, bn: BatchNorm2d, act: ReLU | ReLU6 | GELU | SiLU | Mish | Hardswish | Hardsigmoid | LeakyReLU | PReLU | ELU | Sigmoid | Tanh, *, bn_foldable: bool = True)[source]#
Bases:
FusedModuleFused
Conv2d → BatchNorm2d → add(·, residual) → Activation.forward()accepts two inputs: the main tensorxand theresidualtensor.
- class embedl_deploy.tensorrt.modules.FusedLayerNorm(layer_norm: LayerNorm)[source]#
Bases:
FusedModuleFused wrapper for a standalone
LayerNorm.Weight quantization is disabled by default. LayerNorm’s learnable
weightis an element-wise affine scale, so quantizing it yields negligible savings while hurting accuracy.- Parameters:
layer_norm – The
nn.LayerNormfrom the matched chain.
- inputs_to_quantize: set[int] = {}#
Positional argument indices that should receive a
QuantStub. The Q/DQ insertion pass uses this to decide which inputs of the fused node to quantize. Every subclass must set this explicitly.
- prefers_fp_input: bool = True#
- class embedl_deploy.tensorrt.modules.FusedLinear(linear: Linear)[source]#
Bases:
FusedModuleFused wrapper for a standalone
Linearlayer.- Parameters:
linear – The
nn.Linearfrom the matched chain.
- class embedl_deploy.tensorrt.modules.FusedLinearAct(linear: Linear, act: ReLU | ReLU6 | GELU | SiLU | Mish | Hardswish | Hardsigmoid | LeakyReLU | PReLU | ELU | Sigmoid | Tanh)[source]#
Bases:
FusedModuleFused
Linear → Activation.- Parameters:
linear – The
nn.Linearfrom the matched chain.act – The activation module from the matched chain.
- class embedl_deploy.tensorrt.modules.FusedMHAInProjection(in_proj: MHAInProjection)[source]#
Bases:
FusedModuleFused wrapper for
MHAInProjection.Allows the Q/DQ insertion pass to place quantize / dequantize stubs around the input projection and to attach a
WeightFakeQuantizefor the packed linear weight.- Parameters:
in_proj – The
MHAInProjectionfrom the decomposed MHA.
- forward(query: Tensor, _key: Tensor, _value: Tensor) tuple[Tensor, ...][source]#
Project input to per-head
(Q, K, V)tensors.When
weight_fake_quantis set, fake-quantizes the packed projection weight before the linear operation. Only query is used; _key and _value are accepted to match the call-site signature but ignored for self-attention.- Parameters:
query – Input tensor of shape
[B, S, E].- Returns:
Tuple
(Q, K, V)each of shape[B, num_heads, S, head_dim].
- class embedl_deploy.tensorrt.modules.FusedScaledDotProductAttention(attention: ScaledDotProductAttention)[source]#
Bases:
FusedModuleFused wrapper for
ScaledDotProductAttention.Allows the Q/DQ insertion pass to place quantize / dequantize stubs on each of the three inputs (Q, K, V).
Additionally holds an internal
QuantStubbetween the softmax output and the second batched matrix multiply (BMM2). WhenNone(before Q/DQ insertion) the forward pass is numerically identical to the unwrappedScaledDotProductAttention.- Parameters:
attention – The
ScaledDotProductAttentionfrom the decomposed MHA.
- forward(q: Tensor, k: Tensor, v: Tensor) Tensor[source]#
Compute scaled dot-product attention.
Performs manual attention with an internal quantization step between softmax and BMM2.
- Parameters:
q – Query tensor
[B, num_heads, S, head_dim].k – Key tensor
[B, num_heads, S, head_dim].v – Value tensor
[B, num_heads, S, head_dim].
- Returns:
Output tensor
[B, S, embed_dim].