embedl_deploy.quantize package#

Module contents:

Public quantization API.

Users import from here:

from embedl_deploy.quantize import insert_qdq, optimize_qdq, calibrate
class embedl_deploy.quantize.CalibrationMethod(*values)[source]#

Bases: Enum

Algorithm for collecting activation statistics during PTQ calibration.

Each member’s value is the torch.ao observer class that backs QuantStub during calibration.

HISTOGRAM = <class 'torch.ao.quantization.observer.HistogramObserver'>#

Build a histogram and search for the optimal quantisation range.

MINMAX = <class 'torch.ao.quantization.observer.MinMaxObserver'>#

Track the global minimum and maximum (default, fastest).

MOVING_AVERAGE_MINMAX = <class 'torch.ao.quantization.observer.MovingAverageMinMaxObserver'>#

Exponential moving average of min/max — less sensitive to outliers.

class embedl_deploy.quantize.ModulesToSkip(stub: set[type[~torch.nn.modules.module.Module] | ~torch.nn.modules.module.Module] = <factory>, weight: set[type[~torch.nn.modules.module.Module] | ~torch.nn.modules.module.Module] = <factory>, smooth: set[type[~torch.nn.modules.module.Module] | ~torch.nn.modules.module.Module] = <factory>)[source]#

Bases: object

Specifies which modules or module types to leave disabled during configure.

smooth: set[type[Module] | Module]#

Modules to leave smooth quantization disabled for.

stub: set[type[Module] | Module]#

Modules to leave stub quantization disabled for.

weight: set[type[Module] | Module]#

Modules to leave weight quantization disabled for.

class embedl_deploy.quantize.QuantConfig(activation: ~embedl_deploy._internal.core.config.TensorQuantConfig = <factory>, weight: ~embedl_deploy._internal.core.config.TensorQuantConfig = <factory>, smooth_quant: ~embedl_deploy._internal.core.config.SmoothQuantConfig = <factory>, skip: ~embedl_deploy._internal.core.config.ModulesToSkip = <factory>)[source]#

Bases: object

Top-level quantization configuration.

Bundles separate TensorQuantConfig instances for activations and weights so that each can be configured independently.

Parameters:
  • activation – Settings for activation (inter-layer) quantization.

  • weight – Settings for weight quantization (Conv kernels only; bias is always left in floating-point).

  • skip – Specifies which modules or module types to leave disabled during configure. See ModulesToSkip.

activation: TensorQuantConfig#

Settings for activation quantization.

skip: ModulesToSkip#

Modules or module types to leave disabled during configure.

smooth_quant: SmoothQuantConfig#

SmoothQuant settings applied during calibration.

weight: TensorQuantConfig#

Settings for weight quantization.

class embedl_deploy.quantize.QuantStub(consumers: set[Module], n_bits: int = 8, symmetric: bool = True, calibration_method: CalibrationMethod = CalibrationMethod.MINMAX, *, fixed_calibration: tuple[float, int] | None = None)[source]#

Bases: Module

Quantize a floating-point tensor.

During calibration the module delegates statistics collection to a torch.ao observer selected by calibration_method. After calibration, scale and zero_point are derived from the observer and used by torch.fake_quantize_per_tensor_affine() in the forward pass.

Parameters:
  • consumers – Set of modules that consume this stub’s output.

  • n_bits – Number of quantization bits (default 8).

  • symmetric – Symmetric or asymmetric quantization.

  • calibration_method – Algorithm used to collect activation statistics. Defaults to MINMAX.

  • fixed_calibration – Fixed (scale, zero_point) tuple. When provided, calibration will not override the values.

compute_parameters() None[source]#

Derive scale and zero_point from the observer.

Raises:

RuntimeError – If no data was observed during calibration.

forward(x: Tensor) Tensor[source]#

Fake-quantize x, updating observer stats if calibrating.

scale: Tensor#
zero_point: Tensor#
class embedl_deploy.quantize.SmoothQuantConfig(alpha: float = 0.5)[source]#

Bases: object

SmoothQuant migration settings.

Controls the per-channel weight/activation redistribution applied by calibrate_smooth_quant().

Parameters:

alpha – Migration strength in [0, 1]. 0 keeps all difficulty on activations; 1 pushes it entirely to weights.

alpha: float = 0.5#

Migration strength in [0, 1].

class embedl_deploy.quantize.TensorQuantConfig(n_bits: int = 8, symmetric: bool = True, per_channel: bool = False, calibration_method: CalibrationMethod = CalibrationMethod.MINMAX)[source]#

Bases: object

Quantization settings for a single tensor class (activation or weight).

Parameters:
  • n_bits – Number of bits for the quantized representation. Must be between 2 and 16 inclusive.

  • symmetric – When True the quantized range is centred on zero (zero_point = 0). When False an asymmetric range is used, allowing better coverage of distributions that are not centred on zero.

  • per_channel – When True and when used for weight quantization, a separate scale/zero-point is computed along the output-channel axis (axis 0). Defaults to False for backward compatibility but should be set to True for production quantization of Conv/Linear weights.

calibration_method: CalibrationMethod = <class 'torch.ao.quantization.observer.MinMaxObserver'>#

Calibration algorithm used by QuantStub to collect activation statistics. Only relevant for activation configs; weight quantization computes scale/zero-point on-the-fly.

n_bits: int = 8#

Number of quantization bits (2–16).

per_channel: bool = False#

Per-channel quantization (weight only).

property quant_max: int#

Maximum representable integer value.

property quant_min: int#

Minimum representable integer value.

quant_range() tuple[bool, int, int][source]#

Return (symmetric, quant_min, quant_max).

symmetric: bool = True#

Symmetric (zero_point = 0) or asymmetric range.

class embedl_deploy.quantize.WeightFakeQuantize(consumers: set[Module], n_bits: int = 8, symmetric: bool = True, per_channel: bool = False, *, channel_axis: int = 0)[source]#

Bases: Module

Fake-quantize a weight tensor during the forward pass.

Unlike QuantStub (which requires a calibration pass), this module computes scale and zero_point on-the-fly from the weight tensor. Correct for QAT where weights change each step.

Parameters:
  • consumers – Set of modules that consume this module’s output.

  • n_bits – Number of quantization bits.

  • symmetric – Symmetric (zero_point = 0) or asymmetric.

  • per_channel – Use per-channel quantization along channel_axis.

  • channel_axis – The axis along which per-channel scales are computed (default 0, i.e. output channels).

forward(weight: Tensor) Tensor[source]#

Fake-quantize weight using on-the-fly scale/zero_point.

embedl_deploy.quantize.calibrate_qdq(model: GraphModule, forward_loop: Callable[[GraphModule], None]) None[source]#

Calibrate Q/DQ stubs by running the user’s forward loop.

Switches every QuantStub into calibration mode, invokes forward_loop once, then finalises scale / zero_point from the observed min/max ranges.

The model is modified in-place.

Parameters:
  • model – A configured GraphModule whose fused modules have been set up by configure().

  • forward_loop

    (model) -> None callable that runs representative data through the model. The caller controls batch size, device placement, and iteration count. Example:

    def forward_loop(model):
        for batch in calib_loader:
            model(batch)
    

Raises:
  • ValueError – If the model contains no enabled QuantStub modules.

  • RuntimeError – If any stub did not observe finite values during the loop.

  • Exception – Re-raises any exception from forward_loop after restoring model state.

embedl_deploy.quantize.calibrate_smooth_quant(model: GraphModule, forward_loop: Callable[[GraphModule], None]) None[source]#

Calibrate and apply SmoothQuant to a fused model in-place.

Migrates quantization difficulty from activations to weights for every LayerNorm → Linear pair that has an enabled SmoothQuantObserver.

Must be called after transform() (fusion) and configure(), and before calibrate_qdq().

Parameters:
  • model – A fused GraphModule whose observers have been enabled by configure.

  • forward_loop(model) -> None callable that runs representative data through the model. The caller controls batch size, device placement, and iteration count.

Raises:

Exception – Re-raises any exception from forward_loop after restoring model state.

embedl_deploy.quantize.configure(model: GraphModule, config: QuantConfig) None[source]#

Configure quantization settings on all fused modules in-place.

Walks every FusedModule and:

  • Configures and enables each QuantStub. Stubs without fixed_parameters receive config.activation; stubs not excluded by config.skip are enabled.

  • Configures and enables weight_fake_quant (respecting skip).

  • Enables smooth_quant_observer where present and copies the smooth_quant config from config.

Parameters:
  • model – A GraphModule produced by the fusion step. Modified in-place.

  • config – A QuantConfig controlling activation bits, weight bits, and which module types to skip.

embedl_deploy.quantize.disable_fake_quant(model: Module) Module[source]#

Disable fake quantization throughout the model.

Parameters:

model – The model to modify in-place.

Returns:

The same model, for method chaining.

embedl_deploy.quantize.enable_fake_quant(model: Module) Module[source]#

Enable fake quantization in all stubs and weight quantizers.

Parameters:

model – The model to modify in-place.

Returns:

The same model, for method chaining.

embedl_deploy.quantize.freeze_bn_stats(model: Module) Module[source]#

Freeze BatchNorm running statistics.

Puts all BatchNorm*d layers into eval mode so running_mean and running_var are no longer updated. Affine parameters remain trainable.

Parameters:

model – The model to modify in-place.

Returns:

The same model, for method chaining.

embedl_deploy.quantize.prepare_qat(model: Module) Module[source]#

Prepare a quantized model for quantization-aware training.

Sets the model to training mode so that QuantStub nodes propagate gradients through STE and WeightFakeQuantize nodes apply fake-quant to weights.

Parameters:

model – A quantized nn.Module.

Returns:

The same model, in-place, for method chaining.

embedl_deploy.quantize.quantize(model: GraphModule, args: tuple[Any, ...], config: QuantConfig | None = None, *, forward_loop: Callable[[GraphModule], None]) GraphModule[source]#

Configure, insert Q/DQ stubs, optimise, and calibrate in one call.

Convenience wrapper that chains configure()calibrate_smooth_quant()calibrate_qdq().

Parameters:
  • model – A GraphModule produced by the fusion step.

  • args – The arguments to use for shape propagation necessary to get tensor meta data required for calibration.

  • config – Optional QuantConfig. Defaults to 8-bit symmetric.

  • forward_loop(model) -> None callable that runs representative data through the model. The caller controls batch size, device placement, and iteration count.

Returns:

The quantized GraphModule with calibrated stubs.