# Embedl Deploy

Build deployable physical AI models.

Design, optimize, and quantize in PyTorch. Deploy confidently on the edge.

## AI Deployment challenges

An expected workflow for AI model deployment would be:

1. Design models in Python (e.g., in PyTorch).
2. Convert and quantize to some intermediate format (e.g., ONNX).
3. Compile with a hardware toolchain, (e.g., TensorRT).

In practice, this workflow is rarely as smooth and often breaks, delaying
deployment. Python-first model design, opaque toolchain rewrites, and poor
quantization create late-stage failures when teams try to export, quantize, and
compile models for a specific target hardware. The result is a risk of failure
to deploy models on time, difficult debugging cycles spanning months,
performance bottlenecks, and missed key metrics such as accuracy, latency,
throughput, or memory.

```{list-table}
:header-rows: 1

* -
  - What fails in practice
  - Root cause
  - Deployment impact
* - Complex models
  - Graph tracing often fails on dynamic control flow, and export to
    intermediate formats becomes unreliable.
  - Design-hardware mismatch: hardware constraints are not surfaced during
    model design.
  - Late discovery during export, quantization, or compilation delaying
    deployment.
* - Unsupported ops
  - Hardware compilers enforce strict operator constraints that surface as
    late-stage deployment issues.
  - Toolchain knowledge gap: hardware constraints and fusions are not
    transparent during model design. Specialized expertise and debugging
    necessary to understand hardware behavior.
  - Manual graph rewrites necessary.
* - Graph rewrites
  - Black-box compilers can rewrite graphs unpredictably, making root-cause
    analysis difficult.
  - Hidden toolchain transformations are applied to make models compile on
    target hardware.
  - Slow debugging loops and reduced engineering productivity.
* - Quantization issues
  - Poor quantization operator placement breaks fusion opportunities and can
    hurt performance. Quantized models could even become slower than the
    baseline models.
  - Quantization configuration does not match hardware behavior, especially
    in mixed precision workflows.
  - Unable to reach target metrics such as accuracy, latency, throughput, and
    memory consumption.
```


## The Embedl Deploy Solution

```{list-table}
:header-rows: 1

* -
  - Solution
  - Why it matters
  - Benefit
* - PyTorch-native workflow
  - Transform, optimize, and quantize in PyTorch.
  - Keeps iteration in one environment.
  - Faster development and clearer debugging.
* - Hardware-aware transforms
  - Enforces compiler constraints, fusions, and possible graph rewrites in
    PyTorch.
  - Removes opaque toolchain rewrites and need for specialized knowledge about
    target compiler behavior. Enforces deployable graph behavior so what you
    validate is what runs.
  - Predictable and transparent behavior rather than late-stage deployment
    issues.
* - Quantization
  - Hardware-aware quantization optimization with pattern-based QDQ placements
    in PyTorch.
  - Reliable quantization that does not break operator fusions, and matches
    hardware behavior — reducing the need for specialized knowledge and manual
    debugging.
  - Higher accuracy, predictable quantization performance, and lower
    deployment risk.
* - Production readiness
  - Surfaces hardware behavior early and reduces late rework.
  - Teams resolve compatibility issues sooner.
  - Faster time-to-market and better first-pass success.
```

Embedl Deploy packages hardware expertise into reusable patterns, making any
PyTorch model deployable with a predictable, debuggable pipeline. Quantization
is hardware-aware and QDQ placements are validated in PyTorch
before compilation, preserving accuracy while hitting latency and memory
targets.

The public release supports NVIDIA TensorRT. Enterprise support for additional
hardware is available at [embedl.com](https://www.embedl.com/).


## Core concepts

- **Modules**: PyTorch ``nn.Module`` objects (e.g., ``ConvBatchNormRelu``) that
  represent hardware-friendly subgraph replacements.
- **Patterns**: Hardware-aware graph patterns covering fusions, conversions, and
  quantization placements. A pattern finds a subgraph and replaces it with a
  hardware-aware equivalent. Patterns are determined by extensive exploration of
  the hardware toolchain restrictions and verified against the target compiler.
- **Transform**: applies patterns to your model in one pass, returning a
  deployable ``nn.Module`` that behaves identically to the original.

The images below show the layer mapping before and after TensorRT compilation:

```{raw} html
<div style="display: flex; gap: 1rem; align-items: flex-start;">
  <figure style="flex: 1; text-align: center; margin: 0;">
    <figcaption><strong>Design in PyTorch</strong></figcaption>
    <img src="https://6631582.fs1.hubspotusercontent-na1.net/hubfs/6631582/deploy_mapping.png"
         alt="Layer mapping — PyTorch to ONNX" style="width: 100%;">
  </figure>
  <figure style="flex: 1; text-align: center; margin: 0;">
    <figcaption><strong>Deploy on edge</strong></figcaption>
    <img src="https://6631582.fs1.hubspotusercontent-na1.net/hubfs/6631582/deploy_mapping_trt.png"
         alt="Layer mapping — ONNX to TensorRT" style="width: 100%;">
  </figure>
</div>
```