# Embedl Deploy Build deployable physical AI models. Design, optimize, and quantize in PyTorch. Deploy confidently on the edge. ## AI Deployment challenges An expected workflow for AI model deployment would be: 1. Design models in Python (e.g., in PyTorch). 2. Convert and quantize to some intermediate format (e.g., ONNX). 3. Compile with a hardware toolchain, (e.g., TensorRT). In practice, this workflow is rarely as smooth and often breaks, delaying deployment. Python-first model design, opaque toolchain rewrites, and poor quantization create late-stage failures when teams try to export, quantize, and compile models for a specific target hardware. The result is a risk of failure to deploy models on time, difficult debugging cycles spanning months, performance bottlenecks, and missed key metrics such as accuracy, latency, throughput, or memory. ```{list-table} :header-rows: 1 * - - What fails in practice - Root cause - Deployment impact * - Complex models - Graph tracing often fails on dynamic control flow, and export to intermediate formats becomes unreliable. - Design-hardware mismatch: hardware constraints are not surfaced during model design. - Late discovery during export, quantization, or compilation delaying deployment. * - Unsupported ops - Hardware compilers enforce strict operator constraints that surface as late-stage deployment issues. - Toolchain knowledge gap: hardware constraints and fusions are not transparent during model design. Specialized expertise and debugging necessary to understand hardware behavior. - Manual graph rewrites necessary. * - Graph rewrites - Black-box compilers can rewrite graphs unpredictably, making root-cause analysis difficult. - Hidden toolchain transformations are applied to make models compile on target hardware. - Slow debugging loops and reduced engineering productivity. * - Quantization issues - Poor quantization operator placement breaks fusion opportunities and can hurt performance. Quantized models could even become slower than the baseline models. - Quantization configuration does not match hardware behavior, especially in mixed precision workflows. - Unable to reach target metrics such as accuracy, latency, throughput, and memory consumption. ``` ## The Embedl Deploy Solution ```{list-table} :header-rows: 1 * - - Solution - Why it matters - Benefit * - PyTorch-native workflow - Transform, optimize, and quantize in PyTorch. - Keeps iteration in one environment. - Faster development and clearer debugging. * - Hardware-aware transforms - Enforces compiler constraints, fusions, and possible graph rewrites in PyTorch. - Removes opaque toolchain rewrites and need for specialized knowledge about target compiler behavior. Enforces deployable graph behavior so what you validate is what runs. - Predictable and transparent behavior rather than late-stage deployment issues. * - Quantization - Hardware-aware quantization optimization with pattern-based QDQ placements in PyTorch. - Reliable quantization that does not break operator fusions, and matches hardware behavior — reducing the need for specialized knowledge and manual debugging. - Higher accuracy, predictable quantization performance, and lower deployment risk. * - Production readiness - Surfaces hardware behavior early and reduces late rework. - Teams resolve compatibility issues sooner. - Faster time-to-market and better first-pass success. ``` Embedl Deploy packages hardware expertise into reusable patterns, making any PyTorch model deployable with a predictable, debuggable pipeline. Quantization is hardware-aware and QDQ placements are validated in PyTorch before compilation, preserving accuracy while hitting latency and memory targets. The public release supports NVIDIA TensorRT. Enterprise support for additional hardware is available at [embedl.com](https://www.embedl.com/). ## Core concepts - **Modules**: PyTorch ``nn.Module`` objects (e.g., ``ConvBatchNormRelu``) that represent hardware-friendly subgraph replacements. - **Patterns**: Hardware-aware graph patterns covering fusions, conversions, and quantization placements. A pattern finds a subgraph and replaces it with a hardware-aware equivalent. Patterns are determined by extensive exploration of the hardware toolchain restrictions and verified against the target compiler. - **Transform**: applies patterns to your model in one pass, returning a deployable ``nn.Module`` that behaves identically to the original. The images below show the layer mapping before and after TensorRT compilation: ```{raw} html