.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_tutorials/sam3.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_tutorials_sam3.py: Deploying SAM3 as an INT8 TensorRT engine ========================================= This tutorial shows how to deploy the detector branch of ``facebook/sam3`` end-to-end with Embedl Deploy: from the HuggingFace checkpoint to a quantized (mixed-precision) TensorRT engine running at 12 QPS at 924×924 input on an NVIDIA GPU (L4). Install dependencies: .. code-block:: bash pip install torch transformers onnx opencv-python matplotlib onnxscript pip install --upgrade tensorrt-cu12 pip install "embedl-deploy[tensorrt]" Pipeline: 1. Export the HF SAM3 detector at 924×924 in ``fp32`` with ``torch.export`` as a ``pt2`` file (Aten graph). 2. Apply layer fusions + post-training quantization (PTQ) with mixed-precision using ``embedl_deploy``, calibrating on real frames from a demo video. Calibrate on a larger dataset, e.g., `COCO `_ or the `LVIS `_, for better accuracy. The output is a PyTorch model with fake quantization operators, i.e., Quantize/Dequantize (QDQ) nodes that can be exported to ONNX for compilation with TensorRT. Once the quantized model is exported, a TensorRT engine can be built with mixed precision to run inference. .. note:: The full pipeline through the TensorRT engine build, sanity check, benchmark and video demo requires an NVIDIA GPU with TensorRT 10.x (10.16 pip wheel recommended). Steps 1–3 (``fp32`` export, INT8 PTQ, ONNX export) and the QDQ-vs-``fp32`` sanity check run on CPU too but slowly. .. GENERATED FROM PYTHON SOURCE LINES 43-45 .. code-block:: Python :dedent: 1 .. GENERATED FROM PYTHON SOURCE LINES 49-51 Constants --------- .. GENERATED FROM PYTHON SOURCE LINES 51-96 .. code-block:: Python import sys import time from pathlib import Path import cv2 # type: ignore[import-not-found] import numpy as np # type: ignore[import-not-found] import torch from torch import nn sys.setrecursionlimit(20000) ARTIFACTS_PATH = Path("artifacts") ARTIFACTS_PATH.mkdir(parents=True, exist_ok=True) MODEL_ID = "facebook/sam3" IMAGE_SIZE = 924 # multiple of patch size (14); triggers fusion patterns PATCH_SIZE = 14 CONTEXT_LENGTH = 32 PROMPT = "person" N_CALIB = 16 # frames used for PTQ calibration CONFIDENCE = 0.25 BUILDER_OPTIMIZATION_LEVEL = 3 VIDEO_URL = ( "https://huggingface.co/datasets/hf-internal-testing/" "sam2-fixtures/resolve/main/bedroom.mp4" ) # Normalisation — must match the SAM3 training-time preprocessing. MEAN = torch.tensor([0.5, 0.5, 0.5]).view(3, 1, 1) STD = torch.tensor([0.5, 0.5, 0.5]).view(3, 1, 1) MEAN_NP = np.array([0.5, 0.5, 0.5], dtype=np.float32) STD_NP = np.array([0.5, 0.5, 0.5], dtype=np.float32) FP32_PT2 = ARTIFACTS_PATH / f"sam3_resized_{IMAGE_SIZE}.pt2" QDQ_PT2 = ARTIFACTS_PATH / f"sam3_resized_{IMAGE_SIZE}_int8_qdq.pt2" QDQ_ONNX = ARTIFACTS_PATH / f"sam3_resized_{IMAGE_SIZE}_int8_qdq.onnx" QDQ_ONNX_FIXED = ( ARTIFACTS_PATH / f"sam3_resized_{IMAGE_SIZE}_int8_qdq_fixed.onnx" ) ENGINE = ARTIFACTS_PATH / "sam3.engine" TIMING_CACHE = ARTIFACTS_PATH / "trt_timing.cache" DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu") .. GENERATED FROM PYTHON SOURCE LINES 97-110 Step 1 — fp32 ``torch.export`` of the SAM3 detector --------------------------------------------------- We override the image-size fields in the model config so the encoder runs at 924×924 (multiple of the 14-px patch) so that correct fusions fire. The detector model is wrapped in a thin ``nn.Module`` exposing a clean ``(image, tokens) → (masks, logits, boxes)`` signature for export since we require Torch tensors. Everything stays ``fp32``: the ViT backbone has internal ``fp32`` regions (attention softmax) that emit ``aten._to_copy(dtype=float32)`` nodes when surrounded by ``fp16``, which then conflict with fused conv/BN weights. The precision drop happens at the TRT-engine building stage. .. GENERATED FROM PYTHON SOURCE LINES 110-178 .. code-block:: Python from transformers import ( # type: ignore[import-not-found] AutoConfig, AutoTokenizer, Sam3VideoModel, ) from transformers.video_utils import ( # type: ignore[import-not-found] load_video, ) def patch_image_size(cfg: AutoConfig, image_size: int) -> AutoConfig: """Patch image size.""" grid = image_size // PATCH_SIZE feat_sizes = [[grid * 4, grid * 4], [grid * 2, grid * 2], [grid, grid]] cfg.image_size = image_size cfg.low_res_mask_size = grid * 4 for sub in (cfg.detector_config, cfg.tracker_config): sub.image_size = image_size sub.vision_config.backbone_feature_sizes = feat_sizes sub.vision_config.backbone_config.image_size = image_size cfg.tracker_config.memory_attention_rope_feat_sizes = [grid, grid] return cfg class Sam3DetectorWrapper(nn.Module): """Simple wrapper for tensor inputs and outputs""" def __init__(self, detector: nn.Module) -> None: """Store the SAM3 detector sub-module.""" super().__init__() self.detector = detector def forward( self, image: torch.Tensor, tokenized_text: torch.Tensor ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]: """Run the detector and return masks, logits, and boxes.""" out = self.detector(pixel_values=image, input_ids=tokenized_text) return out.pred_masks, out.pred_logits, out.pred_boxes def export_sam3_fp32_pt2(path: Path) -> None: """Export the SAM3 detector to a fp32 pt2 file via torch.export.""" cfg = patch_image_size(AutoConfig.from_pretrained(MODEL_ID), IMAGE_SIZE) model = ( Sam3VideoModel.from_pretrained( MODEL_ID, config=cfg, torch_dtype=torch.float32 ) .eval() .to(DEVICE) ) wrapped = Sam3DetectorWrapper(model.detector_model) img = torch.randn(1, 3, IMAGE_SIZE, IMAGE_SIZE, device=DEVICE) ids = torch.randint( 0, 32000, (1, CONTEXT_LENGTH), dtype=torch.long, device=DEVICE ) with torch.no_grad(): exported = torch.export.export(wrapped, (img, ids), strict=False) torch.export.save(exported, str(path)) print(f" {path.stat().st_size / 1e9:.2f} GB") # Cleanup GPU memory del model, wrapped, img, ids, exported torch.cuda.empty_cache() export_sam3_fp32_pt2(path=FP32_PT2) .. GENERATED FROM PYTHON SOURCE LINES 179-192 Step 2 — Fuse and INT8-quantize with Embedl Deploy --------------------------------------------------- This is the main part of the recipe that uses Embedl Deploy to fuse and quantize the model. In addition to the standard post-training static quantization we apply a few customizations to maximize performance and accuracy: **Mixed-precision.** We quantize the encoder (INT8) and skip one node from quantization to preserve accuracy - the first 3-channel convolution with a ≥7×7 kernel. **SmoothQuant.** We skip smooth quant for layer norms on the full model. .. GENERATED FROM PYTHON SOURCE LINES 192-322 .. code-block:: Python from embedl_deploy import transform from embedl_deploy._internal.core.quantize.config import ( CalibrationMethod, ModulesToSkip, QuantConfig, TensorQuantConfig, ) from embedl_deploy._internal.core.quantize.main import quantize from embedl_deploy.tensorrt import TENSORRT_PATTERNS def load_video_frames(n: int) -> list[torch.Tensor]: """Sample ``n`` evenly-spaced frames from the demo video.""" frames, _ = load_video(VIDEO_URL) idxs = ( np.linspace(0, len(frames) - 1, n, dtype=int).tolist() if n < len(frames) else list(range(len(frames))) ) out: list[torch.Tensor] = [] for i in idxs: resized = cv2.resize( frames[i], (IMAGE_SIZE, IMAGE_SIZE), interpolation=cv2.INTER_LINEAR ) t = torch.from_numpy(resized).permute(2, 0, 1).float() / 255.0 out.append((t - MEAN) / STD) return out def _find_patch_embed_conv(model: nn.Module) -> nn.Conv2d | None: # The 3-channel Conv with a ≥7×7 kernel is the patch-embed stem. for m in model.modules(): if ( isinstance(m, nn.Conv2d) and m.in_channels == 3 and max(m.kernel_size) >= 7 ): return m return None def quantize_to_qdq(gm: torch.fx.GraphModule) -> None: """Fuse and INT8-quantize the fp32 graph with Embedl Deploy.""" fused = ( transform(gm, TENSORRT_PATTERNS) .model.eval() .to(device=DEVICE, dtype=torch.float32) ) stub_w_skip: set[nn.Module] = set() if (patch := _find_patch_embed_conv(fused)) is not None: stub_w_skip.add(patch) quant_cfg = QuantConfig( activation=TensorQuantConfig( n_bits=8, symmetric=True, per_channel=False, calibration_method=CalibrationMethod.MINMAX, ), weight=TensorQuantConfig( n_bits=8, symmetric=True, per_channel=True, calibration_method=CalibrationMethod.MINMAX, ), skip=ModulesToSkip( stub=stub_w_skip, # type: ignore[arg-type] weight=stub_w_skip, # type: ignore[arg-type] smooth={nn.LayerNorm}, ), ) tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) input_ids = tokenizer( PROMPT, padding="max_length", max_length=CONTEXT_LENGTH, truncation=True, return_tensors="pt", )["input_ids"].to(device=DEVICE) print(f"Loading {N_CALIB} calibration frames from the demo video") calib_imgs = load_video_frames(N_CALIB) def calib_fn(model_fn: nn.Module) -> None: # pylint: disable=cell-var-from-loop """Run the model on the calibration data to collect quant stats.""" model_fn.eval() for img in calib_imgs: # noqa: F821 with torch.no_grad(): model_fn( img.unsqueeze(0).to(device=DEVICE, dtype=torch.float32), input_ids, # noqa: F821 ) dummy_img = torch.randn(1, 3, IMAGE_SIZE, IMAGE_SIZE, device=DEVICE) qmodel = quantize( fused, args=(dummy_img, input_ids), config=quant_cfg, forward_loop=calib_fn, freeze_weights=True, ) # Cleanup calibration frames after quantization # (input_ids still needed for export) del calib_imgs torch.cuda.empty_cache() print(f"Re-exporting QDQ → {QDQ_PT2.name}") qmodel.eval() with torch.no_grad(): exported = torch.export.export( qmodel, (dummy_img, input_ids), strict=False ) torch.export.save(exported, str(QDQ_PT2)) print(f" {QDQ_PT2.stat().st_size / 1e9:.2f} GB") # Cleanup GPU memory del fused, qmodel, exported, dummy_img, input_ids torch.cuda.empty_cache() graph_module: torch.fx.GraphModule = torch.export.load(FP32_PT2).module() quantize_to_qdq(graph_module) # Cleanup GPU memory after quantization del graph_module torch.cuda.empty_cache() .. GENERATED FROM PYTHON SOURCE LINES 323-331 Export quantized model to ONNX ------------------------------ We use the dynamo path of ``torch.onnx.export`` (the classic ``torchscript`` path calls ``model.train(False)`` which ``torch.export``-loaded modules do not support). ``Quantize``/``DequantizeLinear`` pairs carry the calibrated scales in the resulting ONNX model. .. GENERATED FROM PYTHON SOURCE LINES 331-359 .. code-block:: Python def export_to_onnx(qdq_model_path: Path, onnx_model_path: Path) -> None: """Export the QDQ pt2 graph to ONNX with opset 18 and external data.""" model: torch.fx.GraphModule = torch.export.load(qdq_model_path).module() img = torch.randn(1, 3, IMAGE_SIZE, IMAGE_SIZE, device=DEVICE) ids = torch.randint( 0, 32000, (1, CONTEXT_LENGTH), dtype=torch.long, device=DEVICE ) print(f"Exporting to {onnx_model_path.name}") with torch.no_grad(): torch.onnx.export( model, (img, ids), str(onnx_model_path), input_names=["image", "tokenized_text"], output_names=["pred_masks", "pred_logits", "pred_boxes"], do_constant_folding=True, dynamo=True, ) # Cleanup GPU memory del model, img, ids torch.cuda.empty_cache() export_to_onnx(qdq_model_path=QDQ_PT2, onnx_model_path=QDQ_ONNX) .. GENERATED FROM PYTHON SOURCE LINES 360-368 Build the TensorRT engine ----------------------------------- * INT8 + FP16 hybrid precision — TRT picks per-layer. * Shared timing cache file — kernel-timing results persist across runs of this script and subsequent variants of the same ONNX. This cell and every cell below it requires TensorRT. .. GENERATED FROM PYTHON SOURCE LINES 368-419 .. code-block:: Python import tensorrt as trt def build_engine(onnx_path: Path, engine_path: Path) -> None: """Build a TensorRT INT8+FP16 engine from the fixed QDQ ONNX graph.""" logger = trt.Logger(trt.Logger.WARNING) trt.init_libnvinfer_plugins(logger, "") builder = trt.Builder(logger) network = builder.create_network( 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH) ) parser = trt.OnnxParser(network, logger) print(f"Parsing {onnx_path.name}") if not parser.parse(onnx_path.read_bytes(), path=str(onnx_path)): for i in range(parser.num_errors): print(parser.get_error(i)) raise RuntimeError("ONNX parse failed") config = builder.create_builder_config() config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 4 * 1024**3) config.builder_optimization_level = BUILDER_OPTIMIZATION_LEVEL config.set_flag(trt.BuilderFlag.FP16) config.set_flag(trt.BuilderFlag.INT8) cache = config.create_timing_cache( TIMING_CACHE.read_bytes() if TIMING_CACHE.exists() else b"" ) config.set_timing_cache(cache, ignore_mismatch=False) print( f"Building INT8+FP16 engine (opt-level {BUILDER_OPTIMIZATION_LEVEL})" "It can take 15-30 min for the first build without a timing cache." ) t0 = time.perf_counter() plan = builder.build_serialized_network(network, config) if plan is None: raise RuntimeError("build_serialized_network returned None") dt = time.perf_counter() - t0 TIMING_CACHE.write_bytes(bytes(config.get_timing_cache().serialize())) engine_path.write_bytes(bytes(plan)) print( f" built in {dt:.0f} s → {engine_path.name} " f"({len(bytes(plan)) / 1e6:.0f} MB)" ) build_engine(onnx_path=QDQ_ONNX, engine_path=ENGINE) # Cleanup GPU memory after engine build (TensorRT may hold resources) torch.cuda.empty_cache() .. GENERATED FROM PYTHON SOURCE LINES 420-426 Run the engine on a video --------------------------------------- We wrap the TRT engine in a simple class that manages the I/O buffers as CUDA tensors and exposes a clean ``infer()`` method. The demo video is the same one used for calibration; the output is written to a new video file with the predicted masks. .. GENERATED FROM PYTHON SOURCE LINES 426-594 .. code-block:: Python _TRT_TORCH = { trt.float32: torch.float32, trt.float16: torch.float16, trt.int32: torch.int32, trt.int64: torch.int64, trt.int8: torch.int8, trt.bool: torch.bool, } class TrtRunner: """TRT engine wrapper backed by torch.cuda tensors as I/O buffers.""" def __init__(self, engine_path: Path) -> None: """Initialize with an engine.""" logger = trt.Logger(trt.Logger.WARNING) trt.init_libnvinfer_plugins(logger, "") self.engine = trt.Runtime(logger).deserialize_cuda_engine( engine_path.read_bytes() ) self.ctx = self.engine.create_execution_context() self.stream = torch.cuda.Stream() self.outputs: list[str] = [] self.bufs: dict[str, torch.Tensor] = {} for i in range(self.engine.num_io_tensors): name = self.engine.get_tensor_name(i) shape = tuple(self.engine.get_tensor_shape(name)) dtype = _TRT_TORCH[self.engine.get_tensor_dtype(name)] self.bufs[name] = torch.empty(shape, dtype=dtype, device="cuda") self.ctx.set_tensor_address(name, self.bufs[name].data_ptr()) if self.engine.get_tensor_mode(name) == trt.TensorIOMode.OUTPUT: self.outputs.append(name) @property def input_size(self) -> int: """Get the input size""" return int(self.engine.get_tensor_shape("image")[2]) def infer( self, feeds: dict[str, np.ndarray | torch.Tensor] ) -> dict[str, np.ndarray]: """Copy inputs into device buffers, execute, return outputs on host.""" for name, x in feeds.items(): self.bufs[name].copy_(torch.as_tensor(x)) self.ctx.execute_async_v3(self.stream.cuda_stream) self.stream.synchronize() return {n: self.bufs[n].cpu().numpy() for n in self.outputs} def _sigmoid_np(x: np.ndarray) -> np.ndarray: return 1 / (1 + np.exp(-np.clip(x, -50, 50))) def _postprocess( pred_masks: np.ndarray, pred_logits: np.ndarray, h: int, w: int ) -> np.ndarray: """Return a stack of binary masks for detections above ``CONFIDENCE``.""" logits = pred_logits[0, :, 0] if pred_logits.ndim == 3 else pred_logits[0] keep = _sigmoid_np(logits) > CONFIDENCE masks = pred_masks[0][keep] if len(masks) == 0: return np.zeros((0, h, w), dtype=bool) return np.stack( [ _sigmoid_np(cv2.resize(m, (w, h), interpolation=cv2.INTER_LINEAR)) > 0.5 for m in masks ] ) def _overlay( frame_rgb: np.ndarray, masks: np.ndarray, hud: str, colors: np.ndarray ) -> np.ndarray: canvas = frame_rgb.copy().astype(np.uint8) for i, m in enumerate(masks): sel = m.astype(bool) c = colors[i % len(colors)] canvas[sel] = (canvas[sel] * 0.5 + c * 0.5).astype(np.uint8) (tw, th), _ = cv2.getTextSize(hud, cv2.FONT_HERSHEY_SIMPLEX, 0.7, 2) cv2.rectangle(canvas, (6, 6), (14 + tw, 18 + th), (0, 0, 0), -1) cv2.putText( canvas, hud, (10, 10 + th), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 255, 255), 2, cv2.LINE_AA, ) return canvas def _process_frame( # pylint: disable=too-many-positional-arguments runner: TrtRunner, frame: np.ndarray, height: int, width: int, colors: np.ndarray, writer: cv2.VideoWriter, ) -> tuple[float, np.ndarray]: """Process a single frame and write to video.""" resized = cv2.resize( frame, (runner.input_size, runner.input_size), interpolation=cv2.INTER_LINEAR, ) img = (resized.astype(np.float32) / 255.0 - MEAN_NP) / STD_NP img = np.ascontiguousarray(img.transpose(2, 0, 1)[np.newaxis]) t0 = time.perf_counter() out = runner.infer({"image": img}) fwd = (time.perf_counter() - t0) * 1000 masks = _postprocess(out["pred_masks"], out["pred_logits"], height, width) hud = f"TensorRT INT8 | {fwd:.1f} ms | {1000 / fwd:.1f} FPS" canvas = _overlay(frame, masks, hud, colors) writer.write(cv2.cvtColor(canvas, cv2.COLOR_RGB2BGR)) return fwd, canvas def run_engine( engine: Path, output_path: Path ) -> tuple[Path, list[np.ndarray]]: """Run engine on input video with overlay.""" # pylint: disable=too-many-locals out_path = output_path frames, _ = load_video(VIDEO_URL) h, w = frames[0].shape[:2] runner = TrtRunner(engine) for _ in range(3): # warm up — sanity check already populated the buffers runner.ctx.execute_async_v3(runner.stream.cuda_stream) runner.stream.synchronize() colors = np.random.default_rng(42).integers(64, 256, size=(32, 3)) writer = cv2.VideoWriter( str(out_path), cv2.VideoWriter_fourcc(*"mp4v"), 30, (w, h) ) fwd_ms: list[float] = [] sample_overlays: list[np.ndarray] = [] sample_idxs = {0, len(frames) // 3, 2 * len(frames) // 3} print(f"Inference on {len(frames)} frames") for i, frame in enumerate(frames): fwd, canvas = _process_frame(runner, frame, h, w, colors, writer) fwd_ms.append(fwd) if i in sample_idxs: sample_overlays.append(canvas) writer.release() a = np.array(fwd_ms) print( f" {len(frames)} frames | mean {a.mean():.1f} ms | " f"p95 {np.percentile(a, 95):.1f} ms | {1000 / a.mean():.1f} FPS" ) print(f" Saved: {out_path}") # Cleanup GPU memory and TensorRT resources del runner, frames, colors, fwd_ms, a torch.cuda.empty_cache() return out_path, sample_overlays video_path, samples = run_engine(ENGINE, ARTIFACTS_PATH / "output_trt.mp4") .. GENERATED FROM PYTHON SOURCE LINES 595-606 Summary ------- :: Pipeline: HuggingFace (fp32 .pt2) ─▶ Embedl Deploy (QDQ .pt2) ONNX ─▶ TensorRT engine On an NVIDIA L4 with TensorRT 10.16: Throughput: ~12 QPS / ~85 ms latency at 924×924. .. _sphx_glr_download_auto_tutorials_sam3.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: sam3.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: sam3.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: sam3.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_