PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation

arXiv cs.CV / 3/13/2026

📰 NewsModels & Research

共有:

Key Points

PicoSAM3 is a lightweight, promptable segmentation model designed for edge and in-sensor execution, with 1.3 million parameters and deployment on the Sony IMX500 vision sensor.
It fuses a dense CNN backbone with region-of-interest prompt encoding, Efficient Channel Attention, and distillation from SAM2 and SAM3 to boost performance at low complexity.
On COCO and LVIS benchmarks, PicoSAM3 achieves 65.45% and 64.01% mIoU respectively, outperforming existing SAM-based and edge-oriented baselines at similar or lower complexity.
The INT8-quantized version preserves accuracy with negligible degradation and enables real-time in-sensor inference at 11.82 ms latency on the IMX500 under its constraints.
Ablation studies show that distillation from large SAM models can yield up to +14.5% mIoU improvement over supervised training and demonstrate that high-quality, spatially flexible promptable segmentation is feasible directly at the sensor level.

Abstract

Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications such as smart glasses and Internet-of-Things devices. We introduce PicoSAM3, a lightweight promptable visual segmentation model optimized for edge and in-sensor execution, including deployment on the Sony IMX500 vision sensor. PicoSAM3 has 1.3 M parameters and combines a dense CNN architecture with region of interest prompt encoding, Efficient Channel Attention, and knowledge distillation from SAM2 and SAM3. On COCO and LVIS, PicoSAM3 achieves 65.45% and 64.01% mIoU, respectively, outperforming existing SAM-based and edge-oriented baselines at similar or lower complexity. The INT8 quantized model preserves accuracy with negligible degradation while enabling real-time in-sensor inference at 11.82 ms latency on the IMX500, fully complying with its memory and operator constraints. Ablation studies show that distillation from large SAM models yields up to +14.5% mIoU improvement over supervised training and demonstrate that high-quality, spatially flexible promptable segmentation is feasible directly at the sensor level.

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

THE DECODER

Kreuzberg v4.5.0: We loved Docling's model so much that we gave it a faster engine

Reddit r/LocalLLaMA

Today, what hardware to get for running large-ish local models like qwen 120b ?

Reddit r/LocalLLaMA

Running mistral locally for meeting notes and it's honestly good enough for my use case

Reddit r/LocalLLaMA

[D] Single-artist longitudinal fine art dataset spanning 5 decades now on Hugging Face — potential applications in style evolution, figure representation, and ethical training data

Reddit r/MachineLearning

PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation

Key Points

Abstract

Related Articles

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

Kreuzberg v4.5.0: We loved Docling's model so much that we gave it a faster engine

Today, what hardware to get for running large-ish local models like qwen 120b ?

Running mistral locally for meeting notes and it's honestly good enough for my use case

[D] Single-artist longitudinal fine art dataset spanning 5 decades now on Hugging Face — potential applications in style evolution, figure representation, and ethical training data

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer