PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation
arXiv cs.CV / 3/13/2026
📰 NewsModels & Research
Key Points
- PicoSAM3 is a lightweight, promptable segmentation model designed for edge and in-sensor execution, with 1.3 million parameters and deployment on the Sony IMX500 vision sensor.
- It fuses a dense CNN backbone with region-of-interest prompt encoding, Efficient Channel Attention, and distillation from SAM2 and SAM3 to boost performance at low complexity.
- On COCO and LVIS benchmarks, PicoSAM3 achieves 65.45% and 64.01% mIoU respectively, outperforming existing SAM-based and edge-oriented baselines at similar or lower complexity.
- The INT8-quantized version preserves accuracy with negligible degradation and enables real-time in-sensor inference at 11.82 ms latency on the IMX500 under its constraints.
- Ablation studies show that distillation from large SAM models can yield up to +14.5% mIoU improvement over supervised training and demonstrate that high-quality, spatially flexible promptable segmentation is feasible directly at the sensor level.
Related Articles
Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders
Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)
Dev.to

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more
Reddit r/LocalLLaMA
Qwen3.5 Knowledge density and performance
Reddit r/LocalLLaMA
I think I made the best general use System Prompt for Qwen 3.5 (OpenWebUI + Web search)
Reddit r/LocalLLaMA