Zero-Shot Retail Theft Detection via Orchestrated Vision Models: A Model-Agnostic, Cost-Effective Alternative to Trained Single-Model Systems
arXiv cs.CV / 4/17/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageIndustry & Market MovesModels & Research
Key Points
- Paza is a model-agnostic, zero-shot retail theft detection framework that detects concealment behaviors without training on proprietary datasets.
- It uses a layered pipeline where low-cost object detection and pose estimation run continuously, and an expensive vision-language model (VLM) is called only when a multi-signal behavioral pre-filter is triggered.
- The suspicion pre-filter (dwell time plus at least one behavioral signal) reduces VLM invocations by 240x, limiting calls to 10 per minute or fewer and allowing one GPU to cover 10–20 stores.
- The system can swap VLM backends via OpenAI-compatible endpoints (e.g., Gemma 4, Qwen3.5-Omni, GPT-4o) without code changes, helping operators adapt as model offerings evolve.
- On the DCSASS synthesized shoplifting dataset, the VLM component achieves 89.5% precision and 92.8% specificity at 59.3% recall in a zero-shot setting, with cost estimated at $50–100 per store per month and a privacy-preserving face obfuscation design.



