Empirical Recipes for Efficient and Compact Vision-Language Models
arXiv cs.CV / 3/19/2026
📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper conducts an end-to-end efficiency analysis of compact vision-language models to identify the dominant bottlenecks in inference and latency.
- It develops optimization recipes that substantially cut time to first token (TTFT) by 53% on InternVL3-2B and 93% on SmolVLM-256M, while preserving accuracy and with broad applicability across architectures and serving frameworks.
- It introduces ArgusVLM, a new model family with structured perception outputs that remains compact and efficient while achieving strong performance.
- The work provides practical guidance for building efficient VLM systems and demonstrates the broad applicability of the recipes across diverse benchmarks.
Related Articles

I built an online background remover and learned a lot from launching it
Dev.to
How AI is Transforming Dynamics 365 Business Central
Dev.to
Algorithmic Gaslighting: A Formal Legal Template to Fight AI Safety Pivots That Cause Psychological Harm
Reddit r/artificial
Do I need different approaches for different types of business information errors?
Dev.to
ShieldCortex: What We Learned Protecting AI Agent Memory
Dev.to