Empirical Recipes for Efficient and Compact Vision-Language Models
arXiv cs.CV / 3/19/2026
📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper conducts an end-to-end efficiency analysis of compact vision-language models to identify the dominant bottlenecks in inference and latency.
- It develops optimization recipes that substantially cut time to first token (TTFT) by 53% on InternVL3-2B and 93% on SmolVLM-256M, while preserving accuracy and with broad applicability across architectures and serving frameworks.
- It introduces ArgusVLM, a new model family with structured perception outputs that remains compact and efficient while achieving strong performance.
- The work provides practical guidance for building efficient VLM systems and demonstrates the broad applicability of the recipes across diverse benchmarks.
Related Articles

Attacks On Data Centers, Qwen3.5 In All Sizes, DeepSeek’s Huawei Play, Apple’s Multimodal Tokenizer
The Batch

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成
日経XTECH

Your AI generated code is "almost right", and that is actually WORSE than it being "wrong".
Dev.to

Lessons from Academic Plagiarism Tools for SaaS Product Development
Dev.to

Windsurf’s New Pricing Explained: Simpler AI Coding or Hidden Trade-Offs?
Dev.to