Empirical Recipes for Efficient and Compact Vision-Language Models

arXiv cs.CV / 3/19/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper conducts an end-to-end efficiency analysis of compact vision-language models to identify the dominant bottlenecks in inference and latency.
It develops optimization recipes that substantially cut time to first token (TTFT) by 53% on InternVL3-2B and 93% on SmolVLM-256M, while preserving accuracy and with broad applicability across architectures and serving frameworks.
It introduces ArgusVLM, a new model family with structured perception outputs that remains compact and efficient while achieving strong performance.
The work provides practical guidance for building efficient VLM systems and demonstrates the broad applicability of the recipes across diverse benchmarks.

Abstract

Deploying vision-language models (VLMs) in resource-constrained settings demands low latency and high throughput, yet existing compact VLMs often fall short of the inference speedups their smaller parameter counts suggest. To explain this discrepancy, we conduct an empirical end-to-end efficiency analysis and systematically profile inference to identify the dominant bottlenecks. Based on these findings, we develop optimization recipes tailored to compact VLMs that substantially reduce latency while preserving accuracy. These techniques cut time to first token (TTFT) by 53% on InternVL3-2B and by 93% on SmolVLM-256M. Our recipes are broadly applicable across both VLM architectures and common serving frameworks, providing practical guidance for building efficient VLM systems. Beyond efficiency, we study how to extend compact VLMs with structured perception outputs and introduce the resulting model family, ArgusVLM. Across diverse benchmarks, ArgusVLM achieves strong performance while maintaining a compact and efficient design.