Empirical Recipes for Efficient and Compact Vision-Language Models

arXiv cs.CV / 3/19/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper conducts an end-to-end efficiency analysis of compact vision-language models to identify the dominant bottlenecks in inference and latency.
It develops optimization recipes that substantially cut time to first token (TTFT) by 53% on InternVL3-2B and 93% on SmolVLM-256M, while preserving accuracy and with broad applicability across architectures and serving frameworks.
It introduces ArgusVLM, a new model family with structured perception outputs that remains compact and efficient while achieving strong performance.
The work provides practical guidance for building efficient VLM systems and demonstrates the broad applicability of the recipes across diverse benchmarks.

Abstract

Deploying vision-language models (VLMs) in resource-constrained settings demands low latency and high throughput, yet existing compact VLMs often fall short of the inference speedups their smaller parameter counts suggest. To explain this discrepancy, we conduct an empirical end-to-end efficiency analysis and systematically profile inference to identify the dominant bottlenecks. Based on these findings, we develop optimization recipes tailored to compact VLMs that substantially reduce latency while preserving accuracy. These techniques cut time to first token (TTFT) by 53% on InternVL3-2B and by 93% on SmolVLM-256M. Our recipes are broadly applicable across both VLM architectures and common serving frameworks, providing practical guidance for building efficient VLM systems. Beyond efficiency, we study how to extend compact VLMs with structured perception outputs and introduce the resulting model family, ArgusVLM. Across diverse benchmarks, ArgusVLM achieves strong performance while maintaining a compact and efficient design.

I built an online background remover and learned a lot from launching it

Dev.to

How AI is Transforming Dynamics 365 Business Central

Dev.to

Algorithmic Gaslighting: A Formal Legal Template to Fight AI Safety Pivots That Cause Psychological Harm

Reddit r/artificial

Do I need different approaches for different types of business information errors?

Dev.to

ShieldCortex: What We Learned Protecting AI Agent Memory

Dev.to

Empirical Recipes for Efficient and Compact Vision-Language Models

Key Points

Abstract

Related Articles

I built an online background remover and learned a lot from launching it

How AI is Transforming Dynamics 365 Business Central

Algorithmic Gaslighting: A Formal Legal Template to Fight AI Safety Pivots That Cause Psychological Harm

Do I need different approaches for different types of business information errors?

ShieldCortex: What We Learned Protecting AI Agent Memory

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer