AdaVFM: Adaptive Vision Foundation Models for Edge Intelligence via LLM-Guided Execution

arXiv cs.CV / 4/20/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • AdaVFM is proposed as a framework to run language-aligned vision foundation models efficiently on edge devices despite latency and power limits.
  • The approach dynamically adjusts computation at runtime based on scene context and task complexity, motivated by the finding that model-size reduction affects tasks differently.
  • AdaVFM integrates neural architecture search (NAS) into the VFM backbone so the system can execute lightweight subnetworks during inference.
  • A cloud-based multimodal LLM controls the runtime execution through a context-aware agent, enabling coordinated adaptation between edge inference and cloud guidance.
  • Experiments on zero-shot classification and open-vocabulary segmentation show improved accuracy-efficiency trade-offs, with gains up to +7.9% acc@1 on IN1K and +5.2% mIoU on ADE20K, and up to 77.9% lower average FLOPs for similar accuracy.

Abstract

Language-aligned vision foundation models (VFMs) enable versatile visual understanding for always-on contextual AI, but their deployment on edge devices is hindered by strict latency and power constraints. We present AdaVFM, an adaptive framework for efficient on-device inference of language-aligned VFMs that dynamically adjusts computation based on scene context and task complexity. Our key insight is that the effect of model size reduction on performance is task-dependent in vision applications, motivating a runtime-adaptive execution strategy. AdaVFM integrates neural architecture search (NAS) into the language-aligned VFM backbone to enable lightweight subnet execution during runtime. A multimodal large language model (LLM) deployed on the cloud enables runtime control with a context-aware agent. This synergy allows efficient model adaptation under diverse conditions while maintaining strong accuracy. Extensive experiments on zero-shot classification and open-vocabulary segmentation demonstrate that AdaVFM achieves state-of-the-art accuracy-efficiency trade-offs, surpassing prior baselines by up to 7.9\% in acc@1 on IN1K and 5.2\% mIoU on ADE20K over the best models of comparable VFM sizes. For models with similar accuracy, AdaVFM further reduces average FLOPs by up to 77.9\%.