AdaVFM: Adaptive Vision Foundation Models for Edge Intelligence via LLM-Guided Execution

arXiv cs.CV / 4/20/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

AdaVFM is proposed as a framework to run language-aligned vision foundation models efficiently on edge devices despite latency and power limits.
The approach dynamically adjusts computation at runtime based on scene context and task complexity, motivated by the finding that model-size reduction affects tasks differently.
AdaVFM integrates neural architecture search (NAS) into the VFM backbone so the system can execute lightweight subnetworks during inference.
A cloud-based multimodal LLM controls the runtime execution through a context-aware agent, enabling coordinated adaptation between edge inference and cloud guidance.
Experiments on zero-shot classification and open-vocabulary segmentation show improved accuracy-efficiency trade-offs, with gains up to +7.9% acc@1 on IN1K and +5.2% mIoU on ADE20K, and up to 77.9% lower average FLOPs for similar accuracy.

Abstract

Language-aligned vision foundation models (VFMs) enable versatile visual understanding for always-on contextual AI, but their deployment on edge devices is hindered by strict latency and power constraints. We present AdaVFM, an adaptive framework for efficient on-device inference of language-aligned VFMs that dynamically adjusts computation based on scene context and task complexity. Our key insight is that the effect of model size reduction on performance is task-dependent in vision applications, motivating a runtime-adaptive execution strategy. AdaVFM integrates neural architecture search (NAS) into the language-aligned VFM backbone to enable lightweight subnet execution during runtime. A multimodal large language model (LLM) deployed on the cloud enables runtime control with a context-aware agent. This synergy allows efficient model adaptation under diverse conditions while maintaining strong accuracy. Extensive experiments on zero-shot classification and open-vocabulary segmentation demonstrate that AdaVFM achieves state-of-the-art accuracy-efficiency trade-offs, surpassing prior baselines by up to

7.9\%

in acc@1 on IN1K and

5.2\%

mIoU on ADE20K over the best models of comparable VFM sizes. For models with similar accuracy, AdaVFM further reduces average FLOPs by up to

77.9\%

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

Dev.to

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

Dev.to

Building Digital Souls: The Brutal Reality of Creating AI That Understands You Like Nobody Else

Dev.to

Local LLM Beginner’s Guide (Mac - Apple Silicon)

Reddit r/artificial

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals

Dev.to

AdaVFM: Adaptive Vision Foundation Models for Edge Intelligence via LLM-Guided Execution

Key Points

Abstract

Related Articles

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

Building Digital Souls: The Brutal Reality of Creating AI That Understands You Like Nobody Else

Local LLM Beginner’s Guide (Mac - Apple Silicon)

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer