VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models

arXiv cs.CV / 4/17/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses the quadratic compute cost in vision-language models by optimizing visual token pruning configurations rather than relying on fixed, predefined settings.
It introduces VisPCO, which formulates pruning as a budget-aware Pareto-frontier optimization problem and uses continuous relaxation with straight-through estimators for gradient-based search.
The optimization is solved using the Augmented Lagrangian method to automatically find pruning configurations that balance computation and performance.
Experiments on eight visual benchmarks show the method closely approximates a grid-search-derived empirical Pareto frontier and generalizes across different pruning methods and VLM architectures.
By learning kernel functions, the research analyzes layer-wise pruning behavior and finds that multi-step progressive pruning better reflects the model’s hierarchical compression structure than single-layer pruning.

Abstract

Visual token pruning methods effectively mitigate the quadratic computational growth caused by processing high-resolution images and video frames in vision-language models (VLMs). However, existing approaches rely on predefined pruning configurations without determining whether they achieve computation-performance optimality. In this work, we introduce , a novel framework that formulates visual token pruning as a Pareto configuration optimization problem to automatically identify optimal configurations. Our approach employs continuous relaxation and straight-through estimators to enable gradient-based search, solved via the Augmented Lagrangian method. Extensive experiments across 8 visual benchmarks demonstrate that effectively approximates the empirical Pareto frontier obtained through grid search and generalizes well across various pruning methods and VLM architectures. Furthermore, through learnable kernel functions, we investigate layer-wise pruning patterns and reveal that multi-step progressive pruning captures VLMs' hierarchical compression structure, achieving superior accuracy-efficiency trade-offs compared to single-layer approaches.

langchain-anthropic==1.4.1

LangChain Releases

🚀 Anti-Gravity Meets Cloud AI: The Future of Effortless Development

Dev.to

Stop burning tokens on DOM noise: a Playwright MCP optimizer layer

Dev.to

Talk to Your Favorite Game Characters! Mantella Brings AI to Skyrim and Fallout 4 NPCs

Dev.to

AI Will Run Companies. Here's Why That Should Excite You, Not Scare You.

Dev.to

VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models

Key Points

Abstract

Related Articles

langchain-anthropic==1.4.1

🚀 Anti-Gravity Meets Cloud AI: The Future of Effortless Development

Stop burning tokens on DOM noise: a Playwright MCP optimizer layer

Talk to Your Favorite Game Characters! Mantella Brings AI to Skyrim and Fallout 4 NPCs

AI Will Run Companies. Here's Why That Should Excite You, Not Scare You.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer