VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models

arXiv cs.CV / 4/17/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses the quadratic compute cost in vision-language models by optimizing visual token pruning configurations rather than relying on fixed, predefined settings.
  • It introduces VisPCO, which formulates pruning as a budget-aware Pareto-frontier optimization problem and uses continuous relaxation with straight-through estimators for gradient-based search.
  • The optimization is solved using the Augmented Lagrangian method to automatically find pruning configurations that balance computation and performance.
  • Experiments on eight visual benchmarks show the method closely approximates a grid-search-derived empirical Pareto frontier and generalizes across different pruning methods and VLM architectures.
  • By learning kernel functions, the research analyzes layer-wise pruning behavior and finds that multi-step progressive pruning better reflects the model’s hierarchical compression structure than single-layer pruning.

Abstract

Visual token pruning methods effectively mitigate the quadratic computational growth caused by processing high-resolution images and video frames in vision-language models (VLMs). However, existing approaches rely on predefined pruning configurations without determining whether they achieve computation-performance optimality. In this work, we introduce , a novel framework that formulates visual token pruning as a Pareto configuration optimization problem to automatically identify optimal configurations. Our approach employs continuous relaxation and straight-through estimators to enable gradient-based search, solved via the Augmented Lagrangian method. Extensive experiments across 8 visual benchmarks demonstrate that effectively approximates the empirical Pareto frontier obtained through grid search and generalizes well across various pruning methods and VLM architectures. Furthermore, through learnable kernel functions, we investigate layer-wise pruning patterns and reveal that multi-step progressive pruning captures VLMs' hierarchical compression structure, achieving superior accuracy-efficiency trade-offs compared to single-layer approaches.