QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models

arXiv cs.CV / 4/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper finds that vision token pruning and post-training quantization (PTQ) for multimodal LLMs are tightly coupled, and naïvely pruning can remove activation outliers needed for numerical stability, increasing quantization errors in low-bit settings (e.g., W4A4).
It proposes QAPruner, a quantization-aware pruning framework that uses a lightweight hybrid sensitivity metric combining simulated group-wise quantization error with outlier intensity, along with semantic relevance scores.
Experiments on standard LLaVA architectures show QAPruner outperforms baselines that combine PTQ and pruning without accounting for their interaction.
At a very aggressive setting retaining only 12.5% of visual tokens, QAPruner improves accuracy by 2.24% versus the baseline and can even beat dense quantization without pruning.
The authors position QAPruner as the first approach explicitly co-optimizing vision token pruning and PTQ for accurate low-bit inference in MLLMs.

Abstract

Multimodal Large Language Models (MLLMs) have shown strong reasoning ability, but their high computational and memory costs hinder deployment in resource-constrained settings. While Post-Training Quantization (PTQ) and vision token pruning are standard compression techniques, they are usually treated as independent optimizations. In this paper, we show that these two techniques are strongly coupled: naively applying semantic-based token pruning to PTQ-optimized MLLMs can discard activation outliers that are important for numerical stability and thus worsen quantization errors in low-bit regimes (\textit{e.g.}, W4A4). To address this issue, we propose a quantization-aware vision token pruning framework. Our method introduces a lightweight hybrid sensitivity metric that combines simulated group-wise quantization error with outlier intensity. By combining this metric with standard semantic relevance scores, the method retains tokens that are both semantically informative and robust to quantization. Experiments on standard LLaVA architectures show that our method consistently outperforms naive integration baselines. At an aggressive pruning ratio that retains only 12.5\% of visual tokens, our framework improves accuracy by 2.24\% over the baseline and even surpasses dense quantization without pruning. To the best of our knowledge, this is the first method that explicitly co-optimizes vision token pruning and PTQ for accurate low-bit MLLM inference.

How Bash Command Safety Analysis Works in AI Systems

Dev.to

How to Get Better Output from AI Tools (Without Burning Time and Tokens)

Dev.to

How I Added LangChain4j Without Letting It Take Over My Spring Boot App

Dev.to

The Future of Artificial Intelligence in Everyday Life

Dev.to

Why I Reimplemented 22 Unix Tools in Go for AI Agents

Dev.to

QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models

Key Points

Abstract

Related Articles

How Bash Command Safety Analysis Works in AI Systems

How to Get Better Output from AI Tools (Without Burning Time and Tokens)

How I Added LangChain4j Without Letting It Take Over My Spring Boot App

The Future of Artificial Intelligence in Everyday Life

Why I Reimplemented 22 Unix Tools in Go for AI Agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer