Prompt-Guided Prefiltering for VLM Image Compression

arXiv cs.AI / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • 画像をクラウドのVLMに渡す前提で、タスクに不要な細部を抑えて効率よく圧縮するための「prompt-guided prefiltering(プロンプト誘導の前処理)」を提案しています。
  • 提案手法は、テキストプロンプトに対して重要な画像領域を特定し、重要情報は保持しつつ関係の薄い領域を平滑化することで圧縮効率を高めます。
  • codecに依存しないプラグアンドプレイ型モジュールであり、従来のエンコーダや学習ベースのエンコーダの前段に挿入して使えるとしています。
  • 複数のVQAベンチマークで、平均ビットレートを25〜50%削減しながらタスク精度を維持できたと報告しています。
  • ソースコードが公開されており、VLM向け画像圧縮の実装検証や応用につなげやすい内容です。

Abstract

The rapid progress of large Vision-Language Models (VLMs) has enabled a wide range of applications, such as image understanding and Visual Question Answering (VQA). Query images are often uploaded to the cloud, where VLMs are typically hosted, hence efficient image compression becomes crucial. However, traditional human-centric codecs are suboptimal in this setting because they preserve many task-irrelevant details. Existing Image Coding for Machines (ICM) methods also fall short, as they assume a fixed set of downstream tasks and cannot adapt to prompt-driven VLMs with an open-ended variety of objectives. We propose a lightweight, plug-and-play, prompt-guided prefiltering module to identify image regions most relevant to the text prompt, and consequently to the downstream task. The module preserves important details while smoothing out less relevant areas to improve compression efficiency. It is codec-agnostic and can be applied before conventional and learned encoders. Experiments on several VQA benchmarks show that our approach achieves a 25-50% average bitrate reduction while maintaining the same task accuracy. Our source code is available at https://github.com/bardia-az/pgp-vlm-compression.