Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

arXiv cs.CV / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Q-Zoomは、マルチモーダルLLMに対して高解像度画像処理を常時行わず、クエリ(要求)に応じて粗視的→精密のcoarse-to-fineで知覚を切り替える枠組みを提案している。
  • 軽量なDynamic Gating Networkが、粗い全体特徴で十分な場合は高解像度処理を迂回し、自己注意の計算コスト増大を抑える設計になっている。
  • 微細なタスクが必要な場合は、Self-Distilled Region Proposal Network(SD-RPN)が中間特徴からタスク関連のRoIを自己教師ありで高精度にローカライズする。
  • 一貫性に基づく生成戦略で決定的なルーティングラベルを作り、さらに連続的な時空間アラインメントとターゲット微調整で、RoIの高密度情報を全体レイアウトに融合する。

Abstract

MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self-attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q-Zoom, a query-aware adaptive high-resolution perception framework that operates in an efficient coarse-to-fine manner. First, a lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. Second, for queries demanding fine-grained perception, a Self-Distilled Region Proposal Network (SD-RPN) precisely localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating network uses a consistency-aware generation strategy to derive deterministic routing labels, while the SD-RPN employs a fully self-supervised distillation paradigm. A continuous spatio-temporal alignment scheme and targeted fine-tuning then seamlessly fuse the dense local RoI with the coarse global layout. Extensive experiments demonstrate that Q-Zoom establishes a dominant Pareto frontier. Using Qwen2.5-VL-7B as a primary testbed, Q-Zoom accelerates inference by 2.52 times on Document & OCR benchmarks and 4.39 times in High-Resolution scenarios while matching the baseline's peak accuracy. Furthermore, when configured for maximum perceptual fidelity, Q-Zoom surpasses the baseline's peak performance by 1.1% and 8.1% on these respective benchmarks. These robust improvements transfer seamlessly to Qwen3-VL, LLaVA, and emerging RL-based thinking-with-image models. Project page is available at https://yuhengsss.github.io/Q-Zoom/.