ReflectCAP: Detailed Image Captioning with Reflective Memory

arXiv cs.AI / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • ReflectCAP(Reflective Note-Guided Captioning)は、詳細な画像キャプションにおける「事実性」と「きめ細かなカバレッジ」の両立を狙い、LVLMの幻覚(誤り)と見落としを反復分析して指針化します。
  • 複数エージェントのパイプラインで、標的LVLMが一貫して誤る点/抜ける点を抽出し、それを「Structured Reflection Notes」として再利用可能なガイドラインに蒸留します。
  • 推論時にはこのノートが、避けるべき内容と注意すべき内容の両面からキャプション生成を誘導し、GPT-4.1ファミリー、Qwen、InternVLなど8種類のLVLMで事実性とカバレッジのトレードオフを改善します。
  • CapArena-Autoでの対戦評価では強い参照モデルに対して優位性が示され、既存のマルチエージェント手法に比べて計算オーバーヘッド(21–36%増)を抑えつつ、モデルスケーリングより良い品質/計算コストのバランスを実現します。

Abstract

Detailed image captioning demands both factual grounding and fine-grained coverage, yet existing methods have struggled to achieve them simultaneously. We address this tension with Reflective Note-Guided Captioning (ReflectCAP), where a multi-agent pipeline analyzes what the target large vision-language model (LVLM) consistently hallucinates and what it systematically overlooks, distilling these patterns into reusable guidelines called Structured Reflection Notes. At inference time, these notes steer the captioning model along both axes -- what to avoid and what to attend to -- yielding detailed captions that jointly improve factuality and coverage. Applying this method to 8 LVLMs spanning the GPT-4.1 family, Qwen series, and InternVL variants, ReflectCAP reaches the Pareto frontier of the trade-off between factuality and coverage, and delivers substantial gains on CapArena-Auto, where generated captions are judged head-to-head against strong reference models. Moreover, ReflectCAP offers a more favorable trade-off between caption quality and compute cost than model scaling or existing multi-agent pipelines, which incur 21--36\% greater overhead. This makes high-quality detailed captioning viable under real-world cost and latency constraints.