ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs

arXiv cs.CV / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces ICBench, a new large-scale image captioning benchmark designed to better evaluate multimodal large language models by addressing shortcomings of existing benchmarks (caption-length diversity, coverage of recent MLLMs, and more human annotation).
  • ICBench covers 12 content categories using 2K images, with captions generated by 10 advanced MLLMs, yielding 40K captions split into short and long caption settings.
  • Human subjective studies produce mean opinion scores (MOSs) on fine-grained dimensions: short captions are rated for fluency, relevance, and conciseness, while long captions are rated for fluency, relevance, and completeness.
  • The authors propose ITIScore, an automated image-to-text-to-image reconstruction-consistency metric, and report strong correlation with human judgments plus zero-shot generalization to other public captioning datasets.
  • The authors state that the dataset and evaluation metric will be released upon publication.

Abstract

Recent advances in multimodal large language models (MLLMs) have greatly improved image understanding and captioning capabilities. However, existing image captioning benchmarks typically suffer from limited diversity in caption length, the absence of recent advanced MLLMs, and insufficient human annotations, which potentially introduces bias and limits the ability to comprehensively assess the performance of modern MLLMs. To address these limitations, we present a new large-scale image captioning benchmark, termed, ICBench, which covers 12 content categories and consists of both short and long captions generated by 10 advanced MLLMs on 2K images, resulting in 40K captions in total. We conduct extensive human subjective studies to obtain mean opinion scores (MOSs) across fine-grained evaluation dimensions, where short captions are assessed in terms of fluency, relevance, and conciseness, while long captions are evaluated based on fluency, relevance, and completeness. Furthermore, we propose an automated evaluation metric, \textbf{ITIScore}, based on an image-to-text-to-image framework, which measures caption quality through reconstruction consistency. Experimental results demonstrate strong alignment between our automatic metric and human judgments, as well as robust zero-shot generalization ability on other public captioning datasets. Both the dataset and model will be released upon publication.