ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs

arXiv cs.CV / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces ICBench, a new large-scale image captioning benchmark designed to better evaluate multimodal large language models by addressing shortcomings of existing benchmarks (caption-length diversity, coverage of recent MLLMs, and more human annotation).
ICBench covers 12 content categories using 2K images, with captions generated by 10 advanced MLLMs, yielding 40K captions split into short and long caption settings.
Human subjective studies produce mean opinion scores (MOSs) on fine-grained dimensions: short captions are rated for fluency, relevance, and conciseness, while long captions are rated for fluency, relevance, and completeness.
The authors propose ITIScore, an automated image-to-text-to-image reconstruction-consistency metric, and report strong correlation with human judgments plus zero-shot generalization to other public captioning datasets.
The authors state that the dataset and evaluation metric will be released upon publication.

Abstract

Recent advances in multimodal large language models (MLLMs) have greatly improved image understanding and captioning capabilities. However, existing image captioning benchmarks typically suffer from limited diversity in caption length, the absence of recent advanced MLLMs, and insufficient human annotations, which potentially introduces bias and limits the ability to comprehensively assess the performance of modern MLLMs. To address these limitations, we present a new large-scale image captioning benchmark, termed, ICBench, which covers 12 content categories and consists of both short and long captions generated by 10 advanced MLLMs on 2K images, resulting in 40K captions in total. We conduct extensive human subjective studies to obtain mean opinion scores (MOSs) across fine-grained evaluation dimensions, where short captions are assessed in terms of fluency, relevance, and conciseness, while long captions are evaluated based on fluency, relevance, and completeness. Furthermore, we propose an automated evaluation metric, \textbf{ITIScore}, based on an image-to-text-to-image framework, which measures caption quality through reconstruction consistency. Experimental results demonstrate strong alignment between our automatic metric and human judgments, as well as robust zero-shot generalization ability on other public captioning datasets. Both the dataset and model will be released upon publication.

Black Hat Asia

AI Business

v0.20.5

Ollama Releases

Inside Anthropic's Project Glasswing: The AI Model That Found Zero-Days in Every Major OS

Dev.to

Gemma 4 26B fabricated an entire code audit. I have the forensic evidence from the database.

Reddit r/LocalLLaMA

SoloEngine: Low-Code Agentic AI Development Platform with Native Support for Multi-Agent Collaboration, MCP, and Skill System

Dev.to

ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs

Key Points

Abstract

Related Articles

Black Hat Asia

v0.20.5

Inside Anthropic's Project Glasswing: The AI Model That Found Zero-Days in Every Major OS

Gemma 4 26B fabricated an entire code audit. I have the forensic evidence from the database.

SoloEngine: Low-Code Agentic AI Development Platform with Native Support for Multi-Agent Collaboration, MCP, and Skill System

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer