ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs
arXiv cs.CV / 4/7/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces ICBench, a new large-scale image captioning benchmark designed to better evaluate multimodal large language models by addressing shortcomings of existing benchmarks (caption-length diversity, coverage of recent MLLMs, and more human annotation).
- ICBench covers 12 content categories using 2K images, with captions generated by 10 advanced MLLMs, yielding 40K captions split into short and long caption settings.
- Human subjective studies produce mean opinion scores (MOSs) on fine-grained dimensions: short captions are rated for fluency, relevance, and conciseness, while long captions are rated for fluency, relevance, and completeness.
- The authors propose ITIScore, an automated image-to-text-to-image reconstruction-consistency metric, and report strong correlation with human judgments plus zero-shot generalization to other public captioning datasets.
- The authors state that the dataset and evaluation metric will be released upon publication.
Related Articles

Black Hat Asia
AI Business
v0.20.5
Ollama Releases

Inside Anthropic's Project Glasswing: The AI Model That Found Zero-Days in Every Major OS
Dev.to
Gemma 4 26B fabricated an entire code audit. I have the forensic evidence from the database.
Reddit r/LocalLLaMA

SoloEngine: Low-Code Agentic AI Development Platform with Native Support for Multi-Agent Collaboration, MCP, and Skill System
Dev.to