MINOS: A Multimodal Evaluation Model for Bidirectional Generation Between Image and Text

arXiv cs.CL / 4/30/2026

💬 OpinionModels & Research

共有:

Key Points

The paper introduces MINOS, a multimodal evaluation model designed to better assess bidirectional image-text generation, addressing shortcomings of traditional multimodal evaluation metrics.
It constructs a high-quality evaluation dataset, Minos-57K, using rigorous quality control and covering evaluation samples from 15 datasets.
MINOS is trained using SFT (supervised fine-tuning) and preference alignment to improve evaluation reliability across both image-to-text (I2T) and text-to-image (T2I).
The authors report state-of-the-art results on 16 out-of-domain datasets among open-source multimodal evaluation models, despite using less than half the training data scale of prior work.
Extensive experiments emphasize that quality control, joint training across I2T and T2I, and preference alignment are key factors for consistently strong evaluation performance.

Abstract

Evaluation is important for multimodal generation tasks, while traditional multimodal evaluation metrics suffer from several limitations. With the rapid progress of MLLMs, there is growing interest in applying MLLMs to build general evaluation systems. However, existing researches often simply collect large-scale evaluation data for training, while overlooking the quality of evaluation data. What's more, current proposed evaluation models often struggle to achieve consistently strong performance across both image-to-text (I2T) and text-to-image (T2I) tasks. In this paper, through rigorous quality control strategies, we construct a comprehensive multimodal evaluation dataset, Minos-57K, with evaluation samples across 15 datasets, for developing the multimodal evaluation model Minos with SFT and preference alignment training strategies. Notably, despite using less than half the scale of the training data of prior work, our model achieves state-of-the-art evaluation performance across 16 out-of-domain datasets covering both I2T and T2I tasks among all open-source multimodal evaluation models and remain competitive with closed-source models. Extensive experiments demonstrate the importance of leveraging quality control process, jointly training on evaluation data from both I2T and T2I generation tasks and further preference alignment.

We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works

Dev.to

Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD

Dev.to

Function Calling Harness 2: CoT Compliance from 9.91% to 100%

Dev.to

What Anthropic's April 23 Postmortem Reveals About Your Agent Harness

Dev.to

Fine-tuning YOLOv11 to detect stamps and signatures on banking documents - a practical walkthrough

Dev.to

MINOS: A Multimodal Evaluation Model for Bidirectional Generation Between Image and Text

Key Points

Abstract

Related Articles

We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works

Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD

Function Calling Harness 2: CoT Compliance from 9.91% to 100%

What Anthropic's April 23 Postmortem Reveals About Your Agent Harness

Fine-tuning YOLOv11 to detect stamps and signatures on banking documents - a practical walkthrough

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer