CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions
arXiv cs.CV / 3/30/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces CREval, an automated, QA-based evaluation pipeline aimed at making multimodal image-manipulation model scoring more complete and interpretable than opaque MLLM-based metrics.
- It also releases CREval-Bench, a benchmark for creative image editing under complex instructions, spanning three categories and nine creative dimensions with 800+ editing samples and 13K evaluation queries.
- Using CREval and CREval-Bench, the authors evaluate a range of state-of-the-art open- and closed-source models and find closed-source models generally perform better on complex/creative edits.
- Despite performance gaps, the study reports that all evaluated models still struggle to carry out such complex creative edits effectively.
- User studies show high consistency between CREval’s automated metrics and human judgments, positioning CREval as a practical foundation for future evaluation and research.
Related Articles

Black Hat Asia
AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer
Simon Willison's Blog
Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026
Dev.to

I missed the "fun" part in software development
Dev.to

The Billion Dollar Tax on AI Agents
Dev.to