Banana100: Breaking NR-IQA Metrics by 100 Iterative Image Replications with Nano Banana Pro

arXiv cs.CV / 4/7/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies a key failure mode in multi-turn, multi-modal image editing: repeated edits cause iterative degradation that accumulates visible noise and can break adherence to even simple instructions.
  • It introduces Banana100, a new dataset of 28,000 images created via 100 iterative editing steps across varied textures and image content, specifically targeting this degradation phenomenon.
  • The authors report that common no-reference image quality assessment (NR-IQA) metrics fail to reliably flag heavily degraded images, with none of 21 popular metrics consistently scoring degraded images lower than clean ones.
  • The dual breakdown—both in image generators and in evaluators—raises concerns about training stability and the safety of deployed agentic systems if low-quality synthetic outputs bypass quality filters.
  • The authors release code and data to support building more robust editing systems and more reliable quality evaluation for multi-turn agentic workflows.

Abstract

The multi-step, iterative image editing capabilities of multi-modal agentic systems have transformed digital content creation. Although latest image editing models faithfully follow instructions and generate high-quality images in single-turn edits, we identify a critical weakness in multi-turn editing, which is the iterative degradation of image quality. As images are repeatedly edited, minor artifacts accumulate, rapidly leading to a severe accumulation of visible noise and a failure to follow simple editing instructions. To systematically study these failures, we introduce Banana100, a comprehensive dataset of 28,000 degraded images generated through 100 iterative editing steps, including diverse textures and image content. Alarmingly, image quality evaluators fail to detect the degradation. Among 21 popular no-reference image quality assessment (NR-IQA) metrics, none of them consistently assign lower scores to heavily degraded images than to clean ones. The dual failures of generators and evaluators may threaten the stability of future model training and the safety of deployed agentic systems, if the low-quality synthetic data generated by multi-turn edits escape quality filters. We release the full code and data to facilitate the development of more robust models, helping to mitigate the fragility of multi-modal agentic systems.