DistortBench: Benchmarking Vision Language Models on Image Distortion Identification

arXiv cs.CV / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces DistortBench, a diagnostic no-reference benchmark to test how well vision-language models (VLMs) identify image distortion type and severity.
  • DistortBench includes 13,500 four-choice questions spanning 27 distortion types, grouped into six perceptual categories and five severity levels, with 25 distortions based on KADID-10k calibrations plus two added rotation distortions.
  • The authors evaluate 18 VLMs (17 open-weight models from five families and one proprietary model), finding that even the top model achieves only 61.9% accuracy versus a human majority-vote baseline of 65.7%.
  • Analysis shows limited and non-monotonic scaling with model size, performance degradation in most “base–thinking” pairs, and different severity-response behaviors across model families.
  • The authors position DistortBench as a tool to measure and improve VLMs’ low-level visual perception capabilities, which remain a key weakness.
  • Despite VLMs’ strengths on high-level multimodal tasks, they struggle with low-level distortion perception, highlighting a gap for future model improvement.

Abstract

Vision-language models (VLMs) are increasingly used in settings where sensitivity to low-level image degradations matters, including content moderation, image restoration, and quality monitoring. Yet their ability to recognize distortion type and severity remains poorly understood. We present DistortBench, a diagnostic benchmark for no-reference distortion perception in VLMs. DistortBench contains 13,500 four-choice questions covering 27 distortion types, six perceptual categories, and five severity levels: 25 distortions inherit KADID-10k calibrations, while two added rotation distortions use monotonic angle-based levels. We evaluate 18 VLMs, including 17 open-weight models from five families and one proprietary model. Despite strong performance on high-level vision-language tasks, the best model reaches only 61.9% accuracy, just below the human majority-vote baseline of 65.7% (average individual: 60.2%), indicating that low-level perceptual understanding remains a major weakness of current VLMs. Our analysis further reveals weak and non-monotonic scaling with model size, performance drops in most base--thinking pairs, and distinct severity-response patterns across model families. We hope DistortBench will serve as a useful benchmark for measuring and improving low-level visual perception in VLMs.