Evian: Towards Explainable Visual Instruction-tuning Data Auditing

arXiv cs.CV / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that LVLM performance hinges on high-quality training data, and that existing filtering methods are too coarse to catch subtle semantic issues such as logical fallacies and factual errors.
  • It introduces a 300K-sample benchmark created by systematically injecting diverse, subtle defects to better stress-test visual-instruction data auditing.
  • The authors propose a “Decomposition-then-Evaluation” approach that breaks model outputs into visual descriptions, subjective inferences, and factual claims for more fine-grained diagnosis.
  • They implement this as EVIAN, an automated auditing framework evaluating image-text consistency, logical coherence, and factual accuracy, and show that smaller but higher-quality datasets curated by EVIAN can outperform much larger scale-trained sets.
  • Experiments indicate that auditing benefits from decomposing work into verifiable subtasks, and that logical coherence is the most critical dimension for judging data quality.

Abstract

The efficacy of Large Vision-Language Models (LVLMs) is critically dependent on the quality of their training data, requiring a precise balance between visual fidelity and instruction-following capability. Existing datasets, however, are plagued by inconsistent quality, and current data filtering methods rely on coarse-grained scores that lack the granularity to identify nuanced semantic flaws like logical fallacies or factual errors. This creates a fundamental bottleneck in developing more reliable models. To address this, we make three core contributions. First, we construct a large-scale, 300K-sample benchmark by systematically injecting diverse, subtle defects to provide a challenging testbed for data auditing. Second, we introduce a novel "Decomposition-then-Evaluation" paradigm that breaks model responses into constituent cognitive components: visual description, subjective inference, and factual claim, enabling targeted analysis. Third, we instantiate this paradigm via EVIAN (Explainable Visual Instruction-tuning Data AuditiNg), an automated framework that evaluates these components along the orthogonal axes of Image-Text Consistency, Logical Coherence, and Factual Accuracy. Our empirical findings challenge the prevailing scale-centric paradigm: a model fine-tuned on a compact, high-quality subset curated by EVIAN consistently surpassed models trained on orders-of-magnitude larger datasets. We also reveal that dividing complex auditing into verifiable subtasks enables robust curation, and that Logical Coherence is the most critical factor in data quality evaluation.