Evian: Towards Explainable Visual Instruction-tuning Data Auditing

arXiv cs.CV / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that LVLM performance hinges on high-quality training data, and that existing filtering methods are too coarse to catch subtle semantic issues such as logical fallacies and factual errors.
It introduces a 300K-sample benchmark created by systematically injecting diverse, subtle defects to better stress-test visual-instruction data auditing.
The authors propose a “Decomposition-then-Evaluation” approach that breaks model outputs into visual descriptions, subjective inferences, and factual claims for more fine-grained diagnosis.
They implement this as EVIAN, an automated auditing framework evaluating image-text consistency, logical coherence, and factual accuracy, and show that smaller but higher-quality datasets curated by EVIAN can outperform much larger scale-trained sets.
Experiments indicate that auditing benefits from decomposing work into verifiable subtasks, and that logical coherence is the most critical dimension for judging data quality.

Abstract

The efficacy of Large Vision-Language Models (LVLMs) is critically dependent on the quality of their training data, requiring a precise balance between visual fidelity and instruction-following capability. Existing datasets, however, are plagued by inconsistent quality, and current data filtering methods rely on coarse-grained scores that lack the granularity to identify nuanced semantic flaws like logical fallacies or factual errors. This creates a fundamental bottleneck in developing more reliable models. To address this, we make three core contributions. First, we construct a large-scale, 300K-sample benchmark by systematically injecting diverse, subtle defects to provide a challenging testbed for data auditing. Second, we introduce a novel "Decomposition-then-Evaluation" paradigm that breaks model responses into constituent cognitive components: visual description, subjective inference, and factual claim, enabling targeted analysis. Third, we instantiate this paradigm via EVIAN (Explainable Visual Instruction-tuning Data AuditiNg), an automated framework that evaluates these components along the orthogonal axes of Image-Text Consistency, Logical Coherence, and Factual Accuracy. Our empirical findings challenge the prevailing scale-centric paradigm: a model fine-tuned on a compact, high-quality subset curated by EVIAN consistently surpassed models trained on orders-of-magnitude larger datasets. We also reveal that dividing complex auditing into verifiable subtasks enables robust curation, and that Logical Coherence is the most critical factor in data quality evaluation.