Synthetic Designed Experiments for Diagnosing Vision Model Failure

arXiv cs.CV / 5/5/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that existing synthetic data pipelines for computer vision often use an “open-loop” approach that doesn’t explicitly diagnose which scene factors drive a model’s failure modes.
  • It proposes SDRS (Synthetic Designed Experiments for Representational Sufficiency), which uses Design of Experiments methods to audit a vision model’s factor-sensitivity profile by treating the model as a black box and the generator as an experimental apparatus.
  • SDRS decomposes sensitivity using ANOVA and classifies observed failures into two actionable gap types: Type I coverage gaps (underrepresented factor levels) and Type II spurious reliance gaps (dependence on nuisance variables).
  • Across three validation experiments (dSprites bias diagnosis, procedural segmentation shortcut detection, and entanglement/cross-factor contamination detection), targeted synthetic data guided by the audit substantially improves metrics.
  • The work also suggests representation-level correction challenges, including that per-factor invariance penalties can transfer sensitivity between factors, motivating further research.

Abstract

Current synthetic data pipelines for computer vision generate images without diagnosing what the downstream model actually needs. This open-loop paradigm treats synthetic data as cheap real data, randomly sampling the generator's output space and hoping to cover the model's failure modes. We argue this fundamentally misuses synthetic data's unique property: the controllable, independent variation of scene factors.Drawing on the statistical theory of Design of Experiments (DoE), we propose Synthetic Designed Experiments for Representational Sufficiency (SDRS). SDRS treats the downstream model as a black-box system and the synthetic generator as an experimental apparatus. Using fractional factorial designs, SDRS efficiently audits a model's factor-sensitivity profile via ANOVA decomposition. It classifies failures into two actionable types: Type I gaps (coverage failures on underrepresented factor levels) and Type II gaps (reliance on spurious nuisance dependencies). The audit then prescribes targeted synthetic data to address each gap type. We validate SDRS on three experiments: (1) a controlled diagnostic on dSprites with planted biases, where the audit correctly identifies both gap types and targeted data improves accuracy from 49.9% to 79.0%; (2) a dense segmentation task on procedural scenes, where detecting background-complexity shortcuts and applying targeted data improves mIoU from 0.948 to 0.998; and (3) an entanglement detection experiment showing that the ANOVA audit identifies cross-factor contamination in imperfect generators. Finally, we show that per-factor invariance penalties can transfer sensitivity between factors, identifying an open problem for representation-level correction.