WaferSAGE: Large Language Model-Powered Wafer Defect Analysis via Synthetic Data Generation and Rubric-Guided Reinforcement Learning

arXiv cs.AI / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • WaferSAGE is a wafer-defect visual question answering framework that uses small vision-language models to perform domain-specific semiconductor inspection tasks.
  • To overcome scarce labeled data, it introduces a three-stage synthetic data pipeline that cleans noisy labels, generates detailed defect descriptions, and converts them into rubric-based criteria for evaluation.
  • The framework uses a dual assessment approach that combines rule-based metrics with LLM-Judge scores, aligning them via Bayesian optimization for more reliable automated evaluation.
  • It applies curriculum-based reinforcement learning with Group Sequence Policy Optimization (GSPO) and rubric-aligned rewards, enabling a 4B-parameter Qwen3-VL model to achieve strong performance (6.493) while remaining suitable for full on-premise deployment.
  • The authors argue that well-trained small, domain-specific models can outperform proprietary large models in specialized industrial visual understanding, supporting privacy-preserving and cost-effective deployment.

Abstract

We present WaferSAGE, a framework for wafer defect visual question answering using small vision-language models. To address data scarcity in semiconductor manufacturing, we propose a three-stage synthesis pipeline incorporating structured rubric generation for precise evaluation. Starting from limited labeled wafer maps, we employ clustering-based cleaning to filter label noise, then generate comprehensive defect descriptions using vision-language models, which are converted into structured evaluation rubrics criteria. These rubrics guide the synthesis of VQA pairs, ensuring coverage across defect type identification, spatial distribution, morphology, and root cause analysis. Our dual assessment framework aligns rule-based metrics with LLM-Judge scores via Bayesian optimization, enabling reliable automated evaluation. Through curriculum-based reinforcement learning with Group Sequence Policy Optimization (GSPO) and rubric-aligned rewards, our 4B-parameter Qwen3-VL model achieves a 6.493 LLM-Judge score, closely approaching Gemini-3-Flash (7.149) while enabling complete on-premise deployment. We demonstrate that small models with domain-specific training can surpass proprietary large models in specialized industrial visual understanding, offering a viable path for privacy-preserving, cost-effective deployment in semiconductor manufacturing.