Diagnosing Capability Gaps in Fine-Tuning Data

arXiv cs.LG / 5/1/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces GoalCover, a framework to diagnose capability gaps in fine-tuning datasets before running expensive LLM training by decomposing goals into atomic subgoals and assessing coverage.
GoalCover assigns LLM-based alignment scores to training samples for each subgoal and uses low-scoring sample explanations to surface which capabilities are missing.
Controlled corruption experiments across medical QA, legal summarization, and code generation show GoalCover can reliably distinguish targeted capability degradation from non-targeted impacts (25.6% vs 2.1% average degradation, Cohen’s d=1.24).
In a financial-summarization reinforcement fine-tuning task using Qwen-3-14B, filtering data with GoalCover improves LLM-judge reward from 3.77 to 4.12, and the best performance comes from combining filtered data with goal-conditioned synthetic samples (4.20).

Abstract

Fine-tuning large language models (LLMs) for domain-specific tasks requires training datasets that comprehensively cover the target capabilities a practitioner needs. Yet identifying which capabilities a dataset fails to support, and doing so before an expensive fine-tuning run, remains a largely unsolved problem. We introduce GoalCover, a framework that helps practitioners systematically detect capability gaps in fine-tuning datasets through interactive goal decomposition and automated coverage assessment. GoalCover guides a practitioner through structured decomposition of a high-level goal into atomic, independently evaluable subgoals; assigns each training sample an LLM-based alignment score against every subgoal; and surfaces missing capabilities through automated analysis of low-scoring sample explanations. We validate the framework along two complementary axes. First, through controlled corruption experiments across three domains (medical QA, legal summarization, code generation), we show that GoalCover reliably distinguishes targeted from non-targeted capability impacts: target subgoals degrade by 25.6% on average versus 2.1% for non-target subgoals (Cohen's d=1.24). Second, we demonstrate downstream utility on a financial-summarization Reinforcement Fine-Tuning (RFT) task with Qwen-3-14B: training on GoalCover-filtered data improves the LLM-judge reward from 3.77 to 4.12 (out of 5) over the unfiltered baseline, and combining filtered data with goal-conditioned synthetic samples yields the strongest result (4.20). The two results together show that GoalCover works as a practical pre-fine-tuning diagnostic: it detects capability gaps and produces concrete signal for closing them.