Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

arXiv cs.LG / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Skill-SD は、マルチターン LLM エージェント学習における強化学習のサンプル効率の低さ（疎な報酬・長いホライズン）を、自己蒸留で補う枠組みとして提案されています。
既存の OPSD（固定された特権教師によるトークンレベル教師信号）では多様な有効戦略を表現しにくく、RL と組み合わせると崩壊しやすい点が問題視されています。
Skill-SD は、完了したエージェント軌跡を自然言語の「スキル」に要約し、教師側だけに動的な特権情報として与えることで、学生は通常のタスクプロンプトの下で蒸留によりその知見を内面化します。
学習安定化のために、重要度付き逆 KL に基づく勾配補正付きのトークン蒸留損失を導入し、さらに改善する学生に応じて教師を動的に同期させます。
実験では agentic benchmarks にて、GRPO や OPD のバニラに対して AppWorld/Sokoban 等で大幅な性能向上が報告されています。

Abstract

Reinforcement learning (RL) has been widely used to train LLM agents for multi-turn interactive tasks, but its sample efficiency is severely limited by sparse rewards and long horizons. On-policy self-distillation (OPSD) alleviates this by providing dense token-level supervision from a privileged teacher that has access to ground-truth answers. However, such fixed privileged information cannot capture the diverse valid strategies in agent tasks, and naively combining OPSD with RL often leads to training collapse. To address these limitations, we introduce Skill-SD, a framework that turns the agent's own trajectories into dynamic training-only supervision. Completed trajectories are summarized into compact natural language skills that describe successful behaviors, mistakes, and workflows. These skills serve as dynamic privileged information conditioning only the teacher, while the student always acts under the plain task prompt and learns to internalize the guidance through distillation. To stabilize the training, we derive an importance-weighted reverse-KL loss to provide gradient-correct token-level distillation, and dynamically synchronize the teacher with the improving student. Experimental results on agentic benchmarks demonstrate that Skill-SD substantially outperforms the standard RL baseline, improving both vanilla GRPO (+14.0%/+10.9% on AppWorld/Sokoban) and vanilla OPD (+42.1%/+40.6%). Project page: https://k1xe.github.io/skill-sd/

Black Hat Asia

AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Don't forget, there is more than forgetting: new metrics for Continual Learning

Dev.to

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

Dev.to

Bit of a strange question?

Reddit r/artificial

Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

Key Points

Abstract

Related Articles

Black Hat Asia

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Don't forget, there is more than forgetting: new metrics for Continual Learning

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

Bit of a strange question?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer