SODA: Semi On-Policy Black-Box Distillation for Large Language Models

Key Points

The paper introduces SODA, a semi on-policy black-box distillation method that aims to resolve the trade-off between error-correcting on-policy approaches and unstable/expensive adversarial distillation.

Abstract

Black-box knowledge distillation for large language models presents a strict trade-off. Simple off-policy methods (e.g., sequence-level knowledge distillation) struggle to correct the student's inherent errors. Fully on-policy methods (e.g., Generative Adversarial Distillation) solve this via adversarial training but introduce well-known training instability and crippling computational overhead. To address this dilemma, we propose SODA (Semi On-policy Distillation with Alignment), a highly efficient alternative motivated by the inherent capability gap between frontier teachers and much smaller base models. Because a compact student model's natural, zero-shot responses are almost strictly inferior to the powerful teacher's targets, we can construct a highly effective contrastive signal simply by pairing the teacher's optimal response with a one-time static snapshot of the student's outputs. This demonstrates that exposing the small student to its own static inferior behaviors is sufficient for high-quality distribution alignment, eliminating the need for costly dynamic rollouts and fragile adversarial balancing. Extensive evaluations across four compact Qwen2.5 and Llama-3 models validate this semi on-policy paradigm. SODA matches or outperforms the state-of-the-art methods on 15 out of 16 benchmark results. More importantly, it achieves this superior distillation quality while training 10 times faster, consuming 27% less peak GPU memory, and completely eliminating adversarial instability.

SODA: Semi On-Policy Black-Box Distillation for Large Language Models

Key Points

Abstract

Related Articles

Human-Aligned Decision Transformers for satellite anomaly response operations with ethical auditability baked in

That Smoking-Gun Video? It's Not Evidence. It's a Suspect.

AI Citation Registries and Website-Based Publishing Constraints

Amazon S3 Files: The End of the Object vs. File War (And Why It Matters in the AI Agent Era)

大模型价格战2025：谁在烧钱谁在赚？深度解析AI成本暴跌背后的生死博弈

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Related Articles

Human-Aligned Decision Transformers for satellite anomaly response operations with ethical auditability baked in
Dev.to

That Smoking-Gun Video? It's Not Evidence. It's a Suspect.
Dev.to

AI Citation Registries and Website-Based Publishing Constraints
Dev.to

Amazon S3 Files: The End of the Object vs. File War (And Why It Matters in the AI Agent Era)
Dev.to

大模型价格战2025：谁在烧钱谁在赚？深度解析AI成本暴跌背后的生死博弈
Dev.to