Knowledge Distillation for Large Language Models

arXiv cs.CL / 3/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a resource-efficient framework for compressing large language models via knowledge distillation combined with guided chain-of-thought reinforcement learning, using Qwen 3B as the teacher and Qwen 0.5B as the student.
It applies distillation across English Dolly-15k, Spanish Dolly-15k, and code datasets BugNet and PyTorrent, with English-tuned hyperparameters, achieving 70-91% of the teacher's performance in English, up to 95% in Spanish, and up to 93.5% Rouge-L on code.
For coding tasks, integrating chain-of-thought prompting with Group Relative Policy Optimization on CoT-annotated Codeforces data improves reasoning coherence and solution correctness versus knowledge distillation alone.
Post-training 4-bit weight quantization further reduces memory footprint and inference latency, enabling deployment in resource-constrained settings.

Abstract

We propose a resource-efficient framework for compressing large language models through knowledge distillation, combined with guided chain-of-thought reinforcement learning. Using Qwen 3B as the teacher and Qwen 0.5B as the student, we apply knowledge distillation across English Dolly-15k, Spanish Dolly-15k, and code BugNet and PyTorrent datasets, with hyperparameters tuned in the English setting to optimize student performance. Across tasks, the distilled student retains a substantial portion of the teacher's capability while remaining significantly smaller: 70% to 91% in English, up to 95% in Spanish, and up to 93.5% Rouge-L in code. For coding tasks, integrating chain-of-thought prompting with Group Relative Policy Optimization using CoT-annotated Codeforces data improves reasoning coherence and solution correctness compared to knowledge distillation alone. Post-training 4-bit weight quantization further reduces memory footprint and inference latency. These results show that knowledge distillation combined with chain-of-thought guided reinforcement learning can produce compact, efficient models suitable for deployment in resource-constrained settings.