Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge

arXiv cs.CL / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a reinforcement-learning (RL) framework for label-free knowledge distillation where an LLM judge generates training rewards from unlabeled data.
  • Unlike prior RL distillation methods that require verifiable ground-truth rewards/labels, the judge provides an efficient reward signal without needing supervision.
  • The judge is designed to output a single token, reducing the compute cost of reward computation and making large-scale training more practical.
  • Experiments indicate that combining the LLM-judge rewards with verifiable rewards produces substantial gains on math reasoning benchmarks.
  • The authors conclude that LLM-based evaluators can serve as effective training signals for RL fine-tuning, potentially broadening how supervision is obtained.

Abstract

Reinforcement Learning (RL) has been shown to substantially improve the reasoning capability of small and large language models (LLMs), but existing approaches typically rely on verifiable rewards, hence ground truth labels. We propose an RL framework that uses rewards from an LLM that acts as a judge evaluating model outputs over large amounts of unlabeled data, enabling label-free knowledge distillation and replacing the need of ground truth supervision. Notably, the judge operates with a single-token output, making reward computation efficient. When combined with verifiable rewards, our approach yields substantial performance gains across math reasoning benchmarks. These results suggest that LLM-based evaluators can produce effective training signals for RL fine-tuning.