Task-Specific Knowledge Distillation via Intermediate Probes
arXiv cs.AI / 3/16/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The proposed method, named Method, distills knowledge by training lightweight probes on frozen teacher hidden states and using the probe's predictions as supervision for the student instead of the teacher's output logits.
- Probes on intermediate representations provide cleaner labels and effectively denoise the distillation signal, bypassing the brittle vocabulary projection and answer-token selection issues.
- The approach yields consistent improvements across four reasoning benchmarks (AQuA-RAT, ARC Easy/Challenge, and MMLU), with gains most pronounced when data is scarce.
- It requires no architectural changes to either student or teacher, is architecture-agnostic, and adds minimal compute since probe training is cheap and teacher representations can be cached.




