Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory

arXiv cs.LG / 4/7/2026

📰 News

Key Points

  • The paper argues that knowledge distillation’s persistent performance “loss floor” is fundamentally geometric, driven by how neural networks use superposition to represent many features with limited width.

Abstract

Knowledge distillation compresses large teachers into smaller students, but performance saturates at a loss floor that persists across training methods and objectives. We argue this floor is geometric: neural networks represent far more features than dimensions through superposition, and a student of width d_S can encode at most d_S \cdot g(\alpha) features, where g(\alpha) = 1/((1-\alpha)\ln\frac{1}{1-\alpha}) is a sparsity-dependent capacity function. Features beyond this budget are permanently lost, yielding an importance-weighted loss floor. We validate on a toy model (48 configurations, median accuracy >93%) and on Pythia-410M, where sparse autoencoders measure F \approx 28{,}700 features at \alpha \approx 0.992 (critical width d_S^* \approx 1{,}065). Distillation into five student widths confirms the predicted monotonic floor ordering. The observed floor decomposes into a geometric component and a width-independent architectural baseline (R^2 = 0.993). Linear probing shows coarse concepts survive even 88% feature loss, revealing the floor arises from aggregate loss of fine-grained features in the importance distribution's long tail. Our results connect representation geometry to distillation limits and provide a practical tool for predicting distillation performance from SAE measurements alone.