Cross-Tokenizer LLM Distillation through a Byte-Level Interface
arXiv cs.CL / 4/10/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses cross-tokenizer LLM distillation (CTD), where a teacher and student use different tokenizers, noting that prior work often depends on complex heuristic vocabulary alignment.
- It introduces Byte-Level Distillation (BLD), a baseline that aligns the teacher and student by converting the teacher’s output distribution into byte-level probabilities.
- The method adds a lightweight byte-level decoder head to the student and performs distillation through a shared byte-level interface to enable knowledge transfer without tokenizer matching.
- Experiments show BLD is competitive with, and sometimes surpasses, more sophisticated CTD approaches across multiple distillation tasks and benchmarks using models sized from 1B to 8B parameters.
- Despite strong results, the authors conclude that consistent improvements across all tasks and benchmarks are still elusive, reinforcing that CTD remains an open research problem.



