Cross-Tokenizer LLM Distillation through a Byte-Level Interface

arXiv cs.CL / 4/10/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses cross-tokenizer LLM distillation (CTD), where a teacher and student use different tokenizers, noting that prior work often depends on complex heuristic vocabulary alignment.
  • It introduces Byte-Level Distillation (BLD), a baseline that aligns the teacher and student by converting the teacher’s output distribution into byte-level probabilities.
  • The method adds a lightweight byte-level decoder head to the student and performs distillation through a shared byte-level interface to enable knowledge transfer without tokenizer matching.
  • Experiments show BLD is competitive with, and sometimes surpasses, more sophisticated CTD approaches across multiple distillation tasks and benchmarks using models sized from 1B to 8B parameters.
  • Despite strong results, the authors conclude that consistent improvements across all tasks and benchmarks are still elusive, reinforcing that CTD remains an open research problem.

Abstract

Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: the byte level. In more detail, we convert the teacher's output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with--and on several benchmarks surpasses--significantly more sophisticated CTD methods, across a range of distillation tasks with models from 1B to 8B parameters. Our results suggest that the byte level is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.