AI Navigate

[D] Lossless tokenizers lose nothing and add nothing — trivial observation or worth formalizing?

Reddit r/MachineLearning / 3/16/2026

💬 OpinionIdeas & Deep Analysis

Key Points

  • Lossless tokenization can induce any target distribution over strings via a canonical construction without reducing the model's expressiveness.
  • Under the canonical distribution, the entropy H(Q) equals H(P), meaning no extra entropy is added by tokenization.
  • In practice, models leak roughly 0.5–2% probability onto non-canonical tokenizations, and introducing this noise via techniques like BPE-Dropout can improve generalization.
  • The practical takeaway is that focusing on canonical tokenizations is not always optimal in practice, as tokenization choices like BPE-Dropout can be beneficial.

I wrote up a short information-theoretic argument for why lossless tokenization neither restricts the expressiveness of language models nor introduces unavoidable redundancy. The key ideas:

  • Any target distribution over strings can be exactly induced by a distribution over token sequences (via the canonical construction)
  • The canonical distribution achieves H(Q) = H(P) — no extra entropy from tokenization
  • In practice, models do leak ~0.5–2% probability onto non-canonical tokenizations (Chirkova et al., 2023), and deliberately introducing this noise via BPE-Dropout can actually help generalization

https://douglasswng.github.io/why-tokens-enough/

I'm curious whether people find this kind of formalization useful or if it's "obviously true" and not worth writing down. The practical punchline — that the theoretically optimal thing (concentrate on canonical tokenizations) isn't always best in practice (BPE-Dropout helps) — was the part I found most interesting.

submitted by /u/36845277
[link] [comments]