[D] Lossless tokenizers lose nothing and add nothing — trivial observation or worth formalizing?

Reddit r/MachineLearning / 3/16/2026

💬 OpinionIdeas & Deep Analysis

共有:

Key Points

Lossless tokenization can induce any target distribution over strings via a canonical construction without reducing the model's expressiveness.
Under the canonical distribution, the entropy H(Q) equals H(P), meaning no extra entropy is added by tokenization.
In practice, models leak roughly 0.5–2% probability onto non-canonical tokenizations, and introducing this noise via techniques like BPE-Dropout can improve generalization.
The practical takeaway is that focusing on canonical tokenizations is not always optimal in practice, as tokenization choices like BPE-Dropout can be beneficial.

I wrote up a short information-theoretic argument for why lossless tokenization neither restricts the expressiveness of language models nor introduces unavoidable redundancy. The key ideas:

Any target distribution over strings can be exactly induced by a distribution over token sequences (via the canonical construction)
The canonical distribution achieves H(Q) = H(P) — no extra entropy from tokenization
In practice, models do leak ~0.5–2% probability onto non-canonical tokenizations (Chirkova et al., 2023), and deliberately introducing this noise via BPE-Dropout can actually help generalization

https://douglasswng.github.io/why-tokens-enough/

I'm curious whether people find this kind of formalization useful or if it's "obviously true" and not worth writing down. The practical punchline — that the theoretically optimal thing (concentrate on canonical tokenizations) isn't always best in practice (BPE-Dropout helps) — was the part I found most interesting.

submitted by /u/36845277
[link] [comments]

What 81,000 people want from AI

Anthropic News

「AIで雇用は増える」「AIの進化はツールがけん引」、5つのAI潮流を解説

日経XTECH

中国AI企業が他社製AIを「ただ乗り蒸留」か米社が主張、安全保障リスクも

日経XTECH

Superposition and the Capsule: Quantum State Collapse Meets AI Identity

Dev.to

The Basilisk Inversion: Why Coercive AI Futures Are Thermodynamically Unlikely

Dev.to

[D] Lossless tokenizers lose nothing and add nothing — trivial observation or worth formalizing?

Key Points

Related Articles

What 81,000 people want from AI

「AIで雇用は増える」「AIの進化はツールがけん引」、5つのAI潮流を解説

中国AI企業が他社製AIを「ただ乗り蒸留」か米社が主張、安全保障リスクも

Superposition and the Capsule: Quantum State Collapse Meets AI Identity

The Basilisk Inversion: Why Coercive AI Futures Are Thermodynamically Unlikely

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Related Articles

What 81,000 people want from AI

「AIで雇用は増える」「AIの進化はツールがけん引」、5つのAI潮流を解説

中国AI企業が他社製AIを「ただ乗り蒸留」か 米社が主張、安全保障リスクも

Superposition and the Capsule: Quantum State Collapse Meets AI Identity

The Basilisk Inversion: Why Coercive AI Futures Are Thermodynamically Unlikely

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

中国AI企業が他社製AIを「ただ乗り蒸留」か米社が主張、安全保障リスクも