Stochasticity in Tokenisation Improves Robustness
arXiv cs.CL / 4/20/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that deterministic canonical tokenisation makes LLMs brittle under perturbations and adversarial tokenisation attacks, while stochastic tokenisation can improve internal stability.
- It systematically evaluates stochastic tokenisation across multiple learning regimes (pre-training, supervised fine-tuning, and in-context learning), datasets, and model architectures, focusing on robustness to both adversarial and random perturbations.
- Training with uniformly sampled stochastic tokenisations during pre-training and fine-tuning improves robustness against random and adversarial perturbations.
- When evaluating a canonically trained Llama-1b model on uniformly sampled non-canonical tokenisations, its accuracy drops by 29.8%, highlighting the sensitivity to tokenisation choices.
- The authors report that using stochastic tokenisation during training preserves accuracy without increasing inference cost, suggesting a practical robustness gain.
Related Articles

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)
Dev.to

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI
Dev.to

Building Digital Souls: The Brutal Reality of Creating AI That Understands You Like Nobody Else
Dev.to
Local LLM Beginner’s Guide (Mac - Apple Silicon)
Reddit r/artificial

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals
Dev.to