When Informal Text Breaks NLI: Tokenization Failure, Distribution Shift, and Targeted Mitigations
arXiv cs.CL / 4/21/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The study finds that informal surface forms can substantially degrade NLI accuracy, depending on whether the issue is tokenizer failure or distribution shift, tested on ELECTRA-small and RoBERTa-large across SNLI and MultiNLI.
- Slang substitution causes only minor degradation (up to 1.1pp) because most slang tokens remain within WordPiece coverage and therefore do not cause major signal loss.
- Emoji replacement is a severe failure mode because content words become [UNK] after WordPiece tokenization, with most emoji examples containing at least one [UNK], effectively erasing input information before the model processes it.
- Noise tokens like Gen-Z fillers are fully in-vocabulary but fail because they are absent from NLI training data, and the paper shows targeted mitigations differ by failure mode: preprocessing normalization for emojis and data augmentation for noise.
- Combining both preprocessing and augmentation yields large gains on mixed variants (e.g., ELECTRA on SNLI improves from 75.88% to 88.93%) while remaining competitive against GPT-4o-mini zero-shot.
Related Articles

Capsule Security Emerges From Stealth With $7 Million in Funding
Dev.to

Rethinking Coding Education for the AI Era
Dev.to

We Shipped an MVP With Vibe-Coding. Here's What Nobody Tells You About the Aftermath
Dev.to

Agent Package Manager (APM): A DevOps Guide to Reproducible AI Agents
Dev.to

3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work
Dev.to