When Informal Text Breaks NLI: Tokenization Failure, Distribution Shift, and Targeted Mitigations

arXiv cs.CL / 4/21/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study finds that informal surface forms can substantially degrade NLI accuracy, depending on whether the issue is tokenizer failure or distribution shift, tested on ELECTRA-small and RoBERTa-large across SNLI and MultiNLI.
Slang substitution causes only minor degradation (up to 1.1pp) because most slang tokens remain within WordPiece coverage and therefore do not cause major signal loss.
Emoji replacement is a severe failure mode because content words become [UNK] after WordPiece tokenization, with most emoji examples containing at least one [UNK], effectively erasing input information before the model processes it.
Noise tokens like Gen-Z fillers are fully in-vocabulary but fail because they are absent from NLI training data, and the paper shows targeted mitigations differ by failure mode: preprocessing normalization for emojis and data augmentation for noise.
Combining both preprocessing and augmentation yields large gains on mixed variants (e.g., ELECTRA on SNLI improves from 75.88% to 88.93%) while remaining competitive against GPT-4o-mini zero-shot.

Abstract

We study how informal surface forms degrade NLI accuracy in ELECTRA-small (14M) and RoBERTa-large (355M) across four transforms applied to SNLI and MultiNLI: slang substitution, emoji replacement, Gen-Z filler tokens, and their combination. Slang substitution (replacing formal words with informal equivalents, e.g., "going to" -> "gonna", "friend" -> "homie") causes minimal degradation (at most 1.1pp): slang vocabulary falls largely within WordPiece coverage, so the tokenizer handles it without signal loss. Emoji replaces content words with Unicode characters that ELECTRA's WordPiece tokenizer maps to [UNK], destroying the input signal before any learned parameters see it (93.6% of emoji examples contain at least one [UNK], mean 2.91 per example). Noise tokens (no cap, deadass, tbh) are fully in-vocabulary but absent from NLI training data, consistent with the model assigning them inferential weight they do not carry. The two failure modes respond to different interventions: preprocessing recovers emoji accuracy by normalizing text before tokenization; augmentation handles noise by exposing the model to noise-bearing examples during training. A hybrid of both achieves 88.93% on the combined variant for ELECTRA on SNLI (up from 75.88%), with no statistically significant drop on clean text. Against GPT-4o-mini zero-shot, unmitigated ELECTRA is significantly worse on transformed variants (p < 0.0001); hybrid ELECTRA surpasses it across all SNLI variants and reaches statistical parity on MultiNLI.