How Language Models Process Out-of-Distribution Inputs: A Two-Pathway Framework
arXiv cs.CL / 5/4/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- Existing white-box LLM out-of-distribution (OOD) detection methods (e.g., CED, RAUQ, and WildGuard confidence scores) can be structurally confounded by input sequence length, and their apparent advantage collapses under length-matched evaluation.
- The paper shows that even baseline measures like raw attention entropy exhibit the same length dependence, which the authors attribute to attention’s approximately Theta(log T) relationship with input length.
- To recover genuine OOD signals after deconfounding, the authors propose a two-pathway framework that separates “what the text is about” (embeddings) from “how the model processes it” (hidden-state trajectory across layers).
- They find that embedding-based features work best for vocabulary-distinctive OOD, while trajectory-based features better detect covert-intent inputs that reuse normal vocabulary, reporting strong performance on covert-intent/jailbreak cases.
- Three supporting evidence lines are presented, including a crossover between k-NN (embedding) and trajectory scoring across tasks, layer-wise diagnostics identifying length artifacts, and circuit attribution showing different attention-circuit engagement for adversarial vs semantic inputs, with partial replication across models and code released.
Related Articles
AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs
Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI
The Verge

CLMA Frame Test
Dev.to

You Are Right — You Don't Need CLAUDE.md
Dev.to

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions
Dev.to