How Language Models Process Out-of-Distribution Inputs: A Two-Pathway Framework

arXiv cs.CL / 5/4/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

Existing white-box LLM out-of-distribution (OOD) detection methods (e.g., CED, RAUQ, and WildGuard confidence scores) can be structurally confounded by input sequence length, and their apparent advantage collapses under length-matched evaluation.
The paper shows that even baseline measures like raw attention entropy exhibit the same length dependence, which the authors attribute to attention’s approximately Theta(log T) relationship with input length.
To recover genuine OOD signals after deconfounding, the authors propose a two-pathway framework that separates “what the text is about” (embeddings) from “how the model processes it” (hidden-state trajectory across layers).
They find that embedding-based features work best for vocabulary-distinctive OOD, while trajectory-based features better detect covert-intent inputs that reuse normal vocabulary, reporting strong performance on covert-intent/jailbreak cases.
Three supporting evidence lines are presented, including a crossover between k-NN (embedding) and trajectory scoring across tasks, layer-wise diagnostics identifying length artifacts, and circuit attribution showing different attention-circuit engagement for adversarial vs semantic inputs, with partial replication across models and code released.

Abstract

Recent white-box OOD detection methods for LLMs -- including CED, RAUQ, and WildGuard confidence scores -- appear effective, but we show they are structurally confounded by sequence length (|r| >= 0.61) and collapse to near-chance under length-matched evaluation. Even raw attention entropy (mean H(alpha) across heads and layers), a natural baseline we include for completeness, shows the same confound. The confound stems from attention's Theta(log T) dependence on input length. To identify genuine OOD signals after deconfounding, we propose a two-pathway framework: embeddings capture what text is about (effective for topic shifts), while the processing trajectory -- hidden-state evolution across layers -- captures how the model processes input. The relative power of each pathway varies along a vocabulary-transparency spectrum: embedding methods excel on vocabulary-distinctive OOD, while trajectory features detect covert-intent inputs that share vocabulary with normal text (0.721 avg AUROC; Jailbreak: 0.850). Three evidence lines support this framework: (1) a crossover between k-NN and trajectory scoring across 6 tasks, where each pathway wins on different OOD types; (2) a per-layer analysis showing that layer-0 k-NN signal is almost entirely a length artifact (Jailbreak: 0.759 raw -> 0.389 matched) -- processing constructs genuine OOD signal from near-chance embeddings; and (3) circuit attribution showing adversarial tasks engage attention circuits more than semantic tasks (p = 0.022; Jailbreak patching p < 0.001), with partial cross-model replication. Code release upon publication.