How Language Models Process Out-of-Distribution Inputs: A Two-Pathway Framework

arXiv cs.CL / 5/4/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • Existing white-box LLM out-of-distribution (OOD) detection methods (e.g., CED, RAUQ, and WildGuard confidence scores) can be structurally confounded by input sequence length, and their apparent advantage collapses under length-matched evaluation.
  • The paper shows that even baseline measures like raw attention entropy exhibit the same length dependence, which the authors attribute to attention’s approximately Theta(log T) relationship with input length.
  • To recover genuine OOD signals after deconfounding, the authors propose a two-pathway framework that separates “what the text is about” (embeddings) from “how the model processes it” (hidden-state trajectory across layers).
  • They find that embedding-based features work best for vocabulary-distinctive OOD, while trajectory-based features better detect covert-intent inputs that reuse normal vocabulary, reporting strong performance on covert-intent/jailbreak cases.
  • Three supporting evidence lines are presented, including a crossover between k-NN (embedding) and trajectory scoring across tasks, layer-wise diagnostics identifying length artifacts, and circuit attribution showing different attention-circuit engagement for adversarial vs semantic inputs, with partial replication across models and code released.

Abstract

Recent white-box OOD detection methods for LLMs -- including CED, RAUQ, and WildGuard confidence scores -- appear effective, but we show they are structurally confounded by sequence length (|r| >= 0.61) and collapse to near-chance under length-matched evaluation. Even raw attention entropy (mean H(alpha) across heads and layers), a natural baseline we include for completeness, shows the same confound. The confound stems from attention's Theta(log T) dependence on input length. To identify genuine OOD signals after deconfounding, we propose a two-pathway framework: embeddings capture what text is about (effective for topic shifts), while the processing trajectory -- hidden-state evolution across layers -- captures how the model processes input. The relative power of each pathway varies along a vocabulary-transparency spectrum: embedding methods excel on vocabulary-distinctive OOD, while trajectory features detect covert-intent inputs that share vocabulary with normal text (0.721 avg AUROC; Jailbreak: 0.850). Three evidence lines support this framework: (1) a crossover between k-NN and trajectory scoring across 6 tasks, where each pathway wins on different OOD types; (2) a per-layer analysis showing that layer-0 k-NN signal is almost entirely a length artifact (Jailbreak: 0.759 raw -> 0.389 matched) -- processing constructs genuine OOD signal from near-chance embeddings; and (3) circuit attribution showing adversarial tasks engage attention circuits more than semantic tasks (p = 0.022; Jailbreak patching p < 0.001), with partial cross-model replication. Code release upon publication.