Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias

arXiv cs.AI / 3/12/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The 'Lost in the Middle' phenomenon is not just a training artifact but an inherent property of causal decoders with residual connections, appearing as a U-shaped retrieval performance from initialization onward.
The authors model multi-layer causal attention as iterated powers of the Cesàro matrix and derive a closed-form influence density, identifying a Primacy Tail at the start and a Recency Delta at the end due to masking and residuals.
Between these extremes there exists a factorial dead zone of order 1/(H-1)!, making middle-context retrieval and training structurally difficult, with the depth H controlling the width of this region.
Empirical checks show untrained Qwen2 and GPT-2 architectures exhibit the U-shape at Step 0, and the phenomenon persists with or without RoPE, indicating it is not solely a training artifact.
While not claiming impossibility, the work defines the architectural baseline and clarifies where interventions should target to overcome the bias in future work.

Abstract

The ``Lost in the Middle'' phenomenon -- a U-shaped performance curve where LLMs retrieve well from the beginning and end of a context but fail in the middle -- is widely attributed to learned Softmax artifacts or the distance-decay of positional encodings like RoPE. This paper makes a single, precise claim: \emph{the U-shape is already present at initialization, before any training or positional encoding takes effect.} It is an inherent geometric property of the causal decoder with residual connections. We model multi-layer causal attention as iterated powers of the Ces\`{a}ro matrix and derive the exact closed-form influence density in the continuous limit. Causal masking forces a logarithmic divergence of gradient influence at the start of the prompt (the Primacy Tail), while residual connections create an isolated

\mathcal{O}(1)

anchor at the final token (the Recency Delta). Between these extremes lies a factorial dead zone of order

\mathcal{O}(1/(H{-}1)!)

, where

H

is the network depth, making middle-context retrieval and training structurally hostile. We validate empirically that untrained Qwen2 and GPT-2 architectures exhibit this U-shape at Step~0, and that it is identical with or without RoPE. Comparing initialized and pretrained networks, we show that standard training does not overcome the topological valley, confirming that the U-shape persists as an architectural baseline under standard pretraining objectives. We do not claim that this bias is insurmountable, nor that interventions such as RoPE modifications are useless. We establish what the baseline is and where it comes from, so that future efforts to overcome it can be precisely targeted.