Coupled Query-Key Dynamics for Attention

arXiv cs.LG / 4/3/2026

📰 News

Key Points

  • The paper proposes “coupled QK dynamics,” where query and key representations are evolved jointly via shared learned dynamics before computing attention scores, rather than using static, independent projections.

Abstract

Standard scaled dot-product attention computes scores from static, independent projections of the input. We show that evolving queries and keys \emph{jointly} through shared learned dynamics before scoring - which we call \textbf{coupled QK dynamics} - improves language modeling perplexity and training stability. On WikiText-103 at 60M parameters, coupled dynamics achieves 22.55--22.62 perplexity vs.\ 24.22 for standard attention (-6.6--6.9\%), with only 0.11\% additional parameters (shared across both instantiations). A structural ablation isolates coupling as the active ingredient: a symplectic (Hamiltonian) and a non-symplectic (Euler) integrator perform identically when both couple Q and K, while an uncoupled MLP baseline of matched capacity reaches only 23.81 with 8\times higher seed variance. The integration step count (1--7) is similarly irrelevant - a single coupled step suffices. A compute-matched comparison reveals that coupling is a \emph{sample-efficiency} mechanism: standard attention trained for 2.4\times longer (matching wall-clock) reaches the same perplexity, but requires 2.4\times more tokens. The advantage scales to 150M (-6.7\%) but narrows at 350M (-1.0\%), where Differential Attention (18.93) overtakes coupled dynamics (19.35). The benefit is corpus-dependent: coupling helps on domain-coherent text (WikiText-103 -6.6\%, PubMed -4.5\%) but degrades on heterogeneous web text (+10.3\%) and shows no benefit on GLUE. We characterize when coupling helps and when it does not, providing practical guidelines.