AI Navigate

Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics

arXiv cs.LG / 3/16/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The study analyzes a linearized attention mechanism with an exact correspondence to a data-dependent Gram-induced kernel and shows it does not converge to the infinite-width NTK limit even at large widths due to spectral amplification that worsens the Gram matrix's condition number, requiring width m = Omega(kappa^6) for convergence.
  • It introduces the concept of influence malleability, revealing that attention exhibits 6–9× higher malleability than ReLU networks, meaning it can dynamically alter reliance on training examples.
  • This malleability has dual implications: the data-dependent kernel can reduce approximation error by aligning with task structure, but it also increases susceptibility to adversarial manipulation of training data.
  • The results suggest that attention's power and vulnerability stem from its departure from the kernel regime, with important consequences for the design and robustness of attention-based models.

Abstract

Understanding the theoretical foundations of attention mechanisms remains challenging due to their complex, non-linear dynamics. This work reveals a fundamental trade-off in the learning dynamics of linearized attention. Using a linearized attention mechanism with exact correspondence to a data-dependent Gram-induced kernel, both empirical and theoretical analysis through the Neural Tangent Kernel (NTK) framework shows that linearized attention does not converge to its infinite-width NTK limit, even at large widths. A spectral amplification result establishes this formally: the attention transformation cubes the Gram matrix's condition number, requiring width m = \Omega(\kappa^6) for convergence, a threshold that exceeds any practical width for natural image datasets. This non-convergence is characterized through influence malleability, the capacity to dynamically alter reliance on training examples. Attention exhibits 6--9\times higher malleability than ReLU networks, with dual implications: its data-dependent kernel can reduce approximation error by aligning with task structure, but this same sensitivity increases susceptibility to adversarial manipulation of training data. These findings suggest that attention's power and vulnerability share a common origin in its departure from the kernel regime.