Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

arXiv cs.LG / 3/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

A new paper proposes replacing the Query projection W_Q in encoder-only and decoder-only transformers with a nonlinear residual Q(X) = X + f_theta(X), where f_theta is a bottleneck MLP.
The method leverages the fact that attention depends on X only through the products XW_Q, XW_K, and XW_V, allowing the nonlinearity to be anchored by an identity term and absorbed by adjacent layers.
Experiments on GPT-3 small-style models show consistent improvements over the baseline and even outperform a model with 12.5% more non-embedding parameters.
The authors argue for scaling the approach to larger models and cross-modal settings to evaluate broader benefits.

Abstract

Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection

W_Q

may be set to identity without noticeable performance deterioration. This is possible because attention depends on

X

only through the products

XW_Q, XW_K, XW_V

, allowing basis transformations to be absorbed by adjacent layers and propagated through the network. We replace

W_Q \in \mathbb{R}^{d \times d}

with a nonlinear residual of the form

Q(X) = X + f_\theta(X)

, where

f_\theta

is a bottleneck MLP with

d^2 + O(d)

parameters. The identity term anchors the nonlinearity to a known-good prior. Experiments on GPT-3 small style models show consistent improvement over the baseline, comfortably outperforming a model with 12.5% more non-embedding parameters. These results motivate investigation at larger scales and across modalities.

What 81,000 people want from AI

Anthropic News

ラピダス、半導体設計AIエージェント「国内2社海外1社が使用中」

日経XTECH

「AIで雇用は増える」「AIの進化はツールがけん引」、5つのAI潮流を解説

日経XTECH

中国AI企業が他社製AIを「ただ乗り蒸留」か米社が主張、安全保障リスクも

日経XTECH

Superposition and the Capsule: Quantum State Collapse Meets AI Identity

Dev.to

Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

Key Points

Abstract

Related Articles

What 81,000 people want from AI

ラピダス、半導体設計AIエージェント「国内2社海外1社が使用中」

「AIで雇用は増える」「AIの進化はツールがけん引」、5つのAI潮流を解説

中国AI企業が他社製AIを「ただ乗り蒸留」か米社が主張、安全保障リスクも

Superposition and the Capsule: Quantum State Collapse Meets AI Identity

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Abstract

Related Articles

What 81,000 people want from AI

ラピダス、半導体設計AIエージェント「国内2社海外1社が使用中」

「AIで雇用は増える」「AIの進化はツールがけん引」、5つのAI潮流を解説

中国AI企業が他社製AIを「ただ乗り蒸留」か 米社が主張、安全保障リスクも

Superposition and the Capsule: Quantum State Collapse Meets AI Identity

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

中国AI企業が他社製AIを「ただ乗り蒸留」か米社が主張、安全保障リスクも