FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control

arXiv cs.LG / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces FG$^2$-GDN, a new linear-attention model that extends Gated Delta Networks by improving how the delta-rule update’s learning rate is parameterized.
Unlike prior work where the delta learning rate β_t is a single scalar, FG$^2$-GDN uses channel-wise vectors to enable finer, dimension-specific adaptation.
The paper further proposes FG$^2$-GDN+, which separates key and value scaling to independently control erasure strength and write strength.
Experiments on both synthetic and real-world benchmarks indicate that FG$^2$-GDN and FG$^2$-GDN+ achieve better associative recall and long-context understanding than GDN and KDA while keeping computational efficiency comparable.
Overall, the work demonstrates that increasing “fine-grained control” in delta-rule mechanisms can strengthen associative memory and long-context performance in linear-attention architectures.

Abstract

Linear attention mechanisms have emerged as promising alternatives to softmax attention, offering linear-time complexity during inference. Recent advances such as Gated DeltaNet (GDN) and Kimi Delta Attention (KDA) have demonstrated that the delta rule, an online gradient descent update, enables superior associative recall compared to simple additive updates. While KDA refined the coarse head-wise decay gate into channel-wise decay, the learning rate

\beta_t

in the delta update remains a scalar, limiting the model's capacity for dimension-specific adaptation. We introduce FG

^2

-GDN, which replaces the scalar

\beta_t

with a channel-wise vector analogous to the transition from SGD to per-coordinate adaptive optimizers such as AdaGrad and Adam. We further propose FG

^2

-GDN+, which decouples the scaling for keys and values, enabling independent control of erasure strength and write strength. Experiments on synthetic and real-world benchmarks show that FG

^2

-GDN and its variant improve associative recall and long-context understanding over GDN and KDA, with comparable computational efficiency.

I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.

Reddit r/artificial

Deepseek V4 Flash and Non-Flash Out on HuggingFace

Reddit r/LocalLLaMA

DeepSeek V4 Flash & Pro Now out on API

Reddit r/LocalLLaMA

I’m building a post-SaaS app catalog on Base, and here’s what that actually means

Dev.to

From "Hello World" to "Hello Agents": The Developer Keynote That Rewired Software Engineering

Dev.to

FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control

Key Points

Abstract

Related Articles

I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.

Deepseek V4 Flash and Non-Flash Out on HuggingFace

DeepSeek V4 Flash & Pro Now out on API

I’m building a post-SaaS app catalog on Base, and here’s what that actually means

From "Hello World" to "Hello Agents": The Developer Keynote That Rewired Software Engineering

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer