GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

arXiv cs.CL / 3/17/2026

💬 OpinionModels & Research

共有:

Key Points

GradMem introduces writing context into memory via per-sample test-time gradient descent while keeping model weights frozen.
It optimizes a model-level self-supervised context reconstruction loss, enabling an iterative, loss-driven memory write with error correction.
On associative key–value retrieval, GradMem outperforms forward-only memory writers of the same size and scales capacity more effectively with more gradient steps.
When applied to pretrained language models, it achieves competitive results on natural language tasks like bAbI and SQuAD variants using only information encoded in memory.
The approach offers a memory-efficient alternative to large per-layer KV caches for long-context conditioning in transformers.

Abstract

Many large language model applications require conditioning on long contexts. Transformers typically support this by storing a large per-layer KV-cache of past activations, which incurs substantial memory overhead. A desirable alternative is ompressive memory: read a context once, store it in a compact state, and answer many queries from that state. We study this in a context removal setting, where the model must generate an answer without access to the original context at inference time. We introduce GradMem, which writes context into memory via per-sample test-time optimization. Given a context, GradMem performs a few steps of gradient descent on a small set of prefix memory tokens while keeping model weights frozen. GradMem explicitly optimizes a model-level self-supervised context reconstruction loss, resulting in a loss-driven write operation with iterative error correction, unlike forward-only methods. On associative key--value retrieval, GradMem outperforms forward-only memory writers with the same memory size, and additional gradient steps scale capacity much more effectively than repeated forward writes. We further show that GradMem transfers beyond synthetic benchmarks: with pretrained language models, it attains competitive results on natural language tasks including bAbI and SQuAD variants, relying only on information encoded in memory.

Next-Generation LLM Inference Technology: From Flash-MoE to Gemini Flash-Lite, and Local GPU Utilization

Dev.to

The Wave of Open-Source AI and Investment in Security: Trends from Qwen, MS, and Google

Dev.to

Implementing Deep Q-Learning (DQN) from Scratch Using RLax JAX Haiku and Optax to Train a CartPole Reinforcement Learning Agent

MarkTechPost

[D] Training a classifier entirely in SQL (no iterative optimization)

Reddit r/MachineLearning

LLM failure modes map surprisingly well onto ADHD cognitive science. Six parallels from independent research.

Reddit r/artificial

GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

Key Points

Abstract

Related Articles

Next-Generation LLM Inference Technology: From Flash-MoE to Gemini Flash-Lite, and Local GPU Utilization

The Wave of Open-Source AI and Investment in Security: Trends from Qwen, MS, and Google

Implementing Deep Q-Learning (DQN) from Scratch Using RLax JAX Haiku and Optax to Train a CartPole Reinforcement Learning Agent

[D] Training a classifier entirely in SQL (no iterative optimization)

LLM failure modes map surprisingly well onto ADHD cognitive science. Six parallels from independent research.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer