Decoupled Descent: Exact Test Error Tracking Via Approximate Message Passing

arXiv stat.ML / 5/1/2026

💬 OpinionModels & Research

共有:

Key Points

The paper argues that full-batch gradient descent in parametric model training can create a systematic generalization gap, making training error a poor proxy for test error due to bias toward the exact training data.
It proposes a new theory-based training algorithm, decoupled descent (DD), designed to satisfy a train-test identity so that train error asymptotically tracks test error for stylized Gaussian mixture models.
DD uses approximate message passing ideas to iteratively cancel biases introduced by data reuse, aiming to make “zero-cost” validation feasible while using all data.
The algorithm’s behavior is characterized by a low-dimensional state evolution recursion, making the training dynamics more analyzable and tractable than typical deep learning training heuristics.
Experiments on XOR classification, noisy MNIST, and nonlinear probing of CIFAR-10 suggest DD can outperform or narrow the generalization gap compared with standard gradient descent even when the theoretical assumptions are relaxed.

Abstract

In modern parametric model training, full-batch gradient descent (and its variants) suffers due to progressively stronger biasing towards the exact realization of training data; this drives the systematic ``generalization gap'', where the train error becomes an unreliable proxy for test error. Existing approaches either argue this gap is benign through complex analysis or sacrifice data to a validation set. In contrast, we introduce decoupled descent (DD), a novel theory-based training algorithm that satisfies a train-test identity -- enforcing the train error to asymptotically track the test error for stylized Gaussian mixture models. Within this specific regime, leveraging approximate message passing theory, DD iteratively cancels the biases due to data reuse, rigorously demonstrating the feasibility of zero-cost validation and

100\%

data utilization. Moreover, DD is governed by a low-dimensional state evolution recursion, rendering the dynamics of the algorithm transparent and tractable. We validate DD on XOR classification, yielding superior performance compared to GD; additionally, we implement noisy MNIST and non-linear probing of CIFAR-10, demonstrating that even when our stylized assumptions are relaxed, DD narrows the generalization gap compared to GD.

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model

THE DECODER

Qualcomm teases ‘dedicated CPU for agentic experiences’ and ‘agentic smartphones’

The Register

Finetuning Dataset: Claude Opus 4.6/4.7 - 8.7k Chats

Reddit r/LocalLLaMA

Phosphene local video and audio generation for Apple Silicon open source (LTX 2.3) [P]

Reddit r/MachineLearning

Decoupled Descent: Exact Test Error Tracking Via Approximate Message Passing

Key Points

Abstract

Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model

Qualcomm teases ‘dedicated CPU for agentic experiences’ and ‘agentic smartphones’

Finetuning Dataset: Claude Opus 4.6/4.7 - 8.7k Chats

Phosphene local video and audio generation for Apple Silicon open source (LTX 2.3) [P]

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer