[D] The Bitter Lesson of Optimization: Why training Neural Networks to update themselves is mathematically brutal (but probably inevitable)

Reddit r/MachineLearning / 4/8/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The article argues that while neural networks are learned, the core training algorithms (e.g., Adam/AdamW) are still hand-designed, echoing Richard Sutton’s “Bitter Lesson” that heuristics eventually give way to learning-based general methods.
It explains the two-loop setup for “learned optimizers,” where a learned optimizer network outputs update rules by minimizing a trajectory loss that accounts for the optimizee’s entire training dynamics, not just the final objective.
It contends that making learned optimizers practical at scale is “mathematically brutal,” highlighting severe scaling limits when replacing standard optimizers.
The piece frames learned optimizers as likely inevitable in the future of training and fine-tuning, even though current approaches remain constrained by challenging theory-to-practice requirements.

[D] The Bitter Lesson of Optimization: Why training Neural Networks to update themselves is mathematically brutal (but probably inevitable)

Are we still stuck in the "feature engineering" era of optimization?

We trust neural networks to learn unimaginably complex patterns from data, yet the algorithms we use to train them (like Adam or AdamW) are entirely hand-designed by humans. Richard Sutton's famous "Bitter Lesson" dictates that hand-crafted heuristics ultimately lose to general methods that leverage learning. So, why aren't we all using torch.optim.NeuralNetOptimizer to train our LLMs today?

https://preview.redd.it/k17ltm9dtytg1.png?width=2560&format=png&auto=webp&s=168c6659f47a80dc2231f1c143ecc5d7c87e4a6b

I recently spent some time investigating the math and mechanics of "Learned Optimizers" (letting an AI optimize another AI). While the theory is beautiful, the practical scaling limits are brutal. Here is a breakdown of why replacing Adam is so hard, and how this might impact the future of training and fine-tuning models.

(This article is a highly compacted version of the one I wrote in my blog)

1. The Optimizer vs. Optimizee Dynamics

To learn an optimizer, we set up a two-loop system.

The Optimizee (f): The base model we are training (e.g., an LLM). Its parameters are θ.
The Optimizer (g): A neural network parameterized by φ. It ingests features (gradients, momentum) and outputs the parameter update Δθ.

Instead of minimizing the final loss, the Optimizer minimizes the Trajectory Loss: the expected sum of the optimizee's losses across an entire trajectory of training steps. This forces the optimizer to care about the dynamics, penalizing slow convergence and rewarding stability.

https://preview.redd.it/xrry5knfvytg1.png?width=2963&format=png&auto=webp&s=d0a7fff1fd29583fad899a9420604c50c12d4dac

To fix this, we use Truncated Backpropagation Through Time (TBPTT). But truncation does not just approximate the objective; it changes it. The optimizer becomes inherently blind to long-term consequences, systematically biasing the learned update rules toward short-horizon, greedy strategies.

3. The Theorem of Optimizer Dilution

If our learned optimizer had unconstrained access to the global loss landscape of a 1-billion parameter model, mapping an N-dimensional gradient to an N-dimensional update would require O(N²) compute, which is physically impossible.

To make it tractable, we share a tiny MLP across all parameters. For instance, Metz et al. (2022) used an ultra-tiny MLP (only 197 parameters) that processes 39 distinct input features per coordinate (local states, AdaFactor-normalized stats, global training context).

But because the exact same optimizer is applied independently to each parameter, it only sees local information. It is forced into the restricted class of coordinate-wise methods. Even if entirely learned, it acts as a supercharged diagonal preconditioner and cannot represent full loss curvature.

Tooling is already emerging:

Libraries like PyLO (PyTorch) now allow you to swap Adam for learned optimizers like VeLO with a single line of code. Even more interesting is their Hugging Face Hub integration. Meta-trained optimizers can be pushed and pulled from the Hub just like model weights.

Imagine a future for local finetuning where models do not just ship their weights, but also bundle the learned optimizer they were meta-trained with, perfectly tuned to that specific model's gradient geometry.

https://preview.redd.it/00a0ermlvytg1.png?width=4470&format=png&auto=webp&s=43aeac54ec750719f4280393be549bd81d085a6a

Discussion

I am really curious to hear what this community thinks:

Do you think learned optimizers will eventually cross the compute-efficiency threshold to replace AdamW in standard LLM pre-training?
Could bundling models with their own specialized update rules become the standard for parameter-efficient fine-tuning (PEFT/LoRA)?

Full Breakdown: Towards a Bitter Lesson of Optimization

submitted by /u/Accurate-Turn-2675
[link] [comments]