The Bitter Lesson of Optimization: Why training Neural Networks to update themselves is mathematically brutal (but probably inevitable)

Reddit r/LocalLLaMA / 4/8/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The article argues that while “hand-crafted” optimizers like Adam are still dominating neural network training, the broader “Bitter Lesson” suggests we should instead learn update rules using general methods.
It explains learned optimizers as a two-loop setup where a neural-network optimizer is trained to minimize a “trajectory loss,” focusing on training dynamics (stability and convergence speed) rather than just final loss.
Despite promising theory, the author says the practical scaling limits are severe, making it mathematically and computationally brutal to replace standard optimizers with learned ones in large-scale LLM training.
It discusses how these limits could shape the future of model training and fine-tuning, implying learned optimization may be constrained to certain contexts or require new breakthroughs.
The piece frames the problem as “inevitable” long-term, but difficult in the short-to-medium term due to the complexity of optimizing the optimizer itself.

The Bitter Lesson of Optimization: Why training Neural Networks to update themselves is mathematically brutal (but probably inevitable)

Are we still stuck in the "feature engineering" era of optimization?

We trust neural networks to learn unimaginably complex patterns from data, yet the algorithms we use to train them (like Adam or AdamW) are entirely hand-designed by humans. Richard Sutton's famous "Bitter Lesson" dictates that hand-crafted heuristics ultimately lose to general methods that leverage learning. So, why aren't we all using torch.optim.NeuralNetOptimizer to train our LLMs today?

https://preview.redd.it/k17ltm9dtytg1.png?width=2560&format=png&auto=webp&s=168c6659f47a80dc2231f1c143ecc5d7c87e4a6b

I recently spent some time investigating the math and mechanics of "Learned Optimizers" (letting an AI optimize another AI). While the theory is beautiful, the practical scaling limits are brutal. Here is a breakdown of why replacing Adam is so hard, and how this might impact the future of training and fine-tuning models.

(This article is a highly compacted version of the one I wrote in my blog)

1. The Optimizer vs. Optimizee Dynamics

To learn an optimizer, we set up a two-loop system.

The Optimizee (f): The base model we are training (e.g., an LLM). Its parameters are θ.
The Optimizer (g): A neural network parameterized by φ. It ingests features (gradients, momentum) and outputs the parameter update Δθ.

Instead of minimizing the final loss, the Optimizer minimizes the Trajectory Loss: the expected sum of the optimizee's losses across an entire trajectory of training steps. This forces the optimizer to care about the dynamics, penalizing slow convergence and rewarding stability.

https://preview.redd.it/qbx1m3n7tytg1.png?width=2963&format=png&auto=webp&s=4a045f3d535d3cc91bae23ef00b29038eda9eece

2. The Mathematical Wall: Jacobians and Instability

Why is training the optimizer computationally brutal? When you backpropagate through the unrolled optimization steps to update the optimizer's weights (φ), you have to take the derivative of the previous gradient with respect to the parameters. That is the Hessian.

Furthermore, when you unroll the derivative over time, you are computing the sum of the products of Jacobians. From a dynamical systems perspective, if the spectral radius (maximum eigenvalue) is greater than 1, the cumulative product causes trajectories to diverge exponentially. It is the exact same fundamental instability that plagues the training of standard RNNs.

To fix this, we use Truncated Backpropagation Through Time (TBPTT). But truncation does not just approximate the objective; it changes it. The optimizer becomes inherently blind to long-term consequences, systematically biasing the learned update rules toward short-horizon, greedy strategies.

3. The Theorem of Optimizer Dilution

If our learned optimizer had unconstrained access to the global loss landscape of a 1-billion parameter model, mapping an N-dimensional gradient to an N-dimensional update would require O(N²) compute, which is physically impossible.

To make it tractable, we share a tiny MLP across all parameters. For instance, Metz et al. (2022) used an ultra-tiny MLP (only 197 parameters) that processes 39 distinct input features per coordinate (local states, AdaFactor-normalized stats, global training context).

But because the exact same optimizer is applied independently to each parameter, it only sees local information. It is forced into the restricted class of coordinate-wise methods. Even if entirely learned, it acts as a supercharged diagonal preconditioner and cannot represent full loss curvature.

Tooling is already emerging:

Libraries like PyLO (PyTorch) now allow you to swap Adam for learned optimizers like VeLO with a single line of code. Even more interesting is their Hugging Face Hub integration. Meta-trained optimizers can be pushed and pulled from the Hub just like model weights.

Imagine a future for local finetuning where models do not just ship their weights, but also bundle the learned optimizer they were meta-trained with, perfectly tuned to that specific model's gradient geometry.

https://preview.redd.it/qef7b2oltytg1.png?width=4470&format=png&auto=webp&s=7edbdb95533ae2bd61758829193128af959e51a7

Discussion

I am really curious to hear what this community thinks:

Do you think learned optimizers will eventually cross the compute-efficiency threshold to replace AdamW in standard LLM pre-training?
Could bundling models with their own specialized update rules become the standard for parameter-efficient fine-tuning (PEFT/LoRA)?

Full Breakdown: Towards a Bitter Lesson of Optimization

submitted by /u/Accurate-Turn-2675
[link] [comments]