Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

Reddit r/MachineLearning / 4/16/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The article argues that using dynamically routed multi-timescale advantages (different discount factors) inside PPO with an Actor-Critic architecture often triggers irreversible policy collapse or poor local optima.
  • It attributes the failure to two optimization pathologies: “surrogate objective hacking,” where the temporal attention/router exploits the PPO loss shortcuts, and “temporal uncertainty paradox,” where routing prefers short horizons due to lower aleatoric uncertainty.
  • The proposed diagnosis is illustrated in delayed-reward tasks (e.g., LunarLander), where the agent may become overly short-sighted and hover to maximize shaping rewards rather than committing to a successful landing.
  • The author’s fix (“Target Decoupling” / “Representation over Routing”) isolates the Actor from the multi-timescale routing by keeping multi-timescale learning on the Critic side while updating the Actor using only long-term advantage signals.
  • The author reports that decoupling prevents hovering, enables robust learning (e.g., consistently surpassing a 200-point threshold across seeds), and provides a minimal PyTorch reproducible example plus linked paper and GitHub.

Hi folks,

I’m an undergrad doing some research on temporal credit assignment, and I recently ran into a frustrating issue. Trying to fuse multi-timescale advantages (like γ = 0.5, 0.9, 0.99, 0.999) inside an Actor-Critic architecture usually leads to irreversible policy collapse or really weird local optima.

I spent some time diagnosing exactly why this happens, and it boils down to two main optimization pathologies:

  1. Surrogate Objective Hacking: When the temporal attention mechanism is exposed to policy gradients, the optimizer just finds a shortcut. It manipulates the attention weights to minimize the PPO surrogate loss, actively ignoring the actual environment control.
  2. The Paradox of Temporal Uncertainty: If you try to fix the above by using a gradient-free method (like inverse-variance weighting), the router just locks onto the short-term horizons because their aleatoric uncertainty is inherently lower. In delayed-reward environments like LunarLander, the agent becomes so short-sighted that it just endlessly hovers in mid-air to hoard small shaping rewards, terrified of committing to a landing.

The Solution: Target Decoupling

The fix I found is essentially "Representation over Routing." You keep the multi-timescale predictions on the Critic side (which forces the network to learn incredibly robust auxiliary representations), but you strictly isolate the Actor. The Actor only gets updated using the purest long-term advantage.

Once decoupled, the agent stops hovering and learns a highly fuel-efficient, perfect landing, consistently breaking the 200-point threshold across multiple seeds without any hyperparameter hacking.

I got tired of bloated RL codebases, so I wrote a strict 4-stage Minimal Reproducible Example (MRE) in pure PyTorch so you can see the agent crash, hover, and finally succeed in just a few minutes.

Paper (arXiv): https://doi.org/10.48550/arXiv.2604.13517

GitHub (MRE + GIFs): https://github.com/ben-dlwlrma/Representation-Over-Routing

I built this MRE as a standalone project to really understand the math behind PPO and temporal routing. I've fully open-sourced the code and the preprint, hoping it saves someone else the headache of debugging similar "attention hijacking" bugs.

Feel free to use the code as a reference or a starting point if you're building multi-horizon agents. Hope you find it useful!

submitted by /u/dlwlrma_22
[link] [comments]