AI Navigate

Optimize Wider, Not Deeper: Consensus Aggregation for Policy Optimization

arXiv cs.AI / 3/16/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • CAPO (Consensus Aggregation for Policy Optimization) rethinks PPO by running K replicates on the same batch with different minibatch shuffles and aggregating them into a consensus instead of increasing optimization depth.
  • The work analyzes updates in both Euclidean parameter space and the natural parameter space of the policy distribution, showing the consensus can outperform simple averaging in KL-penalized surrogate and trust region compliance.
  • Empirically, CAPO achieves up to 8.6x improvements over PPO and compute-matched deeper baselines on continuous control tasks under a fixed sample budget.
  • The authors argue that depth introduces waste while signal saturates, so widening optimization can improve performance without additional environment interactions.

Abstract

Proximal policy optimization (PPO) approximates the trust region update using multiple epochs of clipped SGD. Each epoch may drift further from the natural gradient direction, creating path-dependent noise. To understand this drift, we can use Fisher information geometry to decompose policy updates into signal (the natural gradient projection) and waste (the Fisher-orthogonal residual that consumes trust region budget without first-order surrogate improvement). Empirically, signal saturates but waste grows with additional epochs, creating an optimization-depth dilemma. We propose Consensus Aggregation for Policy Optimization (CAPO), which redirects compute from depth to width: K PPO replicates are optimized on the same batch, differing only in minibatch shuffling order, and then aggregated into a consensus. We study aggregation in two spaces: Euclidean parameter space, and the natural parameter space of the policy distribution via the logarithmic opinion pool. In natural parameter space, the consensus provably achieves higher KL-penalized surrogate and tighter trust region compliance than the mean expert; parameter averaging inherits these guarantees approximately. On continuous control tasks, CAPO outperforms PPO and compute-matched deeper baselines under fixed sample budgets by up to 8.6x. CAPO demonstrates that policy optimization can be improved by optimizing wider, rather than deeper, without additional environment interactions.