Optimal control of the future via prospective learning with control

arXiv stat.ML / 5/6/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that “optimal control of the future” should go beyond standard reinforcement learning (RL) by extending supervised learning to learning-to-control settings.
  • It introduces “Prospective Learning with Control” (PLuC), showing that empirical risk minimization (ERM) can asymptotically achieve the Bayes-optimal policy under fairly general assumptions.
  • The authors focus on a non-stationary, reset-free environment and demonstrate that this setting is where typical RL approaches break down or become inefficient.
  • In a prospective-learning formulation tested on a 1-D foraging benchmark, modern RL methods (and even time-aware variants) converge dramatically more slowly than the proposed prospective foraging agents.
  • The work provides an implementation via an open-source repository, enabling others to experiment with the PLuC framework.

Abstract

Optimal control of the future is the next frontier for AI. Current approaches to this problem are typically rooted in reinforcement learning (RL). RL is mathematically distinct from supervised learning, which has been the main workhorse for the recent achievements in AI. Moreover, RL typically operates in a stationary environment with episodic resets, limiting its utility. Here, we extend supervised learning to address learning to control in non-stationary, reset-free environments. Using this framework, called ''Prospective Learning with Control'' (PLuC), we prove that under certain fairly general assumptions, empirical risk minimization (ERM) asymptotically achieves the Bayes optimal policy. We then consider a specific instance of prospective learning with control: foraging, a canonical task relevant to both natural and artificial agents. We illustrate that modern RL algorithms, which assume stationarity, struggle in these non-stationary reset-free environments. Even with time-aware modifications, they converge orders of magnitude slower than our prospective foraging agents on a simple 1-D foraging benchmark. Code is available at: https://github.com/neurodata/procontrol.