Auction-Based Online Policy Adaptation for Evolving Objectives

arXiv cs.LG / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies multi-objective reinforcement learning where objectives from the same family can appear or disappear during runtime, requiring policies that adapt efficiently to changing active goals.
  • It introduces a modular framework in which each objective has its own selfish local policy and a novel auction-based coordination mechanism that selects actions via bids proportional to the urgency of the current state.
  • The approach supports dynamic adaptation by adding or removing the corresponding local policies when objectives change, and it enables rapid runtime switching by deploying parameterized policy copies for objectives from the same family.
  • The selfish local policies are computed by reformulating the problem as a general-sum game, where each policy must learn not only to satisfy its objective but also to reason about other objectives and submit calibrated bids.
  • Experiments on Atari Assault and a gridworld path-planning task with dynamic targets show substantially better performance than monolithic PPO-trained policies.

Abstract

We consider multi-objective reinforcement learning problems where objectives come from an identical family -- such as the class of reachability objectives -- and may appear or disappear at runtime. Our goal is to design adaptive policies that can efficiently adjust their behaviors as the set of active objectives changes. To solve this problem, we propose a modular framework where each objective is supported by a selfish local policy, and coordination is achieved through a novel auction-based mechanism: policies bid for the right to execute their actions, with bids reflecting the urgency of the current state. The highest bidder selects the action, enabling a dynamic and interpretable trade-off among objectives. Going back to the original adaptation problem, when objectives change, the system adapts by simply adding or removing the corresponding policies. Moreover, as objectives arise from the same family, identical copies of a parameterized policy can be deployed, facilitating immediate adaptation at runtime. We show how the selfish local policies can be computed by turning the problem into a general-sum game, where the policies compete against each other to fulfill their own objectives. To succeed, each policy must not only optimize its own objective, but also reason about the presence of other goals and learn to produce calibrated bids that reflect relative priority. In our implementation, the policies are trained concurrently using proximal policy optimization (PPO). We evaluate on Atari Assault and a gridworld-based path-planning task with dynamic targets. Our method achieves substantially better performance than monolithic policies trained with PPO.