Poly-EPO: Training Exploratory Reasoning Models

arXiv cs.AI / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a post-training framework that steers language models toward optimistic exploratory behavior while aligning responses with a reward function.
  • It introduces a general set reinforcement learning (set RL) recipe for optimizing LMs under arbitrary objectives by adapting advantage computation to the set setting.
  • The method, Polychromic Exploratory Policy Optimization (Poly-EPO), is instantiated with an objective designed to explicitly balance exploration and exploitation.
  • Experiments on multiple reasoning benchmarks indicate improved generalization, higher pass@k coverage, greater diversity in generated reasoning, and effective scaling with test-time compute.

Abstract

Exploration is a cornerstone of learning from experience: it enables agents to find solutions to complex problems, generalize to novel ones, and scale performance with test-time compute. In this paper, we present a framework for post-training language models (LMs) that explicitly encourages optimistic exploration and promotes a synergy between exploration and exploitation. The central idea is to train the LM to generate sets of responses that are collectively accurate under the reward function and exploratory in their reasoning strategies. We first develop a general recipe for optimizing LMs with set reinforcement learning (set RL) under arbitrary objective functions, showing how standard RL algorithms can be adapted to this setting through a modification to the advantage computation. We then propose Polychromic Exploratory Policy Optimization (Poly-EPO), which instantiates this framework with an objective that explicitly synergizes exploration and exploitation. Across a range of reasoning benchmarks, we show that Poly-EPO improves generalization, as evidenced by higher pass@k coverage, preserves greater diversity in model generations, and effectively scales with test-time compute.