Poly-EPO: Training Exploratory Reasoning Models

arXiv cs.AI / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a post-training framework that steers language models toward optimistic exploratory behavior while aligning responses with a reward function.
It introduces a general set reinforcement learning (set RL) recipe for optimizing LMs under arbitrary objectives by adapting advantage computation to the set setting.
The method, Polychromic Exploratory Policy Optimization (Poly-EPO), is instantiated with an objective designed to explicitly balance exploration and exploitation.
Experiments on multiple reasoning benchmarks indicate improved generalization, higher pass@k coverage, greater diversity in generated reasoning, and effective scaling with test-time compute.

Abstract

Exploration is a cornerstone of learning from experience: it enables agents to find solutions to complex problems, generalize to novel ones, and scale performance with test-time compute. In this paper, we present a framework for post-training language models (LMs) that explicitly encourages optimistic exploration and promotes a synergy between exploration and exploitation. The central idea is to train the LM to generate sets of responses that are collectively accurate under the reward function and exploratory in their reasoning strategies. We first develop a general recipe for optimizing LMs with set reinforcement learning (set RL) under arbitrary objective functions, showing how standard RL algorithms can be adapted to this setting through a modification to the advantage computation. We then propose Polychromic Exploratory Policy Optimization (Poly-EPO), which instantiates this framework with an objective that explicitly synergizes exploration and exploitation. Across a range of reasoning benchmarks, we show that Poly-EPO improves generalization, as evidenced by higher pass@

k

coverage, preserves greater diversity in model generations, and effectively scales with test-time compute.

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

Why Traditional Mobile Vendors Fail at AI Feature Delivery: 2026 Analysis for US Enterprise

Dev.to

Why Mobile AI Projects Fail When the Board Says Add AI: 2026 Analysis for US Enterprise

Dev.to

"Beating 250,000 Mental Comparisons: A Cross-Domain Engineer's Entity Resolution Case Study"

Dev.to

The Enterprise Mobile AI Report 2026

Dev.to

Poly-EPO: Training Exploratory Reasoning Models

Key Points

Abstract

Related Articles

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Why Traditional Mobile Vendors Fail at AI Feature Delivery: 2026 Analysis for US Enterprise

Why Mobile AI Projects Fail When the Board Says Add AI: 2026 Analysis for US Enterprise

"Beating 250,000 Mental Comparisons: A Cross-Domain Engineer's Entity Resolution Case Study"

The Enterprise Mobile AI Report 2026

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer