Flexible Empowerment at Reasoning with Extended Best-of-N Sampling

arXiv cs.LG / 4/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces a reinforcement learning method that integrates “empowerment” into reasoning actions to better manage the exploration–exploitation dilemma (EED).
It argues that simply adding empowerment as an intrinsic reward bonus can be inefficient because the policy must first be learned before exploration emphasis can be adjusted.
To address this, the authors leverage best-of-N (BoN) sampling—a technique used to fine-tune reasoning with foundation models—so that modified policies are acquired implicitly without explicitly learning separate policy networks.
They further extend BoN sampling using Tsallis statistics to control how strongly the policy is modified in a way that is generalizable while keeping computational cost manageable.
Experiments on toy problems and complex locomotion tasks show that the approach can balance EED effectively and improve overall RL performance.

Abstract

This paper proposes a novel method that incorporates empowerment when reasoning actions in reinforcement learning (RL), thereby achieving the flexibility of exploration-exploitation dilemma (EED). In previous methods, empowerment for promoting exploration has been provided as a bonus term to the task-specific reward function as an intrinsically-motivated RL. However, this approach introduces a delay until the policy that accounts for empowerment is learned, making it difficult to adjust the emphasis on exploration as needed. On the other hand, a trick devised for fine-tuning recent foundation models at reasoning, so-called best-of-N (BoN) sampling, allows for the implicit acquisition of modified policies without explicitly learning them. It is expected that applying this trick to exploration-promoting terms, such as empowerment, will enable more flexible adjustment of EED. Therefore, this paper investigates BoN sampling for empowerment. Furthermore, to adjust the degree of policy modification in a generalizable manner while maintaining computational cost, this paper proposes a novel BoN sampling method extended by Tsalis statistics. Through toy problems, the proposed method's cability to balance EED is verified. In addition, it is demonstrated that the proposed method improves RL performance to solve complex locomotion tasks.

Awesome Open-Weight Models: The Practitioner's Guide to Open-Source LLMs (2026 Edition) [P]

Reddit r/MachineLearning

The Mythos vs GPT-5.4-Cyber debate is missing the benchmark

Dev.to

Beyond the Crop: Automating "Ghost Mannequin" Effects with Depth-Aware Inpainting

Dev.to

The $20/month AI subscription is gaslighting developers in emerging markets

Dev.to

A Claude Code hook that warns you before calling a low-trust MCP server

Dev.to

Flexible Empowerment at Reasoning with Extended Best-of-N Sampling

Key Points

Abstract

Related Articles

Awesome Open-Weight Models: The Practitioner's Guide to Open-Source LLMs (2026 Edition) [P]

The Mythos vs GPT-5.4-Cyber debate is missing the benchmark

Beyond the Crop: Automating "Ghost Mannequin" Effects with Depth-Aware Inpainting

The $20/month AI subscription is gaslighting developers in emerging markets

A Claude Code hook that warns you before calling a low-trust MCP server

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer