Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

arXiv cs.AI / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

Language models implicitly represent a distribution over answers, but common post-training methods tend to collapse it to a single dominant mode, which can hurt tasks with ambiguity or multiple valid answers.
The paper proposes a multi-answer reinforcement learning (RL) method that trains LMs to perform distributional reasoning by generating multiple plausible hypotheses in a single forward pass while producing confidence-aware outputs.
By modifying the RL objective, the approach internalizes parts of inference-time search into generation, reducing the need for computationally intensive repeated sampling to find non-modal answers.
Experiments on question answering, medical diagnosis, and coding benchmarks show improved diversity, coverage, and set-level calibration versus single-answer RL baselines, with fewer tokens needed to output multiple answers.
On coding tasks, the multi-answer RL models also achieve substantially higher accuracy, positioning the method as a compute-efficient alternative to inference-time scaling strategies like best-of-k.

Abstract

Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post-training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark-style evaluations that assume one correct answer, many real-world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non-modal answers. This paper describes a multi-answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model's generative process. Across question-answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set-level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling procedures such as best-of-k. Code and more information can be found at https://multi-answer-rl.github.io/.