Distributional Alignment Games for Answer-Level Fine-Tuning

arXiv cs.LG / 5/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses Answer-Level Fine-Tuning (ALFT), optimizing language models based on the correctness or properties of final answers rather than intermediate reasoning traces.
It shows that directly optimizing answer-level objectives is computationally intractable because it would require marginalizing over many latent reasoning paths.
To make the problem tractable, the authors introduce a game-theoretical “Distributional Alignment Game” formulation as a two-player interaction between a Policy (generator) and a Target (auxiliary distribution).
The authors prove that the Nash Equilibrium of the game exactly matches the solution to the original answer-level optimization, turning an intractable marginalization step into a tractable projection problem.
They demonstrate that the framework unifies diversity and self-improvement methods and propose efficient algorithms compatible with Group Relative Policy Optimization (e.g., Coherence-GRPO), achieving notable complexity reductions on mathematical reasoning tasks.

Abstract

We focus on the problem of \emph{Answer-Level Fine-Tuning} (ALFT), where the goal is to optimize a language model based on the correctness or properties of its final answers, rather than the specific reasoning traces used to produce them. Directly optimizing answer-level objectives is computationally intractable due to the need to marginalize over the vast space of latent reasoning paths. To overcome this, we propose a general game-theoretical framework that lifts the problem to a \emph{Distributional Alignment Game}. We formulate ALFT as a two-player game between a Policy (the generator) and a Target (an auxiliary distribution). We prove that the Nash Equilibrium of this game corresponds exactly to the solution of the original answer-level optimization problem. This variational perspective transforms the intractable marginalization problem into a tractable projection problem. We demonstrate that this framework unifies recent approaches to diversity and self-improvement (coherence) and provide efficient algorithms compatible with Group Relative Policy Optimization (GRPO), such as Coherence-GRPO, yielding significant complexity gains in mathematical reasoning tasks.