AI Navigate

PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

arXiv cs.CL / 3/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • PowerFlow introduces a principled, distribution-matching view of unsupervised fine-tuning for LLMs by using GFlowNet as an amortized variational sampler for unnormalized densities.
  • It adds a length-aware Trajectory-Balance objective to explicitly neutralize the structural length biases inherent in autoregressive generation.
  • By targeting alpha-power distributions, PowerFlow can sharpen the model (alpha>1) to enhance logical reasoning or flatten it (alpha<1) to unlock expressive creativity.
  • Experiments show PowerFlow outperforms existing RLIF methods, matches or surpasses supervised baselines, and improves diversity without sacrificing quality, shifting the Pareto frontier in creative tasks.

Abstract

Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting \alpha-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution (\alpha > 1) to intensify logical reasoning, or flattening it (\alpha < 1) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.