Controlling Distributional Bias in Multi-Round LLM Generation via KL-Optimized Fine-Tuning

arXiv cs.CL / 4/8/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper reframes LLM evaluation from single-round accuracy against fixed ground truth to multi-round distribution alignment, asking whether outputs match a desired target probability distribution under repeated prompting.
  • Experiments show that off-the-shelf LLMs and common alignment methods like prompt engineering and Direct Preference Optimization do not reliably control distributional properties for attributes such as gender, race, and sentiment in occupational contexts.
  • The authors propose a KL-optimized fine-tuning method that combines Steering Token Calibration with Semantic Alignment, using a hybrid loss to anchor latent steering-token probability mass (via KL divergence) and enforce semantic consistency (via a Kahneman–Tversky–style optimization term).
  • Across six datasets, the approach is reported to substantially outperform baselines, enabling more precise control over attribute generation distributions in multi-round settings.

Abstract

While the real world is inherently stochastic, Large Language Models (LLMs) are predominantly evaluated on single-round inference against fixed ground truths. In this work, we shift the lens to distribution alignment: assessing whether LLMs, when prompted repeatedly, can generate outputs that adhere to a desired target distribution, e.g. reflecting real-world statistics or a uniform distribution. We formulate distribution alignment using the attributes of gender, race, and sentiment within occupational contexts. Our empirical analysis reveals that off-the-shelf LLMs and standard alignment techniques, including prompt engineering and Direct Preference Optimization, fail to reliably control output distributions. To bridge this gap, we propose a novel fine-tuning framework that couples Steering Token Calibration with Semantic Alignment. We introduce a hybrid objective function combining Kullback-Leibler divergence to anchor the probability mass of latent steering tokens and Kahneman-Tversky Optimization to bind these tokens to semantically consistent responses. Experiments across six diverse datasets demonstrate that our approach significantly outperforms baselines, achieving precise distributional control in attribute generation tasks.