The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice

arXiv cs.LG / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Offline evaluation of language models using usage logs can be systematically biased when the model selection process is confounded with user-side factors that also affect judging quality.
The paper argues that simply comparing logged scores is not estimating a single common target quantity because different logged populations are self-selected.
It proposes a three-source approach combining a large observational log (OBS), a small randomized experiment that overrides model choice (EXP), and an offline simulator (SIM) that replays models on cached contexts.
The authors provide an identification theorem showing that EXP and SIM together are sufficient to recover causal model values, while OBS is used mainly to reduce estimation error rather than to ensure validity.
Experiments compare six estimator families and find that no method wins across all settings; performance depends on how much unbiased EXP supervision is available and on how well the target reward matches the structure inferred from OBS.

Abstract

Offline evaluation of language models from usage logs is biased when model choice is confounded: the same user-side factors that influence which model is used can also influence how its output is judged, so raw comparisons of logged scores mix self-selected populations rather than estimating a common quantity of interest. A small randomized experiment can break this bias by overriding model choice, but in practice such experiments are scarce and costly. We study a three-source design that combines a large confounded observational log (OBS) for scale, a small randomized experiment (EXP) for unconfounded scoring, and an offline simulator (SIM) that replays candidate models on cached contexts. Our main result is an identification theorem showing that the randomized experiment and the simulator are together enough to recover causal model values; the observational log enters only afterward, to reduce estimation error rather than to make the causal comparison valid. Six estimator families are evaluated in a controlled semi-synthetic validation and in two real-task cached benchmarks for summarization and coding. No family dominates every regime; relative performance depends on the amount of unbiased EXP supervision and on how closely the target reward aligns with OBS-derived structure.