The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice
arXiv cs.LG / 5/5/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- Offline evaluation of language models using usage logs can be systematically biased when the model selection process is confounded with user-side factors that also affect judging quality.
- The paper argues that simply comparing logged scores is not estimating a single common target quantity because different logged populations are self-selected.
- It proposes a three-source approach combining a large observational log (OBS), a small randomized experiment that overrides model choice (EXP), and an offline simulator (SIM) that replays models on cached contexts.
- The authors provide an identification theorem showing that EXP and SIM together are sufficient to recover causal model values, while OBS is used mainly to reduce estimation error rather than to ensure validity.
- Experiments compare six estimator families and find that no method wins across all settings; performance depends on how much unbiased EXP supervision is available and on how well the target reward matches the structure inferred from OBS.
Related Articles

Why Retail Chargeback Recovery Could Be AgentHansa's First Real PMF
Dev.to

Why B2B Revenue-Recovery Casework Looks Like AgentHansa's Best Early PMF
Dev.to

10 Ways AI Has Become Your Invisible Daily Companion in 2026
Dev.to

When a Bottling Line Stops at 2 A.M., the Agent That Wins Is the One That Finds the Right Replacement Part
Dev.to

My ‘Busy’ Button Is a Chat Window: 8 Hours of Sorting & Broccoli Poetry
Dev.to