The signal is the ceiling: Measurement limits of LLM-predicted experience ratings from open-ended survey text

arXiv cs.CL / 4/22/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The study evaluates how far LLMs can predict fan experience ratings from open-ended MLB survey text, building on earlier GPT 4.1 baseline results (about 67% agreement within ±1).
  • Across ~10,000 surveys from five MLB teams, modest prompt customization improved GPT 4.1 performance slightly (from 67% to ~69% within ±1), but model changes (GPT 4.1-mini and GPT 5.2) generally reduced accuracy even from the best prompt.
  • The dominant factor was not prompt or model choice: variations in the linguistic character of the input text affected accuracy by more than an order of magnitude compared with those engineering levers.
  • The paper argues the “ceiling” has two components—text-reading bias (which prompt design can partially correct) and missing information about fans’ actual decisions (which cannot be fixed because it is not present in the text).
  • Overall, the findings suggest prompt engineering has a specific, predictable benefit limited to the portion of error that stems from how the model interprets text, rather than broadly overcoming rating measurement limits.

Abstract

An earlier paper (Hong, Potteiger, and Zapata 2026) established that an unoptimized GPT 4.1 prompt predicts fan-reported experience ratings within one point 67% of the time from open-ended survey text. This paper tests the relative impact of prompt design and model selection on that performance. We compared four configurations on approximately 10,000 post-game surveys from five MLB teams: the original baseline prompt and a moderately customized version, crossed with three GPT models (4.1, 4.1-mini, 5.2). Prompt customization added roughly two percentage points of within +/-1 agreement on GPT 4.1 (from 67% to 69%). Both model swaps from that best configuration degraded performance: GPT 5.2 returned to the baseline, and GPT 4.1-mini fell six percentage points below it. Both levers combined were dwarfed by the input itself: across capable configurations, accuracy varied more than an order of magnitude more by the linguistic character of the text than by the choice of prompt or model. The ceiling has two parts. One is a bias in how the model reads text, which prompt design can correct. The other is a difference between what fans write about and what they actually decide, which no engineering can close because the missing information is not in the text. Prompt customization moved the first part; model selection moved neither reliably. The result is not that "prompt engineering helps a little" but that prompt engineering helps in a specific and predictable way, on the part of the ceiling it can reach.