The signal is the ceiling: Measurement limits of LLM-predicted experience ratings from open-ended survey text
arXiv cs.CL / 4/22/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The study evaluates how far LLMs can predict fan experience ratings from open-ended MLB survey text, building on earlier GPT 4.1 baseline results (about 67% agreement within ±1).
- Across ~10,000 surveys from five MLB teams, modest prompt customization improved GPT 4.1 performance slightly (from 67% to ~69% within ±1), but model changes (GPT 4.1-mini and GPT 5.2) generally reduced accuracy even from the best prompt.
- The dominant factor was not prompt or model choice: variations in the linguistic character of the input text affected accuracy by more than an order of magnitude compared with those engineering levers.
- The paper argues the “ceiling” has two components—text-reading bias (which prompt design can partially correct) and missing information about fans’ actual decisions (which cannot be fixed because it is not present in the text).
- Overall, the findings suggest prompt engineering has a specific, predictable benefit limited to the portion of error that stems from how the model interprets text, rather than broadly overcoming rating measurement limits.


