Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys

arXiv cs.AI / 4/21/2026

📰 NewsModels & Research

Key Points

  • The paper studies how to allocate a fixed number of human survey responses across tasks when LLM-generated answers are available but vary unpredictably in accuracy by question.
  • It introduces a question-specific “rectification difficulty” that determines how rapidly estimation variance decreases as more human samples are added.
  • Using this rectification difficulty, the authors derive a closed-form optimal allocation rule that assigns more human effort to questions where the LLM is least reliable.
  • Because rectification difficulty depends on unobserved human responses, the paper proposes meta-learning to predict it for entirely new survey tasks without needing pilot human data.
  • Experiments on two datasets across domains and LLMs show substantial efficiency improvements, capturing 61–79% of theoretically attainable gains and reducing MSE by 11.4% and 10.5% without pilot data.

Abstract

Large Language Models can generate synthetic survey responses at low cost, but their accuracy varies unpredictably across questions. We study the design problem of allocating a fixed budget of human respondents across estimation tasks when cheap LLM predictions are available for every task. Our framework combines three components. First, building on Prediction-Powered Inference, we characterize a question-specific rectification difficulty that governs how quickly the estimator's variance decreases with human sample size. Second, we derive a closed-form optimal allocation rule that directs more human labels to tasks where the LLM is least reliable. Third, since rectification difficulty depends on unobserved human responses for new surveys, we propose a meta-learning approach, trained on historical data, that predicts it for entirely new tasks without pilot data. The framework extends to general M-estimation, covering regression coefficients and multinomial logit partworths for conjoint analysis. We validate the framework on two datasets spanning different domains, question types, and LLMs, showing that our approach captures 61-79% of the theoretically attainable efficiency gains, achieving 11.4% and 10.5% MSE reductions without requiring any pilot human data for the target survey.