Adaptive Budget Allocation in LLM-Augmented Surveys

arXiv cs.LG / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how to allocate a limited human-labeling budget across survey questions when LLM-generated responses are cheap but question-level reliability is unknown before collection.
  • It proposes an adaptive budget allocation algorithm that learns which questions are hardest for the LLM in real time by using each human label both to improve that question’s estimate and to measure LLM prediction error for it.
  • The authors prove that the allocation gap versus an optimal allocation approach shrinks to zero as the human budget increases, without requiring any prior pilot study or pre-known per-question LLM accuracy.
  • Experiments on synthetic data and a real survey dataset (68 questions, 2000+ respondents) show that uniform human labeling wastes 10–12% of budget, while the adaptive method reduces waste to 2–6% and performs comparably to uniform sampling with fewer human labels.
  • The framework is positioned as broadly applicable to any setting where scarce human oversight must be distributed across tasks with unknown LLM reliability.

Abstract

Large language models (LLMs) can generate survey responses at low cost, but their reliability varies substantially across questions and is unknown before data collection. Deploying LLMs in surveys still requires costly human responses for verification and correction. How should a limited human-labeling budget be allocated across questions in real time? We propose an adaptive allocation algorithm that learns which questions are hardest for the LLM while simultaneously collecting human responses. Each human label serves a dual role: it improves the estimate for that question and reveals how well the LLM predicts human responses on it. The algorithm directs more budget to questions where the LLM is least reliable, without requiring any prior knowledge of question-level LLM accuracy. We prove that the allocation gap relative to the best possible allocation vanishes as the budget grows, and validate the approach on both synthetic data and a real survey dataset with 68 questions and over 2000 respondents. On real survey data, the standard practice of allocating human labels uniformly across questions wastes 10--12% of the budget relative to the optimal; our algorithm reduces this waste to 2--6%, and the advantage grows as questions become more heterogeneous in LLM prediction quality. The algorithm achieves the same estimation quality as traditional uniform sampling with fewer human samples, requires no pilot study, and is backed by formal performance guarantees validated on real survey data. More broadly, the framework applies whenever scarce human oversight must be allocated across tasks where LLM reliability is unknown.