Measuring What Cannot Be Surveyed: LLMs as Instruments for Latent Cognitive Variables in Labor Economics

arXiv cs.CL / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a framework for using LLMs as valid measurement instruments for latent economic variables, particularly the cognitive content of occupational tasks at fine granularity beyond traditional surveys.
  • It formalizes four validity conditions—semantic exogeneity, construct relevance, monotonicity, and model invariance—to justify when LLM-generated scores can serve as instruments.
  • The authors apply the method to construct an Augmented Human Capital Index (AHC_o) from 18,796 O*NET task statements scored by Claude Haiku 4.5, and report strong convergent validity against six existing AI exposure indices.
  • Statistical checks—including discriminant validity, PCA revealing two AI-related dimensions (augmentation vs substitution), and inter-model reliability (Pearson r and Krippendorff’s alpha)—support the index’s measurement quality.
  • The study also finds robust task rankings under prompt variation and shows that ORIV estimation corrects measurement error attenuation compared with OLS, with the approach intended to generalize to other domains requiring scalable semantic quantification.

Abstract

This paper establishes the theoretical and practical foundations for using Large Language Models (LLMs) as measurement instruments for latent economic variables -- specifically variables that describe the cognitive content of occupational tasks at a level of granularity not achievable with existing survey instruments. I formalize four conditions under which LLM-generated scores constitute valid instruments: semantic exogeneity, construct relevance, monotonicity, and model invariance. I then apply this framework to the Augmented Human Capital Index (AHC_o), constructed from 18,796 O*NET task statements scored by Claude Haiku 4.5, and validated against six existing AI exposure indices. The index shows strong convergent validity (r = 0.85 with Eloundou GPT-gamma, r = 0.79 with Felten AIOE) and discriminant validity. Principal component analysis confirms that AI-related occupational measures span two distinct dimensions -- augmentation and substitution. Inter-rater reliability across two LLM models (n = 3,666 paired scores) yields Pearson r = 0.76 and Krippendorff's alpha = 0.71. Prompt sensitivity analysis across four alternative framings shows that task-level rankings are robust. Obviously Related Instrumental Variables (ORIV) estimation recovers coefficients 25% larger than OLS, consistent with classical measurement error attenuation. The methodology generalizes beyond labor economics to any domain where semantic content must be quantified at scale.