Social Meaning in Large Language Models: Structure, Magnitude, and Pragmatic Prompting

arXiv cs.AI / 4/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tests whether large language models capture human social meaning both qualitatively and quantitatively, using new calibration-focused metrics (ESR and CDS) to separate structural fidelity from magnitude calibration.
  • Across a case study on numerical (im)precision, frontier LLMs reproduce the qualitative structure of human social inferences but vary widely in how strongly they calibrate the magnitude of those inferences.
  • Prompting grounded in pragmatic theory—specifically encouraging reasoning about the speaker’s knowledge state and communicative motives—reduces magnitude deviation more reliably than prompting that focuses on alternative-awareness.
  • Combining both pragmatic components improves multiple calibration-sensitive metrics across all evaluated models, though fine-grained magnitude calibration remains only partially resolved.
  • Overall, the results suggest LLMs model the inferential structure of pragmatic/social reasoning but still distort inferential strength, and pragmatic-theory prompting helps in a limited, incomplete way.

Abstract

Large language models (LLMs) increasingly exhibit human-like patterns of pragmatic and social reasoning. This paper addresses two related questions: do LLMs approximate human social meaning not only qualitatively but also quantitatively, and can prompting strategies informed by pragmatic theory improve this approximation? To address the first, we introduce two calibration-focused metrics distinguishing structural fidelity from magnitude calibration: the Effect Size Ratio (ESR) and the Calibration Deviation Score (CDS). To address the second, we derive prompting conditions from two pragmatic assumptions: that social meaning arises from reasoning over linguistic alternatives, and that listeners infer speaker knowledge states and communicative motives. Applied to a case study on numerical (im)precision across three frontier LLMs, we find that all models reliably reproduce the qualitative structure of human social inferences but differ substantially in magnitude calibration. Prompting models to reason about speaker knowledge and motives most consistently reduces magnitude deviation, while prompting for alternative-awareness tends to amplify exaggeration. Combining both components is the only intervention that improves all calibration-sensitive metrics across all models, though fine-grained magnitude calibration remains only partially resolved. LLMs thus capture inferential structure while variably distorting inferential strength, and pragmatic theory provides a useful but incomplete handle for improving that approximation.