Navigating the Prompt Space: Improving LLM Classification of Social Science Texts Through Prompt Engineering

arXiv cs.CL / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates how prompt engineering choices affect LLM-based classification performance for social science texts, targeting improvements in accuracy and cost efficiency compared with traditional computational methods.
  • It systematically varies three prompt components—label descriptions, instructional nudges, and few-shot examples—across two example tasks to identify what reliably boosts results.
  • Results indicate that adding only a minimal amount of prompt context produces the largest performance gains, while additional context beyond that often delivers diminishing returns.
  • The study finds that increasing prompt context can sometimes reduce accuracy, highlighting that “more prompting” is not universally beneficial.
  • Performance is shown to vary substantially across different LLMs, tasks, and batch sizes, implying each classification setup needs individual validation rather than one-size-fits-all prompt rules.

Abstract

Recent developments in text classification using Large Language Models (LLMs) in the social sciences suggest that costs can be cut significantly, while performance can sometimes rival existing computational methods. However, with a wide variance in performance in current tests, we move to the question of how to maximize performance. In this paper, we focus on prompt context as a possible avenue for increasing accuracy by systematically varying three aspects of prompt engineering: label descriptions, instructional nudges, and few shot examples. Across two different examples, our tests illustrate that a minimal increase in prompt context yields the highest increase in performance, while further increases in context only tend to yield marginal performance increases thereafter. Alarmingly, increasing prompt context sometimes decreases accuracy. Furthermore, our tests suggest substantial heterogeneity across models, tasks, and batch size, underlining the need for individual validation of each LLM coding task rather than reliance on general rules.
広告