How Trustworthy Are LLM-as-Judge Ratings for Interpretive Responses? Implications for Qualitative Research Workflows

arXiv cs.AI / 4/2/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The study evaluates whether LLM-as-judge scoring aligns with trained human evaluators when generating and judging one-sentence interpretive responses for qualitative interview excerpts.
  • Using 712 K-12 mathematics teacher interview excerpts and five inference models (Command R+, Gemini 2.5 Pro, GPT-5.1, Llama 4 Scout-17B Instruct, Qwen 3-32B Dense), the authors compare AWS Bedrock LLM-as-judge metrics against human ratings of interpretive accuracy, nuance preservation, and coherence.
  • LLM-as-judge scores reflect broad directional trends in model-level performance, but diverge substantially from human ratings in magnitude at the individual excerpt level.
  • Among automated metrics, coherence aligns best with aggregated human judgments, while faithfulness and correctness show systematic misalignment—especially for non-literal and nuanced interpretations.
  • The results recommend using LLM-as-judge primarily for screening/eliminating underperforming models rather than replacing human judgment in qualitative research workflows, since safety metrics were largely irrelevant to interpretive quality.

Abstract

As qualitative researchers show growing interest in using automated tools to support interpretive analysis, a large language model (LLM) is often introduced into an analytic workflow as is, without systematic evaluation of interpretive quality or comparison across models. This practice leaves model selection largely unexamined despite its potential influence on interpretive outcomes. To address this gap, this study examines whether LLM-as-judge evaluations meaningfully align with human judgments of interpretive quality and can inform model-level decision making. Using 712 conversational excerpts from semi-structured interviews with K-12 mathematics teachers, we generated one-sentence interpretive responses using five widely adopted inference models: Command R+ (Cohere), Gemini 2.5 Pro (Google), GPT-5.1 (OpenAI), Llama 4 Scout-17B Instruct (Meta), and Qwen 3-32B Dense (Alibaba). Automated evaluations were conducted using AWS Bedrock's LLM-as-judge framework across five metrics, and a stratified subset of responses was independently rated by trained human evaluators on interpretive accuracy, nuance preservation, and interpretive coherence. Results show that LLM-as-judge scores capture broad directional trends in human evaluations at the model level but diverge substantially in score magnitude. Among automated metrics, Coherence showed the strongest alignment with aggregated human ratings, whereas Faithfulness and Correctness revealed systematic misalignment at the excerpt level, particularly for non-literal and nuanced interpretations. Safety-related metrics were largely irrelevant to interpretive quality. These findings suggest that LLM-as-judge methods are better suited for screening or eliminating underperforming models than for replacing human judgment, offering practical guidance for systematic comparison and selection of LLMs in qualitative research workflows.