How Trustworthy Are LLM-as-Judge Ratings for Interpretive Responses? Implications for Qualitative Research Workflows
arXiv cs.AI / 4/2/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The study evaluates whether LLM-as-judge scoring aligns with trained human evaluators when generating and judging one-sentence interpretive responses for qualitative interview excerpts.
- Using 712 K-12 mathematics teacher interview excerpts and five inference models (Command R+, Gemini 2.5 Pro, GPT-5.1, Llama 4 Scout-17B Instruct, Qwen 3-32B Dense), the authors compare AWS Bedrock LLM-as-judge metrics against human ratings of interpretive accuracy, nuance preservation, and coherence.
- LLM-as-judge scores reflect broad directional trends in model-level performance, but diverge substantially from human ratings in magnitude at the individual excerpt level.
- Among automated metrics, coherence aligns best with aggregated human judgments, while faithfulness and correctness show systematic misalignment—especially for non-literal and nuanced interpretations.
- The results recommend using LLM-as-judge primarily for screening/eliminating underperforming models rather than replacing human judgment in qualitative research workflows, since safety metrics were largely irrelevant to interpretive quality.
Related Articles

Black Hat USA
AI Business

Black Hat Asia
AI Business

Did you know your GIGABYTE laptop has a built-in AI coding assistant? Meet GiMATE Coder 🤖
Dev.to

I Built a Local-First AI Knowledge Base for Developers — Here's What Makes It Different
Dev.to

Benchmarking Batch Deep Reinforcement Learning Algorithms
Dev.to