Context Over Compute Human-in-the-Loop Outperforms Iterative Chain-of-Thought Prompting in Interview Answer Quality

arXiv cs.AI / 3/12/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper compares human-in-the-loop (HITL) evaluation with automated chain-of-thought prompting for interview answer assessment and improvement using LLMs, showing both approaches yield positive rating gains while HITL provides stronger training benefits.
Quantitative results show confidence rising from 3.16 to 4.16 and authenticity rising from 2.94 to 4.53 under HITL, with p-values < 0.001 and a Cohen's d of 3.21.
The HITL method also requires five times fewer iterations (about 1.0 versus 5.0) and achieves full personal detail integration.
Both methods converge rapidly, with mean iterations below one, and HITL achieves a 100 percent success rate among initially weak answers compared with 84 percent for automated approaches, indicating the primary bottleneck is context availability rather than compute.
The authors propose a 'bar raiser' adversarial mechanism to simulate realistic interviewer behavior, but note that quantitative validation remains future work and conclude that domain-specific enhancements and context-aware method selection are essential.

Abstract

Behavioral interview evaluation using large language models presents unique challenges that require structured assessment, realistic interviewer behavior simulation, and pedagogical value for candidate training. We investigate chain of thought prompting for interview answer evaluation and improvement through two controlled experiments with 50 behavioral interview question and answer pairs. Our contributions are threefold. First, we provide a quantitative comparison between human in the loop and automated chain of thought improvement. Using a within subject paired design with n equals 50, both approaches show positive rating improvements. The human in the loop approach provides significant training benefits. Confidence improves from 3.16 to 4.16 (p less than 0.001) and authenticity improves from 2.94 to 4.53 (p less than 0.001, Cohen's d is 3.21). The human in the loop method also requires five times fewer iterations (1.0 versus 5.0, p less than 0.001) and achieves full personal detail integration. Second, we analyze convergence behavior. Both methods converge rapidly with mean iterations below one, with the human in the loop approach achieving a 100 percent success rate compared to 84 percent for automated approaches among initially weak answers (Cohen's h is 0.82, large effect). Additional iterations provide diminishing returns, indicating that the primary limitation is context availability rather than computational resources. Third, we propose an adversarial challenging mechanism based on a negativity bias model, named bar raiser, to simulate realistic interviewer behavior, although quantitative validation remains future work. Our findings demonstrate that while chain of thought prompting provides a useful foundation for interview evaluation, domain specific enhancements and context aware approach selection are essential for realistic and pedagogically valuable results.