An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks

arXiv cs.CL / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes an agentic evaluation architecture to detect historical bias in educational textbooks at scale using a multimodal screening agent, a five-agent heterogeneous jury, and a meta-agent that synthesizes verdicts and escalates to humans when needed.
  • A key contribution is a Source Attribution Protocol that separates the textbook narrative from quoted historical sources to reduce systematic false positives common in single-model evaluators.
  • In experiments on Romanian upper-secondary history textbooks (270 excerpts), the agentic approach classified 83.3% as pedagogically acceptable, substantially improving over a zero-shot baseline (severity 2.9/7 vs. 5.4/7).
  • In blind human comparisons (18 evaluators, 54 comparisons), the Independent Deliberation setup was preferred 64.8% of the time over both heuristic and zero-shot baselines.
  • The authors argue the method is cost-effective (about $2 per textbook), positioning agentic evaluation as viable decision-support for educational governance.

Abstract

History textbooks often contain implicit biases, nationalist framing, and selective omissions that are difficult to audit at scale. We propose an agentic evaluation architecture comprising a multimodal screening agent, a heterogeneous jury of five evaluative agents, and a meta-agent for verdict synthesis and human escalation. A central contribution is a Source Attribution Protocol that distinguishes textbook narrative from quoted historical sources, preventing the misattribution that causes systematic false positives in single-model evaluators. In an empirical study on Romanian upper-secondary history textbooks, 83.3\% of 270 screened excerpts were classified as pedagogically acceptable (mean severity 2.9/7), versus 5.4/7 under a zero-shot baseline, demonstrating that agentic deliberation mitigates over-penalization. In a blind human evaluation (18 evaluators, 54 comparisons), the Independent Deliberation configuration was preferred in 64.8\% of cases over both a heuristic variant and the zero-shot baseline. At approximately \$2 per textbook, these results position agentic evaluation architectures as economically viable decision-support tools for educational governance.