Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics
arXiv cs.AI / 4/30/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces a Human-in-the-Loop benchmarking framework to evaluate how effectively heterogeneous LLMs can automate competency-based assessment in secondary-level mathematics, addressing the manual burden of competency mapping in CBE.
- Using Nepal’s Grade 10 Optional Mathematics curriculum, the authors build a multi-dimensional rubric spanning four math topics and four cross-cutting competencies (Comprehension, Knowledge, Operational Fluency, and Behavior & Correlation).
- In a multi-provider ensemble (two open-weight Llama-family models plus two Gemini frontier proprietary models), results reveal an “architecture-compatibility gap,” where instruction/rubric constraint compliance matters more than sheer model size.
- Gemini MoE (Sparse MoE) models achieved only “Fair Agreement” with faculty ground truth (κ≈0.38), while the larger 70B Orion model showed “No Agreement” (κ≈-0.0261), indicating unreliable rubric-constrained grading.
- The study concludes LLMs are not ready for autonomous certification but can provide valuable assistive evidence extraction as part of a Human-in-the-Loop workflow.
Related Articles
Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]
Reddit r/MachineLearning

Agent Amnesia and the Case of Henry Molaison
Dev.to
Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry
Dev.to
Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance
Dev.to
Vibe coding is a tool, not a shortcut. Most people are using it wrong.
Dev.to