Do Small Language Models Know When They're Wrong? Confidence-Based Cascade Scoring for Educational Assessment

arXiv cs.CL / 4/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • Automated grading at scale can reduce cost and latency by using cascade systems where small LMs score easy items and escalate hard ones, but effective escalation depends on reliable confidence signals.
  • The study tests verbalized confidence (having the LM output a numerical confidence alongside its prediction) as a routing signal using 2,100 expert-scored decisions from student–AI math conversations and model pairings across GPT-5.4, Claude 4.5+, and Gemini 3.1.
  • Confidence quality varies greatly across small LMs: the best small model achieved AUROC 0.857 for confidence discrimination, while the worst produced an almost degenerate confidence distribution that cannot support good routing.
  • Lower LM confidence correlates with human scoring difficulty, including cases where human annotators disagreed and required longer annotation.
  • When confidence discrimination is strong, the cascade can approach large-LM accuracy (kappa 0.802 vs. 0.819) while cutting cost by 76% and latency by 61%, but weak/degenerate confidence prevents closing the accuracy gap regardless of threshold.

Abstract

Automated scoring of student work at scale requires balancing accuracy against cost and latency. In "cascade" systems, small language models (LMs) handle easier scoring tasks while escalating harder ones to larger LMs -- but the challenge is determining which cases to escalate. We explore verbalized confidence -- asking the LM to state a numerical confidence alongside its prediction -- as a routing signal. Using 2,100 expert-scored decisions from student-AI math conversations, we evaluate cascade systems built from GPT-5.4, Claude 4.5+, and Gemini 3.1 model pairs. We find that: (1) confidence discrimination varies widely across small LMs, with the best achieving AUROC 0.857 and the worst producing a near-degenerate confidence distribution; (2) confidence tracks human scoring difficulty, with lower LM confidence where annotators disagreed and took longer to score; (3) the best cascade approached large-LM accuracy (kappa 0.802 vs. 0.819) at 76% lower cost and 61% lower latency. Confidence discrimination is the bottleneck: the two small LMs with meaningful confidence variance yielded cascades with no statistically detectable kappa loss, while the third -- whose confidence was near-degenerate -- could not close the accuracy gap regardless of threshold. Small LMs with strong discrimination let practitioners trade cost for accuracy along the frontier; those without it do not.