Do Small Language Models Know When They're Wrong? Confidence-Based Cascade Scoring for Educational Assessment
arXiv cs.CL / 4/23/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- Automated grading at scale can reduce cost and latency by using cascade systems where small LMs score easy items and escalate hard ones, but effective escalation depends on reliable confidence signals.
- The study tests verbalized confidence (having the LM output a numerical confidence alongside its prediction) as a routing signal using 2,100 expert-scored decisions from student–AI math conversations and model pairings across GPT-5.4, Claude 4.5+, and Gemini 3.1.
- Confidence quality varies greatly across small LMs: the best small model achieved AUROC 0.857 for confidence discrimination, while the worst produced an almost degenerate confidence distribution that cannot support good routing.
- Lower LM confidence correlates with human scoring difficulty, including cases where human annotators disagreed and required longer annotation.
- When confidence discrimination is strong, the cascade can approach large-LM accuracy (kappa 0.802 vs. 0.819) while cutting cost by 76% and latency by 61%, but weak/degenerate confidence prevents closing the accuracy gap regardless of threshold.
Related Articles

The anti-AI crowd is giving “real farmers don’t use tractors” energy, and it’s getting old.
Dev.to

Training ChatGPT on Private Data: A Technical Reference
Dev.to

The Rise of Intelligent Software: How AI is Reshaping Modern Product Development
Dev.to

The Anatomy of a Modern AI Marketing Curriculum in 2026 — What It Covers and Why It Matters
Dev.to
AI as a Fascist Artifact
Dev.to