Cross-Model Disagreement as a Label-Free Correctness Signal

arXiv cs.AI / 3/27/2026

💬 OpinionSignals & Early TrendsModels & Research

共有:

Key Points

The paper addresses label-free detection of when a language model answer is incorrect, highlighting that common uncertainty signals can fail under “confident errors.”
It proposes cross-model disagreement as a training-free correctness indicator by having a second verifier model score the first model’s generated answer using a single forward pass.
It instantiates two metrics: Cross-Model Perplexity (CMP) and Cross-Model Entropy (CME), both computed without requiring verifier generation or ground-truth correctness labels.
Experiments across reasoning, retrieval, and math benchmarks (MMLU, TriviaQA, GSM8K) show CMP and CME outperform within-model uncertainty baselines, with CMP reaching AUROC 0.75 on MMLU versus 0.59 for a baseline.
The authors argue the method can be directly integrated into production pipelines for routing, monitoring, selective prediction, data filtering, and scalable oversight of language model systems.

Abstract

Detecting when a language model is wrong without ground truth labels is a fundamental challenge for safe deployment. Existing approaches rely on a model's own uncertainty -- such as token entropy or confidence scores -- but these signals fail critically on the most dangerous failure mode: confident errors, where a model is wrong but certain. In this work we introduce cross-model disagreement as a correctness indicator -- a simple, training-free signal that can be dropped into existing production systems, routing pipelines, and deployment monitoring infrastructure without modification. Given a model's generated answer, cross-model disagreement computes how surprised or uncertain a second verifier model is when reading that answer via a single forward pass. No generation from the verifying model is required, and no correctness labels are needed. We instantiate this principle as Cross-Model Perplexity (CMP), which measures the verifying model's surprise at the generating model's answer tokens, and Cross-Model Entropy (CME), which measures the verifying model's uncertainty at those positions. Both CMP and CME outperform within-model uncertainty baselines across benchmarks spanning reasoning, retrieval, and mathematical problem solving (MMLU, TriviaQA, and GSM8K). On MMLU, CMP achieves a mean AUROC of 0.75 against a within-model entropy baseline of 0.59. These results establish cross-model disagreement as a practical, training-free approach to label-free correctness estimation, with direct applications in deployment monitoring, model routing, selective prediction, data filtering, and scalable oversight of production language model systems.