AI Navigate

Noise-Response Calibration: A Causal Intervention Protocol for LLM-Judges

arXiv cs.LG / 3/19/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • LLMs are increasingly used as automated judges and synthetic labelers, but their stochasticity and overconfidence complicate deployment when external ground truth is limited.
  • The authors propose a practical calibration protocol based on controlled input interventions, asserting that increasing noise severity should lead to a statistically significant deterioration in task performance, evaluated via a slope-based hypothesis test over repeated trials.
  • They implement SNR perturbations for tabular data and lexical perturbations for text data, and validate the approach across UCI tabular benchmarks and four text classification datasets, revealing modality-dependent behavior.
  • A modality gap is observed: text-based judges degrade predictably while many tabular datasets do not show significant deterioration under noise, and the work provides a reproducible methodology and reporting protocol for robust LLM-judge calibration under distribution shift.

Abstract

Large language models (LLMs) are increasingly used as automated judges and synthetic labelers, especially in low-label settings. Yet these systems are stochastic and often overconfident, which makes deployment decisions difficult when external ground truth is limited. We propose a practical calibration protocol based on controlled input interventions: if noise severity increases, task performance should exhibit a statistically significant deterioration trend. We operationalize this with a slope-based hypothesis test over repeated trials, using signal-to-noise-ratio (SNR) perturbations for tabular data and lexical perturbations for text data. Across UCI tabular benchmarks and four text classification datasets, we find clear modality-dependent behavior. Our results reveal a modality gap: while text-based judges degrade predictably, the majority of tabular datasets show a lack of statistically significant performance deterioration even under significant signal-to-noise reduction. Interestingly we find that model performance is lower on datasets that are insensitive to noise interventions. We present a reproducible methodology and reporting protocol for robust LLM-judge calibration under distribution shift.