Noise-Response Calibration: A Causal Intervention Protocol for LLM-Judges

arXiv cs.LG / 3/19/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

LLMs are increasingly used as automated judges and synthetic labelers, but their stochasticity and overconfidence complicate deployment when external ground truth is limited.
The authors propose a practical calibration protocol based on controlled input interventions, asserting that increasing noise severity should lead to a statistically significant deterioration in task performance, evaluated via a slope-based hypothesis test over repeated trials.
They implement SNR perturbations for tabular data and lexical perturbations for text data, and validate the approach across UCI tabular benchmarks and four text classification datasets, revealing modality-dependent behavior.
A modality gap is observed: text-based judges degrade predictably while many tabular datasets do not show significant deterioration under noise, and the work provides a reproducible methodology and reporting protocol for robust LLM-judge calibration under distribution shift.

Abstract

Large language models (LLMs) are increasingly used as automated judges and synthetic labelers, especially in low-label settings. Yet these systems are stochastic and often overconfident, which makes deployment decisions difficult when external ground truth is limited. We propose a practical calibration protocol based on controlled input interventions: if noise severity increases, task performance should exhibit a statistically significant deterioration trend. We operationalize this with a slope-based hypothesis test over repeated trials, using signal-to-noise-ratio (SNR) perturbations for tabular data and lexical perturbations for text data. Across UCI tabular benchmarks and four text classification datasets, we find clear modality-dependent behavior. Our results reveal a modality gap: while text-based judges degrade predictably, the majority of tabular datasets show a lack of statistically significant performance deterioration even under significant signal-to-noise reduction. Interestingly we find that model performance is lower on datasets that are insensitive to noise interventions. We present a reproducible methodology and reporting protocol for robust LLM-judge calibration under distribution shift.

The programming passion is melting

Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Dev.to

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

Reddit r/LocalLLaMA

Nvidia GTC 2026: Jensen Huang Bets $1 Trillion on the Age of the AI Factory

Dev.to

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)

Dev.to

Noise-Response Calibration: A Causal Intervention Protocol for LLM-Judges

Key Points

Abstract

Related Articles

The programming passion is melting

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

Nvidia GTC 2026: Jensen Huang Bets $1 Trillion on the Age of the AI Factory

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer