Identifying Influential N-grams in Confidence Calibration via Regression Analysis

arXiv cs.CL / 4/8/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies why LLMs remain overconfident during reasoning despite containing linguistic uncertainty cues, framing confidence as a regression target linked to specific textual patterns.
By predicting confidence from reasoning-related n-grams and analyzing their relationships, the authors identify particular linguistic expressions that are strongly associated with overconfidence across multiple models and QA benchmarks.
Several of the identified cue phrases overlap with expressions known to be inserted at test time for scaling/improved reasoning performance, suggesting a mechanistic link between prompting artifacts and confidence calibration.
The authors conduct causality and verification tests to show the extracted linguistic information genuinely affects confidence, not just correlates with it.
They conclude that confidence calibration can be achieved by suppressing the overconfident expressions while maintaining performance.

Abstract

While large language models (LLMs) improve performance by explicit reasoning, their responses are often overconfident, even though they include linguistic expressions demonstrating uncertainty. In this work, we identify what linguistic expressions are related to confidence by applying the regression method. Specifically, we predict confidence of those linguistic expressions in the reasoning parts of LLMs as the dependent variables and analyze the relationship between a specific

n

-gram and confidence. Across multiple models and QA benchmarks, we show that LLMs remain overconfident when reasoning is involved and attribute this behavior to specific linguistic information. Interestingly, several of the extracted expressions coincide with cue phrases intentionally inserted on test-time scaling to improve reasoning performance. Through our test on causality and verification that the extracted linguistic information truly affects confidence, we reveal that confidence calibration is possible by simply suppressing those overconfident expressions without drops in performance.

Meta's latest model is as open as Zuckerberg's private school

The Register

Why multi-agent AI security is broken (and the identity patterns that actually work)

Dev.to

BANKING77-77: New best of 94.61% on the official test set (+0.13pp) over our previous tests 94.48%.

Reddit r/artificial

A Comprehensive Implementation Guide to ModelScope for Model Search, Inference, Fine-Tuning, Evaluation, and Export

MarkTechPost

Harness Engineering: The Next Evolution of AI Engineering

Dev.to

Identifying Influential N-grams in Confidence Calibration via Regression Analysis

Key Points

Abstract

Related Articles

Meta's latest model is as open as Zuckerberg's private school

Why multi-agent AI security is broken (and the identity patterns that actually work)

BANKING77-77: New best of 94.61% on the official test set (+0.13pp) over our previous tests 94.48%.

A Comprehensive Implementation Guide to ModelScope for Model Search, Inference, Fine-Tuning, Evaluation, and Export

Harness Engineering: The Next Evolution of AI Engineering

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer