Enhancing Confidence Estimation in Telco LLMs via Twin-Pass CoT-Ensembling

arXiv cs.LG / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Telecom-domain LLMs (tested with Gemma-3 variants) often produce biased and unreliable confidence scores, commonly showing systematic overconfidence on task answers.
  • The paper finds that single-pass, verbalized confidence estimation does not track true correctness in datasets spanning 3GPP specification analysis and O-RAN troubleshooting benchmarks.
  • It introduces a Twin-Pass Chain-of-Thought (CoT)-Ensembling approach that runs multiple independent reasoning evaluations and aggregates them into a more calibrated confidence score.
  • Experiments on TeleQnA, ORANBench, and srsRANBench show the method can reduce Expected Calibration Error (ECE) by up to 88%, improving trustworthiness of LLM self-assessment.
  • The authors position the technique as a practical route to safer verification and more reliable deployment of LLM outputs in telecommunications workflows.

Abstract

Large Language Models (LLMs) are increasingly applied to complex telecommunications tasks, including 3GPP specification analysis and O-RAN network troubleshooting. However, a critical limitation remains: LLM-generated confidence scores are often biased and unreliable, frequently exhibiting systematic overconfidence. This lack of trustworthy self-assessment makes it difficult to verify model outputs and safely rely on them in practice. In this paper, we study confidence calibration in telecom-domain LLMs using the representative Gemma-3 model family (4B, 12B, and 27B parameters), evaluated on TeleQnA, ORANBench, and srsRANBench. We show that standard single-pass, verbalized confidence estimates fail to reflect true correctness, often assigning high confidence to incorrect predictions. To address this, we propose a novel Twin-Pass Chain of Thought (CoT)-Ensembling methodology for improving confidence estimation by leveraging multiple independent reasoning evaluations and aggregating their assessments into a calibrated confidence score. Our approach reduces Expected Calibration Error (ECE) by up to 88% across benchmarks, significantly improving the reliability of model self-assessment. These results highlight the limitations of current confidence estimation practices and demonstrate a practical path toward more trustworthy evaluation of LLM outputs in telecommunications.