Continual Calibration: Coverage Can Collapse Before Accuracy in Lifelong LLM Fine-Tuning

arXiv cs.LG / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that evaluating lifelong/continual LLM fine-tuning only by accuracy retention is incomplete, because uncertainty calibration (coverage reliability) can degrade much faster than top-1 accuracy.
  • Experiments across three model families and eight sequential task sequences show that coverage loss is on average about 3.4× larger than accuracy loss, including a case where coverage falls from 0.92 to 0.61 while accuracy stays within about 3 points of baseline.
  • The study finds that standard continual-learning methods preserving accuracy do not necessarily preserve conformal coverage, and that naive calibration baselines only recover part of the coverage gap.
  • To address this, the authors propose “calibration replay,” a lightweight post-hoc method that keeps a small task-specific held-out buffer and refits task-specific conformal thresholds after each update, restoring coverage close to nominal with minimal memory and no training-time gradient cost.
  • The work also provides theoretical support via drift decomposition and guarantees for exact conformal validity under exchangeability, plus results explaining why using pooled thresholds alone is insufficient; extensions to open-ended generation are left as exploratory.

Abstract

Continual learning for large language models is typically evaluated through accuracy retention under sequential fine-tuning. We argue that this perspective is incomplete, because uncertainty reliability can degrade earlier and more sharply than top-1 performance. We study this empirically by measuring conformal coverage and calibration error on sequentially fine-tuned models across three model families and eight task sequences drawn primarily from classification and multiple-choice benchmarks. Across the classification-style settings we study, coverage loss exceeds accuracy loss by a factor of roughly \(3.4\times \pm 0.5\times\) on average across seeds; in the most pronounced case, coverage drops from \(0.92\) to \(0.61\), while accuracy remains within three points of baseline. Standard continual-learning methods that preserve accuracy do not automatically preserve coverage, and naive calibration baselines recover only part of the gap. We propose calibration replay, a lightweight post-hoc procedure that maintains a task-specific held-out buffer and refits a task-specific conformal threshold under the current model after each update. It adds no training-time gradient cost, uses less than one percent of the memory of ordinary experience replay, and typically restores coverage to within two points of nominal at buffer size \(m = 200\). We accompany the empirical study with a drift decomposition, a finite-sample recovery theorem showing exact conformal validity under exchangeability, and a mixture-validity proposition explaining why pooled thresholds do not suffice. Our guarantees are stated for classification-style tasks with task-specific buffers; extensions to open-ended generation are exploratory.