AI Navigate

[R] Internal transformer signals predict generation correctness: a 14,540-trace empirical study across 4 models and 2 benchmarks

Reddit r/MachineLearning / 3/17/2026

💬 OpinionSignals & Early TrendsModels & Research

Key Points

  • The study analyzes internal transformer signals to predict generation correctness across four models and two benchmarks, using Pass@k=10 with temperatures 0.7 and 0.8 and grouped cross-validation to prevent prompt leakage between train and test folds.
  • Out of 14,540 total traces, 11,403 were used for correctness analysis after excluding format failures, with AUROC as the evaluation metric via a HistGradientBoosting classifier and StratifiedGroupKFold.
  • Results show that the most informative signal tier depends on model/task: for Qwen-HumanEval, early-window features provide the dominant gain (T4), while for Mistral-GSM8K, full feature sets can hurt performance compared to earlier tiers.
  • Early-window mean surprisal over the first 10 generated tokens yields predictive AUROC of 0.80 for Mixtral-HumanEval and 0.73 for Mistral-HumanEval, and ranking the top-10 candidates by this signal substantially improves correctness prediction.
  • Confidence calibration reveals that high-confidence outputs still exhibit strong AUROC (e.g., 0.92 for Qwen-HumanEval), indicating robustness of internal signals even in the most confident cases.

Experimental design

  • Models: Llama-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, Mistral-7B-Instruct-v0.3, Mixtral-8x7B-Instruct-v0.1
  • Benchmarks: GSM8K (200 prompts), HumanEval (164 prompts)
  • Design: Pass@k, k=10 per prompt (5 runs at temperature 0.7, 5 at 0.8), each graded independently
  • Evaluation: Grouped 5-fold CV by question ID — no prompt appears in both train and test folds
  • Scale: 14,540 total traces; 11,403 used for correctness analysis after excluding format failures
  • Classifier: HistGradientBoosting with StratifiedGroupKFold
  • Metric: AUROC

A prior version of this experiment used greedy decoding, which produced identical outputs per prompt and zero within-prompt variance. That design was fundamentally wrong for this question and was redesigned from scratch.

Results

Signal ablation (T1–T6):

Tiered ablation from entropy-only (T1, 1 feature) through full feature set (T6, 104 features) under grouped CV:

Model Dataset T6 AUROC
Qwen-2.5-7B HumanEval 0.90
Mixtral-8x7B HumanEval 0.82
Mistral-7B HumanEval 0.77
Mistral-7B GSM8K 0.67
Llama-3.1-8B GSM8K 0.64
Qwen-2.5-7B GSM8K 0.60

Which tier provides the largest gain varies by model/task. For Qwen/HumanEval, T4 (early-window features) provides the dominant jump (0.73 → 0.85). For Mistral/GSM8K, T6 underperforms T5 — adding the full feature set hurts.

Early-window signals:

Mean surprisal over the first 10 generated tokens achieves predictive power of 0.80 for Mixtral/HumanEval and 0.73 for Mistral/HumanEval. Ranking k=10 candidates by this single signal:

  • Mixtral/HumanEval: 15% (random) → 50% (+35 pp)
  • Mistral/HumanEval: 16% → 48% (+32 pp)
  • Qwen/HumanEval: 31% → 56% (+25 pp)

Confidence calibration:

Accuracy in the most confident quintile by top-k margin: Mixtral 2.8%, Mistral 6.4%, Qwen 20.4%, Llama 33.5%. Within the high-confidence subset, internal signals still achieve 0.92 AUROC (Qwen/HumanEval, compound_density_per_100t). Output confidence and internal-state signals appear to carry orthogonal information.

Architecture dependence:

MoE and dense models show fundamentally different internal signal distributions. collapsed_rate_mean separates Mixtral from all three dense models at rank-biserial −0.899. Cross-model alignment for composite risk scores is near-zero or negative (Spearman ρ ranging from −0.16 to +0.07 across model pairs on GSM8K). Per-architecture calibration appears necessary — a universal composite score does not transfer.

Format failure:

GSM8K format failure rates (missing #### delimiter): Mistral 72.2%, Mixtral 62.1%, Llama 17.9%, Qwen 4.5%. Internal signals predict Mistral format failures at predictive power 0.88 (hidden_max_abs_last_layer_mean) and Mixtral at 0.83 (focused_head_mean_zscore).

Layer analysis:

Per-layer correlation of attention entropy and L2 norm with correctness shows strong layer-specificity. Qwen layer 2 attention entropy correlates with HumanEval correctness at r = −0.484 (p ≈ 10⁻⁹⁷). Peak layers vary substantially by model and task — no universal correctness layer identified.

Negative results

The built-in composite risk_score saturates at 1.0 for 94–96% of Mistral/Mixtral traces. ECE ranges from 0.24–0.70 before Platt scaling. AUROC for the composite score is near-chance in several cells. A 25-element fingerprint vector tracked throughout the experiment turned out to be a concatenation of existing summary statistics — no independent predictive information. The feature set (104 features) collapses into approximately 47 correlated families at |r| > 0.80; a curated set of ~15 representatives preserves most predictive information.

Data and code

Full experiment (scripts, traces, analysis outputs, calibration results): Experiment directory, Validation report

submitted by /u/Ok_Exercise_7895
[link] [comments]