[R] Internal transformer signals predict generation correctness: a 14,540-trace empirical study across 4 models and 2 benchmarks

Reddit r/MachineLearning / 3/17/2026

💬 OpinionSignals & Early TrendsModels & Research

共有:

Key Points

The study analyzes internal transformer signals to predict generation correctness across four models and two benchmarks, using Pass@k=10 with temperatures 0.7 and 0.8 and grouped cross-validation to prevent prompt leakage between train and test folds.
Out of 14,540 total traces, 11,403 were used for correctness analysis after excluding format failures, with AUROC as the evaluation metric via a HistGradientBoosting classifier and StratifiedGroupKFold.
Results show that the most informative signal tier depends on model/task: for Qwen-HumanEval, early-window features provide the dominant gain (T4), while for Mistral-GSM8K, full feature sets can hurt performance compared to earlier tiers.
Early-window mean surprisal over the first 10 generated tokens yields predictive AUROC of 0.80 for Mixtral-HumanEval and 0.73 for Mistral-HumanEval, and ranking the top-10 candidates by this signal substantially improves correctness prediction.
Confidence calibration reveals that high-confidence outputs still exhibit strong AUROC (e.g., 0.92 for Qwen-HumanEval), indicating robustness of internal signals even in the most confident cases.

Experimental design

Models: Llama-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, Mistral-7B-Instruct-v0.3, Mixtral-8x7B-Instruct-v0.1
Benchmarks: GSM8K (200 prompts), HumanEval (164 prompts)
Design: Pass@k, k=10 per prompt (5 runs at temperature 0.7, 5 at 0.8), each graded independently
Evaluation: Grouped 5-fold CV by question ID — no prompt appears in both train and test folds
Scale: 14,540 total traces; 11,403 used for correctness analysis after excluding format failures
Classifier: HistGradientBoosting with StratifiedGroupKFold
Metric: AUROC

A prior version of this experiment used greedy decoding, which produced identical outputs per prompt and zero within-prompt variance. That design was fundamentally wrong for this question and was redesigned from scratch.

Results

Signal ablation (T1–T6):

Tiered ablation from entropy-only (T1, 1 feature) through full feature set (T6, 104 features) under grouped CV:


Model	Dataset	T6 AUROC
Qwen-2.5-7B	HumanEval	0.90
Mixtral-8x7B	HumanEval	0.82
Mistral-7B	HumanEval	0.77
Mistral-7B	GSM8K	0.67
Llama-3.1-8B	GSM8K	0.64
Qwen-2.5-7B	GSM8K	0.60

Which tier provides the largest gain varies by model/task. For Qwen/HumanEval, T4 (early-window features) provides the dominant jump (0.73 → 0.85). For Mistral/GSM8K, T6 underperforms T5 — adding the full feature set hurts.

Early-window signals:

Mean surprisal over the first 10 generated tokens achieves predictive power of 0.80 for Mixtral/HumanEval and 0.73 for Mistral/HumanEval. Ranking k=10 candidates by this single signal:

Mixtral/HumanEval: 15% (random) → 50% (+35 pp)
Mistral/HumanEval: 16% → 48% (+32 pp)
Qwen/HumanEval: 31% → 56% (+25 pp)

Confidence calibration:

Accuracy in the most confident quintile by top-k margin: Mixtral 2.8%, Mistral 6.4%, Qwen 20.4%, Llama 33.5%. Within the high-confidence subset, internal signals still achieve 0.92 AUROC (Qwen/HumanEval, compound_density_per_100t). Output confidence and internal-state signals appear to carry orthogonal information.

Architecture dependence:

MoE and dense models show fundamentally different internal signal distributions. collapsed_rate_mean separates Mixtral from all three dense models at rank-biserial −0.899. Cross-model alignment for composite risk scores is near-zero or negative (Spearman ρ ranging from −0.16 to +0.07 across model pairs on GSM8K). Per-architecture calibration appears necessary — a universal composite score does not transfer.

Format failure:

GSM8K format failure rates (missing #### delimiter): Mistral 72.2%, Mixtral 62.1%, Llama 17.9%, Qwen 4.5%. Internal signals predict Mistral format failures at predictive power 0.88 (hidden_max_abs_last_layer_mean) and Mixtral at 0.83 (focused_head_mean_zscore).

Layer analysis:

Per-layer correlation of attention entropy and L2 norm with correctness shows strong layer-specificity. Qwen layer 2 attention entropy correlates with HumanEval correctness at r = −0.484 (p ≈ 10⁻⁹⁷). Peak layers vary substantially by model and task — no universal correctness layer identified.

Negative results

The built-in composite risk_score saturates at 1.0 for 94–96% of Mistral/Mixtral traces. ECE ranges from 0.24–0.70 before Platt scaling. AUROC for the composite score is near-chance in several cells. A 25-element fingerprint vector tracked throughout the experiment turned out to be a concatenation of existing summary statistics — no independent predictive information. The feature set (104 features) collapses into approximately 47 correlated families at |r| > 0.80; a curated set of ~15 representatives preserves most predictive information.

Data and code

Full experiment (scripts, traces, analysis outputs, calibration results): Experiment directory, Validation report

submitted by /u/Ok_Exercise_7895
[link] [comments]

Attacks On Data Centers, Qwen3.5 In All Sizes, DeepSeek’s Huawei Play, Apple’s Multimodal Tokenizer

The Batch

Core Allocation Optimization for Energy‑Efficient Multi‑Core Scheduling in ARINC650 Systems

Dev.to

At Palantir’s Developer Conference, AI Is Built to Win Wars

Wired

LongCat-Flash-Prover: A new frontier for Open-Source Formal Reasoning.

Reddit r/LocalLLaMA

composer 2 is just Kimi K2.5 with RL?????

Reddit r/LocalLLaMA

[R] Internal transformer signals predict generation correctness: a 14,540-trace empirical study across 4 models and 2 benchmarks

Key Points

Related Articles

Attacks On Data Centers, Qwen3.5 In All Sizes, DeepSeek’s Huawei Play, Apple’s Multimodal Tokenizer

Core Allocation Optimization for Energy‑Efficient Multi‑Core Scheduling in ARINC650 Systems

At Palantir’s Developer Conference, AI Is Built to Win Wars

LongCat-Flash-Prover: A new frontier for Open-Source Formal Reasoning.

composer 2 is just Kimi K2.5 with RL?????

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Related Articles

Attacks On Data Centers, Qwen3.5 In All Sizes, DeepSeek’s Huawei Play, Apple’s Multimodal Tokenizer

**Core Allocation Optimization for Energy‑Efficient Multi‑Core Scheduling in ARINC650 Systems**

At Palantir’s Developer Conference, AI Is Built to Win Wars

LongCat-Flash-Prover: A new frontier for Open-Source Formal Reasoning.

composer 2 is just Kimi K2.5 with RL?????

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Core Allocation Optimization for Energy‑Efficient Multi‑Core Scheduling in ARINC650 Systems