Experimental design
- Models: Llama-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, Mistral-7B-Instruct-v0.3, Mixtral-8x7B-Instruct-v0.1
- Benchmarks: GSM8K (200 prompts), HumanEval (164 prompts)
- Design: Pass@k, k=10 per prompt (5 runs at temperature 0.7, 5 at 0.8), each graded independently
- Evaluation: Grouped 5-fold CV by question ID — no prompt appears in both train and test folds
- Scale: 14,540 total traces; 11,403 used for correctness analysis after excluding format failures
- Classifier: HistGradientBoosting with StratifiedGroupKFold
- Metric: AUROC
A prior version of this experiment used greedy decoding, which produced identical outputs per prompt and zero within-prompt variance. That design was fundamentally wrong for this question and was redesigned from scratch.
Results
Signal ablation (T1–T6):
Tiered ablation from entropy-only (T1, 1 feature) through full feature set (T6, 104 features) under grouped CV:
| Model | Dataset | T6 AUROC |
| Qwen-2.5-7B | HumanEval | 0.90 |
| Mixtral-8x7B | HumanEval | 0.82 |
| Mistral-7B | HumanEval | 0.77 |
| Mistral-7B | GSM8K | 0.67 |
| Llama-3.1-8B | GSM8K | 0.64 |
| Qwen-2.5-7B | GSM8K | 0.60 |
Which tier provides the largest gain varies by model/task. For Qwen/HumanEval, T4 (early-window features) provides the dominant jump (0.73 → 0.85). For Mistral/GSM8K, T6 underperforms T5 — adding the full feature set hurts.
Early-window signals:
Mean surprisal over the first 10 generated tokens achieves predictive power of 0.80 for Mixtral/HumanEval and 0.73 for Mistral/HumanEval. Ranking k=10 candidates by this single signal:
- Mixtral/HumanEval: 15% (random) → 50% (+35 pp)
- Mistral/HumanEval: 16% → 48% (+32 pp)
- Qwen/HumanEval: 31% → 56% (+25 pp)
Confidence calibration:
Accuracy in the most confident quintile by top-k margin: Mixtral 2.8%, Mistral 6.4%, Qwen 20.4%, Llama 33.5%. Within the high-confidence subset, internal signals still achieve 0.92 AUROC (Qwen/HumanEval, compound_density_per_100t). Output confidence and internal-state signals appear to carry orthogonal information.
Architecture dependence:
MoE and dense models show fundamentally different internal signal distributions. collapsed_rate_mean separates Mixtral from all three dense models at rank-biserial −0.899. Cross-model alignment for composite risk scores is near-zero or negative (Spearman ρ ranging from −0.16 to +0.07 across model pairs on GSM8K). Per-architecture calibration appears necessary — a universal composite score does not transfer.
Format failure:
GSM8K format failure rates (missing #### delimiter): Mistral 72.2%, Mixtral 62.1%, Llama 17.9%, Qwen 4.5%. Internal signals predict Mistral format failures at predictive power 0.88 (hidden_max_abs_last_layer_mean) and Mixtral at 0.83 (focused_head_mean_zscore).
Layer analysis:
Per-layer correlation of attention entropy and L2 norm with correctness shows strong layer-specificity. Qwen layer 2 attention entropy correlates with HumanEval correctness at r = −0.484 (p ≈ 10⁻⁹⁷). Peak layers vary substantially by model and task — no universal correctness layer identified.
Negative results
The built-in composite risk_score saturates at 1.0 for 94–96% of Mistral/Mixtral traces. ECE ranges from 0.24–0.70 before Platt scaling. AUROC for the composite score is near-chance in several cells. A 25-element fingerprint vector tracked throughout the experiment turned out to be a concatenation of existing summary statistics — no independent predictive information. The feature set (104 features) collapses into approximately 47 correlated families at |r| > 0.80; a curated set of ~15 representatives preserves most predictive information.
Data and code
Full experiment (scripts, traces, analysis outputs, calibration results): Experiment directory, Validation report
[link] [comments]




