Training-time intervention yields 63.4% blind-pair human preference at matched val-loss (1.2B params, 320 judgments, p = 1.98 × 10⁻⁵) [R]

Reddit r/MachineLearning / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Two 1.2B-parameter LMs trained on the same data were compared using a blind human preference A/B test, with one model using a Predictive-Coding-inspired precision-weighted gain and divergence-scaled layer gradients during training instead of standard cross-entropy.
  • Despite indistinguishable smoothed validation loss between the models, judges preferred the gain-trained model in 63.4% of decisive comparisons (320 judgments; p = 1.98 × 10⁻5), indicating training-time interventions can shift human preference without changing aggregate loss.
  • The proposed method is optimizer- and architecture-agnostic and adds only lightweight per-step operations, with mean-normalization designed to preserve total gradient budget.
  • The author argues that mean-normalization is a key generalizable requirement, citing smaller-scale experiments where alternative gain formulations that suppress confident-token gradients or drift gain/mean during training caused training collapse.
  • Limitations include single-seed/single-pair evaluation, non-fully Chinchilla-optimal training compute, lack of complete mechanism ablations at 1.2B scale, and short-form prompts due to undertraining.

TL;DR. I ran a blind A/B preference evaluation between two 1.2B-parameter LMs trained on identical data (same order, same seed, 30K steps / 3.9B tokens) - one with a Predictive-Coding-inspired precision-weighted gain function plus per-layer divergence-scaled gradients, one with standard cross-entropy. Smoothed val loss between the two is statistically indistinguishable (0.004-nat difference, well inside step-to-step noise). Ten judges (seven humans, three foundation models across Anthropic / OpenAI / Google) gave 320 pairwise judgments. The gain-trained model was preferred in 63.4% of decisive comparisons (p = 1.98 × 10⁻⁵, two-sided binomial). Training-time intervention (not RLHF) can shift human preference meaningfully while leaving the aggregate loss metric untouched.

Method (briefly). Two composable mechanisms:

  • Per-token precision-weighted gain: gain_i = 1 + s · (ℓ_i − mean(ℓ)) / var(ℓ), applied to the per-token CE before backward. Mean-normalized by construction - total gradient budget preserved. Detached, so the gain weights don't contribute their own gradients.
  • Per-layer divergence-scaled gradients: after backward, each transformer block's parameter gradients are multiplied by a factor proportional to ‖x_out − x_in‖ / ‖x_in‖ measured during the forward pass. Also mean-normalized.

Both are optimizer-agnostic, architecture-agnostic, and cost a few elementwise ops per step with no measurable throughput impact.

A finding I think generalizes even if my specific method doesn't: mean-normalization of the gain is load-bearing. Phase 1 experiments (50M-param) showed that shape variants which suppress gradient on confident tokens (focal loss) or drift the gain/mean away from 1.0 over training (sigmoid) both degenerate training. The working precision formulation enforces gain/mean = 1.0 by construction via centering on the batch mean.

Limitations I'll pre-acknowledge to save skeptics some typing:

  • Single-seed, single-pair comparison at 1.2B. No multi-seed replication yet.
  • 16.4% of Chinchilla-optimal training (3.9B tokens for a ~1.2B-param model vs. ~24B Chinchilla would prescribe).
  • I did not ablate the two mechanisms (token gain vs. layer gain) separately at 1.2B scale, so I can't tell you which one is doing the work. A paired ablation at 1.5B (ongoing full-Chinchilla run) does confirm that one specific detail of the layer-gain mechanism — layer 0's participation in the divergence normalization mean - is load-bearing.
  • A/B prompts are short-form because both models are too undertrained for coherent long-form output.
  • The three foundation-model judges span three labs but share web-corpus training data. Humans and FMs did converge on the same verdict (65.3% vs 59.8% decisive preference), which I find reassuring but not dispositive.

Links

  • Paper (PDF) and method code: github.com/troycorbinz/precision-weighted-training
  • A/B evaluation webapp and raw training metrics as JSON: same repo
  • Training runs are logged in a private W&B project, but every numerical claim in the paper is independently verifiable from the JSON files under /paper/data.

One ask at the end. I'm an independent researcher without institutional affiliation, and I'd like to submit this to arXiv (cs.LG primary, cs.CL cross-list). I need one cs.LG endorsement. If you've made prior cs.LG submissions, read the paper, and feel it meets the bar, I'd appreciate an endorsement - endorsement code is ready. Honest pass is also fine if it doesn't; I'd rather hear a "no" than nothing.

Happy to answer questions, defend claims, or accept pushback.

submitted by /u/ScreamingAmish
[link] [comments]