TL;DR. I ran a blind A/B preference evaluation between two 1.2B-parameter LMs trained on identical data (same order, same seed, 30K steps / 3.9B tokens) - one with a Predictive-Coding-inspired precision-weighted gain function plus per-layer divergence-scaled gradients, one with standard cross-entropy. Smoothed val loss between the two is statistically indistinguishable (0.004-nat difference, well inside step-to-step noise). Ten judges (seven humans, three foundation models across Anthropic / OpenAI / Google) gave 320 pairwise judgments. The gain-trained model was preferred in 63.4% of decisive comparisons (p = 1.98 × 10⁻⁵, two-sided binomial). Training-time intervention (not RLHF) can shift human preference meaningfully while leaving the aggregate loss metric untouched.
Method (briefly). Two composable mechanisms:
- Per-token precision-weighted gain:
gain_i = 1 + s · (ℓ_i − mean(ℓ)) / var(ℓ), applied to the per-token CE before backward. Mean-normalized by construction - total gradient budget preserved. Detached, so the gain weights don't contribute their own gradients. - Per-layer divergence-scaled gradients: after backward, each transformer block's parameter gradients are multiplied by a factor proportional to
‖x_out − x_in‖ / ‖x_in‖measured during the forward pass. Also mean-normalized.
Both are optimizer-agnostic, architecture-agnostic, and cost a few elementwise ops per step with no measurable throughput impact.
A finding I think generalizes even if my specific method doesn't: mean-normalization of the gain is load-bearing. Phase 1 experiments (50M-param) showed that shape variants which suppress gradient on confident tokens (focal loss) or drift the gain/mean away from 1.0 over training (sigmoid) both degenerate training. The working precision formulation enforces gain/mean = 1.0 by construction via centering on the batch mean.
Limitations I'll pre-acknowledge to save skeptics some typing:
- Single-seed, single-pair comparison at 1.2B. No multi-seed replication yet.
- 16.4% of Chinchilla-optimal training (3.9B tokens for a ~1.2B-param model vs. ~24B Chinchilla would prescribe).
- I did not ablate the two mechanisms (token gain vs. layer gain) separately at 1.2B scale, so I can't tell you which one is doing the work. A paired ablation at 1.5B (ongoing full-Chinchilla run) does confirm that one specific detail of the layer-gain mechanism — layer 0's participation in the divergence normalization mean - is load-bearing.
- A/B prompts are short-form because both models are too undertrained for coherent long-form output.
- The three foundation-model judges span three labs but share web-corpus training data. Humans and FMs did converge on the same verdict (65.3% vs 59.8% decisive preference), which I find reassuring but not dispositive.
Links
- Paper (PDF) and method code: github.com/troycorbinz/precision-weighted-training
- A/B evaluation webapp and raw training metrics as JSON: same repo
- Training runs are logged in a private W&B project, but every numerical claim in the paper is independently verifiable from the JSON files under
/paper/data.
One ask at the end. I'm an independent researcher without institutional affiliation, and I'd like to submit this to arXiv (cs.LG primary, cs.CL cross-list). I need one cs.LG endorsement. If you've made prior cs.LG submissions, read the paper, and feel it meets the bar, I'd appreciate an endorsement - endorsement code is ready. Honest pass is also fine if it doesn't; I'd rather hear a "no" than nothing.
Happy to answer questions, defend claims, or accept pushback.
[link] [comments]


