Relational Preference Encoding in Looped Transformer Internal States
arXiv cs.LG / 4/14/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- A new arXiv study analyzes how a 2.6B “looped transformer” (Ouro-2.6B-Thinking) encodes human preferences across iterative internal states using the Anthropic HH-RLHF dataset and frozen base weights.
- Lightweight evaluator heads trained on per-iteration hidden states reach 95.2% test accuracy in a pairwise setting, outperforming a full-batch L-BFGS probe (84.5%) while the underlying model remains unchanged.
- The authors find preference is encoded primarily in a relational manner: linear probes on pairwise differences perform well (84.5%), whereas independent nonlinear evaluators and independent classifiers are much weaker—suggesting internal consistency more than direct prediction of noisy labels.
- Experiments and controls show architectural/optimization details can create misleading ceilings for pairwise vs pointwise evaluators, and a proposed “flip test” is presented as a mandatory diagnostic to detect evaluator bias and degenerate pairwise solutions.
- A cosine learning-rate “dead zone” unintentionally functioned like early stopping, with test accuracy degrading substantially by later epochs, and cross-epoch analysis indicates antisymmetry stays stable while sign-flip rates track scorer bias.
Related Articles

Black Hat Asia
AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Don't forget, there is more than forgetting: new metrics for Continual Learning
Dev.to

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale
Dev.to
Bit of a strange question?
Reddit r/artificial