Correcting Suppressed Log-Probabilities in Language Models with Post-Transformer Adapters
arXiv cs.LG / 4/17/2026
📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- Alignment-tuned language models can suppress factual log-probabilities for politically sensitive topics even when the underlying knowledge remains in hidden representations.
- The study proposes a small post-transformer adapter (786K parameters, ~0.02% of a Qwen3 base model) trained on frozen hidden states that restores correct log-probabilities on 31 ideology-discriminating factual items.
- The adapter memorizes all training facts and generalizes to 11–39% of held-out facts across multiple random splits and model scales, with no knowledge regressions reported due to anchored training.
- Coherence depends on where the adapter is applied: using it only at the last/current prediction token yields coherent, less-censored text, while applying it at all token positions (or in logit space) leads to incoherent generation.
- The authors identify and fix a previously undocumented Apple MLX silent gradient bug that caused earlier null results, and they provide a minimal reproduction and guidance for adapter research in MLX.

![[Patterns] AI Agent Error Handling That Actually Works](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D1200%2Cheight%3D627%2Cfit%3Dcover%2Cgravity%3Dauto%2Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Farticles%252Frn5czaopq2vzo7cglady.png&w=3840&q=75)

