Correcting Suppressed Log-Probabilities in Language Models with Post-Transformer Adapters

arXiv cs.LG / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

Alignment-tuned language models can suppress factual log-probabilities for politically sensitive topics even when the underlying knowledge remains in hidden representations.
The study proposes a small post-transformer adapter (786K parameters, ~0.02% of a Qwen3 base model) trained on frozen hidden states that restores correct log-probabilities on 31 ideology-discriminating factual items.
The adapter memorizes all training facts and generalizes to 11–39% of held-out facts across multiple random splits and model scales, with no knowledge regressions reported due to anchored training.
Coherence depends on where the adapter is applied: using it only at the last/current prediction token yields coherent, less-censored text, while applying it at all token positions (or in logit space) leads to incoherent generation.
The authors identify and fix a previously undocumented Apple MLX silent gradient bug that caused earlier null results, and they provide a minimal reproduction and guidance for adapter research in MLX.

Abstract

Alignment-tuned language models frequently suppress factual log-probabilities on politically sensitive topics despite retaining the knowledge in their hidden representations. We show that a 786K-parameter (approximately 0.02% of the base model) post-transformer adapter, trained on frozen hidden states, corrects this suppression on 31 ideology-discriminating facts across Qwen3-4B, 8B, and 14B. The adapter memorizes all 15 training facts and generalizes to 11--39% of 16 held-out facts across 5 random splits per scale, with zero knowledge regressions via anchored training. Both gated (SwiGLU) and ungated (linear bottleneck) adapters achieve comparable results; neither consistently outperforms the other (Fisher exact p > 0.09 at all scales). On instruct models, the adapter corrects log-probability rankings. When applied at all token positions during generation, the adapter produces incoherent output; however, when applied only at the current prediction position (last-position-only), the adapter produces coherent, less censored text. A logit-space adapter operating after token projection fails to produce coherent generation at any application mode, suggesting hidden-state intervention is the correct level for generation correction. A previously undocumented silent gradient bug in Apple MLX explains all null results in earlier iterations of this work: the standard pattern nn.value_and_grad(model, fn)(model.parameters()) returns zero gradients without error; the correct pattern nn.value_and_grad(model, fn)(model, data) resolves this. We provide a minimal reproduction and discuss implications for other adapter research using MLX.