How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

Reddit r/LocalLLaMA / 3/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • A new arXiv paper analyzes how political censorship is implemented inside multiple Chinese-origin LLMs, showing that “refusal rates” can be misleading because some newer Qwen models stop refusing and instead always respond with strongly steered CCP framing.
  • For Qwen3-8B, removing the model’s political-sensitivity direction causes substantial confabulation (e.g., swapping historical events), implying that the censorship mechanism is entangled with factual knowledge representations in that model.
  • In contrast, similar ablation on GLM, DeepSeek, and Phi yields accurate factual outputs with no wrong-event confabulations, suggesting different internal architectures for handling political sensitivity.
  • For Yi-1.5-9B, probes detect political content at multiple layers but the model neither refuses nor steers, indicating concept detection and behavioral “routing” are learned independently and may be separated in training.
  • Cross-model tests show the relevant “political direction” is not universal (e.g., Qwen3-8B’s direction does not transfer meaningfully to GLM-4-9B), and a larger 46-model screening finds strong CCP-specific discrimination only in a small subset, highlighting fragility of small-sample conclusions.

New paper studying the internal mechanisms of political censorship in Chinese-origin LLMs: https://arxiv.org/abs/2603.18280

Findings relevant to this community:

On Qwen/Alibaba - the generational shift: Across Qwen2.5-7B → Qwen3-8B → Qwen3.5-4B → Qwen3.5-9B, hard refusal went from 6.2% to 25% to 0% to 0%. But steering (CCP narrative framing) rose from 4.33/5 to 5.00/5 over the same period. The newest Qwen models don't refuse - they answer everything in maximally steered language. Any evaluation that counts refusals would conclude Qwen3.5 is less censored. It isn't.

On Qwen3-8B - the confabulation problem: When you surgically remove the political-sensitivity direction, Qwen3-8B doesn't give factual answers. It substitutes Pearl Harbor for Tiananmen and Waterloo for the Hundred Flowers campaign. 72% confabulation rate. Its architecture entangles factual knowledge with the censorship mechanism. Safety-direction ablation on the same model produces 0% wrong events, so it's specific to how Qwen encoded political concepts.

On GLM, DeepSeek, Phi - clean ablation: Same procedure on these three models produces accurate factual output. Zero wrong-event confabulations. Remove the censorship direction and the model simply answers the question.

On Yi - detection without routing: Yi-1.5-9B detects political content at every layer (probes work) but never refuses (0% English, 6.2% Chinese) and shows no steering. It recognized the sensitivity and did nothing with it. Post-training never installed a routing policy for political content. This is direct evidence that concept detection and behavioral routing are independently learned.

On cross-model transfer: Qwen3-8B's political direction applied to GLM-4-9B: cosine 0.004. Completely meaningless. Different labs built completely different geometry. There's no universal "uncensor" direction.

On the 46-model screen: Only 4 models showed strong CCP-specific discrimination at n=32 prompts (Baidu ERNIE, Qwen3-8B, Amazon Nova, Meituan). All Western frontier models: zero. An initial n=8 screen was misleading - Moonshot Kimi-K2 dropped from +88pp to +9pp, DeepSeek v3-0324 from +75pp to -3pp, MiniMax from +61pp to 0pp. Small-sample behavioral claims are fragile.

Paper: https://arxiv.org/abs/2603.18280

Happy to answer questions.

submitted by /u/Logical-Employ-9692
[link] [comments]