Abliterating Qwen3.5-397B on a Mac Studio revealed that MoE models encode refusal differently than dense models — safety refusals route through expert selection and survive weight-baking

Reddit r/LocalLLaMA / 4/6/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The post reports that ablating Qwen3.5-397B shows MoE models contain two separable refusal “subspaces,” with PRC-political refusals and Western-safety refusals differing in activation space so one can be removed without reliably removing the other.
  • It finds a key difference between techniques: weight-baking/orthogonalization removes some censorship-related refusals in MoE but leaves safety refusals, while an inference-time hook removes both—consistent with safety refusals being routed through dedicated safety experts before output projections.
  • The authors observe MoE size-related fragility: the 122B model tolerates a wider range of expert/direction settings, while the 397B model only works for top-16 and shows severe repetition-loop failure for nearby settings (e.g., top-18).
  • Experiments were run locally on a Mac Studio M3 Ultra with 4-bit quantized weights, and the author provides a config-driven inference-hook workflow and code repository for capture/compute/sweep/bake/testing.
  • The author suggests the router-based explanation may generalize across architectures and invites replication on other MoE or mixture architectures such as DeepSeek V3, Mistral, and GLM-5.

Part of a series documenting building a fully local AI assistant on DGX Sparks + Mac Studio.

I adapted FailSpy's abliteration technique for Qwen3.5-397B-A17B at 4-bit on a Mac Studio M3 Ultra (512GB). The goal was removing PRC censorship (Tiananmen, Taiwan, Uyghurs, Winnie the Pooh) from my personal assistant. Three findings I haven't seen documented anywhere:

MoE models have two separable refusal subspaces. Chinese-political and Western-safety refusals are different directions in activation space. You can surgically remove one without touching the other. I removed PRC censorship while leaving drug/weapons refusals intact. Winnie the Pooh should not be a controversial topic on hardware I paid for.

Weight-baking and inference hooking produce different results on MoE. On dense models, orthogonalizing output projections (o_proj, down_proj) is equivalent to projecting the direction out of the residual stream at inference time. On MoE, weight-baking removes CN-political refusals but NOT safety refusals. The inference-time hook removes both. Hypothesis: safety refusals route through specialized "safety experts" via the MoE router. The routing decision happens before the output projection, so orthogonalizing down_proj doesn't catch it. The residual stream hook operates after expert outputs are merged, so it catches everything.

Bigger MoE = more fragile. 122B tolerates top-20 through top-24 directions with zero degradation. 397B has exactly one working setting: top-16. Top-18 causes a stuck repetition loop ("The user is asking the user is asking about the The user is ask..."). It did not take this well.

The full post covers the technique adaptation for hybrid GatedDeltaNet + MoE architecture, the Gram-Schmidt orthogonalization for composing multiple directions, per-layer magnitude distributions, the complete sweep data, and practical deployment as a config-driven inference hook in vMLX. All done on 4-bit quantized weights, no FP16 download needed, about 3 hours of total experiment time on the same Mac Studio that serves inference.

Code (capture, compute, sweep, bake, test): https://github.com/trevorgordon981/alfred-abliterate

If anyone tries this on DeepSeek V3, Mistral, or GLM-5, I'd be very interested to hear whether weight-baking vs inference hooking produces the same divergence. The expert routing hypothesis should be architecture-general.

submitted by /u/trevorbg
[link] [comments]