Abliterated version of the new Qwen3.6-35B-A3B up on HF

Reddit r/LocalLLaMA / 4/17/2026

💬 OpinionSignals & Early TrendsModels & Research

共有:

Key Points

A user has uploaded an “abliterated” version of Qwen3.6-35B-A3B to Hugging Face, arguing that MoE ablation affects refusal behavior in the expert path rather than attention, so standard Q/K/V LoRA methods are ineffective.
The method (“Abliterix framework”) uses rank-1 LoRA on the O-projection and MLP down-projection with Q/K/V disabled, applies expert-granular ablation across all 256 experts’ down_proj slices per layer, and suppresses the MoE router to bias away from top “safety experts.”
Steering vectors are orthogonalized and decayed with depth, while a strength search in the range [0.5, 6.0] is used to reduce degenerate outputs.
In evaluation, the model shows 7/100 refusals with low KL divergence (0.0189) from the base model, but the baseline is 100/100, and the judge (Gemini 3 Flash) treats garbled/degenerate generations as refusal.
The post cautions that many abliterated model cards report 0–3/100 refusals using shorter generations and keyword-based checks, which can miss delayed/soft refusals and incorrectly label garbled output as compliant.

Pushed an abliterated Qwen3.6-35B-A3B to HF. Worth noting because MoE abliteration is genuinely different from dense — the refusal signal lives in the expert path, not attention, so standard Q/K/V LoRA doesn’t cut it.

Approach (Abliterix framework):

LoRA rank-1 on O-proj + MLP down-proj (Q/K/V disabled on purpose)
Expert-Granular Abliteration: project refusal direction across all 256 expert down_proj slices per layer
MoE router suppression: identified top-10 “safety experts”, router bias -2.10
Orthogonalized steering vectors + Gaussian decay across layers
Strength search in [0.5, 6.0] to avoid degenerate output

Eval: 7/100 refusals, KL 0.0189 from base. Baseline is 100/100. Judge is Gemini 3 Flash — degenerate/garbled output counts as refusal, no keyword matching, 150-token generations.

One thing worth saying since this comes up a lot: a bunch of abliterated model cards claim 0–3/100 refusals, and most are using 30–50 token generations + keyword detection. That undercounts delayed/soft refusals and lets garbled output pass as “compliant.” 7/100 is what a stricter LLM-judge eval actually gives you. Take the flashy numbers with salt.

huggingface/wangzhang/Qwen3.6-35B-A3B-abliterated

Research only. Safety guardrails removed — use responsibly.

submitted by /u/Free_Change5638
[link] [comments]