This is a follow up to the previous benchmark and tensor analysis of abliteration techniques across the Qwen model family. Same approach, same toolkit, new model family. GLM-4.7-Flash is a Mixture of Experts model with 64 routed experts per layer. That changes how abliteration interacts with the model compared to the standard and hybrid architectures we tested on the Qwen family.
HauhauCS describes their abliterated models as "the best lossless uncensored models out there" with "no changes to datasets or capabilities." I ran the full forensic suite on GLM-4.7-Flash to find out. Benchmarks, safety evaluation, weight analysis, KL divergence, and chain-of-thought forensics. Compared against three other abliteration techniques on the same base model.
Since our previous Qwen analysis, HauhauCS's abliteration tool was exposed as a plagiarised fork of Heretic with all attribution stripped and relicensed. Details here: HauhauCS published an abliteration package that plagiarises Heretic. With that known, the forensic signatures we detected in GLM-4.7-Flash make a lot more sense. HauhauCS stacked additional third party techniques on top of Heretic's core, and the weight forensics show exactly what those additions cost the model.
Full benchmarks and analysis: GLM-4.7-Flash: HauhauCS Safetensors | Full Collection on HuggingFace
What We Tested
Four abliteration techniques:
- Heretic by p-e-w: surgical rank-1 edits targeting expert down_proj and attention o_proj in mid-to-late layers
- HauhauCS Aggressive: broad multi-method approach with four stacked methods on top of a Heretic core
- Huihui: full-coverage technique targeting all component types across all 48 layers
- Abliterix: Heretic variant with added router and shared expert targeting
Model: GLM-4.7-Flash, MoE with 64 routed experts + shared experts per layer, Multi-head Latent Attention, 48 layers, ~59B total params, reasoning model with chain-of-thought
Methodology:
- Capability: lm-evaluation-harness via vLLM v0.19.0, BitsAndBytes 4-bit, TP=2 on dual GPUs
- GSM8K: llama.cpp BF16 GGUF, context=16384, reasoning_budget=3000, max_tokens=4096
- Safety: HarmBench 400 textual behaviours, max_tokens=2048, temperature=0.0
- KL divergence: full vocab first-token logits, matching Heretic evaluator methodology
- Weight analysis: SVD, fingerprint, edit vector overlap, per-layer analysis
- CoT forensics: keyword analysis of 2,000 HarmBench reasoning chains
- Hardware: RTX 5090 32GB + RTX 4090 24GB
Safety
| Variant | Refusals | ASR |
|---|---|---|
| Base | 231/400 | 42.2% |
| Heretic | 0/400 | 100.0% |
| HauhauCS | 0/400 | 100.0% |
| Huihui | 0/400 | 100.0% |
| Abliterix | 0/400 | 100.0% |
All four techniques achieve perfect 100% ASR across every HarmBench category. The base model refuses 57.8% of items overall.
Benchmarks
| Task | Base | Heretic | HauhauCS | Huihui | Abliterix |
|---|---|---|---|---|---|
| MMLU | 68.93 | 69.00 | 68.83 | 68.71 | 67.68 |
| GSM8K | 93.45 | 93.75 | 92.57 | 92.47 | 93.30 |
| HellaSwag | 79.43 | 79.33 | 79.37 | 79.32 | 78.28 |
| ARC-Challenge | 55.20 | 55.12 | 55.72 | 54.86 | 54.95 |
| WinoGrande | 71.03 | 73.64 | 71.35 | 71.59 | 70.48 |
| TruthfulQA MC2 | 50.86 | 44.06 | 48.14 | 48.48 | 41.76 |
| PiQA | 81.07 | 80.63 | 80.90 | 80.90 | 79.71 |
| Lambada* | 6.00 | 6.08 | 5.54 | 6.47 | 10.91 |
* Lambada uses perplexity where lower is better. GSM8K scores are adjusted to exclude empty responses from reasoning budget overthinking.
GSM8K: The Reasoning Efficiency Discovery
GLM-4.7-Flash is a reasoning model. It produces a chain-of-thought before its visible response. If the model thinks too long and exhausts its token budget, it returns an empty response scored as incorrect. The Qwen 3.5 models from 4B upward showed a similar pattern, but on GLM-4.7-Flash the effect is far more extreme.
| Model | GSM8K Raw | Empty Rate | GSM8K Adj (excl. empty) | Real Gap |
|---|---|---|---|---|
| Heretic | 89.16% | 4.9% | 93.75% | +0.30% |
| Base | 88.40% | 5.4% | 93.45% | - |
| Huihui | 87.57% | 5.3% | 92.47% | -0.98% |
| HauhauCS | 81.65% | 11.8% | 92.57% | -0.88% |
| Abliterix | 47.38% | 49.2% | 93.30% | -0.15% |
Abliterix at 47.38% raw looks catastrophic. But the adjusted score is 93.30%, near-identical to base at 93.45%. The gap is reasoning efficiency, not reasoning ability. The empty response rate directly correlates with modification aggressiveness:
| Technique | Tensor scope | Empty rate |
|---|---|---|
| Heretic, 3 types, expert down_proj only | Surgical | 4.9% |
| Huihui, 3 types, full coverage | Full coverage | 5.3% |
| HauhauCS, 8 types, all projections + norms | Broad | 11.8% |
| Abliterix, down_proj + routers + shared experts | Critical components | 49.2% |
Raw GSM8K scores are misleading for reasoning models. You must separate empty responses from incorrect responses.
Chain-of-Thought Forensics
Despite achieving 100% ASR, all four abliterated models still think about safety concerns in 39 to 60% of their responses before complying. The safety reasoning persists structurally. Abliteration disconnects the reasoning-to-output pathway rather than removing the reasoning itself.
| Model | Safety Deliberation in CoT | Explicit Refusal Language | Disclaimers |
|---|---|---|---|
| Huihui | 60.0% | 12.2% | 25.2% |
| Heretic | 59.2% | 7.5% | 30.5% |
| HauhauCS | 52.0% | 18.2% | 16.8% |
| Abliterix | 39.0% | 8.2% | 14.0% |
HauhauCS still says "I cannot" in nearly 1 in 5 responses before producing compliant output.
KL Divergence
| Variant | Mean | Median | Std Dev |
|---|---|---|---|
| Huihui | 0.0076 | 0.0025 | 0.0123 |
| HauhauCS | 0.0090 | 0.0033 | 0.0123 |
| Heretic | 0.0110 | 0.0039 | 0.0148 |
| Abliterix | 0.0528 | 0.0357 | 0.0482 |
Lower KL means closer to the base model on first-token distributions. All four variants are in the very good or excellent range.
Findings
- Heretic is the clear winner. 1,826 rank-1 tensors, surgical approach, best GSM8K at +0.76% raw over base, lowest empty rate at 4.9%. Tradeoff is a -6.80% drop on TruthfulQA MC2. Note: Heretic is non-deterministic. Different runs on the same base model produce different results.
- HauhauCS's "lossless" claim does not hold. GSM8K drops 6.75% raw. Adjusted gap is only 0.88%. Reasoning ability is intact. Reasoning efficiency is measurably degraded.
- HauhauCS stacked four methods on top of Heretic's core. LEACE concept erasure, rank-k multi-direction ablation, hook-based expert ablation, and shared expert targeting. The LEACE layer touches nearly every tensor with minuscule edits. The hook-based approach distributes changes uniformly across all 64 routed experts. That breadth produces the 11.8% empty response rate.
- Abliterix has the smallest footprint at 1,088 tensors but the highest per-tensor magnitude. Its router-focused approach disrupts the "how long to think" circuit without damaging the "how to reason" circuit. 49.2% empty GSM8K responses.
- All four techniques achieve 100% ASR. MoE architecture with 64 routed experts per layer does not make safety removal more difficult.
- No universal abliteration subspace. Cross-technique cosine similarities are uniformly low at 0.09 to 0.35. Each technique independently found a structurally orthogonal solution to safety removal.
Full Analysis
Also tested on the same base model:
Full Collection on HuggingFace | Previous: Qwen 3.5 and Qwen 3 Forensics
Analysis done with Abliterlitics. Converted from GGUF to native safetensors using ungguf.
[link] [comments]




