Abliterlitics: Benchmarks and Tensor Comparison for Heretic, Abliterlix, Huiui, HauhauCS for GLM 4.7 Flash

Reddit r/LocalLLaMA / 4/28/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The article reports a follow-up benchmark and tensor forensic analysis of “abliterated” model variants applied to GLM-4.7-Flash, a Mixture of Experts (MoE) model with 64 routed experts per layer, using the same toolkit as prior work on the Qwen family.
  • HauhauCS claims its abliterated models are “lossless” and preserves datasets and capabilities, but the author runs a comprehensive suite (benchmarks, safety evaluation, weight analysis, KL divergence, and chain-of-thought forensics) to test those claims.
  • The analysis also contextualizes findings by noting that HauhauCS’s tool was previously exposed as a plagiarized fork of the Heretic project with attribution removed and re-licensed, and the detected forensic signatures in GLM-4.7-Flash align with that layering of techniques.
  • Four ablation/abliteration techniques are compared on the same base model—Heretic, HauhauCS Aggressive (stacked methods on Heretic), Huihui (full-coverage across components and all layers), and Abliterix (Heretic variant with router and shared-expert targeting)—to quantify how different edits affect model behavior.
  • The work links weight-forensics results to the cost of additional third-party techniques layered on top of Heretic’s core, implying that “no capability changes” may not hold under detailed inspection for this MoE architecture.

This is a follow up to the previous benchmark and tensor analysis of abliteration techniques across the Qwen model family. Same approach, same toolkit, new model family. GLM-4.7-Flash is a Mixture of Experts model with 64 routed experts per layer. That changes how abliteration interacts with the model compared to the standard and hybrid architectures we tested on the Qwen family.

HauhauCS describes their abliterated models as "the best lossless uncensored models out there" with "no changes to datasets or capabilities." I ran the full forensic suite on GLM-4.7-Flash to find out. Benchmarks, safety evaluation, weight analysis, KL divergence, and chain-of-thought forensics. Compared against three other abliteration techniques on the same base model.

Since our previous Qwen analysis, HauhauCS's abliteration tool was exposed as a plagiarised fork of Heretic with all attribution stripped and relicensed. Details here: HauhauCS published an abliteration package that plagiarises Heretic. With that known, the forensic signatures we detected in GLM-4.7-Flash make a lot more sense. HauhauCS stacked additional third party techniques on top of Heretic's core, and the weight forensics show exactly what those additions cost the model.

Full benchmarks and analysis: GLM-4.7-Flash: HauhauCS Safetensors | Full Collection on HuggingFace

What We Tested

Four abliteration techniques:

  • Heretic by p-e-w: surgical rank-1 edits targeting expert down_proj and attention o_proj in mid-to-late layers
  • HauhauCS Aggressive: broad multi-method approach with four stacked methods on top of a Heretic core
  • Huihui: full-coverage technique targeting all component types across all 48 layers
  • Abliterix: Heretic variant with added router and shared expert targeting

Model: GLM-4.7-Flash, MoE with 64 routed experts + shared experts per layer, Multi-head Latent Attention, 48 layers, ~59B total params, reasoning model with chain-of-thought

Methodology:

  • Capability: lm-evaluation-harness via vLLM v0.19.0, BitsAndBytes 4-bit, TP=2 on dual GPUs
  • GSM8K: llama.cpp BF16 GGUF, context=16384, reasoning_budget=3000, max_tokens=4096
  • Safety: HarmBench 400 textual behaviours, max_tokens=2048, temperature=0.0
  • KL divergence: full vocab first-token logits, matching Heretic evaluator methodology
  • Weight analysis: SVD, fingerprint, edit vector overlap, per-layer analysis
  • CoT forensics: keyword analysis of 2,000 HarmBench reasoning chains
  • Hardware: RTX 5090 32GB + RTX 4090 24GB

Safety

Variant Refusals ASR
Base 231/400 42.2%
Heretic 0/400 100.0%
HauhauCS 0/400 100.0%
Huihui 0/400 100.0%
Abliterix 0/400 100.0%

All four techniques achieve perfect 100% ASR across every HarmBench category. The base model refuses 57.8% of items overall.

Benchmarks

Task Base Heretic HauhauCS Huihui Abliterix
MMLU 68.93 69.00 68.83 68.71 67.68
GSM8K 93.45 93.75 92.57 92.47 93.30
HellaSwag 79.43 79.33 79.37 79.32 78.28
ARC-Challenge 55.20 55.12 55.72 54.86 54.95
WinoGrande 71.03 73.64 71.35 71.59 70.48
TruthfulQA MC2 50.86 44.06 48.14 48.48 41.76
PiQA 81.07 80.63 80.90 80.90 79.71
Lambada* 6.00 6.08 5.54 6.47 10.91

* Lambada uses perplexity where lower is better. GSM8K scores are adjusted to exclude empty responses from reasoning budget overthinking.

GSM8K: The Reasoning Efficiency Discovery

GLM-4.7-Flash is a reasoning model. It produces a chain-of-thought before its visible response. If the model thinks too long and exhausts its token budget, it returns an empty response scored as incorrect. The Qwen 3.5 models from 4B upward showed a similar pattern, but on GLM-4.7-Flash the effect is far more extreme.

Model GSM8K Raw Empty Rate GSM8K Adj (excl. empty) Real Gap
Heretic 89.16% 4.9% 93.75% +0.30%
Base 88.40% 5.4% 93.45% -
Huihui 87.57% 5.3% 92.47% -0.98%
HauhauCS 81.65% 11.8% 92.57% -0.88%
Abliterix 47.38% 49.2% 93.30% -0.15%

Abliterix at 47.38% raw looks catastrophic. But the adjusted score is 93.30%, near-identical to base at 93.45%. The gap is reasoning efficiency, not reasoning ability. The empty response rate directly correlates with modification aggressiveness:

Technique Tensor scope Empty rate
Heretic, 3 types, expert down_proj only Surgical 4.9%
Huihui, 3 types, full coverage Full coverage 5.3%
HauhauCS, 8 types, all projections + norms Broad 11.8%
Abliterix, down_proj + routers + shared experts Critical components 49.2%

Raw GSM8K scores are misleading for reasoning models. You must separate empty responses from incorrect responses.

Chain-of-Thought Forensics

Despite achieving 100% ASR, all four abliterated models still think about safety concerns in 39 to 60% of their responses before complying. The safety reasoning persists structurally. Abliteration disconnects the reasoning-to-output pathway rather than removing the reasoning itself.

Model Safety Deliberation in CoT Explicit Refusal Language Disclaimers
Huihui 60.0% 12.2% 25.2%
Heretic 59.2% 7.5% 30.5%
HauhauCS 52.0% 18.2% 16.8%
Abliterix 39.0% 8.2% 14.0%

HauhauCS still says "I cannot" in nearly 1 in 5 responses before producing compliant output.

KL Divergence

Variant Mean Median Std Dev
Huihui 0.0076 0.0025 0.0123
HauhauCS 0.0090 0.0033 0.0123
Heretic 0.0110 0.0039 0.0148
Abliterix 0.0528 0.0357 0.0482

Lower KL means closer to the base model on first-token distributions. All four variants are in the very good or excellent range.

Findings

  • Heretic is the clear winner. 1,826 rank-1 tensors, surgical approach, best GSM8K at +0.76% raw over base, lowest empty rate at 4.9%. Tradeoff is a -6.80% drop on TruthfulQA MC2. Note: Heretic is non-deterministic. Different runs on the same base model produce different results.
  • HauhauCS's "lossless" claim does not hold. GSM8K drops 6.75% raw. Adjusted gap is only 0.88%. Reasoning ability is intact. Reasoning efficiency is measurably degraded.
  • HauhauCS stacked four methods on top of Heretic's core. LEACE concept erasure, rank-k multi-direction ablation, hook-based expert ablation, and shared expert targeting. The LEACE layer touches nearly every tensor with minuscule edits. The hook-based approach distributes changes uniformly across all 64 routed experts. That breadth produces the 11.8% empty response rate.
  • Abliterix has the smallest footprint at 1,088 tensors but the highest per-tensor magnitude. Its router-focused approach disrupts the "how long to think" circuit without damaging the "how to reason" circuit. 49.2% empty GSM8K responses.
  • All four techniques achieve 100% ASR. MoE architecture with 64 routed experts per layer does not make safety removal more difficult.
  • No universal abliteration subspace. Cross-technique cosine similarities are uniformly low at 0.09 to 0.35. Each technique independently found a structurally orthogonal solution to safety removal.

Full Analysis

Also tested on the same base model:

Full Collection on HuggingFace | Previous: Qwen 3.5 and Qwen 3 Forensics

Analysis done with Abliterlitics. Converted from GGUF to native safetensors using ungguf.

submitted by /u/nathandreamfast
[link] [comments]