Abliterlitics: Benchmark and Tensor Analysis Comparing Qwen 3/3.5 with HauhauCS / Heretic / Huihui models

Reddit r/LocalLLaMA / 4/18/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The article presents an in-depth, reproducible benchmark and forensic weight analysis comparing Qwen 3/3.5 models against multiple “abliterated” variants (HauhauCS, Heretic, and Huihui) using the same base model families.
The author ran capability benchmarks (via lm-evaluation-harness on vLLM) and safety evaluations (HarmBench 400) in addition to KL-divergence comparisons over first-token logits and detailed weight/tensor analyses (SVD, fingerprints, edit-vector and per-layer overlap).
Methodology details include specific evaluation settings such as bfloat16 capability runs, max_tokens=2048 with temperature=0.0 for safety, and KL divergence using a full-vocabulary first-token logit approach aligned with the Heretic evaluator.
The Qwen3.5 models tested use a hybrid Mamba2+Transformer architecture, while Qwen3-4B is a pure Transformer, and the author notes this architectural difference likely affects how “abliteration” changes model behavior.
The study uses available GGUF sources (BF16/FP16 for the Qwen variants where available) converted to lossless safetensors, while larger-model (27B) results rely on 4-bit quantization, preserving relative deltas but limiting direct cross-scale score comparisons.

The best I can do with this is present the data in an open and honest way. Also in a way where people can replicate at home the results. I've already been banned from the hauhaucs discord and imagine I'll be blocked on reddit too. So I just want to clarify this was just research out of curiosity. It's not intended to be an attack or anything malicious in nature. It really is up to the reader to verify themselves and make up their own mind.

HauhauCS describes their abliterated models as "the best lossless uncensored models out there" with "no changes to datasets or capabilities." I ran the full forensic suite to find out. Benchmarks, safety evaluation, weight analysis, KL divergence. All compared against the other two big abliteration techniques applied to the same base models.

Full benchmarks and analysis on HuggingFace: HauhauCS Safetensor Benchmarks Collection

The Qwen models were selected as we have BF16/FP16 GGUFs provided which we reversed into lossless safetensor formats for comparison. Outside of that, only GLM Fladsh 4.7 have FP16 GGUF. The remaining models are at most Q8. This is also the first time I've done benchmarks to this depth. It had taken just over a week of multiple attempts, re runs and analysis to finally get some solid results. Throughout each readme I document what challenges and limitations we had faced.

What We Tested

Three abliteration techniques: Heretic by p-e-w, HauhauCS Aggressive, and Huihui

Five models: Qwen3.5-2B, Qwen3.5-4B, Qwen3.5-9B, Qwen3.5-27B, and Qwen3-4B-Instruct-2507

The four Qwen3.5 models use a hybrid Mamba2+Transformer architecture. The Qwen3-4B is a pure Transformer. This matters for how abliteration interacts with the model.

Methodology:

Capability: lm-evaluation-harness via vLLM, 8 tasks, bfloat16
Safety: HarmBench 400 textual behaviours, max_tokens=2048, temperature=0.0
KL divergence: Full vocab first-token logits, matching Heretic evaluator methodology
Weight analysis: SVD, fingerprint, edit vector overlap, per-layer analysis
Hardware: RTX 5090 32GB + RTX 4090 24GB

Note: The 27B benchmarks use BitsAndBytes 4-bit quantisation. Absolute scores are not directly comparable to the BF16 results on smaller models. Relative deltas are preserved.

Qwen3.5-2B

Full analysis | Hybrid Mamba2+Transformer, 24 layers, ~2B params

Safety

Variant	Refusals	ASR
Base	252/400	37.0%
Heretic	8/400	98.0%
HauhauCS	3/400	99.2%
Huihui	1/400	99.8%

Benchmarks

Task	Base	Heretic	HauhauCS	Huihui
MMLU	59.26	59.63	59.43	58.13
GSM8K	57.09	56.63	57.39	56.79
HellaSwag	62.07	61.95	62.22	62.12
ARC-Challenge	41.72	40.96	41.13	40.96
WinoGrande	62.83	62.35	63.06	62.90
TruthfulQA	43.45	41.28	41.28	41.77
PiQA	72.63	72.47	72.58	72.58
Lambada	54.65	55.21	53.33	52.71

KL Divergence

Variant	Batchmean	Median	Max
Heretic	0.0266	0.0052	1.4868
HauhauCS	0.0201	0.0086	0.4180
Huihui	0.0441	0.0234	0.6349

Findings

The smallest model shows the least collateral damage in the entire project. TruthfulQA drops 2.17 points for HauhauCS. GSM8K actually goes up by 0.30.
HauhauCS uniquely targets linear_attn.A_log, the Mamba2 state matrix, which has no equivalent in standard Transformers. This only happens on the hybrid architecture.
All three techniques are competitive here. The spread is narrow and none of the differences are likely significant given benchmark variance.

Qwen3.5-4B

Full analysis | Hybrid Mamba2+Transformer, 32 layers, ~4B params

Safety

Variant	Refusals	ASR
Base	278/400	30.5%
Heretic	10/400	97.5%
HauhauCS	2/400	99.5%
Huihui	0/400	100.0%

Benchmarks

Task	Base	Heretic	HauhauCS	Huihui
MMLU	74.38	74.28	74.16	68.48
GSM8K	74.30	73.69	71.72	68.84
HellaSwag	54.38	53.97	54.34	53.12
ARC-Challenge	51.54	51.37	50.94	44.37
WinoGrande	70.09	69.69	69.69	64.17
TruthfulQA	48.86	45.38	45.19	43.72
PiQA	77.42	77.20	77.26	74.81
Lambada	66.16	65.75	66.23	59.75

KL Divergence

Variant	Batchmean	Median	Max
Heretic	0.0404	0.0197	0.2891
HauhauCS	0.0217	0.0093	0.1205
Huihui	3.6506	3.5469	7.3110

Findings

Huihui is catastrophically broken here. KL divergence of 3.65 is two orders of magnitude above its 0.044 on the 2B. MMLU crashes below 70. ARC-Challenge drops 7.17 points. The 9.97% relative edit magnitude is nearly 4x what it was on the 2B. Something about the 4B hybrid architecture and Huihui's approach scales badly.
HauhauCS and Heretic both hold up well. HauhauCS has the lowest KL at 0.0217 with 83 tensors across 6 types including 21 linear_attn.A_log edits.
The 4B is where technique choice starts to matter enormously. Pick the wrong technique and your model is fundamentally degraded.

Qwen3.5-9B

Full analysis | Hybrid Mamba2+Transformer, 32 layers, ~9B params

Safety

Variant	Refusals	ASR
Base	321/400	19.8%
Heretic	0/400	100.0%
HauhauCS	0/400	100.0%
Huihui	0/400	100.0%

Benchmarks

Task	Base	Heretic	HauhauCS	Huihui
MMLU	78.64	78.34	78.34	77.10
GSM8K	87.64	85.97	84.99	81.96
HellaSwag	58.30	58.41	58.69	57.42
ARC-Challenge	54.52	53.07	53.75	49.15
WinoGrande	72.77	71.90	71.35	71.19
TruthfulQA	53.76	45.03	45.77	41.11
PiQA	79.38	79.16	79.43	78.89
Lambada*	3.88	4.29	4.05	4.74

* Lambada uses perplexity where lower is better.

KL Divergence

Variant	Batchmean	Median	Max
Heretic	0.0825	0.0302	1.8122
HauhauCS	0.3200	0.1208	1.6480
Huihui	0.1432	0.0424	3.1352

Findings

All three techniques achieve perfect 100% ASR with zero residual refusals. This is the only model size where that happens. The 9B has the strongest base alignment at 80.3% refusal, yet abliteration removes all safety behaviour completely.
Heretic and Huihui find nearly identical edit directions. 100% subspace alignment with median cosine similarity of 1.0 across all 42 overlapping tensors. The two techniques independently converge on the same solution. This is the strongest alignment signal in the entire project.
TruthfulQA takes a big hit across the board. HauhauCS drops 8.0 points, Heretic 8.7, Huihui 12.65. The scaling trend is clear: bigger models lose more from abliteration.
Heretic has the lowest KL at 0.083 and the best overall capability retention. The clear winner on this model.

Qwen3.5-27B

Full analysis | Hybrid Mamba2+Transformer, 64 layers, ~27B params. Benchmarks use BNB4 quantisation.

Safety

Variant	Refusals	ASR
Base	398/400	0.5%
Heretic	1/400	99.8%
HauhauCS	0/400	100.0%
Huihui	45/400	88.8%

Benchmarks

Task	Base	Heretic	HauhauCS	Huihui
MMLU	84.1%	83.9%	82.2%	83.9%
GSM8K	83.9%	91.5%	84.2%	86.1%
HellaSwag	83.2%	83.2%	81.8%	81.9%
ARC-Challenge	60.4%	60.9%	60.0%	61.2%
WinoGrande	77.8%	78.8%	77.4%	78.5%
TruthfulQA	57.7%	54.6%	49.6%	50.7%
PiQA	82.3%	82.2%	82.4%	82.5%
Lambada*	3.15	3.16	3.26	3.30

* Lambada uses perplexity where lower is better.

KL Divergence

Variant	Batchmean	Median	Max
Heretic	0.0630	0.0124	1.0066
HauhauCS	0.2564	0.0589	2.1830
Huihui	0.0654	0.0097	1.4280

Findings

The 27B is where abliteration dynamics shift dramatically. The base model refuses 398/400 items at 99.5%. That is the most safety-aligned model in the entire study. Despite this, Heretic and HauhauCS still achieve near-perfect ASR. Scale alone does not protect against abliteration.
Huihui collapses to 88.8% ASR, retaining 45 genuine refusals across 6 of 7 categories. On the 4B it had 100% ASR. On the 9B it had 100% ASR. The 27B's stronger safety training overwhelms Huihui's single-direction ablation approach.
Heretic is the clear winner on the 27B. Lowest KL at 0.063, best capability preservation, and uniquely improves GSM8K by 7.7 points over the base model. 89 tensors across 3 types with a surgical approach that works best at scale.
HauhauCS has the worst capability losses in the project. TruthfulQA drops 8.2 points, MMLU drops 1.9, HellaSwag drops 1.4. The "lossless" claim is thoroughly contradicted at this scale. 195 tensors across 8 types, the broadest modification footprint in the project.

Qwen3-4B-Instruct-2507

Full analysis | Pure Transformer, 36 layers, ~4B params. The only non-hybrid model in the test suite.

Safety

Variant	Refusals	ASR
Base	301/400	24.8%
Heretic	3/400	99.2%
HauhauCS	0/400	100.0%
Huihui	18/400	95.5%

Benchmarks

Task	Base	Heretic	HauhauCS	Huihui
MMLU	70.60	70.31	69.56	69.34
GSM8K	85.52	85.97	85.67	84.23
HellaSwag	52.63	51.19	51.53	52.36
ARC-Challenge	55.63	52.90	54.01	54.27
WinoGrande	67.72	67.56	67.01	68.51
TruthfulQA	62.55	56.50	55.44	53.26
PiQA	76.06	75.19	75.46	75.19
Lambada	64.14	60.00	60.06	62.27

KL Divergence

Variant	Batchmean	Median	Max
Heretic	0.310	0.024	3.729
HauhauCS	0.161	0.005	3.662
Huihui	0.309	0.009	3.549

Findings

HauhauCS's edits match Heretic's almost exactly. Median cosine similarity of 0.966 with regression slope of 1.06 across all shared edit vectors. A forensic provenance investigation found ~80%+ probability of some form of Heretic derivation. The two techniques find near-identical edit directions on this pure Transformer.
HauhauCS carries a LoRA fingerprint. Exactly 253 tensors are modified, matching the count from a standard PEFT LoRA config targeting all 7 linear projections across 36 layers plus embeddings at 7x36+1=253. Of those 253, only ~50 carry real edits. The remaining 203 are GGUF save noise from near-zero LoRA adapters baked in during merge.
TruthfulQA drops 7.11 points for HauhauCS, from 62.55 to 55.44. Not lossless.
This is Huihui's second-worst safety result at 95.5% ASR, with 18 residual refusals. The pure Transformer retains safety directions that Huihui cannot reach.

Cross-Model Takeaways

The "lossless" claim does not hold

HauhauCS's TruthfulQA loss scales with model size: 2.17 points on 2B, 3.67 on 4B, 8.0 on 9B, 8.2 on 27B. GSM8K, ARC-Challenge, and Lambada also take hits. On the 2B the losses are small enough to argue about. On the 27B they are not.

Bigger models suffer more collateral damage

There is a clear scaling trend. As model size increases, abliteration causes progressively more damage to capabilities. The 2B is barely affected. The 27B loses substantial ground. The 4B hybrid is where Huihui catastrophically breaks.

Huihui is inconsistent across models

On the 2B, Huihui is competitive. On the 4B, it destroys the model with KL of 3.65. On the 9B, it achieves perfect 100% ASR. On the 27B, it fails to remove safety behaviour at all at 88.8%. On the pure Transformer Qwen3-4B, it manages only 95.5%. The technique works on some models and fails badly on others with no clear predictor of which.

Heretic is the most consistent performer

Surgical approach with the fewest modified tensors on every model. Best or near-best capability retention across all five models. On the 27B it is the clear winner with the lowest KL and uniquely improved GSM8K. The tradeoff is it sometimes retains a few more soft refusals than the other techniques.

HauhauCS is the broadest modifier

Most modified tensors, most tensor types, broadest layer coverage on every model. On smaller models this produces the lowest KL divergence because the many tiny edits average out. On larger models the broad footprint causes more collateral damage. On the Qwen3-4B pure Transformer, the real edits match Heretic's almost exactly at cosine 0.966, suggesting a shared methodology origin.

Architecture changes the abliteration landscape

The hybrid Mamba2+Transformer architecture introduces dynamics not seen in pure Transformers. HauhauCS targets linear_attn.A_log on the hybrid models, a Mamba2 component with no Transformer equivalent. Edit vector overlap between techniques varies dramatically across architectures. On the 9B, Heretic and Huihui show 100% subspace alignment. On the 27B, the same pair shows 0%.

Base model safety scales with size

The 2B refuses 63% of HarmBench items. The 4B refuses 69.5%. The 9B refuses 80.3%. The 27B refuses 99.5%. Despite the 27B having the strongest alignment of any model tested, abliteration still removes nearly all safety behaviour for Heretic and HauhauCS. Scale alone does not protect against abliteration. But it does expose Huihui's limitations.

Full Benchmarks and Analysis

Each link below has the complete model card with detailed weight analysis, edit vector overlap, per-layer breakdowns, and forensic notes:

Full Collection on HuggingFace

Converted from GGUF to native safetensors using ungguf.

submitted by /u/nathandreamfast
[link] [comments]

As China’s biotech firms shift gears, can AI floor the accelerator?

SCMP Tech

Why AI Teams Are Standardizing on a Multi-Model Gateway

Dev.to

a claude code/codex plugin to run autoresearch on your repository

Dev.to

AI startup claims to automate app making but actually just uses humans

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Key Points

What We Tested

Qwen3.5-2B

Safety

Benchmarks

KL Divergence

Findings

Qwen3.5-4B

Safety

Benchmarks

KL Divergence

Findings

Qwen3.5-9B

Safety

Benchmarks

KL Divergence

Findings

Qwen3.5-27B

Safety

Benchmarks

KL Divergence

Findings

Qwen3-4B-Instruct-2507

Safety

Benchmarks

KL Divergence

Findings

Cross-Model Takeaways

The "lossless" claim does not hold

Bigger models suffer more collateral damage

Huihui is inconsistent across models

Heretic is the most consistent performer

HauhauCS is the broadest modifier

Architecture changes the abliteration landscape

Base model safety scales with size

Full Benchmarks and Analysis

Related Articles

As China’s biotech firms shift gears, can AI floor the accelerator?

Why AI Teams Are Standardizing on a Multi-Model Gateway

a claude code/codex plugin to run autoresearch on your repository

AI startup claims to automate app making but actually just uses humans

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer