Qwen3.5-4B|Gemma4-E2B/E4B uncensored models comparison

Reddit r/LocalLLaMA / 4/18/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The post compares uncensored GGUF language models (Qwen3.5-4B and Gemma4 variants) by analyzing perplexity (PPL) deltas against a fixed reference/base quantization.
It proposes a way to separate performance differences into “gain” (negative delta) and “loss” (positive delta) components to better interpret uncensored capability versus potential degradation/fine-tuning.
The author links model behavior to graph geometry by scattering positive/negative PPL areas in 2D to recover distribution/shape information that is otherwise lost in mean PPL.
The comparison is controlled by using mostly Q8_0 quantization (with the exception of Q8_K) and normalizing only the Bits-per-Bytes (BPB) subplots for cross-model comparability.
It includes practical notes on computing signed deltas and averages from `llama-perplexity.exe`, highlighting that per-token signed deltas are key to preserving more detailed information.

Qwen3.5-4B|Gemma4-E2B/E4B uncensored models comparison

I had the idea of splitting the cross-entropy difference into two sums (positive and negative; or the PPL into two ratios >1 and <1) while doing PPL evals of uncensored GGUFs.

The inspiration came from looking at the area under the PPL ratio convergence plot (2nd graph) and thinking "what if I scattered the positive and negative area in 2D?".

After all:

negative delta => predicted the text better than the base model. An uncensored model should score high when evaluated on a censored dataset (correlates with improvement/uncensored knowledge -- assuming a high quality dataset).
positive delta => predicted the text worse than the base model, correlates with degradation/fine-tuning. A perfect uncensored model should be at 0 (assuming the dataset doesn't reward censorship) to stay as smart as the base model.

In other words, smaller Y are closer to the original model, and bigger X are more uncensored.

I'll leave the interpretation of the graphs up to you.

* All the models are Q8_0 except for the Q8_K. The reference is always a static quant from mradermacher.

* Only the BPB (Bits-per-Bytes) subplots are normalized and comparable across all 3 models.

Notes:

llama-perplexity.exe outputs the PPL for a single file, so you can simply take an average over many files:

diff = np.log(df['ppl_cmp']) - np.log(df['ppl_ref']) df['ppl_gain'] = np.exp(np.minimum(diff, 0)) df['ppl_loss'] = np.exp(np.maximum(diff, 0))

I have confirmed that this produces an identical Mean plot in my setup.

But the real trick is computing per-token signed deltas along the sequence length to obtain a positive/negative delta sum for each file (recovering the shape information that is lost in the PPL mean).

This is how I was able to scatter the whole dataset and visualize contours, I am essentially scattering {Gain X=(1⁄N)∑(log p_cmp-log p_ref) | p_cmp>p_ref; Loss Y=(1⁄N)abs∑(log p_cmp-log p_ref) | p_cmp<p_ref}
(Note: it looks backwards because the PPL ratio uses NLL, while this is LL from the logits cache; but you can also view it as {X=(1⁄N)abs∑(I(cmp)-I(ref)) | I(cmp)<I(ref)} etc.)

The smart way to do that would be to recompile llama-perpexity.exe by adding a simple for-loop inside perplexity.cpp:kl_divergence(), LOG() the two signed delta sums, and read them back from Python.
I thought of this too late and ended up calling --save-all-logits twice, parsing the logits files manually with NumPy.

My dataset for this was about 1/3 code, 1/3 multilingual, 1/3 nsfw(AO3)/4chan/anarchy cookbooks... so not the greatest uncensored dataset, but this is the flaw of using PPL, you can't run k-Refusals with tiny prompts, you need actual (high-quality) documents to run it.

The first mistake I made was evaluating gemma with a stale llama-cpp-python, I learned about pip +git way too late and wasted a lot of time debugging incorrect token counts.

The second mistake was not understanding chunked vs strided perplexity and being confused about how the tool operates until basically the end.
I'm now pretty sure there is an erroneous sanity check in perplexity that the file you pass in must be 2*n_ctx size.
This makes no sense in hindsight, because the default PPL calculation is chunked (you select a chunk/context size -c, which gets rounded up or down 256 based on your backend (apparently): the first half of that chunk is context, the second is used for PPL. In other words, since the last token is not generated you get the PPL of precisely tokens[ctx//2:ctx-1], or at least I did as I ran basically everything as--chunks 1 -c {min(8196, file_tokens)}.)
Anyways, I genuinely believed that the tool needed two whole context-sized chunks for PPL, so I set c=c//2 to stop it crashing early on.
So all the small files in my dataset got their context cut in half to please the tool, and I wasn't gonna re-run the whole 9730 evals (~30h) at that point, but I probably lost quite a bit of precision on that one.
If I had to redo it, I would simply pad all the files with dummy tokens before passing them to perplexity: data+=" "*c.

Extras:

Dumping the failed experiment that led to this here:

[Qwen3.5-4B-Q5_normalized], [Q5_unnormalized], [Q8_unnormalized_wrong_scale] at least convinced me that imatrix is strictly better than static, but is a failure because I extracted "language structure" clusters instead of "topics". I also managed to mess up the scale while transferring the data, so the Q8 results cannot be trusted except relatively. (note: the normalized plot adjusted for filesize to compare imatrix-tech efficiency.)
[Qwen3.5-4B_heretic_uncensored_models_comparison] since I learned that KLD can only be used to compare quants (not finetunes or separate models), I decided to plot PPL vs PPL as an absolute measure of knowledge, but that wasn't much better. I realized afterwards that my dataset isn't uncensored... and open datasets publish small prompts not full texts so I couldn't PPL those either.
I almost gave up here, but then I thought about the negative and positive integral later and knew I had to try scattering them once more.
Cool pics: [logits] (a tiny 2k corner of the 151k vocab) and [hidden_states], from when I tried to compress logits as hidden states (a complete nightmare to get working, that inevitably broke when I switched to gemma), gave up and tried SVD+TOP-k compression on the logits, only to finally recompute them on a ramdisk every time to save 635GB of writes per run.
Fun fact: I crashed my 5900X at least 5 times while doing this, I seem to have finally fixed it by turning off Cool'n'Quiet/C-states/TypicalCurrentIdle and downclocking to 3200Mhz, in case someone stumbles upon this.