A technical, 100% local writeup on how I replicated and then surpassed the Secret Detection model from Wiz (and the challenges along the way) - including labeling an entire dataset with local AI

Reddit r/LocalLLaMA / 4/6/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The author describes a fully local attempt to replicate and surpass Wiz’s reported Llama 3.2-1B “secret detection” fine-tuning results, improving to 88% precision and 84.4% recall after several weekends of experimentation.
They benchmarked alternative small language models (including Qwen 3.5 2B and 4B), noting that higher-performing models required more VRAM and incurred longer inference times.
Publicly sourced data was supplemented via procedural generation, and the dataset was labeled locally using Qwen3-Coder-Next; the project also involved training the models to output structured JSON.
Initial schema/JSON compliance was effectively zero for baseline SLMs, but training improved it to 98–100% compliance, enabling reliable structured predictions.
The work uncovered data quality pitfalls (e.g., an “embarrassing” high-entropy class and misclassified negatives that included real-world passwords), and correcting these issues improved recall for passwords.

A technical, 100% local writeup on how I replicated and then surpassed the Secret Detection model from Wiz (and the challenges along the way) - including labeling an entire dataset with local AI

Hey everybody, I have a strong interest in offloading work to small, specialized models that I can parallelize - this lets me scale work significantly (plus, I am less dependent on proprietary APIs)

Some time ago, I saw a blog post from Wiz about fine-tuning Llama 3.2-1B for secret detection in code. They got 86% Precision and 82% Recall. I wanted to see if I can replicate (or beat) those numbers using purely local AI and produce a local specialized model.

After a couple of weekends of trying it out I managed to get a Llama 3.2-1B hitting 88% Precision and 84.4% Recall simultaneously!

I also benchmarked Qwen 3.5-2B and 4B - expectedly, they outperformed Llama 1B at the cost of more VRAM and longer inference time.

I’ve put together a full write-up with the training stats, examples, and a step-by-step breakdown of what I went through to hit these metrics. Warning: It's technical and pretty long, but I honestly think it's fun to read.

Link: Check out the full write-up here.

Here are some highlights:

I only sourced publicly available data. This wasn't enough so I used procedural generation to augment and improve my dataset. Labeling was done locally using Qwen3-Coder-Next (sorry Claude, you sit this one out).
Instead of just finding secrets, I trained the models to output structured JSON. Initially, every vanilla SLM I tested (Llama & Qwen) scored 0% on schema compliance, but I got them to 98-100% after training.
I made a somewhat embarresing mistake including a high entropy class which was detrimental to training, but I caught it and removed it eventually.
I discovered 4,500 of my "negative" samples actually contained real-world passwords (even though they don't seem real!). The model was literally being trained to ignore secrets. At this point I was already clearing the metrics set by Wiz, but fixing this improved the recall on passwords.

Would love to hear if anyone else is pursuing efficient 1B/3B finetunes for specialized tasks and about your stack!

AI Disclaimer: I write everything myself - this post, and the full writeup. Please point out any typos!

submitted by /u/Oatilis
[link] [comments]