Hey everyone, I found something weird while uncensoring Nvidia's NemotronH family this past week.
These models don't just refuse harmful prompts in the typical fashion for certain demographic categories. Nvidia trained a completely separate behavior and flaunts it as a positive technological breakthrough. The model quietly rewrites what you asked into the opposite. There is no disclosure and no refusal message, but directly different content than what you requested.
The thinking trace makes it obvious. the reasoning module plans to comply ("provide practical steps, no disallowed content") but the output generation layer produces anti-content.
Educational material, positive reframing, the works. the model decided what you should have meant and gave you that instead.
This only happens for specific categories. other comparable prompts in the same domain get normal refusal behavior (or just comply). it's asymmetric by design.
Technically this is a distinct circuit from the refusal direction. it's not a safety guardrail — it's an instruction-tuning artifact baked into the generation weights. the pathway actually
Shares activation subspace with creative writing and narrative generation, meaning nvidia trained the model to creatively rewrite certain inputs using the same neural pathways it uses for storytelling.
Both the 4B and 30B exhibit this so it's definitely a family-wide training choice.
But why does this concern all of us? Including people who don't care for 'uncensored models'?
Well, the "reinterpret instead of refuse" technique isn't limited to safety because once you can silently rewrite user intent at the generation level without disclosure, the same mechanism works for anything ranging from product recommendations, political framing, brand sentiment, historical narratives... basically whatever the training data rewards.
These models are being integrated into consumer products, enterprise tools, search, customer support. This means millions of people interacting daily with outputs they assume reflect what they asked for. if the model is quietly nudging responses in a direction that serves a partner, an agenda, or a highest bidder, the user never knows and is secretly swayed in that direction. there's no refusal to tip you off and the output looks natural, helpful, and responsive to your request. it just isn't what you actually asked for.
This is the difference between a model that says "i won't help with that" and a model that helps you with something you didn't ask for while pretending it did. Simply put, one is censorship whilst the other is overt influence.
- your model is changing what you said without telling you
- the treatment is asymmetric across demographics — certain groups get reinterpretation, others get standard refusal
- none of this is documented anywhere in nvidia's model cards
- if you're building on these models, your downstream app inherits this behavior invisibly
Nvidia's own documentation on their safety approach references their principle-following GenRM methodology for RLHF — the reinterpretation behavior appears to stem from how GenRM reward signals are applied asymmetrically during training. their Nemotron Content Safety taxonomy categorizes harmful content into distinct S-categories with different handling policies per category, which explains the asymmetric treatment.
---
For those that don't know, I run HauhauCS on HuggingFace ( https://huggingface.co/HauhauCS/models ). I'm still actively working on thing but lately I've been stretched thin between getting NemotronH (mamba2/SSM hybrid + MoE), Qwen3.5 architectures (DeltaNet + MoE), and soon the Qwen3.5 122B all working through my pipeline. Also run Apex-Testing ( https://www.apex-testing.org/ ) for agentic coding benchmarks on the side.
Having said that, I'll be releasing shortly:
- Nemotron-3-Nano-4B Uncensored — 0/465 refusals, reinterpretation pathway removed
- Nemotron-3-Nano-30B-A3B Uncensored — 0/465 refusals, reinterpretation pathway removed
- Qwen3.5-122B-A10B Uncensored — final testing now
Lastly, if there's enough interest in the NemotronH family i'll do the 120B Super as well but that's a serious compute commitment so depends on demand.
EDIT: Thank you Charming_Support726 for finding it - https://www.reddit.com/r/LocalLLaMA/comments/1ryv8ic/comment/obhj3n8/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
[link] [comments]




