Detoxification for LLM: From Dataset Itself
arXiv cs.CL / 4/22/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that most LLM detoxification methods address toxicity after training or during inference, but not the root cause: toxic content in the pretraining dataset itself.
- It proposes HSPD (Hierarchical Semantic-Preserving Detoxification), which detoxifies raw corpora by rewriting toxic spans while preserving their semantics using SoCD (Soft Contrastive Decoding).
- The authors claim the detoxified corpus can be used as a drop-in replacement for fine-tuning and other training pipelines, aiming to reduce toxic behavior learned during pretraining.
- Experiments on GPT2-XL report improved detoxification performance, lowering Toxicity Probability from 0.42 to 0.18 and Expected Maximum Toxicity from 0.43 to 0.20.
- Results are also reported to be consistently strong on LLaMA2-7B, OPT-6.7B, and Falcon-7B, suggesting corpus-level, semantics-preserving rewriting can suppress downstream toxicity without sacrificing data utility.
Related Articles
The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence
Dev.to
Context Engineering for Developers: A Practical Guide (2026)
Dev.to
GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.
Dev.to
I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)
Dev.to
Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF
Reddit r/LocalLLaMA