Self-Debias: Self-correcting for Debiasing Large Language Models
arXiv cs.CL / 4/10/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper identifies a “Bias Propagation” problem in LLM chain-of-thought reasoning, where social biases can continue cascading once triggered.
- It proposes Self-Debias, a progressive, intrinsic self-correction framework that reallocates probability mass from biased heuristics toward unbiased reasoning paths.
- Unlike broad penalty-based preference optimization, Self-Debias uses a fine-grained trajectory-level objective with dynamic debiasing constraints to revise biased reasoning suffixes while keeping correct context prefixes.
- The method includes an online self-improvement loop via consistency filtering to automatically generate supervision signals, enabling stronger performance with only ~20k annotated samples and without continuous external oversight.



