CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization
arXiv cs.CL / 4/20/2026
📰 NewsModels & Research
Key Points
- The paper addresses a key challenge in machine unlearning for Large Reasoning Models (LRMs) that rely on long chain-of-thought (CoT) reasoning, where existing methods can either fail to remove unwanted knowledge or harm reasoning performance.
- It introduces CiPO (Counterfactual Unlearning through iterative Preference Optimization), which reframes unlearning as a targeted intervention in CoT by generating counterfactual reasoning traces tied to a target “unlearning answer.”
- CiPO uses iterative preference tuning: as the LRM learns from counterfactual traces, the framework updates preference data to increase divergence from the original model.
- Experiments on difficult benchmarks indicate CiPO can remove the targeted knowledge from both intermediate CoT steps and final answers while largely preserving the model’s reasoning abilities.
- Overall, the work claims an approach that resolves the trade-off (“dilemma”) between complete unlearning and maintaining reasoning quality through an iterative optimization loop.
Related Articles

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)
Dev.to

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI
Dev.to
Local LLM Beginner’s Guide (Mac - Apple Silicon)
Reddit r/artificial

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals
Dev.to

Space now with memory
Dev.to