Cat-DPO: Category-Adaptive Safety Alignment

arXiv cs.CL / 4/21/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that many preference-based LLM safety methods treat safety as a single global scalar, which can leave the model unsafe on some minority harm categories even if it appears safe on average.
It introduces Cat-DPO, a direct-preference-optimization approach that performs per-category constrained optimization with an adaptive safety margin for each harm category.
The adaptive margin tightens when unsafe responses persist for a given category and relaxes once the model improves, so training focuses on each category’s evolving difficulty.
Experiments across two LLM backbones and six preference-learning baselines show improved overall helpfulness/harmlessness, reduced per-category safety variance, and a smaller best-to-worst gap.
Cat-DPO is presented as a drop-in per-category refinement for direct preference-based safety alignment methods, potentially improving reliability across diverse harm types.

Abstract

Aligning large language models with human preferences must balance two competing goals: responding helpfully to legitimate requests and reliably refusing harmful ones. Most preference-based safety alignment methods collapse safety into a single scalar that is applied uniformly to every preference pair. The result is a model that looks safe on average but stays relatively unsafe on a minority of harm categories. We cast safety alignment as a per-category constrained optimization problem and derive Cat-DPO, a direct-preference-optimization algorithm with a separate adaptive safety margin for each harm category. The margin tightens when the model still produces unsafe responses on a category and relaxes once the model catches up, so the training signal tracks each category's current difficulty rather than averaging under one global rate. Across two LLM backbones and six preference-learning baselines, Cat-DPO iimproves aggregate helpfulness and harmlessness and compresses per-category safety variance and the best-to-worst gap, offering a drop-in per-category refinement of direct preference safety alignment.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/21DailyView insight →

Capsule Security Emerges From Stealth With $7 Million in Funding

Dev.to

Rethinking Coding Education for the AI Era

Dev.to

We Shipped an MVP With Vibe-Coding. Here's What Nobody Tells You About the Aftermath

Dev.to

Agent Package Manager (APM): A DevOps Guide to Reproducible AI Agents

Dev.to

3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work

Dev.to

Cat-DPO: Category-Adaptive Safety Alignment

Key Points

Abstract

💡 Insights using this article

Related Articles

Capsule Security Emerges From Stealth With $7 Million in Funding

Rethinking Coding Education for the AI Era

We Shipped an MVP With Vibe-Coding. Here's What Nobody Tells You About the Aftermath

Agent Package Manager (APM): A DevOps Guide to Reproducible AI Agents

3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer