Via Negativa for AI Alignment: Why Negative Constraints Are Structurally Superior to Positive Preferences

arXiv cs.AI / 3/18/2026

💬 OpinionSignals & Early TrendsIdeas & Deep Analysis

共有:

Key Points

The paper argues that negative constraints are structurally superior to positive preferences for AI alignment because they encode discrete, verifiable prohibitions that converge to stable boundaries, unlike continuously valued preferences that reflect context-dependent human values.
It cites empirical results showing negative-only feedback methods (negative sample reinforcement, distributional dispreference optimization, Constitutional AI) can match or exceed RLHF on tasks such as mathematical reasoning and harmlessness benchmarks.
The authors attribute the effectiveness of negative signals to an asymmetry rooted in falsification logic (Popper) and the idea of learning what humans reject rather than what they prefer, which helps explain sycophancy in preference-based approaches.
The paper advocates shifting alignment research toward learning rejection criteria, offering testable predictions and outlining broader implications for AI system design and evaluation.

Abstract

Recent empirical results have demonstrated that training large language models (LLMs) with negative-only feedback can match or exceed standard reinforcement learning from human feedback (RLHF). Negative Sample Reinforcement achieves parity with PPO on mathematical reasoning; Distributional Dispreference Optimization trains effectively using only dispreferred samples; and Constitutional AI outperforms pure RLHF on harmlessness benchmarks. Yet no unified theoretical account explains why negative signals are so effective. This paper proposes such an account: positive preferences and negative constraints are structurally asymmetric. Positive preferences ("which is better") encode continuously coupled, context-dependent human values that cannot be exhaustively specified -- leading models to learn surface correlates such as agreement with the user (sycophancy). Negative constraints ("what is wrong") encode discrete, finite, independently verifiable prohibitions that can converge to a stable boundary. This asymmetry -- rooted in Popper's falsification logic and the epistemology of negative knowledge -- explains both the sycophancy failure of preference-based RLHF and the surprising effectiveness of negative-signal methods. We argue that alignment research should shift its center of gravity from "learning what humans prefer" to "learning what humans reject," and offer testable predictions for this framework.

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO

Dev.to

How to Build Passive Income with AI in 2026: A Developer's Practical Guide

Dev.to

The Research That Doesn't Exist

Dev.to

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成

日経XTECH

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap

Dev.to

Via Negativa for AI Alignment: Why Negative Constraints Are Structurally Superior to Positive Preferences

Key Points

Abstract

Related Articles

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO

How to Build Passive Income with AI in 2026: A Developer's Practical Guide

The Research That Doesn't Exist

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer