CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks
arXiv cs.LG / 3/20/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes observational reward modeling to learn reward models from user interactions like clicks, copies, and upvotes, as a scalable alternative to traditional expert annotations.
- It identifies two main challenges: annotation noise causing deviation from true user preference, and bias from users who only provide feedback on strongly felt responses.
- CausalRM introduces a noise-aware surrogate loss that is provably equivalent to the primal loss in noise-free conditions by explicitly modeling how annotation errors occur, and uses propensity scores to reweight training samples to remove user-preference bias.
- Experiments across diverse LLM backbones and benchmarks show substantial gains, including 49.2% on WildGuardMix and 32.7% on HarmBench, and code is available on the project website.
Related Articles
ADICはどの種類の革新なのか ―― ドリフト監査デモで見る「事後説明」から「通過条件」への移行**
Qiita
Complete Guide: How To Make Money With Ai
Dev.to
Built a small free iOS app to reduce LLM answer uncertainty with multiple models
Dev.to
Without Valid Data, AI Transformation Is Flying Blind – Why We Need to “Grasp” Work Again
Dev.to
How We Used Hindsight Memory to Build an AI That Knows Your Weaknesses
Dev.to