Document-tuning for robust alignment to animals

arXiv cs.AI / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

研究では、動物への思いやり（compassion）という価値観を用いて、合成ドキュメントによるファインチューニングが価値アラインメントのロバスト性を高められるかを検証している。
評価として、動物への危害に関する推論を測るベンチマーク「Animal Harm Benchmark (AHB)」を26問・13の倫理次元で構築し、データセットおよびInspectによる評価を公開している。
AHBでの結果では、合成ドキュメント3000件で77%の達成を示し、従来のinstruction-tuning（40%）を上回る一方、通常の安全ベンチマークや能力が低下する兆候は見られない。

Abstract

We investigate the robustness of value alignment via finetuning with synthetic documents, using animal compassion as a value that is both important in its own right and orthogonal to existing alignment efforts. To evaluate compassionate reasoning, we develop and publicly release the Animal Harm Benchmark (AHB), a 26-question evaluation spanning 13 ethical dimensions, publicly available as a dataset and Inspect evaluation. On the AHB, training with 3000 documents achieves 77% compared to 40% for instruction-tuning approaches, with generalization to human compassion and no degradation in standard safety benchmarks or capabilities. However, subsequent unrelated instruction-tuning degrades the intervention, with the advantage disappearing after 5000 samples. Our exploratory results suggest document-based value interventions may require explicit preservation strategies to remain effective through typical training pipelines.