Principles Do Not Apply Themselves: A Hermeneutic Perspective on AI Alignment
arXiv cs.AI / 4/14/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that AI alignment cannot be reduced to simply applying stated principles or preferences, because principles often do not determine their own concrete application in real cases.
- It frames alignment as requiring an interpretive, context-sensitive judgment about how to read, apply, and prioritize principles when they conflict, are too broad, or when relevant facts are unclear.
- It links this interpretive component to empirical observations that a substantial portion of preference-labeling data involves situations of principle conflict or indifference, where the principle set does not uniquely dictate an outcome.
- The authors propose an operational implication: alignment-relevant behaviors may only become visible in the distribution of model responses during deployment, since interpretive judgments manifest in outputs.
- They formalize this risk by distinguishing deployment-induced vs. corpus-induced evaluation and showing that off-policy audits can miss failures when those response distributions diverge.



