Hierarchical Alignment: Enforcing Hierarchical Instruction-Following in LLMs through Logical Consistency

arXiv cs.CL / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • 複数の権限レベルを持つ指示(システム、ユーザー、ツール出力、検索文脈など)が同時に与えられる際の“良性の”指示競合に焦点を当て、階層的な優先順位を尊重するLLM指示追従を扱う研究です。
  • Neuro-Symbolic Hierarchical Alignment(NSHA)として、推論時に指示解決を制約充足問題(constraint satisfaction)として定式化し、階層制約のもとで整合性が最大となる指示集合を導く方針を提案しています。
  • 学習時は、推論段階のソルバーによる判断をモデルパラメータに蒸留することで、自動生成した教師データを用いて階層的整合の振る舞いを身につけさせます。
  • ルール追従、タスク実行、ツール利用、安全性などの複数ベンチマークで、単発・複数ターンの両方において競合下の性能を改善しつつ、参照設定では有用性を維持できることを示しています。

Abstract

Large language models increasingly operate under multiple instructions from heterogeneous sources with different authority levels, including system policies, user requests, tool outputs, and retrieved context. While prior work on instruction hierarchy highlights the importance of respecting instruction priorities, it mainly focuses on adversarial attacks and overlooks the benign but common instruction conflicts that arise in real-world applications. In such settings, models must not only avoid security violations but also preserve task utility and behavioral consistency when instructions partially or implicitly conflict. We propose Neuro-Symbolic Hierarchical Alignment (NSHA) for hierarchical instruction-following by explicitly modeling and enforcing instruction priorities. At inference time, we introduce solver-guided reasoning that formulates instruction resolution as a constraint satisfaction problem, enabling the model to derive a maximally consistent set of applicable instructions under hierarchical constraints. At training time, NSHA distills solver-based decisions into model parameters using automatically constructed supervision. We evaluate our approach on rule following, task execution, tool use, and safety, covering both single-turn and multi-turn interactions, and show that NSHA significantly improves performance under such conflicts while maintaining competitive utility in reference settings.