Many-Tier Instruction Hierarchy in LLM Agents

arXiv cs.CL / 4/13/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that current instruction hierarchy approaches in LLM agents typically assume only a small, fixed set of privilege levels, which breaks down in real-world multi-source agent environments.
  • It proposes Many-Tier Instruction Hierarchy (ManyIH), designed to resolve conflicts among instructions with arbitrarily many privilege levels while prioritizing the highest-privilege instruction.
  • The authors introduce ManyIH-Bench, a new benchmark with up to 12 levels of conflicting instructions, containing 853 agentic tasks across coding and instruction-following scenarios.
  • ManyIH-Bench uses constraints generated by LLMs and verified by humans to produce realistic, difficult test cases derived from 46 real-world agents.
  • Experiments show that even frontier models achieve only about 40% accuracy as instruction conflicts scale, highlighting a need for more robust, fine-grained conflict-resolution methods in agentic systems.

Abstract

Large language model agents receive instructions from many sources-system messages, user prompts, tool outputs, and more-each carrying different levels of trust and authority. When these instructions conflict, models must reliably follow the highest-privilege instruction to remain safe and effective. The dominant paradigm, instruction hierarchy (IH), assumes a fixed, small set of privilege levels (typically fewer than five) defined by rigid role labels (e.g., system > user). This is inadequate for real-world agentic settings, where conflicts can arise across far more sources and contexts. In this work, we propose Many-Tier Instruction Hierarchy (ManyIH), a paradigm for resolving instruction conflicts among instructions with arbitrarily many privilege levels. We introduce ManyIH-Bench, the first benchmark for ManyIH. ManyIH-Bench requires models to navigate up to 12 levels of conflicting instructions with varying privileges, comprising 853 agentic tasks (427 coding and 426 instruction-following). ManyIH-Bench composes constraints developed by LLMs and verified by humans to create realistic and difficult test cases spanning 46 real-world agents. Our experiments show that even the current frontier models perform poorly (~40% accuracy) when instruction conflict scales. This work underscores the urgent need for methods that explicitly target fine-grained, scalable instruction conflict resolution in agentic settings.