AI Navigate

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

arXiv cs.AI / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • IH-Challenge is a reinforcement learning training dataset designed to improve instruction hierarchy in frontier LLMs by prioritizing system, developer, user, and tool instructions during conflicts.
  • It targets defense against jailbreaks, system prompt extractions, and agentic prompt injections by providing a trust-ordered policy for resolving conflicting instructions.
  • Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation yields about a 10-point gain in IH robustness across 16 benchmarks (from 84.1% to 94.1%), reduces unsafe behavior from 6.6% to 0.7%, and saturates an internal static agentic prompt injection evaluation with minimal capability regression.
  • The authors release the IH-Challenge dataset on HuggingFace to enable ongoing research on robust instruction hierarchy for frontier LLMs.

Abstract

Instruction hierarchy (IH) defines how LLMs prioritize system, developer, user, and tool instructions under conflict, providing a concrete, trust-ordered policy for resolving instruction conflicts. IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections. However, robust IH behavior is difficult to train: IH failures can be confounded with instruction-following failures, conflicts can be nuanced, and models can learn shortcuts such as overrefusing. We introduce IH-Challenge, a reinforcement learning training dataset, to address these difficulties. Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation improves IH robustness by +10.0% on average across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks (84.1% to 94.1%), reduces unsafe behavior from 6.6% to 0.7% while improving helpfulness on general safety evaluations, and saturates an internal static agentic prompt injection evaluation, with minimal capability regression. We release the IH-Challenge dataset (https://huggingface.co/datasets/openai/ih-challenge) to support future research on robust instruction hierarchy.