IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

arXiv cs.AI / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

IH-Challenge is a reinforcement learning training dataset designed to improve instruction hierarchy in frontier LLMs by prioritizing system, developer, user, and tool instructions during conflicts.
It targets defense against jailbreaks, system prompt extractions, and agentic prompt injections by providing a trust-ordered policy for resolving conflicting instructions.
Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation yields about a 10-point gain in IH robustness across 16 benchmarks (from 84.1% to 94.1%), reduces unsafe behavior from 6.6% to 0.7%, and saturates an internal static agentic prompt injection evaluation with minimal capability regression.
The authors release the IH-Challenge dataset on HuggingFace to enable ongoing research on robust instruction hierarchy for frontier LLMs.

Abstract

Instruction hierarchy (IH) defines how LLMs prioritize system, developer, user, and tool instructions under conflict, providing a concrete, trust-ordered policy for resolving instruction conflicts. IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections. However, robust IH behavior is difficult to train: IH failures can be confounded with instruction-following failures, conflicts can be nuanced, and models can learn shortcuts such as overrefusing. We introduce IH-Challenge, a reinforcement learning training dataset, to address these difficulties. Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation improves IH robustness by +10.0% on average across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks (84.1% to 94.1%), reduces unsafe behavior from 6.6% to 0.7% while improving helpfulness on general safety evaluations, and saturates an internal static agentic prompt injection evaluation, with minimal capability regression. We release the IH-Challenge dataset (https://huggingface.co/datasets/openai/ih-challenge) to support future research on robust instruction hierarchy.

How CVE-2026-25253 exposed every OpenClaw user to RCE — and how to fix it in one command

Dev.to

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

Dev.to

What CVE-2026-25253 Taught Me About Building Safe AI Assistants

Dev.to

Day 52: Building vs Shipping — Why We Had 711 Commits and 0 Users

Dev.to

The Dawn of the Local AI Era: From iPhone 17 Pro to the Future of NVIDIA RTX

Dev.to

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

Key Points

Abstract

Related Articles

How CVE-2026-25253 exposed every OpenClaw user to RCE — and how to fix it in one command

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

What CVE-2026-25253 Taught Me About Building Safe AI Assistants

Day 52: Building vs Shipping — Why We Had 711 Commits and 0 Users

The Dawn of the Local AI Era: From iPhone 17 Pro to the Future of NVIDIA RTX

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer