CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

arXiv cs.CL / 4/15/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

CompliBench introduces a new benchmark to assess how well LLMs used as “judges” can detect and localize compliance or policy violations in multi-turn enterprise dialogue systems.
The paper presents an automated data generation pipeline with controllable flaw injection and adversarial search to create realistic, hard-to-catch guideline violations alongside precise ground-truth labels (including the exact conversation turn).
Results show that current state-of-the-art proprietary LLM judges perform poorly on this compliance-violation detection and localization task compared with the benchmark’s requirements.
The authors report that a smaller judge model fine-tuned on the synthesized CompliBench data can outperform leading LLM judges and generalize to unseen business domains.
The work positions the CompliBench pipeline as a foundation for training more robust generative reward models for LLM-based agents operating under complex domain guidelines.

Abstract

As Large Language Models (LLMs) are increasingly deployed as task-oriented agents in enterprise environments, ensuring their strict adherence to complex, domain-specific operational guidelines is critical. While utilizing an LLM-as-a-Judge is a promising solution for scalable evaluation, the reliability of these judges in detecting specific policy violations remains largely unexplored. This gap is primarily due to the lack of a systematic data generation method, which has been hindered by the extensive cost of fine-grained human annotation and the difficulty of synthesizing realistic agent violations. In this paper, we introduce CompliBench, a novel benchmark designed to evaluate the ability of LLM judges to detect and localize guideline violations in multi-turn dialogues. To overcome data scarcity, we develop a scalable, automated data generation pipeline that simulates user-agent interactions. Our controllable flaw injection process automatically yields precise ground-truth labels for the violated guideline and the exact conversation turn, while an adversarial search method ensures these introduced perturbations are highly challenging. Our comprehensive evaluation reveals that current state-of-the-art proprietary LLMs struggle significantly with this task. In addition, we demonstrate that a small-scale judge model fine-tuned on our synthesized data outperforms leading LLMs and generalizes well to unseen business domains, highlighting our pipeline as an effective foundation for training robust generative reward models.

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Dev.to

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

Reddit r/MachineLearning

How AI Interview Assistants Are Changing Job Preparation in 2026

Dev.to

Consciousness in Artificial Intelligence: Insights from the Science ofConsciousness

Dev.to

NEW PROMPT INJECTION

Dev.to

CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

Key Points

Abstract

Related Articles

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

How AI Interview Assistants Are Changing Job Preparation in 2026

Consciousness in Artificial Intelligence: Insights from the Science ofConsciousness

NEW PROMPT INJECTION

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer