CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

arXiv cs.CL / 4/15/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • CompliBench introduces a new benchmark to assess how well LLMs used as “judges” can detect and localize compliance or policy violations in multi-turn enterprise dialogue systems.
  • The paper presents an automated data generation pipeline with controllable flaw injection and adversarial search to create realistic, hard-to-catch guideline violations alongside precise ground-truth labels (including the exact conversation turn).
  • Results show that current state-of-the-art proprietary LLM judges perform poorly on this compliance-violation detection and localization task compared with the benchmark’s requirements.
  • The authors report that a smaller judge model fine-tuned on the synthesized CompliBench data can outperform leading LLM judges and generalize to unseen business domains.
  • The work positions the CompliBench pipeline as a foundation for training more robust generative reward models for LLM-based agents operating under complex domain guidelines.

Abstract

As Large Language Models (LLMs) are increasingly deployed as task-oriented agents in enterprise environments, ensuring their strict adherence to complex, domain-specific operational guidelines is critical. While utilizing an LLM-as-a-Judge is a promising solution for scalable evaluation, the reliability of these judges in detecting specific policy violations remains largely unexplored. This gap is primarily due to the lack of a systematic data generation method, which has been hindered by the extensive cost of fine-grained human annotation and the difficulty of synthesizing realistic agent violations. In this paper, we introduce CompliBench, a novel benchmark designed to evaluate the ability of LLM judges to detect and localize guideline violations in multi-turn dialogues. To overcome data scarcity, we develop a scalable, automated data generation pipeline that simulates user-agent interactions. Our controllable flaw injection process automatically yields precise ground-truth labels for the violated guideline and the exact conversation turn, while an adversarial search method ensures these introduced perturbations are highly challenging. Our comprehensive evaluation reveals that current state-of-the-art proprietary LLMs struggle significantly with this task. In addition, we demonstrate that a small-scale judge model fine-tuned on our synthesized data outperforms leading LLMs and generalizes well to unseen business domains, highlighting our pipeline as an effective foundation for training robust generative reward models.