ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring

arXiv cs.CL / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces ARGUS, a policy-adaptive advertising governance system designed for non-stationary regulatory environments where new mandates cause outdated labels and ambiguous reasoning in historical data.
  • ARGUS uses a three-stage pipeline—Policy Seeding, Adversarial Label Rectification (via a Prosecutor-Defender-Umpire architecture), and Latent Knowledge Discovery (tripartite dialectical discussion) to find both clear and “gray-area” violations.
  • To handle sparse new policy data, the system leverages RAG-enhanced policy knowledge and Chain-of-Thought-based reward signals to guide evolving reinforcement learning toward regulations that change over time.
  • Experiments on industrial and public datasets show ARGUS outperforms traditional fine-tuning baselines, achieving stronger policy-adaptive performance with minimal labeled “gold” data.
  • Overall, ARGUS frames ad governance as an evolving multi-agent, adversarially adjudicated reasoning problem rather than a static classifier trained once on fixed labels.

Abstract

Online advertising governance faces significant challenges due to the non-stationary nature of regulatory policies, where emerging mandates (e.g., restrictions on education or aesthetic anxiety) create severe label inconsistencies and reasoning ambiguities in historical datasets. In this paper, we propose ARGUS, a policy-adaptive governance system that enables evolving reinforcement through multi-agent adversarial umpiring. ARGUS addresses the sparsity of new policy data by employing a three-stage framework: (1) Policy Seeding for initial perception; (2) Adversarial Label Rectification, which utilizes a ``Prosecutor-Defender-Umpire'' architecture to resolve conflicts between stale labels and new mandates; and (3) Latent Knowledge Discovery, which employs a tripartite dialectical discussion to unearth sophisticated, ``gray-area'' violations. By leveraging RAG-enhanced policy knowledge and Chain-of-Thought synthesis as dynamic rewards for reinforcement learning, ARGUS synchronizes its reasoning pathways with evolving regulations. Extensive experiments on both industrial and public datasets demonstrate that ARGUS significantly outperforms traditional fine-tuning baselines, achieving superior policy-adaptive learning with minimal gold data.