DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs

arXiv cs.AI / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces DeEscalWild, a real-world benchmark dataset for automated de-escalation training focused on police–civilian interactions distilled from open-source videos.
  • It scales scenario creation from 5,000 raw inputs down to 1,500 high-fidelity cases using a hybrid pipeline with human-in-the-loop verification and LLM-as-a-judge filtering.
  • The released corpus contains 285,887 dialogue turns (~4.7M tokens), enabling fine-tuning and evaluation of small language models for de-escalation dialogue generation.
  • Experiments show fine-tuned SLMs significantly outperform their base models on multiple NLP metrics (ROUGE-L, BLEU-4, METEOR, BERTScore).
  • A domain-optimized Qwen 2.5 3B-Instruct model outperforms a general-purpose Gemini 2.5 Flash baseline, suggesting practical, low-latency, edge-deployable training systems are feasible.

Abstract

Effective de-escalation is critical for law enforcement safety and community trust, yet traditional training methods lack scalability and realism. While Large Language Models (LLMs) enable dynamic, open-ended simulations, their substantial computational footprint renders them impractical for deployment on the lightweight, portable hardware required for immersive field training. Small Language Models (SLMs) offer a viable real-time alternative but suffer from a critical scarcity of high-quality, domain-specific training data. To bridge this gap, we present DeEscalWild, a novel benchmark dataset curated from a multi-stage pipeline of in-the-wild police-civilian interactions extracted from open-source video repositories. Starting with 5,000 raw inputs, we employed a rigorous hybrid filtering process - combining human-in-the-loop verification with LLM-as-a-Judge evaluation - to distill 1,500 high-fidelity scenarios. The resulting corpus comprises 285,887 dialogue turns, totaling approximately 4.7 million tokens. Extensive experiments demonstrate that SLMs fine-tuned on this data significantly outperform their base counterparts across ROUGE-L, BLEU-4, METEOR, and BERTScore metrics. Notably, our fine-tuned Qwen 2.5 (3B-Instruct) surpasses the general-purpose Gemini 2.5 Flash model, demonstrating that domain-optimized SLMs can achieve superior performance with a fraction of the computational cost. This work establishes the foundational infrastructure for accessible, low-latency, and privacy-preserving officer training systems at the edge.