Where Experts Disagree, Models Fail: Detecting Implicit Legal Citations in French Court Decisions

arXiv cs.AI / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper investigates how frequently French first-instance courts implicitly apply Civil Code rules by separating legal reasoning from mere semantic similarity.
It introduces an annotated benchmark of 1,015 passage–article pairs created by three legal experts to support evaluation of implicit legal citation detection.
The authors find that expert disagreement is a strong predictor of model failures, with moderate inter-annotator agreement (κ = 0.33) and many disputes centered on whether text is factual description or legal reasoning.
A supervised ensemble model reaches F1 = 0.70 (77% accuracy), but performance is asymmetric: most false positives occur in cases where experts disagreed.
Reframing the task as top-k ranking and using multi-model consensus improves results, achieving 76% precision at k=200 in an unsupervised setting, with remaining errors concentrated in legally ambiguous applications.

Abstract

Computational methods applied to legal scholarship hold the promise of analyzing law at scale. We start from a simple question: how often do courts implicitly apply statutory rules? This requires distinguishing legal reasoning from semantic similarity. We focus on implicit citation of the French Civil Code in first-instance court decisions and introduce a benchmark of 1,015 passage-article pairs annotated by three legal experts. We show that expert disagreement predicts model failures. Inter-annotator agreement is moderate (

\kappa

= 0.33) with 43% of disagreements involving the boundary between factual description and legal reasoning. Our supervised ensemble achieves F1 = 0.70 (77% accuracy), but this figure conceals an asymmetry: 68% of false positives fall on the 33% of cases where the annotators disagreed. Despite these limits, reframing the task as top-k ranking and leveraging multi-model consensus yields 76% precision at k = 200 in an unsupervised setting. Moreover, the remaining false positives tend to surface legally ambiguous applications rather than obvious errors.

The Security Gap in MCP Tool Servers (And What I Built to Fix It)

Dev.to

Adversarial AI framework reveals mechanisms behind impaired consciousness and a potential therapy

Reddit r/artificial

Why I Switched From GPT-4 to Small Language Models for Two of My Products

Dev.to

Orchestrating AI Velocity: Building a Decoupled Control Plane for Agentic Development

Dev.to

In the Kadrey v. Meta Platforms case, Judge Chabbria's quest to bust the fair use copyright defense to generative AI training rises from the dead!

Reddit r/artificial

Where Experts Disagree, Models Fail: Detecting Implicit Legal Citations in French Court Decisions

Key Points

Abstract

Related Articles

The Security Gap in MCP Tool Servers (And What I Built to Fix It)

Adversarial AI framework reveals mechanisms behind impaired consciousness and a potential therapy

Why I Switched From GPT-4 to Small Language Models for Two of My Products

Orchestrating AI Velocity: Building a Decoupled Control Plane for Agentic Development

In the Kadrey v. Meta Platforms case, Judge Chabbria's quest to bust the fair use copyright defense to generative AI training rises from the dead!

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer