Where Experts Disagree, Models Fail: Detecting Implicit Legal Citations in French Court Decisions

arXiv cs.AI / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates how frequently French first-instance courts implicitly apply Civil Code rules by separating legal reasoning from mere semantic similarity.
  • It introduces an annotated benchmark of 1,015 passage–article pairs created by three legal experts to support evaluation of implicit legal citation detection.
  • The authors find that expert disagreement is a strong predictor of model failures, with moderate inter-annotator agreement (κ = 0.33) and many disputes centered on whether text is factual description or legal reasoning.
  • A supervised ensemble model reaches F1 = 0.70 (77% accuracy), but performance is asymmetric: most false positives occur in cases where experts disagreed.
  • Reframing the task as top-k ranking and using multi-model consensus improves results, achieving 76% precision at k=200 in an unsupervised setting, with remaining errors concentrated in legally ambiguous applications.

Abstract

Computational methods applied to legal scholarship hold the promise of analyzing law at scale. We start from a simple question: how often do courts implicitly apply statutory rules? This requires distinguishing legal reasoning from semantic similarity. We focus on implicit citation of the French Civil Code in first-instance court decisions and introduce a benchmark of 1,015 passage-article pairs annotated by three legal experts. We show that expert disagreement predicts model failures. Inter-annotator agreement is moderate (\kappa = 0.33) with 43% of disagreements involving the boundary between factual description and legal reasoning. Our supervised ensemble achieves F1 = 0.70 (77% accuracy), but this figure conceals an asymmetry: 68% of false positives fall on the 33% of cases where the annotators disagreed. Despite these limits, reframing the task as top-k ranking and leveraging multi-model consensus yields 76% precision at k = 200 in an unsupervised setting. Moreover, the remaining false positives tend to surface legally ambiguous applications rather than obvious errors.