AI Navigate

The COTe score: A decomposable framework for evaluating Document Layout Analysis models

arXiv cs.CV / 3/16/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper announces the Structural Semantic Unit (SSU) and the COTe score, a decomposable metric designed for evaluating document layout analysis beyond traditional IoU, F1, and mAP.
  • It shows that COTe captures semantic structure, reveals distinct failure modes such as semantic boundary breaches or repeated parsing of the same region, and is more informative than traditional metrics.
  • The authors report that COTe reduces the interpretation-performance gap by up to 76% relative to F1 on three DLA datasets.
  • Importantly, COTe's granularity robustness holds even without explicit SSU labeling, lowering barriers to adoption.
  • They also release an SSU-labeled dataset and a Python library to apply COTe in DLA projects.

Abstract

Document Layout analysis (DLA), is the process by which a page is parsed into meaningful elements, often using machine learning models. Typically, the quality of a model is judged using general object detection metrics such as IoU, F1 or mAP. However, these metrics are designed for images that are 2D projections of 3D space, not for the natively 2D imagery of printed media. This discrepancy can result in misleading or uninformative interpretation of model performance by the metrics. To encourage more robust, comparable, and nuanced DLA, we introduce: The Structural Semantic Unit (SSU) a relational labelling approach that shifts the focus from the physical to the semantic structure of the content; and the Coverage, Overlap, Trespass, and Excess (COTe) score, a decomposable metric for measuring page parsing quality. We demonstrate the value of these methods through case studies and by evaluating 5 common DLA models on 3 DLA datasets. We show that the COTe score is more informative than traditional metrics and reveals distinct failure modes across models, such as breaching semantic boundaries or repeatedly parsing the same region. In addition, the COTe score reduces the interpretation-performance gap by up to 76% relative to the F1. Notably, we find that the COTe's granularity robustness largely holds even without explicit SSU labelling, lowering the barriers to entry for using the system. Finally, we release an SSU labelled dataset and a Python library for applying COTe in DLA projects.