Rethinking Meeting Effectiveness: A Benchmark and Framework for Temporal Fine-grained Automatic Meeting Effectiveness Evaluation

arXiv cs.CL / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper argues that meeting effectiveness is often measured using post-hoc surveys that produce only coarse, single scores and fail to reflect the time-varying nature of discussions.
  • It proposes a temporal, fine-grained evaluation paradigm that defines effectiveness as the rate of objective achievement over time and scores it per topical segment within a meeting.
  • The authors introduce the AMI Meeting Effectiveness (AMI-ME) dataset, built from 130 AMI Corpus meetings and containing 2,459 human-annotated topical segments.
  • They develop an automatic evaluation framework that uses a Large Language Model (LLM) as a “judge” to score each segment’s effectiveness against the meeting’s overall objectives, and they benchmark it for generalizability across multiple meeting types.
  • The study also evaluates an end-to-end pipeline from raw speech to effectiveness scoring, and the dataset and code are planned to be publicly released to support future research.

Abstract

Evaluating meeting effectiveness is crucial for improving organizational productivity. Current approaches rely on post-hoc surveys that yield a single coarse-grained score for an entire meeting. The reliance on manual assessment is inherently limited in scalability, cost, and reproducibility. Moreover, a single score fails to capture the dynamic nature of collaborative discussions. We propose a new paradigm for evaluating meeting effectiveness centered on novel criteria and temporal fine-grained approach. We define effectiveness as the rate of objective achievement over time and assess it for individual topical segments within a meeting. To support this task, we introduce the AMI Meeting Effectiveness (AMI-ME) dataset, a new meta-evaluation dataset containing 2,459 human-annotated segments from 130 AMI Corpus meetings. We also develop an automatic effectiveness evaluation framework that uses a Large Language Model (LLM) as a judge to score each segment's effectiveness relative to the overall meeting objectives. Through substantial experiments, we establish a comprehensive benchmark for this new task and evaluate the framework's generalizability across distinct meeting types, ranging from business scenarios to unstructured discussions. Furthermore, we benchmark end-to-end performance starting from raw speech to measure the capabilities of a complete system. Our results validate the framework's effectiveness and provide strong baselines to facilitate future research in meeting analysis and multi-party dialogue. Our dataset and code will be publicly available. The AMI-ME dataset and the Automatic Evaluation Framework are available at: this URL.