Rethinking Meeting Effectiveness: A Benchmark and Framework for Temporal Fine-grained Automatic Meeting Effectiveness Evaluation

arXiv cs.CL / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper argues that meeting effectiveness is often measured using post-hoc surveys that produce only coarse, single scores and fail to reflect the time-varying nature of discussions.
It proposes a temporal, fine-grained evaluation paradigm that defines effectiveness as the rate of objective achievement over time and scores it per topical segment within a meeting.
The authors introduce the AMI Meeting Effectiveness (AMI-ME) dataset, built from 130 AMI Corpus meetings and containing 2,459 human-annotated topical segments.
They develop an automatic evaluation framework that uses a Large Language Model (LLM) as a “judge” to score each segment’s effectiveness against the meeting’s overall objectives, and they benchmark it for generalizability across multiple meeting types.
The study also evaluates an end-to-end pipeline from raw speech to effectiveness scoring, and the dataset and code are planned to be publicly released to support future research.

Abstract

Evaluating meeting effectiveness is crucial for improving organizational productivity. Current approaches rely on post-hoc surveys that yield a single coarse-grained score for an entire meeting. The reliance on manual assessment is inherently limited in scalability, cost, and reproducibility. Moreover, a single score fails to capture the dynamic nature of collaborative discussions. We propose a new paradigm for evaluating meeting effectiveness centered on novel criteria and temporal fine-grained approach. We define effectiveness as the rate of objective achievement over time and assess it for individual topical segments within a meeting. To support this task, we introduce the AMI Meeting Effectiveness (AMI-ME) dataset, a new meta-evaluation dataset containing 2,459 human-annotated segments from 130 AMI Corpus meetings. We also develop an automatic effectiveness evaluation framework that uses a Large Language Model (LLM) as a judge to score each segment's effectiveness relative to the overall meeting objectives. Through substantial experiments, we establish a comprehensive benchmark for this new task and evaluate the framework's generalizability across distinct meeting types, ranging from business scenarios to unstructured discussions. Furthermore, we benchmark end-to-end performance starting from raw speech to measure the capabilities of a complete system. Our results validate the framework's effectiveness and provide strong baselines to facilitate future research in meeting analysis and multi-party dialogue. Our dataset and code will be publicly available. The AMI-ME dataset and the Automatic Evaluation Framework are available at: this URL.

A practical guide to getting comfortable with AI coding tools

Dev.to

Competitive Map: 10 AI Agent Platforms vs AgentHansa

Dev.to

Every time a new model comes out, the old one is obsolete of course

Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Dev.to

Rethinking Meeting Effectiveness: A Benchmark and Framework for Temporal Fine-grained Automatic Meeting Effectiveness Evaluation

Key Points

Abstract

Related Articles

A practical guide to getting comfortable with AI coding tools

Competitive Map: 10 AI Agent Platforms vs AgentHansa

Every time a new model comes out, the old one is obsolete of course

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer