Evaluating AI Meeting Summaries with a Reusable Cross-Domain Pipeline
arXiv cs.CL / 4/24/2026
💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces a reusable, cross-domain evaluation pipeline for generative AI applications, demonstrated with AI meeting summaries and packaged as a public artifact derived from a dataset pipeline.
- The approach modularizes the workflow into five stages—source intake, structured reference construction, candidate generation, structured scoring, and reporting—while treating both ground truth and evaluator outputs as typed, persisted artifacts.
- Offline benchmarking on 114 meetings across city_council, private_data, and whitehouse_press_briefings creates 340 meeting-model pairs and 680 judge runs across GPT-4.1-mini, GPT-5-mini, and GPT-5.1.
- Results show GPT-4.1-mini has the top mean accuracy (0.583), while GPT-5.1 leads in completeness (0.886) and coverage (0.942), with sign tests indicating no significant accuracy winner but significant retention gains for GPT-5.1.
- A contrastive baseline and typed analysis highlight that whitehouse_press_briefings is especially accuracy-challenging due to frequent unsupported specifics, and a follow-up deployment indicates GPT-5.4 outperforms GPT-4.1 on all metrics with robust improvements on retention.
Related Articles

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence
Dev.to

Context Engineering for Developers: A Practical Guide (2026)
Dev.to

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.
Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)
Dev.to
Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF
Reddit r/LocalLLaMA