HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models

arXiv cs.CL / 4/23/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Humor generation evaluation for LLMs remains difficult because prior methods produce separate, non-comparable metrics, so it’s hard to rank and track model progress consistently.
  • The paper introduces HumorRank, a tournament-based leaderboard that converts pairwise humor judgments into unified, globally consistent rankings.
  • It uses the SemEval-2026 MWAHAHA dataset and runs extensive automated pairwise evaluations across nine models (including proprietary, open-weight, and specialized systems).
  • HumorRank applies GTVH-based pairwise judging aggregated through an Adaptive Swiss tournament, with Bradley-Terry MLE to estimate overall humor capability.
  • The authors find that humor quality depends more on mastering comedic mechanisms than on model scale, providing a scalable and interpretable benchmarking approach.

Abstract

Evaluating humor in large language models (LLMs) is an open challenge because existing approaches yield isolated, incomparable metrics rather than unified model rankings, making it difficult to track progress across systems. We introduce HumorRank, a tournament-based evaluation framework and leaderboard for textual humor generation. Using SemEval-2026 MWAHAHA test dataset, we conduct an extensive automated pairwise evaluation across nine models spanning proprietary, open-weight, and specialized systems. Pairwise judgments grounded in the General Theory of Verbal Humor (GTVH) are aggregated via an Adaptive Swiss tournament, with Bradley-Terry Maximum Likelihood Estimation (MLE) producing globally consistent humor generation capability rankings. Our results demonstrate that HumorRank yields statistically grounded model stratifications, showing that humor quality is driven by mastery of comedic mechanisms rather than model scale alone. HumorRank thus provides a scalable, interpretable methodology for benchmarking and understanding LLM-generated humor.