HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models

arXiv cs.CL / 4/23/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Humor generation evaluation for LLMs remains difficult because prior methods produce separate, non-comparable metrics, so it’s hard to rank and track model progress consistently.
The paper introduces HumorRank, a tournament-based leaderboard that converts pairwise humor judgments into unified, globally consistent rankings.
It uses the SemEval-2026 MWAHAHA dataset and runs extensive automated pairwise evaluations across nine models (including proprietary, open-weight, and specialized systems).
HumorRank applies GTVH-based pairwise judging aggregated through an Adaptive Swiss tournament, with Bradley-Terry MLE to estimate overall humor capability.
The authors find that humor quality depends more on mastering comedic mechanisms than on model scale, providing a scalable and interpretable benchmarking approach.

Abstract

Evaluating humor in large language models (LLMs) is an open challenge because existing approaches yield isolated, incomparable metrics rather than unified model rankings, making it difficult to track progress across systems. We introduce HumorRank, a tournament-based evaluation framework and leaderboard for textual humor generation. Using SemEval-2026 MWAHAHA test dataset, we conduct an extensive automated pairwise evaluation across nine models spanning proprietary, open-weight, and specialized systems. Pairwise judgments grounded in the General Theory of Verbal Humor (GTVH) are aggregated via an Adaptive Swiss tournament, with Bradley-Terry Maximum Likelihood Estimation (MLE) producing globally consistent humor generation capability rankings. Our results demonstrate that HumorRank yields statistically grounded model stratifications, showing that humor quality is driven by mastery of comedic mechanisms rather than model scale alone. HumorRank thus provides a scalable, interpretable methodology for benchmarking and understanding LLM-generated humor.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans

Dev.to

10 AI Tools Every Developer Should Try in 2026

Dev.to

Why use an AI gateway at all?

Dev.to

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago

Dev.to

HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models

Key Points

Abstract

Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans

10 AI Tools Every Developer Should Try in 2026

Why use an AI gateway at all?

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer