HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models
arXiv cs.CL / 4/23/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- Humor generation evaluation for LLMs remains difficult because prior methods produce separate, non-comparable metrics, so it’s hard to rank and track model progress consistently.
- The paper introduces HumorRank, a tournament-based leaderboard that converts pairwise humor judgments into unified, globally consistent rankings.
- It uses the SemEval-2026 MWAHAHA dataset and runs extensive automated pairwise evaluations across nine models (including proprietary, open-weight, and specialized systems).
- HumorRank applies GTVH-based pairwise judging aggregated through an Adaptive Swiss tournament, with Bradley-Terry MLE to estimate overall humor capability.
- The authors find that humor quality depends more on mastering comedic mechanisms than on model scale, providing a scalable and interpretable benchmarking approach.
Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans
Dev.to

10 AI Tools Every Developer Should Try in 2026
Dev.to

Why use an AI gateway at all?
Dev.to

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago
Dev.to