POEMetric: The Last Stanza of Humanity
arXiv cs.CL / 4/7/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper introduces POEMetric, a comprehensive evaluation framework for poetry quality across instruction-following, creativity/language richness, emotional resonance, and overall poem appraisal including authorship estimation.
- Researchers compiled a human reference dataset of 203 English poems across seven fixed forms, annotated with meter, rhyme patterns, and themes, and generated 6,090 corresponding poems using 30 LLMs under matched form and theme conditions.
- Using rule-based evaluation and LLM-as-a-judge (validated by human experts), results show LLMs can achieve strong form accuracy and theme alignment but consistently underperform humans on advanced poetic abilities.
- Compared with human poets, the best LLMs fall short in creativity, idiosyncrasy, emotional resonance, and effective use of imagery and literary devices, leading to lower overall poem-quality scores.
- The authors release the dataset and code publicly, positioning POEMetric as a practical benchmark for measuring how close LLM-generated poetry is to human performance.


