GhazalBench: Usage-Grounded Evaluation of LLMs on Persian Ghazals

arXiv cs.CL / 3/12/2026

📰 NewsModels & Research

共有:

Key Points

GhazalBench is introduced as a benchmark to evaluate LLMs on Persian ghazals under usage-grounded conditions, focusing on producing faithful paraphrases and accessing canonical verses.
The evaluation across several proprietary and open-weight multilingual LLMs reveals a consistent dissociation: models generally capture poetic meaning but struggle with exact verse recall in completion-based tasks, while recognition-based tasks reduce this gap.
An English sonnet benchmark shows markedly higher recall, suggesting the limits are tied to training exposure rather than architectural constraints.
The authors advocate evaluation frameworks that jointly assess meaning, form, and cue-dependent access to culturally significant texts, and GhazalBench is publicly available at the linked GitHub repository.

Abstract

Persian poetry plays an active role in Iranian cultural practice, where verses by canonical poets such as Hafez are frequently quoted, paraphrased, or completed from partial cues. Supporting such interactions requires language models to engage not only with poetic meaning but also with culturally entrenched surface form. We introduce GhazalBench, a benchmark for evaluating how large language models (LLMs) interact with Persian ghazals under usage-grounded conditions. GhazalBench assesses two complementary abilities: producing faithful prose paraphrases of couplets and accessing canonical verses under varying semantic and formal cues. Across several proprietary and open-weight multilingual LLMs, we observe a consistent dissociation: models generally capture poetic meaning but struggle with exact verse recall in completion-based settings, while recognition-based tasks substantially reduce this gap. A parallel evaluation on English sonnets shows markedly higher recall performance, suggesting that these limitations are tied to differences in training exposure rather than inherent architectural constraints. Our findings highlight the need for evaluation frameworks that jointly assess meaning, form, and cue-dependent access to culturally significant texts. GhazalBench is available at https://github.com/kalhorghazal/GhazalBench.

Two bots, one confused server: what Nimbus revealed about AI agent identity

Dev.to

PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark forFinance

Dev.to

A Coding Implementation to Build an Uncertainty-Aware LLM System with Confidence Estimation, Self-Evaluation, and Automatic Web Research

MarkTechPost

DNA Memory: Making AI Agents Learn, Forget, and Evolve Like a Human Brain

Dev.to

Tinybox- offline AI device 120B parameters

Hacker News

GhazalBench: Usage-Grounded Evaluation of LLMs on Persian Ghazals

Key Points

Abstract

Related Articles

Two bots, one confused server: what Nimbus revealed about AI agent identity

PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark forFinance

A Coding Implementation to Build an Uncertainty-Aware LLM System with Confidence Estimation, Self-Evaluation, and Automatic Web Research

DNA Memory: Making AI Agents Learn, Forget, and Evolve Like a Human Brain

Tinybox- offline AI device 120B parameters

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer