Tarab: A Multi-Dialect Corpus of Arabic Lyrics and Poetry
arXiv cs.CL / 3/18/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- Tarab Corpus is a large-scale Arabic lyrics and poetry dataset with 2.56 million verses and 13.5 million tokens, making it the largest open Arabic corpus of creative text spanning classical to contemporary production.
- It covers Classical Arabic, Modern Standard Arabic (MSA), and six major regional varieties (Egyptian, Gulf, Levantine, Iraqi, Sudanese, and Maghrebi), spanning over fourteen centuries.
- Each verse includes structured metadata on linguistic variety, geographic origin, and historical or cultural context to enable cross-genre and diachronic analysis.
- The paper outlines the data collection, normalization, and validation pipeline and reports baseline analyses for variety identification and genre differentiation, with the dataset publicly available on HuggingFace.
Related Articles

The programming passion is melting
Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations
Dev.to
Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders
Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)
Dev.to

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more
Reddit r/LocalLLaMA