Training Data Size Sensitivity in Unsupervised Rhyme Recognition
arXiv cs.CL / 4/10/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies how sensitive unsupervised rhyme recognition performance is to the amount of training data, using RhymeTagger as a language-independent tool based on repeating rhyme patterns in poetry corpora.
- It evaluates RhymeTagger across seven languages and analyzes how both training size and cross-language differences affect classification accuracy.
- To establish a realistic benchmark, the authors measure inter-annotator agreement on a manually annotated poem subset and identify causes of expert disagreement, including phonetic similarity and the positional distance between rhyming words.
- The study compares RhymeTagger against three large language models in a one-shot setup, finding that LLMs without strong phonetic representation struggle, while RhymeTagger can outperform human agreement once training data is sufficient.
Related Articles

GLM 5.1 tops the code arena rankings for open models
Reddit r/LocalLLaMA
can we talk about how AI has gotten really good at lying to you?
Reddit r/artificial

AI just found thousands of zero-days. Your firewall is still pattern-matching from 2014
Dev.to

Emergency Room and the Vanishing Moat
Dev.to

I Built a 100% Browser-Based OCR That Never Uploads Your Documents — Here's How
Dev.to