findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding
arXiv cs.AI / 3/30/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- findsylls is introduced as a modular, language-agnostic toolkit that standardizes syllable segmentation by unifying classical syllable detectors with end-to-end syllabifiers under a common interface.
- The framework supports syllable embedding extraction and multi-granular evaluation, enabling controlled comparisons of token rates, representations, and algorithms.
- It implements and standardizes existing methods such as Sylber and VG-HuBERT while allowing components to be recombined for reproducible experimentation.
- The paper demonstrates the toolkit on English and Spanish corpora and extends it to an under-documented Central Mande language (Kono) using newly hand-annotated data.
- By providing a single pipeline for both high-resource and under-resourced languages, findsylls aims to reduce fragmentation in syllabification research and improve cross-study comparability.
Related Articles

Black Hat Asia
AI Business

Claude Code tokens: what they are and how they're counted
Dev.to

How I Review AI-Generated Pull Requests (A Step-by-Step Checklist)
Dev.to

Freedom and Constraints of Autonomous Agents — Self-Modification, Trust Boundaries, and Emergent Gameplay
Dev.to
Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment
Reddit r/artificial