findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding

arXiv cs.AI / 3/30/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • findsylls is introduced as a modular, language-agnostic toolkit that standardizes syllable segmentation by unifying classical syllable detectors with end-to-end syllabifiers under a common interface.
  • The framework supports syllable embedding extraction and multi-granular evaluation, enabling controlled comparisons of token rates, representations, and algorithms.
  • It implements and standardizes existing methods such as Sylber and VG-HuBERT while allowing components to be recombined for reproducible experimentation.
  • The paper demonstrates the toolkit on English and Spanish corpora and extends it to an under-documented Central Mande language (Kono) using newly hand-annotated data.
  • By providing a single pipeline for both high-resource and under-resourced languages, findsylls aims to reduce fragmentation in syllabification research and improve cross-study comparability.

Abstract

Syllable-level units offer compact and linguistically meaningful representations for spoken language modeling and unsupervised word discovery, but research on syllabification remains fragmented across disparate implementations, datasets, and evaluation protocols. We introduce findsylls, a modular, language-agnostic toolkit that unifies classical syllable detectors and end-to-end syllabifiers under a common interface for syllable segmentation, embedding extraction, and multi-granular evaluation. The toolkit implements and standardizes widely used methods (e.g., Sylber, VG-HuBERT) and allows their components to be recombined, enabling controlled comparisons of representations, algorithms, and token rates. We demonstrate findsylls on English and Spanish corpora and on new hand-annotated data from Kono, an underdocumented Central Mande language, illustrating how a single framework can support reproducible syllable-level experiments across both high-resource and under-resourced settings.

findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding | AI Navigate