Harf-Speech: A Clinically Aligned Framework for Arabic Phoneme-Level Speech Assessment
arXiv cs.AI / 4/10/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- Harf-Speech is introduced as a modular framework for Arabic phoneme-level pronunciation assessment aimed at supporting scalable speech therapy and language learning where validated Arabic tools are limited.
- The system combines an MSA phonetizer, a fine-tuned speech-to-phoneme model, Levenshtein-style alignment, and a blended scoring approach based on longest common subsequence and edit-distance metrics.
- Three Arabic ASR architectures are fine-tuned on phoneme data and benchmarked against zero-shot multimodal models, with OmniASR-CTC-1B-v2 achieving an 8.92% phoneme error rate.
- Clinical validation involved three certified speech-language pathologists scoring 40 utterances, and Harf-Speech produced clinically aligned, interpretable scores that correlate with expert ratings (Pearson 0.791, ICC(2,1) 0.659) and outperform prior end-to-end assessment frameworks.
- The reported results position Harf-Speech as yielding scores comparable to inter-rater expert agreement, emphasizing clinical alignment rather than only generic pronunciation scoring accuracy.



