Harf-Speech: A Clinically Aligned Framework for Arabic Phoneme-Level Speech Assessment

arXiv cs.AI / 4/10/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • Harf-Speech is introduced as a modular framework for Arabic phoneme-level pronunciation assessment aimed at supporting scalable speech therapy and language learning where validated Arabic tools are limited.
  • The system combines an MSA phonetizer, a fine-tuned speech-to-phoneme model, Levenshtein-style alignment, and a blended scoring approach based on longest common subsequence and edit-distance metrics.
  • Three Arabic ASR architectures are fine-tuned on phoneme data and benchmarked against zero-shot multimodal models, with OmniASR-CTC-1B-v2 achieving an 8.92% phoneme error rate.
  • Clinical validation involved three certified speech-language pathologists scoring 40 utterances, and Harf-Speech produced clinically aligned, interpretable scores that correlate with expert ratings (Pearson 0.791, ICC(2,1) 0.659) and outperform prior end-to-end assessment frameworks.
  • The reported results position Harf-Speech as yielding scores comparable to inter-rater expert agreement, emphasizing clinical alignment rather than only generic pronunciation scoring accuracy.

Abstract

Automated phoneme-level pronunciation assessment is vital for scalable speech therapy and language learning, yet validated tools for Arabic remain scarce. We present Harf-Speech, a modular system scoring Arabic pronunciation at the phoneme level on a clinical scale. It combines an MSA phonetizer, a fine-tuned speech-to-phoneme model, Levenshtein alignment, and a blended scorer using longest common subsequence and edit-distance metrics. We fine-tune three ASR architectures on Arabic phoneme data and benchmark them with zero-shot multimodal models; the best, OmniASR-CTC-1B-v2, achieves 8.92\% phoneme error rate. Three certified speech-language pathologists independently scored 40 utterances for clinical validation. Harf-Speech attains a Pearson correlation of 0.791 and ICC(2,1) of 0.659 with mean expert scores, outperforming existing end-to-end assessment frameworks. These results show Harf-Speech yields clinically aligned, interpretable scores comparable to inter-rater expert agreement.