Training-Free Cross-Lingual Dysarthria Severity Assessment via Phonological Subspace Analysis in Self-Supervised Speech Representations

arXiv cs.CL / 4/14/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • 本研究は、凍結したHuBERTの自己教師あり音声表現内で「健常対照から推定した音韻特徴サブスペースの劣化」を計測することで、学習不要(training-free)に多言語の構音障害(dysarthria)重症度を定量化する手法を提案している。
  • ラベル付き病的データや教師あり重症度モデルを訓練せず、Montreal Forced Alignerにより話者ごとの音素レベル埋め込みを抽出し、d-primeを複数の音韻コントラスト(例:nasality, voicing, stridency等)と母音特徴で算出して12次元の音韻プロファイルを構成する。
  • 10コーパス・5言語・3疾患(計890話者)で、主要な子音d-prime特徴が臨床的重症度と有意に相関し、メタ解析でも一貫性が示され、FDR補正やleave-one-corpus-out、アラインメント品質の制御条件でも頑健性が確認されている。
  • さらに全12特徴が健常群と重度dysarthria群を統計的に識別可能で、話者の重症度段階に対するnasality指標の単調低下も多くのコーパスで観測される。
  • 最小限の前提(MFAの音響モデルが当該言語に存在すること)で適用でき、実行パイプラインと6言語分の電話特徴設定を公開するとしている。

Abstract

Dysarthric speech severity assessment typically requires trained clinicians or supervised models built from labelled pathological speech, limiting scalability across languages and clinical settings. We present a training-free method that quantifies dysarthria severity by measuring degradation in phonological feature subspaces within frozen HuBERT representations. No supervised severity model is trained; feature directions are estimated from healthy control speech using a pretrained forced aligner. For each speaker, we extract phone-level embeddings via Montreal Forced Aligner, compute d-prime scores along phonological contrast directions (nasality, voicing, stridency, sonorance, manner, and four vowel features) derived exclusively from healthy controls, and construct a 12-dimensional phonological profile.Evaluating 890 speakers across 10 corpora, 5 languages (English, Spanish, Dutch, Mandarin, French), and 3 primary aetiologies (Parkinson's disease, cerebral palsy, ALS), we find that all five consonant d-prime features correlate significantly with clinical severity (random-effects meta-analysis rho = -0.50 to -0.56, p < 2e-4; pooled Spearman rho = -0.47 to -0.55 with bootstrap 95% CIs not crossing zero). The effect replicates within individual corpora, survives FDR correction, and remains robust to leave-one-corpus-out removal and alignment quality controls. Nasality d-prime decreases monotonically from control to severe in 6 of 7 severity-graded corpora. Mann-Whitney U tests confirm that all 12 features distinguish controls from severely dysarthric speakers (p < 0.001).The method requires no dysarthric training data and applies to any language with an existing MFA acoustic model (currently 29 languages). We release the full pipeline and phone feature configurations for six languages.