AI Navigate

Evaluating Large Language Models for Gait Classification Using Text-Encoded Kinematic Waveforms

arXiv cs.LG / 3/17/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study evaluated whether general-purpose LLMs can classify continuous gait kinematics when encoded as textual numeric sequences and compared their performance to traditional classifiers (KNN and OCSVM) using Leave-One-Subject-Out cross-validation.
  • The supervised KNN achieved the highest multiclass MCC of 0.88, outperforming the zero-shot LLMs.
  • GPT-5 with reference grounding reached a multiclass MCC of 0.70 and a binary MCC of 0.68, still below the KNN and above the class-independent OCSVM.
  • Using high-confidence predictions increased the LLM multiclass MCC to 0.83 on the filtered subset, indicating sensitivity to confidence thresholds.
  • The o4-mini model performed comparably to larger models, highlighting computational efficiency and suggesting LLMs may be more suitable for exploratory analysis rather than direct diagnostic use.

Abstract

Background: Machine learning (ML) enhances gait analysis but often lacks the level of interpretability desired for clinical adoption. Large Language Models (LLMs) may offer explanatory capabilities and confidence-aware outputs when applied to structured kinematic data. This study therefore evaluated whether general-purpose LLMs can classify continuous gait kinematics when represented as textual numeric sequences and how their performance compares to conventional ML approaches. Methods: Lower-body kinematics were recorded from 20 participants performing seven gait patterns. A supervised KNN classifier and a class-independent One-Class SVM (OCSVM) were compared against zero-shot LLMs (GPT-5, GPT-5-mini, GPT-4.1, and o4-mini). Models were evaluated using Leave-One-Subject-Out (LOSO) cross-validation. LLMs were tested both with and without explicit reference gait statistics. Results: The supervised KNN achieved the highest performance (multiclass Matthews Correlation Coefficient, MCC = 0.88). The best-performing LLM (GPT-5) with reference grounding achieved a multiclass MCC of 0.70 and a binary MCC of 0.68, outperforming the class-independent OCSVM (binary MCC = 0.60). Performance of the LLM was highly dependent on explicit reference information and self-rated confidence; when restricted to high-confidence predictions, multiclass MCC increased to 0.83 on the filtered subset. Notably, the computationally efficient o4-mini model performed comparably to larger models. Conclusion: When continuous kinematic waveforms were encoded as textual numeric tokens, general-purpose LLMs, even with reference grounding, did not match supervised multiclass classifiers for precise gait classification and are better regarded as exploratory systems requiring cautious, human-guided interpretation rather than diagnostic use.