Evaluating the Limitations of Protein Sequence Representations for Parkinson's Disease Classification

arXiv cs.AI / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The arXiv study tests how well protein primary-sequence representations alone can classify Parkinson’s disease, addressing uncertainty about whether sequence data contains sufficient discriminatory signal.
  • Using nested stratified, leakage-free cross-validation, the authors evaluate multiple feature types including amino acid composition, k-mers, physicochemical descriptors, hybrids, and Protein-Language-Model embeddings (e.g., ProtBERT).
  • The best configuration (ProtBERT + MLP) achieves moderate performance (F1 ≈ 0.704 and ROC-AUC ≈ 0.748), suggesting limited standalone discriminative power from sequences.
  • Simpler approaches like k-mers perform similarly in F1 (up to ~0.667) but show imbalanced behavior (high recall near 0.98 with precision around 0.50), indicating prediction bias.
  • Unsupervised analyses and statistical testing find no significant performance differences across representations, concluding that class-structure is not well captured by primary sequences and motivating richer feature sources such as structural, functional, or interaction-based descriptors.

Abstract

The identification of reliable molecular biomarkers for Parkinson's disease remains challenging due to its multifactorial nature. Although protein sequences constitute a fundamental and widely available source of biological information, their standalone discriminative capacity for complex disease classification remains unclear. In this work, we present a controlled and leakage-free evaluation of multiple representations derived exclusively from protein primary sequences, including amino acid composition, k-mers, physicochemical descriptors, hybrid representations, and embeddings from protein language models, all assessed under a nested stratified cross-validation framework to ensure unbiased performance estimation. The best-performing configuration (ProtBERT + MLP) achieves an F1-score of 0.704 +/- 0.028 and ROC-AUC of 0.748 +/- 0.047, indicating only moderate discriminative performance. Classical representations such as k-mers reach comparable F1 values (up to approximately 0.667), but exhibit highly imbalanced behavior, with recall close to 0.98 and precision around 0.50, reflecting a strong bias toward positive predictions. Across representations, performance differences remain within a narrow range (F1 between 0.60 and 0.70), while unsupervised analyses reveal no intrinsic structure aligned with class labels, and statistical testing (Friedman test, p = 0.1749) does not indicate significant differences across models. These results demonstrate substantial overlap between classes and indicate that primary sequence information alone provides limited discriminative power for Parkinson's disease classification. This work establishes a reproducible baseline and provides empirical evidence that more informative biological features, such as structural, functional, or interaction-based descriptors, are required for robust disease modeling.