Self-Supervised Learning for Knee Osteoarthritis: Diagnostic Limitations and Prognostic Value of Uncurated Hospital Data

arXiv cs.CV / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study evaluates whether self-supervised learning (SSL) can improve knee osteoarthritis (OA) diagnosis and prognosis compared with ImageNet-pretrained initialization using both image-only and image-text (multimodal) hospital data.
  • For diagnostic Kellgren-Lawrence (KL) grade prediction, SSL results are mixed: image-only SSL helps during linear probing but does not beat ImageNet when the full model is fine-tuned, and multimodal SSL does not improve grading performance.
  • The authors attribute the diagnostic underperformance to strong severity bias in the uncurated hospital pretraining corpus, where an estimated 93% of images correspond to KL grade 3.
  • In contrast, the same multimodal SSL initialization substantially improves prognostic modeling, outperforming ImageNet baselines in predicting 4-year structural incidence and progression, including external validation.
  • The findings suggest uncurated image-text data may be ineffective for diagnosis when pretraining and task distributions diverge, but can provide a useful signal for prognosis when the downstream task matches the data distribution.

Abstract

This study assesses whether self-supervised learning (SSL) improves knee osteoarthritis (OA) modeling for diagnosis and prognosis relative to ImageNet-pretrained initialization. We compared (i) image-only SSL pretrained on knee radiographs from the OAI, MOST, and NYU cohorts, and (ii) multimodal image-text SSL pretrained on uncurated hospital knee radiographs paired with radiologist impressions. For diagnostic Kellgren-Lawrence (KL) grade prediction, SSL offered mixed results. While image-only SSL improved accuracy during linear probing (frozen encoder), it did not outperform ImageNet pretraining during full fine-tuning. Similarly, multimodal SSL failed to improve grading performance. We attribute this to severe bias in the uncurated hospital pretraining corpus (93% estimated KL grade 3), which limited alignment with the balanced diagnostic task. In contrast, this same multimodal initialization significantly improved prognostic modeling. It outperformed ImageNet baselines in predicting 4-year structural incidence and progression, including on external validation (MOST AUROC: 0.701 vs. 0.599 at 10% labeled data). Overall, while uncurated hospital image-text data may be ineffective for learning diagnosis due to severity bias, it provides a strong signal for prognostic modeling when the downstream task aligns with pretraining data distribution