Enhancing Online Support Group Formation Using Topic Modeling Techniques

arXiv stat.ML / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study addresses how online health communities can form more personalized and semantically coherent peer support groups, noting that existing methods struggle with scalability and static, weakly personalized categorization.
  • It proposes two machine-learning approaches—gDMR and gSTM—that use users’ text, demographic profiles, and network-derived node embeddings to automate support group formation.
  • Evaluations on a large MedHelp.org dataset (over 2 million posts) show both models outperform baselines (LDA, DMR, STM) on held-out log likelihood, semantic coherence, and internal group consistency.
  • The gDMR variant focuses on producing usable group covariates by leveraging relational structure and demographics, while gSTM uses sparsity constraints to generate more distinct and theme-specific groups.
  • Qualitative validation indicates that automatically generated groups align with manually coded health themes, suggesting the framework could reduce manual curation and improve engagement and peer support quality.

Abstract

Online health communities (OHCs) are vital for fostering peer support and improving health outcomes. Support groups within these platforms can provide more personalized and cohesive peer support, yet traditional support group formation methods face challenges related to scalability, static categorization, and insufficient personalization. To overcome these limitations, we propose two novel machine learning models for automated support group formation: the Group specific Dirichlet Multinomial Regression (gDMR) and the Group specific Structured Topic Model (gSTM). These models integrate user generated textual content, demographic profiles, and interaction data represented through node embeddings derived from user networks to systematically automate personalized, semantically coherent support group formation. We evaluate the models on a large scale dataset from MedHelp.org, comprising over 2 million user posts. Both models substantially outperform baseline methods including LDA, DMR, and STM in predictive accuracy (held out log likelihood), semantic coherence (UMass metric), and internal group consistency. The gDMR model yields group covariates that facilitate practical implementation by leveraging relational patterns from network structures and demographic data. In contrast, gSTM emphasizes sparsity constraints to generate more distinct and thematically specific groups. Qualitative analysis further validates the alignment between model generated groups and manually coded themes, showing the practical relevance of the models in informing groups that address diverse health concerns such as chronic illness management, diagnostic uncertainty, and mental health. By reducing reliance on manual curation, these frameworks provide scalable solutions that enhance peer interactions within OHCs, with implications for patient engagement, community resilience, and health outcomes.