A Benchmark Suite of Reddit-Derived Datasets for Mental Health Detection

arXiv cs.CL / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a uniform benchmark suite of four Reddit-derived datasets aimed at mental health detection using NLP, covering suicidal ideation, general mental disorder (binary), bipolar disorder, and multi-class mental disorder classification.
  • The datasets were created with detailed annotation guidelines, linguistic inspection, and human verification to improve quality and reproducibility.
  • Inter-annotator agreement is reported to exceed a baseline of 0.8 in all datasets, supporting the trustworthiness of the labels.
  • Prior results on both transformer and contextualized recurrent models show high performance (F1 approximately 93–99%), indicating the benchmarks are effective for model evaluation.
  • By consolidating these resources into widely accessible, complementary tasks, the work enables cross-task comparisons, multi-task learning, and more fair model benchmarking for mental-health-focused NLP research.

Abstract

The growing availability of online support groups has opened up new windows to study mental health through natural language processing (NLP). However, it is hindered by a lack of high-quality, well-validated datasets. Existing studies have a tendency to build task-specific corpora without collecting them into widely available resources, and this makes reproducibility as well as cross-task comparison challenging. In this paper, we present a uniform benchmark set of four Reddit-based datasets for disjoint but complementary tasks: (i) detection of suicidal ideation, (ii) binary general mental disorder detection, (iii) bipolar disorder detection, and (iv) multi-class mental disorder classification. All datasets were established upon diligent linguistic inspection, well-defined annotation guidelines, and human-judgmental verification. Inter-annotator agreement metrics always exceeded the baseline agreement score of 0.8, ensuring the labels' trustworthiness. Previous work's evidence of performance on both transformer and contextualized recurrent models demonstrates that these models receive excellent performances on tasks (F1 ~ 93-99%), further validating the usefulness of the datasets. By combining these resources, we establish a unifying foundation for reproducible mental health NLP studies with the ability to carry out cross-task benchmarking, multi-task learning, and fair model comparison. The presented benchmark suite provides the research community with an easy-to-access and varied resource for advancing computational approaches toward mental health research.