Depression Detection at the Point of Care: Automated Analysis of Linguistic Signals from Routine Primary Care Encounters

arXiv cs.CL / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study addresses underdiagnosis of depression in primary care by testing automated detection from routine, passively collected audio-recorded clinical encounters.
  • Using 1,108 encounters from the Establishing Focus study (PHQ-9 defined labels), the authors evaluated supervised models (Sentence-BERT+LR, LIWC+LR, ModernBERT) and a zero-shot GPT-OSS baseline.
  • GPT-OSS performed best overall, achieving AUPRC=0.510 and AUROC=0.774, while LIWC+LR was competitive among supervised approaches (AUPRC=0.500, AUROC=0.742).
  • The paper finds that combining dyadic transcripts (patient+provider) outperforms single-speaker setups, suggesting providers’ linguistic mirroring adds incremental predictive signal.
  • Meaningful performance is attainable from early dialogue (first 128 patient tokens), indicating potential for in-the-moment clinical decision support as a low-burden complement to existing screening.

Abstract

Depression is underdiagnosed in primary care, yet timely identification remains critical. Recorded clinical encounters, increasingly common with digital scribing technologies, present an opportunity to detect depression from naturalistic dialogue. We investigated automated depression detection from 1,108 audio-recorded primary care encounters in the Establishing Focus study, with depression defined by PHQ-9 (n=253 depressed, n=855 non-depressed). We compared three supervised approaches, Sentence-BERT + Logistic Regression (LR), LIWC+LR and ModernBERT, against a zero-shot GPT-OSS. GPT-OSS achieved the strongest performance (AUPRC=0.510, AUROC=0.774), with LIWC+LR competitive among supervised models (AUPRC=0.500, AUROC=0.742). Combined dyadic transcripts outperformed single-speaker configurations, with providers linguistically mirroring patients in depression encounters, an additive signal not captured by either speaker alone. Meaningful detection is achievable from the first 128 patient tokens (AUPRC=0.356, AUROC=0.675), supporting in-the-moment clinical decision support. These findings argue for passively collected clinical audio as a low-burden complement to existing screening workflows.