Calibrated? Not for Everyone: How Sexual Orientation and Religious Markers Distort LLM Accuracy and Confidence in Medical QA

arXiv cs.CL / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The study evaluates whether LLM uncertainty calibration for medical QA remains reliable when patient social identity descriptors (sexual orientation and religion) are included.
  • Testing nine general-purpose and biomedical LLMs on 2,364 medical questions plus counterfactual variants shows a “calibration crisis” where identity markers systematically degrade both accuracy and confidence calibration.
  • “Homosexual” markers are found to consistently trigger performance drops, and intersectional identities cause idiosyncratic, non-additive harms to calibration.
  • A clinician-validated open-ended generation case study supports the finding that these calibration failures are not caused by the multiple-choice question format.
  • The paper warns that using LLM confidence signals in confidence-based clinical workflows can create a significant safety and equity risk, because social identity cues affect the reliability of uncertainty estimates.

Abstract

Safe clinical deployment of Large Language Models (LLMs) requires not only high accuracy but also robust uncertainty calibration to ensure models defer to clinicians when appropriate. Our paper investigates how social descriptors of a patient (specifically sexual orientation and religious affiliation) distort these uncertainty signals and model accuracy. Evaluating nine general-purpose and biomedical LLMs on 2,364 medical questions and their counterfactual variants, we demonstrate that identity markers cause a "calibration crisis". "Homosexual" markers consistently trigger performance drops, and intersectional identities produce idiosyncratic, non-additive harms to calibration. Moreover, a clinician-validated case study in an open-ended generation setting confirms that these failures are not an artifact of the multiple-choice format. Our results demonstrate that the presence of social identity cues does not merely shift predictions; it affects the reliability of confidence signals, posing a significant risk to equitable care and safe deployment in confidence-based clinical workflows.