Calibrated? Not for Everyone: How Sexual Orientation and Religious Markers Distort LLM Accuracy and Confidence in Medical QA
arXiv cs.CL / 4/21/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The study evaluates whether LLM uncertainty calibration for medical QA remains reliable when patient social identity descriptors (sexual orientation and religion) are included.
- Testing nine general-purpose and biomedical LLMs on 2,364 medical questions plus counterfactual variants shows a “calibration crisis” where identity markers systematically degrade both accuracy and confidence calibration.
- “Homosexual” markers are found to consistently trigger performance drops, and intersectional identities cause idiosyncratic, non-additive harms to calibration.
- A clinician-validated open-ended generation case study supports the finding that these calibration failures are not caused by the multiple-choice question format.
- The paper warns that using LLM confidence signals in confidence-based clinical workflows can create a significant safety and equity risk, because social identity cues affect the reliability of uncertainty estimates.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Every time a new model comes out, the old one is obsolete of course
Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆
Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)
Dev.to

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims
Dev.to

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM
Reddit r/LocalLLaMA