MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-End Question Answering

arXiv cs.CL / 3/17/2026

📰 NewsModels & Research

共有:

Key Points

MedPriv-Bench is introduced as the first benchmark specifically designed to jointly evaluate privacy preservation and clinical utility in medical open-ended question answering for LLMs.

Abstract

Recent advances in Retrieval-Augmented Generation (RAG) have enabled large language models (LLMs) to ground outputs in clinical evidence. However, connecting LLMs with external databases introduces the risk of contextual leakage: a subtle privacy threat where unique combinations of medical details enable patient re-identification even without explicit identifiers. Current benchmarks in healthcare heavily focus on accuracy, ignoring such privacy issues, despite strict regulations like Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR). To fill this gap, we present MedPriv-Bench, the first benchmark specifically designed to jointly evaluate privacy preservation and clinical utility in medical open-ended question answering. Our framework utilizes a multi-agent, human-in-the-loop pipeline to synthesize sensitive medical contexts and clinically relevant queries that create realistic privacy pressure. We establish a standardized evaluation protocol leveraging a pre-trained RoBERTa-Natural Language Inference (NLI) model as an automated judge to quantify data leakage, achieving an average of 85.9% alignment with human experts. Through an extensive evaluation of 9 representative LLMs, we demonstrate a pervasive privacy-utility trade-off. Our findings underscore the necessity of domain-specific benchmarks to validate the safety and efficacy of medical AI systems in privacy-sensitive environments.

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Reddit r/LocalLLaMA

QwenDean-4B | fine-tuned SLM for UIGen; our first attempt, looking for feedback!

Reddit r/LocalLLaMA

acestep.cpp: portable C++17 implementation of ACE-Step 1.5 music generation using GGML. Runs on CPU, CUDA, ROCm, Metal, Vulkan

Reddit r/LocalLLaMA

Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Hugging Face Blog

Newest GPU server in the lab! 72gb ampere vram!

Reddit r/LocalLLaMA

MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-End Question Answering

Key Points

Abstract

Related Articles

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

QwenDean-4B | fine-tuned SLM for UIGen; our first attempt, looking for feedback!

acestep.cpp: portable C++17 implementation of ACE-Step 1.5 music generation using GGML. Runs on CPU, CUDA, ROCm, Metal, Vulkan

Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Newest GPU server in the lab! 72gb ampere vram!

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Abstract

Related Articles

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

QwenDean-4B | fine-tuned SLM for UIGen; our first attempt, looking for feedback!

acestep.cpp: portable C++17 implementation of ACE-Step 1.5 music generation using GGML. Runs on CPU, CUDA, ROCm, Metal, Vulkan

**Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding**

Newest GPU server in the lab! 72gb ampere vram!

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding