A Utility-preserving De-identification Pipeline for Cross-hospital Radiology Data Sharing

arXiv cs.CV / 4/9/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper presents a utility-preserving de-identification pipeline (UPDP) aimed at enabling cross-hospital radiology data sharing without losing clinically important signal for training medical AI models.
  • UPDP uses a blacklist of privacy-sensitive terms plus a whitelist of pathology-related terms, and generates privacy-filtered but pathology-reserved synthetic radiology image counterparts.
  • The approach also involves ID-filtered reports, allowing the resulting de-identified images and text to be securely shared across hospitals for downstream model development and evaluation.
  • Experiments on public chest X-ray benchmarks show strong privacy removal of identity-related information while maintaining competitive diagnostic accuracy, though identity-related accuracy declines.
  • In cross-hospital experiments, combining de-identified shared data with local hospital data improves performance relative to using local data alone.

Abstract

Large-scale radiology data are critical for developing robust medical AI systems. However, sharing such data across hospitals remains heavily constrained by privacy concerns. Existing de-identification research in radiology mainly focus on removing identifiable information to enable compliant data release. Yet whether de-identified radiology data can still preserve sufficient utility for large-scale vision-language model training and cross-hospital transfer remains underexplored. In this paper, we introduce a utility-preserving de-identification pipeline (UPDP) for cross-hospital radiology data sharing. Specifically, we compile a blacklist of privacy-sensitive terms and a whitelist of pathology-related terms. For radiology images, we use a generative filtering mechanism that synthesis a privacy-filtered and pathology-reserved counterparts of the original images. These synthetic image counterparts, together with ID-filtered reports, can then be securely shared across hospitals for downstream model development and evaluation. Experiments on public chest X-ray benchmarks demonstrate that our method effectively removes privacy-sensitive information while preserving diagnostically relevant pathology cues. Models trained on the de-identified data maintain competitive diagnostic accuracy compared with those trained on the original data, while exhibiting a marked decline in identity-related accuracy, confirming effective privacy protection. In the cross-hospital setting, we further show that de-identified data can be combined with local data to yield better performance.