AI Navigate

MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers

arXiv cs.CL / 3/11/2026

Tools & Practical UsageModels & Research

Key Points

  • The paper presents MultiGraSCCo, a multilingual anonymization benchmark with annotations of personal identifiers across ten languages, developed using neural machine translation that preserves original annotations and culturally adapts names.
  • This benchmark addresses the challenge of accessing sensitive patient data by leveraging synthetic data and machine translation to create high-quality anonymized datasets suitable for low-resource languages.
  • Medical professionals validated the quality of translations and the contextual adaptation of personal information, confirming the reliability of the dataset for practical use.
  • The benchmark includes over 2,500 personal information annotations and serves multiple applications such as training annotators, cross-institutional annotation validation, and improving automatic personal information detection systems.
  • The authors provide the benchmark and annotation guidelines publicly to support ongoing research into data anonymization and privacy-preserving techniques in healthcare data sharing.

Computer Science > Computation and Language

arXiv:2603.08879 (cs)
[Submitted on 9 Mar 2026]

Title:MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers

View a PDF of the paper titled MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers, by Ibrahim Baroud and 6 other authors
View PDF HTML (experimental)
Abstract:Accessing sensitive patient data for machine learning is challenging due to privacy concerns. Datasets with annotations of personally identifiable information are crucial for developing and testing anonymization systems to enable safe data sharing that complies with privacy regulations. Since accessing real patient data is a bottleneck, synthetic data offers an efficient solution for data scarcity, bypassing privacy regulations that apply to real data. Moreover, neural machine translation can help to create high-quality data for low-resource languages by translating validated real or synthetic data from a high-resource language. In this work, we create a multilingual anonymization benchmark in ten languages, using a machine translation methodology that preserves the original annotations and renders names of cities and people in a culturally and contextually appropriate form in each target language. Our evaluation study with medical professionals confirms the quality of the translations, both in general and with respect to the translation and adaptation of personal information. Our benchmark with over 2,500 annotations of personal information can be used in many applications, including training annotators, validating annotations across institutions without legal complications, and helping improve the performance of automatic personal information detection. We make our benchmark and annotation guidelines available for further research.
Subjects: Computation and Language (cs.CL)
Cite as: arXiv:2603.08879 [cs.CL]
  (or arXiv:2603.08879v1 [cs.CL] for this version)
  https://doi.org/10.48550/arXiv.2603.08879
Focus to learn more
arXiv-issued DOI via DataCite

Submission history

From: Ibrahim Baroud [view email]
[v1] Mon, 9 Mar 2026 19:44:36 UTC (383 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers, by Ibrahim Baroud and 6 other authors
  • View PDF
  • HTML (experimental)
  • TeX Source
Current browse context:
cs.CL
< prev   |   next >
Change to browse by:
cs

References & Citations

export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos

Demos

Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
TXYZ.AI (What is TXYZ.AI?)
Related Papers

Recommenders and Search Tools

Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.