MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers

arXiv cs.CL / 3/11/2026

Tools & Practical UsageModels & Research

Read original →

共有:

Key Points

The paper presents MultiGraSCCo, a multilingual anonymization benchmark with annotations of personal identifiers across ten languages, developed using neural machine translation that preserves original annotations and culturally adapts names.
This benchmark addresses the challenge of accessing sensitive patient data by leveraging synthetic data and machine translation to create high-quality anonymized datasets suitable for low-resource languages.
Medical professionals validated the quality of translations and the contextual adaptation of personal information, confirming the reliability of the dataset for practical use.
The benchmark includes over 2,500 personal information annotations and serves multiple applications such as training annotators, cross-institutional annotation validation, and improving automatic personal information detection systems.
The authors provide the benchmark and annotation guidelines publicly to support ongoing research into data anonymization and privacy-preserving techniques in healthcare data sharing.

Computer Science > Computation and Language

arXiv:2603.08879 (cs)

[Submitted on 9 Mar 2026]

Title:MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers

Authors:Ibrahim Baroud, Christoph Otto, Vera Czehmann, Christine Hovhannisyan, Lisa Raithel, Sebastian Möller, Roland Roller

View a PDF of the paper titled MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers, by Ibrahim Baroud and 6 other authors

View PDF HTML (experimental)

Abstract:Accessing sensitive patient data for machine learning is challenging due to privacy concerns. Datasets with annotations of personally identifiable information are crucial for developing and testing anonymization systems to enable safe data sharing that complies with privacy regulations. Since accessing real patient data is a bottleneck, synthetic data offers an efficient solution for data scarcity, bypassing privacy regulations that apply to real data. Moreover, neural machine translation can help to create high-quality data for low-resource languages by translating validated real or synthetic data from a high-resource language. In this work, we create a multilingual anonymization benchmark in ten languages, using a machine translation methodology that preserves the original annotations and renders names of cities and people in a culturally and contextually appropriate form in each target language. Our evaluation study with medical professionals confirms the quality of the translations, both in general and with respect to the translation and adaptation of personal information. Our benchmark with over 2,500 annotations of personal information can be used in many applications, including training annotators, validating annotations across institutions without legal complications, and helping improve the performance of automatic personal information detection. We make our benchmark and annotation guidelines available for further research.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2603.08879 [cs.CL]
	(or arXiv:2603.08879v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2603.08879 Focus to learn more arXiv-issued DOI via DataCite

Submission history

From: Ibrahim Baroud [view email]
[v1] Mon, 9 Mar 2026 19:44:36 UTC (383 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers, by Ibrahim Baroud and 6 other authors

View PDF
HTML (experimental)
TeX Source

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2026-03

Change to browse by:

References & Citations

export BibTeX citation Loading...

BibTeX formatted citation

Data provided by:

Bookmark

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

Ledge.ai

The programming passion is melting

Dev.to

Best AI Tools for Property Managers in 2026

Dev.to

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Dev.to

MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers

Key Points

Computer Science > Computation and Language

Title:MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers

Submission history