Distilling Human-Aligned Privacy Sensitivity Assessment from Large Language Models

arXiv cs.CL / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper targets accurate privacy sensitivity assessment for text, noting that while LLMs can match human privacy judgments, they are too costly to run on sensitive data at scale.
  • It proposes distilling the privacy-evaluation capability of Mistral Large 3 (675B) into much smaller encoder classifiers (down to ~150M parameters) to make privacy scoring more practical.
  • Using a large, multi-domain dataset of privacy-annotated texts across 10 diverse domains, the authors train lightweight models that maintain strong agreement with human annotations.
  • The approach is validated against human-labeled test data and is presented as a usable evaluation metric for de-identification systems, improving feasibility for real-world privacy workflows.

Abstract

Accurate privacy evaluation of textual data remains a critical challenge in privacy-preserving natural language processing. Recent work has shown that large language models (LLMs) can serve as reliable privacy evaluators, achieving strong agreement with human judgments; however, their computational cost and impracticality for processing sensitive data at scale limit real-world deployment. We address this gap by distilling the privacy assessment capabilities of Mistral Large 3 (675B) into lightweight encoder models with as few as 150M parameters. Leveraging a large-scale dataset of privacy-annotated texts spanning 10 diverse domains, we train efficient classifiers that preserve strong agreement with human annotations while dramatically reducing computational requirements. We validate our approach on human-annotated test data and demonstrate its practical utility as an evaluation metric for de-identification systems.