Identification and Anonymization of Named Entities in Unstructured Information Sources for Use in Social Engineering Detection

arXiv cs.LG / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a compliant system for building cybercrime-analysis datasets by collecting multi-modal content from Telegram and anonymizing personal data to meet GDPR and Spanish penal-code requirements.
  • It evaluates speech-to-text pipelines with signal-enhancement techniques for extracting text from audio, finding that Parakeet delivers the best audio transcription performance.
  • It compares Named Entity Recognition (NER) approaches for detecting sensitive information, including Microsoft Presidio and transformer-based AI models, with the proposed NER solutions achieving the highest F1 scores.
  • The study introduces anonymization metrics to measure how well structural coherence is preserved while still protecting personal information in support of lawful cybersecurity research.
  • Overall, the work targets practical dataset creation for social-engineering detection by combining transcription, NER-based redaction, and measurable anonymization quality controls.

Abstract

This study addresses the challenge of creating datasets for cybercrime analysis while complying with the requirements of regulations such as the General Data Protection Regulation (GDPR) and Organic Law 10/1995 of the Penal Code. To this end, a system is proposed for collecting information from the Telegram platform, including text, audio, and images; the implementation of speech-to-text transcription models incorporating signal enhancement techniques; and the evaluation of different Named Entity Recognition (NER) solutions, including Microsoft Presidio and AI models designed using a transformer-based architecture. Experimental results indicate that Parakeet achieves the best performance in audio transcription, while the proposed NER solutions achieve the highest f1-score values in detecting sensitive information. In addition, anonymization metrics are presented that allow evaluation of the preservation of structural coherence in the data, while simultaneously guaranteeing the protection of personal information and supporting cybersecurity research within the current legal framework.