Identification and Anonymization of Named Entities in Unstructured Information Sources for Use in Social Engineering Detection
arXiv cs.LG / 4/13/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes a compliant system for building cybercrime-analysis datasets by collecting multi-modal content from Telegram and anonymizing personal data to meet GDPR and Spanish penal-code requirements.
- It evaluates speech-to-text pipelines with signal-enhancement techniques for extracting text from audio, finding that Parakeet delivers the best audio transcription performance.
- It compares Named Entity Recognition (NER) approaches for detecting sensitive information, including Microsoft Presidio and transformer-based AI models, with the proposed NER solutions achieving the highest F1 scores.
- The study introduces anonymization metrics to measure how well structural coherence is preserved while still protecting personal information in support of lawful cybersecurity research.
- Overall, the work targets practical dataset creation for social-engineering detection by combining transcription, NER-based redaction, and measurable anonymization quality controls.
Related Articles

Why Fashion Trend Prediction Isn’t Enough Without Generative AI
Dev.to

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About
Dev.to

วิธีใช้ AI ทำ SEO ให้เว็บติดอันดับ Google (2026)
Dev.to

Free AI Tools With No Message Limits — The Definitive List (2026)
Dev.to

Why Domain Knowledge Is Critical in Healthcare Machine Learning
Dev.to