A Mechanism and Optimization Study on the Impact of Information Density on User-Generated Content Named Entity Recognition

arXiv cs.CL / 4/22/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • Existing NER models trained on clean, high-resource data can experience catastrophic performance drops on noisy, sparse UGC (e.g., social media), and prior fixes often don’t generalize well.
  • The paper attributes many surface-level UGC failure symptoms (e.g., neologisms, alias drift, non-standard orthography, rare entities, class imbalance) to a shared underlying cause: low Information Density (ID), shown as an independent factor via controlled resampling experiments.
  • It introduces Attention Spectrum Analysis (ASA) to measure how reduced ID causally causes “attention blunting,” which then degrades NER performance.
  • Based on the mechanism, the authors propose the Window-Aware Optimization Module (WOM), an LLM-empowered, model-agnostic method that enhances semantic density in information-sparse regions via selective back-translation without changing the base model architecture.
  • Experiments on standard UGC NER datasets (WNUT2017, Twitter-NER, WNUT2016) show up to +4.5% absolute F1 gains and new state-of-the-art (SOTA) results on WNUT2017.

Abstract

Named Entity Recognition (NER) models trained on clean, high-resource corpora exhibit catastrophic performance collapse when deployed on noisy, sparse User-Generated Content (UGC), such as social media. Prior research has predominantly focused on point-wise symptom remediation -- employing customized fine-tuning to address issues like neologisms, alias drift, non-standard orthography, long-tail entities, and class imbalance. However, these improvements often fail to generalize because they overlook the structural sparsity inherent in UGC. This study reveals that surface-level noise symptoms share a unified root cause: low Information Density (ID). Through hierarchical confounding-controlled resampling experiments (specifically controlling for entity rarity and annotation consistency), this paper identifies ID as an independent key factor. We introduce Attention Spectrum Analysis (ASA) to quantify how reduced ID causally leads to ``attention blunting,'' ultimately degrading NER performance. Informed by these mechanistic insights, we propose the Window-Aware Optimization Module (WOM), an LLM-empowered, model-agnostic framework. WOM identifies information-sparse regions and utilizes selective back-translation to directionally enhance semantic density without altering model architecture. Deployed atop mainstream architectures on standard UGC datasets (WNUT2017, Twitter-NER, WNUT2016), WOM yields up to 4.5\% absolute F1 improvement, demonstrating robustness and achieving new state-of-the-art (SOTA) results on WNUT2017.