A Hybrid Method for Low-Resource Named Entity Recognition

arXiv cs.AI / 5/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a hybrid neurosymbolic framework for low-resource Vietnamese named entity recognition that combines rule-based label reduction with fine-tuned pre-trained language models.
It uses a two-stage pipeline where rules first group related/special categories to reduce label complexity, and a post-processing step restores fine-grained labels for practical application use.
To address limited annotated data and label-set heterogeneity, the study introduces a scalable data augmentation strategy that leverages LLMs to expand the label set without requiring full re-annotation.
Evaluated on five domain-specific datasets (e.g., logistics, wildlife, healthcare), the method substantially outperforms strong RoBERTa-based baselines, with large F1 gains across multiple benchmarks.
Reported improvements include 90% vs 83% (Customer Service), 84% vs 73% (GAM), and 94% vs 91% (PhoNER_Covid19), demonstrating effectiveness for specialized Vietnamese NER settings.

Abstract

Named Entity Recognition (NER) is a critical component of Natural Language Processing with diverse applications in information extraction and conversational AI. However, NER in specific domains for low-resource languages faces challenges such as limited annotated data and heterogeneous label sets. This study addresses these issues by proposing a hybrid neurosymbolic framework that integrates rule-based processing with deep learning models for Vietnamese NER. The core idea involves a two-stage pipeline: first, a rule-based component reduces label complexity by grouping relational and special categories; second, pre-trained language models are fine-tuned for high-precision extraction. A post-processing module is then utilized to restore fine-grained labels, preserving expressiveness for application-level usability. To mitigate data scarcity, a scalable data augmentation strategy leveraging Large Language Models (LLMs) is introduced to expand the label set without full re-annotation, which is a significant novelty of this work. The effectiveness of this method was evaluated across five specific-domain datasets, including logistics, wildlife, and healthcare. Experimental results demonstrate substantial improvements over strong RoBERTa-based baselines. Specifically, the proposed system achieved F1 scores of 90 percent in Customer Service, up from 83 percent; 84 percent in GAM, up from 73 percent; 83 percent in AI Fluent, up from 80 percent; 94 percent in PhoNER_Covid19, up from 91 percent; and 60 percent in Rare Wildlife, up from 36 percent. These findings confirm that the hybrid approach effectively captures the linguistic complexity of Vietnamese and contextual nuances in specialized domains, offering a robust contribution to low-resource NER research.