DiZiNER: Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition

arXiv cs.CL / 4/20/2026

📰 NewsModels & Research

Key Points

  • The paper proposes DiZiNER, a disagreement-guided instruction refinement framework that simulates pilot annotation to improve zero-shot named entity recognition (NER) with LLMs.
  • DiZiNER uses multiple heterogeneous LLM “annotators” to label the same texts, then a supervisor model reviews disagreements to iteratively refine the task instructions.
  • Evaluated on 18 NER benchmarks, DiZiNER sets zero-shot state-of-the-the-art on 14 datasets, boosting prior best results by +8.0 F1.
  • The approach narrows the gap between zero-shot and supervised systems by more than +11 points and performs well even against its supervisor (GPT-5 mini), suggesting gains come from better instruction refinement rather than larger model capacity.
  • Pairwise agreement among models is strongly correlated with NER performance, supporting the core premise that disagreement signals can drive effective instruction improvement.

Abstract

Large language models (LLMs) have advanced information extraction (IE) by enabling zero-shot and few-shot named entity recognition (NER), yet their generative outputs still show persistent and systematic errors. Despite progress through instruction fine-tuning, zero-shot NER still lags far behind supervised systems. These recurring errors mirror inconsistencies observed in early-stage human annotation processes that resolve disagreements through pilot annotation. Motivated by this analogy, we introduce DiZiNER (Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition), a framework that simulates the pilot annotation process, employing LLMs to act as both annotators and supervisors. Multiple heterogeneous LLMs annotate shared texts, and a supervisor model analyzes inter-model disagreements to refine task instructions. Across 18 benchmarks, DiZiNER achieves zero-shot SOTA results on 14 datasets, improving prior bests by +8.0 F1 and reducing the zero-shot to supervised gap by over +11 points. It also consistently outperforms its supervisor, GPT-5 mini, indicating that improvements stem from disagreement-guided instruction refinement rather than model capacity. Pairwise agreement between models shows a strong correlation with NER performance, further supporting this finding.