BioUNER: A Benchmark Dataset for Clinical Urdu Named Entity Recognition

arXiv cs.CL / 4/6/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

BioUNERは、オンラインのウルドゥ語の医療記事・医薬処方情報・病院/健康ブログ等から収集して構築した、Biomedical Urdu Named Entity Recognition（BioNER）のゴールド標準ベンチマークデータセットです。
Doccanoを用いた医療ドメインに精通する3名のネイティブアノテータにより、前処理後に153Kトークンがアノテーションされました。
アノテータ間一致率は0.78を達成しており、データセットのゴールド標準品質が検証されています。
事前処理と評価のうえで、SVM、LSTM、mBERT、XLM-RoBERTaなど複数の機械学習/深層学習モデルで内在的・外在的評価を行い、ベンチマークとしての有用性を示しています。
BioUNERは、ウルドゥ語の医療NLP資源として信頼できる比較可能な評価基盤を提供します。

Abstract

In this article, we present a gold-standard benchmark dataset for Biomedical Urdu Named Entity Recognition (BioUNER), developed by crawling health-related articles from online Urdu news portals, medical prescriptions, and hospital health blogs and websites. After preprocessing, three native annotators with familiarity in the medical domain participated in the annotation process using the Doccano text annotation tool and annotated 153K tokens. Following annotation, the proposed BioiUNER dataset was evaluated both intrinsically and extrinsically. An inter-annotator agreement score of 0.78 was achieved, thereby validating the dataset as gold-standard quality. To demonstrate the utility and benchmarking capability of the dataset, we evaluated several machine learning and deep learning models, including Support Vector Machines (SVM), Long Short-Term Memory networks (LSTM), Multilingual BERT (mBERT), and XLM-RoBERTa. The gold-standard BioUNER dataset serves as a reliable benchmark and a valuable addition to Urdu language processing resources.