CADEL: A Corpus of Administrative Web Documents for Japanese Entity Linking

arXiv cs.CL / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces CADEL, an annotated Japanese corpus specifically designed to support entity linking by mapping Japanese expressions to knowledge base entities relevant to Japan.
It addresses a key gap in the field, noting that most entity-linking resources and evaluation materials have historically focused on English, leaving Japanese benchmarking limited.
The authors propose a corpus design policy and include coverage of diverse linguistic expressions tied to Japan-specific entities and concepts.
Annotation quality is validated through high inter-annotator agreement, indicating reliable labeling for training and evaluation.
A preliminary disambiguation experiment using string matching suggests the dataset includes many non-trivial cases, positioning CADEL as a useful benchmark for more advanced entity linking systems.

Abstract

Entity linking is the task of associating linguistic expressions with entries in a knowledge base that represent real-world entities and concepts. Language resources for this task have primarily been developed for English, and the resources available for evaluating Japanese systems remain limited. In this study, we develop a corpus design policy for the entity linking task and construct an annotated corpus for training and evaluating Japanese entity linking systems, with rich coverage of linguistic expressions referring to entities that are specific to Japan. Evaluation of inter-annotator agreement confirms the high consistency of the annotations in the corpus, and a preliminary experiment on entity disambiguation based on string matching suggests that the corpus contains a substantial number of non-trivial cases, supporting its potential usefulness as an evaluation benchmark.

Knowledge Governance For The Agentic Economy.

Dev.to

AI server farms heat up the neighborhood for miles around, paper finds

The Register

Does the Claude “leak” actually change anything in practice?

Reddit r/LocalLLaMA

87.4% of My Agent's Decisions Run on a 0.8B Model

Dev.to

AIエージェントをソフトウェアチームに変える無料ツール「Paperclip」

Dev.to

CADEL: A Corpus of Administrative Web Documents for Japanese Entity Linking

Key Points

Abstract

Related Articles

Knowledge Governance For The Agentic Economy.

AI server farms heat up the neighborhood for miles around, paper finds

Does the Claude “leak” actually change anything in practice?

87.4% of My Agent's Decisions Run on a 0.8B Model

AIエージェントをソフトウェアチームに変える無料ツール「Paperclip」

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer