Dynamically Acquiring Text Content to Enable the Classification of Lesser-known Entities for Real-world Tasks

arXiv cs.CL / 4/27/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses a gap in NLP resources by enabling entity classification for lesser-known or newly introduced entities using only entity names and gold labels as training data.
It proposes a framework that dynamically acquires descriptive text for each entity, using a novel text acquisition method that combines web sources with large language models (LLMs).
The acquired entity descriptions are then used to build a text-based classifier tailored to the target task and taxonomy.
Experiments on two real-world classification settings—organizations mapped to SIC codes and healthcare providers mapped to taxonomy codes—show strong performance, with best macro F1 scores of 82.3% (SIC) and 72.9% (healthcare).
The work is designed to help domain experts create task-specific classifiers more easily without needing extensive task-specific text corpora up front.

Abstract

Existing Natural Language Processing (NLP) resources often lack the task-specific information required for real-world problems and provide limited coverage of lesser-known or newly introduced entities. For example, business organizations and health care providers may need to be classified into a variety of different taxonomic schemes for specific application tasks. Our goal is to enable domain experts to easily create a task-specific classifier for entities by providing only entity names and gold labels as training data. Our framework then dynamically acquires descriptive text about each entity, which is subsequently used as the basis for producing a text-based classifier. We propose a novel text acquisition method that leverages both web and large language models (LLMs). We evaluate our proposed framework on two classification problems in distinct domains: (i) classifying organizations into Standard Industrial Classification (SIC) Codes, which categorize organizations based on their business activities; and (ii) classifying healthcare providers into healthcare provider taxonomy codes, which represent a provider's medical specialty and area of practice. Our best-performing model achieved macro-averaged F1-scores of 82.3% and 72.9% on the SIC code and healthcare taxonomy code classification tasks, respectively.