The GELATO Dataset for Legislative NER

arXiv cs.CL / 3/17/2026

📰 NewsModels & Research

共有:

Key Points

GELATO is introduced as a dataset of U.S. House and Senate bills from the 118th Congress, using a novel two-level NER ontology designed for legislative texts.
The paper fine-tunes transformer models (BERT, RoBERTa) for first-level entity prediction and uses LLMs with optimized prompts for second-level predictions.
Results show RoBERTa outperforming BERT for first-level predictions and LLMs improving second-level extraction, suggesting a strong model combo for legislative NER.
The dataset and approach are positioned to enable future research and downstream NLP tasks in government and policy domains.

Abstract

This paper introduces GELATO (Government, Executive, Legislative, and Treaty Ontology), a dataset of U.S. House and Senate bills from the 118th Congress annotated using a novel two-level named entity recognition ontology designed for U.S. legislative texts. We fine-tune transformer-based models (BERT, RoBERTa) of different architectures and sizes on this dataset for first-level prediction. We then use LLMs with optimized prompts to complete the second level prediction. The strong performance of RoBERTa and relatively weak performance of BERT models, as well as the application of LLMs as second-level predictors, support future research in legislative NER or downstream tasks using these model combinations as extraction tools.