AI Navigate

ELM: A Hybrid Ensemble of Language Models for Automated Tumor Group Classification in Population-Based Cancer Registries

arXiv cs.CL / 3/20/2026

📰 NewsTools & Practical UsageIndustry & Market MovesModels & Research

Key Points

  • ELM is a hybrid ensemble that combines six encoder-only language models (three for the top portion and three for the bottom portion of each report) with a large language model that arbiters when five of six encoders agree to assign a tumor group.
  • On a held-out test set of 2,058 pathology reports across 19 tumor groups, ELM achieves a weighted precision and recall of 0.94, significantly outperforming encoder-only ensembles (0.91 F1) and rule-based approaches (p<0.001).
  • In production at the British Columbia Cancer Registry, ELM reduced manual review by about 60–70%, saving an estimated 900 person-hours annually while maintaining data quality.
  • The study claims this is the first successful deployment of a hybrid small encoder-only models-LLM architecture for tumor group classification in a real-world population-based cancer registry setting.
  • ELM delivers notable gains in challenging categories such as leukemia, lymphoma, and skin cancer, with substantial F1-score improvements.

Abstract

Background: Population-based cancer registries (PBCRs) manually extract data from unstructured pathology reports, a labor-intensive process where assigning reports to tumor groups can consume 900 person-hours annually for approximately 100,000 reports at a medium-sized registry. Current automated rule-based systems fail to handle the linguistic complexity of this classification task. Materials and Methods: We present ELM (Ensemble of Language Models), a novel hybrid approach combining small, encoder only language models and large language models (LLMs). ELM employs an ensemble of six fine-tuned encoder only models: three analyzing the top portion and three analyzing the bottom portion of each report to maximize text coverage given token limits. A tumor group is assigned when at least five of six models agree; otherwise, an LLM arbitrates using a carefully curated prompt constrained to likely tumor groups. Results: On a held-out test set of 2,058 pathology reports spanning 19 tumor groups, ELM achieves weighted precision and recall of 0.94, representing a statistically significant improvement (p<0.001) over encoder-only ensembles (0.91 F1-score) and substantially outperforming rule-based approaches. ELM demonstrates particular gains for challenging categories including leukemia (F1: 0.76 to 0.88), lymphoma (0.76 to 0.89), and skin cancer (0.44 to 0.58). Discussion: Deployed in production at British Columbia Cancer Registry, ELM has reduced manual review requirements by approximately 60-70%, saving an estimated 900 person-hours annually while maintaining data quality standards. Conclusion: ELM represents the first successful deployment of a hybrid small, encoder only models-LLM architecture for tumor group classification in a real-world PBCR setting, demonstrating how strategic combination of language models can achieve both high accuracy and operational efficiency.