Budget-Aware Routing for Long Clinical Text

arXiv cs.AI / 5/4/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • The paper addresses the high token cost and latency of large language model deployments by selecting a budgeted subset of long clinical inputs under a strict token limit.
  • It formulates budgeted context selection as a knapsack-constrained subset selection problem and studies how document unitization and unit selection affect performance.
  • The authors propose RCD, a monotone submodular objective that jointly optimizes relevance, coverage, and diversity for the selected context.
  • They compare multiple unitization strategies (sentence, section, window, cluster) and add a routing heuristic that adapts to different budget regimes.
  • Experiments across MIMIC discharge notes, Cochrane abstracts, and L-Eval show that optimal strategies vary by task and evaluation setting, with evaluation metrics like ROUGE and BERTScore reflecting quality differently, and they release accompanying code.

Abstract

A key challenge for large language models is token cost per query and overall deployment cost. Clinical inputs are long, heterogeneous, and often redundant, while downstream tasks are short and high stakes. We study budgeted context selection, where a subset of document units is chosen under a strict token budget so an off-the-shelf generator can meet fixed cost and latency constraints. We cast this as a knapsack-constrained subset selection problem with two design choices, unitization that defines document segmentation and selection that determines which units are kept. We propose \textbf{RCD}, a monotone submodular objective that balances relevance, coverage, and diversity. We compare sentence, section, window, and cluster-based unitization, and introduce a routing heuristic that adapts to the budget regime. Experiments on MIMIC discharge notes, Cochrane abstracts, and L-Eval show that optimal strategies depend on the evaluation setting. Positional heuristics perform best at low budgets in extractive tasks, while diversity-aware methods such as MMR improve LLM generation. Selector choice matters more than unitization, with cluster-based grouping reducing performance and other schemes behaving similarly. ROUGE saturates for LLM summaries, while BERTScore better reflects quality differences. We release our code at https://github.com/stone-technologies/ACL_budget_paper.

Budget-Aware Routing for Long Clinical Text | AI Navigate