AI Navigate

PET-F2I: A Comprehensive Benchmark and Parameter-Efficient Fine-Tuning of LLMs for PET/CT Report Impression Generation

arXiv cs.CV / 3/12/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The authors introduce PET-F2I-41K (PET Findings-to-Impression Benchmark), a large-scale dataset with over 41,000 real-world PET/CT reports for generating diagnostic impressions.
  • They evaluate 27 models, spanning frontier LLMs, open-source generalist models, and medical-domain LLMs, and find zero-shot performance is inadequate.
  • They train a domain-adapted 7B model, PET-F2I-7B, by fine-tuning Qwen2.5-7B-Instruct with LoRA, achieving BLEU-4 of 0.708 and a 3x improvement in entity coverage over the strongest baseline.
  • They introduce three clinically grounded metrics—Entity Coverage Rate, Uncovered Entity Rate, and Factual Consistency Rate—to measure diagnostic completeness and factual reliability alongside standard NLG metrics.
  • The work highlights advantages in cost, latency, and privacy for PET/CT reporting and provides a standardized evaluation framework to accelerate development of reliable clinical reporting systems.

Abstract

PET/CT imaging is pivotal in oncology and nuclear medicine, yet summarizing complex findings into precise diagnostic impressions is labor-intensive. While LLMs have shown promise in medical text generation, their capability in the highly specialized domain of PET/CT remains underexplored. We introduce PET-F2I-41K (PET Findings-to-Impression Benchmark), a large-scale benchmark for PET/CT impression generation using LLMs, constructed from over 41k real-world reports. Using PET-F2I-41K, we conduct a comprehensive evaluation of 27 models across proprietary frontier LLMs, open-source generalist models, and medical-domain LLMs, and we develop a domain-adapted 7B model (PET-F2I-7B) fine-tuned from Qwen2.5-7B-Instruct via LoRA. Beyond standard NLG metrics (e.g., BLEU-4, ROUGE-L, BERTScore), we propose three clinically grounded metrics - Entity Coverage Rate (ECR), Uncovered Entity Rate (UER), and Factual Consistency Rate (FCR) - to assess diagnostic completeness and factual reliability. Experiments reveal that neither frontier nor medical-domain LLMs perform adequately in zero-shot settings. In contrast, PET-F2I-7B achieves substantial gains (e.g., 0.708 BLEU-4) and a 3.0x improvement in entity coverage over the strongest baseline, while offering advantages in cost, latency, and privacy. Beyond this modeling contribution, PET-F2I-41K establishes a standardized evaluation framework to accelerate the development of reliable and clinically deployable reporting systems for PET/CT.