Can Large Language Models Reliably Extract Physiology Index Values from Coronary Angiography Reports?

arXiv cs.CL / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper evaluates whether large language models can reliably extract physiology index values and their anatomical locations from unstructured coronary angiography (CAG) reports, focusing on Portuguese clinical text.
  • Using a corpus of 1,342 reports, the study is presented as the first to address physiology-index extraction at this scale for CAG reports and as one of the few efforts targeting CAG/Portuguese clinical language.
  • It compares local privacy-preserving general-purpose and medical LLMs under multiple prompting strategies (zero-shot, few-shot, and few-shot with implausible examples) and tests constrained generation plus a RegEx-based post-processing step.
  • The authors propose a multi-stage evaluation framework that separately measures format validity, value detection, and value correctness while considering asymmetric clinical error costs.
  • Results indicate that non-medical models can perform similarly, with the best overall performance reported for Llama under zero-shot prompting and GPT-OSS showing the highest robustness to prompt changes, while constrained generation and RegEx augmentation did not significantly improve most models’ outcomes.

Abstract

Coronary angiography (CAG) reports contain clinically relevant physiological measurements, yet this information is typically in the form of unstructured natural language, limiting its use in research. We investigate the use of Large Language Models (LLMs) to automatically extract these values, along with their anatomical locations, from Portuguese CAG reports. To our knowledge, this study is the first addressing physiology indexes extraction from a large (1342 reports) corpus of CAG reports, and one of the few focusing on CAG or Portuguese clinical text. We explore local privacy-preserving general-purpose and medical LLMs under different settings. Prompting strategies included zero-shot, few-shot, and few-shot prompting with implausible examples. In addition, we apply constrained generation and introduce a post-processing step based on RegEx. Given the sparsity of measurements, we propose a multi-stage evaluation framework separating format validity, value detection, and value correctness, while accounting for asymmetric clinical error costs. This study demonstrates the potential of LLMs in for extracting physiological indices from Portuguese CAG reports. Non-medical models performed similarly, the best results were obtained with Llama with a zero-shot prompting, while GPT-OSS demonstrated the highest robustness to changes in the prompts. While MedGemma demonstrated similar results to non-medical models, MedLlama's results were out-of-format in the unconstrained setting, and had a significant lower performance in the constrained one. Changes in the prompt techinique and adding a RegEx layer showed no significant improvement across models, while using constrained generation decreased performance, although having the benefit of allowing the usage of specific models that are not able to conform with the templates.