Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models

arXiv cs.CL / 3/30/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper proposes a locally deployable Clinical Contextual Question Answering (CCQA) framework that answers clinician questions directly from Finnish EHR text without transferring data externally.
It benchmarks multiple open-source LLMs (4B–70B parameters) using an offline dataset of 1,664 expert-annotated question–answer pairs from 183 patients, with most text in Finnish.
Llama-3.1-70B achieved high free-text performance (95.3% accuracy and 97.3% consistency across semantically equivalent question variants), while Qwen3-30B-A3B-2507 showed comparable results.
Quantization to 4-bit and 8-bit helped reduce GPU memory needs while largely preserving predictive performance, improving deployment feasibility in offline settings.
Clinical evaluation found clinically significant errors in 2.9% of outputs and showed that semantically equivalent questions can still produce discordant answers, underscoring the need for validation and human oversight.

Abstract

Clinicians often need to retrieve patient-specific information from electronic health records (EHRs), a task that is time-consuming and error-prone. We present a locally deployable Clinical Contextual Question Answering (CCQA) framework that answers clinical questions directly from EHRs without external data transfer. Open-source large language models (LLMs) ranging from 4B to 70B parameters were benchmarked under fully offline conditions using 1,664 expert-annotated question-answer pairs derived from records of 183 patients. The dataset consisted predominantly of Finnish clinical text. In free-text generation, Llama-3.1-70B achieved 95.3% accuracy and 97.3% consistency across semantically equivalent question variants, while the smaller Qwen3-30B-A3B-2507 model achieved comparable performance. In a multiple-choice setting, models showed similar accuracy but variable calibration. Low-precision quantization (4-bit and 8-bit) preserved predictive performance while reducing GPU memory requirements and improving deployment feasibility. Clinical evaluation identified clinically significant errors in 2.9% of outputs, and semantically equivalent questions occasionally yielded discordant responses, including instances where one formulation was correct and the other contained a clinically significant error (0.96% of cases). These findings demonstrate that locally hosted open-source LLMs can accurately retrieve patient-specific information from EHRs using natural-language queries, while highlighting the need for validation and human oversight in clinical deployment.