Leveraging Large Language Models to Extract and Translate Medical Information in Doctors' Notes for Health Records and Diagnostic Billing Codes

arXiv cs.CL / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The thesis proposes a privacy-preserving, on-device offline system that uses open-weight LLMs to extract medical facts from doctors’ notes and map them to ICD-10-CM diagnostic billing codes without cloud services.
  • It evaluates multiple local open-weight models (e.g., Llama 3.2, Mistral, Phi, DeepSeek) on consumer-grade hardware using Ollama, LangChain, and containerized deployment, along with a synthetic medical-note benchmark.
  • Enforcing a strict JSON output schema yields near-100% formatting compliance, but generating the correct, specific diagnostic codes remains difficult—especially for smaller 7B–20B parameter models.
  • The work finds that few-shot prompting can worsen results due to overfitting and hallucinations, while retrieval-augmented generation helps with discovering unseen codes but often suffers from context-window saturation.
  • The authors conclude that fully automated unsupervised coding with local open-source models is not yet dependable and recommend a human-in-the-loop workflow, while contributing a reproducible local LLM pipeline and benchmark dataset.

Abstract

Physician burnout in the United States has reached critical levels, driven in part by the administrative burden of Electronic Health Record (EHR) documentation and complex diagnostic codes. To relieve this strain and maintain strict patient privacy, this thesis explores an on-device, offline automatic medical coding system. The work focuses on using open-weight Large Language Models (LLMs) to extract clinical information from physician notes and translate it into ICD-10-CM diagnostic codes without reliance on cloud-based services. A privacy-focused pipeline was developed using Ollama, LangChain, and containerized environments to evaluate multiple open-weight models, including Llama 3.2, Mistral, Phi, and DeepSeek, on consumer-grade hardware. Model performance was assessed for zero-shot, few-shot, and retrieval-augmented generation (RAG) prompting strategies using a novel benchmark of synthetic medical notes. Results show that strict JSON schema enforcement achieved near 100% formatting compliance, but accurate generation of specific diagnostic codes remains challenging for smaller local models (7B-20B parameters). Contrary to common prompt-engineering guidance, few-shot prompting degraded performance through overfitting and hallucinations. While RAG enabled limited discovery of unseen codes, it frequently saturated context windows, reducing overall accuracy. The findings suggest that fully automated unsupervised coding with local open-source models is not yet reliable; instead, a human-in-the-loop assisted coding approach is currently the most practical path forward. This work contributes a reproducible local LLM architecture and benchmark dataset for privacy-preserving medical information extraction and coding.