Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation
arXiv cs.CL / 4/30/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces LUNGUAGE, a benchmark dataset for structured chest X-ray report generation that supports both single-report evaluation and longitudinal (patient-level) assessment across multiple studies.
- LUNGUAGE includes 1,473 expert-annotated chest X-ray reports and a subset of 186 with longitudinal annotations capturing disease progression and inter-study intervals, also reviewed by experts.
- It presents a two-stage structuring framework that converts generated reports into fine-grained, schema-aligned structured reports to enable longitudinal interpretation.
- The authors propose LUNGUAGESCORE, an interpretable evaluation metric that compares structured outputs at the entity, relation, and attribute levels while enforcing temporal consistency across patient timelines.
- The work positions itself as the first benchmark dataset, structuring approach, and evaluation metric focused on sequential radiology reporting, with results showing LUNGUAGESCORE supports structured report evaluation effectively.
Related Articles

Building a Local AI Agent (Part 2): Six UX and UI Design Challenges
Dev.to

We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works
Dev.to

Your first business opportunity in 3 commands: /register_directory in @biznode_bot, wait for matches, then /my_pulse to view...
Dev.to

Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD
Dev.to

Function Calling Harness 2: CoT Compliance from 9.91% to 100%
Dev.to