GS-BrainText: A Multi-Site Brain Imaging Report Dataset from Generation Scotland for Clinical Natural Language Processing Development and Validation
arXiv cs.CL / 3/30/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisTools & Practical Usage
Key Points
- The GS-BrainText dataset compiles 8,511 brain radiology reports from the Generation Scotland cohort, with 2,431 reports annotated for 24 brain disease phenotypes.
- It is a multi-site UK dataset spanning five Scottish NHS health boards and includes a broad age distribution (mean 58, median 53), designed to support generalisable clinical NLP development and validation.
- Expert annotations were produced using a defined schema with multidisciplinary clinical oversight, including 10–100% double annotation per site and formal quality assurance procedures.
- Benchmarking with the rule-based EdIE-R system shows performance variability across health boards (F1 86.13–98.13), phenotypes (F1 22.22–100), and age groups (F1 87.01–98.13), underscoring generalisation challenges.
- The release targets a gap in UK clinical text resources and enables research into linguistic variation, expression of diagnostic uncertainty, and how dataset characteristics affect NLP performance.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat Asia
AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer
Simon Willison's Blog
Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026
Dev.to

I missed the "fun" part in software development
Dev.to

The Billion Dollar Tax on AI Agents
Dev.to