Teaching LLMs Brazilian Healthcare: Injecting Knowledge from Official Clinical Guidelines

arXiv cs.CL / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper reports that current LLMs struggle to recall and follow Brazil’s Unified Health System (SUS) guideline knowledge in Brazilian Portuguese, motivating a domain-specific approach.
  • It adapts Qwen2.5-14B-Instruct using continual pre-training plus Group Relative Policy Optimization (GRPO) on synthetic data generated from 178 official clinical guidelines (~5.4M tokens).
  • The authors introduce HealthBench-BR (1,780 balanced true/false assertions) and PCDT-QA (890 open-ended questions), addressing the lack of Brazilian-protocol-grounded evaluation benchmarks.
  • The best 14B-parameter model achieves 83.9% on HealthBench-BR and 85.4% on PCDT-QA, outperforming several larger commercial or web-grounded systems, with ablations highlighting the importance of generator diversity and reinforcement learning.
  • All datasets, benchmarks, and model weights are released to enable reproducible clinical NLP research in Brazilian Portuguese, alongside public code and artifacts on GitHub.

Abstract

Brazil's Unified Health System (SUS) relies on official clinical guidelines that define diagnostic criteria, treatments, dosages, and monitoring procedures for over 200 million citizens. Yet current LLMs perform poorly on this guideline-specific knowledge, and no benchmark evaluates clinical recall grounded in Brazilian Portuguese protocols. We address this gap by adapting Qwen2.5-14B-Instruct to the Brazilian clinical domain. From 178 official guidelines (~5.4M tokens), we generate ~70M tokens of synthetic data in three formats -- rephrases, wiki-style articles, and question-answer pairs -- using four generator LLMs. We then apply continual pre-training followed by Group Relative Policy Optimization (GRPO). We introduce HealthBench-BR, with 1,780 balanced true/false clinical assertions, and PCDT-QA, with 890 open-ended clinical questions scored by an LLM judge. Our best model achieves 83.9% on HealthBench-BR and 85.4% on PCDT-QA, outperforming GPT-5.2, Claude Sonnet 4.6, Gemini 3.1 Pro, and Google AI Overview's web-grounded RAG despite having only 14B parameters. Ablations show that generator diversity and reinforcement learning are critical to these gains. We release all datasets, benchmarks, and model weights to support reproducible clinical NLP research for Brazilian Portuguese. Code, data, and model weights are available at https://github.com/hugoabonizio/clinical-protocols-br