RIHA: Report-Image Hierarchical Alignment for Radiology Report Generation
arXiv cs.CV / 5/1/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- Radiology report generation aims to reduce radiologists’ workload and errors by producing diagnostic reports from medical images, but it struggles with aligning detailed visual features to the hierarchical structure of long reports.
- The paper argues that many existing image-text methods treat reports as flat sequences, which limits fine-grained cross-modal alignment and reduces accuracy.
- It introduces RIHA (Report-Image Hierarchical Alignment Transformer), an end-to-end framework that aligns radiology images with reports at multiple levels: paragraph, sentence, and word.
- RIHA uses a Visual Feature Pyramid and a Text Feature Pyramid, connected by a Cross-modal Hierarchical Alignment module that applies optimal transport, plus Relative Positional Encoding in the decoder to improve token-level alignment.
- Experiments on IU-Xray and MIMIC-CXR show RIHA surpasses prior state-of-the-art methods on both language generation performance and clinical efficacy metrics.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Why Enterprise AI Pilots Fail
Dev.to

The PDF Feature Nobody Asked For (That I Use Every Day)
Dev.to

How to Fix OpenClaw Tool Calling Issues
Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model
THE DECODER