CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging
arXiv cs.CV / 4/28/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- CheXmix proposes a unified early-fusion generative pretraining approach for vision-language medical imaging models, addressing limitations of CLIP+LLM projection layers used in many MLLM pipelines.
- The method trains on large chest X-ray datasets paired with radiology reports and extends Chameleon’s autoregressive framework with a two-stage multimodal generative pretraining strategy that blends masked autoencoder strengths with MLLM training.
- CheXmix is designed to support both discriminative and generative tasks at coarse and fine-grained levels, enabling flexible use across multiple chest X-ray problem types.
- Reported evaluations show CheXmix outperforms other generative baselines by 6.0% across masking ratios, beats CheXagent by 8.6% AUROC at high image masking ratios on CheXpert, improves image inpainting by 51.0% versus text-only generators, and achieves substantially better radiology report generation (45% higher on the GREEN metric vs CheXagent).
- The paper provides an open-source codebase at the linked GitHub repository, supporting replication and further research.
Related Articles

Black Hat USA
AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to
How I Automate My Dev Workflow with Claude Code Hooks
Dev.to

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System
Dev.to

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)
Dev.to