Domain-Specific Latent Representations Improve the Fidelity of Diffusion-Based Medical Image Super-Resolution

arXiv cs.CV / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study finds that, for diffusion-based medical image super-resolution, the choice of the underlying VAE/latent autoencoder—not the diffusion architecture—most strongly limits reconstruction fidelity.
Replacing the generic Stable Diffusion VAE with MedVAE (a domain-specific autoencoder pretrained on 1.6M+ medical images) improves reconstruction by about +2.91 to +3.29 dB PSNR across knee MRI, brain MRI, and chest X-ray.
Wavelet analysis shows the gains concentrate in the finest spatial-frequency bands that encode anatomically relevant fine structure.
Across ablations (inference schedules, prediction targets, and generative architectures), the fidelity gap remains stable (~±0.15 dB), while hallucination rates are comparable, indicating independent control over reconstruction vs. generation artifacts.
The authors propose a practical selection criterion: domain-specific autoencoder reconstruction quality (measured without diffusion training) predicts downstream SR performance (R² = 0.67), and they release code and trained weights on GitHub.

Abstract

Latent diffusion models for medical image super-resolution universally inherit variational autoencoders designed for natural photographs. We show that this default choice, not the diffusion architecture, is the dominant constraint on reconstruction quality. In a controlled experiment holding all other pipeline components fixed, replacing the generic Stable Diffusion VAE with MedVAE, a domain-specific autoencoder pretrained on more than 1.6 million medical images, yields +2.91 to +3.29 dB PSNR improvement across knee MRI, brain MRI, and chest X-ray (n = 1,820; Cohen's d = 1.37 to 1.86, all p < 10^{-20}, Wilcoxon signed-rank). Wavelet decomposition localises the advantage to the finest spatial frequency bands encoding anatomically relevant fine structure. Ablations across inference schedules, prediction targets, and generative architectures confirm the gap is stable within plus or minus 0.15 dB, while hallucination rates remain comparable between methods (Cohen's h < 0.02 across all datasets), establishing that reconstruction fidelity and generative hallucination are governed by independent pipeline components. These results provide a practical screening criterion: autoencoder reconstruction quality, measurable without diffusion training, predicts downstream SR performance (R^2 = 0.67), suggesting that domain-specific VAE selection should precede diffusion architecture search. Code and trained model weights are publicly available at https://github.com/sebasmos/latent-sr.

Black Hat Asia

AI Business

The Complete Guide to Better Meeting Productivity with AI Note-Taking

Dev.to

5 Ways Real-Time AI Can Boost Your Sales Call Performance

Dev.to

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Dev.to

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

Reddit r/MachineLearning

Domain-Specific Latent Representations Improve the Fidelity of Diffusion-Based Medical Image Super-Resolution

Key Points

Abstract

Related Articles

Black Hat Asia

The Complete Guide to Better Meeting Productivity with AI Note-Taking

5 Ways Real-Time AI Can Boost Your Sales Call Performance

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer