AI Navigate

Layout-Guided Controllable Pathology Image Generation with In-Context Diffusion Transformers

arXiv cs.CV / 3/17/2026

📰 NewsModels & Research

Key Points

  • The work addresses controllable pathology image synthesis, noting that prior text-guided diffusion models offer coarse global control and lack fine-grained structural constraints.
  • It introduces a scalable multi-agent LVLM annotation framework that combines image description, diagnostic step extraction, and automatic quality judgment to produce clinically aligned supervision at scale.
  • It presents IC-DiT, a layout-aware diffusion transformer that fuses spatial layouts, textual descriptions, and visual embeddings with hierarchical multimodal attention to preserve morphology while maintaining global semantic coherence.
  • Experiments on five histopathology datasets show IC-DiT achieves higher fidelity, stronger spatial controllability, and better diagnostic consistency, with generated images also boosting downstream tasks like cancer classification and survival analysis.

Abstract

Controllable pathology image synthesis requires reliable regulation of spatial layout, tissue morphology, and semantic detail. However, existing text-guided diffusion models offer only coarse global control and lack the ability to enforce fine-grained structural constraints. Progress is further limited by the absence of large datasets that pair patch-level spatial layouts with detailed diagnostic descriptions, since generating such annotations for gigapixel whole-slide images is prohibitively time-consuming for human experts. To overcome these challenges, we first develop a scalable multi-agent LVLM annotation framework that integrates image description, diagnostic step extraction, and automatic quality judgment into a coordinated pipeline, and we evaluate the reliability of the system through a human verification process. This framework enables efficient construction of fine-grained and clinically aligned supervision at scale. Building on the curated data, we propose In-Context Diffusion Transformer (IC-DiT), a layout-aware generative model that incorporates spatial layouts, textual descriptions, and visual embeddings into a unified diffusion transformer. Through hierarchical multimodal attention, IC-DiT maintains global semantic coherence while accurately preserving structural and morphological details. Extensive experiments on five histopathology datasets show that IC-DiT achieves higher fidelity, stronger spatial controllability, and better diagnostic consistency than existing methods. In addition, the generated images serve as effective data augmentation resources for downstream tasks such as cancer classification and survival analysis.