CT-to-X-ray Distillation Under Tiny Paired Cohorts: An Evidence-Bounded Reproducible Pilot Study

arXiv cs.CV / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The study investigates whether CT images can be used only as training-time supervision to distill a binary disease/no-disease chest X-ray classifier, eliminating the need for CT at inference.
  • Using patient-level paired data and a teacher–student distillation setup, the authors find that a stripped-down plain cross-modal logit-KD baseline outperforms the more complex JDCNet variant on a small four-image validation subset.
  • Eight Monte Carlo patient-level resampling experiments suggest results are sensitive to dataset splits, with late fusion achieving the best mean accuracy while different strategies perform best for macro-F1 and balanced accuracy.
  • Stronger mechanism controls (attention transfer and feature hints) do not reliably restore a robust cross-modality advantage, highlighting likely failure modes in the cross-modality transfer.
  • The paper’s main contribution is presented as a reproducible, evidence-bounded pilot protocol that clarifies the task definition, instability in rankings, and minimum requirements for future credible CT-to-X-ray claims rather than a new validated architecture.

Abstract

Chest X-ray and computed tomography (CT) provide complementary views of thoracic disease, yet most computer-aided diagnosis models are trained and deployed within a single imaging modality. The concrete question studied here is narrower and deployment-oriented: on a patient-level paired chest cohort, can CT act as training-only supervision for a binary disease versus non-disease X-ray classifier without requiring CT at inference time? We study this setting as a cross-modality teacher--student distillation problem and use JDCNet as an executable pilot scaffold rather than as a validated superior architecture. On the original patient-level paired split from a public paired chest imaging cohort, a stripped-down plain cross-modal logit-KD control attains the highest mean result on the four-image validation subset (0.875 accuracy and 0.714 macro-F1), whereas the full module-augmented JDCNet variant remains at 0.750 accuracy and 0.429 macro-F1. To test whether that ranking is a split artifact, we additionally run eight patient-level Monte Carlo resamples with same-case comparisons, stronger mechanism controls based on attention transfer and feature hints, and imbalance-sensitive analyses. Under this resampled protocol, late fusion attains the highest mean accuracy (0.885), same-modality distillation attains the highest mean macro-F1 (0.554) and balanced accuracy (0.660), the plain cross-modal control drops to 0.500 mean balanced accuracy, and neither attention transfer nor feature hints recover a robust cross-modality advantage. The contribution of this study is therefore not a validated CT-to-X-ray architecture, but a reproducible and evidence-bounded pilot protocol that makes the exact task definition, failure modes, ranking instability, and the minimum requirements for future credible CT-to-X-ray transfer claims explicit.