CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling
arXiv cs.CV / 4/16/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies vision-language contrastive training for 3D abdominal CT and radiology report alignment, reproducing the Merlin dual-encoder approach with symmetric InfoNCE loss and improved zero-shot performance (macro F1 74.45% vs 73.00%).
- It finds that explicit normal/abnormal class ratio balancing within training batches (25:75, 50:50, 75:25) generally reduces performance compared with an unbalanced baseline, with 75:25 performing best among balanced settings (72.02%).
- Data scaling ablations on a 4,362-study subset show performance increases sub-linearly as training data grows (65.26% at 20% data to 71.88% at 100%), with large variability in which findings benefit.
- Applying enforced 50:50 balanced sampling on the smaller subset further degrades results (68.01%), suggesting class balancing can be harmful even when dataset size is fixed.
- The authors conclude that stochastic diversity from random sampling and Merlin’s alternating batching across anatomical subsections provides better regularization than engineered class ratios under the small batch constraints common in 3D medical imaging.
Related Articles

Black Hat Asia
AI Business

Introducing Claude Opus 4.7
Anthropic News

AI traffic to US retailers rose 393% in Q1, and it’s boosting their revenue too
TechCrunch

Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability
Dev.to

"Enterprise AI Cost Optimization: How Companies Are Cutting AI Infrastructure Sp
Dev.to