InVitroVision: a Multi-Modal AI Model for Automated Description of Embryo Development using Natural Language

arXiv cs.AI / 4/25/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The study fine-tunes a foundational multimodal vision-language model (PaliGemma-2) to generate natural-language descriptions of embryo morphology, cell cycle, and developmental stage from IVF time-lapse imagery.
  • Using a publicly available dataset, the researchers trained InVitroVision with only 1,000 image-caption pairs, targeting IVF multimodal information that is not fully utilized in many prior approaches.
  • InVitroVision reportedly outperformed a commercial model (ChatGPT 5.2) and other base models on overall evaluation metrics.
  • The model’s performance improved as the training dataset size increased, indicating better generalization with more data despite limited initial annotations.
  • The authors argue the method could support knowledge retrieval with large language models by connecting generated descriptions to scientific evidence from publications and guidelines, and could enable few-shot adaptation across IVF downstream tasks.

Abstract

The application of artificial intelligence (AI) in IVF has shown promise in improving consistency and standardization of decisions, but often relies on annotated data and does not make use of the multimodal nature of IVF data. We investigated whether foundational vision-language models can be fine-tuned to predict natural language descriptions of embryo morphology and development. Using a publicly available embryo time-lapse dataset, we fine-tuned PaliGemma-2, a multi-modal vision-language model, with only 1,000 images and corresponding captions, describing embryo morphology, embryonic cell cycle and developmental stage. Our results show that the fine-tuned model, InVitroVision, outperformed a commercial model, ChatGPT 5.2, and base models in overall metrics, with performance improving with larger training datasets. This study demonstrates the potential of foundational vision-language models to generalize to IVF tasks with limited data, enabling the prediction of natural language descriptions of embryo morphology and development. This approach may facilitate the use of large language models to retrieve information and scientific evidence from relevant publications and guidelines, and has implications for few-shot adaptation to multiple downstream tasks in IVF.