Vision-Language Model for Describing and Segmenting Blastocysts
: Advancing AI-Based Multi-Task Approaches to Understand Morphology of Developing Embryos

  • Nicklas Daniel Neu

    Student thesis: Master's Thesis

    Abstract

    Multimodal deep learning has recently gained increasing attention in biomedical imaging, particularly in assisted reproduction, where accurate embryo evaluation is crucial for successful in vitro fertilization (IVF). Compared to unimodal approaches, multimodal models offer better representational capacity, as they can combine visual and textual information within a unified framework. Traditional grading methods based on morphology remain subjective and limited in predictive accuracy, motivating the exploration of automated, minimal invasive and objective solutions. In this context, an annotation tool was developed at the Software Competence Center Hagenberg (SCCH) to support embryologists in generating high-quality image–text annotations for training and evaluation. This thesis investigates the feasibility of employing vision–language models for embryo assessment, focusing on captioning and segmentation of blastocyst images. Two foundation models, Florence-2 and PaliGemma2, were fine-tuned on a time-lapse embryo dataset annotated with captions and segmentation masks. The study evaluates model performance across varying training sizes, learning rates, and low-rank adaptation (LoRA) parameter configurations, with BERTScore and Intersectionover-Union (IoU) serving as key metrics. Experiments were conducted using a multiGPU high-performance computing setup, enabling efficient large-batch training. The results show that fine-tuned models significantly outperform the baselines, with BERTScores improving from 0.7946 to 0.9005 for Florence-2 and from 0.8143 to 0.9103 for PaliGemma2. Similarly, segmentation performance with PaliGemma2 reached an IoU of 0.7245 and an F1 score of 0.8091, demonstrating the model’s ability to capture biologically meaningful features such as the zona pellucida, trophectoderm, inner cell mass, and blastocoel. Qualitative examples further confirm that the fine-tuned models provide more accurate and domain-specific descriptions compared to the foundation versions. However, analysis highlights that data scarcity remains a major limitation, constraining generalization across embryo stages. A follow-up project at SCCH, in collaboration with Kepler Universitätsklinikum (Kepler University Hospital) and the Wunschkind Klinik Dr. Brunbauer, will address this limitation by expanding the dataset, introducing additional training strategies, and employing data augmentation. Overall, this thesis provides a proof of concept that vision–language models can be adapted for embryo analysis, laying the foundation for future research towards Artificial Intelligence–assisted embryo selection in IVF.
    Date of Award2025
    Original languageEnglish
    SupervisorJulia Vetter (Supervisor)

    Studyprogram

    • Data Science and Engineering

    Cite this

    '