This thesis explores the integration of SNOMED CT, a comprehensive clinical ontology, with Large Language Models (LLMs) to enhance medical information retrieval, particularly for the task of medical concept normalization. Two complementary approaches are investigated: (i) training-time integration, where ontology knowledge informs neural embedding models, and (ii) inference-time integration, where agentic retrieval systems dynamically navigate the ontology. For the training-time approach, a contrastive fine-tuning pipeline is developed that combines SNOMED CT’s synonyms, textual descriptions, and inferred relationships with synthetic definitions generated by LLMs. A novel distance-weighted loss function is introduced to incorporate hierarchical structure into the embedding space. Fine-tuned models achieve improvements of up to six points (Spearman’s 𝜌 × 100) on biomedical concept representation (BCR) benchmarks compared to strong baselines, and maintain competitive performance on entity linking tasks and asymmetric retrieval tasks. For inference-time integration, an agentic retrieval system is implemented that equips LLMs with tools for ontology navigation, query rewriting, and graph traversal. Evaluation on the SNOMED CT Entity Linking Challenge dataset shows that the agentic approach improves the amount of correctly linked top-1 concepts from 51 percent (dense retrieval baseline) to 65 percent with reasoning-augmented agents, and reduces hierarchical error distance by over 40 percent. As entity linking to SCT is an inherently ambiguous task, the adherence to the correct hierarchy matters more than the exact linked concept. The agents ability to link a concept in the right hierarchy allows for an almost 20% increase in top-1 performance. The presented integration establishes a strong baseline for future agentic systems with clear potential of additional engineering gains. In addition to empirical findings, this thesis contributes a reusable Python library (SnowUtils) that streamlines SNOMED CT access across Neo4j, Snowstorm, and Snowstormlite backends. The results demonstrate that ontology-aware embedding and retrieval methods can meaningfully improve semantic accuracy in clinical information systems, establishing a foundation for future ontology-integrated AI applications.
- Data Science and Engineering
Integration of SNOMED CT and Large Language Models for Medical Information Retrieval
Geppner, P. (Author). 2025
Student thesis: Master's Thesis