Assessing the Viability of NMR Metabolomics Data for the Prognosis of Pulmonary Hypertension

  • Sebastian Pritz

    Student thesis: Master's Thesis

    Abstract

    Pulmonary Hypertension (PH) is a progressive lung disease characterized by reduced
    pulmonary artery flexibility, which can lead to right heart failure and death. Given
    the disease’s nonspecific symptoms and its relatively high global prevalence of 1%,
    there is a critical need for reliable diagnostic and prognostic methods. While existing detection methods are either invasive or rely on biomarkers that lack adequate
    diagnostic and prognostic capabilities, previous studies have explored the potential of
    metabolomic data. This thesis investigates the use of NMR data as an alternative approach for prognosis by conducting exploratory analysis and evaluating the performance
    of various machine learning methods for both classification and regression. Initial attempts using dimensionality reduction were inconclusive; however, Kaplan-Meier and
    Cox Proportional-Hazards analyses identified a substantial number of features related
    to survival. Machine learning models achieved moderate success in predicting threeyear survival (f1-score of 0.75), but the resulting Kaplan-Meier and Cox-PH models
    were not statistically significant in the test set, as indicated by the three-year log-rank
    test (p-value of 0.054) and confidence intervals overlapping with the baseline hazard.
    Regression models performed poorly, likely due to the limited data being insufficient for
    the complexity of regression tasks. To address the need for interpretability in the medical field, the thesis also applied genetic programming and NSGA-II, which produced
    results comparable to other machine learning models like SVM, RF, and XGBoost,
    while also providing interpretable mathematical models. Some models using NMR data
    demonstrated greater discriminative power in Kaplan-Meier and Cox-PH analyses than
    both COMPERA2 and FPHR4p on the test set. However, this could be attributed to
    challenges in data partitioning, as correlations between target and input features were
    found to be highly variable depending on the random seed used during data splitting.
    While this variability might suggest population heterogeneity, clustering approaches and
    exploratory analysis were unable to confirm this hypothesis. Further research is needed
    to validate these findings.
    Date of Award2024
    Original languageEnglish (American)
    SupervisorUlrich Bodenhofer (Supervisor)

    Cite this

    '