Assessing the Viability of NMR Metabolomics Data for the Prognosis of Pulmonary Hypertension

  • Sebastian Pritz

Student thesis: Master's Thesis

Abstract

Pulmonary Hypertension (PH) is a progressive lung disease characterized by reduced
pulmonary artery flexibility, which can lead to right heart failure and death. Given
the disease’s nonspecific symptoms and its relatively high global prevalence of 1%,
there is a critical need for reliable diagnostic and prognostic methods. While existing detection methods are either invasive or rely on biomarkers that lack adequate
diagnostic and prognostic capabilities, previous studies have explored the potential of
metabolomic data. This thesis investigates the use of NMR data as an alternative approach for prognosis by conducting exploratory analysis and evaluating the performance
of various machine learning methods for both classification and regression. Initial attempts using dimensionality reduction were inconclusive; however, Kaplan-Meier and
Cox Proportional-Hazards analyses identified a substantial number of features related
to survival. Machine learning models achieved moderate success in predicting threeyear survival (f1-score of 0.75), but the resulting Kaplan-Meier and Cox-PH models
were not statistically significant in the test set, as indicated by the three-year log-rank
test (p-value of 0.054) and confidence intervals overlapping with the baseline hazard.
Regression models performed poorly, likely due to the limited data being insufficient for
the complexity of regression tasks. To address the need for interpretability in the medical field, the thesis also applied genetic programming and NSGA-II, which produced
results comparable to other machine learning models like SVM, RF, and XGBoost,
while also providing interpretable mathematical models. Some models using NMR data
demonstrated greater discriminative power in Kaplan-Meier and Cox-PH analyses than
both COMPERA2 and FPHR4p on the test set. However, this could be attributed to
challenges in data partitioning, as correlations between target and input features were
found to be highly variable depending on the random seed used during data splitting.
While this variability might suggest population heterogeneity, clustering approaches and
exploratory analysis were unable to confirm this hypothesis. Further research is needed
to validate these findings.
Date of Award2024
Original languageEnglish (American)
SupervisorUlrich Bodenhofer (Supervisor)

Cite this

'