Abstract
The anatomical location of colorectal carcinoma (CRC) influences therapy results as well asoverall survival, underlining the importance of correctly classifying the tumours into proximal
or distal areas. Thus, four Machine Learning algorithms (Decision Trees, k-Nearest Neighbors,
Logistic Regression, and Support Vector Machines) were trained via nested cross-validation in
order to classify colorectal carcinomas into right-sided or left-sided, by using NGS and clinical
data provided by the Ordensklinikum Linz Barmherzige Schwestern, a hospital in Upper Austria.
This data comprised more than 2770 CRC patients. However, a large data loss concerning
the somatic mutations of multiple years of analyses was discovered when joining the data sets
reduced the number of patients to just 112. The NGS data had to be filtered by six genes
(KRAS, NRAS, BRAF, PIK3CA, TP53, and APC) in order to mitigate the risk of overfitting.
For each ML algorithm, the model achieving the best test accuracy was determined. Each
of the four models’ accuracy transcended the baseline of 64%. While the LR, SVM, and kNN
models all achieved the same highest overall accuracy of 86.96%, both the highest balanced
accuracy (84.17%) and AUC (0.84) were achieved by the Logistic Regression model.
Date of Award | 2024 |
---|---|
Original language | English (American) |
Supervisor | Gerald Lirk (Supervisor) |