TY - GEN
T1 - Data based prediction of cancer diagnoses using heterogeneous model ensembles - A case study for breast cancer, Melanoma, and cancer in the respiratory system
AU - Winkler, Stephan M.
AU - Affenzeller, Michael
AU - Stekel, Herbert
AU - Schaller, Susanne
PY - 2014
Y1 - 2014
N2 - In this paper we discuss heterogeneous estimation model ensembles for cancer diagnoses produced using various machine learning algorithms. Based on patients' data records including standard blood parameters, tumor markers, and information about the diagnosis of tumors, the goal is to identify mathematical models for estimating cancer diagnoses. Several machine learning approaches implemented in HeuristicLab and WEKA have been applied for identifying estimators for selected cancer diagnoses: k-nearest neighbor learning, decision trees, artificial neural networks, support vector machines, random forests, and genetic programming. The models produced using these methods have been combined to heterogeneous model ensembles. All models trained during the learning phase are applied during the test phase; the final classification is annotated with a confidence value that specifies how reliable the models are regarding the presented decision: We calculate the final estimation for each sample via majority voting, and the relative ratio of a sample's majority vote is used for calculating the confidence in the final estimation. We use a confidence threshold that specifies the minimum confidence level that has to be reached; if this threshold is not reached for a sample, then there is no prediction for that specific sample. As we show in the results section, the accuracies of diagnoses of breast cancer, melanoma, and respiratory system cancer can so be increased significantly. We see that increasing the confidence threshold leads to higher classification accuracies, bearing in mind that the ratio of samples, for which there is a classification statement, is significantly decreased.
AB - In this paper we discuss heterogeneous estimation model ensembles for cancer diagnoses produced using various machine learning algorithms. Based on patients' data records including standard blood parameters, tumor markers, and information about the diagnosis of tumors, the goal is to identify mathematical models for estimating cancer diagnoses. Several machine learning approaches implemented in HeuristicLab and WEKA have been applied for identifying estimators for selected cancer diagnoses: k-nearest neighbor learning, decision trees, artificial neural networks, support vector machines, random forests, and genetic programming. The models produced using these methods have been combined to heterogeneous model ensembles. All models trained during the learning phase are applied during the test phase; the final classification is annotated with a confidence value that specifies how reliable the models are regarding the presented decision: We calculate the final estimation for each sample via majority voting, and the relative ratio of a sample's majority vote is used for calculating the confidence in the final estimation. We use a confidence threshold that specifies the minimum confidence level that has to be reached; if this threshold is not reached for a sample, then there is no prediction for that specific sample. As we show in the results section, the accuracies of diagnoses of breast cancer, melanoma, and respiratory system cancer can so be increased significantly. We see that increasing the confidence threshold leads to higher classification accuracies, bearing in mind that the ratio of samples, for which there is a classification statement, is significantly decreased.
KW - Cancer diagnosis estimation
KW - Data mining
KW - Machine learning
KW - Statistical analysis
KW - Tumor marker data
UR - http://www.scopus.com/inward/record.url?scp=84905657087&partnerID=8YFLogxK
U2 - 10.1145/2598394.2609853
DO - 10.1145/2598394.2609853
M3 - Conference contribution
AN - SCOPUS:84905657087
SN - 9781450328814
T3 - GECCO 2014 - Companion Publication of the 2014 Genetic and Evolutionary Computation Conference
SP - 1337
EP - 1344
BT - GECCO 2014 - Companion Publication of the 2014 Genetic and Evolutionary Computation Conference
PB - Association for Computing Machinery
T2 - 16th Genetic and Evolutionary Computation Conference, GECCO 2014
Y2 - 12 July 2014 through 16 July 2014
ER -