TY - GEN
T1 - Data Aggregation for Reducing Training Data in Symbolic Regression
AU - Kammerer, Lukas
AU - Kronberger, Gabriel
AU - Kommenda, Michael
N1 - Funding Information:
The authors gratefully acknowledge support by the Austrian Research Promotion Agency (FFG) within project #867202, as well as the Christian Doppler Research Association and the Federal Ministry of Digital and Economic Affairs within the Josef Ressel Centre for Symbolic Regression.
Funding Information:
Acknowledgements. The authors gratefully acknowledge support by the Austrian Research Promotion Agency (FFG) within project #867202, as well as the Christian Doppler Research Association and the Federal Ministry of Digital and Economic Affairs within the Josef Ressel Centre for Symbolic Regression.
Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - The growing volume of data makes the use of computationally intense machine learning techniques such as symbolic regression with genetic programming more and more impractical. This work discusses methods to reduce the training data and thereby also the runtime of genetic programming. The data is aggregated in a preprocessing step before running the actual machine learning algorithm. K-means clustering and data binning is used for data aggregation and compared with random sampling as the simplest data reduction method. We analyze the achieved speed-up in training and the effects on the trained models’ test accuracy for every method on four real-world data sets. The performance of genetic programming is compared with random forests and linear regression. It is shown, that k-means and random sampling lead to very small loss in test accuracy when the data is reduced down to only 30% of the original data, while the speed-up is proportional to the size of the data set. Binning on the contrary, leads to models with very high test error.
AB - The growing volume of data makes the use of computationally intense machine learning techniques such as symbolic regression with genetic programming more and more impractical. This work discusses methods to reduce the training data and thereby also the runtime of genetic programming. The data is aggregated in a preprocessing step before running the actual machine learning algorithm. K-means clustering and data binning is used for data aggregation and compared with random sampling as the simplest data reduction method. We analyze the achieved speed-up in training and the effects on the trained models’ test accuracy for every method on four real-world data sets. The performance of genetic programming is compared with random forests and linear regression. It is shown, that k-means and random sampling lead to very small loss in test accuracy when the data is reduced down to only 30% of the original data, while the speed-up is proportional to the size of the data set. Binning on the contrary, leads to models with very high test error.
KW - Machine learning
KW - Sampling
KW - Symbolic regression
UR - http://www.scopus.com/inward/record.url?scp=85083997369&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-45093-9_46
DO - 10.1007/978-3-030-45093-9_46
M3 - Conference contribution
SN - 9783030450922
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 378
EP - 386
BT - Computer Aided Systems Theory – EUROCAST 2019 - 17th International Conference, Revised Selected Papers
A2 - Moreno-Díaz, Roberto
A2 - Quesada-Arencibia, Alexis
A2 - Pichler, Franz
PB - Springer
T2 - 17th International Conference on Computer Aided Systems Theory, EUROCAST 2019
Y2 - 17 February 2019 through 22 February 2019
ER -