Data Aggregation for Reducing Training Data in Symbolic Regression

Research output: Chapter in Book/Report/Conference proceedingsConference contributionpeer-review

Abstract

The growing volume of data makes the use of computationally intense machine learning techniques such as symbolic regression with genetic programming more and more impractical. This work discusses methods to reduce the training data and thereby also the runtime of genetic programming. The data is aggregated in a preprocessing step before running the actual machine learning algorithm. K-means clustering and data binning is used for data aggregation and compared with random sampling as the simplest data reduction method. We analyze the achieved speed-up in training and the effects on the trained models’ test accuracy for every method on four real-world data sets. The performance of genetic programming is compared with random forests and linear regression. It is shown, that k-means and random sampling lead to very small loss in test accuracy when the data is reduced down to only 30% of the original data, while the speed-up is proportional to the size of the data set. Binning on the contrary, leads to models with very high test error.

Original languageEnglish
Title of host publicationComputer Aided Systems Theory – EUROCAST 2019 - 17th International Conference, Revised Selected Papers
EditorsRoberto Moreno-Díaz, Alexis Quesada-Arencibia, Franz Pichler
PublisherSpringer
Pages378-386
Number of pages9
ISBN (Print)9783030450922
DOIs
Publication statusPublished - 2020
Event17th International Conference on Computer Aided Systems Theory, EUROCAST 2019 - Las Palmas de Gran Canaria, Spain
Duration: 17 Feb 201922 Feb 2019

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12013 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference17th International Conference on Computer Aided Systems Theory, EUROCAST 2019
Country/TerritorySpain
CityLas Palmas de Gran Canaria
Period17.02.201922.02.2019

Keywords

  • Machine learning
  • Sampling
  • Symbolic regression

Fingerprint

Dive into the research topics of 'Data Aggregation for Reducing Training Data in Symbolic Regression'. Together they form a unique fingerprint.

Cite this