TY - GEN
T1 - Comparing Methods for Estimating Marginal Likelihood in Symbolic Regression
AU - Leser, Patrick
AU - Bomarito, Geoffrey
AU - Kronberger, Gabriel
AU - Olivetti De França, Fabrício
N1 - Publisher Copyright:
© 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.
PY - 2024/7/14
Y1 - 2024/7/14
N2 - Marginal likelihood has been proposed as a genetic programming-based symbolic regression (GPSR) fitness metric to prevent overly complex expressions and overfitting, particularly when data is limited and noisy. Here, two particular methods for estimating marginal likelihood - the Laplace approximation and sequential Monte Carlo - are studied with a focus on tradeoffs between accuracy and computational efficiency. The comparison focuses on practical challenges in the context of two sets of example problems. First, the methods are compared on handcrafted expressions exhibiting nonlinearity and multimodality in their respective posterior distributions. Next, the methods are compared on a real-world set of equations produced by GPSR using training data from a well-known symbolic regression benchmark. A key finding is that there are potentially significant differences between the methods that, for example, could lead to conflicting selection of expressions within a GPSR implementation. However, it is concluded that there are scenarios where either method could be preferred over the other based on accuracy or computational budget. Algorithmic improvements for both methods as well as future areas of study are discussed.
AB - Marginal likelihood has been proposed as a genetic programming-based symbolic regression (GPSR) fitness metric to prevent overly complex expressions and overfitting, particularly when data is limited and noisy. Here, two particular methods for estimating marginal likelihood - the Laplace approximation and sequential Monte Carlo - are studied with a focus on tradeoffs between accuracy and computational efficiency. The comparison focuses on practical challenges in the context of two sets of example problems. First, the methods are compared on handcrafted expressions exhibiting nonlinearity and multimodality in their respective posterior distributions. Next, the methods are compared on a real-world set of equations produced by GPSR using training data from a well-known symbolic regression benchmark. A key finding is that there are potentially significant differences between the methods that, for example, could lead to conflicting selection of expressions within a GPSR implementation. However, it is concluded that there are scenarios where either method could be preferred over the other based on accuracy or computational budget. Algorithmic improvements for both methods as well as future areas of study are discussed.
KW - equation learning
KW - marginal likelihood
KW - model selection
KW - symbolic regression
UR - http://www.scopus.com/inward/record.url?scp=85201932439&partnerID=8YFLogxK
U2 - 10.1145/3638530.3664142
DO - 10.1145/3638530.3664142
M3 - Conference contribution
AN - SCOPUS:85201932439
T3 - GECCO 2024 Companion - Proceedings of the 2024 Genetic and Evolutionary Computation Conference Companion
SP - 2058
EP - 2066
BT - GECCO 2024 Companion - Proceedings of the 2024 Genetic and Evolutionary Computation Conference Companion
PB - Association for Computing Machinery, Inc
T2 - 2024 Genetic and Evolutionary Computation Conference Companion, GECCO 2024 Companion
Y2 - 14 July 2024 through 18 July 2024
ER -