TY - GEN
T1 - A symbolic regression based scoring system improving peptide identification for MS Amanda
AU - Dorfer, Viktoria
AU - Maltsev, Sergey
AU - Dreiseitl, Stephan
AU - Mechtler, Karl
AU - Winkler, Stephan M.
PY - 2015/7/11
Y1 - 2015/7/11
N2 - Peptide search engines are algorithms that are able to identify peptides (i.e., short proteins or parts of proteins) from mass spectra of biological samples. These identification algorithms report the best matching peptide for a given spectrum and a score that represents the quality of the match; usually, the higher this score, the higher is the reliability of the respective match. In order to estimate the specificity and sensitivity of search engines, sets of target sequences are given to the identification algorithm as well as so-called decoy sequences that are randomly created or scrambled versions of real sequences; decoy sequences should be assigned low scores whereas target sequences should be assigned high scores. In this paper we present an approach based on symbolic regression (using genetic programming) that helps to distinguish between target and decoy matches. On the basis of features calculated for matched sequences and using the information on the original sequence set (target or decoy) we learn mathematical models that calculate updated scores. As an alternative to this white box modeling approach we also use a black box modeling method, namely random forests. As we show in the empirical section of this paper, this approach leads to scores that increase the number of reliably identified samples that are originally scored using the MS Amanda identification algorithm for high resolution as well as for low resolution mass spectra.
AB - Peptide search engines are algorithms that are able to identify peptides (i.e., short proteins or parts of proteins) from mass spectra of biological samples. These identification algorithms report the best matching peptide for a given spectrum and a score that represents the quality of the match; usually, the higher this score, the higher is the reliability of the respective match. In order to estimate the specificity and sensitivity of search engines, sets of target sequences are given to the identification algorithm as well as so-called decoy sequences that are randomly created or scrambled versions of real sequences; decoy sequences should be assigned low scores whereas target sequences should be assigned high scores. In this paper we present an approach based on symbolic regression (using genetic programming) that helps to distinguish between target and decoy matches. On the basis of features calculated for matched sequences and using the information on the original sequence set (target or decoy) we learn mathematical models that calculate updated scores. As an alternative to this white box modeling approach we also use a black box modeling method, namely random forests. As we show in the empirical section of this paper, this approach leads to scores that increase the number of reliably identified samples that are originally scored using the MS Amanda identification algorithm for high resolution as well as for low resolution mass spectra.
KW - Peptide identification
KW - Proteomics
KW - Symbolic regression
UR - http://www.scopus.com/inward/record.url?scp=84959449110&partnerID=8YFLogxK
U2 - 10.1145/2739482.2768509
DO - 10.1145/2739482.2768509
M3 - Conference contribution
AN - SCOPUS:84959449110
SN - 9781450334884
T3 - GECCO 2015 - Companion Publication of the 2015 Genetic and Evolutionary Computation Conference
SP - 1335
EP - 1341
BT - GECCO 2015 - Companion Publication of the 2015 Genetic and Evolutionary Computation Conference
A2 - Silva, Sara
PB - Association for Computing Machinery, Inc
T2 - 17th Genetic and Evolutionary Computation Conference, GECCO 2015
Y2 - 11 July 2015 through 15 July 2015
ER -