Identifying tandem mass spectra of phosphorylated peptides before database search using machine-learning

Dorl, S. (Speaker)

Activity: Talk or presentationOral presentation


INTRODUCTION Identification of post-translational modifications (PTMs), for example phosphorylation, is of high interest in proteomics research since modified proteins are often important for biological functionality. For the identification of modified peptides during tandem mass spectrometry, database search engines consider the selected PTMs for any of the spectra in a sample. Selecting many different PTMs together results in drastically increased search space, leading to longer search times and more false positive peptide identifications. To counteract this, we propose the use of machine-learning-trained models that can reliably classify those spectra which are highly likely to represent phosphorylated peptides before database search. METHODS Our goal is to limit database search of phosphate as a variable modification to only those spectra in a sample which are most likely to originate from phosphorylated peptides. For this purpose, we use a classification model that separates the raw MS/MS spectra into phosphorylated and non-phosphorylated before they are submitted to database search. After splitting the spectra accordingly, each batch is searched separately using MS Amanda (ProteomeDiscoverer 2.1, 5 ppm precursor mass tolerance, 20 ppm fragment mass tolerance, modifications: static carbamidomethylation of cysteine, variable oxidation of methionine and variable phosphorylation of serine or threonine, UniProt Swiss-Prot protein database). Then, results for the phosphorylated and non- phosphorylated sets batches independently filtered to 1% FDR and combined. The classification model uses a set of features that represent the occurrences and relative intensities of all neutral losses commonly observed in tandem mass spectra of phosphorylated peptides. The model was constructed with the random forest algorithm for supervised learning on a training data set consisting of 161,379 phosphorylated spectra and 164,859 non-phosphorylated spectra. We obtained the training spectra by combining data from several phosphopeptide-enrichment experiments publicly available in the PRIDE repository, including a variety of human cell lines and different kinds of mouse tissue samples (exclusively high-accuracy fragment ion spectra measured on QExactive instruments). The raw data was re-analyzed using MS Amanda and the resulting matches were filtered for high confidence identifications (<1% FDR using Percolator and >200 search engine score). RESULTS The classification model for fragment spectra of phosphorylated peptides achieved an average accuracy of 97.1% in 10-fold cross-validation experiments while correctly identifying an average of 96.4% of phosphorylated spectra. The model-assisted workflow was tested using data from an experiment on mouse kidney samples which was completely separate from the training data pool. In this case, using the model-assisted workflow reduced the total search space of the database search by 45.6% while still showing 99.4% as many peptide spectrum matches and 99.5% as many identified unique peptides as an equivalent standard workflow. Besides significantly reducing search time, removing non-phosphorylated spectra before database search also reduces the number of false positive identifications which can lead to increased identifications at the same FDR rate. Subsequently, we tested the split workflow in conditions where high numbers of false positive identifications are a major issue. This includes database searches with many possible variable modifications, searches in big uncurated databases, and database searches in phosphoproteome experiments without phosphopeptide enrichment.
Period12 Jan 2017
Event title2017 EuBIC Winter School on proteomics bioinformatics: null
Event typeConference
LocationSemmering, Austria