A key goal of genetic research is the identification of variants underlying specific phenotypes. These missense mutations can have significant impact on the structure and functions of proteins in a cell. Genomics has proven successful in identifying somatic variants on a large scale. However, advances in mass spectrometry instrumentation now enable proteomics to capture almost complete proteomes. The integration of these with underlying genomics datasets facilitate the identification of alterations not captured in reference protein databases. Proteogenomics uses customized protein sequence databases incorporating genetic alterations and gene predictions for database searching to allow the identification of variant peptides. Furthermore, de novo identification of variants through spectrum clustering also results in peptide sequences not present in a reference database. To provide annotators with evidence for expressed variants and prevent identification of false novel protein coding loci it is crucial to map identified variant peptides to reference proteins. Imperfect string matching is a nontrivial problem and numerous solutions have been proposed for text mining, spanning from alignment algorithms to suffix trees. Time-consuming preprocessing steps, however, make them impractical for proteogenomics applications. Here we describe a fast method to identify all occurrences of peptides with up to two mismatches in a reference protein database thorough k-mer indexing the database and peptide length dependent generation of all mismatch combinations within a peptide. Our algorithm is incorporated in the genome mapping tool PoGo. To demonstrate the effectiveness of our method, we performed imperfect matching of 230,000 unique high confidence peptide sequences from three large scale human tissue proteome datasets against the translation of annotated protein coding transcripts from GENCODE (v20) resolve high complexity regions in the protein coding genome. Our tool exhibited superior performance on benchmark against another tool, PGx, and was able to identify two additional blocks of 8 amino acids in repeat region of the SPRR3 gene. These were validated through peptides identified with the exact sequence. Furthermore, our method was able to map peptides across species between human and mouse. These data show that our method is effective in identifying parent reference proteins for variant peptides even across the phylogenetic tree. The genomics efforts of sequencing different strains of the same species and missing visualization tools pose the challenge mapping identified peptides onto the reference. Our algorithm has been developed with this in mind: we anticipate it will have a central utility for interpretation of trans-strains and species datasets as well as studies on personal variation and precision medicine.
|Title of host publication||Proceedings of the 2017 EuBIC Winter School on proteomics bioinformatics|
|Publication status||Published - 2017|
|Event||2017 EuBIC Winter School on proteomics bioinformatics - Semmering, Austria|
Duration: 10 Jan 2017 → 13 Jan 2017
|Conference||2017 EuBIC Winter School on proteomics bioinformatics|
|Period||10.01.2017 → 13.01.2017|