Protein-protein interactions (PPIs) play a major role in every living organism. By the use of high-throughput techniques the number of known PPIs has increased dramatically within the last years. Thus it became necessary to store PPI data in a structured way and to provide it to the scientific community. The first part of this diploma thesis gives an introduction to the data landscape of proteins and PPIs. So-called secondary PPI databases provide a better understanding of interactomes through the integration of different primary data sources. The problem with secondary PPI databases is data redundancy due to overlapping source data. So every secondary PPI database needs mechanisms to detect duplicates. In the second part of the diploma thesis mechanisms for duplicate detection of three secondary PPI databases are introduced. Subsequently, a new alignment-based duplicate detection method is presented. The algorithms to calculate the alignments, their parameters and alignment quality are evaluated, and the performance of alignmentbased duplicate detection is evaluated on test data sets with different criteria. The advantages and disadvantages of this approach are shown in the results.
|Translated title of the contribution||Duplicate detection in integrated PPI data sets by pairwise sequence alignment|
|Publication status||Accepted/In press - 2007|