Duplicate detection in integrated PPI data sets by pairwise sequence alignment

Research output: Types of ThesesMaster's Thesis / Diploma Thesis


Protein-protein interactions (PPIs) play a major role in every living organism. By the use of high-throughput techniques the number of known PPIs has increased dramatically within the last years. Thus it became necessary to store PPI data in a structured way and to provide it to the scientific community. The first part of this diploma thesis gives an introduction to the data landscape of proteins and PPIs. So-called secondary PPI databases provide a better understanding of interactomes through the integration of different primary data sources. The problem with secondary PPI databases is data redundancy due to overlapping source data. So every secondary PPI database needs mechanisms to detect duplicates. In the second part of the diploma thesis mechanisms for duplicate detection of three secondary PPI databases are introduced. Subsequently, a new alignment-based duplicate detection method is presented. The algorithms to calculate the alignments, their parameters and alignment quality are evaluated, and the performance of alignmentbased duplicate detection is evaluated on test data sets with different criteria. The advantages and disadvantages of this approach are shown in the results.
Translated title of the contributionDuplicate detection in integrated PPI data sets by pairwise sequence alignment
Original languageGerman
Publication statusAccepted/In press - 2007


  • Datenintegration
  • Duplikaterkennung
  • Proteininteraktionen
  • PPI
  • PPI-Datenbanken

Fingerprint Dive into the research topics of 'Duplicate detection in integrated PPI data sets by pairwise sequence alignment'. Together they form a unique fingerprint.

Cite this