Abstract
Protein-protein interactions (PPIs) play a major role in every living organism. By the
use of high-throughput techniques the number of known PPIs has increased dramatically
within the last years. Thus it became necessary to store PPI data in a structured
way and to provide it to the scientific community.
The first part of this diploma thesis gives an introduction to the data landscape of
proteins and PPIs. So-called secondary PPI databases provide a better understanding
of interactomes through the integration of different primary data sources. The problem
with secondary PPI databases is data redundancy due to overlapping source
data. So every secondary PPI database needs mechanisms to detect duplicates.
In the second part of the diploma thesis mechanisms for duplicate detection of three
secondary PPI databases are introduced. Subsequently, a new alignment-based duplicate
detection method is presented. The algorithms to calculate the alignments, their
parameters and alignment quality are evaluated, and the performance of alignmentbased
duplicate detection is evaluated on test data sets with different criteria. The
advantages and disadvantages of this approach are shown in the results.
Translated title of the contribution | Duplicate detection in integrated PPI data sets by pairwise sequence alignment |
---|---|
Original language | German |
Publication status | Accepted/In press - 2007 |
Keywords
- Datenintegration
- Duplikaterkennung
- Proteininteraktionen
- PPI
- PPI-Datenbanken