Evaluation of a large-scale biomedical data annotation initiative

Ronilda Lacson, Erik Pitzer, Christian Hinske, Pedro Galante, Lucila Ohno-Machado

Research output: Contribution to journalArticlepeer-review

9 Citations (Scopus)


Background: This study describes a large-scale manual re-annotation of data samples in the Gene Expression Omnibus (GEO), using variables and values derived from the National Cancer Institute thesaurus. A framework is described for creating an annotation scheme for various diseases that is flexible, comprehensive, and scalable. The annotation structure is evaluated by measuring coverage and agreement between annotators. Results: There were 12,500 samples annotated with approximately 30 variables, in each of six disease categories - breast cancer, colon cancer, inflammatory bowel disease (IBD), rheumatoid arthritis (RA), systemic lupus erythematosus (SLE), and Type 1 diabetes mellitus (DM). The annotators provided excellent variable coverage, with known values for over 98% of three critical variables: disease state, tissue, and sample type. There was 89% strict inter-annotator agreement and 92% agreement when using semantic and partial similarity measures. Conclusion: We show that it is possible to perform manual re-annotation of a large repository in a reliable manner.

Original languageEnglish
Article numberS10
Pages (from-to)10
Number of pages6
JournalBMC Bioinformatics
Issue numberSUPPL. 9
Publication statusPublished - 17 Sept 2009


  • Algorithms
  • Computational Biology/methods
  • Databases, Genetic
  • Gene Expression Profiling
  • Information Storage and Retrieval


Dive into the research topics of 'Evaluation of a large-scale biomedical data annotation initiative'. Together they form a unique fingerprint.

Cite this