TY - JOUR
T1 - Evaluation of a large-scale biomedical data annotation initiative
AU - Lacson, Ronilda
AU - Pitzer, Erik
AU - Hinske, Christian
AU - Galante, Pedro
AU - Ohno-Machado, Lucila
N1 - Funding Information:
The authors would like to thank the annotators who worked diligently on this project: Evelyn Pitzer, Pierre Cornell, Karrie Du, Lindy Su and Anthony Villanova. Galante was funded by grant D43TW007015 from the Forgarty International Center, NIH. This work was funded in part by grant FAS0703850 from the Komen Foundation.
PY - 2009/9/17
Y1 - 2009/9/17
N2 - Background: This study describes a large-scale manual re-annotation of data samples in the Gene Expression Omnibus (GEO), using variables and values derived from the National Cancer Institute thesaurus. A framework is described for creating an annotation scheme for various diseases that is flexible, comprehensive, and scalable. The annotation structure is evaluated by measuring coverage and agreement between annotators. Results: There were 12,500 samples annotated with approximately 30 variables, in each of six disease categories - breast cancer, colon cancer, inflammatory bowel disease (IBD), rheumatoid arthritis (RA), systemic lupus erythematosus (SLE), and Type 1 diabetes mellitus (DM). The annotators provided excellent variable coverage, with known values for over 98% of three critical variables: disease state, tissue, and sample type. There was 89% strict inter-annotator agreement and 92% agreement when using semantic and partial similarity measures. Conclusion: We show that it is possible to perform manual re-annotation of a large repository in a reliable manner.
AB - Background: This study describes a large-scale manual re-annotation of data samples in the Gene Expression Omnibus (GEO), using variables and values derived from the National Cancer Institute thesaurus. A framework is described for creating an annotation scheme for various diseases that is flexible, comprehensive, and scalable. The annotation structure is evaluated by measuring coverage and agreement between annotators. Results: There were 12,500 samples annotated with approximately 30 variables, in each of six disease categories - breast cancer, colon cancer, inflammatory bowel disease (IBD), rheumatoid arthritis (RA), systemic lupus erythematosus (SLE), and Type 1 diabetes mellitus (DM). The annotators provided excellent variable coverage, with known values for over 98% of three critical variables: disease state, tissue, and sample type. There was 89% strict inter-annotator agreement and 92% agreement when using semantic and partial similarity measures. Conclusion: We show that it is possible to perform manual re-annotation of a large repository in a reliable manner.
KW - Algorithms
KW - Computational Biology/methods
KW - Databases, Genetic
KW - Gene Expression Profiling
KW - Information Storage and Retrieval
UR - http://www.scopus.com/inward/record.url?scp=70349866711&partnerID=8YFLogxK
U2 - 10.1186/1471-2105-10-S9-S10
DO - 10.1186/1471-2105-10-S9-S10
M3 - Article
C2 - 19761564
VL - 10
SP - 10
JO - BMC Bioinformatics
JF - BMC Bioinformatics
IS - SUPPL. 9
M1 - S10
ER -