TY - JOUR
T1 - Ensuring the statistical soundness of competitive gene set approaches
T2 - Gene filtering and genome-scale coverage are essential
AU - Tripathi, Shailesh
AU - Glazko, Galina V.
AU - Emmert-Streib, Frank
PY - 2013/4
Y1 - 2013/4
N2 - In this article, we focus on the analysis of competitive gene set methods for detecting the statistical significance of pathways from gene expression data. Our main result is to demonstrate that some of the most frequently used gene set methods, GSEA, GSEArot and GAGE, are severely influenced by the filtering of the data in a way that such an analysis is no longer reconcilable with the principles of statistical inference, rendering the obtained results in the worst case inexpressive. A possible consequence of this is that these methods can increase their power by the addition of unrelated data and noise. Our results are obtained within a bootstrapping framework that allows a rigorous assessment of the robustness of results and enables power estimates. Our results indicate that when using competitive gene set methods, it is imperative to apply a stringent gene filtering criterion. However, even when genes are filtered appropriately, for gene expression data from chips that do not provide a genome-scale coverage of the expression values of all mRNAs, this is not enough for GSEA, GSEArot and GAGE to ensure the statistical soundness of the applied procedure. For this reason, for biomedical and clinical studies, we strongly advice not to use GSEA, GSEArot and GAGE for such data sets.
AB - In this article, we focus on the analysis of competitive gene set methods for detecting the statistical significance of pathways from gene expression data. Our main result is to demonstrate that some of the most frequently used gene set methods, GSEA, GSEArot and GAGE, are severely influenced by the filtering of the data in a way that such an analysis is no longer reconcilable with the principles of statistical inference, rendering the obtained results in the worst case inexpressive. A possible consequence of this is that these methods can increase their power by the addition of unrelated data and noise. Our results are obtained within a bootstrapping framework that allows a rigorous assessment of the robustness of results and enables power estimates. Our results indicate that when using competitive gene set methods, it is imperative to apply a stringent gene filtering criterion. However, even when genes are filtered appropriately, for gene expression data from chips that do not provide a genome-scale coverage of the expression values of all mRNAs, this is not enough for GSEA, GSEArot and GAGE to ensure the statistical soundness of the applied procedure. For this reason, for biomedical and clinical studies, we strongly advice not to use GSEA, GSEArot and GAGE for such data sets.
KW - Breast Neoplasms/genetics
KW - Data Interpretation, Statistical
KW - Female
KW - Gene Expression Profiling/methods
KW - Gene Expression Regulation, Neoplastic
KW - Genomics/methods
KW - Humans
KW - Male
KW - Oligonucleotide Array Sequence Analysis
KW - Precursor Cell Lymphoblastic Leukemia-Lymphoma/genetics
KW - Prostatic Neoplasms/genetics
KW - Sample Size
UR - http://www.scopus.com/inward/record.url?scp=84876544061&partnerID=8YFLogxK
U2 - 10.1093/nar/gkt054
DO - 10.1093/nar/gkt054
M3 - Article
C2 - 23389952
AN - SCOPUS:84876544061
SN - 0305-1048
VL - 41
SP - e82
JO - Nucleic Acids Research
JF - Nucleic Acids Research
IS - 7
ER -