Tampere University of Technology

TUTCRIS Research Portal

Gene set analysis approaches for RNA-seq data: performance evaluation and application guideline

Research output: Contribution to journalArticleScientificpeer-review

Standard

Gene set analysis approaches for RNA-seq data : performance evaluation and application guideline. / Rahmatallah, Yasir; Emmert-Streib, Frank; Glazko, Galina.

In: Briefings in Bioinformatics, Vol. 17, No. 3, 16.05.2016, p. 393-407.

Research output: Contribution to journalArticleScientificpeer-review

Harvard

APA

Vancouver

Author

Rahmatallah, Yasir ; Emmert-Streib, Frank ; Glazko, Galina. / Gene set analysis approaches for RNA-seq data : performance evaluation and application guideline. In: Briefings in Bioinformatics. 2016 ; Vol. 17, No. 3. pp. 393-407.

Bibtex - Download

@article{80d58e332dcb4a48bf229f826983c8af,
title = "Gene set analysis approaches for RNA-seq data: performance evaluation and application guideline",
abstract = "Transcriptome sequencing (RNA-seq) is gradually replacing microarrays for high-throughput studies of gene expression. The main challenge of analyzing microarray data is not in finding differentially expressed genes, but in gaining insights into the biological processes underlying phenotypic differences. To interpret experimental results from microarrays, gene set analysis (GSA) has become the method of choice, in particular because it incorporates pre-existing biological knowledge (in a form of functionally related gene sets) into the analysis. Here we provide a brief review of several statistically different GSA approaches (competitive and self-contained) that can be adapted from microarrays practice as well as those specifically designed for RNA-seq. We evaluate their performance (in terms of Type I error rate, power, robustness to the sample size and heterogeneity, as well as the sensitivity to different types of selection biases) on simulated and real RNA-seq data. Not surprisingly, the performance of various GSA approaches depends only on the statistical hypothesis they test and does not depend on whether the test was developed for microarrays or RNA-seq data. Interestingly, we found that competitive methods have lower power as well as robustness to the samples heterogeneity than self-contained methods, leading to poor results reproducibility. We also found that the power of unsupervised competitive methods depends on the balance between up- and down-regulated genes in tested gene sets. These properties of competitive methods have been overlooked before. Our evaluation provides a concise guideline for selecting GSA approaches, best performing under particular experimental settings in the context of RNA-seq.",
author = "Yasir Rahmatallah and Frank Emmert-Streib and Galina Glazko",
note = "{\circledC} The Author 2015. Published by Oxford University Press.",
year = "2016",
month = "5",
day = "16",
doi = "10.1093/bib/bbv069",
language = "English",
volume = "17",
pages = "393--407",
journal = "Briefings in Bioinformatics",
issn = "1467-5463",
publisher = "Oxford University Press",
number = "3",

}

RIS (suitable for import to EndNote) - Download

TY - JOUR

T1 - Gene set analysis approaches for RNA-seq data

T2 - performance evaluation and application guideline

AU - Rahmatallah, Yasir

AU - Emmert-Streib, Frank

AU - Glazko, Galina

N1 - © The Author 2015. Published by Oxford University Press.

PY - 2016/5/16

Y1 - 2016/5/16

N2 - Transcriptome sequencing (RNA-seq) is gradually replacing microarrays for high-throughput studies of gene expression. The main challenge of analyzing microarray data is not in finding differentially expressed genes, but in gaining insights into the biological processes underlying phenotypic differences. To interpret experimental results from microarrays, gene set analysis (GSA) has become the method of choice, in particular because it incorporates pre-existing biological knowledge (in a form of functionally related gene sets) into the analysis. Here we provide a brief review of several statistically different GSA approaches (competitive and self-contained) that can be adapted from microarrays practice as well as those specifically designed for RNA-seq. We evaluate their performance (in terms of Type I error rate, power, robustness to the sample size and heterogeneity, as well as the sensitivity to different types of selection biases) on simulated and real RNA-seq data. Not surprisingly, the performance of various GSA approaches depends only on the statistical hypothesis they test and does not depend on whether the test was developed for microarrays or RNA-seq data. Interestingly, we found that competitive methods have lower power as well as robustness to the samples heterogeneity than self-contained methods, leading to poor results reproducibility. We also found that the power of unsupervised competitive methods depends on the balance between up- and down-regulated genes in tested gene sets. These properties of competitive methods have been overlooked before. Our evaluation provides a concise guideline for selecting GSA approaches, best performing under particular experimental settings in the context of RNA-seq.

AB - Transcriptome sequencing (RNA-seq) is gradually replacing microarrays for high-throughput studies of gene expression. The main challenge of analyzing microarray data is not in finding differentially expressed genes, but in gaining insights into the biological processes underlying phenotypic differences. To interpret experimental results from microarrays, gene set analysis (GSA) has become the method of choice, in particular because it incorporates pre-existing biological knowledge (in a form of functionally related gene sets) into the analysis. Here we provide a brief review of several statistically different GSA approaches (competitive and self-contained) that can be adapted from microarrays practice as well as those specifically designed for RNA-seq. We evaluate their performance (in terms of Type I error rate, power, robustness to the sample size and heterogeneity, as well as the sensitivity to different types of selection biases) on simulated and real RNA-seq data. Not surprisingly, the performance of various GSA approaches depends only on the statistical hypothesis they test and does not depend on whether the test was developed for microarrays or RNA-seq data. Interestingly, we found that competitive methods have lower power as well as robustness to the samples heterogeneity than self-contained methods, leading to poor results reproducibility. We also found that the power of unsupervised competitive methods depends on the balance between up- and down-regulated genes in tested gene sets. These properties of competitive methods have been overlooked before. Our evaluation provides a concise guideline for selecting GSA approaches, best performing under particular experimental settings in the context of RNA-seq.

U2 - 10.1093/bib/bbv069

DO - 10.1093/bib/bbv069

M3 - Article

VL - 17

SP - 393

EP - 407

JO - Briefings in Bioinformatics

JF - Briefings in Bioinformatics

SN - 1467-5463

IS - 3

ER -