Abstract

A central goal of biology is understanding and describing the molecular basis of plasticity: the sets of genes that are combinatorially selected by exogenous and endogenous environmental changes, and the relations among the genes. The well-nigh viable current arroyo to this trouble consists of determining whether sets of genes are connected by some common theme, e.g. genes from the same pathway are overrepresented among those whose differential expression in response to a perturbation is well-nigh pronounced. At that place are many approaches to this problem, and the results they produce bear witness a fair amount of dispersion, but they all fall within a common framework consisting of a few basic components. We critically review these components, suggest all-time practices for carrying out each footstep, and propose a voting method for coming together the claiming of assessing dissimilar methods on a large number of experimental information sets in the absenteeism of a gold standard.

INTRODUCTION

Understanding of complex polygenetic phenotypes—stage of differentiation, disease land, responsiveness to exogenous perturbations and so on—requires a combination of high functioning experimental and analytical methods for identifying related sets of genes (e.chiliad. genes in pathways or functional classifications) associated with phenotypic changes. Identification by and large means the discovery of factor sets that were not previously known to be related, as well as the conclusion of which sets among a known drove (e.thou. [1]). The quondam, more than difficult problem is likewise known every bit the pathway reconstruction or pathway note problem and discussed elsewhere [2–v]; hither nosotros focus on the latter, including the topological structure of the sets.

Early methods for associating gene sets with phenotype changes showtime identify individual, potentially relevant genes by making a binary decision based on a quantity that measures the extent of differential expression betwixt the phenotypes (east.thousand. performing a t-test and requiring P-value <0.01), and then use a Fisher'southward verbal test to determine whether a significant number of these genes belong to a prespecified gene set [half dozen, seven]. An alternative approach begins by ranking all genes according to differential expression, and and then determines if a prespecified cistron set is significantly overrepresented toward the top or bottom of the ranked list. Such a procedure was first introduced past Subramanian et al. [8] and their particular method was named Gene Fix Enrichments Analysis (GSEA). Here we use GSEA broadly to describe all methods for associating cistron sets with phenotype changes. Nosotros abbreviate prespecified gene sets amid a known drove as gene sets or simply sets. We use the terms enrichment and overrepresentation as they are in normal parlance. They do not distinguish method, but frames of reference: the members of a set are said to be overrepresented in a group of genes being examined, and the group is enriched in members of the set.

The GSEA method by Subramanian et al. [8] consists of the following specific steps: (i) rank all genes by the magnitudes of their differential expression and select a window in the ranked listing, i.e. a contiguous run of some number of genes starting at whatever rank, (ii) define an enrichment score based on a weighted Kolmogorov Smirnov (WKS) test that measures the difference betwixt the number of genes in a prespecified gene set that are observed in the window, and the number of occurrences if the genes in the set were uniformly distributed in the listing, (three) simulate a background distribution of the enrichment score by shuffling samples and estimate the statistical significance of the gene set, (iv) repeat steps i–three for all prespecified gene sets (hypotheses) and for diverse window sizes, (v) right for multiple hypotheses testing.

The strategy for performing GSEA has numerous variants, depending on the method for estimating significance (WKS test, mean exam, median test, Wilcoxon rank sum examination, etc.); background distribution, which is related to the method for estimating significance, but non e'er dictated by information technology; choice of the shuffling method when the background distribution is simulated; the method for multiple hypothesis correction; and the choice of weights to business relationship for auxiliary information such as topology of gene sets [8, 9]. Several splendid reviews compared these variants [10–14]. In particular, Ackermann and Strimmer [13] established a general modular framework for performing GSEA. They critically assessed a subset of factor gear up enrichment procedures using x false data sets and 2 experimental information sets. In addition, they assessed three so-chosen global procedures, which practise not compute a gene-level enrichment score; rather, they exam gene sets as the unit. Ackermann and Strimmer concluded that the selection of cistron-level statistics was inconsequential, while the performance of factor set-level statistics was more variable. All of the gene set-level statistics rejected the merely fake negative command data prepare, and based on the percentage of the other nine simulated data sets being detected, Ackermann and Strimmer ended that the hateful test was more than sensitive than the median test and Wilcoxon rank sum test, while GSEA was the least sensitive. Furthermore, they ended that global procedures yielded worse results than the best of the half dozen factor set-level statistics they tested. Their results on the 2 experimental data sets were inconsistent with each other and inconsistent with the results based on simulated data. They reported that the results varied greatly depending upon the choice of the null hypothesis, and one global process called Hotelling'south T 2 exam was most sensitive for one data set up using one type of groundwork distribution, while tests based on conditional FDR and enrichment score, but not the mean test, were more than most sensitive in other combinations of experimental data ready and null hypothesis. Although the test on experimental data sets is far more than biologically relevant than on simulated information sets, the results of the former are harder to interpret due to the lack of a gold standard, i.eastward. which gene sets are true positives and which are truthful negatives.

Indeed, the lack of a gold standard has profoundly hampered the attempt of assessing gene set up enrichment methods using experimental data sets. Biological systems are so complex that faux data sets simply cannot substitute for experimental data sets. Furthermore, it can be argued that the ability of rejecting simulated positive predictions is even more of import than detecting lowly ranked true positives, because it is costly to validate a large number of predictions. Therefore, in this article, nosotros focus on experimental information sets. First we review the core components of factor set enrichment methods in the aspects that are important to experimental data sets. Then we use 132 experimental data sets to critically assess half-dozen gene set up-level statistics and ane global test, which received favorable ratings in previous reviews [ten, 13]. To tackle the lack of a gilded standard, we innovate the concept of common coverage (MC), which reflects the extent to which the gene sets predicted past a particular method are reproduced past other methods. Our results propose that: (i) the Wilcoxon rank sum test and the WKS test as implemented in GSEA provide the most constructive factor set-level statistics for obtaining high MC, (ii) false background distributions are more than accurate than analytic backgrounds, (iii) the mean and median tests achieve the highest sensitivity merely poor MC, and (iv) incorporating topology of cistron sets in the analysis increases the sensitivity for all six procedures without reducing MC.

COMPONENTS OF GSEA

In this department, we review the five cadre components of GSEA methods as illustrated in Effigy 1, focusing on the aspects that are specially important for experimental data sets.

Figure 1:

Key components of performing gene set enrichment analysis.

Primal components of performing factor set up enrichment analysis.

Figure ane:

Key components of performing gene set enrichment analysis.

Primal components of performing gene gear up enrichment analysis.

Data preprocessing

There are ii of import just frequently overlooked data preprocessing steps. Normalization allows expression values obtained from different experiments to be directly comparable [fifteen, sixteen]. The expression values of a small, just different set of genes may exist missing in different microarray experiments due to technical issues. Imputation of missing data is thus important for maximal data coverage when the results of multiple experiments are compared.

A number of methods are available for normalization [16, 17], yet this critical step is frequently omitted [15]. The nearly common normalization algorithms—RMA [16] and MAS v.0 [18]—are designed for expression levels generated with microarrays that follow a lognormal distribution. Thus it is important to log transform the raw intensity values from microarrays. Failure to practise then would bias toward loftier expression values, reducing statistical power because of increase in variance [sixteen]. Log transform is also practical to RNA-seq data. Expression level determined by RNA-seq is usually quantified in Reads Per Kilobase exon Model per million mapped reads (RPKM; density of reads that map to a gene normalized for the length of its mature transcript and for the sequencing depth of the experiment [19]), which afterward log transform correlates well with normalized intensity measured with microarray, also after log transform [nineteen].

Missing data tin can be imputed using methods based on 1000 nearest neighbors, atypical value decomposition, or to the lowest degree square regression models. Least square regression algorithms were reported to produce lower interpretation error than other methods [xx, 21]. In this article we use a popular least foursquare regression algorithm, LSimpute_gene [22], to impute missing values in all 132 experimental data sets.

Single factor statistics

The first stride in GSEA is to compute a gene-level statistic of differential expression, eastward.g. a t statistic, a signal to noise ratio (mean to standard difference ratio), a fold alter or a Wilcoxon rank sum statistic. Because phenotype change tin bear on dissimilar genes in reverse directions, i.e. increment the expression levels of some genes while subtract the levels of others, and we want to be able to identify the gene sets that contain both types of genes, information technology is desirable to eliminate the direction of differential expression past taking the absolute or square of the statistic [13, 23]. However, data transformations that eliminate direction—such as absolute values—pb to asymmetrical distributions, and can nullify some analytical estimates of significance based on analytical background distributions such as the χii test [24, 25] (encounter 'Estimating significance' and 'The validity of analytical background distributions' sections).

The many-to-many correspondence between genes and probe sets on a microarray creates ambiguity in determining expression levels of genes [26, 27]. A common practice is to calculate the mean or median expression levels of the probe sets that correspond to the same cistron; however, doing and then usually increases the number of false negatives [28]. An alternative is to perform meta analysis [28], using for example, the method proposed by Fisher [29] or past Stouffer [thirty, 31]. Rather than merging the expression values directly, these methods merge probe set-level statistics. A like problem exists in RNA-seq, where some sequencing reads are mapped to multiple genomic locations. Such multi-mappers originate from paralogs, segmentally duplicated regions and depression sequence complexity [32]. Ignoring multi-mappers reduces sensitivity and undercounts some genes [nineteen]. Strategies for assigning multi-mappers are discussed in [19, 32, 33].

Gene ready-level statistics

The purpose of a gene fix-level statistic is to decide whether a gene prepare is distinct in some statistically significant fashion. A factor set up statistic tin be defined in terms of backdrop of the genes in the prepare, e.m. the mean, median, variance, etc. of a gene-level statistic (encounter Table 1 for more details). When a property (and its corresponding statistic) is chosen, the aught hypothesis must, of course, also exist specified. There are ii null hypotheses equally defined by Tian et al. [34]. In one case (Q1) the background distribution is obtained by shuffling genes; in the other (Q2), the background distribution is obtained past shuffling phenotypes, i.e. samples (see 'Estimating significance' section). The rationale for using Q1 is that a pregnant gene set should be distinguishable from an equal size set up equanimous of randomly chosen genes. On the other hand, Q2 focuses on a gene set and tests whether its association with the phenotype alter is distinguishable from randomly shuffled phenotype changes. Q2 is generally favored because it preserves the relationship of the genes in the set [eleven, 12, 34] and directly addresses the question of finding cistron sets whose expression changes correlates with phenotype changes.

Table 1:

Usually used factor-set level statistics

P is the given factor-set up. p i is the factor statistics of factor i in P. Q is the set up of genes not in P. y is the order statistics of P. rank P + Q (x) is the rank of x in the set P +Q. NA: Not applicative.

Table one:

Commonly used gene-set level statistics

P is the given cistron-fix. p i is the factor statistics of gene i in P. Q is the set of genes not in P. y is the gild statistics of P. rank P + Q (x) is the rank of ten in the set P +Q. NA: Not applicable.

Gene set-level statistics generally ignore clinical covariates—factors such as age, sexual practice and weight—which can also cause differential expression, confounding the bear upon of phenotype changes [35]. The effect of covariates tin can exist estimated using, for case, a linear regression model [35]. If a t statistic is used in the gene level, it can exist generalized using a linear regression model for covariate correction, after bookkeeping for the increased number of variables to avoid over fitting [35].

Most gene fix-level statistics also ignore relationships among genes within the set up. For example, if the cistron set is a pathway, its topological information is ignored. Including topological data is important for accounting for the consequence of genetic buffering [36], which deduces that if a gene fails to propagate its influence to a pathway neighbor, its biological role is buffered. Conversely, a cistron that regulates many of its downstream genes may play a pivotal role in the expression changes of the pathway associated with phenotype changes. Methods for including topological information by weighted factor set up-level statistics are discussed in [ix, 37, 38].

Estimating significance

We utilize significance in the standard fashion: the probability that the null hypothesis, evaluated on the background (or null) distribution, is correct. The background distribution can sometimes be written analytically, equally in the case of a Gaussian distribution, and information technology can always exist simulated by shuffling experimental information. Every bit noted in the higher up section, simulated groundwork is dictated by the pick of the null hypothesis (Q1 or Q2), which frequently leads to dissimilar conclusions [34].

Most frequently the cistron prepare-level statistic, east.1000. the mean of the t-statistic values of genes in the set, is causeless to follow a normal distribution when expression alter has no association with phenotype change [34]. In such instance the significance (P-value) of a factor set tin can be computed analytically [25]. Such an supposition is in question when the expression levels of genes in a set up are dependent on one another, which is common for genes in a pathway [34]. In 'The validity of analytical groundwork distributions' department we will discuss analytical backgrounds and empirical corrections [25] to make them more useful.

To be physical in illustrating how significance is estimated using a simulated background distribution, suppose nosotros are interested in estimating the probability that the enrichment score obtained for a particular gene set is a adventure occurrence of phenotype changes. The procedure would exist to shuffle the phenotype labels, calculate the differential expression of each cistron, rank all genes and compute an enrichment score for the same cistron set. The unabridged process is repeated multiple times to obtain a distribution of enrichment scores, and the P-value of the bodily enrichment score is but the fraction of shuffles that produce enrichment scores at least as great as observed. Although simulating the background distribution obviates the demand of an analytical background, it tin exist computational enervating—at least N shuffles need to exist performed to achieve a P-value resolution of ane/North [39].

Correction for multiple testing

P-value is the appropriate measure of statistical significance when only one gene fix is tested. When a large number of gene sets are tested, there can be many simulated positives among the gene sets that receive seemingly highly significant P-values; this is called the multiple hypothesis testing problems. The simplest procedure is to choose a P-value which, when multiplied past the number of hypotheses, i.e. the total number of tested cistron sets, gives a sufficiently low corrected P-value, east.thou. <0.05. This Bonferroni correction [xl] is, yet, very conservative and sometimes results in an unacceptably large number of false negatives. An alternative is to control the expected fraction of false positives among the predictions, or the simulated observe rate (FDR), using the method by Benjamini and Hochberg [41]. The original Benjamini–Hochberg procedure assumes a uniform distribution for the P-values [41]. In some cases when in that location are relatively many 'non-zero' tests, i.east. when low P-values are prevalent, an FDR variant, positive FDR (pFDR) can be applied [42, 43]. The corrected P-value is called Q-value, defined as the 'minimum FDR at which a examination is called significant' [42, 44]. The relationship between Q-value and FDR is analogous to that between P-value and type I error [42]. The last significant gene sets are the ones whose Q-values are smaller than an FDR threshold.

THE VALIDITY OF Belittling Groundwork DISTRIBUTIONS

Although appropriate usage of a standard analytical background distribution facilitates rapid computation of P-values with high precision, the critical question is how well the analytical groundwork represents the bodily groundwork. One manner to test the validity of an analytic background is to examine the distribution of P-values nether the null hypothesis, which should be compatible [45, 46]. To accost this question, we generated 500 nothing data sets past shuffling the sample tags of an arbitrarily called human data prepare from the Gene Expression Omnibus (GEO; a public database of high throughput cistron expression data), GDS2835 [47]. Nosotros synthetic the histograms of the null P-value distributions using five gene set up-level metrics summarized in Table ane, with analytical and simulated backgrounds.

The results indicate that P-values are uniformly distributed when background distributions are generated past shuffling, irrespective of metric (Effigy 2A), whereas the distributions obtained using belittling groundwork distributions are highly nonuniform, and metric dependent (Figure 2B). The null P-value distribution of the χtwo test as shown in Figure 2B deviates profoundly from a uniform distribution, considering taking the absolute value of the cistron-level statistic causes the background to no longer follow the χ2 distribution. Taking the absolute value does not violate the analytical background distributions of the mean, median and Wilcoxon rank sum tests; nonetheless, their null P-value distributions deviate from the compatible distribution. The biased null P-value distribution farther violates the assumption of the FDR procedure ('Correction for multiple testing' section). Thus we conclude that it is more accurate to use fake backgrounds than analytical backgrounds.

Figure two:

P-value distribution of null by (A) simulated background and (B) analytical background. It is clear that analytical backgrounds give biased P-value distributions. WKS (i.e. GSEA) is not shown in (B), because WKS does not follow an analytical background.

P-value distribution of goose egg by (A) simulated background and (B) analytical background. It is clear that analytical backgrounds give biased P-value distributions. WKS (i.e. GSEA) is not shown in (B), because WKS does not follow an belittling groundwork.

Effigy ii:

P-value distribution of null by (A) simulated background and (B) analytical background. It is clear that analytical backgrounds give biased P-value distributions. WKS (i.e. GSEA) is not shown in (B), because WKS does not follow an analytical background.

P-value distribution of null by (A) simulated background and (B) belittling background. It is articulate that belittling backgrounds requite biased P-value distributions. WKS (i.due east. GSEA) is not shown in (B), because WKS does not follow an analytical background.

COMPARISON OF Factor SET-LEVEL STATISTICS

The problem of comparing different methods for gene set up enrichment analysis is made difficult by the lack of a gold standard. One tin mine the literature to obtain evidence on whether a gene set is associated with the phenotype modify, but this can just be done on a small calibration. An alternative is to quantify the number of overlapping predictions (gene sets chosen pregnant) by several methods [9, 48]. Considering each method can capture a piece of the bear witness (from the location of mean, shape of distribution, etc.) of biological perturbation, gene sets predicted by multiple methods should be more reliable than factor sets predicted by only i method. In this article nosotros suggest a formal way to use the MC of multiple methods for evaluating gene set-level statistics.

Mutual coverage

We state the concept of MC by multiple methods in precise terms:

Observation i: given several orthogonal predictors (e.g. gene ready-level statistics), a gene set deemed significant by more predictors is less likely to exist false than a cistron set deemed pregnant past fewer predictors. The term orthogonal hither means that different gene set-level statistics practise not apply correlated backdrop to brand the prediction. Since 'mutually supported cistron sets' identified past multiple predictors show statistical significance from multiple distinct properties, they might ameliorate reflect the underlying biology of phenotype changes.

Observation 2: if a particular predictor reports a loftier fraction of 'mutually supported gene sets', we assume information technology has high positive predictive value, especially if the number of predictors is exhaustive. Nosotros define MC of a predictor as the ratio of the number of votes from other predictors like-minded with its predictions divided by the maximum number of votes possible (see 'Methods' department for details).

Controlled mutual coverage for correlated predictors

The condition that predictors be orthogonal is most never met. Figure 3 illustrates the Pearson correlation coefficient between the gene set-level statistics in Table one. For instance, hateful and median are strongly correlated. If a predictor has many 'echoes', its MC can be overestimated. One way to right for correlations among predictors is to downward weight the votes from the echoes. An intuitive approach is to weight each vote co-ordinate to the probability that two gene set-level statistics concur with each other by chance. In other words, if predictor A has a probability π of voting for all predictions of predictor B past take chances, votes from A will be weighted by a function of π; in this article, we simply take it as the inverse of π. Therefore correlated predictors having higher π receive lower weights. To compute π, nosotros generated 500 randomly phenotype-shuffled data sets from the aforementioned experimental information set (GEO GDS2835) and computed the expected frequency that ii cistron set-level statistics predict the aforementioned factor set as pregnant (see 'Methods' section for more details). Later on weighting and normalization, a controlled MC score (CMC; run across 'Methods' department) is calculated and the furnishings of echoes are controlled.

Effigy 3:

The Pearson correlation coefficient between all 10 gene-set statistics. The 'W'- prefix indicates a TIF weighted statistic.

The Pearson correlation coefficient between all x gene-fix statistics. The 'W'- prefix indicates a TIF weighted statistic.

Effigy 3:

The Pearson correlation coefficient between all 10 gene-set statistics. The 'W'- prefix indicates a TIF weighted statistic.

The Pearson correlation coefficient between all 10 factor-set up statistics. The 'Westward'- prefix indicates a TIF weighted statistic.

We generated 132 zero data sets by randomly shuffling the phenotype tags of an experimental data set (GEO GDS2835), and computed the CMC scores of the 5 original cistron set-level statistics and Hotelling's T ii test in one group, and the CMC scores of the 5 topology impact factor (TIF) [9] weighted gene set-level statistics in another group. All 1l statistics show similarly low CMC scores (the first row of Table 2, labeled as 'Null data ready'), indicating that the contributions from correlated predictors have been substantially diminished. We did not compute CMC for a TIF weighted cistron set-level statistic with its original statistic in 1 group, because they are highly correlated (Figure 3).

Table 2:

CMC of nix information set compared with filtered data fix, using original statistics and TIF weighted statistics at α = 0.01

Type χ2-test Hotelling'due south T ii exam Mean test Median test WKS exam Wilcoxon rank sum examination
Nil information set (original) 0.19 0.15 0.22 0.22 0.19 0.21
Original statistics 0.21 0.23 0.24 0.28 0.40 0.44
Original statistics (leave one out) 0.18; 0.16; 0.sixteen; 0.21; 0.14 0.xx; 0.17; 0.17; 0.21; 0.17 0.21; 0.xx; 0.22; 0.18; 0.16 0.24; 0.22; 0.25; 0.20; 0.xviii 0.31; 0.33; 0.36; 0.31; 0.29 0.38; 0.35; 0.36; 0.33; 0.33
Null data set (weighted) 0.nineteen NA 0.xx 0.22 0.18 0.17
TIF weighted statistics 0.20 NA 0.21 0.25 0.33 0.31
TIF weighted statistics (get out i out) 0.18; 0.12; 0.12; 0.xx NA 0.18; 0.16; 0.twenty; 0.10 0.21; 0.eighteen; 0.22; 0.12 0.23; 0.25; 0.29; 0.23 0.22; 0.25; 0.xix; 0.27
Type χ2-test Hotelling's T 2 test Mean test Median exam WKS exam Wilcoxon rank sum examination
Null data set (original) 0.19 0.15 0.22 0.22 0.19 0.21
Original statistics 0.21 0.23 0.24 0.28 0.40 0.44
Original statistics (go out one out) 0.18; 0.xvi; 0.16; 0.21; 0.14 0.20; 0.17; 0.17; 0.21; 0.17 0.21; 0.20; 0.22; 0.xviii; 0.16 0.24; 0.22; 0.25; 0.20; 0.eighteen 0.31; 0.33; 0.36; 0.31; 0.29 0.38; 0.35; 0.36; 0.33; 0.33
Null data prepare (weighted) 0.19 NA 0.20 0.22 0.xviii 0.17
TIF weighted statistics 0.20 NA 0.21 0.25 0.33 0.31
TIF weighted statistics (leave i out) 0.18; 0.12; 0.12; 0.20 NA 0.18; 0.16; 0.xx; 0.ten 0.21; 0.eighteen; 0.22; 0.12 0.23; 0.25; 0.29; 0.23 0.22; 0.25; 0.19; 0.27

WKS and Wilxocon rank sum examination show pregnant higher CMC than in Nada data ready

Tabular array 2:

CMC of goose egg data set compared with filtered data set, using original statistics and TIF weighted statistics at α = 0.01

Type χ2-test Hotelling's T 2 test Mean test Median test WKS test Wilcoxon rank sum test
Null data set (original) 0.19 0.fifteen 0.22 0.22 0.nineteen 0.21
Original statistics 0.21 0.23 0.24 0.28 0.40 0.44
Original statistics (leave i out) 0.18; 0.16; 0.sixteen; 0.21; 0.xiv 0.20; 0.17; 0.17; 0.21; 0.17 0.21; 0.20; 0.22; 0.18; 0.16 0.24; 0.22; 0.25; 0.twenty; 0.18 0.31; 0.33; 0.36; 0.31; 0.29 0.38; 0.35; 0.36; 0.33; 0.33
Null data set (weighted) 0.19 NA 0.20 0.22 0.18 0.17
TIF weighted statistics 0.20 NA 0.21 0.25 0.33 0.31
TIF weighted statistics (leave i out) 0.xviii; 0.12; 0.12; 0.20 NA 0.18; 0.sixteen; 0.20; 0.x 0.21; 0.eighteen; 0.22; 0.12 0.23; 0.25; 0.29; 0.23 0.22; 0.25; 0.19; 0.27
Blazon χ2-test Hotelling's T 2 test Mean test Median exam WKS test Wilcoxon rank sum test
Null data set (original) 0.19 0.15 0.22 0.22 0.19 0.21
Original statistics 0.21 0.23 0.24 0.28 0.40 0.44
Original statistics (leave one out) 0.18; 0.xvi; 0.16; 0.21; 0.14 0.20; 0.17; 0.17; 0.21; 0.17 0.21; 0.20; 0.22; 0.18; 0.16 0.24; 0.22; 0.25; 0.20; 0.eighteen 0.31; 0.33; 0.36; 0.31; 0.29 0.38; 0.35; 0.36; 0.33; 0.33
Cipher data set (weighted) 0.19 NA 0.20 0.22 0.eighteen 0.17
TIF weighted statistics 0.twenty NA 0.21 0.25 0.33 0.31
TIF weighted statistics (exit one out) 0.18; 0.12; 0.12; 0.20 NA 0.18; 0.sixteen; 0.twenty; 0.x 0.21; 0.eighteen; 0.22; 0.12 0.23; 0.25; 0.29; 0.23 0.22; 0.25; 0.19; 0.27

WKS and Wilxocon rank sum test prove pregnant higher CMC than in Aught data fix

Examination on 132 experimental data sets

We collected and candy 132 human data sets from GEO, in accordance with the procedure in the 'Components of GSEA' section (see 'Methods' section for details). The CMC scores of the six statistics in Table 1 and 5 TIF weighted statistics are listed in Tabular array 2. The Wilcoxon rank sum test and WKS test show significantly college CMC in both original (CMC = 0.4 and 0.35, respectively) and TIF weighted forms (CMC = 0.39 and 0.35, respectively) compared with their CMC for the zippo data sets (0.07). To exam whether the CMC score is still biased by dependency, we iteratively removed i statistic at a time and the resulting CMC scores are also listed in Tabular array ii. The results indicate that the Wilcoxon rank sum exam and WKS test still perform better than the other statistics, and the order of performance of all predictors does not change (Table 2). We conclude that predictions based on the Wilcoxon rank sum exam or WKS test are more likely to be covered past other gene ready statistics that use mean, median, χii and Hotelling's T 2 test, and may imply higher biological inference power. In addition, the TIF weighted WKS exam (PWEA [9]) reports 15% more than positive prediction and still retains the higher level of CMC (meet 'Supplementary Materials'), suggesting that it is more sensitive than the original WKS exam. Note that TIF weighting was not applied to Hotelling'southward T 2 tests since this method performs master component analysis which can non be practical to the topological data and moreover, one of the purposes of Hotelling's T two test is to reduce the influence of correlation structure inside a gene prepare, thus weighting gene based on the correlation of neighbor genes is confronting its design.

DISCUSSIONS AND CONCLUSIONS

We reviewed approaches to cistron set enrichment analysis and attempted to analyze a number of concepts that are important for application to experimental data sets, such as preprocessing of raw information, imputation of missing information, the choice of zero hypothesis, and methods for generating cypher distributions. Our analysis of null P-value distributions indicates that analytical background distributions are less accurate than faux groundwork distributions. As shown previously [xiii], the choice of cistron set-level statistics is the nigh important component for factor set enrichment methods; however, and it is difficult to compare the performance of different gene set-level statistics when a gold standard is absent.

In order to address this difficulty we propose a new metric, CMC, essentially a positive predictive value where the truthful positives are determined by weighted overlaps betwixt different methods. The results of testing 132 experimental information sets suggest that the Wilcoxon rank sum examination and WKS test can better cover the predictions of the mean test, the median examination, the χii test and Hotelling's T 2 test, but not vice versa. We postulate that it is considering WKS covers not simply location shift but likewise shape changes of the observed distribution of differential expression compared with the background distribution and Wilcoxon rank sum examination is robust to the extreme values. Since the WKS examination reports more than gene-sets than the Wilcoxon rank sum test in our experiments, it is likely to have higher sensitivity (see Supplementary Materials).

To farther improve the biological utility of gene set up enrichment analysis, nosotros believe that the inclusion of additional biological features such as topology or covariates as discussed in the 'Gene set-level statistics' department would exist more useful than changing statistics. Utilizing more domain noesis is likely to reveal more than insights in the analysis. The concept of gene set enrichment analysis has been applied to biological features in add-on to expression, such as SNPs, copy number variation [49] and protein–poly peptide interaction networks.

METHODS

Nosotros collect all human gene expression data sets based on microarrays from GEO, and split each data ready co-ordinate to their phenotypes. All GEO data sets take proper gene name annotations for each probe-set. Subsets within the aforementioned GEO entry are scrutinized and simply one subset is chosen to avert redundancy. The information sets with fewer than 10 samples per phenotype were discarded. We imputed missing values of each test set using the LSimpute_gene algorithm [22], which construct weighted multiple regression models based on other genes that best correlated with the genes with missing values. Nosotros ensured all expression values were log transformed. Data sets that were normalized by approaches other than RMA or MAS five.0 were also discarded. In total 132 examination sets remained. Nosotros then perform t-test and use accented values of t-statistic every bit the gene-level statistic of each probe-set. Probe-sets that share the aforementioned gene name were combined according to Stouffer'southward method [31].

We use 201 pathways from KEGG every bit the collection of gene sets for all assay. Nosotros tested 5 factor set level statistics and one global test, Hotelling T 2 test, which were reviewed favorably by Ackermann and Strimmer [13]. A Hotelling T 2 statistic for a gene set is calculated as follows:

where forumla and forumla are vectors having k (total number of genes in the gene ready) elements, representing the mean expression levels of the genes in the factor set among two phenotypes, with nx and northy samples, respectively, and S −i is the inverse of the pooled covariance matrix. The problem of Hotelling T ii test in practice is that when chiliad >nx  +ny , which is common, the singularity of S makes finding the changed difficult. Kong et al. [50] solved the issue by using PCA (principal component analysis) to reduce the dimensionality, and we use the same approach to compute the Hotelling T 2 statistic.

For all v gene set-level statistics (Table ane), nosotros applied the method from Hung el al. to weight the factor level statistics by a topological influence factor (TIF). Hotelling's T 2 statistic cannot be separated to gene-level and gene set-level, so TIF weighting cannot be performed. The significance levels (P-values) are calculated using simulated backgrounds and corrected for multiple testing using the FDR procedure [51].

To calculate CMC, nosotros define W k as the prediction profiles of predictor k under the FDR cutoff of 0.05. Westward is a two dimensional One thousand by North matrix, where Thou is the total number of data sets (Chiliad = 132) and Northward is the total number of cistron sets (Due north = 201). West m (i,j) = 1 if predictor 1000 assigns a Q-value below the FDR cutoff for gene set j in data set i, otherwise West k (i,j) = 0.

The MC of a predictor k is defined every bit:

where S is the total number of predictors.

The numerator is the total number of votes that predictor k gets from other predictors and the denominator is the maximum votes it can get.

In order to control dependency amid different predictors, we offset model the probability π that predictor k agrees with predictor l in 500 null data sets, generated by shuffling the sample tags of an arbitrarily chosen human data fix from the Gene Expression Omnibus (GEO; a public database of high throughput cistron expression data), GDS2835 [47]. Altering the total number of null information sets and the source experimental data set does not change π appreciably. π is divers as:

Nosotros use π as the weight of each vote appropriately and define CMC for predictor k equally follows:

CMC is then a weighted ratio that indicates the fraction of predictions supported by other predictors.

SUPPLEMENTARY DATA

Supplementary data are bachelor online at http://bib.oxfordjournals.org/.

  • Review critical steps in performing a statistically robust gene ready enrichment analysis.

  • Demonstrate the large number of false positives due to inappropriate statistical backgrounds.

  • Innovate a novel 'controlled mutual coverage (CMC)' alphabetize to evaluate gene set statistics.

FUNDING

National Institutes of Health (grant RR022971) (partially); National Science Foundation (grant DBI-0850008) (partially).

References

1

Molecular Signatures Database v3.0

 

2

,  ,  , et al.

KAAS: an automated genome annotation and pathway reconstruction server

,

Nucleic Acids Res

,

2007

, vol.

35

 (pg.

W182

-

five

)

3

,  ,  , et al.

Quantitative proteomics analysis of the secretory pathway

,

Cell

,

2006

, vol.

127

6

(pg.

1265

-

81

)

4

,  ,  , et al.

Proteomic survey of metabolic pathways in rice

,

Proc Natl Acad Sci Us

,

2002

, vol.

99

eighteen

(pg.

11969

-

74

)

5

,  .

A parsimony approach to biological pathway reconstruction/inference for genomes and metagenomes

,

PLoS Comput Biol

,

2009

, vol.

five

8

pg.

e1000465

 

half-dozen

,  .

Ontological assay of gene expression data: electric current tools, limitations, and open bug

,

Bioinformatics

,

2005

, vol.

21

eighteen

(pg.

3587

-

95

)

vii

,  ,  , et al.

Transcriptional regulation and part during the human cell bicycle

,

Nat Genet

,

2001

, vol.

27

1

(pg.

48

-

54

)

8

,  ,  , et al.

Gene set enrichment analysis: a cognition-based approach for interpreting genome-wide expression profiles

,

Proc Natl Acad Sci USA

,

2005

, vol.

102

43

(pg.

15545

-

fifty

)

9

,  ,  , et al.

Identification of functional modules that correlate with phenotypic deviation: the influence of network topology

,

Genome Biol

,

2010

, vol.

11

ii

pg.

R23

 

10

,  .

On testing the significance of sets of genes

,

Ann Appl Stat

,

2007

, vol.

1

1

(pg.

107

-

29

)

11

,  .

Gene-set up approach for expression pattern analysis

,

Cursory Bioinform

,

2008

, vol.

9

3

(pg.

189

-

97

)

12

,  ,  , et al.

Gene-gear up analysis and reduction

,

Brief Bioinform

,

2009

, vol.

10

1

(pg.

24

-

34

)

13

,  .

A general modular framework for gene set enrichment assay

,

BMC Bioinform

,

2009

, vol.

x

 pg.

47

 

14

,  ,  .

Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large cistron lists

,

Nucleic Acids Res

,

2009

, vol.

37

ane

(pg.

1

-

xiii

)

15

.

Improving the scaling normalization for loftier-density oligonucleotide GeneChip expression microarrays

,

BMC Bioinformatics

,

2004

, vol.

5

 pg.

103

 

sixteen

,  ,  , et al.

Exploration, normalization, and summaries of high density oligonucleotide array probe level data

,

Biostatistics

,

2003

, vol.

four

ii

(pg.

249

-

64

)

17

,  .

Normalization and quantification of differential expression in gene expression microarrays

,

Brief Bioinform

,

2006

, vol.

vii

two

(pg.

166

-

77

)

18

,  ,  , et al.

Summaries of affymetrix GeneChip probe level data

,

Nucleic Acids Res

,

2003

, vol.

31

4

pg.

e15

 

xix

,  ,  , et al.

Mapping and quantifying mammalian transcriptomes by RNA-Seq

,

Nat Methods

,

2008

, vol.

v

seven

(pg.

621

-

8

)

xx

,  ,  , et al.

Comparative analysis of missing value imputation methods to improve clustering and estimation of microarray experiments

,

BMC Genomics

,

2010

, vol.

eleven

 pg.

15

 

21

,  ,  , et al.

Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes

,

BMC Bioinformatics

,

2008

, vol.

ix

 pg.

12

 

22

,  ,  .

LSimpute: accurate estimation of missing values in microarray data with to the lowest degree squares methods

,

Nucleic Acids Res

,

2004

, vol.

32

3

pg.

e34

 

23

,  ,  .

Absolute enrichment: gene gear up enrichment analysis for homeostatic systems

,

Nucleic Acids Res

,

2006

, vol.

34

22

pg.

e151

 

24

,  ,  .

The folded normal distribution

,

Technometrics

,

1961

, vol.

3

4

(pg.

543

-

l

)

25

,  ,  , et al.

Gene set enrichment analysis made simple

,

Stat Methods Med Res

,

2009

, vol.

18

6

(pg.

565

-

75

)

26

,  ,  , et al.

ADAPT: a database of affymetrix probesets and transcripts

,

Bioinformatics

,

2005

, vol.

21

10

(pg.

2552

-

3

)

27

,  ,  .

A sequence-based identification of the genes detected by probesets on the Affymetrix U133 plus 2.0 assortment

,

Nucleic Acids Res

,

2005

, vol.

33

3

pg.

e31

 

28

,  .

A comparison of meta-assay methods for detecting differentially expressed genes in microarray experiments

,

Bioinformatics

,

2008

, vol.

24

3

(pg.

374

-

82

)

29

.,

Statistical Methods for Research Workers

,

1932

4th edn

London

Oliver and Boyd

30

,  ,  , et al.

Meta-analytic methods, the Rorschach, and the MMPI

,

Psychol Assess

,

2001

, vol.

13

4

(pg.

449

-

51

)

31

,  ,  , et al.

Normalization and gene p-value estimation: issues in microarray data processing

,

Bioinform Biol Insights

,

2008

, vol.

ii

 (pg.

291

-

305

)

32

,  ,  , et al.

RNA-Seq factor expression estimation with read mapping uncertainty

,

Bioinformatics

,

2010

, vol.

26

4

(pg.

493

-

500

)

33

,  ,  , et al.

A rescue strategy for multimapping short sequence tags refines surveys of transcriptional action by CAGE

,

Genomics

,

2008

, vol.

91

3

(pg.

281

-

8

)

34

,  ,  , et al.

Discovering statistically significant pathways in expression profiling studies

,

Proc Natl Acad Sci U.s.a.

,

2005

, vol.

102

38

(pg.

13544

-

nine

)

35

,  .

Extensions to gene set enrichment

,

Bioinformatics

,

2007

, vol.

23

three

(pg.

306

-

13

)

36

,  .

Biochemical networking contributes more than to genetic buffering in homo and mouse metabolic pathways than does factor duplication

,

Nat Genet

,

2002

, vol.

32

1

(pg.

191

-

four

)

37

,  ,  , et al.

A novel signaling pathway touch on assay

,

Bioinformatics

,

2009

, vol.

25

1

(pg.

75

-

82

)

38

,  ,  , et al.

Calculating the statistical significance of changes in pathway action from gene expression data

,

Stat Appl Genet Mol Biol

,

2004

, vol.

three

 pg.

Article 16

 

39

,  ,  .

Computation of significance scores of unweighted Gene Set Enrichment Analyses

,

BMC Bioinformatics

,

2007

, vol.

8

 pg.

290

 

xl

.

Multiple hypothesis testing: a review

,

Annu Rev Psychol

,

1995

, vol.

46

 (pg.

561

-

84

)

41

,  .

Decision-making the false discovery rate: a applied and powerful approach to multiple testing

,

J R Statist Soc B

,

1995

, vol.

57

 (pg.

289

-

300

)

42

.

The positive simulated discovery rate: A Bayesian interpretation and the q-value

,

Ann Stat

,

2003

, vol.

31

 (pg.

2013

-

35

)

43

,  ,  .

Nonparametric bayesian estimation of positive false discovery rates

,

Biometrics

,

2007

, vol.

63

iv

(pg.

1126

-

34

)

44

,  .

Statistical significance for genomewide studies

,

Proc Natl Acad Sci Usa

,

2003

, vol.

100

xvi

(pg.

9440

-

5

)

45

,  .

Calibration of p values for testing precise null hypotheses

,

Amer Statistician

,

2001

, vol.

55

ane

(pg.

62

-

71

)

46

,  ,  .

Towards the compatible distribution of null P values on Affymetrix microarrays

,

Genome Biol

,

2007

, vol.

eight

5

pg.

R69

 

47

,  ,  , et al.

Human endometriosis is associated with plasma cells and overexpression of B lymphocyte stimulator

,

Proc Natl Acad Sci USA

,

2007

, vol.

104

30

(pg.

12451

-

six

)

48

,  ,  , et al.

A systems biology approach for pathway level analysis

,

Genome Res

,

2007

, vol.

17

x

(pg.

1537

-

45

)

49

,  ,  , et al.

GEAR: genomic enrichment assay of regional DNA re-create number changes

,

Bioinformatics

,

2008

, vol.

24

3

(pg.

420

-

1

)

fifty

,  ,  .

A multivariate approach for integrating genome-broad expression data and biological knowledge

,

Bioinformatics

,

2006

, vol.

22

19

(pg.

2373

-

80

)

51

,  ,  , et al.

Controlling the false discovery rate in behavior genetics research

,

Behav Brain Res

,

2001

, vol.

125

1–2

(pg.

279

-

84

)

52

,  .

PAGE: parametric analysis of gene set enrichment

,

BMC Bioinformatics

,

2005

, vol.

6

 pg.

144

 

53

,  ,  , et al.

JProGO: a novel tool for the functional interpretation of prokaryotic microarray data using Gene Ontology information

,

Nucleic Acids Res

,

2006

, vol.

34

Web Server effect

(pg.

W510

-

W515

)

54

,  ,  .

Significance analysis of functional categories in gene expression studies: a structured permutation approach

,

Bioinformatics

,

2005

, vol.

21

9

(pg.

1943

-

9

)

Supplementary data