rSEA R package

Mitra Ebrahimpoor

2019-10-06

Simultaneous Enrichment Analysis with rSEA

Overview of the Method

rSEA is a novel paradigm for simultaneous enrichment analysis of feature-sets. It combines the pre-existing self-contained and competitive approaches by defining a unified null hypothesis that includes the null hypotheses of both approaches as special cases. This null hypothesis is tested with All-Resolutions Inference (ARI), an approach to multiple testing that controls Familywise Error Rate based on closed testing and Simes tests.

rSEA is extremely flexible: not only does it allow both self-contained and competitive testing, it allows the user to choose the type of test after seeing the data. Moreover, the choice of feature-set database(s) may also be postponed until after seeing the data. The data may even be used for the definition of feature-sets, e.g. by taking subsets of feature-sets with a certain sign or magnitude of estimated effect. Users may even iterate and revise the choice of type of test and the definition of feature-sets of interest on the basis of results. Still, familywise error is controlled for all final results.

The required input is feature-wise p-values, so the functions can be used with any omics platform, experimental design or model. The only assumption needed is the Simes inequality, which allows dependence between p-values. It is the same assumption that is needed for the validity of the procedure of Benjamini and Hochberg as a method for False Discovery Rate control. Despite the flexibility and lack of independence assumptions, the new method has acceptable power compared to classical enrichment methods. Notably, the power of the method does not depend on the number of feature-sets tested. As a consequence, classical methods will do better for a limited number of candidate feature-sets, while rSEA outperforms other methods when databases are large. According to the simulation studies, rSEA is comparable in power to classical enrichment methods for a database the size of Gene Ontology.

The output for each feature-set is not only an adjusted p-value for enrichment, but also a simultaneous lower confidence bound to the actual proportion of active features. Users obtain not just the presence or absence of enrichment, but also an assessment of the level of enrichment in each feature-set.

Defining the unified null hypothesis

The unified null hypothesis is very flexible with two parameters. Other than the default selfcontained and competitive testing, it can handle custom competitive testing. Here are a few examples on how to define and intrepret different parameters within the unified null, for more details please refer to the paper mentioned at the end of this document.

The unified null hypothesis is written as: \(H^{U}_{0}(S,c)\colon \pi (S)\leq c\),

Here, \(\pi\) is the proportion of true discoveries (TDP) in \(S\), \(S\) is the feature set of interest and \(c\) is the threshold for testing. So it reads: TDP for the features in the set is less than threshold \(c\).Different combinations of \(S\) and \(c\) are possible and SEA has considered all of these tests, hence looking at them one after the other will not affect the type I error (\(\alpha\)). So, fo feature-set \(S_t\) you can test:

*Self-contained null hypothesis: \(S=S_t\) and \(C=0\)

*Default competitive null hypothesis: \(S=S_t\) and \(C=TDP(S^{c}_{t})\)

*Custom competitive null hypothesis: \(S=S_t\) and \(C=c\)

Where \(S^{c}_{t}\) is the set of features that are not in the feature-set of interest or what is called ‘background’ features.

Note that value of \(c\) will define the test, so if there are 15 features in \(S\), choosing \(c=0.6\) means you are testing if there are at least 9 \(( 0.6 \times 15)\) active features in that set. To choose this value you can start with the TDP of the background as in the default competitive and refine it as you see fit.

Usage

Currently there are three functions within rSEA. To portray their usage, we first explain the properties of main arguments and then simulate a dataset of 300 features and a mock pathlist for some practical examples.

Input and format

As mentioned in the previous section, the required input is raw feature-wise p-values i.e. p-values from individual testing of the features before application of multiple testing corrections. The second required element is the names of features corresponding to each p-value. This can be either named, or stored in a dataframe with a column representing the name, or a separate vector matching the p-values. The third and last element is the database of feature-sets (called pathlist from now on). There are different types of databases:

  • External: databases in literature such as GO, KEGG, Wikipathways, etc.
  • Internal: User-defined feature-sets based on the current or previous studies
  • Combination: internally modified version of public databases

No matter which database is used, the feature-sets should be stored as a list of lists, where each feature-set is a (named) list defined in terms of the featureIDs matching the input. This type of pathlists are very easy to create in R and the “Creating the Pathlist” section explains how to make a pathlist using online resources of GO, KEGG and Wikipathways.

Example dataset

We simulate the p-values using runif(), to make sure there is some signal, we simulate 100 small p-values and 200 random ones. The featureIDS are just a combination of two small-case letters. Then 20 hypothetical pathways are generated by selecting a random set of names, these are combined to create the pathlist required for the functions setTest() and SEA(). In practice, you will have this data yourself and won’t need to run this chunk.

setTDP() function

The setTDP() function is used to get the point estimate and the lower-bound for the ’True Discovery Proportion (TDP)’of a feature-set, which can even be the set of all features

To get an overview of the dataset we just generated, we will use the setTest() function without a set argument. This evaluates the set of all features. Note that, the data argument is optional, so you can pass two matching vectors of p-values and featureIDs, the output of the following codes would be the same.

So at least 0.1 percent of the features in the dataset are associated with the outcome, and the median point estimate is 0.19. We may say that there are at least (0.1 X 300) 30 active features in the dataset. If we take a look at pathways, what is the proportion of active features in pathway 3? the following will provide an answer:

So it seems that pathway 3 is enriched with active genes. We can formally test that with setTest() function, as you see below.

setTest() function

The setTest() function is used with a single set, which can even be the set of all features. It returns an adjusted p-value for the chosen test of features. for details on defining the testtype see “Defining the unified null hypothesis” section.

Here, we first test if there are any active features, sho we will do a selfcontained test for the set of all features. We already know that the corresponding p-value is significant as the estimated lower-bound for the TDP is estimated to be 0.1. Then we will test the default competitive null for pathway 3.

As expected, the self-contained null hypothesis for the set of all features is rejected, so there are some active features in the dataset. Also, pathway 3 is significantly enriched with active features. The default competitive tests against the total TDP, which is 0.1 here. We have repeated the competitive null test with 0.5, to see if half of the features in the set are active.

SEA() function

The SEA() function will simultaneously evaluate multiple pathways from pathlist. Here we test all the pathways in pathlist.

The resulting chart has one row per pathway and a minimum of 3 columns. Columns represent

*ID: Pathway identifier, these will be in the same order as the pathlist

*Name: Name of the pathway is printed in case the pathlist is a named list

*Size: Size of the pathway as defined in pathlist

*Coverage: Proportion of features in the pathway that were present in the data, so TDP is a proportion of \(size \times Coverage\) features and not necessarily the whole pathway

*TDP_bound: Estimated lower-bound for the proportion of true discoveries in the pathway (in paper denoted by \(\bar \pi\)), the upper-bound is always 1, it can be translated to the number of true discoveries by \(size \times Coverage \times \bar \pi\)

*TDP_estimate: A point estimate for TDP (in paper denoted by \(\hat \pi\)), this can also be translated to number of true discoveries by \(size \times Coverage \times \hat \pi\)

*adj P: These include the adjusted p-values for the specified tests. SC. and Comp. stand for self-contained and competitive, respectively. The custom competitive, is the tets of unified null against the user-chosen thresh

Here all pathways have a coverage of 1 which is very unlikely in practice. The first pathway is of size 50, we can infer that at least 2 (\(1 \times 50 \times 0.04\)) features are associated with the outcome. This is confirmed by the significant self-contained adj.p-value and non-significant competitive adj.p-value. Recall that the default competitive null hypothesis tests against the over all TDP which we estimated as 0.1, so for this pathway the default competitive hypothesis tests the pathway TPD against 5 (\(1 \times 50 \times 0.1\)).

If you were to use a larger pathlist such as Gene Ontology, it may be that you are not interested in all of those 12 thousand pathways, then the pathways of interest are selected, using select argument. Here we choose 20 pathways. You can choose also based on the name of pathways if the pathlist is named.

It seems both pathway 8 and 9 have some interesting features associated with the outcome of interest. It is interesting to see if the union of these two has a larger TDP, we can examine this with setTDP() function. Then we can test if the set of features in the overlapping set is enriched.

As you can see there are 71 unique features with only 3 (71 0.042) active features. According to the self-contained test, at least some of these features are significantly associated with the outcome. Testing against 0.1 does not reject the null hypothesis that the proportion of discoveries in this new set is less than 0.1.

topSEA() function

topSEA() is a sorting function to facilitate the evaluation of the SEA chart. In this example, we will first sort the table by TDP estimate to get the pathways with maximum TDP on top of chart. One may wish to sort the table by Comp.adjP to get the pathways with smaller adj.p-value for the default competitive test on top. Another interesting output can be keeping only the significant (at level \(\alpha<0.05\)) pathways according to the self-contained test and then ordering by size.

require(rSEA) #load rSEA
testchart3<-topSEA(testchart1) #sorted by large TDP.estimates
head(testchart3)
#>    ID Size Coverage  TDP.bound TDP.estimate      SC.adjP    Comp.adjP
#> 3   3   15        1 0.20000000    0.2666667 5.599905e-04 0.0045303163
#> 21 21   58        1 0.17241379    0.2241379 3.556689e-08 0.0019393255
#> 42 42   23        1 0.13043478    0.2173913 1.713408e-12 0.0498190338
#> 34 34   19        1 0.15789474    0.2105263 6.512458e-05 0.0001434172
#> 19 19   48        1 0.10416667    0.2083333 3.556689e-08 0.0433095818
#> 5   5   49        1 0.06122449    0.2040816 3.556689e-08 0.1377924569

testchart4<-topSEA(testchart1, by=Comp.adjP, descending=FALSE) #sorted by smallest competitive adj.p-values 
head(testchart3)
#>    ID Size Coverage  TDP.bound TDP.estimate      SC.adjP    Comp.adjP
#> 3   3   15        1 0.20000000    0.2666667 5.599905e-04 0.0045303163
#> 21 21   58        1 0.17241379    0.2241379 3.556689e-08 0.0019393255
#> 42 42   23        1 0.13043478    0.2173913 1.713408e-12 0.0498190338
#> 34 34   19        1 0.15789474    0.2105263 6.512458e-05 0.0001434172
#> 19 19   48        1 0.10416667    0.2083333 3.556689e-08 0.0433095818
#> 5   5   49        1 0.06122449    0.2040816 3.556689e-08 0.1377924569

sigchart<-topSEA(testchart1, by=Comp.adjP, thresh = 0.05) #keep only significant self-contained p-values
sigchart2<-topSEA(sigchart, by=Size, descending=TRUE) #sorted by pathway size
head(sigchart2)
#>    ID Size Coverage  TDP.bound TDP.estimate      SC.adjP Comp.adjP
#> 22 22   56        1 0.07142857    0.1250000 3.535197e-15 0.2313215
#> 41 41   55        1 0.07272727    0.1090909 1.713408e-12 0.4803668
#> 28 28   55        1 0.07272727    0.1636364 3.556689e-08 0.1998095
#> 1   1   50        1 0.04000000    0.0800000 8.643328e-03 0.9890575
#> 9   9   50        1 0.04000000    0.1400000 3.218631e-05 0.2389773
#> 11 11   44        1 0.04545455    0.1590909 1.119805e-05 0.1998095

There are some additional options. One is removing pathways with a coverage smaller than a certain value, for example by adding cover=0.5, only pathways with 50% coverage are kept.

Creating the Pathlists

The pathlist argument is a list of pathways to be evaluated and can be created in different ways, here we show a few examples of creating such a list. In practice, any list of feature-sets can be used as long as it a list. The only data you need to create the pathlist is an annotation file to link the probe identifiers to gene symbols, gene ontology terms and other gene information. This object is called a bimap object and can be retrived from different databases and even a local library. Here we present two examples, one is the famouse Gene Ontology database, for which a various range of bioconductor tools exist. The other is the wikipathways and the rwikipathways package.

NOTE: For reproducibility, We suggest saving a copy of the created pathlist in your local drive as the online sources are constantly updating.

GO pathways

One standard format for annotation in Bioconductor is an annotation package. Annotation packages are readily available in Bioconductor for most commercial chip types. For custom-made arrays or for less frequently used platforms, it is possible to make your own annotation package using the AnnBuilder package. AnnotationDbi package is a key reference for learnign about how to use bimap objects. AnnotationDbi is used primarily to create mapping objects that allow easy access from R to underlying annotation databases. As such, it acts as the R interface for all the standard annotation packages. For more information read the help file of AnnotationDbi.

To create the bimap for a microarray dataset done on Affymetrix hgu133a chips, install the hgu133a.db package. A relevent package can be used according to your data.

Take the following steps to install the required bioconductor packages. For older versions of R, please refer to the appropriate Bioconductor release.

Use ls() function to view the list of objects provided with this db package and columns() function to discover which sorts of annotations can be extracted from the database. You can see that a mapping of hgu133 to GO exists which provides the relevent bimap data. Here we only focus only on Cellular Component (CC), alterbatively, Biological Process (BP) and Molecular Function (MF) can also be adopted.

The bimap is converted to a GOList in the required format as below.(As the GO is a large database, this can take a few seconds.)

KEGG pathways

Th procdure for creating KEGG pathways is the same. Here we assume the data are from mice and entries are based on “entrez gene identifiers”. So we will use the org.Mm.eg.db package.

Here we create an indirect match between ENTREZID and PATH. You may get the warning that the mapping is 1 on many, then it is better to create the pathways using the IDs from your dataset and not the package. To do so, in the function below, replace the “keggbimap$ENTREZID” with the names of gene from your dataset. Another option is to use a manually corrected mapping of ENTREZID and PATH.

Citing rSEA

If you use the rSEA package, please cite the following paper: