This is a brief introduction to the
dnapath package. This package integrates known pathway information into the differential network analysis of gene expression data. It allows for any network inference method to be used, and provides wrapper functions for several popular methods; these all start with
run_ for ease of searching. Various helper function such as
plot() are implemented for summarizing and visualizing the differential network results. This package is a companion to the paper:
Grimes, T., Potter, S. S., & Datta, S. (2019). Integrating gene regulatory pathways into differential network analysis of gene expression data. Scientific reports, 9(1), 1-12.
You can install
dnapath from CRAN:
The package contains two datasets,
cancer_pathways. The “Package data” vignette shows how these two datasets were created, and it illustrates the use of various methods in the
meso data contains gene expression data for two groups (stage ii and stage iv) of Mesothelioma tumors.
cancer_pathways data is a list of cancer-related (P53) gene pathways from the Reactome database.
These are the two primary inputs to the
dnapath() method – a gene expression dataset and a list of pathways – and are used to demonstrate the package usage in this vignette.
dnapathpackage is not limited to cancer data. The analysis demonstrated in this introduction can be applied to any gene expression datasets from any species.
The main function of the package is
dnapath(), which performs the differential network analysis on a gene expression dataset over a user-specified list of pathways. The output of this function is a
dnapath_list object (or
dnapath object if only one pathway is considered). An overview of its arguments is shown here:
dnapath(): The main function for running the differential network analysis.
xis a single matrix,
groupsmust be specified to label each row.
groupsis a vector of length equal to the number of rows in
x, and it should contain two unique elements (the two group names).
run_. The default is
run_pcorwhich computes partial correlations.
n_perm == 1, the permutation tests are not performed. If
n_permis larger than the number of possible permutations,
n_permwill be set to this value with a warning message.
set.seedprior to permutation test for each pathway. This allows results for individual pathways to be easily reproduced.
An overview of the other methods available in the
dnapath package is given in the following sections.
dnapath objects obtained from running
dnapath() can be summarized and visualized using various methdos.
filter_pathways(): Removes any pathways from the results that were not significantly differentially connected (based on the permutation test p-values for the pathway-level connectivity).
subset(): Subset the results based on pathways or genes of interest.
sort(): Sorts the pathways in the results.
by = "mean_expr", the mean expression of each pathway across both groups;
by = "mean_expr1"or
by = "mean_expr2", the mean expression of each pathway in group 1 or 2, respectively;
by = "dc_score", the differential connectivity score of the pathway;
by = "p_value", the p-value of the dc score;
by = "n_genes", the number of genes in each pathway; or
by = "n_dc"the number of significantly differentially conncted genes in each pathway.
summary() : Creates a summary table for a
plot() : Plot the differential network from a
plot_pair() : Plot the expression values of two genes to visualize their marginal association.
The user may specify any list of gene sets to use in the analysis. However, for convenience
dnapath provides a function for obtaining a list of Reactome pathways for a given species. This list may be useful for most analyses.
get_reactome_pathways(): Connects to the Reactome database (using the
reactome.dbpackage) to obtain a list of pathways for a given species.
min_sizegenes are omitted from the list.
max_sizegenes are removed from the list.
The Reactome pathways are based on entrezgene IDs, which are a unique identifier for individual genes. It is recommended to also annotate RNA-seq data using entrezgene IDs when obtaining gene expression counts. However, when summarizing and visualizing the differential network analysis results, it may be preferred to use gene symbols. The functions in this section are used to rename the genes in a
dnapath object, or a pathway list.
rename_genes(): Used to rename the genes contained in the differential network analysis results. This is useful if the user wants to, for example, create tables or plots summarizing the data using gene symbols rather than Entrezgene IDs. Note: this function can also be used to rename the genes in a pathway list.
dnapathobject, or a pathway list.
to = "symbol"will rename entrezgene IDs to gene symbols; this will automatically call the
entrez_to_symbolmethod (defined below) to obtain the gene_mat matrix. The species arugment must also be specified when to is used.
entrez_to_symbol() : Maps entrezgene IDs to gene symbols using the
biomaRt package. The Reactome pathways use entrezgene IDs, and many RNA-seq pipelines will use entrezgene IDs for annotation. However, gene symbols are more recognizable and may be preferred for summarizing and visualizing results. This method maps entrezgene IDs to gene symbols, and the resulting two column matrix can be used with the
rename_genes() method to rename entrezgene IDs as gene symbols.
symbol_to_entrez() : Maps gene symbols to entrezgene IDs using the
biomaRt package. This can be useful for renaming gene expression datasets that contain gene symbols. Note, however, that gene symbols can be ambiguous and may map to multiple entrezgene IDs. It is recommended to use entrezgene IDs throughout the analysis and only rename with gene symbols when summarizing the results.
get_genes() : This method extracts the genes in a
dnapath object, or from a pathway list. This is not meant to summarize the results of the differential network analysis, but rather to help when renaming the genes in the results; the output is a vector of gene names that can be used in
There are also serveral network inference methods provided in this package. Many of these are available in other R packages, but the format of their inputs and outputs can vary. Wrapper functions have been created to standardize these aspects. The method names are all prefixed with
run_ (for easy searching) and can be used as the network_inference arugment in
dnapath(). The methods are listed here, and additional information/references can be found in the package documentation.
The default method used is
run_pcor, which infers the gene-gene association network using partial correlations. It is important to note that different measures may be better suited for answering different research questions, however those details are beyond the scope of this vignette.
Note: For more information on any method, use the
?command in R to open the documentation (for example
This example will analyze the
meso data provided in the package. (The “Package data” vignette shows how these data were obtained.) First we load the data and take a look at it.
## List of 2 ## $ gene_expression: num [1:32, 1:150] 10.18 10.01 7.88 10.2 9.31 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:32] "TCGA.3H.AB3S" "TCGA.3U.A98E" "TCGA.3U.A98G" "TCGA.SC.AA5Z" ... ## .. ..$ : chr [1:150] "84883" "207" "208" "10000" ... ## $ groups : chr [1:32] "stageii" "stageiv" "stageiv" "stageiv" ...
The data is a list containing (1) a gene expression data set of 32 samples and 150 genes and (2) a vector indicating which group each sample belongs to – stageii (group 1) or stage iv (group 2).
The differential network analysis compares the gene-gene association network between the two groups (stage ii vs stage iv). The
dnapath package enables the use of known pathway information into the analysis, but a pathway list is not required. The
dnapath() method can be run on the full gene expression dataset, without using pathways:
## Differential network analysis between stageii (group 1) and stageiv (group 2). ## Results for an unnamed pathway (out of 1 pathways analyzed) containing 150 genes. ## Pathway p-value = 0.733. The mean expression of genes in the pathway is 9.3 in group 1 and 9.4 in group 2. ## # A tibble: 150 x 6 ## pathway genes dc_score p_value mean_expr1 mean_expr2 ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 unnamed 5111 0.000569 0.0297 10.5 10.8 ## 2 unnamed 23019 0.000641 0.0594 11.6 11.8 ## 3 unnamed 57459 0.000500 0.0594 9.78 9.67 ## 4 unnamed 6233 0.000722 0.0891 13.6 13.3 ## 5 unnamed 843 0.00108 0.0990 8.94 9.23 ## 6 unnamed 3065 0.00115 0.119 10.0 10.3 ## 7 unnamed 5527 0.000825 0.119 10.4 10.6 ## 8 unnamed 5366 0.000815 0.158 6.89 7.55 ## 9 unnamed 7874 0.000800 0.168 11.2 11.1 ## 10 unnamed 3276 0.000502 0.168 11.2 11.3 ## # … with 140 more rows
Printing the results provides a summary of the differential network analysis. The output shows that “stageii” and “stageiv” groups were compared. It says a single “unnamed pathway” was analyzed containing 150 genes; this is referring to the fact that all genes in the dataset were analyzed together and no pathway list was provided. The mean expression level of all genes in each group are also shown (these are 9.3 and 9.4, respectively, for this data), and a p-value (0.723) of the overall differential connectivity.
The output shows the differential network analysis results at the gene level. This table can be obtained using
summary(results), which returns a
tibble that can be filtered and sorted using base
R functions or methods from the
dplyr package. The columns in this table include: genes, the genes analyzed in this pathway; dc_score, the differentially connectivity (DC) score estimated for this gene; p_value, the permutation p-value for the DC score (under the null hypothesis of no differential connectivity); p_value_m, monotonized p-values, which are a conservative estimate; mean_expr1, the mean expression of each gene in group 1; mean_expr2, the mean expression of each gene in group 2.
plot function can be used to view the differential network. In this case, since all genes were analyzed together, the differential network will show all genes present in the gene expression dataset.
alpha = 0.05argument will remove any edges whose differential connectivity score p-value is above 0.05.
With 150 genes, the visualization of the differential network is not very useful; it is difficult to extract much information from this plot. One of the benefits of using known pathway information is that it breaks down the analysis into smaller gene sets, and the resulting differential networks are more manageable and informative. These advantages are illustrated in the next section.
The main feature of
dnapath is that it incorporates known gene pathways into the differential network analysis. A pathway is nothing more than a set of genes that is thought to belong to a particular biological process. The
get_reactome_pathways method can be used to obtain a list of pathways from the Reactome database.
In this section, we will use the
p53_pathways data that accompanies the package. TP53 is a oncogene that is often mutated in many cancers, and the
p53_pathways list contains 13 pathways related to P53 signaling.
We refer the reader to the “Package data” vignette for details on how the
p53_pathwayslist was created. That vignette illustrates how to use
The only difference compared to the previous section is that the
pathway_list argument is now specified when running
## Differential network analysis results between stageii (group 1) and stageiv (group 2) over 13 out of 13 pathways analyzed. ## # A tibble: 13 x 7 ## pathway dc_score p_value n_genes `n_dc (0.05)` mean_expr1 mean_expr2 ## <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> ## 1 Regulation of T… 0.0658 0.426 37 0 10.1 10.2 ## 2 Regulation of T… 0.0628 0.812 30 0 9.51 9.51 ## 3 Regulation of T… 0.0995 0.129 14 3 7.70 7.60 ## 4 TP53 Regulates … 0.0722 0.693 49 1 8.53 8.60 ## 5 TP53 Regulates … 0.149 0.584 14 0 7.76 7.94 ## 6 TP53 Regulates … 0.0680 0.436 44 0 8.28 8.34 ## 7 TP53 regulates … 0.108 0.248 14 0 8.52 8.46 ## 8 Regulation of T… 0.0947 0.723 19 0 9.12 9.10 ## 9 TP53 Regulates … 0.117 0.941 18 0 9.37 9.35 ## 10 TP53 Regulates … 0.112 0.0891 20 3 8.17 8.18 ## 11 TP53 regulates … 0.0780 0.733 21 0 8.46 8.50 ## 12 TP53 Regulates … 0.138 0.158 12 1 8.15 8.16 ## 13 TP53 Regulates … 0.111 0.337 12 0 7.25 7.25
The results can be filtered using the
filter_pathways method remove any pathways with differential connectivity p-values above some threshold.
## Differential network analysis results between stageii (group 1) and stageiv (group 2) over 3 out of 13 pathways analyzed. ## # A tibble: 3 x 7 ## pathway dc_score p_value n_genes `n_dc (0.05)` mean_expr1 mean_expr2 ## <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> ## 1 TP53 Regulates T… 0.138 0.158 12 1 8.15 8.16 ## 2 TP53 Regulates T… 0.112 0.0891 20 3 8.17 8.18 ## 3 Regulation of TP… 0.0995 0.129 14 3 7.70 7.60
Pathways can be plotted one at a time by indexing the results list. The following code shows how to print out the first pathway.
Note: The layout of the nodes is random when generating plots. For reproducible plots, the random number generator seed needs to be set prior to plotting.
This plot uses entrezgene IDs, which are the identifiers used in the gene expression and pathway list data. At this point in the analysis it can be convenient to translate these IDs to the more recognizable gene symbols. The
rename_genes method can be used to achieve this.
Note: Internet connection is required to connect to biomaRt, which
rename_genesuses to map entrezgene IDs to gene symbols. The
dir_saveargument can be set when running
rename_genes()which will save the ID mapping obtained from biomaRt in the specified directory. This way, the mapping can be obtained once (using the internet connection) and accessed from memory in all future calls of
rename_genes(). A temporary directory is used in this example.
## - saving gene info to /var/folders/6d/pbn9v11d2kvfjsh3gz6rxfbc0000gn/T//RtmpqlVFRW/entrez_to_hsapiens.rds
## Differential network analysis between stageii (group 1) and stageiv (group 2). ## Results for the pathway "TP53 Regulates Transcription of Death Receptors and Ligands" (out of 13 pathways analyzed) containing 12 genes. ## Pathway p-value = 0.158. The mean expression of genes in the pathway is 8.1 in group 1 and 8.2 in group 2. ## # A tibble: 12 x 6 ## pathway genes dc_score p_value mean_expr1 mean_expr2 ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 TP53 Regulates Transcription o… TMEM2… 0.0235 0.0495 11.5 11.4 ## 2 TP53 Regulates Transcription o… IGFBP3 0.0225 0.0792 11.1 11.9 ## 3 TP53 Regulates Transcription o… TP53B… 0.0238 0.0891 9.88 9.78 ## 4 TP53 Regulates Transcription o… PPP1R… 0.0226 0.0990 8.82 8.93 ## 5 TP53 Regulates Transcription o… FAS 0.0229 0.149 9.51 9.36 ## 6 TP53 Regulates Transcription o… TNFRS… 0.0151 0.158 6.43 6.27 ## 7 TP53 Regulates Transcription o… TNFRS… 0.0220 0.218 6.90 7.02 ## 8 TP53 Regulates Transcription o… TP53 0.0242 0.297 11.1 10.8 ## 9 TP53 Regulates Transcription o… TP73 0.0173 0.297 3.88 3.47 ## 10 TP53 Regulates Transcription o… TP63 0.0139 0.535 3.49 3.14 ## 11 TP53 Regulates Transcription o… TNFRS… 0.0134 0.584 10.8 11.0 ## 12 TP53 Regulates Transcription o… TNFRS… 0.00794 0.931 4.39 4.80
The differential network plot shows a lot of information about the two groups being compared. First, the nodes are scaled based on the gene’s average expression across both groups. If the gene is differentially expressed, then the node will be shaded blue or red, depending on if the gene has higher expression in group 1 or group 2, respectively. For example, in the plot above we see that the node for IGFBP3 is slightly red, indicating a higher expression in group 2. However, because the shade is very light, the fold change is not high. This can be verified in the table above – the fourth row shows that IGFBP3 has mean expression of 11.1 in group 1 and 11.9 in group 2, which is only a modest change.
The edges in the network are similarly colored. Blue edges indicate that the gene-gene association is stronger in group 1, whereas red edges are stronger in group 2. The thickness of the edges are scaled according to the magnitude of the association, and the opacity is scaled based on the p-value of the edge’s differential connectivity score. That is, a thick, opaque edge is a strong association with high statistical evidence of differential connectivity (low p-value), but a thin, transparent edge is a relatively weaker association with less evidence of differential connectivity (high p-value). For example, the red edge between FAS and TP73 is dark red – indicating that the association is stronger in group 2 – and its width is wide suggesting that the association is strong (in at least one of the two groups) compared to other connections. In contrast, the edge between TP53 and TNFRSF10A is light blue – indicating weaker statistical evidence of a stronger association in group 1 – and is relatively thin indicating that the two genes have a more modest association in at least one of the two groups. This all can be verified from a summary table of the differntial network’s edges (see rows 1 and 9):
## # A tibble: 66 x 6 ## pathway edges dc_score p_value nw1 nw2 ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 TP53 Regulates Transcription o… FAS - TP73 0.0725 0.00990 0.0257 -0.244 ## 2 TP53 Regulates Transcription o… IGFBP3 - PP… 0.0784 0.0198 -0.0755 0.205 ## 3 TP53 Regulates Transcription o… IGFBP3 - TN… 0.100 0.0297 0.135 -0.181 ## 4 TP53 Regulates Transcription o… TMEM219 - T… 0.0933 0.0297 0.126 -0.180 ## 5 TP53 Regulates Transcription o… TNFRSF10A -… 0.0477 0.0297 -0.164 0.0543 ## 6 TP53 Regulates Transcription o… PPP1R13B - … 0.0482 0.0495 0.0647 -0.155 ## 7 TP53 Regulates Transcription o… FAS - PPP1R… 0.0351 0.0594 -0.0265 -0.214 ## 8 TP53 Regulates Transcription o… PPP1R13B - … 0.0497 0.0792 0.0701 -0.153 ## 9 TP53 Regulates Transcription o… PPP1R13B - … 0.0195 0.0792 0.0773 -0.0625 ## 10 TP53 Regulates Transcription o… TNFRSF10A -… 0.0242 0.0891 0.0551 0.210 ## # … with 56 more rows
The table summarizes each edge in the differential network, sorted by p-values. The top edges in the table are also the most noticeable edges in the plot above. Edges edges with high p-values can be filtered out using the
## # A tibble: 6 x 6 ## pathway edges dc_score p_value nw1 nw2 ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 TP53 Regulates Transcription of … FAS - TP73 0.0725 0.00990 0.0257 -0.244 ## 2 TP53 Regulates Transcription of … IGFBP3 - P… 0.0784 0.0198 -0.0755 0.205 ## 3 TP53 Regulates Transcription of … IGFBP3 - T… 0.100 0.0297 0.135 -0.181 ## 4 TP53 Regulates Transcription of … TMEM219 - … 0.0933 0.0297 0.126 -0.180 ## 5 TP53 Regulates Transcription of … TNFRSF10A … 0.0477 0.0297 -0.164 0.0543 ## 6 TP53 Regulates Transcription of … PPP1R13B -… 0.0482 0.0495 0.0647 -0.155
Each edge in the differential network corresponds to a pair of genes. The association between any pair of genes can be visualized using
plot_pair(). We demonstrate this function by visualizing FAS amd TP73.
## `geom_smooth()` using formula 'y ~ x'
From this plot, it appears that the marginal association between FAS and TP73 are quite similar between the two groups. However, recall that the association measure used in the differential network analysis is the partial correlation, which is a conditional association measure that takes into account the other genes in the pathway. It is important to remember that the marginal plot created by
plot_pair() does not show all the information that the differential network analysis uses, however it may be a useful visualization to check the results obtained from the analysis.
The default network inference method is
run_pcor, which uses partial correlations to estimate the gene association network in the two groups. This is changed by setting the
network_inference argument in
dnapath(). For example:
All of the summary and plotting functions behave the same as before regardless of the network inference method used. Note however, if the chosen method produces a dense network (such as
run_corr), then the visualization may become more cluttered. In addition, some methods are more computationally intensive (such as
run_genie3), and the differential network analysis will become more time consuming. To help counter this, the number of permutations performed for the significance testing can be lowered (the default is
n_perm = 100). For example:
Alternatively, there may be tuning parameters in the network inference method that can speed up computation. For example, in
nTrees argument might be lowered to tradeoff performance for computation time. This parameter can be adjusted using the additional
... argument of
In this brief introduction to the
dnapath package, we have reviewed the basic steps for conducting the differential network analysis and for summarizing the results with tables and plots. The examples here used the Mesothelioma dataset that accompanies the package, but we note again that the package is not limited to analyzing cancer datasets.