pathfindR
is an R package for pathway enrichment analysis of gene-level differential expression/methylation data utilizing active subnetworks. The package also enables hierarchical clustering of the enriched pathways. The method is described in detail in Ulgen E, Ozisik O, Sezerman OU. 2018. pathfindR: An R Package for Pathway Enrichment Analysis Utilizing Active Subnetworks. bioRxiv. https://doi.org/10.1101/272450
Our motivation to develop this package was that direct pathway enrichment analysis of differential RNA/protein expression or DNA methylation results may not provide the researcher with the full picture. That is to say; pathway enrichment of only the list of significant genes may not be informative enough to explain the underlying disease mechanisms.
An active subnetwork is defined as a group of interconnected genes in a protein-protein interaction network (PIN) that contains most of the significant genes. Therefore, these active subnetworks define distinct disease-associated sets of genes, whether discovered through differential expression analysis or discovered because of being in interaction with a significant gene.
Therefore, we propose to leverage information from a PIN to identify distinct active subnetworks and then perform pathway enrichment analyses on these subnetworks. Briefly, this workflow first maps the significant genes onto a PIN and finds active subnetworks. Next, pathway enrichment analyses are performed using each gene set of the identified active subnetworks. Finally, these enrichment results are summarized and returned as a data frame. This workflow is implemented as the function run_pathfindR()
and further described in the “Enrichment Workflow” section of this vignette.
This process usually yields a great number of enriched pathways with related biological functions. We therefore implemented a pairwise distance metric (as proposed by Chen et al. [1]) between pathways and based on this distance metric, also implemented hierarchical clustering of the pathways through a shiny app, allowing dynamic partitioning of the dendrogram into relevant clusters. Details of clustering and partitioning of pathways are presented in the “Pathway Clustering” section of this vignette.
The overview of the enrichment workflow is presented in the figure below:
For this workflow, the wrapper function run_pathfindR()
is used. This function takes in a data frame consisting of Gene Symbol
, log-fold-change
and adjusted-p values
. The first 6 rows of an example input dataset (of rheumatoid arthritis differential-expression) can be found below:
Gene.symbol | logFC | adj.P.Val | |
---|---|---|---|
ILMN_1755092 | FAM110A | -0.6939359 | 0.0000034 |
ILMN_1730628 | RNASE2 | 1.3535040 | 0.0000101 |
ILMN_1729801 | S100A8 | 1.5448338 | 0.0000347 |
ILMN_1714991 | S100A9 | 1.0280904 | 0.0002263 |
ILMN_1762037 | TEX261 | -0.3235994 | 0.0002263 |
ILMN_1718610 | ARHGAP17 | -0.6919330 | 0.0002708 |
Executing the workflow is straightforward (but takes several minutes):
The user may want to change certain arguments of the function:
# to change the output directory
RA_output <- run_pathfindR(RA_input, output = "new_directory")
# to change the PIN (default = Biogrid)
RA_output <- run_pathfindR(RA_input, pin_name = "IntAct")
# to use an external PIN of user's choice
RA_output <- run_pathfindR(RA_input, pin_name = "/path/to/myPIN.sif")
# available gene sets are KEGG, Reactome, BioCarta, GO-BP, GO-CC and GO-MF
# default is KEGG
# to change the gene sets used for enrichment analysis
RA_output <- run_pathfindR(RA_input, gene_sets = "BioCarta")
# to change the active subnetwork search algorithm (default = "GR", i.e. greedy algorithm)
# for simulated annealing:
RA_output <- run_pathfindR(RA_input, search_method = "SA")
# to change the number of iterations (default = 10)
RA_output <- run_pathfindR(RA_input, iterations = 5)
# to manually specify the number processes used during parallel loop by foreach
# defaults to the number of detected cores
RA_output <- run_pathfindR(RA_input, n_processes = 2)
# to report the non-DEG active subnetwork genes
RA_output <- run_pathfindR(RA_input, list_active_snw_genes = TRUE)
For a full list of arguments, see ?run_pathfindR
.
The workflow consists of the following steps :
After input testing, the program attempts to convert any gene symbol that is not in the PIN to an alias symbol that is in the PIN. Next, active subnetwork search is performed via the selected algorithm. The available algorithms for active subnetwork search are:
Next, pathway enrichment analyses are performed using the genes in each of the active subnetworks. For this, up-to-date information on human gene sets from KEGG, Reactome, BioCarta and Gene Ontology were retrieved and is available for use within the package. The user may specify custom gene sets, including gene sets for non-human organisms, as described in the section “pathfindR Analysis with Custom Gene Sets”.
During enrichment analyses, pathways with adjusted-p values larger than the enrichment_threshold
(an argument of run_pathfindR()
, defaults to 0.05) are discarded. The results of enrichment analyses over all active subnetworks are combined by keeping only the lowest adjusted-p value for each pathway.
This process of active subnetwork search and enrichment analyses is repeated for a selected number of iterations (indicated by the iterations
argument of run_pathfindR()
), which is performed in parallel via the R package foreach
.
The wrapper function returns a data frame that contains the lowest and the highest adjusted-p values for each enriched pathway, as well as the numbers of times each pathway is encountered over all iterations. The first two rows of the example output of the pathfindR-enrichment workflow (performed on the rheumatoid arthritis data RA_output
) is shown below:
ID | Pathway | Fold_Enrichment | occurrence | lowest_p | highest_p | Up_regulated | Down_regulated |
---|---|---|---|---|---|---|---|
hsa03040 | Spliceosome | 28.81102 | 10 | 1.50e-06 | 0.0000015 | SF3B6, LSM3, BUD31 | SNRPB, SF3B2, U2AF2, PUF60, HNRNPA1, PCBP1, SRSF5, SRSF8, SNU13, DDX23, EIF4A3 |
hsa05130 | Pathogenic Escherichia coli infection | 48.39947 | 10 | 1.83e-05 | 0.0024281 | LY96, TLR5 | ABL1, ITGB1, TUBB, ACTB, ACTG1 |
The function also creates an HTML report results.html
that is saved in a directory, by default named pathfindr_Results
but can be changed by changing the argument output_dir
, under the current working directory. This report contains links to two other HTML files:
This document contains a table of the active subnetwork-oriented pathway enrichment results. Each enriched pathway name is linked to the visualization of that pathway, with the gene nodes colored according to their log-fold-change values. This table contains the same information as the returned data frame. Columns are:
This document contains a table of converted gene symbols. Columns are:
The document contains a second table of genes for which no interactions were identified (after checking for alias symbols).
For this workflow, the wrapper function choose_clusters()
is used. This function first calculates the pairwise distances between the pathways in the input data frame, automatically determining the gene sets used for analysis. This step uses the distance metric described by Chen et al. [1]. By default, the function clusters the pathways using this distance matrix, automatically determines the optimal number of clusters by maximizing the average silhouette width and returns a data frame with cluster assignments:
data("RA_output")
RA_clustered <- choose_clusters(RA_output)
#> Calculating pairwise distances between pathways
#> Clustering pathways
#> Calculating the optimal number of clusters (based on average silhouette width)
#> The maximum average silhouette width was 0.38 for k = 11
#> Returning the resulting data frame
## First 2 rows of clustered pathways data frame
knitr::kable(head(RA_clustered, 2))
ID | Pathway | Fold_Enrichment | occurrence | lowest_p | highest_p | Up_regulated | Down_regulated | Cluster | Status | |
---|---|---|---|---|---|---|---|---|---|---|
4 | hsa04722 | Neurotrophin signaling pathway | 24.84890 | 4 | 0.0000506 | 0.0006648 | CRKL, FASLG, SH2B3, ABL1, MAGED1, IRAK2, IKBKB, CALM1, CALM3 | 1 | Representative | |
7 | hsa04218 | Cellular senescence | 20.06031 | 2 | 0.0016810 | 0.0016810 | MTOR, ETS1, RBL2, CALM1, CALM3, NFATC3, SLC25A5, VDAC1 | 1 | Member |
## The 16 representative pathways
knitr::kable(RA_clustered[RA_clustered$Status == "Representative", ])
ID | Pathway | Fold_Enrichment | occurrence | lowest_p | highest_p | Up_regulated | Down_regulated | Cluster | Status | |
---|---|---|---|---|---|---|---|---|---|---|
4 | hsa04722 | Neurotrophin signaling pathway | 24.84890 | 4 | 0.0000506 | 0.0006648 | CRKL, FASLG, SH2B3, ABL1, MAGED1, IRAK2, IKBKB, CALM1, CALM3 | 1 | Representative | |
24 | hsa03022 | Basal transcription factors | 130.09778 | 1 | 0.0293148 | 0.0293148 | GTF2B, GTF2H5 | TAF1L, TAF4, TAF15 | 2 | Representative |
3 | hsa03013 | RNA transport | 13.07564 | 10 | 0.0000326 | 0.0029739 | NUP214 | GEMIN4, EIF4A3, RNPS1, SRRM1, NUP62, NUP93, UBE2I, RANGAP1, SUMO3, EIF2S3, EIF2B1 | 3 | Representative |
1 | hsa03040 | Spliceosome | 28.81102 | 10 | 0.0000015 | 0.0000015 | SF3B6, LSM3, BUD31 | SNRPB, SF3B2, U2AF2, PUF60, HNRNPA1, PCBP1, SRSF5, SRSF8, SNU13, DDX23, EIF4A3 | 4 | Representative |
2 | hsa05130 | Pathogenic Escherichia coli infection | 48.39947 | 10 | 0.0000183 | 0.0024281 | LY96, TLR5 | ABL1, ITGB1, TUBB, ACTB, ACTG1 | 5 | Representative |
12 | hsa00020 | Citrate cycle (TCA cycle) | 261.35714 | 4 | 0.0077542 | 0.0077542 | MDH2, PDHA1, PDHB | 6 | Representative | |
29 | hsa00970 | Aminoacyl-tRNA biosynthesis | 197.78378 | 1 | 0.0381108 | 0.0381108 | IARS, SARS, YARS | 7 | Representative | |
19 | hsa05110 | Vibrio cholerae infection | 21.77976 | 5 | 0.0210036 | 0.0362485 | ATP6V0E1, ATP6V1D | ARF1, ATP6V0E2, ACTB, ACTG1, PDIA4 | 8 | Representative |
5 | hsa03420 | Nucleotide excision repair | 190.90435 | 1 | 0.0000986 | 0.0000986 | GTF2H5, POLE4 | POLD2, RPA1, XPC | 9 | Representative |
17 | hsa03050 | Proteasome | 170.18605 | 4 | 0.0176234 | 0.0176234 | PSMD7, PSMB10 | 10 | Representative | |
15 | hsa04130 | SNARE interactions in vesicular transport | 182.95000 | 1 | 0.0166367 | 0.0166367 | STX10, STX6 | BET1L, SNAP23, STX2 | 11 | Representative |
# to display the heatmap of pathway clustering
RA_clustered <- choose_clusters(RA_output, plot_heatmap = TRUE)
#> Calculating pairwise distances between pathways
#> Clustering pathways
#> Calculating the optimal number of clusters (based on average silhouette width)
#> The maximum average silhouette width was 0.38 for k = 11
#> Returning the resulting data frame
# to display the dendrogram and clusters
RA_clustered <- choose_clusters(RA_output, plot_dend = TRUE)
#> Calculating pairwise distances between pathways
#> Clustering pathways
#> Calculating the optimal number of clusters (based on average silhouette width)
#> The maximum average silhouette width was 0.38 for k = 11
#> Returning the resulting data frame
# to change agglomeration method (default = "average")
RA_clustered <- choose_clusters(RA_output, agg_method = "centroid")
#> Calculating pairwise distances between pathways
#> Clustering pathways
#> Calculating the optimal number of clusters (based on average silhouette width)
#> The maximum average silhouette width was 0.33 for k = 12
#> Returning the resulting data frame
Alternatively, manual selection of the height at which to cut the dendrogram can be performed. For this, the user should set the auto
parameter to FALSE
. Via a shiny app, presented as an HTML document, the hierarchical clustering dendrogram is visualized. In this HTML document, the user can select the agglomeration method and the distance value at which to cut the tree. The dendrogram with the cut-off value marked with a red line is dynamically visualized and the resulting cluster assignments of the pathways along with annotation of representative pathways (chosen by smallest lowest p value) are presented as a table. This table can be saved as a csv file by pressing the button Get Pathways w\ Cluster Info
. Example usage:
The function calculate_pw_scores
can be used to calculate the pathway scores per sample. This allows the user to individually examine the scores and infer whether a pathway is activated or repressed in a given sample.
For a set of pathways \(P = \{P_1, P_2, ... , P_k\}\), where each \(P_i\) contains a set of genes, i.e. \(P_i = \{g_1, g_2, ...\}\), the pathway score matrix \(PS\) is defined as:
\(PS_{p,s} = \frac{1}{k} \sum_{g \in P_p} GS_{g,s}\) for each pathway \(p\) and for each sample \(s\).
\(GS\) is the gene score per sample matrix and is defined as: \(GS_{g,s} = (EM_{g,s} - \bar{x}_g) / sd_g\) where \(EM\) is the expression matrix (columns are samples, rows are genes), \(\bar{x}_g\) is the mean expression value of the gene and \(sd_g\) is the standard deviaton of the expression values for the gene.
An example application is provided below:
## Pathway data frame
pws_table <- pathfindR::RA_clustered
# selecting "Representative" pathways for clear visualization
pws_table <- pws_table[pws_table$Status == "Representative", ]
## Expression matrix
exp_mat <- pathfindR::RA_exp_mat
## Vector of "Case" IDs
cases <- c("GSM389703", "GSM389704", "GSM389706", "GSM389708",
"GSM389711", "GSM389714", "GSM389716", "GSM389717",
"GSM389719", "GSM389721", "GSM389722", "GSM389724",
"GSM389726", "GSM389727", "GSM389730", "GSM389731",
"GSM389733", "GSM389735")
## Calculate pathway scores and plot heatmap
score_matrix <- calculate_pw_scores(pws_table, exp_mat, cases)
It is possible to use run_pathfindR()
with custom gene sets. Here we provide an example application with a gene set consisting of targets of two transcription factors.
We first load and prepare the gene sets:
## CREB target genes
CREB_targets <- normalizePath(system.file("extdata/CREB.txt",
package = "pathfindR"))
CREB_targets <- read.delim(CREB_targets, skip = 2, header = FALSE, stringsAsFactors = FALSE)
CREB_targets <- CREB_targets$V1
## MYC target genes
MYC_targets <- normalizePath(system.file("extdata/MYC.txt",
package = "pathfindR"))
MYC_targets <- read.delim(MYC_targets, skip = 2, header = FALSE, stringsAsFactors = FALSE)
MYC_targets <- MYC_targets$V1
## Prep for use
custom_genes <- list(CREB_targets, MYC_targets)
names(custom_genes) <- c("TF1", "TF2")
custom_pathways <- c("CREB target genes", "MYC target genes")
names(custom_pathways) <- c("TF1", "TF2")
We next prepare an example input, because of the way we choose genes we expect significant enrichment for MYC targets (50 MYC target genes + 5 CREB target genes). Because this is only an example, we set the logFC values to 1.5 and the adj.P.Vals to 0.001:
set.seed(123)
selected_genes <- sample(MYC_targets, 50)
selected_genes <- c(selected_genes,
sample(CREB_targets, 5))
input <- data.frame(Gene.symbol = selected_genes,
logFC = 1.5,
adj.P.Val = 0.001)
head(input)
#> Gene.symbol logFC adj.P.Val
#> 1 ENO3 1.5 0.001
#> 2 SLC12A5 1.5 0.001
#> 3 HOXA1 1.5 0.001
#> 4 TFB2M 1.5 0.001
#> 5 UBXN1 1.5 0.001
#> 6 ANKRD13B 1.5 0.001
Finally, we run the wrapper script run_pathfindR
using the custom genes as the gene sets:
custom_result <- run_pathfindR(input,
gene_sets = "Custom",
custom_genes = custom_genes,
custom_pathways = custom_pathways,
iterations = 1)
ID | Pathway | Fold_Enrichment | occurrence | lowest_p | highest_p | Up_regulated | Down_regulated |
---|---|---|---|---|---|---|---|
TF2 | MYC target genes | 14.55031 | 1 | 0.0000000 | 0.0000000 | AGMAT, ANGPT2, ANKRD13B, BATF3, C19orf26, C2CD2L, COMMD8, CTSF, CUL5, DOPEY1, ELAVL3, ENO3, FCHSD2, GATA5, HMGN2, HMHA1, HOXA1, IGF2R, IL1RAPL1, IPO4, KAT5, MAFF, MCM8, MEPCE, MPV17, NCOA6, NIT1, OSR1, PDP2, PICALM, PPP1R3B, PPP1R3C, PRMT1, PUS1, RNF43, SEC11C, SLC12A5, SMC3, TCF4, TEF, TFB2M, THUMPD2, TIAL1, UBR4, UBXN1, UTY, YPEL5, ZNF771 | |
TF1 | CREB target genes | 35.35266 | 1 | 0.0037408 | 0.0037408 | AREG, CCBL1, MAFF, OSR1, SPAG9 |
The most significantly enriched gene set is, unsurprisingly, “MYC target genes”. Interestingly, through interaction information, pathfindR was also able to identify significant enrichment for “CREB target genes” as well.
It is also possible to perform pathfindR analysis using non-human organism annotation. For this, custom gene sets and a custom PIN belonging to the species-of-interest must be specified.
[1] Chen YA, Tripathi LP, Dessailly BH, Nyström-persson J, Ahmad S, Mizuguchi K. Integrated pathway clusters with coherent biological themes for target prioritisation. PLoS ONE. 2014;9(6):e99030. 10.1371/journal.pone.0099030
[2] Ideker T, Ozier O, Schwikowski B, Siegel AF. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics. 2002;18 Suppl 1:S233-40.
[3] Ozisik O, Bakir-Gungor B, Diri B, Sezerman OU. Active Subnetwork GA: A Two Stage Genetic Algorithm Approach to Active Subnetwork Search. Current Bioinformatics. 2017; 12(4):320-8. 10.2174/1574893611666160527100444