The NetRep package provides functions for assessing the preservation of network modules across datasets.
This type of analysis is suitable where networks can be meaningfully inferred from multiple datasets. These include gene coexpression networks, protein-protein interaction networks, and microbial co-occurence networks. Modules within these networks consist of groups of nodes that are particularly interesting: for example a group of tightly connected genes associated with a disease, groups of genes annotated with the same term in the Gene Ontology database, or groups of interacting microbial species, i.e. communities.
Application of this method can answer questions such as:
A typical workflow for a NetRep analysis will usually contain the following steps, usually as separate scripts.
NetRep and its dependencies require several third party libraries to be installed. If not found, installation of the package will fail.
NetRep requires:
C++11
support for the <thread>
libary.fortran
compiler.BLAS
and LAPACK
libraries.The following sections provide operating system specific advice for getting NetRep working if installation through R fails.
The necessary fortran
and C++11
compilers are provided with the Xcode
application and subsequent installation of Command line tools
. The most recent version of OSX should prompt you to install these tools when installing the devtools
package from RStudio. Those with older versions of OSX should be able to install these tools by typing the following command into their Terminal application: xcode-select --install
.
Some users on OSX Mavericks have reported that even after this step they receive errors relating to -lgfortran
or -lquadmath
. This is reportedly solved by installing the version of gfortran
used to compile the R binary for OSX: gfortran-4.8.2
. This can be done using the following commands in your Terminal
application:
curl -O http://r.research.att.com/libs/gfortran-4.8.2-darwin13.tar.bz2
sudo tar fvxz gfortran-4.8.2-darwin13.tar.bz2 -C /
For Windows users NetRep requires R version 3.3.0 or later. The necessary fortran
and C++11
compilers are provided with the Rtools
program. We recommend installation of NetRep
through RStudio
, which should prompt the user and install these tools when running devtools::install_github("InouyeLab/NetRep")
. You may need to run this command again after Rtools
finishes installing.
If installation fails when compiling NetRep at permutations.cpp
with an error about namespace thread
, you will need to install a newer version of your compiler that supports this C++11
feature. We have found that this works on versions of gcc
as old as gcc-4.6.3
.
If installation fails prior to this step it is likely that you will need to install the necessary compilers and libraries, then reinstall R. For C++
and fortran
compilers we recommend installing g++
and gfortran
from the appropriate package manager for your operating system (e.g. apt-get
for Ubuntu). BLAS
and LAPACK
libraries can be installed by installing libblas-dev
and liblapack-dev
. Note that these libraries must be installed prior to installation of R.
Any NetRep analysis requires the following data to be provided and pre-computed for each dataset:
There are many different approaches to network inference and module detection. For gene expression data, we recommend using Weighted Gene Coexpression Network Analysis through the WGCNA package. For microbial abundance data we recommend the Python program SparCC. Microbial communities (modules) can then be defined as any group of significantly co-occuring microbes.
For this vignette, we will use gene expression data simulated for two independent cohorts. The discovery dataset was simulated to contain four modules of varying size, two of which (Modules 1 and 4) replicate in the test dataset.
Details of the simulation are provided in the documentation for the package data (see help("NetRep-data")
).
This data is provided with the NetRep package:
library("NetRep")
data("NetRep")
This command loads seven objects into the R session:
discovery_data
: a matrix with 150 columns (genes) and 30 rows (samples) whose entries correspond to the expression level of each gene in each sample in the discovery dataset.discovery_correlation
: a matrix with 150 columns and 150 rows containing the correlation-coefficients between each pair of genes calculated from the discovery_data
matrix.discovery_network
: a matrix with 150 columns and 150 rows containing the network edge weights encoding the interaction strength between each pair of genes in the discovery dataset.module_labels
: a named vector with 150 entries containing the module assignment for each gene as identified in the discovery dataset. Here, we’ve given genes that are not part of any module/group the label “0”.test_data
: a matrix with 150 columns (genes) and 30 rows (samples) whose entries correspond to the expression level of each gene in each sample in the test dataset.test_correlation
: a matrix with 150 columns and 150 rows containing the correlation-coefficients between each pair of genes calculated from the test_data
matrix.test_network
: a matrix with 150 columns and 150 rows containing the network edge weights encoding the interaction strength between each pair of genes in the test dataset.Next, we will combine these objects into list structures. All functions in the NetRep package take the following arguments:
network
: a list of interaction networks, one for each dataset.data
: a list of data matrices used to infer those networks, one for each dataset.correlation
: a list of matrices containing the pairwise correlation coefficients between variables/nodes in each dataset.moduleAssignments
: a list of vectors, one for each discovery dataset, containing the module assignments for each node in that dataset.modules
: a list of vectors, one vector for each discovery dataset, containing the names of the modules from that dataset to run the function on.discovery
: a vector indicating the names or indices to use as the discovery datasets in the network
, data
, correlation
, moduleAssignments
, and modules
arguments.test
: a list of vectors, one vector for each discovery dataset, containing the names or indices of the network
, data
, and correlation
argument lists to use as the test dataset(s) for the analysis of each discovery dataset.Each of these lists may contain any number of datasets. The names provided to each list are used by the discovery
and test
arguments to determine which datasets to compare. More than one dataset can be specified in each of these arguments, for example when performing a pairwise analysis of gene coexpression modules identified in multiple tissues.
Typically we would put the code that reads in our data and sets up the input lists in its own script. This loading script can then be called from our scripts where we calculate the module preservation, visualise our networks, and calculate the network properties:
# Read in the data:
data("NetRep")
# Set up the input data structures for NetRep. We will call these datasets
# "cohort1" and "cohort2" to avoid confusion with the "discovery" and "test"
# arguments in NetRep's functions:
data_list <- list(cohort1=discovery_data, cohort2=test_data)
correlation_list <- list(cohort1=discovery_correlation, cohort2=test_correlation)
network_list <- list(cohort1=discovery_network, cohort2=test_network)
# We do not need to set up a list for the 'moduleAssignments', 'modules', or
# 'test' arguments because there is only one "discovery" dataset.
We will call these “cohort1” and “cohort2” to avoid confusion with the arguments “discovery” and “test” common to NetRep’s functions.
Now we will use NetRep to permutation test whether the network topology of each module is preserved in our test dataset using the modulePreservation
function. This function calculates seven module preservation statistics for each module (more on these later), then performs a permutation procedure in the test dataset to determine whether these statistics are significant.
We will run 10,000 permutations, and split calculation across 2 threads so that calculations are run in parallel. By default, modulePreservaton
will test the preservation of all modules, excluding the network background which is assumed to have the label “0”. This of course can be changed: there are many more arguments than shown here which control how modulePreservation
runs. See help("modulePreservation")
for a full list of arguments.
# Assess the preservation of modules in the test dataset.
preservation <- modulePreservation(
network=network_list, data=data_list, correlation=correlation_list,
moduleAssignments=module_labels, discovery="cohort1", test="cohort2",
nPerm=10000, nThreads=2
)
## [2018-06-12 12:08:40 BST] Validating user input...
## [2018-06-12 12:08:40 BST] Checking matrices for problems...
## [2018-06-12 12:08:40 BST] Input ok!
## [2018-06-12 12:08:40 BST] Calculating preservation of network subsets from dataset "cohort1" in
## dataset "cohort2".
## [2018-06-12 12:08:40 BST] Pre-computing network properties in dataset "cohort1"...
## [2018-06-12 12:08:40 BST] Calculating observed test statistics...
## [2018-06-12 12:08:40 BST] Generating null distributions from 10000 permutations using 2
## threads...
##
##
0% completed.
19% completed.
39% completed.
58% completed.
78% completed.
97% completed.
100% completed.
##
## [2018-06-12 12:08:46 BST] Calculating P-values...
## [2018-06-12 12:08:46 BST] Collating results...
## [2018-06-12 12:08:46 BST] Done!
The results returned by modulePreservation
for each dataset comparison are a list containing seven elements:
nulls
the null distribution for each statistic and module generated by the permutation procedure.observed
the observed value of each module preservation statistic for each module.p.values
the p-values for each module preservation statistic for each module.nVarsPresent
the number of variables in the discovery dataset that had corresponding measurements in the test dataset.propVarsPresent
the proportion of nodes in each module that had corresponding measurements in the test dataset.totalSize
the total number of nodes in the discovery network.alternative
the alternate hypothesis used in the test (e.g. “the module preservation statistics are higher than expected by chance”).If the test dataset has also had module discovery performed in it, a contigency table tabulating the overlap in module content between the two datasets is returned.
Let’s take a look at our results:
preservation$observed
## avg.weight coherence cor.cor cor.degree cor.contrib avg.cor avg.contrib
## 1 0.161069393 0.6187688 0.78448573 0.90843993 0.8795006 0.550004272 0.76084777
## 2 0.001872928 0.1359063 0.17270312 -0.03542772 0.5390504 0.034040922 0.23124826
## 3 0.001957475 0.1263280 0.01121223 -0.17179855 -0.1074944 -0.007631867 0.05412794
## 4 0.046291489 0.4871179 0.32610667 0.68122446 0.5251965 0.442614173 0.68239136
preservation$p.value
## avg.weight coherence cor.cor cor.degree cor.contrib avg.cor avg.contrib
## 1 0.00009999 0.00009999 0.00009999 0.00009999 0.00009999 0.00009999 0.00009999
## 2 0.98050195 0.96830317 0.00949905 0.56404360 0.00339966 0.01709829 0.00699930
## 3 0.98930107 0.98420158 0.42055794 0.80631937 0.71312869 0.99090091 0.87561244
## 4 0.00009999 0.00009999 0.00009999 0.00009999 0.00019998 0.00009999 0.00009999
For now, we will consider all statistics equally important, so we will consider a module to be preserved in “cohort2” if all the statistics have a permutation test P-value < 0.01:
# Get the maximum permutation test p-value
max_pval <- apply(preservation$p.value, 1, max)
max_pval
## 1 2 3 4
## 0.00009999 0.98050195 0.99090091 0.00019998
Only modules 1 and 4 are reproducible at this significance threshold.
So what do these statistics measure? Let’s take a look at the network topology of Module 1 in the discovery dataset, “cohort1”:
Network topology of Module 1 in the discovery dataset (“cohort1”).
From top to bottom, the plot shows:
Now, let’s take a look at the topology of Module 1 in the discovery and the test datasets side by side along with the module preservation statistics: