Introduction

The genetic information collected in genome-wide association studies (GWAS) is represented by the genotypes of various single-nucleotid polymorphisms (SNPS). Testing biological meaningful SNP Sets is a successful strategy for the evaluation of GWAS data, as it may increase power as well as interpretation of results. Via mapping of SNPs to genes forming a network, association between pathways and disease risk can be investigated.
Kernel methods are particularly well suited to cope with the challenges connected to the analysis of large SNP sets from GWAS data. They do not require to model a direct functional relationship between SNPs and effects, while at the same time can deal with high-dimensional data and allow for straightforward incorporation of covariates. The model for a logistic kernel machine regression of a pathway on a binary outcome is given by

\[logit(P(y_i=1|x_i,z_i))=x_i^t \beta + h(z_i) (1) \]

where \(y_i\) denotes the case or control status of individual i, \(x_i\) is the vector including informative covariates (such as age, sex, etc.) and \(z_i\) represents the genotypes of individual i. \(\beta\) is the regression coefficients for the parametric part of the model, while \(h(.)\) denotes an unknown function, non-parametrically incorporating the pathway's influence. The intercept is assumed to be included in \(x_i\). For more details see Liu et al (2008).

Different kernels have been proposed that convert the genomic information of two individuals into a quantitative value reflecting their genetic similarity. This package includes the linear kernel as well as two more advanced kernels, adjusting for size bias in the number of SNPs and genes in a pathway or incorporating the network structure of genes within the pathway, respectively. The kernel functions are described in more detail in the instructions below.

A variance component test, constructed around the similarity matrix, can be used to evaluate a pathway's influence on disease risk. In kangar00 p-values can be calculated with the Satterthwaite approximation or Davies method as described in Schaid (2010) and Davies (1980), respectively.

Data extraction and preparation

Pathways

The kangar00 package offers several functions for data extraction from internet databases. In the following they will be explained using the Circadian rhythm pathway as an example.

pathway gene_start gene_end chr gene
hsa04710 13276652 13387266 11 ARNTL
hsa04710 4979116 4985323 3 BHLHE40
hsa04710 26120026 26125127 12 BHLHE41
hsa04710

listing information on all genes KEGG assigned to Circadian rhythm.

Pathway object

In kangar00 all information on a specific pathway is combined in a pathway object. It includes

The following example creates a new pathway object, to which gene-interaction information has yet to be added

pathw <- pathway(id='hsa04710', adj=matrix(0), sign=as.vector(matrix(0)[matrix(0)!=0]))

Networkmatrix

The gene-gene interactions within pathways are represented by a network matrix. This quadratic matrix is of dimension equal to the number of genes in the corresponding pathway. It includes entries equal to \(1\) (representing an activation interaction), \(-1\) (denoting an inhibiting interaction) or \(0\) (no interaction).

A network matrix can be created using the function get_network_matrix(). Gene interaction information for a specific pathway is extracted from the KEGG database. It is accessed via the function retrieveKGML() from the KEGGgraph package. An internet connection is required for this step.

pathw_complete <- get_network_matrix(pathw, directed=FALSE)

will download the KEGG XML file for the pathway with ID 'hsa04710' and save it in the working directory. The function will convert the data into a network matrix and add it to the given pathway object. The expanded pathway object will be returned. The user can specify whether the gene-interaction matrix should be given directed (directed=TRUE) or undirected (directed=FALSE).

SNP positions

Kangar00 offers a function to download positions of the SNPs available in your GWAS dataset from the Ensembl database.

snp_info("rs234")  

will return a snp_info object containing the data frame

chr position rsnumber
7 105920689 rs234

Pathway Annotation

To define SNP sets representing a pathway, the function get_anno() can be used.

get_anno(snp_info, pathway_info)

will return a data frame listing all SNPs that lie inside the boundaries of one or more genes in the pathway. That means that genes can appear several times, depending on the number of SNPs mapped to them. A SNP can and will be mapped to multiple genes if they overlap. The data frame will have the following format

pathway gene chr snp position
hsa04710 CSNK1E 22 rs11089885 38413480
hsa04710 CSNK1E 22 rs13054361 38336819
hsa04710 CSNK1E 22 rs135757 38307648
hsa04710

GWAS data

Data from a case control study is needed to test a pathways influence on disease risk with the logistic kernel machine test in kangar00. Here, GWAS data is represented by the GWASdata object. It includes

A GWASdata object can be constructed as


my_gwas <- GWASdata(pheno=pheno, geno=geno, anno=anno, desc="study xy")


Calculation of Kernel Matrices

Once a GWASdata object is created, we can start to calculate kernel matrices to test a pathways influence on disease risk. Kangar00 offers three different kernel functions to compute a similarity matrix for the individuals in analysis. They will be explained in the following.

Linear Kernel (Lin)

The linear kernel assumes additive SNP effects. It is calculated as \[ZZ^t (2)\] where \(Z\) denotes the genotype matrix (See also Liu et al, 2010). In kangar00 a linear kernel can be created using the function kernel_lin(). It requires as arguments

K_lin <- lin_kernel(gwas, p, calculation='cpu')

will return a quadratic matrix of dimension equal to the number of individuals in the GWASdata object.

Size-adjusted Kernel (Sia)

The size-adjusted kernel takes into consideration the numbers of SNPs and genes in a pathway to correct for size bias. It is calculated as \[ K_{i,j} = exp( - \sqrt{ \frac{1}{r_p} } \sum_{g } ( \frac{|| z_i^g - z_j^g||} { \mu_g k_g^{eff}})^{\delta}_g ) (3)\]

Here \(z_i^g\) is the vector of individual i 's genotypes in gene \(g\) and \(r_p\) the number of genes in pathway \(p\). Scaling parameters \(k_g^{eff}\), \(\mu^g\) and \(\delta_g\) adjust for the number of genes in the pathway and the number of SNPs within these genes (for more details refer to Freytag et al. 2012).

A kernel of this type can be calculated using the function kernel_sia() with
the following arguments

K_sia <- sia_kernel(gwas, p, calculation='cpu')

will return a quadratic matrix of dimension equal to the number of individuals in the GWASdata object.

Network Kernel (Net)

The network kernel incorporates information about gene-gene interactions into the model. It is defined as \[ K = Z A N A^t Z^t (4)\]
where matrix \(A\) maps SNPs to genes, \(N\) represents the underlying network structure, and \(Z\) is the genotype matrix. The network based kernel matrix for a pathway can be calculated with the function kernel_net(). Following arguments are needed

K_net <- net_kernel(gwas, p, calculation='cpu')

will return a quadratic matrix of dimension equal to the number of individuals in the GWASdata object.

Alternatively, kernel matrices can be calculated using the function calc_kernel(). Here the kernel type is specified via an additional argument type. It can be set to lin, sia or net.

K <- calc_kernel(gwas, p, type='lin', parallel='none')

This function will simply call the suitable kernel function as described above and therefore has the same output.

Variance Component Test

A pathways influence on the probability of being a case is evaluated in a variance component test. The test statistic is \[ Q = \frac{1}{2} ( y - \mu)^t K ( y - \mu) (5)\] with \(\mu\) the vector of null model estimators given by \(\mu_i = logit^{-1} (x_i^t \beta)\) for an individual \(i\) and \(K\) a kernel matrix of the pathway to be tested. \(Q\) follows a mixture of \(X^2\) distributions which can be approximated using the Satterthwaite procedure (Schaid 2012) or Davies method as implemented in the R package QuadCompForm (Davies 1980). More details on the test can be found in Wu et al (2010).

In kangar00 the logistic kernel machine test can be applied to a SNP set defining a pathway with the function lkmt. It needs the following arguments

pval_net <- lkmt(pheno ~ 1+sex+age, K_mat, my_gwas, method='satt')

will return an object of type lkmt giving the test result for the pathway on which the kernel matrix 'K_mat' was calculated. The GWASdata object 'my_gwas' has to be the same as used to calculate the kernel matrix. The formula above would for example fit for a phenotype file of the following format (IDs in first column are always required in phenotype file)

ID pheno sex age smoker
ind1 1 1 41 1
ind2 0 0 38 0
ind3 1 1 56 1

note, that the columns to be used in the model are specified in the formula given to the lkmt() function and not all covariates have to be used.

References