Overview

Methods for genetic association, such as genome-wide association studies (GWAS), are wideley used to test for SNP level relationships for traits and disease. Accounting for multiple-comparisons of such techniques typically calls for conservative Bonferroni adjustements for the number of SNPs included in the analysis (often ~10^6). Commonly referred to as the Minimum P approach, a class level signal is often given by the minimum p-value of all SNPs within a class, where class is defined as as a meaningful segment of the genome, such as a protein coding gene. Therefore, only classes with a single strong SNP level signal are deemed significant. In this package we present tools for implementing an alternative method, known as Genetic Class Association Testing (GenCAT), which utilizing statistical signal from every SNP accross an entire class for each statistical test. Therefore, adjustments for multiple testing need only account for the number of classes used in analysis. This is performed by estimating the distribution of independent test statistics, via an eigen decomposition, using the SNP-level test statistics and genotypic correlation between SNPs. Class level test statistics are then calculated as the sum of the squared values of these independent signals. These test statistics follow a central \(chi^2\)-distribution, with degree of freedom equal to the number of independent signals within a class. More information on the GenCAT approach can be found at (Qian et al., 2015).

In this tutorial, we use SNP level results from the CARDIoGRAM consortium GWA meta-analysis (Schunkert et al., 2011). We define classes a protein coding genes, based on boundaries adapted from the KnownCanonical track from UCSC Genome Browser, build GRCh 37 (hg 19).

For the sake of time and memory, only select SNPs from chromosomes 13-15 are used.

library(GenCAT)

Data Preparation

The class level approach is carried out by the GenCAT function, which requires a six column data frame including information for SNP ID, effect allele (the allele which determines the direction of the test statistic from the SNP level association test), other allele, SNP level test statistic, chromosome number (1-22), and class label.

Mapping SNPs to classes

This package includes the function, map2class, which can be used to assign class labels to SNPs based on their genomic coordinates. The first argument, coords, is a data frame which defines the boundaries of each class in the genome. The second argument, SNPs, defines the genomic coordinates of each SNP. A third argument, extend.boundary, can also be used to specify class boundary extension. Refer to the help page (?GenCAT::map2class) for specific formatting requirements.

In this example we map the CARDIoGRAM SNPs to protein coding genes, extending the class boundaries by 5000.

data('CardioData')
data('coords')

print(head(coords))
##   chr  start   stop    class
## 1   1  69090  70008    OR4F5
## 2   1 367658 368597 OR4F29-1
## 3   1 621095 622034 OR4F29-2
## 4   1 861120 879961   SAMD11
## 5   1 879582 894679    NOC2L
## 6   1 895966 901099   KLHL17
print(head(CardioData))
##         SNP effect_allele other_allele    testStat chr  position
## 1 rs1000022             C            T -0.05761287  13 100461219
## 2 rs1000160             C            T -1.71388200  13 100950452
## 3 rs1000281             C            T  1.19326833  15  72553422
## 4 rs1000290             C            T -0.22856607  15 100949609
## 5 rs1000347             C            T  0.34416833  15  26929735
## 6 rs1000415             C            T  1.21364190  13  25031438
CardioMapped<-map2class(coords, CardioData, extend.boundary = 5000)
## [1] "Mapping SNPs to classes on chromosome 13."
## [1] "Mapping SNPs to classes on chromosome 14."
## [1] "Mapping SNPs to classes on chromosome 15."
print(head(CardioMapped))
##          SNP effect_allele other_allele   testStat chr position  class
## 1 rs10492522             C            T  1.4088115  13 19746331 TUBA3C
## 2 rs10492523             C            T  0.1550574  13 19747750 TUBA3C
## 3 rs13313346             C            T  0.8850154  13 19757529 TUBA3C
## 4 rs17076886             C            T  0.5172858  13 19755612 TUBA3C
## 5 rs17076891             C            G -0.5500411  13 19755726 TUBA3C
## 6 rs17076895             G            A -0.3420408  13 19756051 TUBA3C

Class level testing

Running GenCAT

In order to perform the class level test, genotype data must be provided from a reference population. If available, this can simply be the genotype data used in the initial SNP level analysis. Otherwise, the genotype data should come from a similar representative population. This data is be specified using the genoData and snpInfo arguments. The former must be a SnpMatrix object, defined by the snpStats package. The latter defines the chromosome number, and allele assignments for each SNP. If using the read.plink or read.pedfile functions to read in genotype data, these objects can be obtained respectively from the genotype and map elements of the resulting list object.

It is recommended that adequate SNP and sample level filtering be carried out on genotype data. SNP pairs with high missingness (low call rate), can result in missing correlations. If this happens GenCAT will return an error.

For this example, we use SNPwise filtered genotype data from 1000 Genomes from 99 individuals of northern and western european ancestry (CEU).

data('geno')

genoData<-geno$genotypes
snpInfo<-geno$map
colnames(snpInfo)<-c('chr', 'SNP', 'gen.dist', 'position', 'A1', 'A2')

GenCATtest <- GenCAT(CardioMapped, genoData=genoData, snpInfo = snpInfo)
## Loading required package: snpStats
## Loading required package: survival
## Warning: package 'survival' was built under R version 3.2.5
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 3.2.5
## Note: the specification for S3 class "AsIs" in package 'BiocGenerics' seems equivalent to one from package 'DBI': not turning on duplicate class definitions for this class.
## [1] "Running GenCAT on 304 classes on chromosome 13."
## [1] "Running GenCAT on 562 classes on chromosome 14."
## [1] "Running GenCAT on 529 classes on chromosome 15."

Additional arguments

The GenCAT function is run in parallel using functionality from the doParallel package. This will support parallel processing in both Unix based and Windows machines. The number of processes to run can be specified by the workers argument (set to 2 by default).

The pcCutoff (set to 0.95 be default) argument specifies the proportion of the variability in the SNP-wise genotype correlation matrices for which to account. Using a value less than 1 will result in a dimension reduction, meaning the number of independent test statistics estimated by the eigendecomposition can be less than the total number of SNPs in a class. Refer to (Qian et al., 2015) for more information.

Output

The GenCAT function creates a list object of class GenCATtest, which has five elements.

print(str(GenCATtest))
## List of 5
##  $ GenCAT    :'data.frame':  1389 obs. of  6 variables:
##   ..$ class : chr [1:1389] "TUBA3C" "TPTE2" "MPHOSPH8" "PSPC1" ...
##   ..$ chr   : num [1:1389] 13 13 13 13 13 13 13 13 13 13 ...
##   ..$ n_SNPs: num [1:1389] 41 29 37 34 32 42 36 5 12 134 ...
##   ..$ n_Obs : num [1:1389] 6 4 4 3 2 4 9 3 5 13 ...
##   ..$ CsumT : num [1:1389] 4.74 1.08 5.29 3.72 3.69 ...
##   ..$ CsumP : num [1:1389] 0.578 0.897 0.259 0.293 0.158 ...
##  $ Used      :'data.frame':  87158 obs. of  9 variables:
##   ..$ SNP          : chr [1:87158] "rs10492522" "rs10492523" "rs17076886" "rs17076891" ...
##   ..$ effect_allele: chr [1:87158] "C" "C" "C" "C" ...
##   ..$ other_allele : chr [1:87158] "T" "T" "T" "G" ...
##   ..$ testStat     : num [1:87158] 1.409 0.155 0.517 -0.55 -0.593 ...
##   ..$ class        : chr [1:87158] "TUBA3C" "TUBA3C" "TUBA3C" "TUBA3C" ...
##   ..$ chr          : num [1:87158] 13 13 13 13 13 13 13 13 13 13 ...
##   ..$ position     : num [1:87158] 19746331 19747750 19755612 19755726 19756063 ...
##   ..$ A1           : chr [1:87158] "C" "T" "C" "G" ...
##   ..$ A2           : chr [1:87158] "T" "C" "T" "C" ...
##  $ notFound  :'data.frame':  11382 obs. of  5 variables:
##   ..$ SNP          : chr [1:11382] "rs13313346" "rs17076895" "rs3751423" "rs479287" ...
##   ..$ effect_allele: chr [1:11382] "C" "G" "C" "G" ...
##   ..$ other_allele : chr [1:11382] "T" "A" "T" "A" ...
##   ..$ testStat     : num [1:11382] 0.885 -0.342 -1.062 -0.477 -0.543 ...
##   ..$ class        : chr [1:11382] "TUBA3C" "TUBA3C" "TUBA3C" "TUBA3C" ...
##  $ unMatched :'data.frame':  42 obs. of  9 variables:
##   ..$ SNP          : chr [1:42] "rs11259842" "rs3088277" "rs3935485" "rs6422414" ...
##   ..$ effect_allele: chr [1:42] "G" "G" "C" "C" ...
##   ..$ other_allele : chr [1:42] "A" "A" "T" "T" ...
##   ..$ testStat     : num [1:42] 0.822 -0.441 -0.163 0.453 0.415 ...
##   ..$ class        : chr [1:42] "TMEM255B" "TMEM255B" "TMEM255B" "TMEM255B" ...
##   ..$ chr          : num [1:42] 13 13 13 13 13 13 13 13 13 13 ...
##   ..$ position     : num [1:42] 1.15e+08 1.15e+08 1.14e+08 1.14e+08 1.14e+08 ...
##   ..$ A1           : chr [1:42] "C" "C" "G" "G" ...
##   ..$ A2           : chr [1:42] "T" "T" "A" "A" ...
##  $ TransStats:'data.frame':  8839 obs. of  2 variables:
##   ..$ class    : chr [1:8839] "TUBA3C" "TUBA3C" "TUBA3C" "TUBA3C" ...
##   ..$ transStat: num [1:8839] 0.482 -1.819 0.185 -0.8 0.674 ...
##  - attr(*, "class")= chr "GenCATtest"
## NULL

The GenCAT element contains the results of the class level test, including test statistic(CsumT), p-value(CsumP), and number of independent signals after the dimension reduction(N_obs). The Used element contains information for SNPs included in class level testing. The notFound and unMatched elements contain information for SNPs which were removed because they either couldn’t be found in reference population, or the there was one or more allele assignment that wasn’t in the reference population’s allele assignments for that SNP. The last element, TransStats contains the independent test statistic estimates for each class.

Visualization

Creating Manhattan plots of class level results

The GenCATtest object created by the GenCAT function can be immediately used with the GenCAT_manhattan function. This will plot the \(-log_{10}(p-value)\) for each class. Giving a value to the sigThresh argument will draw a horizontal line at \(-log_{10}\) of this value. Aditionally, classes with strong signal can be highlighted and labeled by setting arguments, highlightPosi and labelPosi, to TRUE.

GenCAT_manhattan(GenCATtest, sigThresh = (0.05/nrow(GenCATtest$GenCAT)), 
highlightPosi = TRUE, labelPosi = TRUE)