Methods for genetic association, such as genome-wide association studies (GWAS), are wideley used to test for SNP level relationships for traits and disease. Accounting for multiple-comparisons of such techniques typically calls for conservative Bonferroni adjustements for the number of SNPs included in the analysis (often ~10^6). Commonly referred to as the Minimum P approach, a class level signal is often given by the minimum p-value of all SNPs within a class, where class is defined as as a meaningful segment of the genome, such as a protein coding gene. Therefore, only classes with a single strong SNP level signal are deemed significant. In this package we present tools for implementing an alternative method, known as Genetic Class Association Testing (GenCAT), which utilizing statistical signal from every SNP accross an entire class for each statistical test. Therefore, adjustments for multiple testing need only account for the number of classes used in analysis. This is performed by estimating the distribution of independent test statistics, via an eigen decomposition, using the SNP-level test statistics and genotypic correlation between SNPs. Class level test statistics are then calculated as the sum of the squared values of these independent signals. These test statistics follow a central \(chi^2\)-distribution, with degree of freedom equal to the number of independent signals within a class. More information on the GenCAT approach can be found at (Qian et al., 2015).
In this tutorial, we use SNP level results from the CARDIoGRAM consortium GWA meta-analysis (Schunkert et al., 2011). We define classes a protein coding genes, based on boundaries adapted from the KnownCanonical track from UCSC Genome Browser, build GRCh 37 (hg 19).
For the sake of time and memory, only select SNPs from chromosomes 13-15 are used.
library(GenCAT)
The class level approach is carried out by the GenCAT
function, which requires a six column data frame including information for SNP ID, effect allele (the allele which determines the direction of the test statistic from the SNP level association test), other allele, SNP level test statistic, chromosome number (1-22), and class label.
This package includes the function, map2class
, which can be used to assign class labels to SNPs based on their genomic coordinates. The first argument, coords
, is a data frame which defines the boundaries of each class in the genome. The second argument, SNPs
, defines the genomic coordinates of each SNP. A third argument, extend.boundary
, can also be used to specify class boundary extension. Refer to the help page (?GenCAT::map2class
) for specific formatting requirements.
In this example we map the CARDIoGRAM SNPs to protein coding genes, extending the class boundaries by 5000.
data('CardioData')
data('coords')
print(head(coords))
## chr start stop class
## 1 1 69090 70008 OR4F5
## 2 1 367658 368597 OR4F29-1
## 3 1 621095 622034 OR4F29-2
## 4 1 861120 879961 SAMD11
## 5 1 879582 894679 NOC2L
## 6 1 895966 901099 KLHL17
print(head(CardioData))
## SNP effect_allele other_allele testStat chr position
## 1 rs1000022 C T -0.05761287 13 100461219
## 2 rs1000160 C T -1.71388200 13 100950452
## 3 rs1000281 C T 1.19326833 15 72553422
## 4 rs1000290 C T -0.22856607 15 100949609
## 5 rs1000347 C T 0.34416833 15 26929735
## 6 rs1000415 C T 1.21364190 13 25031438
CardioMapped<-map2class(coords, CardioData, extend.boundary = 5000)
## [1] "Mapping SNPs to classes on chromosome 13."
## [1] "Mapping SNPs to classes on chromosome 14."
## [1] "Mapping SNPs to classes on chromosome 15."
print(head(CardioMapped))
## SNP effect_allele other_allele testStat chr position class
## 1 rs10492522 C T 1.4088115 13 19746331 TUBA3C
## 2 rs10492523 C T 0.1550574 13 19747750 TUBA3C
## 3 rs13313346 C T 0.8850154 13 19757529 TUBA3C
## 4 rs17076886 C T 0.5172858 13 19755612 TUBA3C
## 5 rs17076891 C G -0.5500411 13 19755726 TUBA3C
## 6 rs17076895 G A -0.3420408 13 19756051 TUBA3C
GenCAT
In order to perform the class level test, genotype data must be provided from a reference population. If available, this can simply be the genotype data used in the initial SNP level analysis. Otherwise, the genotype data should come from a similar representative population. This data is be specified using the genoData
and snpInfo
arguments. The former must be a SnpMatrix
object, defined by the snpStats package. The latter defines the chromosome number, and allele assignments for each SNP. If using the read.plink
or read.pedfile
functions to read in genotype data, these objects can be obtained respectively from the genotype
and map
elements of the resulting list object.
It is recommended that adequate SNP and sample level filtering be carried out on genotype data. SNP pairs with high missingness (low call rate), can result in missing correlations. If this happens GenCAT
will return an error.
For this example, we use SNPwise filtered genotype data from 1000 Genomes from 99 individuals of northern and western european ancestry (CEU).
data('geno')
genoData<-geno$genotypes
snpInfo<-geno$map
colnames(snpInfo)<-c('chr', 'SNP', 'gen.dist', 'position', 'A1', 'A2')
GenCATtest <- GenCAT(CardioMapped, genoData=genoData, snpInfo = snpInfo)
## Loading required package: snpStats
## Loading required package: survival
## Warning: package 'survival' was built under R version 3.2.5
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 3.2.5
## Note: the specification for S3 class "AsIs" in package 'BiocGenerics' seems equivalent to one from package 'DBI': not turning on duplicate class definitions for this class.
## [1] "Running GenCAT on 304 classes on chromosome 13."
## [1] "Running GenCAT on 562 classes on chromosome 14."
## [1] "Running GenCAT on 529 classes on chromosome 15."
The GenCAT
function is run in parallel using functionality from the doParallel package. This will support parallel processing in both Unix based and Windows machines. The number of processes to run can be specified by the workers
argument (set to 2 by default).
The pcCutoff
(set to 0.95 be default) argument specifies the proportion of the variability in the SNP-wise genotype correlation matrices for which to account. Using a value less than 1 will result in a dimension reduction, meaning the number of independent test statistics estimated by the eigendecomposition can be less than the total number of SNPs in a class. Refer to (Qian et al., 2015) for more information.
The GenCAT
function creates a list object of class GenCATtest
, which has five elements.
print(str(GenCATtest))
## List of 5
## $ GenCAT :'data.frame': 1389 obs. of 6 variables:
## ..$ class : chr [1:1389] "TUBA3C" "TPTE2" "MPHOSPH8" "PSPC1" ...
## ..$ chr : num [1:1389] 13 13 13 13 13 13 13 13 13 13 ...
## ..$ n_SNPs: num [1:1389] 41 29 37 34 32 42 36 5 12 134 ...
## ..$ n_Obs : num [1:1389] 6 4 4 3 2 4 9 3 5 13 ...
## ..$ CsumT : num [1:1389] 4.74 1.08 5.29 3.72 3.69 ...
## ..$ CsumP : num [1:1389] 0.578 0.897 0.259 0.293 0.158 ...
## $ Used :'data.frame': 87158 obs. of 9 variables:
## ..$ SNP : chr [1:87158] "rs10492522" "rs10492523" "rs17076886" "rs17076891" ...
## ..$ effect_allele: chr [1:87158] "C" "C" "C" "C" ...
## ..$ other_allele : chr [1:87158] "T" "T" "T" "G" ...
## ..$ testStat : num [1:87158] 1.409 0.155 0.517 -0.55 -0.593 ...
## ..$ class : chr [1:87158] "TUBA3C" "TUBA3C" "TUBA3C" "TUBA3C" ...
## ..$ chr : num [1:87158] 13 13 13 13 13 13 13 13 13 13 ...
## ..$ position : num [1:87158] 19746331 19747750 19755612 19755726 19756063 ...
## ..$ A1 : chr [1:87158] "C" "T" "C" "G" ...
## ..$ A2 : chr [1:87158] "T" "C" "T" "C" ...
## $ notFound :'data.frame': 11382 obs. of 5 variables:
## ..$ SNP : chr [1:11382] "rs13313346" "rs17076895" "rs3751423" "rs479287" ...
## ..$ effect_allele: chr [1:11382] "C" "G" "C" "G" ...
## ..$ other_allele : chr [1:11382] "T" "A" "T" "A" ...
## ..$ testStat : num [1:11382] 0.885 -0.342 -1.062 -0.477 -0.543 ...
## ..$ class : chr [1:11382] "TUBA3C" "TUBA3C" "TUBA3C" "TUBA3C" ...
## $ unMatched :'data.frame': 42 obs. of 9 variables:
## ..$ SNP : chr [1:42] "rs11259842" "rs3088277" "rs3935485" "rs6422414" ...
## ..$ effect_allele: chr [1:42] "G" "G" "C" "C" ...
## ..$ other_allele : chr [1:42] "A" "A" "T" "T" ...
## ..$ testStat : num [1:42] 0.822 -0.441 -0.163 0.453 0.415 ...
## ..$ class : chr [1:42] "TMEM255B" "TMEM255B" "TMEM255B" "TMEM255B" ...
## ..$ chr : num [1:42] 13 13 13 13 13 13 13 13 13 13 ...
## ..$ position : num [1:42] 1.15e+08 1.15e+08 1.14e+08 1.14e+08 1.14e+08 ...
## ..$ A1 : chr [1:42] "C" "C" "G" "G" ...
## ..$ A2 : chr [1:42] "T" "T" "A" "A" ...
## $ TransStats:'data.frame': 8839 obs. of 2 variables:
## ..$ class : chr [1:8839] "TUBA3C" "TUBA3C" "TUBA3C" "TUBA3C" ...
## ..$ transStat: num [1:8839] 0.482 -1.819 0.185 -0.8 0.674 ...
## - attr(*, "class")= chr "GenCATtest"
## NULL
The GenCAT
element contains the results of the class level test, including test statistic(CsumT
), p-value(CsumP
), and number of independent signals after the dimension reduction(N_obs
). The Used
element contains information for SNPs included in class level testing. The notFound
and unMatched
elements contain information for SNPs which were removed because they either couldn’t be found in reference population, or the there was one or more allele assignment that wasn’t in the reference population’s allele assignments for that SNP. The last element, TransStats
contains the independent test statistic estimates for each class.
The GenCATtest
object created by the GenCAT
function can be immediately used with the GenCAT_manhattan
function. This will plot the \(-log_{10}(p-value)\) for each class. Giving a value to the sigThresh
argument will draw a horizontal line at \(-log_{10}\) of this value. Aditionally, classes with strong signal can be highlighted and labeled by setting arguments, highlightPosi
and labelPosi
, to TRUE
.
GenCAT_manhattan(GenCATtest, sigThresh = (0.05/nrow(GenCATtest$GenCAT)),
highlightPosi = TRUE, labelPosi = TRUE)