Statistics for de novo variant analysis

Introduction

This package provides functions to analyse de novo genetic variants using the statistical framework described in Samocha et al (2014) Nature Genetics 10.1038/ng.3050.

This vignette demonstrates the usage of the denovolyzeR package to recapitulate the analyses dectribed in this paper.

  1. What is the overall burden of de novo variation in the study cohort: are there more de novos than expected? As well as looking at the total burden of de novos, results are returned for different classes of variant, such as loss-of-function (lof), missense (mis) and synonymous (syn).
  2. Do de novo variants cluster in specific genes: are there more genes containing multiple de novos than expected?
  3. Are there any individual genes that contain more de novos than expected?

Installation

# Install the package if you haven't already.
# OPTION 1 - download binary, and install:
# install.packages("LOCAL/PATH/TO/denovolyzeR.tgz",repos=NULL)  

# OPTION 2 - install from source.  Either download and install, or use devtools:
# devtools::install_github("jamesware/denovolyzeR")

Example analysis

We start with a table of de novo variants. An example dataset is provided:

library(denovolyzeR)
# have a look at the example data:
dim(autismDeNovos)
## [1] 1040    2
head(autismDeNovos)
##      gene class
## 1  BCORL1   mis
## 2  SPANXD   mis
## 3   GLRA2   mis
## 4 RPS6KA3   non
## 5    TSR2   mis
## 6   GNL3L   syn

Overall de novo burden

First we want to know whether there are more de novos than expected, using the denovolyzeByClass() function. These variants were obtained by sequencing 1,078 cases, so we use nsamples=1078.

denovolyzeByClass(genes=autismDeNovos$gene,
                  classes=autismDeNovos$class,
                  nsamples=1078)
##   class observed expected enrichment   pValue
## 1   syn      254    302.3      0.840 0.998000
## 2   mis      655    679.0      0.965 0.826000
## 3   lof      131     94.3      1.390 0.000199
## 4  prot      786    773.2      1.020 0.328000
## 5   all     1040   1075.5      0.967 0.864000

The total number of de novos is almost exactly as our model predicts. However, we see a statistically significant excess of LOF variants in this population.

Genes containing multiple de novos

Next, we look to see if the total number of genes that contain more than one de novo is greater than expected, using the denovolyzeMultiHits() function.

denovolyzeMultiHits(genes=autismDeNovos$gene,
                    classes=autismDeNovos$class,
                    nsamples=1078)
##      obs expMean expMax pValue nVars
## syn    3     3.0      9   0.61   254
## mis   31    19.5     32   0.02   655
## lof    5     1.2      5   0.01   131
## prot  43    27.9     40   0.00   786
## all   66    47.0     61   0.00  1040

ObsGenes = the number of genes in our dataset with >1 de novo variant
TotalObsDNM = the total number of de novo variants in each class
AvgExpGenes = the expected number of genes containing >1 de novo: an average obtained by permutation
MaxExpGenes = the maximum number of genes containin >1 de novo in nperms permutations (default nperms=100)
Empirical.P = an empirical p value
Note that the number of observed genes with >1 protein-altering variant does not equal the number of genes with >1 lof + number of genes with >1 missense, as genes containing 1 lof + 1 missense will only be counted as “multihits” in the combined analysis.

Here it looks like there may be an excess of genes with >1 lof variant, >1 missense, and >1 protein-altering variant. We will want to increase the number if permutations here to get a handle on our level of significance.

denovolyzeMultiHits(genes=autismDeNovos$gene,
                    classes=autismDeNovos$class,
                    nsamples=1078,
                    nperms=1000)
##      obs expMean expMax pValue nVars
## syn    3     3.3     10  0.647   254
## mis   31    20.5     34  0.012   655
## lof    5     1.0      4  0.000   131
## prot  43    28.2     45  0.004   786
## all   66    47.8     68  0.003  1040

There is another important option here. The expected number of genes containing >1 hit is obtained by permutation: Given n de novo variants, how many genes contain >1 de novo? There are two options for selecting n: by default it is derived from your data: e.g. in the example above autismDeNovos contains 131 lof variants, so this is the number used in the permutation. This is controlled by the default parameter nVars="actual"

sum(autismDeNovos$class %in% c("frameshift","non","splice"))
## [1] 131

This is a conservative approach, addressing the question: “given the number of variants in our dataset, do we see more genes with >1 variant than expected?”

An alternative approach simply asks whether there are more genes with >1 variant than our de novo model predicts. This is accessed by setting nVars="expected".

denovolyzeMultiHits(genes=autismDeNovos$gene,
                    classes=autismDeNovos$class,
                    nsamples=1078,
                    nperms=1000,
                    nVars="expected")
##      obs expMean expMax pValue      nVars
## syn    3     4.6     12  0.855  302.27646
## mis   31    21.6     35  0.018  678.98816
## lof    5     0.5      4  0.000   94.26139
## prot  43    27.4     45  0.001  773.24955
## all   66    50.7     73  0.010 1075.52601

Do any individual genes contain more de novos than expected

We see 6 genes containing >1 de novo lof variant. This is more than expected, but are any of these genes individually significant? We can denovolyzeByGene() to find out.

By default this function compares the number of LOF variants against expectation for each gene, and then the total number of protein-altering variants (LOF + missense). It can also be configured to return other classes if relevant.

head(
denovolyzeByGene(genes=autismDeNovos$gene,
                 classes=autismDeNovos$class,
                 nsamples=1078)
  )
##          lof.obs lof.exp lof.pValue prot.obs prot.exp prot.pValue
## DYRK1A         3       0   2.69e-08        3      0.1    2.77e-05
## SCN2A          3       0   1.83e-06        5      0.1    3.70e-07
## CHD8           3       0   7.19e-07        4      0.2    2.44e-05
## RFX8           0       0   1.00e+00        2      0.0    2.34e-05
## SUV420H1       1       0   6.37e-03        3      0.1    3.17e-05
## POGZ           2       0   1.23e-04        2      0.1    5.07e-03

Several genes meet statistical significance after correcting for multiple testing. Default options apply two tests across 0 genes, so a Bonferroni corrected p-value threshold at \(\alpha\) = 0.05 would be

Geneset analysis

The analyses presented so far have been exome-wide. It may be appropriate to restrict analyses to a geneset of interest - for example, it may be relevant to examine the burden of de novo variation in a pathway of interest, or initial variant detection may have been restricted to a set of candidate genes (rather than whole exome sequencing). All of the above funtions can be targeted to a subset of genes using the includeGenes argument.

The package includes as an example a list of nrow(fmrpGenes) genes that interact with the fragile X mental retardation protein (FMRP). Is this geneset enriched for de novos, and recurrent de novos, in our autism trios?

nrow(fmrpGenes); head(fmrpGenes)
## [1] 837
##            ensgID          enstID hgncID hgncSymbol geneName
## 1 ENSG00000142599 ENST00000337907   9965       RERE     RERE
## 2 ENSG00000149527 ENST00000449969  29037      PLCH2    PLCH2
## 3 ENSG00000078369 ENST00000378609   4396       GNB1     GNB1
## 4 ENSG00000157933 ENST00000378536  10896        SKI      SKI
## 5 ENSG00000171735 ENST00000303635  18806     CAMTA1   CAMTA1
## 6 ENSG00000188157 ENST00000379370    329       AGRN     AGRN
denovolyzeByClass(genes=autismDeNovos$gene,
                  classes=autismDeNovos$class,
                  nsamples=1078,
                  includeGenes=fmrpGenes$geneName)
##   class observed expected enrichment   pValue
## 1   syn       28     33.6      0.835 8.53e-01
## 2   mis       83     74.4      1.110 1.74e-01
## 3   lof       32      9.1      3.500 3.18e-09
## 4  prot      115     83.6      1.380 6.47e-04
## 5   all      143    117.1      1.220 1.13e-02
denovolyzeMultiHits(genes=autismDeNovos$gene,
                    classes=autismDeNovos$class,
                    nsamples=1078,
                    nperms=1000,
                    includeGenes=fmrpGenes$geneName)
## Warning in denovolyzeMultiHits(genes = autismDeNovos$gene, classes =
## autismDeNovos$class, : De novo list includes 897 genes not specified for
## inclusion. These will not be analysed.
##      obs expMean expMax pValue nVars
## syn    0     0.6      3  1.000    28
## mis    9     5.1     14  0.042    83
## lof    3     1.0      4  0.070    32
## prot  15     9.5     20  0.038   115
## all   20    14.0     25  0.035   143

Other functions

viewProbabilityTable provides access to the underlying de novo probability tables used to calculate expected de novo burdens throughout this package:

head(
  viewProbabilityTable()
  )
##   hgncID hgncSymbol          enstID          ensgID geneName          syn
## 1      5       A1BG ENST00000263100 ENSG00000121410     A1BG 8.997970e-06
## 2      7        A2M ENST00000318602 ENSG00000175899      A2M 1.543159e-05
## 3     16   SERPINA3 ENST00000467132 ENSG00000196136 SERPINA3 5.694582e-06
## 4     17      AADAC ENST00000232892 ENSG00000114771    AADAC 4.252483e-06
## 5     18       AAMP ENST00000248450 ENSG00000127837     AAMP 6.496774e-06
## 6     19      AANAT ENST00000250615 ENSG00000129673    AANAT 3.530488e-06
##            mis          non       splice   frameshift          lof
## 1 1.738961e-05 5.763794e-07 2.639868e-07 6.532817e-07 1.493648e-06
## 2 3.545894e-05 1.960148e-06 1.477263e-06 4.616751e-07 3.899086e-06
## 3 1.176919e-05 4.433874e-07 1.387157e-07 5.276555e-07 1.109759e-06
## 4 1.018458e-05 5.312742e-07 1.748982e-07 1.051152e-06 1.757324e-06
## 5 1.313861e-05 5.042914e-07 4.247556e-07 3.344019e-06 4.273066e-06
## 6 7.729807e-06 1.707018e-07 9.988864e-08 4.132204e-07 6.838108e-07
##           prot          all
## 1 1.888326e-05 2.788123e-05
## 2 3.935803e-05 5.478962e-05
## 3 1.287895e-05 1.857353e-05
## 4 1.194191e-05 1.619439e-05
## 5 1.741167e-05 2.390845e-05
## 6 8.413618e-06 1.194411e-05
head(
  viewProbabilityTable(format="long")
  )
##   hgncID hgncSymbol          enstID          ensgID geneName class
## 1    658       ARF5 ENST00000000233 ENSG00000004059     ARF5   syn
## 2   6752       M6PR ENST00000000412 ENSG00000003056     M6PR   syn
## 3   3720      FKBP4 ENST00000001008 ENSG00000004478    FKBP4   syn
## 4  20581    CYP26B1 ENST00000001146 ENSG00000003137  CYP26B1   syn
## 5  28816    NDUFAF7 ENST00000002125 ENSG00000003509  NDUFAF7   syn
## 6   4008      FUCA2 ENST00000002165 ENSG00000001036    FUCA2   syn
##          value
## 1 2.704295e-06
## 2 2.877349e-06
## 3 5.608786e-06
## 4 1.013847e-05
## 5 5.079037e-06
## 6 5.445799e-06

Most of the core functionality of this package is contained in the denovolyze function. The denovolyzeByClass and denovolyzeByGene functions used in this vignette are convenience functions that set defaults appropriate to the most common usages of this function. Full details of additional options and default behaviours are available using ?denovolyze.