Getting Started with Capture Hi-C Analysis Engine

2018-08-24

Quickstart

The main function for running the chicane method is chicane(). This takes a bam file and bed files with all restriction fragments and targeted restriction fragments as input, and produces a table with fragment interactions and associated p-values.

The package comes with a subset of reads from a capture Hi-C experiment on the Bre80 normal epithelial breast tissue cell line (Baxter et al. 2018). The experiment targeted several breast cancer risk loci in non-coding regions of the genome, and aimed to associate these risk variants with genes. Two of the risk SNPs are rs13387042 and rs16857609, and reads that map to them have been included in the file Bre80_2q35.bam.

library(chicane);
#> Loading required package: gamlss.tr
#> Loading required package: gamlss.dist
#> Loading required package: MASS
#> Loading required package: gamlss
#> Loading required package: splines
#> Loading required package: gamlss.data
#> Loading required package: nlme
#> Loading required package: parallel
#>  **********   GAMLSS Version 5.1-0  **********
#> For more on GAMLSS look at http://www.gamlss.org/
#> Type gamlssNews() to see new features/changes/bug fixes.

# example BAM file, baits, and restriction fragments
bam <- system.file('extdata', 'Bre80_2q35.bam', package = 'chicane');
baits <- system.file('extdata', '2q35.bed', package = 'chicane');
fragments <- system.file('extdata', 'GRCh38_HindIII_chr2.bed.gz', package = 'chicane'); # HindIII fragments on chromosome 2

if( bedtools.installed() ) {
  chicane.results <- chicane(
    bam = bam,
    baits = baits,
    fragments = fragments
    );

    print( chicane.results[ 1:10 ] );
}
#>                    target.id                  bait.id target.chr
#>  1: chr2:217491009-217492415 chr2:217035649-217042355       chr2
#>  2: chr2:216650000-216651600 chr2:217430817-217437011       chr2
#>  3: chr2:217290048-217292530 chr2:217430817-217437011       chr2
#>  4: chr2:216269087-216273140 chr2:217430817-217437011       chr2
#>  5: chr2:216551051-216551968 chr2:217430817-217437011       chr2
#>  6: chr2:216696003-216698549 chr2:217430817-217437011       chr2
#>  7: chr2:216879929-216883553 chr2:217430817-217437011       chr2
#>  8: chr2:216687613-216696003 chr2:217430817-217437011       chr2
#>  9: chr2:217162758-217166250 chr2:217430817-217437011       chr2
#> 10: chr2:216682129-216687613 chr2:217430817-217437011       chr2
#>     target.start target.end bait.chr bait.start  bait.end bait.to.bait
#>  1:    217491009  217492415     chr2  217035649 217042355        FALSE
#>  2:    216650000  216651600     chr2  217430817 217437011        FALSE
#>  3:    217290048  217292530     chr2  217430817 217437011        FALSE
#>  4:    216269087  216273140     chr2  217430817 217437011        FALSE
#>  5:    216551051  216551968     chr2  217430817 217437011        FALSE
#>  6:    216696003  216698549     chr2  217430817 217437011        FALSE
#>  7:    216879929  216883553     chr2  217430817 217437011        FALSE
#>  8:    216687613  216696003     chr2  217430817 217437011        FALSE
#>  9:    217162758  217166250     chr2  217430817 217437011        FALSE
#> 10:    216682129  216687613     chr2  217430817 217437011        FALSE
#>     bait.trans.count target.trans.count  distance count expected
#>  1:                0                  0  452710.0    28 6.064476
#>  2:                0                  0  783114.0    16 2.990237
#>  3:                0                  0  142625.0    33 9.143598
#>  4:                0                  0 1162800.5     8 2.101966
#>  5:                0                  0  882404.5    11 2.745668
#>  6:                0                  0  736638.0    16 4.032253
#>  7:                0                  0  552173.0    11 3.896133
#>  8:                0                  0  742106.0    15 3.889132
#>  9:                0                  0  269410.0    24 7.968272
#> 10:                0                  0  749043.0    14 3.716266
#>          p.value   q.value
#>  1: 0.0001206449 0.2038899
#>  2: 0.0005035504 0.8182288
#>  3: 0.0012757379 0.8182288
#>  4: 0.0018935853 0.8182288
#>  5: 0.0021356645 0.8182288
#>  6: 0.0037456864 0.8182288
#>  7: 0.0037711109 0.8182288
#>  8: 0.0048756166 0.8182288
#>  9: 0.0055434462 0.8182288
#> 10: 0.0061512164 0.8182288

The chicane method returns a data.table object sorted by q-value. Each row contains information about the target and bait fragments, and the observed and expected number of reads linking the two fragments.

Pre-Processing Data

The chicane method can operate directly on BAM files. To convert the BAM file into a text file, R calls the shell script prepare_bam.sh. This script relies on bedtools to select reads overlapping the captured fragments and identify the restriction fragments corresponding to each read. The resulting data is read into a data.table object, and the number of reads linking any two fragments is calculated.

This processing can take a while (~45 minutes for a 19GB BAM file on a commodity computer with 13GB RAM), and only needs to be done once for each BAM. To speed up model fitting, it is possible to pre-process the data with the prepare.data() function and run chicane directly on the resulting data table.

if( bedtools.installed() ) {
  interaction.data <- prepare.data(
    bam = bam,
    baits = baits,
    fragments = fragments
    );
}

The interaction data object contains details of the fragments, and the number of reads linking the two fragments.

if( bedtools.installed() ) print(interaction.data);
#>                      target.id                  bait.id target.chr
#>    1: chr2:200013520-200014510 chr2:217035649-217042355       chr2
#>    2: chr2:200018581-200020582 chr2:217035649-217042355       chr2
#>    3: chr2:200020582-200023537 chr2:217430817-217437011       chr2
#>    4: chr2:200033118-200034885 chr2:217430817-217437011       chr2
#>    5: chr2:200040560-200041826 chr2:217430817-217437011       chr2
#>   ---                                                             
#> 4479: chr2:229867689-229872871 chr2:217035649-217042355       chr2
#> 4480: chr2:229872871-229875614 chr2:217035649-217042355       chr2
#> 4481: chr2:229882404-229887121 chr2:217430817-217437011       chr2
#> 4482: chr2:229894321-229904945 chr2:217035649-217042355       chr2
#> 4483: chr2:229917543-229919836 chr2:217430817-217437011       chr2
#>       target.start target.end bait.chr bait.start  bait.end bait.to.bait
#>    1:    200013520  200014510     chr2  217035649 217042355        FALSE
#>    2:    200018581  200020582     chr2  217035649 217042355        FALSE
#>    3:    200020582  200023537     chr2  217430817 217437011        FALSE
#>    4:    200033118  200034885     chr2  217430817 217437011        FALSE
#>    5:    200040560  200041826     chr2  217430817 217437011        FALSE
#>   ---                                                                   
#> 4479:    229867689  229872871     chr2  217035649 217042355        FALSE
#> 4480:    229872871  229875614     chr2  217035649 217042355        FALSE
#> 4481:    229882404  229887121     chr2  217430817 217437011        FALSE
#> 4482:    229894321  229904945     chr2  217035649 217042355        FALSE
#> 4483:    229917543  229919836     chr2  217430817 217437011        FALSE
#>       bait.trans.count target.trans.count distance count
#>    1:                0                  0 17024987     1
#>    2:                0                  0 17019420     1
#>    3:                0                  0 17411854     2
#>    4:                0                  0 17399912     1
#>    5:                0                  0 17392721     1
#>   ---                                                   
#> 4479:                0                  0 12831278     1
#> 4480:                0                  0 12835240     1
#> 4481:                0                  0 12450848     1
#> 4482:                0                  0 12860631     1
#> 4483:                0                  0 12484776     1

By default, only combinations of fragments that are detected at least once are included in the data. To also include zero counts, use the include.zeros argument. Setting include.zeros = 'cis' includes all zero counts for bait/ target combinations on the same chromosome, while include.zeros = 'all' includes all possible combinations.

if( bedtools.installed() ) {
  cis.zero.results <- chicane(
    bam = bam,
    baits = baits,
    fragments = fragments,
    include.zeros = 'cis'
    ); 

    print( cis.zero.results[ 1:10 ] );
}
#>                    target.id                  bait.id target.chr
#>  1: chr2:217430817-217437011 chr2:217035649-217042355       chr2
#>  2: chr2:217035649-217042355 chr2:217035649-217042355       chr2
#>  3: chr2:217042355-217043238 chr2:217035649-217042355       chr2
#>  4: chr2:217043238-217043762 chr2:217035649-217042355       chr2
#>  5: chr2:217043762-217045205 chr2:217035649-217042355       chr2
#>  6: chr2:217045205-217048162 chr2:217035649-217042355       chr2
#>  7: chr2:217026464-217035649 chr2:217035649-217042355       chr2
#>  8: chr2:217024030-217026464 chr2:217035649-217042355       chr2
#>  9: chr2:217023088-217024030 chr2:217035649-217042355       chr2
#> 10: chr2:217048162-217060744 chr2:217035649-217042355       chr2
#>     target.start target.end bait.chr bait.start  bait.end bait.to.bait
#>  1:    217430817  217437011     chr2  217035649 217042355         TRUE
#>  2:    217035649  217042355     chr2  217035649 217042355        FALSE
#>  3:    217042355  217043238     chr2  217035649 217042355        FALSE
#>  4:    217043238  217043762     chr2  217035649 217042355        FALSE
#>  5:    217043762  217045205     chr2  217035649 217042355        FALSE
#>  6:    217045205  217048162     chr2  217035649 217042355        FALSE
#>  7:    217026464  217035649     chr2  217035649 217042355        FALSE
#>  8:    217024030  217026464     chr2  217035649 217042355        FALSE
#>  9:    217023088  217024030     chr2  217035649 217042355        FALSE
#> 10:    217048162  217060744     chr2  217035649 217042355        FALSE
#>     bait.trans.count target.trans.count distance count expected   p.value
#>  1:                0                  0 394912.0     8        8 0.5470392
#>  2:                0                  0      0.0     4       NA        NA
#>  3:                0                  0   3794.5    60       NA        NA
#>  4:                0                  0   4498.0   126       NA        NA
#>  5:                0                  0   5481.5   102       NA        NA
#>  6:                0                  0   7681.5    85       NA        NA
#>  7:                0                  0   7945.5    36       NA        NA
#>  8:                0                  0  13755.0    56       NA        NA
#>  9:                0                  0  15443.0    41       NA        NA
#> 10:                0                  0  15451.0    45       NA        NA
#>       q.value
#>  1: 0.5470392
#>  2:        NA
#>  3:        NA
#>  4:        NA
#>  5:        NA
#>  6:        NA
#>  7:        NA
#>  8:        NA
#>  9:        NA
#> 10:        NA

For most experiments including zeros will result in impractically large files. Another option for accounting for zero counts is to specify a truncated distribution through the distribution argument. However, this will slow down model fitting.

The interactions argument of the chicane() function can take either a data.table object from prepare.data or the path to such a table.

if( bedtools.installed() ) {
  file.name <- tempfile('interaction_data');
  write.table(interaction.data, file.name, row.names = FALSE);

  chicane.results <- chicane(interactions = file.name); 
}

Statistical Model

The chicane method models the expected number of reads linking two restriction fragments as a function of the distance between the loci and the “interactibility” of the bait fragment.

Let \(Y_{ij}\) denote the number of reads linking bait \(i\) and target fragment \(j\), \(d_{ij}\) be the distance between the two fragments, and \(t_i\) denote the number of reads linking the bait \(i\) with another fragment in trans. The chicane model assumes that \(Y_{ij}\) follows a distribution with mean \(\mu_{ij}\), where

If the other end fragment \(j\) is also a bait, terms are included to adjust for the trans counts of the fragment \(j\), as well as the product of the trans counts of the two fragments as shown below

Once an expected value \(\mu_{ij}\) is obtained for each combination of fragments, a \(p\)-value for the observed counts \(y_{ij}\) exceeding what’s expected by random chance is calculated as

The probability \(P(Y_{ij} \geq y_{ij})\) depends on how the distribution \(Y_{ij} \mid \mu_{ij}\) of the counts conditional on the expected value is modelled.

Distribution of the Counts

The chicane method supports several different distributions of the counts \(Y_{ij}\) conditional on the mean \(\mu_{ij}\). By default the counts are assumed to follow a negative binomial distribution, but the Poisson, truncated Poisson, or truncated negative binomial distribution can be specified through the distribution argument.

Negative Binomial

The negative binomial distribution models the counts as \(Y \sim NB(\mu, \theta)\), where \(\mu\) is the mean and \(\theta\) is a dispersion parameter. Under this model, the variance of the counts conditional on the mean \(\mu\) is given by

The model is fit using the glm.nb() function in the MASS package. The maximum likelihood estimates (MLEs) of the coefficients \(\hat{\beta_i}\) are used to provide an estimate of the expected number of counts for each combination of restriction fragments. The MLE of the dispersion parameter \(\theta\) is used as the dispersion parameter (size) in the negative binomial model.

Occasionally, the model fitting procedure will result in numerical errors as the result of lack of convergence. This is most often due to lack of overdispersion in the data leading to an unstable estimate of \(\theta\). If the variance is approximately equal to the mean, the \(\theta\) term goes to infinity and the model fails to converge.

When the model throws a numerical warning or fails to converge, the chicane() function will attempt to diagnose if the problem was due to lack of overdispersion. This is done by fitting a negative binomial regerssion model with 25 iterations, and using the resulting fitted values and dispersion parameter to examine the variance of the negative binomial. If the estimated variance at the point with the largest fitted value is less than 0.1% greater than the corresponding estimate of the Poisson variance (i.e. the fitted value at that point), a Poisson distribution is fit instead.

Poisson

Unlike the negative binomial, the Poisson distribution does not allow for the variance of the counts to exceed their mean. This model assumes that \(Y \sim \text{Poisson}(\mu)\), giving \(Var(Y) = \mu\). In practice, capture Hi-C counts often show more variability than can be explained by the Poisson model, leading to false positive interaction calls.

Truncated Negative Binomial

When fitting the model without including unobserved combinations of fragments (see Pre-processing Data), there are no counts of zero in the data. To accommodate this in the statistical model, a truncated Poisson distribution can be fit using the GAMLSS package (Stasinopoulos and Rigby 2016). This model assigns \(P(Y = 0) = 0\), and assumes the probabilities are proportional to the negative binomial probabilities for values \(y > 0\).

Truncated Poisson

Similarly to the truncated negative binomial, the truncated Poisson distribution is supported by setting distribution = 'truncated-poisson'. This model assumes the probabilities are proportional to the Poisson probabilities for values \(y>0\).

Adjusting for Covariates

Adjusting for distance is done in a piecewise linear fashion. Cis-data is ordered by distance, and split into equally sized bins. The count model is then fit separately within each bin, and results are merged at the end by concatenating the results from the individual model fits.

By default, chicane tries to split the data based on distance quantiles, i.e. into 100 equally sized groups. In some cases, particularly when dealing with smaller datasets, the resulting datasets are too small for model fitting. When this occurs, the number of distance bins are halved until all resulting datasets are large enough for the model to be fit. To override the default behaviour, pass the desired number of distance bins to distance.bins. For example, setting distance.bins = 4 results in the count model being fit separately in each quartile. Trans interactions are fit separately from cis interactions.

Filtering Baits and Targets

By default, all baits and targets are included when fitting the chicane model. An alternative approach is to filter out baits and fragments with low (Dryden et al. 2014) or high (Cairns et al. 2016) degree of “interactibility”, as inferred through the trans counts. Chicane also supports this model through the bait.filters and target.filters arguments.

Each of these take a vector of length two, where each element corresponds to the proportion of fragments that should be considered to fall below the “lower limit” and “upper limit.” For example, passing in bait.filters = c(0.3, 0.9) means that the baits with the lowest 30% or highest 10% of trans counts will be removed before fitting the model.

Filtering fragments may affect multiple testing correction by changing the number of tests performed.

Multiple Testing Correction

By default, chicane performs multiple testing correction separately for each bait. To change this to a global multiple testing correction, set multiple.testing.correction = 'global' in the main chicane() function.

Merging Biological Replicates

chicane can merge replicates at the data processing step. If more than one BAM file is available, these can be passed as vector to the bam argument of the chicane() and prepare.data() functions. The replicate.merging.method determines how replicates are merged. Available options are ‘sum’ and ‘weighted-sum’.

Adjusting for Custom Covariates

chicane allows users to specify additional covariates that should be added to the model with the adjustment.terms argument. For example, we can add a term to adjust for the chromosome of the target fragment.

data(bre80);
adjusted.results <- chicane(
  interactions = bre80, 
  adjustment.terms = 'target.chr'
  );

print( adjusted.results[ 1:5 ] );
#>                   target.id                  bait.id target.chr
#> 1:    chr19:1155128-1228414 chr2:217035649-217042355      chr19
#> 2: chr1:117125276-117135137 chr2:217035649-217042355       chr1
#> 3:   chr1:14950583-14951791 chr2:217035649-217042355       chr1
#> 4:   chr1:49076960-49080684 chr2:217035649-217042355       chr1
#> 5: chr3:145680459-145682483 chr2:217035649-217042355       chr3
#>    target.start target.end bait.chr bait.start  bait.end bait.to.bait
#> 1:      1155128    1228414     chr2  217035649 217042355        FALSE
#> 2:    117125276  117135137     chr2  217035649 217042355        FALSE
#> 3:     14950583   14951791     chr2  217035649 217042355        FALSE
#> 4:     49076960   49080684     chr2  217035649 217042355        FALSE
#> 5:    145680459  145682483     chr2  217035649 217042355        FALSE
#>    bait.trans.count target.trans.count distance count expected   p.value
#> 1:             9494                  2       NA     2 1.003963 0.2656989
#> 2:             9494                  2       NA     2 1.003992 0.2657096
#> 3:             9494                  2       NA     2 1.003992 0.2657096
#> 4:             9494                  2       NA     2 1.003992 0.2657096
#> 5:             9494                  2       NA     2 1.004014 0.2657179
#>      q.value
#> 1: 0.6417435
#> 2: 0.6417435
#> 3: 0.6417435
#> 4: 0.6417435
#> 5: 0.6417435

The adjustment.terms argument also supports more complex expressions such as log-transformations. Multiple adjustment terms can be added by passing a vector.

Any adjustment term must correspond to a column already present in the data. If you would like to adjust for anything that is not already present in the chicane output (e.g. GC content), there’s a three step approach:

  1. Run prepare.data() on your BAM file
  2. Add extra columns to the resulting data frame
  3. Run chicane() with the interactions argument

Helper Functions

The convert.hicup.digest.bed() function can be used to convert a digested genome in HiCUP format to a BED file.

Session Info

sessionInfo();
#> R version 3.5.0 (2018-04-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: CentOS Linux 7 (Core)
#> 
#> Matrix products: default
#> BLAS/LAPACK: /opt/gridware/apps/intel/compilers_and_libraries_2016.3.210/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
#>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  splines   stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.11.4 chicane_0.1.0     gamlss.tr_5.0-0   gamlss_5.1-0     
#> [5] nlme_3.1-137      gamlss.data_5.0-1 gamlss.dist_5.0-6 MASS_7.3-50      
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.18     knitr_1.20       magrittr_1.5     lattice_0.20-35 
#>  [5] foreach_1.4.4    stringr_1.3.1    tools_3.5.0      grid_3.5.0      
#>  [9] iterators_1.0.10 htmltools_0.3.6  yaml_2.2.0       survival_2.41-3 
#> [13] rprojroot_1.3-2  digest_0.6.15    Matrix_1.2-14    codetools_0.2-15
#> [17] evaluate_0.11    rmarkdown_1.10   stringi_1.2.4    compiler_3.5.0  
#> [21] backports_1.1.2

Acknowledgements

Development of chicane was supported by Breast Cancer Now.

References

Baxter, Joseph S, Olivia C Leavy, Nicola H Dryden, Sarah Maguire, Nichola Johnson, Vita Fedele, Nikiana Simigdala, et al. 2018. “Capture Hi-C Identifies Putative Target Genes at 33 Breast Cancer Risk Loci.” Nature Communications 9 (1): 1028.

Cairns, Jonathan, Paula Freire-Pritchett, Steven W Wingett, Csilla Várnai, Andrew Dimond, Vincent Plagnol, Daniel Zerbino, et al. 2016. “CHiCAGO: robust Detection of DNA Looping Interactions in Capture Hi-C Data.” Genome Biology 17 (1): 127.

Dryden, Nicola H, Laura R Broome, Frank Dudbridge, Nichola Johnson, Nick Orr, Stefan Schoenfelder, Takashi Nagano, et al. 2014. “Unbiased Analysis of Potential Targets of Breast Cancer Susceptibility Loci by Capture Hi-C.” Genome Research 24 (11): 1854–1868.

Stasinopoulos, Mikis, and Bob Rigby. 2016. gamlss.Tr: Generating and Fitting Truncated “Gamlss.family” Distributions. https://CRAN.R-project.org/package=gamlss.tr.