The main function for running the chicane method is chicane()
. This takes a bam file and bed files with all restriction fragments and targeted restriction fragments as input, and produces a table with fragment interactions and associated p-values.
The package comes with a subset of reads from a capture Hi-C experiment on the Bre80 normal epithelial breast tissue cell line (Baxter et al. 2018). The experiment targeted several breast cancer risk loci in non-coding regions of the genome, and aimed to associate these risk variants with genes. Two of the risk SNPs are rs13387042 and rs16857609, and reads that map to them have been included in the file Bre80_2q35.bam.
library(chicane);
#> Loading required package: gamlss.tr
#> Loading required package: gamlss.dist
#> Loading required package: MASS
#> Loading required package: gamlss
#> Loading required package: splines
#> Loading required package: gamlss.data
#> Loading required package: nlme
#> Loading required package: parallel
#> ********** GAMLSS Version 5.1-0 **********
#> For more on GAMLSS look at http://www.gamlss.org/
#> Type gamlssNews() to see new features/changes/bug fixes.
# example BAM file, baits, and restriction fragments
bam <- system.file('extdata', 'Bre80_2q35.bam', package = 'chicane');
baits <- system.file('extdata', '2q35.bed', package = 'chicane');
fragments <- system.file('extdata', 'GRCh38_HindIII_chr2.bed.gz', package = 'chicane'); # HindIII fragments on chromosome 2
if( bedtools.installed() ) {
chicane.results <- chicane(
bam = bam,
baits = baits,
fragments = fragments
);
print( chicane.results[ 1:10 ] );
}
#> target.id bait.id target.chr
#> 1: chr2:217491009-217492415 chr2:217035649-217042355 chr2
#> 2: chr2:216650000-216651600 chr2:217430817-217437011 chr2
#> 3: chr2:217290048-217292530 chr2:217430817-217437011 chr2
#> 4: chr2:216269087-216273140 chr2:217430817-217437011 chr2
#> 5: chr2:216551051-216551968 chr2:217430817-217437011 chr2
#> 6: chr2:216696003-216698549 chr2:217430817-217437011 chr2
#> 7: chr2:216879929-216883553 chr2:217430817-217437011 chr2
#> 8: chr2:216687613-216696003 chr2:217430817-217437011 chr2
#> 9: chr2:217162758-217166250 chr2:217430817-217437011 chr2
#> 10: chr2:216682129-216687613 chr2:217430817-217437011 chr2
#> target.start target.end bait.chr bait.start bait.end bait.to.bait
#> 1: 217491009 217492415 chr2 217035649 217042355 FALSE
#> 2: 216650000 216651600 chr2 217430817 217437011 FALSE
#> 3: 217290048 217292530 chr2 217430817 217437011 FALSE
#> 4: 216269087 216273140 chr2 217430817 217437011 FALSE
#> 5: 216551051 216551968 chr2 217430817 217437011 FALSE
#> 6: 216696003 216698549 chr2 217430817 217437011 FALSE
#> 7: 216879929 216883553 chr2 217430817 217437011 FALSE
#> 8: 216687613 216696003 chr2 217430817 217437011 FALSE
#> 9: 217162758 217166250 chr2 217430817 217437011 FALSE
#> 10: 216682129 216687613 chr2 217430817 217437011 FALSE
#> bait.trans.count target.trans.count distance count expected
#> 1: 0 0 452710.0 28 6.064476
#> 2: 0 0 783114.0 16 2.990237
#> 3: 0 0 142625.0 33 9.143598
#> 4: 0 0 1162800.5 8 2.101966
#> 5: 0 0 882404.5 11 2.745668
#> 6: 0 0 736638.0 16 4.032253
#> 7: 0 0 552173.0 11 3.896133
#> 8: 0 0 742106.0 15 3.889132
#> 9: 0 0 269410.0 24 7.968272
#> 10: 0 0 749043.0 14 3.716266
#> p.value q.value
#> 1: 0.0001206449 0.2038899
#> 2: 0.0005035504 0.8182288
#> 3: 0.0012757379 0.8182288
#> 4: 0.0018935853 0.8182288
#> 5: 0.0021356645 0.8182288
#> 6: 0.0037456864 0.8182288
#> 7: 0.0037711109 0.8182288
#> 8: 0.0048756166 0.8182288
#> 9: 0.0055434462 0.8182288
#> 10: 0.0061512164 0.8182288
The chicane
method returns a data.table
object sorted by q-value. Each row contains information about the target and bait fragments, and the observed and expected number of reads linking the two fragments.
The chicane
method can operate directly on BAM files. To convert the BAM file into a text file, R calls the shell script prepare_bam.sh
. This script relies on bedtools to select reads overlapping the captured fragments and identify the restriction fragments corresponding to each read. The resulting data is read into a data.table
object, and the number of reads linking any two fragments is calculated.
This processing can take a while (~45 minutes for a 19GB BAM file on a commodity computer with 13GB RAM), and only needs to be done once for each BAM. To speed up model fitting, it is possible to pre-process the data with the prepare.data()
function and run chicane
directly on the resulting data table.
if( bedtools.installed() ) {
interaction.data <- prepare.data(
bam = bam,
baits = baits,
fragments = fragments
);
}
The interaction data object contains details of the fragments, and the number of reads linking the two fragments.
if( bedtools.installed() ) print(interaction.data);
#> target.id bait.id target.chr
#> 1: chr2:200013520-200014510 chr2:217035649-217042355 chr2
#> 2: chr2:200018581-200020582 chr2:217035649-217042355 chr2
#> 3: chr2:200020582-200023537 chr2:217430817-217437011 chr2
#> 4: chr2:200033118-200034885 chr2:217430817-217437011 chr2
#> 5: chr2:200040560-200041826 chr2:217430817-217437011 chr2
#> ---
#> 4479: chr2:229867689-229872871 chr2:217035649-217042355 chr2
#> 4480: chr2:229872871-229875614 chr2:217035649-217042355 chr2
#> 4481: chr2:229882404-229887121 chr2:217430817-217437011 chr2
#> 4482: chr2:229894321-229904945 chr2:217035649-217042355 chr2
#> 4483: chr2:229917543-229919836 chr2:217430817-217437011 chr2
#> target.start target.end bait.chr bait.start bait.end bait.to.bait
#> 1: 200013520 200014510 chr2 217035649 217042355 FALSE
#> 2: 200018581 200020582 chr2 217035649 217042355 FALSE
#> 3: 200020582 200023537 chr2 217430817 217437011 FALSE
#> 4: 200033118 200034885 chr2 217430817 217437011 FALSE
#> 5: 200040560 200041826 chr2 217430817 217437011 FALSE
#> ---
#> 4479: 229867689 229872871 chr2 217035649 217042355 FALSE
#> 4480: 229872871 229875614 chr2 217035649 217042355 FALSE
#> 4481: 229882404 229887121 chr2 217430817 217437011 FALSE
#> 4482: 229894321 229904945 chr2 217035649 217042355 FALSE
#> 4483: 229917543 229919836 chr2 217430817 217437011 FALSE
#> bait.trans.count target.trans.count distance count
#> 1: 0 0 17024987 1
#> 2: 0 0 17019420 1
#> 3: 0 0 17411854 2
#> 4: 0 0 17399912 1
#> 5: 0 0 17392721 1
#> ---
#> 4479: 0 0 12831278 1
#> 4480: 0 0 12835240 1
#> 4481: 0 0 12450848 1
#> 4482: 0 0 12860631 1
#> 4483: 0 0 12484776 1
By default, only combinations of fragments that are detected at least once are included in the data. To also include zero counts, use the include.zeros
argument. Setting include.zeros = 'cis'
includes all zero counts for bait/ target combinations on the same chromosome, while include.zeros = 'all'
includes all possible combinations.
if( bedtools.installed() ) {
cis.zero.results <- chicane(
bam = bam,
baits = baits,
fragments = fragments,
include.zeros = 'cis'
);
print( cis.zero.results[ 1:10 ] );
}
#> target.id bait.id target.chr
#> 1: chr2:217430817-217437011 chr2:217035649-217042355 chr2
#> 2: chr2:217035649-217042355 chr2:217035649-217042355 chr2
#> 3: chr2:217042355-217043238 chr2:217035649-217042355 chr2
#> 4: chr2:217043238-217043762 chr2:217035649-217042355 chr2
#> 5: chr2:217043762-217045205 chr2:217035649-217042355 chr2
#> 6: chr2:217045205-217048162 chr2:217035649-217042355 chr2
#> 7: chr2:217026464-217035649 chr2:217035649-217042355 chr2
#> 8: chr2:217024030-217026464 chr2:217035649-217042355 chr2
#> 9: chr2:217023088-217024030 chr2:217035649-217042355 chr2
#> 10: chr2:217048162-217060744 chr2:217035649-217042355 chr2
#> target.start target.end bait.chr bait.start bait.end bait.to.bait
#> 1: 217430817 217437011 chr2 217035649 217042355 TRUE
#> 2: 217035649 217042355 chr2 217035649 217042355 FALSE
#> 3: 217042355 217043238 chr2 217035649 217042355 FALSE
#> 4: 217043238 217043762 chr2 217035649 217042355 FALSE
#> 5: 217043762 217045205 chr2 217035649 217042355 FALSE
#> 6: 217045205 217048162 chr2 217035649 217042355 FALSE
#> 7: 217026464 217035649 chr2 217035649 217042355 FALSE
#> 8: 217024030 217026464 chr2 217035649 217042355 FALSE
#> 9: 217023088 217024030 chr2 217035649 217042355 FALSE
#> 10: 217048162 217060744 chr2 217035649 217042355 FALSE
#> bait.trans.count target.trans.count distance count expected p.value
#> 1: 0 0 394912.0 8 8 0.5470392
#> 2: 0 0 0.0 4 NA NA
#> 3: 0 0 3794.5 60 NA NA
#> 4: 0 0 4498.0 126 NA NA
#> 5: 0 0 5481.5 102 NA NA
#> 6: 0 0 7681.5 85 NA NA
#> 7: 0 0 7945.5 36 NA NA
#> 8: 0 0 13755.0 56 NA NA
#> 9: 0 0 15443.0 41 NA NA
#> 10: 0 0 15451.0 45 NA NA
#> q.value
#> 1: 0.5470392
#> 2: NA
#> 3: NA
#> 4: NA
#> 5: NA
#> 6: NA
#> 7: NA
#> 8: NA
#> 9: NA
#> 10: NA
For most experiments including zeros will result in impractically large files. Another option for accounting for zero counts is to specify a truncated distribution through the distribution
argument. However, this will slow down model fitting.
The interactions
argument of the chicane()
function can take either a data.table
object from prepare.data
or the path to such a table.
if( bedtools.installed() ) {
file.name <- tempfile('interaction_data');
write.table(interaction.data, file.name, row.names = FALSE);
chicane.results <- chicane(interactions = file.name);
}
The chicane method models the expected number of reads linking two restriction fragments as a function of the distance between the loci and the “interactibility” of the bait fragment.
Let \(Y_{ij}\) denote the number of reads linking bait \(i\) and target fragment \(j\), \(d_{ij}\) be the distance between the two fragments, and \(t_i\) denote the number of reads linking the bait \(i\) with another fragment in trans. The chicane model assumes that \(Y_{ij}\) follows a distribution with mean \(\mu_{ij}\), where
If the other end fragment \(j\) is also a bait, terms are included to adjust for the trans counts of the fragment \(j\), as well as the product of the trans counts of the two fragments as shown below
Once an expected value \(\mu_{ij}\) is obtained for each combination of fragments, a \(p\)-value for the observed counts \(y_{ij}\) exceeding what’s expected by random chance is calculated as
The probability \(P(Y_{ij} \geq y_{ij})\) depends on how the distribution \(Y_{ij} \mid \mu_{ij}\) of the counts conditional on the expected value is modelled.
The chicane method supports several different distributions of the counts \(Y_{ij}\) conditional on the mean \(\mu_{ij}\). By default the counts are assumed to follow a negative binomial distribution, but the Poisson, truncated Poisson, or truncated negative binomial distribution can be specified through the distribution
argument.
The negative binomial distribution models the counts as \(Y \sim NB(\mu, \theta)\), where \(\mu\) is the mean and \(\theta\) is a dispersion parameter. Under this model, the variance of the counts conditional on the mean \(\mu\) is given by
The model is fit using the glm.nb()
function in the MASS package. The maximum likelihood estimates (MLEs) of the coefficients \(\hat{\beta_i}\) are used to provide an estimate of the expected number of counts for each combination of restriction fragments. The MLE of the dispersion parameter \(\theta\) is used as the dispersion parameter (size
) in the negative binomial model.
Occasionally, the model fitting procedure will result in numerical errors as the result of lack of convergence. This is most often due to lack of overdispersion in the data leading to an unstable estimate of \(\theta\). If the variance is approximately equal to the mean, the \(\theta\) term goes to infinity and the model fails to converge.
When the model throws a numerical warning or fails to converge, the chicane()
function will attempt to diagnose if the problem was due to lack of overdispersion. This is done by fitting a negative binomial regerssion model with 25 iterations, and using the resulting fitted values and dispersion parameter to examine the variance of the negative binomial. If the estimated variance at the point with the largest fitted value is less than 0.1% greater than the corresponding estimate of the Poisson variance (i.e. the fitted value at that point), a Poisson distribution is fit instead.
Unlike the negative binomial, the Poisson distribution does not allow for the variance of the counts to exceed their mean. This model assumes that \(Y \sim \text{Poisson}(\mu)\), giving \(Var(Y) = \mu\). In practice, capture Hi-C counts often show more variability than can be explained by the Poisson model, leading to false positive interaction calls.
When fitting the model without including unobserved combinations of fragments (see Pre-processing Data), there are no counts of zero in the data. To accommodate this in the statistical model, a truncated Poisson distribution can be fit using the GAMLSS package (Stasinopoulos and Rigby 2016). This model assigns \(P(Y = 0) = 0\), and assumes the probabilities are proportional to the negative binomial probabilities for values \(y > 0\).
Similarly to the truncated negative binomial, the truncated Poisson distribution is supported by setting distribution = 'truncated-poisson'
. This model assumes the probabilities are proportional to the Poisson probabilities for values \(y>0\).
Adjusting for distance is done in a piecewise linear fashion. Cis-data is ordered by distance, and split into equally sized bins. The count model is then fit separately within each bin, and results are merged at the end by concatenating the results from the individual model fits.
By default, chicane
tries to split the data based on distance quantiles, i.e. into 100 equally sized groups. In some cases, particularly when dealing with smaller datasets, the resulting datasets are too small for model fitting. When this occurs, the number of distance bins are halved until all resulting datasets are large enough for the model to be fit. To override the default behaviour, pass the desired number of distance bins to distance.bins
. For example, setting distance.bins = 4
results in the count model being fit separately in each quartile. Trans interactions are fit separately from cis interactions.
By default, all baits and targets are included when fitting the chicane model. An alternative approach is to filter out baits and fragments with low (Dryden et al. 2014) or high (Cairns et al. 2016) degree of “interactibility”, as inferred through the trans counts. Chicane also supports this model through the bait.filters
and target.filters
arguments.
Each of these take a vector of length two, where each element corresponds to the proportion of fragments that should be considered to fall below the “lower limit” and “upper limit.” For example, passing in bait.filters = c(0.3, 0.9)
means that the baits with the lowest 30% or highest 10% of trans counts will be removed before fitting the model.
Filtering fragments may affect multiple testing correction by changing the number of tests performed.
By default, chicane performs multiple testing correction separately for each bait. To change this to a global multiple testing correction, set multiple.testing.correction = 'global'
in the main chicane()
function.
chicane can merge replicates at the data processing step. If more than one BAM file is available, these can be passed as vector to the bam
argument of the chicane()
and prepare.data()
functions. The replicate.merging.method
determines how replicates are merged. Available options are ‘sum’ and ‘weighted-sum’.
chicane allows users to specify additional covariates that should be added to the model with the adjustment.terms
argument. For example, we can add a term to adjust for the chromosome of the target fragment.
data(bre80);
adjusted.results <- chicane(
interactions = bre80,
adjustment.terms = 'target.chr'
);
print( adjusted.results[ 1:5 ] );
#> target.id bait.id target.chr
#> 1: chr19:1155128-1228414 chr2:217035649-217042355 chr19
#> 2: chr1:117125276-117135137 chr2:217035649-217042355 chr1
#> 3: chr1:14950583-14951791 chr2:217035649-217042355 chr1
#> 4: chr1:49076960-49080684 chr2:217035649-217042355 chr1
#> 5: chr3:145680459-145682483 chr2:217035649-217042355 chr3
#> target.start target.end bait.chr bait.start bait.end bait.to.bait
#> 1: 1155128 1228414 chr2 217035649 217042355 FALSE
#> 2: 117125276 117135137 chr2 217035649 217042355 FALSE
#> 3: 14950583 14951791 chr2 217035649 217042355 FALSE
#> 4: 49076960 49080684 chr2 217035649 217042355 FALSE
#> 5: 145680459 145682483 chr2 217035649 217042355 FALSE
#> bait.trans.count target.trans.count distance count expected p.value
#> 1: 9494 2 NA 2 1.003963 0.2656989
#> 2: 9494 2 NA 2 1.003992 0.2657096
#> 3: 9494 2 NA 2 1.003992 0.2657096
#> 4: 9494 2 NA 2 1.003992 0.2657096
#> 5: 9494 2 NA 2 1.004014 0.2657179
#> q.value
#> 1: 0.6417435
#> 2: 0.6417435
#> 3: 0.6417435
#> 4: 0.6417435
#> 5: 0.6417435
The adjustment.terms
argument also supports more complex expressions such as log-transformations. Multiple adjustment terms can be added by passing a vector.
Any adjustment term must correspond to a column already present in the data. If you would like to adjust for anything that is not already present in the chicane output (e.g. GC content), there’s a three step approach:
prepare.data()
on your BAM filechicane()
with the interactions
argumentThe convert.hicup.digest.bed()
function can be used to convert a digested genome in HiCUP format to a BED file.
sessionInfo();
#> R version 3.5.0 (2018-04-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: CentOS Linux 7 (Core)
#>
#> Matrix products: default
#> BLAS/LAPACK: /opt/gridware/apps/intel/compilers_and_libraries_2016.3.210/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
#>
#> locale:
#> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
#> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel splines stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] data.table_1.11.4 chicane_0.1.0 gamlss.tr_5.0-0 gamlss_5.1-0
#> [5] nlme_3.1-137 gamlss.data_5.0-1 gamlss.dist_5.0-6 MASS_7.3-50
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_0.12.18 knitr_1.20 magrittr_1.5 lattice_0.20-35
#> [5] foreach_1.4.4 stringr_1.3.1 tools_3.5.0 grid_3.5.0
#> [9] iterators_1.0.10 htmltools_0.3.6 yaml_2.2.0 survival_2.41-3
#> [13] rprojroot_1.3-2 digest_0.6.15 Matrix_1.2-14 codetools_0.2-15
#> [17] evaluate_0.11 rmarkdown_1.10 stringi_1.2.4 compiler_3.5.0
#> [21] backports_1.1.2
Development of chicane was supported by Breast Cancer Now.
Baxter, Joseph S, Olivia C Leavy, Nicola H Dryden, Sarah Maguire, Nichola Johnson, Vita Fedele, Nikiana Simigdala, et al. 2018. “Capture Hi-C Identifies Putative Target Genes at 33 Breast Cancer Risk Loci.” Nature Communications 9 (1): 1028.
Cairns, Jonathan, Paula Freire-Pritchett, Steven W Wingett, Csilla Várnai, Andrew Dimond, Vincent Plagnol, Daniel Zerbino, et al. 2016. “CHiCAGO: robust Detection of DNA Looping Interactions in Capture Hi-C Data.” Genome Biology 17 (1): 127.
Dryden, Nicola H, Laura R Broome, Frank Dudbridge, Nichola Johnson, Nick Orr, Stefan Schoenfelder, Takashi Nagano, et al. 2014. “Unbiased Analysis of Potential Targets of Breast Cancer Susceptibility Loci by Capture Hi-C.” Genome Research 24 (11): 1854–1868.
Stasinopoulos, Mikis, and Bob Rigby. 2016. gamlss.Tr: Generating and Fitting Truncated “Gamlss.family” Distributions. https://CRAN.R-project.org/package=gamlss.tr.