valr overview

2017-07-21

Why valr?

Why another tool set for interval manipulations? There are several other software packages available for genome interval analysis. However, based on our experiences teaching genome analysis, we were motivated to develop a toolset that:

valr can currently be used for analysis of pre-processed data in BED and related formats. We plan to support BAM and VCF files soon via tabix indexes.

Familiar tools, natively in R

The functions in valr have similar names to their BEDtools counterparts, and so will be familiar to users coming from the BEDtools suite. Similar to pybedtools, valr has a terse syntax:

library(valr)
library(dplyr)

snps <- read_bed(valr_example('hg19.snps147.chr22.bed.gz'), n_fields = 6)
genes <- read_bed(valr_example('genes.hg19.chr22.bed.gz'), n_fields = 6)

# find snps in intergenic regions
intergenic <- bed_subtract(snps, genes)
# distance from intergenic snps to nearest gene
nearby <- bed_closest(intergenic, genes)

nearby %>%
  select(starts_with('name'), .overlap, .dist) %>%
  filter(abs(.dist) < 1000)
#> # A tibble: 285 x 4
#>         name.x            name.y .overlap .dist
#>          <chr>             <chr>    <int> <int>
#>  1   rs2261631             P704P        0  -267
#>  2 rs570770556             POTEH        0  -912
#>  3 rs538163832             POTEH        0  -952
#>  4   rs9606135            TPTEP1        0  -421
#>  5  rs11912392 ANKRD62P1-PARP4P3        0   104
#>  6   rs8136454          BC038197        0   355
#>  7   rs5992556              XKR3        0  -455
#>  8 rs114101676              GAB4        0   473
#>  9  rs62236167             CECR7        0   261
#> 10   rs5747023             CECR1        0  -386
#> # ... with 275 more rows

Input data

valr assigns common column names to facilitate comparisons between tbls. All tbls will have chrom, start, and end columns, and some tbls from multi-column formats will have additional pre-determined column names. See the read_bed() documentation for details.

bed_file <- valr_example("3fields.bed.gz")
read_bed(bed_file) # accepts filepaths or URLs
#> # A tibble: 10 x 3
#>    chrom  start    end
#>    <chr>  <int>  <int>
#>  1  chr1  11873  14409
#>  2  chr1  14361  19759
#>  3  chr1  14406  29370
#>  4  chr1  34610  36081
#>  5  chr1  69090  70008
#>  6  chr1 134772 140566
#>  7  chr1 321083 321115
#>  8  chr1 321145 321207
#>  9  chr1 322036 326938
#> 10  chr1 327545 328439

valr can also operate on BED-like data.frames already constructed in R, provided that columns named chrom, start and end are present. New tbls can also be contructed using trbl_interval().

bed <- trbl_interval(
  ~chrom, ~start,  ~end, 
  "chr1", 1657492, 2657492, 
  "chr2", 2501324, 3094650
)

bed
#> # A tibble: 2 x 3
#>   chrom   start     end
#>   <chr>   <dbl>   <dbl>
#> 1  chr1 1657492 2657492
#> 2  chr2 2501324 3094650

Interval coordinates

valr adheres to the BED format which specifies that the start position for an interval is zero based and the end position is one-based. The first position in a chromosome is 0. The end position for a chromosome is one position passed the last base, and is not included in the interval. For example:

# a chromosome 100 basepairs in length
chrom <- trbl_interval(
  ~chrom, ~start, ~end, 
  "chr1", 0,      100
)

chrom
#> # A tibble: 1 x 3
#>   chrom start   end
#>   <chr> <dbl> <dbl>
#> 1  chr1     0   100

# single basepair intervals
bases <- trbl_interval(
  ~chrom, ~start, ~end, 
  "chr1", 0,      1, # first base of chromosome
  "chr1", 1,      2,  # second base of chromosome
  "chr1", 99,     100 # last base of chromosome
)

bases
#> # A tibble: 3 x 3
#>   chrom start   end
#>   <chr> <dbl> <dbl>
#> 1  chr1     0     1
#> 2  chr1     1     2
#> 3  chr1    99   100

Remote databases

Remote databases can be accessed with db_ucsc() (to access the UCSC Browser) and db_ensembl() (to access Ensembl databases).

# access the `refGene` tbl on the `hg38` assembly.
if(require(RMySQL)) {
  ucsc <- db_ucsc('hg38')
  tbl(ucsc, 'refGene')
}

Visual documentation

The bed_glyph() tool illustrates the results of operations in valr, similar to those found in the BEDtools documentation. This glyph shows the result of intersecting x and y intervals with bed_intersect():

x <- tibble::tribble(
  ~chrom, ~start, ~end,
  'chr1', 25,     50,
  'chr1', 100,    125
)

y <- tibble::tribble(
  ~chrom, ~start, ~end,
  'chr1', 30,     75
)

bed_glyph(bed_intersect(x, y))

And this glyph illustrates bed_merge():

x <- tibble::tribble(
  ~chrom, ~start, ~end,
  'chr1',      1,      50,
  'chr1',      10,     75,
  'chr1',      100,    120
)

bed_glyph(bed_merge(x))

Reproducible reports

valr can be used in RMarkdown documents to generate reproducible work-flows for data processing. Because valr is reasonably fast (see the benchmarks), we now use it in lieu of other tools for exploratory analysis of genomic data sets in R.

Command-line tools like BEDtools and bedops can be used in reproducible workflows (e.g., with snakemake), but it is cumbersome to move from command-line tools to exploratory analysis and plotting software. pybedtools can be used within ipython notebooks to accomplish a similar goal, but others have pointed out issues with this approach, including clunky version control. Because RMarkdown files are text files, they are readily kept under version control. Moreover, new features in RStudio (e.g. notebook viewing) enable similar functionality to ipython.

Grouping data

The group_by function in dplyr can be used to perform fuctions on subsets of single and multiple data_frames. Functions in valr leverage grouping to enable a variety of comparisons. For example, intervals can be grouped by strand to perform comparisons among intervals on the same strand.

x <- tibble::tribble(
  ~chrom, ~start, ~end, ~strand,
  'chr1', 1,      100,  '+',
  'chr1', 50,     150,  '+',
  'chr2', 100,    200,  '-'
)

y <- tibble::tribble(
  ~chrom, ~start, ~end, ~strand,
  'chr1', 50,     125,  '+',
  'chr1', 50,     150,  '-',
  'chr2', 50,     150,  '+'
)

# intersect tbls by strand
x <- group_by(x, strand)
y <- group_by(y, strand)

bed_intersect(x, y)
#> # A tibble: 2 x 8
#>   chrom start.x end.x strand.x start.y end.y strand.y .overlap
#>   <chr>   <dbl> <dbl>    <chr>   <dbl> <dbl>    <chr>    <int>
#> 1  chr1       1   100        +      50   125        +       50
#> 2  chr1      50   150        +      50   125        +       75

Comparisons between intervals on opposite strands are done using the flip_strands() function:

x <- group_by(x, strand)

y <- flip_strands(y)
y <- group_by(y, strand)

bed_intersect(x, y)
#> # A tibble: 3 x 8
#>   chrom start.x end.x strand.x start.y end.y strand.y .overlap
#>   <chr>   <dbl> <dbl>    <chr>   <dbl> <dbl>    <chr>    <int>
#> 1  chr1       1   100        +      50   150        +       50
#> 2  chr1      50   150        +      50   150        +      100
#> 3  chr2     100   200        -      50   150        -       50

Both single set (e.g. bed_merge()) and multi set operations will respect groupings in the input intervals.

Column specification

Columns in BEDtools are referred to by position:

# calculate the mean of column 6 for intervals in `b` that overlap with `a`
bedtools map -a a.bed -b b.bed -c 6 -o mean

In valr, columns are referred to by name and can be used in multiple name/value expressions for summaries.

# calculate the mean and variance for a `value` column
bed_map(a, b, .mean = mean(value), .var = var(value))

# report concatenated and max values for merged intervals
bed_merge(a, .concat = concat(value), .max = max(value))

Getting started

Meta-analysis

This demonstration illustrates how to use valr tools to perform a “meta-analysis” of signals relative to genomic features. Here we to analyze the distribution of histone marks surrounding transcription start sites.

First we load libraries and relevant data.

# `valr_example()` identifies the path of example files
bedfile <- valr_example('genes.hg19.chr22.bed.gz')
genomefile <- valr_example('hg19.chrom.sizes.gz')
bgfile  <- valr_example('hela.h3k4.chip.bg.gz')

genes <- read_bed(bedfile, n_fields = 6)
genome <- read_genome(genomefile)
y <- read_bedgraph(bgfile)

Then we generate 1 bp intervals to represent transcription start sites (TSSs). We focus on + strand genes, but - genes are easily accomodated by filtering them and using bed_makewindows() with reversed window numbers.

# generate 1 bp TSS intervals, `+` strand only
tss <- genes %>%
  filter(strand == '+') %>%
  mutate(end = start + 1)

# 1000 bp up and downstream
region_size <- 1000
# 50 bp windows
win_size <- 50

# add slop to the TSS, break into windows and add a group
x <- tss %>%
  bed_slop(genome, both = region_size) %>%
  bed_makewindows(genome, win_size)

x
#> # A tibble: 13,530 x 7
#>    chrom    start      end      name score strand .win_id
#>    <chr>    <int>    <int>     <chr> <chr>  <chr>   <int>
#>  1 chr22 16161065 16161115 LINC00516     3      +       1
#>  2 chr22 16161115 16161165 LINC00516     3      +       2
#>  3 chr22 16161165 16161215 LINC00516     3      +       3
#>  4 chr22 16161215 16161265 LINC00516     3      +       4
#>  5 chr22 16161265 16161315 LINC00516     3      +       5
#>  6 chr22 16161315 16161365 LINC00516     3      +       6
#>  7 chr22 16161365 16161415 LINC00516     3      +       7
#>  8 chr22 16161415 16161465 LINC00516     3      +       8
#>  9 chr22 16161465 16161515 LINC00516     3      +       9
#> 10 chr22 16161515 16161565 LINC00516     3      +      10
#> # ... with 13,520 more rows

Now we use the .win_id group with bed_map() to caluclate a sum by mapping y signals onto the intervals in x. These data are regrouped by .win_id and a summary with mean and sd values is calculated.

# map signals to TSS regions and calculate summary statistics.
res <- bed_map(x, y, win_sum = sum(value, na.rm = TRUE)) %>%
  group_by(.win_id) %>%
  summarize(win_mean = mean(win_sum, na.rm = TRUE),
            win_sd = sd(win_sum, na.rm = TRUE))

res
#> # A tibble: 41 x 3
#>    .win_id win_mean    win_sd
#>      <int>    <dbl>     <dbl>
#>  1       1 100.8974  85.83423
#>  2       2 110.6829  81.13521
#>  3       3 122.9070  99.09635
#>  4       4 116.2800  96.30098
#>  5       5 116.3500 102.33773
#>  6       6 124.9048  95.08887
#>  7       7 122.9437  94.39792
#>  8       8 127.5946  91.47407
#>  9       9 130.2051  95.71309
#> 10      10 130.1220  88.82809
#> # ... with 31 more rows

Finally, these summary statistics are used to construct a plot that illustrates histone density surrounding TSSs.

library(ggplot2)

x_labels <- seq(-region_size, region_size, by = win_size * 5)
x_breaks <- seq(1, 41, by = 5)

sd_limits <- aes(ymax = win_mean + win_sd, ymin = win_mean - win_sd)

ggplot(res, aes(x = .win_id, y = win_mean)) +
  geom_point() + geom_pointrange(sd_limits) + 
  scale_x_continuous(labels = x_labels, breaks = x_breaks) + 
  xlab('Position (bp from TSS)') + ylab('Signal') + 
  ggtitle('Human H3K4me3 signal near transcription start sites') +
  theme_classic()

API

Function names are similar to their their BEDtools counterparts, with some additions.

Data types

Reading data

Transforming single interval sets

Comparing multiple interval sets

Randomizing intervals

Interval statistics

Utilities