Intro to jackalope

This document provides brief examples of how jackalope can be used to generate sequencing data that can inform some common sampling decisions for HTS studies.

For an example reference assembly, I simply simulated one of size 1 kb using the following code:

ref <- create_genome(10, 1e2)

This resulted in the following ref_genome object:

#> < Set of 10 chromosomes >
#> # Total size: 1,000 bp
#>   name                             chromosome                             length
#> chrom0     TGTTCTCACGGAAAACGGGAAGTTTTTG...GTCACGTCAATGGAGTCTTGGGTCTCTA       100
#> chrom1     CCTCATGTTTCTAGTAGCTGCAACGCTC...TTGTGAACTCCCAGAGGCCTATGCTAAA       100
#> chrom2     AAATCGTCAAGGCTTCGACGTAGAGCCG...CATTGCAGTTCGGAACCGAGGACGGGCG       100
#> chrom3     TTCAGGCAGTGTGGTCTACTGGGCGCAG...TCAGAGCGATAAGATAGATCCGGATAAT       100
#> chrom4     AGGCTGACACCGCCGATGGCCGGGTCCA...GGGCGTCGGGCGAATGCTCGCATGCGAA       100
#> chrom5     CACTGAACGTTAGCAACTTCATCTACAT...GATTTCCTATCTGTGTGTTAGGACTATA       100
#> chrom6     GCGGTTAAGACCGAACCATCCGTGGGAC...AAAGTCTCTCGCAAGCCAGTCAGTGCCC       100
#> chrom7     CCGAATGCCAGTAAGCTCGAGCGTATAA...GTAAGCGTAACGCGCGTGCTGGTTTCGT       100
#> chrom8     TATAGCTAGACCGCCTCCGCGTACGACC...CGGCCGGGCTATCGGGCCTGGTAGGGGG       100
#> chrom9     TAAAGAGTGGGCGGGTCCGTAAAGTACC...CCACGAAGGCTCACGGTCACAATAGGTT       100

For molecular-evolution information, I used the TN93 model, an insertion rate of 2e-5 for lengths from 1 to 10, and a deletion rate of 1e-5 for lengths from 1 to 40. By default, the indel function creates rates that are proportional to exp(-L) for indel length L from 1 to the maximum length.

sub <- sub_TN93(pi_tcag = c(0.1, 0.2, 0.3, 0.4),
                alpha_1 = 0.0001, alpha_2 = 0.0002,
                beta = 0.00015)
ins <- indels(rate = 2e-5, max_length = 10)
del <- indels(rate = 1e-5, max_length = 40)

Assembling a genome

The example here produces FASTQ files from the known reference assembly that could test strategies for how to assemble a similar genome using HTS data.

The first strategy is to use only PacBio sequencing. The PacBio Sequel system produces up to 500,000 reads per Single Molecule, Real-Time (SMRT) cell, so you could run the following for two cells:

pacbio(ref, out_prefix = "pacbio", n_reads = 2 * 500e3)

An alternative, hybrid strategy uses 1 SMRT cell of PacBio sequencing and 1 lane (\(\sim 500\) million reads) of \(2 \times 100\)bp Illumina sequencing on the HiSeq 2500 system (the default Illumina system in jackalope):

pacbio(ref, out_prefix = "pacbio", n_reads = 500e3)
illumina(ref, out_prefix = "illumina", n_reads = 500e6, paired = TRUE,
         read_length = 100)

The last strategy combines 1 lane of \(2 \times 100\)bp Illumina HiSeq 2500 sequencing with 1 flow cell of \(2 \times 250\)bp mate-pair sequencing on an Illumina MiSeq v3. The mate-pair sequencing uses longer fragments (defaults are mean of 400 and standard deviation of 100) to better cover highly repetitive regions.

illumina(ref, out_prefix = "ill_pe", n_reads = 500e6, paired = TRUE,
         read_length = 100)
illumina(ref, out_prefix = "ill_mp", seq_sys = "MSv3",
         read_length = 250, n_reads = 50e6, matepair = TRUE, 
         frag_mean = 3000, frag_sd = 500)

Estimating divergence between populations

Here, I will demonstrate how to generate population-genomic data of a type that might be used to estimate the divergence between two populations. I first use the scrm package to conduct coalescent simulations that will generate segregating sites for 10 haploid variants from the reference genome. Five of the variants are from one population, five from another. The symmetrical migration rate is 100 individuals per generation. I set the population-scaled mutation rate (\(\theta = 4 N_0 \mu\)) to 10 for this example.

library(scrm)
ssites <- scrm(paste("10", ref$n_chroms(), "-t 10 -I 2 5 5 100"))

Using the previously created objects for molecular evolution information (sub, ins, and del), I create variants from the reference genome:

vars <- create_variants(ref, vars_ssites(ssites), sub, ins, del)

This results in the following set of variants:

#>                               << Variants object >>
#> # Variants: 10
#> # Mutations: 1,505
#> 
#>                           << Reference genome info: >>
#> < Set of 10 chromosomes >
#> # Total size: 1,000 bp
#>   name                             chromosome                             length
#> chrom0     TGTTCTCACGGAAAACGGGAAGTTTTTG...GTCACGTCAATGGAGTCTTGGGTCTCTA       100
#> chrom1     CCTCATGTTTCTAGTAGCTGCAACGCTC...TTGTGAACTCCCAGAGGCCTATGCTAAA       100
#> chrom2     AAATCGTCAAGGCTTCGACGTAGAGCCG...CATTGCAGTTCGGAACCGAGGACGGGCG       100
#> chrom3     TTCAGGCAGTGTGGTCTACTGGGCGCAG...TCAGAGCGATAAGATAGATCCGGATAAT       100
#> chrom4     AGGCTGACACCGCCGATGGCCGGGTCCA...GGGCGTCGGGCGAATGCTCGCATGCGAA       100
#> chrom5     CACTGAACGTTAGCAACTTCATCTACAT...GATTTCCTATCTGTGTGTTAGGACTATA       100
#> chrom6     GCGGTTAAGACCGAACCATCCGTGGGAC...AAAGTCTCTCGCAAGCCAGTCAGTGCCC       100
#> chrom7     CCGAATGCCAGTAAGCTCGAGCGTATAA...GTAAGCGTAACGCGCGTGCTGGTTTCGT       100
#> chrom8     TATAGCTAGACCGCCTCCGCGTACGACC...CGGCCGGGCTATCGGGCCTGGTAGGGGG       100
#> chrom9     TAAAGAGTGGGCGGGTCCGTAAAGTACC...CCACGAAGGCTCACGGTCACAATAGGTT       100

For a file of true divergences from the reference genome, the write_vcf function writes the variants object to a VCF file:

write_vcf(vars, "variants")

Lastly, I simulate 1 lane of \(2 \times 100\)bp Illumina HiSeq 2500 sequencing. In this case, individuals within a population are pooled, and the population sequences are derived from are identified by barcodes.

illumina(vars, out_prefix = "vars_illumina", n_reads = 500e6, paired = TRUE,
         read_length = 100, barcodes = c(rep("AACCGCGG", 20), 
                                         rep("GGTTATAA", 20)))

The below example instead has each individual variant’s reads output to separate FASTQ files:

illumina(vars, out_prefix = "vars_illumina", n_reads = 500e6, paired = TRUE,
         read_length = 100, sep_files = TRUE)

Constructing a phylogeny

From one phylogenetic tree

This section shows how jackalope can generate variants from a phylogeny, then simulate sequencing data from those variants to test phylogeny reconstruction methods. First, I generated a random coalescent tree of 10 species using the ape package.

Then I input that to the create_variants function alongside the molecular evolution information.

This results in the following variants object:

#>                               << Variants object >>
#> # Variants: 10
#> # Mutations: 3
#> 
#>                           << Reference genome info: >>
#> < Set of 10 chromosomes >
#> # Total size: 1,000 bp
#>   name                             chromosome                             length
#> chrom0     TGTTCTCACGGAAAACGGGAAGTTTTTG...GTCACGTCAATGGAGTCTTGGGTCTCTA       100
#> chrom1     CCTCATGTTTCTAGTAGCTGCAACGCTC...TTGTGAACTCCCAGAGGCCTATGCTAAA       100
#> chrom2     AAATCGTCAAGGCTTCGACGTAGAGCCG...CATTGCAGTTCGGAACCGAGGACGGGCG       100
#> chrom3     TTCAGGCAGTGTGGTCTACTGGGCGCAG...TCAGAGCGATAAGATAGATCCGGATAAT       100
#> chrom4     AGGCTGACACCGCCGATGGCCGGGTCCA...GGGCGTCGGGCGAATGCTCGCATGCGAA       100
#> chrom5     CACTGAACGTTAGCAACTTCATCTACAT...GATTTCCTATCTGTGTGTTAGGACTATA       100
#> chrom6     GCGGTTAAGACCGAACCATCCGTGGGAC...AAAGTCTCTCGCAAGCCAGTCAGTGCCC       100
#> chrom7     CCGAATGCCAGTAAGCTCGAGCGTATAA...GTAAGCGTAACGCGCGTGCTGGTTTCGT       100
#> chrom8     TATAGCTAGACCGCCTCCGCGTACGACC...CGGCCGGGCTATCGGGCCTGGTAGGGGG       100
#> chrom9     TAAAGAGTGGGCGGGTCCGTAAAGTACC...CCACGAAGGCTCACGGTCACAATAGGTT       100

Now I can generate data for 1 flow cell of \(2 \times 250\)bp sequencing on an Illumina MiSeq v3.

From gene trees

Similar to the section above, the ultimate goal here is to test phylogeny reconstruction methods. The difference in this section is that instead of using a single, straightforward phylogeny, I use multiple gene trees per chromosome. In the species used in these simulations, species 3 diverged from 1 and 2 at \(t = 0.5\), where \(t\) indicates time into the past and is in units of \(4 N_0\) generations. Species 1 and 2 diverged at \(t = 0.25\). I assume a recombination rate of \(1 / (4 N_0)\) recombination events per chromosome per generation. Because each chromosome is a different length and the length is required for including a recombination rate, I had to run scrm separately for each chromosome. There are 4 diploid individuals sampled per species.

This results in the following variants object:

#>                               << Variants object >>
#> # Variants: 24
#> # Mutations: 433
#> 
#>                           << Reference genome info: >>
#> < Set of 10 chromosomes >
#> # Total size: 1,000 bp
#>   name                             chromosome                             length
#> chrom0     TGTTCTCACGGAAAACGGGAAGTTTTTG...GTCACGTCAATGGAGTCTTGGGTCTCTA       100
#> chrom1     CCTCATGTTTCTAGTAGCTGCAACGCTC...TTGTGAACTCCCAGAGGCCTATGCTAAA       100
#> chrom2     AAATCGTCAAGGCTTCGACGTAGAGCCG...CATTGCAGTTCGGAACCGAGGACGGGCG       100
#> chrom3     TTCAGGCAGTGTGGTCTACTGGGCGCAG...TCAGAGCGATAAGATAGATCCGGATAAT       100
#> chrom4     AGGCTGACACCGCCGATGGCCGGGTCCA...GGGCGTCGGGCGAATGCTCGCATGCGAA       100
#> chrom5     CACTGAACGTTAGCAACTTCATCTACAT...GATTTCCTATCTGTGTGTTAGGACTATA       100
#> chrom6     GCGGTTAAGACCGAACCATCCGTGGGAC...AAAGTCTCTCGCAAGCCAGTCAGTGCCC       100
#> chrom7     CCGAATGCCAGTAAGCTCGAGCGTATAA...GTAAGCGTAACGCGCGTGCTGGTTTCGT       100
#> chrom8     TATAGCTAGACCGCCTCCGCGTACGACC...CGGCCGGGCTATCGGGCCTGGTAGGGGG       100
#> chrom9     TAAAGAGTGGGCGGGTCCGTAAAGTACC...CCACGAAGGCTCACGGTCACAATAGGTT       100

To store mutation information by diploid sample, the write_vcf function writes the variants object to a VCF file. It assigns haplotypes to diploid samples using a matrix for the sample_matrix argument:

Now I can generate data for 1 flow cell of \(2 \times 250\)bp sequencing on an Illumina MiSeq v3.