This document provides brief examples of how jackalope
can be used to generate sequencing data that can inform some common sampling decisions for HTS studies.
For an example reference assembly, I simply simulated one of size 1 kb using the following code:
This resulted in the following ref_genome
object:
#> < Set of 10 chromosomes >
#> # Total size: 1,000 bp
#> name chromosome length
#> chrom0 TGTTCTCACGGAAAACGGGAAGTTTTTG...GTCACGTCAATGGAGTCTTGGGTCTCTA 100
#> chrom1 CCTCATGTTTCTAGTAGCTGCAACGCTC...TTGTGAACTCCCAGAGGCCTATGCTAAA 100
#> chrom2 AAATCGTCAAGGCTTCGACGTAGAGCCG...CATTGCAGTTCGGAACCGAGGACGGGCG 100
#> chrom3 TTCAGGCAGTGTGGTCTACTGGGCGCAG...TCAGAGCGATAAGATAGATCCGGATAAT 100
#> chrom4 AGGCTGACACCGCCGATGGCCGGGTCCA...GGGCGTCGGGCGAATGCTCGCATGCGAA 100
#> chrom5 CACTGAACGTTAGCAACTTCATCTACAT...GATTTCCTATCTGTGTGTTAGGACTATA 100
#> chrom6 GCGGTTAAGACCGAACCATCCGTGGGAC...AAAGTCTCTCGCAAGCCAGTCAGTGCCC 100
#> chrom7 CCGAATGCCAGTAAGCTCGAGCGTATAA...GTAAGCGTAACGCGCGTGCTGGTTTCGT 100
#> chrom8 TATAGCTAGACCGCCTCCGCGTACGACC...CGGCCGGGCTATCGGGCCTGGTAGGGGG 100
#> chrom9 TAAAGAGTGGGCGGGTCCGTAAAGTACC...CCACGAAGGCTCACGGTCACAATAGGTT 100
For molecular-evolution information, I used the TN93 model, an insertion rate of 2e-5
for lengths from 1 to 10, and a deletion rate of 1e-5
for lengths from 1 to 40. By default, the indel
function creates rates that are proportional to exp(-L)
for indel length L
from 1 to the maximum length.
sub <- sub_TN93(pi_tcag = c(0.1, 0.2, 0.3, 0.4),
alpha_1 = 0.0001, alpha_2 = 0.0002,
beta = 0.00015)
ins <- indels(rate = 2e-5, max_length = 10)
del <- indels(rate = 1e-5, max_length = 40)
The example here produces FASTQ files from the known reference assembly that could test strategies for how to assemble a similar genome using HTS data.
The first strategy is to use only PacBio sequencing. The PacBio Sequel system produces up to 500,000 reads per Single Molecule, Real-Time (SMRT) cell, so you could run the following for two cells:
An alternative, hybrid strategy uses 1 SMRT cell of PacBio sequencing and 1 lane (\(\sim 500\) million reads) of \(2 \times 100\)bp Illumina sequencing on the HiSeq 2500 system (the default Illumina system in jackalope
):
pacbio(ref, out_prefix = "pacbio", n_reads = 500e3)
illumina(ref, out_prefix = "illumina", n_reads = 500e6, paired = TRUE,
read_length = 100)
The last strategy combines 1 lane of \(2 \times 100\)bp Illumina HiSeq 2500 sequencing with 1 flow cell of \(2 \times 250\)bp mate-pair sequencing on an Illumina MiSeq v3. The mate-pair sequencing uses longer fragments (defaults are mean of 400 and standard deviation of 100) to better cover highly repetitive regions.
Here, I will demonstrate how to generate population-genomic data of a type that might be used to estimate the divergence between two populations. I first use the scrm
package to conduct coalescent simulations that will generate segregating sites for 10 haploid variants from the reference genome. Five of the variants are from one population, five from another. The symmetrical migration rate is 100 individuals per generation. I set the population-scaled mutation rate (\(\theta = 4 N_0 \mu\)) to 10
for this example.
Using the previously created objects for molecular evolution information (sub
, ins
, and del
), I create variants from the reference genome:
This results in the following set of variants:
#> << Variants object >>
#> # Variants: 10
#> # Mutations: 1,505
#>
#> << Reference genome info: >>
#> < Set of 10 chromosomes >
#> # Total size: 1,000 bp
#> name chromosome length
#> chrom0 TGTTCTCACGGAAAACGGGAAGTTTTTG...GTCACGTCAATGGAGTCTTGGGTCTCTA 100
#> chrom1 CCTCATGTTTCTAGTAGCTGCAACGCTC...TTGTGAACTCCCAGAGGCCTATGCTAAA 100
#> chrom2 AAATCGTCAAGGCTTCGACGTAGAGCCG...CATTGCAGTTCGGAACCGAGGACGGGCG 100
#> chrom3 TTCAGGCAGTGTGGTCTACTGGGCGCAG...TCAGAGCGATAAGATAGATCCGGATAAT 100
#> chrom4 AGGCTGACACCGCCGATGGCCGGGTCCA...GGGCGTCGGGCGAATGCTCGCATGCGAA 100
#> chrom5 CACTGAACGTTAGCAACTTCATCTACAT...GATTTCCTATCTGTGTGTTAGGACTATA 100
#> chrom6 GCGGTTAAGACCGAACCATCCGTGGGAC...AAAGTCTCTCGCAAGCCAGTCAGTGCCC 100
#> chrom7 CCGAATGCCAGTAAGCTCGAGCGTATAA...GTAAGCGTAACGCGCGTGCTGGTTTCGT 100
#> chrom8 TATAGCTAGACCGCCTCCGCGTACGACC...CGGCCGGGCTATCGGGCCTGGTAGGGGG 100
#> chrom9 TAAAGAGTGGGCGGGTCCGTAAAGTACC...CCACGAAGGCTCACGGTCACAATAGGTT 100
For a file of true divergences from the reference genome, the write_vcf
function writes the variants
object to a VCF file:
Lastly, I simulate 1 lane of \(2 \times 100\)bp Illumina HiSeq 2500 sequencing. In this case, individuals within a population are pooled, and the population sequences are derived from are identified by barcodes.
illumina(vars, out_prefix = "vars_illumina", n_reads = 500e6, paired = TRUE,
read_length = 100, barcodes = c(rep("AACCGCGG", 20),
rep("GGTTATAA", 20)))
The below example instead has each individual variant’s reads output to separate FASTQ files:
This section shows how jackalope
can generate variants from a phylogeny, then simulate sequencing data from those variants to test phylogeny reconstruction methods. First, I generated a random coalescent tree of 10 species using the ape
package.
Then I input that to the create_variants
function alongside the molecular evolution information.
This results in the following variants
object:
#> << Variants object >>
#> # Variants: 10
#> # Mutations: 3
#>
#> << Reference genome info: >>
#> < Set of 10 chromosomes >
#> # Total size: 1,000 bp
#> name chromosome length
#> chrom0 TGTTCTCACGGAAAACGGGAAGTTTTTG...GTCACGTCAATGGAGTCTTGGGTCTCTA 100
#> chrom1 CCTCATGTTTCTAGTAGCTGCAACGCTC...TTGTGAACTCCCAGAGGCCTATGCTAAA 100
#> chrom2 AAATCGTCAAGGCTTCGACGTAGAGCCG...CATTGCAGTTCGGAACCGAGGACGGGCG 100
#> chrom3 TTCAGGCAGTGTGGTCTACTGGGCGCAG...TCAGAGCGATAAGATAGATCCGGATAAT 100
#> chrom4 AGGCTGACACCGCCGATGGCCGGGTCCA...GGGCGTCGGGCGAATGCTCGCATGCGAA 100
#> chrom5 CACTGAACGTTAGCAACTTCATCTACAT...GATTTCCTATCTGTGTGTTAGGACTATA 100
#> chrom6 GCGGTTAAGACCGAACCATCCGTGGGAC...AAAGTCTCTCGCAAGCCAGTCAGTGCCC 100
#> chrom7 CCGAATGCCAGTAAGCTCGAGCGTATAA...GTAAGCGTAACGCGCGTGCTGGTTTCGT 100
#> chrom8 TATAGCTAGACCGCCTCCGCGTACGACC...CGGCCGGGCTATCGGGCCTGGTAGGGGG 100
#> chrom9 TAAAGAGTGGGCGGGTCCGTAAAGTACC...CCACGAAGGCTCACGGTCACAATAGGTT 100
Now I can generate data for 1 flow cell of \(2 \times 250\)bp sequencing on an Illumina MiSeq v3.
Similar to the section above, the ultimate goal here is to test phylogeny reconstruction methods. The difference in this section is that instead of using a single, straightforward phylogeny, I use multiple gene trees per chromosome. In the species used in these simulations, species 3 diverged from 1 and 2 at \(t = 0.5\), where \(t\) indicates time into the past and is in units of \(4 N_0\) generations. Species 1 and 2 diverged at \(t = 0.25\). I assume a recombination rate of \(1 / (4 N_0)\) recombination events per chromosome per generation. Because each chromosome is a different length and the length is required for including a recombination rate, I had to run scrm
separately for each chromosome. There are 4 diploid individuals sampled per species.
# Run scrm for one chromosome size:
one_chrom <- function(size) {
sims <- scrm(
paste("24 1",
# Output gene trees:
"-T",
# Recombination:
"-r 1", size,
# 3 species with no ongoing migration:
"-I 3", paste(rep("8", 3), collapse = " "), "0",
# Species 2 derived from 1 at time 1.0:
"-ej 0.5 2 1",
# Species 3 derived from 2 at time 0.5:
"-ej 0.25 3 2"
))
return(sims$trees[[1]])
}
# For all chromosomes:
gtrees <- list(trees = lapply(ref$sizes(), one_chrom))
This results in the following variants
object:
#> << Variants object >>
#> # Variants: 24
#> # Mutations: 433
#>
#> << Reference genome info: >>
#> < Set of 10 chromosomes >
#> # Total size: 1,000 bp
#> name chromosome length
#> chrom0 TGTTCTCACGGAAAACGGGAAGTTTTTG...GTCACGTCAATGGAGTCTTGGGTCTCTA 100
#> chrom1 CCTCATGTTTCTAGTAGCTGCAACGCTC...TTGTGAACTCCCAGAGGCCTATGCTAAA 100
#> chrom2 AAATCGTCAAGGCTTCGACGTAGAGCCG...CATTGCAGTTCGGAACCGAGGACGGGCG 100
#> chrom3 TTCAGGCAGTGTGGTCTACTGGGCGCAG...TCAGAGCGATAAGATAGATCCGGATAAT 100
#> chrom4 AGGCTGACACCGCCGATGGCCGGGTCCA...GGGCGTCGGGCGAATGCTCGCATGCGAA 100
#> chrom5 CACTGAACGTTAGCAACTTCATCTACAT...GATTTCCTATCTGTGTGTTAGGACTATA 100
#> chrom6 GCGGTTAAGACCGAACCATCCGTGGGAC...AAAGTCTCTCGCAAGCCAGTCAGTGCCC 100
#> chrom7 CCGAATGCCAGTAAGCTCGAGCGTATAA...GTAAGCGTAACGCGCGTGCTGGTTTCGT 100
#> chrom8 TATAGCTAGACCGCCTCCGCGTACGACC...CGGCCGGGCTATCGGGCCTGGTAGGGGG 100
#> chrom9 TAAAGAGTGGGCGGGTCCGTAAAGTACC...CCACGAAGGCTCACGGTCACAATAGGTT 100
To store mutation information by diploid sample, the write_vcf
function writes the variants
object to a VCF file. It assigns haplotypes to diploid samples using a matrix for the sample_matrix
argument:
write_vcf(vars, out_prefix = "var_gtrees",
sample_matrix = matrix(1:vars$n_vars(), ncol = 2, byrow = TRUE))
Now I can generate data for 1 flow cell of \(2 \times 250\)bp sequencing on an Illumina MiSeq v3.