# Introduction

HTSSIP has some functionality for simulating basic HTS-SIP datasets. With that said, I recommend using a more sophisticated simulation toolset such as SIPSim [LINK] for applications other than simple testing of HTS-SIP data analysis functions.

HTSSIP relies heavily on the great R package coenocliner. See this tutorial for a short and simple introduction.

# Simulating a HTS-SIP dataset

In this vignette, we’re going to simulate gradient fraction communities for 6 gradients, with the basic experimental setup as follows:

• Treatments: 13C-glucose vs 12C-control
• Treatment replicates: 3 (each)

First, let’s load some packages including HTSSIP.

library(dplyr)
library(ggplot2)
library(HTSSIP)

OK, let’s set the parameters needed for community simulations. We are basically going to follow the coenocliner tutorial, but instead of a transect along an environmental gradient, we are simulating communities in each fraction of a density gradient.

# setting parameters for tests
set.seed(318)                              # reproduciblility
M = 6                                      # number of OTUs (species)
ming = 1.67                                # gradient minimum...
maxg = 1.78                                # ...and maximum
nfrac = 24                                 # number of gradient fractions
locs = seq(ming, maxg, length=nfrac)       # Fraction BD's
tol  = rep(0.005, M)                       # species tolerances
h    = ceiling(rlnorm(M, meanlog=11))      # max abundances

opt = rnorm(M, mean=1.71, sd=0.008)        # species optima (drawn from a normal dist.)
params = cbind(opt=opt, tol=tol, h=h)      # put in a matrix

With the current parameters, we can simulate the gradient fraction communities for 1 density gradient:

df_OTU = gradient_sim(locs, params)
df_OTU
##    OTU.1 OTU.2 OTU.3 OTU.4 OTU.5  OTU.6 Buoyant_density
## 1      0     0     0     0     0      0        1.670000
## 2      0     0     0     0     0      0        1.674783
## 3      0     0     5     0     0      0        1.679565
## 4      0     0   131     0     0      0        1.684348
## 5      0     3  2074     4    14     22        1.689130
## 6      0   134 13016   131   491    727        1.693913
## 7     15  1891 32630  1177  6248   9066        1.698696
## 8    408 11003 33250  4449 31193  48158        1.703478
## 9   3665 25564 13439  6521 62709 100021        1.708261
## 10 12256 23561  2218  3901 50332  83169        1.713043
## 11 16510  8423   151   948 16405  28235        1.717826
## 12  8946  1236     3   108  2052   3800        1.722609
## 13  1953    73     0     3   102    199        1.727391
## 14   163     1     0     0     4      5        1.732174
## 15     5     0     0     0     0      0        1.736957
## 16     0     0     0     0     0      0        1.741739
## 17     0     0     0     0     0      0        1.746522
## 18     0     0     0     0     0      0        1.751304
## 19     0     0     0     0     0      0        1.756087
## 20     0     0     0     0     0      0        1.760870
## 21     0     0     0     0     0      0        1.765652
## 22     0     0     0     0     0      0        1.770435
## 23     0     0     0     0     0      0        1.775217
## 24     0     0     0     0     0      0        1.780000

As you can see, the abundance distribution of each OTU is approximately Gaussian, with varying optima among OTUs.

### Simulating the full SIP experiment

If all OTUs in the 13C-treatment incorporated labeled isotope, then their abundance distributions should be shifted to ‘heavier’ buoyant densities. Let’s set the 13C-treatment gradients to have a higher mean species optima. For kicks, let’s also increase the species optima variance (representing more variable isotope incorporation percentages).

opt1 = rnorm(M, mean=1.7, sd=0.005)      # species optima
params1 = cbind(opt=opt1, tol=tol, h=h)  # put in a matrix
opt2 = rnorm(M, mean=1.7, sd=0.005)
params2 = cbind(opt=opt2, tol=tol, h=h)
opt3 = rnorm(M, mean=1.7, sd=0.005)
params3 = cbind(opt=opt3, tol=tol, h=h)
opt4 = rnorm(M, mean=1.72, sd=0.008)
params4 = cbind(opt=opt4, tol=tol, h=h)
opt5 = rnorm(M, mean=1.72, sd=0.008)
params5 = cbind(opt=opt5, tol=tol, h=h)
opt6 = rnorm(M, mean=1.72, sd=0.008)
params6 = cbind(opt=opt6, tol=tol, h=h)

# we need a named list of parameters for the next step. The names in the list will be used as sample IDs
params_all = list(
'12C-Con_rep1' = params1,
'12C-Con_rep2' = params2,
'12C-Con_rep3' = params3,
'13C-Glu_rep1' = params4,
'13C-Glu_rep2' = params5,
'13C-Glu_rep3' = params6
)

Additional sample metadata can be added to the resulting phyloseq object that we are going to simulate. To add metadata, just make a data.frame object with a Gradient column, which will be used for matching to the simulated samples. The Gradient column values should match the names in the parameter list.

meta = data.frame(
'13C-Glu_rep1', '13C-Glu_rep2', '13C-Glu_rep3'),
'Treatment' = c(rep('12C-Con', 3), rep('13C-Glu', 3)),
'Replicate' = c(1:3, 1:3)
)

# do the names match?
all(meta$Gradient %in% names(params_all)) ## [1] TRUE OK. Let’s make the phyloseq object with the parameters that we specified above. ## physeq object physeq_rep3 = HTSSIP_sim(locs, params_all, meta=meta) physeq_rep3 ## phyloseq-class experiment-level object ## otu_table() OTU Table: [ 6 taxa and 144 samples ] ## sample_data() Sample Data: [ 144 samples by 4 sample variables ] How does the sample_data table look? phyloseq::sample_data(physeq_rep3) %>% head ## Gradient Buoyant_density Treatment Replicate ## 12C-Con_rep1_1.668185 12C-Con_rep1 1.668185 12C-Con 1 ## 12C-Con_rep1_1.680254 12C-Con_rep1 1.680254 12C-Con 1 ## 12C-Con_rep1_1.679431 12C-Con_rep1 1.679431 12C-Con 1 ## 12C-Con_rep1_1.682305 12C-Con_rep1 1.682305 12C-Con 1 ## 12C-Con_rep1_1.689408 12C-Con_rep1 1.689408 12C-Con 1 ## 12C-Con_rep1_1.691388 12C-Con_rep1 1.691388 12C-Con 1 ## Simulating qPCR data The q-SIP analysis requires qPCR measurements of gene copies for each gradient fraction. We can simulate this data for the HTS-SIP dataset phyloseq object that we just created. The qPCR value simulation function is fairly flexible. For input, functions are provided that define how qPCR values (averages & variability) relate to buoyant density. For example, you can set the ‘peak’ of qPCR values to be located at a BD of 1.7 for unlabeled gradient samples and a ‘peak’ at a BD of 1.75 for labeled samples. For this example, let’s set the error among replicate qPCR reactions (technical replicates) to scale with the mean qPCR values for that gradient sample. # unlabeled control qPCR values control_mean_fun = function(x) dnorm(x, mean=1.70, sd=0.01) * 1e8 # error will scale with the mean control_sd_fun = function(x) control_mean_fun(x) / 3 # labeled treatment values treat_mean_fun = function(x) dnorm(x, mean=1.75, sd=0.01) * 1e8 # error will scale with the mean treat_sd_fun = function(x) treat_mean_fun(x) / 3 OK. Let’s simulate the qPCR data. physeq_rep3_qPCR = qPCR_sim(physeq_rep3, control_expr='Treatment=="12C-Con"', control_mean_fun=control_mean_fun, control_sd_fun=control_sd_fun, treat_mean_fun=treat_mean_fun, treat_sd_fun=treat_sd_fun) physeq_rep3_qPCR %>% names ## [1] "raw" "summary" The output is a list containing: • ‘raw’ data • qPCR values for each PCR reaction • ‘summary’ data • qPCR mean values (& standard deviations) for each gradient fraction sample. physeq_rep3_qPCR$raw %>% head
##   Buoyant_density IS_CONTROL qPCR_tech_rep1 qPCR_tech_rep2 qPCR_tech_rep3
## 1        1.668185       TRUE       38380656       37801138       22008529
## 2        1.680254       TRUE      560887106      659743071      354651067
## 3        1.679431       TRUE      618877197      770165585      538409720
## 4        1.682305       TRUE      554753523     1190798381     1037082632
## 5        1.689408       TRUE     3312072961     1398862657      707835246
## 6        1.691388       TRUE     1515455205     3901629542     2667702680
## 1 12C-Con_rep1_1.668185 12C-Con_rep1   12C-Con         1
## 2 12C-Con_rep1_1.680254 12C-Con_rep1   12C-Con         1
## 3 12C-Con_rep1_1.679431 12C-Con_rep1   12C-Con         1
## 4 12C-Con_rep1_1.682305 12C-Con_rep1   12C-Con         1
## 5 12C-Con_rep1_1.689408 12C-Con_rep1   12C-Con         1
## 6 12C-Con_rep1_1.691388 12C-Con_rep1   12C-Con         1
physeq_rep3_qPCR$summary %>% head ## IS_CONTROL Sample Buoyant_density qPCR_tech_rep_mean ## 1 FALSE 13C-Glu_rep1_1.671712 1.671712 1.953729e-04 ## 2 FALSE 13C-Glu_rep1_1.671722 1.671722 2.710671e-04 ## 3 FALSE 13C-Glu_rep1_1.680311 1.680311 1.368177e-01 ## 4 FALSE 13C-Glu_rep1_1.683540 1.683540 1.039858e+00 ## 5 FALSE 13C-Glu_rep1_1.688831 1.688831 2.230201e+01 ## 6 FALSE 13C-Glu_rep1_1.696594 1.696594 2.143897e+03 ## qPCR_tech_rep_sd Gradient Treatment Replicate ## 1 5.435725e-05 13C-Glu_rep1 13C-Glu 1 ## 2 3.818031e-05 13C-Glu_rep1 13C-Glu 1 ## 3 6.395073e-02 13C-Glu_rep1 13C-Glu 1 ## 4 6.074444e-01 13C-Glu_rep1 13C-Glu 1 ## 5 1.446576e+01 13C-Glu_rep1 13C-Glu 1 ## 6 6.352483e+02 13C-Glu_rep1 13C-Glu 1 Let’s plot the data. x_lab = 'Buoyant density (g ml^-1)' y_lab = '16S rRNA gene copies' ggplot(physeq_rep3_qPCR$summary, aes(Buoyant_density, qPCR_tech_rep_mean,
ymin=qPCR_tech_rep_mean-qPCR_tech_rep_sd,
ymax=qPCR_tech_rep_mean+qPCR_tech_rep_sd,
color=IS_CONTROL, shape=Replicate)) +
geom_pointrange() +
scale_color_discrete('Unlabeled\ncontrol') +
labs(x=x_lab, y=y_lab) +
theme_bw()

With this simulation, we made the separation between labeled and unlabeled DNA pretty easy to distinguish.

OK. Just to show the flexiblity of the qPCR value simulation function, let’s try using some other functions as input.

# using the Cauchy distribution instead of normal distributions
control_mean_fun = function(x) dcauchy(x, location=1.70, scale=0.01) * 1e8
control_sd_fun = function(x) control_mean_fun(x) / 3
treat_mean_fun = function(x) dcauchy(x, location=1.75, scale=0.01) * 1e8
treat_sd_fun = function(x) treat_mean_fun(x) / 3
# simulating qPCR values
physeq_rep3_qPCR = qPCR_sim(physeq_rep3,
control_expr='Treatment=="12C-Con"',
control_mean_fun=control_mean_fun,
control_sd_fun=control_sd_fun,
treat_mean_fun=treat_mean_fun,
treat_sd_fun=treat_sd_fun)

Now, how does the data look?

ggplot(physeq_rep3_qPCR\$summary, aes(Buoyant_density, qPCR_tech_rep_mean,
ymin=qPCR_tech_rep_mean-qPCR_tech_rep_sd,
ymax=qPCR_tech_rep_mean+qPCR_tech_rep_sd,
color=IS_CONTROL, shape=Replicate)) +
geom_pointrange() +
scale_color_discrete('Unlabeled\ncontrol') +
labs(x=x_lab, y=y_lab) +
theme_bw()

# Session info

sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] phyloseq_1.19.1 HTSSIP_1.2.0    ggplot2_2.2.1   tidyr_0.6.1
## [5] dplyr_0.5.0
##
## loaded via a namespace (and not attached):
##  [1] reshape2_1.4.2      splines_3.3.3       lattice_0.20-35
##  [4] rhdf5_2.18.0        colorspace_1.3-2    htmltools_0.3.5
##  [7] stats4_3.3.3        yaml_2.1.14         mgcv_1.8-17
## [10] survival_2.41-3     DBI_0.6-1           BiocGenerics_0.20.0
## [13] foreach_1.4.3       plyr_1.8.4          stringr_1.2.0
## [16] zlibbioc_1.20.0     Biostrings_2.42.1   munsell_0.4.3
## [19] gtable_0.2.0        codetools_0.2-15    evaluate_0.10
## [22] labeling_0.3        Biobase_2.34.0      knitr_1.15.1
## [25] permute_0.9-4       IRanges_2.8.2       biomformat_1.2.0
## [28] parallel_3.3.3      Rcpp_0.12.10        scales_0.4.1
## [31] backports_1.0.5     vegan_2.4-3         S4Vectors_0.12.2
## [34] jsonlite_1.4        XVector_0.14.1      digest_0.6.12
## [37] coenocliner_0.2-2   stringi_1.1.5       grid_3.3.3
## [58] compiler_3.3.3