First, let’s load some packages including HTSSIP
.
library(dplyr)
library(ggplot2)
library(HTSSIP)
See HTSSIP introduction vignette for a description on why dataset parsing (all treatment-control comparisons) is needed.
Let’s see the already parsed dataset
physeq_S2D2_l
## $`(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')`
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 10001 taxa and 46 samples ]
## sample_data() Sample Data: [ 46 samples by 17 sample variables ]
## tax_table() Taxonomy Table: [ 10001 taxa by 8 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 10001 tips and 10000 internal nodes ]
##
## $`(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Cel' & Day == '14')`
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 10001 taxa and 46 samples ]
## sample_data() Sample Data: [ 46 samples by 17 sample variables ]
## tax_table() Taxonomy Table: [ 10001 taxa by 8 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 10001 tips and 10000 internal nodes ]
##
## $`(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Glu' & Day == '14')`
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 10001 taxa and 47 samples ]
## sample_data() Sample Data: [ 47 samples by 17 sample variables ]
## tax_table() Taxonomy Table: [ 10001 taxa by 8 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 10001 tips and 10000 internal nodes ]
##
## $`(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Glu' & Day == '3')`
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 10001 taxa and 46 samples ]
## sample_data() Sample Data: [ 46 samples by 17 sample variables ]
## tax_table() Taxonomy Table: [ 10001 taxa by 8 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 10001 tips and 10000 internal nodes ]
Let’s set some parameters used later.
# adjusted P-value cutoff
padj_cutoff = 0.1
# number of cores for parallel processing (increase depending on your computational hardware)
ncores = 2
First, we’ll just run HR-SIP on 1 treatment-control comparison. Let’s get the individual phyloseq object.
physeq = physeq_S2D2_l[[1]]
physeq
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 10001 taxa and 46 samples ]
## sample_data() Sample Data: [ 46 samples by 17 sample variables ]
## tax_table() Taxonomy Table: [ 10001 taxa by 8 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 10001 tips and 10000 internal nodes ]
Let’s check that the samples belong to either a 13C-treatment or 12C-control.
physeq %>% sample_data %>% .$Substrate %>% table
## .
## 12C-Con 13C-Cel
## 23 23
OK, we should be ready to run HR-SIP!
Note that the design
parameter for HRSIP()
is the experimental design parameter for calculating log2 fold change (l2fc) values with DESeq. Here, it’s used to distinguish label-treatment and unlabel-control samples.
df_l2fc = HRSIP(physeq,
design = ~Substrate,
padj_cutoff = padj_cutoff,
sparsity_threshold = c(0,0.15,0.3)) # just using 3 thresholds to reduce time
## Sparsity threshold: 0
## Density window: 1.7-1.75
## Sparsity threshold: 0.15
## Density window: 1.7-1.75
## Sparsity threshold: 0.3
## Density window: 1.7-1.75
## Sparsity threshold with the most rejected hypotheses: 0.15
df_l2fc %>% head(n=3)
## # A tibble: 3 × 17
## OTU log2FoldChange p padj Rank1 Rank2
## <chr> <dbl> <dbl> <dbl> <fctr> <fctr>
## 1 OTU.514 -0.2450007 0.8528092 1 Bacteria __Proteobacteria
## 2 OTU.816 -0.5302828 0.9141156 1 Bacteria __Proteobacteria
## 3 OTU.1099 -0.3096987 0.8362421 1 Bacteria __Acidobacteria
## # ... with 11 more variables: Rank3 <fctr>, Rank4 <fctr>, Rank5 <fctr>,
## # Rank6 <fctr>, Rank7 <fctr>, Rank8 <fctr>, density_min <dbl>,
## # density_max <dbl>, sparsity_threshold <dbl>, sparsity_apply <chr>,
## # l2fc_threshold <dbl>
How many “incorporators”" (rejected hypotheses)?
df_l2fc %>%
filter(padj < padj_cutoff) %>%
group_by() %>%
summarize(n_incorp_OTUs = OTU %>% unique %>% length)
## # A tibble: 1 × 1
## n_incorp_OTUs
## <int>
## 1 6
Let’s plot a breakdown of incorporators for each phylum.
# summarizing
df_l2fc_s = df_l2fc %>%
filter(padj < padj_cutoff) %>%
mutate(Rank2 = gsub('^__', '', Rank2)) %>%
group_by(Rank2) %>%
summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
ungroup()
# plotting
ggplot(df_l2fc_s, aes(Rank2, n_incorp_OTUs)) +
geom_bar(stat='identity') +
labs(x='Phylum', y='Number of incorporators') +
theme_bw() +
theme(
axis.text.x = element_text(angle=45, hjust=1)
)
Let’s now run HR-SIP on all treatment-control comparisons in the dataset:
# Number of comparisons
physeq_S2D2_l %>% length
## [1] 4
The function plyr::ldply()
is useful (compared to lapply()
) beccause it can be run in parallel and returns a data.frame object.
# Running in parallel; you may need to change the backend for your environment.
# Or you can just set .parallel=FALSE.
doParallel::registerDoParallel(ncores)
df_l2fc = plyr::ldply(physeq_S2D2_l,
HRSIP,
design = ~Substrate,
padj_cutoff = padj_cutoff,
sparsity_threshold = c(0,0.15,0.3), # just using 3 thresholds to reduce run time
.parallel=TRUE)
df_l2fc %>% head(n=3)
## .id
## 1 (Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')
## 2 (Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')
## 3 (Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')
## OTU log2FoldChange p padj Rank1 Rank2
## 1 OTU.514 -0.2450007 0.8528092 1 Bacteria __Proteobacteria
## 2 OTU.816 -0.5302828 0.9141156 1 Bacteria __Proteobacteria
## 3 OTU.1099 -0.3096987 0.8362421 1 Bacteria __Acidobacteria
## Rank3 Rank4 Rank5
## 1 __Deltaproteobacteria __Desulfobacterales __Nitrospinaceae
## 2 __Deltaproteobacteria __Desulfobacterales __Nitrospinaceae
## 3 __32-21 __uncultured_bacterium <NA>
## Rank6 Rank7 Rank8 density_min density_max
## 1 __uncultured __uncultured_bacterium <NA> 1.7 1.75
## 2 __uncultured __uncultured_bacterium <NA> 1.7 1.75
## 3 <NA> <NA> <NA> 1.7 1.75
## sparsity_threshold sparsity_apply l2fc_threshold
## 1 0.15 all 0.25
## 2 0.15 all 0.25
## 3 0.15 all 0.25
Each specific phyloseq subset (treatment-control comparison) is delimited with the “.id” column.
df_l2fc %>% .$.id %>% unique
## [1] "(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')"
## [2] "(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Cel' & Day == '14')"
## [3] "(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Glu' & Day == '14')"
## [4] "(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Glu' & Day == '3')"
For clarity, let’s edit these long strings to make them more readable when plotted.
df_l2fc = df_l2fc %>%
mutate(.id = gsub(' \\| ', '\n', .id))
df_l2fc %>% .$.id %>% unique
## [1] "(Substrate=='12C-Con' & Day=='3')\n(Substrate=='13C-Cel' & Day == '3')"
## [2] "(Substrate=='12C-Con' & Day=='14')\n(Substrate=='13C-Cel' & Day == '14')"
## [3] "(Substrate=='12C-Con' & Day=='14')\n(Substrate=='13C-Glu' & Day == '14')"
## [4] "(Substrate=='12C-Con' & Day=='3')\n(Substrate=='13C-Glu' & Day == '3')"
How many incorporators (rejected hypotheses) & which sparsity cutoff was used for each comparison?
Note: you could set one sparsity cutoff for all comparisons by setting the sparsity_cutoff
to a specific value.
df_l2fc %>%
filter(padj < padj_cutoff) %>%
group_by(.id, sparsity_threshold) %>%
summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
as.data.frame
## .id
## 1 (Substrate=='12C-Con' & Day=='14')\n(Substrate=='13C-Cel' & Day == '14')
## 2 (Substrate=='12C-Con' & Day=='14')\n(Substrate=='13C-Glu' & Day == '14')
## 3 (Substrate=='12C-Con' & Day=='3')\n(Substrate=='13C-Cel' & Day == '3')
## 4 (Substrate=='12C-Con' & Day=='3')\n(Substrate=='13C-Glu' & Day == '3')
## sparsity_threshold n_incorp_OTUs
## 1 0.30 141
## 2 0.30 204
## 3 0.15 6
## 4 0.30 45
How about a breakdown of incorporators for each phylum in each comparision.
# summarizing
df_l2fc_s = df_l2fc %>%
filter(padj < padj_cutoff) %>%
mutate(Rank2 = gsub('^__', '', Rank2)) %>%
group_by(.id, Rank2) %>%
summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
ungroup()
# plotting
ggplot(df_l2fc_s, aes(Rank2, n_incorp_OTUs)) +
geom_bar(stat='identity') +
labs(x='Phylum', y='Number of incorporators') +
facet_wrap(~ .id, scales='free') +
theme_bw() +
theme(
axis.text.x = element_text(angle=55, hjust=1)
)
MW-HR-SIP is run very similarly to HRSIP, but it uses multiple buoyant density (BD) windows. MW-HR-SIP is performed with the HRSIP()
function, but multiple BD windows are specified.
Let’s use 3 buoyant density windows (g/ml):
1.70-1.73, 1.72-1.75, 1.74-1.77
windows = data.frame(density_min=c(1.70, 1.72, 1.74),
density_max=c(1.73, 1.75, 1.77))
windows
## density_min density_max
## 1 1.70 1.73
## 2 1.72 1.75
## 3 1.74 1.77
Running HRSIP with all 3 BD windows. Let’s run this in parallel to speed things up.
You can turn off parallel processing by setting the parallel
option to FALSE
. Also, there’s 2 different levels that could be parallelized: either the ldply()
or HRSIP()
. Here, I’m running HRSIP in parallel, but it may make sense in other situations (eg., many comparisons but few density windows and/or sparsity cutoffs) to use ldply in parallel only.
doParallel::registerDoParallel(ncores)
df_l2fc = plyr::ldply(physeq_S2D2_l,
HRSIP,
density_windows = windows,
design = ~Substrate,
padj_cutoff = padj_cutoff,
sparsity_threshold = c(0,0.15,0.3), # just using 3 thresholds to reduce run time
.parallel = TRUE)
df_l2fc %>% head(n=3)
## .id
## 1 (Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')
## 2 (Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')
## 3 (Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')
## OTU log2FoldChange p padj Rank1 Rank2
## 1 OTU.514 -0.1860702 0.77035746 1 Bacteria __Proteobacteria
## 2 OTU.1415 0.6690787 0.23939655 1 Bacteria __Acidobacteria
## 3 OTU.1136 1.4397854 0.02300216 1 Bacteria __Acidobacteria
## Rank3 Rank4 Rank5
## 1 __Deltaproteobacteria __Desulfobacterales __Nitrospinaceae
## 2 __DA023 <NA> <NA>
## 3 __DA023 __uncultured_bacterium <NA>
## Rank6 Rank7 Rank8 density_min density_max
## 1 __uncultured __uncultured_bacterium <NA> 1.7 1.73
## 2 <NA> <NA> <NA> 1.7 1.73
## 3 <NA> <NA> <NA> 1.7 1.73
## sparsity_threshold sparsity_apply l2fc_threshold
## 1 0.3 all 0.25
## 2 0.3 all 0.25
## 3 0.3 all 0.25
Let’s check that we have all treatment-control comparisons.
df_l2fc %>% .$.id %>% unique
## [1] "(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')"
## [2] "(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Cel' & Day == '14')"
## [3] "(Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Glu' & Day == '14')"
## [4] "(Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Glu' & Day == '3')"
How many incorporators (rejected hypotheses) & which sparsity cutoff was used for each comparison?
Note: one sparsity cutoff could be set for all comparisons by setting the sparsity_cutoff
to a specific value.
df_l2fc %>%
filter(padj < padj_cutoff) %>%
group_by(.id, sparsity_threshold) %>%
summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
as.data.frame
## .id
## 1 (Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Cel' & Day == '14')
## 2 (Substrate=='12C-Con' & Day=='14') | (Substrate=='13C-Glu' & Day == '14')
## 3 (Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Cel' & Day == '3')
## 4 (Substrate=='12C-Con' & Day=='3') | (Substrate=='13C-Glu' & Day == '3')
## sparsity_threshold n_incorp_OTUs
## 1 0.3 158
## 2 0.3 276
## 3 0.3 8
## 4 0.3 67
The density windows can vary for each OTU. Let’s plot which density windows were used for the OTUs in the dataset.
# summarizing
df_l2fc_s = df_l2fc %>%
mutate(.id = gsub(' \\| ', '\n', .id)) %>%
filter(padj < padj_cutoff) %>%
mutate(density_range = paste(density_min, density_max, sep='-')) %>%
group_by(.id, sparsity_threshold, density_range) %>%
summarize(n_incorp_OTUs = OTU %>% unique %>% length)
#plotting
ggplot(df_l2fc_s, aes(.id, n_incorp_OTUs, fill=density_range)) +
geom_bar(stat='identity', position='fill') +
labs(x='Control-treatment comparision', y='Fraction of incorporators') +
scale_y_continuous(expand=c(0,0)) +
theme_bw() +
theme(
axis.text.x = element_text(angle=55, hjust=1)
)
Different BD windows were used for different treatment-control comparisons because the amount of BD shift likely varied among taxa. For example, if a taxon incorporates 100% 13C isotope, then a very ‘heavy’ BD window may show a larger l2fc than a less ‘heavy’ BD window.
Let’s look at a breakdown of incorporators for each phylum in each comparision.
# summarizing
df_l2fc_s = df_l2fc %>%
mutate(.id = gsub(' \\| ', '\n', .id)) %>%
filter(padj < padj_cutoff) %>%
mutate(Rank2 = gsub('^__', '', Rank2)) %>%
group_by(.id, Rank2) %>%
summarize(n_incorp_OTUs = OTU %>% unique %>% length) %>%
ungroup()
# plotting
ggplot(df_l2fc_s, aes(Rank2, n_incorp_OTUs)) +
geom_bar(stat='identity') +
labs(x='Phylum', y='Number of incorporators') +
facet_wrap(~ .id, scales='free') +
theme_bw() +
theme(
axis.text.x = element_text(angle=55, hjust=1)
)
Note that the MW-HR-SIP method identifies more incorporators than the HR-SIP method (which uses just one BD window).
MW-HR-SIP detects more taxa for 2 main reasons. First, taxa vary in G+C content, so using only 1 BD window likely encompasses BD shifts for taxa of certain G+C contents (eg., ~50% G+C), but may miss other taxa with higher or lower G+C content. Second, taxa can vary in how much isotope was incorporated, which will affect where each taxon’s DNA is in the density gradient.
sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] phyloseq_1.19.1 HTSSIP_1.2.0 ggplot2_2.2.1 tidyr_0.6.1
## [5] dplyr_0.5.0
##
## loaded via a namespace (and not attached):
## [1] Biobase_2.34.0 jsonlite_1.4
## [3] splines_3.3.3 foreach_1.4.3
## [5] Formula_1.2-1 assertthat_0.2.0
## [7] stats4_3.3.3 latticeExtra_0.6-28
## [9] yaml_2.1.14 RSQLite_1.1-2
## [11] backports_1.0.5 lattice_0.20-35
## [13] digest_0.6.12 GenomicRanges_1.26.4
## [15] RColorBrewer_1.1-2 XVector_0.14.1
## [17] checkmate_1.8.2 colorspace_1.3-2
## [19] htmltools_0.3.5 Matrix_1.2-8
## [21] plyr_1.8.4 XML_3.98-1.6
## [23] DESeq2_1.14.1 genefilter_1.56.0
## [25] zlibbioc_1.20.0 xtable_1.8-2
## [27] scales_0.4.1 BiocParallel_1.8.2
## [29] annotate_1.52.1 tibble_1.3.0
## [31] htmlTable_1.9 mgcv_1.8-17
## [33] IRanges_2.8.2 SummarizedExperiment_1.4.0
## [35] nnet_7.3-12 BiocGenerics_0.20.0
## [37] lazyeval_0.2.0 survival_2.41-3
## [39] magrittr_1.5 memoise_1.1.0
## [41] evaluate_0.10 doParallel_1.0.10
## [43] nlme_3.1-131 MASS_7.3-45
## [45] foreign_0.8-67 vegan_2.4-3
## [47] tools_3.3.3 data.table_1.10.4
## [49] stringr_1.2.0 S4Vectors_0.12.2
## [51] locfit_1.5-9.1 munsell_0.4.3
## [53] cluster_2.0.6 AnnotationDbi_1.36.2
## [55] Biostrings_2.42.1 ade4_1.7-6
## [57] compiler_3.3.3 GenomeInfoDb_1.10.3
## [59] rhdf5_2.18.0 grid_3.3.3
## [61] coenocliner_0.2-2 RCurl_1.95-4.8
## [63] iterators_1.0.8 biomformat_1.2.0
## [65] htmlwidgets_0.8 igraph_1.0.1
## [67] bitops_1.0-6 base64enc_0.1-3
## [69] labeling_0.3 rmarkdown_1.4
## [71] gtable_0.2.0 codetools_0.2-15
## [73] multtest_2.30.0 DBI_0.6-1
## [75] reshape2_1.4.2 R6_2.2.0
## [77] gridExtra_2.2.1 knitr_1.15.1
## [79] Hmisc_4.0-2 rprojroot_1.2
## [81] permute_0.9-4 ape_4.1
## [83] stringi_1.1.5 parallel_3.3.3
## [85] Rcpp_0.12.10 geneplotter_1.52.0
## [87] rpart_4.1-10 acepack_1.4.1