Overview

In the Introduction vignette users learned how to perform the most fundamental computations and visualizations for phylotranscriptomics analyses using the myTAI package. Especially in the Introduction vignette we demonstrated how to perform Phylostratigraphy and Divergence Stratigraphy as well as the construction of PhyloExpressionSets and DivergenceExpressionSets using the MatchMap() function.

In the Intermediate vignette users learned about more detailed analyses and more specialized techniques to investigate the observed phylotranscriptomics patterns (TAI, TDI, RE, etc.).

This vignette aims to provide users with in-depth tutorials on how to retrieve published Phylostratigraphic Maps and Divergence Maps, the integrattion of these data sets into cutomized phylotranscriptomics pipelines, and additional types of phylotranscriptomics analyses focusing on the expression level distributions of all age or divergence categories.

Biological Sequence Retrieval with biomartr

Both methods Phylostratigraphy and Divergence Stratigraphy need genome or proteome information for their analyses. This section introduces users to the retrieval of biological sequences stored as *.fasta format to be able to perform Phylostratigraphy and Divergence Stratigraphy.

The classical way to download biological sequences is via the Terminal application:


# download CDS file of A. thaliana
curl ftp://ftp.ensemblgenomes.org/pub/
plants/release-23/fasta/arabidopsis_thaliana/
cds/Arabidopsis_thaliana.TAIR10.23.cds.all.fa.gz 
-o Arabidopsis_thaliana.TAIR10.23.cds.all.fa.gz

# download CDS file of A. lyrata
curl ftp://ftp.ensemblgenomes.org/pub/plants/
release-23/fasta/arabidopsis_lyrata/cds/
Arabidopsis_lyrata.v.1.0.23.cds.all.fa.gz 
-o Arabidopsis_lyrata.v.1.0.23.cds.all.fa.gz

Alternatively, users can use the Biological Data Retrieval package biomartr to download proteomes from the refseq database (see Sequence Retrieval Vignette for details).

# install biomartr from CRAN
install.packages("biomartr",
                 repos        = "https://cran.rstudio.com/",
                 dependencies = TRUE)
# download the proteome of Arabidopsis thaliana from refseq
# and store the corresponding proteome file in '_ncbi_downloads/proteomes'
getProteome( db       = "refseq", 
             kingdom  = "plant",
             organism = "Arabidopsis thaliana",
             path     = file.path("_ncbi_downloads","proteomes") )

The getProteome() function creates a directory named _ncbi_downloads/proteomes into which the orresponding proteome named Arabidopsis_thaliana_protein.faa.gz is downloaded. The read_proteome() function enables users to work with the proteome as data.table object.

# path to proteome: '_ncbi_downloads/proteomes/Arabidopsis_thaliana_protein.faa.gz'
file_path <- file.path("_ncbi_downloads","proteomes","Arabidopsis_thaliana_protein.faa.gz")
# read proteome as data.table object
Ath_proteome <- read_proteome(file_path, format = "fasta")

In case users would like to store the proteome file at a different location, they can specify the path = file.path("put","your","path","here") argument.

Download Published Phylostratigraphic Maps and Divergence Maps

Performing the Phylostratigraphy and Divergence Stratigraphy algorithms is a time consuming and computationally complex task. However, the corresponding Phylostratigraphic Maps and Divergence Maps need to be computed only once for each organism of interest. Hence, already published Phylostratigraphic Maps and Divergence Maps can be used without any additional computational tasks to combine them with any corresponding expression dataset of interest to capture evolutionary signals with myTAI.

Here we provide a collection of scripts and documentation of already published Phylostratigraphic Maps and Divergence Maps and step by step instructions on how to retrieve them for the use with myTAI.

Only in cases where users whish to create custom Phylostratigraphic Maps and Divergence Maps based on specialized reference databases or more recent genome version, the Phylostratigraphy and Divergence Stratigraphy algorithms need to be performed.

Investigating Age or Divergence Category Specific Expression Level Distributions

Gene expression levels are a fundamental aspect of phylotranscriptomics studies. In detail, phylotranscriptomic measures aim to quantify the expression intensity of genes deriving from common age or divergence categories to detect stages of evolutionary constraints. Hence, the gene expression distribution of age or divergence categories as well as their differences within and between stages or categories allow us to investigate the age (PS) or divergence (DS) category specific contribution to the corresponding transcriptome.

For this purpose, the PlotCategoryExpr() aims to visualize the expression level distribution of each phylostratum during each time point or experiment as barplot, dot plot, or violin plot enabling users to quantify the age (PS) or divergence (DS) category specific contribution to the corresponding transcriptome.

This way of visualizing the gene expression distribution of each age (PS) or divergence (DS) category during all developmental stages or experiments allows users to detect specific age or divergence categories contributing significant levels of gene expression to the underlying biological process (transcriptome).

library(myTAI)

data(PhyloExpressionSetExample)

# category-centered visualization of PS specific expression level distributions (log-scale)
PlotCategoryExpr(ExpressionSet = PhyloExpressionSetExample,
                 legendName    = "PS",
                 test.stat     = TRUE,
                 type          = "category-centered",
                 distr.type    = "boxplot",
                 log.expr      = TRUE)
#>                   Zygote Quadrant Globular Heart Torpedo Bent  Mature
#> category-centered "***"  "***"    "***"    "***" "***"   "***" "***"

                  Zygote Quadrant Globular Heart Torpedo Bent  Mature
category-centered "***"  "***"    "***"    "***" "***"   "***" "***"

The resulting boxplot illustrates the log expression levels of each phylostratum during each developmental stage. Additionally, a Kruskal-Wallis Rank Sum Test as well as a Benjamini & Hochberg p-value adjustment for multiple comparisons is performed (test.stat = TRUE) to statistically quantify the differences between expression levels of different age or divergence categories. This type of analysis allows users to detect stages or experiments that show high diviation between age or divergence category contributions to the overall transcriptome or no significant deviations of age or divergence categories, suggesting equal age or divergence category contributions to the overall transcriptome. The corresponding P-values are printed to the console using the following notation:

In this case all developmental stages show significant differences in phylostratum specific gene expression.

Please notice that users need to define the legendName argument as PS or DS to specify whether the input ExpressionSet is a PhyloExpressionSet (legendName = 'PS') or DivergenceExpressionSet (legendName = 'DS').

Alternatively, users can investigate the differences of gene expression between all stages or experiments for each age or divergence category by specifying type = 'stage-centered'.

library(myTAI)

data(PhyloExpressionSetExample)

# stage-centered visualization of PS specific expression level distributions (log-scale)
PlotCategoryExpr(ExpressionSet = PhyloExpressionSetExample,
                 legendName    = "PS",
                 test.stat     = TRUE,
                 type          = "stage-centered",
                 distr.type    = "boxplot",
                 log.expr      = TRUE)
#>                PS1   PS2   PS3    PS4 PS5    PS6 PS7    PS8    PS9    PS10  PS11   PS12 
#> stage-centered "***" "***" "n.s." "*" "n.s." "*" "n.s." "n.s." "n.s." "***" "n.s." "***"

               PS1   PS2   PS3    PS4 PS5    PS6 PS7    PS8    PS9    PS10  PS11   PS12 
stage-centered "***" "***" "n.s." "*" "n.s." "*" "n.s." "n.s." "n.s." "***" "n.s." "***"

Here, the Kruskal-Wallis Rank Sum Test (with Benjamini & Hochberg p-value adjustment) quantifies whether or not the gene expression distribution of a single age or divergence category significantly changes throughout development or experiments. This type of analysis allows users to detect specific age or divergence categories that significantly change their expression levels throughout development or experiments.

In this case, users will observe that PS3,5,7-9,11 do not show significant differences of gene expression between developmental stages suggesting that their contribution to the overall transcriptome remains constant throughout development.

Finally, users can choose the following plot types to visualize expression distributions:

Argument: distr.type

Together, studies perfomed with PlotCategoryExpr() allow users to conclude that genes originating in specific PS or DS contribute significantly more to the overall transcriptome than other genes originating from different PS or DS categories. More specialized analyses such as PlotMeans(), PlotRE(), PlotBarRE(), TAI(), TDI(), etc. will then allow them to study the exact mean expression patterns of these age or divergence categories.

Users will notice that so far all examples shown above specified log.expr = TRUE illustrating boxplots based on log2 expression levels. This way of visualization allows better visual comparisons between age or divergence categories. However, when specifying log.expr = FALSE absolute expression levels will be visualized in the corresponding boxplot.

Alternatively, instead of specifying log.expr = TRUE users can directly pass log2 transformed expression levels to PlotCategoryExpr() via tf(PhyloExpressionSetExample,log2) (when log.expr = FALSE):

data(PhyloExpressionSetExample)

# category-centered visualization of PS specific expression level distributions (log-scale)
PlotCategoryExpr(ExpressionSet = tf(PhyloExpressionSetExample, log2),
                 legendName    = "PS",
                 test.stat     = TRUE,
                 type          = "category-centered",
                 distr.type    = "boxplot",
                 log.expr      = FALSE)
#>                   Zygote Quadrant Globular Heart Torpedo Bent  Mature
#> category-centered "***"  "***"    "***"    "***" "***"   "***" "***"

                  Zygote Quadrant Globular Heart Torpedo Bent  Mature
category-centered "***"  "***"    "***"    "***" "***"   "***" "***" 

Or any other expression level transformation, e.g. sqrt.

data(PhyloExpressionSetExample)

# category-centered visualization of PS specific expression level distributions (sqrt-scale)
PlotCategoryExpr(ExpressionSet = tf(PhyloExpressionSetExample, sqrt),
                 legendName    = "PS",
                 test.stat     = TRUE,
                 type          = "category-centered",
                 distr.type    = "boxplot",
                 log.expr      = FALSE)
#>                   Zygote Quadrant Globular Heart Torpedo Bent  Mature
#> category-centered "***"  "***"    "***"    "***" "***"   "***" "***"