# Introduction to the myTAI Package

## Scientific Introduction: Performing Evolutionary Transcriptomics with R

In the past years, a large body of scientific studies aimed at investigating the molecular basis of variation and conservation within biological processes. These transcriptomics studies allowed us to get a glimpse into the molecular patterns and processes that underly complex biological processes such as development or cell differentiation.

Although powerful for investigating the molecular mechanisms that determine the biological process of interest, these datasets rarely capture the evolutionary history of how these expression patterns emerged or to what extent they are possibly constrained. By combining transcriptomics studies with a comparative approach, however, we can capture some evolutionary signatures that allow us to understand the variability or constraint of particular sets of genes. This evolutionary transcriptomics approach relies on the comparison of transcriptomes between species using orthologous or homologous genes as a phylogenetic variable. This method seeks to determine stages of biological processes in which gene expression patterns are more variable or more constrained than others.

Finding such transcriptome conservation or variability patterns within a biological process of interest, allows us to reconstruct parts of the evolutionary history of that particular process and might guide us in developing methods to modify such processes.

The myTAI package provides an analytics tool to perform evolutionary transcriptomics studies and is designed to detect evolutionary signatures in transcriptome data. It furthermore, seeks to provide a consistent way to design a computationally reproducible analytics process that achieves a high degree of transparency when conducting evolutionary transcriptomics studies.

Alternatively, instead of investigating the transcriptome conservation of a biological process of interest, comparative transcriptomics ( Pantalacci and Semon, 2014; Roux et al., 2015 ) has been developed as a method to study the conservation and variation of gene expression profiles of orthologous genes between related species for a biological process of intest.

For some biological studies, it can be useful to choose both approaches, comparative transcriptomics and transcriptome conservation quantification, to independently confirm evolutionary constraints that might be present in a biological process of interest.

The advantage of comparative transcriptomics over transcriptome conservation quantification is that this method uses multiple species to confirm gene expression profile conservation within a biological process of interest. However, one should be aware that most comparative transcriptomics studies limit their quantification and inference on a subset of the transcriptome. Since comparative transcriptomics studies rely on transcripts of orthologous genes, the orthology inference method and the evolutionary distance between compared species can limit the set of investigated transcripts up to 1/3 of the actual transcriptome.

In contrast, the advantage of transcriptome conservation quantification over comparative transcriptomics is that although transcriptome conservation quantification captures less evolutionary signal it can be performed using only one reference species. Thus, hundreds of thousands of transcriptome datasets that are publically accessible can be revisited and investigated in the light of transcriptome conservation without having to conduct new experiments to fulfill the experimental design standards of comparative transcriptomics studies.

Furthermore, although also possible in comparative transcriptomics studies, transcriptome conservation quantification allows to easily include miRNA or lncRNA expression levels to quantify transcriptome conservation. Hence, transcriptome conservation quantification experiments capture almost the entire transcriptome when screening for constrained stages or treatments in biological processes.

In conclusion, transcriptome conservation quantification experiments can serve as a first approach to screen for the potential existence of (evolutionary/developmental/etc.) constraints within a biological process of interest. If constraints were found in particular stages or treatments which might limit the biological process of interest, comparative transcriptomics experiments can be designed to confirm the existence of those constraints in multiple species.

In this regard, the myTAI package serves as analytics tool to revisit existing transcriptome datasets to investigate them in the light of transcriptome conservation.

## Installation

### Package dependencies

If users are interested in performing differential gene expression analyses with myTAI, they may install the edgeR package.

### Install the edgeR package

source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")

Users can download myTAI from CRAN :

# install myTAI from CRAN
install.packages("myTAI", dependencies = TRUE)

## Motivation

Using embryo development of the plant Arabidopsis thaliana as an example, we ask the user to imagine how one would investigate the differences of developmental transcriptomes across developmental stages.

Figure 1: Gene expression distributions (= developmental transcriptome) throughout seven stages of A. thaliana embryo development. Embryo development is divided into three phases: early embryogenesis (purple), mid-embryogenesis (green), and late embryogenesis (brown). This boxplot illustrates that the overall distributions of log2 expression levels (y-axis) hardly differ between developmental stages (x-axis) although the difference on the global scale is statistically significant (Kruskal-Wallis Rank Sum Test: p < 2e-16). Hence, a clear visual pattern of gene expression differences between early, mid, and late embryogenesis on the global scale can not be inferred. Adapted from Drost, 2016.

The objective of performing evolutionary transcriptomics studies is to classify a transcriptome into different categories of genes sharing similar evolutionary origins (detectable homologs) or genes that share similar phylogenetic relationships (orthologous genes) and to study the overall expression patterns of these classified genes throughout the biological process of interest. Thus, by introducing a phylogenetic or taxonomic variable to a transcriptome dataset, we can determine stages or time points that are under stronger constraints than others, indicating switches between biological programs or functions.

Figure 2: Gene expression distributions (= developmental transcriptome) throughout seven stages of A. thaliana embryo development classified into distinctive age categories. Each box represents the developmental stage during A. thaliana embryogenesis, the y-axis denotes the log2 expression levels of genes that fall into the corresponding age category shown on the x-axis. Hence, each boxplot represents the gene expression distribution of genes that are classified into the corresponding age class during a specific developmental stage. The gene age distribution of A. thaliana genes range from PS1 to PS12 where PS1 represents the evolutionarily most distant age category (cellular org.) and PS12 the evolutionary most recent age category (A. thaliana specific). Yellow dots in the boxplots denote the mean expression level of the corresponding expression distribution. This visualization illustrates that although the global gene expression distributions do not change visually between developmental stages (Fig. 1), the global gene expression distributions of age categories differ between stages of A. thaliana embryo development, and thus, allow studying the effect of transcriptome evolution and conservation on embryo development. Adapted from Drost, 2016.

Conceptually, the idea behind evolutionary transcriptomics studies is to combine the phylogenetic relationship between species (usually retrieved from comparative genomics studies in terms of sequence homology) with transcriptome data of a reference species quantifying a particular biological process of interest (e.g. mutant gene expression versus WT gene expression, stress responses, cell differentiation, development, etc.). Usually, transcriptome data comes from Next Generation Sequencing technologies such as RNA-Seq or from Microarray experiments.

Phylogenetic Information + Transcriptome Data

Or in other words:

Comparative Genomics + Transcriptomics = Evolutionary Transcriptomics

In theory, any published or newly generated transcriptome dataset can be used to capture evolutionary signatures with myTAI.

myTAI is designed to receive phylogenetic information obtained from comparative genomics data and transcriptome data as input and internally combines these datasets to perform evolutionary transcriptomics analyses.

Figure 3: Workflow describing the input and output of the myTAI package. The myTAI package takes phylogenetic information such as phylogenetic trees (see Dunn, 2013 ), genomic phylostratography based gene age inference (see Domazet-Loso et al., 2007; Capra et al., 2013; Liebeskind et al., 2016 ), by dNdS estimation of orthologous genes (see Quint, Drost et al., 2012 and Drost et al., 2015), or phylogenetic reconciliation (see Doyon et al, 2011 ) and a RNA-Seq or Microarray based transcriptome dataset as input. Internally, myTAI then combines the phylogenetic data and the transcriptome data an provides numerous functions to perform evolutionary trancriptomics analyses. Here, we examplify the output of the functions PlotSignature(), PlotRE() and PlotCategoryExpr().

## Retrieval of phylogenetic or taxonomic information

For the comparative genomics part there are different methods and tools to quantify sequence homolgy between genes, miRNAs, lncRNAs etc of a reference species and related species. For example, for phylogenetic or taxonomic information retrieval such as phylogenetic trees, genomic phylostratography based gene age inference, dNdS estimation of orthologous genes or phylogenetic reconciliation can be used. Below users can find the most recent tools and resources for retrieving or computing phylogenetic or taxonomic relationships for an organism of interest.

### Genomic phylostratography based gene age inference

As intensely discussed in the past years (Capra et al., 2013; Altenhoff et al., 2016; Liebeskind et al., 2016), gene age inference is not a trivial task and might be biased in some currently existing approaches (Liebeskind et al., 2016; Yin et al., 2018; Casola 2018).

In particular, Moyers & Zhang argue that genomic phylostratigraphy (a prominent BLAST based gene age inference method)

1. underestimates gene age for a considerable fraction of genes,

2. is biased for rapidly evolving proteins which are short, and/or their most conserved block of sites is small, and

3. these biases create spurious nonuniform distributions of various gene properties among age groups, many of which cannot be predicted a priori (Moyers & Zhang, 2015; Moyers & Zhang, 2016; Liebeskind et al., 2016).

However, these arguments were based on simulated data and were inconclusive due to errors in their analyses. Furthermore, Domazet-Loso et al., 2017 provide convincing evidence that there is no phylostratigraphic bias. As a response, Moyers & Zhang, 2017 recently published a counter-study stating that a phylostratigraphic trend claimed by Domazet-Loso et al., 2017 to be robust to error disappears when genes likely to be error-resistant are analyzed. Moyers & Zhang, 2017 further suggest a more robust methodology for controlling for the effects of error by first restricting to those genes which can be simulated and then removing those genes which, through simulation, have been shown to be error-prone (see also Moyers & Zhang, 2018).

In general, an objective benchmarking set representing the tree of life is still missing and therefore any procedure aiming to quantify gene ages will be biased to some degree. Based on this debate a recent study suggested to perform gene age inference by combining several common orthology inference algorithms to create gene age datasets and then characterize the error around each age-call on a per-gene and per-algorithm basis. Using this approach systematic error was found to be a large factor in estimating gene age, suggesting that simple consensus algorithms are not enough to give a reliable point estimate (Liebeskind et al., 2016). This was also observed by Moyers & Zhang, 2018) when running alternative tools such as PSIBLAST, HMMER, OMA, etc. However, by generating a consensus gene age and quantifying the possible error in each workflow step, Liebeskind et al., 2016 provide a very useful database of consensus gene ages for a variety of genomes.

Alternatively, Stephen Smith, 2016 argues that de novo gene birth/death and gene family expansion/contraction studies should avoid drawing direct inferences of evolutionary relatedness from measures of sequence similarity alone, and should instead, where possible, use more rigorous phylogeny-based methods. For this purpose, I recommend researchers to consult the phylomedb database to retrieve phylogeny-based gene orthology relationships and use these age estimates in combination with myTAI. Alternatively, users might find the simulation based removal approach propsed by Moyers & Zhang, 2018 more suitable.

Evidently, these advancements in gene age research are very recent and gene age inference is a very young and active field of genomic research. Therefore, many more studies need to address the robust and realistic inference of gene age and a community standard is still missing.

Despite the ongoing debate about how to correctly infer gene age, users of myTAI can perform any gene age inference method they find most appropriate for their biological question and pass this gene age inference table as input to myTAI. To do so, users need to follow the following data format specifications to use their gene age inference table with myTAI. However, even when users rely on established procedures such as phylostratigraphy the gene age inference bias will be present as ‘systematic error’ in all developmental stages for which TAI or TDI computations are performed. Thus, stages of constraint will be detectable in any case. Since TAI or TDI computations are indended to enable screening for conserved or constrained stages in developmental or biological processes for further downstream experimental studies, even simple approaches such as phylostratigraphy can give first evidence for the existence of transcriptomic contraints within a biological process. If researchers then wish to extract the exact candidate genes that might potentially cause such transcriptome constraints then I would advise to rely on more superior approaches of gene age inference as discussed above.

In my opinion, what is completely missing in this entire debate is the bioinformatics/technical aspect of using BLAST or any other BLAST-like tool for gene age inference. Recently, it was intesely discussed how BLAST hits are biased by the use of the default argument max_target_seqs (Shah et al., 2018). The main issue of how this max_target_seqs is set is that:

According to the BLAST documentation itself (2008), this parameter represents the ‘number of aligned sequences to keep’. This statement is commonly interpreted as meaning that BLAST will return the top N database hits for a sequence query if the value of max_target_seqs is set to N. For example, in a recent article (Wang et al., 2016) the authors explicitly state ‘Setting ’max target seqs’ as ‘1’ only the best match result was considered’. To our surprise, we have recently discovered that this intuition is incorrect. Instead, BLAST returns the first N hits that exceed the specified E-value threshold, which may or may not be the highest scoring N hits. The invocation using the parameter ‘-max_target_seqs 1’ simply returns the first good hit found in the database, not the best hit as one would assume. Worse yet, the output produced depends on the order in which the sequences occur in the database. For the same query, different results will be returned by BLAST when using different versions of the database even if all versions contain the same best hit for this database sequence. Even ordering the database in a different way would cause BLAST to return a different ‘top hit’ when setting the max_target_seqs parameter to 1. - Shah et al., 2018

The solution to this issue seems to be that any BLAST search must be performed with a significantly high -max_target_seqs, e.g. -max_target_seqs 10000 (see https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5 for details) and best hits must be filtered subsequently. It is not clear from any of the studies referenced above how the best BLAST hit was retrieved and which -max_target_seqs values were used to perform BLAST searches in the respective study. Thus, the comparability of the results between studies is impossible and any individual claim made in these studies might be biased.

In addition, the -max_target_seqs argument issue seems not to be the only issue that might influence technical differences in BLAST hit results. Gonzalez-Pech et al., 2018 discuss another problem of retrieving the best BLAST hits based on E-value thresholds.

Many users assume that BLAST alignment hits with E-values less than or equal to the predefined threshold (e.g. 105 via the specification of evalue 1e-5) are identified after the search is completed, in a final step to rank all alignments by E-value, from the smallest (on the top of the list of results) to the largest E-value (at the bottom of the list). However, the E-value filtering step does not occur at the final stage of BLAST; it occurs earlier during the scanning phase (Altschul et al., 1997; Camacho et al., 2009). During this phase, a gapped alignment is generated using less-sensitive heuristic parameters (Camacho et al., 2009); alignments with an E-value that satisfies the defined cut-off are included in the subsequent phase of the BLAST algorithm (and eventually reported). During the final (trace-back) phase, these gapped alignments are further adjusted using moresensitive heuristic parameters (Camacho et al., 2009), and the E-value for each of these refined alignments is then recalculated. - Gonzalez-Pech et al., 2018

This means that if one study mentioned above ran a BLAST search with a BLAST parameter configuration of lets say -max_target_seqs 250 (default value in BLAST) and evalue 10 (default value in BLAST) and then subsequently selected the best hit which returned the smallest E-value and another study used the parameter configuration -max_target_seqs 1 and evalue 0.0001 then the results of both studies would not be comparable and the proposed gene age inference bias might simply result from a technical difference in running BLAST searches.

In more detail, even if one study for example ran BLAST with evalue 10 (default value in BLAST) and then subsequently filtered for hits that resulted in evalue < 0.0001 whereas another study ran BLAST directly with evalue 0.0001, according to Gonzalez-Pech et al., 2018 these studies although referring to the same E-value threshold for filtering hits will result in different sets of filtered BLAST hits.

Hence, all of the above mentioned approaches are far from being perfect and much more research is needed to systematically compare different approaches for gene age inference.

The rational behind gene age inference is to assign each protein coding gene of an organism of interest with an evolutionary age estimate which aims to quantify its potential origin within the tree of life (detectable sequence homolog; orphan gene (see Tautz & Domazet-Loso, 2011)). Hence, gene age inference generates a table storing the gene age in the first column and the corresponding gene id of the organism of iterest in the second column. This table is named phylostratigraphic map.

__Generate or retrieve phylostratigraphic maps:__

• R package: phylostratr which implements Phylostratigraphy as an R package - Please consult the Vignettes for examples.
• createPSmap.pl: generate a phylostratigraphic map (implemented by Alexander Gabel)
• phylostratigraphy.pl: generate a phylostratigraphic (implemented by Cheng et al. 2015)
• phylo_pipeline.py: generate a phylostratigraphic (implemented by Shuqing Xu)
• ORFanFinder: generate a phylostratigraphic map
• Protein Historian: generate a gene age map
• Liebeskind et al., 2016: use a gene age consensus approach to estimating gene ages for model organisms
• orthoscape: a cytoscape application for grouping and visualization KEGG based gene networks by taxonomy and homology principles (implemented by Mustafin et al., 2017)
• RecBlast: Cloud-Based Large Scale Orthology Detection (by Efrat Rapoport and Moran Neuhof)

### dNdS estimation of orthologous genes

We recently proposed to use the classical dNdS measure to quantify the sequence conservation of protein coding genes between closely related species. This way, we combine the information about the selective pressure acting on a particular gene with its expression level during a particular time point or condition. We refer to this approach as Divergence Stratigraphy (Drost et al., 2015 Mol. Biol. Evol.). Analogous to gene age inference methods, divergence stratigraphy generates a table storing the sequence conservation estimate in the first column and the corresponding gene id of the organism of iterest in the second column. This table is named divergence stratigraphic map.

__Generate or retrieve divergence stratigraphic maps:__

• orthologr: generate a divergence stratigraphic map (implemented by Hajk-Georg Drost)
• compute_dNdS.pl: generate a divergence stratigraphic map (implemented by Cheng et al. 2015)
• MetaPhOrs: retrieve phylogeny-based orthology and paralogy predictions
• orthoscape: a cytoscape application for grouping and visualization KEGG based gene networks by taxonomy and homology principles (implemented by Mustafin et al., 2017)
• RecBlast: Cloud-Based Large Scale Orthology Detection (by Efrat Rapoport and Moran Neuhof)

### Generate custom table

In general, users can construct their own gene age assignment methods and are not limited to the methods listed above. After formatting the corresponding results to the age map specification (age assignment in the first column and gene id in the second column), users can use any function in myTAI with their custom gene age assignment table.

## Getting Started with myTAI

myTAI takes an age map and an expression dataset as input and combines both tables to the quantify transcriptome conservation for the biological process of interest.

### Defining input data standards

The following code illustrates an example structure of an age map. Here we choose genomic phylostratigraphy and dNdS estimation as method to generate a phylostratigraphic map and divergence stratigraphic map:

# load myTAI
library(myTAI)

# load example data sets (stored in myTAI)
data(PhyloExpressionSetExample)
data(DivergenceExpressionSetExample)

# show an example phylostratigraphic map of Arabidopsis thaliana
head(PhyloExpressionSetExample[ , c("Phylostratum","GeneID")])
  Phylostratum      GeneID
1            1 at1g01040.2
2            1 at1g01050.1
3            1 at1g01070.1
4            1 at1g01080.2
5            1 at1g01090.1
6            1 at1g01120.1

In detail, a phylostratigraphic map stores the gene age assignment generated with e.g. phylostratigraphy in the first columns and the corresponding gene id in the second column.

Analogously, a divergence stratigraphic map stores the gene age assignment generated with e.g. divergence stratigraphy in the first column and the corresponding gene id in the second column:

# show an example structure of a Divergence Map
head(DivergenceExpressionSetExample[ , c("Divergence.stratum","GeneID")])
  Divergence.stratum      GeneID
1                  1 at1g01050.1
2                  1 at1g01120.1
3                  1 at1g01140.3
4                  1 at1g01170.1
5                  1 at1g01230.1
6                  1 at1g01540.2

Hence, myTAI relies on pre-computed age maps fulfilling the aforementioned standard for all downstream analyses. It does not matter whether or not age maps contain categorized age values like in phylostratigraphic maps or e.g. phylogenetic distance values generated by phylogenetic inference.

### Expression dataset specification

The aim of any evolutionary transcriptomics study is to quantify transcriptome conservation in biological processes. For this purpose, users need to provide the transcriptome dataset of their studied biological process.

In the following examples we will use a gene expression dataset covering seven stages of Arabidopsis thaliana embryo development. This data format is defined as ExpressionMatrix in the myTAI data format specification.

# gene expression set

GeneID     Zygote   Quadrant   Globular      Heart    Torpedo       Bent    Mature
1 at1g01040.2  2173.6352  1911.2001  1152.5553  1291.4224  1000.2529   962.9772 1696.4274
2 at1g01050.1  1501.0141  1817.3086  1665.3089  1564.7612  1496.3207  1114.6435 1071.6555
3 at1g01070.1  1212.7927  1233.0023   939.2000   929.6195   864.2180   877.2060  894.8189
4 at1g01080.2  1016.9203   936.3837  1181.3381  1329.4734  1392.6429  1287.9746  861.2605
5 at1g01090.1 11424.5667 16778.1685 34366.6493 39775.6405 56231.5689 66980.3673 7772.5617
6 at1g01120.1   844.0414   787.5929   859.6267   931.6180   942.8453   870.2625  792.7542

The function MatchMap() allows users to join a phylostratigraphic map with an ExpressionMatrix to obtain a joined table referred to as PhyloExpressionSet. In some cases, the GeneIDs stored in the ExpressionMatrix and in the phylostratigraphic map do not match. This is due to GeneID mappings between different databases and annotations. To map non matching GeneIDs between databases and annotations, please consult the Functional Annotation Vignette in the biomartr package. The biomartr package allows users to map GeneIDs between database annotations.

After matching a phylostratigraphic map with an ExpressionMatrix using the MatchMap() function, a standard PhyloExpressionSet is returned storing the phylostratum assignment of a given gene in the first column, the gene id of the corresponding gene in the second column, and the entire gene expression set (time series or treatments) starting with the third column. This format is crucial for all functions that are implemented in the myTAI package.

library(myTAI)

# load the example data set
data(PhyloExpressionSetExample)

# construct an example Phylostratigraphic Map
Example.PhylostratigraphicMap <- PhyloExpressionSetExample[ , 1:2]
# construct an example ExpressionMatrix
Example.ExpressionMatrix <- PhyloExpressionSetExample[ , 2:9]

# join a PhylostratigraphicMap with an ExpressionMatrix using MatchMap()
Example.PhyloExpressionSet <- MatchMap(Example.PhylostratigraphicMap, Example.ExpressionMatrix)

# look at a standard PhyloExpressionSet
head(Example.PhyloExpressionSet, 3)
Phylostratum      GeneID    Zygote  Quadrant  Globular    Heart  Torpedo     Bent   Mature
1            4 at1g01010.1  878.2158  828.2301  776.0703 753.9589 775.3377 756.2460 999.9118
2            2 at1g01020.1 1004.9710 1106.2621 1037.5141 939.0830 961.5249 871.4684 997.5953
3            3 at1g01030.1  819.4880  771.6396  810.8717 866.7780 773.7893 747.9941 785.6105

Analogous to a standard PhyloExpressionSet, a standard DivergenceExpressionSet is a data.frame storing the divergence stratum assignment of a given gene in the first column, the gene id of the corresponding gene in the second column, and the entire gene expression set (time series or treatments) starting with the third column.

The following DivergenceExpressionSet example illustrates the standard DivergenceExpressionSet data set format.

# head of an example standard DivergenceExpressionSet
head(DivergenceExpressionSetExample, 3) 
  Divergence.stratum      GeneID    Zygote  Quadrant  Globular    Heart   Torpedo      Bent    Mature
1                  1 at1g01050.1 1501.0141 1817.3086 1665.3089 1564.761 1496.3207 1114.6435 1071.6555
2                  1 at1g01120.1  844.0414  787.5929  859.6267  931.618  942.8453  870.2625  792.7542
3                  1 at1g01140.3 1041.4291  908.3929 1068.8832  967.749 1055.1901 1109.4662  825.4633


A DivergenceExpressionSet defines the joined table between a divergence stratigraphic map and a Expression Set. A DivergenceExpressionSet can be generated analogous to a PhyloExpressionSet by joining a divergence stratigraphic map with an ExpressionMatrix using the MatchMap() function. In some cases, the GeneIDs stored in the ExpressionMatrix and in the divergence stratigraphic map do not match. This is due to GeneID mappings between different databases and annotations. To map non matching GeneIDs between databases and annotations, please consult the Functional Annotation Vignette in the biomartr package.

Each function implemented in myTAI checks internally whether or not the PhyloExpressionSet or DivergenceExpressionSet standard is fulfilled.

# used by all myTAI functions to check the validity of the PhyloExpressionSet standard
is.ExpressionSet(PhyloExpressionSetExample) 
[1] TRUE


In case the PhyloExpressionSet standard is violated, the is.ExpressionSet() function will return FALSE and the corresponding function within the myTAI package will return an error message.

# used a non standard PhyloExpressionSet
head(PhyloExpressionSetExample[ , 2:5], 2)
       GeneID   Zygote Quadrant Globular
1 at1g01040.2 2173.635 1911.200 1152.555
2 at1g01050.1 1501.014 1817.309 1665.309

is.ExpressionSet(PhyloExpressionSetExample[ , 2:5]) 

Error in is.ExpressionSet(PhyloExpressionSetExample[, 2:5]) :
The present input object does not fulfill the ExpressionSet standard.


It might be that you work with a tibble object which will not be recognized by is.ExpressionSet. In that case, please convert your tibble object to a data.frame using the function as.data.frame().

# convert any tibble to a data.frame
PhyloExpressionSetExample <- as.data.frame(PhyloExpressionSetExample)
# now is.ExpressionSet() should return TRUE
is.ExpressionSet(PhyloExpressionSetExample)

The PhyloExpressionSet and DivergenceExpressionSet formats are crucial for all functions that are implemented in the myTAI package.

Keeping these standard data formats in mind will provide users with the most important requirements to get started with the myTAI package.

Note, that within the code of each function, the argument ExpressionSet always refers to either a PhyloExpressionSet or a DivergenceExpressionSet, whereas in specialized functions some arguments are specified as PhyloExpressionSet when they take an PhyloExpressionSet as input data set, or specified as DivergenceExpressionSet when they take an DivergenceExpressionSet as input data set.

## Performing a Standard Workflow for Evolutionary Transcriptomics Analyses

The main goal of any evolutionary transcriptomics study is to quantify transcriptome conservation at a particular stage or treatment. This is achieved by computing the average age of genes that contribute to the transcriptome at that stage or treatment. In other words, by multiplying the gene age value with the expression level of the corresponding gene and averaging over all genes, we obtain the mean age of the transcriptome. Hence, we can say that at a particular stage genes that are most expressed at this stage or treatment have (on average) the evolutionary age XY.

To obtain this mean age value, several measures were introduced:

#### Transcriptome Age Index

The first meansure named Transcriptome Age Index (TAI) was introduced by Domazet-Loso and Tautz, 2010 and represents a weighted arithmetic mean of the transcriptome age during a corresponding developmental stage s.

$$TAI_s = \sum_{i = 1}^n \frac{ps_i * e_{is}}{\sum_{i = 1}^n e_{is}}$$

where $$ps_i$$ denotes the phylostratum assignment of gene $$i$$ and $$e_{is}$$ denotes the gene expression level of gene $$i$$ at developmental time point $$s$$. A lower value of TAI describes an older transcriptome age, whereas a higher value of TAI denotes a younger transcriptome age.

The following figure shows the TAI computations for the seven stages of A. thaliana embryo development.

data(PhyloExpressionSetExample)
# Plot the Transcriptome Age Index of a given PhyloExpressionSet
# Test Statistic : Flat Line Test (default)
PlotSignature( ExpressionSet = PhyloExpressionSetExample,
measure       = "TAI",
TestStatistic = "FlatLineTest",
xlab          = "Ontogeny",
ylab          = "TAI" )
#> Plot signature: ' TAI ' and test statistic: ' FlatLineTest ' running  1000  permutations.
#> Significance status of signature:  significant.

The x-axis shows the seven stages of A. thaliana embryo development and the y-axis shows the corresponding mean transcriptome age (TAI) value. The lower the TAI value the older the mean transcriptome age and the higher the TAI value the younger the mean transcriptome age.

The interpretation of the TAI values on the y-axis is given by the next figure.

In this example, a TAI value of 3.5 quantifies that genes that contribute most the transcriptome at a particular stage emerged on average between phylostratum 3 and phylostratum 4. Due to the nature of the arithmetic mean, this value does not represent the true origin of individual genes, and thus the TAI measure is only helpful to screen for stages that express (on average) older or younger genes. Subsequent analyses such as mean expression of age categories, relative expression levels, and gene expression level distributions for each age category will then reveal which exact genes or age categories generate the overall TAI value.

To obtain a more detailed overview of which age categories contribute how much to each developmental stage, the gene expression level distributions for each age caterory and each developmental stage can be visualized (using the PlotCategoryExpr() function).

data(PhyloExpressionSetExample)
# category-centered visualization of PS
# specific expression level distributions (log-scale)
PlotCategoryExpr(ExpressionSet = PhyloExpressionSetExample,
legendName    = "PS",
test.stat     = TRUE,
type          = "category-centered",
distr.type    = "boxplot",
log.expr      = TRUE)
#>                   Zygote Quadrant Globular Heart Torpedo Bent  Mature
#> category-centered "***"  "***"    "***"    "***" "***"   "***" "***"

This figure shows that in all developmental stages, genes coming from PS1-3 are (on average) more expressed than genes coming from PS4-12. Interestingly, the gene expression level distributions of PS4-12 become more equally distributed towards the Torpedo stage which has been marked as the most conserved stage by TAI analysis. This general trend can be visualized using the PlotMeans() function.

data(PhyloExpressionSetExample)
# plot evolutionary old PS (PS1-3) vs
# evolutionary young PS (PS4-12)
PlotMeans(PhyloExpressionSetExample,
Groups = list(c(1:3), c(4:12)),
legendName = "PS",
adjust.range = TRUE)

Here, users will observe that indeed PS1-3 genes are (on average) higher expressed than PS4-12 genes.

Using a linear transformation of the mean expression levels into the interval $$[0,1]$$ (Quint et al., 2012 and Drost et al., 2015) we can compare mean expression patterns between Phylostrata independent from their actual mean expression magnitude. A relative expression level of 0 denotes the minimum mean expression level compared to all other stages and a relative expression level of 1 denotes the maximum mean expression level compared to all other stages.

The following figure illustrates the average gene expression profile for each phylostratum.

data(PhyloExpressionSetExample)
# plot evolutionary old PS (PS1-3) vs
# evolutionary young PS (PS4-12)
PlotRE(PhyloExpressionSetExample,
Groups = list(c(1:3), c(4:12)),
legendName = "PS",
adjust.range = TRUE)