## Warning: package 'ape' was built under R version 3.4.4
Manipulating tree object is frustrated with the fragmented functions available for working with phylo
object, not to mention linking external data to the phylogeny structure. Using tidy data principles can make phylogenetic tree manipulation tasks easier and consistent with tools already in wide use, including dplyr, tidyr, ggplot2 and ggtree.
Convert tree object to tidy data frame and vice versa
phylo
object
The phylo
class defined in ape is fundamental for phylogenetic analysis in R. Most of the R packages in this field rely extensively on phylo
object. The tidytree package provides as_data_frame
method to convert the phylo
object to tidy data frame, a tbl_tree
object.
library(ape)
set.seed(2017)
tree <- rtree(4)
tree
##
## Phylogenetic tree with 4 tips and 3 internal nodes.
##
## Tip labels:
## [1] "t2" "t1" "t4" "t3"
##
## Rooted; includes branch lengths.
x <- as_data_frame(tree)
x
## # A tibble: 7 x 4
## parent node branch.length label
## <int> <int> <dbl> <chr>
## 1 7 1 0.472 t2
## 2 7 2 0.274 t1
## 3 6 3 0.674 t4
## 4 5 4 0.00202 t3
## 5 5 5 NA <NA>
## 6 5 6 0.0393 <NA>
## 7 6 7 0.435 <NA>
The tbl_tree
object can be converted back to a phylo
object.
as.phylo(x)
## Warning: package 'bindrcpp' was built under R version 3.4.4
##
## Phylogenetic tree with 4 tips and 3 internal nodes.
##
## Tip labels:
## [1] "t2" "t1" "t4" "t3"
##
## Rooted; includes branch lengths.
Using tbl_tree
object makes tree and data manipulation more effective and easier. For example, we can link evolutionary trait to phylogeny using the verbs full_join
d <- tibble(label = paste0('t', 1:4),
trait = rnorm(4))
y <- full_join(x, d, by = 'label')
treedata object
The tidytree package defines a treedata
class to store phylogenetic tree with associated data. After mapping external data to the tree structure, the tbl_tree
object can be converted to a treedata
object.
as.treedata(y)
## 'treedata' S4 object'.
##
## ...@ phylo:
## Phylogenetic tree with 4 tips and 3 internal nodes.
##
## Tip labels:
## [1] "t2" "t1" "t4" "t3"
##
## Rooted; includes branch lengths.
##
## with the following features available:
## 'trait'.
The treedata
class is also used in treeio package to store evolutionary evidences inferred by commonly used software (BEAST, EPA, HYPHY, MrBayes, PAML, PHYLODOG, pplacer, r8s, RAxML and RevBayes).
The tidytree package also provides as_data_frame
to convert treedata
object to a tidy data frame. The phylogentic tree structure and the evolutionary inferences were stored in the tbl_tree object, making it consistent and easier for manipulating evolutionary statistics inferred by different software as well as linking external data to the same tree structure.
y %>% as.treedata %>% as_data_frame
## # A tibble: 7 x 5
## parent node branch.length label trait
## <int> <int> <dbl> <chr> <dbl>
## 1 7 1 0.472 t2 -0.00152
## 2 7 2 0.274 t1 -1.96
## 3 6 3 0.674 t4 1.56
## 4 5 4 0.00202 t3 -0.265
## 5 5 5 NA <NA> NA
## 6 5 6 0.0393 <NA> NA
## 7 6 7 0.435 <NA> NA
Grouping taxa
tidytree implemented groupOTU
and groupClade
for adding taxa grouping information to the input tbl_tree
object. These grouping information can be used directly in tree visualization (e.g. coloring tree based on grouping) with ggtree.
groupClade
The groupClade
method accepts an internal node or a vector of internal nodes to add grouping information of clade/clades.
nwk <- '(((((((A:4,B:4):6,C:5):8,D:6):3,E:21):10,((F:4,G:12):14,H:8):13):13,((I:5,J:2):30,(K:11,L:11):2):17):4,M:56);'
tree <- read.tree(text=nwk)
groupClade(as_data_frame(tree), c(17, 21))
## # A tibble: 25 x 5
## parent node branch.length label group
## <int> <int> <dbl> <chr> <fct>
## 1 20 1 4. A 1
## 2 20 2 4. B 1
## 3 19 3 5. C 1
## 4 18 4 6. D 1
## 5 17 5 21. E 1
## 6 22 6 4. F 2
## 7 22 7 12. G 2
## 8 21 8 8. H 2
## 9 24 9 5. I 0
## 10 24 10 2. J 0
## # ... with 15 more rows
groupOTU
## the input nodes can be node ID or label
groupOTU(x, c('t1', 't4'), group_name = "fake_group")
## # A tibble: 7 x 5
## parent node branch.length label fake_group
## <int> <int> <dbl> <chr> <fct>
## 1 7 1 0.472 t2 0
## 2 7 2 0.274 t1 1
## 3 6 3 0.674 t4 1
## 4 5 4 0.00202 t3 0
## 5 5 5 NA <NA> 0
## 6 5 6 0.0393 <NA> 1
## 7 6 7 0.435 <NA> 1
The groupOTU
will trace back from input nodes to most recent common ancestor. In this example, nodes 2, 3, 7 and 6 (2 (t1) -> 7 -> 6
and 3 (t4) -> 6
) are grouping together.
Related OTUs are grouping together and they are not necessarily within a clade. They can be monophyletic (clade), polyphyletic or paraphyletic.
cls <- list(c1=c("A", "B", "C", "D", "E"),
c2=c("F", "G", "H"),
c3=c("L", "K", "I", "J"),
c4="M")
as_data_frame(tree) %>% groupOTU(cls)
## # A tibble: 25 x 5
## parent node branch.length label group
## <int> <int> <dbl> <chr> <fct>
## 1 20 1 4. A c1
## 2 20 2 4. B c1
## 3 19 3 5. C c1
## 4 18 4 6. D c1
## 5 17 5 21. E c1
## 6 22 6 4. F c2
## 7 22 7 12. G c2
## 8 21 8 8. H c2
## 9 24 9 5. I c3
## 10 24 10 2. J c3
## # ... with 15 more rows
If there are conflicts when tracing back to mrca, user can set overlap
parameter to “origin” (the first one counts), “overwrite” (default, the last one counts) or “abandon” (un-selected for grouping), see also discussion here.