Manipulating tree object is frustrated with the fragmented functions available for working with phylo object, not to mention linking external data to the phylogeny structure. Using tidy data principles can make phylogenetic tree manipulation tasks easier and consistent with tools already in wide use, including dplyr, tidyr, ggplot2 and ggtree.

Convert tree object to tidy data frame and vice versa

phylo object

The phylo class defined in ape is fundamental for phylogenetic analysis in R. Most of the R packages in this field rely extensively on phylo object. The tidytree package provides as_data_frame method to convert the phylo object to tidy data frame, a tbl_tree object.

library(ape)
set.seed(2017)
tree <- rtree(4)
tree
## 
## Phylogenetic tree with 4 tips and 3 internal nodes.
## 
## Tip labels:
## [1] "t2" "t1" "t4" "t3"
## 
## Rooted; includes branch lengths.
x <- as_data_frame(tree)
x
## # A tibble: 7 x 4
##   parent  node branch.length label
##    <int> <int>         <dbl> <chr>
## 1      7     1   0.472166386    t2
## 2      7     2   0.273833123    t1
## 3      6     3   0.674331481    t4
## 4      5     4   0.002020766    t3
## 5      5     5            NA  <NA>
## 6      5     6   0.039322336  <NA>
## 7      6     7   0.434905600  <NA>

The tbl_tree object can be converted back to a phylo object.

as.phylo(x)
## 
## Phylogenetic tree with 4 tips and 3 internal nodes.
## 
## Tip labels:
## [1] "t2" "t1" "t4" "t3"
## 
## Rooted; includes branch lengths.

Using tbl_tree object makes tree and data manipulation more effective and easier. For example, we can link evolutionary trait to phylogeny using the verbs full_join

d <- tibble(label = paste0('t', 1:4),
            trait = rnorm(4))

y <- full_join(x, d, by = 'label')

treedata object

The tidytree package defines a treedata class to store phylogenetic tree with associated data. After mapping external data to the tree structure, the tbl_tree object can be converted to a treedata object.

as.treedata(y)
## 'treedata' S4 object'.
## 
## ...@ tree: 
## Phylogenetic tree with 4 tips and 3 internal nodes.
## 
## Tip labels:
## [1] "t2" "t1" "t4" "t3"
## 
## Rooted; includes branch lengths.
## 
## with the following features available:
##  'trait'.

The treedata class is also used in treeio package to store evolutionary evidences inferred by commonly used software (BEAST, EPA, HYPHY, MrBayes, PAML, PHYLODOG, pplacer, r8s, RAxML and RevBayes).

The tidytree package also provides as_data_frame to convert treedata object to a tidy data frame. The phylogentic tree structure and the evolutionary inferences were stored in the tbl_tree object, making it consistent and easier for manipulating evolutionary statistics inferred by different software as well as linking external data to the same tree structure.

y %>% as.treedata %>% as_data_frame
## # A tibble: 7 x 5
##   parent  node branch.length label        trait
##    <int> <int>         <dbl> <chr>        <dbl>
## 1      7     1   0.472166386    t2 -0.001524259
## 2      7     2   0.273833123    t1 -1.958366456
## 3      6     3   0.674331481    t4  1.563222619
## 4      5     4   0.002020766    t3 -0.265336001
## 5      5     5            NA  <NA>           NA
## 6      5     6   0.039322336  <NA>           NA
## 7      6     7   0.434905600  <NA>           NA

Grouping taxa

tidytree implemented groupOTU and groupClade for adding taxa grouping information to the input tbl_tree object. These grouping information can be used directly in tree visualization (e.g. coloring tree based on grouping) with ggtree.

groupClade

The groupClade method accepts an internal node or a vector of internal nodes to add grouping information of clade/clades.

nwk <- '(((((((A:4,B:4):6,C:5):8,D:6):3,E:21):10,((F:4,G:12):14,H:8):13):13,((I:5,J:2):30,(K:11,L:11):2):17):4,M:56);'
tree <- read.tree(text=nwk)

groupClade(as_data_frame(tree), c(17, 21))
## # A tibble: 25 x 5
##    parent  node branch.length label  group
##     <int> <int>         <dbl> <chr> <fctr>
##  1     20     1             4     A      1
##  2     20     2             4     B      1
##  3     19     3             5     C      1
##  4     18     4             6     D      1
##  5     17     5            21     E      1
##  6     22     6             4     F      2
##  7     22     7            12     G      2
##  8     21     8             8     H      2
##  9     24     9             5     I      0
## 10     24    10             2     J      0
## # ... with 15 more rows

groupOTU

## the input nodes can be node ID or label
groupOTU(x, c('t1', 't4'), group_name = "fake_group")
## # A tibble: 7 x 5
##   parent  node branch.length label fake_group
##    <int> <int>         <dbl> <chr>     <fctr>
## 1      7     1   0.472166386    t2          0
## 2      7     2   0.273833123    t1          1
## 3      6     3   0.674331481    t4          1
## 4      5     4   0.002020766    t3          0
## 5      5     5            NA  <NA>          0
## 6      5     6   0.039322336  <NA>          1
## 7      6     7   0.434905600  <NA>          1

The groupOTU will trace back from input nodes to most recent common ancestor. In this example, nodes 2, 3, 7 and 6 (2 (t1) -> 7 -> 6 and 3 (t4) -> 6) are grouping together.

Related OTUs are grouping together and they are not necessarily within a clade. They can be monophyletic (clade), polyphyletic or paraphyletic.

cls <- list(c1=c("A", "B", "C", "D", "E"),
            c2=c("F", "G", "H"),
            c3=c("L", "K", "I", "J"),
            c4="M")

as_data_frame(tree) %>% groupOTU(cls)
## # A tibble: 25 x 5
##    parent  node branch.length label  group
##     <int> <int>         <dbl> <chr> <fctr>
##  1     20     1             4     A     c1
##  2     20     2             4     B     c1
##  3     19     3             5     C     c1
##  4     18     4             6     D     c1
##  5     17     5            21     E     c1
##  6     22     6             4     F     c2
##  7     22     7            12     G     c2
##  8     21     8             8     H     c2
##  9     24     9             5     I     c3
## 10     24    10             2     J     c3
## # ... with 15 more rows

If there are conflicts when tracing back to mrca, user can set overlap parameter to “origin” (the first one counts), “overwrite” (default, the last one counts) or “abandon” (un-selected for grouping), see also discussion here.