phylogram: dendrograms for evolutionary analysis

Shaun Wilkinson

2018-04-13


Important Note: From version 2.0 all k-mer counting and clustering functions have been migrated to the kmer package. Apologies for any inconvenience.

Abstract

The phylogram R package is a Newick parser and tool for developing evolutionary trees as deeply-nested lists known as “dendrogram” objects. It provides functions for importing and exporting trees in parenthetic text format, as well as several tools for command-line tree manipulation. The phylogram package is released under the GPL-3 license, and is available for download from CRAN https://CRAN.R-project.org/package=phylogram and github https://github.com/shaunpwilkinson/phylogram.
Contact: shaunpwilkinson@gmail.com

Introduction

The R environment continues to gain popularity as a platform for working with evolutionary trees, due to the reproducible code-based workflow and the many powerful analytical tools available in a suite of open-source packages such as ape (Paradis et al., 2004), phangorn (Schliep, 2011) and Phytools (Revell, 2012). These packages typically employ a tree structure known as the “phylo” object, whose primary element is an integer matrix with one row for each edge in the graph, and two columns giving the indices of the connecting nodes.

An alternative tree structure is the “dendrogram” object, generated using the as.dendrogram function in the stats package (R Core Team, 2017). Rather than a matrix of edges, a dendrogram is a hierarchical list. These ‘lists of lists’ can be deeply nested, with the limit depending on the C stack size (settable via options("expressions")). A useful feature of this representation is its modularity, whereby the sub tree of a tree is itself a tree - a dendrogram within a dendrogram. This means that dendrogram objects are subsettable in the same way that standard lists are, which in addition to the inbuilt editing functions such as cut and merge, facilitates intuitive command-line tree manipulation. An especially powerful feature of this object type is that tree editing operations can be carried out recursively using fast inbuilt functions in the “apply” family such as dendrapply and lapply.

Each node of a dendrogram object has the following mandatory attributes:

Rather than lists, terminal leaf nodes are length-1 integer vectors whose values correspond to the indices of the members in the set. The “members” attributes of leaf nodes is always 1, the “midpoint” attribute is 0, and they have two additional attributes:

Aside from those listed above, users may attach other objects as attributes to the dendrogram nodes. For example, “label” attributes can be attached to inner nodes, and users can specify plotting parameters for each node by setting the attributes “nodePar” and “edgePar”.

Example 1: Build a dendrogram object manually

Consider the simple example of a tree with three members named “A”, “B” and “C”, where “B” and “C” are more closely related to each other than they are to “A”. An unweighted Newick string for this tree would be (A,(B,C)); We can manually create a dendrogram object for this basic phylogeny and plot the tree as follows:

x <- list(1, list(2, 3))
## attach "leaf" and "label" attributes to leaf nodes
attr(x[[1]], "leaf") <- TRUE
attr(x[[2]][[1]], "leaf") <- attr(x[[2]][[2]], "leaf") <- TRUE
attr(x[[1]], "label") <- "A"
attr(x[[2]][[1]], "label") <- "B"
attr(x[[2]][[2]], "label") <- "C"
## set "height" attributes for all nodes
attr(x, "height") <- 1
attr(x[[1]], "height") <- 0
attr(x[[2]], "height") <- 0.5
attr(x[[2]][[1]], "height") <- attr(x[[2]][[2]], "height") <- 0
## set "midpoints" attributes for all nodes
attr(x, "midpoint") <- 0.75
attr(x[[1]], "midpoint") <- 0
attr(x[[2]], "midpoint") <- 0.5
attr(x[[2]][[1]], "midpoint") <- attr(x[[2]][[2]], "midpoint") <- 0
## set "members" attributes for all nodes
attr(x, "members") <- 3
attr(x[[1]], "members") <- 1
attr(x[[2]], "members") <- 2
attr(x[[2]][[1]], "members") <- attr(x[[2]][[2]], "members") <- 1
## set class as "dendrogram" 
## Note that setting the class for the root node
## automatically sets the class of all nested subnodes
class(x) <- "dendrogram"

### plot the dendrogram
plot(x, yaxt = "n")

Figure 1: A simple dendrogram with three terminal leaf nodes

As demonstrated in this example, manually setting attributes on dendrogram objects can be rather tedious, motivating the development of functions to automate the generation and manipulation of these tree structures.

The ‘phylogram’ package

Here, we introduce phylogram, an R package for working with evolutionary trees as deeply-nested lists. The package contains functions for importing and exporting dendrogram objects to and from parenthetic text, and assembling, visualizing, manipulating and rendering trees for publication. These functions are detailed below with examples of their utility.

Importing and exporting trees

The Newick (a.k.a. New Hampshire) parenthetic text format (Felsenstein et al., 1986) is a universal phylogenetic tree representation that is compatible with most tree-editing software. The phylogram package features the text parser read.dendrogram that reads a character string or text file in the Newick format and creates a dendrogram object. This function supports weighted edges, labels with special meta-characters (enclosed in single quotation marks), comments (enclosed in square brackets; ignored by the parser), multifuricating nodes, and both rooted and unrooted trees. Inner-node labels are currently ignored; however, the inclusion of “label” attributes for non-leaf nodes will be available in a future version. Objects of class “dendrogram” can be exported as Newick-style parenthetic text using the function write.dendrogram.

Example 2: Import and export a tree from a Newick string

The simple Newick string in Example 1 can be imported as a dendrogram object using the read.dendrogram function as follows:

library(phylogram)
newick <- "(A,(B,C));"
x <- read.dendrogram(text = newick)
x
#> 'dendrogram' with 2 branches and 3 members total, at height 2

The following command writes the object back to the console in Newick format without edge weights:

write.dendrogram(x, edges = FALSE)
#> [1] "(A,(B,C));"

The syntax is similar when reading and writing text files, except that the text argument is replaced by file, and a valid file path is passed to the function.

If required, the dendrogram can be converted to an object of class “phylo” using the as.phylo.dendrogram method, and converted back to a dendrogram with as.dendrogram.phylo.

y <- as.phylo(x)
z <- as.dendrogram(y)
identical(x, z)
#> [1] TRUE

Tree editing/manipulation

The phylogram package features several additional functions to facilitate some of the more common manipulation operations. Leaf nodes and internal branching nodes can be removed using the function prune, which identifies and recursively deletes nodes based on regular expression pattern matching of node “label” attributes. To aid visualization, the function ladder rearranges the tree, sorting nodes by the number of members (analogous to the ladderize function in the ape package). Another function aiding in tree visualization is ultrametricize, which resets the “height” attributes of all terminal leaf nodes to zero (note that unlike ape::chronos() there is no mathematical basis for this operation; rather it is merely a visualization aid). The function reposition scales the height of all nodes in a tree by a given constant (passed via the argument shift), and features the option to reset all node heights so that height of the farthest terminal leaf node from the root is zero (by specifying shift = "reset"). The function remidpoint recursively corrects all “midpoint”, “members” and “leaf” attributes following manual editing of a tree.

Tree vizualization

Publication-quality trees can be generated from dendrogram objects using the stats plotting function plot.dendrogram, and the extensive plotting functions available in dendrogram-enhancing packages such as circlize (Gu et al., 2014) and dendextend (Galili, 2015). The latter also offers the facility to convert dendrograms to “ggdend” objects, for which many powerful ‘grammar of graphics’ plotting functions are available in the ggplot2 (Wickham, 2009) and ggdendro (DeVries and Ripley, 2016) packages. Moreover, there are several advanced plotting options for “phylo” objects in the ape package (Paradis et al., 2004), as well as the Bioconductor package ggtree (Yu et al., 2017). Given the extensive tree visualization options already available, we elected not to include any additional plotting functions in the phylogram package.

Getting help

If you experience a problem using this package please either raise it as an issue on GitHub, post it on the phylogram google group or email the author directly.

Acknowledgements

This software was developed with funding from a Rutherford Foundation Postdoctoral Research Fellowship from the Royal Society of New Zealand.

References

DeVries,A. and Ripley,B.D. (2016) ggdendro: create dendrograms and tree diagrams using ’ggplot2’.

Felsenstein,J. et al. (1986) The newick tree format, 1986.

Galili,T. (2015) dendextend: An R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics, 31, 3718–3720.

Gu,Z. et al. (2014) circlize implements and enhances circular visualization in R. Bioinformatics, 30, 2811–2812.

Paradis,E. et al. (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics, 20, 289–290.

R Core Team (2017) R: A Language and Environment for Statistical Computing.

Revell,L.J. (2012) phytools: an R package for phylogenetic comparative biology (and other things). Methods in Ecology and Evolution, 3, 217–223.

Schliep,K.P. (2011) phangorn: phylogenetic analysis in R. Bioinformatics, 27, 592–593.

Wickham,H. (2009) ggplot2: Elegant Graphics for Data Analysis Springer-Verlag, New York.

Yu,G. et al. (2017) ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution, 8, 28–36.