Plotting clustering trees

Luke Zappia

Last updated: 24 February 2019

1 What is a clustering tree?

Clustering analysis is used in many contexts to group similar samples. One problem when conducting this kind of analysis is how many clusters to use. This is usually controlled by a parameter provided to the clustering algorithm, such as \(k\) for \(k\)-means clustering.

Statistics designed to help you make this choice typically either compare two clusterings or score a single clustering. A clustering tree is different in that it visualises the relationships between at a range of resolutions.

To build a clustering tree we need to look at how cells move as the clustering resolution is increased. Each cluster forms a node in the tree and edges are constructed by considering the cells in a cluster at a lower resolution (say \(k = 2\)) that end up in a cluster at the next highest resolution (say \(k = 3\)). By connecting clusters in this way we can see how clusters are related to each other, which are clearly distinct and which are unstable. Extra information about the cells in each node can also be overlaid in order to help make the decision about which resolution to use. For more information about clustering trees please refer to our associated publication (Zappia and Oshlack 2018).

2 A simple example

To demonstrate what a clustering tree looks like we will work through a short example using the well known iris dataset.

2.1 The data

The iris dataset consists of measurements (sepal length, sepal width, petal length and petal width) of 150 iris flowers, 50 from each of three species (Iris setosa, Iris versicolor and Iris virginica). For more information see ?iris. We are going to use a version of this dataset that has already been clustered. Let’s load the data and take a look:

Here we have a data.frame with the normal iris datasets, the measurements and species, plus some additional columns. These columns contain the cluster assignments from clustering this data using \(k\)-means with values ok \(k\) from \(k = 1\) to \(k = 5\).

2.2 Plotting a tree

This clustering information is all we need to build a clustering tree. Each column must consist of numeric values indicating which cluster each sample has been assigned to. To plot the tree we just pass this information to the clustree function. We also need to specify a prefix string to indicate which columns contain the clusterings.

We can see that one cluster is very distinct and does not change with the value of \(k\). This is the Iris setosa samples which are very different to the other species. On the other side of the tree we see a single cluster that splits into the two clusters we would expect to see. After this the tree becomes messier and there are node with multiple incoming edges. This is a good indication that we have over clustered the data.

2.3 Controlling aesthetics

By default the size of each node is related to the number of samples in each cluster and the colour indicates the clustering resolution. Edges are coloured according to the number of samples they represent and the transparency shows the incoming node proportion, the number of samples in the edge divided by the number of samples in the node it points to. We can control these aesthetics by setting them to specific values:

We can also link these aesthetics to other information we have about the samples. All the additional columns in the dataset are available to be added as attributes to the nodes in our tree. Because each node represents multiple samples we need to supply an aggregation function to use as well specifying a column name. Let’s try colouring the nodes according to the sepal width:

We can clearly see that the distinct cluster containing the Iris setosa samples has a wider sepal on average compared to the other clusters.

2.4 Layout

By default the tree is drawn using the Reingold-Tilford tree layout algorithm which tries to place nodes below their parents (Reingold and Tilford 1981). Alternatively we could use the Sugiyama layout by specifying the layout argument. This algorithm tries to minimise the number of crossing edges (Sugiyama, Tagawa, and Toda 1981) and can produce more attractive trees in some cases.

For both of these layout algorithms clustree uses slightly modified versions by default. Only the core network of edges, those that are the highest in-proportion edge for a node, are used when creating the layout. In most cases this leads to more attractive trees that are easier to interpret. To turn this off, and use all edges for deciding the layout, we can set use_core_edges to FALSE.

2.5 Adding labels

To make it easy to identify clusters the cluster nodes are labelled with their cluster number (controlled using the node_text arguments) but sometimes it is useful to add labels with additional information. This is done if they same way as the other aesthetics. Here we label nodes with the maximum petal length:

One way this can be useful is if we have assigned labels to the samples. Here is a custom function that labels a cluster if all the samples are the same species, otherwise it labels the cluster as “mixed”:

3 Clustering trees for scRNA-seq data

Clustering has become a core tool for analysing single-cell RNA-sequencing (scRNA-seq) datasets. These datasets contain gene expression measurements from hundreds to hundreds of thousands of cells. Often samples come from complex tissues containing many types of cells and clustering is used to group similar cells together. To make it easier to produce clustering trees for these kinds of datasets we provide interfaces for some of the objects commonly used to analyse scRNA-seq data.

The clustree package contains an example simulated scRNA-seq data that has been clustered using the SC3 and Seurat packages.

#> [1] "counts"          "logcounts"       "tsne"            "sc3_clusters"   
#> [5] "seurat_clusters"

3.1 SingleCellExperiment objects

The SingleCellExperiment is one of these common objects, used across a range of Bioconductor packages. Let’s have a look at an example, but first we need to convert the example dataset to a SingleCellExperiment object:

The clustering information is held in the coldata slot.

We can plot a clustering tree in the same way we did with a data.frame. In this case the clustering column names contain a suffix that needs to be stripped away, so we will pass that along as well.

3.2 Seurat objects

Clustering trees can also be produced directly from Seurat objects. Let’s convert our SingleCellExperiment to Seurat format:

In this case the clustering information is held in the slot:

Because this object is only used by the Seurat package we can assume the prefix of the clustering columns.

3.3 Using genes as aesthetics

As well as being able to use any additional columns for aesthetics we can also use the expression of individual genes. Let’s colour the nodes in the Seurat tree by Gene730 (a highly variable gene). Again we need to supply an aggregation function.