Loading phylogenetic data into R

Martin R. Smith


Loading phylogenetic data into R

It can be a bit fiddly to get a phylogenetic dataset into R, particularly if you are not used to working with files in the NEXUS format.

The first thing that you’ll need to do is load the phangorn package, which should have been installed when you installed TreeSearch.

## Loading required package: ape

Filesystem navigation

To open a file, you’re going to have to tell R where it is. You can do this by providing the full path to the file on your system. Be careful to use forward slashes (/, not \, which you’ll get if you copy file paths in Windows).

filename <- "C:/nexus/matrix.nex"

You can save typing by giving R a working directory. You can think of R as having a file explorer window open invisibly in the background. You can see the folder that’s open at the moment by typing getwd() at the console. setwd() tells R to open a different folder instead. `setwd(‘../’) tells R to go up to a parent directory. (You can do this using the Graphical User Interface in RStudio).

By setting the directory that your files are in as the working directory, you only need to specify the filename:

setwd("C:/nexus/") # You only need to do this once
filename <- "matrix.nex"
# Do something with this file
filename <- "matrix2.nex"
# Do something with this file

Getting raw data:

From an Excel spreadsheet

If your data is in an Excel spreadsheet, one way to load it into R is using the xlsx package. First you’ll have to install it:

install.packages('xlsx') # You only need to do this once

Then you should prepare your Excel spreadsheet such that each row corresponds to a taxon, and each column to a character.

Then you can read the data from the Excel file by telling R which sheet, rows and columns contain your data:

raw_data <- as.matrix(read.xlsx(filename,
sheetIndex=1, # Loads sheet number 1 from the excel file
rowIndex=2:21, # Extracts rows 2 to 21
colIndex=2:26, # Extracts columns B to Z
taxon_names <- read.xlsx(filename, sheetIndex=1, rowIndex=2:21, colIndex=1, as.data.frame=FALSE) # In this example, the names of taxa are in column 1
rownames(raw_data) <- taxon_names

From a Nexus file

TreeSearch contains an inbuilt NEXUS parser:

raw_data <- ReadCharacters(filename)
# Or, to go straight to PhyDat format:
as_phydat <- ReadAsPhyDat(filename)

This will extract character names and codings from a dataset. It’s been written to work with datasets downloaded from MorphoBank, but should work with most valid (and many invalid) NEXUS files. If there’s a file that’s not being read correctly, please let me know and I’ll try to fix it.

If your data is in a NEXUS file, you can read it using the preinstalled package ape:

raw_data <- ape::read.nexus.data(filename)

Non-standard elements of a Nexus file might be beyond the capabilities of ape’s parser. In particular, you will need to replace spaces in taxon names with an underscore, and to arrange all data into a single block starting BEGIN DATA. You’ll need to strip out comments, character definitions and separate taxon blocks.

The function readNexus in package phylobase promises to be more powerful, yet I’ve not been able to get it to work.

From a TNT file

A TNT format dataset downloaded from MorphoBank can be parsed with ReadTntCharacters, which might also handle other TNT-compatible files. If there’s a file that’s not being read correctly, please let me know and I’ll try to fix it.

raw_data <- ReadTntCharacters(filename)
# Or, to go straight to PhyDat format:
my_data <- ReadTntAsPhyDat(filename)

Processing raw data

The next stage is to get the raw data into a format that TreeSearch can understand. If you’ve used the ReadAsPhyDat or ReadTntAsPhyDat functions, then you can skip this step – you’re already there.

Otherwise, you can try

my_data <- PhyDat(raw_data)

or if that doesn’t work,

my_data <- MatrixToPhyDat(raw_data)

These functions haven’t been exhaustively tested – if they don’t work on your dataset, please let me know.

Failing that, you can enlist the help of the phangorn package, which was installed when you installed TreeSearch:

my_data <- phyDat(raw_data, type='USER', levels=c(0:9, '-'))

type='USER' tells the parser to expect morphological data.

The levels parameter simply lists all the states that any character might take. 0:9 includes all the integer digits from 0 to 9. If you have inapplicable data in your matrix, you should list - as a separate level as it represents an additional state (as handled by the Morphy implementation of (Brazeau, Guillerme, & Smith, 2017)).

What next?

You might want to:


Brazeau, M. D., Guillerme, T., & Smith, M. R. (2017). Morphological phylogenetic analysis with inapplicable data. Biorxiv. doi:10.1101/209775