Reading in branch-annotated phylogenetic trees

Files used in this tutorial:

RBrownie reads and writes modified nexus-formatted files, including nexus files which contain branch-annotated trees. In this tutorial, we'll go through how to read in your tree files and discuss the objects that are returned.

Newick
Before getting into NEXUS files however, let's start by looking at newick-formatted tree files. Newick tree strings are used to define phylogenetic trees and might look something like this (but typically much longer than this):

(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);

To read a standard newick string into R, you don't need to use RBrownie, but it is useful to go through the motions:

library(ape)
library(phylobase)
tree = read.tree(text="(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);")
class(tree) # "phylo"
phy = phylo4(tree)
class(phy) # "phylo4"
phyd = phylo4d(tree)
class(phyd) # "phylo4d"

tree in the case above is of class 'phylo' which is the ape packages's internal tree structure. phy is of class 'phylo4', which is the phylobase packages's internal tree format and is basically an S4 version of 'phylo'. It contains one phylogenetic tree and nothing else. 'phylo4d' is a further extension of 'phylo4', which basically adds a data.frame to the tree so that node data can be stored.

Often times data is associated with the nodes with in a tree and there is nothing in the newick standard which deals with that. SIMMAP is a modified newick format which over comes this deficiency and it has gone through a number of different iterations (and it is what RBrownie uses). Reading newick trees with SIMMAP-style branch annotations is similar:

library(RBrownie)
simtree.old = read.simmap(text="(A:{0,0.1},B:{0,0.2},(C:{0,0.3},D:{0,0.4}):{1,0.5});")
simtree.new = simtree=read.simmap.new(text="(A:[&map={0}]0.1,B:[&map={0}]0.2,(C:[&map={0}]0.3,D:[&map={0}]0.4):[&map={1}]0.5);")
class(simtree.old) == class(simtree.new) # TRUE, both are 'phylo4d'

simtree.old and simtree.new are essentially identical except that the former's data.frame is labeled "simmap_state" (the default name for old simmap), while the later's is labeled "map" (which was explicitly defined). We've used the same trees as we did above (try running the 'plot' command on all of them to confirm), but now have a datum at each tip except the root node (which we can view by typing tdata(simtree.new)).

Enter subnodes (aka internal singleton nodes). These are basically nodes which divide up a branch into multiple fragments, usually to track changes in some variable over the course of a branch. In the code below I've added a subnode to the branch connecting the ancestor of D and C with some internal node:

simtree.old = read.simmap(text="(A:{0,0.1},B:{0,0.2},(C:{0,0.3},D:{0,0.4}):{0,0.2:1,0.3});")
class(simtree.old) # "phylo4d" (w/ singleton)
hasSingle(simtree.old) # TRUE
hasSubNodes(simtree.old) # FALSE
#
simtree.new = read.simmap.new(text="(A:[&map={0}]0.1,B:[&map={0}]0.2,(C:[&map={0}]0.3,D:[&map={0}]0.4):[&map={0,0.2,1}]0.5);")
class(simtree.new) # "phylo4d_ext"
hasSingle(simtree.new) # FALSE
hasSubNodes(simtree.new) # TRUE

Notice how read.simmap and read.simmap.new both return *slightly* different objects. simtree.old is returned as a 'phylo4d' object with a singleton node. This behavior should be depreciated in a future release, but it highlights an important conceptual similarity: subnodes are basically singleton nodes with a bit more flexibility. simtree.old can easily be converted to 'phylo4d_ext' format using this command:

old.to.new = phyext(simtree.old)

Now the simtree.new and old.to.new are identical. Plotting them illustrates this:

par(mfrow=c(1,2))
plot(simtree.new)
plot(old.to.new)

Each color in the tree represents a different character state. Notice how that color changes when it hits the subnode where we specified that the character state changes:

(A:[&map={0}]0.1,B:[&map={0}]0.2,(C:[&map={0}]0.3,D:[&map={0}]0.4):[&map={0,0.2,1}]0.5);

Does the plot look like it reflects this change?

Again, 'phylo4d_ext' is, as the name implies, a direct extension of the 'phylo4d' class (from phylobase). That means all functions defined for 'phylo4d' are compatible (or should be); details on theses can be found in the phylobase documentation and phylo4 vignette. 'phylo4d_ext' does have some notable new functions however, for looking at subnodes:

snid(simtree.new) # get subnode ids
sndata(simtree.new) # similar to 'tdata(...)' function
snbranch(simtree.new) # get branch subnodes are on
snpos(simtree.new) # get position of subnode on branch

These and other 'phylo4d_ext' functions will be discussed in detail a tutorial on how to manipulate branch-annotated phylogenetic trees.

Nexus
Of course, directly reading Newick-type strings is not terribly useful in itself, so RBrownie includes methods which read NEXUS files which include branch-annotated newick strings. We'll start with another review:

Both ape and phylobase have methods for reading nexus data. Download 'nexus_test.txt' file attached to this page and place it in the same directory as getwd(). 'nexus_test.txt' simple contains a similar tree to the one we first worked with above, but in nexus format and rooted. Once it is in the correct directory it can be read in:

library(phylobase)
library(ape)
# ape:
tree = read.nexus("nexus_test.txt")
# phylobase:
phyd = readNexus("nexus_test.txt")

These two functions read standard Nexus files (though they don't read all blocks). read.nexus only reads the tree block, while readNexus reads both the tree and the related data block. RBrownie extends phylobase, but many of it's read in functions look similar to ape.

What if we have a nexus file with simmap-type tree strings however? Well, read.nexus and readNexus may or may not read them, but they since, in the new style simmap format, branch annotations are enclosed in comments ([..nexus comment..]), the data will not be read in. Again, we can use RBrownie functions to read both the tree and data from such a file. First download 'simmap_test.txt' and place it in getwd()

library(RBrownie)
output = read.nexus.simmap("simmap_test.txt")
simtree.new = output[[1]]
class(output) # 'list'
class(simtree.new) # 'phylo4d_ext'

read.nexus.simmap returns trees in a list (because nexus files usually have more than one tree in them). All trees returned in the list will be of type 'phylo4d_ext'. A list is currently being used to house multiple trees until a suitable replacement (which does away with redundant data) can be written up. Also, read.nexus.simmap guess which version of simmap is being used in the tree strings (if any) using the is.simmap function. So, when 'simmap_test.txt' is being read, each tree entry is being parsed using this code block:

#read.nexus.simmap <- function(finput="",text=NULL) {
#....
tmpstr = "(1:[&map={0}]0.1,(2:[&map={0}]0.2,(3:[&map={0}]0.3,4:[&map={0}]0.4):[&map={1}]0.5)[&map={0}]:0.1);"
if(is.simmap(text=tmpstr)){ # FALSE
trtmp = phyext(unname(read.simmap(text=tmpstr)))
} else if (is.simmap(text=tmpstr,vers=1.5)) { # TRUE!
trtmp = phyext(unname(read.simmap.new(text=tmpstr))) # use
} else if (is.simmap(text=tmpstr,vers=1.0)){
trtmp = phyext(unname(read.simmap(text=tmpstr,vers=1.0)))
} else {
trtmp = phyext(read.tree(text=tmpstr))
}
#.....
#}

Try plotting simtree.new. It should look like a typical 'phylo4d' object (because it has no subnodes).

Now download 'subnodes_test.txt', it is the same tree, but with one subnode on the D/C ancestor-to-internal node branch.

simtree.new = read.nexus.simmap("subnodes_test.txt")[[1]]

Try plotting this

plot(simtree.new)

You should see the branch with the state change color differently.

Brownie
Files which the brownie program output and which can be loaded in are modified nexus files where the input trees are either in newick OR old-style simmap format. We are currently working to upgrade brownie to work with new-style simmap tree strings. Such a file is attached called 'brownie_test.txt'. Download it and put it into getwd(). NOTE: These read functions are ONLY for objects in brownie's special nexus format. I general nexus file with simmap-style comments (like some BEAST output) should be read in with read.nexus.simmap

If you open 'brownie_test.txt', you'll see that the DATA block is gone, replaced by a CHARACTERS block and a CHARACTERS2 block. These two blocks should include continuous and standard (symbolic) data respectively and it doesn't matter which block is paired with which datatype. Also of note is the ASSUMPTIONS block. Unlike CHARACTERS2, this is a standard Nexus block. RBrownie parses it and looks for lines that start with this format: taxset SOMENAME=TAXA1 TAX2 TAX3, where SOMENAME is the name of the taxa set and TAXA# are names of any of the taxa separated by spaces. In the 'brownie_test.txt', I have defined a washington taxa set consisting of the taxa D and C.

Now that we have a sense for how this file is formatted, let's read it into R:

library(RBrownie)
br = readBrownie("brownie_test.txt")
class(br) # 'brownie'
hasSubNodes(br) # TRUE

The 'brownie' class is an extension of 'dphylo4d_ext', so it is compatible with all the functions defined for 'phylo4d_ext' and has some new notable functions. For example, 'brownie' distinguishes between continuous and discrete datatypes:

datatypes(br)

You can probably guess what each datatype stands for. "taxset" is a little different than the others however. Remember how we defined a set of taxa as washington? That is now treated as a special binary datatype that can be accessed with:

taxasets(br) # 1 - set member, 0 - not a set memeber

Conclusion
This tutorial covered the ways in which branch annotated tree data can be loaded into R using RBrownie and how to read in Brownie's special nexus format. Check out other tutorials in this series to see how to effectively analyze, manipulate, and output this type of data.

Send any feedback or noticed bugs to conrad . stack @@AT@@ gmail