Files used in this tutorial:
In this tutorial we will look at how to construct and work with tree data structures in RBrownie, withe the purpose to give readers an idea of how to load in phylogenetic data and prepare it for use with the Brownie or just to look at and manipulate branch-annotated data. This tutorial assumes that you have either read the previous tutorial or are familiar with loading trees into R using the ape, phylobase, or RBrownie packages. We'll go through a brief review:
Loading in data
ape, phylobase, and RBrownie can all read in Nexus files and this is the format that RBrownie primarily works in. If your data is not in Nexus or newick format, then you might try converting your data over using some online tool (e.g. this one) or a desktop program like MEGA or FigTree.
Let's get started. First download 'nexus_test.txt' file attached to this tutorial. Open up R, and put 'nexus_test.txt'in the directory pointed to by R when you type: getwd(). Once it is there, go back to the R prompt and type:
require(ape)
require(phylobase)
#
# ape:
tree = read.nexus("nexus_test.txt")
dat = read.nexus.data("nexus_test.txt")
#
# phylobase:
phyd = readNexus("nexus_test.txt")
class(phyd) # 'phylo4d'
Notice how ape reads in the trees and data from a nexus file separately and into different objects, while the phylobase function readNexus reads everything into one 'phylo4d' object. Because RBrownie works with an extension of the 'phylo4d' object, we really want our data to be in a 'phylo4d' object before we use it in RBrownie. Luckily, phylobase provides an easy way for us to combine tree and dat:
phyd2 = phylo4d( tree, tip.data=unlist(dat) )
class(phyd2) # 'phylo4d'
The unlist(dat) line simply converts the data list into a vector. (NOTE: if your dat has more than one data column, then you might need to condition dat differently. See the ?phylo4d documentation). phyd and phyd2 should be essentially the same.
Combining data types
Again, RBrownie data structures inherit from the 'phylobase:::phylo4d' class and all functions which apply to this class also apply to 'RBrownie:::phylo4d_ext' and 'RBrownie:::brownie' classes as well. This is important because we will now go through some useful phylobase functions looking at how to combine data.frames with trees - these methods are both useful to know in their own right and because the procedure here is typical of how RBrownie combines data internally.
Say, we have already read in our phyd object and we also have some tip data in an CSV file called 'extradata.txt' which describes some more traits of each tip (A,B,C,D) in phyd. Download 'extradata.txt' and put it in getwd()
Now we can read in our extra data:
extra = read.table(file="extradata.txt",header=TRUE,row.names=1,sep=",")
I won't go into the options for read.table as that is outside the scope of this tutorial. Once you get your data in to R, look at the row names, they should match up with the tips of our tree:
rownames(extra) # A B C D
tipLabels(phyd) # A B C D
It is not necessary to have the the row names and tip labels match up, but it does make the next step easier:
# append extra to phyd:
phyd = addData(phyd, extra)
Essentially, we just added three new data columns (colnames(extra)) to our 'phylo4d' object, phyd. phyd is just a phylogenetic tree (only one) bundle together with data for some of the nodes. Clearly, we didn't include any data for the internal nodes of our tree in 'extradata.txt', and this can be seen by typing phyd at the R prompt. All the nodes with node.type equal to internal or root have NA where tips have actual numbers.
Removing data is also easy using the rmdata function, which attempts to remove data values from node (and subnodes, if there are any):
phyd.pruned = rmdata(phyd, 1) # remove first column
phyd.pruned = rmdata(phyd, c("standard_char","isrodent") )
And let us add a quick note on the format of this function, because it might look odd to some. Due to R's support only for 'passing by value' (google it), many S4 function calls (which phylobase and RBrownie both use) look like this:
object = function1( object, arg1, arg2, other_args )
Just be aware that output from function1 often needs to be directed back to the object variable. Moving on.
Constructing a phylo4d_ext object
By far the easiest and probably most useful way to create a 'RBrownie:::phylo4d_ext' object is to use one of RBrownie's read functions like read.nexus.simmap (see the reading tutorial). But to better understand of subnodes in RBrownie, we will construct one manually using our phyd object.
First, we need to convert our 'phylobase:::phylo4d' object to a 'RBrownie:::phylo4d_ext' object. This can easily be done by calling:
phyd.ex = phyext(phyd)
The phyext function is the standard way to convert tree objects to 'phylo4d_ext' objects and is analogous to the phylo4d(...) function we used earlier.
The phyd.ex object is virtually identical to the phyd object except that there are now new (and currently empty) slots to hold subnode data. What are they? Let's take a look at the relevant part of the class definition:
# Do not execute!
setClass("phylo4d_ext",
........
representation(
subnode.id="integer",
subnode.data="data.frame",
subnode.branch="matrix",
subnode.pos="matrix",
weight="numeric" ),
,contains="phylo4d"
........ )
As you can see, the 'phylo4d_ext' is just an extension of 'phylo4d' (contains="phylo4d"), adding a number of subnode.* slots. To access those slots, RBrownie defines a number of accessor functions:
snid(phyd.ex) # empty
sndata(phyd.ex) # empty
snbranch(phyd.ex) # empty
snposition(phyd.ex) # empty
subnode.id is a vector of identifies. They are really only used internally for now. subnode.data is a data.frame with a similar structure to the output of tdata(phyd.ex). subnode.branch is a 2-column matrix holding the ancestor and descent node ids of the branch on which the subnodes lies. Finally, subnode.position is a 2-column matrix which holds to position of the subnode along the branch as a fraction of the branch length. (NOTE: It is a 2-column matrix because we plan to use two positions as credibility intervals. This has not been implemented yet.)
Adding a subnode to phyd.ex is fairly simple:
newdat = c("standard_char"=1)
phyd.ex = addSubNode(phyd.ex, 5, 1, 0.25, newdat, pos.is.fraction=TRUE)
The data for the subnode is first stored in newdat as a named vector. It could also be a data.frame as well. The key is to make sure that names(newdat) shares some common column(s) with names(tdata(phyd.ex)) - this is how RBrownie matches up data columns. Next, addSubNode is called. This can be read as: "To the object phyd.ex and a subnode between nodes 5 (the ancestor) and 1 (the descendant), which is a quarter way (0.25) along the branch, starting at node 5 (ancestor) and ending at node 1 (descendant), where the value of trait 'standard_char' is 1 along the portion of the branch from node 5 to the subnode.". We could have left pos.is.fraction as FALSE (it's default value), but argument #4 would have been treated as an absolute branch length and would have exceeded the branch length of 5 <--> 1 which is only 0.1.
When RBrownie reads in simmap-style tree strings it uses this function to add subnodes individually.
IMPORTANT NOTE: How should the value of 'standard_char' at the subnode be interpretted? The subnode does not give a range for which 'standard_char' should equal 1, it is just a point along the branch. So what then does subnode.pos indicate. RBrownie reads subnode.pos as the distance (as a fraction of the branch length) that the subnode is away from the ancestor node and the value of a trait at that subnode is considered valid between that subnode and the ancestor node (or another subnode, if there is one between this subnode and the ancestor node). This is much easiest to see in a plot (plot(phyd.ex)) - play around with adding subnodes at different position along different branches and then plotting the resulting trees. See the plotting tutorial for more on that.
Now take a look at the subnode accessors:
snid(phyd.ex) #
sndata(phyd.ex) #
snbranch(phyd.ex) #
snposition(phyd.ex) #
We should see the subnote we just added.
There are other accessor-like functions that can be used to get subnode information from a 'phylo4d_ext' object most of which are self-explanatory:
nSubNodes(phyd.ex)
hasSubNode(phyd.ex)
getSubNodeData(phyd.ex,"standard_char")
getSubNodePosish(phyd.ex)
getSubNodeEdgeInds(phyd.ex) # which edge index is each subnode is on?
#
showSubNodes(phyd.ex) # visual subnodes in the console
Converting between formats
Because, RBrownie exands ape and phylobase, there are ways to convert between 'RBrownie:::phylod4_ext' and 'phylobase:::phylo4', 'phylobase:::phylo4d', and 'ape:::phylo'. The easiest way to do this is simple to use R's build in 'as' function:
as(phyd.ex,"phylo4d") # looses subnodes + sndata
as(phyd.ex,"phylo4") # looses subnodes + all data
as(phyd.ex,"phylo") # looses subnodes + all data
Of course, naively down-converting 'phylo4d_ext' objects usually means loosing the topological information that subnodes provide. Sometimes this might not be desirable, so RBrownie provides some functions to down-convert while keeping subnode information intact. We'll look at that here
Subnodes are conceptually similar to singleton nodes (non-root nodes with an in-degree and out-degree of one), which phylobase recognizes and RBrownie provides methods to convert between subnodes and singleton nodes:
phyd.singles = collapse.subnodes(phyd.ex)
hasSingle(phyd.singles) # TRUE
class(phyd.singles) # 'phylo4d'
# convert back:
phyd.ex2 = phyext(phyd.singles)
phyd2 and phyd.ex should be similar and if they are not there is a bug in RBrownie (please report it!). Unfortunately, singletons in phylobase are not easy to work with and are of limited utility as, for example, they tend to error-out on plotting and writing functions. RBrownie gives us a way to convert singleton nodes into bifurcations with one zero-length branch attached to a junk node:
phy.junk = expand.singles(phyd.singles)
phyd.junk = expand.singles(phyd.singles,keep.data=TRUE)
which now allows you to down-convert our subnode data to the 'ape:::phylo' class, whereas with singletons and subnodes we cannot:
#tree = as(phy.singles,"phylo") # dangerous....
#tree = as(phyd.ex, "phylo") # subnodes are lost....
tree = as(phy.junk,"phylo") # convert to phylo with "subnodes"!
NOTE: The expand.singles function is relatively experimental, modifies the phyd.singles object in unstable ways, and might not work with future releases of phylobase (I'm using 0.6.2). Just keep that in mind when using it....
expand.singles might not have much utility beyond easier visualization of splits along branches, but it is there if needed.
'brownie' class
So far we've been looking mainly at phylobase and RBrownie classes and functions which are not related to RBrownie intimately. RBrownie provides another class to handle Brownie-specific analyses and data. This class is aptly named 'brownie'.
The 'RBrownie:::brownie' class is a relatively minor extension of 'RBrownie:::phylo4d_ext', adding two new slots: commands and datatypes. I'll start by explaining what these slots are and why they are necessary.
commands
commands is simply a vector of character strings where each character string represents a command to be run in the Brownie program (which RBrownie interfaces with). Like PAUP*, the Brownie program loads in a Nexus data file(s) followed by some number of commands to be executed using that data. The commands slot holds those commands which will be executed in orderafter all the other stuff in the 'brownie' object is loaded into Brownie.
To see an example of commands, load up RBrownie:
library(RBrownie)
data(parrot)
commands(parrot)
When it comes time to execute this brownie object, all the trees and data are internally loaded into the Brownie core and then these commands are executed.
RBrownie provides a number of useful functions which help construct analyses (see constructing and running analyses tutorial), but there are some useful functions to work with commands:
br = parrot[[1]]
br.nocmd = clearCommands(br)
br.nomodel = removeCommands(br,"model")
br.nomodel = removeCommands(br,c(1,3))
br.nocont = removeCommands(br,"cont")
commands(br, add=T,index=1) <- "model junk;"
The clearCommands removes all commands from the brownie object. The removeCommands removes either a certain indices or commands which start with a given character string. And finally, commands can be used to retrieve and insert new commands (see ?commands documentation for all options).
datatypes
The datatypes slot in 'brownie' classes holds information about each data column in the data slot. Why does it need to do this? Because Brownie makes a distinction between continuous and discrete data and looks for those two data types in separate nexus blocks: CHARACTERS and CHARACTERS2
NOTE: It doesn't matter to Brownie which block contains which data type
There is also a third datatype which is used to indicate that a data column is holding a taxa set (binary data where 1 -> in taxa set and 0 -> not in taxaset). Taxa sets are written to an ASSUMPTIONS nexus block and are used by Brownie during various tests
To see the string names of all datatypes type:
brownie.datatypes()
Note, that there is a forth for 'undef'ined data, but this should never be used directly and is usually a bad sign.
When an object is converted to class 'brownie', RBrownie calculates which datatype each column of the data slot is. We can see this when we convert our phyd.ex object to class 'brownie':
br = brownie(phyd.ex)
class(br) # 'brownie'
datatypes(br)
Does anything look wrong? Not necessarily, but it looks like RBrownie is treating the 'isrodent' column (index 4 - see colnames(tdata(br))) as continuous, but we'd like it to be discrete. Using the datatypes replace accessor, we can do that:
datatypes(br)[4] # 'cont'
datatypes(br)[4] <- discData()
datatypes(br)[4] # 'discrete'
Notice that RBrownie provides methods which return the correct characters strings for each data type. These functions are: discData(), contData(), taxData(), and genericData(). These are useful in locating certain datatypes. For example, if we wanted to check if any data we read in was considered 'undefined', we'd do this:
any( datatypes(br) == genericData() ) # FALSE
RBrownie uses these methods internally to return different types of data.
Taxasets are a special case which warrant more investigation. In our br object, there are currenly no taxasets:
hasTaxasets(br) # FALSE
taxasets(br) # empty data frame
However, let's consider the 'isrodent' column a taxaset and re-purpose it like we did before:
datatypes(br)[4] <- taxData()
Now taxasets(br) shows one column with our 'isrodent' data. Say we'd like to add another taxaset for future Brownie analyses. Currently this is done through the S4 replace method:
taxasets(br,taxnames="washington") <- c("D","C")
# OR
taxasets(br,taxnames="ab") <- c(1,1,0,0)
NOTE: I know this is non-S4 behavior and am working to fix it
In the first instance, we added a new taxa set called "washington" using a vector taxa names. In the second instances we added taxa set "ab" using a vector of binary data, indicating which taxa (in order) are in (1) or out (0).
To remove a taxaset, try:
br = removeTaxasets(br,"washington")
br = removeTaxasets(br,"ab")
To visualize a taxa set, trying plotting it:
plot.taxaset(br,"isrodent")
RBrownie plotting methods will be covered in detail in a future tutorial. For now though, notice how the color red covers all branches in the 'isrodent' taxaset which does not look to be monophyletic. We can also check for monophyly and paraphyly directly:
areTaxaMono(br,1) # FALSE
areTaxaPara(br,1) # TRUE
That about covers the extra bits of the 'brownie' class, instances of which are useful really only for setting up Brownie program analyses which are covered in the next tutorial.
Conclustion
Hopefully this tutorial has gotten you confortable working with the new classes which RBrownie adds to R. If you have any specific questions or bug reports please post them! Thanks!