BAMM Tutorial Phyloseminar 2015
This tutorial largely follows the tutorials by Dan Rabosky on his website.
BAMM consists of two components, the program that runs the actual analysis is in C++ (to be faster), while the analysis of the outputs happens in R.
Easiest might be to install it from the executables (both for Win and OS X), although from source and using homebrew is also possible.
- Go to the Download-Page
- download the appropriate BAMM version for your platform
- download the zipped example files
unpack everything, in case of OS X, move to the appropriate directory in the command line and type
tar -xzf bamm-2.3.0-MacOSX.tar.gz
move the files and folders to the directory you want
For this one, simply open R and type
If it is installed, load the package using
In order to run BAMM, we need nothing more than a tree and a controlfile. We will use the whale examples in this tutorial. To run your own analysis, it is a good idea to simply modify the example controlfile or use the template controlfile.
Let’s start the whale analysis and then look at the controlfile.
Run the Whales
It is easiest to use BAMM when all files are in the same directory where BAMM is, so:
- move the whale tree and the controlfile in the same folder as BAMM
- to get a result that is looking a bit more realistic, we open the ‘divcontrol’ file and change
numberOfGenerationsfrom 10k to 1Mio
- change the directory of your command line to that folder
- start the analysis by typing:
./bamm -c divcontrol.txt # on windows you can omit the './'
- wait for the MCMC to start running
- the output files will be
The main data we want to look at later is in the event_data file, and we look at the file mcmc_out to assess convergence. You might want to keep the run_info file with your other files in order to know what settings you used for a specific analysis.
A look at the controlfile
Go to the examples folder and find the ‘whales’ controlfile and tree in the ‘diversification’ subfolder. Open the controlfile ‘divcontrol.txt’. The file is pretty well annotated and most points should be fairly self-explanatory, we will quickly discuss some of them though.
- General Setup and Data Input
- MCMC Simulation Settings & Output Options
- Operators: MCMC Scaling Operators
- Operators: MCMC Move Frequencies
- Initial Parameter Values
- Metropolis Coupled MCMC
- Numerical and Other Parameters
Most values are preset in a way that will work out well, and most users work along the policy to not touch anything if you don’t know what it does, but a few parameters are worth looking at (the bold ones you definitely have to modify for your own analysis, or at least consider it…):
all the priors…
More in-depth treatment of how to fine-tune the analysis and how to use advanced features can be found here
Analyse outputs with BAMMtools
Now we want to dig into the outputs of the analysis, which we do in R. There are a number of different options regarding what we can do and should do, and we try to look at the most important ones here.
Just like for any MCMC analyses, e.g. when we date a tree with BEAST, we want to know if we ran the analysis long enough to reach convergence. To do this for BAMM, we look at the file mcmc_out in R:
mcmc <- read.csv("/Users/orlandoschwery/Documents/UT/Courses/Fall 15/Phyloseminar/BAMMtutorial/mcmc_out.txt", header=T) # load file dim(mcmc) # check dimensions head(mcmc) # check head
Now we can look at the trace plot of the log likelihood per generation (and if we want at the numbers of shifts per generation)
plot(mcmc$logLik ~ mcmc$generation) plot(mcmc$N_shifts ~ mcmc$generation)
Now we want to discard a reasonable portion as burnin, e.g. 10%…
burnstart <- floor(0.1 * nrow(mcmc)) mcmcpost <- mcmc[burnstart:nrow(mcmc), ]
… and calculate the effective sample size, which should exceed 200 if our analyses ran long enough (although different people might argue for different minimal values here).
library(coda) effectiveSize(mcmcpost$logLik) # calculates autocorrelation function effectiveSize(mcmcpost$N_shift) # effective size on N_shifts ## do this on pre-burnin MCMC effectiveSize(mcmc$logLik) effectiveSize(mcmc$N_shift)
Now for the thing we are most interested in: the rate shifts. We load the package BAMMtools, the tree we used and the event_data file, for which we specify the same burnin we decided on before:
library(BAMMtools) tree <- read.tree("/Users/orlandoschwery/Documents/UT/Courses/Fall 15/Phyloseminar/BAMMtutorial/whaletree.tre") ed <- getEventData(tree, "/Users/orlandoschwery/Documents/UT/Courses/Fall 15/Phyloseminar/BAMMtutorial/event_data.txt", burnin=0.1)
It is possible to subset the event data file, if the size is too big for R to handle.
Now we can just look at the mean rates inferred:
plot.bammdata(ed, legend=T, spex='netdiv')
We can also look at speciation or extinction rates separately:
plot.bammdata(ed, legend=T, spex='s') plot.bammdata(ed, legend=T, spex='e')
We can get a first idea on what number of shifts were most commonly found using summary (this just tells us how many samples contain a certain number of shifts, but not WHERE those shifts are):
Bayes Factors for Shift Models
To get a more robust idea of how certain shift models are performing compared to each other, we can calculate Bayes Factors:
postfile <- mcmc # since we already loaded this priorfile <- read.csv("/Users/orlandoschwery/Documents/UT/Courses/Fall 15/Phyloseminar/BAMMtutorial/prior_probs.txt", header=T) bfmat <- computeBayesFactors(postfile, priorfile, burnin=0.1)
The resulting table shows how models with certain numbers of shifts compare to a specific other one, BFs of >20 indicate good support that a model is better than the one it is compared to, BFs of >50 indicate very strong support.
Credible Shift Sets
What our data actually looks like now, are different samples of shift regimes, all of which are sampled at a certain frequency. We can access each of them, and it is a good idea to visualize the most frequent ones:
pset <- getBranchShiftPriors(tree, priorfile) cset <- credibleShiftSet(ed, pset, BFcriterion=3) plot.credibleshiftset(cset, lwd=2.5)
Best Shift Set
Based on the credible shift sets, we might conclude that the most commonly sampled configuration of shifts is the best one:
best <- getBestShiftConfiguration(ed, prior = pset) plot.bammdata(best, lwd = 2) addBAMMshifts(best, cex=2.5)
Maximum Shift Credibility
Alternatively, pick the shiftset that maximizes the marginal likelihood of a shift on a branch, a set of maximum shift credibility, similar to when we get a maximum clade credibility tree out of a set of trees:
msc.set <- maximumShiftCredibility(ed, maximize='product') msc.config <- subsetEventData(ed, index = msc.set$sampleindex) plot.bammdata(msc.config, lwd=2) addBAMMshifts(msc.config, cex = 2)
Macroevolutionary Cohort Analysis
When we have rate shifts to lower or higher rates, some sampled configurations might cancel each other out ( i.e. some say speedup for sisterclade A, some say slowdown for sisterclade B). We can catch this (and more) by looking at the cohort matrix:
cmat <- getCohortMatrix(ed) cohorts(cmat, ed)