BAMM Tutorial Phyloseminar 2015.pdf
This tutorial largely follows the tutorials by Dan Rabosky on his website.
Installation
BAMM consists of two components, the program that runs the actual analysis is in C++ (to be faster), while the analysis of the outputs happens in R.
BAMM
Easiest might be to install it from the executables (both for Win and OS X), although from source and using homebrew is also possible.
 Go to the DownloadPage
 download the appropriate BAMM version for your platform
 download the zipped example files

unpack everything, in case of OS X, move to the appropriate directory in the command line and type
tar xzf bamm2.3.0MacOSX.tar.gz

move the files and folders to the directory you want
BAMMtools
For this one, simply open R and type
install.packages('BAMMtools')
If it is installed, load the package using
library(BAMMtools)
Run BAMM
In order to run BAMM, we need nothing more than a tree and a controlfile. We will use the whale examples in this tutorial. To run your own analysis, it is a good idea to simply modify the example controlfile or use the template controlfile.
Let’s start the whale analysis and then look at the controlfile.
Run the Whales
It is easiest to use BAMM when all files are in the same directory where BAMM is, so:
 move the whale tree and the controlfile in the same folder as BAMM
 to get a result that is looking a bit more realistic, we open the ‘divcontrol’ file and change
numberOfGenerations
from 10k to 1Mio  change the directory of your command line to that folder
 start the analysis by typing:
./bamm c divcontrol.txt # on windows you can omit the './'
 wait for the MCMC to start running
 the output files will be
 chain_swap.txt
 event_data.txt
 mcmc_out.txt
 prior_probs.txt
 run_info.txt
The main data we want to look at later is in the event_data file, and we look at the file mcmc_out to assess convergence. You might want to keep the run_info file with your other files in order to know what settings you used for a specific analysis.
A look at the controlfile
Go to the examples folder and find the ‘whales’ controlfile and tree in the ‘diversification’ subfolder. Open the controlfile ‘divcontrol.txt’. The file is pretty well annotated and most points should be fairly selfexplanatory, we will quickly discuss some of them though.
Sections:
 General Setup and Data Input
 Priors
 MCMC Simulation Settings & Output Options
 Operators: MCMC Scaling Operators
 Operators: MCMC Move Frequencies
 Initial Parameter Values
 Metropolis Coupled MCMC
 Numerical and Other Parameters
Most values are preset in a way that will work out well, and most users work along the policy to not touch anything if you don’t know what it does, but a few parameters are worth looking at (the bold ones you definitely have to modify for your own analysis, or at least consider it…):
 modeltype
 treefile
 useGlobalSamplingProbability
 globalSamplingFraction

sampleProbsFilename

all the priors…
 numberOfGenerations
 mcmcWriteFreq
 eventDataWriteFreq
 printFreq

acceptanceRateFreq

scaling operators…

move frequencies…

initial values…
 numberOfChains
 deltaT

swapPeriod
 minCladeSizeForShift

segLength
More indepth treatment of how to finetune the analysis and how to use advanced features can be found here
Analyse outputs with BAMMtools
Now we want to dig into the outputs of the analysis, which we do in R. There are a number of different options regarding what we can do and should do, and we try to look at the most important ones here.
Convergence
Just like for any MCMC analyses, e.g. when we date a tree with BEAST, we want to know if we ran the analysis long enough to reach convergence. To do this for BAMM, we look at the file mcmc_out in R:
mcmc < read.csv("/Users/orlandoschwery/Documents/UT/Courses/Fall 15/Phyloseminar/BAMMtutorial/mcmc_out.txt", header=T) # load file
dim(mcmc) # check dimensions
head(mcmc) # check head
Now we can look at the trace plot of the log likelihood per generation (and if we want at the numbers of shifts per generation)
plot(mcmc$logLik ~ mcmc$generation)
plot(mcmc$N_shifts ~ mcmc$generation)
Now we want to discard a reasonable portion as burnin, e.g. 10%…
burnstart < floor(0.1 * nrow(mcmc))
mcmcpost < mcmc[burnstart:nrow(mcmc), ]
… and calculate the effective sample size, which should exceed 200 if our analyses ran long enough (although different people might argue for different minimal values here).
library(coda)
effectiveSize(mcmcpost$logLik) # calculates autocorrelation function
effectiveSize(mcmcpost$N_shift) # effective size on N_shifts
## do this on preburnin MCMC
effectiveSize(mcmc$logLik)
effectiveSize(mcmc$N_shift)
Rate Shifts
Now for the thing we are most interested in: the rate shifts. We load the package BAMMtools, the tree we used and the event_data file, for which we specify the same burnin we decided on before:
library(BAMMtools)
tree < read.tree("/Users/orlandoschwery/Documents/UT/Courses/Fall 15/Phyloseminar/BAMMtutorial/whaletree.tre")
ed < getEventData(tree, "/Users/orlandoschwery/Documents/UT/Courses/Fall 15/Phyloseminar/BAMMtutorial/event_data.txt", burnin=0.1)
It is possible to subset the event data file, if the size is too big for R to handle.
Phylorates Plots
Now we can just look at the mean rates inferred:
plot.bammdata(ed, legend=T, spex='netdiv')
We can also look at speciation or extinction rates separately:
plot.bammdata(ed, legend=T, spex='s')
plot.bammdata(ed, legend=T, spex='e')
We can get a first idea on what number of shifts were most commonly found using summary (this just tells us how many samples contain a certain number of shifts, but not WHERE those shifts are):
summary(ed)
Bayes Factors for Shift Models
To get a more robust idea of how certain shift models are performing compared to each other, we can calculate Bayes Factors:
postfile < mcmc # since we already loaded this
priorfile < read.csv("/Users/orlandoschwery/Documents/UT/Courses/Fall 15/Phyloseminar/BAMMtutorial/prior_probs.txt", header=T)
bfmat < computeBayesFactors(postfile, priorfile, burnin=0.1)
The resulting table shows how models with certain numbers of shifts compare to a specific other one, BFs of >20 indicate good support that a model is better than the one it is compared to, BFs of >50 indicate very strong support.
Credible Shift Sets
What our data actually looks like now, are different samples of shift regimes, all of which are sampled at a certain frequency. We can access each of them, and it is a good idea to visualize the most frequent ones:
pset < getBranchShiftPriors(tree, priorfile)
cset < credibleShiftSet(ed, pset, BFcriterion=3)
plot.credibleshiftset(cset, lwd=2.5)
Best Shift Set
Based on the credible shift sets, we might conclude that the most commonly sampled configuration of shifts is the best one:
best < getBestShiftConfiguration(ed, prior = pset)
plot.bammdata(best, lwd = 2)
addBAMMshifts(best, cex=2.5)
Maximum Shift Credibility
Alternatively, pick the shiftset that maximizes the marginal likelihood of a shift on a branch, a set of maximum shift credibility, similar to when we get a maximum clade credibility tree out of a set of trees:
msc.set < maximumShiftCredibility(ed, maximize='product')
msc.config < subsetEventData(ed, index = msc.set$sampleindex)
plot.bammdata(msc.config, lwd=2)
addBAMMshifts(msc.config, cex = 2)
Macroevolutionary Cohort Analysis
When we have rate shifts to lower or higher rates, some sampled configurations might cancel each other out ( i.e. some say speedup for sisterclade A, some say slowdown for sisterclade B). We can catch this (and more) by looking at the cohort matrix:
cmat < getCohortMatrix(ed)
cohorts(cmat, ed)
Fin!