Phyloseminar 2015: BAMM tutorial

BAMM Tutorial Phyloseminar 2015.pdf

This tutorial largely follows the tutorials by Dan Rabosky on his website.


BAMM consists of two components, the program that runs the actual analysis is in C++ (to be faster), while the analysis of the outputs happens in R.


Easiest might be to install it from the executables (both for Win and OS X), although from source and using homebrew is also possible.

  1. Go to the Download-Page
  2. download the appropriate BAMM version for your platform
  3. download the zipped example files
  4. unpack everything, in case of OS X, move to the appropriate directory in the command line and type

    tar -xzf bamm-2.3.0-MacOSX.tar.gz
  5. move the files and folders to the directory you want


For this one, simply open R and type


If it is installed, load the package using



In order to run BAMM, we need nothing more than a tree and a controlfile. We will use the whale examples in this tutorial. To run your own analysis, it is a good idea to simply modify the example controlfile or use the template controlfile.

Let’s start the whale analysis and then look at the controlfile.

Run the Whales

It is easiest to use BAMM when all files are in the same directory where BAMM is, so:

  1. move the whale tree and the controlfile in the same folder as BAMM
  2. to get a result that is looking a bit more realistic, we open the ‘divcontrol’ file and change numberOfGenerations from 10k to 1Mio
  3. change the directory of your command line to that folder
  4. start the analysis by typing:
./bamm -c divcontrol.txt   # on windows you can omit the './'
  1. wait for the MCMC to start running
  2. the output files will be
  • chain_swap.txt
  • event_data.txt
  • mcmc_out.txt
  • prior_probs.txt
  • run_info.txt

The main data we want to look at later is in the event_data file, and we look at the file mcmc_out to assess convergence. You might want to keep the run_info file with your other files in order to know what settings you used for a specific analysis.

A look at the controlfile

Go to the examples folder and find the ‘whales’ controlfile and tree in the ‘diversification’ subfolder. Open the controlfile ‘divcontrol.txt’. The file is pretty well annotated and most points should be fairly self-explanatory, we will quickly discuss some of them though.


  • General Setup and Data Input
  • Priors
  • MCMC Simulation Settings & Output Options
  • Operators: MCMC Scaling Operators
  • Operators: MCMC Move Frequencies
  • Initial Parameter Values
  • Metropolis Coupled MCMC
  • Numerical and Other Parameters

Most values are preset in a way that will work out well, and most users work along the policy to not touch anything if you don’t know what it does, but a few parameters are worth looking at (the bold ones you definitely have to modify for your own analysis, or at least consider it…):

  • modeltype
  • treefile
  • useGlobalSamplingProbability
  • globalSamplingFraction
  • sampleProbsFilename

  • all the priors…

  • numberOfGenerations
  • mcmcWriteFreq
  • eventDataWriteFreq
  • printFreq
  • acceptanceRateFreq

  • scaling operators…

  • move frequencies…

  • initial values…

  • numberOfChains
  • deltaT
  • swapPeriod

  • minCladeSizeForShift
  • segLength

More in-depth treatment of how to fine-tune the analysis and how to use advanced features can be found here

Analyse outputs with BAMMtools

Now we want to dig into the outputs of the analysis, which we do in R. There are a number of different options regarding what we can do and should do, and we try to look at the most important ones here.


Just like for any MCMC analyses, e.g. when we date a tree with BEAST, we want to know if we ran the analysis long enough to reach convergence. To do this for BAMM, we look at the file mcmc_out in R:

mcmc <- read.csv("/Users/orlandoschwery/Documents/UT/Courses/Fall 15/Phyloseminar/BAMMtutorial/mcmc_out.txt", header=T)  # load file
dim(mcmc)  # check dimensions
head(mcmc)  # check head

Now we can look at the trace plot of the log likelihood per generation (and if we want at the numbers of shifts per generation)

plot(mcmc$logLik ~ mcmc$generation)
plot(mcmc$N_shifts ~ mcmc$generation)

Now we want to discard a reasonable portion as burnin, e.g. 10%…

burnstart <- floor(0.1 * nrow(mcmc))
mcmcpost <- mcmc[burnstart:nrow(mcmc), ]

… and calculate the effective sample size, which should exceed 200 if our analyses ran long enough (although different people might argue for different minimal values here).

effectiveSize(mcmcpost$logLik)  # calculates autocorrelation function
effectiveSize(mcmcpost$N_shift)  # effective size on N_shifts
## do this on pre-burnin MCMC

Rate Shifts

Now for the thing we are most interested in: the rate shifts. We load the package BAMMtools, the tree we used and the event_data file, for which we specify the same burnin we decided on before:

tree <- read.tree("/Users/orlandoschwery/Documents/UT/Courses/Fall 15/Phyloseminar/BAMMtutorial/whaletree.tre")
ed <- getEventData(tree, "/Users/orlandoschwery/Documents/UT/Courses/Fall 15/Phyloseminar/BAMMtutorial/event_data.txt", burnin=0.1)

It is possible to subset the event data file, if the size is too big for R to handle.

Phylorates Plots

Now we can just look at the mean rates inferred:

plot.bammdata(ed, legend=T, spex='netdiv')

We can also look at speciation or extinction rates separately:

plot.bammdata(ed, legend=T, spex='s')
plot.bammdata(ed, legend=T, spex='e')

We can get a first idea on what number of shifts were most commonly found using summary (this just tells us how many samples contain a certain number of shifts, but not WHERE those shifts are):


Bayes Factors for Shift Models

To get a more robust idea of how certain shift models are performing compared to each other, we can calculate Bayes Factors:

postfile <- mcmc  # since we already loaded this
priorfile <- read.csv("/Users/orlandoschwery/Documents/UT/Courses/Fall 15/Phyloseminar/BAMMtutorial/prior_probs.txt", header=T)
bfmat <- computeBayesFactors(postfile, priorfile, burnin=0.1)

The resulting table shows how models with certain numbers of shifts compare to a specific other one, BFs of >20 indicate good support that a model is better than the one it is compared to, BFs of >50 indicate very strong support.

Credible Shift Sets

What our data actually looks like now, are different samples of shift regimes, all of which are sampled at a certain frequency. We can access each of them, and it is a good idea to visualize the most frequent ones:

pset <- getBranchShiftPriors(tree, priorfile)
cset <- credibleShiftSet(ed, pset, BFcriterion=3)
plot.credibleshiftset(cset, lwd=2.5)

Best Shift Set

Based on the credible shift sets, we might conclude that the most commonly sampled configuration of shifts is the best one:

best <- getBestShiftConfiguration(ed, prior = pset)
plot.bammdata(best, lwd = 2)
addBAMMshifts(best, cex=2.5)

Maximum Shift Credibility

Alternatively, pick the shiftset that maximizes the marginal likelihood of a shift on a branch, a set of maximum shift credibility, similar to when we get a maximum clade credibility tree out of a set of trees:

msc.set <- maximumShiftCredibility(ed, maximize='product')
msc.config <- subsetEventData(ed, index = msc.set$sampleindex)
plot.bammdata(msc.config, lwd=2)
addBAMMshifts(msc.config, cex = 2)

Macroevolutionary Cohort Analysis

When we have rate shifts to lower or higher rates, some sampled configurations might cancel each other out ( i.e. some say speedup for sisterclade A, some say slowdown for sisterclade B). We can catch this (and more) by looking at the cohort matrix:

cmat <- getCohortMatrix(ed)
cohorts(cmat, ed)