Phylogenetic analysis

Each item includes PAUP batch files
or commands where appropriate. I wrote this page for the Farrell Lab
website, but have included it here as well, where I continue to update
it (occasionally). This page is mostly useful for the batch files which
are on it. Some useful references which provide broader overviews of
phylogenetics are Phylogenetic Trees
Made Easy
by Barry Hall, chapters 11 and 12 of the somewhat
dated (1996) Molecular
Systematics, 2nd Edition
, edited by David Hillis, Craig Moritz,
and Barbara Mable, and the recently updated Molecular Systematics
website. A great new resource is Joe Felsenstein’s new book Inferring Phylogenies.
You may also check my page of books for phylogenetics.
PAUP also has a quick start
on its web page
as well as a useful discussion
for asking questions. Useful software links can be found on
my general links page. Please direct any
comments about this page to me.
This page was last updated on February 3,
2004 , though most of its information has not been
updated since June 10, 2002.

If you find this page useful, and think it may be useful to others,
please link to it from your website. This will rank it higher in search
engines, making it easier for others to find. Thanks!

Each step listed individually
NEXUS file preparation
Batch files
Exhaustive and branch-and-bound searches
(<13 taxa)
Heuristic searches
Parsimony ratchet
Bootstrap searches
Decay indices, including partitioned Bremer support
Choosing a
likelihood model

Testing for a molecular clock
Calibrating using a published calibration
Calibrating using fossil data
Computing error bars on node times
Bayesian searches (MrBayes)
Hypothesis testing
Tree drawing and editing

NEXUS file preparation

  1. Open the NEXUS file with MacClade
    4.03 or higher.
  2. Select characters for one locus by shift-clicking on characters
    for that locus.
  3. Under the Characters menu, choose “Character Set…” then “Store
    Character Set”.
  4. Name this set by the name of the locus (omit spaces)
  5. Do the same for other loci.
  6. Make sure that the genetic code is correct for your loci by going
    to “Genetic Code…” in the Characters menu. For mixed nuclear and
    mitochondrial loci, choose Drosophila mtDNA.
  7. For protein-coding loci, select all the characters in one exon,
    then go to Characters -> “Codon Positions” -> “Calculate
    Positions…”. Select the option “Choose to minimize stop codons”.
  8. Select non-protein-coding regions (i.e., introns, rDNA) and go to
    Characters -> “Codon Positions” -> “Non-coding”.
  9. Go to Taxa -> “Taxon list”.
  10. Move the outgroup taxa so that they are next to each other by
    dragging the number to the left of the taxon name.
  11. Select the whole group of outgroup taxa by shift-clicking on
    their numbers.
  12. Go to Taxa -> “Taxon sets” -> “Store taxon set”.
  13. Name this taxon set “outgroup”.
  14. Do the same for other taxa you would like in a set. Feel free to
    change the order of the taxa to do this.
  15. When you are done, click on the icon that looks like a Mayan
    pyramid, then click on one of the taxa. This alphabetizes the taxa,
    making later steps in PAUP easier.
  16. Go to File -> “Print list” to print the list, along with taxon
  17. Save the file and quit MacClade.
  18. Open PAUP 4.0b8 or higher. PAUP often has updates, so check its website to make sure you are
    using a recent version. Also check for bug reports.
  19. From within PAUP, open the file you just saved with MacClade.
    Choose “Edit”, not “Execute”.
  20. Scroll down to below the data matrix.
  21. NEXUS files are arranged in blocks, starting with “BEGIN ###;”
    and ending with “END;”. Go to the SETS block [skip this and the next 2
    instructions if you have only one locus or no coding regions].
  22. Start a new line. Type “CHARSET”, a space, then “###first = “,
    replacing ### with the name of your first coding locus.
  23. Look for the CODONS block. You should see something within this
    block that looks like “1: 1-1300\3 1400-2700\3”, with the numbers
    changed. If your first coding locus is 1302 long, for example, copy the
    “1-1300\3″and place it after the equals sign in the line you just
    created. If your first coding locus extends all the way to 2700 in this
    example, just with an intron in the middle, copy the entire “1-1300\3
    1400-2700\3” to after the equals sign in the line you just created.
  24. Type a semicolon at the end of this line after pasting in the
  25. You have just created a character set containing all the bases
    with the first codon position from your first coding locus. Repeat the
    same process for other loci, all three positions. The general format is
    “CHARSET [charset name] = [character numbers];”.
  26. You can do the same thing with sets of taxa. You should already
    have a taxset named “outgroup”. You can make other taxsets using the
    format “TAXSET [taxset name] = [taxon numbers];”. The taxon number is
    the same as the number on the taxon’s left in MacClade’s taxon list
    window, which you have printed out. It’s also the order of taxa within
    the DATA block.
  27. Create a PAUP block. Make sure this is after one of the “end;”
    lines but before the next “Begin ###;” lines. The format is “Begin
    PAUP;” skip a few lines, then “end;”. Don’t forget the semicolons.
  28. Within the PAUP block, create a
    character partition. The general format is “CHARPARTITION [partition
    name] = [charset name 1]:[charset name 1], [charset name 2]:[charset
    name 2], [charset name 3]:[charset name 3];” For example, if the data
    consists of 28S and COI sequence, you would have a charset called 28s,
    another charset called coI, and the charparition command would be
    “CHARPARTITION locus = 28s:28s, coI:coI;” This is used in ILD tests.
  29. Within the PAUP block, paste the following text:

    outgroup outgroup /only; set maxtrees=5000 increase=auto autoclose=no
    torder=right tcompress=yes taxlabels=full storebrlens=yes
    storetreewts=yes outroot=paraphyl;

    This sets the outgroup to be whatever you determined in the outgroup
    taxset, sets a reasonable starting maxtrees number, and sets other PAUP
    preferences to useful ones for starting a search. Other PAUP commands
    can be included in the PAUP block as well.

  30. You may find it useful to keep a running log of all analyses
    done. This can be helpful in checking what p-values were of various
    tests, how many replicates were done in a particular search, etc. To do
    this, paste the following text within the PAUP block:

    log file=[name] append=yes replace=no;

    If you do this, delete the log commands from other batch files (see
    below) so that everything is logged to one file.

Batch files

Batch files are ways to give PAUP commands without
going through the menus (Mac) or entering items one at a time in the
command line (all platforms). For example, you could set up a PAUP
batch file which would do a heuristic search saving the best trees from
each addition sequence replicate, save these trees to a file, filter
for the best trees, and finally save these best trees to a second tree
file. Batch files are also a good way to ensure consistency between
searches — for example. you won’t have to remember if you used 5 or 10
addition sequence replicates per bootstrap replicate, as that
information will be in the batch file. A batch file has the following

#nexus [Tells PAUP that
this is a NEXUS file]
begin PAUP; [Starts the PAUP block]
[PAUP commands];
[Gives PAUP instructions on what to do — the same thing you enter in
the command line]
end; [Ends the PAUP block]

To create a batch file, first create a new document
within PAUP or in some other text editor (such as SimpleText). Then
type “#nexus” in the first line and “begin paup;” in the second. After
that, include the PAUP commands you want — PAUP’s command reference
document, available from the downloads section of
its website, is invaluable in
constructing these. Finally, write “end;” at the end of the document.
Save the document. Now, executing the document, the same way you
execute a data file, will start the commands running.

All the batch files on this website can be copied into
a new PAUP document and executed. Several of the batch files have
places to insert the name of a file or set, often appearing as [name].
Make sure that the brackets are removed: for example, if the file is
called tree1, “file=[name]” should become “file=tree1”. PAUP makes
batch file troubleshooting easy — if there is a problem with the file,
PAUP will open the file and put the cursor where the error occured.
PAUP does not see anything in brackets, so comments written there will
not affect analyses. The only two exceptions to this are: 1) if the
first character within the bracket is an exclamation point, PAUP will
display the contents in the display buffer; and 2) Trees can have a
[&U] or [&R] to tell PAUP if they are rooted or not.

and branch-and-bound searches (<13 taxa)

Both these search strategies are guaranteed to find the
optimal tree. However, they are not generally practical for even
moderate numbers of taxa: finding the best tree is generally an NP-hard problem.
Thus, heuristic searches are necessary for most analyses.

Heuristic searches

Heuristic searches are not guaranteed to find the
globally-optimal tree, but they can work for many more taxa than
exhaustive or branch-and-bound searches. A good starting batch file is:

begin PAUP;
log file=hsearch1.log;
set autoclose=yes;
hsearch start=stepwise addseq=random nreps=100
savereps=yes randomize=addseq rstatus=yes hold=1 swap=tbr multrees=yes;
savetrees file=hsearch1.all.tre brlen=yes;
filter best=yes permdel=yes;
savetrees brlen=yes;
log stop;

This batch file will do a heuristic search, save all
the trees found in each random addition sequence replicate in the
hsearch1.all.tre file, then filter for the best trees overall and save
them in an file. PAUP will also output a tree-island
profile. A “tree-island” is a group of trees which cannot be reached
through branchswapping from a different group of trees. Ideally, all
the trees will be on one island which will have been hit every time, no
matter what starting tree or taxon addition order. If this is the case,
the treespace is fairly simple and we would probably move on to bootstrapping. If the island(s) with the best
trees was hit nearly all the time, we’d probably be satisfied, as well.
However, if the best trees are not recovered in many of the searches,
we would need to search further.

We use a variety of strategies for these next searches.
The first strategy is to simply increase the number of addition
sequence replicates. To do this, change the nreps=100 in the batch file
above to a higher number, perhaps nreps=1000. To avoid confusion, we
may also want to replace “hsearch1” with “hsearch2” wherever it appears
before running the search again.

Another strategy is to start from random trees, rather
than random taxon addition. On the above batch file, change the
randomize=addseq to randomize=trees, change the “hsearch#” as above,
and search again, with nreps to taste.

These searches will run longer than the initial search.
A way to speed up the search while covering more of tree space is to
increase the number of addition replicates but reducing the
thoroughness of the search for each replicate. We might do this if we
believe there may be islands of good trees PAUP is missing. A sample
batch file follows. You may want to change the nreps and timelimit.

begin paup;
log file=hsearch.tlimit.log;
set maxtrees=10000 increase=auto;
hsearch rstatus=no limitperrep=yes nreps=5000 randomize=trees
timelimit=5 savereps=yes;
savetrees file=hsearch.tlimit.all.tre brlen=yes;
filter best=yes permdel=yes;
savetrees brlen=yes;

The parsimony ratchet (below) can be useful in searching treespace
broadly, as well.

Parsimony ratchet

The parsimony ratchet is a way to search treespace by
reweights characters for some iterations of a search. It is especially
good for searches with large numbers of taxa. It is described by Kevin
Nixon (Nixon, K. C. 1999. “The Parsimony Ratchet, a new method for
rapid parsimony analysis.” Cladistics 15: 407-414). Derek Sikes
and Paul Lewis have written a program, called PaupRat,
which generates batch files to implement this search strategy in PAUP.
A few notes on the use of this: 1) It’s better to do several searches
of a moderate number of nreps in each search (create a new ratchet file
for each search) than one search with many nreps; 2) It’s useful to
insert the commands:

stopcmd “filter best=yes permdel=yes”;
stopcmd “savetrees”;

into the setup.nex file before the stopcmd “[quit]”; to
automatically filter for the best trees.


Bootstrap searches

Bootstrap searches can take quite some time. One useful
feature of PAUP is the ability to search on multiple computers or on
several different occasions and combine the results. The key to this is
saving the bootstrap trees from each search and then loading them all
together, then computing the consensus tree using the tree weights. You
must be sure to keep the bootstrap search settings (except for the
number of bootstrap replicates) the same between searches for this to
be valid. This can be done through the menus (make sure to hit the
“save trees to file” checkbox in the bootstrap menu), or through the
batch files (below).

For the search itself [the search can be stopped before
completing all the bootstrap replicates; if doing multiple searches,
the treefile name should be changed for each search. You may want to
change the number of bootstrap replicates (currently set at 500) and
the number of random taxon additions per bootstrap replicate (currently
set at a low value of 10)]:

begin paup;
set storetreewts=yes;
bootstrap nreps=500
treefile=bootstrap1.tre replace=no brlen=yes/ start=stepwise addseq=random nreps=10 savereps=no randomize=addseq hold=1
swap=tbr multrees=yes;

Load all the bootstrap trees, making sure to store tree
weights (an option which should have been made the default by the above
batch file), to load all blocks, and NOT to eliminate duplicate trees.
Executing the following batch file should set all these options and
load the trees saved as bootstrap1.tre in the active folder.

begin paup;
gettrees allblocks=yes duptrees=keep storetreewts=yes storebrlens=yes mode=7

Finally, compute a majority-rule consensus tree. The
batch file for this is:

begin paup;
log file=bootstrapconsensus.log;
contree /majrule=yes strict=no le50=yes usetreewts=yes showtree=yes
treefile=finalbootstrap.tre grpfreq=yes;
log stop;

Note that this will create a bootstrap tree which will include nodes
with support less than 50% which are consistent with the majority rule
tree. Most authors choose to omit bootstrap numbers less than 50% while
continuing to show the node; a better approach, if space allows, would
be to show all bootstrap values.

Though likelihood is far slower than parsimony, this does not mean
that likelihood is too slow for bootstrapping. To speed up likelihood
bootstrap searches, one can fix the values of nuisance parameters (for
example, use the lset output from ModelTest) and time limit each
heuristic search.

Decay indices,
including partitioned Bremer support

For simple decay indices, MacClade can generate a batch
file. Go to the tree window, then go to the sigma () menu -> “Decay Index Paup
File…”. For partitioned Bremer support, we use TreeRot (see
its detailed instruction manual).

Choosing a likelihood model

We generally use David Posada’s program ModelTest
to determine the appropriate likelihood model. This consists of a batch
file which has PAUP compute the likelihood score of a tree under
various models and a program which computes the likelihood ratio test
and the AIC criterion. When the two criteria choose different models,
we use the likelihood ratio test result. This procedure will be brought
into PAUP in one of the next updates, according to PAUP’s programmers.

Testing for a molecular clock

Once the appropriate likelihood model has been selected
by ModelTest, one can test for a clock. Copy the lscores line
corresponding to ModelTest’s chosen model from the modelblock file and
paste this over the [LSCORES] lines in the following batch file:

begin paup;
set criterion=distance;
log file=clocktest.log;
DSet distance=JC objective=ME base=equal rates=equal pinv=0
subst=all negbrlen=setzero;
NJ showtree=no breakties=random;
set criterion=likelihood;
lset clock=no;
;[!Non-clock score above, clock score below];
lset clock=yes;
;[!Clock score above];
log stop;

This will result in two ln likelihood scores. Take the
difference between the scores and double it. This is your test
statistic for a chi-squared test; the number of taxa minus 2 is the
degrees of freedom. You can use a program such as Microsoft Excel (the
chidist function) or this
web site
to compute the p-value. Significant p-values mean that the
clock is rejected. People generally try each locus independently; if
you have time, you could use ModelTest to determine the correct
likelihood model for each locus, or just use the model chosen for all
the loci combined.

the tree I: Using a previously-published calibration

There are generally two ways to put ages on nodes of a
clock tree: 1) Use a previously-published rate of percent divergence
for a particular gene versus time, or 2) Use fossil information (see
next entry). Reported calibration rates often are in terms of
uncorrected percent sequence divergence per million years
(“uncorrected” means uncorrected for multiple hits). To use this value
for your tree, compute all uncorrected pairwise distances for your taxa
in PAUP. With menus, use File -> Save Distances to File, then
Options -> Distance options and select uncorrected “p”. The batch
file would be (replacing [filename] with what you want to call the
distance file, without brackets):

begin paup;
dset distance=p;
savedist format=onecolumn file=[filename] undefined=asterisk;

Examine the ensuing list (in PAUP or Microsoft Excel)
to find a pair of taxa which have a percent divergence similar to the
ones in the taxa used in the original calibration. Using these taxa,
rather than the oldest or youngest pair, makes it more likely that the
proportion of uncorrected multiple hits will be the same for the
calibration and the pair being used to calibrate the entire tree (the
rate of uncorrected divergence versus time will go down with time as
multiple hits accumulate). The calibration rate can be used to
determine the age of the node separating the chosen pair of taxa. The
age of the node is then used to calibrate the rest of the tree.

the tree II: Using fossil data

The rate of molecular evolution may be affected by many
things: mutation rate, generation time, and perhaps even body size. A
calibration based on fossil data of the group being studied may be more
accurate than a calibration from other organisms. Multiple calibration
points will give the most precise estimate. Record infomation about all
the fossils in the group, not just the oldest. Fossils tend to give
just minimum ages for the group — think about ways to determine
maximum ages, as well. For example, the group may not be older than the
evolution of life on land, or may not be older than the oldest possible
age (which is not the age of the oldest known fossil) of an obligate
host. Biogeographic information may also be useful. Fossil
calibrations, and minimum ages in general, should be mapped to the stem
of the clade (= the most recent common ancestor of the clade and its
sister group). . Maximum ages should be mapped to the most recent
common ancestor (MRCA) of the clade constrained by that calibration
(the “crown”). For example, if we have a calibration for birds using
the oldest known bird fossil (a minimum age constraint), we know that
the split between birds and their sister group had to happen eariler
than that fossil, but not that the MRCA of the bird taxa included in
our study lived before that fossil. In contrast, if we say that we know
that Lepidoptera must have evolved after the first terrestrial plants
(a maximum age constraint), we know that the MRCA of Lepidoptera cannot
be older than that, but not that the split separating Lepidoptera from
their sister group occurred after it. The total branchlength from the
present to the node of each constraint on the clock tree is recorded
(the units do not matter, as long as the branch lengths are
proportional to time). This information, plus the age and type of the
calibration constraint, are entered into an Excel sheet which
calculated the maximum and minimum possible rates given all the
calibration points. This can easily be done manually, as well. Each
branch length is then multiplied by these rates to get the maximum and
minimum length of the branch in time. Note that the rate is actually
applied to the whole tree at once — the tree is stretched or
compressed by the maximum or minimum rates — the ratio of the ages of
nodes does not change with the different rates. Other methods must be
used to calculate error bars on the ages of nodes due to finite
sequence length rather than uncertain calibration. These methods are
explained below.

error bars on ages of nodes

For determining error bars on nodes, two methods can be
used, both implemented in Sanderson’s r8s program. The first
method involves looking at the shape of the likelihood surface. The
second method uses bootstrapped datasets and calculates error bars from
the distribution of branchlengths on the bootstrapped dataset. We have
done the latter method using PAUP and Excel, due to a lack of Linux x86
computers to run r8s. First, Phylip‘s
SeqBoot is used to create bootstrapped datasets. These are edited in
PAUP to merge them into one large dataset (for example, if the original
dataset is 3000 characters long, and 500 bootstrap replicate datasets
are created, they are edited to make a single 1,500,000 character
dataset. A PAUP batch file is then constructed in Excel which divides
the combined datasets into character sets and then has PAUP evaluate
the branchlengths of the given topology (the clock tree being
evaluated) for each replicate dataset (corresponding to a charset). A
sample batch file is below for a dataset 2432 characters long. Some of
the repetitive commands have been deleted, but the pattern should be
clear. Before executing this batch file, load your long datafile, set
the likelihood parameters, and load your clock tree.

begin paup;
charset rep1 = 1 – 2432 ;
charset rep2 = 2433 – 4864 ;
charset rep3 = 4865 – 7296 ;
[…. repetitive commands not included here. Just use Excel to
follow the pattern….]
charset rep500 = 1213569 – 1216000 ;
exclude all;
include rep1 ; [! rep 1 ]
savetrees root=yes brlen=yes maxdecimals=6 replace=no append=yes
exclude all;
include rep2 ; [! rep 2 ]
savetrees root=yes brlen=yes maxdecimals=6 replace=no append=yes
exclude all;
include rep3 ; [! rep 3 ]
savetrees root=yes brlen=yes maxdecimals=6 replace=no append=yes
[…. repetitive commands not included here. Just use Excel to
follow the pattern….]

exclude all;
include rep500 ; [! rep 500 ]
savetrees root=yes brlen=yes maxdecimals=6 replace=no append=yes

The trees are all saved in one file with branchlengths
by PAUP. This tree file is then opened within Excel, with the length of
each branch put in a column. The total length of branches from the root
to the node of interest is computed for each tree and divided by the
distance from the root to the tips. Then the percentile function is
used in Microsoft Excel to determine the 95% confidence interval on the
distance from the root to the node of interest, with the distance
expressed as proportion of total tree length.

Bayesian searches (MrBayes)

Bayesian phylogenetic inference is a new way of finding
trees and tree support, in terms of probability of the topology given
the data and your prior beliefs (or the default prior beliefs of your
program). I use MrBayes
for this. Good introductions to the proper use of MrBayes (such as
diagnosing convergence) can be found at Frederick Ronquist’s site.


We often want to test our hypotheses of evolution. For
example, we may ask if the tree we get from our analysis is
significantly different than the traditional phylogeny, or whether the
observed paraphyly of a particular group is strongly supported, or
other questions. There are several ways to do this. The first step in
most of them is to create a constraint tree. Open the data file in
MacClade and go to the tree window. Create a random tree. Then use
MacClade’s tree drawing tools to create the constraint tree. A
constraint tree for the monophyly of a certain group would have only
one resolved internode branch, that between the group and the rest of
the taxa. A constraint tree for a previously-existing phylogenetic
hypothesis would be more resolved. Export these trees from MacClade,
one at a time, naming each appropriately.

Open PAUP and execute the data file. Under analysis,
choose load constraints (don’t load as backbone constraint). Then,
depending on the analysis and optimization criterion:

For Bayesian analyses: Load the trees found
after stationarity. Filter the trees to find those consistent with the
constraint tree. The command for this is “filter constraint=[constraint
name]”, where [constraint name] is replaced by the name of your
constraint tree. Record the number of these trees. This number, divided
by the total number of post-stationarity trees, is the posterior
probability of the hypothesis represented by the constraint tree. For
example, if 60 out of 1000 trees are left after filtering for trees
with a certain group monophyletic, the probability of monophyly of this
group is 6%.

Under the parsimony criterion: Perform a
thorough tree search with the constraint enforced [if using a batch
file, insert “enforce=yes constraint=[constraint name]” within the
hsearch command (somewhere between “hsearch” and the semicolon),
replacing [constraint name] with the name of your constraint]. This
generates the most parsimonious tree[s] which meet this constraint.
Save these trees. Then load your most parsimonious trees found without
the constraint without replacing the trees you just found [go to
options in the get trees window and fill in both circles, or use
“gettrees mode=7 file=[tree file name];”, replacing [tree file name]
with the name of your tree file]. The Templeton and winning-sites tests
are the appropriate tests to use in this situation and can be selected
under Trees -> Tree Scores… -> Parsimony, then choosing
nonparametric tests. The command for this is “pscores
/nonparamtest=yes;”. The Kishino-Hasegawa test, which is also
available, is inappropriate when comparing trees which were not
specified a priori. We often report this value anyway.

Under the likelihood criterion: Perform a
thorough tree search with the constraint enforced [if using a batch
file, insert “enforce=yes constraint=[constraint name]” within the
hsearch command (between “hsearch” and the following semicolon),
replacing [constraint name] with the name of your constraint]. This
generates the likeliest tree which meets this constraint. Save this
tree. Then load all likely trees without replacing the trees you just
found [go to options in the get trees window and fill in both circles,
or use “gettrees mode=7 file=[tree file name];”, replacing [tree file
name] with the name of your tree file]. What “all likely trees” means
is difficult. Some people include the maximum likelihood tree and the
most parsimonious trees. You could also include all trees with a length
no greater than 5 more than the most parsimonious trees or something
like that. Then go to Trees -> Tree Scores… -> Likelihood, then
click on the button for topology based tests. Choose the
Shimodaira-Hasegawa test. This test measures whether some trees are
better than others under likelihood. The Kishino-Hasegawa test, one
many people use, is inappropriate (non-conservative) unless all the
trees have been specified a priori (specified without using any
of the data for a tree search). We often report the value anyway,
though noting it’s not valid.

Tree drawing and

To prepare trees for publication, we use Acrobat PDF
Writer. This printer extension allows trees to be saved as PDF files
from PAUP, complete with correct branchlengths, scale bars, titles, and
bootstrap proportions. These trees may then be opened in Adobe
Illustrator and have branches colored, line thicknesses changed, keys
added, etc. The trees can be saved in PDF format or output to other
formats for publication.