Grad Core Phylogenetics 2010

Syllabus Fall 2010:

Day 1

Topics:
What trees are used for
history of life + cool discoveries about topology
Tree terminology
clades (dinos birds, reptiles, fish)
sister groups, crown/stem, etc.
rank and its importance
homology
Slides: pdf, keynote, mov (QuickTime movie)
R assignment:
By the start of class Wednesday, please:
1) Install R on a computer you can access (you don’t need to bring it in, so a desktop at home is fine). You can get an installer from http://cran.r-project.org/ .
2) Read this web page: http://bodegaphylo.wikispot.org/Phylogenetics_and_Comparative_Methods_in_R
3) Read and do the exercises (aka “type ~8 lines”) from this web page: http://bodegaphylo.wikispot.org/I._Getting_Started_with_R . Do the task view route (it’s easier).
4) Write me an email with the info you get from the See Also section when you run ?read.tree after doing the tutorial (should be one line).

Day 2

Topics:

Treespace/NP-nasty

History

Distance

Parsimony

Long branch attraction
Slides: pdf, keynote, mov (QuickTime movie)
R assignment: Plot the number of trees vs the number of taxa.
Here is the second assignment. Note that http://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf is a very readable tutorial you may want to download (written by the same guy who wrote the phylogenetic package ape). The homework is due by start of class on Friday.
1. Fire up R. Input library(phangorn) to load the Phangorn package (who gets the geeky reference? Why the F->Ph transition? Get used to this sort of thing). If this gives you an error, try installing it — install.packages(“phangorn”)
2. Create a plain text file (no formatting: .txt rather than .rdf) called treespace.R. We’re going to be typing our commands in this.
3. Go to http://cran.r-project.org/doc/manuals/R-intro.html
4. Read and understand sections 1.5, 1.7, 1.10, 2.1, 2.3, 2.8, 3.2, 12.1, 12.1.4, and 13 [the other bits are handy, too, but I want to minimize your workload].
5. Read the documentation for allTrees (just type ?allTrees ), plot (same way), and length .
6. Treespace is big, as we’ll learn on Friday. Let’s see how big. In your treespace.R file, include the following commands (don’t copy in “7. Don’t paste this…” or the lines after that into the file). Remember that anything after a # sign is ignored by R (it’s a comment). I’m also attaching this as a file in case the formatting is affected by emailing it.

library(phangorn) #to load the phangorn package
x<-c() #creates an empty x vector
y<-c() #creates an empty y vector
for (number_of_taxa in 3:8) { #this starts a loop: the first time through, variable number_of_taxa=3, the next time, it equals 4, then 5, etc. up to and including 8.
    print("now doing this number of taxa:")
    print(number_of_taxa)
    x[number_of_taxa-2]<-number_of_taxa #here we are sticking the number of taxa into vector x. x[1] is the first slot in that vector, x[2] is the second slot in that vector… (it expands if it is not that size already -- not all languages do that). Think about why we do number_of_taxa - 2 rather than number_of_taxa.
    #WAY TO STORE THE LENGTH OF THE OUTPUT FROM allTrees, WHICH IS THE NUMBER OF POSSIBLE TREES, FOR PLOTTING ON THE Y-AXIS
}
#PLOT THE NUMBER OF POSSIBLE TREES VERSUS THE NUMBER OF TAXA (DO A LOG SCALE FOR NUMBER OF TREES)

7. Don't paste this or following lines into your file
8. Replace the #WAY… and #PLOT… comments with functioning R code. I've given you hints (why did I have you look at what the allTrees command does? what's that y<-c() in the code for?). 9. Try running your code -- copy and paste into R, or on a Mac you can select all and then command+return to have it run (if you're editing the text within an R window), or you can use source("treespace.R") if you have set the working directory to the folder containing the file (?setwd or use an option in the menu). 10. Fix the errors, repeat 9 (see the joys of loops? -- "while (working==FALSE) { debug some more }"). 11. Send me your final figure (screenshot, pdf, etc.) and your treespace.R file. 12. Is the number of trees growing exponentially with the number of taxa? 13. Are you sure? BONUS (optional): How can you tell? 14. If you're still up for it (this part is optional, too), modify your code to look at all trees from 3 to 9 taxa rather than from 3 to 8. You'll need to be very patient. Why? Bang your head against this for awhile, then talk to your cohort-mates for help, and/or email me. A lot of programming issues, especially early on, are frustrating in the way these problems are (i.e., what function lets me count the number of trees? how do I stick things in a vector?), and it's salutary to try to work through them. There are generally several ways to solve each problem -- go for the one that works first. Later you can tweak with speed, memory usage, code beauty (yes, really), etc. There are a few mailing lists for help with R, and help with R for phylogenetics, but [most of you] are too new to R to start posting there.

Day 3

Topics:

Models

Model selection

Bayes/Likelihood

R code for animation to make image

Slides: pdf, keynote, mov (QuickTime movie)

R assignment: Do your own tree search

The next assignment should be pretty fun. You are going to write a heuristic search algorithm for getting the maximum parsimony tree in R after downloading a set of GenBank sequences, also within R. I've sketched out a lot of the code -- the main tricky bit should be figuring out how to store the best tree as you go and learning how to use an if() function. Comments in ALL CAPS are the sections you should focus on for changing. You're encouraged to play with this assignment: maybe you want to plot the scores as you search, maybe you want to try different search strategies, etc. Feel free to email me for help or talk to your cohort-mates (remember you can use the blackboard forum, too). R code below and attached.

And just think: on Monday, many of you had never coded before. Now, you're able to create a phylogenetic search program and pull things from an online database!

Brian

library(phangorn)
accessionNumbers <- c("U15717", "U15718", "U15719", "U15720", "U15721", "U15722", "U15723", "U15724") #These come from a paper by Paradis (1997) on Tanagers
#USE THE read.GenBank command TO DOWNLOAD THESE SEQUENCES FROM GENBANK AND PUT THEM IN AN OBJECT CALLED sequences.
#Note that it's read.GenBank, not read.Genbank, read.genbank, readGenBank, etc. Programming is usually very case-sensitive
distanceMatrix<-dist.dna(sequences) #get a distance matrix
print(distanceMatrix)
convertedSeqs<-phyDat(sequences) #convert to a format useful for phangorn
treeBad<-upgma(distanceMatrix)
#the UPGMA starting tree actually is not that bad (and a NJ tree is the same as the maximum parsimony tree). So let's make it challenging by giving you a bad starting tree (for larger data sets, you would not have this "problem" of a starting tree that is too good)
#note how to get the parsimony score of a tree (use convertedSeqs, not sequences)
while(parsimony(treeBad,convertedSeqs)<220) {
    treeBad<-rNNI(treeBad)
}
maximumTries<-50
#GET SCORE OF STARTING TREE, treeBad
#STORE THE CURRENT BEST TREE, WHICH IS treeBad
#PRINT THE STARTING SCORE
for (tryNumber in 1:maximumTries) {
    #GET A NEW TREE BY DOING newTree<-rNNI( __SOMETHING__ )
    #COMPARE THE SCORE OF THE NEW TREE TO THE OLD TREE
    #USE AN IF BLOCK (if (new #INFO ON IF() IS AT http://cran.r-project.org/doc/manuals/R-intro.html#Conditional-execution
    #HERE IS AN EXAMPLE BLOCK
    if (1<2) {
        print("one is less than 2")
    }
    #PERHAPS MORE THINGS AFTER THE IF BLOCK
}
#PRINT THE FINAL SCORE
#PLOT THE FINAL TREE
#BONUS: THINK OF A WAY TO DO AN EXHAUSTIVE SEARCH USING THE allTrees COMMAND. WILL THIS WORK FOR MANY MORE TAXA?

Day 4

Topics:

Bootstrap

Consensus

Clocking

Slides: pdf, keynote, mov (QuickTime movie)

Day 5

Topics:

Supertrees & Supermatrices

Gene tree species tree

Where to find trees

Slides: pdf, keynote, mov (QuickTime movie)