Open source software

What is open source?

Open source software is software where you can see, and typically modify, the source code.

What is source code?

People understand things like this: "print('hello')". Computers understand something like this: 'C3>^D< 85>< C0>u< F7>< 89>\$^L< E8>'. You write programs in people-speak, then they are converted (compiled) into computer-speak. In a language like C++, this is typically done once, then you use the compiled program (called the "binary"). In a language like Perl, this is done every time the program is run. When you buy a program like Microsoft Excel, all you get is the binary -- something written in computer-speak. You can't understand this or (practically) go back to people-speak. This has some advantages to the software company -- if Microsoft has a really good approach to calculating averages of huge lists of numbers, a competing software company can't use this idea as they can't see how it is done, so Microsoft maintains a competitive advantage. And if Microsoft sells a lot of copies of their software, they can pay for clever developers to keep working on it, ideally making it better. However, if there's an error in the software (for example, in some versions of Excel, the random number generator when told to make numbers between 0 and 1 would sometimes spit out a huge number instead), users can't look to see where the error is or fix it themselves. It also prevents people building off software. A good example of the utility of reuse in phylogenetics is Schluter's ANCML -- to develop this, he took some source code from Felsenstein's PHYLIP package and modified it to do ancestral state reconstruction. This saved Schluter a lot of work coding everything from scratch and resulted in a useful program. Within science, some (including me) argue that new code should be open source so you can see what the software is doing -- it is the equivalent of making sure that people can find out about the methods you use in a wet lab experiment. A paper where the authors said they got 15 DNA sequences, but won't say where or how, would get in trouble with reviewers, and so should a paper that says they got a tree from these sequences without the ability to see how the program works (though some disagree). Note that the popularity and understanding of open source software has grown through time and is enabled by the internet. So, a program written in the 1980's, where to get it you'd have to mail a check to someone and they'd put a floppy disk in the mail, might not have been open source when written and might be prevented from being open-sourced now due to contracts with distributors, but new work by the same authors might be open source and freely-distributed on the internet (think MacClade vs. Mesquite). Much new software in phylogenetics is open source (RAxML, PHYLIP, Brownie, MrBayes, Beast, FigTree, APE, Geiger, etc.) but some is not (SIMMAP, BayesTraits).

What are open source licenses?

There are various ways to license open source software. Can everyone see the source code? Can they modify it? Can they modify and distribute it, and do they have to cite you in the new code? Can they sell their modifications? Do their modifications have to be open source, too? Wikipedia has a description of various open source licenses.

How do I get the source code?

There are two main ways to get source code. One is a compressed directory (folder) containing it. This is a good static snapshot, but it might not reflect ongoing changes in the code. The other way is "checking it out" from a repository. Most software is developed using version control: think of something like track changes in a document, but for files and folders. You can see how code has changed through time, who has changed it, and roll it back to earlier versions in case something breaks. There are also ways to merge code: if two people are working on some code at the same time, the version control software can merge their changes. Probably the two most popular ways to do version control are subversion and git. Both allow you to get a copy of the folder containing the code and then update it as the code changes. For example, to get the Brownie source code, you would install subversion and then do an anonymous checkout of the code to store a copy of it locally on your computer. "Anonymous" checkout means anyone can do it. You can then take the code and modify it according to its license.

What do I do with changes to the code?

Let's say you check out Brownie and modify it. You could do a fork of it -- start developing new software based on that but diverging from it (think fork in the road, or speciation event). Given Brownie's license, you don't need permission to do this. You could instead try to put changes back into Brownie. In many open source projects, you need permission to do this -- there's always the risk of vandalism of existing code or introducing an error with the change. It's often better to try to merge changes into an existing program rather than fork it -- otherwise, improvements in one version can't get into another version. However, if you want a program to do something different (think ANCML and PHYLIP's Continuous), forking can make sense.

What are trunks and branches?

Sometimes someone wants to make some changes in code that will break it temporarily in the process. For example, if you wanted to change how a program handled help, there might be a period when help doesn't work at all until the new help system is installed. If this will take a long time to do, you might not want to have the program break during this period, especially if people are doing bug fixes or the like on the main code. So, the main code will be the trunk -- a version that is fairly stable and working most of the time. The code that is being heavily worked on will be a branch. It's like forking the code, but with the eventual goal of merging the two later. When, for example, the new help system is available and tested, the branch can be merged back into the trunk. The code will incorporate both the bug fixes introduced in the trunk and the new features added in the branch. This is most helpful on big projects or with new developers (let them work on the branch and not have to worry about breaking the code for everyone).