Phylogenetic diversity
Download Zip File
Open the zip and drag the folder "Lab8" to your desktop.
|
|
1. Collect data gene sequences and environmental data from Genbank.
|
|
a. Genbank is inconsistent at best with regards to linking geographic or environmental data to sequence data. For this reason, we have prepared a dataset for you. But optimally, you could generate your own sequence data with highly detailed information about location and environment.
|
|
b. Our dataset comes from ~15 Archaea 16s clone libraries.
|
| |
i. Sequence Filename = arch.756.fa
|
| |
ii. Environmental data = sequence_desc.xls
|
|
2. In order to assess relatedness of 16s data, we first need to run an alignment. There are many options for generating sequence alignments. Here are a few
|
|
a. Clustal is a very good program, but very slow. Probably better for small alignments. A desktop version can be found online, but there is also an online blackbox
|
|
b. MAFFT is a fairly new program that is fast and has good results. It uses a relatively novel algorithmic approach to alignments that allows for large open windows in sequences. The utility of this is that MAFFT can make fairly good alignments on datasets where some (but not all) sequences have no overlapping frames with some (but not all) others. However, in personal experience, it hasn't created the best alignments of 16s data. MAFFT also has desktop and blackbox versions.
|
|
c. Muscle is a very popular multiple sequence alignment tool. Like the others, it comes in several flavors including blackbox and desktop. We have provided an executable version of Muscle in phylo_lab.zip, muscle.exe.
|
| |
i. To run muscle.exe in Windows, you could open the command prompt (if new to cmd see: command propt tutorial) and navigate to its folder and type muscle.exe -h to open a help page
|
| |
ii. We have also provided a batch file, that will run a basic alignment of arch.756.fa for you. The batch file is titled arch.align.bat. This should run the alignment simply by double clicking it. If you are interested in the command it runs, open it in a text editor (right click, and click edit).
|
| |
iii. For the sake of time we have also included the alignment that is produced by Muscle, arch.756.af.
|
| |
iv. For more information see the Muscle help file
|
|
3. Alignments aren't complete just by running an alignment tool. Each of the above alignment tools keeps every nucleotide, but inserts gaps "-" in order to align column of the sequences. This generally results in many "-" at the beginning and end of the sequence. Although informative, they are often not valuable for a phylogenetic search because so few sequences have nucleotides at these positions.
|
|
a. Download BioEdit here
|
| |
i. Install
|
|
b. Open the file arch.756.af in BioEdit
|
|
c. You should see all your sequences in rows. You will also notice the nucleotides for each position in columns. Notice the frequency of nucleotides versus gaps "-" at the beginning and end of the alignment. We won't get into the nature of trimming a sequence, but there are reasons to trim sequences at specific nucleotide positions. But we will show you how it is done.
|
| |
i. Click the dropdown next to 'Mode' near the top right. Select 'Edit'
|
| |
ii. Click on position 150 (numbers are at the top along the columns
|
| |
iii. Click 'Edit' from the top menu, Click 'Select to beginning'
|
| |
iv. Press delete
|
| |
v. Click on position 1200
|
| |
vi. Click 'Edit' and 'Select to end'
|
| |
vii. Press delete
|
| |
viii. Save file as arch.756.trim.af (We have already provided this file for you)
|
|
4. Another problem that arises often in lab, and this is no exception, is file formatting. Phylogenetics is exceptionally convoluted with file types and formats. In order to perform a phylogenetic analysis, we will need a Phylip Interleaved alignment file.
|
|
a. To convert from the trimmed muscle output to a .phy file, you can use a program called SeqVerter, available here: http://www.genestudio.com/seqverter.htm
|
|
b. However, we have converted it for you. The file is named, arch756.trim.phy
|
|
5. Now, on to phylogenetic hypotheses. The number of programs you can use to generate a phylogeny are overwhelming. A really good place to find information on these programs is here. That site also has good information on many other tools, including alternate alignment tool options. Another great resource for MAC users is Mike Robeson's site. He can likely answer a lot of questions about these programs too!
|
|
6. For our purposes, we are going to use a program called RAxML. This program is freely available here. The developer of this program is coming in early April to give a job talk at our department. RAxML is ultimately a maximum-likelihood tree search. Because of the size of our dataset (currently 378 sequences), many older programs will take a long time to run. RAxML is really fast! And on MAC/Linux, it is great on multi-core boxes.
|
|
a. We have included the RAxML6 Windows distro in the .zip file, RAxML.exe
|
|
b. Like Muscle, the best place to start is to use the command prompt, navigate to the folder and type raxml_vi.exe -h to bring up a help page
|
|
c. But, also like Muscle, we have provided a batch file, runraxml.bat, that will run RAxML, using a simple nucleotide substitution model, on the arch.756.trim.phy file.
|
|
d. We have also provided the resulting files in the raxml_output folder.
|
|
e. For more information read the RAxML documentation found on the website above
|
|
|
Extra: A better tree search
|
|
What we have just performed is a very simple, and incomplete tree search. If you want to find a more acceptable tree search, this is the first step:
Right click runraxml.bat and click Edit
The first change to make, is where it says GTRCAT, replace it with GTRMIX. This tells RAxML to use two approaches to finding the best ML trees, first it searches for a tree using GTRCAT, and then evaluates the final tree for each run using GTRGAMMA. GTRGAMMA finds more stable likelihood values, read the RAxML manual and do some primary literature searches if you want to know more about this.
You may have noticed that I said for each run, where the current batch file only produces one tree. What we want to do now, is tell RAxML to search for trees a certain number of times, say 10.
To do this, after GTRMIX add a space and then -#10. Resave the file and close it. Now double click it and watch RAxML run. This will take a while. RAxML will now repeat its search of tree space 10 times. Newer RAxML distrobutions, and web-servers, do this much better, but this works find for now.
When RAxML completes, you should have a file in the directory titled RAxML_info.arch.tree.txt. Open this file. In it you will see a list of rows starting [0] then [1] then [2]. Those are the results for the best tree topology found for each run. Further down there will be a line Best Likelihood in run number followed by a number. In my case, that number is 7, this means that the best tree found from all the runs was that of run 7. So the tree I would want to use for the rest of lab is in the file titled RAxML_results.arch.tree.txt.RUN.7.
You could do further seachers and run bootstraps on each of the trees. For further information be sure to read the manual.
|
|
|
|
Extra: Mesquite
|
|
If you begin working with phylogenies more, you will begin to need ways to visually explore them. One simple tool to load a tree is called, Dendroscope, download it here if you like. In Dendroscope, you can directly open the file called RAxML_result.arch.tree.txt. However, Mesquite is a better/more functional tool. Download it here, and install.
Open Mesquite and navigate to the folder of files from lab. Open the file named arch.756.nex. This should load into Mesquite. Also open that file in a text editor to see its contents. The Mesquite file type is fairly straight forward, and many phylogeny programs will output it for you automatically. However, the program is very unforgiving of errors, so it may take some time to get used to in other projects.
You should now see a "Character Matrix" window, listing all the accession numbers and their environments. Now we want to view our tree too. Click "Get file with trees" -> "Link contents". Navigate to the RAxML outputs and open the RAxML_result.arch.tree.txt file. Select "Phylip (trees)" in the new pop-up. Now choose a new name to save this as, something like arch_full.nex is fine.
"Taxa&Trees" -> "New Tree Window" and click "Stored trees" and OK in the pop-up. A new window with the phylogenetic tree should open. Fun!
Now lets try and optimize a character on our phylogenetic tree. Because Environment is the only character we have for these archaea, we'll play with that. Click "Analysis" -> "Trace Character History". Click "Stored Characters" and OK. Click "Parsimony Ancestral States" and OK.
Scroll horizontally across your tree. You should see a legend for the color of each environment, and branches colored correspondingly. Looking through the tree within Mesquite can get rather annoying. Click "File" -> "Save tree as PDF"and save the file anywhere. Now navigate to that file on your computer, and open the PDF.
If you save the .nex file now, next time you open it, the tree window and optimization will load automatically.
To skip all of these steps, you can open open pre_compiled.arch.756.nex which already has the tree loaded with optimization done. You can also open nex.output_tree.pdf to see the tree. Finally, you could play with the "Drawing" and "Tree" menues to change the tree form, one popular form is the right laddarized tree. Look at nex.output_tree.right_ladder.pdf to see what that looks like.
|
|
|
7. Now, we are going to evaluate the environmental use of the archaea through their phylogenetic diversity.
|
|
a. Goto UniFrac
|
|
b. Sign in with the account you made over the weekend
|
|
c. Upload our tree created by RAxML
|
| |
i. Click the first browse button
|
| |
ii. Navigate to the raxml_output folder
|
| |
iii. Double click RAxML_result.arch.tree.txt
|
|
d. Upload our environmental file. Using basic data from Genbank entries we created the file sequence_desc.xls. In this file, we cut and pasted columns B and C into arch.envs.txt.
|
| |
i. Click the second browse button
|
| |
ii. Navigate the folder with all the data in it
|
| |
iii. Double-click arch.envs.txt
|
|
e. Click Load Tree
|
|
|
Extra: Readings
|
|
1. If you want to learn more about microbial biodiversity, start by reading this classic/seminal paper by Norm Pace.
|
|
2. Noah Fierer has explored some of the global biodiversity questions we have covered in class, but in the microbial biology context. Andrew King has brought up this particular study in some discussions: The diversity and biogeography of soil bacterial communities (PDF)
|
|
|
8. PCA/P-TEST/UNIFAC
|
|
a. We are going to run four analysis in Unifrac. The first analysis just measures how far apart your environments are from one another in terms of the organisms (well, in this case, sequences) that they share. The higher the distance, the more dissimilar the environments are in terms of community composition. To do this, choose 'Environment Distance Matrix' under the 'Select Analysis' tab.
|
|
b. Next run the 'Unifrac significance test'. As the tutorial for Unifrac says: 'This test tells you whether the pattern of environments on your tree is significantly different from what you would expect by chance. This analysis is performed by randomly reassigning species to environments, measuring the amount of branch length that is unique to one environment versus the amount that is shared by more than one, and testing whether the amount of shared branch length is significantly different from chance expectations.'
|
|
c. Another similar test developed by our own Andy Martin, called the p-test, can also be run. The p-test has the same goal as the unifrac significance test (to distinguish the community composition similarity/dissimilarity for different environments) but is more likely to give a significant result than UniFrac when there are many sequences that are very closely related to one another but that are unique to one particular environment.
|
|
d. Finally, we are going to examine how these environment s 'ordinate' in a principle coordinates space (PCoA). The distance matrix you have created can be hard to 'read'. You want to look for correspondences and similarities in environments in a low dimensional space that can show you which environments are most similar or dissimilar to which other ones. A PCoA does that and is a subclass of multidimensional scaling techniques. Remember this is a heuristic output, not a way to test a particular hypothesis. Select 'PCA' under 'Select Analysis', select 'scatter plot text labels' and then Bin Environments by 'first underscore' (not by default first letter). Hit 'Analyze tree' button. The output shows the ordination of the environments. Thos environments most similar to each other will 'cluster' in the PCoA space.
|
|
|
Extra: Readings
|
|
1. If you want to learn more about UniFrac, start by reading this paper UniFrac.
|
|
2. Andy Martin also wrote a great paper on comparing community diversity Here
|
|
2. Finally, a popular paper from the authors of UniFrac explores many interesting questions on microbial biodiveristy, and is worth a read: Global patterns in bacterial diversity
|
|
|
|
So you have now, for all intents and purposes, examined whether different environments have different or similar communities of archaeans. You could run the same analyses for plant or animal diversity in different environments in a similar way. The main point is that this analysis focuses on genetic distances as a criterion for measuring similarity or dissimilarity, not simply number of species or shared species.