* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Team Application Activity #3: Statistical Analysis of Microbial
Survey
Document related concepts
Biodiversity wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
Pathogenomics wikipedia , lookup
Maximum parsimony (phylogenetics) wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Sequence alignment wikipedia , lookup
Transcript
Spring 2017 BIOL 312: Microbiology A Town on Fire Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire Instructor: Dr. Tammy Tobin University E-Mail: [email protected] Susquehanna Team Application Activity #3: Statistical Analysis of Microbial Community Diversity and Composition. Names of Team Members: Introduction: During the last class period, you assigned your metagenomic sequences to OTUs, picked representative sequences for each OTU, made a .biom table to summarize your OTU data, and then processed the .biom tables into bar graphs in order to better visualize whether or not the results supported your hypothesis regarding the presence or absence of a single species in a Centralia soil sample. You were then asked to hypothesize which environmental parameter, from the ones in the mapping file (temperature, pH, ammonia, nitrate, sulfate or total sulfur concentration) you believe plays the largest role in determining microbial community diversity and structure in Centralia. In this activity you will use QIIME to prepare your sequence data for phylogenetic analysis, to generate a phylogenetic tree from your sequence data (we will analyze that tree offline using FigTree), and to determine if your metagenomic sequences at each site are likely to reflect the actual diversity of microorganisms living there. Diversity Analyses There are two types of diversity analysis that are commonly used in metagenomic studies. Alpha diversity looks at the diversity within samples, in this case, the OTU diversity within each borehole. Beta diversity describes the differences between samples. Both types of diversity analysis can be computed using QIIME. Since some of the diversity metrics we will be using require the existence of a phylogenetic tree, we will construct that first. Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 1 Getting QIIME Started Step One: Logging in to your EC2 instance from the Mac Terminal. This should be review by now, but here are the steps: 1. Start the terminal program (in applications – utilities). 2. cd to the location where your key is located. This should still be the Desktop. 3. Type in the following command after the $ prompt: ssh -i “QIIME-1.pem” ubuntu@IP address 4. Type ‘yes’ in response to the next prompt. 5. You are now running QIIME through your terminal. 6. cd to the Centralia_Case_Study directory to start today’s activity. Using QIIME to Construct Phylogenetic Trees The underlying assumption for all DNA sequenced-based phylogenetic analyses is that the more closely related two species are, evolutionarily, the more closely related their DNA sequences will be. This underlying assumption does have some flaws that need to be kept in mind. As you have already learned, horizontal gene transfer between species can make two species look more (or less) related than they truly are. Also, not all DNA changes are equal in terms of phenotypic outcome. Some mutations are selectively neutral while others are not. Thus, selective pressures will have an impact on the rate of nucleotide changes observed in different parts of the genome over time. Some analysis metrics take this latter situation into account by weighting base changes in different codon positions differently (to account for silent mutations, etc.). This is not done in 16S rRNA sequence analysis because there are no codons (no protein is produced). 1. Aligning sequences. In order to construct a phylogenetic tree, all of the 16S rRNA sequences in our quality-filtered fasta files will first need to be aligned to make sure that the base changes observed between sequences are due to mutations to that site in the gene, rather than to comparison of two completely different parts of the gene. QIIME will also insert gaps, as needed, to account for the fact that insertions and deletions of bases also occur during evolution. By way of example, take the three related phrases below: AFATCAT AFFATCAT TINYRATFEAREDAFFATCAT If these phrases were compared without adjusting the default alignment (left justified) above, they would show almost no similarity at all. Even phrases 1 and 2, which are obviously very similar, would have only two letters in common as they are currently aligned: the first A and F. The third sequence (TI) would not match at all. After those first two letters, almost every subsequent letter is different. Aligned versions of these phrases are shown below: A_FATCAT AFFATCAT TINYRATFEAREDAFFATCAT In this scenario, QIIME has shifted phrases 1 and 2 over to the right, so they match the corresponding phrase in 3, and has also inserted a gap in phrase 1, to account for the additional F’s in phrases 2 and 3. This will give a much more accurate picture of the overall sequence identity. Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 2 The command for aligning sequences in QIIME is align_seqs.py, with an input of your representative OTU fasta files, and an output directory of centralia_repset1_aligned. This step will take several minutes, so be patient and do not hit return or enter another command until you see the $ prompt. The total command is: align_seqs.py -i rep_set1.fna -o aligned_sequences 2. Next you will filter out uninformative sequence data. This script will remove positions that are gaps in every sequence (which can happen with some alignment programs), as well as those, such as ‘TINYRATFEARED’ in phrase 3 above that are non-conserved positions (it is only present in one phrase), and thus are uninformative for tree building. The command for this is: filter_alignment.py. The input file will be the aligned sequence file, and the output will be a folder entitled ‘filtered_alignments’. The total command is: filter_alignment.py -i aligned_sequences/rep_set1_aligned.fasta -o filtered_alignments 3. Finally, you will ask QIIME to generate a phylogenetic tree from your filtered alignments using FastTree, a modified NeighborJoining method (uses pairwise sequence comparisons to build the tree) that is much faster at metagenomics level analyses than many other methods, but is still reliable (see article in References section for more details). The command is: make_phylogeny.py -i filtered_alignments/rep_set1_aligned_pfiltered.fasta You will be able to visualize trees in three basic ways, as shown below. The first two trees are ‘rooted’, while that last is not. Remember that no matter which way the tree is diagrammed, they show phylogenetic relationships in the same way. 1. In the first tree above, which taxon is most closely related to Nimravidae? 2. In the second tree above, which taxon is most closely related to Spirochetes? Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 3 That’s it for phylogeny for now. We can next move on to calculating alpha diversity. Calculating Alpha Diversity To calculate alpha diversity, QIIME must first generate alpha rarefaction tables (in biom format). As you know from your readings, rarefaction data will not only provide information regarding the amount of diversity present within each sample, but will also help you determine if you have sampled at a sufficient depth to reveal an acceptable level of the diversity present in your sample. We will use three different methods to actually analyze alpha diversity. Faith's Phylogenetic Diversity, Heip's Evenness and total number of OTUs. Faith's Phylogenetic Diversity is based on the phylogenetic tree you generated in the first part of this activity. This method adds up all the branch lengths in the tree as a measure of alpha diversity. Branch lengths are roughly analogous to the number of nucleotide changes represented in the tree. So, if you add a new OTU to a dataset that is closely related to another OTU in the sample (very few nucleotide changes between them), it will only cause a small increase in diversity. However, if you add a new OTU that comes from a totally different lineage than anything else in the sample, it will cause a much larger increase in the diversity. Heip's Evenness measures how close in numbers the OTUs an environment are. If a sample contains 100 sequences, and there are roughly equal numbers of sequences from each OTU, then Heip’s evenness will be close to one. If, however, only one of the sequences came from one OTU, and 99 came from another, then the score will be very close to zero. Total number of OTUs is exactly what it sounds like. If there are more OTUs, then the samples will be considered to be more diverse. Alpha diversity is generated in four steps: 1. Generating rarefaction tables. In this step, QIIME will subsample the original OTU table at a variety of specified sequence depths, and will report the number of OTUs revealed at each depth. In the script we will use, the OTU will first be subsampled 10 times at a depth of 10 sequences/sample, then 10 times at a depth of 120 sequences/sample, and so on until the maximum rarefaction depth is reached (we will stop at 1000 sequences/sample, although we could well choose to look at even more sequences if our data indicates this would be beneficial). The step size is 110, which means that each sampling depth will be increased by 110 until 1000 sequences/sample is reached. Since each subsampling depth will be repeated 10 times, a total of 100 subsampled OTU tables will be generated in our output (multiple_rarefactions) folder. The script for this analysis is: multiple_rarefactions.py -i output.biom -o multiple_rarefactions -m 10 -x 1000 -n 10 -s 110 2. Next, the alpha diversity of the rarefied samples will be computed using the three different metrics. alpha_diversity.py -i multiple_rarefactions/ -o alpha_diversity -m PD_whole_tree,observed_species,heip_e -t filtered_alignments/rep_set1_aligned_pfiltered.tre 3. At this stage there are still a ton of separate files, so those will need to be collated into a single file for graphing purposes. The command is: collate_alpha.py -i alpha_diversity/ -o collated_alpha_diversity Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 4 4. Now we can plot the results, including the original mapping data so that we can see if alpha diversity varies with sample site, chemical or temperature parameters. make_rarefaction_plots.py -i collated_alpha_diversity/ -m mapping_centralia.txt -o rarefaction_plots 5. Now you need to download all of your data. Use Cyberduck to connect to your instance just like you did before…give instructions here. Transfer your files to your desktop by simply dragging the Centralia_Case_Study folder there. It will take a while…you have generated a LOT of data! 6. Once your data is safely on your desktop, copy it to your thumb drive. 7. You are now done with QIIME! Type exit next to the command promp in your terminal window, and then quit the program. Downloading Files Using Cyberduck Download the Centralia_Case_Study folder with all of your data to your Desktop instance using Cyberduck. 1. Start Cyberduck. It is in your applications folder. Cyberduck allows you to transfer files between your computer and the Amazon EC2 instance you are running. 2. Click on “Open Connection” in the upper left of the window, and choose SFTP from the top drop-down menu. 3. Type the IP address for your EC2 instance (It is still on the blackboard) into the Server window in Cyberduck. 4. Type ubuntu (exactly as written!) as your username. You do not need a password. 5. Click on ‘Use Public Key Authentication’ at the bottom of the Cyberduck window and then migrate to and choose your key pair (it should be on your desktop as QIIME-1.pem). Your Cyberduck window should now look something like the figure on the left. 6. If it does, click on “Connect”. Then click on ‘allow’. A new window will open up in Cyberduck when it has connected to the server. To copy all of your Centralia files to your desktop, simply drag and drop your Centralia_Case_Study folder into the desktop of your computer. Say yes to replace the current folder. Copying these files may take a while, so go take a break if it does. Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 5 7. Find your Centralia_Case_Study folder on your desktop and open it, and then open the rarification_plots folder. Double click on rarefication_plots.html to open it. You should see a browser window that looks like this: From the “Select a Metric” drop-down, choose PD_whole_tree and from the ‘Select a category” choose “Sample ID”. You will now see Faith’s Phylogenetic Diversity rarefication curves for all three sample sites. Which sample has the highest species richness? Do you believe we have sufficiently sampled this location in order to see all of its microbial diversity? Justify your answer. 8. Change the metric to “Observed Species”. Does this metric support your conclusions in question 6? Explain. 9. Finally, change the metric to “Heip’s”. Which microbial community shows the greatest evenness? How does species evenness differ from species richness? Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 6 Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 7