Download Team Application Activity #3: Statistical Analysis of Microbial

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

RNA-Seq wikipedia , lookup

Genomics wikipedia , lookup

Biodiversity wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Pathogenomics wikipedia , lookup

Maximum parsimony (phylogenetics) wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Sequence alignment wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Metagenomics wikipedia , lookup

Transcript
Spring 2017
BIOL 312: Microbiology
A Town on Fire
Metagenomic Analysis of Bacterial
Communities in Soils Overlying the
Centralia, Pennsylvania Mine Fire
Instructor: Dr. Tammy Tobin
University
E-Mail: [email protected]
Susquehanna
Team Application Activity #3: Statistical Analysis of Microbial
Community Diversity and Composition.
Names of Team Members:
Introduction: During the last class period, you assigned your metagenomic sequences to OTUs, picked representative
sequences for each OTU, made a .biom table to summarize your OTU data, and then processed the .biom tables into bar graphs in
order to better visualize whether or not the results supported your hypothesis regarding the presence or absence of a single species
in a Centralia soil sample. You were then asked to hypothesize which environmental parameter, from the ones in the mapping file
(temperature, pH, ammonia, nitrate, sulfate or total sulfur concentration) you believe plays the largest role in determining
microbial community diversity and structure in Centralia. In this activity you will use QIIME to prepare your sequence data for
phylogenetic analysis, to generate a phylogenetic tree from your sequence data (we will analyze that tree offline using FigTree),
and to determine if your metagenomic sequences at each site are likely to reflect the actual diversity of microorganisms living there.
Diversity Analyses
There are two types of diversity analysis that are commonly used in metagenomic studies. Alpha diversity looks at the diversity
within samples, in this case, the OTU diversity within each borehole. Beta diversity describes the differences between samples.
Both types of diversity analysis can be computed using QIIME. Since some of the diversity metrics we will be using require the
existence of a phylogenetic tree, we will construct that first.
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
1
Getting QIIME Started
Step One: Logging in to your EC2 instance from the Mac Terminal. This should be review by now, but here are the steps:
1.
Start the terminal program (in applications – utilities).
2.
cd to the location where your key is located. This should still be the Desktop.
3.
Type in the following command after the $ prompt: ssh -i “QIIME-1.pem” ubuntu@IP address
4.
Type ‘yes’ in response to the next prompt.
5.
You are now running QIIME through your terminal.
6.
cd to the Centralia_Case_Study directory to start today’s activity.
Using QIIME to Construct Phylogenetic Trees
The underlying assumption for all DNA sequenced-based phylogenetic analyses is that the more closely related two species are,
evolutionarily, the more closely related their DNA sequences will be. This underlying assumption does have some flaws that need
to be kept in mind. As you have already learned, horizontal gene transfer between species can make two species look more (or
less) related than they truly are. Also, not all DNA changes are equal in terms of phenotypic outcome. Some mutations are
selectively neutral while others are not. Thus, selective pressures will have an impact on the rate of nucleotide changes observed in
different parts of the genome over time. Some analysis metrics take this latter situation into account by weighting base changes in
different codon positions differently (to account for silent mutations, etc.). This is not done in 16S rRNA sequence analysis
because there are no codons (no protein is produced).
1. Aligning sequences. In order to construct a phylogenetic tree, all of the 16S rRNA sequences in our quality-filtered fasta files will
first need to be aligned to make sure that the base changes observed between sequences are due to mutations to that site in the gene,
rather than to comparison of two completely different parts of the gene. QIIME will also insert gaps, as needed, to account for the
fact that insertions and deletions of bases also occur during evolution. By way of example, take the three related phrases below:
AFATCAT
AFFATCAT
TINYRATFEAREDAFFATCAT
If these phrases were compared without adjusting the default alignment (left justified) above, they would show almost no similarity
at all. Even phrases 1 and 2, which are obviously very similar, would have only two letters in common as they are currently
aligned: the first A and F. The third sequence (TI) would not match at all. After those first two letters, almost every subsequent
letter is different. Aligned versions of these phrases are shown below:
A_FATCAT
AFFATCAT
TINYRATFEAREDAFFATCAT
In this scenario, QIIME has shifted phrases 1 and 2 over to the right, so they match the corresponding phrase in 3, and has also
inserted a gap in phrase 1, to account for the additional F’s in phrases 2 and 3. This will give a much more accurate picture of the
overall sequence identity.
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
2
The command for aligning sequences in QIIME is align_seqs.py, with an input of your representative OTU fasta files, and an
output directory of centralia_repset1_aligned. This step will take several minutes, so be patient and do not hit return or enter
another command until you see the $ prompt. The total command is:
align_seqs.py -i rep_set1.fna -o aligned_sequences
2. Next you will filter out uninformative sequence data. This script will remove positions that are gaps in every sequence (which can
happen with some alignment programs), as well as those, such as ‘TINYRATFEARED’ in phrase 3 above that are non-conserved
positions (it is only present in one phrase), and thus are uninformative for tree building.
The command for this is: filter_alignment.py. The input file will be the aligned sequence file, and the output will be a folder
entitled ‘filtered_alignments’. The total command is:
filter_alignment.py -i aligned_sequences/rep_set1_aligned.fasta -o filtered_alignments
3. Finally, you will ask QIIME to generate a phylogenetic tree from your filtered alignments using FastTree, a modified NeighborJoining method (uses pairwise sequence comparisons to build the tree) that is much faster at metagenomics level analyses than
many other methods, but is still reliable (see article in References section for more details). The command is:
make_phylogeny.py -i filtered_alignments/rep_set1_aligned_pfiltered.fasta
You will be able to visualize trees in three basic ways, as shown below. The first two trees are ‘rooted’, while that last is not.
Remember that no matter which way the tree is diagrammed, they show phylogenetic relationships in the same way.
1.
In the first tree above, which taxon is most closely related to Nimravidae?
2.
In the second tree above, which taxon is most closely related to Spirochetes?
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
3
That’s it for phylogeny for now. We can next move on to calculating alpha diversity.
Calculating Alpha Diversity
To calculate alpha diversity, QIIME must first generate alpha rarefaction tables (in biom format). As you know from your
readings, rarefaction data will not only provide information regarding the amount of diversity present within each sample, but will
also help you determine if you have sampled at a sufficient depth to reveal an acceptable level of the diversity present in your
sample. We will use three different methods to actually analyze alpha diversity. Faith's Phylogenetic Diversity, Heip's Evenness
and total number of OTUs.
Faith's Phylogenetic Diversity is based on the phylogenetic tree you generated in the first part of this activity. This method adds up
all the branch lengths in the tree as a measure of alpha diversity. Branch lengths are roughly analogous to the number of nucleotide
changes represented in the tree. So, if you add a new OTU to a dataset that is closely related to another OTU in the sample (very
few nucleotide changes between them), it will only cause a small increase in diversity. However, if you add a new OTU that comes
from a totally different lineage than anything else in the sample, it will cause a much larger increase in the diversity.
Heip's Evenness measures how close in numbers the OTUs an environment are. If a sample contains 100 sequences, and there are
roughly equal numbers of sequences from each OTU, then Heip’s evenness will be close to one. If, however, only one of the
sequences came from one OTU, and 99 came from another, then the score will be very close to zero.
Total number of OTUs is exactly what it sounds like. If there are more OTUs, then the samples will be considered to be more
diverse.
Alpha diversity is generated in four steps:
1.
Generating rarefaction tables. In this step, QIIME will subsample the original OTU table at a variety of specified sequence
depths, and will report the number of OTUs revealed at each depth. In the script we will use, the OTU will first be subsampled
10 times at a depth of 10 sequences/sample, then 10 times at a depth of 120 sequences/sample, and so on until the maximum
rarefaction depth is reached (we will stop at 1000 sequences/sample, although we could well choose to look at even more
sequences if our data indicates this would be beneficial). The step size is 110, which means that each sampling depth will be
increased by 110 until 1000 sequences/sample is reached. Since each subsampling depth will be repeated 10 times, a total of 100
subsampled OTU tables will be generated in our output (multiple_rarefactions) folder.
The script for this analysis is:
multiple_rarefactions.py -i output.biom -o multiple_rarefactions -m 10 -x 1000 -n 10 -s 110
2.
Next, the alpha diversity of the rarefied samples will be computed using the three different metrics.
alpha_diversity.py -i multiple_rarefactions/ -o alpha_diversity -m PD_whole_tree,observed_species,heip_e -t
filtered_alignments/rep_set1_aligned_pfiltered.tre
3.
At this stage there are still a ton of separate files, so those will need to be collated into a single file for graphing purposes. The
command is:
collate_alpha.py -i alpha_diversity/ -o collated_alpha_diversity
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
4
4.
Now we can plot the results, including the original mapping data so that we can see if alpha diversity varies with sample site,
chemical or temperature parameters.
make_rarefaction_plots.py -i collated_alpha_diversity/ -m mapping_centralia.txt -o rarefaction_plots
5.
Now you need to download all of your data. Use Cyberduck to connect to your instance just like you did before…give
instructions here. Transfer your files to your desktop by simply dragging the Centralia_Case_Study folder there. It will take a
while…you have generated a LOT of data!
6.
Once your data is safely on your desktop, copy it to your thumb drive.
7.
You are now done with QIIME! Type exit next to the command promp in your terminal window, and then quit the program.
Downloading Files Using Cyberduck
Download the Centralia_Case_Study folder with all of your data to your Desktop instance using Cyberduck.
1.
Start Cyberduck. It is in your applications folder. Cyberduck allows you to transfer files between your computer and the
Amazon EC2 instance you are running.
2.
Click on “Open Connection” in the upper left of the window, and choose SFTP from the top drop-down menu.
3.
Type the IP address for your EC2 instance (It is still on the blackboard) into the Server window in Cyberduck.
4.
Type ubuntu (exactly as written!) as your username. You do not need a password.
5. Click on ‘Use Public Key Authentication’ at the bottom of the
Cyberduck window and then migrate to and choose your key pair (it
should be on your desktop as QIIME-1.pem). Your Cyberduck window
should now look something like the figure on the left.
6. If it does, click on “Connect”. Then click on ‘allow’. A new
window will open up in Cyberduck when it has connected to the server.
To copy all of your Centralia files to your desktop, simply drag and drop
your Centralia_Case_Study folder into the desktop of your computer. Say
yes to replace the current folder. Copying these files may take a while, so
go take a break if it does.
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
5
7.
Find your Centralia_Case_Study folder on your desktop and open it, and then open the rarification_plots folder. Double
click on rarefication_plots.html to open it. You should see a browser window that looks like this:
From the “Select a Metric” drop-down, choose PD_whole_tree and
from the ‘Select a category” choose “Sample ID”. You will now see
Faith’s Phylogenetic Diversity rarefication curves for all three sample
sites. Which sample has the highest species richness? Do you believe
we have sufficiently sampled this location in order to see all of its
microbial diversity? Justify your answer.
8.
Change the metric to “Observed Species”. Does this metric support your conclusions in question 6? Explain.
9.
Finally, change the metric to “Heip’s”. Which microbial community shows the greatest evenness? How does species evenness
differ from species richness?
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
6
Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia,
Pennsylvania Mine Fire
7