Download How to do phylogenetic analysis (Mac or Windows)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Young, Ned 1-8
Phylogenetic Analysis
Collecting the sequences you will use:
1. Include all of the close relatives of a particular protein or other gene.
2. BLAST is a quick way to collect sequences.
3. You will also need outgroup sequences.
4. For each gene, make a fasta file of the sequences, placing your most distant
outgroup first.
5. Example of FASTA format:
a. >Peloba_ace
QLMDQTNPLSEITHKRRLSALGPGGLTRERAGFEVRDVHPTHY
GRVCPIETPEGPNIGLIASLSTYARINEHGFV
ETPYRVVRNGKVTNEIRYFSRPEEGGHAIAQANAPLDEEGRFV
NELVNARQNGEFVLIQREEIGLMDVSPKQLVS
VAASLIPFLENDDANRALMGSNMQGQN
b. >Peloba_car
MAYSVANNQLLRKHFSTIKGIIEIPNLIDIQKNSYKRFLQAGLPP
SARKNIGLEAVFRSVFPIRDFSETCSLDYV
SYSLGTPKYDVGECHQRGMTFAAPVKVCVRLVSWDVDKESG
VQAIRDIKEQEVYFGEIPLMTENGTFIINGTERV
IVSQLHRSPGVFFDHDKGKTHSSGKILYNARVIPYRGSWLDFDF
DHKDLLYVRIDRRRKLPATVLLKALGYSAEE
LLNYYYQVETVTVDGDSFKKKVNLDLLAGQRASTDVLGNDG
EIIVKANRKFTKAAIRKLADSNVEYIPVSEEEIV
GKVASTDIVDVSTGE
6. Make sure that they are all protein (or all DNA).
7. Use unique 10-letter identifiers.
a. The program ProtTest, and the PHYLIP format in general, only the
first 10 characters of the taxon names are extracted from the FASTA
file
b. Although it would be convenient in some ways, keeping the whole
NCBI heading as the sequence identifier is cumbersome.
c. Some of the programs also complain if the identifiers are purely
numeric.
8. You can edit these names later when you get to the tree displaying stage.
Alignment:
1. Upload the FASTA file(s) to the T-Coffee web server
a. http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi
b. If you have over 30 sequences, you can use the MUSCLE web server,
instead.
Young, Ned 2-8
2. If the order in which you arranged the genes is informative, keep it by making
sure the alignment program is using "input order" for its output order.
3. Run.
4. Capture the output in clustalw.aln format (you can use other formats, too, but
.aln is human-readable).
Examination and trimming of the alignment:
1. Open a sequence editor, such as Mesquite
a. http://mesquiteproject.org/mesquite/mesquite.html
2. Open the T-Coffee output
3. Mesquite immediately prompts you to save the file. DO NOT do this.
4. Check the Mesquite Log window to see if there are any messages like this:
a. ERROR: nothing written for character 466 taxon 9.
5. This probably means that Mesquite has changed a U to a ! character, which
will be saved as a - (gap) character.
6. Edit these to X characters.
7. Save the file in NEXUS format.
8. Now examine the quality of the alignment.
9. If it looks bad, repeat the alignment with different parameters or different
sequences.
a. Many alignments are poor at the ends, where only a few sequences
extend.
b. Trim away leading and trailing regions unless 60% of the sequences
are present.
Choosing a model of molecular evolution:
1. Select the model of molecular evolution that is best for your data, to do any
more sensitive phylogenetic method.
a. Anything other than parsimony
2. For protein phylogenies, you will run ProtTest, using a Phylip file as input.
a. http://darwin.uvigo.es/software/prottest.html
3. In Mesquite, export as a Phylip (protein) file and check the box for
"interleaved".
4. Launch ProtTest.
5. Select this Phylip file for it to use.
6. ProtTest takes from a few minutes up to several hours, depending on the
number of sequences.
a. If ProtTest is taking too long, and you want a quicker estimate of the
phylogeny, you can try a parsimony analysis, which doesn't require a
model testing phase.
7. Save results to file, including the complete model details.
8. If you're doing DNA, there are programs other than ProtTest with analogous
function.
a. Use MrModelTest2 (http://people.scs.fsu.edu/~nylander/), rather than
ModelTest, because it restricts itself to models that MrBayes can use.
Young, Ned 3-8
Bayesian phylogenetic analysis:
1. Now you will do a Bayesian phylogenetic analysis, using MrBayes 3.1
a. http://mrbayes.csit.fsu.edu/
2. Use the NEXUS file you made before, but insert a MrBayes 'block'.
3. If necessary, re-open your NEXUS file in Mesquite.
4. "Export” the file using the “Export NEXUS for MrBayes” option, which
opens up a window with a suggested MrBayes block.
a. You can substitute your MrBayes block, just by pasting it in the box.
5. When exporting, you need place the file in the same folder with the MrBayes
application.
6. Start MrBayes.
7. Type "execute xxxxx.nex".
8. When MrBayes has finished, it makes several files.
9. Move them out of the MrBayes folder and into the working folder you were in
before.
10. One of these is suffixed .mcmc.
11. Open this file.
12. In order to check whether it ran long enough, look at the last column, where
the StDev values are.
a. If they drop below 0.01 by the end of the analysis, you're OK.
b. If not, you need to repeat it and specify more generations.
i. You'll also have to change your burnins so that, when
multiplied by samplefreq you get 1/4 of the ngen value.
13. Save the contents of the MrBayes window
14. Cut and paste all of the MrBayes output to a file and save.
Parsimony analysis:
1. When you have a lot of sequences, Bayesian analysis can take several days.
2. Parsimony is faster.
Neighbor-joining:
1. NJ is more error-prone and not recommended, except for huge data sets, but it
is fast.
Maximum Likelihood:
1. Identical to Bayesian analysis in its goals and strategy, but uses different math
2. Usually finds the same tree
3. Getting bootstrap values is a separate process and takes longer than finding
the tree.
Tree manipulation and figure drawing in Mesquite:
1. Getting trees into Mesquite:
a. In Mesquite re-open the file you used for MrBayes.
b. One of the files MrBayes makes is suffixed 'out.con' or 'con'.
c. This has the tree.
d. Select File->Include File and choose the 'out.con' tree.
Young, Ned 4-8
2.
3.
4.
5.
6.
7.
8.
e. Select Taxa&Trees->New Tree Window->Stored Trees.
i. Note: if you have changed the taxon names in Mesquite since
you ran MrBayes, this doesn't work.
f. This will bring in two versions of the tree.
i. The first has posterior probabilities; the second doesn't.
ii. You can toggle between the two using the arrows at the upper
left of the tree window.
Re-root the tree:
a. You can use the re-root tool to change the location of the tree's root.
b. Choose the branch that separates your ingroup sequences from your
outgroup sequences.
Rotating branches:
a. You can also change the appearance of the tree by rotating branches
with the interchange tool .
b. This in no way changes the information content of the tree.
c. It just makes cosmetic changes in the way the tree is displayed.
Other changes:
a. Tree form->Square Tree
b. Cut corners
c. Turn off 'Use Suggested Size'
d. Orientation->right (or left)
e. Turn 'Use Suggested Size' back on
f. Line Width->1 (or 2)
g. Branches proportional to length
h. Turn "Show scale" off
Select Save Macro for Tree Drawing.
a. It will save you time on the next tree you make.
Store Tree
Save data file
Print tree.
Notes:
Models of molecular evolution:
Why is using a correct model of molecular evolution better than using parsimony?
 Under some conditions, parsimony chooses the wrong tree (long branch
attraction).
 Methods using a model are more precise and result in fewer exact ties,
generally.
o For example, changes between two chemically similar a.a.’s can be
used as “similarity”. Under parsimony all differences are simply
“different”.
o Models usually choose a single best tree, whereas parsimony usually
chooses a large set of most parsimonious trees.
 Branch length estimates are more accurate with a model.
What does a model consist of?
 Substitution matrix
Young, Ned 5-8
o

For proteins, this is the (observed) probability of one amino acid
changing to another.
o For DNA, it is the probability of one base changing to another.
Site-to-site rate variation
o Some sites don’t vary.
o Among those that do, they vary at different rates.
Outgroup sequences:
1. Choose multiple outgroup sequences.
2. If you are wrong, and one of them turns out to be part of your ingroup, you
can still achieve a correct tree rooting by rooting between the ingroup and the
majority of the outgroup sequnces.
3. You want them to be clearly outside the group you are studying, but as close
as possible.
4. If they are too far out, it will wreak havoc with your alignment.
5. After seeing your alignment or tree, you may want to select different outgroup
sequences and repeat the whole procedure.
Combining multiple genes into one data set:
1. For the first gene's FASTA file, use unique 10-letter identifier names for the
taxa.
2. For subsequent genes, the names are not important, so long as the order
of taxa is the same.
a. When you combine the FASTA files, the names in the first will be
used and the others will be discarded.
3. In every FASTA file, all of the taxa must be represented, even if no sequence
is available.
a. The alignment programs will want some sequence, put in a short 3-4
a.a. placeholder sequence, such as WWW.
4. After the first FASTA file, make sure that the subsequent ones are in exactly
the same order and have sequences from the exact same set of taxa
a. Nothing added, Nothing removed.
5. After your sequences have been aligned, go in and replace the placeholder
WWW sequences, if any, with dashes, before you trim away leading and
trailing regions.
6. After you have made the Phylip versions of each FASTA file, combine the
Phylip files in a sequence editor, as follows.
a. Paste the second file into the first
b. Combine the seq. lengths of the 2 genes
c. Write at top of file, replacing the first length value.
d. Remove the second set of taxon names.
e. Continue until all genes have been combined.
f. Save file, using Mac line breaks.
7. You can now use this file in ProtTest or import it into Mesquite.
Bayesian analysis:
Young, Ned 6-8
1. Searches many tree topologies for the best trees that are consistent with both
the model and the data.
2. Optimality criterion:
a. Maximizes the probability of the tree, given the data (aligned
sequences) and the model of molecular evolution.
3. Bayesian analysis is the only one that, in the course of finding the tree,
provides clade confidence scores for each node.
a. These values are similar to the 'bootstrap values' produced by other
phylogeny methods, but in Bayesian analysis they are called posterior
probabilities.
b. The range from 0 to 1, rather than the 0 to 100 range for bootstraps.
Sample MrBayes block:
1. Here is an example:
begin mrbayes;
set autoclose=yes nowarn=yes;
prset aamodelpr=fixed(rtrev);
lset rates=gamma;
mcmc ngen=20000 samplefreq=10 file=5genesMrB.out;
sump burnin=500;
sumt burnin=500;
end;
2. After fixing this up a bit, you can cut it and paste it into your NEXUS file.
3. Modify the block as follows:
a. Put in the aamodel that ProtTest chose, instead of rtrev.
i. poisson, jones, dayhoff, mtrev, mtnam, wag, cprev, vt, or
blosum.
b. If, +G was specified, you include gamma in the model by leaving in
the ‘lset rates=gamma’ line
c. If +IG or +I was specified, change gamma to invgamma or inv.
4. Put in a filename you like for output.
5. Copy this whole file to the clipboard.
6. For large numbers of taxa, you may want to change three of the lines to read:
mcmc ngen=200000 samplefreq=100 file=5genesMrB.out;
sump burnin=500;
sumt burnin=500;
a. The burnin values don't change because they are in 'samples' not
'generations'.
Mixed model analysis:
1. MrBayes allows you to do a mixed model analysis.
2. You can include DNA sequence data of some genes and protein sequence data
for other genes in the same analysis.
3. Review the method for combining gene data sets.
4. If it is a mixed-model run, you can't open it directly in Mesquite, but you can
open it in a text editor such as BBEdit and change the header so it works.
Young, Ned 7-8
MrBayes block, mixed-model analysis:
1. You need to specify which columns belong to which model.
2. In the FORMAT line, you needed something like:
a. DATATYPE=MIXED(PROTEIN:1-1524, DNA:1525-3056)
3. In the mrbayes block, you need something like:
begin mrbayes;
set autoclose=yes nowarn=yes;
charset protein = 1-1524;
charset 16S = 1525-3056;
partition favored = 2: protein, 16S;
set partition = favored;
prset applyto=(1) aamodelpr=fixed(rtrev);
lset applyto=(1) rates=gamma;
prset applyto=(2) statefreqpr=dirichlet(1,1,1,1);
lset applyto=(2) nst=6 rates=invgamma;
mcmc ngen=50000 samplefreq=25 file=cat6.out;
sump burnin=500;
sumt burnin=500;
end;
Parsimony:
1. If the analysis is taking a long time, you might want to get a quick answer
using a parsimony analysis.
2. Parsimony treats identical a.a's or bases as evidence of common ancestry,
weights all changes equally and ignores molecular evolution models.
3. The program PAUP* can do parsimony analysis. It uses a NEXUS file to
save.
4. Substitute a PAUP block for the MrBayes block with a text editor.
a. For example:
BEGIN PAUP;
SET MAXTREES = 150000;
DEFAULTS HSEARCH ADDSEQ=RANDOM NORSTATUS;
HSEARCH ADDSEQ = RANDOM NREPS = 10 RSEED =1234567;
SAVETREES BRLENS = YES;
DESCRIBE 1;
END;
5. Open Mesquite and include the tree, etc., just as described in the Bayesian
analysis, except this time the tree will be in the .tre file just created by PAUP.
How to fix when a probability score is missing:
1. This happens when you change the root.
2. You can add the number by hand.
3. Jot down all of the sequences in the group whose probability value is missing.
4. Look at the file you made from the output in the MrBayes window.
Young, Ned 8-8
5. Look down near the end where it says,"List of taxa in bipartitions:" This gives
reference numbers for all of the sequences.
6. Jot down the reference numbers for the sequences are in the group whose
probability value is missing.
7. Look in the next block of text, "Summary statistics for taxon bipartitions: ".
a. This file uses stars and dots to represent the partitions on the tree.
8. Somewhere in that table is a row representing the group whose probability
you need. It will have the group members as stars and all of the other
sequences as dots OR VICE VERSA.
9. Write down the corresponding probability value and go back to Mesquite.
10. Using the name tool click on the naked branch and enter the value.
Changing the title of a tree:
1. The title that prints with your tree figure is Mesquite's "name" for the tree.
2. To change it, use
a. Tree>Save Tree As.
UNIX-style pathnames:
1. If the directory (folder) is a subdirectory of the one the application is in, use:
a. execute subdirectory/file.nex
2. If the directory (folder) is somewhere else in your user's directory, use ~ to
represent your user's main directory, for example:
a. execute ~/Geobacter/Two-component/ForPAUP/file.nex