Download Phylogenetic analysis

Document related concepts

Point mutation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Multilocus sequence typing wikipedia , lookup

Community fingerprinting wikipedia , lookup

Protein structure prediction wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
Phylogenetic Analysis
YTSLLLSRQYASLLW-RQA
PASIILSRQA
GRSIVLTRQM
Phylogenetics
What do I need to do?
Get related sequences of interest
Perform multiple sequence alignments
Edit alignment
Estimate phylogenetic relationships
Interpret results correctly
Phylogenetics
Get related sequences of interest
Perform multiple sequence alignments
Edit alignment
Estimate phylogenetic relationships
Interpret results correctly
So you have a sequence…
now what?
MKILLLCIIFLYYVNAFKNTQKDGVSLQILK
KKRSNQVNFLNRKNDYNLIKNKNPSSSLK
STFDDIKKIISKQLSVEEDKIQMNSNFTKDL
GADSLDLVELIMALEEKFNVTISDQDALKI
NTVQDAIDYIEKNNKQ
#1: What is it?
Does source organism have it’s own genome database?
Unknown/No
BLAST
@ Pubmed
Yes
BLAST
@ genome database
(GeneDB, PlasmoDB, etc.)
Why start with genome-specific database?
Genome location/structure
Strain variability
BLAST
Expression data
Pathway data
PubMed BLAST
PubMed BLAST
Blastp
Protein families – Conserved Domains
BLAST Hits
Downloading sequences – FASTA format
Getting sequences – FASTA format
Saving and editing FASTA files
Phylogenetics
Get related sequences of interest
Perform multiple sequence alignments
Edit alignment
Estimate phylogenetic relationships
Interpret results correctly
GYTSLLLSRQNED--G
G--SLLLSHK-D-HTG
GYTSLLLSRQNEDG---GSLLLSHK-D-HTG
TSLLLSR
TSLLLSH
Global
Overlap
Local
Pair-wise sequence alignment
Smith-Waterman
Aligning 2 sequences globally
- Y T S L L L S R Q
YTSLLLSRQ
YASLLWRQA
YTSLLLSRQYASLLW-RQA
Y
A
S
L
L
W
R
Q
A
-4
-8
-12 -16 -20
-24 -28 -32 -36
-4
4
-8
-12 -16 -20
-24 -28 -32 -36
-8
-4
2
-12 -16 -20
-24 -28 -32 -36
-12
-4
-8
10
-24 -28 -32 -36
-16
-4
-8
-12 14
-20
-4
-8
-12 -16 18
14
10
-32 -36
-24
-19 -8
-12 -16 -20
14
10
6
-36
-28
-4
15
11
-32
-25 -29 -24 -16 -20
-24 -28 -32
20
-36
-26 -25 -34 -25 -35
-28 -28 -32
16
-16 -20
-20
-20 -12 -16 -20
-24 -28 -32 -36
-24 -28
Multiple sequence alignment
Progressive
YTSLLLSRQYASLLW-RQA
Align 2 closest sequences
YTSLLLSRQYASLLW-RQA
PASIILSRQA
Add in next closest sequence
YTSLLLSRQYASLLW-RQA
PASIILSRQA
GRSIVLTRQM
Continue adding….
Hyper dependent on initial matches.
Multiple sequence alignment
Iterative
YTTSLLLSRQ-YATSLLWRQA-PASIILSRQA-GRTSIVLTRQMA
YTTSLLLSRQ-YATSLLW-RQ-A
PA-SIILSRQ-A
GRTSIVLTRQMA
Initial MSA Score (low)
Optimize MSA score
Probabilistic methods don’t always generate the same answer
Multiple sequence alignment programs
Local
Global
progressive
POA
ClustalX
T-Coffee
iterative
MSA Alignment type
Pair-wise alignment type
Dialign
HMMs
GAs
Multiple Sequence Alignments
POAVIZ
– progressive local
CLUSTAL – progressive global
Multiple Sequence Alignments
POAVIZ
– progressive local
CLUSTAL – progressive global
POAVIZ
POAVIZ
POAVIZ
Multiple Sequence Alignments
POAVIZ
– progressive local
CLUSTAL – progressive global
CLUSTALX
Parameters
CLUSTALX
CLUSTALX – Protein Weight Matrices
• 1) BLOSUM (Henikoff). These matrices appear to be
the best available for carrying out data base similarity
(homology searches).
• 2) PAM (Dayhoff). These have been extremely widely
used since the late '70s.
• 3) GONNET. These matrices were derived using
almost the same procedure as the Dayhoff one
(above) but are much more up to date and are based
on a far larger dataset.
BLOSUM (BLOck SUbstitution Matrix)
BLOSUM62 – Gather proteins with at least 62% identity to obtain
actual substitution rates for these proteins
BLOSUM99 ----------------------------------------------------->BLOSUM62
>99% identity
>62% identity
Pros
Best bet for distantly divergent sequences
PAM (point accepted mutation)
Gather the substitution rates for PAM1 (99% identical sequences)
Assuming that those substitution rates are consistent over time…:
(# Point mutations / 100 amino acids)
PAM1 ------------------------------------------------------------->PAM250
99% identity
20% identity
Pros
Very good for closely related sequences
Cons
Rare mutations under-represented
Substitution rates not constant over time
(both are problems for phylogenetic estimation)
CLUSTALX
CLUSTALX - Aligning
CLUSTALX - Aligning
CLUSTALX – Alignment view
CLUSTAL vs POAVIZ
(global
vs
local)
POAVIZ
CLUSTAL
Phylogenetics
Get related sequences of interest
Perform multiple sequence alignments
Edit alignment
Estimate phylogenetic relationships
Interpret results correctly
BioEdit – Alignment manipulation
Open the “.aln” file
BioEdit – Alignment manipulation
“Back colored view” gives more contrast
Select “Edit” from the mode dropdown
BioEdit – Alignment manipulation
Select “Insert” so that you don’t
accidentally lose part of your sequence
Then select the unaligned beginning
(or end) sequence and delete it….
BioEdit – Alignment manipulation
Now save as a different file .fasta
Phylogenetics
Get related sequences of interest
Perform multiple sequence alignments
Edit alignment
Estimate phylogenetic relationships
Interpret results correctly
Tree terminology
root
common ancestor
(node, branch point)
A
B
C
D
E
F
Operational taxonomic units (OTUs, leaves)
G
Topology 1
monophyletic
A
B
C
D
E
F
G
Topology 2
paraphyletic
A
B
C
E
D
F
G
Topology 3
polyphyletic
A
E
B
F
G
C
D
Sequence homology – orthologues and paralogues
Ancestral gene
duplication
A
B
Last common ancestor
speciation
Human A
Rat A Human B
orthologues
Rat B
A
B
A
orthologues
paralogues
orthologues
paralogues
B
Methods of estimating phylogenetic relationships
Character-based
Maximum Parsimony (MP)
Distance-based
Neighbor-Joining (NJ)
Minimum Evolution (ME)
Probabilistic
Maximum likelihood (ML)
Bayesian inference
Methods of estimating phylogenetic relationships
Maximum Parsimony (MP)
Taxa1
Taxa2
Taxa3
Taxa4
AAA
AAA
1
1
AAA
AGA
1
AAG AAA
GGA AGA
3 changes required
(best tree)
AAG
AAA
GGA
AGA
AAA
1
AAA
AAA
1
AAG
AGA
AAA
2
AAA
GGA
4 changes required
1
AAG
AAA
2
GGA
1
AAA
AGA
4 changes required
Methods of estimating phylogenetic relationships
Distance-based
Neighbor-Joining (NJ) Method
The NJ method involves clustering of neighbor species that are joined by
one node. It does not evaluate all the possible tree topologies. Not
guaranteed to obtain the optimal tree
Minimum Evolution (ME) Method
Estimates the total branch length of each topology exhaustively, then
chooses the topology with the least total branch length. Time intensive for
large numbers of taxa.
Methods of estimating phylogenetic relationships
Prob ( data | model + tree )
Probabilistic methods
Maximum likelihood (ML)
More likely topology found
Search all possible topologies to optimize probability
Bayesian inference
P(You _ getting _ picked ) 
Prior information
Model for selection
need both for everyone in the class
You
 People in the class
Methods of estimating phylogenetic relationships
Character
Maximum Parsimony (MP)
Distance
Neighbor-Joining (NJ)
Minimum Evolution (ME)
Probabilistic
Maximum likelihood (ML)
Bayesian inference
Estimating Phylogenetic Relationships
MEGA
MrBayes
Estimating Phylogenetic Relationships
MEGA
MrBayes
MEGA – Molecular Evolutionary Genetic Analysis
First we have to get a MEGA formatted file made
Select ‘All Files [“ “]’ from the dropdown
‘Files of Type’ menu
Then choose the ‘.aln’ file you just
made…
MEGA – making a MEGA formatted file
MEGA recognizes that you didn’t enter
a MEGA formatted file… Click ‘OK’
Now click on the ‘Convert to MEGA
format’ button at the top left hand side
of the screen
MEGA – making a MEGA formatted file
Make sure that the file is the
right one and that the formatting
is correct. Click ‘OK’.
Now we have to make sure that the file
looks good before starting any analysis
MEGA – making a MEGA formatted file
-Make sure all sequences
are the same length
-Remove all traces of the
consensus marks
When the file looks good, save it and close both text formatter windows…
Now try ‘Activating the data file’ again, this time with the ‘.meg’ file you just made…
MEGA – input a MEGA formatted file
Make sure that the correct sequence
type is selected
Make sure that the correct characters
are selected for missing data and gaps.
MEGA – input a MEGA formatted file
You should now see the ‘sequence
data explorer’
Minimize this window and you can
begin analyzing your data…
MEGA – choose an algorithm
From the phylogeny window you can choose an appropriate algorithm.
In this case we’ll use Minimum Evolution.
MEGA – set parameters
There are two major things to think about first: ‘Model’ and ‘Rates among Sites’
In this example, I’ll use the Poisson model with gamma (y=2.0) rate variation
Identity
Substitution rates
Base frequencies
Transition and/or transversion frequencies
Symmetrical substitution (G->A = A->G)
Variable
Variable
Kimura 2-parameter: B(E), si(V), sv(V)
Tamura-Nei: B(V), si(V), sv(E)
Kimura 3-parameter: B(V), si(E), sv(V)
General Time Reversible: B(V), Sym
Rate variation across sites
Gamma ( Γ )distribution of rate variation among sites
Proportion of Invariable Sites ( I )
Γ + I + GTR
Substitution models (nucleic acid)
Equal
Equal
Sophistication
Mixture models
Site specific residue
frequencies
Poisson
mtREV
JTT
PAM
Identity
Each site can choose it’s own
substitution model, and coupled
with maximum likelihood
probability estimations or
MCMC/Bayesian methods
High dimensional model
but requires large dataset
probabilistic substitution rates
extrapolation of observed
substitution rates
No model
Substitution models (amino acid)
MEGA – set parameters
There are two major things to think about first: ‘Model’ and ‘Rates among Sites’
In this example, I’ll use the Poisson model with gamma (y=2.0) rate variation
MEGA – choose tree test options
Now switch over to the ‘Test of Phylogeny’ tab..
In order to determine the validity of your tree you’ll need to bootstrap it.
Since our sequence isn’t very long, only a couple hundred replications are needed.
Now click the check button, then click ‘Compute’ in the main window…
MEGA – edit your tree
Your tree should appear. Not a very good one in this case. Why?
Because the sequences were too identical.
The icons on the left allow you to reroot, flip branches, etc.
You can also change the format of the tree
But let’s also compute a condensed tree…(Select that from the ‘Compute’ menu)
using a cutoff of 50%..
MEGA – interpret the tree
Four of the sequences cluster indistinguishably together, while a single other
sequence stands out. If we look back at our alignments we could predict this…
Estimating Phylogenetic Relationships
MEGA
MrBayes
MrBayes – Making a NEXUS (.nex) file
MrBayes – Making a NEXUS (.nex) file
MrBayes – Running MrBayes
MrBayes – Running MrBayes
MrBayes – Running MrBayes
MrBayes – Running MrBayes
MrBayes – Running MrBayes
MrBayes – Running MrBayes
Phylogenetics
Get related sequences of interest
Perform multiple sequence alignments
Edit alignment
Estimate phylogenetic relationships
Interpret results correctly
Phylogenetics
Interpret results correctly
Quality of aligned sequences
One bad egg
Sequence similarity (think goldilocks)
Use an appropriate model
Use an appropriate estimation method
Use appropriate parameters
Try different things and compare results wisely
Determine the validity of each part of your tree
Develop a model to explain your tree
how does it square with known information?
what can you learn from your sequences?
what can’t you learn from your analysis?
The Intelligent Consumer
(You don’t have to completely understand everything in order to use it properly, but it helps to have a rough idea…)
BLAST
- stochastic processes
- random walks
Sequence alignments
- Markov processes
- dynamic programming
- Viterbi, Forward, and Backward algorithms
Bayesian phylogenetic inference
- Bayes theorem
- Bayesian inference
- Metropolis algorithm
Many uses for multiple sequence analysis…
Protein family analysis
1
2
1
1
2
2
1
2
multiple sequence alignment
profile
profile–HMM (hidden Markov model)
2
1
2
2
1
1
Find new proteins with same domains
RNA secondary structure prediction
Protein secondary structure prediction
Protein structure prediction – homology modeling
Protein sequence with known structure
Aligned sequences with unknown structure
Comparative genomics