Download Trees from proteins I

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genomics wikipedia , lookup

Maximum parsimony (phylogenetics) wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Frameshift mutation wikipedia , lookup

Protein moonlighting wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Sequence alignment wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Point mutation wikipedia , lookup

Expanded genetic code wikipedia , lookup

Genetic code wikipedia , lookup

Transcript
Inferring trees is difficult!!!
MODELS OF PROTEIN EVOLUTION:
AN INTRODUCTION TO AMINO
ACID EXCHANGE MATRICES
1. The method problem
A
Method 1
Dataset 1
?
Robert Hirt
Department of Zoology, The
Natural History Museum,
London
Dataset 1
*1
*2
*3
A
?
Dataset 2
Method 1
C
Sequence data
Align Sequences
Phylogenetic signal?
Patterns—>evolutionary processes?
Distances methods
Characters based methods
B
C
A
Method 2
From DNA/protein sequences to trees
2. The dataset problem
Method 1
C
A
B
Inferring trees is difficult!!!
Dataset 1
B
*4
MB
Model?
C
Distance calculation
(which model?)
Choose a method
ML
MP
Wheighting?
Model?
(sites, changes)?
Optimality criterion
LS
B
ME
Single tree
NJ
Calculate or estimate best fit tree
5
Test phylogenetic reliability
Modified from Hillis et al., (1993). Methods in Enzymology 224, 456-487
1
Agenda
• Some general considerations
– why protein phylogenetics?
– What are we comparing? Protein sequences - some basic features
– Protein structure/function and its impact on patterns of mutations
• Amino acid exchange matrices: where do they come from
and when do we use them?
– Database searches (Blast, FASTA)
– Sequence alignment (ClustalX)
– Phylogenetics (model based methods)
Proteins were the first molecular
sequences to be used for phylogenetic
inference
• Fitch and Margoliash (1967).
Construction of phylogenetic trees.
Science 155, 279-284.
Why protein phylogenies?
•
•
•
•
For historical reasons - the first sequences
Most genes encode proteins
To study protein structure, function and evolution
Comparing DNA and protein based phylogenies can
be useful
– Different genes - e.g. 18S rRNA versus EF-2 protein
– Protein encoding gene - codons versus amino acids
Phylogenies from proteins
• Parsimony
• Distance matrices
• Maximum likelihood
• Bayesian methods
2
Evolutionary models for amino acid
changes
• All methods have explicit or implicit evolutionary
models
• Can be in the form of simple formula
–
Kimura formula to estimate distances
• Most models for amino acid changes typically include
–
–
–
20x20 rate matrix
Correction for rate heterogeneity among sites (G [a]+ pinv)
Assume neutrality - what if there are biases, or non neutral changes - such
as selection?
DNA->Protein: the code
• 3 nucleotides (a codon) code for one amino acid
(61 codons! 61x61 rate matrices?)
Character states in DNA and
protein alignments
• DNA sequences have four states (five): A, C, G, T,
(and ± indels)
•Proteins have 20 states (21): A, C, D, E, F, G, H, I, K,
L, M, N, P, Q, R, S, T, V, W, Y (and ± indels)
—> more information in DNA or protein alignments?
DNA—>Protein
• The code is degenerate:
20 amino acids are encoded by 61 possible codons (3 stop
codons)
• Complex patterns of changes among codons:
• Degeneracy of the code: most amino acids are
coded by several codons
– Synonymous/non synonymous changes
– Synonymous changes correspond to codon changes not
affecting the coded amino acid
—> more data/information in DNA?
3
Codon degeneracy: protein alignments as a guide
for DNA alignments
GluGlu-GlyGly-SerSer-SerSer-TrpTrp-LeuLeu-LeuLeu-LeuLeu-GlyGly-Ser
GluGlu-GlyGly-SerSer-SerSer-TyrTyr-LeuLeu-LeuLeu-IleIle-GlyGly-Ser
AspAsp-GlyGly-SerSer-AlaAla-TrpTrp-LeuLeu-LeuLeu-LeuLeu-GlyGly-Ser
AspAsp-GlyGly-SerSer-AlaAla-TyrTyr-LeuLeu-LeuLeu-AlaAla-GlyGly-Ser
GAA-GGA-AGC-TCC-TGG-TTA-CTC-CTG-GGA-TCC
GAG-GGT-TCC-AGC-TAT-CTA-TTA-ATT-GGT-AGC
GAC-GGC-AGT-GCA-TGG-TTG-CTT-TTG-GGC-AGT
GAT-GGG-TCA-GCT-TAC-CTC-CTG-GCC-GGG-TCA
DNA->Protein: code usage
• Difference in codon usage can lead to large base
composition bias - in which case one often needs to
remove the 3rd codon, the more bias prone site…
and possibly the 1st
• Comparing protein sequences can reduce the
compositional bias problem
—> more information in DNA or protein?
Ask James for PUTGAPS…
Models for DNA and Protein evolution
• DNA: 4 x 4 rate matrices
– Easy to estimate (can be combined with tree search)
• Protein: 20 x 20 matrices
– More complex: time and estimation problems (rare changes?) ->
empirical models from large datasets are typically used
Evolutionary models for amino acid
changes
• All methods have explicit or implicit evolutionary
models
• Can be in the form of simple formula
–
Kimura formula to estimate distances
• Most models for amino acid changes typically include
–
–
20x20 rate matrix
Correction for rate heterogeneity between sites (G [a]+ pinv)
4
Proteins and amino acids
• Proteins determine shape and structure of cells and
carry most catalytic processes - 3D
• Proteins are polymers of 20 different amino acids
• Amino acids sequences determine the structure
(2ndary, 3ary…) and function of the protein
• Amino acids can be categorized by their side chain
physicochemical properties
– Polarity (hydrophobic versus hydrophilic, +/- charges)
– Size (small versus large)
Amino acid physico-chemical
properties
– Major factor in protein folding
– Key to protein functions
—> Major influence in pattern
of amino acid mutations
As for Ts versus Tv in DNA sequences, some
amino acid changes are more common than
others: very important for sequence
comparisons (alignment and phylogenetics!)
Small <—> small > small <—> big
Estimation of relative rates of residue
replacement (models of evolution)
• Differences/changes in protein alignments can be pooled and
patterns of changes investigate.
– Selected sequence, alignment and counting method dependent! Empirical
models!
• Patterns of changes give insights into the evolutionary processes
underlying protein diversification -> estimation of evolutionary
models
– How general is such a model?
• Choice of protein evolutionary models can be important for the
sequence analysis we perform (database searching, sequence alignment,
phylogenetics)
Amino acid substitution matrices
based on observed substitutions:
“empirical models”
• Summarise the substitution pattern from large
amount of existing data
• Based on a selection of proteins
– Globular proteins, membrane proteins?
– Mitochondrial proteins?
• Uses a given counting method and the counted
changes to be recorded
– tree dependent/independent
– restriction on the sequence divergence
5
Amino acid physico-chemical
properties
–
–
–
–
Size
Polarity
Hydrophilic (polar, +/- charges)
Hydrophobic (non polar)
Taylor’s Venn diagram of amino acids
properties
Tiny
Small
P
Aliphatic
CS-S
I V
L
M
A
CS-H
T
F
Hydrophobic
G
Y
W
S N
Polar
D+ Q
E
K H
R
Charged
Aromatic
Amino acids categories 1:
Doolittle (1985). Sci. Am. 253, 74-85.
–Small polar: S, G, D, N
–Small non-polar: T, A, P, C
–Large polar: E, Q, K, R
–Large non-polar: V, I, L, M, F
–Intermediate polarity: W, Y, H
Amino acids categories 2
–Sulfhydryl: C
–Small hydrophilic: S, T, A, P, G
–Acid, amide: D, E, N, Q
–Basic: H, R, K
–Small hydrophobic : M, I, L, V
–Aromatic: F, Y, W
6
Phylogenetic trees from protein
alignments
• Parsimony based methods - unweighted/weighted
• Distance methods - model for distance estimation
– probability of amino acid changes, site rate heterogeneity
• Maximum likelihood and Bayesian methods- model for ML
calculations
– probability of amino acid changes, site rate heterogeneity
—> Colour coding of different categories is useful for protein
alignment visual inspection
Trees from protein alignment:
Parsimony methods - cost matrices
Parsimony: unweighted matrix
for amino acid changes
• All changes weighted equally
• Differential weighting of changes: an attempt to
correct for homoplasy!:
– Based on the minimal number of amino acid substitutions, the genetic
code matrix (PHYLIP -PROTPARS)
– Weights based on physico-chemical properties of amino acids
– Weights based on observed frequency of amino acid substitutions in
alignments
–Ile -> Leu
–Trp -> Asp
–Ser -> Arg
–Lys -> Asp
cost = 1
cost = 1
cost = 1
cost = 1
7
Parsimony: weighted matrix for amino
acid changes, the genetic code matrix
–Ile -> Leu
–Trp -> Asn
–Ser -> Arg
–Lys -> Asp
cost = 1
cost = 3
cost = 2
cost = 2
Phylogenetic trees from protein
alignments
Weighting matrix
based on minimal
amino acid changes
PROTPARS in
PHYLIP
W: TGG
|||
N: AAC
AAT
A minimum of
3 changes
are needed at
the DNA level
for W<->N
[A]
[C]
[D]
[E]
[F]
[G]
[H]
[I]
[K]
[L]
[M]
[N]
[P]
[Q]
[R]
[1]
[2]
[T]
[V]
[W]
[Y]
A
0
2
1
1
2
1
2
2
2
2
2
2
1
2
2
1
2
1
1
2
2
C
2
0
2
2
1
1
2
2
2
2
2
2
2
2
1
1
1
2
2
1
1
D
1
2
0
1
2
1
1
2
2
2
2
1
2
2
2
2
2
2
1
2
1
E
1
2
1
0
2
1
2
2
1
2
2
2
2
1
2
2
2
2
1
2
2
F
2
1
2
2
0
2
2
1
2
1
2
2
2
2
2
1
2
2
1
2
1
G
1
1
1
1
2
0
2
2
2
2
2
2
2
2
1
2
1
2
1
1
2
H
2
2
1
2
2
2
0
2
2
1
2
1
1
1
1
2
2
2
2
2
1
I
2
2
2
2
1
2
2
0
1
1
1
1
2
2
1
2
1
1
1
2
2
K
2
2
2
1
2
2
2
1
0
2
1
1
2
1
1
2
2
1
2
2
2
L
2
2
2
2
1
2
1
1
2
0
1
2
1
1
1
1
2
2
1
1
2
M
2
2
2
2
2
2
2
1
1
1
0
2
2
2
1
2
2
1
1
2
3
N
2
2
1
2
2
2
1
1
1
2
2
0
2
2
2
2
1
1
2
3
1
P
1
2
2
2
2
2
1
2
2
1
2
2
0
1
1
1
2
1
2
2
2
Q
2
2
2
1
2
2
1
2
1
1
2
2
1
0
1
2
2
2
2
2
2
R
2
1
2
2
2
1
1
1
1
1
1
2
1
1
0
2
1
1
2
1
2
1
1
1
2
2
1
2
2
2
2
1
2
2
1
2
2
0
2
1
2
1
1
2
2
1
2
2
2
1
2
1
2
2
2
1
2
2
1
2
0
1
2
2
2
T
1
2
2
2
2
2
2
1
1
2
1
1
1
2
1
1
1
0
2
2
2
V
1
2
1
1
1
1
2
1
2
1
1
2
2
2
2
2
2
2
0
2
2
W
2
1
2
2
2
1
2
2
2
1
2
3
2
2
1
1
2
2
2
0
2
Y
2
1
1
2
1
2
1
2
2
2
3
1
2
2
2
1
2
2
2
2
0
Distance methods
A two step approach - two choices!
• Parsimony based methods - unweighted/weighted
• Distance methods - model for distance estimation
– probability of amino acid changes, site rate heterogeneity
• Maximum likelihood and Bayesian methods- model for ML
calculations
1) Estimate all pairwise distances
Choose a method (100s) - has an explicit model for sequence
evolution
2) Estimate a tree from the distance matrix
Choose a method: with or without an optimality criterion?
– probability of amino acid changes, site rate heterogeneity
8
Estimation of protein pairwise
distances
1. Simple formula
2. More complex models
• 20 x 20 matrices (evolutionary model):
– Identity matrix
– Genetic code matrix
– Mutational data matrices (MDMs)
• Correction for rate heterogeneity between sites
(G [a]+ pinv)
The Kimura formula: correction for
multiple hits
dij = -Ln (1 - Dij - (Dij2/5))
- Dij the observed dissimilarity between i and j (0-1).
- Can give good estimate of dij for 0.75 > Dij > 0
- It can approximates the PAM matrix well
- If Dij ≥ 0.8541, dij = infinite.
- Does not take into account which amino acid are changing
- Implemented in Clustal and PHYLIP
-> Importance of mutational matrices, MDM!
Amino acid substitution matrices
(MDMs)
Protein alignment may be guided by structural
interactions
• Sequence alignments based matrices
PAM, JTT, BLOSUM, WAG...
• Structure alignments based matrices
STR (for highly divergent sequences)
Escherichia. coli
djlA protein
Homo sapiens
djlA protein
9
Protein distance measurements with
MDM
20 x 20 matrices:
• PAM, BLOSUM, WAG…matrices
• Maximum likelihood calculation which
takes into account:
– All sites in the alignment
– All pairwise rates in the matrix
– Branch length
dij = ML [P(n), Xij, (G, pinv)]
(dodgy notation!)
dij = -Ln (1 - Dij - (Dij2/5))= F(Dij)
How is an MDM inferred?
The raw data: observed changes in pairwise
comparisons in an alignment or on a tree
seq.1 AIDESLIIASIATATI
|*||*||*||*||*||
seq.2 AGDEALILASAATSTI
How is an MDM inferred?
Observed raw changes are corrected for:
- The amino acid relative mutability
- The amino acid normalised frequency
Differences between MDM comes from:
- Choice of proteins used (membrane, globular)
- Range of sequence similarities used
- Counting methods
- On a tree [MP, ML]
- Pairwise comparison from an alignment
-> empirical models from large datasets are typically used
seq.1 AIDESLIIASIATATI
|*||*||*||*||*||
seq.2 AGEEALILASAATSTI
Raw matrix
Symmetrical!
A
S
T
G
I
L
E
D
A
3
2
0
0
1
0
0
0
S T G I L E D
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
2
1 1
0 0 1
0 0 1 0
-> The larger the dataset the better the estimates!
10
Amino Acid exchange matrices
s1,2 s1,3
s1,2
s2,3
s1,3 s2,3
…
…
…
s1,20 s2,20 s3,20
Q
Qij
sij
sij = sij
!i
…
…
…
…
…
s1,20
s2,20
s3,20
…
-
Amino Acid exchange matrices
R
Relative rate matrix
Q
Rate matrix
P
R
(no composition, no branch length)
X diag(!1, …, !20) = Q matrix
Rate matrix
Instantaneous rates of change of amino acids
Exchangeabilities of amino acid pairs ij
Time reversibility
Stationarity of amino acid frequencies
(typically the observed proportion of residues in the dataset)
The PAM and JTT matrices
F
Raw matrix
Observed changes
(counted on a MP tree
or in pairwise comparisons)
(with composition, not branch length)
Probability matrix
(composition +
branch length)
Can be estimated
using ML on a tree
Relatedness odd matrix
Used for scoring alignments
(Blast, Clustal)
Modified from Peter Foster
The BLOSUM matrices
Henikoff & Henikoff (1992). Proc Natl Acad Sci USA 89, 10915-9
• PAM - Dayhoff et al. 1968
– Nuclear encoded genes, ~100 proteins
• JTT - Jones et al. 1992
– 59,190 accepted point mutations for 16,300
proteins
Jones, Taylor & Thornton (1992). CABIOS 8, 275-282
• BLOcks SUbstitution Matrices
– The matrix values are based on 2000 conserved amino acid
patterns (blocks) - pairwise comparisons
—> more efficient for distantly related proteins
—> more agreement with 3D structure data
BLOSUM62 - 62% minimum sequence identity
BLOSUM50 - 50% minimum sequence identity
11
Comparisons of MDMs:
The WAG matrix
(sij) amino acid exchangeability
Whelan and Goldman (2001) Mol. Biol. Evol. 18, 691-699
Whelan and Goldman (2001) Mol. Biol. Evol. 18, 691-699
• Globular protein sequences
– 3,905 sequences from 182 protein families
• Produced a phylogenetic trees for every family and
used maximum likelihood to estimate the relative rate
values in the rate matrix (overall lnL over 182
different trees)
– Better fit of the model with most data (significant improvement of
the lnL of a tree when compared to PAM or JTT matrices)
– Might not be the best option in some cases…
BLOSUM62 Amino Acid Substitution Matrix
Log-odds matrices
MDMij = 10 log10 Rij
The MDMij values are rounded to the
nearest integer
MDMij < 0 freq. less than chance
MDMij = 0 freq. expected by chance
MDMij > 0 freq. greater then chance
The Log-odds matrices can be used
for scoring alignments (Blast and Clustal)
Clustal)
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
C
9
-1
-1
-3
0
-3
-3
-3
-4
-3
-3
-3
-3
-1
-1
-1
-1
-2
-2
-2
C
S
4
1
-1
1
0
1
0
0
0
-1
-1
0
-1
-2
-2
-2
-2
-2
-3
S
T
5
-1
0
-2
0
-1
-1
-1
-2
-1
-1
-1
-1
-1
0
-2
-2
-2
T
P
7
-1
-2
-2
-1
-1
-1
-2
-2
-1
-2
-3
-3
-2
-4
-3
-4
P
A
4
0
-2
-2
-1
-1
-2
-1
-1
-1
-1
-1
0
-2
-2
-3
A
G
6
0
-1
-2
-2
-2
-2
-2
-3
-4
-4
-3
-3
-3
-2
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
MDMij < 0 freq. less than chance
MDMij = 0 freq. expected by chance
MDMij > 0 freq. greater then chance
6
1
0
0
1
0
0
-2
-3
-3
-3
-3
-2
-4
N
6
2
0
-1
-2
-1
-3
-3
-4
-3
-3
-3
-4
D
5
2
0
0
1
-2
-3
-3
-2
-3
-2
-3
E
5
0
1
1
0
-3
-2
-2
-3
-1
-2
Q
8
0
-1
-2
-3
-3
-3
-1
2
-2
H
5
2
-1
-3
-2
-3
-3
-2
-3
R
5
-1 5
-3 1 4
-2 2 2 4
-2 1 3 1 4
-3 0 0 0 -1
-2 -1 -1 -1 -1
-3 -1 -3 -2 -3
K M I L V
6
3
1
F
7
2
Y
11
W
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
sulfhydryl
small
hydrophilic
acid, acid-amide
and hydrophilic
basic
small
hydrophobic
aromatic
12
Summary
Summary 2
• Many amino acid rate matrices exist and one needs to
choose one for protein comparisons (alignment,
phylogenetics...) do not hesitate to experiment!
• One should make a rational choice (as much as
possible):
• In practice MDM are obtained by averaging the
observed changes and amino acid frequencies between
numerous proteins (e.g. JTT, BLOSUM) and are used
for your specific dataset
– How was the rate matrix produced?
– What are the structural features of the sequences you are
comparing? Globular/membrane protein?
– What is the level of sequence identity of the compared
sequences?
• Always try to correct for rate heterogeneity between
sites in phylogenetics!
– You can correct an MDM for the !i values of your data (amino
acid frequencies)
• Specific matrices have been calculated to reflect
particular composition biases (e.g. the mitochondrial
proteins matrix: mtREV24)
• Future work: What about context-dependent MDM:
alpha helices versus beta sheets, surface accessibility?
(Heterogenous models)
From DNA/protein sequences to trees
*1
*2
*3
Sequence data
Align Sequences
Phylogenetic signal?
Patterns—>evolutionary processes?
Distances methods
Characters based methods
*4
Distance calculation
(which model?)
Choose a method
MB
Model?
ML
MP
Wheighting?
Model?
(sites, changes)?
Optimality criterion
LS
ME
Single tree
NJ
Calculate or estimate best fit tree
5
Test phylogenetic reliability
Modified from Hillis et al., (1993). Methods in Enzymology 224, 456-487
13