Download Document

Document related concepts

Frameshift mutation wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Protein moonlighting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Sequence alignment wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Point mutation wikipedia , lookup

Genetic code wikipedia , lookup

Expanded genetic code wikipedia , lookup

Transcript
MODELS OF PROTEIN EVOLUTION:
AN INTRODUCTION TO AMINO
ACID EXCHANGE MATRICES
Simon Harris
Wellcome Trust Sanger Institute, UK
Inferring trees is difficult!!!
1. The method problem
A
Dataset 1
Method 1
?
Dataset 1
Method 2
B
C
A
C
B
Inferring trees is difficult!!!
2. The dataset problem
A
Dataset 1
?
Dataset 2
Method 1
B
C
A
Method 1
C
B
From DNA/protein sequences to trees
*
2
*
*
3
1
Sequence data
Align Sequences
Phylogenetic signal?
Patterns—>evolutionary processes?
Characters based methods
4
*
Distances methods
Distance calculation
(which model?)
Choose a method
MB
Model?
ML
MP
Wheighting?
Model?
(sites, changes)?
Optimality criterion
LS
ME
Calculate or estimate best fit tree
5
Test phylogenetic reliability
Modified from Hillis et al., (1993). Methods in Enzymology 224, 456-487
Single tree
NJ
Agenda
• Some general considerations
– Why protein phylogenetics?
– What are we comparing? Protein sequences - some basic features
– Protein structure/function and its impact on patterns of mutations
• Amino acid exchange matrices: where do they come from
and when do we use them?
– Database searches (e.g. Blast, FASTA)
– Sequence alignment (e.g. ClustalX)
– Phylogenetics (model based methods: distance, ML & Bayesian)
Why protein phylogenies?
•
•
•
•
For historical reasons - the first sequences
Most genes encode proteins
To study protein structure, function and evolution
Comparing DNA and protein based phylogenies can
be useful
– Different genes - e.g. 18S rRNA versus EF-2 protein
– Protein encoding gene - codons versus amino acids
Proteins were the first molecular
sequences to be used for phylogenetic
inference
• Fitch and Margoliash (1967).
Construction of phylogenetic trees.
Science 155, 279-284.
Phylogenies from proteins
• Parsimony
• Distance matrices
• Maximum likelihood
• Bayesian methods
Evolutionary models for amino acid
changes
• All methods have explicit or implicit evolutionary
models
• Can be in the form of simple formula
–
Kimura formula to estimate distances
• Most models for amino acid changes typically include
–
–
–
A 20x20 rate matrix (or reduced version of it, 6x6 rate matrix)
Correction for rate heterogeneity among sites (G [a]+ pinv)
Assume stationarity and neutrality - what if there are biases in
composition, or non neutral changes such as selection?
Character states in DNA and
protein alignments
• DNA sequences have four states (five): A, C, G, T,
(and ± indels)
•Proteins have 20 states (21): A, C, D, E, F, G, H, I, K,
L, M, N, P, Q, R, S, T, V, W, Y (and ± indels)
—> more information in DNA or protein alignments?
DNA->Protein: the code
• 3 nucleotides (a codon) code for one amino acid
(61 codons! 61x61 rate matrices?)
• Degeneracy of the code: most amino acids are
coded by several codons
—> more data/information in DNA?
DNA—>Protein
• The code is degenerate:
20 amino acids are encoded by 61 possible codons (3 stop
codons)
• Complex patterns of changes among codons:
– Synonymous/non synonymous changes
– Synonymous changes correspond to codon changes not
affecting the coded amino acid
Codon degeneracy: protein alignments as a guide
for DNA alignments
Glu-Gly-Ser-Ser-Trp-Leu-Leu-Leu-Gly-Ser
Glu-Gly-Ser-Ser-Tyr-Leu-Leu-Ile-Gly-Ser
Asp-Gly-Ser-Ala-Trp-Leu-Leu-Leu-Gly-Ser
Asp-Gly-Ser-Ala-Tyr-Leu-Leu-Ala-Gly-Ser
GAA-GGA-AGC-TCC-TGG-TTA-CTC-CTG-GGA-TCC
GAG-GGT-TCC-AGC-TAT-CTA-TTA-ATT-GGT-AGC
GAC-GGC-AGT-GCA-TGG-TTG-CTT-TTG-GGC-AGT
GAT-GGG-TCA-GCT-TAC-CTC-CTG-GCC-GGG-TCA
DNA->Protein: code usage
• Difference in codon usage can lead to large base
composition bias - in which case one often needs to
remove the 3rd codon, the more bias prone site…
and possibly the 1st
• Comparing protein sequences can reduce the
compositional bias problem
—> more information in DNA or protein?
Models for DNA and Protein evolution
• DNA: 4 x 4 rate matrices
– “Easy” to estimate (can be combined with tree search)
• Protein: 20 x 20 matrices
– More complex: time and estimation problems (rare changes?) ->
• Empirical models from large datasets are typically used
• One can correct for amino acid frequencies for a given dataset
Proteins and their amino acids
• Proteins determine shape and structure of cells and
carry most catalytic processes - 3D structure
• Proteins are polymers of 20 different amino acids
• Amino acids sequence composition determines the
structure (2ndary, 3ary…) and function of the protein
• Amino acids can be categorized by their side chain
physicochemical properties
– Size (small versus large)
– Polarity (hydrophobic versus hydrophilic, +/- charges)
D
R
Amino acid physico-chemical
properties
– Major factor in protein folding
– Key to protein functions
—> Major influence in pattern
of amino acid mutations
As for Ts versus Tv in DNA sequences, some
amino acid changes are more common than
others: fundamental for sequence comparisons
(alignments and phylogenetics!)
Small <—> small > small <—> big
Estimation of relative rates of residue
replacement (models of evolution)
• Differences/changes in protein alignments can be pooled and
patterns of changes investigated.
• Patterns of changes give insights into the evolutionary processes
underlying protein diversification -> estimation of evolutionary
models
• Choice of protein evolutionary models can be important for the
sequence analysis we perform (database searching, sequence alignment,
phylogenetics)
Amino acid substitution matrices based on
observed substitutions: “empirical models”
• Summarise the substitution pattern from large
amount of existing data (‘average’ models)
• Based on a selection of proteins
– Globular proteins, membrane proteins?
– Mitochondrial proteins?
• Uses a given counting method and set of recorded
changes
– tree dependent/independent
– restriction on the sequence divergence
Amino acid physico-chemical
properties
– Size
– Polarity
• Charges (acidic/basic)
• Hydrophilic (polar)
• Hydrophobic (non polar)
Taylor’s Venn diagram of amino acids
properties
Tiny
Small
P
Aliphatic
CS-S
I V
L
M
CS-H
T
F
Hydrophobic
G
A
Y
W
Aromatic
S N
DE
K +
H
R
Polar
Q
Charged
Taylor (1986). J Theor. Biol. 119: 205-218
Hydrophylic
Small
Large
Hydrophobic
Kosiol et al. (2004). J. Theor. Biol.
228: 97-106
Amino acids categories 1:
Doolittle (1985). Sci. Am. 253, 74-85.
–Small polar: S, G, D, N
–Small non-polar: T, A, P, C
–Large polar: E, Q, K, R
–Large non-polar: V, I, L, M, F
–Intermediate polarity: W, Y, H
Amino acids categories 2
(PAM matrix)
–Sulfhydryl: C
–Small hydrophilic: S, T, A, P, G
–Acid, amide: D, E, N, Q
–Basic: H, R, K
–Small hydrophobic : M, I, L, V
–Aromatic: F, Y, W
Amino acids categories 3
(implemented in SEAVIEW colour coding)
– Tiny 1, non-polar: C
– Tiny 2, non-polar: G
– Imino acid: P
– Non-polar: M, V, L, I, A, F, W
– Acid: D, E
– Basic: R, K
– Aromatic: Y, H
– Uncharged polar: S, T, Q, N
Amino acids categories
Changes within a category are more common than
between them
• Colour coding of alignments to help visualise their
quality (ClustalX, SEAVIEW)
• Differential weighting of cost matrices in parsimony
analyses
• Mutational data matrices in model based methods (e.g.
ML and Bayesian framework)
• Recoding of the 20 amino acids into bins to focus on
changes between bins (categories) (6x6 matrix)
—> Colour coding of different categories is useful for protein
alignment visual inspection
Phylogenetic trees from protein
alignments
• Parsimony based methods - unweighted/weighted
• Distance methods - model for distance estimation
– probability of amino acid changes, site rate heterogeneity
• Maximum likelihood and Bayesian methods- model for ML
calculations
– probability of amino acid changes, site rate heterogeneity
Trees from protein alignment:
Parsimony methods - cost matrices
• All changes weighted equally
• Differential weighting of changes: an attempt to
correct for homoplasy!:
– Based on the minimal number of amino acid substitutions, the genetic
code matrix (PHYLIP-PROTPARS)
– Weights based on physico-chemical properties of amino acids
– Weights based on observed frequency of amino acid substitutions in
alignments
Parsimony: unweighted matrix
for amino acid changes
–Ile -> Leu
–Trp -> Asp
–Ser -> Arg
–Lys -> Asp
cost = 1
cost = 1
cost = 1
cost = 1
Parsimony: weighted matrix for amino
acid changes, the genetic code matrix
–Ile -> Leu
–Trp -> Asn
–Ser -> Arg
–Lys -> Asp
cost = 1
cost = 3
cost = 2
cost = 2
Weighting matrix
based on minimal
amino acid changes
PROTPARS in
PHYLIP
W: TGG
|||
N: AAC
AAT
A minimum of
3 changes
are needed at
the DNA level
for W<->N
[A]
[C]
[D]
[E]
[F]
[G]
[H]
[I]
[K]
[L]
[M]
[N]
[P]
[Q]
[R]
[1]
[2]
[T]
[V]
[W]
[Y]
A
0
2
1
1
2
1
2
2
2
2
2
2
1
2
2
1
2
1
1
2
2
C
2
0
2
2
1
1
2
2
2
2
2
2
2
2
1
1
1
2
2
1
1
D
1
2
0
1
2
1
1
2
2
2
2
1
2
2
2
2
2
2
1
2
1
E
1
2
1
0
2
1
2
2
1
2
2
2
2
1
2
2
2
2
1
2
2
F
2
1
2
2
0
2
2
1
2
1
2
2
2
2
2
1
2
2
1
2
1
G
1
1
1
1
2
0
2
2
2
2
2
2
2
2
1
2
1
2
1
1
2
H
2
2
1
2
2
2
0
2
2
1
2
1
1
1
1
2
2
2
2
2
1
I
2
2
2
2
1
2
2
0
1
1
1
1
2
2
1
2
1
1
1
2
2
K
2
2
2
1
2
2
2
1
0
2
1
1
2
1
1
2
2
1
2
2
2
L
2
2
2
2
1
2
1
1
2
0
1
2
1
1
1
1
2
2
1
1
2
M
2
2
2
2
2
2
2
1
1
1
0
2
2
2
1
2
2
1
1
2
3
N
2
2
1
2
2
2
1
1
1
2
2
0
2
2
2
2
1
1
2
3
1
P
1
2
2
2
2
2
1
2
2
1
2
2
0
1
1
1
2
1
2
2
2
Q
2
2
2
1
2
2
1
2
1
1
2
2
1
0
1
2
2
2
2
2
2
R
2
1
2
2
2
1
1
1
1
1
1
2
1
1
0
2
1
1
2
1
2
1
1
1
2
2
1
2
2
2
2
1
2
2
1
2
2
0
2
1
2
1
1
2
2
1
2
2
2
1
2
1
2
2
2
1
2
2
1
2
0
1
2
2
2
T
1
2
2
2
2
2
2
1
1
2
1
1
1
2
1
1
1
0
2
2
2
V
1
2
1
1
1
1
2
1
2
1
1
2
2
2
2
2
2
2
0
2
2
W
2
1
2
2
2
1
2
2
2
1
2
3
2
2
1
1
2
2
2
0
2
Y
2
1
1
2
1
2
1
2
2
2
3
1
2
2
2
1
2
2
2
2
0
Phylogenetic trees from protein
alignments
• Parsimony based methods - unweighted/weighted
• Distance methods - model for distance estimation
– probability of amino acid changes, site rate heterogeneity
• Maximum likelihood and Bayesian methods- model for ML
calculations
– probability of amino acid changes, site rate heterogeneity
Distance methods
A two step approach - two choices!
1) Estimate all pairwise distances
Choose a method (100s) - has an explicit model for sequence
evolution
2) Estimate a tree from the distance matrix
Choose a method: with or without an optimality criterion?
Estimation of protein pairwise
distances
1. Simple formula
2. More complex models
• 20 x 20 matrices (evolutionary model):
– Identity matrix
– Genetic code matrix
– Mutational data matrices (MDMs)
• Correction for rate heterogeneity between sites
(G [a]+ pInv)
The Kimura formula: correction for
multiple hits
dij = -Ln (1 - Dij - (Dij2/5))
- Dij the observed dissimilarity between i and j (0-1).
- Can give good estimate of dij for 0.75 > Dij > 0
- It can approximates the PAM matrix well
- If Dij ≥ 0.8541, dij = infinite.
- Implemented in ClustalX1.83 and PHYLIP3.62
- Does not take into account which amino acids are changing
-> Importance of mutational matrices, MDM!
Amino acid substitution matrices
(MDMs)
• Sequence alignment based matrices
PAM, JTT, BLOSUM, WAG...
• Structure alignment based matrices
STR (for highly divergent sequences)
Protein distance measurements with
MDM
20 x 20 matrices:
• PAM, BLOSUM, WAG…matrices
• Maximum likelihood calculation which
takes into account:
– All sites in the alignment
– All pairwise rates in the matrix
– Branch length
dij = ML [P(n), Xij, (G, pinv)]
(dodgy notation!)
dij = -Ln (1 - Dij - (Dij2/5))= F(Dij)
How is an MDM inferred?
Observed raw changes are corrected for:
- The amino acid relative mutability
- The amino acid normalised frequency
Differences between MDM come from:
- Choice of proteins used (membrane, globular)
- Range of sequence similarities used
- Counting methods
- On a tree [MP, ML]
- Pairwise comparison from an alignment
-> empirical models from large datasets are typically used
How is an MDM inferred?
The raw data: observed changes in pairwise
comparisons in an alignment or on a tree
seq.1 AIDESLIIASIATATI
|*||*||*||*||*||
seq.2 AGDEALILASAATSTI
seq.1 AIDESLIIASIATATI
|*||*||*||*||*||
seq.2 AGEEALILASAATSTI
Raw matrix
Symmetrical!
A
S
T
G
I
L
E
D
A
3
2
0
0
1
0
0
0
S T G I L E D
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
2
1 1
0 0 1
0 0 1 0
-> The larger the dataset the better the estimates!
Amino Acid exchange matrices
s1,2 s1,3
s1,2
s2,3
s1,3 s2,3
…
…
…
s1,20 s2,20 s3,20
Q
Qij
sij
sij = sij
πi
…
…
…
…
…
s1,20
s2,20
s3,20
…
-
X diag(π1, …, π20) = Q matrix
Rate matrix
Instantaneous rates of change of amino acids
Exchangeabilities of amino acid pairs ij
Time reversibility
Stationarity of amino acid frequencies
(typically the observed proportion of residues in the dataset)
Amino Acid exchange matrices
F
Raw matrix
Observed changes
(counted on a MP tree
or in pairwise comparisons)
R
Relative rate matrix
Q
Rate matrix
P
R
(no composition, no branch length)
(with composition, not branch length)
Probability matrix
Relatedness odd matrix
(composition +
branch length)
Can be estimated
using ML on a tree
Used for scoring alignments
(BlastP, ClustalX)
Modified from Peter Foster
The PAM and JTT matrices
• PAM - Dayhoff et al. 1968
– Nuclear encoded genes, ~100 proteins
• JTT - Jones et al. 1992
– 59,190 accepted point mutations for 16,300
proteins
Jones, Taylor & Thornton (1992). CABIOS 8, 275-282
The BLOSUM matrices
Henikoff & Henikoff (1992). Proc Natl Acad Sci USA 89, 10915-9
• BLOcks SUbstitution Matrices
– The matrix values are based on 2000 conserved amino acid
patterns (blocks) - pairwise comparisons
—> more efficient for distantly related proteins
—> more agreement with 3D structure data
BLOSUM62 - 62% minimum sequence identity (BlastP default)
BLOSUM50 - 50% minimum sequence identity
BLOSUM42 - 42% minimum sequence identity (BlastP)
The WAG matrix
Whelan and Goldman (2001) Mol. Biol. Evol. 18, 691-699
• Globular protein sequences
– 3,905 sequences from 182 protein families
• Produced a phylogenetic tree for each family and
used maximum likelihood to estimate the relative rate
values in the rate matrix (overall lnL over 182
different trees)
– Better fit of the model with most data (significant improvement of the tree lnL
when compared to PAM or JTT matrices)
– Might not be the best option in some cases such as for mitochondria
encoded proteins or other membrane proteins…
Comparisons of MDMs:
(sij) amino acid exchangeability
Whelan and Goldman (2001) Mol. Biol. Evol. 18, 691-699
PAM
JTT
D<->E
S<->A
V<->I
WAG
WAG*
Log-odds matrices
MDMij = 10 log10 Rij
The MDMij values are rounded to the
nearest integer
MDMij < 0 freq. less than chance
MDMij = 0 freq. expected by chance
MDMij > 0 freq. greater then chance
The Log-odds matrices can be used
for scoring alignments (Blast and Clustalx)
PAM250 Amino Acid Substitution Matrix
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
C
12
0
-2
-3
-2
-3
-4
-5
-5
-5
-3
-4
-5
-5
-2
-6
-2
-4
0
-8
C
S
2
1
1
1
1
1
0
0
-1
-1
-0
0
-1
-1
-3
-1
-3
-3
-2
S
T
3
0
1
0
0
0
0
-1
-1
-2
0
-1
0
-2
0
-3
-3
-5
T
P
6
1
0
0
-1
-1
-1
0
0
-1
-2
-2
-3
-1
-5
-5
-6
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
MDMij < 0 freq. less than chance
MDMij = 0 freq. expected by chance
MDMij > 0 freq. greater then chance
2
1
0
0
0
0
-1
-2
-1
-1
-1
-2
0
-3
-3
-6
A
5
0
1
0
-1
-2
-3
-2
-3
-3
-4
-1
-5
-5
-7
G
2
2
1
1
2
0
1
-2
-2
-3
-2
-3
-2
-4
N
4
3
2
1
-1
0
-3
-2
-4
-2
-6
-4
-7
D
4
2
1
-1
0
-2
-2
-3
-2
-5
-4
-7
E
4
3
1
1
-1
-2
-2
-2
-5
-4
-5
Q
6
2
0
-2
-2
-2
-2
-2
0
-3
H
6
3
0
-2
-3
-2
-4
-4
2
R
5
0 6
-2 2 5
-3 4 2 6
-2 2 4 2 4
-5 0 1 2 -1
-4 -2 -1 -1 -2
-3 -4 -5 -2 -6
K M I L V
9
7 10
0 0
F Y
17
W
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
sulfhydryl (1)
small
hydrophilic (2)
acid, acid-amide
and hydrophilic (3)
basic (4)
small
hydrophobic (5)
aromatic (6)
BLOSUM62 Amino Acid Substitution Matrix
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
C
9
-1
-1
-3
0
-3
-3
-3
-4
-3
-3
-3
-3
-1
-1
-1
-1
-2
-2
-2
C
S
4
1
-1
1
0
1
0
0
0
-1
-1
0
-1
-2
-2
-2
-2
-2
-3
S
T
5
-1
0
-2
0
-1
-1
-1
-2
-1
-1
-1
-1
-1
0
-2
-2
-2
T
P
7
-1
-2
-2
-1
-1
-1
-2
-2
-1
-2
-3
-3
-2
-4
-3
-4
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
MDMij < 0 freq. less than chance
MDMij = 0 freq. expected by chance
MDMij > 0 freq. greater then chance
4
0
-2
-2
-1
-1
-2
-1
-1
-1
-1
-1
0
-2
-2
-3
A
6
0
-1
-2
-2
-2
-2
-2
-3
-4
-4
-3
-3
-3
-2
G
6
1
0
0
1
0
0
-2
-3
-3
-3
-3
-2
-4
N
6
2
0
-1
-2
-1
-3
-3
-4
-3
-3
-3
-4
D
5
2
0
0
1
-2
-3
-3
-2
-3
-2
-3
E
5
0
1
1
0
-3
-2
-2
-3
-1
-2
Q
8
0
-1
-2
-3
-3
-3
-1
2
-2
H
5
2
-1
-3
-2
-3
-3
-2
-3
R
5
-1 5
-3 1 4
-2 2 2 4
-2 1 3 1 4
-3 0 0 0 -1
-2 -1 -1 -1 -1
-3 -1 -3 -2 -3
K M I L V
6
3
1
F
7
2
Y
11
W
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
sulfhydryl (1)
small
hydrophilic (2)
acid, acid-amide
and hydrophilic (3)
basic (4)
small
hydrophobic (5)
aromatic (6)
Log-odds matrices
MDMij = 10 log10 Rij
The MDMij values are rounded to the
nearest integer
MDMij < 0 freq. less than chance
MDMij = 0 freq. expected by chance
MDMij > 0 freq. greater then chance
I <---> M Log-odds = +2 (in PAM250):
2 corresponds to an actual value of 0.2
Log10 = 0.20412, hence 100.2 = 1.6
Meaning L<--->M changes between two sequences are occurring
1.6 times more often then random
Summary 1
• Many amino acid rate matrices (MDM) exist and one
needs to choose one for protein comparisons (alignment,
phylogenetics...)
– do not hesitate to experiment!
• One should make a rational choice (as much as possible):
– How was the rate matrix produced?
– What are the structural features of the sequences that you are comparing?
Globular/membrane protein?
– What is the level of sequence identity of the compared sequences?
– Does one MDM fit my data better than the others: You can use ModelGenerator
or ProtTest to compare models
• Always try to correct for rate heterogeneity between sites
in phylogenetics!
Summary 2
• In practice MDM are obtained by averaging the observed changes
and amino acid frequencies between numerous proteins (e.g. JTT,
BLOSUM) and are used for your specific dataset
– With some software you can correct an MDM for the πi values of your data
(amino acid frequencies -F option)
• Specific matrices have been calculated to reflect particular
composition biases
– the mitochondrial proteins matrix: mtREV24
– Transmembrane domains: PHAT
• Using recoding of amino acids one can generate dataset specific
models (specific GTR type model)
And…
• Other developments:
– What about context-dependent MDM: alpha helices versus
beta sheets, surface accessibility?
– Heterogeneous models between sites or taxa (branches)
– Protein LodDet? For long alignments only…
– Modeltest-like software that allow to choose protein models
analytically:
• Modelgenerator: http://bioinf.may.ie/software/
• ProtTest: http://darwin.uvigo.es