* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Trees from proteins I
Survey
Document related concepts
Maximum parsimony (phylogenetics) wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Frameshift mutation wikipedia , lookup
Protein moonlighting wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Sequence alignment wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Point mutation wikipedia , lookup
Transcript
Inferring trees is difficult!!! MODELS OF PROTEIN EVOLUTION: AN INTRODUCTION TO AMINO ACID EXCHANGE MATRICES 1. The method problem A Method 1 Dataset 1 ? Robert Hirt Department of Zoology, The Natural History Museum, London Dataset 1 *1 *2 *3 A ? Dataset 2 Method 1 C Sequence data Align Sequences Phylogenetic signal? Patterns—>evolutionary processes? Distances methods Characters based methods B C A Method 2 From DNA/protein sequences to trees 2. The dataset problem Method 1 C A B Inferring trees is difficult!!! Dataset 1 B *4 MB Model? C Distance calculation (which model?) Choose a method ML MP Wheighting? Model? (sites, changes)? Optimality criterion LS B ME Single tree NJ Calculate or estimate best fit tree 5 Test phylogenetic reliability Modified from Hillis et al., (1993). Methods in Enzymology 224, 456-487 1 Agenda • Some general considerations – why protein phylogenetics? – What are we comparing? Protein sequences - some basic features – Protein structure/function and its impact on patterns of mutations • Amino acid exchange matrices: where do they come from and when do we use them? – Database searches (Blast, FASTA) – Sequence alignment (ClustalX) – Phylogenetics (model based methods) Proteins were the first molecular sequences to be used for phylogenetic inference • Fitch and Margoliash (1967). Construction of phylogenetic trees. Science 155, 279-284. Why protein phylogenies? • • • • For historical reasons - the first sequences Most genes encode proteins To study protein structure, function and evolution Comparing DNA and protein based phylogenies can be useful – Different genes - e.g. 18S rRNA versus EF-2 protein – Protein encoding gene - codons versus amino acids Phylogenies from proteins • Parsimony • Distance matrices • Maximum likelihood • Bayesian methods 2 Evolutionary models for amino acid changes • All methods have explicit or implicit evolutionary models • Can be in the form of simple formula – Kimura formula to estimate distances • Most models for amino acid changes typically include – – – 20x20 rate matrix Correction for rate heterogeneity among sites (G [a]+ pinv) Assume neutrality - what if there are biases, or non neutral changes - such as selection? DNA->Protein: the code • 3 nucleotides (a codon) code for one amino acid (61 codons! 61x61 rate matrices?) Character states in DNA and protein alignments • DNA sequences have four states (five): A, C, G, T, (and ± indels) •Proteins have 20 states (21): A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y (and ± indels) —> more information in DNA or protein alignments? DNA—>Protein • The code is degenerate: 20 amino acids are encoded by 61 possible codons (3 stop codons) • Complex patterns of changes among codons: • Degeneracy of the code: most amino acids are coded by several codons – Synonymous/non synonymous changes – Synonymous changes correspond to codon changes not affecting the coded amino acid —> more data/information in DNA? 3 Codon degeneracy: protein alignments as a guide for DNA alignments GluGlu-GlyGly-SerSer-SerSer-TrpTrp-LeuLeu-LeuLeu-LeuLeu-GlyGly-Ser GluGlu-GlyGly-SerSer-SerSer-TyrTyr-LeuLeu-LeuLeu-IleIle-GlyGly-Ser AspAsp-GlyGly-SerSer-AlaAla-TrpTrp-LeuLeu-LeuLeu-LeuLeu-GlyGly-Ser AspAsp-GlyGly-SerSer-AlaAla-TyrTyr-LeuLeu-LeuLeu-AlaAla-GlyGly-Ser GAA-GGA-AGC-TCC-TGG-TTA-CTC-CTG-GGA-TCC GAG-GGT-TCC-AGC-TAT-CTA-TTA-ATT-GGT-AGC GAC-GGC-AGT-GCA-TGG-TTG-CTT-TTG-GGC-AGT GAT-GGG-TCA-GCT-TAC-CTC-CTG-GCC-GGG-TCA DNA->Protein: code usage • Difference in codon usage can lead to large base composition bias - in which case one often needs to remove the 3rd codon, the more bias prone site… and possibly the 1st • Comparing protein sequences can reduce the compositional bias problem —> more information in DNA or protein? Ask James for PUTGAPS… Models for DNA and Protein evolution • DNA: 4 x 4 rate matrices – Easy to estimate (can be combined with tree search) • Protein: 20 x 20 matrices – More complex: time and estimation problems (rare changes?) -> empirical models from large datasets are typically used Evolutionary models for amino acid changes • All methods have explicit or implicit evolutionary models • Can be in the form of simple formula – Kimura formula to estimate distances • Most models for amino acid changes typically include – – 20x20 rate matrix Correction for rate heterogeneity between sites (G [a]+ pinv) 4 Proteins and amino acids • Proteins determine shape and structure of cells and carry most catalytic processes - 3D • Proteins are polymers of 20 different amino acids • Amino acids sequences determine the structure (2ndary, 3ary…) and function of the protein • Amino acids can be categorized by their side chain physicochemical properties – Polarity (hydrophobic versus hydrophilic, +/- charges) – Size (small versus large) Amino acid physico-chemical properties – Major factor in protein folding – Key to protein functions —> Major influence in pattern of amino acid mutations As for Ts versus Tv in DNA sequences, some amino acid changes are more common than others: very important for sequence comparisons (alignment and phylogenetics!) Small <—> small > small <—> big Estimation of relative rates of residue replacement (models of evolution) • Differences/changes in protein alignments can be pooled and patterns of changes investigate. – Selected sequence, alignment and counting method dependent! Empirical models! • Patterns of changes give insights into the evolutionary processes underlying protein diversification -> estimation of evolutionary models – How general is such a model? • Choice of protein evolutionary models can be important for the sequence analysis we perform (database searching, sequence alignment, phylogenetics) Amino acid substitution matrices based on observed substitutions: “empirical models” • Summarise the substitution pattern from large amount of existing data • Based on a selection of proteins – Globular proteins, membrane proteins? – Mitochondrial proteins? • Uses a given counting method and the counted changes to be recorded – tree dependent/independent – restriction on the sequence divergence 5 Amino acid physico-chemical properties – – – – Size Polarity Hydrophilic (polar, +/- charges) Hydrophobic (non polar) Taylor’s Venn diagram of amino acids properties Tiny Small P Aliphatic CS-S I V L M A CS-H T F Hydrophobic G Y W S N Polar D+ Q E K H R Charged Aromatic Amino acids categories 1: Doolittle (1985). Sci. Am. 253, 74-85. –Small polar: S, G, D, N –Small non-polar: T, A, P, C –Large polar: E, Q, K, R –Large non-polar: V, I, L, M, F –Intermediate polarity: W, Y, H Amino acids categories 2 –Sulfhydryl: C –Small hydrophilic: S, T, A, P, G –Acid, amide: D, E, N, Q –Basic: H, R, K –Small hydrophobic : M, I, L, V –Aromatic: F, Y, W 6 Phylogenetic trees from protein alignments • Parsimony based methods - unweighted/weighted • Distance methods - model for distance estimation – probability of amino acid changes, site rate heterogeneity • Maximum likelihood and Bayesian methods- model for ML calculations – probability of amino acid changes, site rate heterogeneity —> Colour coding of different categories is useful for protein alignment visual inspection Trees from protein alignment: Parsimony methods - cost matrices Parsimony: unweighted matrix for amino acid changes • All changes weighted equally • Differential weighting of changes: an attempt to correct for homoplasy!: – Based on the minimal number of amino acid substitutions, the genetic code matrix (PHYLIP -PROTPARS) – Weights based on physico-chemical properties of amino acids – Weights based on observed frequency of amino acid substitutions in alignments –Ile -> Leu –Trp -> Asp –Ser -> Arg –Lys -> Asp cost = 1 cost = 1 cost = 1 cost = 1 7 Parsimony: weighted matrix for amino acid changes, the genetic code matrix –Ile -> Leu –Trp -> Asn –Ser -> Arg –Lys -> Asp cost = 1 cost = 3 cost = 2 cost = 2 Phylogenetic trees from protein alignments Weighting matrix based on minimal amino acid changes PROTPARS in PHYLIP W: TGG ||| N: AAC AAT A minimum of 3 changes are needed at the DNA level for W<->N [A] [C] [D] [E] [F] [G] [H] [I] [K] [L] [M] [N] [P] [Q] [R] [1] [2] [T] [V] [W] [Y] A 0 2 1 1 2 1 2 2 2 2 2 2 1 2 2 1 2 1 1 2 2 C 2 0 2 2 1 1 2 2 2 2 2 2 2 2 1 1 1 2 2 1 1 D 1 2 0 1 2 1 1 2 2 2 2 1 2 2 2 2 2 2 1 2 1 E 1 2 1 0 2 1 2 2 1 2 2 2 2 1 2 2 2 2 1 2 2 F 2 1 2 2 0 2 2 1 2 1 2 2 2 2 2 1 2 2 1 2 1 G 1 1 1 1 2 0 2 2 2 2 2 2 2 2 1 2 1 2 1 1 2 H 2 2 1 2 2 2 0 2 2 1 2 1 1 1 1 2 2 2 2 2 1 I 2 2 2 2 1 2 2 0 1 1 1 1 2 2 1 2 1 1 1 2 2 K 2 2 2 1 2 2 2 1 0 2 1 1 2 1 1 2 2 1 2 2 2 L 2 2 2 2 1 2 1 1 2 0 1 2 1 1 1 1 2 2 1 1 2 M 2 2 2 2 2 2 2 1 1 1 0 2 2 2 1 2 2 1 1 2 3 N 2 2 1 2 2 2 1 1 1 2 2 0 2 2 2 2 1 1 2 3 1 P 1 2 2 2 2 2 1 2 2 1 2 2 0 1 1 1 2 1 2 2 2 Q 2 2 2 1 2 2 1 2 1 1 2 2 1 0 1 2 2 2 2 2 2 R 2 1 2 2 2 1 1 1 1 1 1 2 1 1 0 2 1 1 2 1 2 1 1 1 2 2 1 2 2 2 2 1 2 2 1 2 2 0 2 1 2 1 1 2 2 1 2 2 2 1 2 1 2 2 2 1 2 2 1 2 0 1 2 2 2 T 1 2 2 2 2 2 2 1 1 2 1 1 1 2 1 1 1 0 2 2 2 V 1 2 1 1 1 1 2 1 2 1 1 2 2 2 2 2 2 2 0 2 2 W 2 1 2 2 2 1 2 2 2 1 2 3 2 2 1 1 2 2 2 0 2 Y 2 1 1 2 1 2 1 2 2 2 3 1 2 2 2 1 2 2 2 2 0 Distance methods A two step approach - two choices! • Parsimony based methods - unweighted/weighted • Distance methods - model for distance estimation – probability of amino acid changes, site rate heterogeneity • Maximum likelihood and Bayesian methods- model for ML calculations 1) Estimate all pairwise distances Choose a method (100s) - has an explicit model for sequence evolution 2) Estimate a tree from the distance matrix Choose a method: with or without an optimality criterion? – probability of amino acid changes, site rate heterogeneity 8 Estimation of protein pairwise distances 1. Simple formula 2. More complex models • 20 x 20 matrices (evolutionary model): – Identity matrix – Genetic code matrix – Mutational data matrices (MDMs) • Correction for rate heterogeneity between sites (G [a]+ pinv) The Kimura formula: correction for multiple hits dij = -Ln (1 - Dij - (Dij2/5)) - Dij the observed dissimilarity between i and j (0-1). - Can give good estimate of dij for 0.75 > Dij > 0 - It can approximates the PAM matrix well - If Dij ≥ 0.8541, dij = infinite. - Does not take into account which amino acid are changing - Implemented in Clustal and PHYLIP -> Importance of mutational matrices, MDM! Amino acid substitution matrices (MDMs) Protein alignment may be guided by structural interactions • Sequence alignments based matrices PAM, JTT, BLOSUM, WAG... • Structure alignments based matrices STR (for highly divergent sequences) Escherichia. coli djlA protein Homo sapiens djlA protein 9 Protein distance measurements with MDM 20 x 20 matrices: • PAM, BLOSUM, WAG…matrices • Maximum likelihood calculation which takes into account: – All sites in the alignment – All pairwise rates in the matrix – Branch length dij = ML [P(n), Xij, (G, pinv)] (dodgy notation!) dij = -Ln (1 - Dij - (Dij2/5))= F(Dij) How is an MDM inferred? The raw data: observed changes in pairwise comparisons in an alignment or on a tree seq.1 AIDESLIIASIATATI |*||*||*||*||*|| seq.2 AGDEALILASAATSTI How is an MDM inferred? Observed raw changes are corrected for: - The amino acid relative mutability - The amino acid normalised frequency Differences between MDM comes from: - Choice of proteins used (membrane, globular) - Range of sequence similarities used - Counting methods - On a tree [MP, ML] - Pairwise comparison from an alignment -> empirical models from large datasets are typically used seq.1 AIDESLIIASIATATI |*||*||*||*||*|| seq.2 AGEEALILASAATSTI Raw matrix Symmetrical! A S T G I L E D A 3 2 0 0 1 0 0 0 S T G I L E D 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 2 1 1 0 0 1 0 0 1 0 -> The larger the dataset the better the estimates! 10 Amino Acid exchange matrices s1,2 s1,3 s1,2 s2,3 s1,3 s2,3 … … … s1,20 s2,20 s3,20 Q Qij sij sij = sij !i … … … … … s1,20 s2,20 s3,20 … - Amino Acid exchange matrices R Relative rate matrix Q Rate matrix P R (no composition, no branch length) X diag(!1, …, !20) = Q matrix Rate matrix Instantaneous rates of change of amino acids Exchangeabilities of amino acid pairs ij Time reversibility Stationarity of amino acid frequencies (typically the observed proportion of residues in the dataset) The PAM and JTT matrices F Raw matrix Observed changes (counted on a MP tree or in pairwise comparisons) (with composition, not branch length) Probability matrix (composition + branch length) Can be estimated using ML on a tree Relatedness odd matrix Used for scoring alignments (Blast, Clustal) Modified from Peter Foster The BLOSUM matrices Henikoff & Henikoff (1992). Proc Natl Acad Sci USA 89, 10915-9 • PAM - Dayhoff et al. 1968 – Nuclear encoded genes, ~100 proteins • JTT - Jones et al. 1992 – 59,190 accepted point mutations for 16,300 proteins Jones, Taylor & Thornton (1992). CABIOS 8, 275-282 • BLOcks SUbstitution Matrices – The matrix values are based on 2000 conserved amino acid patterns (blocks) - pairwise comparisons —> more efficient for distantly related proteins —> more agreement with 3D structure data BLOSUM62 - 62% minimum sequence identity BLOSUM50 - 50% minimum sequence identity 11 Comparisons of MDMs: The WAG matrix (sij) amino acid exchangeability Whelan and Goldman (2001) Mol. Biol. Evol. 18, 691-699 Whelan and Goldman (2001) Mol. Biol. Evol. 18, 691-699 • Globular protein sequences – 3,905 sequences from 182 protein families • Produced a phylogenetic trees for every family and used maximum likelihood to estimate the relative rate values in the rate matrix (overall lnL over 182 different trees) – Better fit of the model with most data (significant improvement of the lnL of a tree when compared to PAM or JTT matrices) – Might not be the best option in some cases… BLOSUM62 Amino Acid Substitution Matrix Log-odds matrices MDMij = 10 log10 Rij The MDMij values are rounded to the nearest integer MDMij < 0 freq. less than chance MDMij = 0 freq. expected by chance MDMij > 0 freq. greater then chance The Log-odds matrices can be used for scoring alignments (Blast and Clustal) Clustal) C S T P A G N D E Q H R K M I L V F Y W C 9 -1 -1 -3 0 -3 -3 -3 -4 -3 -3 -3 -3 -1 -1 -1 -1 -2 -2 -2 C S 4 1 -1 1 0 1 0 0 0 -1 -1 0 -1 -2 -2 -2 -2 -2 -3 S T 5 -1 0 -2 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 0 -2 -2 -2 T P 7 -1 -2 -2 -1 -1 -1 -2 -2 -1 -2 -3 -3 -2 -4 -3 -4 P A 4 0 -2 -2 -1 -1 -2 -1 -1 -1 -1 -1 0 -2 -2 -3 A G 6 0 -1 -2 -2 -2 -2 -2 -3 -4 -4 -3 -3 -3 -2 G N D E Q H R K M I L V F Y W MDMij < 0 freq. less than chance MDMij = 0 freq. expected by chance MDMij > 0 freq. greater then chance 6 1 0 0 1 0 0 -2 -3 -3 -3 -3 -2 -4 N 6 2 0 -1 -2 -1 -3 -3 -4 -3 -3 -3 -4 D 5 2 0 0 1 -2 -3 -3 -2 -3 -2 -3 E 5 0 1 1 0 -3 -2 -2 -3 -1 -2 Q 8 0 -1 -2 -3 -3 -3 -1 2 -2 H 5 2 -1 -3 -2 -3 -3 -2 -3 R 5 -1 5 -3 1 4 -2 2 2 4 -2 1 3 1 4 -3 0 0 0 -1 -2 -1 -1 -1 -1 -3 -1 -3 -2 -3 K M I L V 6 3 1 F 7 2 Y 11 W C S T P A G N D E Q H R K M I L V F Y W sulfhydryl small hydrophilic acid, acid-amide and hydrophilic basic small hydrophobic aromatic 12 Summary Summary 2 • Many amino acid rate matrices exist and one needs to choose one for protein comparisons (alignment, phylogenetics...) do not hesitate to experiment! • One should make a rational choice (as much as possible): • In practice MDM are obtained by averaging the observed changes and amino acid frequencies between numerous proteins (e.g. JTT, BLOSUM) and are used for your specific dataset – How was the rate matrix produced? – What are the structural features of the sequences you are comparing? Globular/membrane protein? – What is the level of sequence identity of the compared sequences? • Always try to correct for rate heterogeneity between sites in phylogenetics! – You can correct an MDM for the !i values of your data (amino acid frequencies) • Specific matrices have been calculated to reflect particular composition biases (e.g. the mitochondrial proteins matrix: mtREV24) • Future work: What about context-dependent MDM: alpha helices versus beta sheets, surface accessibility? (Heterogenous models) From DNA/protein sequences to trees *1 *2 *3 Sequence data Align Sequences Phylogenetic signal? Patterns—>evolutionary processes? Distances methods Characters based methods *4 Distance calculation (which model?) Choose a method MB Model? ML MP Wheighting? Model? (sites, changes)? Optimality criterion LS ME Single tree NJ Calculate or estimate best fit tree 5 Test phylogenetic reliability Modified from Hillis et al., (1993). Methods in Enzymology 224, 456-487 13