Download Similarity

Alignment methods April 17, 2007 Quiz 1—Question on databases Learning objectives- Understand difference between identity, similarity and homology. Understand how PAM scoring matrices. Understand difference between global alignment and local alignment. Knowledge of Dotter software program. Workshop-Import sequences of interest from GenBank, place in FASTA format, align sequences using DOTTER program. Homework #4 due on Tues, April 24 at the beginning of class. Purpose of finding differences and similarities of amino acids in two proteins. Infer structural information Infer functional information Infer evolutionary relationships Evolutionary Basis of Sequence Alignment 1. Similarity: Quantity that relates how much two amino acid sequences are alike. 2. Identity: Quantity that describes how much two sequences are alike in the strictest terms. 3. Homology: a conclusion drawn from data suggesting that two genes share a common evolutionary history. Evolutionary Basis of Sequence Alignment (Cont. 1) Why are there regions of identity? 1) Conserved function-residues participate in reaction. 2) Structural (For example, conserved cysteine residues that form a disulfide linkage) 3) Historical-Residues that are conserved solely due to a common ancestor gene. Identity Matrix A C I L 1 0 0 0 A 1 0 1 0 0 C I 1 L Simplest type of scoring matrix Similarity It is easy to score if an amino acid is identical to another (the score is 1 if identical and 0 if not). However, it is not easy to give a score for amino acids that are somewhat similar. +NH 3 CO2- +NH 3 CO2- Isoleucine Leucine Should they get a 0 (non-identical) or a 1 (identical) or Something in between? One is mouse trypsin and the other is crayfish trypsin. They are homologous proteins. The sequences share 41% identity. Evolutionary Basis of Sequence Alignment (Cont. 2) Note: it is possible that two proteins share a high degree of similarity but have two different functions. For example, human gamma-crystallin is a lens protein that has no known enzymatic activity. It shares a high percentage of identity with E. coli quinone oxidoreductase. These proteins likely had a common ancestor but their functions diverged. Analogous to railroad car and diner function. Orthologs vs Paralogs Two proteins that have a common ancestor that exist in different species are said to be orthologs. Two proteins with a common ancestor that exist in the same species are said to be paralogs. Modular nature of proteins The previous alignment was global. However, many proteins do not display global patterns of similarity. Instead, they possess local regions of similarity. Proteins can be thought of as assemblies of modular domains. It is thought that this may, in some cases, be due to an evolutionary process known as exon shuffling. Modular nature of proteins (cont. 1) Gene A Exon 1a Exon 2a Duplication of Exon 2a Gene A Exon 1a Exon 2a Exon 2a Exchange with Gene B Gene B Exon 1b Exon 2b Exon 2b Gene A Exon 1a Exon 2a Exon 3 (Exon 2b from Gene B) Gene B Exon 1b Exon 2b Exon 3 (Exon 2a from Gene A) Scoring Matrices Importance of scoring matrices Scoring matrices appear in all analyses involving sequence comparisons. The choice of matrix can strongly influence the outcome of the analysis. Scoring matrices implicitly represent a particular theory of relationships. Understanding theories underlying a given scoring matrix can aid in making proper choice of which matrix to use. Scoring Matrices When we consider scoring matrices, we encounter the convention that matrices have numeric indices corresponding to the rows and columns of the matrix. For example, M11 refers to the entry at the first row and the first column. In general, Mij refers to the entry at the ith row and the jth column. To use this for sequence alignment, we simply associate a numeric value to each letter in the alphabet of the sequence. Two major scoring matrices for amino acid sequence comparisons PAM-derived from sequences known to be closely related (Eg. Proteins from chimpanzees and human). PAM1 was created from empirical data and other PAMs were mathematically derived. BLOSUM-derived from sequences not closely related (Eg. E. coli and human) from data stored in the BLOCKS database. The Point-Accepted-Mutation (PAM) model of evolution and the PAM scoring matrix Started by Margaret Dayhoff, 1978 A series of matrices describing the extent to which two amino acids have changed during evolution. Proteins were aligned by eye and then the number of times an amino acid was substituted in different species was counted. Protein families used to construct Dayhoff’s scoring matrix Protein IgG kappa C region Kappa casein Serum Albumin Cytochrome C Histone H3 Histone H4 PAMs per 100 mil yrs 37 33 26 0.9 0.14 0.10 Numbers of accepted point mutations, multiplied by 10 A V A R 30 N 109 D 154 C 33 Q 93 E 266 G 579 H 21 I 66 L 95 K 57 M 29 F 20 P 345 S 772 T 590 W 0 Y 20 V 365 17 R N D C Q E G H I L K M F P S T W Y Original amino acid 17 0 10 120 0 10 103 30 17 477 17 7 67 137 20 27 3 20 532 0 0 50 76 0 94 831 0 422 156 162 10 30 112 226 43 10 243 23 10 36 13 17 8 35 0 37 0 0 75 15 17 322 85 0 147 104 60 0 0 0 20 7 7 7 0 0 0 0 17 27 10 10 93 40 49 432 98 117 47 86 450 169 57 10 37 31 50 3 0 0 0 0 0 36 0 30 0 10 0 13 17 33 27 37 97 Replacement amino acid 3 40 253 23 43 39 0 57 207 90 20 90 167 0 17 50 7 43 43 4 7 26 20 32 168 20 40 269 14 129 52 200 28 10 73 696 3 0 13 0 0 10 0 17 0 40 13 23 10 0 260 0 22 23 6 30 661 303 17 77 10 50 43 186 0 Calculation of relative mutability of amino acid Find frequency of amino acid change to another amino acid at a certain position in protein. Divide the frequency of aa change by the frequency that the “j” (original) aa occurs in all proteins studied. This is called the “mutability”. Determine the constant to multiply the alanine mutability to get 100. Multiply the 19 other a.a. mutabilities by the same constant. This is called the relative mutability. Relative mutabilities of amino acids Asn Ser Asp Glu Ala Thr Ile Met Gln Val 134 120 106 102 100 97 96 94 93 74 His Arg Lys Pro Gly Tyr Phe Leu Cys Trp 66 65 56 56 49 41 41 40 20 18 Why are the mutabilities different? High mutabilities because a similar amino acid can replace it. (Asp for Glu) Conversely, the low mutabilities are unique, can’t be replaced. Creation of a mutation probability matrix Used accepted mutation data from earlier slide and the mutability of each amino acid in nature to create a mutation probability matrix. Mij shows the probability that an original amino acid j (in columns) will be replaced by amino acid i (in rows) over a defined evolutionary interval. For PAM1, 1% of aa’s have been changed. PAM1 mutational probability matrix . . . Values of each column will sum to 10,000 The Point-Accepted-Mutation (PAM) model of evolution and the PAM scoring matrix A 1-PAM unit is equivalent to 1 mutation found in a stretch of 2 sequences each containing 100 amino acids that are aligned Example 1: ..CNGTTDQVDKIVKILNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQV.. |||||||||||||| ||||||||||||||||||||||||||||||||||| ..CNGTTDQVDKIVKIRNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQV.. length = 100, 1 Mismatch, PAM distance = 1 A k-PAM unit is equivalent to k 1-PAM units (or Mk). The Point-Accepted-Mutation (PAM) model of evolution and the PAM scoring matrix Observed % Difference 1 5 10 20 40 50 60 70 80 Evolutionary Distance In PAMs 1 5 11 23 56 80 112 159 246 Final Scoring Matrix is the LogOdds Scoring Matrix Replacement amino acid S (a,b) = 10 log10(Mab/Pb) Original amino acid Frequency of amino acid b Mutational probability matrix number Summary of PAM Scoring Matrix PAM = a unit of evolution (1 PAM = 1 point mutation/100 amino acids) Accepted Mutation means fixed point mutation Comparison of 71 groups of closely related proteins yielding 1,572 changes. (>85% identity) Different PAM matrices are derived from the PAM 1 matrix by matrix multiplication. The matrices are converted to log odds scoring matrices. (Frequency of change divided by probability of chance alignment converted to log base 10.) A PAM 250 matrix is roughly equivalent to 20% identity in two sequences. The Dotter Program • Program consists of three components: •Sliding window •A table that gives a score for each amino acid match •A graph that converts the score to a dot of certain density. The higher the density the higher the score. Two proteins that are similar in certain regions Tissue plasminogen activator (PLAT) Coagulation factor 12 (F12). Region of similarity Single region on F12 is similar to two regions on PLAT FASTA format >gi|1244762|gb|AAA98563.1| p53 tumor suppressor homolog MSQGTSPNSQETFNLLWDSLEQVTANEYTQIHERGVGYEYHEAEPDQTSLEISAYRIAQPDPYGRSESYD LLNPIINQIPAPMPIADTQNNPLVNHCPYEDMPVSSTPYSPHDHVQSPQPSVPSNIKYPGEYVFEMSFAQ PSKETKSTTWTYSEKLDKLYVRMATTCPVRFKTARPPPSGCQIRAMPIYMKPEHVQEVVKRCPNHATAKE HNEKHPAPLHIVRCEHKLAKYHEDKYSGRQSVLIPHEMPQAGSEWVVNLYQFMCLGSCVGGPNRRPIQLV FTLEKDNQVLGRRAVEVRICACPGRDRKADEKASLVSKPPSPKKNGFPQRSLVLTNDITKITPKKRKIDD ECFTLKVRGRENYEILCKLRDIMELAARIPEAERLLYKQERQAPIGRLTSLPSSSSNGSQDGSRSSTAFS TSDSSQVNSSQNNTQMVNGQVPHEEETPVTKCEPTENTIAQWLTKLGLQAYIDNFQQKGLHNMFQLDEFT LEDLQSMRIGTGHRNKIWKSLLDYRRLLSSGTESQALQHAASNASTLSVGSQNSYCPGFYEVTRYTYKHT ISYL Workshop 3

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Similarity