* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 1. Pam matrices
Survey
Document related concepts
Bottromycin wikipedia , lookup
Protein (nutrient) wikipedia , lookup
Cell-penetrating peptide wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Peptide synthesis wikipedia , lookup
Proteolysis wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Molecular evolution wikipedia , lookup
Homology modeling wikipedia , lookup
Protein structure prediction wikipedia , lookup
Biochemistry wikipedia , lookup
Transcript
INTRODUCTION PAM matrices defined as point accepted mutation matrices or percent accepted mutation matrices (Dayhoff). The main point of this matrices is to use the sequences that quite similar to each other. From the sequence we can derive substitution frequency and predict or assume the value of frequencies of an evolution. This matrices interested only in mutations that have been “accepted” by natural selection. It was developed by Margaret Dayhoff and co-workers and had been calculated in 1978. This matric also had developed a precise and rigorous approach to implement a model of evolutionary change in their mutation data matrix. PAM Matrices particularly used to score sequence alignment (usually amino acid). One PAM of evolution means 1% of residues/bases have changed (average overall 20 amino acids). To get relative frequency of each type of mutation, we count the times it was observed of multiple sequence alignments. This matrices also based on global alignments and assumption based on Markov model of evolution. Note: global alignment is a way of arranging the sequences of nucleotide and amino acid which intend to align every base in every similar and roughly equal in size sequence. Note: Mutation is change of the nucleotide sequence of the genome of an organism, virus, or extra chromosomal genetic element. Only mutations are allowed Sites evolve independently Example of codon table By aligning the sequences, we assert that the align bases of each column had common ancestor PAM DISTANCES PAM distances: • In the PAM matrix the as the number increases so does evolutionary distance • conversion one into the other with an average of 1 accepted mutation per 100 amino acids result in 2 sequences are at 1 PAM distance. • if we use many pairs of sequences at one PAM distance there will be about 1% differences between each pair of amino acids. • We can derive the frequencies expected for each of the amino acid pairs • the probability of two independent events is the product of the two individual probabilities. PAM 1 • PAM 1: ‒ 1 substitution over 100 residues ‒ a PAM unit of time ‒ probability of each amino acid changing into another is ~ 1% and probability of not changing is ~99% Dayhoff PAM (250) Matrix Example PAM 250 Matrices MARKOVIAN EVOLUTION • Where the likelihood of the current state have great influence on next state. • Evolution is Markovian: base changes(or AA) occur at constant rate & depend only on the identity of current base (or AA) • Example: Markovian evolution is an extrapolation: • Start with all G’s. wait 1 MY. Where do they go? - using PAM1, we expect them to mutate about 0.0002 A, 0.0007 P, 0.9946 G, etc • Wait another million years. - new A’s mutate according to PAM1 for A’s - etc • Wait another million & etc - what is final distribution of AA at positions that were once G’s SCORING MATRIX Scoring matric is for computing alignment scores. Alignment score is the result of the matrix’s entries for each aligned amino acid pair. So, the alignment score is calculated by using the number of existence of a match of the 2 individual amino acids in evolutionarily related sequences. It provides a judge of a chance alignment of the 2 amino acids. PAM is a set matrices use to score sequence alignment. A sequence alignment is a ways of arranging the sequences of amino acids and nucleic acids to identify the areas of similarity. After computing the scoring matrices, if the alignment score is greater than 0, the sequences are considered to be related. If the score is negative, it is assumed that the sequences are not related. Scoring matrices can be used for any kind of sequence (e.g. DNA and amino acid). Hence, the aligned amino acid pairs and frequently observed substitutions are assigned the most positive scores. However, the matches that do not have a result of evolution, meaning that they are tend to indicative of relatedness at that location will be given negative scores. The matrices with scoring schemes based only on side chain moiety similarity. The codon picture showed the 4 DNA base combine with another 2 bases to form a codon which is also an amino acid. The following are the amino acid category as hydrophobic bond and hydrophilic bond. Both hydrophilic and hydrophobic bond will determine the side chain moiety similarity of the twenty amino acids. Hydrophobic Hydrophilic Alanine Cysteine Glycine Aspartic acid Isoleucine Glutamic acid Leucine Histidine Phenylalanine Lysine Methionine Asparagine Proline Glutamine Valine Arginine Typtophan Serine Threonine Tyrosine Example 1: TAGHVRP HVGGSQM T - H = (Hydrophilic-hydrophilic) A - V = (Hydrophobic-Hydrophobic) G - G = (Hydrophobic-Hydrophobic) H - G = (Hydrophilic-Hydrophobic) V - S = (Hydrophobic-Hydrophilic) R - Q = (Hydrophobic-Hydrophobic) P - M = (Hydrophobic-Hydrophobic) Similar side chain is more than different side chain in protein sequence. Result: The scoring matric will be positive score Example 2: ATGRFWV NVTRFEY A - N = (Hydrophobic-Hydrophilic) T - V = (Hydrophilic-Hydrophobic) G - T = (Hydrophobic-Hydrophilic) R - R = (Hydrophilic-Hydrophilic) F - F = (Hydrophobic-Hydrophobic) W - E = (Hydrophobic-Hydrophilic) Different side chain is more than identity and similar side chain in sequence. Result: The scoring matric will be negative score. S = [sij] gives score of aligning character i with character j for every pair i, j. Example 3: Using PAM250 scoring matrix, calculate the alignment score. Given a aligning character I with character j STPP CTCA 0 + 3 + (-3) +1 = 1 Gap 1. A consecutive run of spaces without interruption in a sequence alignment known as gap. It corresponds to insertion or deletion of amino acid and DNA. 2. Gaps are represented as dashes on a protein or DNA sequence alignment. 3. The number of insertion and deletions in sequence alignment is calculated as the length of a gap. Example 4: attc--ga-tggacc a--cgtgatt---cc Seven matches, no mismatch, four gaps and eight spaces 4. Gaps are introduced into sequence alignments that allow the alignment to be extended into regions where one sequence may have lost or gained sequence characters which not found in the other. Gap penalties 1. Gap penalty values are planned to reduce the alignment score when an alignment of sequence has been interrupted by insertion and deletion of amino acid in the sequences. 2. A penalty value is subtracted for each gap introduced into an alignment because the gap increases randomly into an alignment 3. The gap penalty is used to help decide whether or not to accept a gap or insertion and deletion in an alignment of sequence when it is possible to attain a good alignment residue-to-residue at some other neighboring point in the sequence. 4. A high sequence alignment score is attainable even between unrelated or random sequences if the value of gap penalty is too low. 5. The value should be small enough to allow a previously accumulated alignment to continue with insertion and deletion in one of the sequences. Also, it should be not too large that this previous alignment score. RELATIVE MUTABILITY Dayhoff et al. described the “relative mutability” of each amino acid as the probability that amino acid will change over a small evolutionary time period. The total number of changes are counted (on all branches of all protein trees considered), and the total number of occurrences of each amino acid is also considered. A ratio is determined. The relative mutability of the individual amino acids was calculated because amino acids are not equally mutable. That is to say, some residues are observed to mutate more frequently than others per occurrence. This is taken into consideration by defining the relative mutability of amino acid j as the number of times amino acid j mutated divided by the number of occurrences of amino acid j. Data from the relative mutability of the amino acids and the point accepted mutation matrix is then used to calculate the mutation probability matrix. Relative mutability [changes] / [occurrences] Example: Sequence 1 ala his val ala Sequence 2 ala arg ser val For ala, relative mutability = [1] / [3] = 0.33 For val, relative mutability = [2] / [2] = 1.0 SUBSTITUTION FREQUENCY A substitution matrix contains values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments (or multiple sequence alignments) of amino acids. Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution. Example F G,A Substitution may occur in A→G or G→A. Therefore, F G,A = 3 MUTATION PROBABILITY Mutation probability matrix can be done after we have calculated the relative mutability. Data from the relative mutability of the amino acids and the point accepted mutation matrix is used to calculate the mutation probability matrix. For example: F G,A Mij = (mj*Fij) / (sum_over_all_i Fij) Mij shows the probability that an original amino acid j (in columns) will be replaced by amino acid i (in rows) over a defined evolutionary interval. The entries, Ri,j are the Mi,j values divided by the frequency of occurrence, fi, of residue i. • f G = 10 G / 63 residues = 0.1587 • R G,A = log (2.1/0.1587) = log(13.2325) = 1.1216 ≈1 QUESTION (RELATIVE MUTABILITY) 1. You are given the sequences below: There are how many G→X substitutions across all pairs of sequences? AMINO ACID SUBSTITUTION MATRIX In the process of evolution, there is a tendency where DNA mutation would occur and eventually cause the gradual change in phenotype and genotype of an organism. Hence, substitution matrix is the matrix that defined the number of times a bases in a sequence is being switch to another bases over a period of time. This is commonly happens in amino acid or DNA sequence. Divergence time and the bases that have been substituted in these 2 sequences eventually will being represented in the substitution matrix and is being used to determine the percentage similar between them and Besides, amino acid substitution matrix was based upon 2 important concepts. These concepts are being introduced by Dayhoff and co-workers. First concept was using the ‘log-odd’ strategy to score the alignment. To do this, the alignment that has been presumed to be correct was first being collected .Then the number of residues substitution in the alignment was being estimated to make the ratio of observed to expected probabilities where the expected probabilities are based on a model for chance alignments. The odd ratio of each residue pair in the alignment should be multiplied or equivalent in order to score an alignment. A negative log-odd score signify that the residues exchange is likely to occur in chance alignment than in a correct alignment, and vice versa for positive score. Second concept was that to base substitution numbers on estimated mutation rate. Every occurrence of mutation is assumed to be independent from the occurrence of the previous mutation. Mutation rate should be calculated from sequence alignments that are closely related in order to make this model valid. Else, intermediate occurrence of mutation could be left out. A consequence of using closely related sequences to estimate mutation rates is that these rates must be extrapolated to model greater evolutionary distances PAM (Point Accepted Mutation) MATRIX In 1970s, Margaret Dayhoff has developed the first amino acid substitution matrix, which is known as PAM matrix. PAM shows the unit of evolutionary change for protein sequences. Each row and column in a matrix represents one of the 20 standard amino acids. All the data being used in PAM were come from sequence alignments that are closely related proteins, which involved more than 85% amino acid that are identity of one another. Besides, PAM matrix is also based on global sequence alignment being calculated by observed the dissimilarity in closely related proteins. Result of PAM matrix is being represented in log-odd table where the ratio of related alignment sequence divided by ratio of unrelated alignment sequence. This result is then being converted to logarithm of base two. In the process of computing PAM matrix, it is quite difficult to collect the statistics about amino acid substitution in distantly diverged sequences. For example, finding the position correspondence is relatively difficult if the PAM divergence distance between sequences is large, as there are many insertions and deletions took place. Hence, it is easier to compute PAM matrix with closely related protein sequence. During evolution, the most likely changes in amino acid sequence can be predicted if the ancestor relationships among the group of proteins are assessed. Margaret Dayhoff was the one who pioneered this type of analysis and she used this analysis to produce scoring matrix (PAM matrix). PAM matrix is used to compare two sequences which are a specific number of PAM units apart. There are a few examples of PAM matrix. For example: PAM 1, PAM 2, PAM 50 and etc. In the case of PAM 1, it is the matrix that calculated from comparisons of sequences with no more than 1% divergence in amino acid sequence. PAM 1 usually being used as the basis for calculating the other PAM matrix. Hence, all the other PAM are extrapolated from PAM1 by assuming that repeated mutations would follow the same pattern as those in the PAM1 matrix, and multiple substitutions can occur at the same site. Using this logic, Dayhoff derived matrices as high as PAM250. PAM MATRIX ASSUMPTION We have to make some assumption before a PAM matrix is being generated. One of these is that we assume PAM matrix only allowed mutation to occur. Besides, we assume that the probability of amino acid X replacing Y is same as the probability of amino acid Y replacing X. Moreover, we assumed all the position in protein is equally mutable and to decreased the mediated mutation, we also used consider the use of closely related proteins. In addition, the probability given by Markov model tell us that the replacement of sequences at any sites and we assumed it depend only on amino acid at that site. We also assumed that all amino sequences have amino acid composition. GENERATING PAM MATRIX According to Dayhoff approach in generating PAM 1 matrix, it is constructed based on 1572 observed mutations in 71 families of closely related proteins. Statistics were collected from aligned sequences that are one Pam unit divergence. PAM 1 matrix is constructed as followed: 1. Mij = estimated probabilities of amino acids I mutating into amino acid J in one PAM unit of evolutionary change. 2. Result: 20*20 real matrix is being generated. Values in each matrix column add up to one. 3. Every possible identity and substitution is assigned a score based on the observed frequencies of such occurrences in alignments of related proteins 4. The diagonal of the matrix that contain only positive scores 5. Residue pairs with Score >=1: amino acid pairs that are found as alternatives at exactly the frequency predicted by chance. Scores <1: that these residues are not functionally equivalent. Example of Dayhoff’s PAM matrix PROBLEM OF PAM MATRIX The main problem of PAM matrix is the false assumption which states all sites are equally mutable. This is wrong as different sites have different possibilities of mutation that would occur. Besides, the matrix is biased as not much protein being collected at that time and it is mainly based on globular proteins that are small. Calculate the difference in PAM matrix 1. A PAM 80 matrix represents sequences with an average of 50% identical amino acids. Mii = 50% = 50/100 = 0.5 Mij= 1-0.5 = 0.5 = 50% Hence, a PAM 80 matrix has 50% difference of amino acids. 2. A PAM 250 matrix represents sequences with an average of 20% identical amino acids. Mii = 20% = 20/100 = 0.2 Mij= 1-0.2 = 0.8 = 80% Hence, a PAM 250 matrix has 80% difference of amino acids. COMPUTATIONAL STEPS Where f(j) is the frequency n(j) is total number of occurrences in the jth amino acid N is the total number of all amino acids Supposed the alignments are Sequence 1: AB Sequence 2: AA Frequency: Estimated Probabilities We construct the mutation matrix, M so that the entry M(i,j) represents the probability of the jth amino acid. Expected number of mutation: This is the new expected number of mutation after factorization by α: k-PAM probability matrix score will be computed (values of Mk are plugged instead of those of M.) REFERENCE • Drubin R. et al., Biological Sequence Analysis, Chapter 2. • Setubal J., Meidanis, J., Introduction to Molecular Biology, Chapter 3. DEMO SOFEWARE • http://www.ebi.ac.uk/Tools/psa/emboss_needle/ EFFECTS OF PAM MATRICES PAM Matrices used to find the relative frequency with which amino acid replaced or substitute each other during the course of evolution that carry by the Margaret DayHoff. By using the substitution matrices, the individual score was assigned to the aligned sequence position and also to define the values for all possible pairs of the residues. Through PAM substitution, they can trace the evolutionary origins of protein by using the substitution frequency derived from the sets of closely related protein sequence. Each matrix correspond to a particular quantity of accepted mutation, which are the mutation that been retained in the sequence. By using the PAM matrices there are a lot of advantages. Firstly, PAM matrices based on global alignment, where it include both highest conserved and highest mutable regions and all mutation in the sequence are involved without missing anything. It can be extremely helpful in determining those process which are responsible for these mutation and also provide criteria for select and fixed a mutation in the population. PAM tables constructed based on the data from the sequence to provide information about the changes in the structure of amino acid residue after given number of mutation. Lastly, it provides empirical and experimental determination of conserved replacement. The limitations or disadvantages of PAM matrices are it assumes that all types of mutation are distributed uniformly across the protein and its uses data from the closely related proteins to infer the relationship between different proteins. COMPARISON BETWEEN PAM MATRICES AND BLOSUM It compares the PAM matrices and BLOSUM because both of them are from same scoring information but using different method and because of that the PAM100 different with BLOSUM 100 but equal with BLOSUM 90. METHOD PAM Matrices Percent accepted mutation ALIGNMENT Global alignment, take all the mutation in the sequence OBJECTIVE Refers to specific evolutionary distance ALIGNING DISTANTLY ALIGNING CLOSELY COMPARISON OF SEQUENCE RANGES PAM 250 (long sequence) PAM 120 (short sequence) PAM 120, not more than 120% divergence From identical to the completely random Yes, extrapolated from PAM1 using an assumed Markov chain EXTRAPOLATION BLOSUM Blocks substitution matrices Local alignment, just take the sequence that have gap and not all the sequence Refers to percent identity, always blend of distance like in the database and PROSITE BLOSUM 50 BLOSUM 62 BLOSUM 62 , approx. 62% identity Narrower ranges than PAM matrices Not, it based on observe alignment from comparison of closely related proteins Extrapolation is a process of estimating, beyond the original observation interval, the value of the variable on the basis of its relationship with another variable.