Download Derivation of the BLOSUM substitution matrices:

Bioc 565, Dr. Cordes Supplement handout to sequence evolution lecture, Sep 10, 2007 1. General facts about amino-acid substitution matrices: The average likelihood of a particular mutation occurring and being accepted is influenced both by codon similarity and by chemical similarity of the amino acids (as a result of structural and functional constraints). These likelihoods are reflected in substitution matrices. The most common methods for constructing substitution matrices are empirical, i.e. the matrices are parametrized using observations of what substitutions actually occur among collections of related protein sequences. Substitution matrices are today commonly used in sequence alignment methods and database sequence similarity searches. A pioneer in this field was Margaret Dayhoff, who created the PAM (Per cent Accepted point Mutations) substitution matrices. She used sets of closely related sequences to model how evolution occurs over short distances, and then extrapolated these findings to longer evolutionary distances. The other most commonly used type of substitution matrix is called the BLOSUM matrix, due to Henikoff. BLOSUM matrices use less closely related sequences for parametrization, and only use “blocks” of the best aligned portions. Portions that don’t align well and vary in length, such as surface loops, are not used. Thus the PAM and BLOSUM matrices show some differences. Matrices parametrized using soluble, globular proteins will also differ from those derived from transmembrane proteins. A few words on nomenclature: you will see terms like “BLOSUM 62” and “PAM 250”. “BLOSUM 62” means that the matrix was parametrized using sequences with 62% average sequence identity. “PAM 250” refers to the PAM unit: 1 PAM is the amount of evolution required to change 1% of a protein’s residues. A PAM 250 matrix is then the result of extrapolating data on closely related sequences to an evolutionary distance of 250 PAMs. Another thing you will note is that the numbers in the PAM matrix don’t look like “percent accepted” values--many are negative numbers. This is because most matrices are presented in “log odds” form. The odds are figured as the probability of a particular change at a given evolutionary distance, normalized by the overall frequency of occurrence of a given amino acid in the database. These odds are then converted to logarithms. page 1 of 4 2. Derivation of substitution matrices from sequence conservation data: Construction of the BLOSUM matrices based on Henikoff & Henikoff, PNAS USA, 89, 10915 (1992) The general procedure for deriving generalized substitution matrices from sequence conservation data is to 1) assemble a large number of multiple alignments from different protein families 2) tabulate the frequencies of all the possible residue pairings 3) compare these frequencies to those expected in a random database of the same composition, giving an odds ratio 4) convert the odds ratio into a log format. To make the BLOSUM matrices (which are probably the best and are certainly the most commonly used), Henikoff and Henikoff took multiple alignments of several hundred protein families and parsed them into a database of “blocks”, where a block is defined as a portion of a multiple alignment which contains no gaps. This analysis yielded a database of >2000 blocks. The block approach contrasts with the way the PAM matrices (Dayhoff) were constructed, because in that case regions containing gaps were included. There are 20 different naturally occurring amino acids, and thus there are 20 + 19 + ... 1 = 210 different possible pairings of amino acids. Henikoff and Henikoff converted their database of blocks into a frequency table describing the number of occurrences fij of each of the 210 possible pairings of amino acids i and j, where 1 ≤ j ≤ i ≤ 20. To illustrate construction of a frequency table, let’s look at a simple (and fake) example. A given block can be described as having a width of w alignment positions and a depth of s sequences. The very small block shown below has w=4 (positions 1-4) and s=5 (sequences A-E). Let’s suppose that our entire database is composed of this one block. sequence A B C D E position in alignment 1 2 3 4 S L M K A L A E M V A E R I A W T L M C A block of w=4 and s=5 has a total of ws(s-1)/2 = 40 amino acid pairs. Let’s now consider, as an illustrative exercise, the frequencies of all pairs involving alanine. Alanine occurs at two positions in the block: positions 1 and 3. At the first position, there are 4 alanine-containing pairs: 1 each of the pairs AM, AR, AS and AT. At the third position, there are 9 alanine-containing pairs, including 3 different AA pairs (sequences B and C, B and D and C and D) and 6 AM pairs page 2 of 4 (enumerate them yourself to convince yourself). So summing the occurrences of each pair at the two positions, the frequency of occurrence of AM pairs fAM = 1+ 6 = 7, the frequency of occurrence of AA pairs fAA = 0 + 3 = 3. The frequency of occurrence of AR, AS and AT pairs is just 1, while pairings of alanine with all other amino acids have frequencies of 0. If we did this same analysis with all 210 possible pairings we’d have a pair frequency table (or matrix) for our database. The pair frequencies are then converted into pair probabilities. The probability of observing any particular pairing can be described as The denominator here simply corresponds to the sum total of all pair frequencies (which is just the total number of pairs) in the database. qAM for example, is 7/40 = 0.175, while qAA is 3/40 = 0.075. The pair probabilities only have statistical meaning when compared to the pair probabilities that one would expect from a random database of the same aminoacid composition. The expected probability of occurrence eij for each pair, based on the frequency of occurrence of pairs involving each of those amino acids in the database, is equal to pipj, where pi and pj are the individual probabilities that a given residue pair will contain amino acid i or j, respectively. The probability pi that a pair will contain amino acid i is generally figured as: I find this formulation mildly confusing, and it should be noted that this number will also come out to equal the fractional population of amino acid i in the database. This makes sense. Intuitively, the probability of a given pair containing alanine should be equal to the fractional population of alanine in the database. For example, as figured from the equation above, pA = [3 + (10/2)]/40 = 0.2. The fractional population of alanine in the database is also 4/20 = 0.2. Similarly, pM = [1 + (10/2)]/40 = 0.15. The fractional population of methionine in the database is 3/20 = 0.15. In any case, for the example of alanine-methionine pairs, eAM = (0.2)(0.15) = 0.03. Similarly, for alanine pairing with itself, eAA = (0.2)(0.2) = 0.04. The odds ratio is then computed as the observed probability qj divided by the expected probability eij . This ratio is then converted into the log-odds ratio sij, where sij = log2(qij/eij). page 3 of 4 For the AM pairing, qAM/eAM = 0.175/0.03 = 5.82, and sAM = 2.54. For the AA pairing, qAM/eAM = 0.075/0.04 = 1.88, and sAM = 0.9. Note that if qij is larger than eij, the odds ratio will be greater than 1 and the log-odds score will be positive, while if the opposite is true the odds ratio will be less than 1 and the log-odds score will be negative. Thus, pairings which occur more frequently than expected by chance will have positive log odds scores, while those which occur less frequently will have negative log odds scores. Note that this particular case is unrealistic: in any real situation, alanine would have a higher log odds score with itself than with any other amino acid. To convert them to final BLOSUM matrix element form, the log odds scores sij are then all multiplied by some uniform scaling factor (2 in the case of the BLOSUM matrices) and rounded off to integers. page 4 of 4

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Derivation of the BLOSUM substitution matrices: