Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Burkhard Morgenstern Grundlagen der Bioinformatik Subsitutionsmatrizen SS 07 Substitution matrices All protein alignment programs depend on similarity scores s(a,b) Similarity score s(a,b) for amino acids a and b is based on probability pa,b of substitution a -> b Idea: it is more reasonable to align amino acids that are frequently (with high probability) replaced by each other! Substitution matrices Compute similarity score s(a,b) for amino acids a and b: Probability pa,b of substitution a → b (or b → a), Frequency qa of a Define s(a,b) = log (pa,b / qa qb) Substitution matrices 1. Estimate probability pa as relative frequency of a (possibly with pseudo counts) 2. Estimate probability pa,b of substitution a -> b based on observed substitutions in real-world sequences Substitution matrices Simplifying assumptions: Consider evolution as a random process: substitution a -> b occurs with probability pa,b depending on a and b pa,b = pa,b (t), i.e. probability depends on time span t in evolution since sequences originated from common ancester pa,b does not depend on sequence position Sequence positions independent of each other pa,b = pb,a (symmetry!) Substitution matrices Result: PAM matrix (Dayhoff et al.) Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a) Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a) ESWTSRQWERYTIALMSDQRREVLYWIALY ERWTSERQWERYTLALMSQRREALYWIALY Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a) ESWTS-RQWERYTIALMSDQRREVLYWIALY ERWTSERQWERYTLALMS-QRREALYWIALY Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a) ESWTS-RQWERYTIALMSDQRREVLYWIALY ERWTSERQWERYTLALMS-QRREALYWIALY Substitution matrices Problems involved: Probability pa,b depends on time t since sequences separated in evolution: pa,b = pa,b (t). But: pa,b (t) not linear in t for large t 2. Alignment of protein families must be known! 3. Multiple mutations at one sequence position 4. Protein families contain multiple sequences: phylogenetic tree must be known! 1. Substitution matrices Solution for 1. – 3. (time dependence, alignment, multiple mutations) Look at small evolutionary distances first, normalize for distance = 1 PAM (= percentage accepted mutations) Calculate substitution matrices for larger distances based on small distances Solution for 4 (tree must be known): Use parsimony to find tree M. Dayhoff et al. (1978), Atlas of Protein sequence and Structure: PAM matrices Substitution matrices Calculation of pa,b(t) : Consider multiple alignments of closely related protein families Count substitutions a->b (or b->a) in alignments based on phylogenetic tree Estimate pa,b(t) for small times t Normalize to distance t = 1 PAM (percentage of accepted mutations) Calculate conditional probabilities p(a|b,t) for small t Calculate p(a|b,t) for larger evolutionary distances by matrix multiplication Calculate pa,b(t) for larger evolutionary distances Substitution matrices Substitution matrices Alternative: BLOSUM matrices S. Henikoff and J.G. Henikoff, PNAS, 1992 Basis: BLOCKS database, gap-free regions of multiple alignments. Cluster of sequences if percentage of similarity > L Estimate pa,b(t) directly. Default values: L = 62, L = 50