Download Substitution matrices

Burkhard Morgenstern Grundlagen der Bioinformatik Subsitutionsmatrizen SS 07 Substitution matrices All protein alignment programs depend on similarity scores s(a,b) Similarity score s(a,b) for amino acids a and b is based on probability pa,b of substitution a -> b Idea: it is more reasonable to align amino acids that are frequently (with high probability) replaced by each other! Substitution matrices Compute similarity score s(a,b) for amino acids a and b:  Probability pa,b of substitution a → b (or b → a),  Frequency qa of a Define s(a,b) = log (pa,b / qa qb) Substitution matrices 1. Estimate probability pa as relative frequency of a (possibly with pseudo counts) 2. Estimate probability pa,b of substitution a -> b based on observed substitutions in real-world sequences Substitution matrices Simplifying assumptions:  Consider evolution as a random process: substitution     a -> b occurs with probability pa,b depending on a and b pa,b = pa,b (t), i.e. probability depends on time span t in evolution since sequences originated from common ancester pa,b does not depend on sequence position Sequence positions independent of each other pa,b = pb,a (symmetry!) Substitution matrices Result: PAM matrix (Dayhoff et al.) Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a) Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a) ESWTSRQWERYTIALMSDQRREVLYWIALY ERWTSERQWERYTLALMSQRREALYWIALY Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a) ESWTS-RQWERYTIALMSDQRREVLYWIALY ERWTSERQWERYTLALMS-QRREALYWIALY Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a) ESWTS-RQWERYTIALMSDQRREVLYWIALY ERWTSERQWERYTLALMS-QRREALYWIALY Substitution matrices Problems involved: Probability pa,b depends on time t since sequences separated in evolution: pa,b = pa,b (t). But: pa,b (t) not linear in t for large t 2. Alignment of protein families must be known! 3. Multiple mutations at one sequence position 4. Protein families contain multiple sequences: phylogenetic tree must be known! 1. Substitution matrices  Solution for 1. – 3. (time dependence, alignment, multiple mutations)    Look at small evolutionary distances first, normalize for distance = 1 PAM (= percentage accepted mutations) Calculate substitution matrices for larger distances based on small distances Solution for 4 (tree must be known): Use parsimony to find tree M. Dayhoff et al. (1978), Atlas of Protein sequence and Structure: PAM matrices Substitution matrices Calculation of pa,b(t) :  Consider multiple alignments of closely related protein families  Count substitutions a->b (or b->a) in alignments based on phylogenetic tree  Estimate pa,b(t) for small times t  Normalize to distance t = 1 PAM (percentage of accepted mutations)  Calculate conditional probabilities p(a|b,t) for small t  Calculate p(a|b,t) for larger evolutionary distances by matrix multiplication  Calculate pa,b(t) for larger evolutionary distances Substitution matrices Substitution matrices Alternative: BLOSUM matrices S. Henikoff and J.G. Henikoff, PNAS, 1992 Basis: BLOCKS database, gap-free regions of multiple alignments.  Cluster of sequences if percentage of similarity > L  Estimate pa,b(t) directly. Default values: L = 62, L = 50

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Substitution matrices