Download Substitution matrices

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Burkhard Morgenstern
Grundlagen der Bioinformatik
Subsitutionsmatrizen
SS 07
Substitution matrices
All protein alignment programs depend on similarity
scores s(a,b)
Similarity score s(a,b) for amino acids a and b is based
on probability pa,b of substitution a -> b
Idea: it is more reasonable to align amino acids that are
frequently (with high probability) replaced by each
other!
Substitution matrices
Compute similarity score s(a,b) for amino acids a and b:
 Probability pa,b of substitution
a → b (or b → a),
 Frequency qa of a
Define
s(a,b) = log (pa,b / qa qb)
Substitution matrices
1.
Estimate probability pa as relative frequency of a
(possibly with pseudo counts)
2.
Estimate probability pa,b of substitution a -> b based
on observed substitutions in real-world sequences
Substitution matrices
Simplifying assumptions:
 Consider evolution as a random process: substitution




a -> b occurs with probability pa,b depending on a
and b
pa,b = pa,b (t), i.e. probability depends on time span t
in evolution since sequences originated from
common ancester
pa,b does not depend on sequence position
Sequence positions independent of each other
pa,b = pb,a (symmetry!)
Substitution matrices
Result: PAM matrix (Dayhoff et al.)
Substitution matrices
To calculate pa,b:
Consider alignments of related proteins and count
substitutions
a → b (or b → a)
Substitution matrices
To calculate pa,b:
Consider alignments of related proteins and count
substitutions
a → b (or b → a)
ESWTSRQWERYTIALMSDQRREVLYWIALY
ERWTSERQWERYTLALMSQRREALYWIALY
Substitution matrices
To calculate pa,b:
Consider alignments of related proteins and count
substitutions
a → b (or b → a)
ESWTS-RQWERYTIALMSDQRREVLYWIALY
ERWTSERQWERYTLALMS-QRREALYWIALY
Substitution matrices
To calculate pa,b:
Consider alignments of related proteins and count
substitutions
a → b (or b → a)
ESWTS-RQWERYTIALMSDQRREVLYWIALY
ERWTSERQWERYTLALMS-QRREALYWIALY
Substitution matrices
Problems involved:
Probability pa,b depends on time t since sequences
separated in evolution: pa,b = pa,b (t). But: pa,b (t) not
linear in t for large t
2. Alignment of protein families must be known!
3. Multiple mutations at one sequence position
4. Protein families contain multiple sequences:
phylogenetic tree must be known!
1.
Substitution matrices

Solution for 1. – 3. (time dependence, alignment,
multiple mutations)



Look at small evolutionary distances first, normalize
for distance = 1 PAM (= percentage accepted
mutations)
Calculate substitution matrices for larger distances
based on small distances
Solution for 4 (tree must be known): Use parsimony
to find tree
M. Dayhoff et al. (1978), Atlas of Protein sequence and
Structure: PAM matrices
Substitution matrices
Calculation of pa,b(t) :
 Consider multiple alignments of closely related
protein families
 Count substitutions a->b (or b->a) in alignments
based on phylogenetic tree
 Estimate pa,b(t) for small times t
 Normalize to distance t = 1 PAM (percentage of
accepted mutations)
 Calculate conditional probabilities p(a|b,t) for small t
 Calculate p(a|b,t) for larger evolutionary distances by
matrix multiplication
 Calculate pa,b(t) for larger evolutionary distances
Substitution matrices
Substitution matrices
Alternative: BLOSUM matrices
S. Henikoff and J.G. Henikoff, PNAS, 1992
Basis: BLOCKS database, gap-free regions of multiple
alignments.
 Cluster of sequences if percentage of similarity > L
 Estimate pa,b(t) directly.
Default values: L = 62, L = 50
Related documents