Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Pairwise Sequence Alignment How can sequence similarity be measured? • one way is the score of the alignment, but it depends on the scoring system • another way is by measuring sequence identity o sequence identity is measured as a number of matches in the aligned sequences appropriately normalized o for the following global alignment, the number of matches is 3 (Q, A, and B) Q A C – D B D Q A W X – B o sequence identity (si) is here si = # of matches when S1 and S 2 are aligned 3 = = 0 .5 max{length ( S1 ), length ( S 2 )} 6 (or 50%) however, based on the problem sequence identity may be differently defined, for example for local alignments it can be divided by the alignment length (after filtering really short ones) probably the best way is by calculating statistical significance of the match (for example, probability that a score can be obtained using random sequences with the same first order statistics) o • What is an optimal alignment? What is a correct alignment? • optimal alignment is obtained using dynamic programming, and is optimal only for a particular scoring system (scoring matrix and gap penalties) • a correct alignment is the alignment where residues are structurally superimposed; if a function of protein is conserved then its structure should be conserved too; thus structural alignment gives us the correct alignment • correct alignments help us find the appropriate scoring matrix and gap penalties • however, what is the correct alignment if a protein doesn’t have a fixed 3-D conformation? This is an open problem in bioinformatics. Local Alignment • in many biological applications local similarity is far more important than global similarity because many proteins are often made of different domains or motifs To be able to discuss local alignment, we will define an alignment score between two empty strings as 0 (this is very important!) Definition: Given a pair of indices i ≤ m and j ≤ n, the local suffix problem is to find a (possibly empty) suffix α of S1[1..i] and a (possibly empty) suffix β of S2[1..j] such that V(α, β) is the maximum score over all pairs of suffixes of S1[1..i] and S2[1..j]. We can use v(i, j) to denote the optimal local suffix alignment score for given i, j. 1/4 The highest scoring local alignment v* is calculated as v* = max{v(i, j) : i ≤ m, j ≤ n} (we’ll skip the proof) What is the recurrence for the local alignment: Base conditions: v(i, 0) = 0 and v(0, j) = 0 for ∀(i, j) The recurrence relation: v(i, j) = max{0, v(i – 1, j) – score(S1(i), –) , v(i, j – 1) – score(–, S2(i)), v(i – 1, j – 1) + score(i, j)} This can also be recursively implemented! A condition on the scoring system in order for the local alignment to make sense: the expected score of a random match must be negative. Also, there must be at least one positive entry in the scoring matrix. The time-space complexity of the local alignment is O(mn). Exercise: calculate highest scoring local alignment between TGTACG and AACGT, if the scoring matrix is: v(i, j) A C G T A 5 –4 –4 –4 C –4 5 –4 –4 G –4 –4 5 –4 T –4 –4 –4 5 A A C G and gap penalty = –3. v(i, j) 0 T G T A C G 2/4 T Heuristic Techniques FASTA and BLAST • • FASTA stands for Fast-All; it was proposed by Pearson and Lipman in 1988 BLAST stands for Basic Local Alignment Search Tool; it was proposed by Altschul et al. in 1990 • the motivation is to reduce quadratic complexity of optimal alignments to expected sub-quadratic time (the worst case can still be O(mn), but it rarely happens for typical biological sequences) the algorithm is based on an observation that good alignments almost always have runs of matches (identities) we can then look for those places first and use them to expand the alignment • • An example of a dot plot: • • • • • FASTA and BLAST make a dot plot and then explore all di-peptide or tri-peptide matches in case of nucleotide sequences, BLAST is looking for 11-nucleotide stretches after the short peptide matches are found both algorithms try to expand the matches as long as the resulting alignment is good (this is called “hit-extension”) a strength of FASTA and BLAST is that they output statistical significance of the alignments they output (statistical significance is essentially the probability of a such score given random sequences) FASTA and BLAST are similarly good, but BLAST seems to be dominant 3/4 ----Sources: Fundamental Concepts of Bioinformatics by Dan E. Krane and Michael L. Raymer Algorithms on Strings, Trees, and Sequences by Dan Gusfield Comparative Sequence Analysis: Finding Genes by Steven Henikoff (appeared in D. W. Smith – Biocomputing: Informatics and Genome Projects) Biological Sequence Analysis by Richard Durbin, Sean R. Eddy, Anders Krogh and Graeme Mitchison 4/4