Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computational Biology, Part 8 Sequence Comparison with Dynamic Programming Robert F. Murphy Copyright 1996, 1999-2001. All rights reserved. Dynamic programming algorithms for sequence comparison Similar in concept to dot matrix Introduced for biological sequences by S. B. Needleman & C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48:443-453 (1970) Steps of basic dynamic programming method 1. Initialize matrix to match scores (0 or 1) 2. Do summation operation Finds the maximum number of matches that can be obtained starting at any position and proceeding "forward" 3. Traceback to find maximum match alignment Summation operation 1. Start in lower right corner 2. Move up one position and left one position 3. Find largest value in either (a) row segment starting one below current position and extending to the right or (b) column segment starting one to the right of current position and extending down Summation operation (cont.) 4. Add this value to the value in the current cell 5. Repeat steps 3 and 4 for all cells to the left in current row and all cells above in current column 6. If we are not in the top left corner, go to step 2 Illustration of Simple Dynamic Programming Method (Demonstration A7) Use of dynamic programming to evaluate homology between pairs of sequences If we just want to know maximum match possible between two sequences, don't need to do traceback but can just look at the highest value in the first row or column ("match score"). This represents the best possible alignment score. Extensions to basic dynamic programming method use similarity function in initialization step allow gaps use gap penalties Illustration with a Similarity Matrix (Demonstration A8) Gap penalty alternatives constant gap penalty for gap > 1 gap penalty proportional to gap size one penalty for starting a gap (gap opening penalty) different (lower) penalty for adding to a gap (gap extension penalty) Gap penalty alternatives (cont.) gap penalty proportional to gap size and sequence for nucleic acids, can be used to mimic thermodynamics of helix formation two kinds of gap opening penalties one for gap closed by AT, different for GC different gap extension penalty End gaps Some programs treat end gaps as normal gaps and apply penalties, other programs do not apply penalties for end gaps End gaps (cont.) Can determine which a program does by adding extra (unmatched) bases to the end of one sequence and seeing if match score changes Penalties for end gaps appropriate for aligned sequences where ends "should match" Penalties for end gaps inappropriate when surrounding sequences are expected to be different (e.g., conserved exon surrounded by varying introns Global vs. local similarity Should result of alignment include all amino acids or proteins or just those that "match"? If yes, a global alignment is desired In a global alignment, presence of mismatched elements is neutral - doesn't affect overall match score Global vs. local similarity Should result of alignment include all amino acids or proteins or just those that "match"? If no, a local alignment is desired Local alignments accomplished by including negative scores for "mismatched" positions, thus scores get worse as we move away from region of match Instead of starting traceback with highest value in first row or column, start with highest value in entire matrix, stop when score hits zero Reading for next class B & O, Chapter 7 just pp. 155-170 Additional optional reading: Sequence Analysis Primer, pp. 138-141 "Hashing and Neighborhood Algorithms" and "Statistical Approaches - BLAST” (on web site as Reading 1) Altschul et al. paper (on web site as Reading 2)