Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 3 Computational Molecular Biology Michael Smith [email protected] Sequence Comparison Sequence comparison is the most important operation in computational biology Consists of finding which parts of the sequences are alike and which parts differ Similarity and Alignment Similarity Gives a measure of how similar sequences are Alignment A way of placing sequences one above the other in order to make clear the correspondence between similar characters or substrings Sequence Comparison Want best alignment between two or more sequences Global Comparison Local Comparison Alignment involving substrings Semi-Global Comparison Alignment involving entire sequences Aligning prefixes and suffixes of the sequences All can be solved by Dynamic Programming Global Comparison Consider the following DNA sequences GACGGATTAG GATCGGAATAG Are they similar? After alignment, similarities are more obvious GA-CGGATTAG GATCGGAATAG Alignment and Score Alignment, more precise definition Insertion of spaces in arbitrary locations along the sequences so that they end up with the same size No column can be entirely composed of spaces Score Measure of similarity Each column receive +1, for a match, -1 for a mismatch or -2 for a space Sum values to get score Dynamic Programming Solving an instance of a problem by taking advantage of already computed solutions for smaller instances of the problem Main algorithmic approach used in sequence alignment Figure 3.1, 3.2 Optimal Alignments From Figure 3.1, start at (m,n) and follow arrows to (0,0) Each arrow gives one column of the alignment If arrow is horizontal, it corresponds to a column with a space in s matched with t[j] If arrow is vertical, it corresponds to s[i] matched with a space in t If arrow is diagonal, s[i] is matched with t[j] Optimal Alignments Many alignments are possible, depending on which arrow is given priority Local Comparison A local alignment between s and t is an alignment between a substring of s and a substring of t Goal : find the highest scoring local alignment between two sequences Variation of basic algorithm (Figure 3.2) Each entry holds highest score of an alignment between suffixes of s and t (page 55) SemiGlobal Comparison Score alignments ignoring some of the end spaces in the sequences End spaces are those that appear before the first or after the last character in a sequence For example, CAGCA-CTTGGATTCTCGG ---CAGCGTGG-------- If we aligned the sequences in the usual way, then CAGCACTTGGATTCTCGG CAGC-----G-T----GG Extensions to Basic Algorithm Basic algorithm has O(mn) complexity and uses space on the order of O(mn) Possible to improve complexity from quadratic to linear at the expense of doubling processing time Can be accomplished by using a Divide and Conquer strategy Divide the problem into small subproblems and later combine the solutions to obtain a solution for the whole problem Gap Penalty Functions A gap is a consecutive number of spaces When mutations occur, it is more likely to have a block of gaps verses a series of isolated gaps Previous discussed scoring method is not appropriate in this case Gap Penalty Functions For example, A------ATTCCTTCCTTCC AAAGAGAATTCCTTCCTTCC Scoring is done at a block level, not a column level A A -----AAGAGA ATTCCTTCCTTCC ATTCCTTCCTTCC Multiple Sequences Multiple sequence alignment is a generation of the two sequence case Multiple alignment of s1,s2…..sk is obtained by inserting spaces in the sequences in such a way to make them all the same size No column is made entirely of spaces Figure 3.10 Scoring Multiple Sequences Need a function that inputs amino acid sequences and returns a score The function must have two properties Order of arguments must be independent. For example if a column has I,V,- the same score should be produced if the order is -,V,I Should reward the presence of many equal resides and penalize unequal residues and spaces Sum-of-Pairs (SP) Sum-of-Pairs (SP) satisfies the properties Sum of pairwise scores of all pairs of symbols in a column SP-score(I,-,I,V) = p(I,-) + p(I,I) + p(I,V) + p(-,I) + p(-,V) + p(I,V) where p(a,b) is pairwise score of a and b Algorithm Paradigm Dynamic programming is used again Basic algorithm can be used, but there will be problems In two sequence case, complexity is O(n2) For k sequence case, complexity is O(nk) Can take a really long time if k is large Algorithm Paradigm Must reduce the amount or number of cells to compute Apply a heuristic to reduce the number of computed cells Star Alignments Building a multiple alignment based on pairwise alignments between a fixed sequence and all others Fixed sequence is the center of the star Star Alignments Example a = ATTGCCATT b = ATGGCCATT c = ATCCAATTTT d = ATCTTCTT e = ACTGACC Select a as the center of the star Star Alignments Align a with b a with c a with d a with e Star Alignments ATTGCCATT ATGGCCATT ATTGCCATT-ATC-CAATTTT ATTGCCATT ATCTTC-TT ATTGCCATT ACTGACC-- Star Alignments Combine results ATTGCCATT-ATGGCCATT-ATC-CAATTTT ATCTTC-TT-ACTGACC---- Database Search Database exist for searching and comparing protein and DNA sequences Methods described work, but may take to long and be impractical for searching large databases Novel and faster methods have been developed PAM Matrix When scoring protein sequences, the +1,-1,-2 may not be sufficient Amino acids have properties that influence the likelihood that they will be substituted in an evolutionary scenario PAM Matrix Point Accepted Mutations A 1-PAM matrix is suitable for comparing sequences that are 1 unit of evolution apart A 250-PAM matrix is suitable for comparing sequences that are 250 units of evolution apart PAM Matrix Markovian in nature Need the probability of for each amino acid Probability transition matrix Score matrix BLAST Most frequently programs used to search sequence databases Acronym for Basic Alignment Search Tool Returns a list of high scoring segment pairs between the query sequence and sequences in the database http://www.ncbi.nlm.nih.gov FAST Another family of programs for sequence database search http://www.rcsb.org/pdb/index.html BLAST and FAST use PAM matrices