* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Sequence Analysis - Missouri State University
Community fingerprinting wikipedia , lookup
Deoxyribozyme wikipedia , lookup
History of molecular evolution wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Genetic code wikipedia , lookup
Non-coding DNA wikipedia , lookup
Point mutation wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Molecular evolution wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics Aligning Sequences Sequences Representing proteins or nucleic acid (DNA/RNA) molecules Order of amino acids (for proteins – nucleotides for DNA/RNA) along one chain Sequence alignment The identification of residue-residue correspondences Any assignment of correspondences that preserves the order of residues within the sequences Evolutionary Basis of Sequence Alignment Identity: Quantity that describes how much two sequences are alike in the strictest terms. Similarity: Quantity that relates how much two amino acid sequences are alike. Homology: a conclusion drawn from data suggesting that two genes share a common evolutionary history. Evolutionary Basis of Sequence Alignment Homologous sequences Related by evolution (common ancestors) Alignment of homologous sequences Identifying relationship between the sequence elements Match up characters coming from same characters in ancestor Alignment and Evolution Assume we know evolutionary history relating q and d: The true alignment can be found using h as a template: h : GLVS T q’: GLISVT d’: GIV--T Alignment Evolution Given an alignment, several different evolutionary histories may be (equally) plausible Example: Alignment: q’: GLISVT d’: G-I-VT One possible history: H*:GLIVT /\ ->S / \ L-> / \ q:GLISVT d:GIVT Global and Local Alignment Global Assuming that the complete sequences are the results of evolution from the same S2 ancestor sequence Ancestor S1 Local Align segments of the sequences so that the segments are evolutionarily related S2 Ancestor S1 Pairwise sequence alignments Vs Multiple sequence alignments Pairwise sequence alignment: two sequence Multiple sequence alignments: a mutual alignment of more than two sequences The dotplot The dotplot Captures not only the overall similarity of two sequences, but also the complete set and relative quality of different possible alignments Diagonal ― Horizontal ― a gap is introduced in the sequence indexing the rows Vertical ― a gap is introduced in the sequence indexing the columns Dotplots and alignments A path through the dotplot is as an edit script; Each move performs an operation ― a substitution, an insertion or a deletion. When the end of the path is reached, the effect will change one sequence into the other. Several different sequences of edit operations may convert one string to the other in the same number of steps. Dotplots and alignments Although a sequence of edit operations derived from an optimal alignment may correspond to an actual evolutionary pathway Impossible to prove that it does. The larger the edit distance, the larger the number of reasonable evolutionary pathways between two sequences. Dotplots and alignments The dotplots between pairs of proteins with increasingly more distant relationships. The dotplot comparisons of the sulphydryl proteinase papain from papaya, with four homologues ― the close relative, kiwi fruit actinidin, the more distant relatives, human procathepsin L, human cathepsin B, and staphyloccus anueus. Example Example Example Example Measures of sequence similarity Hamming distance ― the number of positions with mismatching characters. Edit distance ― the minimum number of “edit operations” required to change one string into the other. What is an Alignment? A global alignment of two sequences A and B contains all characters of A and B in the same order one symbol from A can be aligned with one symbol from B a symbol can be aligned with a blank, written as ‘-’ two blanks cannot be aligned Every symbol from A and from B must be aligned Example: A:INVEST, B:INTEREST IN--VEST INV--EST INTEREST INTEREST IN-V--EST IN-TEREST Computing Alignments There exist a large number of alignments for a pair of sequences In order to use a computer to do the alignment process in a meaningful way, we need Scoring scheme – mathematical way to calculate goodness of candidate alignments Search method – algorithm able to identify high scoring alignments Choosing Scoring Scheme Scoring scheme should be Simple – to allow for efficient calculation and search for best alignment Biologically meaningful (give score to biologically good alignments) Simple Scoring Scheme Assign score to each column in the alignment Columns are of the following sorts: Alignment score: sum of score over all columns R: matrix giving score for all possible character pairs (e.g., all pairs of amino acid symbols) Alignment Score – Example R identity matrix – identical characters score1, unequal 0, g=1 ALIGN1: V - E I P R E 0 -1 1 -1 ALIGN2: V E I P R E 0 0 0 T T 1 G E 0 E R 0 I S I 1 -1 T G T 1 -1 E E 1 I R 0 S I 0 T T 1 T T 1 Score: 1 Score: 2 Finding the Minimum Scoring Alignment Large number of possible alignments – cannot generate all and score them to find the best Task – align A=a1a2...am B=b1b2...bn and Independence Between Sub-alignments Observations: The score of the alignment up to and including character i from A and character j from B is independent of how the rest of the sequences are aligned The best solution to (i,j) can be “locked”, its score recorded in Di,j Dm,n is the score of the best global alignment Amenable to dynamic Programming Dynamic programming algorithm Individual edit operations include: Substitution of bj for ai ― represented (ai, bj) Deletion of ai from sequence A― represented (ai,) Deletion of bj from sequence B― represented (,bj) Dynamic programming algorithm A cost function d is defined on edit operations d(ai, bj)=cost of a mutation in an alignment in which position i of sequence A corresponds to position j of sequence B d(ai,) or d( bj) = cost of a deletion or insertion The minimum weighted distance between sequences A and B as D(A,B)=min (d(x,y)) Three Alternative Alignment Ends The alignment between a1a2...ai and b1b2...bi ends in one of three ways: a1..i-1 b1..j ai a1..i-1 b1..j-1 ai bj a1..i b1..j-1 bj - To calculate Di,j we pick the one that gives the lowest cost Recurrence Relation Assume that Di-1,j, Di-1,j-1, Di,j-1 have been calculated already Di 1, j a1..i-1 b1..j ai Di 1, j 1 a1..i-1 b1..j-1 ai bj Di , j 1 a1..i b1..j-1 bj Di 1, j d(ai,) - Di , j min Di 1, j 1 d(ai,bj) Di , j 1 d(,bj) Basis of Recursion Align empty string to string of length i (resp. j) – can be done by aligning to i (resp. j) blanks: j D0, j d (, bk ) k 0 i Di , 0 d ( ak , ) k 0 Calculating Score of Best Alignment Using Matrix H matrix cost of best alignment Time Complexity Sequences of lengths n and m O (nm) Two sequences of length l 2 O(l )