Download Sequence Alignment - UTK-EECS

Some Independent Study on Sequence Alignment — Lan Lin prepared for theory group meeting on July 16, 2003 Biological Background (1)  Genetic information is stored in DNA – used to make identical copies – transferred from DNA to RNA to protein  DNA is a linear polymer of 4 nucleotides – (A, T, G, C)  RNA is a similar polymer – (A, U, G, C)  Both can pair one with another — “double helix” – pairing being sequence specific (G-C, A-T/U) – templating resulting in DNA replication and RNA copy of a DNA sequence Biological Background (2)  Proteins are variable length linear, mixed polymers of 20 different amino acids – peptides and polypeptides for amino acid polymers – functional property largely determined by the amino acid sequence  RNA protein by translation of a code consisting of 3 nucleotides into 1 amino acid – one amino acid encoded by 1 ~ 6 different triplet codes – 3 stop codons specifying “end of peptide sequence” – 3 reading frames for a DNA sequence, 6 for one with its (inferred) complementary strand Sequence Analysis (1)  Some difficulties – Where the code for a protein starts and stops? – DNA frequently scattered in separate “exons”, not continuous – RNAs up- and down-stream of the coding region, non-coding regions can be quite large; not all RNAs encode proteins  Inferring structure and function from a protein sequence is even harder! – 3 levels of protein structure  primary structure — sequence of amino acids in the protein  secondary structure — polypeptide chains folding into regular structures (i.e., alpha helix or beta sheet)  tertiary structure — 3D structure of protein determining biological function – homology-based approach used to determine the tertiary structure by primary sequence analysis of related proteins Sequence Analysis (2)  What can be done? – Identification of protein primary sequence from DNA sequence – searching for DBs for similar sequences  DNA sequences: DDBJ, EMBL, GenBank – for rapid search for a query sequence: BLAST and FASTA  protein sequences: SwissProt, PIR – calculation of sequence alignments for evolutionary inferences and to aid in structural and functional analysis Pairwise Sequence Alignment  Two quantitative measures – similarity (the larger the better) – distance (the smaller the better)  Edit operations by introducing a gap character “-” – match, replacement, insertion/deletion (“indel”)  The unit cost model – The cost of an alignment of two sequences s and t is the sum of the cost of all the edit operations that lead from s to t. – An optimal alignment is one with the minimum cost. – The edit distance of s and t is the cost of an optimal alignment of s and t under a cost function w denoted by dw(s, t). Pairwise Alignment via Dynamic Programming (1)   Recursion step dw(0:s:i, 0:t:j ) = min {dw(0:s:(i-1), 0:t:(j-1) ) + w(si, tj), dw(0:s:(i-1), 0:t:j ) + w(si,- ), dw(0:s:i, 0:t:(j-1) ) + w(-, tj)} for i, j  1 Base dw(0:s:0, 0:t:0 ) = 0 dw(0:s:i, 0:t:0 ) = dw(0:s:(i-1), 0:t:0 ) + w(si,- ) for i = 1, …, m dw(0:s:0, 0:t:j ) = dw(0:s:0, 0:t:(j-1) ) + w(-, tj) for j = 1, …, n Pairwise Alignment via Dynamic Programming (2)     The edit distances of all prefixes define an (m+1) (n+1) distance marix D = (di, j) with di, j = dw(0:s:i, 0:t:j). Pattern of dependencies between matrix elements di-1, j-1 di-1, j di, j-1 di, j The bottom right corner contains the desired result: dmn = dw(0:s:m, 0:t:n) = dw(s, t). A path through the distance matrix indicating how to align – A diagonal line means replacement/match – A vertical line means deletion – A horizontal line means insertion  The most common order of calculation is line by line (each line from left to right), or column by column (each column from top to bottom). On Scoring Functions  Different words all attributing a numeric value to a pair of sequences – “distance” values are never negative; should be minimized – “cost” implies positive values, with the overall cost to be minimized – “weights” and “scores” can be positive or negative; the optimal alignments maximize scores – “similarity” implies large values are good; should be maximized  If relating sequences of different length, length-relative scores make sense. Realistic Gap Models  No-gap alignment – using matches/replacements only in some regions (i.e., sites of protein-protein interaction) – DP algorithm geared to do this by setting costs for indel to infinity (or something close to it)  Block-indel – charging a certain set-up cost for introducing the gap, whereas extending the gap is less expensive – DP algorithm adapted without much effect on its efficiency Variations of Pairwise Alignment (1)  Local alignment (approximate pattern matching) – where s is relatively short with respect to t and we seek that subunit of t which s aligns best with: Given 0:s:m and 0:t:n, find i:t:j such that dw(s, i:t:j ) is minimal among all choices of 0 i j n.  Local alignment recursion – no cost for deletion of a prefix 0:t:i, – no cost for deletion of a suffix j:t:n, – dmn gives the cost of the optimal local alignment, i:t:j is found by: j = min {k|dm,k = dm, n} i is the point where the optimal path leading to dm, j starts from the 1st row Variations of Pairwise Alignment (2)  Local similarity – asking for those subunits of s and t that exhibit most similarity – using a similarity rather than a distance measure w(a, b) > 0, if a, b are similar, w(a, b) < 0, if a, b are not similar w(a, -) < 0, and w(-, b) < 0, in particular – score 0 as a cut-off value between subsequences with/without similarity   long stretches of dissimilarity shown as regions of zeroes in the matrix stretches of local similarity rising as islands of positive values Heuristic Methods  Edit distance calculation complexity – for input sequences 0:s:m,and 0:t:n, DP calculates m n matrix entries; time complexity is O(m n) – to only get the edit distance, only one column (or one row) of the matrix needs to be stored; space complexity is O(m) or O(n) – to retrace optimal path, the whole matrix needs to be stored; space complexity is also O(m n)  Heuristic methods approximate optimal alignment in a time complexity close to O(m+n) – trading speed for precision Multiple Alignment    Helpful for protein structure prediction and evolutionary history inference A multiple alignment of k sequences is a rectangular array of k rows which resemble the corresponding sequences when ignoring the gap character, with each column containing at lease one character different from “-”. Two ways to formulate a cost/weight function – colomuns-first – pairs-first  based on SP-cost (“sum-of-pairs”) An optimal multiple alignment is one with minimum overall cost, or maximal overall similarity. MSA by Standard DP and Heuristics  DP matrix DP hyperlattice – taking time in O(2ki=1,…,k  |si|) and space in O(i=1,…,k  |si|) – NP-hard with regard to the number of sequences with the SP measure  Alignment along a phylogenetic tree – tree generation through all optimal pairwise alignments – most similar pairs aligned first before aligning alignments – not necessarily optimal due to error accumulation – “sequences” “sum-of-pairs” “profiles” “scoring along a tree” More Interesting Topics  Phylogenetic tree  Genetic algorithms and protein folding  RNA secondary structure prediction  Protein structures  Finding instances of known/unknown sites  etc, … References  Online Lectures on Bioinformatics http://lectures.molgen.mpg.de/online_lectures.html  Biocomputing Hypertext Coursebook http://www.techfak.unibielefeld.de/bcd/Curric/welcome.html  Lecture Notes on Biological Sequence Analysis http://www.cs.uml.edu/bioinformatics/resources/Lectu res/tompa00lecture.pdf

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Sequence Alignment - UTK-EECS