* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Sequence Alignment - UTK-EECS
Bottromycin wikipedia , lookup
List of types of proteins wikipedia , lookup
Western blot wikipedia , lookup
Community fingerprinting wikipedia , lookup
Protein (nutrient) wikipedia , lookup
Protein moonlighting wikipedia , lookup
Expanded genetic code wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Non-coding RNA wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Protein adsorption wikipedia , lookup
Biochemistry wikipedia , lookup
Proteolysis wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Gene expression wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Biosynthesis wikipedia , lookup
Non-coding DNA wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Molecular evolution wikipedia , lookup
Genetic code wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Point mutation wikipedia , lookup
Protein structure prediction wikipedia , lookup
Some Independent Study on Sequence Alignment — Lan Lin prepared for theory group meeting on July 16, 2003 Biological Background (1) Genetic information is stored in DNA – used to make identical copies – transferred from DNA to RNA to protein DNA is a linear polymer of 4 nucleotides – (A, T, G, C) RNA is a similar polymer – (A, U, G, C) Both can pair one with another — “double helix” – pairing being sequence specific (G-C, A-T/U) – templating resulting in DNA replication and RNA copy of a DNA sequence Biological Background (2) Proteins are variable length linear, mixed polymers of 20 different amino acids – peptides and polypeptides for amino acid polymers – functional property largely determined by the amino acid sequence RNA protein by translation of a code consisting of 3 nucleotides into 1 amino acid – one amino acid encoded by 1 ~ 6 different triplet codes – 3 stop codons specifying “end of peptide sequence” – 3 reading frames for a DNA sequence, 6 for one with its (inferred) complementary strand Sequence Analysis (1) Some difficulties – Where the code for a protein starts and stops? – DNA frequently scattered in separate “exons”, not continuous – RNAs up- and down-stream of the coding region, non-coding regions can be quite large; not all RNAs encode proteins Inferring structure and function from a protein sequence is even harder! – 3 levels of protein structure primary structure — sequence of amino acids in the protein secondary structure — polypeptide chains folding into regular structures (i.e., alpha helix or beta sheet) tertiary structure — 3D structure of protein determining biological function – homology-based approach used to determine the tertiary structure by primary sequence analysis of related proteins Sequence Analysis (2) What can be done? – Identification of protein primary sequence from DNA sequence – searching for DBs for similar sequences DNA sequences: DDBJ, EMBL, GenBank – for rapid search for a query sequence: BLAST and FASTA protein sequences: SwissProt, PIR – calculation of sequence alignments for evolutionary inferences and to aid in structural and functional analysis Pairwise Sequence Alignment Two quantitative measures – similarity (the larger the better) – distance (the smaller the better) Edit operations by introducing a gap character “-” – match, replacement, insertion/deletion (“indel”) The unit cost model – The cost of an alignment of two sequences s and t is the sum of the cost of all the edit operations that lead from s to t. – An optimal alignment is one with the minimum cost. – The edit distance of s and t is the cost of an optimal alignment of s and t under a cost function w denoted by dw(s, t). Pairwise Alignment via Dynamic Programming (1) Recursion step dw(0:s:i, 0:t:j ) = min {dw(0:s:(i-1), 0:t:(j-1) ) + w(si, tj), dw(0:s:(i-1), 0:t:j ) + w(si,- ), dw(0:s:i, 0:t:(j-1) ) + w(-, tj)} for i, j 1 Base dw(0:s:0, 0:t:0 ) = 0 dw(0:s:i, 0:t:0 ) = dw(0:s:(i-1), 0:t:0 ) + w(si,- ) for i = 1, …, m dw(0:s:0, 0:t:j ) = dw(0:s:0, 0:t:(j-1) ) + w(-, tj) for j = 1, …, n Pairwise Alignment via Dynamic Programming (2) The edit distances of all prefixes define an (m+1) (n+1) distance marix D = (di, j) with di, j = dw(0:s:i, 0:t:j). Pattern of dependencies between matrix elements di-1, j-1 di-1, j di, j-1 di, j The bottom right corner contains the desired result: dmn = dw(0:s:m, 0:t:n) = dw(s, t). A path through the distance matrix indicating how to align – A diagonal line means replacement/match – A vertical line means deletion – A horizontal line means insertion The most common order of calculation is line by line (each line from left to right), or column by column (each column from top to bottom). On Scoring Functions Different words all attributing a numeric value to a pair of sequences – “distance” values are never negative; should be minimized – “cost” implies positive values, with the overall cost to be minimized – “weights” and “scores” can be positive or negative; the optimal alignments maximize scores – “similarity” implies large values are good; should be maximized If relating sequences of different length, length-relative scores make sense. Realistic Gap Models No-gap alignment – using matches/replacements only in some regions (i.e., sites of protein-protein interaction) – DP algorithm geared to do this by setting costs for indel to infinity (or something close to it) Block-indel – charging a certain set-up cost for introducing the gap, whereas extending the gap is less expensive – DP algorithm adapted without much effect on its efficiency Variations of Pairwise Alignment (1) Local alignment (approximate pattern matching) – where s is relatively short with respect to t and we seek that subunit of t which s aligns best with: Given 0:s:m and 0:t:n, find i:t:j such that dw(s, i:t:j ) is minimal among all choices of 0 i j n. Local alignment recursion – no cost for deletion of a prefix 0:t:i, – no cost for deletion of a suffix j:t:n, – dmn gives the cost of the optimal local alignment, i:t:j is found by: j = min {k|dm,k = dm, n} i is the point where the optimal path leading to dm, j starts from the 1st row Variations of Pairwise Alignment (2) Local similarity – asking for those subunits of s and t that exhibit most similarity – using a similarity rather than a distance measure w(a, b) > 0, if a, b are similar, w(a, b) < 0, if a, b are not similar w(a, -) < 0, and w(-, b) < 0, in particular – score 0 as a cut-off value between subsequences with/without similarity long stretches of dissimilarity shown as regions of zeroes in the matrix stretches of local similarity rising as islands of positive values Heuristic Methods Edit distance calculation complexity – for input sequences 0:s:m,and 0:t:n, DP calculates m n matrix entries; time complexity is O(m n) – to only get the edit distance, only one column (or one row) of the matrix needs to be stored; space complexity is O(m) or O(n) – to retrace optimal path, the whole matrix needs to be stored; space complexity is also O(m n) Heuristic methods approximate optimal alignment in a time complexity close to O(m+n) – trading speed for precision Multiple Alignment Helpful for protein structure prediction and evolutionary history inference A multiple alignment of k sequences is a rectangular array of k rows which resemble the corresponding sequences when ignoring the gap character, with each column containing at lease one character different from “-”. Two ways to formulate a cost/weight function – colomuns-first – pairs-first based on SP-cost (“sum-of-pairs”) An optimal multiple alignment is one with minimum overall cost, or maximal overall similarity. MSA by Standard DP and Heuristics DP matrix DP hyperlattice – taking time in O(2ki=1,…,k |si|) and space in O(i=1,…,k |si|) – NP-hard with regard to the number of sequences with the SP measure Alignment along a phylogenetic tree – tree generation through all optimal pairwise alignments – most similar pairs aligned first before aligning alignments – not necessarily optimal due to error accumulation – “sequences” “sum-of-pairs” “profiles” “scoring along a tree” More Interesting Topics Phylogenetic tree Genetic algorithms and protein folding RNA secondary structure prediction Protein structures Finding instances of known/unknown sites etc, … References Online Lectures on Bioinformatics http://lectures.molgen.mpg.de/online_lectures.html Biocomputing Hypertext Coursebook http://www.techfak.unibielefeld.de/bcd/Curric/welcome.html Lecture Notes on Biological Sequence Analysis http://www.cs.uml.edu/bioinformatics/resources/Lectu res/tompa00lecture.pdf