Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Application of Algorithm Research to Molecular Biology R. C. T. Lee Dept. Of Computer Science National Chinan University 1 • There is one peculiar characteristics of all living organisms: We can reproduce ourselves. • Yet, it is important that what we reproduce have to be the same as we are. • That is, wild flowers produce the same kind of wild flowers and birds reproduce the same kind of birds. 2 • Information about ourselves must be passed to our descendants. • Question: How is this done? • Answer: Through DNA. 3 First of all, we need a language to pass the information about heredity. This language has existed for 3 billion years, the oldest language in the world. This language consists of 4 alphabets: A, G, C and T. 4 We need a mechanism to represent the alphabets. This is done by using chemical compounds. A: adenine G: guanine C: cytosine T: thymine 5 Nature has used DNA to pass the heredity information to our descendants. A DNA strand is a sequence of chemical compounds. From our point of view, a DNA strand is a sequence of A, G, C and T. 6 • DNA(Deoxyribonucleic Acid) can be viewed as two strands of nucleic acids formed as a double helix. 7 8 • Each strand of a DNA is a sequence of A, G, C and T. • Yet, in each strand, A is paired with T in the other strand. • Similarly, G is paired with C. 9 Human Mitochondrial DNA Control Region TTCTTTCATGGGGAAGCAAA AAGAAAGTACCCCTTCGTTT 10 • DNA exists in cells. • For each living organism, there are a lot of different kinds of cells. For instance, in human beings, we have muscle cells, blood cells, neural cells etc. • How can different cells perform different functions? 11 Genes • In each DNA sequence, there are subsequences which are called genes. • Each gene corresponds to a distinct protein and it is the protein which determines the function of the cell. • For instance, in red blood cells, there must be oxygen carrying protein haemoglobin and the production of this protein is controlled by a certain gene. 12 Proteins • Each protein consists of amino acids. • There are 20 different amino acids 13 14 The Relationship between a Gene and its Corresponding Protein 15 • As shown above, each amino acid is coded by a triplet. For instance, TTC denotes PHE(Phenylalanine). • Each triplet is called a codon. • There are three codons, namely TAA, TGA and TAG which represent “end of gene”. 16 • Protein Rnase A: KETAAAKFER • Its corresponding DNA sequence is: AAA GAA ACT GCT GCT GCT AAA TTT GAA CGT 17 How Is a Protein Produced? • RNA (Ribonucleic Acid) • Each cell is able to recognize all of the starting points of genes relevant to the proteins important to the functions of the cell. 18 • The RNA system scans a gene. For each codon being scanned, it produces a corresponding amino acid. • After all codons have been scanned, the corresponding protein is produced. 19 20 • AAA GAA ACT GCT GCT GCT AAA TTT GAA CGT • KETAAAKFER • Note that codon AAA corresponds to amino acid K and CGT corresponds to R. • Remember TAA, TGA and TAG signify “end of gene”. 21 Problems 1. 2. 3. 4. String Matching Problem Sequence Alignment Problem Evolution Tree Problem RNA Secondary Structure Prediction Problem 5. Protein Structure Problem 6. Physical Mapping Problem 7. Genome Rearrangement Problem 22 Exact String Matching Problems • Exact String Matching Problems – Instance: A text T of length n and a pattern P of length m, where n > m. – Question: Find all occurrences of P in T. – Example: If T = “ttaptaap” and P = “ap”, then P occurs in T starting at 3 and 7. • Linear time (O(n+m) time) Algorithms – Knuth-Morris-Pratt (KMP) algorithm – Boyer-Moore algorithm 23 Approximate String Matching Problems • Approximate String Matching Problems – Instance: A text T of length n, a pattern P of length m and a maximal number of errors allowed k – Question: Find all text positions where the pattern matches the text up to k errors, where errors can be substituting, deleting, or inserting a character. – Example: • Let T = “pttapa”, P = “patt” and k = 2. • The substrings T[1..2], T[1..3], T[1..4] and T[5..6] are up to 2 errors with P. • Algorithms – Dynamic Programming approach 24 – NFA approach Sequence Alignment Problem • ATTCATTACAACCGCTATG ACCCATCAACAACCGCTATG • It appears that these two sequences are quite different. • An alignment will produce the following: ATTCATTA-CAACCGCTATG ACCCATCAACAACCGCTATG 25 • Given two sequences, any alignment will have a corresponding score. • For each exact match, the score is equal to 2. • For each mismatch, the score is equal to -1. • AGCAG-C AAAC AAAC 2-3=-1 2x2-2x(-1)=2 26 • The sequence alignment problem: Given two sequences, find an alignment which produces the highest score. • Approach: Dynamic Programming • The multiple sequence alignment problem is NP-hard 27 Before alignment: TTAAAAATAA GAAATTTTTT TTTTTAAAAA ATTTCTATAA ATTTTATATA TATTTTATAT TTAAAAATAA GAAATTTTTT TTTTTAAAAA ATTTCTATAA ATTTTATATA TATTTTATAT TTAAAAATAA GAAATTTTTT TTTTTAAAAA ATTTCTATAA ATTTTATATA TATTTTATAT TTAAAAATAA GAAATTTTTT TTTTTAAAAA ATTTCTATAA ATTTTATATA TATTTTATAT TTAAAAATAA GAAATTTTTT TTTTTAAAAA ATTTCTATAA ATTTTATATA TATATTTTAT TTAAAAATAA GAAATTATTT TTTAAAAATA ATTTCTATAA ATGTTATATA TATATTTTAT TTAAAAATAA GAAATTATTT TTTAAAAATA ATTTCTATAA ATGTTATATA TATATTTTAT TTAAAAATAA GAAATTATTT TTTAAAAATA ATTTCTATAA ATGTTATATA TATATTTTAT TTAAAAATAA GAAATTATTT TTTAAAATAA TTTCTATAAA TTTTATATAT ATATTTTATA TTAAAAATAA GAAATTATTT TTTAAAAATA ATTTCTATAA ATTTTATATA TATATTTTAT TTAAAAATAA GAAATTTTTT TTTTTAAATT AAATTTCTAT CAATTTTATA TATTTTTTAT TTAAAAATTA GAAATTTTAT TTTTAAAATT TCTATTAAAA TTTATATATA TATTTTATAA TTAAAAATTA GAAATTTTAT TTTTAAAATT TCTATTAAAA TTTATATATA TATATTATAA TTAAAAATTA GAAATTTTAT TTTTAAAATT TCTATTAAAA TTTATATATA TTTTTTATAA TTAAAAATTA GAAATTTTAT TTTTTAAAAT TTCTATTAAA ATTTATATAT ATATTTTTTT TTAAAAATGA GAAATTTTTA TAAAAAAATT TCTTTAAATT TTATATATTT TATAAATATA TTAATAATAA GAAATTTTTT TATTTTTTAA ATAAAAAATT CTTTAAATTT TATATATATA 28 After alignment: TTAAAAATAA GAAATTATTT T~TT~A~~AA A~ATAA~~TT TCTAT~AAAT GTTATATATA TTAAAAATAA GAAATTATTT T~TT~A~~AA A~ATAA~~TT TCTAT~AAAT GTTATATATA TTAAAAATAA GAAATTATTT T~TT~A~~AA A~ATAA~~TT TCTAT~AAAT GTTATATATA TTAAAAATAA GAAATTTTTT T~TTTT~~AA A~~AAA~~TT TCTAT~AAAT TTTATATATA TTAAAAATAA GAAATTTTTT T~TTTT~~AA A~~AAA~~TT TCTAT~AAAT TTTATATATA TTAAAAATAA GAAATTTTTT T~TTTT~~AA A~~AAA~~TT TCTAT~AAAT TTTATATATA TTAAAAATAA GAAATTTTTT T~TTTT~~AA A~~AAA~~TT TCTAT~AAAT TTTATATATA TTAAAAATAA GAAATTATTT T~TT~A~~AA A~~TAA~~TT TCTAT~AAAT TTTATATATA TTAAAAATAA GAAATTATTT T~TT~A~~AA A~ATAA~~TT TCTAT~AAAT TTTATATATA TTAAAAATTA GAAATTTTAT T~TTT~~~AA A~~A~~~~TT TCTATTAAAA TTTATATATA TTAAAAATTA GAAATTTTAT T~TTT~~~AA A~~A~~~~TT TCTATTAAAA TTTATATATA TTAAAAATAA GAAATTTTTT T~TTTT~~AA A~~AAA~~TT TCTAT~AAAT TTTATATATA TTAAAAATTA GAAATTTTAT T~TTTT~~AA A~~A~~~~TT TCTATTAAAA TTTATATATA TTAAAAATTA GAAATTTTAT T~TTT~~~AA A~~A~~~~TT TCTATTAAAA TTTATATATA TTAAAAATAA GAAATTTTTT T~TTTT~~AA ATTAAA~~TT TCTAT~CAAT TTTATATATT TTAAAAATGA GAAATTTTTA T~~~~~~~AA A~AAAA~~TT TCTTT~AAAT TTTATATATT TTAATAATAA GAAATTTTTT TATTTTTTAA A~TAAAAAAT TCTTT~AAAT TTTATATATA 29 The Evolution Tree Problem 30 31 • The evolution tree problem: Given a distance matrix of n species, find an evolution tree under some criterion. • Usually, the criteria are such that all of the tree distances reflect the original distances. • That is, when two species are close to each other in the distance matrix, they should be close in the evolution tree. 32 • Each criterion corresponds to a distinct evolution tree problem. • Most of them are NP-complete. • Algorithms which produce optimal evolution trees in polynomial time are mostly based upon the minimal spanning tree approach. 33 A Partial Evolution Tree of the Homo Sapien (Intelligent Human Beings, also Modern Men) Our ancestors are from Africa. 34 Secondary Structure of RNA • Due to hydrogen bonds, the primary structure of a RNA can fold back on itself to form its secondary structure. • Base pairs (formed by hydrogen bonds): 1. AU (Watson-Crick base pair) 2. CG (Watson-Crick base pair) 3. GU (Wobble base pair) 35 G A G A C A U A A U C G U U A C C C U U C A U C A G G A A A U G A C RNA Secondary Structure without Pseudoknots 36 Given an RNA sequence, there may be several secondary structures without pseudoknots, as shown below: C C C U C U G U G U G C U C A C G C C U A A G C G U C U U C C U U C C U C C A U C G G U C U G U C A G C G C A U U G C 37 An optimal RNA secondary structure is one with the maximum number of base pairs. 38 C j1 U A C G A A U C G U G U A C A A C U U C A j1 U C A G G A A A j2 U G A C j2 RNA Secondary Structure with Simple Pseudoknots 39 2D & 3D Structures of Yeast Phenylalanyl-Transfer RNA 2D Structure 3D Structure 40 Secondary Structure Prediction Problem • Given an RNA sequence, determine the secondary structure of the minimum free energy from this sequence. • Approach: Dynamic Programming 41 Protein Structure Problem • Each amino acid of a protein can be classified into either of the following two types: – H (hydrophobic, non-polar) (hating water) – P (hydrophilic, polar) (loving water) • Then the amino acid sequence of a protein can be viewed as a binary sequence of H’s (1’s) and P’s (0’s). 42 Example • Instance: 011001001110010 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 1 0 1 1 1 0 0 0 0 0 0 1 0 Score = 3 Score = 5 43 H-P Model • Instance: A sequence of 1’s (H’s) and 0’s (P’s). • Question: To find a self-avoiding paths embedded in either a 2D or 3D lattice which maximizes score, where the score is the number of pairs of 1’s that are adjacent in the lattice without being adjacent in the sequence. • NP-complete even for 2D lattice. 44 Physical Mapping Problem 108 bp C: Full DNA Physical mapping Physical mapping 106 bp Cut C and clone into overlapping YAC clones. Cut the DNA in each YAC clone and clone into overlapping cosmid clones. 104 bp Select a subset of cosmid clones of minimum total length that covers the YAC DNA. Fragment assembling 102 bp Duplicate the cosmid and then cut the copies randomly. Select and sequence short fragments and then reassemble them into a deduced cosmid string. 45 Shortest Common Superstring • Input: A collection F of strings. • Output: A shortest possible string S such that for every f F, S is a superstring of f. • For example: ACT F CTA AGT S ACTAGT • NP-complete 46 • Suppose the target is too long and its contents are unknown. • What can we do? • Enzyme A {6, 8, 3, 10} Enzyme B {7, 11, 4, 5} Enzymes A and B {1, 5, 2, 6, 7, 3, 3} 47 A 3 B AB 8 6 4 5 11 3 1 5 2 6 10 7 3 7 This problem is called the two digest problem which is NP-complete. 48 A genome is a sequence of genes. Chloroplast genome of Alfafa: -8, -7, -6, -5, -4, -3, -2, -1, -11, -10, -9 Chloroplast genome of garden pea: -4, +3, -2, +8, +7, -1, -5, -6, -11, +10, +9 49 Suppose that we can only reverse a substring of genes. -4, +5, -8, -9 After reversal, we have +9, +8, -5, +4. 50 The sorting by reversal problem: The problem of transforming one sequence to another only by reversals in the minimum number of steps. 51 The transformation of worm Ascaris Suum mitochondrial DNA into human mitochondrial DNA 12 31 34 28 26 17 29 4 9 36 18 35 19 1 16 14 32 33 22 15 11 27 5 20 13 30 23 10 6 3 24 21 8 25 2 7 12 31 34 28 26 17 29 4 9 36 18 35 19 1 16 14 33 32 22 15 11 27 5 20 13 30 23 10 6 3 24 21 8 25 2 7 12 31 32 33 14 16 1 19 35 18 36 9 4 29 17 26 28 34 22 15 11 27 5 20 13 30 23 10 6 3 24 21 8 25 2 7 12 33 32 31 14 16 1 19 35 18 36 9 4 29 17 26 28 34 22 15 11 27 5 20 13 30 23 10 6 3 24 21 8 25 2 7 12 33 32 31 30 13 20 5 27 11 15 22 34 28 26 17 29 4 9 36 18 35 19 1 16 14 23 10 6 3 24 21 8 25 2 7 12 33 32 31 30 29 17 26 28 34 22 15 11 27 5 20 13 4 9 36 18 35 19 1 16 14 23 10 6 3 24 21 8 25 2 7 12 33 32 31 30 29 28 26 17 34 22 15 11 27 5 20 13 4 9 36 18 35 19 1 16 14 23 10 6 3 24 21 8 25 2 7 12 33 32 31 30 29 28 27 11 15 22 34 17 26 5 20 13 4 9 36 18 35 19 1 16 14 23 10 6 3 24 21 8 25 2 7 12 33 32 31 30 29 28 27 26 17 34 22 15 11 5 20 13 4 9 36 18 35 19 1 16 14 23 10 6 3 24 21 8 25 2 7 12 33 32 31 30 29 28 27 26 25 8 21 24 3 6 10 23 14 16 1 19 35 18 36 9 4 13 20 5 11 15 22 34 17 2 7 12 33 32 31 30 29 28 27 26 25 24 21 8 3 6 10 23 14 16 1 19 35 18 36 9 4 13 20 5 11 15 22 34 17 2 7 12 33 32 31 30 29 28 27 26 25 24 23 10 6 3 8 21 14 16 1 19 35 18 36 9 4 13 20 5 11 15 22 34 17 2 7 12 33 32 31 30 29 28 27 26 25 24 23 22 15 11 5 20 13 4 9 36 18 35 19 1 16 14 21 8 3 6 10 34 17 2 7 12 33 32 31 30 29 28 27 26 25 24 23 22 21 14 16 1 19 35 18 36 9 4 13 20 5 11 15 8 3 6 10 34 17 2 7 12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 13 4 9 36 18 35 19 1 16 14 5 11 15 8 3 6 10 34 17 2 7 12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 35 18 36 9 4 13 1 16 14 5 11 15 8 3 6 10 34 17 2 7 52 12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 35 36 9 4 13 1 16 14 5 11 15 8 3 6 10 34 17 2 7 12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 34 10 6 3 8 15 11 5 14 16 1 13 4 9 36 35 2 7 12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 14 5 11 15 8 3 6 10 34 1 13 4 9 36 35 2 7 12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 11 5 14 8 3 6 10 34 1 13 4 9 36 35 2 7 12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 5 11 8 3 6 10 34 1 13 4 9 36 35 2 7 12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 1 34 10 6 3 8 11 5 4 9 36 35 2 7 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 1 34 10 6 3 8 11 5 4 9 36 35 2 7 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 1 34 10 6 3 4 5 11 8 9 36 35 2 7 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 1 34 35 36 9 8 11 5 4 3 6 10 2 7 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 1 34 35 36 9 8 7 2 10 6 3 4 5 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 1 34 35 36 9 8 7 6 10 2 3 4 5 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 1 34 35 36 9 8 7 6 5 4 3 2 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 1 2 3 4 5 6 7 8 9 36 35 34 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 9 8 7 6 5 4 3 2 1 10 11 1 2 3 4 5 6 7 8 9 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 10 11 1 2 3 4 5 6 7 8 9 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 53 • TAA, TGA, or TAG. • Do you know what they mean? • End of Gene. • Thank you for your patience. Have a good conference. 54