Download Application of Algorithm Research to Molecular Biology

Document related concepts
no text concepts found
Transcript
Application of Algorithm
Research to Molecular Biology
R. C. T. Lee
Dept. Of Computer Science
National Chinan University
1
• There is one peculiar characteristics of all
living organisms: We can reproduce
ourselves.
• Yet, it is important that what we reproduce
have to be the same as we are.
• That is, wild flowers produce the same kind
of wild flowers and birds reproduce the
same kind of birds.
2
• Information about ourselves must be passed
to our descendants.
• Question: How is this done?
• Answer: Through DNA.
3
First of all, we need a language to pass
the information about heredity. This
language has existed for 3 billion
years, the oldest language in the
world.
This language consists of 4 alphabets:
A, G, C and T.
4
We need a mechanism to represent the
alphabets. This is done by using
chemical compounds.
A: adenine
G: guanine
C: cytosine
T: thymine
5
Nature has used DNA to pass the
heredity information to our descendants.
A DNA strand is a sequence of chemical
compounds.
From our point of view, a DNA strand
is a sequence of A, G, C and T.
6
• DNA(Deoxyribonucleic Acid) can
be viewed as two strands of nucleic
acids formed as a double helix.
7
8
• Each strand of a DNA is a sequence of A, G,
C and T.
• Yet, in each strand, A is paired with T in the
other strand.
• Similarly, G is paired with C.
9
Human Mitochondrial
DNA Control Region
TTCTTTCATGGGGAAGCAAA
AAGAAAGTACCCCTTCGTTT
10
• DNA exists in cells.
• For each living organism, there are a lot of
different kinds of cells. For instance, in
human beings, we have muscle cells, blood
cells, neural cells etc.
• How can different cells perform different
functions?
11
Genes
• In each DNA sequence, there are
subsequences which are called genes.
• Each gene corresponds to a distinct protein
and it is the protein which determines the
function of the cell.
• For instance, in red blood cells, there must
be oxygen carrying protein haemoglobin
and the production of this protein is
controlled by a certain gene.
12
Proteins
• Each protein consists of amino acids.
• There are 20 different amino acids
13
14
The Relationship between a Gene
and its Corresponding Protein
15
• As shown above, each amino acid is coded
by a triplet. For instance, TTC denotes
PHE(Phenylalanine).
• Each triplet is called a codon.
• There are three codons, namely TAA, TGA
and TAG which represent “end of gene”.
16
• Protein Rnase A:
KETAAAKFER
• Its corresponding DNA sequence is:
AAA GAA ACT GCT GCT GCT AAA TTT
GAA CGT
17
How Is a Protein Produced?
• RNA (Ribonucleic Acid)
• Each cell is able to recognize all of the
starting points of genes relevant to the
proteins important to the functions of the
cell.
18
• The RNA system scans a gene. For each
codon being scanned, it produces a
corresponding amino acid.
• After all codons have been scanned, the
corresponding protein is produced.
19
20
• AAA GAA ACT GCT GCT GCT AAA TTT GAA
CGT
• KETAAAKFER
• Note that codon AAA corresponds to amino acid
K and CGT corresponds to R.
• Remember TAA, TGA and TAG signify “end of
gene”.
21
Problems
1.
2.
3.
4.
String Matching Problem
Sequence Alignment Problem
Evolution Tree Problem
RNA Secondary Structure Prediction
Problem
5. Protein Structure Problem
6. Physical Mapping Problem
7. Genome Rearrangement Problem
22
Exact String Matching Problems
• Exact String Matching Problems
– Instance: A text T of length n and a pattern P of length m,
where n > m.
– Question: Find all occurrences of P in T.
– Example: If T = “ttaptaap” and P = “ap”, then P occurs in T
starting at 3 and 7.
• Linear time (O(n+m) time) Algorithms
– Knuth-Morris-Pratt (KMP) algorithm
– Boyer-Moore algorithm
23
Approximate String Matching
Problems
• Approximate String Matching Problems
– Instance: A text T of length n, a pattern P of length m and a
maximal number of errors allowed k
– Question: Find all text positions where the pattern matches
the text up to k errors, where errors can be substituting,
deleting, or inserting a character.
– Example:
• Let T = “pttapa”, P = “patt” and k = 2.
• The substrings T[1..2], T[1..3], T[1..4] and T[5..6] are up
to 2 errors with P.
• Algorithms
– Dynamic Programming approach
24
– NFA approach
Sequence Alignment Problem
• ATTCATTACAACCGCTATG
ACCCATCAACAACCGCTATG
• It appears that these two sequences are quite
different.
• An alignment will produce the following:
ATTCATTA-CAACCGCTATG
ACCCATCAACAACCGCTATG
25
• Given two sequences, any alignment will
have a corresponding score.
• For each exact match, the score is equal to 2.
• For each mismatch, the score is equal to -1.
• AGCAG-C
AAAC
AAAC
2-3=-1
2x2-2x(-1)=2
26
• The sequence alignment problem: Given
two sequences, find an alignment which
produces the highest score.
• Approach: Dynamic Programming
• The multiple sequence alignment problem is
NP-hard
27
Before alignment:
TTAAAAATAA GAAATTTTTT TTTTTAAAAA ATTTCTATAA ATTTTATATA TATTTTATAT
TTAAAAATAA GAAATTTTTT TTTTTAAAAA ATTTCTATAA ATTTTATATA TATTTTATAT
TTAAAAATAA GAAATTTTTT TTTTTAAAAA ATTTCTATAA ATTTTATATA TATTTTATAT
TTAAAAATAA GAAATTTTTT TTTTTAAAAA ATTTCTATAA ATTTTATATA TATTTTATAT
TTAAAAATAA GAAATTTTTT TTTTTAAAAA ATTTCTATAA ATTTTATATA TATATTTTAT
TTAAAAATAA GAAATTATTT TTTAAAAATA ATTTCTATAA ATGTTATATA TATATTTTAT
TTAAAAATAA GAAATTATTT TTTAAAAATA ATTTCTATAA ATGTTATATA TATATTTTAT
TTAAAAATAA GAAATTATTT TTTAAAAATA ATTTCTATAA ATGTTATATA TATATTTTAT
TTAAAAATAA GAAATTATTT TTTAAAATAA TTTCTATAAA TTTTATATAT ATATTTTATA
TTAAAAATAA GAAATTATTT TTTAAAAATA ATTTCTATAA ATTTTATATA TATATTTTAT
TTAAAAATAA GAAATTTTTT TTTTTAAATT AAATTTCTAT CAATTTTATA TATTTTTTAT
TTAAAAATTA GAAATTTTAT TTTTAAAATT TCTATTAAAA TTTATATATA TATTTTATAA
TTAAAAATTA GAAATTTTAT TTTTAAAATT TCTATTAAAA TTTATATATA TATATTATAA
TTAAAAATTA GAAATTTTAT TTTTAAAATT TCTATTAAAA TTTATATATA TTTTTTATAA
TTAAAAATTA GAAATTTTAT TTTTTAAAAT TTCTATTAAA ATTTATATAT ATATTTTTTT
TTAAAAATGA GAAATTTTTA TAAAAAAATT TCTTTAAATT TTATATATTT TATAAATATA
TTAATAATAA GAAATTTTTT TATTTTTTAA ATAAAAAATT CTTTAAATTT TATATATATA
28
After alignment:
TTAAAAATAA GAAATTATTT T~TT~A~~AA A~ATAA~~TT TCTAT~AAAT GTTATATATA
TTAAAAATAA GAAATTATTT T~TT~A~~AA A~ATAA~~TT TCTAT~AAAT GTTATATATA
TTAAAAATAA GAAATTATTT T~TT~A~~AA A~ATAA~~TT TCTAT~AAAT GTTATATATA
TTAAAAATAA GAAATTTTTT T~TTTT~~AA A~~AAA~~TT TCTAT~AAAT TTTATATATA
TTAAAAATAA GAAATTTTTT T~TTTT~~AA A~~AAA~~TT TCTAT~AAAT TTTATATATA
TTAAAAATAA GAAATTTTTT T~TTTT~~AA A~~AAA~~TT TCTAT~AAAT TTTATATATA
TTAAAAATAA GAAATTTTTT T~TTTT~~AA A~~AAA~~TT TCTAT~AAAT TTTATATATA
TTAAAAATAA GAAATTATTT T~TT~A~~AA A~~TAA~~TT TCTAT~AAAT TTTATATATA
TTAAAAATAA GAAATTATTT T~TT~A~~AA A~ATAA~~TT TCTAT~AAAT TTTATATATA
TTAAAAATTA GAAATTTTAT T~TTT~~~AA A~~A~~~~TT TCTATTAAAA TTTATATATA
TTAAAAATTA GAAATTTTAT T~TTT~~~AA A~~A~~~~TT TCTATTAAAA TTTATATATA
TTAAAAATAA GAAATTTTTT T~TTTT~~AA A~~AAA~~TT TCTAT~AAAT TTTATATATA
TTAAAAATTA GAAATTTTAT T~TTTT~~AA A~~A~~~~TT TCTATTAAAA TTTATATATA
TTAAAAATTA GAAATTTTAT T~TTT~~~AA A~~A~~~~TT TCTATTAAAA TTTATATATA
TTAAAAATAA GAAATTTTTT T~TTTT~~AA ATTAAA~~TT TCTAT~CAAT TTTATATATT
TTAAAAATGA GAAATTTTTA T~~~~~~~AA A~AAAA~~TT TCTTT~AAAT TTTATATATT
TTAATAATAA GAAATTTTTT TATTTTTTAA A~TAAAAAAT TCTTT~AAAT TTTATATATA
29
The Evolution Tree Problem
30
31
• The evolution tree problem: Given a
distance matrix of n species, find an
evolution tree under some criterion.
• Usually, the criteria are such that all of the
tree distances reflect the original distances.
• That is, when two species are close to each
other in the distance matrix, they should be
close in the evolution tree.
32
• Each criterion corresponds to a distinct
evolution tree problem.
• Most of them are NP-complete.
• Algorithms which produce optimal
evolution trees in polynomial time are
mostly based upon the minimal spanning
tree approach.
33
A Partial Evolution Tree of the Homo Sapien
(Intelligent Human Beings, also Modern Men)
Our ancestors are from Africa.
34
Secondary Structure of RNA
• Due to hydrogen bonds, the primary
structure of a RNA can fold back on
itself to form its secondary structure.
• Base pairs (formed by hydrogen bonds):
1. AU (Watson-Crick base pair)
2. CG (Watson-Crick base pair)
3. GU (Wobble base pair)
35
G
A
G
A
C
A
U
A
A
U
C
G
U
U
A
C
C
C U U C A U C A G G A A A U G A C
RNA Secondary Structure without Pseudoknots
36
Given an RNA sequence, there may be several secondary
structures without pseudoknots, as shown below:
C
C
C
U
C
U
G
U
G
U
G
C
U
C
A
C
G
C
C
U
A
A
G
C
G
U
C
U
U
C
C
U
U
C
C
U
C
C
A
U
C
G
G
U
C
U
G
U
C
A
G
C
G
C
A
U
U
G
C
37
An optimal RNA secondary structure is one with the
maximum number of base pairs.
38
C
j1
U
A
C
G
A
A
U
C
G
U
G
U
A
C
A
A
C U U C A
j1
U C A G G A A A
j2
U G A C
j2
RNA Secondary Structure with Simple Pseudoknots
39
2D & 3D Structures of Yeast
Phenylalanyl-Transfer RNA
2D Structure
3D Structure
40
Secondary Structure Prediction
Problem
• Given an RNA sequence, determine the
secondary structure of the minimum
free energy from this sequence.
• Approach: Dynamic Programming
41
Protein Structure Problem
• Each amino acid of a protein can be classified
into either of the following two types:
– H (hydrophobic, non-polar) (hating water)
– P (hydrophilic, polar) (loving water)
• Then the amino acid sequence of a protein can
be viewed as a binary sequence of H’s (1’s) and
P’s (0’s).
42
Example
• Instance: 011001001110010
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
1
1
0
1
1
1
0
0
0
0
0
0
1
0
Score = 3
Score = 5
43
H-P Model
• Instance: A sequence of 1’s (H’s) and 0’s
(P’s).
• Question: To find a self-avoiding paths
embedded in either a 2D or 3D lattice which
maximizes score, where the score is the
number of pairs of 1’s that are adjacent in
the lattice without being adjacent in the
sequence.
• NP-complete even for 2D lattice.
44
Physical Mapping Problem
108 bp
C: Full DNA
Physical
mapping
Physical
mapping
106 bp
Cut C and clone
into overlapping
YAC clones.
Cut the DNA in each YAC clone and
clone into overlapping cosmid clones.
104 bp
Select a subset of cosmid clones of minimum
total length that covers the YAC DNA.
Fragment
assembling
102 bp
Duplicate the cosmid and then cut the copies randomly.
Select and sequence short fragments and then reassemble
them into a deduced cosmid string.
45
Shortest Common Superstring
• Input: A collection F of strings.
• Output: A shortest possible string S such
that for every f  F, S is a superstring of f.
• For example:
ACT
F
CTA
AGT
S ACTAGT
• NP-complete
46
• Suppose the target is too long and its
contents are unknown.
• What can we do?
• Enzyme A  {6, 8, 3, 10}
Enzyme B  {7, 11, 4, 5}
Enzymes A and B  {1, 5, 2, 6, 7, 3, 3}
47
A
3
B
AB
8
6
4
5
11
3 1
5
2
6
10
7
3
7
This problem is called the two digest
problem which is NP-complete.
48
A genome is a sequence of genes.
Chloroplast genome of Alfafa:
-8, -7, -6, -5, -4, -3, -2, -1, -11, -10, -9
Chloroplast genome of garden pea:
-4, +3, -2, +8, +7, -1, -5, -6, -11, +10, +9
49
Suppose that we can only reverse a
substring of genes.
-4, +5, -8, -9
After reversal, we have
+9, +8, -5, +4.
50
The sorting by reversal problem: The
problem of transforming one sequence
to another only by reversals in the
minimum number of steps.
51
The transformation of worm Ascaris Suum
mitochondrial DNA into human
mitochondrial DNA
12 31 34 28 26 17 29 4 9 36 18 35 19 1 16 14 32 33 22 15 11 27 5 20 13 30 23 10 6 3 24 21 8 25 2 7
12 31 34 28 26 17 29 4 9 36 18 35 19 1 16 14 33 32 22 15 11 27 5 20 13 30 23 10 6 3 24 21 8 25 2 7
12 31 32 33 14 16 1 19 35 18 36 9 4 29 17 26 28 34 22 15 11 27 5 20 13 30 23 10 6 3 24 21 8 25 2 7
12 33 32 31 14 16 1 19 35 18 36 9 4 29 17 26 28 34 22 15 11 27 5 20 13 30 23 10 6 3 24 21 8 25 2 7
12 33 32 31 30 13 20 5 27 11 15 22 34 28 26 17 29 4 9 36 18 35 19 1 16 14 23 10 6 3 24 21 8 25 2 7
12 33 32 31 30 29 17 26 28 34 22 15 11 27 5 20 13 4 9 36 18 35 19 1 16 14 23 10 6 3 24 21 8 25 2 7
12 33 32 31 30 29 28 26 17 34 22 15 11 27 5 20 13 4 9 36 18 35 19 1 16 14 23 10 6 3 24 21 8 25 2 7
12 33 32 31 30 29 28 27 11 15 22 34 17 26 5 20 13 4 9 36 18 35 19 1 16 14 23 10 6 3 24 21 8 25 2 7
12 33 32 31 30 29 28 27 26 17 34 22 15 11 5 20 13 4 9 36 18 35 19 1 16 14 23 10 6 3 24 21 8 25 2 7
12 33 32 31 30 29 28 27 26 25 8 21 24 3 6 10 23 14 16 1 19 35 18 36 9 4 13 20 5 11 15 22 34 17 2 7
12 33 32 31 30 29 28 27 26 25 24 21 8 3 6 10 23 14 16 1 19 35 18 36 9 4 13 20 5 11 15 22 34 17 2 7
12 33 32 31 30 29 28 27 26 25 24 23 10 6 3 8 21 14 16 1 19 35 18 36 9 4 13 20 5 11 15 22 34 17 2 7
12 33 32 31 30 29 28 27 26 25 24 23 22 15 11 5 20 13 4 9 36 18 35 19 1 16 14 21 8 3 6 10 34 17 2 7
12 33 32 31 30 29 28 27 26 25 24 23 22 21 14 16 1 19 35 18 36 9 4 13 20 5 11 15 8 3 6 10 34 17 2 7
12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 13 4 9 36 18 35 19 1 16 14 5 11 15 8 3 6 10 34 17 2 7
12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 35 18 36 9 4 13 1 16 14 5 11 15 8 3 6 10 34 17 2 7
52
12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 35 36 9 4 13 1 16 14 5 11 15 8 3 6 10 34 17 2 7
12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 34 10 6 3 8 15 11 5 14 16 1 13 4 9 36 35 2 7
12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 14 5 11 15 8 3 6 10 34 1 13 4 9 36 35 2 7
12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 11 5 14 8 3 6 10 34 1 13 4 9 36 35 2 7
12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 5 11 8 3 6 10 34 1 13 4 9 36 35 2 7
12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 1 34 10 6 3 8 11 5 4 9 36 35 2 7
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 1 34 10 6 3 8 11 5 4 9 36 35 2 7
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 1 34 10 6 3 4 5 11 8 9 36 35 2 7
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 1 34 35 36 9 8 11 5 4 3 6 10 2 7
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 1 34 35 36 9 8 7 2 10 6 3 4 5 11
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 1 34 35 36 9 8 7 6 10 2 3 4 5 11
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 1 34 35 36 9 8 7 6 5 4 3 2 10 11
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 1 2 3 4 5 6 7 8 9 36 35 34 10 11
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 9 8 7 6 5 4 3 2 1 10 11
1 2 3 4 5 6 7 8 9 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 10 11
1 2 3 4 5 6 7 8 9 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
53
• TAA, TGA, or TAG.
• Do you know what they mean?
• End of Gene.
• Thank you for your patience. Have a good
conference.
54
Related documents