Download Chapter 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Western blot wikipedia , lookup

Metabolism wikipedia , lookup

Metalloprotein wikipedia , lookup

Community fingerprinting wikipedia , lookup

RNA wikipedia , lookup

Transformation (genetics) wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Protein wikipedia , lookup

Protein–protein interaction wikipedia , lookup

RNA-Seq wikipedia , lookup

Molecular cloning wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

DNA supercoil wikipedia , lookup

Gene wikipedia , lookup

Non-coding DNA wikipedia , lookup

Epitranscriptome wikipedia , lookup

Proteolysis wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Protein structure prediction wikipedia , lookup

Biochemistry wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Genetic code wikipedia , lookup

Gene expression wikipedia , lookup

Point mutation wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Biosynthesis wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Transcript
Chapter 1
DNA, Proteins, Genes and RNA
We human beings are so called living things. We grow, we eat, we reproduce and we die. These
characteristics make us tremendously different from unliving things such as rocks and metal. We
are not the only living things on earth; the beautiful wild flowers in the moors are also living things.
They also grow, they also reproduce and they also die. Perhaps we take for granted about one
phenomenon: Our offsprings are all like us. In fact, they all belong to the same species as we do.
That is, dogs reproduce dogs and the beautiful wild flowers reproduce the same kind of flowers. It
is simply logical for us to conclude that some kind of information must have been passed from us to
our offsprings. The question is: How is it passed? The answer lies in DNA which will be introduced
in the next section.
1.1 DNA
The hereditary information of living things is contained in deoxyribonucleic acid (DNA) which
exists in chromosomes and mitochondria or chloroplast of cells. For us human beings, there are 46
chromosomes. They are paired, except the sex chromosomes. The male has X chromosome and Y
chromosome and the female has two X chromosomes. The basic units of DNA are nucleotides and
each nucleotide is one of the following four types: adenine (A), guanine (G), cytosine (C) and
thymine (T). The DNA is a sequence of A’s, G’s, C’s and T’s. The particular order of the A’s, G’s,
C’s and T’s is extremely important, as we shall see later. The order underlines all of life’s diversity,
even dictating whether a living organism is human or other species, such as yeast, bacteria, rice or
fruit fly. In the late 1940’s, Erwin Chargoff noted an important phenomenon: The amount of
adenine in DNA molecules is equal to that of thymine and the amount of guanine is equal to that of
cytosine ( A = T and G = C). In 1953, based on the x-ray diffraction data of Rosalind Franklin and
Maurice Wilkins, James Watson and Francis Crick proposed a model for DNA structure. The
Watson-Crick model states that the DNA molecule is a double helix (two strands twisted together).
The only two pairs that the DNA are possible are AT and CG. That is, whenever there is an A in
one strand, there will be a corresponding T in the other strand. This is why we have A = T and G =
C. In human cells, there are 3 billion such pairs. For this groundbreaking theory, Watson and Crick
shared the Nobel Prize in 1962. Consider the following sequence, which is a part of the human
mitochondria DNA control sequence:
TTCTTTCATGGGGAAGCAAA
For this sequence, there will be another sequence, which is its counterpart and exists in the
1-1
other strand as follows:
AAGAAAGTACCCCTTCGTTT
Figure 1.1 illustrates the double helix structure of DNA.
Figure 1.1: The Double Helix Structure of DNA
Cells are different. This is common sense. A blood cell is of course different from, say a cell in
the fingernail. We can now ask an interesting question: Are DNA’s different in different cells? It
turns out that the DNA molecules are exactly the same for all of the cells. Consequently, we can ask
another piercing question: What makes cells different? We can roughly answer the question as
follows: Different cells have different proteins. It is the proteins which make the difference.
1.2 Proteins
In our cells, 90% of the substances are water. Of the remaining molecules, 50 percent are proteins.
Proteins determine the functions of cells. Red blood cells, for example, have to be able to carry
oxygen. They can do so because they produce protein haemoglobin, which transports oxygen. Some
proteins are structural as they determine the structures of cells. For instance, proteins make up most
of the hair, fingernails and skin. On the other hand, many proteins are enzymes, chemical materials
that speed up chemical processes which are necessary for living organisms to function as living
things. The building blocks of proteins are the amino acids. Only 20 different amino acids make up
the diverse array of proteins found in living things. Table 1.1 summarizes these 20 common amino
acids. Each protein differs according to the amount, type and arrangement of amino acids that make
up its structure.
1-2
Amino acid
Three-letter code
One-letter code
Alanin
ALA
A
Arginine
ARG
R
Aspartic Acid
ASP
D
Asparagine
ASN
N
Cysteine
CYS
C
Glutamic Acid
GLU
E
Glutamine
GLN
Q
Glycine
GLY
G
Histidine
HIS
H
Isoleucine
ILE
I
Leucine
LEU
L
Lysine
LYS
K
Methionine
MET
M
Phenylalanine
PHE
F
Proline
PRO
P
Serine
SER
S
Threonine
THR
T
Tryptophan
TRP
W
Tyrosine
TYR
Y
Valine
VAL
V
Table 1.1: The 20 Common Amino Acids.
Now, as we pointed before, different cells contain exactly the same DNA. But there are
different proteins. Recent research revealed that for each protein, there is a certain contiguous part
of DNA which is called a gene corresponding to this protein. Genes control which proteins are to be
made. Thus, it is appropriate to say that genes control the function of cells. Perhaps we can view
DNA as an exceedingly large computer program, containing 3 billion instructions where each
instruction is A, G, C or T. A gene is a small part of the large program which performs a particular
function, namely, to control the production of a protein. The unique environment of a cell triggers a
particular gene to work. But, how does DNA produce the particular protein for a cell? Certainly,
DNA must have a coding method to specify proteins. Note that proteins consist of 20 amino acids.
DNA therefore employs a triplet scheme as illustrated in Table 1.2.
1-3
Second Position of Codon
F
i
r
s
t
T
C
A
G
T
TTT Phe [F]
TTC Phe [F]
TTA Leu [L]
TTG Leu [L]
TCT Ser [S]
TCC Ser [S]
TCA Ser [S]
TCG Ser [S]
TAT Tyr [Y]
TAC Tyr [Y]
TAA Ter [end]
TAG Ter [end]
TGT Cys [C]
TGC Cys [C]
TGA Ter [end]
TGG Trp [W]
T
C
C
CTT Leu [L]
CTC Leu [L]
CTA Leu [L]
CTG Leu [L]
CCT Pro [P]
CCC Pro [P]
CCA Pro [P]
CCG Pro [P]
CAT His [H]
CAC His [H]
CAA Gln [Q]
CAG Gln [Q]
CGT Arg [R]
CGC Arg [R]
CGA Arg [R]
CGG Arg [R
A
G
A
ATT Ile [I]
ATC Ile [I]
ATA Ile [I]
ATG Met [M]
ACT Thr [T]
ACC Thr [T]
ACA Thr [T]
ACG Thr [T]
AAT Asn [N]
AAC Asn [N]
AAA Lys [K]
AAG Lys [K]
AGT Ser [S]
AGC Ser [S]
AGA Arg [R]
AGG Arg [R]
T
C
A
G
G
GTT Val [V]
GTC Val [V]
GTA Val [V]
GTG Val [V]
GCT Ala [A]
GCC Ala [A]
GCA Ala [A]
GCG Ala [A]
GAT Asp [D]
GGAC Asp [D]
GAA Glu [E]
GAG Glu [E]
GGT Gly [G]
GGC Gly [G]
GGA Gly [G]
GGG Gly [G]
T
C
A
G
P
o
s
i
t
i
o
n
T
h
i
r
d
P
o
s
i
t
i
o
n
Table 1.2: The Genetic Code Is Read in Triplets.
Remember that DNA has 4 nuclei acid bases, T, G, C and A. Every kind of combination of
three bases is called a codon. There are 4x4x4 = 64 distinct codons and there are only 20 proteins.
Therefore there is a many to one relation among codon and amino acids. For instance, both TTT
and TTC correspond to amino acid F and TCT, TCC, TCA and TCG all correspond to amino acid S.
Three codons, namely TAA, TAG and TGA, do not correspond to any amino acid. They are
terminal (ending) codons which act as control signals.
Let us consider protein Rnase A, which is a major component of the so-called ribonuclease
enzyme whose function is to break down RNA. A part of this protein is as follows.
KETAAAKFER
Its corresponding DNA sequence will be
AAA GAA ACT GCT GCT GCT AAA TTT GAA CGT
Having introduced the coding scheme employed by DNA, we are now ready to answer another
question: Which chemical material will bring this information to the appropriate place where the
protein is to be produced. This chemical compound is called RNA.
1-4
1.3 RNA
Ribonucleic acid (RNA) is quite similar to DNA. There are three major differences:
1. In RNA, there is no thymine (T) which exists in DNA. Instead, uracil (U) is found in RNA.
In other words, RNA is constructed out of A, G, C and U.
2. DNA has a double helix structure while RNA has only one strand.
3. Unlike DNA, there are different RNA’s performing different functions, which will be
explained later.
RNA plays an important role in the production of the particular protein which a cell needs. In
the following, we shall endeavor to give a general outline as to how the genetic code in DNA finally
produces the protein. The first part of the protein synthesis process is called transcription consisting
of the following steps:
1. For every gene in the DNA, there is a region before it which is called a promotor. A
promotor indicates that there will be a gene ahead.
2. Each cell has a mechanism which recognizes this promotor.
3. The codon ATG (specifying the amino acid methionine) signals the beginning of agene.
4. The RNA makes a copy of the gene which results in a messenger RNA (mRNA). This
mRNA has one strand of the original gene except T is now replaced by U.
The second part of the protein synthesis is called translation which consists of the following
steps:
1. Ribosomal RNA (rRNA) molecules combine with tens of specific proteins to form a
ribosome, where the protein synthesis takes place. In the process of protein systhesis, the
ribomsome moves along the mRNA. As it does, each codon of the mRNA is read and a
transfer RNA (tRNA) carrying the amino acid corresponding to this codon is called. For
instance, when codon ACT is read by ribosome, a tRNA carrying the amino acid Threonine
is called.
2. An enzyme adds this amino acid to the protein partially made and releases the tRNA.
The above steps are repeated until an ending codon (TAA, TAG or TGA) is encountered. The
protein synthesis process is now terminated and the mRNA is released. This mRNA will be
recycled for later use. The entire process is illustrated in Figure 1.2.
It can be safely said that the entire DNA and gene system is like a computer. Let us give the
analogies:
1-5
Figure 1.2: Protein Production Process
1. The part of the gene copied by the mRNA is like a program, as it contains instructions about
a particular protein is to be produced.
2. The ribosome is like a CPU as it reads the gene copied by the mRNA which can be
considered as a program and executes the program also.
3. The tRNA is like the I/O system.
4. The input is a set of amino acids needed to produce the protein.
5. The output is the protein, which the gene specifies to produce.
1.4 A Preview of the Book
In this section, we shall give a preview of the book.
Throughout this book, we shall consider many molecular biology problems from the
algorithmic point of view. Therefore in Chapter 2, we shall have an introduction to algorithms. We
shall introduce the concepts of time-complexity, NP-completeness and approximation algorithms.
All of these concepts are needed for later discussions. In Chapter 3, we shall discuss the string
matching problem. As can be easily imagined, it may happen that there are a lot of DNA’s, or
proteins and we are given a new DNA, or a new protein. We are asked to find out whether this
DNA sequence, or protein sequence,
exists in our data or not and if it exists, where it exists.
For reasons which we shall explain later in this problem, we shall use the term “string”, instead
of “sequence”.
For example, suppose that we have a string S = GACTTACT and X = ACT. Then X occurs
twice, starting at locations 2 and 6 respectively, in S.
This problem is called the exact string matching problem and we will introduce efficient
algorithms to solve this problem in this chapter.
Since “exact matching” imposes a too strict condition, we shall also introduce another problem,
called the approximate string matching problem. That is, we will now allow errors within a certain
bound. Suppose that S = ATTCAC and X = ACTT. The appropriate matching problem is to find all
occurrences of X in S up to k errors, where errors can be caused by substituting, deleting or inserting
1-6
a character. In our case, if k = 2, the substrings AT, ATT, ATTC and AC are all solutions. Efficient
algorithms will be introduced for this problem.
In Chapter 4, we shall introduce the sequence alignment problem. Let us consider the
following two sequences:
ATTCATTACAACCGCTATG
and
ACCCATCAACAACCGCTATG
It may appear that these two sequences are quite different. Yet a proper alignment will produce
the following:
ATTCATTA-CAACCGCTATG
ACCCATCAACAACCGCTATG
After the alignment is made, we can now see that these two sequences are quite similar to
each other.
There are many reasons for us to perform sequence alignment and given two sequences,
there are many different alignments. The sequence alignment problem is to find an optimal
alignment. We will also discuss the multiple sequence alignment
problem.
In Chapter 4, we shall introduce the evolution tree problem. Let us consider the following
matrix in Table 1.3. This is a distance matrix. Each entry of the matrix indicates the similarity
between two species. The smaller the distance is, the more similar they are to each other. In Figure
1.3, there is a tree, which is called an evolution tree.
1-7
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Species
0
1
13
17
16
13
12
12
17
16
18
18
19
20
31
33
36
63
56
66
Man
0
12
16
15
12
11
13
16
15
17
17
18
21
32
32
35
62
57
65
Monkey
0
10
8
4
6
7
12
12
14
14
13
30
29
24
28
64
61
66
Dog
0
1
5
11
11
16
16
16
17
16
32
27
24
33
64
60
68
Horse
0
4
10
12
15
15
15
16
15
31
26
25
32
64
59
67
Donkey
0
6
7
13
13
13
14
13
30
25
26
31
64
59
67
Pig
0
7
10
8
11
11
11
25
26
23
29
62
59
67
Rabbit
0
14
14
15
13
14
30
27
26
31
66
58
68
Kangaroo
0
3
3
3
7
24
26
25
29
61
62
66
Duck
0
4
4
8
24
27
26
30
59
62
66
Pigeon
0
2
8
28
26
26
31
61
62
66
Chicken
0
8
28
27
28
30
62
61
65
Penguin
0
30
27
30
33
65
64
67
Turtle
0
38
40
41
61
61
69
Rattlesnake
0
34
41
72
66
69
Tuna
0
16
58
63
65
Screw worm
0
59
60
61
Moth
0
57
61
Neurospora
0
41
Saccharomyces
10
11
12
13
14
15
16
17
18
19
20
0
Candida
Table 1.3: Distance Matrix of Species
This tree appropriately reflects the distances in Table 1.3. For instance, the distance between man
and monkey is small in the distance matrix. They are also close to each other in the evolution tree.
Evolution trees can be constructed under different criteria. For each criterion, given a distance
matrix, there is an optimal evolution tree. The techniques of constructing
these evolution trees will be discussed in this chapter.
The DNA has a double helix structure which is relatively stable. As we introduced at the
beginning of this chapter, RNA does not have such a structure. RNA has a single strand structure.
Among the nucleic acids of RNA, there are several bounds as follows:
1. A ≡ U (Watson-Crick base pair)
2. C = G (Watson-Crick base pair)
3. G - U (Wobble base pair)
Let us consider the following RNA sequence:
AGGCCUUCCU
1-8
Figure 1.4 shows six possible structures and the structure in Figure 1.4 (f) is the
Figure 1.3: An Evolution Tree of Species of Table 1.3
best one. All of the structures in Figure 1.4 are called secondary structures. The RNA secondary
structure prediction problem will be discussed in Chapter 5.
Similar to the RNA secondary structure prediction problem, there is a protein structure
prediction problem which will be introduced in Chapter 6.
The DNA sequence is an extremely long one, consisting of 3 billion A, G, C and T. When we
try to read this long sequence, we have to break it into small pieces and then try to piece them
together. Let us imagine that we have the following sequences: AGT, ACT and CTA. Based upon
these three strings, we can build one string
ACTAGT.
1-9
Figure 1.4: Six Possible Secondary Structures of RNA Sequence A–G–G–C–C–U–U–C C–U (The
Dashed Lines Denote the Hydrogen Bonds)
The above string S has a special property which can be illustrated as follows:
AGT
S1
ACT
S2
CTA
S3
ACTAGT
S
As can be seen, for each Si, S is a superstring of Si. This problem is called a physical mapping
problem. There are many physical mapping problems and they will all be discussed in Chapter 7.
1.5 Websites, Books and Journals
Since we are entering the network era, perhaps it is suitable for us to give a list of websites related
to this field as follows
1. MIT Biology Hypertextbook:
http://esg-www.mit.edu:8001/esgbio/
2. The International Society for Computational Biology:
1-10
http://www.iscb.org/
3. Bioinformatics and Computational Biology-Related Books:
4.
5.
6.
7.
http://www.bioinform.com/framesets/fsbookstore.htm
Bioinformatics and Computational Biology-Related Journals:
http://www.iscb.org/journals.html
Bioinformatics and Computational Biology-Related Conferences:
http://www.iscb.org/conferences.html
Bioinformatics and Computational Biology Centers:
http://www.iscb.org/bioinf.html
National Center for Biotechnology Information (NCBI, NLM, NIH):
http://www.ncbi.nlm.nih.gov/
8. European Bioinformatics Institute (EBI):
http://www.ebi.ac.uk/
9. DNA Data Bank of Japan (DDBJ):
http://www.ddbj.nig.ac.jp/
There are not many textbooks in the field of computational biology. There following are two
books which are suitable for reference:
1. Dan Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and
Computational Biology, Cambridge University Press, New York, 1997.
2. Joao Carlos Setubal and Joao Meidanis, Introduction to Computational Molecular Biology,
PWS Publishing Company, 1996.
The following is a list of journals which will be useful for further research:
1. Bioinformatics
2. Bioinform Newsletter
3. Biotechnology Software & Internet Journal
4. Computers and Biomedical Research
5. Genome Research
6. Human Genome News
7. Journal of Computational Biology
8. Journal of Molecular Biology
9. Journal of Molecular Modeling
10. Nature Biotechnology
11. Nature Genetics
12. Nature Structural Biology
13. Nucleic Acids Research
14. Protein Engineering
15. Protein Science
16. Science
17. THE SCIENTIST
18. “The News Journal for the Life Scientist”
1-11