* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Chapter 2
Western blot wikipedia , lookup
Metalloprotein wikipedia , lookup
Community fingerprinting wikipedia , lookup
Transformation (genetics) wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Molecular cloning wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
DNA supercoil wikipedia , lookup
Non-coding DNA wikipedia , lookup
Epitranscriptome wikipedia , lookup
Proteolysis wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Protein structure prediction wikipedia , lookup
Biochemistry wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Genetic code wikipedia , lookup
Gene expression wikipedia , lookup
Point mutation wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Chapter 1 DNA, Proteins, Genes and RNA We human beings are so called living things. We grow, we eat, we reproduce and we die. These characteristics make us tremendously different from unliving things such as rocks and metal. We are not the only living things on earth; the beautiful wild flowers in the moors are also living things. They also grow, they also reproduce and they also die. Perhaps we take for granted about one phenomenon: Our offsprings are all like us. In fact, they all belong to the same species as we do. That is, dogs reproduce dogs and the beautiful wild flowers reproduce the same kind of flowers. It is simply logical for us to conclude that some kind of information must have been passed from us to our offsprings. The question is: How is it passed? The answer lies in DNA which will be introduced in the next section. 1.1 DNA The hereditary information of living things is contained in deoxyribonucleic acid (DNA) which exists in chromosomes and mitochondria or chloroplast of cells. For us human beings, there are 46 chromosomes. They are paired, except the sex chromosomes. The male has X chromosome and Y chromosome and the female has two X chromosomes. The basic units of DNA are nucleotides and each nucleotide is one of the following four types: adenine (A), guanine (G), cytosine (C) and thymine (T). The DNA is a sequence of A’s, G’s, C’s and T’s. The particular order of the A’s, G’s, C’s and T’s is extremely important, as we shall see later. The order underlines all of life’s diversity, even dictating whether a living organism is human or other species, such as yeast, bacteria, rice or fruit fly. In the late 1940’s, Erwin Chargoff noted an important phenomenon: The amount of adenine in DNA molecules is equal to that of thymine and the amount of guanine is equal to that of cytosine ( A = T and G = C). In 1953, based on the x-ray diffraction data of Rosalind Franklin and Maurice Wilkins, James Watson and Francis Crick proposed a model for DNA structure. The Watson-Crick model states that the DNA molecule is a double helix (two strands twisted together). The only two pairs that the DNA are possible are AT and CG. That is, whenever there is an A in one strand, there will be a corresponding T in the other strand. This is why we have A = T and G = C. In human cells, there are 3 billion such pairs. For this groundbreaking theory, Watson and Crick shared the Nobel Prize in 1962. Consider the following sequence, which is a part of the human mitochondria DNA control sequence: TTCTTTCATGGGGAAGCAAA For this sequence, there will be another sequence, which is its counterpart and exists in the 1-1 other strand as follows: AAGAAAGTACCCCTTCGTTT Figure 1.1 illustrates the double helix structure of DNA. Figure 1.1: The Double Helix Structure of DNA Cells are different. This is common sense. A blood cell is of course different from, say a cell in the fingernail. We can now ask an interesting question: Are DNA’s different in different cells? It turns out that the DNA molecules are exactly the same for all of the cells. Consequently, we can ask another piercing question: What makes cells different? We can roughly answer the question as follows: Different cells have different proteins. It is the proteins which make the difference. 1.2 Proteins In our cells, 90% of the substances are water. Of the remaining molecules, 50 percent are proteins. Proteins determine the functions of cells. Red blood cells, for example, have to be able to carry oxygen. They can do so because they produce protein haemoglobin, which transports oxygen. Some proteins are structural as they determine the structures of cells. For instance, proteins make up most of the hair, fingernails and skin. On the other hand, many proteins are enzymes, chemical materials that speed up chemical processes which are necessary for living organisms to function as living things. The building blocks of proteins are the amino acids. Only 20 different amino acids make up the diverse array of proteins found in living things. Table 1.1 summarizes these 20 common amino acids. Each protein differs according to the amount, type and arrangement of amino acids that make up its structure. 1-2 Amino acid Three-letter code One-letter code Alanin ALA A Arginine ARG R Aspartic Acid ASP D Asparagine ASN N Cysteine CYS C Glutamic Acid GLU E Glutamine GLN Q Glycine GLY G Histidine HIS H Isoleucine ILE I Leucine LEU L Lysine LYS K Methionine MET M Phenylalanine PHE F Proline PRO P Serine SER S Threonine THR T Tryptophan TRP W Tyrosine TYR Y Valine VAL V Table 1.1: The 20 Common Amino Acids. Now, as we pointed before, different cells contain exactly the same DNA. But there are different proteins. Recent research revealed that for each protein, there is a certain contiguous part of DNA which is called a gene corresponding to this protein. Genes control which proteins are to be made. Thus, it is appropriate to say that genes control the function of cells. Perhaps we can view DNA as an exceedingly large computer program, containing 3 billion instructions where each instruction is A, G, C or T. A gene is a small part of the large program which performs a particular function, namely, to control the production of a protein. The unique environment of a cell triggers a particular gene to work. But, how does DNA produce the particular protein for a cell? Certainly, DNA must have a coding method to specify proteins. Note that proteins consist of 20 amino acids. DNA therefore employs a triplet scheme as illustrated in Table 1.2. 1-3 Second Position of Codon F i r s t T C A G T TTT Phe [F] TTC Phe [F] TTA Leu [L] TTG Leu [L] TCT Ser [S] TCC Ser [S] TCA Ser [S] TCG Ser [S] TAT Tyr [Y] TAC Tyr [Y] TAA Ter [end] TAG Ter [end] TGT Cys [C] TGC Cys [C] TGA Ter [end] TGG Trp [W] T C C CTT Leu [L] CTC Leu [L] CTA Leu [L] CTG Leu [L] CCT Pro [P] CCC Pro [P] CCA Pro [P] CCG Pro [P] CAT His [H] CAC His [H] CAA Gln [Q] CAG Gln [Q] CGT Arg [R] CGC Arg [R] CGA Arg [R] CGG Arg [R A G A ATT Ile [I] ATC Ile [I] ATA Ile [I] ATG Met [M] ACT Thr [T] ACC Thr [T] ACA Thr [T] ACG Thr [T] AAT Asn [N] AAC Asn [N] AAA Lys [K] AAG Lys [K] AGT Ser [S] AGC Ser [S] AGA Arg [R] AGG Arg [R] T C A G G GTT Val [V] GTC Val [V] GTA Val [V] GTG Val [V] GCT Ala [A] GCC Ala [A] GCA Ala [A] GCG Ala [A] GAT Asp [D] GGAC Asp [D] GAA Glu [E] GAG Glu [E] GGT Gly [G] GGC Gly [G] GGA Gly [G] GGG Gly [G] T C A G P o s i t i o n T h i r d P o s i t i o n Table 1.2: The Genetic Code Is Read in Triplets. Remember that DNA has 4 nuclei acid bases, T, G, C and A. Every kind of combination of three bases is called a codon. There are 4x4x4 = 64 distinct codons and there are only 20 proteins. Therefore there is a many to one relation among codon and amino acids. For instance, both TTT and TTC correspond to amino acid F and TCT, TCC, TCA and TCG all correspond to amino acid S. Three codons, namely TAA, TAG and TGA, do not correspond to any amino acid. They are terminal (ending) codons which act as control signals. Let us consider protein Rnase A, which is a major component of the so-called ribonuclease enzyme whose function is to break down RNA. A part of this protein is as follows. KETAAAKFER Its corresponding DNA sequence will be AAA GAA ACT GCT GCT GCT AAA TTT GAA CGT Having introduced the coding scheme employed by DNA, we are now ready to answer another question: Which chemical material will bring this information to the appropriate place where the protein is to be produced. This chemical compound is called RNA. 1-4 1.3 RNA Ribonucleic acid (RNA) is quite similar to DNA. There are three major differences: 1. In RNA, there is no thymine (T) which exists in DNA. Instead, uracil (U) is found in RNA. In other words, RNA is constructed out of A, G, C and U. 2. DNA has a double helix structure while RNA has only one strand. 3. Unlike DNA, there are different RNA’s performing different functions, which will be explained later. RNA plays an important role in the production of the particular protein which a cell needs. In the following, we shall endeavor to give a general outline as to how the genetic code in DNA finally produces the protein. The first part of the protein synthesis process is called transcription consisting of the following steps: 1. For every gene in the DNA, there is a region before it which is called a promotor. A promotor indicates that there will be a gene ahead. 2. Each cell has a mechanism which recognizes this promotor. 3. The codon ATG (specifying the amino acid methionine) signals the beginning of agene. 4. The RNA makes a copy of the gene which results in a messenger RNA (mRNA). This mRNA has one strand of the original gene except T is now replaced by U. The second part of the protein synthesis is called translation which consists of the following steps: 1. Ribosomal RNA (rRNA) molecules combine with tens of specific proteins to form a ribosome, where the protein synthesis takes place. In the process of protein systhesis, the ribomsome moves along the mRNA. As it does, each codon of the mRNA is read and a transfer RNA (tRNA) carrying the amino acid corresponding to this codon is called. For instance, when codon ACT is read by ribosome, a tRNA carrying the amino acid Threonine is called. 2. An enzyme adds this amino acid to the protein partially made and releases the tRNA. The above steps are repeated until an ending codon (TAA, TAG or TGA) is encountered. The protein synthesis process is now terminated and the mRNA is released. This mRNA will be recycled for later use. The entire process is illustrated in Figure 1.2. It can be safely said that the entire DNA and gene system is like a computer. Let us give the analogies: 1-5 Figure 1.2: Protein Production Process 1. The part of the gene copied by the mRNA is like a program, as it contains instructions about a particular protein is to be produced. 2. The ribosome is like a CPU as it reads the gene copied by the mRNA which can be considered as a program and executes the program also. 3. The tRNA is like the I/O system. 4. The input is a set of amino acids needed to produce the protein. 5. The output is the protein, which the gene specifies to produce. 1.4 A Preview of the Book In this section, we shall give a preview of the book. Throughout this book, we shall consider many molecular biology problems from the algorithmic point of view. Therefore in Chapter 2, we shall have an introduction to algorithms. We shall introduce the concepts of time-complexity, NP-completeness and approximation algorithms. All of these concepts are needed for later discussions. In Chapter 3, we shall discuss the string matching problem. As can be easily imagined, it may happen that there are a lot of DNA’s, or proteins and we are given a new DNA, or a new protein. We are asked to find out whether this DNA sequence, or protein sequence, exists in our data or not and if it exists, where it exists. For reasons which we shall explain later in this problem, we shall use the term “string”, instead of “sequence”. For example, suppose that we have a string S = GACTTACT and X = ACT. Then X occurs twice, starting at locations 2 and 6 respectively, in S. This problem is called the exact string matching problem and we will introduce efficient algorithms to solve this problem in this chapter. Since “exact matching” imposes a too strict condition, we shall also introduce another problem, called the approximate string matching problem. That is, we will now allow errors within a certain bound. Suppose that S = ATTCAC and X = ACTT. The appropriate matching problem is to find all occurrences of X in S up to k errors, where errors can be caused by substituting, deleting or inserting 1-6 a character. In our case, if k = 2, the substrings AT, ATT, ATTC and AC are all solutions. Efficient algorithms will be introduced for this problem. In Chapter 4, we shall introduce the sequence alignment problem. Let us consider the following two sequences: ATTCATTACAACCGCTATG and ACCCATCAACAACCGCTATG It may appear that these two sequences are quite different. Yet a proper alignment will produce the following: ATTCATTA-CAACCGCTATG ACCCATCAACAACCGCTATG After the alignment is made, we can now see that these two sequences are quite similar to each other. There are many reasons for us to perform sequence alignment and given two sequences, there are many different alignments. The sequence alignment problem is to find an optimal alignment. We will also discuss the multiple sequence alignment problem. In Chapter 4, we shall introduce the evolution tree problem. Let us consider the following matrix in Table 1.3. This is a distance matrix. Each entry of the matrix indicates the similarity between two species. The smaller the distance is, the more similar they are to each other. In Figure 1.3, there is a tree, which is called an evolution tree. 1-7 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Species 0 1 13 17 16 13 12 12 17 16 18 18 19 20 31 33 36 63 56 66 Man 0 12 16 15 12 11 13 16 15 17 17 18 21 32 32 35 62 57 65 Monkey 0 10 8 4 6 7 12 12 14 14 13 30 29 24 28 64 61 66 Dog 0 1 5 11 11 16 16 16 17 16 32 27 24 33 64 60 68 Horse 0 4 10 12 15 15 15 16 15 31 26 25 32 64 59 67 Donkey 0 6 7 13 13 13 14 13 30 25 26 31 64 59 67 Pig 0 7 10 8 11 11 11 25 26 23 29 62 59 67 Rabbit 0 14 14 15 13 14 30 27 26 31 66 58 68 Kangaroo 0 3 3 3 7 24 26 25 29 61 62 66 Duck 0 4 4 8 24 27 26 30 59 62 66 Pigeon 0 2 8 28 26 26 31 61 62 66 Chicken 0 8 28 27 28 30 62 61 65 Penguin 0 30 27 30 33 65 64 67 Turtle 0 38 40 41 61 61 69 Rattlesnake 0 34 41 72 66 69 Tuna 0 16 58 63 65 Screw worm 0 59 60 61 Moth 0 57 61 Neurospora 0 41 Saccharomyces 10 11 12 13 14 15 16 17 18 19 20 0 Candida Table 1.3: Distance Matrix of Species This tree appropriately reflects the distances in Table 1.3. For instance, the distance between man and monkey is small in the distance matrix. They are also close to each other in the evolution tree. Evolution trees can be constructed under different criteria. For each criterion, given a distance matrix, there is an optimal evolution tree. The techniques of constructing these evolution trees will be discussed in this chapter. The DNA has a double helix structure which is relatively stable. As we introduced at the beginning of this chapter, RNA does not have such a structure. RNA has a single strand structure. Among the nucleic acids of RNA, there are several bounds as follows: 1. A ≡ U (Watson-Crick base pair) 2. C = G (Watson-Crick base pair) 3. G - U (Wobble base pair) Let us consider the following RNA sequence: AGGCCUUCCU 1-8 Figure 1.4 shows six possible structures and the structure in Figure 1.4 (f) is the Figure 1.3: An Evolution Tree of Species of Table 1.3 best one. All of the structures in Figure 1.4 are called secondary structures. The RNA secondary structure prediction problem will be discussed in Chapter 5. Similar to the RNA secondary structure prediction problem, there is a protein structure prediction problem which will be introduced in Chapter 6. The DNA sequence is an extremely long one, consisting of 3 billion A, G, C and T. When we try to read this long sequence, we have to break it into small pieces and then try to piece them together. Let us imagine that we have the following sequences: AGT, ACT and CTA. Based upon these three strings, we can build one string ACTAGT. 1-9 Figure 1.4: Six Possible Secondary Structures of RNA Sequence A–G–G–C–C–U–U–C C–U (The Dashed Lines Denote the Hydrogen Bonds) The above string S has a special property which can be illustrated as follows: AGT S1 ACT S2 CTA S3 ACTAGT S As can be seen, for each Si, S is a superstring of Si. This problem is called a physical mapping problem. There are many physical mapping problems and they will all be discussed in Chapter 7. 1.5 Websites, Books and Journals Since we are entering the network era, perhaps it is suitable for us to give a list of websites related to this field as follows 1. MIT Biology Hypertextbook: http://esg-www.mit.edu:8001/esgbio/ 2. The International Society for Computational Biology: 1-10 http://www.iscb.org/ 3. Bioinformatics and Computational Biology-Related Books: 4. 5. 6. 7. http://www.bioinform.com/framesets/fsbookstore.htm Bioinformatics and Computational Biology-Related Journals: http://www.iscb.org/journals.html Bioinformatics and Computational Biology-Related Conferences: http://www.iscb.org/conferences.html Bioinformatics and Computational Biology Centers: http://www.iscb.org/bioinf.html National Center for Biotechnology Information (NCBI, NLM, NIH): http://www.ncbi.nlm.nih.gov/ 8. European Bioinformatics Institute (EBI): http://www.ebi.ac.uk/ 9. DNA Data Bank of Japan (DDBJ): http://www.ddbj.nig.ac.jp/ There are not many textbooks in the field of computational biology. There following are two books which are suitable for reference: 1. Dan Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, New York, 1997. 2. Joao Carlos Setubal and Joao Meidanis, Introduction to Computational Molecular Biology, PWS Publishing Company, 1996. The following is a list of journals which will be useful for further research: 1. Bioinformatics 2. Bioinform Newsletter 3. Biotechnology Software & Internet Journal 4. Computers and Biomedical Research 5. Genome Research 6. Human Genome News 7. Journal of Computational Biology 8. Journal of Molecular Biology 9. Journal of Molecular Modeling 10. Nature Biotechnology 11. Nature Genetics 12. Nature Structural Biology 13. Nucleic Acids Research 14. Protein Engineering 15. Protein Science 16. Science 17. THE SCIENTIST 18. “The News Journal for the Life Scientist” 1-11