* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Day 2 (Jan. 23) Scribe Notes
RNA silencing wikipedia , lookup
Protein adsorption wikipedia , lookup
Molecular cloning wikipedia , lookup
Gel electrophoresis of nucleic acids wikipedia , lookup
Promoter (genetics) wikipedia , lookup
RNA polymerase II holoenzyme wikipedia , lookup
Genome evolution wikipedia , lookup
Eukaryotic transcription wikipedia , lookup
Cell-penetrating peptide wikipedia , lookup
Epitranscriptome wikipedia , lookup
List of types of proteins wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Expanded genetic code wikipedia , lookup
Metalloprotein wikipedia , lookup
Non-coding RNA wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Non-coding DNA wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Point mutation wikipedia , lookup
Gene expression wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Molecular evolution wikipedia , lookup
Biochemistry wikipedia , lookup
Genetic code wikipedia , lookup
CS 294-8 Computational Biology for Computer Scientists Spring 2003 Lecture 2: January 23 Lecturer: Gene Myers Scribe: Ruchira Datta Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor. We will begin with a brief survey of molecular biology. It is important for you as computer scientists working in this field to be able to converse with biologists. This way you will be able to formulate your own problems. By the time a problem in computational biology has been formulated in a mathematical way, someone else (the person who formulated it) will already be working on it. The class website is http://inst.eecs.berkeley.edu/~cs294-8. Links to various resources are on that website. The book Biological Sequence Analysis [DEKM98] is one of the best, but we will not be following any particular textbook in this course. 2.1 The Central Dogma The central dogma can be stated as follows: Genes are units perpetuating themselves and functioning through their expression in the form of proteins. Now we elaborate the biochemical basis of each part of this statement. 2.2 DNA Genes are physically embodied in DNA (deoxyribonucleic acid), a long polymer. It consists of two strands winding around each other in a double helix. The outside of each strand is a backbone, constructed of ribose (a type of sugar) and phosphate. On the interior, each unit contains one of the nucleic acids: adenosine, guanine, cytosine, or thymine (usually referred to by their initials: A, G, C, T). Each nucleic acid on one strand binds to the corresponding nucleic acid on the other: adenosine binds to thymine, and cytosine binds to guanine. A single monomer of one strand is called a “nucleotide”. The backbone has a definite orientation. Ribose consists of a ring of six carbons, which are numbered sequentially 10 , 20 , etc. The phosphate group binds to the 50 carbon, and the hydroxyl group binds to the 30 group. So the ends of the strand are identified according to the orientation of the backbone as the 50 end and the 30 end. The “forward” direction is 50 to 30 , by convention. The two complementary strands run in opposite directions. Aromatic bonds such as these lie in a single plane. So the nucleic acids stack on top of each other like plates. Furthermore, the outside (the backbone) is hydrophilic, and the interior (the nucleic acids) are hydrophobic. (Hydrophilic means attracted to water—literally “water-loving”, and hydrophobic means repelled by water— literally “water-fearing”. This characteristic is very significant in chemistry.) So the bonds between the complementary nucleic acids (the weakest part of the polymer) are folded inside and protected. Furthermore, van der Waals forces also stabilize the molecule. This refers to weak electrostatic attraction between hydrogen and other atoms. Basically, when hydrogen is participating in a covalent bond with, 2-1 2-2 Lecture 2: January 23 say, nitrogen, a single electron is “shared” between the two atoms. But it is not shared equally; it is somewhat closer to the nitrogen than the hydrogen. The nitrogen is “electronegative”, and the hydrogen is “electropositive”. So the hydrogen is weakly attracted to other electronegative items, for example a different nitrogen atom participating in other covalent bonds. All these factors mean that DNA is a very stable molecule. Thus it is a suitable repository for the storage of genetic information. Indeed, one must heat it to 130–140◦ F to break it down. Alternatively, one can use a soapy solution to denature the DNA. Soap has a hydrophilic end and a hydrophobic end, so it is capable of dissolving hydrophobic molecules. The hydrophilic end is attracted to water, and the hydrophobic end is attracted to the hydrophobic molecule. 2.3 Genomes Genomes are usually measured in base pairs (bp). “Base pair” refers to the complementary nucleotides which make up one unit of the double strand. The smallest genome, which was the first to be sequenced, is that of the λ virus, about 50 kbp (kilobasepairs) long. The genome of the Epstein-Barr virus, which causes herpes, is about 120 kbp long. The genome of the small bacterium Mycoplasma genitalium is about 500 kbp long; this is the smallest sequenced genome of an independent organism. The following table summarizes the sizes of several of the known genomes. 50 120 500 18 47 10 30 120 500 2.75–2.8 50 2.4 kbp kbp kbp Mbp Mbp Mbp Mbp Mbp Mbp Gbp Gbp λ (virus) Epstein-Barr virus Mycoplasma genitalium Homophilus influenza, i.e., flu Escherichia coli Yeast Fungi Drosophila Rice, Chicken Mouse, Human Algae RNA RNA (ribonucleic acid) is somewhat similar to DNA, but has a different nucleic acid, uracil, in place of thymine. It is not as stable as DNA. The stability of DNA makes it suitable for storing the genetic information, whereas RNA is suitable for actually using the genetic information to carry out the functions of the cell. The particular kind of RNA which is tranlated into protein, called tRNA (“transfer RNA”), usually comes in short segments, 80–90 bp long. RNA does not generally consist of two strands running linearly together. A single strand may connect back to itself, forming a hairpin. In this way a secondary structure is formed. This in turn can fold up more, forming complicated tertiary structures. RNA is a very versatile molecule, and is thought to be the initial basis of life. Lecture 2: January 23 2.5 2-3 Protein Protein consists of polypeptide chains, i.e., chains of amino acids. Each amino acid consists of an ammonium group; a carbon bound to a hydrogen atom and to another group, called the “residue”, which varies from one amino acid to another; and a hydroxyl group. The hydroxyl group of one amino acid binds to the ammonium group of the next, forming a chain. The residue and the carbon to which it attaches can rotate about their bond axis with respect to each other. The ammonium-hydroxyl binding is rigid and planar. The ammonium and the middle carbon-residue group can rotate about their bond axis with respect to each other; this angle is denoted α. The middle carbon-residue group and the hydroxyl can also rotate about their bond axis with respect to each other; this angle is denoted γ. A plot of the α-γ angles, called a Ramachandran plot, reveals a few types of zones. In an α-helix, ammonium groups from one turn of the helix bind to hydroxyl groups from the next turn. In a β-sheet, the chain folds boustrophedon to form a sheet. Any other structure is called a “coil”. A diagram called a Richardson plot, after Jane Richardson, shows only this structure: the α-helices, β-sheets, and coils. There are 20 possible residues. These determine how the protein folds up, based on various properties, such as: whether they’re charged or not; whether they’re hydrophilic or hydrophobic; whether they’re relatively large or small; and whether they contain sulfur, which is particularly structurally significant. For example, in 1985 Chris Sanders showed how to determine whether a chain will form an α-helix or a β-sheet, based on the hydrogen bonding properties of the residues. 2.6 Transcription To implement the central dogma, DNA is transcribed into RNA, some of which is then translated into protein. Transcription is begun by an enzyme, called polymerase. A portion of the double-stranded DNA molecule is unzipped, and polymerase runs along a single strand forming a complementary RNA strand. It begins transcription from a promoter site, and ends at a terminator site. The level of transcription may be enhanced or inhibited based on several other sites, known as enhancers. These are studied, for instance, by Mike Levine and Mike Eisen. Much is known about the promoter site, but almost nothing is known about the terminator site. It could be stochastic—that is, transcription ends whenever the polymerase falls off. The action of the enhancers could be quite complex. It could depend on whether the portion of the DNA where the enhancer lies is even unpacked and accessible or not. It is likely to be much more complicated than Boolean circuits, which are sometimes used to model genetic regulatory networks. The transcribed piece of RNA includes regions called introns, which are snipped out. The remaining pieces, called exons, are spliced together and kept. The 50 end of an exon always ends in a particular 2 bp sequence, the 50 splice site, and the 30 end of an exon always ends in another particular 2 bp sequence, the 30 splice site. The splicing process takes place in the spliceosome and involves snRNP’s (small nuclear ribonuclear proteins), complexes of RNA and protein, such as U1, U2, U4, U5, and U6. The molecular machinery grabs a particular site in the middle of the intron, called the lariat site, and pulls the intron out into a horseshoe. Also, the 50 splice site of one exon and the 30 splice site of another exon are brought together. The intron forms a loop which is cut off, and the exons are spliced. Some introns seem to splice themselves out. The splicing mechanism is still under study. 2-4 2.7 Lecture 2: January 23 Translation RNA is translated into protein in the ribosomes. The strand of RNA to be translated begins with a ribosomal binding site whose starting nucleotides are TATA. Hence it is known as the TATA box. This is followed by a sequence of 3 bp codons. Each codon is translated to a particular amino acid in the polypeptide chain. A stop codon does not code for an amino acid but signals the end of the polypeptide chain. After the stop codon is a long string of AAA. . . . These were not present in the original DNA from which the RNA strand was transcribed, but were added later. We can see why the codons are 3 bp long. We have 4 nucleic acids, or “letters” in our alphabet. We have to code for 20 different residues (as well as for “stop”), so we need at least 3 bp, since 2 bp could only code for 42 = 16. However, most residues are encoded by only 2 bp. That is, the first 2 bp determine the residue, and the last position is a “don’t care”. So four different codons code for the same residue. 2.8 Other Issues We have been describing the cellular machinery of eukaryotes, that is, those organisms whose cells have a “true nucleus”. Prokaryotes (such as bacteria) do not have nuclei or ribosomes. Moreover, their DNA includes no introns. This may help them evolve faster by causing more variation among their genotypes. Eukaryotic genomes, by contrast, are more robust and stable. A particular region of DNA does not always break down into exons in the same way. There may be alternate splicings of the same region. For example, a stretch of DNA may consist of regions I, II, III, and IV, separated by introns. Sometimes regions I, II, and IV might be spliced together, to make up form 1. At other times regions II, III, and IV might be spliced together to make up form 2. So far it appears that the average number of alternate splicings of any given region is three or four. Some regions can only be spliced in one way. The maximum number of alternate splicings of a given region that has been observed so far is twelve. Often the alternate splicings consist of different suffixes. The existence of alternate splicings makes it somewhat problematic to define genes. If a region has two alternate splicings, does it contain one gene or two? Furthermore, the mechanism for choosing among alternate splicings is unknown. It may be through the promoters. About 97% of the human genome is intergenic, that is, it consists of DNA between the regions which are transcribed. The function of this portion of the DNA is not yet clear. It doesn’t fit neatly within the framework of the Central Dogma, which has led some to call it “junk”. Alternatively, it might be used for signaling and control of gene expression. Or it might be present to promote variation. The basic mechanism of evolution is variation and selection, so this “extra” genetic material may exist to promote evolvability. The process of translation is carried out in the ribosome, which is basically a protein factory. It is a molecular machine: the tRNA fits in one hole, and the other part of the ribosome walks along forming the polypeptide chain. This is the fundamental example of nanotechnology in action. Experiment has shown that if a protein is heated to the point where it becomes denatured (that is, it loses its shape and becomes a long strand), and then allowed to cool, it folds back into the correct shape. Thus, the sequence of residues determines the shape of the protein, and therefore its function. This is an important property for us as computer scientists, since we prefer to think of the protein as a string of “letters” from the amino acid “alphabet”. Thus we can apply the familiar tools of combinatorics and discrete mathematics. However, this is only true of about 95% of proteins. A few proteins require the presence of other “chaperone” proteins in order to fold into their correct shape. As is often the case in biology, we must always keep in mind that the assumptions we make for convenience of mathematical modeling do not hold in all cases. Lecture 2: January 23 2-5 By weight, the cell is 70% water. But it is not a sack or a test tube through which everything can diffuse uniformly! It contains many internal structures which play an important role in impeding, enhancing, or constraining transport. The remaining 30% (the dry weight of the cell) is constituted as follows: 1% 20% 42% 3% 30% 2% 2% small metabolites (e.g., sugar, fuel) cell wall, internal membrane (ER: endoplasmic reticulum) soluble protein tRNA ribosome mRNA DNA Notice that fully 33%, or one third, of the dry weight of the cell is devoted to tRNA and the ribosomes, that is, to the machinery for making proteins. So the cell is always dynamic, never quiescent. Next time we will discuss recombinant DNA techniques. We will show how various enzymes can be used to operate on DNA. References [DEKM98] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis. Cambridge University Press, Cambridge, 1998.