Download Day 2 (Jan. 23) Scribe Notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

RNA silencing wikipedia , lookup

Protein adsorption wikipedia , lookup

Molecular cloning wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

Promoter (genetics) wikipedia , lookup

RNA polymerase II holoenzyme wikipedia , lookup

Genome evolution wikipedia , lookup

Eukaryotic transcription wikipedia , lookup

Cell-penetrating peptide wikipedia , lookup

Epitranscriptome wikipedia , lookup

RNA-Seq wikipedia , lookup

List of types of proteins wikipedia , lookup

Gene wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Replisome wikipedia , lookup

Expanded genetic code wikipedia , lookup

Metalloprotein wikipedia , lookup

Non-coding RNA wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Non-coding DNA wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Point mutation wikipedia , lookup

Gene expression wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Molecular evolution wikipedia , lookup

Biochemistry wikipedia , lookup

Genetic code wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Transcript
CS 294-8 Computational Biology for Computer Scientists
Spring 2003
Lecture 2: January 23
Lecturer: Gene Myers
Scribe: Ruchira Datta
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.
They may be distributed outside this class only with the permission of the Instructor.
We will begin with a brief survey of molecular biology. It is important for you as computer scientists working
in this field to be able to converse with biologists. This way you will be able to formulate your own problems.
By the time a problem in computational biology has been formulated in a mathematical way, someone else
(the person who formulated it) will already be working on it.
The class website is http://inst.eecs.berkeley.edu/~cs294-8. Links to various resources are on that
website. The book Biological Sequence Analysis [DEKM98] is one of the best, but we will not be following
any particular textbook in this course.
2.1
The Central Dogma
The central dogma can be stated as follows:
Genes are units perpetuating themselves and functioning through their expression in the form of proteins.
Now we elaborate the biochemical basis of each part of this statement.
2.2
DNA
Genes are physically embodied in DNA (deoxyribonucleic acid), a long polymer. It consists of two strands
winding around each other in a double helix. The outside of each strand is a backbone, constructed of ribose
(a type of sugar) and phosphate. On the interior, each unit contains one of the nucleic acids: adenosine,
guanine, cytosine, or thymine (usually referred to by their initials: A, G, C, T). Each nucleic acid on one
strand binds to the corresponding nucleic acid on the other: adenosine binds to thymine, and cytosine binds
to guanine. A single monomer of one strand is called a “nucleotide”.
The backbone has a definite orientation. Ribose consists of a ring of six carbons, which are numbered
sequentially 10 , 20 , etc. The phosphate group binds to the 50 carbon, and the hydroxyl group binds to the
30 group. So the ends of the strand are identified according to the orientation of the backbone as the 50 end
and the 30 end. The “forward” direction is 50 to 30 , by convention. The two complementary strands run in
opposite directions.
Aromatic bonds such as these lie in a single plane. So the nucleic acids stack on top of each other like plates.
Furthermore, the outside (the backbone) is hydrophilic, and the interior (the nucleic acids) are hydrophobic.
(Hydrophilic means attracted to water—literally “water-loving”, and hydrophobic means repelled by water—
literally “water-fearing”. This characteristic is very significant in chemistry.) So the bonds between the
complementary nucleic acids (the weakest part of the polymer) are folded inside and protected.
Furthermore, van der Waals forces also stabilize the molecule. This refers to weak electrostatic attraction
between hydrogen and other atoms. Basically, when hydrogen is participating in a covalent bond with,
2-1
2-2
Lecture 2: January 23
say, nitrogen, a single electron is “shared” between the two atoms. But it is not shared equally; it is
somewhat closer to the nitrogen than the hydrogen. The nitrogen is “electronegative”, and the hydrogen is
“electropositive”. So the hydrogen is weakly attracted to other electronegative items, for example a different
nitrogen atom participating in other covalent bonds.
All these factors mean that DNA is a very stable molecule. Thus it is a suitable repository for the storage
of genetic information. Indeed, one must heat it to 130–140◦ F to break it down. Alternatively, one can use
a soapy solution to denature the DNA. Soap has a hydrophilic end and a hydrophobic end, so it is capable
of dissolving hydrophobic molecules. The hydrophilic end is attracted to water, and the hydrophobic end is
attracted to the hydrophobic molecule.
2.3
Genomes
Genomes are usually measured in base pairs (bp). “Base pair” refers to the complementary nucleotides
which make up one unit of the double strand. The smallest genome, which was the first to be sequenced, is
that of the λ virus, about 50 kbp (kilobasepairs) long. The genome of the Epstein-Barr virus, which causes
herpes, is about 120 kbp long. The genome of the small bacterium Mycoplasma genitalium is about 500 kbp
long; this is the smallest sequenced genome of an independent organism. The following table summarizes
the sizes of several of the known genomes.
50
120
500
18
47
10
30
120
500
2.75–2.8
50
2.4
kbp
kbp
kbp
Mbp
Mbp
Mbp
Mbp
Mbp
Mbp
Gbp
Gbp
λ (virus)
Epstein-Barr virus
Mycoplasma genitalium
Homophilus influenza, i.e., flu
Escherichia coli
Yeast
Fungi
Drosophila
Rice, Chicken
Mouse, Human
Algae
RNA
RNA (ribonucleic acid) is somewhat similar to DNA, but has a different nucleic acid, uracil, in place of
thymine. It is not as stable as DNA. The stability of DNA makes it suitable for storing the genetic information, whereas RNA is suitable for actually using the genetic information to carry out the functions of the
cell. The particular kind of RNA which is tranlated into protein, called tRNA (“transfer RNA”), usually
comes in short segments, 80–90 bp long.
RNA does not generally consist of two strands running linearly together. A single strand may connect back
to itself, forming a hairpin. In this way a secondary structure is formed. This in turn can fold up more,
forming complicated tertiary structures. RNA is a very versatile molecule, and is thought to be the initial
basis of life.
Lecture 2: January 23
2.5
2-3
Protein
Protein consists of polypeptide chains, i.e., chains of amino acids. Each amino acid consists of an ammonium
group; a carbon bound to a hydrogen atom and to another group, called the “residue”, which varies from one
amino acid to another; and a hydroxyl group. The hydroxyl group of one amino acid binds to the ammonium
group of the next, forming a chain.
The residue and the carbon to which it attaches can rotate about their bond axis with respect to each
other. The ammonium-hydroxyl binding is rigid and planar. The ammonium and the middle carbon-residue
group can rotate about their bond axis with respect to each other; this angle is denoted α. The middle
carbon-residue group and the hydroxyl can also rotate about their bond axis with respect to each other; this
angle is denoted γ.
A plot of the α-γ angles, called a Ramachandran plot, reveals a few types of zones. In an α-helix, ammonium
groups from one turn of the helix bind to hydroxyl groups from the next turn. In a β-sheet, the chain folds
boustrophedon to form a sheet. Any other structure is called a “coil”. A diagram called a Richardson plot,
after Jane Richardson, shows only this structure: the α-helices, β-sheets, and coils.
There are 20 possible residues. These determine how the protein folds up, based on various properties, such
as: whether they’re charged or not; whether they’re hydrophilic or hydrophobic; whether they’re relatively
large or small; and whether they contain sulfur, which is particularly structurally significant. For example,
in 1985 Chris Sanders showed how to determine whether a chain will form an α-helix or a β-sheet, based on
the hydrogen bonding properties of the residues.
2.6
Transcription
To implement the central dogma, DNA is transcribed into RNA, some of which is then translated into protein.
Transcription is begun by an enzyme, called polymerase. A portion of the double-stranded DNA molecule
is unzipped, and polymerase runs along a single strand forming a complementary RNA strand. It begins
transcription from a promoter site, and ends at a terminator site. The level of transcription may be enhanced
or inhibited based on several other sites, known as enhancers. These are studied, for instance, by Mike Levine
and Mike Eisen.
Much is known about the promoter site, but almost nothing is known about the terminator site. It could
be stochastic—that is, transcription ends whenever the polymerase falls off. The action of the enhancers
could be quite complex. It could depend on whether the portion of the DNA where the enhancer lies is even
unpacked and accessible or not. It is likely to be much more complicated than Boolean circuits, which are
sometimes used to model genetic regulatory networks.
The transcribed piece of RNA includes regions called introns, which are snipped out. The remaining pieces,
called exons, are spliced together and kept. The 50 end of an exon always ends in a particular 2 bp sequence,
the 50 splice site, and the 30 end of an exon always ends in another particular 2 bp sequence, the 30 splice
site.
The splicing process takes place in the spliceosome and involves snRNP’s (small nuclear ribonuclear proteins),
complexes of RNA and protein, such as U1, U2, U4, U5, and U6. The molecular machinery grabs a particular
site in the middle of the intron, called the lariat site, and pulls the intron out into a horseshoe. Also, the
50 splice site of one exon and the 30 splice site of another exon are brought together. The intron forms a
loop which is cut off, and the exons are spliced. Some introns seem to splice themselves out. The splicing
mechanism is still under study.
2-4
2.7
Lecture 2: January 23
Translation
RNA is translated into protein in the ribosomes. The strand of RNA to be translated begins with a ribosomal
binding site whose starting nucleotides are TATA. Hence it is known as the TATA box. This is followed by
a sequence of 3 bp codons. Each codon is translated to a particular amino acid in the polypeptide chain.
A stop codon does not code for an amino acid but signals the end of the polypeptide chain. After the stop
codon is a long string of AAA. . . . These were not present in the original DNA from which the RNA strand
was transcribed, but were added later.
We can see why the codons are 3 bp long. We have 4 nucleic acids, or “letters” in our alphabet. We have to
code for 20 different residues (as well as for “stop”), so we need at least 3 bp, since 2 bp could only code for
42 = 16. However, most residues are encoded by only 2 bp. That is, the first 2 bp determine the residue,
and the last position is a “don’t care”. So four different codons code for the same residue.
2.8
Other Issues
We have been describing the cellular machinery of eukaryotes, that is, those organisms whose cells have
a “true nucleus”. Prokaryotes (such as bacteria) do not have nuclei or ribosomes. Moreover, their DNA
includes no introns. This may help them evolve faster by causing more variation among their genotypes.
Eukaryotic genomes, by contrast, are more robust and stable.
A particular region of DNA does not always break down into exons in the same way. There may be alternate
splicings of the same region. For example, a stretch of DNA may consist of regions I, II, III, and IV, separated
by introns. Sometimes regions I, II, and IV might be spliced together, to make up form 1. At other times
regions II, III, and IV might be spliced together to make up form 2. So far it appears that the average
number of alternate splicings of any given region is three or four. Some regions can only be spliced in one
way. The maximum number of alternate splicings of a given region that has been observed so far is twelve.
Often the alternate splicings consist of different suffixes.
The existence of alternate splicings makes it somewhat problematic to define genes. If a region has two
alternate splicings, does it contain one gene or two? Furthermore, the mechanism for choosing among
alternate splicings is unknown. It may be through the promoters.
About 97% of the human genome is intergenic, that is, it consists of DNA between the regions which are
transcribed. The function of this portion of the DNA is not yet clear. It doesn’t fit neatly within the
framework of the Central Dogma, which has led some to call it “junk”. Alternatively, it might be used for
signaling and control of gene expression. Or it might be present to promote variation. The basic mechanism
of evolution is variation and selection, so this “extra” genetic material may exist to promote evolvability.
The process of translation is carried out in the ribosome, which is basically a protein factory. It is a molecular
machine: the tRNA fits in one hole, and the other part of the ribosome walks along forming the polypeptide
chain. This is the fundamental example of nanotechnology in action.
Experiment has shown that if a protein is heated to the point where it becomes denatured (that is, it loses
its shape and becomes a long strand), and then allowed to cool, it folds back into the correct shape. Thus,
the sequence of residues determines the shape of the protein, and therefore its function. This is an important
property for us as computer scientists, since we prefer to think of the protein as a string of “letters” from the
amino acid “alphabet”. Thus we can apply the familiar tools of combinatorics and discrete mathematics.
However, this is only true of about 95% of proteins. A few proteins require the presence of other “chaperone”
proteins in order to fold into their correct shape. As is often the case in biology, we must always keep in
mind that the assumptions we make for convenience of mathematical modeling do not hold in all cases.
Lecture 2: January 23
2-5
By weight, the cell is 70% water. But it is not a sack or a test tube through which everything can diffuse
uniformly! It contains many internal structures which play an important role in impeding, enhancing, or
constraining transport. The remaining 30% (the dry weight of the cell) is constituted as follows:
1%
20%
42%
3%
30%
2%
2%
small metabolites (e.g., sugar, fuel)
cell wall, internal membrane (ER: endoplasmic reticulum)
soluble protein
tRNA
ribosome
mRNA
DNA
Notice that fully 33%, or one third, of the dry weight of the cell is devoted to tRNA and the ribosomes, that
is, to the machinery for making proteins. So the cell is always dynamic, never quiescent.
Next time we will discuss recombinant DNA techniques. We will show how various enzymes can be used to
operate on DNA.
References
[DEKM98]
R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis. Cambridge University Press, Cambridge, 1998.