Download Chapter 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mutation wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Silencer (genetics) wikipedia , lookup

List of types of proteins wikipedia , lookup

DNA barcoding wikipedia , lookup

DNA sequencing wikipedia , lookup

Comparative genomic hybridization wikipedia , lookup

Biochemistry wikipedia , lookup

Holliday junction wikipedia , lookup

DNA repair wikipedia , lookup

Agarose gel electrophoresis wikipedia , lookup

Maurice Wilkins wikipedia , lookup

Community fingerprinting wikipedia , lookup

Molecular evolution wikipedia , lookup

Biosynthesis wikipedia , lookup

DNA vaccination wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

Transformation (genetics) wikipedia , lookup

Replisome wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Molecular cloning wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Non-coding DNA wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Transcript
Bacterial Genetics and Molecular Biology -­‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery Chapter 1. What is DNA?
It is widely accepted that DNA represents a blueprint for life. Indeed, DNA stores much of the
information a living cell requires to function, and that information is stored as physical and
chemical structures. The biological information of DNA comes from the sequence of the four
building blocks that contain the bases A, C, G, and T. The order of the bases determines
which structures are or can be formed, and it is these structures that in turn dictate biological
function. DNA is not a complicated, miraculous substance, but rather a relatively simple
polymer, with humble origins and some interesting properties that make life possible.
1.1. The Ancient and humble origin of DNA
In the beginning, there was the big bang. From physical evidence it can be deduced that this
represents the creation of the universe in which we live. At first, the products of this
unimaginable event were mostly (75%) hydrogen, and the remainder (about 25%) helium
isotopes, the first and only atoms to be formed at this stage. The extreme temperatures of the
young universe did not allow more complex atoms to be stable. Eventually, atoms began to
clump and condense due to gravity, causing an implosion of gaseous stars that, in turn, by
way of nuclear fusion, created the heavier atoms which are known today. Included in these
heavier atoms are (in decreasing order of abundance) oxygen, neon, nitrogen and carbon.
Carbon may be present as carbon monoxide (CO), carbon dioxide (CO2), acetylene (CH2) or
methane (CH4) gas; nitrogen can be found as nitrogen gas (N2) or ammonia gas (NH3),
whereas hydrogen and oxygen combine to form water (H2O). Under the influence of light,
methane and nitrogen gas can spontaneously form hydrogen cyanide (2CH4 + N2 → 2HCN +
3H2). These gasses are ubiquitous in the universe, as we know from by their telltale spectra,
detected by radio frequency spectroscopy.
From these simple gasses, more complex molecules can be produced during reactions that
are driven by high-energy and/or light. In a famous experiment performed in the 1950's,
Stanley Miller showed that a mixture of methane, ammonia and water, exposed to a high
voltage discharge (simulating lightning) could produce organic compounds such as formic
acid, lactic acid, adenine, glycine and alanine. All of these are common components in a
living cell, but as Miller showed, they can also be produced chemically, and it is likely that
such reactions took place on Earth when the planet was still young. In fact, the amino acid
glycine is quite common in the universe.
More complex chemicals can also be found in the universe. The mixture of primeval gases
of a planet can react together, under the influence of light from the stars, to form purines such
as adenine, a chemical with alkaline properties in solution meaning that it is a base. Adenine
can be produced from hydrogen cyanide in a multistep reaction with the presence of ammonia
that serves as a catalyst. It is a planar molecule, as can be seen in Figure 1.1. Reactive gases
can form substantive quantities of adenine, in the presence of light. Likewise, pyrimidines,
sugars, and simple lipids can all be formed chemically. Formaldehyde (CH2O) is one of the
crucial reactants to produce amino acids, and formaldehyde is still being produced by
photochemical reactions in our present atmosphere, which is why it can be detected at low
concentrations in rain.
1
Bacterial Genetics and Molecular Biology -­‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery Figure 1.1. Schematic representation of some important events following the Big Bang. At
the top, the formation of the first two elements, hydrogen and helium is shown. Heavier
elements were formed by nuclear fusion during implosion of stars. The abundant elements
shown (H, C, O, and N) can spontaneously form simple molecules. At the bottom, the
formation of adenine from five hydrogen cyanide molecules is shown, a reaction that takes
place in the presence of ammonia. A schematic of five stacked adenine molecules in water is
also shown; adenine spontaneously forms a helix in solution.
Water vapor is commonly found in the Universe, and on planets with permissive temperatures
and gravitational forces, which are required to prevent evaporation into space, it may even be
present in liquid form. This is important as the origin of life depends on a solvent. The
reactive gases that can build complex molecules can also destroy them. However, when these
complex compounds dissolve in a solvent, they are protected against such destruction and
their concentration may build up. Take, for instance, adenine, which is not very soluble in
water. In liquid water, these hydrophobic planar base molecules tend to combine together by
stacking one on top of the other. The most stable structure for such stacks is formed when
each molecule is slightly turned compared to the one below. When adenine is mixed in water,
a helical structure forms spontaneously that reminds us of the DNA helix, which of course is
formed by the stacking of adenine, guanine, cytosine, and thymine, all planar bases. In DNA,
these four bases are stacked just as multiple loose adenine molecules, though they do so in
pairs: a purine is always paired with a pyrimidine, so that adenine is always combined with
thymine and guanine always pairs with cytosine. Moreover, in DNA these pairs of bases are
not just loose stacks, but are connected to each other via a backbone composed of a ribose
sugar and phosphate. Still, the spontaneous formation of adenine, and its ability to form
helical stacks in liquid water, illustrates how the basic building blocks for life can be found in
the universe, produced from abiotic gasses that obey the laws of chemistry and physics.
All cells are made up of four types of biopolymers
All living cells contain carbohydrates (sugars), amino acids, lipids, and nucleotides, and these
again form polymeric structures, or biopolymers, of which all cells are composed. Box 1.1
2
Bacterial Genetics and Molecular Biology -­‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery explains why the molecules of living organism are mainly made of only a few elements.
Proteins are made up of chains of amino acids, some of which are spontaneously formed in a
pre-biotic world. The sugar molecules in a cell are also frequently interconnected into chains
of carbohydrates, the second type of polymers. Sugars are also frequently attached to other
compounds, such as proteins or lipids. The lipids are mainly found in the membrane, which
separates the inside of a cell from the outside world.
Lipids, the third type of biopolymer, are not very soluble in water, and can form
membrane-like structures if mixed with water. Usually the lipids will form film layers, but
fatty acids (hydrocarbons with a carboxyl-group, COOH, on one end) will spontaneously
form vesicles in water: multiple molecules align into a spherical structure so that the
carboxyl-group faces out and the hydrophobic lipid tail is shielded away from the bipolar
water molecules. A cell membrane is composed of two such layers, since both inside and
outside of a cell are watery environments. Such vesicles may have captured organic
molecules, forming a structure that starts to resemble a cell. However, the jump from these
abiotic vesicles to self-replicating, living cells is a giant step that hasn't been fully understood.
Probably the precursors of present-day cells were much simpler. The fourth type of
biopolymer is nucleic acids, to which DNA and RNA belong. These are chains of nucleotides.
Information Box 1.1: Why are the molecules of life built of so few different elements?
All living organisms are made of similar chemical building blocks, which are mainly composed
of six elements: carbon and hydrogen form hydrocarbons; add oxygen and carbohydrates
(Cn(H2O)n) can be formed. These three elements together with nitrogen can build most (but not
all) amino acids, along with the bases of nucleic acids. Add phosphate, a component of the
DNA and RNA backbone, and sulfur, which is needed for the amino acids methionine and
cysteine, and DNA, RNA, and most proteins can be made. Other elements, mostly metals, are
found in trace amounts only.
The position of the elements in the periodic table predicts their chemical and physical properties
such as the number of atomic bonds their atoms can form, and also the strength of these bonds.
In the periodic table, the noble gases (starting with helium, neon and argon) are on the righthand column. Their chemical properties are dictated by the outer orbitals of their atoms, which
are fully occupied with electrons. As a result, they rarely react with other atoms. The elements
of the second row in the periodic table that are abundant in biological polymers - carbon,
nitrogen, and oxygen - are also abundant in the universe: the third and fourth most abundant
elements (after hydrogen and helium) are oxygen and carbon. Nitrogen is the 7th most abundant
element in the universe, and also a main ingredient in our atmosphere, making up more than
75% of the air we breathe. Carbon (with 4 outer electrons) can form very strong bonds to itself
or to other molecules, leading to long and stable chains of C-C bonds (as in hydrocarbons), and
is found in millions of different types of compounds. In contrast, silicon (located in the row
beneath carbon, also with 4 outer electrons) is mostly found in the form of silicon dioxide
(present in rocks, sand or glass), which is quite stable and unreactive, and can't form polymers
that are stable.
Sheer abundance and reactivity are two reasons why biological molecules are mainly composed
of the elements from the top of the periodic table: H, C, N, and O. There are exceptions to this
rule: Phosphorus is an essential element in biological material but it is positioned in the periodic
table under N; phosphate bonds are often highly reactive, which is why their formation or
breakage is essential in the formation or degradation of many biological molecules. How
phosphorus got incorporated into early organic molecules is still a mystery, although it should
be noted that phosphate occurs in many mineral compounds, and might have been involved in
early proto-cells. Sulfur is also relatively abundant in the earth, and plays an important role in
some biochemical reactions.
In addition to the six main elements, other elements are important for life, in particular the
cations (positively-charged ions) sodium and potassium. Lithium and beryllium, though
positioned in the periodic table at the same row as C, N and O, are rarely found in biological
material. Instead, sodium (Na+, for natrium), and potassium (K+, for kalium) take the place of
3
Bacterial Genetics and Molecular Biology -­‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery lithium cations, whereas Magnesium (Mg2+) and Calcium (Ca2+) are typically used for bivalent
cations. Aluminum (Al3+) and Chlorine (Cl-) are favored over boron and fluorine, respectively.
The reason is the abundance of these elements; Aluminum is the third most abundant element in
the earth's crust; in comparison, lithium, beryllium, boron and fluorine are relatively rare on the
earth. The high abundance of iron may be the reason why this metal is also commonly attached
to biopolymers.
A final criterion is solubility: life depends on a solvent, which on Earth is water. Silicon is
insoluble in water. One might hypothesize life forms being formed in other solvents, such as
liquid hydrogen fluoride (HF), methane or ammonia. However, since fluoride is rare, liquid HF
is not common on planets. Charged ions are not very soluble in methane, which furthermore has
a limited ability to store heat, two factors that disqualify this abundant solvent as a support for
life. Water has a very high heat-storage capacity, which dampens extreme temperature swings
on our planet surface, a factor that probably strongly contributed to the formation of life. The
only solvent with similar properties to water is ammonia, although it has a narrower temperature
range for its liquid phase.
DNA is an optimal information carrier
There are three fundamental characteristics essential to all modern living cells: (1.) All living
cells contain DNA as the genetic information storage molecule, and replicate this DNA to
donate a copy to their offspring. (2.) All living cells produce proteins from DNA via the
intermediate molecule RNA. (3.) All living cells mainly rely on proteins to carry out
necessary chemical reactions, including the production of DNA and RNA. Proteins, in the
form of enzymes, are also responsible for the biosynthesis or modification of lipids and
carbohydrates.
In a cell, DNA replication depends on proteins just as protein synthesis depends on DNA.
Both require RNA molecules as essential components of these production processes.
Likewise, RNA can only be produced in the presence of both DNA and proteins. In all living
cells, DNA is the molecule that stores the genetic information.
The interdependent production of these three essential cellular components must have
evolved from simpler stages. The first precursors of life may have been mineral complexes,
perhaps clay templates, combined with some of the organic compounds mentioned above.
That would allow primitive replication of a simple information carrier and, thereby, a means
of preserving favorable changes. The preservation of favorable changes is essential for
evolution. Evolution allows optimization of biomolecules over time, so that from simpler
precursors ever more complex products can eventually be formed. The first information
carrier molecules are not known, but during the evolution of life, RNA probably entered the
stage before DNA did. Which final steps brought on the interaction and interdependence of
DNA, RNA and protein, however, remains unknown. The result of these early developments
is that all living organisms rely on DNA as the genetic information carrier, for which it is
optimally suited.
1.2. The building blocks of DNA
The longterm storage of genetic information is performed by deoxyribose nucleic acid,
abbreviated as DNA. The building blocks of DNA are called nucleotides, which contain a
deoxyribose sugar, to which one of the four possible bases is connected. These bases are
either purines, which have two carbon rings with nitrogen incorporated (adenine, abbreviated
A, and guanine, G) or pyrimidine with only one ring (thymine, T, and cytosine, C). A single
phosphate group connects multiple nucleotides to form the ribose – phosphate backbone of
DNA.
4
Bacterial Genetics and Molecular Biology -­‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery The chemical structure of the building blocks of DNA is shown in Figure 1.2, together
with the structure of nucleotides that produce RNA, the sister molecule of DNA. In RNA, the
pyrimidine uracil (abbreviated U) instead of thymine is used. A further difference is that the
sugar of RNA is ribose, which contains two free hydroxyl groups, whereas the sugar part of
DNA misses the 2'-hydroxyl group, as pointed out in the figure (note that all carbon atoms are
numbered, whereby those of the sugar group are primed to distinguish them from the base
carbons). Although the ribose of RNA has two hydroxyl groups, only the one attached to C-3'
is used to connect multiple RNA nucleotides, so that the structure of RNA is similar to that of
DNA.
The difference of a single hydroxyl (-OH) group might seem trivial, but in the case of
DNA and RNA, this has major implications for the structure and stability of the polymers. In
the case of the ribose sugar, present in RNA, the extra hydroxyl group results in a less stable
phosphate backbone, through interactions of the -OH group and the phosphate. This is
particularly true for single-stranded RNA, such as most messenger RNAs, which usually have
short lifetimes, in the order of minutes. This instability makes RNA less suitable as a reliable
information storage molecule. In contrast to this, DNA is almost always found as a doublestranded helix, and can form very long polymers of millions of nucleotides that are very
stable. In fact, DNA is so stable that it can be isolated from samples dating back more than a
million years. Ancient DNA may have suffered from frequent strand breaks, so that presently,
only fragments of a few hundred basepairs are left. But RNA would never have survived that
long; obviously, DNA is a better polymer for long-term storage of information.
Margin box: DNA and cola
DNA and cola share the same major ingredients. The main ingredient in both is water. Pure
DNA would be useless to the cell as it would be crystalline; in the cell, it is dissolved and
hydrated. The second ingredient is sugar. Cola contains almost 20% sugar. The sugar
component of the backbone of DNA is important for its solubility. Phosphate is the third
ingredient, present as phosphoric acid in cola (which has a pH of around 2.5), and as the
phosphate backbone in DNA. Again, phosphate is soluble in water, though not as much as
sugar. Their solubility dictates that the phosphate and sugar are on the outside of the DNA helix.
The last and fourth ingredient in both cola and DNA are the flat planar bases. In cola it is
caffeine, which is not very soluble in water ('hydrophobic'), and is present in very small
amounts only. However, if enough were added, caffeine molecules would stack on top of each
other, just like adenine does. This stacking forms a single stranded helix in solution. Likewise,
the planar bases in DNA form a helical structure by stacking on top of each other.
Three phosphorylation forms of nucleotides exist
Nucleotides exist in free form in the cell and are also incorporated into DNA. They can be
present in the form shown in Figure 1.2, with one phosphate group attached to the C-5' of the
sugar. Alternatively, a second or even third phosphate is attached to the first one. In the case
of three phosphates being present, the nucleotides are maximally charged with chemical
energy; these forms, with three phosphates attached, are called deoxyadenosine triphosphate,
or dATP, deoxyguanosine triphosphate (dGTP), deoxycytidine triphosphate (dCTP) and
deoxythymidine (dTTP) triphosphate. (The general term 'nucleotide' implies that at least one
phosphate is attached; the term 'nucleoside' is reserved for molecules without any phosphate
group). When the combination of all four deoxy-nucleotides are met, the general term dNTP
is commonly used, whereby N is the symbol for a nucleotide that can be either G, A, T or C.
The configurations with two phosphates (deoxythymidine diphosphate, dTDP, for instance)
are intermediates of nucleotide biosynthesis, but the monophosphate nucleotides, dAMP,
dGMP, dTMP, and dCMP are produced when DNA is degraded. The building blocks of RNA
are named with similar terminology, but the 'deoxy' in their name is dropped (since the ribose
5
Bacterial Genetics and Molecular Biology -­‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery has both hydroxyl groups), so now we have ATP, GTP, UTP, and CTP, and then ADP, GMP,
and so on. The nucleotide ATP serves several roles in a cell. It is not only the building block
of RNA but it also functions as the general energy carrier of all cells. Since adenine is the
most readily formed nucleotide by natural processes in the universe, it is not surprising that
this is also most commonly used in biological systems.
Figure 1.2. The building blocks of the nucleic acids DNA and RNA. At the top, a building
block of DNA (left) and RNA (right) is shown. These are composed of a variable base
attached to a ribose sugar, to which a phosphate group is bound at the C-5'. Note that the
deoxyribose of DNA lacks the hydroxyl group at C-2', which is present in the ribose of RNA,
as indicated by the arrows. In the middle, the variable bases are shown. Two purines and two
pyrimidines are found in DNA and RNA, but whereas DNA contains the pyrimidine thymine,
RNA contains uracil. At the bottom, the basepairing is shown between adenine and thymine
(left), which are held together by two hydrogen bonds, or guanine and cytosine (right), which
are connected by three hydrogen bonds. Note that the width of both the AT and GC basepairs
is the same.
1.3. The two strands of DNA form a helix
6
Bacterial Genetics and Molecular Biology -­‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery DNA can exist as very long polymers, containing millions of nucleotides. A single strand of
DNA is composed of nucleotide bases that are connected through their ribose sugars, each
separated by a single phosphate. This phosphate is bound, via an oxygen atom, to the C-3' of
one nucleotide and the C-5' of the next; a chain is thus formed that connects all nucleotides
like beads on a string. Two such strands are held together by the hydrogen bonds that
spontaneously form between opposite purine and pyrimidine bases, provided their sequences
match. Thus, where adenine (A) is present on one strand, thymine (T) is present on the other,
and likewise guanine (G) will find cytosine (C) opposite. These purine-pyrimidine pairs are
called the basepairs of double-stranded DNA. The result will be a helical ladder where the
ribose-phosphate backbone forms the rope of the ladder and the planar basepairs, sandwiched
and slightly twisted one on top another, form the steps. The basepairs between guanine (G)
and cytosine (C) are formed by three hydrogen bonds, whereas the basepairs of Adenine (A)
and Thymine (T) are held together by two hydrogen bonds only. One hydrogen bond requires
approximately 1 Kcal/mol of energy to break, so it appears that each GC basepair stabilizes
the double strand with 3 Kcal/mol stabilization, while AT base-pairs contribute only 2
Kcal/mol. The breakage of all base pairs of a stretch of DNA, which would separate it into
two single strands, is called DNA melting or DNA denaturation.
Destacking energy and hydrogen bonds keep the two strands of DNA together
It is easy to see why most biologists think that AT regions of DNA will melt more readily
than GC base pairs, but this view is too simplistic. The most important stabilizing force in
DNA is, in fact, the stacking of the bases on top of each other, and the hydrogen bonds
between the bases are only contributing part of the stabilization. As stated earlier, DNA bases
will spontaneously stack on top of each other, stabilized by resonance from the conjugated pi
systems of the electrons in the flat, planar molecules. This forms a stable hydrophobic
structure in solution; it takes a fair amount of energy to separate the two strands in solution.
How much so-called 'destacking' energy is required depends on the combination of bases. For
instance, guanine stacked on top of cytosine (in other words, a GC basepair stacked on top of
a CG pair in double helical DNA), requires around 14 Kcal/mol to be separated into the two
strands, much more than the energy required to simply break the H-bonds between the bases.
For a pyrimidine followed by a purine, the stacking energy is not nearly as strong; in the case
of a TA step (also written as TpA: a TA basepair followed by an AT pair), this is less than 4
Kcal/mol. So it is not just the separation of the H-bonds between the bases that is important to
open up the helix: the stacking energy of the nucleotides is a major determinant, too. Most
biologists would expect a sequence like 'TATA' to melt as rapidly as 'TTTT', and more readily
than GCGC. However, a stretch of T's on one strand (A's on the other), will require more
energy to destack than a TATA repeat, so melting of the DNA helix is strongly dependent on
the order of the bases, not just AT content.
Since it takes energy to separate two DNA strands, it should come to no surprise that two
single-strand DNA molecules mixed in solution will spontaneously find each other to produce
double-stranded DNA, at least when sufficient 'matches' for basepairing are present. An
occasional mismatch, of an incorrect base that can't pair with its opposite partner, will
introduce an irregularity in the double-helix (the ribose-phosphate backbone will bulge out
when two large purines face each other, or bulge inwards when two small pyrimidine bases
are combined) but it will not destroy the overall basepairing. How many mismatches are
required to avoid base pairing, and keep the two DNA strands separated, depends on the
temperature (at higher temperature, both the stacking interactions and the hydrogen bonds
become less stable), salt concentration (higher salt concentration destabilizes basepairing)
and, as discussed above, the DNA sequence.
7
Bacterial Genetics and Molecular Biology -­‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery For two single-stranded DNA molecules that come together to produce a double-stranded
structure, the term hybridization is used. For the reverse event, of a double-stranded DNA
molecule separating into two single-strands, various terms exist. It is called 'destacking' when
the energy of the reaction is described, 'melting' (as in melting temperature, or melting curve)
or the more general term denaturation is used. DNA can be denaturated by increasing the
temperature or salt concentration. (In analogy to this general term, DNA hybridization is also
sometimes described as 'renaturation').
Three different DNA helix structures exist
The most common form of DNA is double-stranded DNA, which forms a helix structure. The
right-handed DNA helix that is usually shown (mostly highly simplified) is only one of
various possible structures of double-stranded DNA. It is more accurately called B-DNA, or
relaxed double-helical DNA. It is the average configuration of DNA in solution, and it is the
most commonly found conformation of DNA in a living cell. In the DNA helix a minor
groove and a major grove alternate, as shown in Figure 1.3; the complete stretch of one major
and one minor groove in relaxed B-DNA covers about 10.5 basepairs, on average. However,
most DNA in cells is more than a thousand-fold compacted, and exists as superhelical DNA,
also called supercoiled DNA, in which the DNA is wrapped tightly around proteins and is
twisted. In this configuration there can be more, or fewer, basepairs to a complete turn,
depending on the direction of supercoiled twists. Moreover, the twists can again be twisted,
held together by proteins, to produce a very condensed structure. Hence, superhelical DNA is
more compact than a linear form of B-DNA. How supercoiled DNA is formed and maintained
will be further explained in the next chapter.
Figure 1.3. Different structures of the DNA helix. To the left, a simplified schematic of the
helical structure of B-DNA is shown, as proposed by Watson and Crick in 1953. To the right,
B-DNA, A-DNA and Z-DNA are shown, projected from the side of the helix (top) and looking
down the helix axis (bottom). Modified after W. Saenger: Principles of Nucleic Acids
Structure, with permission.( Springer-Verlag, New York, 1983 ISBN 0-387-90761-0, ISBN 3540-90761-0)
8
Bacterial Genetics and Molecular Biology -­‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery An alternative helix formed by DNA, but more frequently observed with double-stranded
RNA, is A-DNA (because this was the first structure to be discovered, by Rosalind Franklin,
its name received the 'A-status' whereas the more commonly known B-DNA was made
famous by James Watson and Francis Crick). A-DNA is also a right-handed helix but the
helix is shorter and fatter than B-DNA, and forms a hollow tube as shown in Figure 1.3. It
also contains a major and a minor groove, this time covered by a periodicity of 11 basepairs
per turn of the helix. DNA-RNA hybrid helices are usually of the A-type. The nucleotide
sequence to some degree also affects the structure of a DNA molecule. Purine stretches
(multiple A or G nucleotides on one strand) tend to form localized A-DNA helices - such
structures can be involved in recombination; for instance, the long terminal repeats of
transposable elements are usually purine stretches. The major groove of DNA will allow
another DNA strand to fold into it, allowing for triple-stranded helices, which are
intermediates in recombination. In contrast, DNA with short phased A-tracts of about 4
nucleotides long can be curved; this is caused by a change in tilt and pitch of the bases
stacking on top of each other, and these changes, if phased with the pitch of the DNA helix,
can result in curves. Curved DNA is common in the cell and it serves a purpose: a stretch of
curved DNA can bring separated sequences into each other's vicinity.
Finally, a third structure of DNA is known as Z-DNA. This is a left-handed helix that is
uncommon in most bacterial DNA, but it is found in sequences of G-C repeats that frequently
occur in eukaryotic DNA. Some bacteria, for instance Burkholderia species (they belong to
the Beta-Proteobacteria), also contain these GC-stretches and part of their DNA could form ZDNA. The two strands of Z-DNA tend to melt more easily, which provides another example
of DNA with more G-C than A-T basepairs melting more easily, since alternating C's and G's
can have a destabilizing effect on the double helix. Figure 1.3 illustrates the differences
between A-DNA, B-DNA and Z-DNA.
The shape of a DNA helix can thus be right or left-handed, curved, and it can even fold
into three-stranded helices. All of these different conformations are dictated by the DNA
sequence, and repeats of various sequences are frequently involved in the formation of such
three-dimensional structures. For example, Z-DNA can be formed by CG repeats, an A-type
helix can be favoured by short repeats of purines or pyrimidines on the same strand, and
curved DNA can be formed by phased A-tracts repeated within the pitch of the helix.
DNA strands are directional
The DNA strands that together form the helix both have a direction, which is simpler to see
when the helix is flattened out, as in Figure 1.4. Now it is obvious that the two strands run in
opposite directions: DNA contains two anti-parallel strands. One can follow one strand from
the phosphate group, which is attached to a C-5' of the first nucleotide's deoxyribose, to the C3' hydroxyl group of the last nucleotide depicted. The other strand goes in the other direction,
even if one turns the molecule upside down. This is important to notice: the information
stored in a DNA sequence is coded by the order, or sequence, of the nucleotides, but one has
to read these in the correct direction. A DNA strand is biologically meaningful in the direction
of 5'-phosphate to 3'-OH, and this is also the direction in which we write a DNA sequence, by
default.
The molecule schematically drawn in Figure 1.4 can thus be written as 5'-ATGC-3', or
ATGC for short, but equally accurate is GCAT: it only depends on which strand is written
down. These two sequences are complementary, and together represent one and the same
double-stranded DNA molecule. To write down a complementary sequence from a given, or
'template' sequence, one has to read from right-to-left and exchange every nucleotide with its
complementary pair. Thus, ATGTGCTAA will become TTAGCACAT. Box 1.2 describes some
more chemical and physical properties of DNA.
9
Bacterial Genetics and Molecular Biology -­‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery Figure 1.4. The direction of the two strands in a DNA molecule. The two arrows indicate
the direction of the ribose – phosphate backbone from the 5'-phospate to the 3'-OH end for
each strand. The sugar and phosphate groups are only schematically drawn. The bases are
colored grey (no side groups are shown), and the dotted lines connecting them represent the
hydrogen bonds that keep the two strands together. Notice how the connections between bases
and backbone are twisted, as a result of this planar representation. This sequence shown here
either reads ATGC or GCAT, depending on which strand is written down. Both ways of
writing describe the same molecule.
Information Box 1.2: The chemical and physical properties of DNA
Pure DNA is a crystalline white powder. Dissolved in water it can become viscous, with
the viscosity depending on the concentration and also on the length of the DNA strands.
The DNA inside a bacterial cell is quite viscous and would have the consistency of
'snotty gel', like a 0.8% agarose gel. As the name suggests, DNA is an acid, though it
contains purines and pyrimidines, which are both weakly alkaline. The acid nature of
DNA is due to its phosphates, which also gives the backbone an overall negative charge
when DNA is dissolved in water. In solution this negative charge can be neutralized
with sodium (Na+) or magnesium (Mg2+) ions, though in the cell the positively charged
polyamines (spermine and spermidine) as well as chromatin proteins rich in positively
charged arginine and lysine, neutralize the negative charge of DNA.
The DNA helix takes a fair amount of energy to melt due to the stabalization from the
stacking of the bases on top of each other. A force of 65 picoNewton (pN) has been
found to be enough to mechanically separate the two strands of double-stranded DNA,
though this value depends both on the DNA sequence, on the temperature and on the
presence of stabilizing cations. Adenine stacks spontaneously form a helical structure in
single-stranded poly-A DNA that has elastic properties when stretched; it can withstand
forces of 115 pN before it completely relaxes. The ribose–phosphate backbone can
withstand forces much stronger than that, though it is sensitive to shearing and chemical
degradation. One of the early methods of DNA sequencing was based on chemical
10
Bacterial Genetics and Molecular Biology -­‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery degradation of the backbone, using hydrazine (a rocket fuel). Potassium permanganate
(KMnO4 a strong oxidizing agent) can also be used to break the phosphate backbone.
DNA is stable at a wide range of pH values, although the two strands of the double helix
will separate at higher pH, depending on the salt concentration. DNA in solution can
also withstand a wide range of temperatures, from below freezing to boiling, though the
two strands would again separate at high temperatures. The deoxyribose-backbone
cannot easily be broken, even by boiling (the ribose-backbone of RNA is far more
fragile).
The most astonishing physical property of DNA is its size. A human chromosome is a
single DNA molecule; when stretched out as a B-DNA helix, it would measure only 2
nanometer wide but, on average, it would have a length of almost 5 cm (or 25,000,000
times longer than wide). If all 46 chromosomes of a single diploid human cell were thus
stretched out and aligned head-to-tail it would cover a distance of over 2 meters. A
bacterial chromosome thus stretched would measure between 0.054 mm (54 microns)
and 5.35 mm (5,350 microns), depending on the species, though most of these
molecules are circular.
DNA in the cell is often methylated
DNA in the cell can be chemically modified by specific enzymes. Typically, methyl groups
are attached to the bases of DNA, resulting in the structures shown in Figure 1.5. E. coli and
other Gamma-proteobacteria can attach a methyl group (-CH3) to the N6 of adenine, to
produce N6-methyladenine (6-meA); some archaea also methylate their DNA this way. Some
E. coli strains can additionally methylate C5 of cytosine to produce C5-methylcytosine (5meC). Other species produce N4-methylcytosine. The responsible enzymes in E. coli are
called Dam (for DNA adenine methylase) and Dcm (DNA cytosine methylase). Not every
adenine (or cytosine) in the DNA of cells containing these enzymes is methylated: the
addition of a methyl group is sequence-dependent. Dam methylates the A in the sequence
GATC and Dcm methylates the first C in CCAGG and CCTGG. Even so, less than 1% of all
copies of these recognition sequences that are present in the genome are normally methylated.
Methylation of DNA by Dam serves multiple roles in E.coli (the function of Dcm is
unknown). The expression of some genes is regulated by GATC- methylation. The cell also
recognizes newly produced DNA that has not yet been methylated, and this is a way to
regulate the start of DNA replication (the production of a chromsome copy); methylation is
also used to repair mistakes in made DNA copies. Finally, DNA methylation is a way to mark
the DNA as 'self', so that it can be distinguished from foreign, incoming DNA, which can be
attacked while the own, methylated, DNA is protected. However, not all bacteria methylate
their DNA, and even different strains within a species can produce differently methylated
DNA.
Figure 1.5. Methylated bases are the products of different DNA methylases.
11
Bacterial Genetics and Molecular Biology -­‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery 1.4. DNA contains biological information beyond its sequence
We represent DNA sequences as letters and although DNA sequences don't contain empty
spaces, we can interpret them as if they are forming words, and words can form sentences.
For this analogy we could consider DNA 'words' to be functional domains within proteins,
and a sentence could serve as a complete protein. Non-coding sequences that separate genes
would be the equivalent of the periods that separate sentences in printed texts. All text that
together might produce a book could then be a genome. Distinct chapters or volumes could
even represent different chromosomes. Indeed, the sequence of the human genome was
presented to the public as the 'blueprint of life' and its size was described as a book of 23
volumes (for the 23 chromosomes) that together would easily fill 150,000 pages. If the
sequence would be printed in small print, it would indeed fill that much paper. However, text
is only a symbolic representation of DNA, and other ways to represent DNA can be just as
informative as text, or even more informative (though less practical), as described in Box 1.3.
Information Box 1.3. Representing DNA sequences
DNA consists of only four building blocks, which are combined into strings to produce a
particular sequence. We are used to representing these as four letters in the form of text. But
that is only one of many possible representations, and though it is the most practical way, there
are alternatives that could be considered.
• DNA can be represented as thin vertical lines of four different colors. That would allow
many more nucleotides on a line of text without loosing readability. Using lines 1/3
millimetre wide and 2 mm high, 500 nucleotides would easily fit a line, and 250 lines
would make a page containing 125,000 nucleotides. The book of the human genome would
reduce from 150,000 pages of text to 'only' 20,000 pages of colour, whereby regions
enriched for a particular base, or repeat patterns, would be easily visible. Try that with a
book containing DNA letters!
• In a DNA atlas, we represent the entire chromosome as one circular line, averaging
structural and compositional information along the chromosome by colours. This can
visualize a base skew in bacterial chromosomes, for instance, or local variation in base
content.
• DNA nucleotides can also been represented by four tones of different pitch. A human ear
trained to recognize melody would be able to recognize repeats, and a single nucleotide
stretch would increase the length of the note. However, when a genome would thus be
played, the 'concert of nature' would not be very appealing, and music is not a suitable
medium to work with.
• The representation of beads on a string would more closely resemble the linear nature of
DNA. Using coloured beads as thin as a hair you could hold the human genome in your
hand, and maybe a synthetic thread could be developed that aligns to its complementary
sequence, to give double-strand DNA and envisage DNA melting. But a genome
represented like that would quickly get tangled and, unlike in the cell, that tangling would
not be ordered.
Whatever representation we choose, it doesn't capture the true amount of information
stored in a DNA sequence, because that is more than just the sequential order of nucleotides.
The most immediate shortcoming of DNA as mere letters is that we only show one strand,
while all DNA in nature is nearly always found as double-strand DNA. Since both strands
usually contain genes, it means that maybe half of all genes are represented as their
complementary sequence in a genome DNA file. When searching for a gene in a database,
this isn't immediately obvious since a single gene is usually reported with its coding strand,
12
Bacterial Genetics and Molecular Biology -­‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery but when searching for a gene in a complete genome's DNA sequence, one has to look for the
complementary sequence for all genes located on the strand that is not given by the sequence.
The order of nucleotides affects the three-dimensional structure of a DNA molecule, and
that conveys information to the regulatory proteins of the cell. We have no way to represent
that information in text files, and other visualization means are required for this.
DNA modification, most frequently by base methylation, is another way how biological
information is stored, especially if that information has to be erasable (in contrast, DNA
sequences are permanent, ignoring the odd mutation). But again, DNA modification is not
represented by a DNA sequence given as letters. Proteins that are bound to DNA are part of
the temporary biological message of a chromosome, too. A DNA sequence can be temporarily
'hidden' from the cell by proteins that shield it or fold it away, so that the information stored in
that sequence is currently not available. Conversely, proteins can open up the DNA helix
locally, to stimulate gene expression. Again, there is no way to show this in a DNA sequence.
1.5. Repeats code for structures in DNA
Stretches of nucleotides that repeat themselves in a molecule are frequent in Nature, and they
can affect the structure of DNA. The simplest kind of repeat would be a stretch of repeated
single nucleotides (GGGGG or TTTTTTT) where the repeat unit would be one nucleotide. This
is also called a homonucleotide stretch, and such stretches are quite common in DNA.
Another simple repeat consists of a unit of two nucleotides, also called dinucleotide repeats,
for instance GAGAGAGA. Repeats of two or more nucleotides are called ‘tandem repeats’, and
if they are longer than 10 repeats, they are called ‘micro-satellites’ (that can be up to 60 times
repeated). These are particularly common in eukaryotic DNA. For example, (CG) repeats
(also called 'CpG islands') are involved in chromatin structures, and their methylation can
result in epigenetic effects. In bacteria, CpG islands are uncommon, though a so-called CpG
motif (composed of two purines, followed by C and G and then two pyrimidines) is
commonly found in bacterial DNA. When unmethylated, this motif has immunological
properties and is sometimes loosely described as 'CpG DNA'. In bacteria, true CpG islands
have so far only been described for Burkholderia species, but they are quite common in
eukaryotes. Triplet repeats (e.g., CGGCGGCGG) in eukaryotic DNA can 'expand', due to slippaired structures, and in some cases this can lead to disease in humans. The shown CGG triplet
repeat can cause fragile X syndrome, which is the most common inherited form of mental
retardation in humans.
A repeat unit can of course also be longer than a few nucleotides, and can even comprise
complete genes. Moreover, the repeat unit can be found on the same strand, or on the
complementary strand of double-strand DNA. A few examples of different types of repeat
sequences are presented in Figure 1.6. A particular sequence that is repeated on the same
strand is called a direct repeat. These frequently cause duplications or deletions during DNA
replication. In case a DNA sequence is once found on one strand, and repeated on the other,
this is called an inverted repeat. An inverted repeat without a spacer is called a palindrome. In
this case the sequence reads the same for both strands (provided they are both read from 5' to
3', as is the convention for reading DNA sequences). Short palindromic sequences are
frequently recognition sites for restriction enzymes. A mirror repeat, on the other hand,
produces the same sequence only if the repeated unit would be read from 3' to 5', which
biologically doesn't make sense. Mirror repeats are less common than inverted repeats in
bacteria, and the same applies for everted repeats, where the mirrored repeat is found on the
complementary strand. A mirror repeat, especially when it is purine-rich on one strand, can
fold back on itself, and form a single-stranded region half, and a region of triplex DNA where
the third strand is wound into the major groove of the helix.
13
Bacterial Genetics and Molecular Biology -­‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery Figure 1.6. Different kinds of DNA repeats. A direct repeat (A) consists of a sequence
repeated on the same strand. In the figure, a direct repeat with a unit consisting of 15
nucleotides is shown with a spacer of four undefined nucleotides. In case of an inverted
repeat (B), the repeat unit is found on the complementary strand. An inverted repeat without a
spacer (C) produces a palindromic sequence, in which both strands read the same sequence.
Mirror repeats (D) and an everted repeats (E) are less common.In these cases, the repeat
sequence is read in the 'wrong' direction, from 3' to 5', which is indicated by the grey arrows.
(F) shows an imperfect direct repeat, in which a slight variation is found in the repeat unit).
Structures in DNA due to repeat sequences
Inverted repeats can form hairpin structures, whereby the two repeat units of one strand basepair with each other, as shown in Figure 1.7. Hairpins in single-strand molecules are common
in RNA molecules. When inverted repeats form hairpins in double-strand DNA, a cruciform
is the result, as shown in the middle of the figure. Finally, two inverted repeats (or
palindromes) that are again repeated can result in slipped-strand structures, as shown in the
figure. Such structures can result in duplication or deletion, when they form during DNA
synthesis in the cell.
14
Bacterial Genetics and Molecular Biology -­‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery Figure 1.7. Structures in DNA. In (A) a hairpin structure is shown, which is the result of a
single-strand molecule basepairing with an inverted repeat. One repeat unit is shown by an
arrow. In (B) two hairpins in a double-strand DNA molecule build a cruciform. In (C), two
inverted repeats have resulted in a slip-strand structure.
1.6. Concluding remarks
DNA is the carrier of genetic information in all living cells. It exists as a double helix, of
which there are different structural variants. DNA consists of the four nucleotides A, G, C and
T, which build two strands. The two strands are kept together by the bases of the nucleotides
that pair with each other. These strands have a direction, and are paired antiparallel. The
structure of DNA, which is sequence-dependent, mostly dictates its function. Thus, sequence,
that is the order of nucleotides, dictates structure, and structure dictates function. However,
when we 'read' DNA as sequences of letters, we miss a lot of the information available to the
cell. It is one of the major shortcomings of our way to symbolize.
Recommended reading:
DNA Mystique: The Gene as a Cultural Icon. Nelkin D and Lindee MS. W.H. Freeman and
Company, New York, 1995.
Astrobiology: a brief introduction. Plaxco KW and Gross M. The Johns Hopkins University
Press, Baltimore, MA, USA, 2006.
DNA structure: A-, B- and Z-DNA helix families. Ussery DW. Encyclopedia of Life
Sciences. Macmillan Publishers Ltd, Nature Publishing Group, 2002.
Bias of purine stretches in sequenced chromosomes. Ussery DW, Soumpasis DM, Brunak S,
Staerfeldt HH, Worning P, Krogh A. 2002. Comput Chem. 26:531-541.
Physical maps of chromosomes. Ussery DW. 2009. In: Encyclopedia of Life Sciences. John
Wiley & Sons, Ltd: Chichester. DOI: 10.1002/9780470015902.a0001425.
Strand misalignments lead to quasipalindrome correction. Van Noort V, Worning P, Ussery
DW, Rosche WA, Sinden RR. 2003. Trends Genet. 19:365-369.
15