* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Chapter 1
Promoter (genetics) wikipedia , lookup
Silencer (genetics) wikipedia , lookup
List of types of proteins wikipedia , lookup
DNA barcoding wikipedia , lookup
DNA sequencing wikipedia , lookup
Comparative genomic hybridization wikipedia , lookup
Biochemistry wikipedia , lookup
Holliday junction wikipedia , lookup
Agarose gel electrophoresis wikipedia , lookup
Maurice Wilkins wikipedia , lookup
Community fingerprinting wikipedia , lookup
Molecular evolution wikipedia , lookup
Biosynthesis wikipedia , lookup
DNA vaccination wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Gel electrophoresis of nucleic acids wikipedia , lookup
Transformation (genetics) wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Molecular cloning wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Non-coding DNA wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Bacterial Genetics and Molecular Biology -‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery Chapter 1. What is DNA? It is widely accepted that DNA represents a blueprint for life. Indeed, DNA stores much of the information a living cell requires to function, and that information is stored as physical and chemical structures. The biological information of DNA comes from the sequence of the four building blocks that contain the bases A, C, G, and T. The order of the bases determines which structures are or can be formed, and it is these structures that in turn dictate biological function. DNA is not a complicated, miraculous substance, but rather a relatively simple polymer, with humble origins and some interesting properties that make life possible. 1.1. The Ancient and humble origin of DNA In the beginning, there was the big bang. From physical evidence it can be deduced that this represents the creation of the universe in which we live. At first, the products of this unimaginable event were mostly (75%) hydrogen, and the remainder (about 25%) helium isotopes, the first and only atoms to be formed at this stage. The extreme temperatures of the young universe did not allow more complex atoms to be stable. Eventually, atoms began to clump and condense due to gravity, causing an implosion of gaseous stars that, in turn, by way of nuclear fusion, created the heavier atoms which are known today. Included in these heavier atoms are (in decreasing order of abundance) oxygen, neon, nitrogen and carbon. Carbon may be present as carbon monoxide (CO), carbon dioxide (CO2), acetylene (CH2) or methane (CH4) gas; nitrogen can be found as nitrogen gas (N2) or ammonia gas (NH3), whereas hydrogen and oxygen combine to form water (H2O). Under the influence of light, methane and nitrogen gas can spontaneously form hydrogen cyanide (2CH4 + N2 → 2HCN + 3H2). These gasses are ubiquitous in the universe, as we know from by their telltale spectra, detected by radio frequency spectroscopy. From these simple gasses, more complex molecules can be produced during reactions that are driven by high-energy and/or light. In a famous experiment performed in the 1950's, Stanley Miller showed that a mixture of methane, ammonia and water, exposed to a high voltage discharge (simulating lightning) could produce organic compounds such as formic acid, lactic acid, adenine, glycine and alanine. All of these are common components in a living cell, but as Miller showed, they can also be produced chemically, and it is likely that such reactions took place on Earth when the planet was still young. In fact, the amino acid glycine is quite common in the universe. More complex chemicals can also be found in the universe. The mixture of primeval gases of a planet can react together, under the influence of light from the stars, to form purines such as adenine, a chemical with alkaline properties in solution meaning that it is a base. Adenine can be produced from hydrogen cyanide in a multistep reaction with the presence of ammonia that serves as a catalyst. It is a planar molecule, as can be seen in Figure 1.1. Reactive gases can form substantive quantities of adenine, in the presence of light. Likewise, pyrimidines, sugars, and simple lipids can all be formed chemically. Formaldehyde (CH2O) is one of the crucial reactants to produce amino acids, and formaldehyde is still being produced by photochemical reactions in our present atmosphere, which is why it can be detected at low concentrations in rain. 1 Bacterial Genetics and Molecular Biology -‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery Figure 1.1. Schematic representation of some important events following the Big Bang. At the top, the formation of the first two elements, hydrogen and helium is shown. Heavier elements were formed by nuclear fusion during implosion of stars. The abundant elements shown (H, C, O, and N) can spontaneously form simple molecules. At the bottom, the formation of adenine from five hydrogen cyanide molecules is shown, a reaction that takes place in the presence of ammonia. A schematic of five stacked adenine molecules in water is also shown; adenine spontaneously forms a helix in solution. Water vapor is commonly found in the Universe, and on planets with permissive temperatures and gravitational forces, which are required to prevent evaporation into space, it may even be present in liquid form. This is important as the origin of life depends on a solvent. The reactive gases that can build complex molecules can also destroy them. However, when these complex compounds dissolve in a solvent, they are protected against such destruction and their concentration may build up. Take, for instance, adenine, which is not very soluble in water. In liquid water, these hydrophobic planar base molecules tend to combine together by stacking one on top of the other. The most stable structure for such stacks is formed when each molecule is slightly turned compared to the one below. When adenine is mixed in water, a helical structure forms spontaneously that reminds us of the DNA helix, which of course is formed by the stacking of adenine, guanine, cytosine, and thymine, all planar bases. In DNA, these four bases are stacked just as multiple loose adenine molecules, though they do so in pairs: a purine is always paired with a pyrimidine, so that adenine is always combined with thymine and guanine always pairs with cytosine. Moreover, in DNA these pairs of bases are not just loose stacks, but are connected to each other via a backbone composed of a ribose sugar and phosphate. Still, the spontaneous formation of adenine, and its ability to form helical stacks in liquid water, illustrates how the basic building blocks for life can be found in the universe, produced from abiotic gasses that obey the laws of chemistry and physics. All cells are made up of four types of biopolymers All living cells contain carbohydrates (sugars), amino acids, lipids, and nucleotides, and these again form polymeric structures, or biopolymers, of which all cells are composed. Box 1.1 2 Bacterial Genetics and Molecular Biology -‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery explains why the molecules of living organism are mainly made of only a few elements. Proteins are made up of chains of amino acids, some of which are spontaneously formed in a pre-biotic world. The sugar molecules in a cell are also frequently interconnected into chains of carbohydrates, the second type of polymers. Sugars are also frequently attached to other compounds, such as proteins or lipids. The lipids are mainly found in the membrane, which separates the inside of a cell from the outside world. Lipids, the third type of biopolymer, are not very soluble in water, and can form membrane-like structures if mixed with water. Usually the lipids will form film layers, but fatty acids (hydrocarbons with a carboxyl-group, COOH, on one end) will spontaneously form vesicles in water: multiple molecules align into a spherical structure so that the carboxyl-group faces out and the hydrophobic lipid tail is shielded away from the bipolar water molecules. A cell membrane is composed of two such layers, since both inside and outside of a cell are watery environments. Such vesicles may have captured organic molecules, forming a structure that starts to resemble a cell. However, the jump from these abiotic vesicles to self-replicating, living cells is a giant step that hasn't been fully understood. Probably the precursors of present-day cells were much simpler. The fourth type of biopolymer is nucleic acids, to which DNA and RNA belong. These are chains of nucleotides. Information Box 1.1: Why are the molecules of life built of so few different elements? All living organisms are made of similar chemical building blocks, which are mainly composed of six elements: carbon and hydrogen form hydrocarbons; add oxygen and carbohydrates (Cn(H2O)n) can be formed. These three elements together with nitrogen can build most (but not all) amino acids, along with the bases of nucleic acids. Add phosphate, a component of the DNA and RNA backbone, and sulfur, which is needed for the amino acids methionine and cysteine, and DNA, RNA, and most proteins can be made. Other elements, mostly metals, are found in trace amounts only. The position of the elements in the periodic table predicts their chemical and physical properties such as the number of atomic bonds their atoms can form, and also the strength of these bonds. In the periodic table, the noble gases (starting with helium, neon and argon) are on the righthand column. Their chemical properties are dictated by the outer orbitals of their atoms, which are fully occupied with electrons. As a result, they rarely react with other atoms. The elements of the second row in the periodic table that are abundant in biological polymers - carbon, nitrogen, and oxygen - are also abundant in the universe: the third and fourth most abundant elements (after hydrogen and helium) are oxygen and carbon. Nitrogen is the 7th most abundant element in the universe, and also a main ingredient in our atmosphere, making up more than 75% of the air we breathe. Carbon (with 4 outer electrons) can form very strong bonds to itself or to other molecules, leading to long and stable chains of C-C bonds (as in hydrocarbons), and is found in millions of different types of compounds. In contrast, silicon (located in the row beneath carbon, also with 4 outer electrons) is mostly found in the form of silicon dioxide (present in rocks, sand or glass), which is quite stable and unreactive, and can't form polymers that are stable. Sheer abundance and reactivity are two reasons why biological molecules are mainly composed of the elements from the top of the periodic table: H, C, N, and O. There are exceptions to this rule: Phosphorus is an essential element in biological material but it is positioned in the periodic table under N; phosphate bonds are often highly reactive, which is why their formation or breakage is essential in the formation or degradation of many biological molecules. How phosphorus got incorporated into early organic molecules is still a mystery, although it should be noted that phosphate occurs in many mineral compounds, and might have been involved in early proto-cells. Sulfur is also relatively abundant in the earth, and plays an important role in some biochemical reactions. In addition to the six main elements, other elements are important for life, in particular the cations (positively-charged ions) sodium and potassium. Lithium and beryllium, though positioned in the periodic table at the same row as C, N and O, are rarely found in biological material. Instead, sodium (Na+, for natrium), and potassium (K+, for kalium) take the place of 3 Bacterial Genetics and Molecular Biology -‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery lithium cations, whereas Magnesium (Mg2+) and Calcium (Ca2+) are typically used for bivalent cations. Aluminum (Al3+) and Chlorine (Cl-) are favored over boron and fluorine, respectively. The reason is the abundance of these elements; Aluminum is the third most abundant element in the earth's crust; in comparison, lithium, beryllium, boron and fluorine are relatively rare on the earth. The high abundance of iron may be the reason why this metal is also commonly attached to biopolymers. A final criterion is solubility: life depends on a solvent, which on Earth is water. Silicon is insoluble in water. One might hypothesize life forms being formed in other solvents, such as liquid hydrogen fluoride (HF), methane or ammonia. However, since fluoride is rare, liquid HF is not common on planets. Charged ions are not very soluble in methane, which furthermore has a limited ability to store heat, two factors that disqualify this abundant solvent as a support for life. Water has a very high heat-storage capacity, which dampens extreme temperature swings on our planet surface, a factor that probably strongly contributed to the formation of life. The only solvent with similar properties to water is ammonia, although it has a narrower temperature range for its liquid phase. DNA is an optimal information carrier There are three fundamental characteristics essential to all modern living cells: (1.) All living cells contain DNA as the genetic information storage molecule, and replicate this DNA to donate a copy to their offspring. (2.) All living cells produce proteins from DNA via the intermediate molecule RNA. (3.) All living cells mainly rely on proteins to carry out necessary chemical reactions, including the production of DNA and RNA. Proteins, in the form of enzymes, are also responsible for the biosynthesis or modification of lipids and carbohydrates. In a cell, DNA replication depends on proteins just as protein synthesis depends on DNA. Both require RNA molecules as essential components of these production processes. Likewise, RNA can only be produced in the presence of both DNA and proteins. In all living cells, DNA is the molecule that stores the genetic information. The interdependent production of these three essential cellular components must have evolved from simpler stages. The first precursors of life may have been mineral complexes, perhaps clay templates, combined with some of the organic compounds mentioned above. That would allow primitive replication of a simple information carrier and, thereby, a means of preserving favorable changes. The preservation of favorable changes is essential for evolution. Evolution allows optimization of biomolecules over time, so that from simpler precursors ever more complex products can eventually be formed. The first information carrier molecules are not known, but during the evolution of life, RNA probably entered the stage before DNA did. Which final steps brought on the interaction and interdependence of DNA, RNA and protein, however, remains unknown. The result of these early developments is that all living organisms rely on DNA as the genetic information carrier, for which it is optimally suited. 1.2. The building blocks of DNA The longterm storage of genetic information is performed by deoxyribose nucleic acid, abbreviated as DNA. The building blocks of DNA are called nucleotides, which contain a deoxyribose sugar, to which one of the four possible bases is connected. These bases are either purines, which have two carbon rings with nitrogen incorporated (adenine, abbreviated A, and guanine, G) or pyrimidine with only one ring (thymine, T, and cytosine, C). A single phosphate group connects multiple nucleotides to form the ribose – phosphate backbone of DNA. 4 Bacterial Genetics and Molecular Biology -‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery The chemical structure of the building blocks of DNA is shown in Figure 1.2, together with the structure of nucleotides that produce RNA, the sister molecule of DNA. In RNA, the pyrimidine uracil (abbreviated U) instead of thymine is used. A further difference is that the sugar of RNA is ribose, which contains two free hydroxyl groups, whereas the sugar part of DNA misses the 2'-hydroxyl group, as pointed out in the figure (note that all carbon atoms are numbered, whereby those of the sugar group are primed to distinguish them from the base carbons). Although the ribose of RNA has two hydroxyl groups, only the one attached to C-3' is used to connect multiple RNA nucleotides, so that the structure of RNA is similar to that of DNA. The difference of a single hydroxyl (-OH) group might seem trivial, but in the case of DNA and RNA, this has major implications for the structure and stability of the polymers. In the case of the ribose sugar, present in RNA, the extra hydroxyl group results in a less stable phosphate backbone, through interactions of the -OH group and the phosphate. This is particularly true for single-stranded RNA, such as most messenger RNAs, which usually have short lifetimes, in the order of minutes. This instability makes RNA less suitable as a reliable information storage molecule. In contrast to this, DNA is almost always found as a doublestranded helix, and can form very long polymers of millions of nucleotides that are very stable. In fact, DNA is so stable that it can be isolated from samples dating back more than a million years. Ancient DNA may have suffered from frequent strand breaks, so that presently, only fragments of a few hundred basepairs are left. But RNA would never have survived that long; obviously, DNA is a better polymer for long-term storage of information. Margin box: DNA and cola DNA and cola share the same major ingredients. The main ingredient in both is water. Pure DNA would be useless to the cell as it would be crystalline; in the cell, it is dissolved and hydrated. The second ingredient is sugar. Cola contains almost 20% sugar. The sugar component of the backbone of DNA is important for its solubility. Phosphate is the third ingredient, present as phosphoric acid in cola (which has a pH of around 2.5), and as the phosphate backbone in DNA. Again, phosphate is soluble in water, though not as much as sugar. Their solubility dictates that the phosphate and sugar are on the outside of the DNA helix. The last and fourth ingredient in both cola and DNA are the flat planar bases. In cola it is caffeine, which is not very soluble in water ('hydrophobic'), and is present in very small amounts only. However, if enough were added, caffeine molecules would stack on top of each other, just like adenine does. This stacking forms a single stranded helix in solution. Likewise, the planar bases in DNA form a helical structure by stacking on top of each other. Three phosphorylation forms of nucleotides exist Nucleotides exist in free form in the cell and are also incorporated into DNA. They can be present in the form shown in Figure 1.2, with one phosphate group attached to the C-5' of the sugar. Alternatively, a second or even third phosphate is attached to the first one. In the case of three phosphates being present, the nucleotides are maximally charged with chemical energy; these forms, with three phosphates attached, are called deoxyadenosine triphosphate, or dATP, deoxyguanosine triphosphate (dGTP), deoxycytidine triphosphate (dCTP) and deoxythymidine (dTTP) triphosphate. (The general term 'nucleotide' implies that at least one phosphate is attached; the term 'nucleoside' is reserved for molecules without any phosphate group). When the combination of all four deoxy-nucleotides are met, the general term dNTP is commonly used, whereby N is the symbol for a nucleotide that can be either G, A, T or C. The configurations with two phosphates (deoxythymidine diphosphate, dTDP, for instance) are intermediates of nucleotide biosynthesis, but the monophosphate nucleotides, dAMP, dGMP, dTMP, and dCMP are produced when DNA is degraded. The building blocks of RNA are named with similar terminology, but the 'deoxy' in their name is dropped (since the ribose 5 Bacterial Genetics and Molecular Biology -‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery has both hydroxyl groups), so now we have ATP, GTP, UTP, and CTP, and then ADP, GMP, and so on. The nucleotide ATP serves several roles in a cell. It is not only the building block of RNA but it also functions as the general energy carrier of all cells. Since adenine is the most readily formed nucleotide by natural processes in the universe, it is not surprising that this is also most commonly used in biological systems. Figure 1.2. The building blocks of the nucleic acids DNA and RNA. At the top, a building block of DNA (left) and RNA (right) is shown. These are composed of a variable base attached to a ribose sugar, to which a phosphate group is bound at the C-5'. Note that the deoxyribose of DNA lacks the hydroxyl group at C-2', which is present in the ribose of RNA, as indicated by the arrows. In the middle, the variable bases are shown. Two purines and two pyrimidines are found in DNA and RNA, but whereas DNA contains the pyrimidine thymine, RNA contains uracil. At the bottom, the basepairing is shown between adenine and thymine (left), which are held together by two hydrogen bonds, or guanine and cytosine (right), which are connected by three hydrogen bonds. Note that the width of both the AT and GC basepairs is the same. 1.3. The two strands of DNA form a helix 6 Bacterial Genetics and Molecular Biology -‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery DNA can exist as very long polymers, containing millions of nucleotides. A single strand of DNA is composed of nucleotide bases that are connected through their ribose sugars, each separated by a single phosphate. This phosphate is bound, via an oxygen atom, to the C-3' of one nucleotide and the C-5' of the next; a chain is thus formed that connects all nucleotides like beads on a string. Two such strands are held together by the hydrogen bonds that spontaneously form between opposite purine and pyrimidine bases, provided their sequences match. Thus, where adenine (A) is present on one strand, thymine (T) is present on the other, and likewise guanine (G) will find cytosine (C) opposite. These purine-pyrimidine pairs are called the basepairs of double-stranded DNA. The result will be a helical ladder where the ribose-phosphate backbone forms the rope of the ladder and the planar basepairs, sandwiched and slightly twisted one on top another, form the steps. The basepairs between guanine (G) and cytosine (C) are formed by three hydrogen bonds, whereas the basepairs of Adenine (A) and Thymine (T) are held together by two hydrogen bonds only. One hydrogen bond requires approximately 1 Kcal/mol of energy to break, so it appears that each GC basepair stabilizes the double strand with 3 Kcal/mol stabilization, while AT base-pairs contribute only 2 Kcal/mol. The breakage of all base pairs of a stretch of DNA, which would separate it into two single strands, is called DNA melting or DNA denaturation. Destacking energy and hydrogen bonds keep the two strands of DNA together It is easy to see why most biologists think that AT regions of DNA will melt more readily than GC base pairs, but this view is too simplistic. The most important stabilizing force in DNA is, in fact, the stacking of the bases on top of each other, and the hydrogen bonds between the bases are only contributing part of the stabilization. As stated earlier, DNA bases will spontaneously stack on top of each other, stabilized by resonance from the conjugated pi systems of the electrons in the flat, planar molecules. This forms a stable hydrophobic structure in solution; it takes a fair amount of energy to separate the two strands in solution. How much so-called 'destacking' energy is required depends on the combination of bases. For instance, guanine stacked on top of cytosine (in other words, a GC basepair stacked on top of a CG pair in double helical DNA), requires around 14 Kcal/mol to be separated into the two strands, much more than the energy required to simply break the H-bonds between the bases. For a pyrimidine followed by a purine, the stacking energy is not nearly as strong; in the case of a TA step (also written as TpA: a TA basepair followed by an AT pair), this is less than 4 Kcal/mol. So it is not just the separation of the H-bonds between the bases that is important to open up the helix: the stacking energy of the nucleotides is a major determinant, too. Most biologists would expect a sequence like 'TATA' to melt as rapidly as 'TTTT', and more readily than GCGC. However, a stretch of T's on one strand (A's on the other), will require more energy to destack than a TATA repeat, so melting of the DNA helix is strongly dependent on the order of the bases, not just AT content. Since it takes energy to separate two DNA strands, it should come to no surprise that two single-strand DNA molecules mixed in solution will spontaneously find each other to produce double-stranded DNA, at least when sufficient 'matches' for basepairing are present. An occasional mismatch, of an incorrect base that can't pair with its opposite partner, will introduce an irregularity in the double-helix (the ribose-phosphate backbone will bulge out when two large purines face each other, or bulge inwards when two small pyrimidine bases are combined) but it will not destroy the overall basepairing. How many mismatches are required to avoid base pairing, and keep the two DNA strands separated, depends on the temperature (at higher temperature, both the stacking interactions and the hydrogen bonds become less stable), salt concentration (higher salt concentration destabilizes basepairing) and, as discussed above, the DNA sequence. 7 Bacterial Genetics and Molecular Biology -‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery For two single-stranded DNA molecules that come together to produce a double-stranded structure, the term hybridization is used. For the reverse event, of a double-stranded DNA molecule separating into two single-strands, various terms exist. It is called 'destacking' when the energy of the reaction is described, 'melting' (as in melting temperature, or melting curve) or the more general term denaturation is used. DNA can be denaturated by increasing the temperature or salt concentration. (In analogy to this general term, DNA hybridization is also sometimes described as 'renaturation'). Three different DNA helix structures exist The most common form of DNA is double-stranded DNA, which forms a helix structure. The right-handed DNA helix that is usually shown (mostly highly simplified) is only one of various possible structures of double-stranded DNA. It is more accurately called B-DNA, or relaxed double-helical DNA. It is the average configuration of DNA in solution, and it is the most commonly found conformation of DNA in a living cell. In the DNA helix a minor groove and a major grove alternate, as shown in Figure 1.3; the complete stretch of one major and one minor groove in relaxed B-DNA covers about 10.5 basepairs, on average. However, most DNA in cells is more than a thousand-fold compacted, and exists as superhelical DNA, also called supercoiled DNA, in which the DNA is wrapped tightly around proteins and is twisted. In this configuration there can be more, or fewer, basepairs to a complete turn, depending on the direction of supercoiled twists. Moreover, the twists can again be twisted, held together by proteins, to produce a very condensed structure. Hence, superhelical DNA is more compact than a linear form of B-DNA. How supercoiled DNA is formed and maintained will be further explained in the next chapter. Figure 1.3. Different structures of the DNA helix. To the left, a simplified schematic of the helical structure of B-DNA is shown, as proposed by Watson and Crick in 1953. To the right, B-DNA, A-DNA and Z-DNA are shown, projected from the side of the helix (top) and looking down the helix axis (bottom). Modified after W. Saenger: Principles of Nucleic Acids Structure, with permission.( Springer-Verlag, New York, 1983 ISBN 0-387-90761-0, ISBN 3540-90761-0) 8 Bacterial Genetics and Molecular Biology -‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery An alternative helix formed by DNA, but more frequently observed with double-stranded RNA, is A-DNA (because this was the first structure to be discovered, by Rosalind Franklin, its name received the 'A-status' whereas the more commonly known B-DNA was made famous by James Watson and Francis Crick). A-DNA is also a right-handed helix but the helix is shorter and fatter than B-DNA, and forms a hollow tube as shown in Figure 1.3. It also contains a major and a minor groove, this time covered by a periodicity of 11 basepairs per turn of the helix. DNA-RNA hybrid helices are usually of the A-type. The nucleotide sequence to some degree also affects the structure of a DNA molecule. Purine stretches (multiple A or G nucleotides on one strand) tend to form localized A-DNA helices - such structures can be involved in recombination; for instance, the long terminal repeats of transposable elements are usually purine stretches. The major groove of DNA will allow another DNA strand to fold into it, allowing for triple-stranded helices, which are intermediates in recombination. In contrast, DNA with short phased A-tracts of about 4 nucleotides long can be curved; this is caused by a change in tilt and pitch of the bases stacking on top of each other, and these changes, if phased with the pitch of the DNA helix, can result in curves. Curved DNA is common in the cell and it serves a purpose: a stretch of curved DNA can bring separated sequences into each other's vicinity. Finally, a third structure of DNA is known as Z-DNA. This is a left-handed helix that is uncommon in most bacterial DNA, but it is found in sequences of G-C repeats that frequently occur in eukaryotic DNA. Some bacteria, for instance Burkholderia species (they belong to the Beta-Proteobacteria), also contain these GC-stretches and part of their DNA could form ZDNA. The two strands of Z-DNA tend to melt more easily, which provides another example of DNA with more G-C than A-T basepairs melting more easily, since alternating C's and G's can have a destabilizing effect on the double helix. Figure 1.3 illustrates the differences between A-DNA, B-DNA and Z-DNA. The shape of a DNA helix can thus be right or left-handed, curved, and it can even fold into three-stranded helices. All of these different conformations are dictated by the DNA sequence, and repeats of various sequences are frequently involved in the formation of such three-dimensional structures. For example, Z-DNA can be formed by CG repeats, an A-type helix can be favoured by short repeats of purines or pyrimidines on the same strand, and curved DNA can be formed by phased A-tracts repeated within the pitch of the helix. DNA strands are directional The DNA strands that together form the helix both have a direction, which is simpler to see when the helix is flattened out, as in Figure 1.4. Now it is obvious that the two strands run in opposite directions: DNA contains two anti-parallel strands. One can follow one strand from the phosphate group, which is attached to a C-5' of the first nucleotide's deoxyribose, to the C3' hydroxyl group of the last nucleotide depicted. The other strand goes in the other direction, even if one turns the molecule upside down. This is important to notice: the information stored in a DNA sequence is coded by the order, or sequence, of the nucleotides, but one has to read these in the correct direction. A DNA strand is biologically meaningful in the direction of 5'-phosphate to 3'-OH, and this is also the direction in which we write a DNA sequence, by default. The molecule schematically drawn in Figure 1.4 can thus be written as 5'-ATGC-3', or ATGC for short, but equally accurate is GCAT: it only depends on which strand is written down. These two sequences are complementary, and together represent one and the same double-stranded DNA molecule. To write down a complementary sequence from a given, or 'template' sequence, one has to read from right-to-left and exchange every nucleotide with its complementary pair. Thus, ATGTGCTAA will become TTAGCACAT. Box 1.2 describes some more chemical and physical properties of DNA. 9 Bacterial Genetics and Molecular Biology -‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery Figure 1.4. The direction of the two strands in a DNA molecule. The two arrows indicate the direction of the ribose – phosphate backbone from the 5'-phospate to the 3'-OH end for each strand. The sugar and phosphate groups are only schematically drawn. The bases are colored grey (no side groups are shown), and the dotted lines connecting them represent the hydrogen bonds that keep the two strands together. Notice how the connections between bases and backbone are twisted, as a result of this planar representation. This sequence shown here either reads ATGC or GCAT, depending on which strand is written down. Both ways of writing describe the same molecule. Information Box 1.2: The chemical and physical properties of DNA Pure DNA is a crystalline white powder. Dissolved in water it can become viscous, with the viscosity depending on the concentration and also on the length of the DNA strands. The DNA inside a bacterial cell is quite viscous and would have the consistency of 'snotty gel', like a 0.8% agarose gel. As the name suggests, DNA is an acid, though it contains purines and pyrimidines, which are both weakly alkaline. The acid nature of DNA is due to its phosphates, which also gives the backbone an overall negative charge when DNA is dissolved in water. In solution this negative charge can be neutralized with sodium (Na+) or magnesium (Mg2+) ions, though in the cell the positively charged polyamines (spermine and spermidine) as well as chromatin proteins rich in positively charged arginine and lysine, neutralize the negative charge of DNA. The DNA helix takes a fair amount of energy to melt due to the stabalization from the stacking of the bases on top of each other. A force of 65 picoNewton (pN) has been found to be enough to mechanically separate the two strands of double-stranded DNA, though this value depends both on the DNA sequence, on the temperature and on the presence of stabilizing cations. Adenine stacks spontaneously form a helical structure in single-stranded poly-A DNA that has elastic properties when stretched; it can withstand forces of 115 pN before it completely relaxes. The ribose–phosphate backbone can withstand forces much stronger than that, though it is sensitive to shearing and chemical degradation. One of the early methods of DNA sequencing was based on chemical 10 Bacterial Genetics and Molecular Biology -‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery degradation of the backbone, using hydrazine (a rocket fuel). Potassium permanganate (KMnO4 a strong oxidizing agent) can also be used to break the phosphate backbone. DNA is stable at a wide range of pH values, although the two strands of the double helix will separate at higher pH, depending on the salt concentration. DNA in solution can also withstand a wide range of temperatures, from below freezing to boiling, though the two strands would again separate at high temperatures. The deoxyribose-backbone cannot easily be broken, even by boiling (the ribose-backbone of RNA is far more fragile). The most astonishing physical property of DNA is its size. A human chromosome is a single DNA molecule; when stretched out as a B-DNA helix, it would measure only 2 nanometer wide but, on average, it would have a length of almost 5 cm (or 25,000,000 times longer than wide). If all 46 chromosomes of a single diploid human cell were thus stretched out and aligned head-to-tail it would cover a distance of over 2 meters. A bacterial chromosome thus stretched would measure between 0.054 mm (54 microns) and 5.35 mm (5,350 microns), depending on the species, though most of these molecules are circular. DNA in the cell is often methylated DNA in the cell can be chemically modified by specific enzymes. Typically, methyl groups are attached to the bases of DNA, resulting in the structures shown in Figure 1.5. E. coli and other Gamma-proteobacteria can attach a methyl group (-CH3) to the N6 of adenine, to produce N6-methyladenine (6-meA); some archaea also methylate their DNA this way. Some E. coli strains can additionally methylate C5 of cytosine to produce C5-methylcytosine (5meC). Other species produce N4-methylcytosine. The responsible enzymes in E. coli are called Dam (for DNA adenine methylase) and Dcm (DNA cytosine methylase). Not every adenine (or cytosine) in the DNA of cells containing these enzymes is methylated: the addition of a methyl group is sequence-dependent. Dam methylates the A in the sequence GATC and Dcm methylates the first C in CCAGG and CCTGG. Even so, less than 1% of all copies of these recognition sequences that are present in the genome are normally methylated. Methylation of DNA by Dam serves multiple roles in E.coli (the function of Dcm is unknown). The expression of some genes is regulated by GATC- methylation. The cell also recognizes newly produced DNA that has not yet been methylated, and this is a way to regulate the start of DNA replication (the production of a chromsome copy); methylation is also used to repair mistakes in made DNA copies. Finally, DNA methylation is a way to mark the DNA as 'self', so that it can be distinguished from foreign, incoming DNA, which can be attacked while the own, methylated, DNA is protected. However, not all bacteria methylate their DNA, and even different strains within a species can produce differently methylated DNA. Figure 1.5. Methylated bases are the products of different DNA methylases. 11 Bacterial Genetics and Molecular Biology -‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery 1.4. DNA contains biological information beyond its sequence We represent DNA sequences as letters and although DNA sequences don't contain empty spaces, we can interpret them as if they are forming words, and words can form sentences. For this analogy we could consider DNA 'words' to be functional domains within proteins, and a sentence could serve as a complete protein. Non-coding sequences that separate genes would be the equivalent of the periods that separate sentences in printed texts. All text that together might produce a book could then be a genome. Distinct chapters or volumes could even represent different chromosomes. Indeed, the sequence of the human genome was presented to the public as the 'blueprint of life' and its size was described as a book of 23 volumes (for the 23 chromosomes) that together would easily fill 150,000 pages. If the sequence would be printed in small print, it would indeed fill that much paper. However, text is only a symbolic representation of DNA, and other ways to represent DNA can be just as informative as text, or even more informative (though less practical), as described in Box 1.3. Information Box 1.3. Representing DNA sequences DNA consists of only four building blocks, which are combined into strings to produce a particular sequence. We are used to representing these as four letters in the form of text. But that is only one of many possible representations, and though it is the most practical way, there are alternatives that could be considered. • DNA can be represented as thin vertical lines of four different colors. That would allow many more nucleotides on a line of text without loosing readability. Using lines 1/3 millimetre wide and 2 mm high, 500 nucleotides would easily fit a line, and 250 lines would make a page containing 125,000 nucleotides. The book of the human genome would reduce from 150,000 pages of text to 'only' 20,000 pages of colour, whereby regions enriched for a particular base, or repeat patterns, would be easily visible. Try that with a book containing DNA letters! • In a DNA atlas, we represent the entire chromosome as one circular line, averaging structural and compositional information along the chromosome by colours. This can visualize a base skew in bacterial chromosomes, for instance, or local variation in base content. • DNA nucleotides can also been represented by four tones of different pitch. A human ear trained to recognize melody would be able to recognize repeats, and a single nucleotide stretch would increase the length of the note. However, when a genome would thus be played, the 'concert of nature' would not be very appealing, and music is not a suitable medium to work with. • The representation of beads on a string would more closely resemble the linear nature of DNA. Using coloured beads as thin as a hair you could hold the human genome in your hand, and maybe a synthetic thread could be developed that aligns to its complementary sequence, to give double-strand DNA and envisage DNA melting. But a genome represented like that would quickly get tangled and, unlike in the cell, that tangling would not be ordered. Whatever representation we choose, it doesn't capture the true amount of information stored in a DNA sequence, because that is more than just the sequential order of nucleotides. The most immediate shortcoming of DNA as mere letters is that we only show one strand, while all DNA in nature is nearly always found as double-strand DNA. Since both strands usually contain genes, it means that maybe half of all genes are represented as their complementary sequence in a genome DNA file. When searching for a gene in a database, this isn't immediately obvious since a single gene is usually reported with its coding strand, 12 Bacterial Genetics and Molecular Biology -‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery but when searching for a gene in a complete genome's DNA sequence, one has to look for the complementary sequence for all genes located on the strand that is not given by the sequence. The order of nucleotides affects the three-dimensional structure of a DNA molecule, and that conveys information to the regulatory proteins of the cell. We have no way to represent that information in text files, and other visualization means are required for this. DNA modification, most frequently by base methylation, is another way how biological information is stored, especially if that information has to be erasable (in contrast, DNA sequences are permanent, ignoring the odd mutation). But again, DNA modification is not represented by a DNA sequence given as letters. Proteins that are bound to DNA are part of the temporary biological message of a chromosome, too. A DNA sequence can be temporarily 'hidden' from the cell by proteins that shield it or fold it away, so that the information stored in that sequence is currently not available. Conversely, proteins can open up the DNA helix locally, to stimulate gene expression. Again, there is no way to show this in a DNA sequence. 1.5. Repeats code for structures in DNA Stretches of nucleotides that repeat themselves in a molecule are frequent in Nature, and they can affect the structure of DNA. The simplest kind of repeat would be a stretch of repeated single nucleotides (GGGGG or TTTTTTT) where the repeat unit would be one nucleotide. This is also called a homonucleotide stretch, and such stretches are quite common in DNA. Another simple repeat consists of a unit of two nucleotides, also called dinucleotide repeats, for instance GAGAGAGA. Repeats of two or more nucleotides are called ‘tandem repeats’, and if they are longer than 10 repeats, they are called ‘micro-satellites’ (that can be up to 60 times repeated). These are particularly common in eukaryotic DNA. For example, (CG) repeats (also called 'CpG islands') are involved in chromatin structures, and their methylation can result in epigenetic effects. In bacteria, CpG islands are uncommon, though a so-called CpG motif (composed of two purines, followed by C and G and then two pyrimidines) is commonly found in bacterial DNA. When unmethylated, this motif has immunological properties and is sometimes loosely described as 'CpG DNA'. In bacteria, true CpG islands have so far only been described for Burkholderia species, but they are quite common in eukaryotes. Triplet repeats (e.g., CGGCGGCGG) in eukaryotic DNA can 'expand', due to slippaired structures, and in some cases this can lead to disease in humans. The shown CGG triplet repeat can cause fragile X syndrome, which is the most common inherited form of mental retardation in humans. A repeat unit can of course also be longer than a few nucleotides, and can even comprise complete genes. Moreover, the repeat unit can be found on the same strand, or on the complementary strand of double-strand DNA. A few examples of different types of repeat sequences are presented in Figure 1.6. A particular sequence that is repeated on the same strand is called a direct repeat. These frequently cause duplications or deletions during DNA replication. In case a DNA sequence is once found on one strand, and repeated on the other, this is called an inverted repeat. An inverted repeat without a spacer is called a palindrome. In this case the sequence reads the same for both strands (provided they are both read from 5' to 3', as is the convention for reading DNA sequences). Short palindromic sequences are frequently recognition sites for restriction enzymes. A mirror repeat, on the other hand, produces the same sequence only if the repeated unit would be read from 3' to 5', which biologically doesn't make sense. Mirror repeats are less common than inverted repeats in bacteria, and the same applies for everted repeats, where the mirrored repeat is found on the complementary strand. A mirror repeat, especially when it is purine-rich on one strand, can fold back on itself, and form a single-stranded region half, and a region of triplex DNA where the third strand is wound into the major groove of the helix. 13 Bacterial Genetics and Molecular Biology -‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery Figure 1.6. Different kinds of DNA repeats. A direct repeat (A) consists of a sequence repeated on the same strand. In the figure, a direct repeat with a unit consisting of 15 nucleotides is shown with a spacer of four undefined nucleotides. In case of an inverted repeat (B), the repeat unit is found on the complementary strand. An inverted repeat without a spacer (C) produces a palindromic sequence, in which both strands read the same sequence. Mirror repeats (D) and an everted repeats (E) are less common.In these cases, the repeat sequence is read in the 'wrong' direction, from 3' to 5', which is indicated by the grey arrows. (F) shows an imperfect direct repeat, in which a slight variation is found in the repeat unit). Structures in DNA due to repeat sequences Inverted repeats can form hairpin structures, whereby the two repeat units of one strand basepair with each other, as shown in Figure 1.7. Hairpins in single-strand molecules are common in RNA molecules. When inverted repeats form hairpins in double-strand DNA, a cruciform is the result, as shown in the middle of the figure. Finally, two inverted repeats (or palindromes) that are again repeated can result in slipped-strand structures, as shown in the figure. Such structures can result in duplication or deletion, when they form during DNA synthesis in the cell. 14 Bacterial Genetics and Molecular Biology -‐ a Genomics Perspective (Ch. 1) Trudy M. Wassenaar, David W. Ussery Figure 1.7. Structures in DNA. In (A) a hairpin structure is shown, which is the result of a single-strand molecule basepairing with an inverted repeat. One repeat unit is shown by an arrow. In (B) two hairpins in a double-strand DNA molecule build a cruciform. In (C), two inverted repeats have resulted in a slip-strand structure. 1.6. Concluding remarks DNA is the carrier of genetic information in all living cells. It exists as a double helix, of which there are different structural variants. DNA consists of the four nucleotides A, G, C and T, which build two strands. The two strands are kept together by the bases of the nucleotides that pair with each other. These strands have a direction, and are paired antiparallel. The structure of DNA, which is sequence-dependent, mostly dictates its function. Thus, sequence, that is the order of nucleotides, dictates structure, and structure dictates function. However, when we 'read' DNA as sequences of letters, we miss a lot of the information available to the cell. It is one of the major shortcomings of our way to symbolize. Recommended reading: DNA Mystique: The Gene as a Cultural Icon. Nelkin D and Lindee MS. W.H. Freeman and Company, New York, 1995. Astrobiology: a brief introduction. Plaxco KW and Gross M. The Johns Hopkins University Press, Baltimore, MA, USA, 2006. DNA structure: A-, B- and Z-DNA helix families. Ussery DW. Encyclopedia of Life Sciences. Macmillan Publishers Ltd, Nature Publishing Group, 2002. Bias of purine stretches in sequenced chromosomes. Ussery DW, Soumpasis DM, Brunak S, Staerfeldt HH, Worning P, Krogh A. 2002. Comput Chem. 26:531-541. Physical maps of chromosomes. Ussery DW. 2009. In: Encyclopedia of Life Sciences. John Wiley & Sons, Ltd: Chichester. DOI: 10.1002/9780470015902.a0001425. Strand misalignments lead to quasipalindrome correction. Van Noort V, Worning P, Ussery DW, Rosche WA, Sinden RR. 2003. Trends Genet. 19:365-369. 15