Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Thinking of Biology Decoding "coding"-information and DNA M ost biologists would probably be more bored by than surprised at, and certainly not impressed by the originality of, the claim that the central eore of the conceptual structure of contemporary molecular biology can be encapsulated in rhe following three precepts: ogy, "information" and associated terrns (especially "coding") belong to the latter dass of metaphors. This claim is, no doubr, a bald one-to develop the argurncnrs in its favor requires an excursion into the history of moleeular biology, a discussion of how the term "information'' ca me to be introduced, and why, for a11 its initial plausibility as a useful theoretical concepr, there is little to recommend its continued use today. its conforma tion, and that what mediates biological interactions is a precise lock-and-key fit between the shapes of rnolecules. In the 1940s, when no three-dimensiona l structure of a biological macromolecule had yet been determined, the confonnational theory of specificity was speculative. The demonstration of its approximate truth for a wide • All hereditary information resides variety of inreractions, which came in the DNA sequences of organisms. in the late 1950s and 1960s, was one ., This information is transferred of molecular biology's most signififrcm DNA RNA through the pro- Information and cant triumphs. Just as "one genecess of transcription, and from RNA one enzyme" was the archetypal slomolecular biology to protein through translation. gan of early molecular biology, • This inforrnation is never trans- Histcrically, the term "Information" "srrucrure determines funcrion" ferred from protein to nucleic acid entered molecular biology as part of came to be the dominating principle sequences. a putative theory of biological speci- of the field during its triumphant ficity. By around 1930, it had be- 1960s. Following Crick (1958), the last pre- come clear that molecular interacMeanwhile, back in 1944, in What cept is usually called rhe "Central tions in living organisms are highly Is Lire?, Erwin Schrädinger had inDogma" of molecular biology. specific in the sense that particular troduced a conceptual scheme that These three precepts are so uni- molecules interact with exactly one, raised the possibility of a startlingly versally accepted that they usually or at most a few, reagents. Enzymes different source of specificity. find their way into introductory bi- act on specific substrares. Living Schrödinger asked how so tiny an ology texts. Nevertheless, they are organisms produce antibodies that objecr as the nudeus of a fertilized at best misleading, and at worst sim- are highly specific, not only to na tu- cell could contain all of the specifiply vacuous, in the sense that they rally occurring antigens, but also to cations (i.e., insrructions) necessary can play no significant explanatory arrificia l ant igens (La nds te iner for normal development of an adult role in molecular biology. The rea- 1936). Even the action of genes was organism. He stated that there exson for this is that there is no clear sometimes described using "speci- isted in the nucleus some structure technical notion of "Information" ficity": genes specified precise phe- whose organization was interpreted in molecular biology. "Information" notypes to different degrees of accu- as "an elaborate code-script," which is little more than a meraphor that racy called their "specificity" he compared to the Morse code masquerades as a technical concept (Timofeeff-Ressovsky and Timofeeff- (Schrödinger 1944). Alrhough he was and leads to a misleading picture of Ressovsky 1926). In genetics, the willing to countenance codes in more the conceptual structure of molecu- ultimate exemplar of specificity be- than one dimension, even a linear [ar biology. Metaphors are ubiqui- ca me the gene-enzyme relationship code based on a 5-letter alphabet tous in scienee. When they provide (Beadle and Tatum 1941): "one and word of up to 25 letters could suceinct or easily eomprehensible gene-one enzyme" was perhaps the generate more than 10 17 parterns. aceounts of complex technical con- most important organizing hypoth- Thus the arrangement of the units rather than their physical shape becepts, they are particular ly useful esis of early molecular biology. By rhe end 01 rhe 1930s, a highly came the source of specificiry in for communicative or didactic purposes. However, when they serve successful theory of specificity (and Schrödinger's model. In the postwar only as surrogates for nonexistent one thac remains central to molecu- era, when many scientists who were teehnical concepts, their influence is lar biology) emerged. Due primarily initially trained in physics turned less than benign. In molecular biol- to Linus Pauling (e.g, Pauling 1940), their arrenrion to biology for the although with many antecedenrs, this firsr time, What Is Lire? was influentheory claimed that the behavior of tial in setting the agenda of a new by Sahotra Sarkar a macromolecule is determined by biology (Sarkar 1991). '0 December 1996 857 The 19405 also saw an explosive growth of microbial geneties, starring wirh Luria and Delbrück's (1943) demonstration of spontaneous muta genesis in baeteria; continuing, especially, with Avery and colleagues' (1944) demonstration of DNA as the likely genetic material; and cu lmi na ti ng with Joshua Lederberg's diseovery of recombinarion in bacteria (Lederberg and Tatum 1946a, b). "Transformation," "induction," and "transduction" were some of the new terms introduced to describe these phenornena (Ephrussi er al. 1953). In an attempt to navigate through this terminological morass, Ephrussi et al. (1953) suggested that the term "interbacterial information" replace them all. This was the first modern use of "information" in geneties. Ephrussi er al. (1953, p. 701) emphasized that the use of this term "dces not necessarily imply the transfer of material substances, and [that they] recognize the possible future importance of cybernetics at rhe bacterial level." Immediarely after rhe publication of Ephrussi et al. (1953)-in fact, in the next issue of Nature-Watson and Crlck (1953a) published the double helix model of DNA. The base pairing-A:T and C:G-that they proposed showed a possible way in which rhe specificities between the two helices could be involved in the formation of exact replicas. Moreover, in their second paper on rhemodel IWarsonand Crick 1953b, p. 964), they went on to use "information" explicirly, and defined it implicitly as what the "code" carried: "The phosphate-sugar backbone in our model is complerely regular but any sequence of the pairs of bases can fit into the structure. It follows that in a long moleeule many different permutations are possible, and it therefore seems likely that the precise sequence of bases is the code which carries the genetic information. " "Information" was finally defined explicitly by Crick in 1958, who identified it with the specification of a protein sequence. Crick's (1958) concern was the synthesis of proteins. There were three separate factors involved, he argued: "the flow of energy, the flow of matter, and the flow of information" (Crick 858 1958). The former two exhausted the physics and chemistry of the situation, whereas information was pecuJiar to biological systems. Crick (1958, p. 144) defined "information" with more care than ever before in this eontext: "By information I mean the specifieation of the amino acid sequence of the protein. ,. He took it for granted that the genetic information was encoded in a DNA sequence. The phystcs and chemistry of folding of a protein, Crick hypothesized, were purely a result of its amino acid sequence. This is the well-known "sequence hypothesis" (which remains unproved because of the continued insolvability of rhe prorein folding problem). Finally, he put this formalized notion of information to additional use: The Central Dogma ... states rhat once 'Information' has passed inro protein it cannot get out again. In more derail, the transfer of informa tion from nucleic acid to nucleic acid, or from nucleic acid to protein may be possible, but transfer from protein to protein, or from protein to nucleic acid, is impossible. Information means heee the precise derermination of sequence, either of bases in the nucleic acid or on amino acid residues in the protein. (Crick 1958, p. 153; italics in the original) This assumption about the one-way transfer of information did not arise from physical considerations. Rather, it was Crick's way to give a molecular characterization of neoDarwinism. "[I]t can be argued," he explicitly observed, "that [the prorein] sequences are the most delicate expression possible of the phenotype of an organism" (Crick 1958, p. 142). Therefore, the one-way transfer of information ensured that changes that occur initially at the phenotypic level cannot induce genotypic changes and be inherited. Crick implicitly distinguished two different types of specificity: that of each DNA sequence for its complementary strand, as modulated through base pairing; and that of the relationship between DNA and protein. The latter was modulated by genetic information. This nation of information was combinatorial: all that was required was that ehe code perform its funetion from the sequence of bases in a DNA segment. Schrödinger's (1944) "arrangemenr" came to "encode information," resulting in a new theory of specificiry, distinct from the conformational theory. The theory of the genetic code Even before Crick's (1958) explicir identificarion of "Information" with "specificiry," the same idea was being used systemarically. For instance, in 1955, during a symposium on enzymes, Mazia (1956) argued that the role of RNA was to carry "information" from the nuc1ear DNA ro the cytoplasm for the synthesis of proteins. At the same conference, Spiegelman (1956) argued that rhe "information a l complexit y" required for the formation of proteins made RNA and DNA rhe only two plausible candidates for being templates for pro tein formation. Lederberg (1956) nored thar "informaticn" was what "speciflcity" was "called nowadays." However, what fully embedded "Information" into the conceptual framework of moleeular biology was the idea of a genetic code, that Is, the idea that the relationship between DNA and prorein should be conceived of as one of coding. After the formulation of the double helix model in 1953, this idea entered molecular biology Iargely through the work of George Gamow and his collaborarors, who attempted to deduce properties of the genetie code from plausible theoretieal assumpt io ns about information. (See Gamow et al. 1955 for a review, and Judson 1979 and Sarkar 1989, 1996, for critical discussions.) Broadly speaking, these assumptions were about the efficiency of information transfer and storage using DNA sequences. The basic problem was conceived of as that of determining, first, what sets of DNA nucleotides coded for individual amino acid residues and, second, whether these sets had any overlap when a long stretch of DNA coded for an amino acid residue sequence. Gamow formu~ lated a variety of coding sehemes from different sets of assumptions. By the late 19505, these schemes BioScience Val. 46 No. 11 were all shown to be incorrect. The most interesting and, in some ways, influenrial of the theoretieal attempts to establish the nature of the genetic code was the "commafree'' coding scheme of Crick et al. (1957), who introduced relatively sophistieated ideas about information storagc and transmission. Their only experimental criterion for judging success was rhe "magic numher," 20, of amino acid residues that occurred na tu rally in proteins and, therefore, must be coded for by DNA. They rejected overlapping codes On experimental grounds and argued that it was natural to restriet attention to triplet codes because doublet ones would allow only 16 coding units, whereas triplet ones, allowing 64, were clearly sufficienr. From their point of view, there were two problems to be solved. First, tbere was the problem of potential degeneracy-if 64 trip lets code for only 20 residues, then in Same cases several rriplets would code for a single residue. Although there was DO experimental ground to reject adegenerate code (such as ehe one that is now accepted), they believed thar it was undesirable. Second, there was the problem of synchronization-if A, C, G, and T are the four nucleotide base types, is the sequence ACCGTAGT to be read as ACC, GTA,... , as CCG, TAG,... , or as CGT, AGT,... ? Their solution to both problems was ingenious. By attempting to solve only the synchronization problem, they also removed degeneraey and, incidenrally, obtained the magic number, 20. Their solution to the synchron ization problem began with the assumption that only some triplets had "sense," that is, could code for residues. Ofthe 64 possible triplets, the 4 with one base type, such as AAA, had to be rejected immediately: otherwise a sequence such as AAAACGA could potentially be read ambiguously as AAA, ACG,... or as AAA, CGA, .... This Ieft 60 possibly meaningful triplets. These segregate into 20 sets of three, each set consisting of a tdplet and its two cyclie permutations (e.g., ACG, CGA, and GAC). If the possibility of ambiguity is to be avoided, only one triplet from each set can be meaningful. For instance, if ACG and CGA were both December 1996 meaningful, ACGACGT could potentially be read as ACG, ACG,... or as eGA, CGT,.... Thus, at most 20 meaningful triplets were possible. Crick et al. (1957) went on to show that a solution with exactly 20 triplets was possible by explicitly constructing such a set. That solution was not unique. Crick et al. managed to find 288 different solutions, and there are in facr as many as 408 (Golomb 1962). Crick et al. (1957) wem on to discuss a possible physical interpretation of the code that is historicaUy important because it suggested that intermediate molecular complexes are involved in the translation of a nucleic acid sequence ro a polypeptide one. This suggestion was one of the first statements of the "adapror hypothesis, " the adaptors eventually being identified wirh transfer RNA (tRNA). However, rhese physical considerations did not form the basis for their coding scheme. That basis was provided by the two desiderata of solving the degeneracy and synchronization problems. There was no experimental grounds for these, the only experimental constraint was the magie number of 20. The desiderata were based on two claims about the nature of biological Information. The first was a claim of a certain kind of simplicity: a degenerate code is not as simple as a nondegenerate one. The second was an assumprion ahout efficiency: if synchronization is not automatically determined by the nature of the code, then ambiguous translation can occur by a shift of the reading frame, resulting in errors. The comma-free code satisfied hoth desiderata. But why settle for synchronization alone? In 1958, Max Delbrück pointed out that genetic information ultimately resides in double-stranded DNA rather than in only one strand (Go10mb et al. 1958). Consequently, synchronization must simultaneously hold for both strands. The "dictionary," Delbrück insisted, would be better off if this additional constraint on synchronization were also satisfied. This constraint was a natural extension of comma-freedom for a single strand and was called "transposability ." Onee it is imposed, triplet codes can no longer code for the 20 amino acid residue rypes known to occur in proteins. Triplet codes can now code for ar most 16 residue types because the "converse" of each meaningful triplet, as determined by base pairing, musr, when read from right to lefr, also belang to the set of coding triplets. This rules out many of the meaningful triplets of the original commafree codes. Attention, therefore, shifted to quadruplet codes, and in 1962 Samuel Golomb reported from computer searches that there can be at least 57 quadruplet codons. Pursuing coding as a mathematical idea, Golomb (1962) introduced another scheme, that of "biorrhogonal codes," based on the mathematical theory of Hadamard matrices, which had six nucleotides code for each amino acid residue. Its biologreal morivarion remains myscerious. Golomb provided no reason for its introduction but was apparently sufficiently convinced of the biological potential of all of these new formal schemes to observe: "Ir will be interesting to see how much of the final solution [of the coding problern] will be proposed by the marhematicians before the experimentalists find it, and how much the experimenrers will be ahead of rhe mathematicians" (Golomb 1962, p. 106). Subsequent developments, as ir turned out, did not show much respect fcr the mathematicians. Theory and experiment What Golomb was apparently unaware of is that by 1961 it had become clear that the code was triplet (Crick et al. 1961), showing that his speculations-and those of Delbrück-had, for a11 their analytic sophistication, little relevance to biology. That same year, the first codon was experimentally deciphered by Matthaei and Nirenberg (l961a, b), using a cell~free system and RNA sequences. They determined that UUU, a trip let not permitted to be meaningful by the camma-free code, codes for phenylananine. As this result was verified, and other results began to come in, it became clear that the genetic code was not remotely comma-free. It was, in fact, highly degenerate. Colinearity of the code was demonstrated in 1964 (Yanofsky 859 et al. 1964), and by 1966 the entire genetic code was established (Woese 1963. Ycas 1969). Synchronization turned out (0 be controlled by a variety of mechanisms, none of which could reasonably have been predicted from a consideration of constraints on rhe flow or storage of information. Failures in synchronization, as exemplified by the expression of frame-shifred sequences, were still possible. What the mechanisms controlling synchronization usually did was to prevent frame-shifted polypeptide formation, although this did not become dear until the 19805 (Atkins et al. 1991). What had gone wrang with the comma-free code is tha t none of the elegant properties thatwere imposed on the code from considerations about information are revered in rhe living organism. The attempts to deeipher the code in the 1950s using these ideas were an unmitigared failure. Even the ideas that the code was fuHy synchronized and fully sequential eventually came to be modified through the discovery of frameshift mutations and noncoding regions of the genome. The only idea to be validated, besides the uniformity of codon lengrh, was Schrödinger's original one-merely that of the existence of a genetic code. But even the simplest form of this idea comes with two attendant metaphors: that of informarion and, perhaps even more perniciously, that of DNA as language, with the code used to interpret DNA symbols into meaningful proteins. The latter metaphor deserves more systematic analysis than is possible here. It is particularIy harmful because itdisregards the fact thar, ultimately, DNA is a molecule interacting with other molecules through a complex set of mechanisms. DNA is not just some text to be interpreted, and [0 regard it as such is an inaccurate simplification. The theoretical schemes for the genetic code are particularly important because rhe success of any of them would have provided at least a rudimentary theory of information for moleeular biology. However, despite their failure, the ideas of coding and information in molecular biology persisted. The reason for this is the unusual simplicity of 860 prokaryotic genomes. Much of molecular biology, especially in the 1960s, was established through the srudy of only a single species, Eseheriehia coli, whose genome was particularly weH behaved. Every DNA segment has either a coding or a regulatory function. For coding regions, transcription results in a complementary mRNA molecule that, with no further modification, is translated at ehe ribosome. If, following Crick, the speeifieity of a DNA sequence is identified with the information encoded in ir, then the three precepts stated at the beginning of this article become plausible. However, Crick's definition of Information remains quirky: for instance, longer segments ·of DNA, unless they encode more genes, cannot be said to carry more information than shorter ones. Regularory sequences do not have any coding role: they cannot be said to conrain information in any straightfotward sense. Worse, because the code is so arbitrary (i.e., there are no theories thar explain why a particular codon codes for its amino acid residue), the concept of information cannot be invoked in any explanatory role: there is no potential for explaining novel biological phenomena by appeals to some property of information. Nevertheless, these problems become insignificant onee attention shifts to eukaryotic genetics. The unexpected complexity of eukaryotic genetics Monod (1971) is responsible forwhat should perhaps be called the "Central Myth" of molecular genetics: that what is true For E. eoli is true for elephants. Were this correct, then the notion of information formulated by Crick (1958), and the framework of coding that emerged from ir, would not only make the three precepts at the beginning of this article true, but would also make the linguistic view of genetics palatable. However, developments since the early 1970s have shown Monod's claim of absolute universality to be ineorrect. When molecular biologists turned to eukaryotie genetics in the late 1960s, they were in for surprise after surpnse, so much so that Watson et al. (1983) entitled a chapter on eukaryotie genetics, "The Unexpeeted Complexity of Eukaryotie Genes." Precisely beeause of its bewildering complexity, eukaryotic genetics has no simple precepts like those that apply to proka ryotic genetics. Four sets of developments show how the simple information! coding picture inherited from E. eoli beg ins [Q fall apart: • The genetic code is not universal, although the amountofknown variation is not great (see Fox 1987 for a review). At present, the most extensive variations have been found in mitochondrial DNA, in which, for instance, across all the major kingdoms UGA codes for tryptophan rather than causing translarion to terminate as it does in the usual code. Ir can be argued that mitochondrial DNA is "special" beeause mitochondria probably arose as independent organisms that were subsequently incorporated into eukaryotic cells. However, in the nuelear DNA of at least four speeies of protozoa, UAA and VAG can code for glutamine rather than terminating translation. Moreover, in many speeies, UGA codes for amino acid residu es that da not belong to the standard set of 20. In some viral DNA sequences UGA and UAG are sometimes, but not always, read through, that is, ignored both as termination signals and as codons (Fox 1987). Even in the same RNA sequence, these codons sometimes result in termination and are sometimes ignored. For example, the virus Qß has a coat protein that is usually produced by having UGA read as a termination codon. However, 2% of the time ir is ignored, resulting in a Ionger functional protein (Fox 1987). • The diseovery of frameshift mutations has destroyed any residual belief in a natural synchronization of the genetic code. The extent to which frame shifcs are present in organisms is large1y a matter of conjecture (Atkins et al. 1991). Semetimes, frame shifts at the DNA level are used to transcribe an RNA segment that is translated into a different protein than the standard one (Fox 1987). • Not all DNA segments have a coding or regulatory function. In BioScienee Vol. 46 No. 11 human genomes, as much as 95% of the DNA may have no function. Inside a segment that, as a whole, has a coding function, coding regions, called "exons," are interspersed with noncoding regions, called "introns." In almosr all eukaryores, the portions of RNA corresponding to the introns are spliced out after transcription. Moreover, alternative splicing (the producnon from the same transcript of different RNA segments, eoding for different proteins) has also been found (Smith er al. 1989). Moreover, there are large segments of nonfunctional DNA between genes (i.e., between segments with known coding or regulatory roles). The existence of introns and other nonfunctional DNA segments makes it impossible to simply read off a DNA sequence and predictan amino acid sequenee (even . when all regulatory regions are known). • Besides splicing, several types of mRNA editing are also now known to the RNA after transcripnon. In the present context, RNA editing is the most interesting faeet of eukaryotic geneties: it already shows how the first precept described at the outset of this article (rhar all information resides in the DNA sequences of the genome) cannot be universally true. Acceptanee of the idea that not all information resides in DNA sequenees implies the acceptance of the idea that not atl information proceeds as a transfer from DNA to RNA to protein, at least through the conventional coding relationship. Then the Central Dogma becomes dubious. Ir would be unreasonable to criticize rnolecular biologists for not predicting these complexities in the 19605, long before rhere was any experimental evidence for rhem. Nevertheless, these complexities show that the information/coding picture inherited from E. coli should no longer be regarded as the eonceptual core of molecular biology. In particular, one should wonder wherher the idea of a genetic code captures something important about biological systems or whether it is s imp ly a metaphor (hat has epiphenomenally emerged from the accident that both nucleic acids and proteins happen physically to be linear molecules. (Cattaneo 1991). DNA segments producing transcripts that are subsequently so edired are called "cryptic genes." For insrance, in mammaIian intestinal celfs, a ccrtain C nucleotide in ehe mRNA for apolipoprotein becomes deaminated, converting it to a U and creating a stop codon. Deamination of C to U, and the reverse process (U ~ C arnination), occur in several plant Cybernetics and mitochondrial mRNA transcripts as information theory weIl. Moreover, even more unusual behaviors have been observed with The argument that has been develmitochondrial RNAs in which bases oped so far assumes thar "infcrmacan be deleted or inserted. The lat- tion" should be basically understood ter, especially, leads to a situation as Crick (1958) construed ir, that is, that can be interpreted as the forma- what is specified by a DNA sequence tion of proteins for which there are through the genetie code. The probno genes. In an extreme case, in the lems with that interpretation neverhuman parasiteTrypanosoma bru- theless leave open the possibility that cei, as many as 551 U's are inserted there is some other interpretation of throughout the transcript coding for that term that will allow its recovery NADH dehydrogenase subunit 7, and as an interesting theoretical concept 88 are deleted (Koslowski et al. of molecular biology. Historically, 1990). In this case, the DNA seg- thete have been two such interpretament encoding the primary transcript tions, aod these have to be disposed ean hardly be considered a gene for of to complete the argument of this NADH dehydtogenase subunit 7. By article. looking at the DNA sequence it The first of these was already would he impossible to ptedict be- alluded to by Ephrussi et a1.'s (1953) forehand that this was the protein comment that understanding inforthat would eventually be produced. mation transfer may involve explorMoreover, in almost all eukaryotes ing "cybernetics at the bacterial bases are added as "tails" and "caps" level." Cybernetics, from their and December 1996 other similar points of view, would provide the theory for the use of information in molecular biology. However, nobody seems to be certain about what constitutes cybernetics (see Pierce 1962). The rerm "cybernetics" was popularized with messianic fervor by Norbert Wiener (1948). However, all that is clear from Wiener's work is that cybernetics LS a rheory of regulated end, especially, self-regulating systems. Regulation was posited to occur through "feedback," a concept that had entered biology long before rhe invention of cybernetics, but had been co-opted into the cybernetic framework (Keller 1995, Sarkar 1996). Feedback provides rhe information for regulation. In cybernerics, the concept of information gers no more explicit than the indication that "information" is that which enables regulation. The value of cybernetics in molecular biology is doubtful, although putative cybemetic interpretations of genetics began as early as 1950 (see Kalmus 1950). lf rhe published record is taken as evidence, these interpretations had negligible impact during the 1950s and early 1960s, when the conceptual framework of rnolecular biology carne to be established. They were given a new lease oflife by Monod (1971) in Chance and Necessity, when he reinterpreted much of his eaelier work, including the model of allosteric regulation of proteins and the operon model of bacterial cell regulation, as examples of cybernetic systems (Sarkar 1996). Whatever plausibility this interpreration may have had in 1971, it falls apart, once again because of the unexpected complexity of the eukaryotic genome. Eukaryotic gene regulation is not welt understood even today, but it is clear that no model similar to the operon ean account for the regulation of eukaryotie genes (see Sarkar 1996 for detail). Cybernetics appears to have been Httle more than a diversion in the development of molecular bioJ~ ogy, but even if it is somehow reinstated (although it is hard to see how), its associated eoncept of information (that which enables regulation) cannot enahle a recovery of the three precepts mentioned at the 861 beginning of rhis article: "information" as "feedback" is hardly what resides in DNA, passes from DNA to RNA to protein, or, for that matter, can make the Central Dogma true. A second alternative inrerpretation of information emerged from the mathematical theory of communication, which eventually came to be called information theory (Shannon 1948). In infcrmation theory, ehe amounr of information is measured by the logarithm of the relative number of choices available during a comm unica t ion process. Information connotes uncertainty; formally, its numerical value is determined by an entropy function thar is similar to the usual entropy of sratistical mechanics. In the 1950s. there were many attempts to apply this notion of information to moleeular biology. Bransou (1953) calculated the information content of polypepride sequences using empirical frequencies of the various residues to ca1culate the uncerrainty er each posirion of a sequence. In a similar manner, Linsehitz (1953) attempted to calculate the information content of a bacterial cell. However, by 1956. even its staunehest proponent, Quastler (1958, p. 399) conceded at least temporary defeat: Information theory is very strang on the negative side, i.e. in demonstrating what cannot be done; on the positive side its application to rhe study of living things has not produced many results so far; it has not led to the discovery of new facts, nor has its application [0 known facts been tested in critical experimenta. To dare, a definitive judgment of the value of information theory in biology is not possible. Sporadic attempts to apply information theory directly to molecular biology continue, but the results are less than exciting. For instance, a major result of Yockey's (1992) at~ tempt to apply information theory to molecular biology is that polypeptides may not code for DNA sequences (whieh is Yoekey's version of the Central Dogma). The basis for this "theorem" is the degeneraey of the genetie code: a given polypeptide seqüence can be encoded by different DNA sequences. The con- 862 clusion is correct. What is mysteri- will have important consequences. ous is why infcrmation theory-or There are at least five such conseany abstract theoretical frame- quences, and a little reflection shows work-has to be invoked to make so that they are, in fact, desirable: trivial a point. Ir is a trivial combinarorial facr thac was known by • If biological "information" 1S not Gamow, Crick, or anyone else who DNA sequence alone, other features had ever thought about the relation of an organism can also contain inbetween DNA and pro tein. formation. This is precisely what Recently, Thomas Schneider and recent discoveries indicate. In parh is collaborators (starting with ticular l the developmental fate of a Schneider er al. 1986) have made cell might be largely a result of feapromising use of information theory tures such as methylation patterns to find the most functionally rel- of DNA, which are not even ultievant parts of long DNA sequences mately determined only by DNA when these are all that are availa ble. sequences (see Jablonka and Lamb The basic idea, which goes back to 1995). These "epigeneric" patterns Kimura (1961), is that functional can be inherited for several cell genportions of sequences are most likely erations. Different cells in the same to be conserved through natural se- organism, presumably with identilection. These will therefore have cal DNA sequences, can have differlow i nfo r ma ti on content (in ent epigenetic patterns. These difShannon'ssense). WhetherSchneider's ferences c a n result in cell methods will live up to their initial speeialization and differentiation, promise remains to be seen. Never- the usual prelude to developmental rheless, for eonceptual reasons alone, changes. Epigenetic specifications this notion of "information" {i.e.• are also crirical in generating differShannon Information} is irrelevant enees in offspring (of sexually reproin the present context. According to ducing organisms), depending on this notion, for DNA sequences the whether an allele is inherited from "information" conrenr is a property the morher or the father. Epigenetic of a set of sequences: the more var- specifications are sometimes transied a set, the greater the "informa- mitred aeross organistnie generation" content at individual positions tions. If "information" is to have ofthe DNA sequence. But "informa- any plausible biological significance, tion" in this sehe me is not actually ir would be odd not to regard the what an individual DNA sequence transfer of rhese specifications as contains, that is, not what would be transfers of information. The condecoded by the cellular organelles. ventional DNA-based concept of Worse, whac Kimura's (1961) argu- information precludes this possibilment suggests is rha t w ha t should be iry. regarded as biologically informa- • The Centtal Dogma of molecular tive-c-funcrional sequences-c-are ex- biclogy is false if it is construed as a act1y those that have low "informa- universal biological law. However, tion" content. a less grandiose claim, that protein sequenees do not directly specify nucleie acid sequences in the way in Conclusions which the latter specify the former, Thus, neither cybernetics nor formal remains true. This humbler claim information theory can rescue the does not have the majestic rhetorical concept of information for molecu- power of the Central Dogma, but lar biology in such a way as to per- does this retreat really undermine mit the recovery of the conventional some putative insight rhat is enpicture of DNA sequences encoding shrined in that dogma? The usual information to be decoded using the defense of the general biological genetk code. The natural conelu- importance of the Central Dogma is sion is that the conventional picture that it is a statement at the molecushould be abandoned. However, be- lar level of the noninheritance of cause the concepts of information acquired characteristics (see, for and coding have been central to how example, Crick 1958 and Maynard molecular biology is currently un- Smith 1989). However, this interderstood, abandoning these conceprs pretation of the Central Dogma is BioScience Val. 46 No. 11 entirely unjustified. Acquired characteristics are occasionally inherited, although usually not (Jablonka and Lamb 1995, Landman 1991). What ensures that even those acquired characteristics that involve changes in DNA are not inherited in higher animals is the segregation of the germline from the soma. But plants have no germline, and the extent of its segregation in animals varies greatly across phyla (Buss 1987). Nevertheless, whatever the relation between nudeic acid and protein, thar relation shows no such variability across the phyla: ipso facto the Central Dogma, even if it were true, could not be either an explanation or an alternative synonymous statement of the alleged noninheritance of acquired characteristics. There is certainly something peculiar, and extremely interesting, about how DNA resists easy change across the phyla. But this observation is so mething to be studied and understood, not something to be explained away on the basis of some alleged law about some incoherent notion of information. • Many influential contemporary discussions of ehe origin of life have concentrated on the origin of information, in which information is consrrued simply to be nudeic acid sequences (e.g, Eigen 1992). Implicit in these discussions is the assumption that nucleic acid sequenees ultimately encode aIl that is neeessary for the genesis of living forms and, therefore, that a solution to the problem of the initial generation of these sequences will solve the problem of the origin of life. The move away from sequences would put these efforts in proper perspective: to explain the possible origin of persistent segments of DNA doesnot suffice as an explanation of the origin of living ceIls. • The emphasis on DNA sequences that marks contemporary molecular biology is misplaced. Therefore, the sorts of arguments that were mustered to initiate the Human Genome Project (HGP-a crash program to sequence DNA blindly, thar is, without first determining the functional rcles ofthe segments to be sequenced) are less than compelling. This is not a new point. Ir has previously been made, on the basis of other consider- December 1996 ations, by many critics of the HGP (Da vis 1992, Lederberg 1993, Lewontin 1992, Sarkar 1992). These arguments, taken together, stronglv suggest that the HGP should be limited to the mapping of all known genetic loci to specific poeitions on chromosomes and the sequencing of only those scgments that are found to have some functional interest. There is little scientific rationale for the blind sequencing of DNA, and the shift of scarce resourees to it is unjustified. Human and other genome sequences will ultimately be sequenced, with or without the HGP, but should such sequencing proceed at a normal pace, not only would such a shift of resources not occur but there would be more time to prepare for the well-known social and ethical problems that the HGP raises (see, for example, Holtzman 1989). • Abandoning the coding metaphor will also do much to li berate biology from the unfortunate linguistic metaphor of an organism's (or a cell's) DNA sequence being a message in some language to be decoded. Despite rhe immense popularity of this metaphor (see, for example, Wills 1991 and Pollack 1994), at the rechnicallevel the linguistic metaphor is at best only as helpful in understanding biology as the concept of coding. The complexities of eukaryotic generics show that the code is of only limited use in the transition from DNA to an organism's biology. Given a DNA sequence, simply to read off an amino acid sequence requires that it be known wh ether any non standard coding is being used, what reading frame is to be used, that alt gene-non-gene and intron-exon boundaries are known, and what kinds of RNA editing will take place. Even at the metaphorical level, it is unlikely that these complexities can all be treated as questions of language: after aIJ, natural languages do not contain large segments of meaningless signs interspersed with occasional bits of meaningful symbols. Of course, even with an amino acid sequence, biology has bare1y begun: one then faces the problem of going to higher levels of organization, and in the absence of a solution to the prorein folding problem there is Iittle prospect for doing that if one really starts from a DNA "text." In any case, the sterility of the informational picture of molecular biology is a much-needed reminder that DNA is, ultimately, a moleeule and not a language. Acknowledgments Parts of this artide also appear in Sarkar (1996), which provides a more detailed treatment of many of the issues discussed here. Thanks are due to Angela Creager, Larry Holmes, Manfred Laubichler, Lily Kay, Eve lyn Fox Keller, Joshua Lederberg, Richard Lewontin, William C. Wimsatt, and two anonymous referees fcr extensive discussions and comments on an earlier version of this article. Work on this article was parrly funded by a fellowship at the Dibner Institute at the Massachusetts Institute of Technology. References cited Atkins JF, Weiss RB, Thompson S, Gesteland RF. 1991. Towards a genetic dissecrion of the basis of triplet decoding, and irs natural subversion. programmed reading frame shifts and hops. Annual Review of Genetics 25: 201-228. Avery OT, MacLeod CM, McCarry M. 1944. Studies on the chemical nature of the substance inducing transformation of pneumococcal types: induction oftransformation by a deoxyribonudeic acid fraction isolared from pneumococcus 1II. Journal of Experimental Medicine 79: 137-157. Beadle GW, Tatum E. 1941. Genetic control of biochemical reactions in Neuromora. Proceedings of the National Academy of Sciences of the United States of America 27: 499-506. Branson HR. 1953. Adefinition ofinformation from the thermodynamics of irreversible processes. Pages 25-40 in QuastIer H, ed. Essays on the use of information theory in biology. Urbana (IL): University of Illinois Press. Buss L. 1987. The evolution of individuality. Princeton {NJ}: Princeton Universlry Press. Cattaneo R. 1991. Differenttypesof messenger RNA editing, Annual Review of Generies 25: 71-88. Criek FHC. 1958. On prorein synthesis. Symposium ofthe Society for Experimental Biology 12: 138-163. Criek FHC, Grifflth jS, Orgel LE. 1957. Codes wirheut ccrnmas. Proceedings of the National Academy of Seiences of the United States of America 43: 416-421. Crick FHC, Barnett L, Brenner S, Watts-Tobin RJ. 1961. General nature of the genetie code forproteins. Nature 192: 1227-1232. Davis BD. 1992. Seguencing the human genome: a faded goal. Bulletin of the New 863 York Academy of Medicine 68: 115-145. Eigen M. 1992. Steps rowards life: a perspective on evolution. Öxford (UK): Oxford University Press. Ephrussi B, Leopold U, Watson JO, Weigle 11. 1953. Terminology in bacrerial genetics. Nature 171:701. Fox TO. 1987. Natural variation in the gcnetic code. Annual Review of Geneties 21: 67-91. Garnow Gi Rich A, Yl;asM.1955. Theproblem of information transfer from the nudeie acids to proreins. Advanees in Biological and Medieal Physics 4: 23-68. Golomb SW. 1962. Efficient coding for the desoxyribonudeie acid channel. Proceedings ofthe Symposium for Applied Mathernatics 14: 87-100. Golomb SW, Wekh LR, Delbrüek M. 1958. Constructicn and properries ofcomrna-free codes. Biologislee Meddelelser Kongelige Danske Videnskabernes Selskab 23(9): 134. Holtzrnan N. 1989. Proceed with caunon. Balrimore (MD): The Johns Hopkins Univetsity Press. Jablonka E, Lamb MJ. 1995. Epigenetic inheritance andevolution: the Lamarckian dimen. sion. Oxford (UK): Oxford University Press. Judson HF. 1979. The eighth day of creation. New York: Simon and Schuster. KalmusH. 1950. Acybernetical aspect ofgenetics. Journal of Heredity 41: 19-22. Keller EF. 1995. Refiguring life: metaphors of twentieth century biology. New York: Columbia University Press. Kimura M. 1961. Natural selecticn as a process of accumulating geneeicinfcrmation in adaptive evolurion. Genedeal Research 2: 127140. Koslowski DJ, Bhat GJ, PreollazAL, FeaginJE, Stuart K. 1990. The MURF3 gene of T. brucei contains multiple doruains of extensive ediring and is bomologous to a subunir of NADH dehydrogenase. Cell 62: 901911. Landman OE. 1991. The inheritance of acquired characteristies. Annual Review of Genetics25: 1-20. LandsteinerK.1936. Thespecifieityofserological reactions. Sptingfield (IL): C. C. ThomOlS. Lederberg J. 1956. Comments on the geneenzyme relationship. Pages 161-169 in Gaebler OH, ed. Enzymes: units of biological strueture and funetion. New York: Academie Publishers. ___.1993. Whatthe double helix has meant for basic biomedieal science: 01 personalcommentary. Journal of the American Medical Assol;iation269: 1981-1985. LederbergJ, Tatum EL. 1946a. Gene recombination inEseheriehia eali. Nature 158: 558. ___. 1946b. Nove1 genotypes in mixed cul~ 864 tures of biochemieal muranrs of [racteria . Cold Spring Harbor Symposia on Quantitative Biology 11: 113-114. Lewontin Re. 1992. Biology as ideology: the docrrineofDNA.New York: Harper-Perennial. Linsehitz H. 1953. The informanon conrenr of 01 bacterial cell. Pages251-262 in Quastier H, ed. Essays on rhe use of information theory biology. Urbana (IL): University of Illinois Press. Luria SE, Delbrück M. 1943. Mutations of bacteria from virus sensitivity to virus resisrance. Generies 28: 491-511. Matthaei JH, Nirenberg MW. 1961a. Characrerization and stability of DNAse-sensitive prorein synthesis in E. eali extracrs. Proeeedings of the National Academy of Scienees of the United States of America 47: 1580-1588. _ _. 1961b. The dependence of eell-free prorein synthesis in E. coli upon naturally occurring or synthetic polyribonudeotides. Proceedings of the National Academy of Seiences of the United Stares of America 47: 1588-1594. Maynard SmithJ. 1989. Evolutionary genetics. Oxford (UK): Oxford University Press. Mazia D. 1956. Nuclear produets end nudear reproduction. Pages261-278 in Gaebler OH, ed. Enzymes: units of biological srrucrure and function. New York. Aeademic Publishers. Monod ]. 1971. Chance and necessity: an essay on the natural philosophy of modern biology. New York: Knopf. Pauling L. 1940. A theory of rhe structure and. proeess of formation of annbodies. Journal ofrhe American Chemieal Sociery 62: 26432657. Pierce JR. 1962. Symbols, signals and noise. New York: Harper and Brothers. Pollack R. 1994. Signa of life: the language and meanings of ONA. Boston: Houghton Mifflin. QuastIer H, ed. 1958. The status of information theory in biology: a raund-table discussion. Pages 399-402 in Yoekey HP, ed. Symposium on information theory in biology. New York: Pergamon Press. Sarkar S. 1989. Reduetionism and molecular biology: a reappraisal. [Ph.D. dissertation.] Department of Phi1osophy, University of Chicago, Chicago, IL. ___.1991. Whatislife? Revisited. BioSeience 41:631-634. ___.1992. Para que sirveel proyeeto Genoma Humano. La Jornade Semanal180: 29-39. ___.1996. BioJogieal information: a skepticaJ look at some cenual dogmas of molecular biology. Pages 187-231 in Sarkar S, ed. The philosophy 'lnd history of moleeular biology: new perspeetives. Dordrecht (the Netherlands)e Kluwer. Schneider TD, Stormo GD,. Gold L, Ehrenfeucht A. 1986. Information content of binding sites on nucleoride sequences. Journal ofMolecular Biology 188: 415-431. Schrödinger E. 1944. What is life? The physical aspect of the living cell. Cambridge (UK): Cambridge University Press. Shannon CE. 1948. A mathematical theory of cornrriunication. BellSystem TechniealJour_ nal 27: 379-423, 623-656. Smith CW, Patton JG, Nadal-Ginard B. 1989. Alternative splieing in rhe control of gene expressicn. Annual Review of Genetks 23: 527-577. Spiegelman S. 1956. On the nature of rhe enzyme-formation system. Pages 67-92 in Gaebler OH, ed. Enzymes: units ofbiologieal srructure and function. New York: Aeademic Publishers. Timofeeff-Ressovsky HA, Tirnofeeff-Ressovsky NW. 1926. über das phänotypische manifestieren des genotyps. II. über idioSOmatischevariationsgruppen bei Drosophila funebris. Roux Archiv für Entwicklungsmechanik der Organismen 108: 146-170. WatsonJD, CrickFHC.1953a. Molecularstrucrure of nucleic acids-a structure for deoxyribose nucleic acid. Nature 171: 737-738. ___. 1953b. Genetical implicaticns of the structure of deoxyribonucleic acid. Nature 171: 964-967. WatsonJD, Tooze ], KurtzDT. 1983. Recombinant DNA: a shorr course. New York. W. H. Freernan and Co. WienerN. 1948. Cybernetics. Cambridge{MA): MIT Press. WillsC. 1991.Exons, introns,and talkinggenes: thescience behind the human genome projecr. New York: Basic Books. WoeseCR.I963. Thegenericeode-1963.ICSU Review of World Science 5: 210-252. Yanofsky C, Carlton BC, Guest JR, Helsinki DR, Henning U. 1964. On the colinearity of gene strucrure and protein structure. Proceedings of the National Academy of Scieoces of the United States of Ameriea 51: 266-272. Ycas M. 1969. The biologil;al code. Amsterdam (the Netherlandsl: North-Holland. Yockey HP. 1992. Information theory and molecular biology. Cambridge (UK):Cambridge University Press. Sahotra Sarkar is an associate professor in the Department of Philosophy, McGill University, Montreal, Quebec H3A 2T7, Canada. He is currently a Fellow of the Wissensehaffskolleg zu Berlin, Wallostrasse 19, D-14193 Berlin, Germany. © 1996 American Institute of Biological Sciences. BioScience Val. 46 No. 11