Download Decoding "coding"-information and DNA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

United Kingdom National DNA Database wikipedia , lookup

Microsatellite wikipedia , lookup

DNA nanotechnology wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Thinking of Biology
Decoding "coding"-information and DNA
M
ost biologists would probably be more bored by than
surprised at, and certainly
not impressed by the originality of,
the claim that the central eore of the
conceptual structure of contemporary molecular biology can be encapsulated in rhe following three
precepts:
ogy, "information" and associated
terrns (especially "coding") belong
to the latter dass of metaphors. This
claim is, no doubr, a bald one-to
develop the argurncnrs in its favor
requires an excursion into the history of moleeular biology, a discussion of how the term "information''
ca me to be introduced, and why, for
a11 its initial plausibility as a useful
theoretical concepr, there is little to
recommend its continued use today.
its conforma tion, and that what
mediates biological interactions is a
precise lock-and-key fit between the
shapes of rnolecules. In the 1940s,
when no three-dimensiona l structure of a biological macromolecule
had yet been determined, the confonnational theory of specificity was
speculative. The demonstration of
its approximate truth for a wide
• All hereditary information resides
variety of inreractions, which came
in the DNA sequences of organisms.
in the late 1950s and 1960s, was one
., This information is transferred
of molecular biology's most signififrcm DNA RNA through the pro- Information and
cant triumphs. Just as "one genecess of transcription, and from RNA
one enzyme" was the archetypal slomolecular biology
to protein through translation.
gan of early molecular biology,
• This inforrnation is never trans- Histcrically, the term "Information" "srrucrure determines funcrion"
ferred from protein to nucleic acid entered molecular biology as part of came to be the dominating principle
sequences.
a putative theory of biological speci- of the field during its triumphant
ficity. By around 1930, it had be- 1960s.
Following Crick (1958), the last pre- come clear that molecular interacMeanwhile, back in 1944, in What
cept is usually called rhe "Central tions in living organisms are highly Is Lire?, Erwin Schrädinger had inDogma" of molecular biology.
specific in the sense that particular troduced a conceptual scheme that
These three precepts are so uni- molecules interact with exactly one, raised the possibility of a startlingly
versally accepted that they usually or at most a few, reagents. Enzymes different source of specificity.
find their way into introductory bi- act on specific substrares. Living Schrödinger asked how so tiny an
ology texts. Nevertheless, they are organisms produce antibodies that objecr as the nudeus of a fertilized
at best misleading, and at worst sim- are highly specific, not only to na tu- cell could contain all of the specifiply vacuous, in the sense that they rally occurring antigens, but also to cations (i.e., insrructions) necessary
can play no significant explanatory arrificia l ant igens (La nds te iner for normal development of an adult
role in molecular biology. The rea- 1936). Even the action of genes was organism. He stated that there exson for this is that there is no clear sometimes described using "speci- isted in the nucleus some structure
technical notion of "Information" ficity": genes specified precise phe- whose organization was interpreted
in molecular biology. "Information" notypes to different degrees of accu- as "an elaborate code-script," which
is little more than a meraphor that racy called their "specificity" he compared to the Morse code
masquerades as a technical concept (Timofeeff-Ressovsky and Timofeeff- (Schrödinger 1944). Alrhough he was
and leads to a misleading picture of Ressovsky 1926). In genetics, the willing to countenance codes in more
the conceptual structure of molecu- ultimate exemplar of specificity be- than one dimension, even a linear
[ar biology. Metaphors are ubiqui- ca me the gene-enzyme relationship code based on a 5-letter alphabet
tous in scienee. When they provide (Beadle and Tatum 1941): "one and word of up to 25 letters could
suceinct or easily eomprehensible gene-one enzyme" was perhaps the generate more than 10 17 parterns.
aceounts of complex technical con- most important organizing hypoth- Thus the arrangement of the units
rather than their physical shape becepts, they are particular ly useful esis of early molecular biology.
By rhe end 01 rhe 1930s, a highly came the source of specificiry in
for communicative or didactic purposes. However, when they serve successful theory of specificity (and Schrödinger's model. In the postwar
only as surrogates for nonexistent one thac remains central to molecu- era, when many scientists who were
teehnical concepts, their influence is lar biology) emerged. Due primarily initially trained in physics turned
less than benign. In molecular biol- to Linus Pauling (e.g, Pauling 1940), their arrenrion to biology for the
although with many antecedenrs, this firsr time, What Is Lire? was influentheory claimed that the behavior of tial in setting the agenda of a new
by Sahotra Sarkar
a macromolecule is determined by biology (Sarkar 1991).
'0
December 1996
857
The 19405 also saw an explosive
growth of microbial geneties, starring wirh Luria and Delbrück's (1943)
demonstration of spontaneous muta genesis in baeteria; continuing,
especially, with Avery and colleagues' (1944) demonstration of
DNA as the likely genetic material;
and cu lmi na ti ng with Joshua
Lederberg's diseovery of recombinarion in bacteria (Lederberg and
Tatum 1946a, b). "Transformation,"
"induction," and "transduction"
were some of the new terms introduced to describe these phenornena
(Ephrussi er al. 1953). In an attempt
to navigate through this terminological morass, Ephrussi et al. (1953)
suggested that the term "interbacterial information" replace them
all. This was the first modern use of
"information" in geneties. Ephrussi
er al. (1953, p. 701) emphasized that
the use of this term "dces not necessarily imply the transfer of material
substances, and [that they] recognize the possible future importance
of cybernetics at rhe bacterial level."
Immediarely after rhe publication
of Ephrussi et al. (1953)-in fact, in
the next issue of Nature-Watson
and Crlck (1953a) published the
double helix model of DNA. The
base pairing-A:T and C:G-that
they proposed showed a possible way
in which rhe specificities between
the two helices could be involved in
the formation of exact replicas.
Moreover, in their second paper on
rhemodel IWarsonand Crick 1953b,
p. 964), they went on to use "information" explicirly, and defined it
implicitly as what the "code" carried: "The phosphate-sugar backbone in our model is complerely
regular but any sequence of the pairs
of bases can fit into the structure. It
follows that in a long moleeule many
different permutations are possible,
and it therefore seems likely that the
precise sequence of bases is the code
which carries the genetic information. "
"Information" was finally defined
explicitly by Crick in 1958, who
identified it with the specification of
a protein sequence. Crick's (1958)
concern was the synthesis of proteins. There were three separate factors involved, he argued: "the flow
of energy, the flow of matter, and
the flow of information" (Crick
858
1958). The former two exhausted
the physics and chemistry of the
situation, whereas information was
pecuJiar to biological systems. Crick
(1958, p. 144) defined "information" with more care than ever before in this eontext: "By information
I mean the specifieation of the amino
acid sequence of the protein. ,. He
took it for granted that the genetic
information was encoded in a DNA
sequence. The phystcs and chemistry of folding of a protein, Crick
hypothesized, were purely a result
of its amino acid sequence. This is
the well-known "sequence hypothesis" (which remains unproved because of the continued insolvability
of rhe prorein folding problem). Finally, he put this formalized notion
of information to additional use:
The Central Dogma ... states rhat
once 'Information' has passed inro
protein it cannot get out again. In
more derail, the transfer of informa tion from nucleic acid to
nucleic acid, or from nucleic acid
to protein may be possible, but
transfer from protein to protein,
or from protein to nucleic acid, is
impossible. Information means
heee the precise derermination of
sequence, either of bases in the
nucleic acid or on amino acid
residues in the protein. (Crick
1958, p. 153; italics in the original)
This assumption about the one-way
transfer of information did not arise
from physical considerations.
Rather, it was Crick's way to give a
molecular characterization of neoDarwinism. "[I]t can be argued," he
explicitly observed, "that [the prorein] sequences are the most delicate
expression possible of the phenotype of an organism" (Crick 1958,
p. 142). Therefore, the one-way
transfer of information ensured that
changes that occur initially at the
phenotypic level cannot induce genotypic changes and be inherited.
Crick implicitly distinguished two
different types of specificity: that of
each DNA sequence for its complementary strand, as modulated
through base pairing; and that of the
relationship between DNA and protein. The latter was modulated by
genetic information. This nation of
information was combinatorial: all
that was required was that ehe code
perform its funetion from the sequence of bases in a DNA segment.
Schrödinger's (1944) "arrangemenr"
came to "encode information," resulting in a new theory of specificiry, distinct from the conformational
theory.
The theory of the genetic code
Even before Crick's (1958) explicir
identificarion of "Information" with
"specificiry," the same idea was being used systemarically. For instance,
in 1955, during a symposium on
enzymes, Mazia (1956) argued that
the role of RNA was to carry "information" from the nuc1ear DNA ro
the cytoplasm for the synthesis of
proteins. At the same conference,
Spiegelman (1956) argued that rhe
"information a l complexit y" required for the formation of proteins
made RNA and DNA rhe only two
plausible candidates for being templates for pro tein formation.
Lederberg (1956) nored thar "informaticn" was what "speciflcity" was
"called nowadays."
However, what fully embedded
"Information" into the conceptual
framework of moleeular biology was
the idea of a genetic code, that Is, the
idea that the relationship between
DNA and prorein should be conceived of as one of coding. After the
formulation of the double helix
model in 1953, this idea entered
molecular biology Iargely through
the work of George Gamow and his
collaborarors, who attempted to deduce properties of the genetie code
from plausible theoretieal assumpt io ns about information. (See
Gamow et al. 1955 for a review, and
Judson 1979 and Sarkar 1989, 1996,
for critical discussions.) Broadly
speaking, these assumptions were
about the efficiency of information
transfer and storage using DNA sequences. The basic problem was conceived of as that of determining,
first, what sets of DNA nucleotides
coded for individual amino acid residues and, second, whether these sets
had any overlap when a long stretch
of DNA coded for an amino acid
residue sequence. Gamow formu~
lated a variety of coding sehemes
from different sets of assumptions.
By the late 19505, these schemes
BioScience Val. 46 No. 11
were all shown to be incorrect.
The most interesting and, in some
ways, influenrial of the theoretieal
attempts to establish the nature of
the genetic code was the "commafree'' coding scheme of Crick et al.
(1957), who introduced relatively
sophistieated ideas about information storagc and transmission. Their
only experimental criterion for judging success was rhe "magic numher," 20, of amino acid residues that
occurred na tu rally in proteins and,
therefore, must be coded for by DNA.
They rejected overlapping codes On
experimental grounds and argued
that it was natural to restriet attention to triplet codes because doublet
ones would allow only 16 coding
units, whereas triplet ones, allowing
64, were clearly sufficienr.
From their point of view, there
were two problems to be solved.
First, tbere was the problem of potential degeneracy-if 64 trip lets
code for only 20 residues, then in
Same cases several rriplets would
code for a single residue. Although
there was DO experimental ground
to reject adegenerate code (such as
ehe one that is now accepted), they
believed thar it was undesirable. Second, there was the problem of synchronization-if A, C, G, and T are
the four nucleotide base types, is the
sequence ACCGTAGT to be read as
ACC, GTA,... , as CCG, TAG,... , or
as CGT, AGT,... ? Their solution to
both problems was ingenious. By
attempting to solve only the synchronization problem, they also removed degeneraey and, incidenrally,
obtained the magic number, 20.
Their solution to the synchron ization problem began with the assumption that only some triplets had
"sense," that is, could code for residues. Ofthe 64 possible triplets, the
4 with one base type, such as AAA,
had to be rejected immediately: otherwise a sequence such as AAAACGA
could potentially be read ambiguously as AAA, ACG,... or as AAA,
CGA, .... This Ieft 60 possibly meaningful triplets. These segregate into
20 sets of three, each set consisting
of a tdplet and its two cyclie permutations (e.g., ACG, CGA, and GAC).
If the possibility of ambiguity is to
be avoided, only one triplet from
each set can be meaningful. For instance, if ACG and CGA were both
December 1996
meaningful, ACGACGT could potentially be read as ACG, ACG,... or
as eGA, CGT,.... Thus, at most 20
meaningful triplets were possible.
Crick et al. (1957) went on to show
that a solution with exactly 20 triplets was possible by explicitly constructing such a set. That solution
was not unique. Crick et al. managed to find 288 different solutions,
and there are in facr as many as 408
(Golomb 1962).
Crick et al. (1957) wem on to
discuss a possible physical interpretation of the code that is historicaUy
important because it suggested that
intermediate molecular complexes
are involved in the translation of a
nucleic acid sequence ro a polypeptide one. This suggestion was one of
the first statements of the "adapror
hypothesis, " the adaptors eventually being identified wirh transfer
RNA (tRNA). However, rhese physical considerations did not form the
basis for their coding scheme. That
basis was provided by the two
desiderata of solving the degeneracy
and synchronization problems. There
was no experimental grounds for
these, the only experimental constraint was the magie number of 20.
The desiderata were based on two
claims about the nature of biological Information. The first was a claim
of a certain kind of simplicity: a
degenerate code is not as simple as a
nondegenerate one. The second was
an assumprion ahout efficiency: if
synchronization is not automatically
determined by the nature of the code,
then ambiguous translation can occur by a shift of the reading frame,
resulting in errors. The comma-free
code satisfied hoth desiderata.
But why settle for synchronization alone? In 1958, Max Delbrück
pointed out that genetic information
ultimately resides in double-stranded
DNA rather than in only one strand
(Go10mb et al. 1958). Consequently,
synchronization must simultaneously
hold for both strands. The "dictionary," Delbrück insisted, would be
better off if this additional constraint
on synchronization were also satisfied. This constraint was a natural
extension of comma-freedom for a
single strand and was called
"transposability ." Onee it is imposed, triplet codes can no longer
code for the 20 amino acid residue
rypes known to occur in proteins.
Triplet codes can now code for ar
most 16 residue types because the
"converse" of each meaningful triplet, as determined by base pairing,
musr, when read from right to lefr,
also belang to the set of coding triplets. This rules out many of the meaningful triplets of the original commafree codes. Attention, therefore,
shifted to quadruplet codes, and in
1962 Samuel Golomb reported from
computer searches that there can be
at least 57 quadruplet codons.
Pursuing coding as a mathematical idea, Golomb (1962) introduced
another scheme, that of "biorrhogonal codes," based on the mathematical theory of Hadamard matrices, which had six nucleotides code
for each amino acid residue. Its biologreal morivarion remains myscerious. Golomb provided no reason for
its introduction but was apparently
sufficiently convinced of the biological potential of all of these new
formal schemes to observe: "Ir will
be interesting to see how much of the
final solution [of the coding problern] will be proposed by the marhematicians before the experimentalists find it, and how much the
experimenrers will be ahead of rhe
mathematicians" (Golomb 1962, p.
106). Subsequent developments, as
ir turned out, did not show much
respect fcr the mathematicians.
Theory and experiment
What Golomb was apparently unaware of is that by 1961 it had
become clear that the code was triplet (Crick et al. 1961), showing that
his speculations-and those of
Delbrück-had, for a11 their analytic sophistication, little relevance
to biology. That same year, the first
codon was experimentally deciphered by Matthaei and Nirenberg
(l961a, b), using a cell~free system
and RNA sequences. They determined that UUU, a trip let not permitted to be meaningful by the
camma-free code, codes for
phenylananine. As this result was
verified, and other results began to
come in, it became clear that the
genetic code was not remotely
comma-free. It was, in fact, highly
degenerate. Colinearity of the code
was demonstrated in 1964 (Yanofsky
859
et al. 1964), and by 1966 the entire
genetic code was established (Woese
1963. Ycas 1969). Synchronization
turned out (0 be controlled by a
variety of mechanisms, none of which
could reasonably have been predicted
from a consideration of constraints
on rhe flow or storage of information. Failures in synchronization, as
exemplified by the expression of
frame-shifred sequences, were still
possible. What the mechanisms controlling synchronization usually did
was to prevent frame-shifted polypeptide formation, although this did
not become dear until the 19805
(Atkins et al. 1991).
What had gone wrang with the
comma-free code is tha t none of the
elegant properties thatwere imposed
on the code from considerations
about information are revered in rhe
living organism. The attempts to
deeipher the code in the 1950s using
these ideas were an unmitigared failure. Even the ideas that the code was
fuHy synchronized and fully sequential eventually came to be modified
through the discovery of frameshift
mutations and noncoding regions of
the genome. The only idea to be
validated, besides the uniformity of
codon lengrh, was Schrödinger's
original one-merely that of the existence of a genetic code. But even
the simplest form of this idea comes
with two attendant metaphors: that
of informarion and, perhaps even
more perniciously, that of DNA as
language, with the code used to interpret DNA symbols into meaningful proteins. The latter metaphor
deserves more systematic analysis
than is possible here. It is particularIy harmful because itdisregards
the fact thar, ultimately, DNA is a
molecule interacting with other
molecules through a complex set of
mechanisms. DNA is not just some
text to be interpreted, and [0 regard
it as such is an inaccurate simplification.
The theoretical schemes for the
genetic code are particularly important because rhe success of any of
them would have provided at least a
rudimentary theory of information
for moleeular biology. However,
despite their failure, the ideas of
coding and information in molecular biology persisted. The reason for
this is the unusual simplicity of
860
prokaryotic genomes. Much of molecular biology, especially in the
1960s, was established through the
srudy of only a single species, Eseheriehia coli, whose genome was
particularly weH behaved. Every
DNA segment has either a coding or
a regulatory function. For coding
regions, transcription results in a
complementary mRNA molecule
that, with no further modification,
is translated at ehe ribosome. If, following Crick, the speeifieity of a
DNA sequence is identified with the
information encoded in ir, then the
three precepts stated at the beginning of this article become plausible.
However, Crick's definition of
Information remains quirky: for instance, longer segments ·of DNA,
unless they encode more genes, cannot be said to carry more information than shorter ones. Regularory
sequences do not have any coding
role: they cannot be said to conrain
information in any straightfotward
sense. Worse, because the code is so
arbitrary (i.e., there are no theories
thar explain why a particular codon
codes for its amino acid residue), the
concept of information cannot be
invoked in any explanatory role:
there is no potential for explaining
novel biological phenomena by appeals to some property of information. Nevertheless, these problems
become insignificant onee attention
shifts to eukaryotic genetics.
The unexpected complexity of
eukaryotic genetics
Monod (1971) is responsible forwhat
should perhaps be called the "Central Myth" of molecular genetics:
that what is true For E. eoli is true
for elephants. Were this correct, then
the notion of information formulated by Crick (1958), and the framework of coding that emerged from
ir, would not only make the three
precepts at the beginning of this article true, but would also make the
linguistic view of genetics palatable.
However, developments since the
early 1970s have shown Monod's
claim of absolute universality to be
ineorrect. When molecular biologists
turned to eukaryotie genetics in the
late 1960s, they were in for surprise
after surpnse, so much so that
Watson et al. (1983) entitled a chapter on eukaryotie genetics, "The
Unexpeeted Complexity of Eukaryotie Genes." Precisely beeause of its
bewildering complexity, eukaryotic
genetics has no simple precepts like
those that apply to proka ryotic genetics. Four sets of developments
show how the simple information!
coding picture inherited from E.
eoli beg ins [Q fall apart:
• The genetic code is not universal,
although the amountofknown variation is not great (see Fox 1987 for a
review). At present, the most extensive variations have been found in
mitochondrial DNA, in which, for
instance, across all the major kingdoms UGA codes for tryptophan
rather than causing translarion to
terminate as it does in the usual
code. Ir can be argued that mitochondrial DNA is "special" beeause
mitochondria probably arose as independent organisms that were subsequently incorporated into eukaryotic cells. However, in the nuelear
DNA of at least four speeies of protozoa, UAA and VAG can code for
glutamine rather than terminating
translation. Moreover, in many speeies, UGA codes for amino acid residu es that da not belong to the standard set of 20. In some viral DNA
sequences UGA and UAG are sometimes, but not always, read through,
that is, ignored both as termination
signals and as codons (Fox 1987).
Even in the same RNA sequence,
these codons sometimes result in termination and are sometimes ignored.
For example, the virus Qß has a coat
protein that is usually produced by
having UGA read as a termination
codon. However, 2% of the time ir is
ignored, resulting in a Ionger functional protein (Fox 1987).
• The diseovery of frameshift mutations has destroyed any residual
belief in a natural synchronization
of the genetic code. The extent to
which frame shifcs are present in
organisms is large1y a matter of conjecture (Atkins et al. 1991). Semetimes, frame shifts at the DNA level
are used to transcribe an RNA segment that is translated into a different protein than the standard one
(Fox 1987).
• Not all DNA segments have a
coding or regulatory function. In
BioScienee Vol. 46 No. 11
human genomes, as much as 95% of
the DNA may have no function.
Inside a segment that, as a whole,
has a coding function, coding regions, called "exons," are interspersed with noncoding regions,
called "introns." In almosr all eukaryores, the portions of RNA corresponding to the introns are spliced
out after transcription. Moreover,
alternative splicing (the producnon
from the same transcript of different
RNA segments, eoding for different
proteins) has also been found (Smith
er al. 1989). Moreover, there are
large segments of nonfunctional
DNA between genes (i.e., between
segments with known coding or regulatory roles). The existence of introns and other nonfunctional DNA
segments makes it impossible to simply read off a DNA sequence and
predictan amino acid sequenee (even
. when all regulatory regions are
known).
• Besides splicing, several types of
mRNA editing are also now known
to the RNA after transcripnon.
In the present context, RNA editing is the most interesting faeet of
eukaryotic geneties: it already shows
how the first precept described at
the outset of this article (rhar all
information resides in the DNA sequences of the genome) cannot be
universally true. Acceptanee of the
idea that not all information resides
in DNA sequenees implies the acceptance of the idea that not atl
information proceeds as a transfer
from DNA to RNA to protein, at
least through the conventional coding relationship. Then the Central
Dogma becomes dubious.
Ir would be unreasonable to criticize rnolecular biologists for not predicting these complexities in the
19605, long before rhere was any
experimental evidence for rhem.
Nevertheless, these complexities
show that the information/coding
picture inherited from E. coli should
no longer be regarded as the eonceptual core of molecular biology. In
particular, one should wonder
wherher the idea of a genetic code
captures something important about
biological systems or whether it is
s imp ly a metaphor (hat has
epiphenomenally emerged from the
accident that both nucleic acids and
proteins happen physically to be linear molecules.
(Cattaneo 1991). DNA segments
producing transcripts that are subsequently so edired are called "cryptic genes." For insrance, in mammaIian intestinal celfs, a ccrtain C
nucleotide in ehe mRNA for
apolipoprotein becomes deaminated,
converting it to a U and creating a
stop codon. Deamination of C to U,
and the reverse process (U ~ C
arnination), occur in several plant Cybernetics and
mitochondrial mRNA transcripts as information theory
weIl. Moreover, even more unusual
behaviors have been observed with The argument that has been develmitochondrial RNAs in which bases oped so far assumes thar "infcrmacan be deleted or inserted. The lat- tion" should be basically understood
ter, especially, leads to a situation as Crick (1958) construed ir, that is,
that can be interpreted as the forma- what is specified by a DNA sequence
tion of proteins for which there are through the genetie code. The probno genes. In an extreme case, in the lems with that interpretation neverhuman parasiteTrypanosoma bru- theless leave open the possibility that
cei, as many as 551 U's are inserted there is some other interpretation of
throughout the transcript coding for that term that will allow its recovery
NADH dehydrogenase subunit 7, and as an interesting theoretical concept
88 are deleted (Koslowski et al. of molecular biology. Historically,
1990). In this case, the DNA seg- thete have been two such interpretament encoding the primary transcript tions, aod these have to be disposed
ean hardly be considered a gene for of to complete the argument of this
NADH dehydtogenase subunit 7. By article.
looking at the DNA sequence it
The first of these was already
would he impossible to ptedict be- alluded to by Ephrussi et a1.'s (1953)
forehand that this was the protein comment that understanding inforthat would eventually be produced. mation transfer may involve explorMoreover, in almost all eukaryotes ing "cybernetics at the bacterial
bases are added as "tails" and "caps" level." Cybernetics, from their and
December 1996
other similar points of view, would
provide the theory for the use of
information in molecular biology.
However, nobody seems to be certain about what constitutes cybernetics (see Pierce 1962). The rerm
"cybernetics" was popularized with
messianic fervor by Norbert Wiener
(1948). However, all that is clear
from Wiener's work is that cybernetics LS a rheory of regulated end,
especially, self-regulating systems.
Regulation was posited to occur
through "feedback," a concept that
had entered biology long before rhe
invention of cybernetics, but had
been co-opted into the cybernetic
framework (Keller 1995, Sarkar
1996). Feedback provides rhe information for regulation. In cybernerics, the concept of information gers
no more explicit than the indication
that "information" is that which
enables regulation.
The value of cybernetics in molecular biology is doubtful, although
putative cybemetic interpretations
of genetics began as early as 1950
(see Kalmus 1950). lf rhe published
record is taken as evidence, these
interpretations had negligible impact during the 1950s and early
1960s, when the conceptual framework of rnolecular biology carne to
be established. They were given a
new lease oflife by Monod (1971) in
Chance and Necessity, when he reinterpreted much of his eaelier work,
including the model of allosteric
regulation of proteins and the operon model of bacterial cell regulation, as examples of cybernetic systems (Sarkar 1996).
Whatever plausibility this interpreration may have had in 1971, it
falls apart, once again because of the
unexpected complexity of the eukaryotic genome. Eukaryotic gene
regulation is not welt understood
even today, but it is clear that no
model similar to the operon ean account for the regulation of eukaryotie genes (see Sarkar 1996 for detail). Cybernetics appears to have
been Httle more than a diversion in
the development of molecular bioJ~
ogy, but even if it is somehow reinstated (although it is hard to see
how), its associated eoncept of information (that which enables regulation) cannot enahle a recovery of
the three precepts mentioned at the
861
beginning of rhis article: "information" as "feedback" is hardly what
resides in DNA, passes from DNA to
RNA to protein, or, for that matter,
can make the Central Dogma true.
A second alternative inrerpretation of information emerged from
the mathematical theory of communication, which eventually came to
be called information theory (Shannon 1948). In infcrmation theory,
ehe amounr of information is measured by the logarithm of the relative number of choices available during a comm unica t ion process.
Information connotes uncertainty;
formally, its numerical value is determined by an entropy function thar
is similar to the usual entropy of
sratistical mechanics. In the 1950s.
there were many attempts to apply
this notion of information to moleeular biology. Bransou (1953) calculated the information content of
polypepride sequences using empirical frequencies of the various residues to ca1culate the uncerrainty er
each posirion of a sequence. In a
similar manner, Linsehitz (1953)
attempted to calculate the information content of a bacterial cell. However, by 1956. even its staunehest
proponent, Quastler (1958, p. 399)
conceded at least temporary defeat:
Information theory is very strang
on the negative side, i.e. in demonstrating what cannot be done;
on the positive side its application to rhe study of living things
has not produced many results so
far; it has not led to the discovery
of new facts, nor has its application [0 known facts been tested in
critical experimenta. To dare, a
definitive judgment of the value
of information theory in biology
is not possible.
Sporadic attempts to apply information theory directly to molecular
biology continue, but the results are
less than exciting. For instance, a
major result of Yockey's (1992) at~
tempt to apply information theory
to molecular biology is that polypeptides may not code for DNA sequences (whieh is Yoekey's version
of the Central Dogma). The basis for
this "theorem" is the degeneraey of
the genetie code: a given polypeptide seqüence can be encoded by
different DNA sequences. The con-
862
clusion is correct. What is mysteri- will have important consequences.
ous is why infcrmation theory-or There are at least five such conseany abstract theoretical frame- quences, and a little reflection shows
work-has to be invoked to make so that they are, in fact, desirable:
trivial a point. Ir is a trivial combinarorial facr thac was known by • If biological "information" 1S not
Gamow, Crick, or anyone else who DNA sequence alone, other features
had ever thought about the relation of an organism can also contain inbetween DNA and pro tein.
formation. This is precisely what
Recently, Thomas Schneider and recent discoveries indicate. In parh is collaborators (starting with ticular l the developmental fate of a
Schneider er al. 1986) have made cell might be largely a result of feapromising use of information theory tures such as methylation patterns
to find the most functionally rel- of DNA, which are not even ultievant parts of long DNA sequences mately determined only by DNA
when these are all that are availa ble. sequences (see Jablonka and Lamb
The basic idea, which goes back to 1995). These "epigeneric" patterns
Kimura (1961), is that functional can be inherited for several cell genportions of sequences are most likely erations. Different cells in the same
to be conserved through natural se- organism, presumably with identilection. These will therefore have cal DNA sequences, can have differlow i nfo r ma ti on content (in ent epigenetic patterns. These difShannon'ssense). WhetherSchneider's ferences c a n result in cell
methods will live up to their initial speeialization and differentiation,
promise remains to be seen. Never- the usual prelude to developmental
rheless, for eonceptual reasons alone, changes. Epigenetic specifications
this notion of "information" {i.e.• are also crirical in generating differShannon Information} is irrelevant enees in offspring (of sexually reproin the present context. According to ducing organisms), depending on
this notion, for DNA sequences the whether an allele is inherited from
"information" conrenr is a property the morher or the father. Epigenetic
of a set of sequences: the more var- specifications are sometimes transied a set, the greater the "informa- mitred aeross organistnie generation" content at individual positions tions. If "information" is to have
ofthe DNA sequence. But "informa- any plausible biological significance,
tion" in this sehe me is not actually ir would be odd not to regard the
what an individual DNA sequence transfer of rhese specifications as
contains, that is, not what would be transfers of information. The condecoded by the cellular organelles. ventional DNA-based concept of
Worse, whac Kimura's (1961) argu- information precludes this possibilment suggests is rha t w ha t should be iry.
regarded as biologically informa- • The Centtal Dogma of molecular
tive-c-funcrional sequences-c-are ex- biclogy is false if it is construed as a
act1y those that have low "informa- universal biological law. However,
tion" content.
a less grandiose claim, that protein
sequenees do not directly specify
nucleie acid sequences in the way in
Conclusions
which the latter specify the former,
Thus, neither cybernetics nor formal remains true. This humbler claim
information theory can rescue the does not have the majestic rhetorical
concept of information for molecu- power of the Central Dogma, but
lar biology in such a way as to per- does this retreat really undermine
mit the recovery of the conventional some putative insight rhat is enpicture of DNA sequences encoding shrined in that dogma? The usual
information to be decoded using the defense of the general biological
genetk code. The natural conelu- importance of the Central Dogma is
sion is that the conventional picture that it is a statement at the molecushould be abandoned. However, be- lar level of the noninheritance of
cause the concepts of information acquired characteristics (see, for
and coding have been central to how example, Crick 1958 and Maynard
molecular biology is currently un- Smith 1989). However, this interderstood, abandoning these conceprs pretation of the Central Dogma is
BioScience Val. 46 No. 11
entirely unjustified. Acquired characteristics are occasionally inherited,
although usually not (Jablonka and
Lamb 1995, Landman 1991). What
ensures that even those acquired
characteristics that involve changes
in DNA are not inherited in higher
animals is the segregation of the
germline from the soma. But plants
have no germline, and the extent of
its segregation in animals varies
greatly across phyla (Buss 1987).
Nevertheless, whatever the relation
between nudeic acid and protein,
thar relation shows no such variability across the phyla: ipso facto the
Central Dogma, even if it were true,
could not be either an explanation
or an alternative synonymous statement of the alleged noninheritance
of acquired characteristics. There is
certainly something peculiar, and
extremely interesting, about how
DNA resists easy change across the
phyla. But this observation is so mething to be studied and understood,
not something to be explained away
on the basis of some alleged law
about some incoherent notion of information.
• Many influential contemporary
discussions of ehe origin of life have
concentrated on the origin of information, in which information is consrrued simply to be nudeic acid sequences (e.g, Eigen 1992). Implicit
in these discussions is the assumption that nucleic acid sequenees ultimately encode aIl that is neeessary
for the genesis of living forms and,
therefore, that a solution to the problem of the initial generation of these
sequences will solve the problem of
the origin of life. The move away
from sequences would put these efforts in proper perspective: to explain the possible origin of persistent segments of DNA doesnot
suffice as an explanation of the origin of living ceIls.
• The emphasis on DNA sequences
that marks contemporary molecular
biology is misplaced. Therefore, the
sorts of arguments that were mustered to initiate the Human Genome
Project (HGP-a crash program to
sequence DNA blindly, thar is, without first determining the functional
rcles ofthe segments to be sequenced)
are less than compelling. This is not
a new point. Ir has previously been
made, on the basis of other consider-
December 1996
ations, by many critics of the HGP
(Da vis 1992, Lederberg 1993,
Lewontin 1992, Sarkar 1992). These
arguments, taken together, stronglv
suggest that the HGP should be limited to the mapping of all known
genetic loci to specific poeitions on
chromosomes and the sequencing of
only those scgments that are found
to have some functional interest.
There is little scientific rationale for
the blind sequencing of DNA, and
the shift of scarce resourees to it is
unjustified. Human and other genome sequences will ultimately be
sequenced, with or without the HGP,
but should such sequencing proceed
at a normal pace, not only would
such a shift of resources not occur
but there would be more time to
prepare for the well-known social
and ethical problems that the HGP
raises (see, for example, Holtzman
1989).
• Abandoning the coding metaphor
will also do much to li berate biology
from the unfortunate linguistic metaphor of an organism's (or a cell's)
DNA sequence being a message in
some language to be decoded. Despite rhe immense popularity of this
metaphor (see, for example, Wills
1991 and Pollack 1994), at the rechnicallevel the linguistic metaphor is
at best only as helpful in understanding biology as the concept of
coding. The complexities of eukaryotic generics show that the code is of
only limited use in the transition
from DNA to an organism's biology. Given a DNA sequence, simply
to read off an amino acid sequence
requires that it be known wh ether
any non standard coding is being
used, what reading frame is to be
used, that alt gene-non-gene and
intron-exon boundaries are known,
and what kinds of RNA editing will
take place. Even at the metaphorical
level, it is unlikely that these complexities can all be treated as questions of language: after aIJ, natural
languages do not contain large segments of meaningless signs interspersed with occasional bits of meaningful symbols. Of course, even with
an amino acid sequence, biology has
bare1y begun: one then faces the
problem of going to higher levels of
organization, and in the absence of a
solution to the prorein folding problem there is Iittle prospect for doing
that if one really starts from a DNA
"text." In any case, the sterility of
the informational picture of molecular biology is a much-needed reminder that DNA is, ultimately, a
moleeule and not a language.
Acknowledgments
Parts of this artide also appear in
Sarkar (1996), which provides a more
detailed treatment of many of the
issues discussed here. Thanks are
due to Angela Creager, Larry
Holmes, Manfred Laubichler, Lily
Kay, Eve lyn Fox Keller, Joshua
Lederberg, Richard Lewontin, William C. Wimsatt, and two anonymous referees fcr extensive discussions and comments on an earlier
version of this article. Work on this
article was parrly funded by a fellowship at the Dibner Institute at the
Massachusetts Institute of Technology.
References cited
Atkins JF, Weiss RB, Thompson S, Gesteland
RF. 1991. Towards a genetic dissecrion of
the basis of triplet decoding, and irs natural
subversion. programmed reading frame shifts
and hops. Annual Review of Genetics 25:
201-228.
Avery OT, MacLeod CM, McCarry M. 1944.
Studies on the chemical nature of the substance inducing transformation of pneumococcal types: induction oftransformation by
a deoxyribonudeic acid fraction isolared
from pneumococcus 1II. Journal of Experimental Medicine 79: 137-157.
Beadle GW, Tatum E. 1941. Genetic control of
biochemical reactions in Neuromora. Proceedings of the National Academy of Sciences of the United States of America 27:
499-506.
Branson HR. 1953. Adefinition ofinformation
from the thermodynamics of irreversible
processes. Pages 25-40 in QuastIer H, ed.
Essays on the use of information theory in
biology. Urbana (IL): University of Illinois
Press.
Buss L. 1987. The evolution of individuality.
Princeton {NJ}: Princeton Universlry Press.
Cattaneo R. 1991. Differenttypesof messenger
RNA editing, Annual Review of Generies
25: 71-88.
Criek FHC. 1958. On prorein synthesis. Symposium ofthe Society for Experimental Biology
12: 138-163.
Criek FHC, Grifflth jS, Orgel LE. 1957. Codes
wirheut ccrnmas. Proceedings of the National Academy of Seiences of the United
States of America 43: 416-421.
Crick FHC, Barnett L, Brenner S, Watts-Tobin
RJ. 1961. General nature of the genetie code
forproteins. Nature 192: 1227-1232.
Davis BD. 1992. Seguencing the human genome: a faded goal. Bulletin of the New
863
York Academy of Medicine 68: 115-145.
Eigen M. 1992. Steps rowards life: a perspective
on evolution. Öxford (UK): Oxford University Press.
Ephrussi B, Leopold U, Watson JO, Weigle 11.
1953. Terminology in bacrerial genetics.
Nature 171:701.
Fox TO. 1987. Natural variation in the gcnetic
code. Annual Review of Geneties 21: 67-91.
Garnow Gi Rich A, Yl;asM.1955. Theproblem
of information transfer from the nudeie acids to proreins. Advanees in Biological and
Medieal Physics 4: 23-68.
Golomb SW. 1962. Efficient coding for the
desoxyribonudeie acid channel. Proceedings
ofthe Symposium for Applied Mathernatics
14: 87-100.
Golomb SW, Wekh LR, Delbrüek M. 1958.
Constructicn and properries ofcomrna-free
codes. Biologislee Meddelelser Kongelige
Danske Videnskabernes Selskab 23(9): 134.
Holtzrnan N. 1989. Proceed with caunon. Balrimore (MD): The Johns Hopkins Univetsity
Press.
Jablonka E, Lamb MJ. 1995. Epigenetic inheritance andevolution: the Lamarckian dimen.
sion. Oxford (UK): Oxford University Press.
Judson HF. 1979. The eighth day of creation.
New York: Simon and Schuster.
KalmusH. 1950. Acybernetical aspect ofgenetics. Journal of Heredity 41: 19-22.
Keller EF. 1995. Refiguring life: metaphors of
twentieth century biology. New York: Columbia University Press.
Kimura M. 1961. Natural selecticn as a process
of accumulating geneeicinfcrmation in adaptive evolurion. Genedeal Research 2: 127140.
Koslowski DJ, Bhat GJ, PreollazAL, FeaginJE,
Stuart K. 1990. The MURF3 gene of T.
brucei contains multiple doruains of extensive ediring and is bomologous to a subunir
of NADH dehydrogenase. Cell 62: 901911.
Landman OE. 1991. The inheritance of acquired characteristies. Annual Review of
Genetics25: 1-20.
LandsteinerK.1936. Thespecifieityofserological reactions. Sptingfield (IL): C. C. ThomOlS.
Lederberg J. 1956. Comments on the geneenzyme relationship. Pages 161-169 in
Gaebler OH, ed. Enzymes: units of biological strueture and funetion. New York:
Academie Publishers.
___.1993. Whatthe double helix has meant
for basic biomedieal science: 01 personalcommentary. Journal of the American Medical
Assol;iation269: 1981-1985.
LederbergJ, Tatum EL. 1946a. Gene recombination inEseheriehia eali. Nature 158: 558.
___. 1946b. Nove1 genotypes in mixed cul~
864
tures of biochemieal muranrs of [racteria .
Cold Spring Harbor Symposia on Quantitative Biology 11: 113-114.
Lewontin Re. 1992. Biology as ideology: the
docrrineofDNA.New York: Harper-Perennial.
Linsehitz H. 1953. The informanon conrenr of 01
bacterial cell. Pages251-262 in Quastier H,
ed. Essays on rhe use of information theory
biology. Urbana (IL): University of Illinois
Press.
Luria SE, Delbrück M. 1943. Mutations of
bacteria from virus sensitivity to virus resisrance. Generies 28: 491-511.
Matthaei JH, Nirenberg MW. 1961a. Characrerization and stability of DNAse-sensitive
prorein synthesis in E. eali extracrs. Proeeedings of the National Academy of Scienees of the United States of America 47:
1580-1588.
_ _. 1961b. The dependence of eell-free
prorein synthesis in E. coli upon naturally
occurring or synthetic polyribonudeotides.
Proceedings of the National Academy of
Seiences of the United Stares of America 47:
1588-1594.
Maynard SmithJ. 1989. Evolutionary genetics.
Oxford (UK): Oxford University Press.
Mazia D. 1956. Nuclear produets end nudear
reproduction. Pages261-278 in Gaebler OH,
ed. Enzymes: units of biological srrucrure
and function. New York. Aeademic Publishers.
Monod ]. 1971. Chance and necessity: an essay
on the natural philosophy of modern biology. New York: Knopf.
Pauling L. 1940. A theory of rhe structure and.
proeess of formation of annbodies. Journal
ofrhe American Chemieal Sociery 62: 26432657.
Pierce JR. 1962. Symbols, signals and noise.
New York: Harper and Brothers.
Pollack R. 1994. Signa of life: the language and
meanings of ONA. Boston: Houghton
Mifflin.
QuastIer H, ed. 1958. The status of information
theory in biology: a raund-table discussion.
Pages 399-402 in Yoekey HP, ed. Symposium on information theory in biology. New
York: Pergamon Press.
Sarkar S. 1989. Reduetionism and molecular
biology: a reappraisal. [Ph.D. dissertation.]
Department of Phi1osophy, University of
Chicago, Chicago, IL.
___.1991. Whatislife? Revisited. BioSeience
41:631-634.
___.1992. Para que sirveel proyeeto Genoma
Humano. La Jornade Semanal180: 29-39.
___.1996. BioJogieal information: a skepticaJ look at some cenual dogmas of molecular biology. Pages 187-231 in Sarkar S, ed.
The philosophy 'lnd history of moleeular
biology: new perspeetives. Dordrecht (the
Netherlands)e Kluwer.
Schneider TD, Stormo GD,. Gold L,
Ehrenfeucht A. 1986. Information content
of binding sites on nucleoride sequences.
Journal ofMolecular Biology 188: 415-431.
Schrödinger E. 1944. What is life? The physical
aspect of the living cell. Cambridge (UK):
Cambridge University Press.
Shannon CE. 1948. A mathematical theory of
cornrriunication. BellSystem TechniealJour_
nal 27: 379-423, 623-656.
Smith CW, Patton JG, Nadal-Ginard B. 1989.
Alternative splieing in rhe control of gene
expressicn. Annual Review of Genetks 23:
527-577.
Spiegelman S. 1956. On the nature of rhe enzyme-formation system. Pages 67-92 in
Gaebler OH, ed. Enzymes: units ofbiologieal srructure and function. New York: Aeademic Publishers.
Timofeeff-Ressovsky HA, Tirnofeeff-Ressovsky
NW. 1926. über das phänotypische
manifestieren des genotyps. II. über idioSOmatischevariationsgruppen bei Drosophila
funebris. Roux Archiv für Entwicklungsmechanik der Organismen 108: 146-170.
WatsonJD, CrickFHC.1953a. Molecularstrucrure of nucleic acids-a structure for deoxyribose nucleic acid. Nature 171: 737-738.
___. 1953b. Genetical implicaticns of the
structure of deoxyribonucleic acid. Nature
171: 964-967.
WatsonJD, Tooze ], KurtzDT. 1983. Recombinant DNA: a shorr course. New York. W. H.
Freernan and Co.
WienerN. 1948. Cybernetics. Cambridge{MA):
MIT Press.
WillsC. 1991.Exons, introns,and talkinggenes:
thescience behind the human genome projecr.
New York: Basic Books.
WoeseCR.I963. Thegenericeode-1963.ICSU
Review of World Science 5: 210-252.
Yanofsky C, Carlton BC, Guest JR, Helsinki
DR, Henning U. 1964. On the colinearity of
gene strucrure and protein structure. Proceedings of the National Academy of Scieoces of the United States of Ameriea 51:
266-272.
Ycas M. 1969. The biologil;al code. Amsterdam
(the Netherlandsl: North-Holland.
Yockey HP. 1992. Information theory and molecular biology. Cambridge (UK):Cambridge
University Press.
Sahotra Sarkar is an associate professor
in the Department of Philosophy,
McGill University, Montreal, Quebec
H3A 2T7, Canada. He is currently a
Fellow of the Wissensehaffskolleg zu
Berlin, Wallostrasse 19, D-14193 Berlin, Germany. © 1996 American Institute of Biological Sciences.
BioScience Val. 46 No. 11