* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download An introduction to molecular linguistics
Survey
Document related concepts
Transcript
Thinking of Biology
An introduction to molecular linguistics
B
iology is filled with linguistic
metaphors. The punctuated
yet commaless genetic code is
transcribed and translated. Gene
transcripts are edited, and cells are
said to communicate. Recently the
newspapers carried the story of the
discovery of a gene, a "molecular
spell checker," whose malfunction
causes colon cancer (Saltus 1994).
Immunologist NeilsJerne in his 1984
Nobel Prize acceptance speech suggested that the variable region of an
antibody might correspond to a sentence or phrase and the vast repertoire of immune response might be
considered a vocabulary of sentences
(Jerne 1988). Why afe there so many
metaphors? Is there more here than
analogy?
This article addresses two questions: How deeply does the biological-linguistic analogy run, and to
what uses can it be put today? One
discovers that historically the analogy keeps reemerging: from Darwin's (1859) comparison of a species to a natural language to Gamov's
(1954) suggestion that the synthesis
of proteins from DNA be considered a problem in coding. To Gamov,
DNA was a long, four-digital number that had been translated into
words spelled from the 20-letter
amino acid alphabet. Crick (1959)
viewed the genetic coding problem
as that of "translating one language
to another." The idea of the analogical use of language, letters, and
translation, which may now seem
useful but hardly profound, was
stimulating and controversial when
first proposed. It formed the basis of
the theory driving early experiments
deciphering the genetic code. Today, techniques from formal language theory are used to analyze the
data from genome sequencing
by Patricia Bralley
146
The general goal of
structural linguisticsdistinguishing how
differences in syntactic
structure reflect
differences in meaningaddresses the same
problems facing
molecular biologists
projects, while the analogy suggests
that new ways of understanding the
evolution of complex systems await
discovery.
Analogy with natural language
The oldest biological-linguistic analogy can be traced to Darwin's On
the Origin of Species (Darwin 1859).
Darwin suggested that a species is
like a language: They both evolve
through time and under geographical constraints, and they both undergo a process of evolution based
on a low level of constant change.
Darwin even compared vestigial organs to the unpronounced letters in
words.
The data from molecular biology
have suggested a second form of the
analogy: The cell's molecules correspond to different objects found in
natural languages. Kiippers (1990)
delineates possible correspondences:
a nucleotide corresponds to a letter,
a codon to either a phoneme (the
smallest unit of sound) or a morpheme (the smallest unit of meaning), a gene to a word or simple
sentence, an operon to a complex
sentence, a replicon to a paragraph,
and a chromosome to a cha pter. The
genome becomes a complete text.
Kiippers (1990) emphasizes the
thoroughness of the mapping and
notes that it presents a hierarchical
organization of symbols. Like human language, molecular language
possesses a syntax. Just as the syntax of natural language imposes a
grammatical structure that allows
words to relate to one another in
only specific ways, biological symbols combine in a specific structural
manner.
A closer examination of the
molecular-linguistics analogy is
achieved by considering four essential design features (Lyons 1977)
that ena ble language to function as
a signaling, or semiotic, system: discreteness, arbitrariness, duality, and
productivity. Jimenez-Montano
(1989) examines how biological
molecules satisfy several of these
criteria. First, language is composed
of discrete and arbitrary elementsphonemes (in speech), alphabet letters (in writing), or discrete motions
(in sign language). These discrete
elements are linearly arrayed and
numerable, as are the bases of DNA
or the amino acids of a protein.
Arbitrariness refers to the fact that
these basic units have no a priori
meaning. A phoneme such as th no
more suggests the meaning of a word
than a single codon suggests the function of the protein to which it codes.
The fact that phonemes have no
inherent meaning yet are noninterchangeable (just as within a gene a
thymine cannot be replaced by a
cytosine without risk of fatal mutation) gives rise to the linguistic property of duality. Duality uses two
discrete combinatorial systems: One
combines meaningless sounds into
meaningful morphemes, while the
second combines meaningful morphemes into words and ultimately
sentences. Because each discrete
BioScience Vol. 46 No.2
combinatorial system combines a
finite set of elements into any number of larger structures, duality is an
economical and powerful way to
produce an infinity of meaningful
forms from a few elements. It is a
strategy also used by the cell-four
nucleotidcs combine into 64 codons;
codons combine into many different
genes.
The power of duality finds its
expression in productivity or creativity. Human linguistic competence involves the ability to create
an infinite variety of sentences.
Chomsky emphasized the creative
nature of language when formulating his theory of generative grammar (Chomsky in Radford 1981).
Similarly, the biological diversity of
living systems presents a seemingly
inexhaustible number of possibilities. Or, as Nobel laureate and geneticist Barbara McClintock said,
"Anything you can think of [in biology J you will find" (McClintock
qunted in Keller 1983, p. 199).
In addition to discreteness, arbitrariness, duality, and productivity,
all natural languages require redundancy. A design feature engineered
into any reliable communication
system, redundancy is found in cells
and languages at many levels. Nearly
every passage of prose contains
words that could be left out without
affecting comprehension. Words too
can be recognized even when letters
are deleted (Campbell 1982). This
redundancy is a safeguard against
errors. It allows parts of a message
to be statistically related and thus
assures that a message gets through,
despite inevitable production errors,
such as mispronunciations, slips of
the tongue, incorrect grammar, or
speaking too softly or quickly to be
heard. Similarly, redundancy in the
genetic code protects against genetic
errors. The redundancy of having
61 codons translate to only 20 amino
acids helps minimize the effects of
deleterious mutations. Changes in
the third position of a codon often
give rise to the same or chemically
similar amino acid.
All natural languages also possess a grammar: rules for combining
phonemes, morphemes, words, and
phrases. Syntax establishes "legal"
structures; semantics establishes
meaning. Chomsky's (1972, p. 14)
f'ebruary 1996
The cat saw a mouse
VP
saw a mouse
NP
The
cat
saw
a
mouse
Figure 1. A parsing of the sentence "The cat saw a mouse" reveals the hierarchical
organization and syntactical components: NP, noun phrase; Yr, verb phrase; Det,
determiner; N, noun; Y, verb; and Art, article. Parse trees can also be viewed as
representing a dynamic process of generating a number of similarly constructed
sentences from a small number of rules.
famous sentence, "Colorless green
ideas sleep furiously," while syntactically correct, fails on the semantic
level. Phonological rules allow that
clonch, while not an actual English
word, seems more acceptable than
grzyb (Polish for mushroom). The
rules of grammar ensure that if words
are randomly chosen and strung together, they will probably not form
a legal sentence. Similarly, polymerizing randomly chosen nucleotides
or amino acids fails to produce a
functional gene or protein.
The fact that not all combinations of phonemes or words occur
allows definition of larger elements
such as words and sentences. The
general goal of structural linguistics-distinguishing how differences
in syntactic structure reflect differences in meaning-addresses the
same problems facing molecular biologists. What are the grammatical
rules for forming protein structures
such as an a-helix, a ~-ribbon, or a
three-dimensional conformation?
Where are the meaningful regularities in a sequence of an operator,
enhancer, gene, or operon?
Searching for the structural
grammar of amino acids
Which amino acids in a linear sequence of a protein determine the
final conformation? The problem,
which King (1989) called the
"search for the grammar of amino
acids," is one area suited to linguistic techniques.
Proteins are hierarchical in structure. They have a linear primary
structure, which determines the secondary structures of helices, turns,
and sheets. These secondary structures fold into a three-dimensional
tertiary structure-the functional
protein.
Language is also hierarchical.
While words are linearly arrayed
into sentences, they are not strung
together like beads on a string. There
are syntactic frames, analogous to
secondary helices and sheets. Words
or whole phrases are assembled to
create a sentence: Toby/ran lazily/
down the street. Additionally, nonadjacent words are dependent on
each other. Consider: The lioness
hurt herself. The word lioness determines the selection of herself
rather than himself.
The protein folding problem suggests the necessity of finding the
molecular equivalent of linguistic
phrases and dependencies. Linguistically, this problem is solved by the
process of parsing, the orderly and
systematic arranging of phrases.
Parsing is visualized in a tree diagram (figure 1), which shows the
147
mononucleotide binding fold (MBF)
MBF core
~-strand
GKTFILHDGPPY ANGSIHIGHSVNKILKOIIVKSKGLSGYDSPYVPGWDCHG
Figure 2. A partial parsing of the Escherichia coli protein isoleucyl-tRNA synthetase
(residues 48-99) comprising the mononucleotide binding fold (MBF) subdomain.
Parsing proceeds from the primary sequence of amino acids (here represented by their
single letter abbreviations), to secondary structures of helix, turn, and p-strands, to the
final tertiary structure of the MBF subdomain (Lathrop et al. 1993).
hierarchical organization of the sentence (S) into its component phrases:
noun phrase (NP), verb phrase (VP),
composed of article (Art), determiner
(Det), noun (N), or verb (V).
Tree diagrams can also be viewed
as representing a dynamic process, a
way of generating a large number of
sentences with similar structures
from a small number of rules. One
rule might be that a sentence consists of a noun phrase and a verb
phrase. There are also rules for the
substructures of phrases. A noun
phrase might be composed of a noun
and an article or of another noun
phrase and an adjectival phrase.
These rules of composition are what
the biologist needs to define for protein folding.
Figure 2 shows a typical protein
parsing done on the mononucleotide
binding fold of isoleucyl-tRNA synthetase from Escherichia coli
(Lathrop et al. 1993). The analysis
is used within the artificial intelligence program ARIADNE to predict
protein structure.
The generative grammar of a
genetic sentence
Collado-Vides (1991) uses a generative grammar model to analyze gene
regulation. Here rules are applied
recursively to generate the complete
sentence. His so-called sentences are
148
various transcription units such as
the lac operon of E. coli. Lexical
categories (e.g., noun and verb) are
created by promoters, operators, and
structural genes. A single parse tree
may generate a number of different
sentences depending on the actual
DNA sequence in each lexical category. For example, the promoter
may be either the lac promoter, the
ara operon promoter, or a phage
promoter.
A chain of lexical units is grammatically correct if RNA polymerase
can read a translatable sentence such
as "promoter, start codon, gene, stop
codon," rather than a garbled "stop
codon, start codon, gene, promoter. "
However, this regulatability is not
synonymous with physiological usefulness. A gene for tryptophan synthesis regulated by a lactose-regulated operator and promoter makes
no sense physiologically and is analogous to a sentence being syntactically correct but semantically incorrect. Thus physiological usefulness
is likely to require another grammatical component corresponding to the
semantic.
Assessing the analogy
Most biologists coming to an understanding of molecular linguistics
probably do so by making intuitive
analogies with natural languages.
However, lay definitions and intuitive understandings can mislead
technical discussions. None of the
analogies presented here agrees on
correspondences. Crick (1959) uses
DNA bases as letters, then translates
codons into amino acids-the socalled letters for Gamov's (1954)
protein words. Kiippers's (1990) socalled word is a gene (DNA), and the
cell is a book, not a person as Sereno
(1991) suggests. Collado-Vides
(1991) has DNA (promoters, operators) functioning as a word, a gene
as a sentence. Jimenez-Montano's
(1989) analysis of redundancy actually generalizes from information
theory, not language.
Language is not equivalent to
communication or speech (Wilkins
and Wakefield 1995). Spelling, letter frequencies, and diphthongal
associations of letters are not parts
of grammar, because writing systems arc artifacts distinct from language (Pinker 1994). To critically
assess {he correctness and depth of
the biological-linguistic analogy, one
needs to decompose and analyze it
for parallel relationships between
the hierarchically organized biological and linguistic components.
Sereno (1991), in a thoughtful
critique, schematizes prominent
forms the analogy has historically
taken (Figure 3). For instance, evolutionary epistemologists compare
a species to a scientific discipline
and the evolution of an organism to
the evolution of concepts (the socalled organism-concept analogy);
this analogy connects three different hierarchical levels in an inconsistent manner (Figure 3).
Darwin's species-language holds
on two levels. The population-level
alignment, in which a species corresponds to a language, implies that
an organism is analogous to a
speaker of a language. Thus, an
organism's embryological development corresponds to a speaker's
development of linguistic skills, and
heritability corresponds to transmittal of language to offspring. However, there are some places where
the analogy breaks down. Spoken
languages (except for scientific jargons) do not become more adapted
with time. Middle English was just
as effective a medium as modern
English (Sereno 1991). Lower levels
BioScience Vol. 46 No.2
of correspondences become less clear
also. In a cell it is the DNA sequence
of the gene that mutates. In language the set of phonemes changes
with time, as well as the set of morphemes and the meanings of words
themselves. It is not clear which of
these changes corresponds to the
nucleotide, not is it clear what is
equivalent to the cell in a language
speaket.
Sequence differences are dealt
with in a more consistent manner by
aligning a cell to the conceptual system of a person (the so-called cellperson analogy). New DNA sequences produced by mutation
correspond to novel combinations
of phonemes heard from day to day.
The cell-person analogy emphasizes
parallels between the symbolic-representation systems of both cell and
person with detailed correspondences. For example, on the molecular level, mRNA corresponds to
the activity pattern of the secondary
auditory cortex caused by the sounds
of several sentences. Protein folding
corresponds to the comprehension
of word-sound sequences (see Table
1). The analogy relates three levels
of organization in the most consistent manner: An organism corresponds to a community, a cell to a
person, and biomolecules to neural
network firing patterns (Figure 3.)
This analogy holds, however, only for
language comprehension, not production. It is as if the cell has the ability
to listen to and comprehend its own
internal chatter. To produce language (as a speaker docs) the cell
would have to unfold a protein, turn
it into DNA, and inject that DNA
into another cell.
the artificial languages of computer
science was quickly realized. Today, many linguists consider both
natural and artificial languages as
the output of formal systems comprising sets of objects (words), axioms, and rules for changing sequences of words into other words.
Formal language theory has an abstract similarity to processes of logic
and deals in theorems, proofs, and
hypotheses. It is at this most mathematical level that DNA, RNA, and
proteins appear most languagelike.
The question then becomes: What is
the nature of this language? Is it
more like an artificial computer language or more like natural human
languages?
A language, as defined by formal
language theory, has three components (Hopcroft and Ullman 1979).
First, there is an alphabet composed
of a set of indivisible elements called
terminal symbols. Terminal refers
to the fact that these symbols are
allowed in the final form of a sentence. This alphabet might be composed of 2 symbols (0, 1), 4 letters
(A, C, G, T), 20 amino acids, 26
letters, or a lexicon of English words.
Second, there is a set of nonterm ina I
symbols that are used in the production rules of grammar. These
nonterminals are intermediate constructs such as the noun phrase and
verb phrase seen earlier in the parse
trees of sentences. Among the
nonterminals is also a start symbol,
S, which stands for sentence. Third,
there is a set of rules, called rewrite
rules, for combining these symbols
into strings (sentences) and deciding which strings are legal. Thus the
start symbol, S, could form the left
Formal language theory
Table 1. Summary of the cell-person analogy.
A mathematically precise method
for dealing with languages is provided by formal language theory.
Formal language theory shows that
biomolecules can actually be considered languages and provides
methods for dealing with current
problems in computational biology,
creating the new discipline of molecular linguistics.
Chomsky introduced formal language theory in the 1950s to explain
the productivity of natural languages. However, its relevance to
February 1996
Cell
DNA nucleotide
DNA codon
RNA nucleotide
mRNA
Rihosome
Amino acid
Functional domain
of an enzyme
Enzyme suhstrates
(e.g., amino acids,
proteins, and carbohydrates)
Suhstances of prebiotic soup
organism I concept
specic~
<f. learned discipline
organism
person
cell
concept
,
biomolecules
species I language
species
<I"
organism
community
speaker
cell
biomolecules
cell / person
species
organism
cell
biomo1ecules
community
personal concepts
neural network
neurotransmitter
Figure 3. A companson of three different forms of the biological-linguistic
analogy (the organism-concept, specieslanguage, and cell-person) across different levels of organization. The
organism-concept analogy maps correspondences inconsistently to different
levels. In the species-language analogy,
objects in one system map to objects of
approximately the same size in the other
system. In the cell-person analogy, small
biological objects map to larger, linguistic/cultural objects.
side of the grammatical rule S--7NP,
V (which means "The nonterminal,
S, can be rewritten with the nonter-
Person
Phoneme
Sounds in a word
Auditory cortex activity for phoneme
Auditory cortex activity for sounds of several sentences
Auditory cortex activity that assembles unit meaning patterns into chain
A word meaning caused by activity in secondary visual cortex
Structure in short-term memory upon hearing discourse of clauses
Mental "objects" (e.g., single words, discourses, emotions, and images)
Prelinguistic firing patterns in primate brain
149
a) The molecular
~
Figure 4. The longdistance dependencies of a protein
suggest that the linguistic sophistication of biomolecules is comparable
to that of natural
languages. Amino
acid residues interacting through polar and hydrogen
bonding establish
the tertiary structure of a protein (a).
If the protein is linearized, these interactions are revealed
as crossed (A, B, C)
and nested (D, E,
F) dependencies.
Such
linguistic
structures in natural language sentences (b) require
con rex t-sensi ti vc
and context-free
grammars respec-
crossed
dependencies
c
D
3
1.
nested
dependencies
2
unfold protein
crossed dependencies
nested dependencies
b) The linguistic
crossed dependencies
cl
Bill, Alice. and Ted are a cook, a chef, and a dishwasher
re~pectively.
nested dependencies
The reaction the cn/.ymc the gene encoded ~·atalY7.ed stopped
minals NP and V"). Other rewriting
rules might be:
N-7cat
V-7frisked
such that the nonterminals are eventually rewritten with terminal symbols to produce: The cat frisked.
Together, the nonterminals and rules
make up a grammar. A formal language is thus a subset of legal strings
produced from all the possible combinations of symbols in an alphabet.
Depending upon the alphabet, these
strings may constitute an English
sentence, a protein sequence, a gene
sequence, or simply 000110011O0011l.
The resulting language may appear
to be natural or decidedly artificial.
Automata, grammars,
and ribosomes
In formal language theory, grammar is viewed as a generator of language. Its purpose is to take a string
of symbols and produce a new string
150
1=~;~1~~Et
1
tively.
NP-7art,N
art-7the
........
in an explicit, rule-governed, and
thus essentially mechanical process.
Consequently, it becomes possible
to envision a grammar as a mechanical device, a virtual or imaginary
machine, called an automaton. This
intimate relationship between grammars and automata has allowed computer scientists to use the often more
easily intuited behavior of machines
to solve linguistic questions. Describing an automaton is equivalent to
describing a grammar and thus defining a language (Kain 1972).
An automaton can be considered
a black-box computer, called a finite controller, that is fed a sequence
of symbols on a tape. The tape contains a linear array of boxes, or
cells, with each cell containing a
symbol from the language's alphabet. In the beginning, the automaton is in the start state. As the tape
is read, symbol by symbol, the state
of the automaton may change, because each new state is uniquely
determined by the input symbol and
the current state. In some states an
indicator light goes on, in others it
docs not. When all of the tape has
been read, the machine stops. If the
light is on, the tape has been accepted, and the language is declared
recognized.
It is easy to imagine different
variations of this basic design. Input
tapes may be finite or infinite in
length. The reading head, which
scans the sequences of the tape into
the finite controller, mayor may not
also have the ability to erase or
write on the tape. The tape may
move to the left, or it may be able to
move both right and left. There could
be more than one tape, or more than
one reading head. Different ways of
constructing automata allow them
to carry out computations of varyingcomplexity and create more powerful or less powerful grammars.
These grammars are defined hierarchically ranging from the least powerful, regular grammars through the
context-free, context-sensitive, and
most powerful phrase structure grammars embodied by Turing machines.
The Turing machine is the virtual, universal computer that can
recognize all languages. The reading head of a Turing machine can
write on the tape, which makes it
possible to create a new string of
symbols called the output tape.
Yockey (1992) asserts that the logic
of the Turing machine is isomorphic
to that of the cell's genetic information system. Here, DNA is the input
tape, protein the output tape. The
ribosome acts as the reading head,
while the internal states of the controller are the tRNA, mRNA, and
enzymes involved in protein synthesis. This isomorphism between a
Turing machine and the ribosome
illustrates one way in which formal
language theory moves molecular
linguistics beyond metaphor to identity.
The linguistic sophistication
of biomolecules
If one formally accepts that biomolecules are languages, are they
closer in design to artificial or natural languages? In most natural languages there are long-distance dependencies; some element of the
sentence is constrained by features
of another part. Consider, for exBioScience Vo!' 46 No.2
ample, the sentence: The reaction
the enzyme the gene encoded catalyzed stopped. This sentence is structured in nested dependencies that
can, in principle, be any number of
layers deep. Although multiple layers make the sentence difficult to
understand, it is grammatically correct. Its context-free grammar requires the computational machinery of a pushdown stack automaton.
Expecting a sentence to conform to
the common rule, S~NP,VP, we read
NP, The reaction. But when there is
no VP immediately following, we
move the reaction to the pushdown
stack. (A pushdown stack automaton receives and stores symbols just
like the spring-loaded device for
stacking dishes in a cafeteria. The
first plate into the stack is pushed to
the bottom as additional plates are
added-the first symbol in is the last
taken out and computed.) The NP
the enzyme meets a similar fate, and
goes into the stack on top of the
reaction. Finally, the gene fulfills
expectations and is followed by the
VP encoded. Now freed, the processor returns to the stack. The enzyme
is paired with catalyzed, and the
reaction is then paired with stopped.
This set of nested dependencies is
similar to the set of interactions created by hydrogen and polar bonding between residues in the tertiary
structure of folded proteins (Figure
4). That a pushdown stack automaton recognizes nested dependencies
means that at least a grammar with
context-free sophistication is required to capture this particular feature of protein structure.
Figure 4 also illustrates the possibility of having crossed dependencies within proteins. The sentence,
"Bill, Alice, and Ted are a cook, a
chef, and a dishwasher respectively,"
is a construct that cannot be handled
by context-free grammars. A more
powerful, context-sensitive grammar
is required.
Tandem and direct repeats and
RNA pseudoknots with nested and
crossed dependencies require at least
a context-free grammar for expression. Searls (1993) showed that attenuators and double-inverted repeats require a context-sensitive
grammar to reflect their ambiguity
of structure. Ambiguity has long
been recognized as a quality of natu-
February 1996
ral languages. Humans find it acceptable, if not useful or even poetic, to be able to create sentences
like: Everyone has read two books.
The ambiguity as to whether two
hooks means "any two books" or
"the Bible and Gone with the
Wind" is easily tolerated. The artificial languages of computers cannot
tolerate such ambiguity. The presence of ambiguity in biomolecules
reflects the linguistic sophistication
of biological systems. The linguistic
power required to capture molecular structures arguably appears similar to that required for natural languages.
Applications in
computational biology
Interpreting the genome is actually
a problem in pattern recognitionspecifically, recognizing sequence
patterns of genes, regulatory elements, protein folding, and structural determinants. Pattern recognition is a well-researched problem in
computer science applied to such
diverse fields as speech recognition,
medical diagnostics, and remote
sensing. Two different approaches
have been developed: the statistical
and syntactic (Fu 1982).
Statistical pattern recognition
extracts from an image a set of characteristic measurements called features. Consider the problem of trying to distinguish a short text of
English from similar passages of
Romanian, Polish, and Italian when
all the consonants are changed to
"c" and all the vowels are changed
to "v" (Goldenberg and Feurzeig
1987). The problem is not unlike
that of the biologist trying to discriminate between protein families
or coding and noncoding regions. It
can be solved by a statistical description of the language. How socalled vowelish is English? What are
the probabilities of vowels and letters and their combinations? If too
many words end with a vowel, the
language cannot be English, yet it
may be Romanian or Italian. If some
words are a single consonant, the
language is not English, but may be
Polish.
Programs such as GeneMark
(Borodovsky et al. 1994) similarly
analyze statistical properties of DNA
sequences. GeneMark assumes that
frequencies in position "0" depend
on the nucleotide in position
"-1" or positions "-1" and "-2."
The program slides a window along
the sequence in discrete steps and
calculates the probability that the
DNA conforms to either a model of
a coding or noncoding region. Because there are significant differences between frequency models of
coding and noncoding sequences, it
can detect an open reading frame.
Typically, the analysis of a new
sequence begins with a similarity
search to align the new sequence to
the database (Tatusov et al. 1994).
These programs first find locally
similar regions between two sequences, then they try to extend the
matches. There are many different
ways to measure similarity between
matched residues, all of which create a substitution matrix to compensate for evolutionary distances,
frame shifts, and other factors that
allow related proteins to vary in
sequence. The matrix used and the
values assigned to its variables can
have a great effect on the search
results. Once a significant homology is found, alignment methods are
used to find common structural
motifs associated with particular
protein function. The new gene sequence then has a deduced function.
Statistical approaches do have
drawbacks. Homology programs
assign to each alignment a certain
probability of occurrence. This approach works satisfactorily only
when the frequencies of amino acids
do not differ greatly from the frequencies of the database-a condition that is not always true. Homology searches also work only if
homologies exist. Today, 40%-50%
of new gene products are not homologous to any other protein currently in the database.
The statistical approach also lacks
depth. Recognizing so-called vowelish ness can successfully discriminate
a sequence as English, but it obviously reflects a superficial understanding of structural or syntactic
complexities. This superficiality cannot be overcome by simply using
more sophisticated statistics. GeneMark uses sophisticated Markov
chaining models, which simulate
DNA sequences by calculating the
151
probabilities of oligonucleotides
according to correlations between
base frequencies in different positions of the sequence. However,
Pinker (1994) shows how Markov
chaining models are deeply and fundamentally wrong as models of language-sentences are not formed by
simply choosing each successive
word from one of a few lists of words
according to prespecified probabilities. Markov chaining models are formally unable to handle the longdistance, nested, and crossed dependencies illustrated in Figure 4.
Molecular linguistics avoids these
problems by using methods based
on syntactic pattern recognition and
formal language theory. Syntactic
pattern recognition, drawing upon
an analogy with the syntax of language, expresses a pattern as a composition of its subpatterns and pattern primitives, and thus captures
structural complexities. Developing
a hidden Markov model similar to
those used in speech recognition,
Krogh et a1. (1994) identify the core
structural elements in families of
homologous proteins. Their method
differs markedly from conventional
global alignment techniques, which
align just two sequences at a time
and use substitution matrices that
assign the same penalty for nonidentical pairs found in all regions
of the sequence. Ideally, differences
would be penalized in the conserved
regions but tolerated in the variable
regions found in all homologous
proteins. Krogh et al. (1994) are
able to allow variable, position-dependent penalties while aligning
multiple sequences simultaneously.
Sakakibara et a1. (1994) generalize
on this model to predict tRNA structures. Other molecular linguistic
methods use formal language theory
to analyze the database. For example, using a context-free grammar with added features giving
power similar to a Turing machine,
Searls (1993) has scanned more than
70,000 base pairs in under five minutes, successfully recognizing mouse
and human alphalike globins while
ignoring pseudogenes.
Using the analogy
Disregarding mathematical formalIsms, Sereno (1991) uses the anal152
ogy to make predictions about the
neurophysiology of language. He
sees the major advance of both life
and language as the ability to control the assembly of preexisting units
of meaning, rather than the invention of the units themselves. Amino
acids exist as individual units with
general chemical properties. A cell
can polymerize these residues (units
of meaning) into a protein in which
each amino acid takes on new and
specific meanings depending upon
the specific context of each sequence.
Similarly, the prelinguistic abilities
of pygmy chimpanzees can identify
unitary concepts or words referring
to actions, objects, and locations
encountered in life, yet the chimps
seem unable to create sentences of
more than two or three words
(Sereno 1991). Combining words
into long sentences in which they
take on new meanings appears to be
a unique ability of humankind. From
these basics, Sereno (1991) extrapolates the analogy to use in theories
of aphasia, word recognition, sentence assembly, and language comprehension.
The biological-linguistic analogy
invites innovative ways of conceptualizing biology. Genome sequencing projects represent reductionism
taken to its conclusion. Innovative
ways are needed to integrate the
sequence information upwards, to
macroscopic morphology and behavior. Biology has long been reduced to chemistry; now chemistry
has become a symbol-manipulating
process. The difference has consequences, bringing new appreciations
and surprising associations.
A recent review in Nature of the
major evolutionary transitions by
Szathmary and Maynard Smith
(1995) raised such unaccustomed
associations that the authors believed it necessary to defend "writing an article concerning topics as
diverse as the origins of genes, of
cells and language." When viewed
as changes in information processing, there is sufficient formal similarity between the various evolutionary transitions to hope that
progress in understanding anyone
step will illuminate the others. The
question then arises: What do the
origin of life and the origin of language have in common?
Why is life linguistic?
Pinker (1994) points out that the
use of discrete combinatorial systems is rare in nature. Geology,
weather, cooking, sound, and light
are blending (and inanimate) systems in which the properties of the
system are a mixture of the elements. Blending systems are relatively circumscribed. The pink created from mixing red and white can
be no redder than red, no whiter
than white. Pinker (1994) suggests
that life and mind-the two systems
most impressive for their open-ended
design-necessarily required the creative power of a discrete combinatorial system.
Sere~o (1991) argues that the cellperson analogy stems from the fact
that the origin of life and the origin
of language solved the same problem: how to escape determinism.
The pre biotic and prelinguistic states
were both complex, interacting systems, or so-called soups, that evolved
deterministically. The step to a new,
morc intentional system of preferentially selected reactions required
the ability to encode, use, and reproduce informarion. Information
had to be protected from the dissipative attack of the soup; it had to
be in essence hidden, yet remain
available enough for use. The solution was to create a symbol-based
system. What better way to hide
while remaining available than to
use a symbol-"that which stands
for something else" (Webster's Third
International Dictionary 1986).
These new symbol systems had to
escape design constraints common
to both prebiotic and prelinguistic
states. Each needed specific reaction-controlling devices to simultaneously control many different reactions. These controlling devices
had to be built from stable, concatenated units and assembled locally,
one at a time, in a serial fashion.
It is easy to recognize how replicarors (Dawkins 1989), or so-called
living molecules such as the hypothetical RNA replicase of an RNA
world, fulfill these criteria. It is perhaps less apparent how the neural
networks responsible for language
do also. Arguably, the ontogeny of
language in the child may recapitulate the emergence of language from
BioScience Vol. 46 No.2
the pre linguistic soup. During development, undifferentiated, diffuse,
short-range neural networks are
pruned back and replaced by a
smaller set of less diffuse, long-distance connections. Alternatively, the
parallel event may be the differentiation of the common neural substrate supporting both manual manipulation of objects and linguistic
functions seen as the child matures
and acquires language (Greenfield
1991 ).
The cell-person analogy, with its
emphasis on the symbolic representation systems of both the cell and
the human brain, reflects the terminology used to describe complex
adaptive systems. The origin of life
and the origin of language were two
emergent events in a hierarchy of
evolving complex systems ranging
from metabolism to economies. Such
systems routinely develop an internalized representation of their external environment. By filtering out
environmental noise and compressing perceived regularities, complex
adaptive systems create schema.
These schema are then tested for
fitness by selective pressures (Cowan
et al. 1994). A better understanding
of the principles governing complex
adaptive systems may reveal features shared with other systems as
well as the crucial differences that
caused a discrete combinatorial system to emerge only twice-at the
origin of life and the origin of language.
Acknowledgments
I thank W. William Walthall and
Duane M. Rumbaugh for their comments and am especially indebted to
Evelyn R. Strauss for her valuable
criticisms of early drafts of this
manuscript. I am also grateful to
Marc Weissburg and five anonymous reviewers whose criticisms
made the final version a better article.
Fehruary 1996
References cited
Borodovskv M, Koonin EV, Rudd KE. 1994.
:'-Jew genes in old seqm:nee: a ~wltegy for
findinggencs in thc hacterlal gcnome. Trends
in Biochemical S..:iences 19: 309-313.
Camphell J. 1982. Grammatical man: information, entropy, lan~lIage and life. New York:
Simon and Schuster.
Chomskv N. 1972. Svntactic structuf(:~. The
Hagl;e (the NetherLnd~): Mouton & Co.
Collado- Vlde~ J. 1991. A ~yntactic rcpre~cnta
tion of thc unit~ of genetll· information-a
syntax of units of genetic information. Journal of Theoretical Biology 128: 401-429.
Cowan GA, Pines D, Meltzer D, cds. 1994.
Complexity: metaphors, models, and reality. Reading ('\1A): Addison-Wesley Pubhshmg Co.
Crick I'HC. 1959. The present position of the
coding problem. :'Tructure and Function of
Genetic Elemcnts: Brookhavcn Symposium
on Biology 12: 35-37. Upton (NY):
Brookhaven National Laboratory.
Darwin C. 1859. On the origin of species, a
faCSImile of the first edition. Cambridge
(MA): Harvard University Press.
Dawkins R. 1989. The selfish gene. Oxford
(UK): Oxford University Press.
Fu K:'. 1982. Syntactic pattern recognition and
applicatlolls. Englewood Cliffs (NJ):
Prcntice-Hall.
Gamov G. 1954. Po~sible relation betwecn
deoxyribonucleic acid ~trllcture and protein structure. Nature 173: 318.
Goldenherg PE, Feurzeig W. 1987. Exploring
hlllguage .vith logo. Cambridge (MA): MIT
Press.
(;reenflcld 1'1\<1. 1991. Langllage, tools and
hr,lln: the ontogeny and phylogeny of hierarchl(111)" organized sequential behavior.
Beh;lVioral and Brain Sciences 14: 531-595.
Hopcroft .IF., Ullman JD. 1979. Introduction
to allTomata theory, langllage~ and computation. Reading (PA): Addl.,on-We~ley.
Jcrnc N. 1988. The generative grammM of
thc immune system. :,cicnce 229; 10571059.
Jimcnez-Montai'io MA. 1989. Formal languagc~ and theorcti..:al molecular hlology. Pages 199-210 in Goodwin 15,
Saunder~ P, cds. Theoretical biology: epigenetic and evolutionary ordcr from complex systems. Edinburgh (UK): Edinhurgh
Universir·· Press.
Kain RY. 1'1.·2. Automata theory: machines
and languagc. Ncw York: McGraw.
Keller FF. 1983. A fceling for the organism: thc
life and times of Barbara :VlcClintock. San
Francisco (CAl: W.H. Freeman.
KlIlg./. 1989. Deciphering the rules of protein
folding. Chemical Engineering News 6 7( 15):
32-54.
Krogh A, Brown M, Mian IS, Sjiilander K,
Hal1s~ler D. 1994. Hidden Markov models
in compuutional hiology: applications to
protein modeling. Journal of Molecular Biology 235: 1501-1531.
Kuppers B-O. 1990. Information and the
origin of life. Camhridge (MA): MIT
Press.
Lathrop R, Webster T, Smith R, Winston P,
Smith T. 1993. Integrating AI with 'iequence analysi~. Pages 212-258 in Hunter I.,
ed. ArtIficial intclligcn..:c and ]lloleculal
biology. Menlo Pilrk (CA): AAAI Pres~.
Lyons I. 1977. Scmantic~. Vol I. Camhridge
, (UK\ Cambridge Uiliver~ity Press.
Pinker S. 1994. The language instinct. New
York: William Morrow.
Radford A. 1981. Transformational syntax: a
student's guide to Chomsky'., extended standard theory. Cambridge (UK): Cambridge
University Press.
Sakakibara Y, Brown M, Hughey R, .\1ian IS,
Sjolander K, Underwood RC, Haussler D.
1994. StochasTlc contcxt-free grammars for
tRNA modehng. Nudei..: A..:ids Research
22: 5112-5120.
Saltus R. 1994. Second ..:ulprit gene in inherited
re..:tal cancer identified. Boston Globe
March 17: 7.
Sea rls DB. 199.3. ·rhe computational linguistics
of biological sequences. Pages 47-120 in
Hunter L, ed. Artifici.11 intelligence and
molecular biology. Cambridge (MA): MIT
Pre~~,
Sereno Ml. 1991. rour analogies betwcen biological and cultural/lingui~tlC evolution.
Journal of Theorcti..:al Blology 151:
467-507.
Szathm{ll"Y E, Ma)"nard Smith J. 1995. The
major evolutionary transitiom. Nature.3 74:
227-2.32.
Tatusov RL, Altschul SF, Koonin EV. 1994.
Dcrection of conserved segments in proteins: iterative scanning of ~equen<.:e databases with al"lgnment blocks. Proceedings oi the National Academy of Scien..:es
of the United States of America 91:
12091-12095.
Wilkins WK, Wakefield J. 1995. Brain evolution and neurolinguistic preconditions.
Behavlor.d and Hr,lin Sciences I X;
161-226.
YockeY HI'. 1992. lnfnnnatioll (heon and
mo'lecular bl()logy. Cambridge (UK):' C:all1bridge University Press.
Patricia Bralley is a doctoral candid,ae
in molecular genetics in the Department of Biology at Georgia State University, Atlanta, GA 30303. She is researching control of the lysis-lysogeny
decision in phage P 1. She has published
several literary essays on the nature of
life alld consciousness. CD 1996 American Institllte of Biological Sciences.
15.1