Download An introduction to molecular linguistics

Thinking of Biology An introduction to molecular linguistics B iology is filled with linguistic metaphors. The punctuated yet commaless genetic code is transcribed and translated. Gene transcripts are edited, and cells are said to communicate. Recently the newspapers carried the story of the discovery of a gene, a "molecular spell checker," whose malfunction causes colon cancer (Saltus 1994). Immunologist NeilsJerne in his 1984 Nobel Prize acceptance speech suggested that the variable region of an antibody might correspond to a sentence or phrase and the vast repertoire of immune response might be considered a vocabulary of sentences (Jerne 1988). Why afe there so many metaphors? Is there more here than analogy? This article addresses two questions: How deeply does the biological-linguistic analogy run, and to what uses can it be put today? One discovers that historically the analogy keeps reemerging: from Darwin's (1859) comparison of a species to a natural language to Gamov's (1954) suggestion that the synthesis of proteins from DNA be considered a problem in coding. To Gamov, DNA was a long, four-digital number that had been translated into words spelled from the 20-letter amino acid alphabet. Crick (1959) viewed the genetic coding problem as that of "translating one language to another." The idea of the analogical use of language, letters, and translation, which may now seem useful but hardly profound, was stimulating and controversial when first proposed. It formed the basis of the theory driving early experiments deciphering the genetic code. Today, techniques from formal language theory are used to analyze the data from genome sequencing by Patricia Bralley 146 The general goal of structural linguisticsdistinguishing how differences in syntactic structure reflect differences in meaningaddresses the same problems facing molecular biologists projects, while the analogy suggests that new ways of understanding the evolution of complex systems await discovery. Analogy with natural language The oldest biological-linguistic analogy can be traced to Darwin's On the Origin of Species (Darwin 1859). Darwin suggested that a species is like a language: They both evolve through time and under geographical constraints, and they both undergo a process of evolution based on a low level of constant change. Darwin even compared vestigial organs to the unpronounced letters in words. The data from molecular biology have suggested a second form of the analogy: The cell's molecules correspond to different objects found in natural languages. Kiippers (1990) delineates possible correspondences: a nucleotide corresponds to a letter, a codon to either a phoneme (the smallest unit of sound) or a morpheme (the smallest unit of meaning), a gene to a word or simple sentence, an operon to a complex sentence, a replicon to a paragraph, and a chromosome to a cha pter. The genome becomes a complete text. Kiippers (1990) emphasizes the thoroughness of the mapping and notes that it presents a hierarchical organization of symbols. Like human language, molecular language possesses a syntax. Just as the syntax of natural language imposes a grammatical structure that allows words to relate to one another in only specific ways, biological symbols combine in a specific structural manner. A closer examination of the molecular-linguistics analogy is achieved by considering four essential design features (Lyons 1977) that ena ble language to function as a signaling, or semiotic, system: discreteness, arbitrariness, duality, and productivity. Jimenez-Montano (1989) examines how biological molecules satisfy several of these criteria. First, language is composed of discrete and arbitrary elementsphonemes (in speech), alphabet letters (in writing), or discrete motions (in sign language). These discrete elements are linearly arrayed and numerable, as are the bases of DNA or the amino acids of a protein. Arbitrariness refers to the fact that these basic units have no a priori meaning. A phoneme such as th no more suggests the meaning of a word than a single codon suggests the function of the protein to which it codes. The fact that phonemes have no inherent meaning yet are noninterchangeable (just as within a gene a thymine cannot be replaced by a cytosine without risk of fatal mutation) gives rise to the linguistic property of duality. Duality uses two discrete combinatorial systems: One combines meaningless sounds into meaningful morphemes, while the second combines meaningful morphemes into words and ultimately sentences. Because each discrete BioScience Vol. 46 No.2 combinatorial system combines a finite set of elements into any number of larger structures, duality is an economical and powerful way to produce an infinity of meaningful forms from a few elements. It is a strategy also used by the cell-four nucleotidcs combine into 64 codons; codons combine into many different genes. The power of duality finds its expression in productivity or creativity. Human linguistic competence involves the ability to create an infinite variety of sentences. Chomsky emphasized the creative nature of language when formulating his theory of generative grammar (Chomsky in Radford 1981). Similarly, the biological diversity of living systems presents a seemingly inexhaustible number of possibilities. Or, as Nobel laureate and geneticist Barbara McClintock said, "Anything you can think of [in biology J you will find" (McClintock qunted in Keller 1983, p. 199). In addition to discreteness, arbitrariness, duality, and productivity, all natural languages require redundancy. A design feature engineered into any reliable communication system, redundancy is found in cells and languages at many levels. Nearly every passage of prose contains words that could be left out without affecting comprehension. Words too can be recognized even when letters are deleted (Campbell 1982). This redundancy is a safeguard against errors. It allows parts of a message to be statistically related and thus assures that a message gets through, despite inevitable production errors, such as mispronunciations, slips of the tongue, incorrect grammar, or speaking too softly or quickly to be heard. Similarly, redundancy in the genetic code protects against genetic errors. The redundancy of having 61 codons translate to only 20 amino acids helps minimize the effects of deleterious mutations. Changes in the third position of a codon often give rise to the same or chemically similar amino acid. All natural languages also possess a grammar: rules for combining phonemes, morphemes, words, and phrases. Syntax establishes "legal" structures; semantics establishes meaning. Chomsky's (1972, p. 14) f'ebruary 1996 The cat saw a mouse VP saw a mouse NP The cat saw a mouse Figure 1. A parsing of the sentence "The cat saw a mouse" reveals the hierarchical organization and syntactical components: NP, noun phrase; Yr, verb phrase; Det, determiner; N, noun; Y, verb; and Art, article. Parse trees can also be viewed as representing a dynamic process of generating a number of similarly constructed sentences from a small number of rules. famous sentence, "Colorless green ideas sleep furiously," while syntactically correct, fails on the semantic level. Phonological rules allow that clonch, while not an actual English word, seems more acceptable than grzyb (Polish for mushroom). The rules of grammar ensure that if words are randomly chosen and strung together, they will probably not form a legal sentence. Similarly, polymerizing randomly chosen nucleotides or amino acids fails to produce a functional gene or protein. The fact that not all combinations of phonemes or words occur allows definition of larger elements such as words and sentences. The general goal of structural linguistics-distinguishing how differences in syntactic structure reflect differences in meaning-addresses the same problems facing molecular biologists. What are the grammatical rules for forming protein structures such as an a-helix, a ~-ribbon, or a three-dimensional conformation? Where are the meaningful regularities in a sequence of an operator, enhancer, gene, or operon? Searching for the structural grammar of amino acids Which amino acids in a linear sequence of a protein determine the final conformation? The problem, which King (1989) called the "search for the grammar of amino acids," is one area suited to linguistic techniques. Proteins are hierarchical in structure. They have a linear primary structure, which determines the secondary structures of helices, turns, and sheets. These secondary structures fold into a three-dimensional tertiary structure-the functional protein. Language is also hierarchical. While words are linearly arrayed into sentences, they are not strung together like beads on a string. There are syntactic frames, analogous to secondary helices and sheets. Words or whole phrases are assembled to create a sentence: Toby/ran lazily/ down the street. Additionally, nonadjacent words are dependent on each other. Consider: The lioness hurt herself. The word lioness determines the selection of herself rather than himself. The protein folding problem suggests the necessity of finding the molecular equivalent of linguistic phrases and dependencies. Linguistically, this problem is solved by the process of parsing, the orderly and systematic arranging of phrases. Parsing is visualized in a tree diagram (figure 1), which shows the 147 mononucleotide binding fold (MBF) MBF core ~-strand GKTFILHDGPPY ANGSIHIGHSVNKILKOIIVKSKGLSGYDSPYVPGWDCHG Figure 2. A partial parsing of the Escherichia coli protein isoleucyl-tRNA synthetase (residues 48-99) comprising the mononucleotide binding fold (MBF) subdomain. Parsing proceeds from the primary sequence of amino acids (here represented by their single letter abbreviations), to secondary structures of helix, turn, and p-strands, to the final tertiary structure of the MBF subdomain (Lathrop et al. 1993). hierarchical organization of the sentence (S) into its component phrases: noun phrase (NP), verb phrase (VP), composed of article (Art), determiner (Det), noun (N), or verb (V). Tree diagrams can also be viewed as representing a dynamic process, a way of generating a large number of sentences with similar structures from a small number of rules. One rule might be that a sentence consists of a noun phrase and a verb phrase. There are also rules for the substructures of phrases. A noun phrase might be composed of a noun and an article or of another noun phrase and an adjectival phrase. These rules of composition are what the biologist needs to define for protein folding. Figure 2 shows a typical protein parsing done on the mononucleotide binding fold of isoleucyl-tRNA synthetase from Escherichia coli (Lathrop et al. 1993). The analysis is used within the artificial intelligence program ARIADNE to predict protein structure. The generative grammar of a genetic sentence Collado-Vides (1991) uses a generative grammar model to analyze gene regulation. Here rules are applied recursively to generate the complete sentence. His so-called sentences are 148 various transcription units such as the lac operon of E. coli. Lexical categories (e.g., noun and verb) are created by promoters, operators, and structural genes. A single parse tree may generate a number of different sentences depending on the actual DNA sequence in each lexical category. For example, the promoter may be either the lac promoter, the ara operon promoter, or a phage promoter. A chain of lexical units is grammatically correct if RNA polymerase can read a translatable sentence such as "promoter, start codon, gene, stop codon," rather than a garbled "stop codon, start codon, gene, promoter. " However, this regulatability is not synonymous with physiological usefulness. A gene for tryptophan synthesis regulated by a lactose-regulated operator and promoter makes no sense physiologically and is analogous to a sentence being syntactically correct but semantically incorrect. Thus physiological usefulness is likely to require another grammatical component corresponding to the semantic. Assessing the analogy Most biologists coming to an understanding of molecular linguistics probably do so by making intuitive analogies with natural languages. However, lay definitions and intuitive understandings can mislead technical discussions. None of the analogies presented here agrees on correspondences. Crick (1959) uses DNA bases as letters, then translates codons into amino acids-the socalled letters for Gamov's (1954) protein words. Kiippers's (1990) socalled word is a gene (DNA), and the cell is a book, not a person as Sereno (1991) suggests. Collado-Vides (1991) has DNA (promoters, operators) functioning as a word, a gene as a sentence. Jimenez-Montano's (1989) analysis of redundancy actually generalizes from information theory, not language. Language is not equivalent to communication or speech (Wilkins and Wakefield 1995). Spelling, letter frequencies, and diphthongal associations of letters are not parts of grammar, because writing systems arc artifacts distinct from language (Pinker 1994). To critically assess {he correctness and depth of the biological-linguistic analogy, one needs to decompose and analyze it for parallel relationships between the hierarchically organized biological and linguistic components. Sereno (1991), in a thoughtful critique, schematizes prominent forms the analogy has historically taken (Figure 3). For instance, evolutionary epistemologists compare a species to a scientific discipline and the evolution of an organism to the evolution of concepts (the socalled organism-concept analogy); this analogy connects three different hierarchical levels in an inconsistent manner (Figure 3). Darwin's species-language holds on two levels. The population-level alignment, in which a species corresponds to a language, implies that an organism is analogous to a speaker of a language. Thus, an organism's embryological development corresponds to a speaker's development of linguistic skills, and heritability corresponds to transmittal of language to offspring. However, there are some places where the analogy breaks down. Spoken languages (except for scientific jargons) do not become more adapted with time. Middle English was just as effective a medium as modern English (Sereno 1991). Lower levels BioScience Vol. 46 No.2 of correspondences become less clear also. In a cell it is the DNA sequence of the gene that mutates. In language the set of phonemes changes with time, as well as the set of morphemes and the meanings of words themselves. It is not clear which of these changes corresponds to the nucleotide, not is it clear what is equivalent to the cell in a language speaket. Sequence differences are dealt with in a more consistent manner by aligning a cell to the conceptual system of a person (the so-called cellperson analogy). New DNA sequences produced by mutation correspond to novel combinations of phonemes heard from day to day. The cell-person analogy emphasizes parallels between the symbolic-representation systems of both cell and person with detailed correspondences. For example, on the molecular level, mRNA corresponds to the activity pattern of the secondary auditory cortex caused by the sounds of several sentences. Protein folding corresponds to the comprehension of word-sound sequences (see Table 1). The analogy relates three levels of organization in the most consistent manner: An organism corresponds to a community, a cell to a person, and biomolecules to neural network firing patterns (Figure 3.) This analogy holds, however, only for language comprehension, not production. It is as if the cell has the ability to listen to and comprehend its own internal chatter. To produce language (as a speaker docs) the cell would have to unfold a protein, turn it into DNA, and inject that DNA into another cell. the artificial languages of computer science was quickly realized. Today, many linguists consider both natural and artificial languages as the output of formal systems comprising sets of objects (words), axioms, and rules for changing sequences of words into other words. Formal language theory has an abstract similarity to processes of logic and deals in theorems, proofs, and hypotheses. It is at this most mathematical level that DNA, RNA, and proteins appear most languagelike. The question then becomes: What is the nature of this language? Is it more like an artificial computer language or more like natural human languages? A language, as defined by formal language theory, has three components (Hopcroft and Ullman 1979). First, there is an alphabet composed of a set of indivisible elements called terminal symbols. Terminal refers to the fact that these symbols are allowed in the final form of a sentence. This alphabet might be composed of 2 symbols (0, 1), 4 letters (A, C, G, T), 20 amino acids, 26 letters, or a lexicon of English words. Second, there is a set of nonterm ina I symbols that are used in the production rules of grammar. These nonterminals are intermediate constructs such as the noun phrase and verb phrase seen earlier in the parse trees of sentences. Among the nonterminals is also a start symbol, S, which stands for sentence. Third, there is a set of rules, called rewrite rules, for combining these symbols into strings (sentences) and deciding which strings are legal. Thus the start symbol, S, could form the left Formal language theory Table 1. Summary of the cell-person analogy. A mathematically precise method for dealing with languages is provided by formal language theory. Formal language theory shows that biomolecules can actually be considered languages and provides methods for dealing with current problems in computational biology, creating the new discipline of molecular linguistics. Chomsky introduced formal language theory in the 1950s to explain the productivity of natural languages. However, its relevance to February 1996 Cell DNA nucleotide DNA codon RNA nucleotide mRNA Rihosome Amino acid Functional domain of an enzyme Enzyme suhstrates (e.g., amino acids, proteins, and carbohydrates) Suhstances of prebiotic soup organism I concept specic~ <f. learned discipline organism person cell concept , biomolecules species I language species <I" organism community speaker cell biomolecules cell / person species organism cell biomo1ecules community personal concepts neural network neurotransmitter Figure 3. A companson of three different forms of the biological-linguistic analogy (the organism-concept, specieslanguage, and cell-person) across different levels of organization. The organism-concept analogy maps correspondences inconsistently to different levels. In the species-language analogy, objects in one system map to objects of approximately the same size in the other system. In the cell-person analogy, small biological objects map to larger, linguistic/cultural objects. side of the grammatical rule S--7NP, V (which means "The nonterminal, S, can be rewritten with the nonter- Person Phoneme Sounds in a word Auditory cortex activity for phoneme Auditory cortex activity for sounds of several sentences Auditory cortex activity that assembles unit meaning patterns into chain A word meaning caused by activity in secondary visual cortex Structure in short-term memory upon hearing discourse of clauses Mental "objects" (e.g., single words, discourses, emotions, and images) Prelinguistic firing patterns in primate brain 149 a) The molecular ~ Figure 4. The longdistance dependencies of a protein suggest that the linguistic sophistication of biomolecules is comparable to that of natural languages. Amino acid residues interacting through polar and hydrogen bonding establish the tertiary structure of a protein (a). If the protein is linearized, these interactions are revealed as crossed (A, B, C) and nested (D, E, F) dependencies. Such linguistic structures in natural language sentences (b) require con rex t-sensi ti vc and context-free grammars respec- crossed dependencies c D 3 1. nested dependencies 2 unfold protein crossed dependencies nested dependencies b) The linguistic crossed dependencies cl Bill, Alice. and Ted are a cook, a chef, and a dishwasher re~pectively. nested dependencies The reaction the cn/.ymc the gene encoded ~·atalY7.ed stopped minals NP and V"). Other rewriting rules might be: N-7cat V-7frisked such that the nonterminals are eventually rewritten with terminal symbols to produce: The cat frisked. Together, the nonterminals and rules make up a grammar. A formal language is thus a subset of legal strings produced from all the possible combinations of symbols in an alphabet. Depending upon the alphabet, these strings may constitute an English sentence, a protein sequence, a gene sequence, or simply 000110011O0011l. The resulting language may appear to be natural or decidedly artificial. Automata, grammars, and ribosomes In formal language theory, grammar is viewed as a generator of language. Its purpose is to take a string of symbols and produce a new string 150 1=~;~1~~Et 1 tively. NP-7art,N art-7the ........ in an explicit, rule-governed, and thus essentially mechanical process. Consequently, it becomes possible to envision a grammar as a mechanical device, a virtual or imaginary machine, called an automaton. This intimate relationship between grammars and automata has allowed computer scientists to use the often more easily intuited behavior of machines to solve linguistic questions. Describing an automaton is equivalent to describing a grammar and thus defining a language (Kain 1972). An automaton can be considered a black-box computer, called a finite controller, that is fed a sequence of symbols on a tape. The tape contains a linear array of boxes, or cells, with each cell containing a symbol from the language's alphabet. In the beginning, the automaton is in the start state. As the tape is read, symbol by symbol, the state of the automaton may change, because each new state is uniquely determined by the input symbol and the current state. In some states an indicator light goes on, in others it docs not. When all of the tape has been read, the machine stops. If the light is on, the tape has been accepted, and the language is declared recognized. It is easy to imagine different variations of this basic design. Input tapes may be finite or infinite in length. The reading head, which scans the sequences of the tape into the finite controller, mayor may not also have the ability to erase or write on the tape. The tape may move to the left, or it may be able to move both right and left. There could be more than one tape, or more than one reading head. Different ways of constructing automata allow them to carry out computations of varyingcomplexity and create more powerful or less powerful grammars. These grammars are defined hierarchically ranging from the least powerful, regular grammars through the context-free, context-sensitive, and most powerful phrase structure grammars embodied by Turing machines. The Turing machine is the virtual, universal computer that can recognize all languages. The reading head of a Turing machine can write on the tape, which makes it possible to create a new string of symbols called the output tape. Yockey (1992) asserts that the logic of the Turing machine is isomorphic to that of the cell's genetic information system. Here, DNA is the input tape, protein the output tape. The ribosome acts as the reading head, while the internal states of the controller are the tRNA, mRNA, and enzymes involved in protein synthesis. This isomorphism between a Turing machine and the ribosome illustrates one way in which formal language theory moves molecular linguistics beyond metaphor to identity. The linguistic sophistication of biomolecules If one formally accepts that biomolecules are languages, are they closer in design to artificial or natural languages? In most natural languages there are long-distance dependencies; some element of the sentence is constrained by features of another part. Consider, for exBioScience Vo!' 46 No.2 ample, the sentence: The reaction the enzyme the gene encoded catalyzed stopped. This sentence is structured in nested dependencies that can, in principle, be any number of layers deep. Although multiple layers make the sentence difficult to understand, it is grammatically correct. Its context-free grammar requires the computational machinery of a pushdown stack automaton. Expecting a sentence to conform to the common rule, S~NP,VP, we read NP, The reaction. But when there is no VP immediately following, we move the reaction to the pushdown stack. (A pushdown stack automaton receives and stores symbols just like the spring-loaded device for stacking dishes in a cafeteria. The first plate into the stack is pushed to the bottom as additional plates are added-the first symbol in is the last taken out and computed.) The NP the enzyme meets a similar fate, and goes into the stack on top of the reaction. Finally, the gene fulfills expectations and is followed by the VP encoded. Now freed, the processor returns to the stack. The enzyme is paired with catalyzed, and the reaction is then paired with stopped. This set of nested dependencies is similar to the set of interactions created by hydrogen and polar bonding between residues in the tertiary structure of folded proteins (Figure 4). That a pushdown stack automaton recognizes nested dependencies means that at least a grammar with context-free sophistication is required to capture this particular feature of protein structure. Figure 4 also illustrates the possibility of having crossed dependencies within proteins. The sentence, "Bill, Alice, and Ted are a cook, a chef, and a dishwasher respectively," is a construct that cannot be handled by context-free grammars. A more powerful, context-sensitive grammar is required. Tandem and direct repeats and RNA pseudoknots with nested and crossed dependencies require at least a context-free grammar for expression. Searls (1993) showed that attenuators and double-inverted repeats require a context-sensitive grammar to reflect their ambiguity of structure. Ambiguity has long been recognized as a quality of natu- February 1996 ral languages. Humans find it acceptable, if not useful or even poetic, to be able to create sentences like: Everyone has read two books. The ambiguity as to whether two hooks means "any two books" or "the Bible and Gone with the Wind" is easily tolerated. The artificial languages of computers cannot tolerate such ambiguity. The presence of ambiguity in biomolecules reflects the linguistic sophistication of biological systems. The linguistic power required to capture molecular structures arguably appears similar to that required for natural languages. Applications in computational biology Interpreting the genome is actually a problem in pattern recognitionspecifically, recognizing sequence patterns of genes, regulatory elements, protein folding, and structural determinants. Pattern recognition is a well-researched problem in computer science applied to such diverse fields as speech recognition, medical diagnostics, and remote sensing. Two different approaches have been developed: the statistical and syntactic (Fu 1982). Statistical pattern recognition extracts from an image a set of characteristic measurements called features. Consider the problem of trying to distinguish a short text of English from similar passages of Romanian, Polish, and Italian when all the consonants are changed to "c" and all the vowels are changed to "v" (Goldenberg and Feurzeig 1987). The problem is not unlike that of the biologist trying to discriminate between protein families or coding and noncoding regions. It can be solved by a statistical description of the language. How socalled vowelish is English? What are the probabilities of vowels and letters and their combinations? If too many words end with a vowel, the language cannot be English, yet it may be Romanian or Italian. If some words are a single consonant, the language is not English, but may be Polish. Programs such as GeneMark (Borodovsky et al. 1994) similarly analyze statistical properties of DNA sequences. GeneMark assumes that frequencies in position "0" depend on the nucleotide in position "-1" or positions "-1" and "-2." The program slides a window along the sequence in discrete steps and calculates the probability that the DNA conforms to either a model of a coding or noncoding region. Because there are significant differences between frequency models of coding and noncoding sequences, it can detect an open reading frame. Typically, the analysis of a new sequence begins with a similarity search to align the new sequence to the database (Tatusov et al. 1994). These programs first find locally similar regions between two sequences, then they try to extend the matches. There are many different ways to measure similarity between matched residues, all of which create a substitution matrix to compensate for evolutionary distances, frame shifts, and other factors that allow related proteins to vary in sequence. The matrix used and the values assigned to its variables can have a great effect on the search results. Once a significant homology is found, alignment methods are used to find common structural motifs associated with particular protein function. The new gene sequence then has a deduced function. Statistical approaches do have drawbacks. Homology programs assign to each alignment a certain probability of occurrence. This approach works satisfactorily only when the frequencies of amino acids do not differ greatly from the frequencies of the database-a condition that is not always true. Homology searches also work only if homologies exist. Today, 40%-50% of new gene products are not homologous to any other protein currently in the database. The statistical approach also lacks depth. Recognizing so-called vowelish ness can successfully discriminate a sequence as English, but it obviously reflects a superficial understanding of structural or syntactic complexities. This superficiality cannot be overcome by simply using more sophisticated statistics. GeneMark uses sophisticated Markov chaining models, which simulate DNA sequences by calculating the 151 probabilities of oligonucleotides according to correlations between base frequencies in different positions of the sequence. However, Pinker (1994) shows how Markov chaining models are deeply and fundamentally wrong as models of language-sentences are not formed by simply choosing each successive word from one of a few lists of words according to prespecified probabilities. Markov chaining models are formally unable to handle the longdistance, nested, and crossed dependencies illustrated in Figure 4. Molecular linguistics avoids these problems by using methods based on syntactic pattern recognition and formal language theory. Syntactic pattern recognition, drawing upon an analogy with the syntax of language, expresses a pattern as a composition of its subpatterns and pattern primitives, and thus captures structural complexities. Developing a hidden Markov model similar to those used in speech recognition, Krogh et a1. (1994) identify the core structural elements in families of homologous proteins. Their method differs markedly from conventional global alignment techniques, which align just two sequences at a time and use substitution matrices that assign the same penalty for nonidentical pairs found in all regions of the sequence. Ideally, differences would be penalized in the conserved regions but tolerated in the variable regions found in all homologous proteins. Krogh et al. (1994) are able to allow variable, position-dependent penalties while aligning multiple sequences simultaneously. Sakakibara et a1. (1994) generalize on this model to predict tRNA structures. Other molecular linguistic methods use formal language theory to analyze the database. For example, using a context-free grammar with added features giving power similar to a Turing machine, Searls (1993) has scanned more than 70,000 base pairs in under five minutes, successfully recognizing mouse and human alphalike globins while ignoring pseudogenes. Using the analogy Disregarding mathematical formalIsms, Sereno (1991) uses the anal152 ogy to make predictions about the neurophysiology of language. He sees the major advance of both life and language as the ability to control the assembly of preexisting units of meaning, rather than the invention of the units themselves. Amino acids exist as individual units with general chemical properties. A cell can polymerize these residues (units of meaning) into a protein in which each amino acid takes on new and specific meanings depending upon the specific context of each sequence. Similarly, the prelinguistic abilities of pygmy chimpanzees can identify unitary concepts or words referring to actions, objects, and locations encountered in life, yet the chimps seem unable to create sentences of more than two or three words (Sereno 1991). Combining words into long sentences in which they take on new meanings appears to be a unique ability of humankind. From these basics, Sereno (1991) extrapolates the analogy to use in theories of aphasia, word recognition, sentence assembly, and language comprehension. The biological-linguistic analogy invites innovative ways of conceptualizing biology. Genome sequencing projects represent reductionism taken to its conclusion. Innovative ways are needed to integrate the sequence information upwards, to macroscopic morphology and behavior. Biology has long been reduced to chemistry; now chemistry has become a symbol-manipulating process. The difference has consequences, bringing new appreciations and surprising associations. A recent review in Nature of the major evolutionary transitions by Szathmary and Maynard Smith (1995) raised such unaccustomed associations that the authors believed it necessary to defend "writing an article concerning topics as diverse as the origins of genes, of cells and language." When viewed as changes in information processing, there is sufficient formal similarity between the various evolutionary transitions to hope that progress in understanding anyone step will illuminate the others. The question then arises: What do the origin of life and the origin of language have in common? Why is life linguistic? Pinker (1994) points out that the use of discrete combinatorial systems is rare in nature. Geology, weather, cooking, sound, and light are blending (and inanimate) systems in which the properties of the system are a mixture of the elements. Blending systems are relatively circumscribed. The pink created from mixing red and white can be no redder than red, no whiter than white. Pinker (1994) suggests that life and mind-the two systems most impressive for their open-ended design-necessarily required the creative power of a discrete combinatorial system. Sere~o (1991) argues that the cellperson analogy stems from the fact that the origin of life and the origin of language solved the same problem: how to escape determinism. The pre biotic and prelinguistic states were both complex, interacting systems, or so-called soups, that evolved deterministically. The step to a new, morc intentional system of preferentially selected reactions required the ability to encode, use, and reproduce informarion. Information had to be protected from the dissipative attack of the soup; it had to be in essence hidden, yet remain available enough for use. The solution was to create a symbol-based system. What better way to hide while remaining available than to use a symbol-"that which stands for something else" (Webster's Third International Dictionary 1986). These new symbol systems had to escape design constraints common to both prebiotic and prelinguistic states. Each needed specific reaction-controlling devices to simultaneously control many different reactions. These controlling devices had to be built from stable, concatenated units and assembled locally, one at a time, in a serial fashion. It is easy to recognize how replicarors (Dawkins 1989), or so-called living molecules such as the hypothetical RNA replicase of an RNA world, fulfill these criteria. It is perhaps less apparent how the neural networks responsible for language do also. Arguably, the ontogeny of language in the child may recapitulate the emergence of language from BioScience Vol. 46 No.2 the pre linguistic soup. During development, undifferentiated, diffuse, short-range neural networks are pruned back and replaced by a smaller set of less diffuse, long-distance connections. Alternatively, the parallel event may be the differentiation of the common neural substrate supporting both manual manipulation of objects and linguistic functions seen as the child matures and acquires language (Greenfield 1991 ). The cell-person analogy, with its emphasis on the symbolic representation systems of both the cell and the human brain, reflects the terminology used to describe complex adaptive systems. The origin of life and the origin of language were two emergent events in a hierarchy of evolving complex systems ranging from metabolism to economies. Such systems routinely develop an internalized representation of their external environment. By filtering out environmental noise and compressing perceived regularities, complex adaptive systems create schema. These schema are then tested for fitness by selective pressures (Cowan et al. 1994). A better understanding of the principles governing complex adaptive systems may reveal features shared with other systems as well as the crucial differences that caused a discrete combinatorial system to emerge only twice-at the origin of life and the origin of language. Acknowledgments I thank W. William Walthall and Duane M. Rumbaugh for their comments and am especially indebted to Evelyn R. Strauss for her valuable criticisms of early drafts of this manuscript. I am also grateful to Marc Weissburg and five anonymous reviewers whose criticisms made the final version a better article. Fehruary 1996 References cited Borodovskv M, Koonin EV, Rudd KE. 1994. :'-Jew genes in old seqm:nee: a ~wltegy for findinggencs in thc hacterlal gcnome. Trends in Biochemical S..:iences 19: 309-313. Camphell J. 1982. Grammatical man: information, entropy, lan~lIage and life. New York: Simon and Schuster. Chomskv N. 1972. Svntactic structuf(:~. The Hagl;e (the NetherLnd~): Mouton & Co. Collado- Vlde~ J. 1991. A ~yntactic rcpre~cnta tion of thc unit~ of genetll· information-a syntax of units of genetic information. Journal of Theoretical Biology 128: 401-429. Cowan GA, Pines D, Meltzer D, cds. 1994. Complexity: metaphors, models, and reality. Reading ('\1A): Addison-Wesley Pubhshmg Co. Crick I'HC. 1959. The present position of the coding problem. :'Tructure and Function of Genetic Elemcnts: Brookhavcn Symposium on Biology 12: 35-37. Upton (NY): Brookhaven National Laboratory. Darwin C. 1859. On the origin of species, a faCSImile of the first edition. Cambridge (MA): Harvard University Press. Dawkins R. 1989. The selfish gene. Oxford (UK): Oxford University Press. Fu K:'. 1982. Syntactic pattern recognition and applicatlolls. Englewood Cliffs (NJ): Prcntice-Hall. Gamov G. 1954. Po~sible relation betwecn deoxyribonucleic acid ~trllcture and protein structure. Nature 173: 318. Goldenherg PE, Feurzeig W. 1987. Exploring hlllguage .vith logo. Cambridge (MA): MIT Press. (;reenflcld 1'1\<1. 1991. Langllage, tools and hr,lln: the ontogeny and phylogeny of hierarchl(111)" organized sequential behavior. Beh;lVioral and Brain Sciences 14: 531-595. Hopcroft .IF., Ullman JD. 1979. Introduction to allTomata theory, langllage~ and computation. Reading (PA): Addl.,on-We~ley. Jcrnc N. 1988. The generative grammM of thc immune system. :,cicnce 229; 10571059. Jimcnez-Montai'io MA. 1989. Formal languagc~ and theorcti..:al molecular hlology. Pages 199-210 in Goodwin 15, Saunder~ P, cds. Theoretical biology: epigenetic and evolutionary ordcr from complex systems. Edinburgh (UK): Edinhurgh Universir·· Press. Kain RY. 1'1.·2. Automata theory: machines and languagc. Ncw York: McGraw. Keller FF. 1983. A fceling for the organism: thc life and times of Barbara :VlcClintock. San Francisco (CAl: W.H. Freeman. KlIlg./. 1989. Deciphering the rules of protein folding. Chemical Engineering News 6 7( 15): 32-54. Krogh A, Brown M, Mian IS, Sjiilander K, Hal1s~ler D. 1994. Hidden Markov models in compuutional hiology: applications to protein modeling. Journal of Molecular Biology 235: 1501-1531. Kuppers B-O. 1990. Information and the origin of life. Camhridge (MA): MIT Press. Lathrop R, Webster T, Smith R, Winston P, Smith T. 1993. Integrating AI with 'iequence analysi~. Pages 212-258 in Hunter I., ed. ArtIficial intclligcn..:c and ]lloleculal biology. Menlo Pilrk (CA): AAAI Pres~. Lyons I. 1977. Scmantic~. Vol I. Camhridge , (UK\ Cambridge Uiliver~ity Press. Pinker S. 1994. The language instinct. New York: William Morrow. Radford A. 1981. Transformational syntax: a student's guide to Chomsky'., extended standard theory. Cambridge (UK): Cambridge University Press. Sakakibara Y, Brown M, Hughey R, .\1ian IS, Sjolander K, Underwood RC, Haussler D. 1994. StochasTlc contcxt-free grammars for tRNA modehng. Nudei..: A..:ids Research 22: 5112-5120. Saltus R. 1994. Second ..:ulprit gene in inherited re..:tal cancer identified. Boston Globe March 17: 7. Sea rls DB. 199.3. ·rhe computational linguistics of biological sequences. Pages 47-120 in Hunter L, ed. Artifici.11 intelligence and molecular biology. Cambridge (MA): MIT Pre~~, Sereno Ml. 1991. rour analogies betwcen biological and cultural/lingui~tlC evolution. Journal of Theorcti..:al Blology 151: 467-507. Szathm{ll"Y E, Ma)"nard Smith J. 1995. The major evolutionary transitiom. Nature.3 74: 227-2.32. Tatusov RL, Altschul SF, Koonin EV. 1994. Dcrection of conserved segments in proteins: iterative scanning of ~equen<.:e databases with al"lgnment blocks. Proceedings oi the National Academy of Scien..:es of the United States of America 91: 12091-12095. Wilkins WK, Wakefield J. 1995. Brain evolution and neurolinguistic preconditions. Behavlor.d and Hr,lin Sciences I X; 161-226. YockeY HI'. 1992. lnfnnnatioll (heon and mo'lecular bl()logy. Cambridge (UK):' C:all1bridge University Press. Patricia Bralley is a doctoral candid,ae in molecular genetics in the Department of Biology at Georgia State University, Atlanta, GA 30303. She is researching control of the lysis-lysogeny decision in phage P 1. She has published several literary essays on the nature of life alld consciousness. CD 1996 American Institllte of Biological Sciences. 15.1

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download An introduction to molecular linguistics