Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CABIOS Vol. 12 no. 5 1996 Pages 423-429 Interfering contexts of regulatory sequence elements E.N.Trifonov Abstract Motivation: Although one would normally expect a given regulatory element to perform best when it fully matches its consensus sequence, this is generally far from being the case. Usually, almost none of the actual sites fits the consensus exactly, and some of those that do fit do not perform well. The main reason for that is the very nature of the sequences and the messages (codes) they contain. Normally, any given stretch of the sequence with one or another regulatory site not only carries this regulatory message, but several more messages of various types as well. These messages overlap with the regulatory element in such a way that the letter (base) which actually appears in any given sequence position simultaneously belongs to one or more additional codes. Apart from numerous individual codes (sequence patterns) specific for a given species or gene, there are many different general (universal) sequence codes all interacting with one another. These are the classical triplet code, DNA shape code, chromatin code, gene splicing code, modulation code and many more, including those that have not yet been discovered. Examples of overlapping of different codes and their interaction are discussed, as well as the role of degeneracy of the codes and the sequence complexity as a function of code density. Contact: E-mail: [email protected] Introduction Recognition sites for various transcription factors in the DNA sequences as well as, generally, specific sites of all types appear as islands in the sea of an irrelevant 'random' background, different for every individual site. Since many positions in the consensus sequence of the site are frequently degenerate, even the signal sequence itself is variable, the variations being, again, seemingly random. The natural DNA sequences (as well as RNA and protein sequences), however, are far from being random. What was considered in earlier days as the sea of 'junk' sequences is today shrinking to remaining lakes and puddles of seemingly featureless and functionless sequences which, most likely, also carry some messages that are just not yet recognized. Department of Structural Biology, The Weizmann Institute of Science, Rehovot 76100, Israel © Oxford University Press As the reader will see below, many of those natural sequences which are found to be functional and recognizable, after further scrutiny, turn out to carry several interfering (overlapping) messages simultaneously. This paper describes various kinds of these interfering contexts and upgrades some of them to the category of 'codes' of which the classical triplet code is one veteran representative. There is something else in the same sequence Do we get right what the consensus sequences tell? Common belief is that having enough versions of a given type of the recognition sequence, one can derive a consensus with a typical or even exemplary performance as the recognition site. Surprisingly, however, there are several hard-to-believe cases when some rather different sequences turn out to perform better than the consensus. Two such examples emerged from experiments with randomization of the recognition sequences in a search for a better one. In both cases, some of the sequences selected from random sets were found to be superior in some important aspects as compared to well-established consensus sequences. The tet promoter of pBR322, TTGACA TTTAAT, with an almost ideal TATA box and ideal -35 sequence, was replaced by the sequence TTTCTT TAAGCT with only marginal resemblance to the wild type; yet 8-fold more mRNA was produced with this artificial 'promoter' (Horwitz and Loeb, 1988). Obviously, the promoter consensus reflects not only sequence requirements for good mRNA production, but also some other features that it is important for natural promoters to have. In other words, the consensus tells something else, not just what our limited knowledge about promoter structure would suggest. Another case like that is a study on RNA ligands of T4 DNA polymerase (Tuerk and Gold, 1990). The consensus ligand AAUAACUC was, indeed, fished out from the random sequences as one of the best performers, but not the best. The sequence AGCAACCU showed, actually, even stronger binding, although it was only a 50% match to the natural consensus. Again, it appears that the consensus sequence is not only about the binding. It has to satisfy some other sequence requirements escaping our straightforward understanding. All this suggests that reading the sequences in only one way, i.e. expecting one message only, is generally wrong. Not 423 E.N.Trifonov infrequently some other messages hide both in the consensus and in the contexts of the individual sites. This is not unlike the following analogy (not to be taken as more than that), where the 'messages' have even no connection with their formal consensus. The words COOKY, MANGO, MELON, HONEY, SWEET all suggest something sweet or sweet-sour and could be considered, thus, as recognition sequences for the 'sweet' quality. Their consensus sequence, however, conveys a rather different message: MONEY. The third positions do have something to say The third, degenerate, positions of the codons in the mRNA are normally given an unimpressive role of nearly neutral drifting. Indeed, if two very closely related gene sequences are taken, the difference between them resides almost exclusively in the third positions, leaving the encoded amino acid sequences unchanged. However, there are several observations indicating that the third position is not quite neutral. For example, a certain selection pressure in favor of local G + C composition is well documented (D'Onofrio and Bernardi, 1992). Analysis of the occurrences of four bases in the three codon positions of mRNA also indicates that the distribution of bases in the almost 'silent' third position displays a significant degree of species specificity (Zhang and Chou, 1994). Yet another bias is the frequent preference for U in the third positions, a reflection of the hidden mRNA consensus (GCU)n presumably responsible for keeping the ribosome in the correct reading frame (Lagunez-Otero and Trifonov, 1992). In other words, the third positions, indeed, seem to carry some functional load or rather several different loads, not necessarily one at a time. The degeneracy of the triplet code (alternative bases in the third positions and alternative triplets for leucine, serine and arginine) is not wasted. Below, some particularly striking examples of the utilization of the degeneracy are described. The largest subunit of RNA polymerase II contains a Cterminal domain with tandem repeats of the highly conserved seven amino acid sequence YSPTSPS. In the corresponding cDNA, all six possible codons for serine are used (Corden et al, 1985). One would expect that if the selection pressure were only exerted on the amino acid sequence, thus allowing mutations of the genomic sequence to all possible triplets for serine. Close inspection of the distribution of the AGY serine triplets along the cDNA reveals, however, that the most frequent distances between the AGY triplets are not 6, 9, 12, 15 or 21 bases as the amino acid sequence would suggest but, very selectively, 21 (also 42 and 63) residues. In other words, the AGY triplets appear periodically, with the period 21 bases. Obviously, this has nothing to do with the selection pressure on the amino acid sequence. The AGY periodicity is needed for whatever purpose by the DNA (or RNA) sequence. Perhaps, this is some message carried by DNA 424 for itself, since the period 21 bases corresponds closely to two helical repeats of DNA. A similar example, but apparently with a different type of DNA level message, is provided by the sequence of one of the MHC genes (Levi-Strauss et al, 1988). Here as well the region 725-880 of the sequence codes for a repeating amino acid sequence with strictly alternating positively and negatively charged amino acids, R, K and E, D, respectively. Four different triplets for arginine are present in this sequence: CGA, CGG, AGA and AGG; and, again, the degeneracy of these triplets and of the third positions of other triplets is highly non-random. The positions of purine bases strictly follow a consensus (GA)n, such that all AG dinucleotides occupy only odd positions in the sequence (727 and other odd coordinates), while the GA dinucleotides are placed in even positions (728 and other even coordinates). Perhaps at some point in evolution this sequence was just an alternation of A and G, and this alternation is still kept wherever the degeneracy of the triplet code allows this. Why the twobase periodical alternation of A and G is almost as important as six-base periodical alternation of triplets for positively and negatively charged amino acids remains to be seen. Interestingly, the presumed ancestral repeating sequence (AGAGAG)n coding for the repeat (RE)n would ideally satisfy both alternations. Two other charged amino acids, K and D, apparently, had to be introduced, compromising the original simple pattern, but perhaps bringing in some important advantage. Another very interesting example is hypervariability of the V3 region of the HIV-1 envelope gene. Here the changes occur in the first two positions of the codons, while the third positions are conserved (!): of 11 documented point changes in this region, only one hits the third position (J.Hirshon and J.Goudsmit, personal communication). This conservation, obviously, is unrelated to the encoded protein and keeps intact some additional message. It remains to be found what that message means. Codes other than triplet code The code is any pattern or bias in the sequence which corresponds to one or another specific biological (biomolecular) function or interaction (Trifonov, 1989). The codes could be specific or general, depending on the respective functions. The sequence of a regulatory site operating with only a limited group of genes or common to only a few species would be an example of a specific code. Many transcription factor binding sites are of this category. The universal triplet code and, say, signals for transcription initiation and termination are examples of general codes common either to all species or at least to large kingdoms. This section is a brief overview of current knowledge about various general codes starting with those known sequence patterns which are Interacting sequence contexts normally not considered as codes, although they are according to the above definition. The following codes are very briefly described below: transcription codes, gene splicing code, translation pausing code, DNA structure codes, chromatin code, translation framing code, modulation (adaptation) code and gene segmentation code. Transcription codes include promoters and terminators, and are rather universal though different in prokaryotes and in eukaryotes. Most of the promoter sequences include the TATA box or a close version of it (Pribnow, 1975; Goldberg, 1979). Prokaryotic promoter sequences are rather thoroughly characterized and the algorithms are available to recognize them in the sequences (e.g. Alexandrov and Mironov, 1990; Demeler and Zhou, 1991). Prokaryotic terminators are characterized as well (Brendel et ai, 1986; d'Aubenton Carafa et ai, 1990; Yager and von Hippel, 1991). Eukaryotic promoters are well reviewed (Bucher and Trifonov, 1986; Bucher, 1996) and several algorithms to locate them in the sequences have been attempted [Prestige (1995) and references therein]. The gene splicing code for the processing of nuclear premRNA is largely undeciphered. Its main components are obligatory GU- and AG-ends of introns (Breathnach and Chambon, 1981), as well as rather conserved consensus sequences and other sequence features around the ends (Mount, 1982; Soloviev et ai, 1994). This information, unfortunately, is insufficient to locate the splice sites in the genomic sequences unequivocally. Apparent high precision of the gene splicing would suggest that there are some other important features of the excised and spliced sequences, which remain to be revealed in future studies. Translation pausing important for the regulation of translation is encoded by clusters of rare triplets for which the respective aminoacyl-tRNAs are in limited supply (Varenne and Lazdunski, 1986; Kurland, 1991). A higher potential of hairpin formation immediately downstream from the rare triplets (Shpaer, 1985) also contributes to the translation pausing pattern, providing a transient obstacle for the ribosome. Since the rare codons are also sites of more frequent translational errors, the translation pausing code may also be called a 'fidelity code' (Shpaer, 1985). The last two decades of studies on DNA structure revealed that DNA is not a monotonous double helix all the same for whatever sequence of base pairs is hidden in its interior. Rather, it is a spectrum of local variations of the shape dictated by the nucleotide sequence. In particular, the molecule is frequently curved, which was originally revealed from the DNA sequence analysis showing some dinucleotides reappearing with the DNA helical repeat (Trifonov, 1980; Trifonov and Sussman, 1980). The sequence-dependent local shape of DNA is a crucial component of the protein-DNA recognition. The corresponding sequence code (DNA shape code) is universal and can be expressed in the form of a set of 26 angular parameters characterizing geometries of all 10 possible base pair stacks in DNA. The angles—wedge roll, wedge tilt, and twist—have been estimated by computational fitting to results of solution experiments with many DNA fragments of different nucleotide sequences (Kabsch et ai, 1982; Bolshoy et ai, 1991). A computer algorithm exists which calculates the DNA shape from its nucleotide sequence based on the above 26 angles (Shpigelman et ai, 1993). There are also several alternative models describing the sequencedependent DNA shape [Dickerson et ai (1996) and references therein]. These models, however, predict DNA curvatures with average misfit >2 SD away from the experimental estimates (Bolshoy et ai, 1991). A gross change in DNA helical shape is caused by transition from one DNA form to another. Each form can be characterized by A-, B- or Z-propensity energies calculated for all trinucleotides (Basham et ai, 1995). This constitutes DNA form codes for all three basic DNA helical structures. Chromatin code describes those sequence features that direct the histone octamer's binding to DNA and formation of the nucleosomes. The pattern reflects deformational anisotropy of DNA (Trifonov, 1980), and consists primarily of periodically alternating AA and TT dinucleotides with the period ~10.4 bases (Mengeritsky and Trifonov, 1983; Ioshikhes et ai, 1992). The latest analysis (Ioshikhes et ai, 1996) of compiled experimentally mapped nucleosome DNA sequences (Ioshikhes and Trifonov, 1993) confirmed earlier estimates of the phase shift between AA and TT dinucleotides in the nucleosomal pattern (~5 bases), and allowed the location of the AA dinucleotides preferentially on the histone octamer surface (Ioshikhes et ai, 1996). This places the complementary TT dinucleotides on the outer face of the nucleosomal DNA, in accordance with UV-irradiation experiments (Gale et ai, 1987; Pehrson, 1989). An alternative nucleosome DNA pattern is also suggested where the periodical AA and TT dinucleotides are in-phase (Satchwell et ai, 1986). An important part of the chromatin code is information about orientations of neighboring nucleosomes in space indispensable for the description of local architecture of the chromatin fiber. This information is, presumably, expressed in the lengths of the linkers connecting the nucleosomes (Noll et ai, 1980; Ulanovsky and Trifonov, 1986). Overlapping with the triplet code is the translation framing code (Trifonov, 1987). The correct reading frame during translation is, apparently, secured by complementarity of the 'consensus' mRNA pattern, (Gcu)n (Trifonov, 1987; LagunezOtero and Trifonov, 1992), with the (Cxx)n mRNA contact sites in the small subunit ribosomal RNA (Trifonov, 1987; McCarthy and Brimacombe, 1994). The frame monitoring by this scheme is reflected in several known cases of translational frameshifting where the change in the (Gcu)n frame, apparently, causes the jump of the ribosome to the new frame 425 E.N.Trifonov (Trifonov, 1987). The presence of the G-periodical pattern in the protein-coding sequences should be important, especially for long polypeptide chains, to make the probability of the ribosome jumping sufficiently low so that the translation would faithfully proceed to the completion of the synthesis of the entire long chain. The most conspicuous of all sequence types are tandemly repeating sequences, especially simple repeats. At first sight, the tandem repeats seem to be meaningless and dispensable since the copy numbers of the repeats are highly variable. Analysis of the literature on the occurrences and functional involvement of the repeats leads to the conclusion that the very copy number matters no less than the repeat sequence itself (Trifonov, 1989, 1990). It is suggested that the variable copy number of a repeat serves as an adjustable variable to modulate expression of the nearby gene or other biological (biomolecular) function influenced by the repeats—modulation code (Trifonov, 1989, 1990). The particular mechanism of the influence, which usually becomes a matter of major interest, is actually irrelevant as soon as the influence is sensitive to the tuner—copy number of the repeats. Numerous 'triplet expansion' diseases [Perutz et al. (1994) and references therein] are a good illustration of tuners that went beyond their normal limits. The number three of the repeat unit size is not, of course, a magic number, and one would expect that 'expansion' diseases exist as well with other repeat sizes. For example, the alcoholic syndrome in mice is accompanied by a change in the copy number of the (RY)n repeat sequence in the first intron of the alcohol dehydrogenase gene (Zhang et al, 1987). It is speculated that the tuning mechanism—fast changes in the copy numbers of the tuning tandem repeats—also participates in the fast adaptation to a changing environment (Trifonov, 1989,1990). Indeed, the unequal crossing-over and replication slippage on the repeating sequences, both causing changes in the copy numbers, are rather frequent events compared to rare point mutational changes in the protein-coding sequences. The given gene's activity could be changed much faster by the tuning mechanism than by the point mutations. The modulation code as described above could thus also be considered as an adaptation code. One of the emerging new codes is the genome segmentation code, i.e. the genomes appear to be built of rather standard size units of an unknown number of types fused together in various orders, very much like shuffled packs of cards. The first evidence comes from the size distributions of the natural polypeptide chains. Eukaryotic proteins, for example, show preference for the size 123 amino acid residues and multiples of this unit size (Berman et al, 1994). Such a distribution could result from many acts of combinatorial fusion of different small genes, all of the same unit size, at some early stage of evolution (Trifonov, 1995). One prediction would be that at the points of fusion, i.e. every 426 120-125 residues, there would be some excess of methionines, formerly initiation residues of the fused small genes. This expectation is well met, indeed (Kolker and Trifonov, 1995), although most of these boundary methionines disappeared, apparently, due to mutations. The corresponding DNA unit size, ~370 bp, is also observed in the size distributions of DNA mobile elements (E.N.Trifonov, in preparation), in particular, in the sizes of extrachromosomal circular DNA (Gaubatz, 1990). What the sequence features of the 'genome units' or the fusion sites are remains to be found. Undoubtedly, there are many more codes hidden in the sequences. Some of them could only be guessed, some already disclose themselves by one or another sequence feature. For example, the G+C content of isochores seems to be maintained largely by the composition bias in third positions of the triplets—'genomic code' (D'Onofrio and Bernardi, 1992). There has to be some code responsible for folding of mRNA and heterogeneous nuclear RNA. Surely, the code should involve sequence complementarity, perfect or imperfect (Trifonov, 1985), and, perhaps, mirror symmetry elements presumably located in the loop regions—'RNA loop code' (Beckmann et al, 1986; Trifonov, 1988). There should be some pattern in the genomic sequences responsible for mRNA editing [Maslov et al. (1994) and references therein] and many more, including those which correspond to function not even yet discovered. Superposition and interaction of the codes One of the most remarkable properties of genetic sequences is that the codes they carry overlap, i.e. a given base in a given nucleotide sequence may well belong to two or more different codes simultaneously. This phenomenon, unique for the genetic sequences, was predicted in 1968 by Holliday who suggested, in particular, that the signals responsible for recombination may well reside within protein-coding sequences. The general idea was developed further in later works (Schaap, 1971; Zuckerkandl, 1976a,b; Trifonov, 1981) and soon confirmed by numerous experimental data, as reviewed by Normark et al. (1983). The idea on the multiplicity of the codes and their overlapping also appeared in later independent studies (Caporale, 1984; Kypr, 1986; Koningsef a/., 1987). Essentially, the genetic sequences appear as texts designed for several different reading devices, each one seeing in it a message of its own kind. This is possible, of course, primarily due to degeneracy of the superimposed codes which allows a given letter (base, amino acid) serving a given message to be replaced by another letter. As a result, some other message is also satisfied, written by a different code. Such a design is most natural in terms of selection: if, for example, in a given sequence after some mutational changes a certain pattern emerged which causes some useful interaction (protein-DNA, Interacting sequence contexts protein-RNA, protein-protein), then this pattern would become a subject of positive selection pressure, together with the whole sequence context with the patterns selected earlier. Or if the sequence responsible for one function happened to be good for something else, it would serve both functions simultaneously, and there is no reason why it should not. A good descriptive term for such sequence behavior is 'molecular opportunism', used by Doolittle (1988) for one of the most striking examples of overlapping codes: crystallins, eye lens proteins, serve as very specific enzymes in different tissues of the same species, which have nothing to do with their physical and optical properties suitable for the eye lens. A theoretical treatment of the multicode information systems with overlapping is not available. Such systems have no precedent either in information theory or in theories on codes. The main issue here is, perhaps, the relationship between the number of codes present in the sequence and the code degeneracies. The overlapping codes have to be degenerate to allow the adjustment of the superimposed sequence patterns to a common sequence which, thus, represents all the interacting codes simultaneously. The number of different codes overlapping in the given sequence should be a function of their degeneracy: fewer codes of low degeneracy and more codes of high degeneracy. Lowdegeneracy codes, obviously, would leave only a little possibility for compromise, and the sequence would not, thus, accommodate much of other codes. The superposition of the codes should enrich the vocabulary of the sequences, i.e. to make a broader spectrum of the oligonucleotides (oligopeptides) present. Indeed, the combined vocabularies of sufficiently different overlapping codes are certainly richer than the component vocabularies. On the other hand, the condition of compromise would force the interacting codes to use all the wealth of their 'linguistic' arsenals, again making the final vocabulary rich. The richness of the vocabulary can be expressed as the sequence complexity, which can be defined in several different ways (Konopka, 1990; Trifonov, 1990). Contrary to the complexity approach, classical measures of information based on statistical non-uniformity of the sequences are not applicable to the multicode systems. Indeed, a highly complex sequence which carries many overlapping codes is as far off the statistical mean as a simple repeating sequence with only one code in it. The linguistic consideration finds some confirmation in the analysis of the complexity of protein-coding and non-coding sequences (Konopka, 1990; Trifonov, 1990), and in complexity comparisons of protein sequences with human texts (Popov et al., 1996). In particular, the prokaryotic sequences, basically open for all kinds of interaction at all three levels (DNA, RNA and protein), would, perhaps, carry many overlapping sequence patterns all over their genomes, including the rather conserved protein-coding regions [as originally suggested by Holliday (1968)]. These regions would be closer to the exhaustion of their capacity to accommodate additional codes and should, therefore, have higher linguistic complexity as compared to adjacent 'noncoding' sequences. Indeed, this is found to be the case (Konopka, 1990; Trifonov, 1990). The human texts are relevant in the context of sequence complexity because they provide the only example available of the sequences carrying only one code (read in only one way, consecutively letter by letter). Such an example cannot be taken from biological sequences, since they all are likely to carry many overlapping messages. Comparison of the sequence complexities of the human texts and of natural protein sequences, both in 20letter alphabets, shows that as one would expect human texts are simpler (Popov et al., 1996). This, however, can only be used as an illustration of the point on higher complexity of multicode sequences, rather than proof, because the relative simplicity of the human texts may also reflect imperfection of human languages at this stage of their evolution. The interaction between the codes can be studied by singling out two interacting codes of interest and analyzing the changes in the corresponding sequence patterns which are caused by the superposition. One example of such a study is known: analysis of the interaction of the gene splicing code with the triplet code (Gelfand, 1992). It is found that the codons flanking the splice junctions are biased to satisfy the sequence requirements of the splicing signals. Interestingly, the bias is different depending on the frame of the interruption by the intervening sequence. One more study case is interaction of the gene splicing code with the chromatin code. The interaction is not yet described at the sequence level, but well demonstrated by other means. The connection chromatin-gene splicing has been originally suggested as the very essence of the phenomenon of gene splicing (Zuckerkandl, 1981). In the terminology of the interacting codes, the suggestion was that the intervening sequences had to be introduced to separate two otherwise interfering messages: triplet code and chromatin code. The exons were given primarily the protein-coding function, while the introns were loaded by the sequence requirements of the chromatin structure. The connection splicing-chromatin soon found numerous supporting evidence. The typical size of the intron-exon pairs was found to be ~205 bp which is close to the nucleosome 'repeat' size (Beckmann and Trifonov, 1991). The linguistic sequence complexity of exons turned out to be lower than that for prokaryotic protein-coding sequences (Konopka and Owens, 1990; Trifonov, 1991), as one would expect, indeed, in the case of spatial separation of the interacting messages in eukaryotes. The introns show higher affinity to the nucleosomes (Soloviev and Kolchanov, 1986; D.Denisov et al, in preparation) and the splice junctions are preferentially located in the central positions of the nucleosomes (D.Denisov et al, in preparation). Finally, experimental studies clearly demonstrate 427 E.N.Trifonov that the chromatin reconstituted on cDNA (without introns) has no unique structure, contrary to the chromatin reconstituted on the genomic DNA (with introns) (Liu et al., 1995). Yet another case of apparent interaction of different sequence patterns is periodic positional distribution of transcription factor binding sites CCAAT and GGGCGG in eukaryotic promoters (P.Bucher and E.N.Trifonov, in preparation). Since the period observed, 10.2 ± 0.2 bases, is close to the sequence period of the nucleosome DNA (loshikhes et al., 1992; Ioshikhes et al, 1996), the periodicity, quite likely, reflects the presence of the nucleosome immediately upstream from the TATA box. The binding sites, then, would be preferentially oriented in a specific way relative to the surface of the histone octamer, e.g. to favor (or disfavor) accessibility of the site for interaction with the transcription factor. Concluding remarks The regulatory sequence elements and transcription signals, in particular, should not be torn out of their contexts. The sequence they are located in carries numerous messages both related and unrelated to the transcription function. The choice of the sequence elements belonging to a given binding site is a compromise between several overlapping codes specific to this sequence region. The consensus sequence of the transcription signal is only a first approximation in the description of the signal. A more appropriate, although still approximate, description would be a matrix presentation in which preferences of various signal elements (mono-, di-, trinucleotides) for different positions along the sequence are listed (Trifonov, 1983). In most cases, none of the elements individually is necessary or sufficient for the recognition, very much like individual voters in elections. The recognition is distributional (Trifonov, 1983), such that the decision is made after the weights of different signal elements present in a given site are somehow combined. The distribution of various signal components along the sequence and the use of alternative components at the same position reflects the multiplicity of the codes overlapping with the given recognition sequence and the compromise, yielding to one another in order to co-exist, as much as their degeneracies allow. References Alexandrov.N.N. and Mironov.A.A. (1990) Application of a new method of pattern recognition in DNA sequence analysis: a study of E. coli promoters. Nucleic Acids Res., 18, 1847-1852. Basham,B., Schroth.G.P. and Ho,P.S. (1995) An A-DNA triplet code: thermodynamic rules for predicting A- and B-DNA. Proc. NatlAcad. Set. USA, 92, 6464-6468. Beckmann,J.S. and Trifonov.E.N. (1991) Splice junctions follow a 205-base ladder. Proc. NatlAcad. Set. USA, 88, 2380-2383. Beckmann,J.S., Brendel.V. and Trifonov.E.N. (1986) Intervening sequences exhibit distinct vocabulary. J. Biomol. Struct. Dyn., 4, 391-400. 428 Berman,A.L., Kolker,E. and Trifonov.E.N. (1994) Underlying order in protein sequence organization. Proc. NatlAcad. Sci. USA, 91, 4044-4047. BolshoyA, McNamara,P., Harrington.R.E. and Trifonov.E.N. (1991) Curved DNA without AA: experimental estimation of all 16 wedge angles. Proc. NatlAcad. Sci. USA, 88, 2312-2316. Breathnach,R. and Chambon,P. (1981) Organization and expression of eukaryotic split genes coding for proteins. Annu. Rev. Biochem., 50, 349383. Brendel,V., Hamm.G.H. and Trifonov.E.N. (1986) Terminators of transcription with RNA polymerase from Eschertchia coli: what they look like and how to find them. J. Biomol. Struct. Dyn., 3, 705-723. Bucher.P. (1996) The Eukaryotic Promoter Database, PDB. EMBL Nucleotide Sequence Data Library. Bucher.P. and Trifonov.E.N. (1986) Compilation and analysis of eukaryotic Pol II promoter sequences. Nucleic Acids Res., 14, 10009-10026. Caporale.L.H. (1984) Is there a higher level genetic code that directs evolution? Mo/. Cell. Biochem., 64, 5-13. Corden.J.L., Cadena.D.L., Ahearn,J.M. and Dahmus.M.E. (1985) A unique structure at the carboxyl terminus of the largest subunit of eukaryotic RNA-polymerase II. Proc. NatlAcad. Sci. USA, 82, 7934-7938. d'Aubenton Carafa.Y., Brody.E. and Thermes.C. (1990) Prediction of rhoindependent Eschenchia coli transcription terminators. A statistical analysis of their RNA stem-loop structures. J. Mol. Bioi, 216, 835-858. Demeler.B. and Zhou.G. (1991) Neural network optimization for E. coli promoter prediction. Nucleic Acids Res., 19, 1593-1599. Dickerson.R.E., Goodsell.D. and Kopka.M.L. (1996) MPD and DNA bending in crystals and in solution. J. Mol. Bioi, 256, 108-125. D'Onofrio.G. and Bernardi.G. (1992) A universal compositional correlation among codon positions. Gene, 110, 81-88. Doolittle,R.F. (1988) More molecular opportunism. Nature, 336, 18. Gale,J.M., Nissen.K.A. and Smerdon,M.J. (1987) UV-induced formation of pyrimidine dimers in nucleosome core DNA is strongly modulated with a period of 10.3 bases. Proc. NatlAcad. Sci. USA, 84, 6644-6648. Gaubatz.J.W. (1990) Extrachromosomal circular DNAs and genomic sequence plasticity in eukaryotic cells. Mutat. Res., 237, 271-292. Gelfand,M.S. (1992) Statistical analysis and prediction of the exonic structure of human genes. J. Mol. Evoi, 35, 239—252. Goldberg,M. (1979) PhD Thesis, Stanford University. Holliday.R. (1968). Genetic recombination in fungi. In Peacock,W.J. and Brock,R.D. (eds), Replication and Recombination of Genetic Material. Australian Academy of Science, Canberra, pp. 157—174. Horwitz.M.S.Z. and Loeb,L.A. (1988) DNA sequences of random origin as probes of Escherichia coli promoter architecture. J. Bioi. Client., 263, 14724-14731. Ioshikhes,I. and Trifonov.E.N. (1993) Nucleosoma! DNA sequence database. Nucleic Acids Res., 21, 4857-4859. Ioshikhes.I., Bolshoy.A. and Trifonov.E.N. (1992) Preferred positions of AA and TT dinucleotides in aligned nucleosomal DNA sequences. J. Biomol. Struct. Dyn., 9, 1111-1117. Ioshikhes.I., Bolshoy,A., Derenshteyn.K., Borodovsky,M. and Trifonov.E.N. (1996) Nucleosome DNA sequence pattern revealed by multiple alignment of experimentally mapped sequences. J. Mol. Bioi, 262, 129-138. Kabsch.W., Sander,C. and Trifonov.E.N. (1982) The ten helical twist angles of B-DNA. Nucleic Acids Res., 10, 1097-1104. Kolker.E. and Trifonov.E.N. (1995) Periodic recurrence of methionines: fossil of gene fusion? Proc. NatlAcad. Sa. USA, 92, 557-560. Konings.D.A.M., Hogeweg.P. and Hesper.B. (1987) Evolution of the primary and secondary structures of the Ela mRNAs of the adenovirus. Mol. Bioi Evoi, 4, 300-314. Konopka,A.K. (1990) Towards mapping functional domains in indiscriminately sequenced nucleic acids: a computational approach. In Sarma.R.H. and Sarma.M.H. (eds), Structure and Methods, Volume I: Human Genome Initiative and DNA Recombination. Adenine Press, New York, pp. 113-125. Konopka,A.K. and Owens.J. (1990) Complexity charts can be used to map functional domains in DNA. Gene Anal. Tech. Appi, 7, 35-38. Kurland.CG. (1991) Codon bias and gene expression. FEBS Lett., 285, 165169. Kypr.J. (1986) A part of codon bias in genes protects protein spatial structures from destabilization by random single point mutations. Biochem. Biophys. Res. Commun., 139, 1094-1097. Interacting sequence contexts Lagunez-Otero.J. and Trifonov.E.N. (1992) mRNA periodical infrastructure complementary to the proof-reading site in the ribosome. J. Biomol. Struct. Dyn., 10, 455-464. Levi-Strauss,M., Carroll.M.C, Steinmetz,M. and Meo,T. (1988) A previously undetected MHC gene with an unusual periodic structure. Science, 240, 201-204. Liu,K., Sandgren,E.P., Palmiter.R.D. and Stein,A. (1995) Rat growth hormone gene introns stimulate nucleosome alignment in vitro and in transgenic mice. Proc. NatlAcad. Sci. USA, 92, 7724-7728. Maslov.D.A., Avila.H.A., Lake,J.A. and Simpson.L. (1994) Evolution of RNA editing in kinetoplastid protozoa. Nature, 368, 345-348. McCarthy,J.E.G. and Brimacombe.R. (1994) Prokaryotic translation: the interacting pathway leading to initiation. Trends Genet., 10, 402—407. Mengeritsky.G. and Trifonov,E.N. (1983) Nucleotide sequence-directed mapping of the nucleosomes. Nucleic Acids Res., 11, 3833-3851. Mount.S.M. (1982) A catalogue of splice junction sequences. Nucleic Acids Res., 10, 459-472. Noll.M., Zimmer,S., Engel.A. and Dubochet,J. (1980) Self-assembly of single and closely spaced nucleosome core particles. Nucleic Acids Res., 8, 21-42. Normark,S., Bergstrom,S., Edlund.T., Grundstrom.T., Jaurin,B., Lindberg,F.P. and Olsson,O. (1983) Overlapping genes. Annu. Rev. Genet.,11, 499-525. Pehrson,J.R. (1989) Thymine dimer formation as a probe of the path of DNA in and between nucleosomes in intact chromatin. Proc. Natl Acad. Sci. USA, 86, 9149-9153. Perutz,M.F., Johnson,T., Suzuki,M. and Finch,J.T. (1994) Glutamine repeats as polar zippers: their possible role in inherited neurodegenerative diseases. Proc. NatlAcad. Sci. USA, 91, 5355-5358. Popov,O., Segal.D.M. and Trifonov.E.N. (1996) Linguistic complexity of protein sequences as compared to texts of human languages. BioSystems, 38, 65-74. Prestige,D.S. (1995) Predicting Pol II promoter sequences using transcription factor binding sites. J. Mol. Bioi, 249, 923-932. Pribnow,D. (1975) Nucleotide sequence of an RNA polymerase binding site at an early T7 promoter. Proc. Natl Acad. Sci. USA, 72, 784-788. Satchwell,S.C, Drew,H.R. and Travers,A.A. (1986) Sequence periodicities in chicken nucleosome core DNA. J. Mol. Bioi, 191, 659-675. Schaap.T. (1971) Dual information in DNA and the evolution of the genetic code. J. Theor. Bioi., 32, 293-298. Shpaer,E.G. (1985) The secondary structure of mRNAs from Escherichia coli: its possible role in increasing the accuracy of translation. Nucleic Acids Res., 13, 275-288. Shpigelman,E.S., Trifonov.E.N. and Bolshoy,A. (1993) CURVATURE: software for the analysis of curved DNA. Comput. Applic. Biosci., 9, 435-440. Soloviev,V.V. and Kolchanov,N.A. (1986) The exon-intron structure of the genes of eukaryotes may be determined by the nucleosomal organization of the chromatin and associated peculiarities of the regulation of gene expression. Dokl. Biochem., 284, 286-290. Soloviev.V.V., Salamov,A.A. and Lawrence.C.B. (1994) Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Res., 22, 5156-5163. Trifonov.E.N. (1980) Sequence-dependent deformational anisotropy of chromatin DNA. Nucleic Acids Res., 8, 4041-4053. Trifonov.E.N. (1981) Structure of DNA in chromatin. In Schweiger,H. (ed), International Cell Biology 1980-1981. Springer-Verlag, Berlin, pp. 128138. Trifonov.E.N. (1983) Sequence-dependent variations of B-DNA structure and protein-DNA recognition. Cold Spring Harbor Symp. Quant. Bioi, 41, 271-278. Trifonov,E.N. (1985) Imperfect complementarity and RNA structure. J. Biosci., 8(Suppl.), 781-790. Trifonov.E.N. (1987) Translation framing code and frame-monitoring mechanism as suggested by the analysis of mRNA and 16s rRNA nucleotide sequences. J. Mol. Bioi, 194, 643-652. Trifonov.E.N. (1988) Codes of nucleotide sequences. Math. Biosci., 90, 507517. Trifonov.E.N. (1989) The multiple codes of nucleotide sequences. Bull. Math. Bioi., 51, 417-432. Trifonov.E.N. (1990) Making sense of the human genome. In Sarma.R.H. and Sarma.M.H. (eds), Structure and Methods, Volume 1: Human Genome Initiative and DNA Recombination. Adenine Press, New York, pp. 69-77. Trifonov.E.N. (1991) Informational structure of genetic sequences and nature of gene splicing. In Lavery.R., Rivail,J.-L. and Smith,J. (eds), Advances in Biomolecular Simulations. AIP Conference Proceedings 239. AIP Press, New York, pp. 329-338. Trifonov,E.N. (1995) Segmented structure of protein sequences and early evolution of genome by combinatorial fusion of DNA elements. J. Mol. Evoi, 40, 337-342. Trifonov.E.N. and Sussman,J.L. (1980) The pitch of chromatin DNA is reflected in its nucleotide sequence. Proc. NatlAcad. Sci. USA, 77, 38163820. Tuerk.C. and Gold.L. (1990) Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriphage T4 DNA polymerase. Science, 249, 505-510. Ulanovsky.L.E. and Trifonov.E.N. (1986) A different view point on the chromatin higher order structure: steric exclusion effects. In Sarma.R.H. and Sarma,M.H. (eds), Biomolecular Stereodynamics III. Adenine Press, New York, pp. 35-44. Varenne.S. and Lazdunski.C. (1986) Effect of distribution of unfavourable codons on the maximum rate of gene expression by an heterologous organism./ Theor. Bioi, 129, 99-110. Yager.T.D. and von Hippel.P.H. (1991) A thermodynamic analysis of RNA transcript elongation and termination in Escherichia coli. Biochemistry, 30, 1097-1118. Zhang,C.-T. and Chou,K.-C. (1994) A graphic approach to analyzing codon usage in 1562 Escherichia coli protein coding sequences. J. Mol. Bioi, 238, 1-8. Zhang.K., Bosron.W.F. and Edenberg.H.J. (1987) Structure of the mouse Adh-1 gene and identification of a deletion in a long alternating purinepyrimidine sequence in the first introns of strains expressing low alcohol dehydrogenase activity. Gene, 57, 27-36. Zuckerkandl.E. (1976a) Evolutionary process and evolutionary noise at the molecular level. I. Functional density in proteins. J. Mol. Evoi 7,167-183. Zuckerkandl.E. (1976b) Evolutionary process and evolutionary noise at the molecular level. II. A selectionist model for random fixations in proteins. J. Mol. Evoi, 7, 269-312. Zuckerkandl.E. (1981) A general function of noncoding polynucleotide sequences. Mol. Bioi Rep., 7, 149-158. 429