* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lack of biological significance in the `linguistic features` of
DNA sequencing wikipedia , lookup
DNA paternity testing wikipedia , lookup
Cancer epigenetics wikipedia , lookup
DNA polymerase wikipedia , lookup
Zinc finger nuclease wikipedia , lookup
Comparative genomic hybridization wikipedia , lookup
DNA profiling wikipedia , lookup
SNP genotyping wikipedia , lookup
DNA damage theory of aging wikipedia , lookup
DNA barcoding wikipedia , lookup
Genomic library wikipedia , lookup
Microevolution wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Primary transcript wikipedia , lookup
Epigenomics wikipedia , lookup
Gel electrophoresis of nucleic acids wikipedia , lookup
History of genetic engineering wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Sequence alignment wikipedia , lookup
Cell-free fetal DNA wikipedia , lookup
Molecular cloning wikipedia , lookup
DNA vaccination wikipedia , lookup
DNA supercoil wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
DNA nanotechnology wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Genome editing wikipedia , lookup
Nucleic acid double helix wikipedia , lookup
Metagenomics wikipedia , lookup
Point mutation wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Genealogical DNA test wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
Human genome wikipedia , lookup
Microsatellite wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Helitron (biology) wikipedia , lookup
1996 Oxford University Press 1676–1681 Nucleic Acids Research, 1996, Vol. 24, No. 9 Lack of biological significance in the ‘linguistic features’ of noncoding DNA—a quantitative analysis C. A. Chatzidimitriou-Dreismann1,3,*, R. M. F. Streffer1 and D. Larhammar2,3 1Iwan N. Stranski Institute for Physical and Theoretical Chemistry, Technical University of Berlin, Strasse des 17. Juni 112, D-10623 Berlin, Germany; 2Department of Medical Pharmacology, Uppsala University, Box 593, S-75124 Uppsala, Sweden and 3Institute for Basic Research, Castello Principe Pignatelli, I-86075 Monteroduni (IS), Molise, Italy Received January 19, 1996; Revised and Accepted March 14, 1996 ABSTRACT Recently, the application of two statistical methods (related to Zipf’s distribution and Shannon’s redundancy), called ‘linguistic’ tests, to the primary structure of DNA sequences of living organisms has excited considerable interest. Of particular importance is the claim that noncoding DNA sequences in eukaryotes display specific ‘linguistic’ features, being reminiscent of natural languages. Furthermore, this implies that noncoding regions of DNA may carry some new, thus far unknown, biological information which is revealed by these tests. In this paper these claims are tested quantitatively. With the aid of computer simulations of natural DNA sequences, and by applying the same ‘linguistic’ tests to both natural and artificial sequences, we investigate in detail the reasons of the appearance of the claimed ‘linguistic’ features and the associated differences between coding and noncoding DNAs. The presented results show quantitatively that the ‘linguistic’ tests failed to reveal any new biological information in (noncoding or coding) DNA. INTRODUCTION Recently it was announced (1), reported (2) and commented (see, for example, 3) that natural DNA sequences, and especially noncoding DNAs, appear to have many statistical features in common with natural languages. In the original paper (2), Mantegna et al. performed certain mathematical investigations—called ‘linguistic tests’—on DNA sequences, which are related to Zipf’s distribution (4) and Shannon’s information theory and redundancy analysis (5,6). In short, these tests seem to reveal significant differences between coding and noncoding parts of natural DNA sequences. The first of these linguistic tests is related to the so-called Zipf plot, i.e. the relation between the relative occurrence of all oligonucleotides of a given length n [which we will call ‘words’ as in the paper of Mantegna et al. (2)] in a specific DNA sequence (called a ‘text’). It was claimed (2) that: (i) in a double-logarithmic plot, the graphs of the aforementioned relation for different DNA sequences are linear, which implies that Zipf’s law applies to the present case, and (ii) the slopes of these graphs for coding and noncoding DNAs differ significantly. The second linguistic test uses the information-theoretical ‘entropy’ H(n) (5,6) of a DNA sequence (i.e. a ‘text’) when it is viewed as a collection of n-tuple words, as well as the associated ‘redundancy’ defined by Shannon (see below). This ‘redundancy’ may be considered as a property of natural languages, the purpose of which being to preserve the meaning of a word also in the case of ‘typographical errors’. As a result of these investigations, it was claimed (2) that, in clear contrast with protein coding DNA segments, the noncoding DNA parts are related with a considerable amount of redundancy. Very recently, however, these findings and/or claims have been strongly criticised by Konopka and Martindale (7) by stressing, among others, the following points. (i) Statistical differences of coding and noncoding DNA are known at least since 1981, which are used even in routine methods for discrimination between them; therefore, the claimed novelty of the results was not appreciated. (ii) The oligonucleotide frequency distribution in noncoding DNA does not appear to fit Zipf’s law any better than does the distribution in coding regions; additionally the presented log–log plots display a nonlinear, rather than a linear, trend. (iii) It was concluded that both coding and noncoding DNA regions fit Zipf’s law rather poorly, if at all (7). Nevertheless, these criticisms, being formulated qualitatively, may be subject to dispute. Since the aforementioned findings and/or claims (2) seem to be interesting and to have a potentially, thus far unknown, biological significance, we looked on them in detail. In order to be as precise and concrete as possible, we concentrated on a quantitative analysis, the results of which are presented below. Our main conclusion is that the ‘linguistic’ tests do not reveal any new biological information in DNA. THEORY In order to apply the aforementioned ‘linguistic’ tests to DNA sequences, the concept of ‘word’ must be introduced. Of course, in the case of coding sequences, the biologically relevant ‘words’ are the well-known 3-tuples, or codons, which code for amino acids according to the (almost) universal genetic code. For noncoding regions of DNA, however, biologically relevant ‘words’ are not known. * To whom correspondence should be addressed at: Iwan N. Stranski Institute for Physical and Theoretical Chemistry, Technical Institute of Berlin, Strasse des 17. Juni 112, D-10623 Berlin, Germany 1677 Nucleic Acids Acids Research, Research,1994, 1996,Vol. Vol.22, 24,No. No.19 Nucleic Therefore, Mantegna et al. (2) considered n-tuples, where n is a free parameter between 3 and 8. To obtain the different n-tuples needed to perform the ‘linguistic’ analyses, one shifts progressively by one base a ‘reading window’ of length n along the DNA sequence of interest. Note that there are 4n different n-tuples, since there are four ‘letters’ (i.e. A, G, C and T) in the ‘alphabet’ used by DNA. To implement the ‘linguistic’ test as given by the Zipf analysis (4), one has to rank all the ‘words’ (in the present case: of a given length, i.e. the n-tuples) in the order of their actual frequency of occurrence in a given DNA sequence. It is then convenient to make a histogram, plotting the logarithm of the frequency of occurrence of an n-tuple against the logarithm of its rank. (This is shortly called a log–log plot.) According to the claims of Mantegna et al. (2), it appears then, surprisingly, that the produced graph is linear over a significant range of the rank (e.g. if n = 6, the linearity should extend from rank 1 to rank ∼1000). The used ‘word’ lengths were between 3 and 8. This linearity is considered to be the characteristic feature of the so-called Zipf’s law (4). The slope to the graph (if it is linear; see Results below) is called the Zipf exponent. The same n-tuples are also needed for the second ‘linguistic’ test of Mantegna et al. (2), which is based on Shannon’s informationtheoretical concept of entropy (5). According to Shannon, the entity ‘information’ is directly associated with ‘reduction of entropy’. Related to this reduction is also another quantity of information theory, called redundancy. In simple terms, redundancy is the degree to which a given text, which represents an ‘information’, can be understood even when letters are missing and/or incorrect. Therefore this quantity is also a measure of the flexibility of a ‘language’ or a ‘code’. The mathematically precise definitions of these quantities are as follows (5). The entropy (or better: the n-entropy) H(n) is given by p log p 4n H(n) j 2 j 1 j1 where n is the (constant) length of all ‘words’. The redundancy Re is defined through a limes, i.e. Re lim Re(n) with Re(n) 1–H(n)kn n 2 where, by convention, k = log24 = 2 (see for example ref. 2). The maximum value of n for which it is possible to determine the n-entropy appears to be n = 6. For larger n-values too many possible words are rarely present, i.e. they exhibit extremely bad statistics which obscure the numerical values of H(n) and Re(n), (cf. ref. 2). As mentioned above, it was claimed (2) that these two ‘linguistic’ tests reveal significant differences between coding and noncoding parts of natural DNAs. Furthermore, is was found that the analysed noncoding DNA sequences exhibit larger values of redundancy than did the coding DNAs, which suggests, as Mantegna et al. (2) put it, ‘‘the possible existence of one (or more than one) structured biological language present in noncoding DNA sequences”. RESULTS (a) Linearity of Zipf graphs To check this claim, i.e. the Zipf-like scaling behaviour, we calculated and displayed graphically the Zipf plot of different coding and noncoding DNA sequences. The main features of these graphs are qualitatively in agreement with those displayed in Figures l and 1677 2 of the original paper of Mantegna et al. (2). To be concrete, we present here (in Fig. 1a) the corresponding graphs, for 6-tuples, of the human sequence HSRETBLAS (1.5% coding) and the Escherichia coli sequence ECUW89 (82.1% coding) DNAs, as also studied by Mantegna et al. (2). (The mentioned acronyms are the identification codes of the EMBL database.) However, one should observe that these plots are double-logarithmic, which makes it very difficult to assess quantitatively whether the slope is really constant or not. Therefore, we calculated also numerically the slopes of these graphs, which are now displayed in a linear scale in Figure 1b, together with the corresponding Zipf plots (Fig. 1a). It is seen that these slopes, instead to be constant, are clearly curved and monotonously increasing. Similar results were obtained for almost all DNA sequences we analysed. In summary, we failed to find constant slopes of the claimed extension (2), i.e. about three orders of magnitude of the ‘word’ rank. (b) Zipf graphs of coding and noncoding DNAs The DNA sequences we studied show the following qualitative difference: the graphs of the noncoding sequences are usually ‘steeper’ than those of the (mostly) coding ones. This result supports qualitatively the finding (2) that the Zipf exponent is larger, by ∼50%, for the noncoding sequences. In order to quantify this finding we applied the chi-square test to the sequences comparing them with the mean of five highly coding sequences as well as the mean of five nearly noncoding ones (cf. Fig. 2). The chi square test results in a value of 0.005 if sequences are compared with a similar coding part, but if the coding part differs significantly, the chi-square test will result in values >0.015 (see Table 1). A distinction between highly coding sequences and nearly noncoding sequences seems therefore to be fairly easy. But we found also an exception, as the Zipf graph of the herpesvirus saimiri DNA sequence (HSGEND, 94% coding), shown in Figure 2, clearly demonstrates. The overall behaviour of this graph is identical to that of a typical (mostly) noncoding sequence, which is also reflected in the values of the chi square test: 0.042 with the mean of highly coding sequences. This proves that the claimed difference (2) between coding and noncoding DNA sequences is not universal. Table 1. Chi-square test Sequencea Coding partb Large coding partc HSGENDd 94 0.042 EBV 12 0.020 RNIGF2 23 0.020 5 0.018 HSGHCSA HSHBB 4 0.015 AD2 78 0.001 HEVZXX 89 0.002 LAMBDA 84 0.001 aEMBL-ID of the sequence. Acronyms are explained in the figure legends. percentage of the sequence. cResult of the chi-square test comparing the Zipf plot of the sequence with the mean Zipf plot of five mostly coding sequences (cf. Fig. 2). dNote the exceptional behaviour of the highly coding sequence HSGEND, which behaves like a ‘noncoding’ DNA-sequence; cf. the text. bCoding 1678 Nucleic Acids Research, 1996, Vol. 24, No. 9 Figure 2. Averaged Zipf graphs of five typical mostly coding (thin solid line) and five typical mostly noncoding (thin dashed line) sequences. In addition the Zipf plot of the herpesvirus saimiri (HSGEND) is presented, which is highly coding (94%). The figure reveals a difference in averaged Zipf plots of coding and noncoding sequences, but disproves the difference to be general for individual DNAs. The used ‘word’ length is 6 bp. Figure 1. (a) 6-tuple Zipf plot of the human sequence HSRETBLAS (solid line) and E.coli sequence ECUW89 (dashed line), which are studied in (Mantegna et al., 1994). (b) Local slope of the Zipf plots in (a) proving them to be curved. The slope has the tendency to decrease with increasing ‘word’ rank. Every data point represents the mean slope of 100 ‘words’ in the Zipf plot, which are fitted according to the standard linear regression method. The lines are guides to visualise the tendency in slope. , ECUW89; , HSRETBLAS. Acronyms are EMBL database identification names. (c) Zipf tests of natural DNAs and computer simulated sequences Nevertheless, in most cases we studied, there are indeed visible differences between the averaged slopes of the Zipf graphs: coding graphs appear to exhibit a smaller slope, on average, than noncoding DNAs, in agreement with Mantegna et al. (2); for some examples see Figure 3. This qualitative finding motivates the questions whether the observed differences have a biological significance, and how they can be quantified properly. Trying to clarify these questions concretely and quantitatively, we made the same ‘linguistic’ Zipf analysis also on a large number of artificial, computer-generated sequences using random number generators. For details of calculation of such sequences see refs 8–10. We found that the Zipf graphs of these sequences exhibit a Figure 3. 6-tuple Zipf plots of different highly coding (thick lines) and mostly noncoding (thin lines) sequences. Shown are the following sequences: Epstein–Barr virus (EBV, solid thin, 12%), rat insulin-like growth factor (RNIGF2, dashed thin, 23%), human growth hormone (HSGHCSA, dotted thin, 5%), human β-globin (HSHBB, dashed-dotted thin, 4%), adenovirus (AD2, solid thick, 78%), herpesvirus (HEVZVXX, dashed thick, 89%) and λ-phage (LAMBDA, dotted thick, 84%). In parentheses are given first the EMBL identification name, second the line shape in the figure and third the coding percentage of the sequence. qualitatively similar form as those of natural DNAs. In the light of this qualitative finding, one may wonder about the ‘reasons’ and/or ‘origins’ of the observed specific form of the Zipf graphs discussed above: can we associate some ‘biological information’ with the observed forms of the Zipf graphs, as suggested by Mantegna et al. (2)? Or are these graphs related to some, thus far unknown, ‘numerical artefacts’? 1679 Nucleic Acids Acids Research, Research,1994, 1996,Vol. Vol.22, 24,No. No.19 Nucleic 1679 Figure 5. Difference in quality of Zipf plot simulation in dependence of the simulation parameter Di . Presented are the Zipf plots of the human β-globin sequence HSHBB (Di = 100 bp; dashed) with an unequal base composition and of the λ-phage genome (Di = 15 bp; solid) with an equal base composition. To achieve a good simulation a much shorter Di has to be chosen for a sequence with an equal base composition than for a sequence with an unequal one. Graphs of natural DNAs are thick, those of artificial sequences thin. Figure 4. 6-tuple Zipf plot of natural DNAs (thick lines) and associated artificial sequences (thin lines). (a) Yeast chromosome III (SCHRIII, 67% coding), and an associated artificial sequence with simulation parameter Di = 200 bp. (b) ECUW89 (dashed lines, 82% coding) and HSRETBLAS (solid lines, 1.5% coding). The parameter Di in the generation of associated artificial sequences equal 20 bp in the case of ECUW89 and 100 bp for HSRETBLAS. All Zipf plots of the associated artificial sequence are nearly indistinguishable from those of the natural sequences. Note the different values of the parameter Di . For the explanation see Results subsection (d). In order to clarify these kinds of questions, we studied furthermore, with the same Zipf analysis, a large number of computer generated artificial sequences—being associated with a given natural DNA—of the following specific kind: every produced artificial sequence has the same length and the same bp composition as the associated natural DNA. Moreover, in every chosen interval Di (with a typical length of, say 100 bp; see below) around any base position i, both natural and artificial DNAs have almost the same bp composition, i.e. the same composition, up to the natural statistical deviations caused by the finite length of the chosen interval Di ; cf. refs 8–10. The most surprising feature of our computer-simulation results is demonstrated in Figure 4a and b. In Figure 4a, the Zipf analysis of the complete yeast chromosome III sequence [also studied by Mantegna et al. (2)] is presented. It can be seen immediately that the Zipf analyses of both natural and associated artificial sequences (for, say, Di = 200 bp), using 6-tuples as ‘words’ produced essentially indistinguishable graphs. We obtained essentially the same ‘negative’ result for many different natural DNAs, among others: (i) the human HSRETBLAS (cf. above), and (ii) the E.coli sequence ECUW89 (cf. above); see Figure 4b. These results demonstrate quantitatively that the Zipf analysis (2) is unable to discriminate natural DNAs from the associated computer generated sequences, which indicates strongly furthermore that the Zipf analysis under consideration does not reveal new biological information being coded in DNA (cf. Discussion). (d) Dependence on base composition Although the Zipf graph of every natural DNA can be sufficiently well approximated with the Zipf graphs of associated artificial sequences, as described above, it could be that noncoding and coding DNAs exhibit quantitative differences in the quality of this approximation. Namely, one easily recognises that some DNAs can be approximated (in the considered manner) using larger Di values than others. See for example Figure 5, where it is shown that for the ‘approximation’ of the λ-phage (84% coding) one needs a much smaller Di value than in the case of the human β-globin DNA HSHBB (4% coding). However, further analyses revealed the following unexpected feature: it is not the coding or noncoding character of a natural DNA which is directly related with the quality of its approximation with artificial sequences, but simply its base composition! More concretely, all DNAs studied thus far revealed that natural DNAs which have unequal mean base composition (i.e., the frequencies of the occurrences of A, G, C and T in the DNA are not 25% each) can be approximated with artificial sequences choosing relatively large Di values (say, some hundreds). On the contrary, if each base has relative occurrence of ∼25% in a natural DNA (which we may call an ‘equal base composition’), one has 1680 Nucleic Acids Research, 1996, Vol. 24, No. 9 Figure 6. 6-tuple Zipf analysis of artificial sequences with a constant base composition along the sequence, but different length. Shown are sequences of length 50 kbp (solid lines), 500 kbp (dashed lines) and 2000 kbp (dotted lines). In addition there are two different base compositions chosen, first an equal base composition (thick lines), i.e. all four bases have a relative occurrence of 25%, and second an unequal base composition (thin lines) with A (10%), T (20%), C (30%) and G (40%). The slope of the Zipf plots show a strong dependence on the base composition of the sequence. to proceed to much smaller Di values (say, some tenths) until the aforementioned ‘approximation’ becomes satisfying. Furthermore, application of the Zipf analysis to a large number of artificial sequences of different lengths revealed that the typical form of the Zipf graphs (cf. the aforementioned figures) is mainly determined by the base composition of the sequence. In these investigations, the base composition along each of these artificial sequences is held constant. [This means that no patchiness (11,12) appears in these artificial sequences.] In Figure 6 some Zipf graphs of such sequences are presented. Here one clearly sees that the Zipf graphs of artificial sequences with unequal mean occurrences of the four bases, exhibit the typical form and/or curvature of natural DNAs. A more complete discussion of the base composition dependence of the Zipf graphs can be found in Bonhoeffer et al. (Phys. Rev. Lett, in press). These computer simulation results imply that the forms of the Zipf graphs under consideration are due to plain ‘numerics’, rather than due to ‘biological information’ (cf. Discussion). (e) Shannon’s redundancy analysis Our numerical investigations based on Shannon’s redundancy Re(n) concept (see Theory, above) produced graphs being similar to those presented by Mantegna et al. (2). As an example, see Figure 7. However, also in the present case, the natural DNA sequences and our associated computer generated sequences yield essentially indistinguishable graphs, although the chosen values of the intervals are much smaller here than those mentioned above in (c). As examples, see Figure 8, where results on a mostly coding DNA (adenovirus, AD2, 78% coding) and a mostly noncoding DNA (human β-globin, HSHBB, 4% coding) are graphically presented. Moreover, the following observation is, from the biological viewpoint, crucial: the 3-tuples are already known to be the ‘relevant’ words (i.e. codons) in coding DNA sequences, since they have a well established biological meaning related to amino acid coding. Therefore, one naturally may demand that a successful Figure 7. Shannon’s ‘word’ redundancy test for highly coding (full marks) and mostly noncoding (open marks) sequences. Note that one finds no specific features for ‘word’-length 3 even in highly coding sequences, which could be expected, due to the universal genetic code. J, ECUW89; F, LAMBDA;z, AD2; f, HSADAG; j, HSRETBLAS; ∆, HSHBB. Acronyms are defined in previous figures and the text. Figure 8. Shannon’s ‘word’ redundancy test for associated artificial sequences of the AD2 (J,j) and HSHBB (F,f) genome. Shown are analyses of artificial sequences with different simulation parameter Di indicated with different line shapes. (Di = 10 bp, dashed dotted; Di = 20 bp, dashed; Di = 50 bp, dotted). The graphs of the artificial sequences are getting nearer to the graphs of the natural DNAs (thick lines) with decreasing Di . Di is much larger than the ‘word’ length n, smearing out information possibly coded with small n-tuples in natural DNA. ‘linguistic’ test clearly shows the specific character of 3-tuples, as compared with 2-tuples, 4-tuples etc., in the case of mostly coding DNAs. An inspection of the redundancy graphs presented by Mantegna et al. (2) and of Figure 8, however, does not satisfy this demand. To be more specific, all Re(n) graphs are just smooth monotonous functions of the word length n, and they exhibit no specific feature at n = 3. These two findings indicate—in contrast with the claims of Mantegna et al. (2)—that the presently considered quantity Re(n) is 1681 Nucleic Acids Acids Research, Research,1994, 1996,Vol. Vol.22, 24,No. No.19 Nucleic not appropriate to reveal any new biological information in noncoding DNA sequences (cf. Discussion). DISCUSSION In this paper the focus is on the quantitative tests of the main claims of Mantegna et al. (2) concerning the ‘linguistic structure’ of (especially noncoding) DNA sequences of living organisms. The main idea underlying our tests is, simply, to produce artificial (computer generated) sequences having similar bp composition as a natural DNA, and then to perform the same ‘linguistic’ analysis on both natural DNA and associated artificial sequences. Of course, since the artificial sequences are produced with the aid of random number generators, there is absolutely no biological information in these sequences—at least such information coded with oligonucleotides, for instance 6-tuples. Therefore, a successful ‘linguistic’ test must at least fulfil the following criterion: it must be able to discriminate between natural DNAs and their associated artificial sequences. But if, on the contrary, a ‘linguistic’ test does not fulfil this criterion, then we ought to conclude that: (i) the results of this test cannot have any biological significance, i.e. there may be mathematical artefacts, and (ii) the ‘linguistic’ test is not appropriate. In the light of this consideration, the results presented in the Results section demonstrate that the investigated two ‘linguistic’ tests are not successful, since both fail to discriminate between natural DNAs and their associated artificial sequences. In particular, we demonstrated this fact with respect to the Zipf test in Results subsection (c), and with respect to Shannon’s redundancy in subsection (e). Note also that the chosen lengths of the intervals Di are clearly larger than the longest considered ‘words’, i.e. the n-tuples. This remark is important, since the distribution of the ‘letters’ (i.e. the bases) in every interval of length Di of any artificial sequence is completely random. In other words, in these sequences there is certainly no biological information ‘coded’ with n-tuples shorter than Di . Some additional results, which require comment, are as follows. (i) The missing linearity of the Zipf graphs, as mentioned already by Konopka and Martindale (7) has been confirmed. (ii) We revealed an unexpected dependence of the quality of the simulation (or approximation) of natural DNAs on their mean base composition, independently of their coding or noncoding character. This finding indicates that the natural patchiness of DNA (11,12) may also contribute to the appearance of different mean Zipf slopes for coding and noncoding DNAs. (iii) With an explicit counterexample we proved that the claimed difference between the Zipf slopes of coding and noncoding DNA sequences is not universal. (iv) The serious weakness and/or missing biological significance of Shannon’s redundancy, in the biological context under consideration, has been demonstrated by the fact that the redundancy graphs of partially or mainly coding DNAs exhibit absolutely no specific feature for codons. According to the considerations made in subsections (c) and (d), the base distribution seem to be more uneven in noncoding DNA than in coding DNA parts. One speculative explanation might be that noncoding DNA is not subject to any selection pressures and hence its base composition should more or less reflect the availability of nucleotides inside the cell. Coding sequences however are subject to selection pressures, because they have to encode certain biological information. This restricts their choice of amino acids and hence must influence the base composition. Our results appear to be in agreement with those of Bonhoeffer et al. recently mentioned in ref. 13. 1681 Summarising, we may conclude that both ‘linguistic’ tests (2; see also 1,3) failed to reveal any new biological information in natural, noncoding or coding, DNA sequences. ACKNOWLEDGEMENTS Financial support of the Fonds der Chemischen Industrie (Frankfurt/Main), DAAD (Bonn) and Svenska Institutet (Stockholm) is gratefully acknowledged. APPENDIX Here we will shortly describe the computer-generation procedure of the artifical sequences, which have the same length and nearly the same base distribution as the corresponding natural DNA-sequence (8–10). (i) One chooses an appropriate interval width Di , which is kept constant during the whole generation of the artifical sequence. Typical Di -value vary between 20 and 1000 bp. (ii) The natural DNA-sequence is divided into consecutive DNApeaces of length Di . (iii) Determine the relative ocurrence of all four bases (A, T, C and G) in one DNA-peace of length Di . (iv) The base distribution acts as input for a computer program, which creates a randomly generated nucleotide series of length Di . The main part of the program is a random number generator; see below. (v) Repeat steps (iii) and (iv) for all DNA peaces of length Di . (vi) Concatenate all generated nucleotide series in the correct order to give the artifical sequence. Different random number generators can be found in standard references like (14). The results of this paper are obtained with the routines called ran1, ran2 and ran3 of ref. 14 as well as the routine called urng of ref. 15 and the shuffeled nested Weyl-sequence algorithm of ref. 16. All these random number generators produced almost identical results in the present study. REFERENCES 1 Flam, F. (1994) Science 266, 1320. 2 Mantegna, R. N., Buldyrev, S. V., Goldberger, A. L., Havlin, S., Peng, C.-K., Simons, M. and Stanley, H. E. (1994) Phys. Rev. Lett. 73, 3169–3172. 3 Buchbinder H. (1995) The Sciences, (issue of May/June) 8–9. 4 Zipf, G. K. (1949) Human Behavior and the Principle of Least Effort, Addison-Wesley, Cambridge, MA. 5 Shannon, C. E. (1948) Bell. Syst. Tech. J. 27, 379–423 and 623–655. 6 Shannon, C. E. and Weaver, W. (1949) The Mathematical Theory of Communication, Univ. of Illinois Press, Urbana, IL. 7 Konopka, A. K. and Martindale, C. (1995) Science 268, 789. 8 Chatzidimitriou-Dreismann, C. A., Streffer, R. M. F. and Larhammar, D. (1994) Biochim. Biophys. Acta 1217, 181–187. 9 Chatzidimitriou-Dreismann, C. A., Streffer, R. M. F. and Larhammar, D. (1994) Eur. J. Biochem. 224, 365–371. 10 Streffer, R. M. F. (1994) Numerische Untersuchung der fraktalen Korrelationen langer Reichweite in der DNA lebender Organismen und ihre mogliche biologische Information, Diploma thesis, Technische Universität Berlin. 11 Karlin, S. and Brendel, V. (1993) Science 259, 677–680. 12 Larhammar, D. and Chatzidimitriou-Dreismann, C. A. (1993) Nucleic Acids Res. 21, 5167–5170. 13 Bonhoeffer, S., Herz, A. V. M., Boerlijst, M. C., Nee, S., Nowak, M. A., May, R. M. (1996) Science, 271, 14–15. 14 Press, W. H., Flannery, B. P., Teukolsky, S. A., Vetterling, W. T. (1986) Numerical Recipes, Cambridge University Press, Cambridge. 15 Morris, A. H. (1993) NSWC Library of Mathematics Subroutines, Naval Surface Warfare Center, Dahlgren/Virginia. 16 Holian, B. L., Percus, O. E., Warnock, T. T. and Whitlock, P. A. (1994) Phys. Rev. E 50, 1607–1615.