Download Lack of biological significance in the `linguistic features` of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

DNA sequencing wikipedia , lookup

DNA paternity testing wikipedia , lookup

Cancer epigenetics wikipedia , lookup

DNA wikipedia , lookup

DNA polymerase wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Gene wikipedia , lookup

Replisome wikipedia , lookup

Comparative genomic hybridization wikipedia , lookup

DNA profiling wikipedia , lookup

SNP genotyping wikipedia , lookup

Nucleosome wikipedia , lookup

DNA damage theory of aging wikipedia , lookup

Mutagen wikipedia , lookup

DNA barcoding wikipedia , lookup

Genomic library wikipedia , lookup

Microevolution wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Primary transcript wikipedia , lookup

Epigenomics wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

History of genetic engineering wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Sequence alignment wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Molecular cloning wikipedia , lookup

DNA vaccination wikipedia , lookup

DNA supercoil wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

DNA nanotechnology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genome editing wikipedia , lookup

Nucleic acid double helix wikipedia , lookup

Genomics wikipedia , lookup

Metagenomics wikipedia , lookup

Point mutation wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Human genome wikipedia , lookup

Microsatellite wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Helitron (biology) wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Non-coding DNA wikipedia , lookup

Transcript
 1996 Oxford University Press
1676–1681 Nucleic Acids Research, 1996, Vol. 24, No. 9
Lack of biological significance in the ‘linguistic
features’ of noncoding DNA—a quantitative analysis
C. A. Chatzidimitriou-Dreismann1,3,*, R. M. F. Streffer1 and D. Larhammar2,3
1Iwan
N. Stranski Institute for Physical and Theoretical Chemistry, Technical University of Berlin, Strasse des 17. Juni
112, D-10623 Berlin, Germany; 2Department of Medical Pharmacology, Uppsala University, Box 593, S-75124
Uppsala, Sweden and 3Institute for Basic Research, Castello Principe Pignatelli, I-86075 Monteroduni (IS),
Molise, Italy
Received January 19, 1996; Revised and Accepted March 14, 1996
ABSTRACT
Recently, the application of two statistical methods
(related to Zipf’s distribution and Shannon’s redundancy), called ‘linguistic’ tests, to the primary structure
of DNA sequences of living organisms has excited
considerable interest. Of particular importance is the
claim that noncoding DNA sequences in eukaryotes
display specific ‘linguistic’ features, being reminiscent
of natural languages. Furthermore, this implies that
noncoding regions of DNA may carry some new, thus
far unknown, biological information which is revealed
by these tests. In this paper these claims are tested
quantitatively. With the aid of computer simulations of
natural DNA sequences, and by applying the same
‘linguistic’ tests to both natural and artificial sequences,
we investigate in detail the reasons of the appearance
of the claimed ‘linguistic’ features and the associated
differences between coding and noncoding DNAs. The
presented results show quantitatively that the ‘linguistic’
tests failed to reveal any new biological information in
(noncoding or coding) DNA.
INTRODUCTION
Recently it was announced (1), reported (2) and commented (see,
for example, 3) that natural DNA sequences, and especially
noncoding DNAs, appear to have many statistical features in
common with natural languages. In the original paper (2), Mantegna
et al. performed certain mathematical investigations—called
‘linguistic tests’—on DNA sequences, which are related to Zipf’s
distribution (4) and Shannon’s information theory and redundancy
analysis (5,6). In short, these tests seem to reveal significant
differences between coding and noncoding parts of natural DNA
sequences.
The first of these linguistic tests is related to the so-called Zipf plot,
i.e. the relation between the relative occurrence of all oligonucleotides of a given length n [which we will call ‘words’ as in the paper
of Mantegna et al. (2)] in a specific DNA sequence (called a ‘text’).
It was claimed (2) that: (i) in a double-logarithmic plot, the graphs
of the aforementioned relation for different DNA sequences are
linear, which implies that Zipf’s law applies to the present case, and
(ii) the slopes of these graphs for coding and noncoding DNAs differ
significantly.
The second linguistic test uses the information-theoretical
‘entropy’ H(n) (5,6) of a DNA sequence (i.e. a ‘text’) when it is
viewed as a collection of n-tuple words, as well as the associated
‘redundancy’ defined by Shannon (see below). This ‘redundancy’
may be considered as a property of natural languages, the purpose
of which being to preserve the meaning of a word also in the case
of ‘typographical errors’. As a result of these investigations, it was
claimed (2) that, in clear contrast with protein coding DNA
segments, the noncoding DNA parts are related with a considerable
amount of redundancy.
Very recently, however, these findings and/or claims have been
strongly criticised by Konopka and Martindale (7) by stressing,
among others, the following points. (i) Statistical differences of
coding and noncoding DNA are known at least since 1981, which
are used even in routine methods for discrimination between them;
therefore, the claimed novelty of the results was not appreciated.
(ii) The oligonucleotide frequency distribution in noncoding DNA
does not appear to fit Zipf’s law any better than does the distribution
in coding regions; additionally the presented log–log plots display a
nonlinear, rather than a linear, trend. (iii) It was concluded that both
coding and noncoding DNA regions fit Zipf’s law rather poorly, if
at all (7). Nevertheless, these criticisms, being formulated qualitatively, may be subject to dispute.
Since the aforementioned findings and/or claims (2) seem to be
interesting and to have a potentially, thus far unknown, biological
significance, we looked on them in detail. In order to be as precise
and concrete as possible, we concentrated on a quantitative analysis,
the results of which are presented below. Our main conclusion is that
the ‘linguistic’ tests do not reveal any new biological information
in DNA.
THEORY
In order to apply the aforementioned ‘linguistic’ tests to DNA
sequences, the concept of ‘word’ must be introduced. Of course, in
the case of coding sequences, the biologically relevant ‘words’ are
the well-known 3-tuples, or codons, which code for amino acids
according to the (almost) universal genetic code. For noncoding
regions of DNA, however, biologically relevant ‘words’ are not
known.
* To whom correspondence should be addressed at: Iwan N. Stranski Institute for Physical and Theoretical Chemistry, Technical Institute of Berlin, Strasse
des 17. Juni 112, D-10623 Berlin, Germany
1677
Nucleic Acids
Acids Research,
Research,1994,
1996,Vol.
Vol.22,
24,No.
No.19
Nucleic
Therefore, Mantegna et al. (2) considered n-tuples, where n is a
free parameter between 3 and 8. To obtain the different n-tuples
needed to perform the ‘linguistic’ analyses, one shifts progressively
by one base a ‘reading window’ of length n along the DNA sequence
of interest. Note that there are 4n different n-tuples, since there are
four ‘letters’ (i.e. A, G, C and T) in the ‘alphabet’ used by DNA.
To implement the ‘linguistic’ test as given by the Zipf analysis (4),
one has to rank all the ‘words’ (in the present case: of a given length,
i.e. the n-tuples) in the order of their actual frequency of occurrence
in a given DNA sequence. It is then convenient to make a histogram,
plotting the logarithm of the frequency of occurrence of an n-tuple
against the logarithm of its rank. (This is shortly called a log–log
plot.)
According to the claims of Mantegna et al. (2), it appears then,
surprisingly, that the produced graph is linear over a significant
range of the rank (e.g. if n = 6, the linearity should extend from rank
1 to rank ∼1000). The used ‘word’ lengths were between 3 and 8.
This linearity is considered to be the characteristic feature of the
so-called Zipf’s law (4). The slope to the graph (if it is linear; see
Results below) is called the Zipf exponent.
The same n-tuples are also needed for the second ‘linguistic’ test
of Mantegna et al. (2), which is based on Shannon’s informationtheoretical concept of entropy (5). According to Shannon, the entity
‘information’ is directly associated with ‘reduction of entropy’.
Related to this reduction is also another quantity of information
theory, called redundancy. In simple terms, redundancy is the degree
to which a given text, which represents an ‘information’, can be
understood even when letters are missing and/or incorrect. Therefore
this quantity is also a measure of the flexibility of a ‘language’ or
a ‘code’.
The mathematically precise definitions of these quantities are as
follows (5). The entropy (or better: the n-entropy) H(n) is given by
p log p
4n
H(n) j
2
j
1
j1
where n is the (constant) length of all ‘words’. The redundancy
Re is defined through a limes, i.e.
Re lim Re(n) with Re(n) 1–H(n)kn
n
2
where, by convention, k = log24 = 2 (see for example ref. 2). The
maximum value of n for which it is possible to determine the
n-entropy appears to be n = 6. For larger n-values too many possible
words are rarely present, i.e. they exhibit extremely bad statistics
which obscure the numerical values of H(n) and Re(n), (cf. ref. 2).
As mentioned above, it was claimed (2) that these two ‘linguistic’
tests reveal significant differences between coding and noncoding
parts of natural DNAs. Furthermore, is was found that the analysed
noncoding DNA sequences exhibit larger values of redundancy than
did the coding DNAs, which suggests, as Mantegna et al. (2) put it,
‘‘the possible existence of one (or more than one) structured
biological language present in noncoding DNA sequences”.
RESULTS
(a) Linearity of Zipf graphs
To check this claim, i.e. the Zipf-like scaling behaviour, we
calculated and displayed graphically the Zipf plot of different coding
and noncoding DNA sequences. The main features of these graphs
are qualitatively in agreement with those displayed in Figures l and
1677
2 of the original paper of Mantegna et al. (2). To be concrete, we
present here (in Fig. 1a) the corresponding graphs, for 6-tuples, of
the human sequence HSRETBLAS (1.5% coding) and the
Escherichia coli sequence ECUW89 (82.1% coding) DNAs, as also
studied by Mantegna et al. (2). (The mentioned acronyms are the
identification codes of the EMBL database.) However, one should
observe that these plots are double-logarithmic, which makes it very
difficult to assess quantitatively whether the slope is really constant
or not. Therefore, we calculated also numerically the slopes of these
graphs, which are now displayed in a linear scale in Figure 1b,
together with the corresponding Zipf plots (Fig. 1a). It is seen that
these slopes, instead to be constant, are clearly curved and
monotonously increasing. Similar results were obtained for almost
all DNA sequences we analysed. In summary, we failed to find
constant slopes of the claimed extension (2), i.e. about three orders
of magnitude of the ‘word’ rank.
(b) Zipf graphs of coding and noncoding DNAs
The DNA sequences we studied show the following qualitative
difference: the graphs of the noncoding sequences are usually
‘steeper’ than those of the (mostly) coding ones. This result supports
qualitatively the finding (2) that the Zipf exponent is larger, by
∼50%, for the noncoding sequences. In order to quantify this finding
we applied the chi-square test to the sequences comparing them with
the mean of five highly coding sequences as well as the mean of five
nearly noncoding ones (cf. Fig. 2). The chi square test results in a
value of 0.005 if sequences are compared with a similar coding
part, but if the coding part differs significantly, the chi-square test
will result in values >0.015 (see Table 1). A distinction between
highly coding sequences and nearly noncoding sequences seems
therefore to be fairly easy.
But we found also an exception, as the Zipf graph of the
herpesvirus saimiri DNA sequence (HSGEND, 94% coding),
shown in Figure 2, clearly demonstrates. The overall behaviour of
this graph is identical to that of a typical (mostly) noncoding
sequence, which is also reflected in the values of the chi square test:
0.042 with the mean of highly coding sequences. This proves that
the claimed difference (2) between coding and noncoding DNA
sequences is not universal.
Table 1. Chi-square test
Sequencea
Coding partb
Large coding partc
HSGENDd
94
0.042
EBV
12
0.020
RNIGF2
23
0.020
5
0.018
HSGHCSA
HSHBB
4
0.015
AD2
78
0.001
HEVZXX
89
0.002
LAMBDA
84
0.001
aEMBL-ID
of the sequence. Acronyms are explained in the figure legends.
percentage of the sequence.
cResult of the chi-square test comparing the Zipf plot of the sequence with the
mean Zipf plot of five mostly coding sequences (cf. Fig. 2).
dNote the exceptional behaviour of the highly coding sequence HSGEND,
which behaves like a ‘noncoding’ DNA-sequence; cf. the text.
bCoding
1678 Nucleic Acids Research, 1996, Vol. 24, No. 9
Figure 2. Averaged Zipf graphs of five typical mostly coding (thin solid line)
and five typical mostly noncoding (thin dashed line) sequences. In addition the
Zipf plot of the herpesvirus saimiri (HSGEND) is presented, which is highly
coding (94%). The figure reveals a difference in averaged Zipf plots of coding
and noncoding sequences, but disproves the difference to be general for
individual DNAs. The used ‘word’ length is 6 bp.
Figure 1. (a) 6-tuple Zipf plot of the human sequence HSRETBLAS (solid line)
and E.coli sequence ECUW89 (dashed line), which are studied in (Mantegna
et al., 1994). (b) Local slope of the Zipf plots in (a) proving them to be curved.
The slope has the tendency to decrease with increasing ‘word’ rank. Every data
point represents the mean slope of 100 ‘words’ in the Zipf plot, which are fitted
according to the standard linear regression method. The lines are guides to
visualise the tendency in slope. , ECUW89; , HSRETBLAS. Acronyms
are EMBL database identification names.
(c) Zipf tests of natural DNAs and computer simulated
sequences
Nevertheless, in most cases we studied, there are indeed visible
differences between the averaged slopes of the Zipf graphs: coding
graphs appear to exhibit a smaller slope, on average, than noncoding
DNAs, in agreement with Mantegna et al. (2); for some examples
see Figure 3. This qualitative finding motivates the questions
whether the observed differences have a biological significance, and
how they can be quantified properly.
Trying to clarify these questions concretely and quantitatively, we
made the same ‘linguistic’ Zipf analysis also on a large number of
artificial, computer-generated sequences using random number
generators. For details of calculation of such sequences see refs
8–10. We found that the Zipf graphs of these sequences exhibit a
Figure 3. 6-tuple Zipf plots of different highly coding (thick lines) and mostly
noncoding (thin lines) sequences. Shown are the following sequences:
Epstein–Barr virus (EBV, solid thin, 12%), rat insulin-like growth factor
(RNIGF2, dashed thin, 23%), human growth hormone (HSGHCSA, dotted
thin, 5%), human β-globin (HSHBB, dashed-dotted thin, 4%), adenovirus
(AD2, solid thick, 78%), herpesvirus (HEVZVXX, dashed thick, 89%) and
λ-phage (LAMBDA, dotted thick, 84%). In parentheses are given first the
EMBL identification name, second the line shape in the figure and third the
coding percentage of the sequence.
qualitatively similar form as those of natural DNAs. In the light of
this qualitative finding, one may wonder about the ‘reasons’ and/or
‘origins’ of the observed specific form of the Zipf graphs discussed
above: can we associate some ‘biological information’ with the
observed forms of the Zipf graphs, as suggested by Mantegna
et al. (2)? Or are these graphs related to some, thus far unknown,
‘numerical artefacts’?
1679
Nucleic Acids
Acids Research,
Research,1994,
1996,Vol.
Vol.22,
24,No.
No.19
Nucleic
1679
Figure 5. Difference in quality of Zipf plot simulation in dependence of the
simulation parameter Di . Presented are the Zipf plots of the human β-globin
sequence HSHBB (Di = 100 bp; dashed) with an unequal base composition and
of the λ-phage genome (Di = 15 bp; solid) with an equal base composition. To
achieve a good simulation a much shorter Di has to be chosen for a sequence
with an equal base composition than for a sequence with an unequal one.
Graphs of natural DNAs are thick, those of artificial sequences thin.
Figure 4. 6-tuple Zipf plot of natural DNAs (thick lines) and associated
artificial sequences (thin lines). (a) Yeast chromosome III (SCHRIII, 67%
coding), and an associated artificial sequence with simulation parameter
Di = 200 bp. (b) ECUW89 (dashed lines, 82% coding) and HSRETBLAS (solid
lines, 1.5% coding). The parameter Di in the generation of associated artificial
sequences equal 20 bp in the case of ECUW89 and 100 bp for HSRETBLAS.
All Zipf plots of the associated artificial sequence are nearly indistinguishable from
those of the natural sequences. Note the different values of the parameter Di . For
the explanation see Results subsection (d).
In order to clarify these kinds of questions, we studied
furthermore, with the same Zipf analysis, a large number of
computer generated artificial sequences—being associated with a
given natural DNA—of the following specific kind: every
produced artificial sequence has the same length and the same bp
composition as the associated natural DNA. Moreover, in every
chosen interval Di (with a typical length of, say 100 bp; see below)
around any base position i, both natural and artificial DNAs have
almost the same bp composition, i.e. the same composition, up to
the natural statistical deviations caused by the finite length of the
chosen interval Di ; cf. refs 8–10.
The most surprising feature of our computer-simulation results
is demonstrated in Figure 4a and b. In Figure 4a, the Zipf analysis
of the complete yeast chromosome III sequence [also studied by
Mantegna et al. (2)] is presented. It can be seen immediately that the
Zipf analyses of both natural and associated artificial sequences (for,
say, Di = 200 bp), using 6-tuples as ‘words’ produced essentially
indistinguishable graphs. We obtained essentially the same ‘negative’ result for many different natural DNAs, among others: (i) the
human HSRETBLAS (cf. above), and (ii) the E.coli sequence
ECUW89 (cf. above); see Figure 4b.
These results demonstrate quantitatively that the Zipf analysis (2)
is unable to discriminate natural DNAs from the associated
computer generated sequences, which indicates strongly furthermore that the Zipf analysis under consideration does not reveal new
biological information being coded in DNA (cf. Discussion).
(d) Dependence on base composition
Although the Zipf graph of every natural DNA can be sufficiently
well approximated with the Zipf graphs of associated artificial
sequences, as described above, it could be that noncoding and
coding DNAs exhibit quantitative differences in the quality of this
approximation. Namely, one easily recognises that some DNAs can
be approximated (in the considered manner) using larger Di values
than others. See for example Figure 5, where it is shown that for the
‘approximation’ of the λ-phage (84% coding) one needs a much
smaller Di value than in the case of the human β-globin DNA
HSHBB (4% coding).
However, further analyses revealed the following unexpected
feature: it is not the coding or noncoding character of a natural
DNA which is directly related with the quality of its approximation with artificial sequences, but simply its base composition!
More concretely, all DNAs studied thus far revealed that natural
DNAs which have unequal mean base composition (i.e., the
frequencies of the occurrences of A, G, C and T in the DNA are
not 25% each) can be approximated with artificial sequences
choosing relatively large Di values (say, some hundreds). On the
contrary, if each base has relative occurrence of ∼25% in a natural
DNA (which we may call an ‘equal base composition’), one has
1680 Nucleic Acids Research, 1996, Vol. 24, No. 9
Figure 6. 6-tuple Zipf analysis of artificial sequences with a constant base
composition along the sequence, but different length. Shown are sequences of
length 50 kbp (solid lines), 500 kbp (dashed lines) and 2000 kbp (dotted lines).
In addition there are two different base compositions chosen, first an equal base
composition (thick lines), i.e. all four bases have a relative occurrence of 25%,
and second an unequal base composition (thin lines) with A (10%), T (20%),
C (30%) and G (40%). The slope of the Zipf plots show a strong dependence
on the base composition of the sequence.
to proceed to much smaller Di values (say, some tenths) until the
aforementioned ‘approximation’ becomes satisfying.
Furthermore, application of the Zipf analysis to a large number of
artificial sequences of different lengths revealed that the typical form
of the Zipf graphs (cf. the aforementioned figures) is mainly
determined by the base composition of the sequence. In these
investigations, the base composition along each of these artificial
sequences is held constant. [This means that no patchiness (11,12)
appears in these artificial sequences.] In Figure 6 some Zipf graphs
of such sequences are presented. Here one clearly sees that the Zipf
graphs of artificial sequences with unequal mean occurrences of the
four bases, exhibit the typical form and/or curvature of natural
DNAs. A more complete discussion of the base composition
dependence of the Zipf graphs can be found in Bonhoeffer et al.
(Phys. Rev. Lett, in press).
These computer simulation results imply that the forms of the Zipf
graphs under consideration are due to plain ‘numerics’, rather than
due to ‘biological information’ (cf. Discussion).
(e) Shannon’s redundancy analysis
Our numerical investigations based on Shannon’s redundancy Re(n)
concept (see Theory, above) produced graphs being similar to those
presented by Mantegna et al. (2). As an example, see Figure 7.
However, also in the present case, the natural DNA sequences and
our associated computer generated sequences yield essentially
indistinguishable graphs, although the chosen values of the intervals
are much smaller here than those mentioned above in (c).
As examples, see Figure 8, where results on a mostly coding DNA
(adenovirus, AD2, 78% coding) and a mostly noncoding DNA
(human β-globin, HSHBB, 4% coding) are graphically presented.
Moreover, the following observation is, from the biological
viewpoint, crucial: the 3-tuples are already known to be the
‘relevant’ words (i.e. codons) in coding DNA sequences, since they
have a well established biological meaning related to amino acid
coding. Therefore, one naturally may demand that a successful
Figure 7. Shannon’s ‘word’ redundancy test for highly coding (full marks) and
mostly noncoding (open marks) sequences. Note that one finds no specific
features for ‘word’-length 3 even in highly coding sequences, which could be
expected, due to the universal genetic code. J, ECUW89; F, LAMBDA;z,
AD2; f, HSADAG; j, HSRETBLAS; ∆, HSHBB. Acronyms are defined in
previous figures and the text.
Figure 8. Shannon’s ‘word’ redundancy test for associated artificial sequences
of the AD2 (J,j) and HSHBB (F,f) genome. Shown are analyses of
artificial sequences with different simulation parameter Di indicated with
different line shapes. (Di = 10 bp, dashed dotted; Di = 20 bp, dashed; Di = 50
bp, dotted). The graphs of the artificial sequences are getting nearer to the
graphs of the natural DNAs (thick lines) with decreasing Di . Di is much larger
than the ‘word’ length n, smearing out information possibly coded with small
n-tuples in natural DNA.
‘linguistic’ test clearly shows the specific character of 3-tuples, as
compared with 2-tuples, 4-tuples etc., in the case of mostly coding
DNAs. An inspection of the redundancy graphs presented by
Mantegna et al. (2) and of Figure 8, however, does not satisfy this
demand. To be more specific, all Re(n) graphs are just smooth
monotonous functions of the word length n, and they exhibit no
specific feature at n = 3.
These two findings indicate—in contrast with the claims of
Mantegna et al. (2)—that the presently considered quantity Re(n) is
1681
Nucleic Acids
Acids Research,
Research,1994,
1996,Vol.
Vol.22,
24,No.
No.19
Nucleic
not appropriate to reveal any new biological information in
noncoding DNA sequences (cf. Discussion).
DISCUSSION
In this paper the focus is on the quantitative tests of the main claims
of Mantegna et al. (2) concerning the ‘linguistic structure’ of
(especially noncoding) DNA sequences of living organisms. The
main idea underlying our tests is, simply, to produce artificial
(computer generated) sequences having similar bp composition as
a natural DNA, and then to perform the same ‘linguistic’ analysis on
both natural DNA and associated artificial sequences. Of course,
since the artificial sequences are produced with the aid of random
number generators, there is absolutely no biological information
in these sequences—at least such information coded with
oligonucleotides, for instance 6-tuples.
Therefore, a successful ‘linguistic’ test must at least fulfil the
following criterion: it must be able to discriminate between natural
DNAs and their associated artificial sequences. But if, on the
contrary, a ‘linguistic’ test does not fulfil this criterion, then we ought
to conclude that: (i) the results of this test cannot have any biological
significance, i.e. there may be mathematical artefacts, and (ii) the
‘linguistic’ test is not appropriate.
In the light of this consideration, the results presented in the
Results section demonstrate that the investigated two ‘linguistic’
tests are not successful, since both fail to discriminate between
natural DNAs and their associated artificial sequences. In particular,
we demonstrated this fact with respect to the Zipf test in Results
subsection (c), and with respect to Shannon’s redundancy in
subsection (e). Note also that the chosen lengths of the intervals Di
are clearly larger than the longest considered ‘words’, i.e. the
n-tuples. This remark is important, since the distribution of the
‘letters’ (i.e. the bases) in every interval of length Di of any artificial
sequence is completely random. In other words, in these sequences
there is certainly no biological information ‘coded’ with n-tuples
shorter than Di . Some additional results, which require comment,
are as follows. (i) The missing linearity of the Zipf graphs, as
mentioned already by Konopka and Martindale (7) has been
confirmed. (ii) We revealed an unexpected dependence of the
quality of the simulation (or approximation) of natural DNAs on
their mean base composition, independently of their coding or
noncoding character. This finding indicates that the natural
patchiness of DNA (11,12) may also contribute to the appearance
of different mean Zipf slopes for coding and noncoding DNAs.
(iii) With an explicit counterexample we proved that the claimed
difference between the Zipf slopes of coding and noncoding DNA
sequences is not universal. (iv) The serious weakness and/or missing
biological significance of Shannon’s redundancy, in the biological
context under consideration, has been demonstrated by the fact that
the redundancy graphs of partially or mainly coding DNAs exhibit
absolutely no specific feature for codons.
According to the considerations made in subsections (c) and (d),
the base distribution seem to be more uneven in noncoding DNA
than in coding DNA parts. One speculative explanation might be
that noncoding DNA is not subject to any selection pressures and
hence its base composition should more or less reflect the
availability of nucleotides inside the cell. Coding sequences however
are subject to selection pressures, because they have to encode
certain biological information. This restricts their choice of amino
acids and hence must influence the base composition.
Our results appear to be in agreement with those of Bonhoeffer et
al. recently mentioned in ref. 13.
1681
Summarising, we may conclude that both ‘linguistic’ tests (2; see
also 1,3) failed to reveal any new biological information in natural,
noncoding or coding, DNA sequences.
ACKNOWLEDGEMENTS
Financial support of the Fonds der Chemischen Industrie
(Frankfurt/Main), DAAD (Bonn) and Svenska Institutet (Stockholm) is gratefully acknowledged.
APPENDIX
Here we will shortly describe the computer-generation procedure of
the artifical sequences, which have the same length and nearly the
same base distribution as the corresponding natural DNA-sequence
(8–10).
(i) One chooses an appropriate interval width Di , which is kept
constant during the whole generation of the artifical sequence.
Typical Di -value vary between 20 and 1000 bp.
(ii) The natural DNA-sequence is divided into consecutive DNApeaces of length Di .
(iii) Determine the relative ocurrence of all four bases (A, T, C and
G) in one DNA-peace of length Di .
(iv) The base distribution acts as input for a computer program,
which creates a randomly generated nucleotide series of length Di .
The main part of the program is a random number generator; see
below.
(v) Repeat steps (iii) and (iv) for all DNA peaces of length Di .
(vi) Concatenate all generated nucleotide series in the correct order
to give the artifical sequence.
Different random number generators can be found in standard
references like (14). The results of this paper are obtained with the
routines called ran1, ran2 and ran3 of ref. 14 as well as the routine
called urng of ref. 15 and the shuffeled nested Weyl-sequence
algorithm of ref. 16. All these random number generators produced
almost identical results in the present study.
REFERENCES
1 Flam, F. (1994) Science 266, 1320.
2 Mantegna, R. N., Buldyrev, S. V., Goldberger, A. L., Havlin, S., Peng, C.-K.,
Simons, M. and Stanley, H. E. (1994) Phys. Rev. Lett. 73, 3169–3172.
3 Buchbinder H. (1995) The Sciences, (issue of May/June) 8–9.
4 Zipf, G. K. (1949) Human Behavior and the Principle of Least Effort,
Addison-Wesley, Cambridge, MA.
5 Shannon, C. E. (1948) Bell. Syst. Tech. J. 27, 379–423 and 623–655.
6 Shannon, C. E. and Weaver, W. (1949) The Mathematical Theory of
Communication, Univ. of Illinois Press, Urbana, IL.
7 Konopka, A. K. and Martindale, C. (1995) Science 268, 789.
8 Chatzidimitriou-Dreismann, C. A., Streffer, R. M. F. and Larhammar, D.
(1994) Biochim. Biophys. Acta 1217, 181–187.
9 Chatzidimitriou-Dreismann, C. A., Streffer, R. M. F. and Larhammar, D.
(1994) Eur. J. Biochem. 224, 365–371.
10 Streffer, R. M. F. (1994) Numerische Untersuchung der fraktalen
Korrelationen langer Reichweite in der DNA lebender Organismen und
ihre mogliche biologische Information, Diploma thesis, Technische
Universität Berlin.
11 Karlin, S. and Brendel, V. (1993) Science 259, 677–680.
12 Larhammar, D. and Chatzidimitriou-Dreismann, C. A. (1993) Nucleic
Acids Res. 21, 5167–5170.
13 Bonhoeffer, S., Herz, A. V. M., Boerlijst, M. C., Nee, S., Nowak, M. A.,
May, R. M. (1996) Science, 271, 14–15.
14 Press, W. H., Flannery, B. P., Teukolsky, S. A., Vetterling, W. T. (1986)
Numerical Recipes, Cambridge University Press, Cambridge.
15 Morris, A. H. (1993) NSWC Library of Mathematics Subroutines, Naval
Surface Warfare Center, Dahlgren/Virginia.
16 Holian, B. L., Percus, O. E., Warnock, T. T. and Whitlock, P. A. (1994)
Phys. Rev. E 50, 1607–1615.