ABSTRACT
Recently, the application of two statistical methods (related to Zipf's distribution and Shannon's redundancy), called `linguistic' tests, to the primary structure of DNA sequences of living
organisms has excited considerable interest. Of particular importance is the
claim that noncoding DNA sequences in eukaryotes display specific `linguistic'
features, being reminiscent of natural languages. Furthermore, this implies
that noncoding regions of DNA may carry some new, thus far unknown, biological
information which is revealed by these tests. In this paper these claims are
tested quantitatively. With the aid of computer simulations of natural DNA sequences, and by applying the same `linguistic' tests to both
natural and artificial sequences, we investigate in detail the reasons of the appearance of the claimed
`linguistic' features and the associated differences between coding and
noncoding DNAs. The presented results show quantitatively that the `linguistic' tests failed to reveal any new biological information in (noncoding or coding)
DNA.
Recently it was announced (
1
), reported (
2
) and commented (see, for example,
3
) that natural DNA sequences, and especially noncoding DNAs, appear to have many
statistical features in common with natural languages. In the original paper (
2
), Mantegna
et al
. performed certain mathematical investigations-called `linguistic tests'-on DNA sequences, which are related to Zipf's distribution (
4
) and Shannon's information theory and redundancy analysis (
5
,
6
). In short, these tests seem to reveal significant differences between coding
and noncoding parts of natural DNA sequences.
The first of these linguistic tests is related to the so-called Zipf plot, i.e. the relation between the relative occurrence of all
oligonucleotides of a given length
n
[which we will call `words' as in the paper of Mantegna
et al
. (
2
)] in a specific DNA sequence (called a `text'). It was claimed (
2
) that: (i) in a double-logarithmic plot, the graphs of the aforementioned relation for different
DNA sequences are linear, which implies that Zipf's law applies to the present
case, and (ii) the slopes of these graphs for coding and noncoding DNAs differ
significantly.
The second linguistic test uses the information-theoretical `entropy'
H(n)
(
5
,
6
) of a DNA sequence (i.e. a `text') when it is viewed as a collection of
n
-tuple words, as well as the associated `redundancy' defined by Shannon
(see below). This `redundancy' may be considered as a property of natural
languages, the purpose of which being to preserve the meaning of a word also in
the case of `typographical errors'. As a result of these investigations, it was
claimed (
2
) that, in clear contrast with protein coding DNA segments, the noncoding DNA
parts are related with a considerable amount of redundancy.
Very recently, however, these findings and/or claims have been strongly
criticised by Konopka and Martindale (
7
) by stressing, among others, the following points. (i) Statistical differences
of coding and noncoding DNA are known at least since 1981, which are used even
in routine methods for discrimination between them; therefore, the claimed novelty of the results was not appreciated. (ii) The oligonucleotide frequency distribution in noncoding DNA does not appear
to fit Zipf's law any better than does the distribution in coding regions;
additionally the presented log-log plots display a nonlinear, rather than a linear, trend. (iii) It was
concluded that both coding and noncoding DNA regions fit Zipf's law rather
poorly, if at all (
7
). Nevertheless, these criticisms, being formulated qualitatively, may be subject to dispute.
Since the aforementioned findings and/or claims (
2
) seem to be interesting and to have a potentially, thus far unknown, biological
significance, we looked on them in detail. In order to be as precise and
concrete as possible, we concentrated on a quantitative analysis, the results
of which are presented below. Our main conclusion is that the `linguistic' tests do not reveal any new biological information in DNA.
In order to apply the aforementioned `linguistic' tests to DNA sequences, the
concept of `word' must be introduced. Of course, in the case of coding
sequences, the biologically relevant `words' are the well-known 3-tuples, or codons, which code for amino acids according to the
(almost) universal genetic code. For noncoding regions of DNA, however, biologically relevant `words' are not known.
Therefore, Mantegna
et al
. (
2
) considered
n
-tuples, where
n
is a free parameter between 3 and 8. To obtain the different
n
-tuples needed to perform the `linguistic' analyses, one shifts
progressively by one base a `reading window' of length
n
along the DNA sequence of interest. Note that there are 4
n
different
n
-tuples, since there are four `letters' (i.e. A, G, C and T) in the
`alphabet' used by DNA.
To implement the `linguistic' test as given by the Zipf analysis (
4
), one has to rank all the `words' (in the present case: of a given length, i.e.
the
n
-tuples) in the order of their actual frequency of occurrence in a given
DNA sequence. It is then convenient to make a histogram, plotting the logarithm
of the frequency of occurrence of an
n
-tuple against the logarithm of its rank. (This is shortly called a log-log plot.)
According to the claims of Mantegna
et al
. (
2
), it appears then, surprisingly, that the produced graph is
linear
over a significant range of the rank (e.g. if
n
= 6, the linearity should extend from rank 1 to rank ~1000). The used `word' lengths were between 3 and 8. This linearity is
considered to be the characteristic feature of the so-called Zipf's law (
4
). The slope to the graph (if it is linear; see Results below) is called the
Zipf exponent.
The same
n
-tuples are also needed for the second `linguistic' test of Mantegna
et al
. (
2
), which is based on Shannon's information-theoretical concept of entropy (
5
). According to Shannon, the entity `information' is directly associated with
`reduction of entropy'. Related to this reduction is also another quantity of
information theory, called redundancy. In simple terms, redundancy is the
degree to which a given text, which represents an `information', can be
understood even when letters are missing and/or incorrect. Therefore this
quantity is also a measure of the flexibility of a `language' or a `code'.
The mathematically precise definitions of these quantities are as follows (
5
). The entropy (or better: the
n
-entropy)
H(n)
is given by
H ( n ) = - {sum from {j = 1} to {4 sup n}} {p sub j} {log sub 2} {p sub j}
1
where
n
is the (constant) length of all `words'. The redundancy
Re
is defined through a limes, i.e.
R e = {lim from {{roman n} -> inf} to ""} R e ( n ) " " {roman {w i t h}} " " R
e ( n ) = 1 - H ( n ) / k n
2
where, by convention,
k
= log
2
4 = 2 (see for example ref.
2
). The maximum value of
n
for which it is possible to determine the
n
-entropy appears to be
n
= 6. For larger
n
-values too many possible words are rarely present, i.e. they exhibit
extremely bad statistics which obscure the numerical values of
H(n)
and
Re(n)
, (cf. ref.
2
).
As mentioned above, it was claimed (
2
) that these two `linguistic' tests reveal significant differences between
coding and noncoding parts of natural DNAs. Furthermore, is was found that the
analysed noncoding DNA sequences exhibit larger values of redundancy than did
the coding DNAs, which suggests, as Mantegna
et al
. (
2
) put it, ``the possible existence of one (or more than one) structured biological
language present in noncoding DNA sequences".
To check this claim, i.e. the Zipf-like scaling behaviour, we calculated and displayed graphically the Zipf
plot of different coding and noncoding DNA sequences. The main features of
these graphs are qualitatively in agreement with those displayed in Figures l
and 2 of the original paper of Mantegna
et al
. (
2
). To be concrete, we present here (in Fig.
1
a) the corresponding graphs, for 6-tuples, of the human sequence HSRETBLAS (1.5% coding) and the
Escherichia coli
sequence ECUW89 (82.1% coding) DNAs, as also studied by Mantegna
et al
. (
2
). (The mentioned acronyms are the identification codes of the EMBL database.)
However, one should observe that these plots are double-logarithmic, which makes it very difficult to assess quantitatively
whether the slope is really constant or not. Therefore, we calculated also
numerically the slopes of these graphs, which are now displayed in a linear
scale in Figure
1
b, together with the corresponding Zipf plots (Fig.
1
a). It is seen that these slopes, instead to be constant, are clearly curved and
monotonously increasing. Similar results were obtained for almost all DNA
sequences we analysed. In summary, we failed to find constant slopes of the
claimed extension (
2
), i.e. about three orders of magnitude of the `word' rank.
The DNA sequences we studied show the following qualitative difference: the
graphs of the noncoding sequences are usually `steeper' than those of the
(mostly) coding ones. This result supports qualitatively the finding (
2
) that the Zipf exponent is larger, by ~50%, for the noncoding sequences. In order to quantify this finding we
applied the chi-square test to the sequences comparing them with the mean of five highly
coding sequences as well as the mean of five nearly noncoding ones (cf. Fig.
2
). The chi square test results in a value of <= 0.005 if sequences are compared with a similar coding part, but if the
coding part differs significantly, the chi-square test will result in values >0.015 (see Table
1
). A distinction between highly coding sequences and nearly noncoding sequences
seems therefore to be fairly easy.
Trying to clarify these questions concretely and
quantitatively,
we made the same `linguistic' Zipf analysis also on a large number of
artificial, computer-generated sequences using random number generators. For details of
calculation of such sequences see refs
8
-
10
. We found that the Zipf graphs of these sequences exhibit a qualitatively
similar form as those of natural DNAs. In the light of this qualitative
finding, one may wonder about the `reasons' and/or `origins' of the observed
specific form of the Zipf graphs discussed above: can we associate some `biological information' with the observed forms of
the Zipf graphs, as suggested by Mantegna
et al
. (
2
)? Or are these graphs related to some, thus far unknown, `numerical artefacts'?
Figure
In order to clarify these kinds of questions, we studied furthermore, with the
same Zipf analysis, a large number of computer generated artificial sequences-being associated with a given natural DNA-of the following specific kind: every produced artificial sequence
has the same length and the same bp composition as the associated natural DNA.
Moreover, in every chosen interval
D
i
(with a typical length of, say 100 bp; see below) around any base position
i
, both natural and artificial DNAs have almost the same bp composition, i.e. the
same composition, up to the natural statistical deviations caused by the finite
length of the chosen interval
D
i
; cf. refs
8
-
10
.
The most surprising feature of our computer-simulation results is demonstrated in Figure
4
a and b. In Figure
4
a, the Zipf analysis of the complete yeast chromosome III sequence [also studied
by Mantegna
et al
. (
2
)] is presented. It can be seen immediately that the Zipf analyses of both
natural and associated artificial sequences (for, say,
D
i
= 200 bp), using 6-tuples as `words' produced essentially indistinguishable graphs. We obtained essentially
the same `negative' result for many different natural DNAs, among others: (i)
the human HSRETBLAS (cf. above), and (ii) the
E.coli
sequence ECUW89 (cf. above); see Figure
4
b.
These results demonstrate quantitatively that the Zipf analysis (
2
) is unable to discriminate natural DNAs from the associated computer generated
sequences, which indicates strongly furthermore that the Zipf analysis under
consideration does not reveal new biological information being coded in DNA
(cf. Discussion).
Although the Zipf graph of every natural DNA can be sufficiently well
approximated with the Zipf graphs of associated artificial sequences, as
described above, it could be that noncoding and coding DNAs exhibit
quantitative differences in the quality of this approximation. Namely, one
easily recognises that some DNAs can be approximated (in the considered manner)
using larger
D
i
values than others. See for example Figure
5
, where it is shown that for the `approximation' of the [lambda]-phage (84% coding) one needs a much smaller
D
i
value than in the case of the human [beta]-globin DNA HSHBB (4% coding).
Figure
However, further analyses revealed the following unexpected feature: it is not
the coding or noncoding character of a natural DNA which is directly related
with the quality of its approximation with artificial sequences, but simply its
base composition! More concretely, all DNAs studied thus far revealed that
natural DNAs which have
unequal
mean base composition (i.e., the frequencies of the occurrences of A, G, C and
T in the DNA are not 25% each) can be approximated with artificial sequences
choosing relatively large
D
i
values (say, some hundreds). On the contrary, if each base has relative
occurrence of ~25% in a natural DNA (which we may call an `equal base composition'), one
has to proceed to much smaller
Di
values (say, some tenths) until the aforementioned `approximation' becomes
satisfying.
Furthermore, application of the Zipf analysis to a large number of artificial
sequences of different lengths revealed that the typical form of the Zipf
graphs (cf. the aforementioned figures) is mainly determined by the base
composition of the sequence. In these investigations, the base composition
along each of these artificial sequences is held constant. [This means that no
patchiness (
11
,
12
) appears in these artificial sequences.] In Figure
6
some Zipf graphs of such sequences are presented. Here one clearly sees that
the Zipf graphs of artificial sequences with
unequal
mean occurrences of the four bases, exhibit the typical form and/or curvature
of natural DNAs. A more complete discussion of the base composition dependence
of the Zipf graphs can be found in Bonhoeffer
et al
. (
Phys. Rev. Lett
, in press).
Figure
These computer simulation results imply that the forms of the Zipf graphs under
consideration are due to plain `numerics', rather than due to `biological
information' (cf. Discussion).
Our numerical investigations based on Shannon's redundancy
Re(n)
concept (see Theory, above) produced graphs being similar to those presented by
Mantegna
et al
. (
2
). As an example, see Figure
7
. However, also in the present case, the natural DNA sequences and our
associated computer generated sequences yield essentially indistinguishable
graphs, although the chosen values of the intervals are much smaller here than
those mentioned above in (c).
Figure
As examples, see Figure
8
, where results on a mostly coding DNA (adenovirus, AD2, 78% coding) and a
mostly noncoding DNA (human [beta]-globin, HSHBB, 4% coding) are graphically presented.
Figure
Moreover, the following observation is, from the biological viewpoint, crucial:
the 3-tuples are already known to be the `relevant' words (i.e. codons) in
coding DNA sequences, since they have a well established biological meaning
related to amino acid coding. Therefore, one naturally may demand that a
successful `linguistic' test clearly shows the specific character of 3-tuples, as compared with 2-tuples, 4-tuples etc., in the case of mostly coding DNAs. An inspection
of the redundancy graphs presented by Mantegna
et al
. (
2
) and of Figure
8
, however, does not satisfy this demand. To be more specific, all
Re(n)
graphs are just smooth monotonous functions of the word length n, and they
exhibit no specific feature at
n
= 3.
These two findings indicate-in contrast with the claims of Mantegna
et al
. (
2
)-that the presently considered quantity
Re(n)
is not appropriate to reveal any new biological information in noncoding DNA
sequences (cf. Discussion).
In this paper the focus is on the quantitative tests of the main claims of
Mantegna
et al
. (
2
) concerning the `linguistic structure' of (especially noncoding) DNA sequences
of living organisms. The main idea underlying our tests is, simply, to produce
artificial (computer generated) sequences having similar bp composition as a
natural DNA, and then to perform the same `linguistic' analysis on both natural
DNA and associated artificial sequences. Of course, since the artificial sequences are produced with the aid of random number
generators, there is
absolutely no
biological information in these sequences-at least such information coded with oligonucleotides, for instance 6-tuples.
Therefore, a successful `linguistic' test must at least fulfil the following
criterion: it must be able to discriminate between natural DNAs and their
associated artificial sequences. But if, on the contrary, a `linguistic' test
does not fulfil this criterion, then we ought to conclude that: (i) the results
of this test cannot have any biological significance, i.e. there may be
mathematical artefacts, and (ii) the `linguistic' test is not appropriate.
In the light of this consideration, the results presented in the Results section
demonstrate that the investigated two `linguistic' tests are not successful,
since both fail to discriminate between natural DNAs and their associated
artificial sequences. In particular, we demonstrated this fact with respect to
the Zipf test in Results subsection (c), and with respect to Shannon's
redundancy in subsection (e). Note also that the chosen lengths of the
intervals
D
i
are clearly
larger
than the longest considered `words', i.e. the
n
-tuples. This remark is important, since the distribution of the `letters'
(i.e. the bases) in every interval of length
D
i
of any artificial sequence is completely random. In other words, in these
sequences there is certainly no biological information `coded' with
n
-tuples shorter than
D
i
. Some additional results, which require comment, are as follows. (i) The
missing linearity of the Zipf graphs, as mentioned already by Konopka and
Martindale (
7
) has been confirmed. (ii) We revealed an unexpected dependence of the quality of the simulation (or approximation) of natural DNAs on their mean base
composition, independently of their coding or noncoding character. This finding
indicates that the natural patchiness of DNA (
11
,
12
) may also contribute to the appearance of different mean Zipf slopes for coding and noncoding DNAs. (iii) With an explicit counterexample we proved that the claimed difference
between the Zipf slopes of coding and noncoding DNA sequences is not universal.
(iv) The serious weakness and/or missing biological significance of Shannon's
redundancy, in the biological context under consideration, has been
demonstrated by the fact that the redundancy graphs of partially or mainly
coding DNAs exhibit absolutely no specific feature for codons.
According to the considerations made in subsections (c) and (d), the base
distribution seem to be more uneven in noncoding DNA than in coding DNA parts.
One speculative explanation might be that noncoding DNA is not subject to any
selection pressures and hence its base composition should more or less reflect
the availability of nucleotides inside the cell. Coding sequences however are
subject to selection pressures, because they have to encode certain biological
information. This restricts their choice of amino acids and hence must
influence the base composition.
Our results appear to be in agreement with those of Bonhoeffer
et al.
recently mentioned in ref.
13
.
Summarising, we may conclude that both `linguistic' tests (
2
; see also
1
,
3
) failed to reveal any new
biological
information in natural, noncoding or coding, DNA sequences.
Financial support of the Fonds der Chemischen Industrie (Frankfurt/Main), DAAD (Bonn) and Svenska Institutet (Stockholm) is gratefully acknowledged.
Here we will shortly describe the computer-generation procedure of the artifical sequences, which have the same
length and nearly the same base distribution as the corresponding natural DNA-sequence (
8
-
10
).
(i) One chooses an appropriate interval width
D
i
, which is kept constant during the whole generation of the artifical sequence.
Typical
D
i
-value vary between 20 and 1000 bp.
(ii) The natural DNA-sequence is divided into consecutive DNA-peaces of length
D
i
.
(iii) Determine the relative ocurrence of all four bases (A, T, C and G) in one
DNA-peace of length
D
i
.
(iv) The base distribution acts as input for a computer program, which creates a
randomly generated nucleotide series of length
D
i
. The main part of the program is a random number generator; see below.
(v) Repeat steps (iii) and (iv) for all DNA peaces of length
D
i
.
(vi) Concatenate all generated nucleotide series in the correct order to give
the artifical sequence.
Different random number generators can be found in standard references like (
14
). The results of this paper are obtained with the routines called ran1, ran2
and ran3 of ref.
14
as well as the routine called urng of ref.
15
and the shuffeled nested Weyl-sequence algorithm of ref.
16
. All these random number generators produced almost identical results in the
present study.





REFERENCES
Return

