Download Document 7996183

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inductive probability wikipedia , lookup

Infinite monkey theorem wikipedia , lookup

Transcript
PREDICTION AN
erratic and uncertain, and t:
involved.
Prediction and Entropy of Printed
2. ENTROPY CALCULA
Engli~h
One method of calculatin e
F o , F I , F 2 , ••• , which sue
of the language into accoun
the N-g~am entropy; it mea
to statistics extending over _
By C. E. SHANNON
(Manuscript &ceiDcd Sept. IS, I950)
A Dew method of estimating the entropy and redundancy of a language is
described. This method exploits the knowledge of the language statistics possessed by those who speak the language, and depends on experimental results
in prediction of the next letter when the preceding text is known. Results of
experiments in prediction are given, and some properties of an ideal predictor are
developed.
1.
FH =
i,j
-:- L pCb,
INTRODUCTION
in which: b i is a block of A
I
N A previous paper! the entropy and redundancy of a language have
been defined. The entropy is a statistical parameter which measures,
in a certain sense, how much infonnation is produced on the average for
each letter of a text in the language. If the language is translated into binary
digits (0 or 1) in the most efficient way, the entropy H is the average number
of binary digits required per letter of the original language. The redundancy,
on the other hand, measures the amount of constraint imposed on a text in
the language_ due to its statistical structure, e.g., in English the high frequency of the letter E, the strong tendency of H to follow T or of U to follow
Q: It was estimated that when statistical effects extending over not more
than eight letters are considered the entropy is roughly 2.3 bits per letter,
the redundancy about 50 per cent.
Since then a new method has been found for estimating· these quantIties,
which is more sensitive and takes account of long range statistics, iniluences
extending over phrases, sentences, etc. This method is based on a study of
the predictability of English; how well can the next letter of a text be predicted when the preceding .1\7 letters are known. The results of some experiments in prediction will be given, and a theoretical analysis of some of the
properties of ideal prediction. By combining the experimental and theoretical results it is possible to estimate upper and lower bounds for the entropy
and redundancy. From this analysis it appears that, in ordinary literary
English; the long range statistical effects (up to 100 letters) reduce the
entropy to somethiD.g of the order of one bit per letter, with a corresponding
redundancy of roughly 75%. The redundancy may be still higher when
structure extending over paragraphs, chapters, etc: is included. However, as
the lengths involved ate 4J.creased, the parameters in question become more
j is an arbitrary
pCb, ,j) is the pr
poJj) is the cone
and is E
The equation (1) can be
(conditional entropy) of th
known. As N is increased;
and the entropy, H, is givt
The N-gram entropies 1
standard tables of letter,
punctuation are ignored Vi
be taken (by definition) to
frequencies and is given t
F,
"
=-L
i=l .
The digram approximatio
F, ~ -
~
C. E. Shannon, <lA Mathematical Theory of Communication," Bdt S;'stem Tedmical
Journal, v. 27, pp. 379-423, 623-656, July, October, 1948.
-LP(b,
1
2:
f---~·
Ii
;;
~ p(i
7.70 - 4.
Fletcher Pratt, "Secret a·
erratic and
involved.
Printed Engli~h
uncertai~J
2. ENTROPY CALCULATION FROM THE STATISTICS OF ENGLISH
One method of calculating the entropy H is by a series of approximations
F 0 J F 1 J F 2 J ... J which successively take more and more of the statistics
of the language into account and approach H as a limit. F N may be called
the lV-gram entropy; it measures the amount of infonnation or entropy due
to statistics extending over .LV adjacent letters of text. F N is given by l
~ON
t. I5, I950)
md redundancy of a lan.:,ouage is
;e of the language statistics posdepends on experimental results
ceding text is known. Results of
roperties of an ideal predictor are
FN ~
-
L. p(b"j) log, p"V)
1.j
- L. pCb, ,j) log, p(b"
:ON
'edundaney of a language have
cal parameter which measures,
is produced on the average fOf
:mguage is translated into binary
~ntropy H is the average number
ginallanguage. The redundancy,
constraint imposed on a text in
'e, e.g., in English the high fre)f H to follow T or of U to follow
effects extending over not more
)y is roughly 2.3 bits per letter,
and they depend more critically on the type of text
....
j)
'i
:ommunication," Bell System Technical
)er, 194-8.
(1)
i
in which: b, is a block of N-1letters [(N-1)-gram]
j is an arbitrary letter following hi
p(b" j) is the probability of the N-gram b" j
Po,(j) is the conditional probability of letter j after the block b"
•
and is given by p(b" j)1pCb,).
The equation (1) can be interpreted as measuring t~e average uncertainty
(conditional entropy) of the next letter j when the preceding N-1leiters are
known. As f{ is increased, F N includes longer and longer range statistics
and th~ entropy, H, is given by the limiting value of F N as N ----7 00 :
H = Lim FN
l for estimating these quantities,
.f long range statistics, influences
s method is based on a study of
the next letter of a text be preown. The results of some experi,eoretical analysis of some of the
.g the experimental and theoreti.nd lower bounds for the entropy
)pears that, in ordinary literary
(up to 100 letters) reduce the
t per letter, with a corresponding
.aney may be still higher when
:ers, etc: .is included. However I as
lrneters in question become more
+ L, pCb,) log pCb,)
(2)
•
The 'N-gram entropies F N for small values of iV can be calculated from
2
standard tables of letter, digram and trigram frequencies. If spaces and
punc.tuation are ignored we have a twenty-six letter alphabet and F 0 may
be taken (by definition) to be log, 26, or 4.7 bits per letter. F , involves letter
frequen~ies and is given by
F, ~ -
" p(i)
L,
log, p(i) = 4.14 bits per letter.
(3)
1=1
The digram approximation F 2 gives the result
F, = -
L, p(i, j)
log, p,(j)
i,j
- L, p(i, j)
log, p(i, j)
i,j
= 7.70 2
4.14
+ L, p(i) log, p(i)
i
~
3.56 bits per letter.
Fletcher Pratt, "Secret and Urgent," Blue Ribbon Books, 1942.
(4)
52
THE BELL SYSTEM TECHNIClli, JOURNAL, JAl\lJARY
1951
PREDICTION AND EN1
The trigram entropy is given by
F, =
formula (6) clearly cannot hold i"
L: p(i, j, k) log, p;;(k)
~
L: p(i, j, k)
log, p(i, j, k)
i,i,k
- 11.0 -
+ L: p(i, j) log, p(i, j)
(5)
i,j
1
better estimate) that the formula
7.7 = 3.3
0.1
In this calculation the trigram table2 used did not take into account trigrams bridging two words, such as WOW and OWO in TWO WORDS. To
compensate partially for this omission, corrected trigram probabilities p(i~
j, k) were obtained from the prohabilities p'(i,j, k) of the table by the following rough formula:
p(i,j, k) =
~:~ p'Ci,j, k) +
4\
r(i)p(j, k)
+
4\
=
.1
1Z
Zipf 4 has pointed out that this type of formula, pn = kin, gives a rather good
approximation to the word probabilities in many different languages. The
3 G. Dewey, "Relative Frequency of English Speech Sounds," Harvard University
Press, 1923.
,
4 G. K. Zipf, "Human Behavior and the Prinnple of Least Effort," Addison-Wesley
Press, 1949.
-
-~.~-~"_._,-
I
~;
" THE
"
I
~OF
"
I
AND
~t+-TO
,
,
I
..-I
0.01
'"'~
p(i,j)s(k)
where rei) is the probability of letter i as the terminal letter of a word and.
s(k) is the probability of k as an initial letter. Thus the trigrams "\vitl1in
words (all average of 2.5 per word) are counted according to the table; the
bridging trigrams (one of each type per word) are counted approximately
by assuming independence of the terminal letter of-one word and the initial
digram in the next or vice versa. Because of the approximations"involve4
here, and also because of the fact that the sampling error in identifyipg
probability with sample frequency is more serious, the value of F s is less
reliable than the previous numbers.
Since tables of N-gram frequencies were not available for~V > 3, F 41 F s ,
etc. could not be calculated in the same way. However, word frequencies
have been tabulated:! and can b~ used to obtain a further approximation.
Figure 1 is· a plot on log-log paper of the probabilities of words against
frequency rank. The most frequent English word "the" has a probability "'"
.071 and this is plotted against 1. The next most frequent word "of" has a
probability of .034 and is plotted against 2, etc. Using logarithmic scales
both for probability and rank, the curve is approximately a straight line
with slope -1; thus, if Pn is the probability of the nth most frequent word,
we have, roughly
p.
L: .1/n is in!
must be unity, while
i.i.k
",,~
u
z
w
"@
a:: 0.00 1
w
o
'"o
•
.
0.0001
0.000 01
2
4681020
, Fig. i,-Relative freq
total probability is unity, am
critical 11, is the word of rank :
8727
- L:
p.lo,
1
or 11.82/4.5 = 2.62 bits per 1,
is 4.5 letters. One migh t be 1
actually the ordinate of the 1
The reason is that F, or F 5 IT
of word division. A word is a
53
PREDICTION AND ENTROPY OF PRINTED ENGLISH
formula (6) clearly cannot bold indefinitely since the total probability
Lpn
~
1-
L
i,j
pCi, j) log, p(i, J}
.lln is infinite. If we assume (in the absence of any
1
better estimate) that the formula
o
did not take -into account triad OWO in TWO WORDS. To
'ected trigram probabilities P(i,
i,j, k) of the table by the follow-
+
""
=
.lln holds out to the n at which the
.,---OF
A~O
~t+-TO
.
~
_I
1
4 • p(i, j)s(k)
''''
"'~OR
.0
he terminal letter of a word and
~tter. Thus the trigrams within
lilted according to the table; the
ord) are counted approximately
letter atone word and the initiai
of the approximations involved
:Ie sampling error in identifying
~ sc:rious, the value of F s is less
not available for N > 3, F 4 1 F 5 ,
way. However, word frequencies
obtain a further approximation.
le probabilities of words against
sh word "the" has a probability
:t most frequent word "of" has a
: 2, etc. Using logarithmic scales
: is approximately a straight line
ty of the 12th most frequent word,
Pn
'~~
... THE
0. 01
'p(j, k)
L
must be unity, while
(5)
oz
,
...... ~
:--
w
o
@
c:: 0.0 01
~
,
_SAY
''l.
c
oc
°3
~REAt..LY
'\ ... __ QUALITY
0.000 1
"
."
,,
,
,
~
0.000 01
2
4
6 8
to
20
40 60 100 200
WORD ORpER
400
1000
2000 4000
,,
10,000
Fig. i-Relative frequency against rank for English wOrds.
total probability is unity, and that pn = 0 for larger n, we find that the
critical n is the word of rank 8,727. Tbe entropy is then:
fr/27
J.ula, pn = kj n, gives a rather good
in many different languages. The
1.
Speech Sounds," Harvard University
lldple of Least Effort," Addison-Wesley
-L
p. log, p.
= 11.82 bits per word,
(7)
1
or 11.82/4.5 ~ 2.62 bits per letter since the average word length in English
is 4.5 letters. One might be tempted to identify this value with F 4 • 5 , but
-a-ctuall§"the ordinate of the F N curve at N = 4.5 will be above this value.
The reason is that F4 Or F" involves groups·of four or five letters regardless
division. A word is a cohesive group of letters with strong internal
54
THE BELL SYSTEM TECHNICAL JOURNAL, JA..-l\ffiARY
PREDICTION Al\TJ)
1951
Of a total or 129 letters, 89 c
expected, occur mo~
syU.ables where the line of thou
be thought that the secc
contains much less inform,
the same .information in t
to recover the first line -f
anidentical twin of the indivj
(who must be mathelnaticallYl
the salne way when faced witl
only the reduced text of (8). W
point we will know whether his
the first twin and the preser
to a. correct guess. The letters b
each stage he can be supplied'
twin had available.
statistical influences, and consequently the N-grams within words are more
restricted than those which bridge words. The effect of this is that we have
obtained, in 2.62 bits per letter, an estimate which corresponds more nearly
to, say, F 5 or F 6 "
A similar set of calculations was carried Qut illcluding the space as an
additional letter, givljlg a 27 letter alphabet. The results of both 26- and
27-1etter calculations are summarized below:
26 letter.
27 letter.
F,
F,
F,
F,
F."",
4.70
4.14
4.03
3.56
3.3
3.1
2.62
2.14
4.76
3.32
The estimate of 2.3 for F 8 J alluded to above, was found by several methods,
one of which is the extrapolation of the 26-letter series above out to that
point. Since the space symbol is almost completely redundant when sequences of one or more words are involved, the values of F N in the 27-!etter
case will be ~.5 or .818 of F N for the 26-letter alphabet when. N is reasonably
~.5
large.
ORIGINAL
3.
The new method of estimating entropy exploits the fact that anyone
speaking a language possesses, implicitly, an enormous knO"IYledge of the
statistics of the language. Faroiliari,ty with the words, idioms, cliches and
grammar enables him to fill in missing or incorrect letters :in proof-reading;
or to complete an unfinished phrase in conversation. An e.xperitnental demonstration of the extent to which English is predictable can be given as follows:
Select a short passage unfamiliar to the person who is to do the predicting.
He is then asked to guess the first letter in the passage. If the guess is correct
he is so informed, and procee9s to guess the second letter. If not, he is told
the correct first letter and proceeds to his next guess. This is continued
through the text. As the experiment progresses, the subject writes dowp. the
correct text up to the current point for use in predicting future letters. 1;he
result of a typical experiment of this type is given below. Spaces were included as an additional letter, making a 27 letter alphabet. The first line is
the- original text; the second line contains a dash for each .letter correctly
guessed.. In the case of incorrect guesses the correct let~er is copied in the
second line.
(1) THE ROOM WAS NOT VERY LIGHT A SMALL OBLONG
(2) ----Roo------NOT-V~--c-I------SM----OBL---­
(1) READING LAMP ON THE DESK SHED GLOW ON
(2) REA----------O"-----D----SHED-GLO--O-(1) POLISHED WOOD BIlT LESS ON THE SHABBY RED CARPET
(2) P-L-S -----O'-_BU --L-3--0 ------3H -----RE --C ------
-~---------~
.~ ---~.---~----~~.~-..~
COMPARISON
mer
PREDICTION OF ENGLISH
Fig. 2-Communi
(8)
l
The need for an identical 1
eliminated as follows. In gener
. edge of more than N preceding
only a finite nUlllber· of possih
subject to guess the next letter
plete list of these predictions
reduced text from the original "
To put this another way, tl
encoded form of the original, tl
a- reversible transd ncer. In fa(
structed in which only the red
the other. This could be set up
diction devices.
An extension of the .above
cerning the predictability of Er
up to the current point and is a
he is told so and asked to gues
Corred letter. A typical result
T.RNAL, JA,}..""(JARY
1951
N-grams within words are more
he effect qf this is that we have
: which corresponds more nearly
~
out including the space as an
et. The results of both 26- and
w:
F,
F,
F,.ord
3.56
3.3
3.1
2.62
2.14
3.32
~, was found by several methods,
5-letter series above out to that
completely redundant when sethe values of F lot in the 27-letter
" alpbabet when N is reasonably
exploits the fact that anyone
an enormous knowledge of the
:1 the words, idioms, cliches and
ncorrect letters in proof-reading,
~rsation. An experimental demon'edictable can be given as follows:
rson who is to do the predicting.
he passage. If the guess is correct
le second letter. If not, he is told
is next guess. This is continued
$ses, the subject writes dowp. the
, in predicting future letters. The
~ is given below. Spaces were in7 letter alphabet. Tbe first line is
; a dash for each letter correctly
,he correct letter is copied in the
SHED GLOW ON
SHEO-GLD--O-THE SHABBY RED CARl'ET
·-----SH-----RE --C ------
-
TEXT.'.I
1
:HT A SMALL OBLOHG
------SM----OBL-.--
Of a total of 129 letters, 89 or 69% were guessed correctly. The errors, as
would be expected, occur most frequently at the beginning of words and
syllables where the line of thought has more possibility of branching out. It
might be thought that the second line in (8), which we will call the reduced
text, contains much less information than the first. Actually, both lines contain the same information in the sense that it is possible, at least in principle, to recover the first line from the second. To accomplish this we need
'an identical twin of the individual who produced the sequence. The twin
(\...ho must be mathematically, not just biologically identical) will respond in
the same way when faced with the same problem. Suppose, now, we have
only the reduced text of (8). We ask the twin to guess the passage. At each
point we will know whether his guess is correct, since he is guessing the same
as the first twin and the presence of a dash in the reduced text corresponds
to a correct guess. The letters he guesses wrong are also available, so that at
each stage he can be supplied with precisely the same information the first
twin had available.
ORIGINAL
ENGLISH
(8)
55
PREDICTION A..1>,J"D ENTROPY OF PRThi"TED ENGLISH
J
COMPARISON
COMPARISON
REDUCED TEXT
ORIGINAL
TEXT
Fig. 2-Communication system using reduced text.
The need for an identical twin in this conceptual experiment can be
eliminated as follows. In general, good prediction does not require knowledge of more than N preceding letters of text, with N fairly small. There are
only a finite number of possible ,sequences of N letters. We could ask the
subject to guess the next letter for each of these possible N-grams. The co::nplete list of these predictions could thell'be used both for obtaining the
reduced text from the original and for the inverse reconstruction process.
To put this another way, the reduced text can be considered to be an
encoded form of the original, the result of passing the original text through
ct reversible transducer. In fact, a communication system could be constructed in which only the reduced text is transmitted from one point to
the other. This could be set up as shown in Fig. 2, with two identical prediction devices.
An extension of the above experiment yields further information cancerning the predictability of English. As before, the subject knows the text
up ~o the current point and is asked to guess the next letter. If he is wrong,
he IS told so and asked to guess again. This is continued until he [lllds the
correct letter. A typical result with this expe~ent is shown below. The
56
THE BELL SYSTEM TECHNICAL JOURNAL, JANUARY
PREDICTION AND
1951
first line is the original text and the numbers in the second line
guess at which the correct letter was obtained.
~ndic3.te
,
__
H 0 REV E R 8 E 0 HAM 0 TOR G Y G LEA
(2) 1 1 1 5 11 2 11 2 1115 1 17 1 1 1 21 3 21 22 7 1 1 1 1 4 1 1 1 1 1 3 1
(1) THE REI 8
(1)
(2)
(1)
(2)
F RI
8 6 1
RAT
4 1 1
EHD
3 1 11
HER
1 1 11
0 F MI H E F
1 11 1 1 1 11 6
D RAM A TIC
11 5 1 1 1 1 1 1
0 UHD
2 1 1 11
ALL Y
1 1 1 11
THI 8
1 1 2 11
THE 0
6 1 11 1
0 UT
1 1 1 1
THE R DAY
1 1 1 11 1 1 1 1
(9)
Out of 102 symbols the subject guessed right on the first guess 79 times,
on the second guess 8 times, on the third guess 3 times, the fourth and fifth
guesses 2 each and only eight times required more than five guesses. Results
of this order are typical of prediction by a good subject with ordinary literary
English. Newspaper writing, scientific work and poetry generally lead to
somewhat poorer scores.
The reduced text in this case also contains the same infonnation as the
original. Again utilizing the identical twin we ask him at each stage to guess
as many times as the number given in the reduced text and recover in this
way the original. To eliminate the human element here we must ask our
subject, for each possible l\T_gram of text, to guess the most probable next
letter, the second most probable next letter/letc. This set of data can then
serve both for prediction and recovery. ".
Just as before, the reduced text can be considered an encoded version of
the original. The original language, with an alphabet of 27 symbols, A,
B,
,Z, space, has been translated into a new language with the alphabet
1, 2,
, 27. The translating has been such that the symbol 1 now has an
extremely high frequency. The symbols 2, 3, 4 have successively smaller
frequencies and the final symbols 20, 21, ... , 27 occur very rarely. Thus the
translating has simplified to a considerable extent the nature of the statistical structure involved. The redundancy which originally appeared in complicated constraints among groups of letters, has, by the translating process,
been made explicit to a large extent in the very unequal probabilities of the
new symbols. It is this, as will appear later, which enables one to estimate
the entropy from these experiments.
In order to determine how predictability depends on the number 11' of
preceding letters known to the subject, a- more involved experiment was
carried out. One hundred samples of English text were selected at random
from a book, each fifteen letters in length. The subject was required to guess
the text, letter by letter, for each sample as in the preceding experiment.
Thus one hundred samples were obtained in which the subject had available
0, 1, 2, 3, ... , 14 preceding letters. To aid in prediction the subject made
such use as he wish~d oi various statistical tables, letter, digram and trigram
....- .
~'f
the
I~
tables, a table of the frequenc
queneies of common words am
were from IIJefferson the Virg
gether with a similar test in wI
summarized in Table 1. The cc
letters known to the subject I
The entry in column N at fO\V
the right letter at the 5th gues,
L
I
-
~
I
I
2
3
4
- - - - -- -47
1 18.2 29.2 36
2 10.7 14.8 20
8.6 10.0 12
3
7
8.6
6.7
4
1
7.1
6.5
5
4
5.5
5.8
6
3
4.5
5.6
7
2
3.6
5.2
8
4
3.0
5.0
9
2
2.6
4.3
10
2
2.2
3.1
11
4
1.9
12
2.8
1 .5
J
13
2.4
1.2
14
2.3
1
1.0
15
2.1
.9
16
2.0
.7
1
17
1.6
.5
18
1.6
19
.4
1.6
.3
20
1.3
.2
21
1.2
.1
22
.8
.1
23
.3
24
.0
.1
25
.1
26
.1
27
.1
18
14
3
1
5
3
2
1
2
1
1
1
I
5
~ll
8
I
jl
2
2
1
5
3
2
2
1
,
,
2
1
1
the entry 19 in column 6, row
rect letter was obtained on th
dred. The first two columns (
mental procedure outlined a
known letter and digram freg
probable symbol is the spae<
wrong, should be E (probal
frequencies with which the ri~
trials with best prediction, Si
table gives the entries in coIl
\L JOURNAL, JANUARY
1951
PREDICTION
umbers in the second line indic3.te the
!htained.
lE
ONA
MOTORCYCLE
A
l21321227111l41111131
ND THIS OUT
111112111111
LY THE OTHER OAY
1 11 6 1 11 1 1 1 1 11 1 1 1 1
(9)
essed right on the first guess 79 times,
lird guess 3 times, the fourth and fifth
;quired more than five guesses. Resul~s
)y a good subject with ordinary literary
ic work and poetry generally lead to
tability depends on the number N of
iect, a more involved experiment was
English text were selected at random
tgth. The subject was required to guess
tmple as in the preceding experiment.
ined in which the subject had available
To aid in prediction the subject made
3tical tables) letter) digram and trigram
57
ENTROPY OF PRINTED ENGLISH
tables) a table of the frequencies of initial letters in words, a list of the frequencies of common words and a dictionary. The samples in this experiment
were from "Jefferson the Virgillian" by Dumas Malone. These results, together with a similar test in which 100 letters were known to the subject, are
summarized in Table I. The column corresponds to the number of preceding
letters known to the subject plus one; the row is the number of the guess.
The entry in column N at row S is the number of times the subject guessed
tlie right letter at the Sth guess when (N-I) letters were known. For example,
TABLE
-
I
2
3
4
contains the same information as the
twin we ask him at each stage to guess
n the reduced text and recover in this
lUman elem~nt here we must ask our
text, to guess the most probable next
t letter, etc. This set of data can then
'ery.
n be considered an encoded version of
with an alphabet of 27 symbols, A,
into a new language with the alphabet
en such that the symbol I now has an
;)ols 2, 3) 4 have successively smaller
21) ... , 27 occur very rarely. Thus the
erable extent the nature of the statist ilCy which originally appeared in comletters, has) by the translating process,
n the very unequal probabilities of the
lr later) which enables one to estimate
A~"D
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
I
1
2
3
4
5
6
1
8
9
10
18.2
10.7
8.6
6.7
6.5
5.8
5.6
5.2
5.0
4.3
3.1
2.8
2.4
2.3
2.1
2.0
1.6
1.6
1.6
1.3
1.2
.8
.3
.1
.1
.1
.1
29.2
14.8
10.0
8.6
7.1
5.5
4.5
3.6
3.0
2.6
2.2
1.9
1.5
1.2
1.0
.9
.7
.5
.4
.3
.2
.1
.1
.0
36
20
12
7
1
4
3
2
4
2
2
47
18
14
3
1
5
3
2
51
13
8
58
19
5
1
48
17
3
4
3
2
8
2
66
15
5
4
6
66
13
9
4
1
67
10
4
4
6
1
1
1
1
1
2
4
1
1
4
3
2
2
1
5
3
2
2
1
4
3
2
1
1
1
1
1
,-;'1
1
1
4
3
1
1
1
1
1
1
1
1
2
1
1
62
9
7
5
5
4
1
1
1
3
2
1
1
1
1
~l~
1
2
1
1
1
1
1
1
1
66
9
4
4
3
3
1
2
2
1
1
1
1
1
2
1
1
1
H I 151100
_I_-
72 60 80
6 1 18 7
9
5
3
5 3
j
4
4
1
1
2
1
4
3
2
1
1
1
1
1
1
1
1
1
1
58
14
7
6
2
2
4
13
2
2
1
1
1
the entry 19 in column 6, row 2, means that with five letters known th~ cor
rect letter was obtailTM. on the second guess nineteen times ou t of the hun
dred. The first two columns of this table were not obtained by the experimental procedure outlined above but were calculated directly from the
known letter and digram frequencies. Thus with no known letters the most
probable symbol is the space (probability .182); the next guess, if this is
wrong, should be E (probability .107), etc. These probabilities are the
frequencies with which the right guess would occur at the first) second, etc.,
trials with best prediction. Similarly, a simple calculation from the digram
table gives the entries in column I when the subject uses the table to best
58
THE BELL SYSTEM TECIThi1:CAL JOURNAL, JANuARY
PREDICTION AJ\'D
1951
r edl1ced text, qlN+l ,q2N+l ,
pred'Lcti:on is on the basis of a g
.._.';cc ......ee the probabilities of low n
the following inequalities
advantage. Since the frequency tables are determined from long samples of
English, these two columns are subject to less sampling error than the others.
It will be seen that the _prediction gradually improves, apart from some
statistical fluctuation, with increasing knowledge of the past as indicated
by the larger numbers of correct first guesses and the smaller numbers of
high rank guesses.
One experiment was carried out with "reverse" prediction) in which the
subject guessed the letter preceding those already known. Although the
task is subjectively much more difficult, the scores were only slightly poorer.
Thus, with two 101 letter samples from the same source) the subject ob~
tained the following resnlts:
No. of
guess
Forward ...
Reverse.
1
2
3
4
70
10
7
7
4
2
4
66
2
6
6
7
8
>8
3
2
3·
1
0
2
9
S
i=l
IDEAL
N -GRAM
4
PREDICTION
The data of Table I can be nsed to obtain upper and lower bounds to the
N-gram entropies F N : In order t~ do this, it is n:eces~ary first to develop
some general results concerning the best possible prediction of it language
when the preceding N letters are known. There will be for the language a set
of conditionalprob~bilities Pil , i! , •.• , iN_l (j). This is the probability when
the (N-i) gram iI, i 2 , ••• , i N _ 1 occurs that the next letter will be j. Th~'
best guess for the next letter, when this (N-1) gram is known to have occurred, will be that letter having the highest conditional probability. The
second guess should be that with the second highest probability, etc. A
machine or person guessing in the best way would guess letters in the order
of decreasing q,mditional probability. Thus
process of reducing a text
with such an ideal predictor conSists of a mapping of the letters -into the
numbers from 1 to 27 in such a way that the most probable next letter
[conditional on the known preceding (N-1) gram] is mapped into 1, etc.
The frequency of 1's in the reduced text will then be given by
p(h, i a, ••• ,iN))
the
q'/. = 1;p(i, , i" ... , i N _ 1 , j)
(10)
where the sum is taken overall (N-1) grams i 1 ) i 2 ) ••• ,i N _ 1 thej being the
one which maximizes p for that particular (N-1) gram. Similarly, the fre· .
quency of 2's, qf, is given by the same formula with j chosen to be that
letter having the second highest value of p, etc.
On the basis of LV-grams, a different set of probabilities for the symbols
L
iE
?c
This means that the probabilit
the preceding N letters are 1m
only (N-1) are known, for all ,
p(i1 , i 2 , ••• , iN) j) arranged ir
the N-grams vertically. The tal
tenn on the left of (1]
row, SlltnIDed over all the rows. '::
of entries from this table in whic
necessarily the S largest. This
member would be calculated Ire
tha.n N-gramslisted vertically. :
of 27 rows of the N-gram table,
Incidentally, the N -gram entropy F N for. a reversed language is equal to
that for the forward language as may be seen from the second form in equation (1). Both terms have the same value in the forward and reversed cases.
4.
,
L li+1
)
The sum of the S largest entri
the sum of the 27S selected ent
N-gram table only if the latter j
to hold for a particular S, this J
table. In this case, the first lette]
S most probable choices for th,
the set may be affeeted. Howev
follows that the ordering as wel
N-gram. The reduced text obtau
identical with that obtained frOJ
Since the partial sums
B
Q:=L
i=l
are monotonic increasing functi
preach limits as iV ---7 00. The:
limits as jl{ ---7 00, i.e., the l: apl
as the relative frequency of cor
edge of the entire (infinite) pas
JOUR:,AL, JA><"ARY
1951
PREnICTro~
.-\...-"\;n E)o"TROPY OF PRINTED ENGLISH
. t he red
t ql.v+l , q?N+l , ... , q27
.v+l ,wo uld norma IIy resu lt. S·mce tho15
m
uced
tex,
prediction is on the basis of a greater knowledge of the past, one would expect the probabilities of low numbers to be greater, and in fact one can
prove the following inequalities:
re determined from long samples of
lless sampling error than the others.
adually improves, apart from some
mow1edge of the past as indicated
;uesses and the smaller numbers of
S = 1,2, ....
"reverse" prediction, in which the
lOse already known. Although the
the SCores were only slightly poorer.
n the same SQurce, the subject ob-
, , ,
7
4
2
4
2
6
6
7
3
2
3
1
8
4
2
9
for. a reversed language is equal to
seen from the second fann in equae in the forward and reversed cases.
( PREDICTION
!tain upper and lower bounds to the
it is necessary first to develop
;t possible prediction of language
TheJ;"e will be for the language a set
;N_, (j). This is the probability when
; that the next letter will be j. The
5 (iV-1) gram is known to have oclighest conditional probability. The
second highest probability, etc. A
,yay would guess letters in the order
[,hus the process of reducing a text
·f a mapping of the letters into the
that the most probable next letter
N-1) gram] is mapped into 1, etc.
t will then be given by
~hisJ
... ,iN _ 1 ,j)
27
p(;",i" .;- ,iN,j) =
a
,-
,
1.
L:
p(i"i" ··-,iN,j).
(12)
"1=1
The sum of the S largest entries in a row of the N-l gram table will equal
the sum of the 275 selected entries from the corresponding 27 rowS of the
iV-gram table only if the latter fall into S columns. For the equality in (11)
to hold for a particular 5, this must be true of every row of the lV-l gram
table. In this case, the first letter of the iV-gram does not affect the set of the
S most probable choices for the next letter, although the ordering within
the set may be affected. However, if the equality in (11) holds for all S, it
follows that the ordering as well will be unaffected by the first letter of the·
N-gram. The reduced text obtained from an ideal IV-l gram predictor is then
identical with that obtained from an ideal iV-gram predictor.
Since the partial sums
S = 1,2, ...
(10)
ams i l , i 2 , ••• , i N _ 1 the j being the
"lar (N-l) gram. Similarly, the fre,e fonnula with j chosen to be that
of p, etc_
set of probabilities for the symbols
(11)
This means that the probability of being right in the first S guesses when
the preceding N letters are knmvn is greater than or equal to that when
only (N-l) are known, for all S. To prove this, imagine the probabilities
p(ir, i:!., ... , iN ,j) arranged in a table with j running horizontally and all
the ~V-grams vertically. The table will therefore have 27 columns and 27 N
rows. The tenn on the left of (11) is the sum of the S largest entries in eacR
row, summed over all the rows. The right-hand member of (11) is also a sum
of entries from this table in which S entries are taken from each row but not
necessarily the S largest. This follows from the fact that the right-hand.
member would be calculated from a similar table with (iV-i) grams rather
than N-grams-1i$ted vertically. Each row in the lV-l gram table is the sum
of 27 rows of the N-gram table, since:
>8
o
59
(13)
are monotonic increasing functions of N, < 1 for all iV, they must all approach limits as IV --+ 00. Their :first differences must therefore approach
approach limits, gO; . These may be interpreted
limits as N -)- 00, i.e., the
as the relatiye frequency of correct first, second, ... , guesses with knowledge of the entire (infinite) past history of the text.
l!
60
THE BELL SYSTEM TECHNICAL JOURNAL, JANUARY
Pl
1951
27
The ideal N-gram predictor can be considered, as has been pointed out, to
be a transducer which operates on the language translating it into a sequence
of numbers running from 1 to 27. As such it has the following two properties:
1. The output symbol is a function of the present input (the predicted
next letter when we think of it as a predicting device) and the preceding (N-l) letters.
2. It is instantaneously reversible. The original input can be recovered by
a suitable operation on the reduced text without loss of time. In fact,
the inverse operation also operates on only the (N-l) preceding symbols of the reduced text together with the present output.
The above proof that the frequencies of output symbols with an N-l
gram predictor satisfy the inequalities:
S = 1,2, '., ,27
s
L r i , will be the sum of the probabilitiesforS entriesin each row, summed
1
over the rows, and consequently is certainly not greater than the sum of the
S largest entries in each row. Thus we wiU have
s
=
.1, 2, ... , 27
(15)
In other words ideal prediction as defined above enjoys a preferred positi?n
among all translating operations that may be applied to a language and
which sat.isfy the bvo properties above.. Roughly speaking, ideal predictioJ;1 .
collapses the probabilities of various symbols to a small group mQre than
any other translating operation involving the same number of letters which
is instantaneously reversible.
Sets of numbers satisfying the inequalities (15) have been studied by
Muirhead in cOllilection with the theory of algebraic inequalities' If (15)
holds when the
and ri are arranged in decreasing order of magnitude, and
gr
5
.
,1
2'
.L
q~
1
=
:L1
case is 1), then
known that the
., 'properties:
1. Ther, can
.flow is und
one, as he;:
direction.
2. The ri cal
2£era~~i~. ~
~
\}
i
-"
(14)
can be applied to any transducer having the two properties listed above.
In fact we can imagine again an array with the various (N-l) grams listed
vertically' and the present input letter horizontally. Since the present output
is a function of only these quantities there will be a definite output symbol
which may be entered at the corresponding intersection of row and column.
Furthermore, the instantaneous reversibility requires that no two entries
in the same row be the same. Otherwise, there would beambiguity between
the two or more possible present inpu t letters when reversing the translation. The total probability of the S most probable symbols in the output,
say
also
Hardy, Littlewood and Polya, "Inequalities," Cambridge University Press, 1934.
The upper bou
entropy j
the entrap)
iV-gram ent,
Ia!1guage, .as may
sums involved wi
. different order. T
diction is ideal.
The lower boun
,with any selectior
"
"
N
~
~'(qi
-1_1
-
N
gi+l
The left-hand In<
I
' the qiN an
magme
The actual qf can
tions as shown~ Th
tions. Thus, the
PREnICTIOX ;\,....'1(1)
onsidered, as has been pointed out, to
nguage translating it into a sequence
ch it has the following two properties:
of the present input (the predicted
a predicting device) and the precede original input can be recovered by
ed text without loss of time. In fact,
es on only the (N-l) preceding symwith the present output.
ies of output symbols with an N-l
also
.,
21
27
1
1
Lq'Y = Lr.,
=
1,2, ... ,27
ilities forS entries in each row, summed
tainly not greater than the sum of the
will have
S
=
1,2, ... , 27
(15)
ed above enjoys a preferred positi~m
may be applied to a language and
e. Roughly speaking, ideal prediction
symbols to a small group more than
. g the same number of letters which
qualities (15) have been studied by
ory of algebraic inequalities.' If (15)
in decreasing order of magnitude, and
·ties," Cambridge University Press, 1934.
61
(this is true here since the total probability in ea.ch
L
i
r;
(14)
ing the two properties listed above.
with the various (N-I) grams listed
horizontally. Since the present output
here will be a definite output symbol
ding intersection of row and colunin.
sibility requires 'that no two entries
e, there would be ambiguity between
t letters when reversing the translaost probable symbols in the output,
OF PRTh"TED EXGLISH
ca~e is 1), then the first set, q7 , is sa.id to majoriz3 the second set, r i. It is
known that the majorizing property is equivalent to either of the following
properties:
1. The r i can be obtained from the q~ by a finite series of ({flows." By a
flow is understood a transfer of probability from a larger q to a smaller
one, as heat flows from hotter to cooler bodies but not in the reverse
direction.
2. The r i can be obtained from the qlf by a generalized "averaging"
eE,eration. T~ere exists a set of non-negative real numbers, aij, with
~ aij =
aij = land such that
j
S
E~TROPY
5.
=
L,a;j(q7).
j
(16)
ENTROPY BOU:N'DS FROM PREDICTION FREQUENCIES
If we know the frequencies of symbols in the reduced text with the ideal
LV-gram predictor, qf , it is possible to set both upper and lower bounds to
th~' N-gram entropy, F N, of the original language. These bounds are as
follows:
27
L
,.
i(qf - qf+l) log i :S F N :$ -
(17)
i=l
The upper bound follows immediately from the fact that the maximum
possible entropy in a language with letter frequencies q~ is - L qf log q~.
Thus the entropy per symbol of the reduced text is not greatt'f than this.
The lV-gram entropy of the reduced text is equal to that for the original
language, as may be seen by an inspection of the definition (1) Df F N. The
sums involved will contain precisely the same terms although, perhaps, in a
different order. This upper bound is clearly valid, whether or not the prediction is ideal.
The lower bound is more difficult to establish. It is necessary to show that
wit:le-any selection of iV-gram probabilities p(i1 , i'l, ... , iN), we will have
27
L, i(q~ - q~+l) log i :s;
i=l
(18)
The left-hand member of the inequality can be interpreted as follows:
Imagine the q~ arranged.as a sequence of lines of decreasing height (Fig. 3).
The actual qf can be considered as the sum of a set of rectangular distributions as shown: The left member of (18) is the entropy of this set of distributions. Thus, the ilh rectangular distribution has a total probability of
62
THE BELL SYSTEM TECHNICAL JOURNAL, JA?>!uARY
1951
PREDICTION AND
i(q': - q':+l)' The entropy of the distribution is log i. The total entropy is
then
27
L
i_I
i(q': - q':H) log i.
The problem, then, is to show that any system of probabilities P(il , ...
iN), with best prediction frequencies q.. has an entropy F N greater than or
equal to that of this rectangular ~ystem, derived from the same set of qi.
0.60
ORIGINAL
I
1'.
DISTRIBUTION
,
• \\ :'\.
\
0.20
10.05
q,
q.
of the general theorem that By'
The equality holds only if the
Now we may add the c1iffereJ
changing the entropy (since in
The result is that we have arri
q.. , by a series of processes wh
starting with the original N-gr,
of the original systeln F N is gre,
decomposition of the q,. This
It will be noted that the lowe)
row of the table has a rectangu
q3
10.05
I
0.025
q-,
q.
10.025
g.
I
0.025
q,
I
0.025
\
q.
•
0040 (ql-q2)
""<
"'"
-UPPER BOUND
'- \--.
~
LOWER BOUNO-
RECTANGULAR
...... ;-j"
.......
1
DECOMPOSITION
o
/""'5
10.025
10.025
r"
I
I
0
.234567
NUO
Fig. 4--Upper and lower experimen
Cq2-q3)
I
I
I
I
0.025
Cq.-q,l
I
I
I
, 0 .025
qe
Fig. 3--Rectangular decomposition of a monotonic distribution.
The qi as we have said are obtained from the p(i1 1 • • • 1 iN) by arranging
each row of the table in decreasing order of magnitude and adding vertically.
Thus the qi are the sum of a set of monotonic decreasing distributions. Re- .
place each of these distributions by its rectangular decomposition. Each one
is replaced then (in general) by 27 rectangular distributions; the q, are the
sum of 27 x 27N rectangular distributions, of froni 1 to 27 elements, and all
starting at the left column. The entropy for this set is less than or equal to
that of the.original set of distributions since a termwiseaddition·of two or>
. more distributions always increases entropy. This is actually an application.
possible eN-I) gram there is a s,
probability, while all other next
It will now be showJ>-that the
(17) are monotonic decftasing fun
sm'ce
th e q..N+l maJonze
"
.
t h e qiN an d
Increases the entropy. To prove t
'.-:-.reasmg we will show that the (
U=L
is increased by an equalizing flow
q, to q'+1, the first decreased by
amount. Then three terms in the S'
LlU = [~(i - 1) log(i - 1)-
______ '
)r
PREDICTION-AND-- ENTROPY OF PRINTED _ENGLISH
.~__
-63
of the general theorem that Hy(X) ~ H(x) for any chance variables x and y.
The equality holds only if the distributions being added are proportional.
Now we may add the different components of the same width without
changing the entropy (since in this case the distributions are proportional).
The result is that we have arrived at the rectangular decomposition of the
q i, by a series of processes which decrease or leave constant the entropy,
starting with the original N-gram probabilities. Consequently the entropy
of the original system F N is greater than or equal to that of the rectangular
decomposition of the qi. This proves the des~red result.
It will be noted that the lower bound is definitely less than F N unless each
row of the table has a rectangular distribution. This requires that for each
5
,,\
4
.
\ I"\.
i'<
3
"\.
2
'I
~
- UPPER BOUND
--...... ......
-
...... :--
" ........ ,A'"
LOWER BOUNO-
r--
1
o
0
'-- -
,
2
3
4
5
6
7
8
9
10
11
12
13
14
15
"L,~
100
NUMBER OF LETTERS
Fig. 4-Upper and lowerexpeiimental bounds for the entropy of 27-letter English.
possi?le (N-I) gram there is a set of possible next letters each with equal
probability, while all other next letters have zero probability.
It will now be shown that the upper and lower bounds for F N given by
(17) are monotonic decreasing functions of lV. This is true of the upper bound
since the qf+l majorize the qf and any equalizing flow in a set of probabilities
increases the entropy. To prove that the lower bound is also monotonic decreasing we will show that the quantity
ij =
L:, i(q, -
q1+1) log i
(20)
is increased by an equalizing flow among the qi. Suppose a flow occurs from
qi to qi+l, the first decreased by !1q and the latter increased by the same
. amount. Then three terms in the sum change and the change in U is given by
r
tlU ~ [- (i -
1) log (i - 1)
+ 2i log i -
(i
+ 1) log (i + 1)]Ll.q
I
.1
-
(21)
r_
II:1'1
I'ii'
il!
!J!
1'"I,1;
Iii!
'I'
!'I
.,1/
I
I'
II
',j
I
)
~
III
1\1
III
ii'
,,'
':1
lij
,
64
THE BELL SYSTEM TECHNICAL JOURNAL, JA1"UARY
1951
+
+
The term in brackets has the form -f(x - 1)
2f(x) - f(x
1) where
f(x) = x log x. Now f(x) is a function which is concave upward for positive x,
since!" (x) = l/x > O. The bracketed term is twice the,difference between the
ordinate of the curve at x = i and the ordinate of the midpoint of the chord
joining i - 1 and i + 1, and consequently is negative. Since t1q also is negative, the change in U brought about by the flow is posit.ive. An even simpler
calculation shows that this is also true for a flow from ql to q2 or from qZ6 to
q27 (where only two terms of the sum are affected). It follows that the lower
bound based on the N-gram prediction frequencies q1[ is gre3.ter than or
equal to that calculated from the N + 1 gram frequencies q1[+l .
6.
A Submarine Telephon€
Bj
(Manuscri.i
\
The paper describes-the recen
telephone system in which repea
cable structure and are laid as
IN
APRIL of last year then
and Havana, Cuba, a subm,
cal departure from the convent
any. This departure consisted I
marine cable of electron tube r,
t~~ cable laying machinery ani
cable, and which, over an exter
not require servicing for the pUl
circuit elements. The repeater
about three inches in diameter
cable diameter of a little over an
the taper at each end is about:
it can conform to the curvature (
in the laying gear on the cable,
in Fig. 1.
EXPERIMENTAL BOUNDS FOR ENGLISH
Working from the data of Table I, the upper and lower bounds were calculated from relations (17). The data were first smoothed somewhat to overcome the worst sampling fluctuations. The low numbers in this table are
the least reliable and these were averaged together in groups. Thus, in
column 4, the 47, 18 and 14 were not changed but the remaining group
totaling 21 was divided uniformly over the rows from 4 to 20. The upper and
lower bounds given by (17) were then calculated" for each column giving the
following results:
Column
Upper.
Lower.
1
2
3
4
S
6
7
8
9
10
It
12
13
14
15
100
4.033.423.0 2.6 2.72.22.81.8 1.92.1 2.2 2.3 2.1 1.72.1 1.3
3.192.50 2.1 1.71.71.31.81.01.0 1.0 1.3 1,3 1,2 .91.2 .6'
It is evident that there is ~till considerable sampling error in these figures
due to identifying the observed sample frequencies with the prediction
probabilities. It must also be remembered that the lower bound was proved
only for the ideal predictor, while the frequencies used here are from human
prediction. Some rough calculations, however, indicate that the discrepangr
between the actual F N and the lower bound with ideal prediction (due to
the failure to have rectangular distributions of _conditional probability)
more than compensates for the failure of human subjects to predict in the
ideal manner. Thus we feel reasonably confident of both bounds apart from
sampling errors. The values given above are plotted against LV in Fig. 4.
ACKNOWLEDGMENT
The writer is indebted to Mrs. Mary E. Shannon and to Dr. B. M. Oliver
for help with the experimental work and for a number of suggestions and.
criticisms concerning the theoretical aspects of 'this paper.
...,.
Th~ new cable system, com!"
. Amencan Telephone and Teleg
__ .~~,,:,the development of telephonic (
'.,.an~ Cuba, which has presented
dlllons make it difficult, if not
,:':nethods of communication. One
..'lI} Florida that would permit thE
" the stretch of water between Flo
. as 6,000 feet in depth and whi,
.1.lsed. The practical solution has
!?'C Bell System toll lines at Mi
line (With Some water crossings).
-.,' abo t 100.
';'-""h ,',' ~
n.m., by submarine c
.~"avmg
a single coaxial circuit, ins
.'
'~:-;