Download A Stochastic Approach to the Grammatical Coding of English

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inflection wikipedia , lookup

Lojban grammar wikipedia , lookup

Polish grammar wikipedia , lookup

Ojibwe grammar wikipedia , lookup

Junction Grammar wikipedia , lookup

Japanese grammar wikipedia , lookup

Lithuanian grammar wikipedia , lookup

Macedonian grammar wikipedia , lookup

Scottish Gaelic grammar wikipedia , lookup

Compound (linguistics) wikipedia , lookup

Classical compound wikipedia , lookup

Symbol grounding problem wikipedia , lookup

Agglutination wikipedia , lookup

Comparison (grammar) wikipedia , lookup

Lexical semantics wikipedia , lookup

Word-sense disambiguation wikipedia , lookup

Untranslatability wikipedia , lookup

Pipil grammar wikipedia , lookup

Stemming wikipedia , lookup

Anglicism wikipedia , lookup

Contraction (grammar) wikipedia , lookup

Pleonasm wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

Malay grammar wikipedia , lookup

Transcript
A. G. O E T T I N G E R , Editor
A Stochastic Approach to the
Grammatical Coding of English
WALTERS. STOLZ, PERCY H. TANNENBAUMAND
t?I(EDERICK V. CARSTENSEN
U~ive'rsityof Wisconsin, Madison, Wisconsin
A computer program is described which will assign each
word in an English text to its form class or part of speech. The
program operates at relatively high speed in only a limited
storage space. About half of the word-events in a corpus are
identified through the use of a small dictionary of function
words and frequently occurring lexical words. Some suffix tests
and Iogical-decislon rules are employed to code additional
words. Finally, the remaining words are assigned to one class
or another on the basis of the most probable form classes to
occur within the already identified contexts. The conditional
probabilities used as a basis for this coding were empirically
derived f r o m a separate hand-coded corpus. On preliminary
trials, the accuracy of the coder was 91% to 9 3 % , with
obvious ways of improving the algorithm being suggested by
an analysis of the results.
I. I n t r o d u c t i o n
In recent years there has been an increasing interest in
the role of syntax in language behavior (cf. Miller [4]) and
i~t various mechanical alnguage processing activities (e.g.,
0cttingcr [5] on language translation). In many analyses
()t' syntactic structure of language, there is often involved
the task of allocating each word of a language coepus to
its respective grammatical form class or part of speech.
f"or rather obvious reasons,--e.g., relatively unavailability
of trained human coders, large amounts of text, heavy
investments of time, etc.---such grammatical coding is a
rather uneconomical undertaking, and many investigators
have quite naturally turned to the use of computers to
t)erform the coding operation. Traditionally, this has been
handled through use of a large dictionary containing the
This research was conducted under Grant GS-296 from the
X:~.tionalScience Foundation to Dr. Tannenbaum, who is Director
(Jr'the Mass (?ommunications Research Center at the University
~)f Wisconsin where Mr. Carstensen is a project assistant. Dr.
stolz is currently an NSF post-doctoral fellow at Harvard University, Center for Cognitive Studies. The use of the facilities of
the Wisconsin Computing Center greatly abetted this work.
Volume 8 / Number 6 / June, 1965
words to be encountered during tile text processing. More
recently, a straight dietionalT approach has been supplemented througtl the use of computational decision procedures. The present paper reports on one such computational system, WISSYN, in which decisions about how to
code certain words are based on conditional probabilities
of various form classes occurring in given syntactic environments.
Dictionary Approach. Given a set of words and a set of
grammaticM classes, one carl map the former into the latter
through a set of one-to-one or one-to-many relations in the
form of a dictionary lookup procedure. One such program
is limited to a set of 800 words of basic English (Lindsay,
[3]) while others use much more extensive dictionaries,
sometimes exceeding 75,000 words (e.g., Kuno and Oetringer [2]).
The use of such dictionaries has several apparent shortcomings. Most obvious, of course, is the fact that if a word
in the text is not included in the original dictionary, it cannot be coded. In principle, then, every word which could
possibly be encountered in any application must be initially accommodated. Moreover, the dictionary entry for
each word must contain all the possible grammatical
classes in which that word could have membership. Since
a great many English words have multiple form class
membership, this introduces a substantial degree of ambiguity into the analysis, ti'i~ally, from a purely practical
point of view, the immense size of any dictionary which
would be needed to process a comprehensive range of
English text makes such a program laborious to construct
and most unwieldy to utilize---if, indeed, it does not completely overtax the capacity of a given computer system.
Computational Approach. Given such inherent disadvantages, it was to be expected that some attempt would
be made to substitute, or at least to supplement, the dietionary approach with some type of estimation procedure
designed to make the program construction less laborious
and permit the grammatieal coding of words not included
in the original dictionary [1, 7].
An example of this approach is the Computational
Grammatical Coder (CGC) devised by Klein and Simmons [1]. This algorithm includes a relatively smM1 dictionary (approximately 400 items) of frequently occurring
words which are unambiguous with respect to form class,
and a formal deeisi0n procedure for estimating the allocation for all remaining words. To accommodate these infreCommunications o f t h e ACM
399
quent and/or ambiguous words, Klein and Simmons attempt to take advantage of the known syntaetie organization of the English language. They have eonstrueted,
rather painstakingly, a set of logical tables which predict
all those form classes which could possibly occur in a
particular syntactic context in terms of known or assumed
distribution restrictions. The procedure is for the initial
dictionary operations to define the form class membership
of a substantial proportion of the word-events. This, then,
not onlf¢ serves to code these particular words, but also
provides information of the syntactic environment of the
remaining words. Given a certain environment thus identified, the other words are allocated to their various form
classes according to the formal tables provided in the
program.
It should be noted that while CGC has obvious advantages over a direct dietiona~T lookup, it still allows for
a certain degree of ambiguity of allocation. That is, there
may be often more than one form class which could legitimately fit within a given syntactic environment.
Stochastic Approach. Originally developed totally independent of the Klein and Simmons proeedure, the present system has much in common with CGC in terms of
underlying rationale and basic purpose. It too employs a
limited dictionary approach at the initial stages to identify
the most commonly occurring words, and this information
is then used to estimate the remaining words in the corpus.
The main distinction between the two procedures is that
while CGC estimates the grammatical classes which could
possibly oeeur in a given context, WISSYN predicts the
one most probable class.
Of course probabilities are an aid to estimating the syntactic eategory of a given word only in so far as the largest
conditional probability accounts for a major proportion of
the word-occurrences in that environment. If situations
frequently arise where several syntactic classes are nearly
equiprobable, then many errors will result. Note, however,
that the distributions of the conditional probabilities involved are difficult or impossible to estimate without empirical investigation--whenee comes one motivation for
our study.
WISSYN includes four distinct phases--the first three
accomplish the identification of the more frequently occurring words, and the Iast performs the prediction of remaining words. The first phase is a simple, relatively small wordto-class dictionary. The second phase focuses not on the
word as such, but on predetermined morphological characteristics which serve to identify certain form classes. Since
these two phases do not always allocate a word to a single
grammatical class, but sometimes merely identify it as a
possible member of one of several classes, some degree of
ambiguity is retained among words which have been found
in these two dictionaries. A third, more-or-less ad hoe phase
was introduced, therefore, to unambiguously code these
words through the application of some structural rules
taking advantage of the already identified context.
400
Communications of the ACM
This still felt,yes a si~nilict~xt propo~'i,ion of words to be
co(led in the fourth seetioll. This is d()ilc t)y usingempirieally
(terived sots of conditi()n:d prC)abilities to predict the
gramma.~ieal form (:lass of words from given syntactic
environments. Thus, instead of t'elying on the prescriptive
rules of the OGC system, WISSYN primarily utilizes the
normative patterns of sgoehasl;ic organizttl,ion inherent in
English syntax tO predict lhe g~'aInmatic:d Cass membership of the remaining uncodcd words.
TIle Ambiguity l)rob:em. The appro:wh of WISSYN
to syntactic ambiguity is to assmne that none exists in the
input text. Thus it produees one and only one string of
syntactic categories corresponding to each sentence of input. In the earlier phases, the choice between possible alternatives is occasionally "forced" in the sense that if an
ambiguity eannot be resolved, a choice is made anyway on
a rather arbitrary, deterministic basis. Here, WISSYN
is similar to a parsing system devised by Sa./ton and Thorpe
[7] which also eliminates ambiguiV by arbitrarily choosing between possible syntaetic eodings. However, in neither
program is there any evidence tt,at these choices are made
in an optional way since the decision procedure is based on
the programmer's intuition rather than on empirical data.
In the last section of WISSYN, such a forced choice is
not necessary beeause the ehanee that two classes will be
exactly equiprobable in a given context is virtually nonexistent. That is, after conditional probabilities have been
computed, only one coding for a given word will be most
probable and so no ambiguity is retained.
In the remainder of this paper, errors in the coder's
performance are recorded as all disagreements between the
program and a human coder who makes full use of all
semantic and pragmatic factors to select a single "correct"
syntactic interpretation. The issue of whether or not an
error made by the computer is a syntactieally possible
coding of the sentence is not considered since it bears little
on the practical usefulness of the coding algorithm.
2. T h e S e t o f G r a m m a t i c a l
Categories
A critical decision in any grammatical coding procedure
is the selection of an adequate and inclusive set of grammarital categories. There are a variety of such classification schemes available, along with a corresponding variety
in ways of defining any given class within a total scheme.
The basic proeedure is to group words into classes on the
basis of distributional similarity---i.e., words which tend to
occur in similar grammatical fl'ames are placed together in
a single class. Theoretically, two words should be placed
in the same class only if they are syntactically interchangeable throughout the language; however, in practice, a small
number of test frames are generally used to define a class
and all words which :fit into one or all (depending on the
system) of the frames are considered as members of the
class. The "correct" number of classes which a system
should have is, of course, a function of its desired descriptive power--sufficient to accurately make the required
Volume 8 / Number 6 / June, 1965
grammatical d:istinc/Jo,s, I)/~I not~ s. large and diverse as
to make the system (mt)arsimt)t~i~s or unmanageable. To
obtain a relat~ively parsi m()tfiotls b~t still distributionally
defined set of categories, we adopted the system suggested
by Roberes [6]. Ce:tain minor mo(tifieationzs were made,
and a final sel~of 18 classes w~s used.
Table 1 presents a brief description of each of the 18
categories. It will be noted l;bat these include :16 formal
word classes and two punetuation categories. The former
are each detined in terms of one or more test frames, or
in some eases by a simple enumeration of the possible
members of that class. The punctuation classes, within and
between sentences, were included since they serve as syntactic markers, and, as such, should serve to enhance the
predictability of word classes in their immediate environments2
TABLE
1.
LtsT
o r ' THE GRAN[MATICAL CATEGOtIIES ASSIGNED
EY W I S S Y N
(Definitive frames a n d members of t,he categories are in italics.)
Description
Code Name
N
Noun
V
Verbs
A
Adjectives
B
l]
Adverb
Pronoun
D
Determiners
L
Linking verbs
X
I
P
Auxiliaries
Intensifiers
Prepositions
E
Relative P r o -
S
Subordinators
C
G
U
Z
Connectors
Negative
Internal punctuation
Terminal punctuation
Preverb
Y
Exclamations
nouns
T
All place and person names. We saw
the . . . . . .
All t r a n s i t i v e a n d i n t r a n s i t i v e verbs as
in Let's - - ,
or Let's - it.
All words which are used to modify
n o u n s (including o t h e r nouns when
t h e y are used in t h a t capacity).
Modifiers of v e r b s as in H e walked - [, you, he, him, myself, those, no one,
etc.
As in . - - m a n shot - wolf. E.g.
The, m y , no, any, most, etc.
am, 'is, are, become, seem, smell, remain,
etc.
As in: He - go.
very, somewhat, quite, etc.
E.g. at, by, over, for, in, on, etc. (Prepopositions always h a v e nouns as
t h e i r objects, otherwise t h e y are
considered a d v e r b s . )
As in the dog - - - I bought. E.G. who,
which, that, etc.
E.g. because, although, where, until,
after, whether, etc.
E.g. and, or, but, so, hou'ever, etc.
~ot, 'n~t
( ),;:
T h e word to used as infinitive signal or
prepositions h a v i n g gerunds as objects. As in: He got home - walking.
One-word utter,'mees such as O K , yes,
well, etc.
~Subsequent analysis has revealed t,h a t t h e inclusion of such
gross p u n c t u a t i o n categories in a parsing scheme of this sort does
indeed increase t h e predieta/~ility of v a r i o u s form classes to a
significant; degree.
Volnme 8 / N u m b e r 6 / I,,,,,,_ 106~
In most cases, the classification system operates at the
individual word level---a word being defined as an event
occurring between two blank spaces--within its own sentence context. There are, however, some exceptions. For
example, contractions are separated into their component
words and coded as such (e.g., the negative contraction
shouldn't into should and not). Conversely, some word pairs
are combined into a single syntactic unit (e.g., verb forms
such as have to or ought to are coded as a single verb).
3. Description of the WISSYN S y s t e m
The WISSYN grammatical coder, as presently construtted, is written in CDC FORTRAN60 and is operational
on the CDC 1604. Processing speed is still to be precisely
determined, but at present it is approximately 2500 words
per minute, including tape input and output. The full
program, including probability tables, is contained in less
than 6500 48-bit words of storage, and the procedure can
readily be subdivided into four component phases: the
dictionary, morphology, ad hoc and probability phases. 2
Dictionary Phase. The original intention behind the
inclusion of this phase in the program was to quickly identify those words which were relatively frequent in occurrence and relatively unambiguous with respect to form
class membership. As our work progressed on compiling
such a list, we found that it included a large proportion of
the members of the 12 categories other than the four major
lexical classes noun, verb, adjective and adverb. It was
therefore decided to attempt to exhaustively list the
members of these 12 categories in the dictionary, and then
to assume that all words not identified by this dictionary
were not of these categories, that is, that they were lexical
words. At the present time, this dictionary contains approximately 300 words, and it is being gradually extended
as we gain additional experience with the system as a
whole.
The input text is read from card-images, and each word
of input is checked against the dictionary entries. If a word
is located in the dictionary, a code indicating its class or
classes of membership is retrieved and placed in a developing "syntactic image" (SI) of the sentence--this SI being
an ordered set of storage locations with one cell corresponding to each word or punctuation mark in the input sentence. If a particular word is not located in the dictionary
lookup, an "unknown" code is placed in its corresponding
SI cell.
It is interesting to note that in various applications of
this program to date, an average of 60%, at times as
much as 70 %, of the words in a passage have been identified by this simple scanning procedure.
Morphology Phase. While English is not a language
marked by a relatively high degree of morphological inflecA full description a n d operator's m a n u a l for the W I S S Y N
system can be obtained from the Mass C o m m u n i c a t i o n Research
Center, U n i v e r s i t y of Wisconsin, Madison, Wis. 53711.
Communications
o f t h e ACM
401
tion, there is a high correlation between certain wordsuffixes and certain grammatical classes. Klein and Simmons recognized this fact and included such a morphologieal identification procedure in the CGC system, and we
have done the same here.
If a given word of input is not in the first dictionary
phase, it is then looked up in a series of suffix dictionaries
of six or fewer final letters. For example, one such suffix
test scans the word for its last four letters to determine if
they match the -ship suffix, the -ment suffix, or any of a
number of other four-letter endings. When a match is
found, that. word is coded appropriately in the corresponding SI cell. There is a total of 63 such suffixes and some
of the more common ones are indicated in Table 2.
At this point in the program, some rather complicated
decision routines could be employed to ferret out the more
subtle morphological devices of English. These routines
would include quite elaborate context-sensitive rules which
would detect and code various suffixes only when they were
included in assorted higher-order phrasal units. While a
few such routines are employed at a rudimentary level in
the ad hoc phase, they are not the primary focus of this
study and thus have been largely omitted. This section,
then, can be thought of as taking advantage of only the
most obvious and easily detected morphological information.
Even among the suffixes which are tested, however, there
are several that are not perfect indicators of a single syntactic category. Thus, words may be encountered which
have characteristic endings but which are not members of
the classes predicted by their endings (e.g., f a m i l y has the
-ly ending but it is not an adverb). This potential source of
error can be attenuated by the compilation of these words
into an "exception list" which is then incorporated into
the dictionary phase. Through this procedure the troublesome words are correctly identified before the morphology
section is encountered. Again, however, it seems to be an
uneconomical task to find every word in English that is an
exception to our morphology tables; so only the most frequent of such words have been listed. As a beginning, the
most frequent 2000 words of the Lorge-Thorndike word
list [10] were searched and about 60 exceptions to our rules
T A B L E 2.
Ending
-ec
-er
-ic
-iv
-'s
-ace
-ent
-ics
-ine
402
A SAMPLE OF SOME WORD-ENDINGS TESTED
FOR IN THE MORPHOLOGY SECTION
Possible Classes
N
A,
A,
B
A
N;
A,
N
A,
N
N
N
N
Ending
-ism
-ire
..man
-ors
-ous
-able
-ance
-ship
-nesses
Communications of the ACM
Possible
Classes
A,
A,
N
N
A
A
A,
A,
N
N
N
N
N
were discovered and incorporated i:n~o ~he first phase of
WISSYN,
The first two phases of the coder operate on each word
of the input sequentially as it is isolated by the read scanner. However, when a terminal punctuation is encountered,
scanning stops and the sentence as art entity is then
operated upon. At this point, the Si contains a partial
coding of the words in the sentence. Some of the cells in
the SI contain flags indicating that the word has been
unambiguously coded (e.g., the word the, which can only
be a determiner). The contents of other cells indicate that
the word has been identified but with some ambiguity of
allocation still present. The remaining cells are still in the
"unknown" category, and are set aside for processing in
the probability phase.
A d Hoc Phase. This routine was introduced into the
program to attempt clarification of some of those words
identified in either of the first two phases but which remain
ambiguous. For example, the word that, being a function
word, is in the initial dictionary phase, but happens to
have multiple class membership in different contexts (e.g.,
as in that dog, that the dog j u m p e d , the dog that j u m p e d , etc.).
There are currently eight such types of ambiguities
which are resolved by the ad hoc phase. In each case, the
words already unambiguously identified provide the critical environmental information for the final allocation of
these ambiguous words to one single grammatical class or
another. As such, this phase of the system is similar in
principle to that employed by Klein and Simmons more
generally, in that a specified set of frames is provided as
diagnostic for particular identifications. Examples of such
decision-making in WISSYN include: (a) a verb processor
to determine whether forms of to be and other such verbs
should be classified as auxiliaries, linking verbs or main
verbs; (b) a gerund processor which focuses on the -ing
sutfix words to determine whether they are nouns, gerunds,
verbs or adjectives; and (c) a routine which processes certain preposition-adverbs to determine their exact usage
either as prepositions (e.g., ,in in i n the house) or as adverbs
(come in f r o m the cold).
There has been a considerable range in the number of
words identified by this phase in our application to date
with the average being about 10 %.
Probability Phase. At this stage, the SI of a sentence is
such that a number of words have been firmly coded
(usually 60 % or better) by the previous phases, and the
problem is to identify the remaining words. Because it is
assumed that the earlier processing has identified all of the
other 12 classes, the probability phase is left with predicting the remaining words in terms of only the four major
lexical categories; noun, verb, adjective and adverb? At
Actually, in some cases, tile morphology and dictionary sections identify a lexical word by eliminating one or two of the lexical classes from the range of possibilities, and thus for these words
the prediction is only among two or three classes rather than four.
Volume 8 / Number 6 / June, 1965
this point, the original sel~ of words comprising the input
sentence is no longer necessary and only the SI is retained.
Estimations for ttle remaining words are obtained, as we
have indicated, by accessing empirically derived tables of
conditional probabilities of individual word classes in
various syntactic strings. These probability tables were
generated separately by a set of computer programs operating on text which had previously been grammatically
coded by a number of human experts. The corpus used
was collected in an experiment by Stolz [8], and consisted
of two protocols from each of 32 undergraduate subjects
elicited in response to Thematic Apperception Test (TAT)
pictorial stimuli. All 64 of the passages were encoded in
typewritten form and included a total of 28,476 words in
1470 sentences.
The programs used tabulated all syntactic strings up to
five items in length, but occurring within a sentence.
Conditional probabilities of the various members of the
strings were then computed using one, two, three and four
items as predictors, with the predictors antecedent to the
predicted unit, or subsequent to it, or bridging it. Because
examination of the probability tables indicated no real
increase in predictability when three- or four-word strings
were used, only the probabilities involving up to three
successive predictors were used in the WISSYN coder.
The operation of the probability phase of the coder is
perhaps best illustrated with an attempt. Suppose that
upon entry to this phase, the SI for a given sentence is
T D N xl P D N P x2 x3 T
where the letters are as identified in Table i and xx, x2 and
z3 indicate the words to be predicted• (Note that each
sentence is identified as beginning and ending with a
terminal punctuation mark.)
The program first locates the uncoded items and then
isolates the already identified context in their immediate,
up to three units on each side, environments. For blank x~,
the tables available to the program would include conditional probabilities of the four lexical classes occurring in
the following contexts:
(1) the three antecedent strings Nx (i.e., a noun immediately'preceding the word for which a form class is to
be predicted), DNx and TDNx;
(2) the three subsequent strings xP (i.e., preposition
immediately following the word to be coded), zPD, and
zPDN; and
(3) the two bilateral predictor strings NaP and DNxPD.
Since other research (Stolz, [8]) had indicated that the
longer predictor strings in these tables generally have
greater predictive power than the shorter ones, the system
first focuses on the longest possible predictors available.
Thus in the case of the x~ blank above, it would select
TDNx as the initial antecedent predictor, xPDN as the
subsequent predictor and DNxPD as the bilateral predictor. It should be noted that in the interest of parsimony,
the predictor tables used do not always include all possible
Volume 8 / Number 6 / June, 1965
strings but rather the most frequent 150 or so predictor
strings for each of the possible predictor configurations.
Thus, it is possible that a particular predictor string would
not be included in the tables. In such instances, the tables
are searched for the next longest string. For example, if,
in the above case, the TDNx predictor were not available
in the tables, the search would look for the DNx predictor,
and so on until the best available predictor was located.
Assuming that each of the three longest isolated predictors would be available in the above case, the tables
would provide four probability figures, one for each of the
four main lexical classes, for each predictor string. The
system then computes a joint probability index for each
of the four possible alternatives by nmltiplying their respective probabilities across the three predictor strings.
The lexical item with the highest such probability index is
then selected as the predicted form class. Table 3 presents
such a sample computation for the x, blank. In this case,
it is clear that the most likely form class is a verb and it
would be coded accordingly.
Where single bhmks exist in given environments, the
system proceeds as above, always selecting the best available antecedent, subsequent and bilateral predictor strings
to make prediction. Special problems, however, present
themselves in the event that two or more unknowns occur
in sequence--as in the case of blanks x,2and xa in the above
example--and a somewhat different procedure is called for.
It is apparent that in such cases there is no bilateral
environment available at the outset. The general logic of
the system, then, is to select for the first of tile two blanks
(i.e., x2) the best available antecedent string--here the
DNPx predictor--and for the second blank (x~) the best
available subsequent predictor--zT, in this case. As before,
this procedure yields four probabilities for each blank and
the system then proceeds to select the one highest figure
from among these eight available probabilities. If that one
happened to be for the x2 blank, then it is assigned to the
indicated form class; if it happens to be for the x~ blank,
then the word in that location is coded first. In either case,
once the first allocation is made simply on tile basis of the
highest single probability, the newly chosen code then
becomes available for creating new predictor strings for
tile remaining blank, which is then treated in precisely the
same manner as was x~ above• Where more than two blanks
occur together, the same general procedure is used: the
TABLE 3.
A SAMPLE COMPUTATION LEADING TO TIIE CIIOICE
OF VERB AS THE CODE FOR TtIE WORD X
IN THE SEQUENCE TDNxPDN
Predictor
TDNx
xPDN
DNxPD
Joint Product
Conditional Probabilities
Probability
Probability
noun
verb
Probability
adjective
Probability
adverb
•046
•438
.017
.819
.359
.591
.013
.068
•078
. 122
.135
.314
.'-~34
.m77
1oo007
.005~7
Communications of the ACM
403
most peripheral of the string of unknowns (e.g., the first
and the third in a sequence of three successive unknowns)
are first estimated on the basis of the available unilateral
predictor strings, and this procedure is repeated until only
a single blank remains to be predicted.
4. E v a l u a t i o n
of Performance
Generality of the Probability Tables. As is the case with
a n y estimation procedm'e, the main criterion for evalnating its performance is whether it works or not. Accordingly,
some a t t e m p t s toward empirical assessment of the present
W I S S Y N system have been undertaken. Before considering these, however, it is important to note t h a t since a
substantial degree of the success of such a scheme depends
upon the accuracy of the probabililby tables, their generality of application is a vital factor. T h a t is, if the same
predictor strings yield highly variable conditional probabilities across different samples of English text, their
predictive utility in a particular subsample would be impaired, and thus the practical value of the entire system
would be seriously diminished.
Obviously, not all possible or available English texts
can be evaluated in this regard;however, we have collected
some evidence a n d this points to more generality than
specificity in predictive patterns. For example, Stolz [8]
found essentially equivalent sets of conditional probabilities for various strings in spoken vs. typewritten
messages from the same subjects. Similarly, T a n n e n b a m n
and Brewer [9], although they utilized an earlier and somewhat more primitive version of the present system, found
no differences in the probability patterns for different
types of newspaper content areas. In another study, they
also found a high degree of syntactic similarity in encoding
behavior tinder negative vs. positive evaluative feedback
conditions. While far from conclusive, to be sure, results
such as these do suggest that the available conditional
probabilities are quite characteristic of the nature of the
language code, and hence relatively invariant across different language specimens.
To date, we have
conducted several empirical evaluations of the W I S S Y N
system, but all on a fairly minor scale. A main reason is
t h a t in order to assess the s y s t e m ' s performance one must
h a v e a standard for comparison, usually a hand-parsed
corpus which utilizes the identical grammatical classification scheme used here. Such materials are not easy to come
by, and hence our evaluations have not been on a large
scale.
One of our first a t t e m p t s was to determine how well the
s y s t e m functioned on a subset of the same messages which
generated the basic set of probability tables. To do this,
four messages were selected at random from the original
set of 64 typewritten T A T responses. There was a total
of 1916 words to be parsed in the four passages, and
E v a l u a t i o n of Coder's Performance.
404
Communications of tile ACM
WISSYN, t~Lking less that, on(, minute to perform all the
opel'~t(iol.ls, cocve('tly tfllocaied 92.8 % ()f the words. 4
Since it could be argued that lhis ligure was somewhat
inflated in tha¢ the S:~lne maLerials used in the test were
involved in the original prob~tbility generation, we then
turned to a related set, of lnatcri~ls. [n the original study
Stolz [8], each of the 32 subjects involved actually encoded
four separate messages to four T A T stimuli. Two of these
messages were typewritlen, and these were used to
generate the probability tables. Two others, however, were
spoken. For assessing the overall system, four of the 64
spoken messages were used, involving a total of 1964
words. Here ~gain the performance was quite high and
TABLE 4. PERFORMANCE OF W[SSYN
()N EIGHT TAT
PI~OTOCOLS (FOUR SPOKEN ANI) FOUR 'WRUFTEN) PRODUCED
BY
Phase
Dictionary
Morphok)gy
Ad hoe
Probability
Total
*
EIGHT
DIFFERENT COLI,EGE SENIOItS
Number of
Words htentified
2632
119
321
814
3886
Percent of
Corpus
67.7 "
3.1
8.3
20.9
10~).0
.Vumber of
Errors
9
21
26
223*
279
Percent
Error
0.2
0.5
0.7
5.7
7.1
33 of these errors (or 0.8%) were caused by errors in earlier phases.
TABLE 5. PERFOI~MANCEON WISSYN ON Six NEWSSTORIES
AND A SEGMENT OF O N E A R T I C L E FROM
Scientific A merican
Phase
Dictionary
Morphology
Ad hoc
Probabilty
Total
Number of
Words Identified
Percent of
Corpus
Number of
Errors
Percent
Error
937
114
273
389
1713
54.7
6.7
15.9
22.7
100.0
2
5
19
122'
148
0,1
0.3
1.1
7.1
8.6
* 27 of these errors (or 1,6%) were due to the errors made in earlier phases.
TABLE 6. SEVERAL t~EPRESENTATIVE Ett/:ORS MADE BY THE
P R O B A B I L I T Y ~ECTION
(nalieized words are coded erroneously. Ca,mgory symbols are identified in Table 1.)
T t)
N
B
L D
1 . . Our tax system is a ...
E
X
(;
B
V I)
2. • who does ~'t vote has no.. •
U P
l)
N
V
P l)
3..
after a elo.sed meeting of the
'])
X
G
V
B T
4..
but would not give. detail.~ "
u v
N U C
5..
, tear gas , and ...
N
cDmmittce ....
4 It should be remembered tth~t only a single syntactic interpretation was assumed to be correct for each sentence. Titus, if the
program coded u sentence in a syntactically possil)le but not
semantically or pragmatically reasomtble way, errors were recorded.
Volume 8 / Number 6 / June, 1965
almost idcntictd t,o {,hat; foc the written TAT response
data, wittt (.)2.9% of ,,ll (;he words correctly ]al)eled.
We then turned {;o quil;e different mai.erials to fm'ther
assess the sys~)em's performance. Six general newspaper
stories were randomly selected from a one-day sample of
wire-service copy. These involved a l~otal of 1146 words.
Performance here lagged slightly behind tim above TAT
materials, with an error rate of 8.6%, but l;he fact that
well over 90 % of the words were properly allocated into
their respective form classes bodes well for the system.
In a further, relatively minor undertaking, a single
article fl'om the Scie:~et~fic American--presumably using
somewhat nmre complex scientific style-, was used. A
total of 567 words was involved, and again the program
operated at better than 90 % accuracy.
While these results are encouraging in pointing to a
fairly efficient system as of now, they have the additional
feature of constituting feedback for making the system
even more accurate. Accordingly, we have been examining
various aspects of the program to determine specific loci of
error occurrence, and a summary of our findings appears in
Tables 4 and 5. In general, we have found the dictionary
phase to perform with a yew high degree of accuracy (less
than 1% error). The morphology phase, however, has not
functioned as effectively as was originally hoped for. This
appears to be largely due to specific words which represent
exceptions to our morphological rules. As previously
indicated, the most h'equent words of this type have been
included in the dictionary phase; however, this exception
list, may be expanded in the future. In general, the ad hoe
phase has functioned well, although here again a few
specific words have been giving undue trouble. These, too,
can be readily accommodated in future applications. Indeed, we are constantly modifying these first three phases
as new data inform us of certain problems not previously
anticipated.
As expected, the highest error rate does occur in the
probability phase. Part of this error (15%-25%) is
obviously due to whatever inaccuracies accrue from the
preceding phases, i.e., an error in a predictor string wiI1
very likely lead to errors in its prediction of unknown
words. Further analysis has indicated that the error rate
here is relatively low in the case where single unknown
words are guessed (about 15 %-20 %) but increases substantially in the case of nmltiple consecutive blanks. Our
present system, of course, contributes to this in that if the
first guess is incorrect, it lowers the chances of the second
and other subsequent estimations being correct. Here,
again, we are contemplating certain changes in the basic
program to reduce such error. One possibility is to have
the program estimate a pair of words at a time when two
consecutive blanks are encountered. Some preliIninatT
data do indicate that this procedure of estimating several
unknowns at ()rice may reduce the error rate substantially
--largely because it anows for the introduction of bilateral
predictor contents, which are generally superior t.o either
Voh~me 8 / Number 6 / June, 1965
of the unilateral ones in predictive power. To give a firsthand picture of the kinds of errors which are typical of the
probability section, several sentence-fragments in which
"typical" errors occurred are reproduced in Table 6.
5. Summary and Conclusion
This investigation has provided evidence on several
points relevant to grammatical coding of English words by
eomputer. First, it has shown t~hat words can be coded
with considerable success using empirically derived probabilities as a partial basis for their allociatiou to syntactic
classes. Second, it has indicated that this can be done not
only with relatively little error, but also at a rapid rate and
using only a fraction of the core storage normally available in large scale computers. Third, it has shown that
the error rate of the coder is highly sensitive to the number
of words correctly coded by the earlier, deterministic
phases of the algorithm. That is, the more already coded
context which is available to the probability tables, the
better are the chances that the code for an unknown or
ambiguous word will be estimated correctly.
It also seems clear that probabilities are not the whole
answer to grammatical coding but that they might be
most useful as an auxiliary device in a more traditional
coder such as the CGC. In such a program, the role of the
probability section would be chiefly to aid in the final
resolution of ambiguities and to code otherwise unidentifiable words.
RECEIVED DECEMBER, 1964; REVISEDMARCH 1965
REFERENCES
1. KLEIN, S., AND SIMMONS, R. F, A computational appro~ch.
to grammatical coding of English words. J. ACM 10, (July
1963), 334--347.
2. KVNO, S., AND OETTINOER, A. G. Syntactic structure and
ambiguity in English. Prec. 1963 Fall Joint Comput. Conf.
Spartan Books, Baltimore, 1963, 397-418.
3. LINDSAY,R . K . The reading machine problem. Unpublished
doctoral dissertation, Carnegie Inst. Tech., Pittsburgh,
Pa., 1960.
4. MILLER, G. A. Some psychological studies of grammar.
Amer. Psychol. 17 (Nov. 1962), 748-762.
5. 0ET!PINOER,A. G. Automatic Language Translation. H:~rvard
U. Press, Cambridge, Mass., 1960.
6. ROr~ERTS, P. Patterns of English. Harcourt, Brace, and
World, Inc., New York, 1956.
7. SALTON, G., AND THORPE, |~.. W. An approach to the segmentation problem in speech analysis and language translation. In 1961 International Conference on Machine Translation of Languages and Applied Language Analysis, Vol. II;
Natl. Physical L:&oratory Syrup. 13, H. M. Star. Off.,
London, 1962, 704-724.
8. STOLZ, W. S. Syntactic constraint in spoken and written
English. Unpublished doctoral dissert:~tion, U. of Wisconsin, Madison, Wis., 1964.
9. TANNENBAUM, P. H., AND B[~EWER, R. K. Consistency of
syntactic structure as a factor in journalistic style. Journalism Quart., in press.
10. TIIORNDIKE,E. L., AND LORGE, [- The Teachers' Word Book
of 30riO0 Words. Te~mhers College, Cohnnbia U., 1944.
Contmt, n i c a t i o n s of tim ACM
405