Download A Stochastic Approach to the Grammatical Coding of English

A. G. O E T T I N G E R , Editor A Stochastic Approach to the Grammatical Coding of English WALTERS. STOLZ, PERCY H. TANNENBAUMAND t?I(EDERICK V. CARSTENSEN U~ive'rsityof Wisconsin, Madison, Wisconsin A computer program is described which will assign each word in an English text to its form class or part of speech. The program operates at relatively high speed in only a limited storage space. About half of the word-events in a corpus are identified through the use of a small dictionary of function words and frequently occurring lexical words. Some suffix tests and Iogical-decislon rules are employed to code additional words. Finally, the remaining words are assigned to one class or another on the basis of the most probable form classes to occur within the already identified contexts. The conditional probabilities used as a basis for this coding were empirically derived f r o m a separate hand-coded corpus. On preliminary trials, the accuracy of the coder was 91% to 9 3 % , with obvious ways of improving the algorithm being suggested by an analysis of the results. I. I n t r o d u c t i o n In recent years there has been an increasing interest in the role of syntax in language behavior (cf. Miller [4]) and i~t various mechanical alnguage processing activities (e.g., 0cttingcr [5] on language translation). In many analyses ()t' syntactic structure of language, there is often involved the task of allocating each word of a language coepus to its respective grammatical form class or part of speech. f"or rather obvious reasons,--e.g., relatively unavailability of trained human coders, large amounts of text, heavy investments of time, etc.---such grammatical coding is a rather uneconomical undertaking, and many investigators have quite naturally turned to the use of computers to t)erform the coding operation. Traditionally, this has been handled through use of a large dictionary containing the This research was conducted under Grant GS-296 from the X:~.tionalScience Foundation to Dr. Tannenbaum, who is Director (Jr'the Mass (?ommunications Research Center at the University ~)f Wisconsin where Mr. Carstensen is a project assistant. Dr. stolz is currently an NSF post-doctoral fellow at Harvard University, Center for Cognitive Studies. The use of the facilities of the Wisconsin Computing Center greatly abetted this work. Volume 8 / Number 6 / June, 1965 words to be encountered during tile text processing. More recently, a straight dietionalT approach has been supplemented througtl the use of computational decision procedures. The present paper reports on one such computational system, WISSYN, in which decisions about how to code certain words are based on conditional probabilities of various form classes occurring in given syntactic environments. Dictionary Approach. Given a set of words and a set of grammaticM classes, one carl map the former into the latter through a set of one-to-one or one-to-many relations in the form of a dictionary lookup procedure. One such program is limited to a set of 800 words of basic English (Lindsay, [3]) while others use much more extensive dictionaries, sometimes exceeding 75,000 words (e.g., Kuno and Oetringer [2]). The use of such dictionaries has several apparent shortcomings. Most obvious, of course, is the fact that if a word in the text is not included in the original dictionary, it cannot be coded. In principle, then, every word which could possibly be encountered in any application must be initially accommodated. Moreover, the dictionary entry for each word must contain all the possible grammatical classes in which that word could have membership. Since a great many English words have multiple form class membership, this introduces a substantial degree of ambiguity into the analysis, ti'i~ally, from a purely practical point of view, the immense size of any dictionary which would be needed to process a comprehensive range of English text makes such a program laborious to construct and most unwieldy to utilize---if, indeed, it does not completely overtax the capacity of a given computer system. Computational Approach. Given such inherent disadvantages, it was to be expected that some attempt would be made to substitute, or at least to supplement, the dietionary approach with some type of estimation procedure designed to make the program construction less laborious and permit the grammatieal coding of words not included in the original dictionary [1, 7]. An example of this approach is the Computational Grammatical Coder (CGC) devised by Klein and Simmons [1]. This algorithm includes a relatively smM1 dictionary (approximately 400 items) of frequently occurring words which are unambiguous with respect to form class, and a formal deeisi0n procedure for estimating the allocation for all remaining words. To accommodate these infreCommunications o f t h e ACM 399 quent and/or ambiguous words, Klein and Simmons attempt to take advantage of the known syntaetie organization of the English language. They have eonstrueted, rather painstakingly, a set of logical tables which predict all those form classes which could possibly occur in a particular syntactic context in terms of known or assumed distribution restrictions. The procedure is for the initial dictionary operations to define the form class membership of a substantial proportion of the word-events. This, then, not onlf¢ serves to code these particular words, but also provides information of the syntactic environment of the remaining words. Given a certain environment thus identified, the other words are allocated to their various form classes according to the formal tables provided in the program. It should be noted that while CGC has obvious advantages over a direct dietiona~T lookup, it still allows for a certain degree of ambiguity of allocation. That is, there may be often more than one form class which could legitimately fit within a given syntactic environment. Stochastic Approach. Originally developed totally independent of the Klein and Simmons proeedure, the present system has much in common with CGC in terms of underlying rationale and basic purpose. It too employs a limited dictionary approach at the initial stages to identify the most commonly occurring words, and this information is then used to estimate the remaining words in the corpus. The main distinction between the two procedures is that while CGC estimates the grammatical classes which could possibly oeeur in a given context, WISSYN predicts the one most probable class. Of course probabilities are an aid to estimating the syntactic eategory of a given word only in so far as the largest conditional probability accounts for a major proportion of the word-occurrences in that environment. If situations frequently arise where several syntactic classes are nearly equiprobable, then many errors will result. Note, however, that the distributions of the conditional probabilities involved are difficult or impossible to estimate without empirical investigation--whenee comes one motivation for our study. WISSYN includes four distinct phases--the first three accomplish the identification of the more frequently occurring words, and the Iast performs the prediction of remaining words. The first phase is a simple, relatively small wordto-class dictionary. The second phase focuses not on the word as such, but on predetermined morphological characteristics which serve to identify certain form classes. Since these two phases do not always allocate a word to a single grammatical class, but sometimes merely identify it as a possible member of one of several classes, some degree of ambiguity is retained among words which have been found in these two dictionaries. A third, more-or-less ad hoe phase was introduced, therefore, to unambiguously code these words through the application of some structural rules taking advantage of the already identified context. 400 Communications of the ACM This still felt,yes a si~nilict~xt propo~'i,ion of words to be co(led in the fourth seetioll. This is d()ilc t)y usingempirieally (terived sots of conditi()n:d prC)abilities to predict the gramma.~ieal form (:lass of words from given syntactic environments. Thus, instead of t'elying on the prescriptive rules of the OGC system, WISSYN primarily utilizes the normative patterns of sgoehasl;ic organizttl,ion inherent in English syntax tO predict lhe g~'aInmatic:d Cass membership of the remaining uncodcd words. TIle Ambiguity l)rob:em. The appro:wh of WISSYN to syntactic ambiguity is to assmne that none exists in the input text. Thus it produees one and only one string of syntactic categories corresponding to each sentence of input. In the earlier phases, the choice between possible alternatives is occasionally "forced" in the sense that if an ambiguity eannot be resolved, a choice is made anyway on a rather arbitrary, deterministic basis. Here, WISSYN is similar to a parsing system devised by Sa./ton and Thorpe [7] which also eliminates ambiguiV by arbitrarily choosing between possible syntaetic eodings. However, in neither program is there any evidence tt,at these choices are made in an optional way since the decision procedure is based on the programmer's intuition rather than on empirical data. In the last section of WISSYN, such a forced choice is not necessary beeause the ehanee that two classes will be exactly equiprobable in a given context is virtually nonexistent. That is, after conditional probabilities have been computed, only one coding for a given word will be most probable and so no ambiguity is retained. In the remainder of this paper, errors in the coder's performance are recorded as all disagreements between the program and a human coder who makes full use of all semantic and pragmatic factors to select a single "correct" syntactic interpretation. The issue of whether or not an error made by the computer is a syntactieally possible coding of the sentence is not considered since it bears little on the practical usefulness of the coding algorithm. 2. T h e S e t o f G r a m m a t i c a l Categories A critical decision in any grammatical coding procedure is the selection of an adequate and inclusive set of grammarital categories. There are a variety of such classification schemes available, along with a corresponding variety in ways of defining any given class within a total scheme. The basic proeedure is to group words into classes on the basis of distributional similarity---i.e., words which tend to occur in similar grammatical fl'ames are placed together in a single class. Theoretically, two words should be placed in the same class only if they are syntactically interchangeable throughout the language; however, in practice, a small number of test frames are generally used to define a class and all words which :fit into one or all (depending on the system) of the frames are considered as members of the class. The "correct" number of classes which a system should have is, of course, a function of its desired descriptive power--sufficient to accurately make the required Volume 8 / Number 6 / June, 1965 grammatical d:istinc/Jo,s, I)/~I not~ s. large and diverse as to make the system (mt)arsimt)t~i~s or unmanageable. To obtain a relat~ively parsi m()tfiotls b~t still distributionally defined set of categories, we adopted the system suggested by Roberes [6]. Ce:tain minor mo(tifieationzs were made, and a final sel~of 18 classes w~s used. Table 1 presents a brief description of each of the 18 categories. It will be noted l;bat these include :16 formal word classes and two punetuation categories. The former are each detined in terms of one or more test frames, or in some eases by a simple enumeration of the possible members of that class. The punctuation classes, within and between sentences, were included since they serve as syntactic markers, and, as such, should serve to enhance the predictability of word classes in their immediate environments2 TABLE 1. LtsT o r ' THE GRAN[MATICAL CATEGOtIIES ASSIGNED EY W I S S Y N (Definitive frames a n d members of t,he categories are in italics.) Description Code Name N Noun V Verbs A Adjectives B l] Adverb Pronoun D Determiners L Linking verbs X I P Auxiliaries Intensifiers Prepositions E Relative P r o - S Subordinators C G U Z Connectors Negative Internal punctuation Terminal punctuation Preverb Y Exclamations nouns T All place and person names. We saw the . . . . . . All t r a n s i t i v e a n d i n t r a n s i t i v e verbs as in Let's - - , or Let's - it. All words which are used to modify n o u n s (including o t h e r nouns when t h e y are used in t h a t capacity). Modifiers of v e r b s as in H e walked - [, you, he, him, myself, those, no one, etc. As in . - - m a n shot - wolf. E.g. The, m y , no, any, most, etc. am, 'is, are, become, seem, smell, remain, etc. As in: He - go. very, somewhat, quite, etc. E.g. at, by, over, for, in, on, etc. (Prepopositions always h a v e nouns as t h e i r objects, otherwise t h e y are considered a d v e r b s . ) As in the dog - - - I bought. E.G. who, which, that, etc. E.g. because, although, where, until, after, whether, etc. E.g. and, or, but, so, hou'ever, etc. ~ot, 'n~t ( ),;: T h e word to used as infinitive signal or prepositions h a v i n g gerunds as objects. As in: He got home - walking. One-word utter,'mees such as O K , yes, well, etc. ~Subsequent analysis has revealed t,h a t t h e inclusion of such gross p u n c t u a t i o n categories in a parsing scheme of this sort does indeed increase t h e predieta/~ility of v a r i o u s form classes to a significant; degree. Volnme 8 / N u m b e r 6 / I,,,,,,_ 106~ In most cases, the classification system operates at the individual word level---a word being defined as an event occurring between two blank spaces--within its own sentence context. There are, however, some exceptions. For example, contractions are separated into their component words and coded as such (e.g., the negative contraction shouldn't into should and not). Conversely, some word pairs are combined into a single syntactic unit (e.g., verb forms such as have to or ought to are coded as a single verb). 3. Description of the WISSYN S y s t e m The WISSYN grammatical coder, as presently construtted, is written in CDC FORTRAN60 and is operational on the CDC 1604. Processing speed is still to be precisely determined, but at present it is approximately 2500 words per minute, including tape input and output. The full program, including probability tables, is contained in less than 6500 48-bit words of storage, and the procedure can readily be subdivided into four component phases: the dictionary, morphology, ad hoc and probability phases. 2 Dictionary Phase. The original intention behind the inclusion of this phase in the program was to quickly identify those words which were relatively frequent in occurrence and relatively unambiguous with respect to form class membership. As our work progressed on compiling such a list, we found that it included a large proportion of the members of the 12 categories other than the four major lexical classes noun, verb, adjective and adverb. It was therefore decided to attempt to exhaustively list the members of these 12 categories in the dictionary, and then to assume that all words not identified by this dictionary were not of these categories, that is, that they were lexical words. At the present time, this dictionary contains approximately 300 words, and it is being gradually extended as we gain additional experience with the system as a whole. The input text is read from card-images, and each word of input is checked against the dictionary entries. If a word is located in the dictionary, a code indicating its class or classes of membership is retrieved and placed in a developing "syntactic image" (SI) of the sentence--this SI being an ordered set of storage locations with one cell corresponding to each word or punctuation mark in the input sentence. If a particular word is not located in the dictionary lookup, an "unknown" code is placed in its corresponding SI cell. It is interesting to note that in various applications of this program to date, an average of 60%, at times as much as 70 %, of the words in a passage have been identified by this simple scanning procedure. Morphology Phase. While English is not a language marked by a relatively high degree of morphological inflecA full description a n d operator's m a n u a l for the W I S S Y N system can be obtained from the Mass C o m m u n i c a t i o n Research Center, U n i v e r s i t y of Wisconsin, Madison, Wis. 53711. Communications o f t h e ACM 401 tion, there is a high correlation between certain wordsuffixes and certain grammatical classes. Klein and Simmons recognized this fact and included such a morphologieal identification procedure in the CGC system, and we have done the same here. If a given word of input is not in the first dictionary phase, it is then looked up in a series of suffix dictionaries of six or fewer final letters. For example, one such suffix test scans the word for its last four letters to determine if they match the -ship suffix, the -ment suffix, or any of a number of other four-letter endings. When a match is found, that. word is coded appropriately in the corresponding SI cell. There is a total of 63 such suffixes and some of the more common ones are indicated in Table 2. At this point in the program, some rather complicated decision routines could be employed to ferret out the more subtle morphological devices of English. These routines would include quite elaborate context-sensitive rules which would detect and code various suffixes only when they were included in assorted higher-order phrasal units. While a few such routines are employed at a rudimentary level in the ad hoc phase, they are not the primary focus of this study and thus have been largely omitted. This section, then, can be thought of as taking advantage of only the most obvious and easily detected morphological information. Even among the suffixes which are tested, however, there are several that are not perfect indicators of a single syntactic category. Thus, words may be encountered which have characteristic endings but which are not members of the classes predicted by their endings (e.g., f a m i l y has the -ly ending but it is not an adverb). This potential source of error can be attenuated by the compilation of these words into an "exception list" which is then incorporated into the dictionary phase. Through this procedure the troublesome words are correctly identified before the morphology section is encountered. Again, however, it seems to be an uneconomical task to find every word in English that is an exception to our morphology tables; so only the most frequent of such words have been listed. As a beginning, the most frequent 2000 words of the Lorge-Thorndike word list [10] were searched and about 60 exceptions to our rules T A B L E 2. Ending -ec -er -ic -iv -'s -ace -ent -ics -ine 402 A SAMPLE OF SOME WORD-ENDINGS TESTED FOR IN THE MORPHOLOGY SECTION Possible Classes N A, A, B A N; A, N A, N N N N Ending -ism -ire ..man -ors -ous -able -ance -ship -nesses Communications of the ACM Possible Classes A, A, N N A A A, A, N N N N N were discovered and incorporated i:n~o ~he first phase of WISSYN, The first two phases of the coder operate on each word of the input sequentially as it is isolated by the read scanner. However, when a terminal punctuation is encountered, scanning stops and the sentence as art entity is then operated upon. At this point, the Si contains a partial coding of the words in the sentence. Some of the cells in the SI contain flags indicating that the word has been unambiguously coded (e.g., the word the, which can only be a determiner). The contents of other cells indicate that the word has been identified but with some ambiguity of allocation still present. The remaining cells are still in the "unknown" category, and are set aside for processing in the probability phase. A d Hoc Phase. This routine was introduced into the program to attempt clarification of some of those words identified in either of the first two phases but which remain ambiguous. For example, the word that, being a function word, is in the initial dictionary phase, but happens to have multiple class membership in different contexts (e.g., as in that dog, that the dog j u m p e d , the dog that j u m p e d , etc.). There are currently eight such types of ambiguities which are resolved by the ad hoc phase. In each case, the words already unambiguously identified provide the critical environmental information for the final allocation of these ambiguous words to one single grammatical class or another. As such, this phase of the system is similar in principle to that employed by Klein and Simmons more generally, in that a specified set of frames is provided as diagnostic for particular identifications. Examples of such decision-making in WISSYN include: (a) a verb processor to determine whether forms of to be and other such verbs should be classified as auxiliaries, linking verbs or main verbs; (b) a gerund processor which focuses on the -ing sutfix words to determine whether they are nouns, gerunds, verbs or adjectives; and (c) a routine which processes certain preposition-adverbs to determine their exact usage either as prepositions (e.g., ,in in i n the house) or as adverbs (come in f r o m the cold). There has been a considerable range in the number of words identified by this phase in our application to date with the average being about 10 %. Probability Phase. At this stage, the SI of a sentence is such that a number of words have been firmly coded (usually 60 % or better) by the previous phases, and the problem is to identify the remaining words. Because it is assumed that the earlier processing has identified all of the other 12 classes, the probability phase is left with predicting the remaining words in terms of only the four major lexical categories; noun, verb, adjective and adverb? At Actually, in some cases, tile morphology and dictionary sections identify a lexical word by eliminating one or two of the lexical classes from the range of possibilities, and thus for these words the prediction is only among two or three classes rather than four. Volume 8 / Number 6 / June, 1965 this point, the original sel~ of words comprising the input sentence is no longer necessary and only the SI is retained. Estimations for ttle remaining words are obtained, as we have indicated, by accessing empirically derived tables of conditional probabilities of individual word classes in various syntactic strings. These probability tables were generated separately by a set of computer programs operating on text which had previously been grammatically coded by a number of human experts. The corpus used was collected in an experiment by Stolz [8], and consisted of two protocols from each of 32 undergraduate subjects elicited in response to Thematic Apperception Test (TAT) pictorial stimuli. All 64 of the passages were encoded in typewritten form and included a total of 28,476 words in 1470 sentences. The programs used tabulated all syntactic strings up to five items in length, but occurring within a sentence. Conditional probabilities of the various members of the strings were then computed using one, two, three and four items as predictors, with the predictors antecedent to the predicted unit, or subsequent to it, or bridging it. Because examination of the probability tables indicated no real increase in predictability when three- or four-word strings were used, only the probabilities involving up to three successive predictors were used in the WISSYN coder. The operation of the probability phase of the coder is perhaps best illustrated with an attempt. Suppose that upon entry to this phase, the SI for a given sentence is T D N xl P D N P x2 x3 T where the letters are as identified in Table i and xx, x2 and z3 indicate the words to be predicted• (Note that each sentence is identified as beginning and ending with a terminal punctuation mark.) The program first locates the uncoded items and then isolates the already identified context in their immediate, up to three units on each side, environments. For blank x~, the tables available to the program would include conditional probabilities of the four lexical classes occurring in the following contexts: (1) the three antecedent strings Nx (i.e., a noun immediately'preceding the word for which a form class is to be predicted), DNx and TDNx; (2) the three subsequent strings xP (i.e., preposition immediately following the word to be coded), zPD, and zPDN; and (3) the two bilateral predictor strings NaP and DNxPD. Since other research (Stolz, [8]) had indicated that the longer predictor strings in these tables generally have greater predictive power than the shorter ones, the system first focuses on the longest possible predictors available. Thus in the case of the x~ blank above, it would select TDNx as the initial antecedent predictor, xPDN as the subsequent predictor and DNxPD as the bilateral predictor. It should be noted that in the interest of parsimony, the predictor tables used do not always include all possible Volume 8 / Number 6 / June, 1965 strings but rather the most frequent 150 or so predictor strings for each of the possible predictor configurations. Thus, it is possible that a particular predictor string would not be included in the tables. In such instances, the tables are searched for the next longest string. For example, if, in the above case, the TDNx predictor were not available in the tables, the search would look for the DNx predictor, and so on until the best available predictor was located. Assuming that each of the three longest isolated predictors would be available in the above case, the tables would provide four probability figures, one for each of the four main lexical classes, for each predictor string. The system then computes a joint probability index for each of the four possible alternatives by nmltiplying their respective probabilities across the three predictor strings. The lexical item with the highest such probability index is then selected as the predicted form class. Table 3 presents such a sample computation for the x, blank. In this case, it is clear that the most likely form class is a verb and it would be coded accordingly. Where single bhmks exist in given environments, the system proceeds as above, always selecting the best available antecedent, subsequent and bilateral predictor strings to make prediction. Special problems, however, present themselves in the event that two or more unknowns occur in sequence--as in the case of blanks x,2and xa in the above example--and a somewhat different procedure is called for. It is apparent that in such cases there is no bilateral environment available at the outset. The general logic of the system, then, is to select for the first of tile two blanks (i.e., x2) the best available antecedent string--here the DNPx predictor--and for the second blank (x~) the best available subsequent predictor--zT, in this case. As before, this procedure yields four probabilities for each blank and the system then proceeds to select the one highest figure from among these eight available probabilities. If that one happened to be for the x2 blank, then it is assigned to the indicated form class; if it happens to be for the x~ blank, then the word in that location is coded first. In either case, once the first allocation is made simply on tile basis of the highest single probability, the newly chosen code then becomes available for creating new predictor strings for tile remaining blank, which is then treated in precisely the same manner as was x~ above• Where more than two blanks occur together, the same general procedure is used: the TABLE 3. A SAMPLE COMPUTATION LEADING TO TIIE CIIOICE OF VERB AS THE CODE FOR TtIE WORD X IN THE SEQUENCE TDNxPDN Predictor TDNx xPDN DNxPD Joint Product Conditional Probabilities Probability Probability noun verb Probability adjective Probability adverb •046 •438 .017 .819 .359 .591 .013 .068 •078 . 122 .135 .314 .'-~34 .m77 1oo007 .005~7 Communications of the ACM 403 most peripheral of the string of unknowns (e.g., the first and the third in a sequence of three successive unknowns) are first estimated on the basis of the available unilateral predictor strings, and this procedure is repeated until only a single blank remains to be predicted. 4. E v a l u a t i o n of Performance Generality of the Probability Tables. As is the case with a n y estimation procedm'e, the main criterion for evalnating its performance is whether it works or not. Accordingly, some a t t e m p t s toward empirical assessment of the present W I S S Y N system have been undertaken. Before considering these, however, it is important to note t h a t since a substantial degree of the success of such a scheme depends upon the accuracy of the probabililby tables, their generality of application is a vital factor. T h a t is, if the same predictor strings yield highly variable conditional probabilities across different samples of English text, their predictive utility in a particular subsample would be impaired, and thus the practical value of the entire system would be seriously diminished. Obviously, not all possible or available English texts can be evaluated in this regard;however, we have collected some evidence a n d this points to more generality than specificity in predictive patterns. For example, Stolz [8] found essentially equivalent sets of conditional probabilities for various strings in spoken vs. typewritten messages from the same subjects. Similarly, T a n n e n b a m n and Brewer [9], although they utilized an earlier and somewhat more primitive version of the present system, found no differences in the probability patterns for different types of newspaper content areas. In another study, they also found a high degree of syntactic similarity in encoding behavior tinder negative vs. positive evaluative feedback conditions. While far from conclusive, to be sure, results such as these do suggest that the available conditional probabilities are quite characteristic of the nature of the language code, and hence relatively invariant across different language specimens. To date, we have conducted several empirical evaluations of the W I S S Y N system, but all on a fairly minor scale. A main reason is t h a t in order to assess the s y s t e m ' s performance one must h a v e a standard for comparison, usually a hand-parsed corpus which utilizes the identical grammatical classification scheme used here. Such materials are not easy to come by, and hence our evaluations have not been on a large scale. One of our first a t t e m p t s was to determine how well the s y s t e m functioned on a subset of the same messages which generated the basic set of probability tables. To do this, four messages were selected at random from the original set of 64 typewritten T A T responses. There was a total of 1916 words to be parsed in the four passages, and E v a l u a t i o n of Coder's Performance. 404 Communications of tile ACM WISSYN, t~Lking less that, on(, minute to perform all the opel'~t(iol.ls, cocve('tly tfllocaied 92.8 % ()f the words. 4 Since it could be argued that lhis ligure was somewhat inflated in tha¢ the S:~lne maLerials used in the test were involved in the original prob~tbility generation, we then turned to a related set, of lnatcri~ls. [n the original study Stolz [8], each of the 32 subjects involved actually encoded four separate messages to four T A T stimuli. Two of these messages were typewritlen, and these were used to generate the probability tables. Two others, however, were spoken. For assessing the overall system, four of the 64 spoken messages were used, involving a total of 1964 words. Here ~gain the performance was quite high and TABLE 4. PERFORMANCE OF W[SSYN ()N EIGHT TAT PI~OTOCOLS (FOUR SPOKEN ANI) FOUR 'WRUFTEN) PRODUCED BY Phase Dictionary Morphok)gy Ad hoe Probability Total * EIGHT DIFFERENT COLI,EGE SENIOItS Number of Words htentified 2632 119 321 814 3886 Percent of Corpus 67.7 " 3.1 8.3 20.9 10~).0 .Vumber of Errors 9 21 26 223* 279 Percent Error 0.2 0.5 0.7 5.7 7.1 33 of these errors (or 0.8%) were caused by errors in earlier phases. TABLE 5. PERFOI~MANCEON WISSYN ON Six NEWSSTORIES AND A SEGMENT OF O N E A R T I C L E FROM Scientific A merican Phase Dictionary Morphology Ad hoc Probabilty Total Number of Words Identified Percent of Corpus Number of Errors Percent Error 937 114 273 389 1713 54.7 6.7 15.9 22.7 100.0 2 5 19 122' 148 0,1 0.3 1.1 7.1 8.6 * 27 of these errors (or 1,6%) were due to the errors made in earlier phases. TABLE 6. SEVERAL t~EPRESENTATIVE Ett/:ORS MADE BY THE P R O B A B I L I T Y ~ECTION (nalieized words are coded erroneously. Ca,mgory symbols are identified in Table 1.) T t) N B L D 1 . . Our tax system is a ... E X (; B V I) 2. • who does ~'t vote has no.. • U P l) N V P l) 3.. after a elo.sed meeting of the ']) X G V B T 4.. but would not give. detail.~ " u v N U C 5.. , tear gas , and ... N cDmmittce .... 4 It should be remembered tth~t only a single syntactic interpretation was assumed to be correct for each sentence. Titus, if the program coded u sentence in a syntactically possil)le but not semantically or pragmatically reasomtble way, errors were recorded. Volume 8 / Number 6 / June, 1965 almost idcntictd t,o {,hat; foc the written TAT response data, wittt (.)2.9% of ,,ll (;he words correctly ]al)eled. We then turned {;o quil;e different mai.erials to fm'ther assess the sys~)em's performance. Six general newspaper stories were randomly selected from a one-day sample of wire-service copy. These involved a l~otal of 1146 words. Performance here lagged slightly behind tim above TAT materials, with an error rate of 8.6%, but l;he fact that well over 90 % of the words were properly allocated into their respective form classes bodes well for the system. In a further, relatively minor undertaking, a single article fl'om the Scie:~et~fic American--presumably using somewhat nmre complex scientific style-, was used. A total of 567 words was involved, and again the program operated at better than 90 % accuracy. While these results are encouraging in pointing to a fairly efficient system as of now, they have the additional feature of constituting feedback for making the system even more accurate. Accordingly, we have been examining various aspects of the program to determine specific loci of error occurrence, and a summary of our findings appears in Tables 4 and 5. In general, we have found the dictionary phase to perform with a yew high degree of accuracy (less than 1% error). The morphology phase, however, has not functioned as effectively as was originally hoped for. This appears to be largely due to specific words which represent exceptions to our morphological rules. As previously indicated, the most h'equent words of this type have been included in the dictionary phase; however, this exception list, may be expanded in the future. In general, the ad hoe phase has functioned well, although here again a few specific words have been giving undue trouble. These, too, can be readily accommodated in future applications. Indeed, we are constantly modifying these first three phases as new data inform us of certain problems not previously anticipated. As expected, the highest error rate does occur in the probability phase. Part of this error (15%-25%) is obviously due to whatever inaccuracies accrue from the preceding phases, i.e., an error in a predictor string wiI1 very likely lead to errors in its prediction of unknown words. Further analysis has indicated that the error rate here is relatively low in the case where single unknown words are guessed (about 15 %-20 %) but increases substantially in the case of nmltiple consecutive blanks. Our present system, of course, contributes to this in that if the first guess is incorrect, it lowers the chances of the second and other subsequent estimations being correct. Here, again, we are contemplating certain changes in the basic program to reduce such error. One possibility is to have the program estimate a pair of words at a time when two consecutive blanks are encountered. Some preliIninatT data do indicate that this procedure of estimating several unknowns at ()rice may reduce the error rate substantially --largely because it anows for the introduction of bilateral predictor contents, which are generally superior t.o either Voh~me 8 / Number 6 / June, 1965 of the unilateral ones in predictive power. To give a firsthand picture of the kinds of errors which are typical of the probability section, several sentence-fragments in which "typical" errors occurred are reproduced in Table 6. 5. Summary and Conclusion This investigation has provided evidence on several points relevant to grammatical coding of English words by eomputer. First, it has shown t~hat words can be coded with considerable success using empirically derived probabilities as a partial basis for their allociatiou to syntactic classes. Second, it has indicated that this can be done not only with relatively little error, but also at a rapid rate and using only a fraction of the core storage normally available in large scale computers. Third, it has shown that the error rate of the coder is highly sensitive to the number of words correctly coded by the earlier, deterministic phases of the algorithm. That is, the more already coded context which is available to the probability tables, the better are the chances that the code for an unknown or ambiguous word will be estimated correctly. It also seems clear that probabilities are not the whole answer to grammatical coding but that they might be most useful as an auxiliary device in a more traditional coder such as the CGC. In such a program, the role of the probability section would be chiefly to aid in the final resolution of ambiguities and to code otherwise unidentifiable words. RECEIVED DECEMBER, 1964; REVISEDMARCH 1965 REFERENCES 1. KLEIN, S., AND SIMMONS, R. F, A computational appro~ch. to grammatical coding of English words. J. ACM 10, (July 1963), 334--347. 2. KVNO, S., AND OETTINOER, A. G. Syntactic structure and ambiguity in English. Prec. 1963 Fall Joint Comput. Conf. Spartan Books, Baltimore, 1963, 397-418. 3. LINDSAY,R . K . The reading machine problem. Unpublished doctoral dissertation, Carnegie Inst. Tech., Pittsburgh, Pa., 1960. 4. MILLER, G. A. Some psychological studies of grammar. Amer. Psychol. 17 (Nov. 1962), 748-762. 5. 0ET!PINOER,A. G. Automatic Language Translation. H:~rvard U. Press, Cambridge, Mass., 1960. 6. ROr~ERTS, P. Patterns of English. Harcourt, Brace, and World, Inc., New York, 1956. 7. SALTON, G., AND THORPE, |~.. W. An approach to the segmentation problem in speech analysis and language translation. In 1961 International Conference on Machine Translation of Languages and Applied Language Analysis, Vol. II; Natl. Physical L:&oratory Syrup. 13, H. M. Star. Off., London, 1962, 704-724. 8. STOLZ, W. S. Syntactic constraint in spoken and written English. Unpublished doctoral dissert:~tion, U. of Wisconsin, Madison, Wis., 1964. 9. TANNENBAUM, P. H., AND B[~EWER, R. K. Consistency of syntactic structure as a factor in journalistic style. Journalism Quart., in press. 10. TIIORNDIKE,E. L., AND LORGE, [- The Teachers' Word Book of 30riO0 Words. Te~mhers College, Cohnnbia U., 1944. Contmt, n i c a t i o n s of tim ACM 405

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A Stochastic Approach to the Grammatical Coding of English