Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
General Word Sense Disambiguation Method Based on a Full Sentential Context Jiri S t e t i n a Sadao Kurohashi Makoto Nagao G r a d u a t e School of Infomatics, K y o t o U n i v e r s i t y Y o s h i d a - h o n m a c h i , Sakyo, K y o t o , 606-8501, J a p a n { st et £na, kuro, nagao } ~kuee. kyot o-u. ac. j p Abstract I I I I I This paper presents a new general supervised word sense disambiguation method based on a relatively small syntactically parsed and semantically tagged training corpus. The method exploits a full sentential context and all the explicit semantic relations in a sentence to identify the senses of all of that sentence's content words. In spite of a very small training corpus, we report an overall accuracy of 80.3% (85.7, 63.9, 83.6 and 86.5%, for nouns, verbs, adjectives and adverbs, respectively), which exceeds the accuracy of a statistical sense-frequency based semantic tagging, the only really applicable general disambiguating technique. 1 Introduction Identification of the right sense of a word in a sentence is crucial to any successful Natural Language Processing system. The same word can have different meanings in different contexts. The task of Word Sense Disambiguation is to determine the correct sense of a word in a given context. In most cases the correct word sense can be identified using only the words co-occurring in the same sentence. However, very often we also need to use the context of words that appear outside the given sentence. For this reason we distinguish two types of contexts: the sentential context and the discourse context. The sentential context is given by the words which co-occur with the word in a sentence and by their relations to this word, while the discourse context is given by the words outside the sentence and their relations to the word. The problem that arises here is that most of the co-occurring words are also polysemous, and unless disambiguated they cannot fully contribute to the process of disambiguation. The senses of these words, however, also depend on the sense of the disambiguated word and therefore there is a reciprocal dependency which we will try to resolve by the algorithm described in this paper. I 1 Table I: Percentage of nouns, verbs, adjectives and adverbs and average number of senses 2 Category Number % NOUNS VERBS ADJ ECTIVES ADVERBS TOTAL 48,534 26,674 19,743 I 1,804 106,755 45.5 25.0 18.5 11.0 100.0 Average # of senses 5.4 10.5 5.5 3.? 5.8 The Task Specification For our work, we used the word sense definitions as given in WordNet (Miller, 1990), which is comparable to a good printed dictionary in its coverage and distinction of senses. Since WordNet only provides definitions for content words (nouns, verbs, adjectives and adverbs), we are only concerned with identifying the correct senses of the content words. Both for the training and for the testing of our algorithm, we used the syntactically analysed sentences of the Brown Corpus (Marcus, 1993), which have been manually semantically tagged (Miller et al., 1993) into semantic concordance files (SemCor). These files combine 103 passages of the Brown Corpus with the WordNet lexical database in such a way that every content word in the text carries both a syntactic tag and a semantic tag pointing to the appropriate sense of that word in WordNet. Passages in the Brown Corpus are approximately 2,000 words long, and each contains approximately 1,000 content words. The percentages of the nouns, verbs, adjectives and adverbs in the semantically tagged corpus, together with theiraverage number of Word Net senses, are given in Table I. Although most of the words in a dictionary are monosemous, it is the polysemous words that occur most frequently in speech and text. For example, over 80% of words in WordNet are monosemous, but almost 78% of the content words in the tested corpus had more than one sense, as shown in Table 2. Table 2: Percentage of polysemous word in the cor~US Category NOUNS VERBS ADJECT[VES ADVERBS TOTAL Number 48,534 26,674 19,743 II,804 Polysemous 38,279 24,845 13,315 6,715 106,755 83,154 % 78.9 93.1 67.4 56.9 77.9 election, and by the subject-verb relation with the verb produced, and so on. The key to extraction of the relations is that any phrase can be substituted by the corresponding tree head-word (links marked bold in Figure 1). To determine the tree head-word we used a set of rules similar to that described by (Magerman, 1995)(Jelinek et al., 1994) and also used by (Collins, 1996), which we modified in the following way: • The head of a prepositional phrase ( P P - - IN NP) was substituted by a function the name of which corresponds to the preposition, and its sole argument corresponds to the head of the noun phrase NP. Assigning the most frequent sense (as defined by WordNet) to every content word in the used corpus would result in an accuracy of 75.2 %. Our aim is to create a word sense disambiguation system for identifying the correct senses of all content words in a gwen sentence, with an accuracy higher than would be achieved solely by a use of the most frequent sense. I ! I ! I I I n I ! II II il 3 General W o r d S e n s e Disambiguation The aim of the system described here is to take any syntactically analysed sentence on the input and assign each of its content words a pointer to an appropriate sense in WordNet. Because the words in a sentence are bound by their syntactic relations, all the word's senses are determined by their most probable combination in all the syntactic relations derived from the parse structure of the given sentence. It is assumed here that each phrase has one central constituent (head), and all other constituents in the phrase modify the head (modifiers). It is also assumed that there is no relation between the modifiers. The relations are explicitly present in the parse tree, where head words propagate up through the tree, each parent receiving its head word from its head-child. Every syntactic relation can be also viewed as a semantic relationship between the concepts represented by the participating words. Consider, for example, the sentence (1) whose syntactic structure is given in Figure 1. (1) The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced no evidence that any irregularities took place. Each word in the above sentence is bound by a number of syntactic relations which determine the correct sense of the word. For example, the sense of the verb produced is constrained by the subjectverb relation with the noun investigation, by the verb-object relation with the noun evidence and by the subordinate clause relation with the verb said. Similarly, the verb said is constrained by its relations with the words Jury, Friday and produced; the sense of the noun investigation is constrained by the relation with the head of its prepositional phrase - • The head of a subordinate clause was changed to a function named after the head of the first element in the subordinate clause (usually 'that' or a 'NULL' element) and its sole argument corresponds to the head of its second element (usually head of a sentence). Because we assumed that the relations within the same phrase are independent, all the relations are between the modifier constituents and the head of a phrase only. This is not necessarily true in some situations, but for the sake of simplicity we took the liberty to assume so. A complete list of applicable relations for sentence (I) is given in (2). (2) NP(NN P(County), NN P(Jury)) N P(NNP(Grand),NN P(Jury)) NP(NP(Atlanta),NP(election)) N P( J J(recent),N P(election)) N P(J J(primary),N N(election)) N P(N N(in vestigation), P P(of(election))) S(N P(irregularities),VP(took)) V P(VB D(took),NP(place)) N P(NN(evidence),SBAR(that(took)) S( N P(investigation),VP(produced)) V P(VB D(produced),N P(evidence)) VPIVBD(said),N P(Friday)) V P(VBD(said),S BA R(0(produced))) S(NP(Jury),VP(said)) Each of the extracted syntactic relations has a certain probability for each combination of the senses of its arguments. This probability is derived from the probability of the semantic relation of each combination of the sense candidates of the related content words. Therefore, the approach described here consists of two phases: 1. learning the semantic relations, and 2. disambiguation through the probability evaluation of relations. 4 Learning At first, every content word in every sentence in the training set was tagged by an appropriate pointer to a sense in WordNet. Secondly, using the parse trees of all the corpus sentences, all the syntactic relations present in the VP (said) DT NP The VBD NP NNP NNP NNP NNP I Fulton I County I Grand I Jury ....... "" ,~~roduced)) NNP NONE I °o°°°ooo ..... °° ............. SBAR( d ~ u ( w o k ) ) IN(d~) NP . $ (~onk) °IT I an inve~igation I IN of TT any L. . . . . . . . irregul~ities t °°oo°°°...°.°°o. NP POS I / NNP / NN .... p~e o.... o~ NP (el.) p ~ e d I I Atlanm's NP (election) JJ recent I~N I no NP (election) I Imrna:y "*'SBAR !°~o.~o...oo°. it--"i election •. Figure 1: Example parse tree I I I I I I I I I I I i training corpus were extracted and converted into the following form: lations PA, the overall relational probability for the combination of senses X would be: N (4) reI(PNT, MNT, HNT, MS, HS, RP). where PNT is the phrase parent non-terminal, MNT the modifier non-terminal, HNT the head nonterminal, MS the semantic content (see below) of the modifier constituent, ['IS the semantic content of the head constituent and RP the relative position of the modifier and the head ( R P = I indicates that the modifier precedes the head, while for R P = 2 the head precedes the modifier). Relations involving non-content modifiers were ignored. Synsets of the words not present in WordNet were substituted by the words themselves. The semantic content was either a WordNet sense identificator (synset) or, in the case of prepositional and subordinate phrases, a function of the preposition (or a null element) and the sense identificator of the second phrase constituent. 5 Disambiguation Algorithm As mentioned above, we assumed that all the content words in a sentence are bound by a number of syntactic relations. Every content word can have several meanings, but each of these meanings has a different probability, which is given by the set of semantic relations in which the word participates. Because every relation has two arguments (head and its modifier), the probability of each sense also depends on the probability of the sense of the other participant in the relation. The task is to select such a combination of senses for all the content words, that the overall relational probability is maximal. If, for any given sentence, we had extracted N syntactic re- (5) ORP(X) = I-[ p(&IX) i=l where p ( R i I X ) is the probability of the i-th relation given the combination of senses X. If we consider, that an average word sense ambiguity in the used corpus is 5.8 senses, a sentence with 10 content words would have 5.8 t° possible sense combinations, leading to a combinatorial explosion of over 43,080,420 overall probability combinations, which is not feasible. Also, with a very small training corpus, it is not possible to estimate the sense probabilities very accurately. Therefore, we have opted for a hierarchical disambiguation approach based on similarity measures between the tested and the training relations, which we will describe in Section 5.2. At first, however, we will describe the part of the probabilistic model which assigns probability estimates to the individual sense combinations based on the semantic relations acquired in the learning phase. 5.1 Relational Probability Estimate Consider, for example, the syntactic relation between a head noun and its adjectival modifier derived from N P ~ JJ NN. Let us assume that the number of senses in WordNet is k for the adjective and 1 for the noun. The number of possible sense combinations is therefore m = k • 1. The probability estimate of a sense combination (i,j) in the relation R, where i is the sense of the modifier (adjective in this example) and j is the sense of the head (noun in this example), is calculated as follows: (6)pR(i,j) = i fR(i,j) ~ t o=lp=t I I i I where fl~(id) [~j) is a score sco !e of ~f co-occurrences co-occurrence', of a modifier sense xx with a head h a d word sense y, among the same semantic .mantic relations rel. Lti >ns R extracted extract.d during the learning phase. lase. Please Pleas~ note, n ~te, that because beta as~ f R ( i j ) is not a count~but rather a score (de5core of co-occurrences co-oc~ ur fined below), probability but v), pR(i,j) pR(i~j) is not a real plob rather its approximation. Because tile the occurrence count is replaced by a similarity score, the sparse data problem of a small training corpus is substantially reduced. The score of co-occurrences is defined as a sum of hits of similar pairs, where a hit is a multiplication of the similarity measures, sim(i,x) and sim(j,y), between both participants, i.e.: r I I i I I i ia II I i I I I il second coordinate represents the sense of the head and the value at the coordinate position (i j) is the estimate of the probability of the sense combination (id) computed by (6). An example of a relational matrix for an adjective-noun relation modern building based on two training examples (new airport and classical music) is given in Figure 3. Naturally, the more the examples, the more fields of the matrix get filled. The training examples have an accumulative effect on the matrix, because the sense probabilities in the matrix are calculated as a sum of 'similarity based frequency scores' of all examples (7) divided by the sum of all matrix entries, (6). The most likely sense combination scores the highest value in the matrix. Each semantic relation has its own matrix. The way all the relations are combined is described in Section 5.2. 5.1.1 (7) f R ( i , j ) = ~= sim(i,z), sim(j,y) where x, y E R ; r is the number of relations of the same type (for the above example R=reI(NP,ADJ,NOUN,x,y,1)) found in the training corpus. To emphasise the sense-restricting contribution of each example found, every pair (x,y) is restricted to contributing to only one sense combination (id): every example pair (x,y) contributes only to such a combination for which sim(i, x) * sim(j, y) is maximal. f R 0 , j ) represents a sum of all hits in the training corpus for the sense combination (ij). Because the similarity measure (see below) has a value between 0 and 1 and each hit is a multiplication of two similarities, its value is also between 0 and 1. The reason why we used a multiplication of similarities was to eliminate the contributions of exampies in which one participant belonged to a completely different semantic class. For example, the training pair new airport, makes no contribution to the probability estimate of any sense combination of new management, because none of the two senses of the noun management (group or human activity) belongs to the same semantic class as airport (entity). On the other hand, new airport would contribute to the probability estimate of the sense combination of modern building because one sense of the adjective modern is synonymous to one sense of the adjective new, and one sense of the noun building belongs to the same conceptual class (entity) as the noun airport. The situation is analogous for all other relations. The reason why we used a count modified by the semantic distances, rather than a count of exact matches only, was to avoid situations where no match would be found due to the sparse data, a problem of many small training corpora. Every semantic relation can be represented by a relational matrix, which is a matrix whose first coordinate represents the sense of the modifier, the 4 Semantic Similarity We base the definition of the semantic similarity between two concepts (concepts are defined by their WordNet synsets a,b) on their semantic distance, as follows: (8) sire(a, b) "-- 1 - sd(a, b)~-, The semantic distance s d ( a , b ) is squared in the above formula in order to give a bigger weight to closer matches. The semantic distance is calculated as follows. S e m a n t i c D i s t a n c e for N o u n s a n d Verbs I DI - D D2- D) sd(a,b) = { . ( ~ + -D-2 where DI is the depth of synset a, D2 is the depth of synset D2, and D is the depth of their nearest common ancestor in the WordNet hierarchy. If a and b have no common ancestor, sd(a,b) = 1. If any of the participants in the semantic distance calculation is a function (derived from a prepositional phrase or subordinate clause), the distance is equal to the distance of the function arguments for the same functor, or equals 1 for different functors. For example, sd(of(sensel), of(sense2)) = sd(sense 1, sense2), while sd(of(senset), about(sense2)) = t, no matter what sensel and sense2 are. S e m a n t i c D i s t a n c e for A d j e c t i v e s sd(a,b) = 0 for the same adjectival synsets (inci.synonymy), sd(a,b) = 0 for the synsets in antonymy relations, i.e. for ant(a,b), sd(a,b) = 0.5 for the synsets in the same similarity cluster, sd(a,b) = 0.5 if a belongs to the same similarity cluster as c and b is the antonymy of c (indirect antonymy), sd(a,b) = I for all other synsets. I I I I I ! ENTITY I I PHYSICAL OBJECT ARTIFACT I I t I I STRUCTUREI AIRFIELD I BUILDING(I) 102207842 RAW MATERIALS 100313161 I BULDING(2) BULDING(3) to611462 100506493 102055456 Example to disambiguate: MODERN(X) BUILDING(Y): reI(NP,ADJ,NOUN,X,Y,1) Example training set: NEW(9) AIRPORT: reI(NP,ADJ,NOUN,3006112602,102055456,1) CLASSICAL(I) MUSIC(3): reI(NP,ADJ,NOUN,300306289,100313161,1) sd(AIRPORT,BUILDING(1)) = 1/2(3/6+2/5) = 0.45 Relational matrix: sd(AIRPORT,BUILDING(2)) = 1.0 BUILDING(I) 0.0 0.0 0.0 sd(AIRPORT,RUILDING(3)) = 1.0 sd(BUILDING(2),BUILDING(3)) = 1/2(4/5+4/5) = 0.8 BUILDING(2) 0.0 0.0 0.4 sd(BUILDING(2),MUSIC(3)) = 1/2(3/5 + 2/4) = 0.55 BUILDING(3) 0.0 0.0 o" .,: 0.0 .j .......... sire(AIRPORT,BUILDING(I)) = 1-0.452 = 0.8 sim(MUSIC(3),BUILDING(2)) = 1-0.552= 0.7 ,-°°'-¢a¢ :~O ~ z~: sim(CLASSICAL(1)'MODERN(3)) = 1-0"52= 0"75 :*i ~' ~O ~ ZO' O~z~O fR(5, I) = sim(NEW(9),MODERN(5))" sire(AIRPORT, BUILDING(1)) = ~ =1.0"0.8=0.8 zz fR(3,2) = sire(CLASSICAL( 1),MOO ERN(3)) * sim(MUSIC(3),BUILDING(2)) = ." ~ c~ ~ = 0.75 "0.7 = 0.53 : -r me" ~ pR(3,2) = fR(3,2)/sum(fR(i,j)) = 0.53/(0.8+0.53) = 0.4 ........................... " ~ i I FA~IUTY AIRPORT I I I I I I '°N pR(5,1) :1 fR(5,l)/sum(fR(i,j)) = 0.8/(0.8+0.53) = 0.6 ........................ . ........... * ......................... 0.0 0.6~ I 0.0 0.0 ~,1 0.0 0.0 -~¢ <O zO m 0 >'n -~z m~:'~ z¢ --:~0 mo ~ z ~'~ ! : : :: ..... ,,-" ...°"° Figure 2: Relational matrix based on two training examples and its modifiers at any syntactic level, the modifiers do not participate in any relation with an element outside their parent phrase. As depicted in the example in Figure l, it is only the head word concepts that propagate through the parse tree and that participate in semantic relations with concepts on other levels of the parse tree. The modifiers (which are heads themselves at lower tree levels), however, play an important role in constraining the head-word senses. The number of relations derived at each level of the tree depends on the number of concepts that modify the head. Each of these relations contributes to the score of each sense of the head word. We define the s e n s e s c o r e v e c t o r of a word w as a vector of scores of each WordNet sense of the word w. The initial s e n s e s c o r e v e c t o r of the word w is given by its contextually independent sense distribution in the whole training corpus. Because the training corpus is relatively small, and because it always excludes the tested file, an appropriate sense of the word w may not be present in it at all. Therefore, each sense i of the word w is always given a non-zero initial score Pi(W) (ga): S e m a n t i c D i s t a n c e for A d v e r b s sd(a,b) = 0 for the same synsets (incl.synonymy), sd(a,b) = 0 for the synsets in antonymy relation ant(a,b), sd(a,b) = I for all other synsets. 5.2 Hierarchical D i s a m b J g u a t i o n This section describes the main part of the algorithm, i.e. the disambiguation process based on the overall probability estimate of sententia] relations. As we have outlined above, for computational reasons, it is not feasible to evaluate overall probabilities for all the sense combinations. Instead, we take advantage of the hierarchical structure of each sentence and arrive at the optimum combination of its word senses, in a process which has two parts: [. bottom-up propagation of the head word sense scores and 2. top-down disambiguation. 5.2.1 Bottom-up head word sense score propagation [n compliance with our assumption that all the semantic relations are only between a head word 5 (9a)pi(w) = , count(w), + 1 (cou.t(w)i + l) j=t where count(w), is the number of occurrences of the sense i of the word w in the entire training corpus, and n is the number of different WordNet senses of the word w. The sense score vectors of head words propagate up the tree. At each level, they are modified b y all the semantic relations with their modifiers which occur at that level. Also, the sense score vectors of head words are used to calculate the matrices of the sense score vectors of the modifiers. This is done as follows: Let H -- Jill, h 2 .... , hit] be the sense score vector of the head word h. Let T = [R1, R 2 , ...Rn] be a set of relations between the head word h a n d its modifiers. 1. For each semantic relation R , E T between the head word h and a modifier m i with sense score vector Mi = loll, oi2 .... oil], do: 1.1 Using (6), calculate the relational matrix R i ( m , h ) of the relation R i 1.2 For each ol E Mi multiply all the elements of the R i ( m , h ) for which m = o i by oi, yielding Qi - the s e n s e s c o r e m a t r i x of the modifier m i 2. The new sense score vector of the head word h is now G-" [ g l , g 2 , ...,gk], where Lj (lo)g i = 2--, h~ L j / L represents the score of the head word sense j based on the matrices Q calculated in the step 1., i.e.: (ll) Lj = ~ maz(zi(j, u)) i=I I where x i ( j , u ) E Qi and m a x ( x i ( j , u ) ) highest score in the line of the matrix Qi corresponds to the head word sense j. n number of modifiers of the head word h current tree level, and is the which is the at the k i Lj = j~l Lj where k is the number of senses of the head word h. I i I I The reason why gj (I0) is calculated as a sum of the best scores (ll), rather than by using the traditional maximum likelihood estimate (Berger et al., 1996)(Gah eta[., 1993), is to minimise the effectof the sparse data problem. Imagine, for example, the phrase VP-- VB NP PP, where the head verb V B is in the object relation with the head of the noun phrase NP and also in the modifying relation with the head of the prepositional p h r ~ e PP. Let us also assume that the correct sense of the verb VB is a. Even if the verb-object relation provided a strong selectional support for the sense a, if there was no example in the training set for the second relation (between VB and PP) which would score a hit for the sense a, multiplying the scores of that sense derived from the first and from the second relation respectively, would gain a zero probability for this sense and thus prevent its correct assignment. The newly created head word sense score vector G propagates upwards in the parse tree and the same process repeats at the next syntactic level. Note that at the higher level, depending on the head extraction rules described in section 3, the roles may be changed and the former head word may become a modifier of a new head (and participate in the above calculation as a modifier). The process repeats itself until the root of the tree is reached. The word sense score vector which has reached the root, represents a vector of scores of the senses of the main head word of the sentence (verb said in the example in Figure 1), which is based on the whole syntactic structure of that sentence. The sense with the highest score is selected and the sentence head disambiguated. Disambiguation Having ascertained the sense of the sentence head, the process of top-down disambiguation begins. The top-down disambiguation algorithm, which starts with the sentence head, can be described recursively as follows: Let 1 be the sense of the head word h on the input. Let M - [ m l , m 2 , . . . , m x ] be the set of the modifiers of the head word h. For every modifier mi E M , do: 5.2.2 Top-down l. In the sense score matrix Qi of the modifier m i (calculated in step 1.2 of the bottom-up phase) find all the elements x(ki,l), where I is the sense of the head h 2. Assign the modifier m i such a sense k - - k ' for which the value x(ki,l) is maximum. In the case of a draw, choose the sense which is listed as more frequent in WordNet. 3. If the modifier m i has descendants in the parse tree, call the same algorithm again with m l being the head and k being its sense, else e n d . The disambiguation of the modifiers (which become heads at lower levels of the parse tree), is based solely on those lines of their sense score matrices which correspond to the sense of the head they are in relation with. This is possible because of our assumption that the modifiers are related only to their head words, and that there is no relation among the modifiers themselves. To what extent this assump- I I I I i I I I I I I I I I ! I i I I Table 3: Number of words with the same and different sense as its previous occurrence in the same discourse (shortened) Has predecessor with Distance NOUNS VERBS anywhere [' 15,373 6,923 <10 9,474 3,697 <5 6,892 5,964 2,426 2,065 <3 <2 4,797 1,578 986 3,039 the same sense ADJs ADVs 5,523 3812 2,733 1672 1,834 1000 1,566 841 1,219 614 733 348 tion holds in real life sentences, however, has yet to be investigated. 6 VERBS 5,227 208 103 929 555 2,521 1,561 1,269 ADJs 933 258 104 104 83 42 ADVs 830 214 135 82 55 27 Table 4: Result Accuracy [%] CONTEXT NOUNS VERBS ADJs ADVs TOTAL First sense 77.8 61.7 81.9 84.5 75.2 Sentence 84.2 63.6 82.9 86.3 79.4 +Discourse 85.7 63.9 83.6 86.5 80.3 Discourse Context (Yarowsky, 1995) pointed out that the sense of a target word is highly consistent within any given document (one sense per discourse). Because our algorithm does not consider the context given by the preceding sentences, we have conducted the following experiment to see to what extent the discourse context could improve the performance of the wordsense disambiguation: Using the semantic concordance files(Miller et al., 1993), we have counted the occurrences of content words which previously appear in the same discourse file. The experiment indicated that the "one sense per discourse" hypothesis works fairlywell for nouns, however, the evidence is much weaker for verbs, adverbs and adjectives. Table 3 shows the numbers of content words which appear previously in the same discourse with the same meaning (same synset), and those which appear previously with a different meaning. The experiment also confirmed our expectation that the ratio of words with the same sense to those with a different sense, depends on the distance of sentences in which the same words appear (distance I indicates that the same word appeared in the previous sentence, distance 2 that the same word was present 2 sentences before, etc.). W e have modified the disambiguation algorithm to make use of the information gained by the above experiment in the following way: All the disambiguated words and their senses are stored. The words of all the input sentences are first compared with the set of these stored word-sense pairs. If the same word is found in the set, the initialsense score assigned to it by (ga) is modified using Table 3, so that the sense, which has been previously assigned to the word, gets higher priority. The calculation of the initialsense score (9a) is thus replaced by (9b): (9b)pi(w) = nc°unt(w)i + 1 NOUNS 2,057 649 355 290 where e(POS,SN) is the probability that the word with syntactic category POS which already occurred SN sentences before, has the same sense as its previous occurrence. If, for example, the same noun has occurred in the previous sentence ( S N = I ) where it was assigned sense n, the probability of sense n of the same noun in the current sentence is multiplied by e(NOUN,I)=3,039/(3,039+I03)=0.967, while all the probabilities of its remaining senses are multiplied by I-0.967=0.033. Ifno m a t c h is found, i.e. the word has not previously occurred in the discourse, e(POS,SN) is set to 1 for all senses. 7 Evaluation To evaluate the algorithm, we randomly selected 15 files (with a total of 18,413 content words tagged in SemCor) from the set of 103 files of the sense tagged section of the Brown Corpus. Each tested file was removed from the set and the remaining 102 files were used for learning (Section 4). Every sense assigned by the hierarchical disambiguation algorithm (Section 5) was compared with the sense from the corresponding semantic concordance file. Table 4 shows the achieved accuracy compared with the accuracy which would be achieved by a simple use of the most frequent sense. As the above table shows, the accuracy of the word sense disambiguation achieved by our method was better than using the first sense for all lexicai categories. In spite of a very small training corpus, the overall word sense accuracy exceeds 80%. *e(POS, SN) 8 E(count(w)j + l) Related Work To our knowledge, there is no current method which attempts to identify the senses of all words in whole J=! 7 sentences, so we cannot make a practical comparison. Similarly to our work, (Resnik, 1995)(Agirre and Rigau, 1996) challenge the fine-grainedness of WordNet, but their work is limited to nouns only. (Agirre and Rigau, 1996) report coverage 86.2%, precision 71.2% and recall 61.4% for nouns in four randomly selected semantic concordance files. From among the methods based on semantic distance, (Reanik, 1993)(Sussna, 1993) use a similar semantic distance measure for two concepts in WordNet, but they also focus on selected group of nouns only. (Karov and Edelman, 1996) use an interesting iterative algorithm and attempt to solve the sparse data bottleneck by using a graded measure of contextual similarity. They achieve 90.5, 92.5, 94.8 and 92.3 percent accuracy in distinguishing between two senses of the noun drug, sentence, suit and player, respectively. (Yarowsky, 1995), whose training corpus for the noun drug was 9 times bigger than that of Karov and Edelman, reports 91.4% correct performance improved to impressive 93.9% when using the "one sense per discourse" constraint. These methods, however, focus on only two senses of a very limited number of nouns and therefore are not comparable with our approach. 9 ! ! ! ! ! ! ! Conclusion This paper presents a new general approach to word sense disambiguation. Unlike most of the existing methods, it identifies the senses of all content words in a sentence based on an estimation of the overall probability of all semantic relations in that sentence. By using the semantic distance measure, our method reduces the sparse data problem since the training examples and their contexts do not have to match the disambiguated words exactly. All the semantic relations in a sentence are combined according to the syntactic structure of the sentence, which makes the method particularly suitable for integration with a statistical parser into a powerful Natural Language Processing system. The method is designed to work with any type of common text and is capable of distinguishing among many word senses. It has a very wide scope of applicability and is not limited to only one part-of-speech. References E. Agirre and G. Rigau. 1996. Word sense disambiguation using conceptual density. In Proc. of COLLING, pages 16-22. A. Berger, V. Pietra, and S. Pietra. 1996. A maximum entropy approach to natural language processing. Computatzoaal Linguistics, 1(22):39-72. M. Collins. 1996. A new statistical parser based on bigram lexical dependencies. In Proc. of the .?4th Annual Meeting of the ACL, pages 184-191. W. Gale, K. Church, and D. Yarowsky. 1993. A method for disambiguating word senses in a large corpus. Computers and Humanities, (26):4154397. F. Jelinek, J. Lafferty, D. Magerman, R. Mercer, A. Rathnaparkhi, and S. Roukos. 1994. Decision tree parsing using a hidden derivation model. In Proc. of the ARPA Human Language Technology Workshop, pages 272-277. Y. Karov and S. Edelman, 1996. Learning similarity-based word sense disambiguation from sparse data. In Proc. of the 3rd Workshop on Very Large Corpora, pages 42-55. D. Magerman. 1995. Statistical decision-tree models for parsing. In Proc. of the 33rd Annual Meetin 9 of ACL, pages 276-283. M. Marcus. 1993. Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 2( 19):313-330. G. Miller, C. Leacock, and R. Tengi. 1993. A semantic concordance. In Proc. of the ARPA Human Language Technology Workshop, pages 303-308. G Miller. 1990. Wordnet: An on-line lexical database. International Journal of Lexicography, 3(4):235-312. P. Resnik. 1993. Semantic classses and syntactic ambiguity. In Proc. of the APRA Human Language Technology Workshop, pages 278-283. P. Resnik. 1995. Disambiguating noun groupings with respect to wordnet senses. In Proc. of the 3rd Workshop on Very Large Corpora. M. Sussna. 1993. Word sense disambiguation for free-text indexing using a massive semantic network. In Proc. of Second International Confer- ence on Information and Knowledge Management, pages 67-74. D. Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proc. o/ the 32nd Annual Meeting of the ACL.