Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A stochastic Parts Program and Noun Phrase Parser for Unrestricted Text by Kenneth Ward Church HyungSuk Won NLP Lab. CSE POSTECH 98-10-01 2 서론 Test : 400word sample => 99.5% correct most errors are attributable to defects in the lexicon remarkably few errors are related to the inadequacies of the extremely over-simplified grammar ( a trigram model) One might thought that ngram models weren’t adequate for the task since it is well known that they are inadequate for determining grammaticality. eg. long distance dependency BUT, for tagging application, the ngram approximation may be acceptable since long distance dependencies do not seem to be very important Leech, Garside and Atwell : 96.7% in LOB corpus, using bigram model modified with heuristics to cope with more important trigrams NLP Lab. CSE POSTECH 3 1. How hard is Lexical Ambiguity computational liguistics 를 하지 않는 보통사람들은 lexical ambiguity를 중요한 것이 아니라는 strong intuition을 가지고 있다 반대로 CL의 expert들은 lexical ambiguity를 major issue라고 생각하고 있다. Time flies like a arrow Flying planes can be dangerous :practically, any content word can be used as noun, verb or adjective, and local context is not always adequate to disambiguate 하지만, Marcus의 말대로 대부분의 texts는 실제 그렇게 어렵지 않다. “garden paths” : The horse raced past the barn fell 사람들이 위의 문장들을 보고 나면, 한 단어에 각각 하나의 pos를 assign하는 것은 가망없는 짓이라고 생각할 것이다. NLP Lab. CSE POSTECH 4 2. Lexical Disambiguation Rules Fidditch’s lexical disambiguation rule (defrule n+prep! “>[**n+prep]!=n[npstarters]”) ; a preposition is more likely than a noun before a noun phrase ; this type can be captured with bigram and trigram statistics (more easy to obtain than Fedditch-type disambiguation rules) if parser do not use frequency information, then every possibility in the dictionary must be given equal weight =>parsing is very difficult (ex.) the Holy See Dictionary tends to focus on what is possible, not on what is likely (according to Webster’s Seventh New Collegiate Dictionary, every word is ambiguous) NLP Lab. CSE POSTECH 5 계속 (ex.) I see a bird [NP [N city] [N school] [N committee] [N meeting]] [NP [N I] [N see] [N a] [N bird]] [S [NP [N I] [N see] [N a]] [VP [V bird]]] NLP Lab. CSE POSTECH 6 3. The proposed methods Word I see a bird part of speech PPSS 5837 NP 1 VB 771 UH 1 AT 23013 IN(French) 6 NN 26 (PPSS: pronoun, NP: proper noun, VB:verb, UH:interjection, IN:preposition, AT:article, NN:noun) lexical probability : prob( PPSS |" I " ) freq( PPSS |" I " ) freq(" I " ) contextual probability : prob(VB | AT , NN ) freq(VB, AT , NN ) freq( AT , NN ) NLP Lab. CSE POSTECH 7 계속 A search is performed in order to find the assignment of part of speech tags to words that optimizes the product of the lexical and contextual probabilities 과정 (“NN”) (“AT” “NN”) (“IN” ‘NN”) (“VB” “AT” “NN”) (“VB” “IN” “NN”) (“UH” “AT” “NN”) (“UH” “IN” “NN”) => PPSS VB IN NN, NP VB IN NN, PPSS UH IN NN, NP UH IN NN ; score less well than below, contextual scoring function has limited window of three parts of speech (“PPSS” “VB” “AT” ‘NN”) (“NP” “VB” “AT” “NN”) (“PPSS” “UH” “AT” “NN”) (“NP” “UH” “AT” “NN”) => the same reason to above (“” “PPSS” “VB” “AT” ‘NN”) (“” “NP” “VB” “AT” “NN”) finally, (“” “” “PPSS” “VB” “AT” ‘NN”) NLP Lab. CSE POSTECH 8 4. Parsing simple non-recursive noun phrases stochastically similar stochastic methods are applied to locate simple noun phrases with very high accuracy stochastic parser input: a sequence of parts of speech processing : insert brackets corresponding to the beginning and end of noun phrases output : [A/AT former/AP top/NN aide/NN] to/IN [Attorney/NP ….]… (ex.) NN VB NN VB, [NN] VB, [NN VB], [NN] [VB], NN [VB] NLP Lab. CSE POSTECH 9 계속 AT NN NNS VB IN Probability of starting a noun phrase AT NN NNS VB 0 0 0 0 0.99 0.01 0 0 1 0.02 0.11 0 1 1 1 0 1 1 1 0 IN 0 0 0 0 0 AT NN NNS VB IN Probability of ending a noun phrase AT NN NNS VB 0 0 0 0 1 0.01 0 0 1 0.02 0.11 1 0 0 0 0 0 0 0 0 IN 0 1 1 0 0.02 AT(article), NN(singular noun), NNS(non-singular noun), VB(uninflected verb), IN(preposition) these probabilities were estimated from about 40,000 words(11,000 noun phrases) of training material selected from the Brown Corpus NLP Lab. CSE POSTECH 10 5. Smoothing Issues Zipf’s Law 1 frequency rank alleviate no appear case : using conventional dictionary => add 1 to the frequency count of possibilities in the dictionary proper noun and capitalized words =>capitalized words with small frequency counts (<20) were thrown out of the lexicon (ex) Act/NP 1. add 1 for the proper noun possibility (ex.) fall ( (1 “JJ”) (65 “VB”) (72 “NN”) ) Fall ( (1 “NP”) (1 “JJ”) (65 “VB”) (72 “NN”) ) 2. prepass : labels words as proper nouns if they are “adjacent to” other capitalized words (ex) White House, States of the Union NLP Lab. CSE POSTECH