Download Semantic Similarity

Natural Language Processing : Lexical Acquisition Lecture 8 Pusan National University Minho Kim ([email protected]) Contents • • • • • • Introduction Evaluation Measures Verb Subcategorization Attachment Ambiguity Selectinal Prefeences Semantic Similarity 2 Lexical acquisition • Develop algorithms and statistical techniques for filling the holes in existing dictionaries and lexical resources by looking at the occurrences of patterns of words in large text corpora – Collocations – Selectional preferences – Subcategorization – Semantic categorization 3 일반사전 - 표준국어대사전 표제어 발음 정보 활용 정보 정의문 예문 4 전자사전 - 세종전자사전 5 The limits of hand-encoded lexical resources • Manual construction of lexical resources is very costly • Because language keeps changing, these resources have to be continuously updated • Quantitative information (e.g., frequencies, counts) has to be computed automatically anyway 6 Lexical acquisition • Examples: – “insulin” and “progesterone” are in WordNet 2.1 but “leptin” and “pregnenolone” are not. – “HTML” and “SGML”, but not “XML” or “XHTML”. • We need some notion of word similarity to know where to locate a new word in a lexical resource 7 Evaluation Measures Performance of system • Information retrieval engine – – – – – 사용자가 “NLP”와 관련된 문서를 찾고 싶음 실제 “NLP“와 관련된 문서는 많이 있음 검색 결과로 하나의 문서가 나옴 해당 문서는 “NLP”와 관련된 문서임 좋은 검색 시스템인가? • Speller – – – – 사용자가 5개의 단어를 입력함 이 중 3개의 단어가 오류어임 시스템은 5개의 단어가 모두 오류라고 판단함 좋은 검사 시스템인가? • Word sense disambiguation system – 이 시스템은 항상 “사과”가 “apple”의 뜻이라고 판단함 – 좋은 시스템인가? 9 Evaluation: precision and recall tn fp selected tp fn target • For many problems, we have a set of targets – Information retrieval: relevant documents – Spelling error correction: error words – Word sense disambiguation: sense of ambiguous word • Precision can be seen as a measure of exactness or fidelity, whereas Recall is a measure of completeness. 10 Evaluation: precision and recall • Used in information retrieval – A perfect Precision score of 1.0 means that every result retrieved by a search was relevant (but says nothing about whether all relevant documents were retrieved) – A perfect Recall score of 1.0 means that all relevant documents were retrieved by the search (but says nothing about how many irrelevant documents were also retrieved). – Precision is defined as the number of relevant documents retrieved by a search divided by the total number of documents retrieved – Recall is defined as the number of relevant documents retrieved by a search divided by the total number of existing relevant documents (which should have been retrieved). 11 Evaluation: precision and recall • Joint distribution of the variables System • • Actual target ￢target selected tp (true positives) fp (false positives) ￢selected fn (false negatives) tn (true negatives) False negatives = Type Ⅰerrors False positives = Type Ⅱerrors 12 Computing Recall/Precision n doc # relevant 1 588 x 2 589 x 3 576 4 590 x 5 986 6 592 x 7 984 8 988 9 578 10 985 11 103 12 591 13 772 x 14 990 Let total # of relevant docs = 6 Check each new recall point: R=1/6=0.167; P=1/1=1 R=2/6=0.333; P=2/2=1 R=3/6=0.5; P=3/4=0.75 R=4/6=0.667; P=4/6=0.667 Missing one relevant document. Never reach R=5/6=0.833; p=5/13=0.38 100% recall 13 Precision and recall: trade-off Returns relevant documents but misses many useful ones too The ideal Precision 1 0 Recall 1 Returns most relevant documents but includes lots of junk 14 F measure • Combine precision and recall in a single measure of overall performance. P : Precision R : Recall α : a factor which determines the weighting of P & R. α= 0.5 is chosen often for equal weighting 15 Verb Subcategorization Finding collocations • Frequency – If two words occur together a lot, that may be evidence that they have a special function – But if we sort by frequency of pairs C(w1, w2), then “of the” is the most frequent pair – Filter by POS patterns – A N (linear function), N N (regression coefficients) etc.. • Mean and variance of the distance of the words • For not contiguous collocations – She knocked at his door (d = 2) – A man knocked on the metal front door (d = 4) – Hypothesis testing (see page 162 Stat NLP) • How do we know it’s really a collocation? • Low mean distance can be accidental (new company) • We need to know whether two words occur together by chance or not (because they are a collocation) – Hypothesis testing 17 Finding collocations • Mutual information measure – A measure of how much a word tells us about the other, i.e. the reduction in uncertainty of one word due to knowing about another • 0 when the two words are independent I(x, y)  log p(x,y) p(x) p(y ) • (see Stat NLP page 66 and178) 18 Main verbs • Transitive – requires a direct object (found with questions: what? or whom?) • ?The child broke. • The child broke a glass. • Intransitive – does not require a direct object. • The train arrived. • Some verbs can be both transitive and intransitive – The ship sailed the seas. (transitive) – The ship sails at noon. (intransitive) – I met my friend at the airport. (transitive) – The delegates met yesterday. (intransitive) 18 Verb Phrases • VP --> head-verb complements adjuncts • Some VPs: – – – – – – Verb Verb NP Verb NP PP Verb PP Verb S Verb VP eat. leave Montreal. leave Montreal in the morning. leave in the morning. think I would like the fish. want to leave. want to leave Montreal. want to leave Montreal in the morning. want to want to leave Montreal in the morning. 20 Verb Subcategorization • Verb Subcategorization : Verbs subcategorize for different semantic categories. • Verb Subcategorization Frame : A particular set of syntactic category that a verb can appear with is called a subcategorization frame. 21 Subcategorization frames • Some verbs can take complements that others cannot I want to fly. * I find to fly. • Verbs are subcategorized according to the complements they can take --> subcategorization frames – traditionally: transitive vs intransitive – nowadays: up to 100 subcategories / frames Frame Verb Example empty NP eat, sleep prefer, find I eat. I prefer apples. NP NP show, give Show me your hand. PPfrom PPto fly, travel VPto S I fly from Montreal to Toronto prefer, want I prefer to leave. mean Does this mean you are going to leave me? 22 Subcategorization frames: pasring • Subcategorization is important for parsing a) She told the man where Peter grew up. b) She found the place where Peter grew up. –Tell → NP NP (subject, object) –Find → NP (subject) 23 Brent’s Lerner system • There are two steps of this algorithm. ①Cues : Define a regular pattern of words and syntactic categories which indicates the presence of the frame with high certainty. ① Certainty is formalized as probability of error ② For a certain cue Cj we define error Ej ②Hypothesis testing is done by contradiction ① We assume that frame is not appropriate for the verb and call is Ho (Null Hypothesis). ② we reject the hypothesis if Cj indicates with high probability that our Ho is wrong. 24 Cues Cue for frame “NP NP” (OBJ | SUBJ_OBJ | CAP) (PUNC|CC) •OBJ = personal pronouns(me, him) •SUBJ_OBJ = personal pronouns(you, it) •CAP = capitalized word •PUNC = punctuation mark •CC = subordinating conjunction(if, before) […] greet-V Peter-CAP ,-PUNC […] 25 Hypotheses testing pE  P(v i ( f j )  0 | C (v i , c j )  m) n r     j (1   j ) nr r m  r  n n : # of times vi occurs in corpus m : # of frame f j occurs vi(f j)=0 : Verb vi does not permit frame f j C(vi,c j) : # of times that vi occurs with cue c j εj : error rate for cue f j •  If pE < α, then we reject H0 • Precision : – close to 100% (when α = 0.02) • Recall : 47 ~ 100% 26 Attachment Ambiguity Attachment ambiguity  (8.14) The children ate the cake with a spoon.  I saw the man with a telescope • (Log) Likelihood Ratio [a common and good way of comparing between two exclusive alternatives] P( p | v) P ( p | n) Problem: ignores preference for attaching phrase “low” in parse tree  (v, n, p)  log • 28 Simple model • Chrysler confirmed that it would end its troubled venture with Maserati. 29 Hindle and Rooth(1) Event space: all V NP PP sequences, • How likely for a preposition to attach with a verb or noun VAp: Is there a PP headed by p which attaches to v NAp: Is there a PP headed by p which attaches to n Both can be 1: • • He put the book on World War II on the table She sent him into the nursery to gather up his toys.  (v, n, p)  log 2  log 2 P( Attach( p)  v | v, n) P( Attach( p)  n | v, n) P(VAp  1 | v) P( NAp  0 | v) P( NAp  1 | n) C ( v, p ) C (v ) C (n, p) P( NAp  1 | n)  C ( n) P(VAp  1 | v)  30 Hindle and Rooth(2) 31 General remarks on PP attachment Model’s limitations   Only consider the identity of the preposition, noun and the verb Consider only the most basic case of PP immediately after an NP object which is modifying either the immediately preceding n or v. The board approved [its acquisition] [by Royal Trustco Ltd.] [of Toronto] [for $27 a share] [at its monthly meeting]  Other attachment issues • Attachment ambiguity in noun compounds (a) [[Door bell] manufacturer] : left-branching (b) [Woman [aid worker]] : right-branching 32 Selectional Prefences Selectional Preferences(1)  Selectional Preference(or Selectional restriction) •  Most verbs prefer arguments of a particular type.  Preference ↔ Rules eat + non-food argument Example) eating one’s words. 34 Selectional Preferences(2)  Acquisition of selectional preference is important in Statistical NLP for a number of reasons Durian is missing in dictionary then we can infer part of its meaning from selection restrictions  Another important use is ranking the parse of a sentence  Give high scores to the parses where verb has natural arguments 35 Selectional Preferences(3)  Resnik’s Model(1993,1996) 1. Selectional Preference Strength  How strongly the verb constrains its direct object.  Two Assumptions ① Take only head noun ② Classes of nouns. P (c | v ) S (v)  D( P(C | v) || P(C ))   P(c | v) log P (c ) c •P(C) : overall probability distribution of noun classes •P(C|v): probability distribution of noun classes in the direct object position of v 36 Selectional Preferences(4)  Table 8.5 Selectional Preference Strength Noun class c P( c) P(c |eat) P(c | see) P(c | find) People 0.25 0.01 0.25 0.33 Furniture 0.25 0.01 0.25 0.33 Food 0.25 0.97 0.25 0.33 Action 0.25 0.01 0.25 0.01 1.76 0.00 0.35 SPS S(v) 37 Selectional Preferences(5)  The Notion of the Resnik’s Model (conti’) 2. Selectional Association between a verb v and a class c P (c | v ) P (c | v) log P (c ) A(v, c)  S (v ) • A rule for assigning strength to nouns A(v, n)  max cclasses( n ) A(v, c) Ex) (8.31) Susan interrupted the chair. A(interrupt, chair )  max cclasses( chair) A(interrupt , c)  A(interrupt, people ) 38 Selectional Preferences(6) • Estimate the Probability of P(c|v) C (v )  P (v )  C (v ' )  = P(v,c) / P(v) v' 1 C ( v, c ) N 1 1  C ( v, n )  N nwords( c ) | classes (n) | P(v, c)  N : total number of verb-object pairs in the corpus words(c) : set of all nouns in class c |classes(n)| : number of noun classes that contain n as a member C(v,n) : number of verb-object pairs with v as the verb and n as the head of the object NP 39 8.4 Selectional Preferences(7)  Resnik’s experiments on the Brown corpus (1996) : Table 8.6 • Left half : typical objects • Right half : atypical objects • For most verbs, association strength predicts which object is typical • Most errors the model makes are due to the fact that it performs a form of disambiguation, by choosing the highest A(v,c) for A(v,n)  Implicit object alternation a. Mike ate the cake. b. Mike ate. ◦ The more constraints a verb puts on its object, the more likely it is to permit the implicit-object construction ◦ Selectional Preference Strength (SPS) is seen as the more basic phenomenon which explains the occurrence of implicitobjects as well as association strength 40 Selectional Preferences(8) 41 Semantic Similarity Semantic Similarity (1)  Lexical Acquisition The Acquisition of meaning  Semantic Similarity • Automatically acquiring a relative measure of how similar a new word is to known words is much easier than determining what the meaning actually is • Most often used for generalization under the assumption that semantically similar words behave similarly ex) Susan had never eaten a fresh durian before. • Similarity-based Generalization VS. Class-based Generalization – Similarity-based generalization : Consider the closest neighbors – Class-based generalization : Consider the whole class • Usage of Semantic Similarity – Query expansion : astronaut  cosmonaut – k nearest neighbors classification 43 Semantic Similarity(2)  A Notion of Semantic Similarity • Extension of synonymy and refers to cases of nearsynonymy like the pair dwelling/abode • Two words are from the Same domain or topic ex) Doctor, nurse, fever, intravenous • Judgements of Semantic Similarity explained by the degree of contextual interchangeability ( Miller and Charles – 1991) • Ambiguity presents a problem for all notions of semantic similarity  When applied to ambiguous words, semantically similar usually means ‘similar to the appropriate sense’ ex) litigation ≒ suit (≠ clothes)  Similarity Measures • Vector space measures • Probabilistic measures 44 Semantic Similarity(3) 45 29 Vector space measures(1)  The two words whose semantic similarity we want to compute are represented as vectors in a multi-dimensional space. 1. A document-by-word matrix A ( Figure 8.3 ) • 2. Entry i. contains the number of times word j occurs in document A word-by-word matrix B ( Figure 8.4 ) • Entry i. contains the number of times word j co-occurs with word  3. A modifier-by-head matrix C ( Figure 8.5 ) • Entry contains the number of times that head J is modified by modifier i.  Different spaces get at different types of semantic similarity • • Document-Word, Word-Word spaces capture topical similarity Modifier-Head space captures more fine grained similarity 46 Vector space measures(2) 3Similarity measures for binary vectors ( Table 8.7 )  Matching coefficient simply counts the number of dimension on which both vectors are non-zero. X Y  Dice coefficient normailizes for length of the vectors and the total number of non zero entries. 2 | X Y | | X |  |Y |  Jaccard (or Tanimoto) coefficient penalizes a small number of shared entries more than the Dice coefficient does. | X Y | | X Y | 47 Vector space measures(3)  Similarity measures for binary vectors (conti’)  Overlap coefficient has a value of 1.0 if every dimension with a nonzero value for the first vector is also non-zero for the second vector. | X Y | min(| X |, | Y |)  Cosine penalizes less in cases where the number of non-zero entries is very different. | X  Y | | X ||Y |  Real-valued vector space  More powerful representation for linguistic objects.  The length of a vector 48 Vector space measures(4)  Real-valued vector space (conti’)  The dot product between two vectors  The cosine measure cos( x, y )  x y | x || y |  The Euclidean distance  n  i 1  n 2 x i i 1 xi yi  n 2 y i i 1  The advantage of vector spaces as a representational medium. • • Simplicity. Computational efficiency.  The disadvantage of vector spaces  Operate on binary data except for cosine  Cosine has its own problem • Cosine assumes a Euclidean space • Euclidean space is not well-motivated choice if the vectors we are dealing with are vectors of probability or counts 49 Probabilistic measures(1)  Transform semantic similarity into the similarity of two probability distribution  Transform matrices of counts in Figure 8.3, 8.4 and 8.5 into matrices of conditional probability Ex) (American, Astronaut)  P(American|astronaut) = ½ = 0.5 •  Measures of (dis-)similarity between probability distributions • ( Table 8.9 ) • 3 measures of dissimilarity between probability distributions investigated by Dagan et al.(1997) 1. KL divergence – D( p || q)   pi log i pi qi – Measures how much information is lost if we assume distribution q when the true distribution is p – Two Problems for Practical applications » Get value of infinity when qi=0 and pi ≠ 0 » Asymmetric ( D(p||q) ≠ D(q||p) ) 50 Probabilistic measures(2)  Measures of similarity between probability distributions (conti’) 2. Information radius (IRAD) – – – 3. D( p || Symmetric and no problem with infinite values. Measures how much information is lost if we describe the two words that correspond to p and q with their average distribution norm. – – pq pq )  D(q || ) 2 2 1  | pi  qi | 2 i A measure of the expected proportion of events that are going to be different between the distributions p and q 51 Probabilistic measures(3)  Measures of similarity between probability distributions (conti’) • norm’s example ( by figure 8.5 ) p1 = P(Soviet | cosmonaut) = 0.5 p2 = 0 p3 = P(spacewalking | cosmonaut)=0.5 q1 = 0 q2 = P(American | astronaut) = 0.5 q3 = P(spacewalking | astronaut) = 0.5 1 1  2  | pi  qi |  2 (0.5  0.5  0)  0.5 i 52 The Role of Lexical Acquisition in Statistical NLP(1)  Lexical acquisition plays a key role in statistical NLP because available lexical resources are always lacking in some way. • The cost of building lexical resources manually. • The quantitative part of lexical acquisition almost always has to be done automatically. • Many lexical resources were designed for human consumption.  The best solution : the augmentation of a manual resource by automatic means. • The main reason : The inherent productivity of language. 53 What does the future hold for lexical acquisition?  Look harder for sources of prior knowledge that can constrain the process of lexical acquisition.   Much of the hard work of lexical acquisition will be in building interfaces that admit easy specification of prior knowledge and easy correction of mistake made in automatic learning.  Linguistic theory-important source of prior knowledge- has been surprisingly underutilized in Statistical NLP.  Dictionaries are only one source of information that can be important in lexical acquisition in addition to text corpora.  ( Other source : encyclopedias, thesauri, gazeteers, collections of technical vocabulary etc.)  If we succeed in emulating human acquisition of language by tapping into this rich source of information, then a breakthrough in the effectiveness of lexical acquisition can be expected. 54

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Semantic Similarity