Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Word Sense Disambiguation & Information Retrieval CMSC 35100 Natural Language Processing May 20, 2003 Roadmap • Word Sense Disambiguation – Knowledge-based Approaches • Sense similarity in a taxonomy – Issues in WSD • Why they work & why they don’t • Information Retrieval – Vector Space Model • Computing similarity • Term weighting • Enhancements: Expansion, Stemming, Synonyms Resnik’s WordNet Labeling: Detail • • • • Assume Source of Clusters Assume KB: Word Senses in WordNet IS-A hierarchy Assume a Text Corpus Calculate Informativeness – For Each KB Node: • Sum occurrences of it and all children • Informativeness • Disambiguate wrt Cluster & WordNet – Find MIS for each pair, I – For each subsumed sense, Vote += I – Select Sense with Highest Vote Sense Labeling Under WordNet • Use Local Content Words as Clusters – Biology: Plants, Animals, Rainforests, species… – Industry: Company, Products, Range, Systems… • Find Common Ancestors in WordNet – Biology: Plants & Animals isa Living Thing – Industry: Product & Plant isa Artifact isa Entity – Use Most Informative • Result: Correct Selection The Question of Context • Shared Intuition: – Context Sense • Area of Disagreement: – What is context? • Wide vs Narrow Window • Word Co-occurrences Taxonomy of Contextual Information • • • • • Topical Content Word Associations Syntactic Constraints Selectional Preferences World Knowledge & Inference A Trivial Definition of Context All Words within X words of Target • Many words: Schutze - 1000 characters, several sentences • Unordered “Bag of Words” • Information Captured: Topic & Word Association • Limits on Applicability – Nouns vs. Verbs & Adjectives – Schutze: Nouns - 92%, “Train” -Verb, 69% Limits of Wide Context • Comparison of Wide-Context Techniques (LTV ‘93) – Neural Net, Context Vector, Bayesian Classifier, Simulated Annealing • Results: 2 Senses - 90+%; 3+ senses ~ 70% • People: Sentences ~100%; Bag of Words: ~70% • Inadequate Context • Need Narrow Context – Local Constraints Override – Retain Order, Adjacency Surface Regularities = Useful Disambiguators • Not Necessarily! • “Scratching her nose” vs “Kicking the bucket” (deMarcken 1995) • Right for the Wrong Reason – Burglar Rob… Thieves Stray Crate Chase Lookout • Learning the Corpus, not the Sense – The “Ste.” Cluster: Dry Oyster Whisky Hot Float Ice • Learning Nothing Useful, Wrong Question – Keeping: Bring Hoping Wiping Could Should Some Them Rest Interactions Below the Surface • Constraints Not All Created Equal – “The Astronomer Married the Star” – Selectional Restrictions Override Topic • No Surface Regularities – “The emigration/immigration bill guaranteed passports to all Soviet citizens – No Substitute for Understanding What is Similar • Ad-hoc Definitions of Sense – Cluster in “word space”, WordNet Sense, “Seed Sense”: Circular • Schutze: Vector Distance in Word Space • Resnik: Informativeness of WordNet Subsumer + Cluster – Relation in Cluster not WordNet is-a hierarchy • Yarowsky: No Similarity, Only Difference – Decision Lists - 1/Pair – Find Discriminants Information Retrieval • Query/Document similarity – Expression of user’s information need • Query – Searchable units: encode concepts • Documents – Paragraphs, encyclopedia entries, web pages,… • Collection: searchable group of documents – Elementary units: terms • E.g. words, phrases, stems,.. – Bag of words: (typically) • man, dog, bit Vector Space Model • Represent documents and queries as – Vectors of term-based features • Features: tied to occurrence of terms in collection – E.g. d j (t1, j , t2, j ,..., t N , j ); qk (t1,k , t2,k ,..., t N ,k ) • Solution 1: Binary features: t=1 if presense, 0 otherwise – Similiarity: number of terms in common • Dot product N sim (qk , d j ) ti ,k ti , j i 1 Vector Space Model II • Problem: Not all terms equally interesting – E.g. the vs dog vs Levow d j (w1, j , w2, j ,..., wN , j ); qk (w1,k , w2,k ,..., wN ,k ) • Solution: Replace binary term features with weights – Document collection: term-by-document matrix – View as vector in multidimensional space • Nearby vectors are related – Normalize for vector length Vector Similarity Computation • Similarity = Dot product N sim (qk , d j ) qk d j wi ,k wi , j i 1 • Normalization: – Normalize weights in advance – Normalize post-hoc sim (qk , d j ) N i 1 wi ,k wi , j i1 wi2,k N N 2 w i 1 i , j Term Weighting • “Aboutness” – To what degree is this term what document is about? – Within document measure – Term frequency (tf): # occurrences of t in doc j • “Specificity” – How surprised are you to see this term? – Collection frequency – Inverse document frequency (idf): idf i log( N ) ni wi , j tfi , j idf i Term Selection & Formation • Selection: – Some terms are truly useless • Too frequent, no content – E.g. the, a, and,… – Stop words: ignore such terms altogether • Creation: – Too many surface forms for same concepts • E.g. inflections of words: verb conjugations, plural – Stem terms: treat all forms as same underlying Query Refinement • Typical queries very short, ambiguous – Cat: animal/Unix command – Add more terms to disambiguate, improve • Relevance feedback – Retrieve with original queries – Present results • Ask user to tag relevant/non-relevant – “push” toward relevant vectors, away from nr R S qi 1 qi rj sk R j 1 S k 1 – β+γ=1 (0.75,0.25); r: rel docs, s: non-rel docs