Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
QUANTIFYING UNCERTAINTY Heng Ji [email protected] 03/29, 04/01, 2016 Top 1 Proposal Presentation • Multimedia Joint Model: Spencer Whitehead 4.83 • "good idea building on existing system" • "interesting problem, clear schedule" • "very professional and well-reported" • "potentially super useful" • "cool project" • "could use more pictures, good speaking" • "little ambiguous, not entirely" • "very nice" • "interesting research" • "sounds very ambitious" • "need to figure out the layout of texts (some words are covered by pictures)" Top 2 Team • Sudoku Destructifier: Corey Byrne, Garrett Chang, Darren Lin, Ben Litwack 4.82 • "seems like a pretty easy project for 4 people" • "too many slides" • "I like the topic!" • "great idea" • "good luck" • "very interesting modification" • "funny project name" • "uneven distribution of speaking" • "good speaking" • "using the procedural algorithm against itself. nice." Top 3 Team • Protozoan Perception: Ahmed Eleish, Chris Paradis, Rob Russo, Brandon Thorne 4.81 • "interesting problem" • "ambitious but could be a very interesting project" • "good project. interesting problem/approach. Useful." • "good idea" • "specific and narrow goals" • "very clear" • "very unique and interesting idea for a project. good crossover w/ CS + other • • • • • • • disciplines" "too many slides" "a bit long" "cool stuff" "ambitious" "only 2 people talked" "not all members participated" "seems useful for scientists. what's additional in this project that's not in your Top 4 – 7 Teams • Forgery Detection Julia Fantacone, Connor Hadley, Asher Norland, Alexandra Zytek 4.76 • German to English Machine Translation Kevin O’Neill and Mitchell Mellone 4.73 • Bias Check on web documents Chris Higley, Michael Han, Terence Farrell 4.73 • Path-finding Agents with swarm based travel Nicholas Lockwood 4.73 Outline • Probability Basics • Naïve Bayes • Application in Word Sense Disambiguation • Application in Information Retrieval Uncertainty • So far in course, everything deterministic • If I walk with my umbrella, I will not get wet • But: there is some chance my umbrella will break! • Intelligent systems must take possibility of failure into account… • May want to have backup umbrella in city that is often windy and rainy • … but should not be excessively conservative • Two umbrellas not worthwhile for city that is usually not windy • Need quantitative notion of uncertainty Uncertainty • General situation: • Observed variables (evidence): Agent knows certain things about the state of the world (e.g., sensor readings or symptoms) • Unobserved variables: Agent needs to reason about other aspects (e.g. where an object is or what disease is present) • Model: Agent knows something about how the known variables relate to the unknown variables • Probabilistic reasoning gives us a framework for managing our beliefs and knowledge Random Variables • A random variable is some aspect of the world about which we (may) have uncertainty • • • • R = Is it raining? T = Is it hot or cold? D = How long will it take to drive to work? L = Where is the ghost? • We denote random variables with capital letters • Like variables in a CSP, random variables have domains • • • • R in {true, false} (often write as {+r, -r}) T in {hot, cold} D in [0, ) L in possible locations, maybe {(0,0), (0,1), …} Probability Y 6 1/36 1/36 1/36 1/36 1/36 1/36 5 1/36 1/36 1/36 1/36 1/36 1/36 4 1/36 1/36 1/36 1/36 1/36 1/36 3 1/36 1/36 1/36 1/36 1/36 1/36 • Outcome is represented by an 2 1/36 1/36 1/36 1/36 1/36 1/36 ordered pair of values (x, y) 1 1/36 1/36 1/36 1/36 1/36 1/36 • Example: roll two dice • Random variables: – X = value of die 1 – Y = value of die 2 – E.g., (6, 1): X=6, Y=1 – Atomic event or sample point tells• 1 2 3 4 5 6 X An event is a proposition us the complete state of the world, about the state (=subset of i.e., values of all random variables states) • Exactly one atomic event will happen; each atomic event has a – X+Y = 7 • Probability of event = sum of ≥0 probability; sum to 1 probabilities of atomic Probability Distributions • Associate a probability with each value Weather: • Temperature: T P hot 0.5 cold 0.5 W P sun 0.6 rain 0.1 fog 0.3 meteor 0.0 Probability Distributions • Unobserved random variables have distributions Shorthand notation: T P W P hot 0.5 sun 0.6 cold 0.5 rain 0.1 fog 0.3 meteor 0.0 • A distribution is a TABLE of probabilities of values • A probability (lower case value) is a single number • Must have: and OK if all domain entries are unique Joint Distributions • A joint distribution over a set of random variables: specifies a real number for each assignment (or outcome): • Must obey: T W P hot sun 0.4 hot rain 0.1 cold sun 0.2 • Size of distribution if n variables with domain sizes d? • For all but the smallest distributions, impractical to write out! cold rain 0.3 Probabilistic Models • A probabilistic model is a joint distribution over a set of random variables • Probabilistic models: • (Random) variables with domains • Assignments are called outcomes • Joint distributions: say whether assignments (outcomes) are likely • Normalized: sum to 1.0 • Ideally: only certain variables directly interact • Constraint satisfaction problems: • Variables with domains • Constraints: state whether assignments are possible • Ideally: only certain variables directly interact Distribution over T,W T W P hot sun 0.4 hot rain 0.1 cold sun 0.2 cold rain 0.3 Constraint over T,W T W P hot sun T hot rain F cold sun F cold rain T Events • An event is a set E of outcomes • From a joint distribution, we can calculate the probability of any event • Probability that it’s hot AND sunny? • Probability that it’s hot? • Probability that it’s hot OR sunny? • Typically, the events we care about are partial assignments, like P(T=hot) T W P hot sun 0.4 hot rain 0.1 cold sun 0.2 cold rain 0.3 Facts about probabilities of events • If events A and B are disjoint, then • P(A or B) = P(A) + P(B) • More generally: • P(A or B) = P(A) + P(B) - P(A and B) • If events A1, …, An are disjoint and exhaustive (one of them must happen) then P(A1) + … + P(An) = 1 • Special case: for any random variable, ∑x P(X=x) = 1 • Marginalization: P(X=x) = ∑y P(X=x and Y=y) Conditional probability • Probability of cavity given toothache: P(Cavity = true | Toothache = true) • For any two events A and B, P( A B) P( A, B) P( A | B) P( B) P( B) P(A B) P(A) P(B) Conditional probability • We might know something about the world – e.g., “X+Y=6 or X+Y=7” – given this (and only this), what is the probability of Y=5? • Part of the sample space is eliminated; probabilities are renormalized to sum to 1 Y 6 1/36 1/36 1/36 1/36 1/36 1/36 6 1/11 0 0 0 0 0 5 1/36 1/36 1/36 1/36 1/36 1/36 5 1/11 1/11 0 0 0 0 4 1/36 1/36 1/36 1/36 1/36 1/36 4 0 1/11 1/11 0 0 0 3 1/36 1/36 1/36 1/36 1/36 1/36 3 0 0 1/11 1/11 0 0 2 1/36 1/36 1/36 1/36 1/36 1/36 2 0 0 0 1/11 1/11 0 1 1/36 1/36 1/36 1/36 1/36 1/36 1 0 0 0 0 1/11 1/11 1 2 3 4 5 1 Y 2 3 4 5 6 X • P(Y=5 | (X+Y=6) or (X+Y=7)) = 2/11 6X Facts about conditional probability • P(A | B) = P(A and B) / P(B) Sample space A and B B A • P(A | B)P(B) = P(A and B) • P(A | B) = P(B | A)P(A)/P(B) – Bayes’ rule How can we scale this? • In principle, we now have a complete approach for reasoning under uncertainty: • Specify probability for every atomic event, • Can compute probabilities of events simply by summing probabilities of atomic events, • Conditional probabilities are specified in terms of probabilities of events: P(A | B) = P(A and B) / P(B) • If we have n variables that can each take k values, how many atomic events are there? Independence • Some variables have nothing to do with each other • Dice: if X=6, it tells us nothing about Y • P(Y=y | X=x) = P(Y=y) • So: P(X=x and Y=y) = P(Y=y | X=x)P(X=x) = P(Y=y)P(X=x) • Usually just write P(X, Y) = P(X)P(Y) • Only need to specify 6+6=12 values instead of 6*6=36 values • Independence among 3 variables: P(X,Y,Z)=P(X)P(Y)P(Z), etc. • Are the events “you get a flush” and “you get a straight” independent? An example without cards or dice Rain in Beaufort Rain in Durham Sun in Durham .2 .2 Sun in Beaufort .1 .5 (disclaimer: no idea if these numbers are realistic) • What is the probability of • Rain in Beaufort? Rain in Durham? • Rain in Beaufort, given rain in Durham? • Rain in Durham, given rain in Beaufort? • Rain in Beaufort and rain in Durham are correlated Conditional independence • Intuition: • the only reason that X told us something about Y, • is that X told us something about Z, • and Z tells us something about Y • If we already know Z, then X tells us nothing about Y • P(Y | Z, X) = P(Y | Z) or • P(X, Y | Z) = P(X | Z)P(Y | Z) • “X and Y are conditionally independent given Z” Context-specific independence • Recall P(X, Y | Z) = P(X | Z)P(Y | Z) really means: for all x, y, z, P(X=x, Y=y | Z=z) = P(X=x | Z=z)P(Y=y | Z=z) • But it may not be true for all z • P(Wet, RainingInLondon | CurrentLocation=New York) = P(Wet | CurrentLocation=New York)P(RainingInLondon | CurrentLocation=New York) • But not • P(Wet, RainingInLondon | CurrentLocation=London) = P(Wet | CurrentLocation=London)P(RainingInLondon | CurrentLocation=London) Expected value • If Z takes numerical values, then the expected value of Z is E(Z) = ∑z P(Z=z)*z • Weighted average (weighted by probability) • Suppose Z is sum of two dice • E(Z) = (1/36)*2 + (2/36)*3 + (3/36)*4 + (4/36)*5 + (5/36)*6 + (6/36)*7 + (5/36)*8 + (4/36)*9 + (3/36)*10 + (2/36)*11 + (1/36)*12 = 7 • Simpler way: E(X+Y)=E(X)+E(Y) (always!) • Linearity of expectation • E(X) = E(Y) = 3.5 Monty Hall problem • You’re a contestant on a game show. You see three closed doors, and behind one of them is a prize. You choose one door, and the host opens one of the other doors and reveals that there is no prize behind it. Then he offers you a chance to switch to the remaining door. Should you take it? Monty Hall problem • With probability 1/3, you picked the correct door, and with probability 2/3, picked the wrong door. If you picked the correct door and then you switch, you lose. If you picked the wrong door and then you switch, you win the prize. • Expected payoff of switching: (1/3) * 0 + (2/3) * Prize • Expected payoff of not switching: (1/3) * Prize + (2/3) * 0 Product rule • Definition of conditional probability: P( A, B) P( A | B) P( B) • Sometimes we have the conditional probability and want to obtain the joint: P( A, B) P( A | B) P( B) P( B | A) P( A) Product rule • Definition of conditional probability: P( A, B) P( A | B) P( B) • Sometimes we have the conditional probability and want to obtain the joint: P( A, B) P( A | B) P( B) P( B | A) P( A) • The chain rule: P( A1 , , An ) P( A1 ) P( A2 | A1 ) P( A3 | A1 , A2 ) P( An | A1 , , An 1 ) n P( Ai | A1 , , Ai 1 ) i 1 The Birthday problem • We have a set of n people. What is the probability that two of them share the same birthday? • Easier to calculate the probability that n people do not share the same birthday • Let P(i |1, …, i –1) denote the probability of the event that the ith person does not share a birthday with the previous i –1 people: P(i |1, …, i –1) = (365 – i + 1)/365 • Probability that n people do not share a birthday: • Probability that n people do share a birthday: one minus the above The Birthday problem • For 23 people, the probability of sharing a birthday is above 0.5! http://en.wikipedia.org/wiki/Birthday_problem Bayes Rule Rev. Thomas Bayes (1702-1761) • The product rule gives us two ways to factor a joint distribution: P( A, B) P( A | B) P( B) P( B | A) P( A) • Therefore, P( B | A) P( A) P( A | B) P( B) • Why is this useful? • Can get diagnostic probability P(cavity | toothache) from causal probability P(toothache | cavity) • Can update our beliefs based on evidence • Important tool for probabilistic inference Bayes Rule example • Marie is getting married tomorrow, at an outdoor ceremony in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding? Bayes Rule example • Marie is getting married tomorrow, at an outdoor ceremony in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding? P(Predict | Rain ) P(Rain ) P(Predict ) P(Predict | Rain ) P(Rain ) P(Predict | Rain ) P(Rain ) P(Predict | Rain ) P(Rain ) 0.9 * 0.014 0.111 0.9 * 0.014 0.1* 0.986 P(Rain | Predict ) The IR Problem query • doc1 • doc2 • doc3 ... Sort docs in order of relevance to query. Example Query Query: The 1929 World Series 384,945,633 results in Alta Vista • GNU's Not Unix! - the GNU Project and the Free Software Foundation (FSF) • Yahoo! Singapore • The USGenWeb Project - Home Page •… Better List (Google) • TSN Archives: The 1929 World Series • Baseball Almanac - World Series Menu • 1929 World Series - PHA vs. CHC - Baseball- Reference.com • World Series Winners (1903-1929) (Baseball World) Goal Should return as many relevant docs as possible recall Should return as few irrelevant docs as possible precision Typically a tradeoff… Main Insights How identify “good” docs? • More words in common is good. • Rare words more important than common words. • Long documents carry less weight, all other things being equal. Bag of Words Model Just pay attention to which words appear in document and query. Ignore order. Boolean IR "and" all uncommon words Most web search engines. • Altavista: 79,628 hits • fast • not so accurate by itself Example: Biography Science and the Modern World (1925), a series of lectures given in the United States, served as an introduction to his later metaphysics. Whitehead's most important book, Process and Reality (1929), took this theory to a level of even greater generality. http://www-groups.dcs.stand.ac.uk/~history/Mathematicians/Whitehead.html Vector-space Model For each word in common between document and query, compute a weight. Sum the weights. tf = (term frequency) number of times term appears in the document idf = (inverse document frequency) divide by number of times term appears in any document Also various forms of document-length normalization. Example Formula i Insurance Try sumj tfi,j dfi 10440 3997 10422 8760 Weight(i,j) = (1+log(tfi,j)) log N/dfi Unless tfi,j = 0 (then 0). N documents, dfi doc frequency Cosine Normalization Cos(q,d) = sumi qi di / sqrt(sumi qi2) sqrt(sumi di2) Downweights long documents. (Perhaps too much.) Probabilistic Approach Lots of work studying different weighting schemes. Often very ad hoc, empirically motivated. Is there an analog of A* for IR? Elegant, simple, effective? Language Models Probability theory is gaining popularity. Originally speech recognition: If we can assign probabilities to sentence and phonemes, we can choose the sentence that minimizes the chance that we’re wrong… Probability Basics Pr(A): Probability A is true Pr(AB): Prob. both A & B are true Pr(~A): Prob. of not A: 1-Pr(A) Pr(A|B): Prob. of A given B Pr(AB)/Pr(B) Pr(A+B): Probability A or B is true Pr(A) + Pr(B) – Pr(AB) Venn Diagram B A AB Bayes Rule Pr(A|B) = Pr(B|A) Pr(A) / Pr(B) because Pr(AB) = Pr (B) Pr(A|B) = Pr(B|A) Pr(A) The most basic form of “learning”: • picking a likely model given the data • adjusting beliefs in light of new evidence Probability Cheat Sheet Chain rule: Pr(A,X|Y) = Pr(A|Y) Pr(X|A,Y) Summation rule: Pr(X|Y) = Pr(A X | Y) + Pr(~A X | Y) Bayes rule: Pr(A|BX) = Pr(B|AX) Pr(A|X)/Pr(B|X) Classification Example Given a song title, guess if it’s a country song or a rap song. • U Got it Bad • Cowboy Take Me Away • Feelin’ on Yo Booty • When God-Fearin' Women Get The Blues • God Bless the USA • Ballin’ out of Control Probabilistic Classification Language model gives: • Pr(T|R), Pr(T|C), Pr(C), Pr(R) Compare • Pr(R|T) vs. Pr(C|T) • Pr(T|R) Pr(R) / Pr(T) vs. Pr(T|C) Pr(C) / Pr(T) • Pr(T|R) Pr(R) vs. Pr(T|C) Pr(C) Naïve Bayes Pr(T|C) Generate words independently Pr(w1 w2 w3 … wn|C) = Pr(w1|C) Pr(w2|C) … Pr(wn|C) So, Pr(party|R) = 0.02, Pr(party|C) = 0.001 Estimating Naïve Bayes Where would these numbers come from? Take a list of country song titles. First attempt: Pr(w|C) = count(w; C) / sumw count(w; C) Smoothing Problem: Unseen words. Pr(party|C) = 0 Pr(Even Party Cowboys Get the Blues) = 0 Laplace Smoothing: Pr(w|C) = (1+count(w; C)) / sumw (1+count(w; C)) Other Applications Filtering • Advisories Text classification • Spam vs. important • Web hierarchy • Shakespeare vs. Jefferson • French vs. English IR Example Pr(d|q) = Pr(q|d) Pr(d) / Pr(q) Language model Constant Prior belief d is relevant (assume equal) Can view each document like a category for classification. Smoothing Matters p(w|d) = ps(w|d) if count(w;d)>0 (seen) p(w|collection) if count(w;d)=0 ps(w|d): estimated from document and smoothed p(w|collection): estimated from corpus and smoothed Equivalent effect to TF-IDF. Exercise The weather data, with counts and probabilities outlook temperature yes no yes no sunny 2 3 hot 2 2 overcast 4 0 mild 4 2 rainy 3 2 cool 3 1 humidity windy yes no high 3 4 normal 6 1 play yes no yes no false 6 2 9 5 true 3 3 sunny hot high false overcast mild normal true rainy cool A new day outlook temperature humidity windy play sunny cool high true ? 61/30 The Central Problem in IR Information Seeker Authors Concepts Concepts Query Terms Document Terms Do these represent the same concepts? 62/30 Relevance • Relevance is a subjective judgment and may include: • Being on the proper subject. • Being timely (recent information). • Being authoritative (from a trusted source). • Satisfying the goals of the user and his/her intended use of the information (information need). 62 63/30 IR Ranking • Early IR focused on set-based retrieval • Boolean queries, set of conditions to be satisfied • document either matches the query or not • like classifying the collection into relevant / non-relevant sets • still used by professional searchers • “advanced search” in many systems • Modern IR: ranked retrieval • free-form query expresses user’s information need • rank documents by decreasing likelihood of relevance • many studies prove it is superior 64/30 A heuristic formula for IR • Rank docs by similarity to the query • suppose the query is “cryogenic labs” • Similarity = # query words in the doc • favors documents with both “labs” and “cryogenic” • mathematically: sim ( D, Q) 1qD qQ • Logical variations (set-based) • Boolean AND (require all words): • Boolean OR (any of the words): AND( D, Q) q 1qD OR ( D, Q) 1 q 1qD 65/30 Term Frequency (TF) • Observation: • key words tend to be repeated in a document • Modify our similarity measure: • give more weight if word occurs multiple times • Problem: sim ( D, Q) tf D (q) qQ • biased towards long documents • spurious occurrences • normalize by length: tf D (q ) sim ( D, Q) qQ | D | 66/30 Inverse Document Frequency (IDF) • Observation: • rare words carry more meaning: cryogenic, apollo • frequent words are linguistic glue: of, the, said, went • Modify our similarity measure: • give more weight to rare words … but don’t be too aggressive (why?) |C | tf D (q) sim ( D, Q) log qQ | D | df (q) • |C| … total number of documents • df(q) … total number of documents that contain q 67/30 TF normalization • Observation: • D1={cryogenic,labs}, D2 ={cryogenic,cryogenic} • which document is more relevant? • which one is ranked higher? (df(labs) > df(cryogenic)) • Correction: • first occurrence more important than a repeat (why?) • “squash” the linearity of TF: tf ( q ) tf ( q ) K 1 2 3 tf 68/30 State-of-the-art Formula Repetitions of query words good Common words less important |C | tf D (q) sim ( D, Q) log qQ tf D ( q ) K | D | df (q) More query words good Penalize very long documents 69/30 Vector-space approach to IR cat •cat cat •cat cat cat •cat pig •pig cat θ pig •cat cat pig dog dog dog 70/30 Some formulas for Sim Dot product Cosine Sim ( D, Q) (ai * bi ) Jaccard (a * b ) i Sim ( D, Q) Dice D i i ai * bi 2 i Sim ( D, Q) t1 Q 2 i t2 2 (ai * bi ) i ai bi 2 i 2 i (a * b ) Sim ( D, Q) a b (a * b ) i i i 2 2 i i i i i i i 71/30 Language-modeling Approach • query is a random sample from a “perfect” document • words are “sampled” independently of each other • rank documents by the probability of generating query D P( query )=P( ) P( )P( ) P( ) = 4/9 * 2/9 * 4/9 * 3/9 72/30 PageRank in Google PageRank in Google (Cont’) I1 I2 A B PR( I i ) PR( A) (1 d ) d i C(Ii ) • Assign a numeric value to each page • The more a page is referred to by important pages, the more this page is important • d: damping factor (0.85) • Many other criteria: e.g. proximity of query words • “…information retrieval …” better than “… information … retrieval …” 73/30 74/30 Problems with Keywords • May not retrieve relevant documents that include synonymous terms. • “restaurant” vs. “café” • “PRC” vs. “China” • May retrieve irrelevant documents that include ambiguous terms. • “bat” (baseball vs. mammal) • “Apple” (company vs. fruit) • “bit” (unit of data vs. act of eating) 74 75/30 Query Expansion • http://www.lemurproject.org/lemur/IndriQueryLanguage.php • Most errors caused by vocabulary mismatch • query: “cars”, document: “automobiles” • solution: automatically add highly-related words • Thesaurus / WordNet lookup: • add semantically-related words (synonyms) • cannot take context into account: • “rail car” vs. “race car” vs. “car and cdr” • Statistical Expansion: • add statistically-related words (co-occurrence) • very successful 76/30 Document indexing • Goal = Find the important meanings and create an internal representation • Factors to consider: • Accuracy to represent meanings (semantics) • Exhaustiveness (cover all the contents) • Facility for computer to manipulate • What is the best representation of contents? • Char. string (char trigrams): not precise enough • Word: good coverage, not precise • Phrase: poor coverage, more precise • Concept: poor coverage, precise Coverage (Recall) String Word Phrase Concept Accuracy (Precision) 77/30 Indexer steps • Sequence of (Modified token, Document ID) pairs. Doc 1 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Term I did enact julius caesar I was killed i' the capitol brutus killed me so let it be with caesar the noble brutus hath told you Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 caesar 2 was ambitious 2 2 78/30 • Multiple term entries in a single document are merged. • Frequency information is added. Term Doc # ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 hath 1 I 1 I 1 i' 1 it 2 julius 1 killed 1 killed 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2 Term Doc # ambitious be brutus brutus capitol caesar caesar did enact hath I i' it julius killed let me noble so the the told you was was with 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 2 1 2 2 2 1 2 2 Term freq 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 79/30 Stopwords / Stoplist • function words do not bear useful information for IR of, in, about, with, I, although, … • Stoplist: contain stopwords, not to be used as index • Prepositions • Articles • Pronouns • Some adverbs and adjectives • Some frequent words (e.g. document) • The removal of stopwords usually improves IR effectiveness • A few “standard” stoplists are commonly used. 80/30 Stemming • Reason: • Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them • Stemming: • Removing some endings of word computer compute computes computing computed computation comput 81/30 Lemmatization • transform to standard form according to syntactic category. E.g. verb + ing verb noun + s noun • Need POS tagging • More accurate than stemming, but needs more resources • crucial to choose stemming/lemmatization rules noise v.s. recognition rate • compromise between precision and recall light/no stemming -recall +precision severe stemming +recall -precision What to Learn IR problem and TF-IDF. Unigram language models. Naïve Bayes and simple Bayesian classification. Need for smoothing. 83/34 Another Example on WSD • Word sense disambiguation is the problem of selecting a sense for a word from a set of predefined possibilities. • Sense Inventory usually comes from a dictionary or thesaurus. • Knowledge intensive methods, supervised learning, and (sometimes) bootstrapping approaches 84/34 Ambiguity for a Computer • The fisherman jumped off the bank and into the water. • The bank down the street was robbed! • Back in the day, we had an entire bank of computers devoted to this problem. • The bank in that road is entirely too steep and is really dangerous. • The plane took a bank to the left, and then headed off towards the mountains. 85/34 Early Days of WSD • Noted as problem for Machine Translation (Weaver, 1949) • A word can often only be translated if you know the specific sense intended (A bill in English could be a pico or a cuenta in Spanish) • Bar-Hillel (1960) posed the following: • Little John was looking for his toy box. Finally, he found it. The box was in the pen. John was very happy. • Is “pen” a writing instrument or an enclosure where children play? • …declared it unsolvable, left the field of MT! 86/34 Since then… • 1970s - 1980s • Rule based systems • Rely on hand crafted knowledge sources • 1990s • Corpus based approaches • Dependence on sense tagged text • (Ide and Veronis, 1998) overview history from early days to 1998. • 2000s • Hybrid Systems • Minimizing or eliminating use of sense tagged text • Taking advantage of the Web 87/34 Machine Readable Dictionaries • In recent years, most dictionaries made available in Machine Readable format (MRD) • Oxford English Dictionary • Collins • Longman Dictionary of Ordinary Contemporary English (LDOCE) • Thesauruses – add synonymy information • Roget Thesaurus • Semantic networks – add more semantic relations • WordNet • EuroWordNet 88/34 Lesk Algorithm • • (Michael Lesk 1986): Identify senses of words in context using definition overlap Algorithm: • Retrieve from MRD all sense definitions of the words to be disambiguated • Determine the definition overlap for all possible sense combinations • Choose senses that lead to highest overlap Example: disambiguate PINE CONE • PINE 1. kinds of evergreen tree with needle-shaped leaves 2. waste away through sorrow or illness • CONE 1. solid body which narrows to a point 2. something of this shape whether solid or hollow 3. fruit of certain evergreen trees Pine#1 Cone#1 = 0 Pine#2 Cone#1 = 0 Pine#1 Cone#2 = 1 Pine#2 Cone#2 = 0 Pine#1 Cone#3 = 2 Pine#2 Cone#3 = 0 89/34 Heuristics: One Sense Per Discourse • • A word tends to preserve its meaning across all its occurrences in a given discourse (Gale, Church, Yarowksy 1992) What does this mean? E.g. The ambiguous word PLANT occurs 10 times in a discourse all instances of “plant” carry the same meaning • • Evaluation: • 8 words with two-way ambiguity, e.g. plant, crane, etc. • 98% of the two-word occurrences in the same discourse carry the same meaning The grain of salt: Performance depends on granularity • (Krovetz 1998) experiments with words with more than two senses • Performance of “one sense per discourse” measured on SemCor is approx. 70% 90/34 Heuristics: One Sense per Collocation • • • • A word tends to preserver its meaning when used in the same collocation (Yarowsky 1993) • Strong for adjacent collocations • Weaker as the distance between words increases An example The ambiguous word PLANT preserves its meaning in all its occurrences within the collocation “industrial plant”, regardless of the context where this collocation occurs Evaluation: • 97% precision on words with two-way ambiguity Finer granularity: • (Martinez and Agirre 2000) tested the “one sense per collocation” hypothesis on text annotated with WordNet senses • 70% precision on SemCor words 91/34 What is Supervised Learning? • Collect a set of examples that illustrate the various possible classifications or outcomes of an event. • Identify patterns in the examples associated with each particular class of the event. • Generalize those patterns into rules. • Apply the rules to classify a new event. Learn from these examples : “when do I go to the store?” Day CLASS Go to Store? F1 Hot Outside? F2 Slept Well? F3 Ate Well? 1 YES YES NO NO 2 NO YES NO YES 3 YES NO NO NO 4 NO NO NO 92/34 YES Learn from these examples : “when do I go to the store?” Day CLASS Go to Store? F1 Hot Outside? F2 Slept Well? F3 Ate Well? 1 YES YES NO NO 2 NO YES NO YES 3 YES NO NO NO 4 NO NO NO 93/34 YES 94/34 Task Definition: Supervised WSD • Supervised WSD: Class of methods that induces a classifier from manually sense-tagged text using machine learning techniques. • Resources • Sense Tagged Text • Dictionary (implicit source of sense inventory) • Syntactic Analysis (POS tagger, Chunker, Parser, …) • Scope • Typically one target word per context • Part of speech of target word resolved • Lends itself to “targeted word” formulation • Reduces WSD to a classification problem where a target word is assigned the most appropriate sense from a given set of possibilities based on the context in which it occurs Sense Tagged Text Bonnie and Clyde are two really famous criminals, I think they were bank/1 robbers My bank/1 charges too much for an overdraft. I went to the bank/1 to deposit my check and get a new ATM card. The University of Minnesota has an East and a West Bank/2 campus right on the Mississippi River. My grandfather planted his pole in the bank/2 and got a great big catfish! The bank/2 is pretty muddy, I can’t walk there. 95/34 Two Bags of Words (Co-occurrences in the “window of context”) FINANCIAL_BANK_BAG: a an and are ATM Bonnie card charges check Clyde criminals deposit famous for get I much My new overdraft really robbers the they think to too two went were RIVER_BANK_BAG: a an and big campus cant catfish East got grandfather great has his I in is Minnesota Mississippi muddy My of on planted pole pretty right River The the there University walk West 96/34 97/34 Simple Supervised Approach Given a sentence S containing “bank”: For each word Wi in S If Wi is in FINANCIAL_BANK_BAG then Sense_1 = Sense_1 + 1; If Wi is in RIVER_BANK_BAG then Sense_2 = Sense_2 + 1; If Sense_1 > Sense_2 then print “Financial” else if Sense_2 > Sense_1 then print “River” else print “Can’t Decide”; 98/34 Supervised Learning (Cont’) Training Text Instance Vector (Feature) Training (Model Estimator) Model Test Test Text Feature Output 99/34 Supervised Methodology for WSD • Create a sample of training data where a given target word is • • • • • manually annotated with a sense from a predetermined set of possibilities. • One tagged word per instance/lexical sample disambiguation Select a set of features with which to represent context. • co-occurrences, collocations, POS tags, verb-obj relations, etc... Convert sense-tagged training instances to feature vectors. Apply a machine learning algorithm to induce a classifier. • Form – structure or relation among features • Parameters – strength of feature interactions Convert a held out sample of test data into feature vectors. • “correct” sense tags are known but not used Apply classifier to test instances to assign a sense tag. 100/34 From Text to Feature Vectors • My/pronoun grandfather/noun used/verb to/prep fish/verb along/adv the/det banks/SHORE of/prep the/det Mississippi/noun River/noun. (S1) • The/det bank/FINANCE issued/verb a/det check/noun for/prep the/det amount/noun of/prep interest/noun. (S2) S1 S2 P-2 P-1 P+1 P+2 fish check river interest SENSE TAG adv det prep det Y N Y N SHORE det verb det N Y N Y FINANCE 101/34 Supervised Learning Algorithms • Once data is converted to feature vector form, any supervised learning algorithm can be used. Many have been applied to WSD with good results: • Support Vector Machines • Nearest Neighbor Classifiers • Decision Trees • Decision Lists • Naïve Bayesian Classifiers • Perceptrons • Neural Networks • Graphical Models • Log Linear Models 102/34 Naïve Bayesian Classifier • Naïve Bayesian Classifier well known in Machine Learning community for good performance across a range of tasks (e.g., Domingos and Pazzani, 1997) • …Word Sense Disambiguation is no exception • Assumes conditional independence among features, given the sense of a word. • The form of the model is assumed, but parameters are estimated from training instances • When applied to WSD, features are often “a bag of words” that come from the training data • Usually thousands of binary features that indicate if a word is present in the context of the target word (or not) Bayesian Inference p ( F 1, F 2, F 3,..., Fn|S )* p ( S ) p ( S | F 1, F 2, F 3,..., Fn) p ( F 1, F 2, F 3,..., Fn) • Given observed features, what is most likely sense? • Estimate probability of observed features given sense • Estimate unconditional probability of sense • Unconditional probability of features is a normalizing term, doesn’t affect sense classification 103/34 104/34 Naïve Bayesian Model S F1 F2 F3 F4 Fn P( F1, F 2,..., Fn | S ) p ( F1 | S ) * p ( F 2 | S ) * ... * p ( Fn | S ) The Naïve Bayesian Classifier • if V is a vector representing the frequencies of the different words in context, and F1…Fn are elements of this vector , we select the sense s = argmax(s) p(s|V) = argmax(s) p(V|s) p(s) • assuming the probabilities of the F1…Fn are independent sense argmax p( F1 | S ) * ... * p( Fn | S ) * p( S ) senseS 105/34 The Naïve Bayesian Classifier sense argmax p( F1 | S ) * ... * p( Fn | S ) * p( S ) senseS • Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense) and 500 for bank/2 (river sense) • P(S=1) = 1,500/2000 = .75 • P(S=2) = 500/2,000 = .25 • Given “credit” occurs 200 times with bank/1 and 4 times with bank/2. • P(F1=“credit”) = 204/2000 = .102 • P(F1=“credit”|S=1) = 200/1,500 = .133 • P(F1=“credit”|S=2) = 4/500 = .008 • Given a test instance that has one feature “credit” • P(S=1|F1=“credit”) = .133*.75/.102 = .978 • P(S=2|F1=“credit”) = .008*.25/.102 = .020 106/34 107/34 Comparative Results • (Leacock, et. al. 1993) compared Naïve Bayes with a Neural Network and a Context Vector approach when disambiguating six senses of line… • (Mooney, 1996) compared Naïve Bayes with a Neural Network, Decision Tree/List Learners, Disjunctive and Conjunctive Normal Form learners, and a perceptron when disambiguating six senses of line… • (Pedersen, 1998) compared Naïve Bayes with Decision Tree, Rule Based Learner, Probabilistic Model, etc. when disambiguating line and 12 other words… • …All found that Naïve Bayesian Classifier performed as well as any of the other methods! 108/34 Supervised WSD with Individual Classifiers • Many supervised Machine Learning algorithms have been applied to Word Sense Disambiguation, most work reasonably well. • (Witten and Frank, 2000) is a great intro. to supervised learning. • Features tend to differentiate among methods more than the learning algorithms. • Good sets of features tend to include: • Co-occurrences or keywords (global) • Collocations (local) • Bigrams (local and global) • Part of speech (local) • Predicate-argument relations • Verb-object, subject-verb, • Heads of Noun and Verb Phrases 109/34 Convergence of Results • Accuracy of different systems applied to the same data tends to converge on a particular value, no one system shockingly better than another. • Senseval-1, a number of systems in range of 74-78% accuracy for English Lexical Sample task. • Senseval-2, a number of systems in range of 61-64% accuracy for English Lexical Sample task. • Senseval-3, a number of systems in range of 70-73% accuracy for English Lexical Sample task… • What to do next? • Minimally supervised WSD (later in semi-supervised learning lectures) (Hearst, 1991; Yarowsky, 1995) • Can use bi-lingual parallel corpora (Ng et al., ACL 2003)