Download Quantifying Uncertainty

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Birthday problem wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Transcript
QUANTIFYING UNCERTAINTY
Heng Ji
[email protected]
03/29, 04/01, 2016
Top 1 Proposal Presentation
• Multimedia Joint Model: Spencer Whitehead 4.83
• "good idea building on existing system"
• "interesting problem, clear schedule"
• "very professional and well-reported"
• "potentially super useful"
• "cool project"
• "could use more pictures, good speaking"
• "little ambiguous, not entirely"
• "very nice"
• "interesting research"
• "sounds very ambitious"
• "need to figure out the layout of texts (some words are covered by
pictures)"
Top 2 Team
• Sudoku Destructifier: Corey Byrne, Garrett Chang, Darren Lin,
Ben Litwack 4.82
• "seems like a pretty easy project for 4 people"
• "too many slides"
• "I like the topic!"
• "great idea"
• "good luck"
• "very interesting modification"
• "funny project name"
• "uneven distribution of speaking"
• "good speaking"
• "using the procedural algorithm against itself. nice."
Top 3 Team
• Protozoan Perception: Ahmed Eleish, Chris Paradis, Rob
Russo, Brandon Thorne 4.81
• "interesting problem"
• "ambitious but could be a very interesting project"
• "good project. interesting problem/approach. Useful."
• "good idea"
• "specific and narrow goals"
• "very clear"
• "very unique and interesting idea for a project. good crossover w/ CS + other
•
•
•
•
•
•
•
disciplines"
"too many slides"
"a bit long"
"cool stuff"
"ambitious"
"only 2 people talked"
"not all members participated"
"seems useful for scientists. what's additional in this project that's not in your
Top 4 – 7 Teams
• Forgery Detection Julia Fantacone, Connor Hadley, Asher
Norland, Alexandra Zytek 4.76
• German to English Machine Translation Kevin O’Neill and
Mitchell Mellone 4.73
• Bias Check on web documents Chris Higley, Michael Han,
Terence Farrell 4.73
• Path-finding Agents with swarm based travel Nicholas
Lockwood 4.73
Outline
• Probability Basics
• Naïve Bayes
• Application in Word Sense Disambiguation
• Application in Information Retrieval
Uncertainty
• So far in course, everything deterministic
• If I walk with my umbrella, I will not get wet
• But: there is some chance my umbrella will break!
• Intelligent systems must take possibility of failure into
account…
• May want to have backup umbrella in city that is often windy
and rainy
• … but should not be excessively conservative
• Two umbrellas not worthwhile for city that is usually not windy
• Need quantitative notion of uncertainty
Uncertainty
• General situation:
• Observed variables (evidence): Agent
knows certain things about the state of
the world (e.g., sensor readings or
symptoms)
• Unobserved variables: Agent needs to
reason about other aspects (e.g. where
an object is or what disease is present)
• Model: Agent knows something about
how the known variables relate to the
unknown variables
• Probabilistic reasoning gives us a
framework for managing our beliefs
and knowledge
Random Variables
• A random variable is some aspect of the
world about which we (may) have
uncertainty
•
•
•
•
R = Is it raining?
T = Is it hot or cold?
D = How long will it take to drive to work?
L = Where is the ghost?
• We denote random variables with capital
letters
• Like variables in a CSP, random variables
have domains
•
•
•
•
R in {true, false} (often write as {+r, -r})
T in {hot, cold}
D in [0, )
L in possible locations, maybe {(0,0), (0,1), …}
Probability
Y
6
1/36
1/36 1/36
1/36 1/36
1/36
5
1/36
1/36 1/36
1/36 1/36
1/36
4
1/36
1/36 1/36
1/36 1/36
1/36
3
1/36
1/36 1/36
1/36 1/36
1/36
• Outcome is represented by an
2
1/36
1/36 1/36
1/36 1/36
1/36
ordered pair of values (x, y)
1
1/36
1/36 1/36
1/36 1/36
1/36
• Example: roll two dice
• Random variables:
– X = value of die 1
– Y = value of die 2
– E.g., (6, 1): X=6, Y=1
– Atomic event or sample point tells•
1
2
3
4
5
6
X
An event is a proposition
us the complete state of the world,
about the state (=subset of
i.e., values of all random variables
states)
• Exactly one atomic event will
happen; each atomic event has a – X+Y = 7
• Probability of event = sum of
≥0 probability; sum to 1
probabilities of atomic
Probability Distributions
• Associate a probability with each
value
 Weather:
• Temperature:
T
P
hot
0.5
cold 0.5
W
P
sun
0.6
rain
0.1
fog
0.3
meteor
0.0
Probability Distributions
• Unobserved random variables have
distributions
Shorthand
notation:
T
P
W
P
hot
0.5
sun
0.6
cold 0.5
rain
0.1
fog
0.3
meteor
0.0
• A distribution is a TABLE of probabilities of
values
• A probability (lower case value) is a single
number
• Must have:
and
OK if all domain entries
are unique
Joint Distributions
• A joint distribution over a set of random
variables:
specifies a real number for each assignment
(or outcome):
• Must obey:
T
W
P
hot sun 0.4
hot rain 0.1
cold sun 0.2
• Size of distribution if n variables with domain
sizes d?
• For all but the smallest distributions, impractical
to write out!
cold rain 0.3
Probabilistic Models
• A probabilistic model is a joint
distribution over a set of random
variables
• Probabilistic models:
• (Random) variables with domains
• Assignments are called outcomes
• Joint distributions: say whether
assignments (outcomes) are likely
• Normalized: sum to 1.0
• Ideally: only certain variables
directly interact
• Constraint satisfaction problems:
• Variables with domains
• Constraints: state whether
assignments are possible
• Ideally: only certain variables
directly interact
Distribution over
T,W
T
W
P
hot
sun
0.4
hot
rain
0.1
cold
sun
0.2
cold
rain
0.3
Constraint over
T,W
T
W
P
hot
sun
T
hot
rain
F
cold
sun
F
cold
rain
T
Events
• An event is a set E of
outcomes
• From a joint distribution, we
can calculate the probability
of any event
• Probability that it’s hot AND sunny?
• Probability that it’s hot?
• Probability that it’s hot OR sunny?
• Typically, the events we
care about are partial
assignments, like P(T=hot)
T
W
P
hot
sun
0.4
hot
rain
0.1
cold
sun
0.2
cold
rain
0.3
Facts about probabilities of events
• If events A and B are disjoint, then
• P(A or B) = P(A) + P(B)
• More generally:
• P(A or B) = P(A) + P(B) - P(A and B)
• If events A1, …, An are disjoint and exhaustive (one of them
must happen) then P(A1) + … + P(An) = 1
• Special case: for any random variable, ∑x P(X=x) = 1
• Marginalization: P(X=x) = ∑y P(X=x and Y=y)
Conditional probability
• Probability of cavity given toothache:
P(Cavity = true | Toothache = true)
• For any two events A and B,
P( A  B) P( A, B)
P( A | B) 

P( B)
P( B)
P(A  B)
P(A)
P(B)
Conditional probability
• We might know something about the world – e.g., “X+Y=6 or X+Y=7”
– given this (and only this), what is the probability of Y=5?
• Part of the sample space is eliminated; probabilities are renormalized
to sum to 1
Y
6 1/36
1/36 1/36
1/36 1/36
1/36
6
1/11
0
0
0
0
0
5
1/36
1/36 1/36
1/36 1/36
1/36
5
1/11
1/11
0
0
0
0
4
1/36
1/36 1/36
1/36 1/36
1/36
4
0
1/11
1/11
0
0
0
3
1/36
1/36 1/36
1/36 1/36
1/36
3
0
0
1/11
1/11
0
0
2
1/36
1/36 1/36
1/36 1/36
1/36
2
0
0
0
1/11
1/11
0
1
1/36
1/36 1/36
1/36 1/36
1/36
1
0
0
0
0
1/11
1/11
1
2
3
4
5
1
Y
2
3
4
5
6
X
• P(Y=5 | (X+Y=6) or (X+Y=7)) = 2/11
6X
Facts about conditional probability
• P(A | B) = P(A and B) / P(B)
Sample space
A and B
B
A
• P(A | B)P(B) = P(A and B)
• P(A | B) = P(B | A)P(A)/P(B)
– Bayes’ rule
How can we scale this?
• In principle, we now have a complete
approach for reasoning under uncertainty:
• Specify probability for every atomic event,
• Can compute probabilities of events simply by summing probabilities of
atomic events,
• Conditional probabilities are specified in terms of probabilities of events:
P(A | B) = P(A and B) / P(B)
• If we have n variables that can each take k values, how many
atomic events are there?
Independence
• Some variables have nothing to do with each other
• Dice: if X=6, it tells us nothing about Y
• P(Y=y | X=x) = P(Y=y)
• So: P(X=x and Y=y) = P(Y=y | X=x)P(X=x) =
P(Y=y)P(X=x)
• Usually just write P(X, Y) = P(X)P(Y)
• Only need to specify 6+6=12 values instead of 6*6=36
values
• Independence among 3 variables: P(X,Y,Z)=P(X)P(Y)P(Z), etc.
• Are the events “you get a flush” and “you get a
straight” independent?
An example without cards or dice
Rain in
Beaufort
Rain in
Durham
Sun in Durham
.2
.2
Sun in
Beaufort
.1
.5
(disclaimer:
no idea if
these
numbers are
realistic)
• What is the probability of
• Rain in Beaufort? Rain in Durham?
• Rain in Beaufort, given rain in Durham?
• Rain in Durham, given rain in Beaufort?
• Rain in Beaufort and rain in Durham are correlated
Conditional independence
• Intuition:
• the only reason that X told us something about Y,
• is that X told us something about Z,
• and Z tells us something about Y
• If we already know Z, then X tells us nothing about Y
• P(Y | Z, X) = P(Y | Z) or
• P(X, Y | Z) = P(X | Z)P(Y | Z)
• “X and Y are conditionally independent given Z”
Context-specific independence
• Recall P(X, Y | Z) = P(X | Z)P(Y | Z) really means: for all x, y,
z, P(X=x, Y=y | Z=z) = P(X=x | Z=z)P(Y=y | Z=z)
• But it may not be true for all z
• P(Wet, RainingInLondon | CurrentLocation=New York) =
P(Wet | CurrentLocation=New York)P(RainingInLondon |
CurrentLocation=New York)
• But not
• P(Wet, RainingInLondon | CurrentLocation=London) = P(Wet
| CurrentLocation=London)P(RainingInLondon |
CurrentLocation=London)
Expected value
• If Z takes numerical values, then the expected
value of Z is E(Z) = ∑z P(Z=z)*z
• Weighted average (weighted by probability)
• Suppose Z is sum of two dice
• E(Z) = (1/36)*2 + (2/36)*3 + (3/36)*4 + (4/36)*5 +
(5/36)*6 + (6/36)*7 + (5/36)*8 + (4/36)*9 + (3/36)*10
+ (2/36)*11 + (1/36)*12 = 7
• Simpler way: E(X+Y)=E(X)+E(Y) (always!)
• Linearity of expectation
• E(X) = E(Y) =
3.5
Monty Hall problem
• You’re a contestant on a game show. You see three
closed doors, and behind one of them is a prize. You
choose one door, and the host opens one of the other
doors and reveals that there is no prize behind it. Then he
offers you a chance to switch to the remaining door.
Should you take it?
Monty Hall problem
• With probability 1/3, you picked the correct door,
and with probability 2/3, picked the wrong door. If
you picked the correct door and then you switch,
you lose. If you picked the wrong door and then
you switch, you win the prize.
• Expected payoff of switching:
(1/3) * 0 + (2/3) * Prize
• Expected payoff of not switching:
(1/3) * Prize + (2/3) * 0
Product rule
• Definition of conditional probability:
P( A, B)
P( A | B) 
P( B)
• Sometimes we have the conditional probability and want
to obtain the joint:
P( A, B)  P( A | B) P( B)  P( B | A) P( A)
Product rule
• Definition of conditional probability:
P( A, B)
P( A | B) 
P( B)
• Sometimes we have the conditional probability and want
to obtain the joint:
P( A, B)  P( A | B) P( B)  P( B | A) P( A)
• The chain rule:
P( A1 ,  , An )  P( A1 ) P( A2 | A1 ) P( A3 | A1 , A2 )  P( An | A1 , , An 1 )
n
  P( Ai | A1 ,  , Ai 1 )
i 1
The Birthday problem
• We have a set of n people. What is the probability that two of
them share the same birthday?
• Easier to calculate the probability that n people do not share
the same birthday
• Let P(i |1, …, i –1) denote the probability of the event that the
ith person does not share a birthday with the previous i –1
people:
P(i |1, …, i –1) = (365 – i + 1)/365
• Probability that n people do not share a birthday:
• Probability that n people do share a birthday: one minus the
above
The Birthday problem
• For 23 people, the probability of sharing a birthday is
above 0.5!
http://en.wikipedia.org/wiki/Birthday_problem
Bayes Rule
Rev. Thomas Bayes
(1702-1761)
• The product rule gives us two ways to factor a joint
distribution:
P( A, B)  P( A | B) P( B)  P( B | A) P( A)
• Therefore,
P( B | A) P( A)
P( A | B) 
P( B)
• Why is this useful?
• Can get diagnostic probability P(cavity | toothache) from causal
probability P(toothache | cavity)
• Can update our beliefs based on evidence
• Important tool for probabilistic inference
Bayes Rule example
• Marie is getting married tomorrow, at an outdoor ceremony in
the desert. In recent years, it has rained only 5 days each
year (5/365 = 0.014). Unfortunately, the weatherman has
predicted rain for tomorrow. When it actually rains, the
weatherman correctly forecasts rain 90% of the time. When it
doesn't rain, he incorrectly forecasts rain 10% of the time.
What is the probability that it will rain on Marie's wedding?
Bayes Rule example
• Marie is getting married tomorrow, at an outdoor ceremony in
the desert. In recent years, it has rained only 5 days each
year (5/365 = 0.014). Unfortunately, the weatherman has
predicted rain for tomorrow. When it actually rains, the
weatherman correctly forecasts rain 90% of the time. When it
doesn't rain, he incorrectly forecasts rain 10% of the time.
What is the probability that it will rain on Marie's wedding?
P(Predict | Rain ) P(Rain )
P(Predict )
P(Predict | Rain ) P(Rain )

P(Predict | Rain ) P(Rain )  P(Predict | Rain ) P(Rain )
0.9 * 0.014

 0.111
0.9 * 0.014  0.1* 0.986
P(Rain | Predict ) 
The IR Problem
query
• doc1
• doc2
• doc3
...
Sort docs in order of relevance to query.
Example Query
Query: The 1929 World Series
384,945,633 results in Alta Vista
• GNU's Not Unix! - the GNU Project and the Free
Software Foundation (FSF)
• Yahoo! Singapore
• The USGenWeb Project - Home Page
•…
Better List (Google)
• TSN Archives: The 1929 World Series
• Baseball Almanac - World Series Menu
• 1929 World Series - PHA vs. CHC - Baseball-
Reference.com
• World Series Winners (1903-1929) (Baseball World)
Goal
Should return as many relevant docs as possible
recall
Should return as few irrelevant docs as possible
precision
Typically a tradeoff…
Main Insights
How identify “good” docs?
• More words in common is good.
• Rare words more important than common words.
• Long documents carry less weight, all other things being
equal.
Bag of Words Model
Just pay attention to which words appear in document and
query.
Ignore order.
Boolean IR
"and" all uncommon words
Most web search engines.
• Altavista: 79,628 hits
• fast
• not so accurate by itself
Example: Biography
Science and the Modern World (1925), a series of
lectures given in the United States, served as an
introduction to his later metaphysics.
Whitehead's most important book, Process and
Reality (1929), took this theory to a level of even
greater generality.
http://www-groups.dcs.stand.ac.uk/~history/Mathematicians/Whitehead.html
Vector-space Model
For each word in common between document and
query, compute a weight. Sum the weights.
tf = (term frequency) number of times term
appears in the document
idf = (inverse document frequency) divide by
number of times term appears in any document
Also various forms of document-length
normalization.
Example Formula
i
Insurance
Try
sumj tfi,j dfi
10440 3997
10422 8760
Weight(i,j) = (1+log(tfi,j)) log N/dfi
Unless tfi,j = 0 (then 0).
N documents, dfi doc frequency
Cosine Normalization
Cos(q,d) = sumi qi di /
sqrt(sumi qi2) sqrt(sumi di2)
Downweights long documents.
(Perhaps too much.)
Probabilistic Approach
Lots of work studying different weighting schemes.
Often very ad hoc, empirically motivated.
Is there an analog of A* for IR? Elegant, simple, effective?
Language Models
Probability theory is gaining popularity. Originally speech
recognition:
If we can assign probabilities to sentence and phonemes,
we can choose the sentence that minimizes the chance
that we’re wrong…
Probability Basics
Pr(A): Probability A is true
Pr(AB): Prob. both A & B are true
Pr(~A): Prob. of not A: 1-Pr(A)
Pr(A|B): Prob. of A given B
Pr(AB)/Pr(B)
Pr(A+B): Probability A or B is true
Pr(A) + Pr(B) – Pr(AB)
Venn Diagram
B
A
AB
Bayes Rule
Pr(A|B) = Pr(B|A) Pr(A) / Pr(B)
because
Pr(AB) = Pr (B) Pr(A|B) = Pr(B|A) Pr(A)
The most basic form of “learning”:
• picking a likely model given the data
• adjusting beliefs in light of new evidence
Probability Cheat Sheet
Chain rule:
Pr(A,X|Y) = Pr(A|Y) Pr(X|A,Y)
Summation rule:
Pr(X|Y) = Pr(A X | Y) + Pr(~A X | Y)
Bayes rule:
Pr(A|BX) = Pr(B|AX) Pr(A|X)/Pr(B|X)
Classification Example
Given a song title, guess if it’s a country song or a rap
song.
• U Got it Bad
• Cowboy Take Me Away
• Feelin’ on Yo Booty
• When God-Fearin' Women Get The Blues
• God Bless the USA
• Ballin’ out of Control
Probabilistic Classification
Language model gives:
• Pr(T|R), Pr(T|C), Pr(C), Pr(R)
Compare
• Pr(R|T) vs. Pr(C|T)
• Pr(T|R) Pr(R) / Pr(T) vs.
Pr(T|C) Pr(C) / Pr(T)
• Pr(T|R) Pr(R) vs. Pr(T|C) Pr(C)
Naïve Bayes
Pr(T|C)
Generate words independently
Pr(w1 w2 w3 … wn|C)
= Pr(w1|C) Pr(w2|C) … Pr(wn|C)
So, Pr(party|R) = 0.02,
Pr(party|C) = 0.001
Estimating Naïve Bayes
Where would these numbers come from?
Take a list of country song titles.
First attempt:
Pr(w|C) = count(w; C)
/ sumw count(w; C)
Smoothing
Problem: Unseen words. Pr(party|C) = 0
Pr(Even Party Cowboys Get the Blues) = 0
Laplace Smoothing:
Pr(w|C) = (1+count(w; C))
/ sumw (1+count(w; C))
Other Applications
Filtering
• Advisories
Text classification
• Spam vs. important
• Web hierarchy
• Shakespeare vs. Jefferson
• French vs. English
IR Example
Pr(d|q) = Pr(q|d) Pr(d) / Pr(q)
Language model
Constant
Prior belief d is relevant
(assume equal)
Can view each document like a
category for classification.
Smoothing Matters
p(w|d) =
ps(w|d) if count(w;d)>0 (seen)
p(w|collection) if count(w;d)=0
ps(w|d): estimated from document and smoothed
p(w|collection): estimated from corpus and smoothed
Equivalent effect to TF-IDF.
Exercise
The weather data, with counts and probabilities
outlook
temperature
yes
no
yes
no
sunny
2
3
hot
2
2
overcast
4
0
mild
4
2
rainy
3
2
cool
3
1
humidity
windy
yes
no
high
3
4
normal
6
1
play
yes
no
yes
no
false
6
2
9
5
true
3
3
sunny
hot
high
false
overcast
mild
normal
true
rainy
cool
A new day
outlook
temperature
humidity
windy
play
sunny
cool
high
true
?
61/30
The Central Problem in IR
Information Seeker
Authors
Concepts
Concepts
Query Terms
Document Terms
Do these represent the same concepts?
62/30
Relevance
• Relevance is a subjective judgment and may include:
• Being on the proper subject.
• Being timely (recent information).
• Being authoritative (from a trusted source).
• Satisfying the goals of the user and his/her intended use of the
information (information need).
62
63/30
IR Ranking
• Early IR focused on set-based retrieval
• Boolean queries, set of conditions to be satisfied
• document either matches the query or not
• like classifying the collection into relevant / non-relevant sets
• still used by professional searchers
• “advanced search” in many systems
• Modern IR: ranked retrieval
• free-form query expresses user’s information need
• rank documents by decreasing likelihood of relevance
• many studies prove it is superior
64/30
A heuristic formula for IR
• Rank docs by similarity to the query
• suppose the query is “cryogenic labs”
• Similarity = # query words in the doc
• favors documents with both “labs” and “cryogenic”
• mathematically:
sim ( D, Q)  1qD
qQ
• Logical variations (set-based)
• Boolean AND (require all words):
• Boolean OR (any of the words):
AND( D, Q)  q 1qD
OR ( D, Q)  1  q 1qD
65/30
Term Frequency (TF)
• Observation:
• key words tend to be repeated in a document
• Modify our similarity measure:
• give more weight if word
occurs multiple times
• Problem:
sim ( D, Q)   tf D (q)
qQ
• biased towards long documents
• spurious occurrences
• normalize by length:
tf D (q )
sim ( D, Q)  
qQ | D |
66/30
Inverse Document Frequency (IDF)
• Observation:
• rare words carry more meaning: cryogenic, apollo
• frequent words are linguistic glue: of, the, said, went
• Modify our similarity measure:
• give more weight to rare words
… but don’t be too aggressive (why?)
 |C | 
tf D (q)

sim ( D, Q)  
 log 
qQ | D |
 df (q) 
• |C| … total number of documents
• df(q) … total number of documents that contain q
67/30
TF normalization
• Observation:
• D1={cryogenic,labs}, D2 ={cryogenic,cryogenic}
• which document is more relevant?
• which one is ranked higher? (df(labs) > df(cryogenic))
• Correction:
• first occurrence more important than a repeat (why?)
• “squash” the linearity of TF:
tf ( q )
tf ( q )  K
1
2
3
tf
68/30
State-of-the-art Formula
Repetitions of query
words  good
Common words
less important
 |C | 
tf D (q)

sim ( D, Q)  
 log 
qQ tf D ( q )  K | D |
 df (q) 
More query
words  good
Penalize very
long documents
69/30
Vector-space approach to
IR
cat
•cat cat
•cat cat cat
•cat pig
•pig cat
θ
pig
•cat cat pig dog dog
dog
70/30
Some formulas for Sim
Dot product
Cosine
Sim ( D, Q)   (ai * bi )
Jaccard
 (a * b )
i
Sim ( D, Q) 
Dice
D
i
i
 ai *  bi
2
i
Sim ( D, Q) 
t1
Q
2
i
t2
2 (ai * bi )
i
 ai   bi
2
i
2
i
 (a * b )
Sim ( D, Q) 
 a   b   (a * b )
i
i
i
2
2
i
i
i
i
i
i
i
71/30
Language-modeling Approach
• query is a random sample from a “perfect” document
• words are “sampled” independently of each other
• rank documents by the probability of generating query
D
P(
query
)=P( ) P( )P( ) P( )
= 4/9 * 2/9 * 4/9 * 3/9
72/30
PageRank in Google
PageRank in Google (Cont’)
I1
I2
A
B
PR( I i )
PR( A)  (1  d )  d 
i C(Ii )
• Assign a numeric value to each page
• The more a page is referred to by important pages, the more this page is
important
• d: damping factor (0.85)
• Many other criteria: e.g. proximity of query words
• “…information retrieval …” better than “… information … retrieval …”
73/30
74/30
Problems with Keywords
• May not retrieve relevant documents that include
synonymous terms.
• “restaurant” vs. “café”
• “PRC” vs. “China”
• May retrieve irrelevant documents that include
ambiguous terms.
• “bat” (baseball vs. mammal)
• “Apple” (company vs. fruit)
• “bit” (unit of data vs. act of eating)
74
75/30
Query Expansion
• http://www.lemurproject.org/lemur/IndriQueryLanguage.php
• Most errors caused by vocabulary mismatch
• query: “cars”, document: “automobiles”
• solution: automatically add highly-related words
• Thesaurus / WordNet lookup:
• add semantically-related words (synonyms)
• cannot take context into account:
• “rail car” vs. “race car” vs. “car and cdr”
• Statistical Expansion:
• add statistically-related words (co-occurrence)
• very successful
76/30
Document indexing
• Goal = Find the important meanings and create an internal
representation
• Factors to consider:
• Accuracy to represent meanings (semantics)
• Exhaustiveness (cover all the contents)
• Facility for computer to manipulate
• What is the best representation of contents?
• Char. string (char trigrams): not precise enough
• Word: good coverage, not precise
• Phrase: poor coverage, more precise
• Concept: poor coverage, precise
Coverage
(Recall)
String
Word
Phrase
Concept
Accuracy
(Precision)
77/30
Indexer steps
• Sequence of (Modified token, Document ID)
pairs.
Doc 1
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 2
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Term
I
did
enact
julius
caesar
I
was
killed
i'
the
capitol
brutus
killed
me
so
let
it
be
with
caesar
the
noble
brutus
hath
told
you
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
caesar
2
was
ambitious
2
2
78/30
• Multiple term entries in a
single document are merged.
• Frequency information is
added.
Term
Doc #
ambitious
2
be
2
brutus
1
brutus
2
capitol
1
caesar
1
caesar
2
caesar
2
did
1
enact
1
hath
1
I
1
I
1
i'
1
it
2
julius
1
killed
1
killed
1
let
2
me
1
noble
2
so
2
the
1
the
2
told
2
you
2
was
1
was
2
with
2
Term
Doc #
ambitious
be
brutus
brutus
capitol
caesar
caesar
did
enact
hath
I
i'
it
julius
killed
let
me
noble
so
the
the
told
you
was
was
with
2
2
1
2
1
1
2
1
1
2
1
1
2
1
1
2
1
2
2
1
2
2
2
1
2
2
Term freq
1
1
1
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
79/30
Stopwords / Stoplist
• function words do not bear useful information for IR
of, in, about, with, I, although, …
• Stoplist: contain stopwords, not to be used as index
• Prepositions
• Articles
• Pronouns
• Some adverbs and adjectives
• Some frequent words (e.g. document)
• The removal of stopwords usually improves IR effectiveness
• A few “standard” stoplists are commonly used.
80/30
Stemming
• Reason:
• Different word forms may bear similar meaning (e.g. search,
searching): create a “standard” representation for them
• Stemming:
• Removing some endings of word
computer
compute
computes
computing
computed
computation
comput
81/30
Lemmatization
• transform to standard form according to syntactic category.
E.g. verb + ing  verb
noun + s  noun
• Need POS tagging
• More accurate than stemming, but needs more resources
• crucial to choose stemming/lemmatization rules
noise v.s. recognition rate
• compromise between precision and recall
light/no stemming
-recall +precision
severe stemming
+recall -precision
What to Learn
IR problem and TF-IDF.
Unigram language models.
Naïve Bayes and simple Bayesian classification.
Need for smoothing.
83/34
Another Example on WSD
• Word sense disambiguation is the problem of selecting a
sense for a word from a set of predefined possibilities.
• Sense Inventory usually comes from a dictionary or
thesaurus.
• Knowledge intensive methods, supervised learning, and
(sometimes) bootstrapping approaches
84/34
Ambiguity for a Computer
• The fisherman jumped off the bank and into the water.
• The bank down the street was robbed!
• Back in the day, we had an entire bank of computers
devoted to this problem.
• The bank in that road is entirely too steep and is really
dangerous.
• The plane took a bank to the left, and then headed off
towards the mountains.
85/34
Early Days of WSD
• Noted as problem for Machine Translation (Weaver, 1949)
• A word can often only be translated if you know the specific
sense intended (A bill in English could be a pico or a cuenta
in Spanish)
• Bar-Hillel (1960) posed the following:
• Little John was looking for his toy box. Finally, he found it.
The box was in the pen. John was very happy.
• Is “pen” a writing instrument or an enclosure where children
play?
•
…declared it unsolvable, left the field of MT!
86/34
Since then…
• 1970s - 1980s
• Rule based systems
• Rely on hand crafted knowledge sources
• 1990s
• Corpus based approaches
• Dependence on sense tagged text
• (Ide and Veronis, 1998) overview history from early days to
1998.
• 2000s
• Hybrid Systems
• Minimizing or eliminating use of sense tagged text
• Taking advantage of the Web
87/34
Machine Readable Dictionaries
• In recent years, most dictionaries made available in
Machine Readable format (MRD)
• Oxford English Dictionary
• Collins
• Longman Dictionary of Ordinary Contemporary English
(LDOCE)
• Thesauruses – add synonymy information
• Roget Thesaurus
• Semantic networks – add more semantic relations
• WordNet
• EuroWordNet
88/34
Lesk Algorithm
•
•
(Michael Lesk 1986): Identify senses of words in context using
definition overlap
Algorithm:
• Retrieve from MRD all sense definitions of the words to be
disambiguated
• Determine the definition overlap for all possible sense
combinations
• Choose senses that lead to highest overlap
Example: disambiguate PINE CONE
• PINE
1. kinds of evergreen tree with needle-shaped leaves
2. waste away through sorrow or illness
• CONE
1. solid body which narrows to a point
2. something of this shape whether solid or hollow
3. fruit of certain evergreen trees
Pine#1  Cone#1 = 0
Pine#2  Cone#1 = 0
Pine#1  Cone#2 = 1
Pine#2  Cone#2 = 0
Pine#1  Cone#3 = 2
Pine#2  Cone#3 = 0
89/34
Heuristics: One Sense Per Discourse
•
•
A word tends to preserve its meaning across all its occurrences in a given
discourse (Gale, Church, Yarowksy 1992)
What does this mean?
E.g. The ambiguous word PLANT occurs 10 times in a discourse
all instances of “plant” carry the same meaning
•
•
Evaluation:
• 8 words with two-way ambiguity, e.g. plant, crane, etc.
• 98% of the two-word occurrences in the same discourse carry the same
meaning
The grain of salt: Performance depends on granularity
• (Krovetz 1998) experiments with words with more than two senses
• Performance of “one sense per discourse” measured on SemCor is
approx. 70%
90/34
Heuristics: One Sense per Collocation
•
•
•
•
A word tends to preserver its meaning when used in the same
collocation (Yarowsky 1993)
• Strong for adjacent collocations
• Weaker as the distance between words increases
An example
The ambiguous word PLANT preserves its meaning in all its
occurrences within the collocation “industrial plant”, regardless
of the context where this collocation occurs
Evaluation:
• 97% precision on words with two-way ambiguity
Finer granularity:
• (Martinez and Agirre 2000) tested the “one sense per collocation”
hypothesis on text annotated with WordNet senses
• 70% precision on SemCor words
91/34
What is Supervised Learning?
• Collect a set of examples that illustrate the various
possible classifications or outcomes of an event.
• Identify patterns in the examples associated with each
particular class of the event.
• Generalize those patterns into rules.
• Apply the rules to classify a new event.
Learn from these examples :
“when do I go to the store?”
Day
CLASS
Go to
Store?
F1
Hot
Outside?
F2
Slept
Well?
F3
Ate Well?
1
YES
YES
NO
NO
2
NO
YES
NO
YES
3
YES
NO
NO
NO
4
NO
NO
NO
92/34
YES
Learn from these examples :
“when do I go to the store?”
Day
CLASS
Go to
Store?
F1
Hot
Outside?
F2
Slept
Well?
F3
Ate Well?
1
YES
YES
NO
NO
2
NO
YES
NO
YES
3
YES
NO
NO
NO
4
NO
NO
NO
93/34
YES
94/34
Task Definition: Supervised WSD
• Supervised WSD: Class of methods that induces a classifier from
manually sense-tagged text using machine learning techniques.
• Resources
• Sense Tagged Text
• Dictionary (implicit source of sense inventory)
• Syntactic Analysis (POS tagger, Chunker, Parser, …)
• Scope
• Typically one target word per context
• Part of speech of target word resolved
• Lends itself to “targeted word” formulation
• Reduces WSD to a classification problem where a target word is
assigned the most appropriate sense from a given set of possibilities
based on the context in which it occurs
Sense Tagged Text
Bonnie and Clyde are two really famous criminals, I
think they were bank/1 robbers
My bank/1 charges too much for an overdraft.
I went to the bank/1 to deposit my check and get a new
ATM card.
The University of Minnesota has an East and a West
Bank/2 campus right on the Mississippi River.
My grandfather planted his pole in the bank/2 and got a
great big catfish!
The bank/2 is pretty muddy, I can’t walk there.
95/34
Two Bags of Words (Co-occurrences in
the “window of context”)
FINANCIAL_BANK_BAG:
a an and are ATM Bonnie card charges check Clyde
criminals deposit famous for get I much My new
overdraft really robbers the they think to too two went
were
RIVER_BANK_BAG:
a an and big campus cant catfish East got
grandfather great has his I in is Minnesota Mississippi
muddy My of on planted pole pretty right River The the
there University walk West
96/34
97/34
Simple Supervised Approach
Given a sentence S containing “bank”:
For each word Wi in S
If Wi is in FINANCIAL_BANK_BAG then
Sense_1 = Sense_1 + 1;
If Wi is in RIVER_BANK_BAG then
Sense_2 = Sense_2 + 1;
If Sense_1 > Sense_2 then print “Financial”
else if Sense_2 > Sense_1 then print “River”
else print “Can’t Decide”;
98/34
Supervised Learning (Cont’)
Training
Text
Instance
Vector
(Feature)
Training
(Model
Estimator)
Model
Test
Test
Text
Feature
Output
99/34
Supervised Methodology for WSD
• Create a sample of training data where a given target word is
•
•
•
•
•
manually annotated with a sense from a predetermined set of
possibilities.
• One tagged word per instance/lexical sample disambiguation
Select a set of features with which to represent context.
• co-occurrences, collocations, POS tags, verb-obj relations, etc...
Convert sense-tagged training instances to feature vectors.
Apply a machine learning algorithm to induce a classifier.
• Form – structure or relation among features
• Parameters – strength of feature interactions
Convert a held out sample of test data into feature vectors.
• “correct” sense tags are known but not used
Apply classifier to test instances to assign a sense tag.
100/34
From Text to Feature Vectors
• My/pronoun grandfather/noun used/verb to/prep fish/verb along/adv
the/det banks/SHORE of/prep the/det Mississippi/noun River/noun.
(S1)
• The/det bank/FINANCE issued/verb a/det check/noun for/prep
the/det amount/noun of/prep interest/noun. (S2)
S1
S2
P-2
P-1
P+1
P+2 fish
check
river
interest
SENSE TAG
adv
det
prep
det
Y
N
Y
N
SHORE
det
verb
det
N
Y
N
Y
FINANCE
101/34
Supervised Learning Algorithms
• Once data is converted to feature vector form, any
supervised learning algorithm can be used. Many have
been applied to WSD with good results:
• Support Vector Machines
• Nearest Neighbor Classifiers
• Decision Trees
• Decision Lists
• Naïve Bayesian Classifiers
• Perceptrons
• Neural Networks
• Graphical Models
• Log Linear Models
102/34
Naïve Bayesian Classifier
• Naïve Bayesian Classifier well known in Machine
Learning community for good performance across a range
of tasks (e.g., Domingos and Pazzani, 1997)
•
…Word Sense Disambiguation is no exception
• Assumes conditional independence among features,
given the sense of a word.
• The form of the model is assumed, but parameters are
estimated from training instances
• When applied to WSD, features are often “a bag of words”
that come from the training data
• Usually thousands of binary features that indicate if a word is
present in the context of the target word (or not)
Bayesian Inference
p ( F 1, F 2, F 3,..., Fn|S )* p ( S )
p ( S | F 1, F 2, F 3,..., Fn)
p ( F 1, F 2, F 3,..., Fn)
• Given observed features, what is most likely sense?
• Estimate probability of observed features given sense
• Estimate unconditional probability of sense
• Unconditional probability of features is a normalizing
term, doesn’t affect sense classification
103/34
104/34
Naïve Bayesian Model
S
F1
F2
F3
F4
Fn
P( F1, F 2,..., Fn | S )  p ( F1 | S ) * p ( F 2 | S ) * ... * p ( Fn | S )
The Naïve Bayesian Classifier
• if V is a vector representing the frequencies of the
different words in context, and F1…Fn are elements of
this vector , we select the sense s
= argmax(s) p(s|V)
= argmax(s) p(V|s) p(s)
• assuming the probabilities of the F1…Fn are
independent
sense  argmax p( F1 | S ) * ... * p( Fn | S ) * p( S )
senseS
105/34
The Naïve Bayesian Classifier
sense  argmax p( F1 | S ) * ... * p( Fn | S ) * p( S )
senseS
• Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense)
and 500 for bank/2 (river sense)
• P(S=1) = 1,500/2000 = .75
• P(S=2) = 500/2,000 = .25
• Given “credit” occurs 200 times with bank/1 and 4 times with
bank/2.
• P(F1=“credit”) = 204/2000 = .102
• P(F1=“credit”|S=1) = 200/1,500 = .133
• P(F1=“credit”|S=2) = 4/500 = .008
• Given a test instance that has one feature “credit”
• P(S=1|F1=“credit”) = .133*.75/.102 = .978
• P(S=2|F1=“credit”) = .008*.25/.102 = .020
106/34
107/34
Comparative Results
• (Leacock, et. al. 1993) compared Naïve Bayes with a Neural Network
and a Context Vector approach when disambiguating six senses of
line…
• (Mooney, 1996) compared Naïve Bayes with a Neural Network,
Decision Tree/List Learners, Disjunctive and Conjunctive Normal
Form learners, and a perceptron when disambiguating six senses of
line…
• (Pedersen, 1998) compared Naïve Bayes with Decision Tree, Rule
Based Learner, Probabilistic Model, etc. when disambiguating line
and 12 other words…
• …All found that Naïve Bayesian Classifier performed as well as any
of the other methods!
108/34
Supervised WSD with Individual
Classifiers
• Many supervised Machine Learning algorithms have been applied to Word
Sense Disambiguation, most work reasonably well.
• (Witten and Frank, 2000) is a great intro. to supervised learning.
• Features tend to differentiate among methods more than the learning
algorithms.
• Good sets of features tend to include:
• Co-occurrences or keywords (global)
• Collocations (local)
• Bigrams (local and global)
• Part of speech (local)
• Predicate-argument relations
• Verb-object, subject-verb,
• Heads of Noun and Verb Phrases
109/34
Convergence of Results
• Accuracy of different systems applied to the same data tends to
converge on a particular value, no one system shockingly better than
another.
• Senseval-1, a number of systems in range of 74-78% accuracy for
English Lexical Sample task.
• Senseval-2, a number of systems in range of 61-64% accuracy for
English Lexical Sample task.
• Senseval-3, a number of systems in range of 70-73% accuracy for
English Lexical Sample task…
• What to do next?
• Minimally supervised WSD (later in semi-supervised learning
lectures) (Hearst, 1991; Yarowsky, 1995)
• Can use bi-lingual parallel corpora (Ng et al., ACL 2003)