Download Word sense disambiguation & intro to probability theory

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Inductive probability wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
I256
Applied Natural Language
Processing
Fall 2009
Lecture 5
• Word Sense Disambiguation (WSD)
• Intro on Probability Theory
• Graphical Models
• Naïve Bayes
• Naïve Bayes for WSD
Barbara Rosario
Word Senses
• Words have multiple distinct meanings, or senses:
– Plant: living plant, manufacturing plant, …
– Title: name of a work, ownership document, form of address,
material at the start of a film, …
•
Many levels of sense distinctions
– Homonymy: totally unrelated meanings (river bank, money bank)
– Polysemy: related meanings (star in sky, star on tv, title)
– Systematic polysemy: productive meaning extensions
(metonymy such as organizations to their buildings) or metaphor
– Sense distinctions can be extremely subtle (or not)
•
Granularity of senses needed depends a lot on the task
2
Taken from Dan Klein’s cs 288 slides
Word Sense Disambiguation
• Determine which of the senses of an ambiguous word is invoked in
a particular use of the word
• Example: living plant vs. manufacturing plant
• How do we tell these senses apart?
– “Context”
• The manufacturing plant which had previously sustained the town’s
economy shut down after an extended labor strike.
– Maybe it’s just text categorization
– Each word sense represents a topic
•
Why is it important to model and disambiguate word senses?
– Translation
• Bank  banca or riva
– Parsing
• For PP attachment, for example
– information retrieval
• To return documents with the right sense of bank
3
Adapted from Dan Klein’s cs 288 slides
Resources
•
WordNet
– Hand-build (but large) hierarchy of word senses
– Basically a hierarchical thesaurus
• SensEval
– AWSD competition
– Training / test sets for a wide range of words, difficulties, and
parts-of-speech
– Bake-off where lots of labs tried lots of competing approaches
•
SemCor
– A big chunk of the Brown corpus annotated with WordNet senses
•
OtherResources
– The Open Mind Word Expert
– Parallel texts
4
Taken from Dan Klein’s cs 288 slides
Features
• Bag-of-words (use words around with no order)
– The manufacturing plant which had previously
sustained the town’s economy shut down after an
extended labor strike.
– Bags of words = {after, manufacturing, which, labor, ..}
• Bag-of-words classification works ok for noun
senses
– 90% on classic, shockingly easy examples (line,
interest, star)
– 80% on senseval-1 nouns
– 70% on senseval-1 verbs
5
Verb WSD
• Why are verbs harder?
– Verbal senses less topical
– More sensitive to structure, argument choice
– Better disambiguated by their argument (subject-object): importance of
local information
– For nouns, a wider context likely to be useful
•
Verb Example: “Serve”
–
–
–
–
–
–
–
[function] The tree stump serves as a table
[enable] The scandal served to increase his popularity
[dish] We serve meals for the homeless
[enlist] She served her country
[jail] He served six years for embezzlement
[tennis] It was Agassi's turn to serve
[legal] He was served by the sheriff
• Different types of information may be appropriate for different part of
speech
6
Adapted from Dan Klein’s cs 288 slides
Better features
•
There are smarter features:
– Argument selectional preference:
• serve NP[meals] vs. serve NP[papers] vs. serve NP[country]
•
Subcategorization:
–
–
–
–
–
[function] serve PP[as]
[enable] serve VP[to]
[tennis] serve <intransitive>
[food] serve NP {PP[to]}
Can capture poorly (but robustly) with local windows… but we
can also use a parser and get these features explicitly
• Other constraints (Yarowsky 95)
– One-sense-per-discourse
– One-sense-per-collocation (pretty reliable when it kicks in:
– manufacturing plant, flowering plant)
7
Taken from Dan Klein’s cs 288 slides
Various Approaches to WSD
• Unsupervised learning
– We don’t know/have the labels
– More than disambiguation is discrimination
• Cluster into groups and discriminate between these groups
without giving labels
• Clustering
– Example: EM (expectation-minimization),
Bootstrapping (seeded with some labeled data)
• Indirect supervision (See Session 7.3 of Stat
NLP book)
– From thesauri
– From WordNet
– From parallel corpora
• Supervised learning
8
Adapted from Dan Klein’s cs 288 slides
Supervised learning
• Supervised learning
– When we know the truth (true senses) (not always
true or easy)
– Classification task
– Most systems do some kind of supervised learning
– Many competing classification technologies perform
about the same (it’s all about the knowledge sources
you tap)
– Problem: training data available for only a few words
– Examples: Bayesian classification
• Naïve Bayes (simplest example of Graphical models)
– (We’ll talk more about supervised
learning/classification during the course)
9
Adapted from Dan Klein’s cs 288 slides
Today
• Introduction to probability theory
• Introduction to graphical models
– Probability theory plus graph theory
• Naïve bayes (simple graphical model)
– Naïve bayes for WSD (classification task)
10
Why Probability?
• Statistical NLP aims to do statistical
inference for the field of NLP
• Statistical inference consists of taking
some data (generated in accordance with
some unknown probability distribution) and
then making some inference about this
distribution.
11
Why Probability?
• Examples of statistical inference are WSD,
the task of language modeling (ex how to
predict the next word given the previous
words), topic classification, etc.
• In order to do this, we need a model of the
language.
• Probability theory helps us finding such
model
12
Probability Theory
• How likely it is that something will happen
• Sample space Ω is listing of all possible
outcome of an experiment
– Sample space can be continuous or discrete
– For language applications it’s discrete (i.e.
words)
• Event A is a subset of Ω
• Probability function (or distribution)
P : Ω  0,1
13
14
http://ai.stanford.edu/~paskin/gm-short-course/lec1.pdf
15
http://ai.stanford.edu/~paskin/gm-short-course/lec1.pdf
16
http://ai.stanford.edu/~paskin/gm-short-course/lec1.pdf
17
http://ai.stanford.edu/~paskin/gm-short-course/lec1.pdf
Prior Probability
• Prior probability: the probability before we
consider any additional knowledge
P( A)
18
Conditional probability
• Sometimes we have partial knowledge
about the outcome of an experiment
• Conditional (or Posterior) Probability
• Suppose we know that event B is true
• The probability that A is true given the
knowledge about B is expressed by
P( A | B)
P(A,B)
P(A|B) 
P(B)
19
20
http://ai.stanford.edu/~paskin/gm-short-course/lec1.pdf
Conditional probability (cont)
P ( A, B )  P ( A | B ) P ( B )
 P ( B | A) P ( A)
•
•
•
•
•
Note: P(A,B) = P(A ∩ B)
Chain Rule
P(A, B) = P(A|B) P(B) = The probability that A and B both happen is the
probability that B happens times the probability that A happens, given B has
occurred.
P(A, B) = P(B|A) P(A) = The probability that A and B both happen is the
probability that A happens times the probability that B happens, given A has
occurred.
Multi-dimensional table with a value in every cell giving the probability of
that specific state occurring
21
Chain Rule
P(A,B) = P(A|B)P(B)
= P(B|A)P(A)
P(A,B,C,D…) = P(A)P(B|A)P(C|A,B)P(D|A,B,C..)
22
Chain Rule  Bayes' rule
P(A,B) = P(A|B)P(B)
= P(B|A)P(A)
P(B|A)P(A)
P(A|B) 
P(B)
Bayes' rule
Useful when one quantity is more easy to calculate;
trivial consequence of the definitions we saw but it’ s
extremely useful
23
Bayes' rule
P(A|B)P(A)
P(A|B) 
P(B)
Bayes' rule translates causal knowledge into diagnostic knowledge.
For example, if A is the event that a patient has a disease, and B is the
event that she displays a symptom, then P(B | A) describes a causal
relationship, and P(A | B) describes a diagnostic one (that is usually
hard to assess).
If P(B | A), P(A) and P(B) can be assessed easily, then we get P(A | B)
for free.
24
Example
• S:stiff neck, M: meningitis
• P(S|M) =0.5, P(M) = 1/50,000 P(S)=1/20
• I have stiff neck, should I worry?
P( S | M ) P( M )
P( M | S ) 
P( S )
0.5 1 / 50,000

 0.0002
1 / 20
25
(Conditional) independence
• Two events A e B are independent of each
other if
P(A) = P(A|B)
• Two events A and B are conditionally
independent of each other given C if
P(A|C) = P(A|B,C)
26
Back to language
• Statistical NLP aims to do statistical inference for
the field of NLP
– Topic classification
• P( topic | document )
– Language models
• P (word | previous word(s) )
– WSD
• P( sense | word)
• Two main problems
– Estimation: P in unknown: estimate P
– Inference: We estimated P; now we want to find
(infer) the topic of a document, o the sense of a word
27
Language Models (Estimation)
• In general, for language events, P is
unknown
• We need to estimate P, (or model M of the
language)
• We’ll do this by looking at evidence about
what P must be based on a sample of data
28
Estimation of P
• Frequentist statistics
– Parametric
– Non-parametric (distribution free)
• Bayesian statistics
– Bayesian statistics measures degrees of belief
– Degrees are calculated by starting with prior beliefs
and updating them in face of the evidence, using
Bayes theorem
• 2 different approaches, 2 different philosophies
29
Inference
• The central problem of computational Probability
Theory is the inference problem:
• Given a set of random variables X1, … , Xk and
their joint density P(X1, … , Xk), compute one or
more conditional densities given observations.
– Compute
•
•
•
•
P(X1 | X2 … , Xk)
P(X3 | X1 )
P(X1 , X2 | X3, X4,)
Etc …
• Many problems can be formulated in these
terms.
30
Bayes decision rule
•
•
•
•
w: ambiguous word
S = {s1, s2, …, sn } senses for w
C = {c1, c2, …, cn } context of w in a corpus
V = {v1, v2, …, vj } words used as contextual features for
disambiguation
• Bayes decision rule
– Decide sj if P(sj | c) > P(sk | c) for sj ≠ sk
• We want to assign w to the sense s’ where
s’ = argmaxsk P(sk | c)
31
Bayes classification for WSD
• We want to assign w to the sense s’ where
s’ = argmaxsk P(sk | c)
• We usually do not know P(sk | c) but we can
compute it using Bayes rule
P(c,sk) P(c|sk)
P(sk|c) 

P(sk)
P(c)
P(c)
P(c|sk)
s'  arg max skP(sk|c)  arg max sk
P(sk)  arg max skP(c|sk)P(sk)
P(c)
32
Naïve Bayes classifier
s' arg max P(c|sk)P(sk)
sk
• Naïve Bayes classifier widely used in
machine learning
• Estimate P(c | sk) and P(sk)
33
Naïve Bayes classifier
s' arg max P(c|sk)P(sk)
sk
• Estimate P(c | sk) and P(sk)
• w: ambiguous word
• S = {s1, s2, …, sn } senses for w
• C = {c1, c2, …, cn } context of w in a corpus
• V = {v1, v2, …, vj } words used as contextual features for
disambiguation
• Naïve Bayes assumption:
P(c|sk)  P( {vj|vj in c} | sk)   P(vj | sk)
vj in c
34
Naïve Bayes classifier
P(c|s )  P( {v |v in c} | s )   P(v | s )
k
j
j
j
k
k
vj in c
• Naïve Bayes assumption:
– Two consequences
– All the structure and linear ordering of words within
the context is ignored bags of words model
– The presence of one word in the model is
independent of the others
• Not true but model “easier” and very “efficient”
• “easier” “efficient” mean something specific in the
probabilistic framework
– We’ll see later (but easier to estimate parameters and more
efficient inference)
– Naïve Bayes assumption is inappropriate if there are strong
dependencies, but often it does very well (partly because the 35
decision may be optimal even if the assumption is not correct)
Naïve Bayes for WSD
s' arg max P(c|sk)P(sk)
Bayes decision rule
sk
P(c|sk)  P( {vj|vj in c} | sk)   P(vj | sk)
Naïve Bayes assumption
vj in c
s'  arg max  P(vj | sk)P(sk)
sk
vj in c
P(vj | sk) 
C (vj , sk)
C ( sk)
Count of vj when sk
Estimation
C ( sk)
P(sk) 
C ( w)
Prior probability of sk
36
Naïve Bayes Algorithm for WSD
• TRAINING (aka Estimation)
• For all of senses sk of w do
– For all words vj in the vocabulary calculate
– end
• end
C (vj , sk)
P(vj | sk) 
C ( sk)
• For all of senses sk of w do
P(sk) 
C ( sk)
C ( w)
• end
37
Naïve Bayes Algorithm for WSD
• TESTING (aka Inference or Disambiguation)
• For all of senses sk of w do
– For all words vj in the context window c calculate
score( sk)  P(sk|c)  P(c|sk)P(sk)
– end
• end
• Choose s= sk of w do
s' arg max score( sk)
sk
38
Next week
• Introduction to Graphical Models
• Part of speech tagging
• Readings:
– Chapter 5 NLTL book
– Chapter 10 of Foundation of Stat NLP
39