Download prob-tour+bayes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
KEY CONCEPTS IN PROBABILITY:
SMOOTHING, MLE, AND MAP
Outline
• MAPs and MLEs
– catchup from last week
• Joint Distributions
– a new learner
• Naïve Bayes
– another new learner
Administrivia
• Homeworks:
– Due tomorrow
– Hardcopy and Autolab submission (see wiki)
• Texts
– Mitchell or Murphy are optional
• this week – an update from Tom Mitchell’s longexpected new edition
– Bishop is also excellent if you prefer
• but a little harder to skip around in
– pick one or the other (both is overkill)
– main differences are not content but notation: for
instance…
Some practical problems
I bought a loaded d20 on EBay…but it didn’t come
with any useful specs. How can I find out how it
behaves?
Frequency
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
Face Shown
1. Collect some data (20 rolls)
2. Estimate Pr(i)=C(rolls of i)/C(any roll)
A better solution
I bought a loaded d20 on EBay…but it didn’t come
with any specs. How can I find out how it behaves?
Frequency
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
Face Shown
0. Imagine some data (20 rolls, each i shows up 1x)
1. Collect some data (20 rolls)
2. Estimate Pr(i)=C(rolls of i)/C(any roll)
A better solution?
Q: What if I used m rolls with a
probability of q=1/20 of rolling any i?
C (i)  1
P̂r(i) 
C ( ANY )  C ( IMAGINED)
C (i )  mq
P̂r(i ) 
C ( ANY )  m
I can use this formula with m>20, or
even with m<20 … say with m=1
Terminology – more later
This is called a uniform Dirichlet prior
C(i), C(ANY) are sufficient statistics
C (i )  mq
P̂r(i ) 
C ( ANY )  m
MLE = maximum
likelihood estimate
Tom’s notes
are different
MAP= maximum
a posteriori estimate
Some differences….
William: Estimate each
probability Pr(i) associated
with a multinomial with
MLE as:
Tom: estimate Θ=P(heads) for
a binomial with MLE as:
#heads
C(i)
P̂r(i) =
C(ANY)
for C(i)=count of times
you saw i, and estimate ith
MAP as:
#tails
and with MAP as:
#imaginary
heads
C (i )  mq
P̂r(i ) 
C ( ANY )  m
#imaginary
tails
Some apparent differences….
C (i )  mq
P̂r(i ) 
C ( ANY )  m
Tom: estimate Θ=P(heads) for
a binomial with MLE as:
#heads
#tails
C(i) = α1
C(ANY) = α0+α1
m = (γ0+γ1)
and with MAP as:
#imaginary
heads
.. and confidence
in prior
q = γ1 / (γ0+γ1)
emphasizes the prior
emphasizes the
pseudo-data
#imaginary
tails
imagined m=60 samples with q = 0.3
imagined m=60 samples with q = 0.4
imagined m=120 samples with q = 0.3
imagined m=120 samples with q = 0.4
Why we call this a MAP
• Simple case: replace the die with a coin
– Now there’s one parameter: q=P(H)
– I start with a prior over q, P(q)
– I get some data: D={D1=H, D2=T, ….}
– I compute maximum of posterior of q
P(D | q)P(q)
argmax q P(q | D) =
P(D)
argmaxq P(D | q)
= argmax q P(D | q)P(q)
MLE estimate
MAP estimate
Why we call this a MAP
• Simple case: replace the die with a coin
– Now there’s one parameter: q=P(H)
– I start with a prior over q, P(q)
– I get some data: D={D1=H, D2=T, ….}
– I compute the posterior of q
• The math works if the pdf of P(q) is P(x) =
• α+1,β+1 are counts of imaginary pos/neg examples
Why we call this a MAP
• The math works if the pdf P(x) =
30
20
10
0.5
Why we call this a MAP
• This is called a beta distribution
• The generalization to multinomials is called a
Dirichlet distribution
• Parameters are
f(x1,…,xK) =
KEY CONCEPTS IN PROBABILITY:
THE JOINT DISTRIBUTION
Some practical problems
• I have 1 standard “fair” d6 die, 2 loaded d6 die, one loaded high, one low.
• Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50
• Experiment: pick one d6 uniformly at random (A) and roll it. What is more
likely – rolling a seven or rolling doubles?
Three combinations: HL, HF, FL
P(D) = P(D ^ A=HL) + P(D ^ A=HF) + P(D ^ A=FL)
= P(D | A=HL)*P(A=HL) + P(D|A=HF)*P(A=HF) + P(A|A=FL)*P(A=FL)
A brute-force solution
A
Roll 1
Roll 2
P
Comment
FL
1
1
1/3 * 1/6 * ½
doubles
1
2
1/3 * 1/6 * 1/10
FL
FL
…
FL
A joint probability table shows P(X1=x1 and … and Xk=xk)
1 every possible
…
… of values x1,x2,…., xk
for
combination
seven
1
6
FL
2 this you1can compute any P(A) where A is any
With
boolean
combination
of the primitive events (Xi=Xk), e.g.
2
…
…
•…
P(doubles) …
FL
• 6P(seven or 6eleven)
HL
1
• 1P(total is higher
than 5)
HL
• 1….
2
…
…
…
HF
1
1
…
doubles
doubles
The Joint Distribution
Example: Boolean variables
A, B, C
Recipe for making a joint distribution of M
variables:
The Joint Distribution
Example: Boolean variables
A, B, C
Recipe for making a joint distribution of M
variables:
1.
Make a truth table listing all
combinations of values of your
variables (if there are M Boolean
variables then the table will have 2M
rows).
A
B
C
0
0
0
0
0
1
0
1
0
0
1
1
1
0
0
1
0
1
1
1
0
1
1
1
The Joint Distribution
Example: Boolean variables
A, B, C
Recipe for making a joint distribution of M
variables:
1.
2.
Make a truth table listing all
combinations of values of your
variables (if there are M Boolean
variables then the table will have 2M
rows).
For each combination of values, say
how probable it is.
A
B
C
Prob
0
0
0
0.30
0
0
1
0.05
0
1
0
0.10
0
1
1
0.05
1
0
0
0.05
1
0
1
0.10
1
1
0
0.25
1
1
1
0.10
The Joint Distribution
Example: Boolean variables
A, B, C
Recipe for making a joint distribution of M
variables:
1.
2.
3.
Make a truth table listing all
combinations of values of your
variables (if there are M Boolean
variables then the table will have 2M
rows).
For each combination of values, say
how probable it is.
If you subscribe to the axioms of
probability, those numbers must sum
to 1.
A
B
C
Prob
0
0
0
0.30
0
0
1
0.05
0
1
0
0.10
0
1
1
0.05
1
0
0
0.05
1
0
1
0.10
1
1
0
0.25
1
1
1
0.10
Estimating The Joint
Distribution
Recipe for making a joint distribution of M
variables:
1.
2.
3.
Make a truth table listing all
combinations of values of your
variables (if there are M Boolean
variables then the table will have 2M
rows).
For each combination of values,
estimate how probable it is from
data.
If you subscribe to the axioms of
probability, those numbers must sum
to 1.
Example: Boolean variables
A, B, C
A
B
C
Prob
0
0
0
0.30
0
0
1
0.05
0
1
0
0.10
0
1
1
0.05
1
0
0
0.05
1
0
1
0.10
1
1
0
0.25
1
1
1
0.10
Pros and Cons of the Joint
Distribution
• You can do a lot with it! 
– Answer any query Pr(Y1,Y2,..|X1,X2,…)
• It takes up a lot of room! 
• It takes a lot of data to train! 
• It can be expensive to use 
– The big question: how do you simplify
(approximate, compactly store,…) the joint
and still be able to answer interesting
queries?
Density Estimation
• Our Joint Distribution learner is our first
example of something called Density
Estimation
• A Density Estimator learns a mapping from
a set of attributes values to a Probability
Input
Attributes
Copyright © Andrew W. Moore
Density
Estimator
Probability
Density Estimation – looking ahead
• Compare it to two other major kinds of
models:
Input
Attributes
Classifier
Prediction of
categorical output or class
One of a few discrete values
Input
Attributes
Density
Estimator
Probability
Input
Attributes
Regressor
Prediction of
real-valued output
Copyright © Andrew W. Moore
Another example
Another example
• Starting point: Google books 5-gram data
– All 5-grams that appear >= 40 times in a
corpus of 1M English books
• 30Gb compressed, 250-300Gb uncompressed
• Each 5-gram contains frequency distribution
over years (which I ignored)
– Pulled out counts for all 5-grams
(A,B,C,D,E) where C=affect or C=effect
and turned this into a joint probability
table
Some of the Joint Distribution
A
B
C
D
E
is
the
effect
of
the
0.00036
is
the
effect
of
a
0.00034
.
The
effect
of
this
0.00034
to
this
effect
:
“
0.00034
be
the
effect
of
the
…
…
…
…
…
…
the
effect
of
any
0.00024
…
…
…
…
…
does
not
affect
the
general
0.00020
does
not
affect
the
question
0.00020
any
manner
affect
the
principle 0.00018
not
p
…
about 50k more rows...that summarize 90M 5-gram instances in text
Example queries
Pr(C) ?
c
Pr(C=c)
C=effect
0.94628
C=affect
0.04725
C=Effect
0.00575
C=EFFECT
0.00067
C=effecT
…
Example queries
Pr(B|C=affect) ?
b
Pr(B=b|C=affect)
B=not
0.61357
B=to
0.11483
B=may
0.03267
B=they
0.02738
B=which
…
Example queries
Pr(C|B=not,D=the) ?
c
Pr(C|b=not,D=the)
B=affect
0.99644
B=effect
0.00356
Density Estimation As a Classifier
Input
Attributes
Classifier
Input
Attributes
Density
Estimator
Input
Attributes
+ Class Y
Density
Estimator
Prediction of
categorical output or class
One of a few discrete values
Probability
P(X1=x1,…,Xn=xn)
Probability
P(Y=y1|X1=x1,…,Xn=xn)
…
P(Y=yk|X1=x1,…,Xn=xn)
Predict: f(X1=x1,…,Xn=xn)=max yi P(Y=yi|X1=x1,…,Xn=xn)
Copyright © Andrew W. Moore
An experiment: how useful is the
brute-force joint classifier?
• Test set: extracted all uses affect or effect in a
20k document newswire corpus:
– about 723 n-grams, 661 distinct
• Tried to predict center word C with:
– argmaxc Pr(C=c|A=a,B=b,D=d,E=e)
using the joint estimated from the Google
ngram data
Poll time…
• https://piazza.com/class/ij382zqa2572hc
Example queries
How many errors would I expect in 100 trials if my
classifier always just guesses the most frequent
class?
https://piazza.com/class/ij382zqa2572hc
c
Pr(C=c)
C=effect
0.94628
C=affect
0.04725
C=Effect
0.00575
C=EFFECT
0.00067
C=effecT
…
Performance summary
Pattern
P(C|A,B,D,E)
Used
Errors
101
1
But: no counts at all for a,b,c,d for 622 of the 723 instances!
Slightly fancier idea….
• Tried to predict center word with:
– Pr(C|A=a,B=b,D=d,E=e)
– then P(C|A,B,D) if there’s no data for that
– then P(C|B,D) if there’s no data for that
– then P(C|B) …
– then P(C)
EXAMPLES
– “The cumulative _ of the”  effect (1.0)
– “Go into _ on January”  effect (1.0)
– “From cumulative _ of accounting” not
present in train data
• Nor is ““From cumulative _ of _”
• But “_ cumulative _ of _”  effect (1.0)
– “Would not _ Finance Minister” not
present
• But “_ not _ _ _”  affect (0.9625)
Performance summary
Pattern
Used
3% error
Errors
P(C|A,B,D,E)
101
1
P(C|A,B,D)
157
6
P(C|B,D)
163
13
P(C|B)
244
78
P(C)
58
31
723
5% error
15% error