Download h - WSU EECS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Statistical learning
Chapter 20, Sections 1–4
Chapter 20, Sections 1–4
1
Outline
♦ Bayesian learning
♦ Maximum a posteriori and maximum likelihood learning
♦ Bayes net learning
– ML parameter learning with complete data
– linear regression
♦ Expectation-Maximization (EM) algorithm
♦ Instance-based learning
Chapter 20, Sections 1–4
2
Full Bayesian learning
View learning as Bayesian updating of a probability distribution
over the hypothesis space
H is the hypothesis variable, values h1, h2, . . ., prior P(H)
jth observation dj gives the outcome of random variable Dj
training data d = d1, . . . , dN
Given the data so far, each hypothesis has a posterior probability:
P (hi|d) = αP (d|hi)P (hi)
where P (d|hi) is called the likelihood
Predictions use a likelihood-weighted average over the hypotheses:
P(X|d) = Σi P(X|d, hi)P (hi|d) = Σi P(X|hi)P (hi|d)
No need to pick one best-guess hypothesis!
Chapter 20, Sections 1–4
3
Example
Suppose there are five kinds of bags of candies:
10% are h1: 100% cherry candies
20% are h2: 75% cherry candies + 25% lime candies
40% are h3: 50% cherry candies + 50% lime candies
20% are h4: 25% cherry candies + 75% lime candies
10% are h5: 100% lime candies
Then we observe candies drawn from some bag:
What kind of bag is it? What flavour will the next candy be?
Chapter 20, Sections 1–4
4
Posterior probability of hypotheses
Posterior probability of hypothesis
1
P(h1 | d)
P(h2 | d)
P(h3 | d)
P(h4 | d)
P(h5 | d)
0.8
0.6
0.4
0.2
0
0
2
4
6
Number of samples in d
8
10
Chapter 20, Sections 1–4
5
Prediction probability
P(next candy is lime | d)
1
0.9
0.8
0.7
0.6
0.5
0.4
0
2
4
6
Number of samples in d
8
10
Chapter 20, Sections 1–4
6
MAP approximation
Summing over the hypothesis space is often intractable
(e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes)
Maximum a posteriori (MAP) learning: choose hMAP maximizing P (hi|d)
I.e., maximize P (d|hi)P (hi) or log P (d|hi) + log P (hi)
Log terms can be viewed as (negative of)
bits to encode data given hypothesis + bits to encode hypothesis
This is the basic idea of minimum description length (MDL) learning
For deterministic hypotheses, P (d|hi) is 1 if consistent, 0 otherwise
⇒ MAP = simplest consistent hypothesis (cf. science)
Chapter 20, Sections 1–4
7
ML approximation
For large data sets, prior becomes irrelevant
Maximum likelihood (ML) learning: choose hML maximizing P (d|hi)
I.e., simply get the best fit to the data; identical to MAP for uniform prior
(which is reasonable if all hypotheses are of the same complexity)
ML is the “standard” (non-Bayesian) statistical learning method
Chapter 20, Sections 1–4
8
ML parameter learning in Bayes nets
Bag from a new manufacturer; fraction θ of cherry candies?
Any θ is possible: continuum of hypotheses hθ
θ is a parameter for this simple (binomial) family of models
P (F=cherry )
θ
Flavor
Suppose we unwrap N candies, c cherries and ` = N − c limes
These are i.i.d. (independent, identically distributed) observations, so
P (d|hθ ) =
N
Y
j =1
P (dj |hθ ) = θ c · (1 − θ)`
Maximize this w.r.t. θ—which is easier for the log-likelihood:
L(d|hθ ) = log P (d|hθ ) =
c
`
dL(d|hθ )
= −
=0
dθ
θ 1−θ
N
X
j =1
log P (dj |hθ ) = c log θ + ` log(1 − θ)
⇒
c
c
θ=
=
c+` N
Seems sensible, but causes problems with 0 counts!
Chapter 20, Sections 1–4
9
Multiple parameters
Red/green wrapper depends probabilistically on flavor:
Likelihood for, e.g., cherry candy in green wrapper:
P (F=cherry )
θ
Flavor
P (F = cherry, W = green|hθ,θ1,θ2 )
= P (F = cherry|hθ,θ1,θ2 )P (W = green|F = cherry, hθ,θ1,θ2 )
= θ · (1 − θ1)
N candies, rc red-wrapped cherry candies, etc.:
F
P(W=red | F )
cherry
lime
θ1
θ2
Wrapper
r
P (d|hθ,θ1,θ2 ) = θ c(1 − θ)` · θ1rc (1 − θ1)gc · θ2` (1 − θ2)g`
L = [c log θ + ` log(1 − θ)]
+ [rc log θ1 + gc log(1 − θ1)]
+ [r` log θ2 + g` log(1 − θ2)]
Chapter 20, Sections 1–4
10
Multiple parameters contd.
Derivatives of L contain only the relevant parameter:
c
`
∂L
= −
=0
∂θ
θ 1−θ
c
⇒ θ=
c+`
rc
gc
∂L
=
−
=0
∂θ1
θ1 1 − θ 1
rc
⇒ θ1 =
rc + g c
r`
g`
∂L
=
−
=0
∂θ2
θ2 1 − θ 2
r`
⇒ θ2 =
r` + g `
With complete data, parameters can be learned separately
Chapter 20, Sections 1–4
11
Example: linear Gaussian model
1
0.8
0.6
y
P(y |x)
4
3.5
3
2.5
2
1.5
1
0.5
0
0 0.2
0.4
0.4 0.6
0.8
x
1
0
1
0.8
0.6
0.4 y
0.2
0.2
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
x
(y−(θ1 x+θ2 ))2
1
−
2σ 2
e
Maximizing P (y|x) = √
w.r.t. θ1, θ2
2πσ
= minimizing E =
N
X
j =1
(yj − (θ1xj + θ2))2
That is, minimizing the sum of squared errors gives the ML solution
for a linear fit assuming Gaussian noise of fixed variance
Chapter 20, Sections 1–4
12
Expectation Maximization (EM)
When to use:
• Data is only partially observable
• Unsupervised clustering (target value unobservable)
• Supervised learning (some instance attributes unobservable)
Some uses:
• Train Bayesian Belief Networks
• Unsupervised clustering (AUTOCLASS)
• Learning Hidden Markov Models
Chapter 20, Sections 1–4
13
p(x)
Generating Data from Mixture of k Gaussians
x
Each instance x generated by
1. Choosing one of the k Gaussians with uniform probability
2. Generating an instance at random according to that Gaussian
Chapter 20, Sections 1–4
14
EM for Estimating k Means
Given:
• Instances from X generated by mixture of k Gaussian distributions
• Unknown means hµ1, . . . , µk i of the k Gaussians
• Don’t know which instance xi was generated by which Gaussian
Determine:
• Maximum likelihood estimates of hµ1, . . . , µk i
Think of full description of each instance as yi = hxi, zi1, zi2i, where
• zij is 1 if xi generated by jth Gaussian
• xi observable
• zij unobservable
Chapter 20, Sections 1–4
15
EM for Estimating k Means
EM Algorithm: Pick random initial h = hµ1, µ2i, then iterate
E step: Calculate the expected value E[zij ] of each hidden variable zij , assuming
the current hypothesis h = hµ1, µ2i holds.
E[zij ] =
=
p(x = xi|µ = µj )
P2
n=1 p(x = xi |µ = µn )
e
P2
− 12 (xi −µj )2
2σ
n=1 e
− 12 (xi −µn )2
2σ
M step: Calculate a new maximum likelihood hypothesis h0 = hµ01, µ02i, assuming
the value taken on by each hidden variable zij is its expected value E[zij ]
calculated above. Replace h = hµ1, µ2i by h0 = hµ01, µ02i.
µj ←−
Pm
i=1 E[zij ] xi
Pm
i=1 E[zij ]
Chapter 20, Sections 1–4
16
EM Algorithm
Converges to local maximum likelihood h
and provides estimates of hidden variables zij
In fact, local maximum in E[ln P (Y |h)]
• Y is complete (observable plus unobservable variables) data
• Expected value is taken over possible values of unobserved variables in Y
Chapter 20, Sections 1–4
17
General EM Problem
Given:
• Observed data X = {x1, . . . , xm}
• Unobserved data Z = {z1, . . . , zm}
• Parameterized probability distribution P (Y |h), where
– Y = {y1, . . . , ym} is the full data yi = xi ∪ zi
– h are the parameters
Determine: h that (locally) maximizes E[ln P (Y |h)]
Many uses:
• Train Bayesian belief networks
• Unsupervised clustering (e.g., k means)
• Hidden Markov Models
Chapter 20, Sections 1–4
18
General EM Method
Define likelihood function Q(h0|h) which calculates Y = X ∪ Z using observed X and current parameters h to estimate Z
Q(h0|h) ←− E[ln P (Y |h0)|h, X]
EM Algorithm:
Estimation (E) step: Calculate Q(h0|h) using the current hypothesis h
and the observed data X to estimate the probability distribution over Y .
Q(h0|h) ←− E[ln P (Y |h0)|h, X]
Maximization (M) step: Replace hypothesis h by the hypothesis h0 that
maximizes this Q function.
h ←− argmaxh0 Q(h0|h)
Chapter 20, Sections 1–4
19
Instance-Based Learning
Key idea: just store all training examples hxi, f (xi)i
Nearest neighbor:
• Given query instance xq , first locate nearest training example xn, then
estimate fˆ(xq ) ←− f (xn)
k-Nearest neighbor:
• Given xq , take vote among its k nearest nbrs (if discrete-valued target
function)
• take mean of f values of k nearest nbrs (if real-valued)
fˆ(xq ) ←−
Pk
i=1 f (xi )
k
Chapter 20, Sections 1–4
20
When To Consider Nearest Neighbor
• Instances map to points in <n
• Less than 20 attributes per instance
• Lots of training data
Advantages:
• Training is very fast
• Learn complex target functions
• Don’t lose information
Disadvantages:
• Slow at query time
• Easily fooled by irrelevant attributes
Chapter 20, Sections 1–4
21