Download Job Shop Scheduling

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Machine Learning
Expectation Maximization and
Gaussian Mixtures
CSE 473
Chapter 20.3
Feedback in Learning
• Supervised learning: correct answers for
each example
• Unsupervised learning: correct answers not
given
• Reinforcement learning: occasional rewards
The problem of finding labels for unlabeled data
So far we have solved “supervised” classification problems where a teacher
told us the label of each example. In nature, items often do not come with
labels. How can we learn labels without a teacher?
Unlabeled data
x2
Labeled data
6
6
4
4
2
2
0
0
-2
-2
-4
-4
-6
-6
-10
-8
-6
-4
-2
-10
-8
-6
-4
-2
x1
From Shadmehr & Diedrichsen
Example: image segmentation
Identify pixels that are white matter, gray matter, or outside of the brain.
Outside the brain
1500
1250
1000
Gray matter
White matter
750
500
250
0.2
0.4
0.6
0.8
1
Pixel value (normalized)
From Shadmehr & Diedrichsen
Raw Sensor Data
Measured distances for expected distance of 300 cm.
Sonar
Laser
ANEMIA PATIENTS AND CONTROLS
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
Gaussians
p ( x) ~ N (  ,  2 ) :

p ( x) 
1
e
2 

1 ( x )2
2 2
Univariate
-

p (x) ~ Ν (μ,Σ) :
p ( x) 
1
(2 )
d /2
Σ
1/ 2
e
Multivariate
1
 ( x μ ) t Σ 1 ( x μ )
2

Fitting a Gaussian PDF to Data
Good fit
Poor fit
1.4
1.4
1.2
1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
-5
-4
-3
© D. Weld and D. Fox
-2
-1
0
0
1
2 -5
3
-4
-3 4
-2
5 -1
-0.2 0
1
2
3
4
5
From Russell
9
Fitting a Gaussian PDF to Data
• Suppose y = y1,…,yn,…,yN is a set of N
data values
• Given a Gaussian PDF p with mean  and
variance , define:
N
p  y |  ,     p  yn |  ,  
n 1
N

n 1
1
e
2 
1 ( yn   ) 2

2 2
• How do we choose  and  to maximise
this probability?
© D. Weld and D. Fox
From Russell
10
Maximum Likelihood Estimation
• Define the best fitting Gaussian to be
the one such that p(y|,) is maximised.
• Terminology:
p(y|,), thought of as a function of y is the
probability (density) of y
p(y|,), thought of as a function of , is
the likelihood of ,
• Maximizing p(y|,) with respect to ,
is called Maximum Likelihood (ML)
estimation of ,
© D. Weld and D. Fox
From Russell
11
ML estimation of ,
• Intuitively:
The maximum likelihood estimate of  should be
the average value of y1,…,yN, (the sample mean)
The maximum likelihood estimate of  should be
the variance of y1,…,yN. (the sample variance)
• This turns out to be true: p(y| , ) is
maximised by setting:
1

N
© D. Weld and D. Fox
N
y ,
n 1
n
1

N
N
 y
n 1
 
2
n
From Russell
12
0.5
p(x)
0.4
Component 1
Component 2
0.3
0.2
0.1
0
-5
0
5
10
5
10
0.5
p(x)
0.4
Mixture Model
0.3
0.2
0.1
0
-5
© D. Weld and D. Fox
0
x
13
0.5
p(x)
0.4
Component 1
Component 2
0.3
0.2
0.1
0
-5
0
5
10
5
10
0.5
p(x)
0.4
Mixture Model
0.3
0.2
0.1
0
-5
© D. Weld and D. Fox
0
x
14
2
p(x)
1.5
Component Models
1
0.5
0
-5
0
5
10
5
10
0.5
p(x)
0.4
Mixture Model
0.3
0.2
0.1
0
-5
© D. Weld and D. Fox
0
x
15
Mixtures
If our data is not labeled, we can hypothesize that:
y  1, 2,
1. There are exactly m classes in the data:
, m
P  y
2. Each class y occurs with a specific frequency:
3. Examples of class y are governed by a specific distribution: p  x y 
According to our hypothesis, each example x(i) must have been generated
from a specific “mixture” distribution:
m
p  x   P  y  j  p  x y  j 
j 1
We might hypothesize that the distributions are Gaussian:
Parameters of the distributions
   P  y  1 , 1 , 1 ,
, P  y  m  , m ,  m 
p x     P  y  j N x  j ,  j 
m
j 1
Mixing proportions
Normal distribution
Graphical Representation of
Gaussian Mixtures
p( y)
Hidden variable
P y 
y
p( x | y )
Measured variable
x
p( x | y  1, 1 , 1 )
p( x | y  3, 3 ,  3 )
p( x | y  2, 2 , 2 )
3
p ( x)   p ( y  i ) p ( x | y  i, i ,  i )
i 1
Learning of mixture models
Learning Mixtures from Data
Consider fixed K = 2
e.g., Unknown parameters Q = {1 , 1 , 2 , 2 , a1}
Given data D = {x1,…….xN}, we want to find the
parameters Q that “best fit” the data
Early Attempts
Weldon’s data, 1893
- n=1000 crabs from Bay of Naples
- Ratio of forehead to body length
- Suspected existence of 2 separate species
Early Attempts
Karl Pearson, 1894:
- JRSS paper
- proposed a mixture of 2 Gaussians
- 5 parameters Q = {1 , 1 , 2 , 2 , a1}
- parameter estimation -> method of
moments
- involved solution of 9th order equations!
(see Chapter 10, Stigler (1986), The History of Statistics)
“The solution of an equation of the ninth
degree, where almost all powers, to the
ninth, of the unknown quantity are
existing, is, however, a very laborious
task. Mr. Pearson has indeed possessed
the energy to perform his heroic task….
But I fear he will have few
successors…..”
(1906)
Charlier
Maximum Likelihood Principle
• Fisher, 1922
assume a probabilistic model
likelihood = p(data | parameters, model)
find the parameters that make the data
most likely
1977: The EM Algorithm
• Dempster, Laird, and Rubin
General framework for likelihood-based
parameter estimation with missing data
•
•
•
•
start with initial guesses of parameters
E-step: estimate memberships given params
M-step: estimate params given memberships
Repeat until convergence
Converges to a (local) maximum of likelihood
E-step and M-step are often computationally
simple
Generalizes to maximum a posteriori (with
priors)
EM for Mixture of Gaussians
• E-step: Compute probability that point
xj was generated by component i:
pij  P(C  i | x j )
pij  a P( x j | C  i ) P(C  i )
pi   pij
• M-step: Compute new mean, covariance,
and component weights:
j
i   pij x j / pi
j
 2   pij ( x j  i ) 2 / pi
j
wi  pi
© D. Weld and D. Fox
25
ANEMIA PATIENTS AND CONTROLS
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
EM ITERATION 1
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
EM ITERATION 3
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
EM ITERATION 5
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
EM ITERATION 10
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
EM ITERATION 15
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
EM ITERATION 25
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
LOG-LIKELIHOOD AS A FUNCTION OF EM ITERATIONS
490
480
Log-Likelihood
470
460
450
440
430
420
410
400
0
5
10
15
EM Iteration
20
25
ANEMIA DATA WITH LABELS
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
Control Group
4.1
4
Anemia Group
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
Mixture Density
 a hit 


 a unexp 
P ( z | x, m )  
a max 


a

 rand 
T
 Phit ( z | x, m) 


 Punexp ( z | x, m) 

Pmax ( z | x, m) 


 P ( z | x, m ) 
 rand

How can we determine the model parameters?
Raw Sensor Data
Measured distances for expected distance of 300 cm.
Sonar
Laser
Approximation Results
Laser
Sonar
300cm
400cm
Predict Goal and Path
Predicted goal
Predicted path
© D. Weld and D. Fox
38
Learned Transition Parameters
Going to the workplace
: bus
© D. Weld and D. Fox
Going home
: car
: foot
39
Related documents