Download Slides - User Web Pages

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

German tank problem wikipedia , lookup

Maximum likelihood estimation wikipedia , lookup

Transcript
Lecture Slides for
ETHEM ALPAYDIN
© The MIT Press, 2010
[email protected]
http://www.cmpe.boun.edu.tr/~ethem/i2ml2e
Parametric Estimation
 X = { xt }t=1N where xt ~ p(x)
 Here x is one dimensional and the densities are
univariate.
 Parametric estimation:
Assume a form for p (x | θ) and estimate θ, its sufficient
statistics, using X
e.g., N ( μ, σ2) where θ = { μ, σ2}
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
3
Maximum Likelihood Estimation
 Likelihood of θ given the sample X
l (θ|X )  p(X |θ) = ∏ t=1N p (xt|θ)
 Log likelihood
L(θ|X)  log l (θ|X) = ∑ t=1N log p (xt|θ)
 Maximum likelihood estimator (MLE)
θ* = arg maxθ L(θ|X)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
4
Examples: Bernoulli Density
 Two states, failure/success, x in {0,1}
P (x) = px (1 – p ) (1 – x)
l (θ|X )= ∏ t=1N p (xt|θ)
L(p|X) = log l (θ|X)
= log ∏t pxt (1 – p ) (1 – xt)
= (∑t xt ) log p+ (N – ∑t xt ) log (1 – p )
MLE: p = (∑t xt ) / N
MLE: Maximum Likelihood Estimation
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
5
Examples: Multinomial Density
 K > 2 states, xi in {0,1}
P (x1,x2,...,xK) = ∏i pixi
L(p1,p2,...,pK|X) = log ∏t ∏i pixit
where xit = 1 if experiment t chooses state i
xit = 0 otherwise
MLE: pi = (∑t xit ) / N
PS. K: mutually exclusive
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
6
Gaussian (Normal) Distribution
 p(x) = N ( μ, σ2)
 x   2 
1
p x  
exp
2
2
 2 
 x   2 
1
p x  
exp

2
2 
2

 Given a sample X = { xt }t=1N with xt ~
μ
σ
N ( μ, σ2) , the log likelihood of a
Gaussian sample is
N



,

|
x


log( 2 )  N log  
L
 x
t
2
t


2
2 2
(Why? N samples?)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
7
Gaussian (Normal) Distribution
 MLE for μ and σ2:
 x   2 
1
p x  
exp
2
2
 2 
m
s2 
μ
x
t
t
N
 x
t
m

2
t
N
σ
( Exercise!)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
8
Bias and Variance
Unknown parameter θ
Estimator di = d (Xi)
on sample Xi
Bias: bθ(d) = E [d] – θ
Variance: E [(d–E [d])2]
θ
If bθ(d) = 0
d is an unbiased estimator of θ
If E [(d–E [d])2] = 0
d is a consistent estimator of θ
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
9
10
Expected value
• If the probability distribution of X admits a probability density function f (x),
then the expected value can be computed as
• It follows directly from the discrete case definition that if X is a constant random
variable, i.e. X = b for some fixed real number b, then the expected value of X is
also b.
• The expected value of an arbitrary function of X, g(X), with respect to the
probability density function f(x) is given by the inner product of f and g:
http://en.wikipedia.org/wiki/Expected_value
Bias and Variance
For example:
  t xt  1
E[m]  E 

 N  N
 E[ xt ] 
t
  t xt  1
Var[m]  Var 
 2
 N  N
N

N
N 2  2
t Var[ x ]  N 2  N
t
Var [m]  0 as N∞
m is also a consistent estimator
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
11
Bias and Variance
For example: (see P. 65-66)
 N 1  2
2
E (s )  
  
 N 
s2 is a biased estimator of σ2
(N/(N-1))s2 is a unbiased estimator of σ2
2
Mean square error:
r (d,θ) = E [(d–θ)2] (see P. 66, next slide)
= (E [d] – θ)2 + E [(d–E [d])2]
= Bias2 + Variance
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
θ
12
Bias and Variance
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
13
14
Standard Deviation
• In statistics, the standard deviation is often estimated from a random
sample drawn from the population.
• The most common measure used is the sample standard deviation, which is
defined by
where
variable X) and
is the sample (formally, realizations from a random
is the sample mean.
http://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation
Bayes’ Estimator
 Treat θ as a random variable with prior p(θ)
 Bayes’ rule:
p( X |  ) p( )
p( | X ) 
p( X )
 Maximum a Posteriori (MAP):
θMAP = arg maxθ p(θ|X)
 Maximum Likelihood (ML):
θML = arg maxθ p(X|θ)
 Bayes’ estimator:
θBayes = E[θ|X] = ∫ θ p(θ|X) dθ
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
15
MAP VS ML
 If p(θ) is an uniform distribution
then θMAP = arg maxθ p(θ|X)
= arg maxθ p(X|θ) p(θ) / p(X)
= arg maxθ p(X|θ)
= θML
θMAP = θML
where p(θ) / p(X) is a constant
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
16
Bayes’ Estimator: Example
 If p (θ|X) is normal, then θML = m and θBayes =θMAP
 Example: Suppose xt ~ N (θ, σ2) and θ ~ N ( μ0, σ02)
N
p X |  
1
(2 )
N /2

N
exp[ 
t
2
(
x


)

t 1
2
2
]
θML = m
   0 2 
1
p   
exp  

2
2 0 
2 0


θBayes =
1/  02
N / 2
E  |X  
m
0
2
2
2
2
N /   1/  0
N /   1/  0
 The Bayes’ estimator is a weighted average of the prior mean μ0 and the sample
mean m.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
17
Parametric Classification
 The discriminant function
g i x   px | Ci PCi 
or equivalent ly
g i x   log px | Ci   log PCi 
 Assume that px | Ci  are Gaussian
 x  i 2 
1
p x | Ci  
exp

2
2i
2i 

x  i   log P C 
1
gi x    log 2  log i 
i
2
2
2i
2
log likelihood of a Gaussian sample
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
18
t
t N
 Given the sample X  {x ,r }t 1
1
t
ri  
0
x 
if x t Ci
if x t C j , j  i
 Maximum Likelihood (ML) estimates are
Pˆ Ci  
 ri
t
N
t
, mi 
x r
t t
i
t
 ri
t
, si2 
 x
t
t

2
 mi rit
t
t
r
i
t
 Discriminant becomes
x  mi   log P̂ C 
1
gi x    log 2  log si 
i
2
2
2si
2
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
19
 The first term is a constant and if the priors are equal,
those terms can be dropped.
 Assume that variances are equal, then
x  mi 

1
ˆ C 
gi  x    log 2  log si 

log
P
i
2
2si 2
2
becomes
gi  x     x  mi 
2
| x  mk |
Choose Ci if | x  mi | min
k
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
20
Equal
variances
Single boundary at
halfway between
means
Likelihood functions and the posteriors with equal priors for two classes when the input is one-dimensional.
Variances are equal and the posteriors intersect at one point, which is the threshold of decision.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
21
Variances
are different
Two boundaries
Likelihood functions and the posteriors with equal priors for two classes when the input is one-dimensional.
Variances are unequal and the posteriors intersect at two points.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
22
Exercise
 For a two-class problem, generate normal samples for
two classes with different variances, then use
parametric classification to estimate the discriminant
points. Compare these with the theoretical values.
PS. You can use normal sample generation tools.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
23
Regression
r  f x 
estimator : g  x| 
 ~N  0, 2 

p  r|x  ~N g  x|  , 2

p  r|x  : the probability of the output
given the input
Given an sample X =
the log likelihood is
N

t

t
{ xt , rt}t=1N,
L  |X   log  p x , r
t 1
N
t
Regression assumes 0 mean
Gaussian noise added to the
model; Here, the model is linear.


N
 
 log  p r |x  log  p x t
t 1
t
t 1
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
24
Regression: From LogL to Error
 Ignore the second term: (because it does not depend on our estimator)
N
L  |X   log 
t 1


 r t  g x t |  2 
1
 
exp   
2

2
2


1 N  t
 N log 2  2  r  g x t | 

2 t 1 


2
 Maximizing this is equivalent to minimizing
1 N  t
E  |X    r  g x t | 

2 t 1 


2
Least squares estimate
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
25
Example: Linear Regression


Let g x t |w1 , w0  w1 x t  w0
1 N  t
Minimize E  |X    r  g x t | 

2 t 1 

r
We can obtain
t
t
r x
t
and
t
 N

A
t
 x
 t
x
t
 Nw0  w1  x t
t
t
(Exercise!!)
 
 w0  x  w1  x
t
t
t
t

 rt 

w0 
 t

,
w

,
y


 
2
 rt xt 
w1 
xt 


t

2
t
 
t

2
Aw = y
w  A 1 y
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
26
Example: Polynomial Regression


 
g x |wk ,..., w2 , w1 , w0  wk x
t
k
t
 
 ...  w2 x
t
2
 w1 x t  w0
Aw = y
1 x 1


2
1
x
Let D  

:

N
1
x

 
x 
2
x1
2
2
x 
2
N

...
 
x 
...
 


 r1 
k 
 2
2
 , r  r 

 : 

 N
2
r 
N
x 

x1
...

k



We can obtain A  D D , y  D r , and w  D D
T
T
T

1
DT r
(see page 75)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
27
Other Error Measures
 Square Error:
1 N  t
E  |X    r  g x t | 

2 t 1 


2
N
 Relative Square Error:
E  |X  
 r
t
t 1
N
 r
t 1
 Absolute Error:


 g x | 

t
2
2
t
 r 
E (θ|X) = ∑t |rt – g(xt|θ)|
 ε-sensitive Error:
E (θ|X) = ∑ t 1(|rt – g(xt|θ)|>ε)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
28
Bias and Variance
The expected square error at a particular point x wrt
to a fixed g(x) and variations in r based on p(r|x):
E[(r - g(x)) |x]= E[(r -E[r|x]) |x]+(E[r | x]- g(x))
2
Estimate for
the error at
point x
2
2
noise
(See Eq. 4.17)
squared error
Now let’s note that g(.) is a random variable (function) of samples S:
ES [E[(r - g(x)) | x]]= E[(r -E[r | x]) | x]+ES [(E[r | x]- g(x)) ]
2
Expectation of our estimate for
the error at point x (wrt
sample variation)
2
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
2
29
Bias and Variance
Now let’s note that g(.) is a random variable (function) of samples S:
ES [E[(r - g(x))2 | x]]= E[(r -E[r | x])2 | x]+ES [(E[r | x]- g(x))2 ]
The expected value (average over samples X, all of size N and
drawn from the same joint density p(x, r)) : (See Eq. 4.11)
squared error
ES [(E[r | x]- g(x))2 | x]=
(See page 66 and 76)
(E[r | x]- ES [g(x)])2 + ES [( g(x)- ES [g(x)])2 ]
bias2
variance
squared error = bais2+variance
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
31
Estimating Bias and Variance
 Samples Xi={xti , rti}, i =1,...,M, t = 1,2,…,N
are used to fit gi (x), i =1,...,M
1
g  x    gi  x 
M i
2
1 
2
t
t 
Bias  g    g x  f x

N t 
2
1
t
t 

Variance  g  
gi x  g x



NM t i
θ
   
   
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
32
Bias/Variance Dilemma
 Examples:
 gi (x) = 2 has no variance and high bias
 gi (x) = ∑t rti/N has lower bias with variance
 As we increase model complexity,
bias decreases (a better fit to data) and
variance increases (fit varies more with data)
 Bias/Variance dilemma (Geman et al., 1992)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
33
f
f ( x)  2 sin( 1.5 x)
f
bias
gi
g
variance
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
34
Polynomial Regression
Best fit “min error”
underfitting
overfitting
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
35
Best fit,
Fig. 4.7
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
36
Model Selection (1)
 Cross-validation:
 Measure generalization accuracy by testing on data unused during training
 To find the optimal complexity
 Regularization (調整):
 Penalize complex models
E’ = error on data + λ model complexity
 Structural risk minimization (SRM):
 To find the model simplest in terms of order and best in terms of empirical
error on the data

Model complexity measure: polynomials of increasing order, VC dimension, ...
 Minimum description length (MDL):
 Kolmogorov complexity of a data set is defined as the shortest description
of data
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
37
Model Selection (2)
 Bayesian Model Selection:
 Prior on models, p(model)
p data | model  p model 
p model | data 
p data
 Discussions:
 When the prior is chosen such that we give higher probabilities to simpler
models, the Bayesian approach, regularization, SRM, and MDL are
equivalent.
 Cross-validation is the best approach if there is a large enough validation
dataset.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
38
Regression example
Coefficients increase in
magnitude as order
increases:
1: [-0.0769, 0.0016]
2: [0.1682, -0.6657, 0.0080]
3: [0.4238, -2.5778, 3.4675, -0.0002
4: [-0.1093, 1.4356, -5.5007, 6.0454,
-0.0019]
Please compare with Fig. 4.5, p. 78.
regularization : E  w|X  
2
1  t
r  g x t |w     i wi2


2 t 1 
N


Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
39