Download Kernel Methods

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gaussian elimination wikipedia , lookup

Matrix calculus wikipedia , lookup

Least squares wikipedia , lookup

Coefficient of determination wikipedia , lookup

Ordinary least squares wikipedia , lookup

Transcript
Kernel Methods
Lecture Notes for CMPUT 466/551
Nilanjan Ray
Kernel Methods: Key Points
• Essentially a local regression (function estimation/fitting) technique
• Only the observations (training set) close to the query point are
considered for regression computation
• While regressing, an observation point gets a weight that decreases
as its distance from the query point increases
• The resulting regression function is smooth
• All these features of this regression are made possible by a function
called kernel
• Requires very little training (i.e., not many parameters to compute
offline from the training set, not much offline computation needed)
• This kind of regression is known as memory based technique as it
requires entire training set to be available while regressing
One-Dimensional Kernel
Smoothers
• We have seen that k-nearest
neighbor directly estimates
Pr(Y|X=x)
• k-nn assigns equal weight to
all points in neighborhood
• The average curve is bumpy
and discontinuous
• Rather than give equal weight,
assign weights that decrease
smoothly with distance from
the target points
fˆ x  Ave yi | xi  Nk x
Nadaraya-Watson
Kernel-weighted Average

fˆ x0   i N1
N
• N-W kernel weighted average:

K  x0 , xi  yi
i 1
K  x0 , xi 
 x  x0 

where K  x0 , x   D
 h x  
  0 
• K is a kernel function:
Any smooth function K such that
K  x   0,
 K x dx  1,
2


xK
x
dx

0
and
x

 K x dx  0
Typically K is also symmetric about 0
Some Points About Kernels
• hλ(x0) is a width function also dependent on λ
• For the N-W kernel average hλ(x0) = λ
• For k-nn average hλ(x0) = |x0-x[k]|, where x[k] is
the kth closest xi to x0
• λ determines the width of local neighborhood
and degree of smoothness
• λ also controls the tradeoff between bias and
variance
– Larger λ makes lower variance but higher bias (Why?)
• λ is computed from training data (how?)
Example Kernel functions
• Epanechnikov quadratic kernel (used in N-W method)
 x  x0
K  x0 , x   D
 





3
1t 2
4
0
Dt   {

if t  1;
otherwise.
• tri-cube kernel
 x  x0
K  x0 , x   D
 




1 t 
if t  1;
0
otherwise.
Dt   {
3 3
• Gaussian kernel
( x  x0 ) 2
1
K  x0 , x  
exp( 
)
22
2 
Kernel
characteristics
Compact – vanishes beyond a finite range (such as Epanechnikov, tri-cube)
Everywhere differentiable (Gaussian, tri-cube)
Local Linear Regression
• In kernel-weighted average method estimated function value
has a high bias at the boundary
• This high bias is a result of the asymmetry at the boundary
• The bias can also be present in the interior when the x values
in the training set are not equally spaced
• Fitting straight lines rather than constants locally helps us to
remove bias (why?)
Locally Weighted Linear Regression
• Least squares solution:
2
N
min
  x0 ,   x0 
 K  x , x y
i 1
0
i
i
   x0    x0 xi 

T
fˆ x0   ˆ x0   ˆ  x0 x0  bx0  BT W x0 B
Ex.

1
B T W x0  y
N
  li x0  yi
i 1
vector - valued function : bx   1,x 
T
N  2 regression matrix B with ith row bxi 
T
N  N diagonal matrix W x0  with ith diagonal element K   x0 , xi 
• Note that the estimate is linear in yi
• The weights li(xi) are sometimes referred to as
the equivalent kernel
Bias Reduction In Local Linear
Regression
• Local linear regression automatically modifies the kernel to
correct the bias exactly to the first order
N
Efˆ  x0    li  x0  f  xi 
Write a Taylor series expansion of f(xi)
i 1
N
N
N
i 1
i 1
i 1
 f  x0  li  x0   f  x0   xi  x0 li  x0   f  x0   xi  x0  li  x0   R

f  x0  
2
N
2
f  x0   xi  x0  li  x0   R
i 1
N
2
bias  Efˆ  x0   f  x0   f  x0   xi  x0  li  x0   R
i 1
N
since :
N
 l x   1 and  x
i 1
i
0
i 1
i
 x0 li  x0   0
Ex. 6.2 in [HTF]
Local Polynomial Regression
• Why have a polynomial for the local fit? What would be
the rationale?
d






min
K
x
,
x
y


x

 j x0 xij 



0
i  i
0
  x0 ,  j  x0 , j 1,..., d
i 1
j 1


N

d
T
fˆ  x0   ˆ  x0    ˆ j  x0 x0j  b x0  B T W  x0 B

j 1
1
2
B T W  x0  y
N
  li  x0  yi
i 1

vector - valued function : b x   1,x,..., x d
T

N  d  1 regression matrix B with ith row b xi 
T
N  N diagonal matrix W  x0  with ith diagonal element K   x0 , xi 
• We will gain on bias; however we will pay the price in
terms of variance (why?)
Bias and Variance Tradeoff
• As the degree of local polynomial regression increases, bias
decreases and variance increases
• Local linear fits can help reduce bias significantly at the boundaries
at a modest cost in variance
• Local quadratic fits tend to be most helpful in reducing bias due to
curvature in the interior of the domain
• So, would it be helpful have a mixture or linear and quadratic local
fits?
Local Regression in Higher
Dimensions
• We can extend 1D local regression to higher dimensions
 K  x , x y
N
min
  x0 ,   x0 
0
i 1
i
i
 x  x0 

K   x0 , x   D




T
fˆ x   b x  ˆ x   bx
0
0
0

 bxi    x0 
T
0
2
T BTW x0 B 1 BTW x0  y
N
  li x0  yi
i 1
p dimension with d degree
1  H pd1 vector - valued function : b x 
T
N  H pp1 regression matrix B with ith row b xi 
T
N  N diagonal matrix W x0  with ith diagonal element K   x0 , xi 
• Standardize each coordinates in the kernel, because Euclidean
(square) norm is affected by scaling
Local Regression: Issues in Higher
Dimensions
• The boundary poses even a greater problem in higher
dimensions
– Many training points are required to reduce the bias; Sample
size should increase exponentially in p to match the same
performance.
• Local regression becomes less useful when dimensions
go beyond 2 or 3
• It’s impossible to maintain localness (low bias) and
sizeable samples (low variance) in the same time
Combating Dimensions: Structured Kernels
• In high dimensions, input variables (i.e., x variables)
could be very much correlated. This correlation could be
a key to reduce the dimensionality while performing
kernel regression.
• Let A be a positive semidefinite matrix (what does that
mean?). Let’s now consider a kernel that looks like:
 x  x0 T Ax  x0  

K  , A x0 , x   D




• If A=-1, the inverse of the covariance matrix of the input
variables, then the correlation structure is captured
• Further, one can take only a few principal components of
A to reduce the dimensionality
Combating Dimensions: Low Order
Additive Models
• ANOVA (analysis of variance) decomposition:
f x1 , x2 ,, x p      g j x j    g kl xk , xl   ...
p
j 1
k l
• One-dimensional local regression is all needed:
f x1 , x2 ,, x p      g j x j 
p
j 1
Probability Density Function
Estimation
• In many classification or regression problems we desperately
want to estimate probability densities– recall the instances
• So can we not estimate a probability density, directly given
some samples xi from it?
• Local methods of Density Estimation:
f ( x0 ) 
# xi  Nbhood ( x0 )
N
• This estimate is typically bumpy, non-smooth (why?)
Smooth PDF Estimation using Kernels
N
ˆf ( x )  1
 K  ( x0 , xi )
0
N i 1
• Parzen method:
• Gaussian kernel:
( x0  xi ) 2
1
K  ( x0 , xi ) 
exp( 
)
22
2 
• In p-dimensions
f X ( x0 ) 
1
N (2 2 )
N
e
1
 (|| xi  x0 ||/  ) 2
2
p
2 i 1
Kernel density estimation
Using Kernel Density Estimates in
Classification
Posterior probability density:
In order to estimate this density, we can estimate the class conditional densities
using Parzen method
ˆ f ( x )
P(G  j | X  x0 ) 
j
 ˆ
l 1
where
f j ( x)  p( x | G  j )
j
0
K
l
f l ( x0 )
is the jth class conditional density
Class conditional densities
Ratio of posteriors
P(G  1 | X  x) ˆ1 f1 ( x)

P(G  1 | X  x) ˆ 2 f 2 ( x)
Naive Bayes Classifier
•
•
•
•
•
•
In Bayesian Classification we need to
estimate the class conditional densities:
What if the input space x is multidimensional?
If we apply kernel density estimates, we
will run into the same problems that we
faced in high dimensions
To avoid these difficulties, assume that
the class conditional density factorizes:
In other words we are assuming here
that the features are independent –
Naïve Bayes model
Advantages:
f j ( x)  p( x | G  j )
p
f j ( x1 ,, x p )   p( xi | G  j )
i 1
– Each class density for each feature can
be estimated (low variance)
– If some of the features are continuous,
some are discrete this method can
seamlessly handle the situation
•
Naïve Bayes classifier works surprisingly
well for many problems (why?)
Discriminant function is now generalized linear additive
Key Points
• Local assumption
• Usually Bandwidth () selection is more important than
kernel function selection
• Low bias, low variance usually not guaranteed in high
dimensions
• Little training and high online computational complexity
– Use sparingly: only when really required, like in the highconfusion zone
– Use when model may not be used again: No need for the
training phase