Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Kernel Methods
Lecture Notes for CMPUT 466/551
Nilanjan Ray
Kernel Methods: Key Points
• Essentially a local regression (function estimation/fitting) technique
• Only the observations (training set) close to the query point are
considered for regression computation
• While regressing, an observation point gets a weight that decreases
as its distance from the query point increases
• The resulting regression function is smooth
• All these features of this regression are made possible by a function
called kernel
• Requires very little training (i.e., not many parameters to compute
offline from the training set, not much offline computation needed)
• This kind of regression is known as memory based technique as it
requires entire training set to be available while regressing
One-Dimensional Kernel
Smoothers
• We have seen that k-nearest
neighbor directly estimates
Pr(Y|X=x)
• k-nn assigns equal weight to
all points in neighborhood
• The average curve is bumpy
and discontinuous
• Rather than give equal weight,
assign weights that decrease
smoothly with distance from
the target points
fˆ x Ave yi | xi Nk x
Nadaraya-Watson
Kernel-weighted Average
fˆ x0 i N1
N
• N-W kernel weighted average:
K x0 , xi yi
i 1
K x0 , xi
x x0
where K x0 , x D
h x
0
• K is a kernel function:
Any smooth function K such that
K x 0,
K x dx 1,
2
xK
x
dx
0
and
x
K x dx 0
Typically K is also symmetric about 0
Some Points About Kernels
• hλ(x0) is a width function also dependent on λ
• For the N-W kernel average hλ(x0) = λ
• For k-nn average hλ(x0) = |x0-x[k]|, where x[k] is
the kth closest xi to x0
• λ determines the width of local neighborhood
and degree of smoothness
• λ also controls the tradeoff between bias and
variance
– Larger λ makes lower variance but higher bias (Why?)
• λ is computed from training data (how?)
Example Kernel functions
• Epanechnikov quadratic kernel (used in N-W method)
x x0
K x0 , x D
3
1t 2
4
0
Dt {
if t 1;
otherwise.
• tri-cube kernel
x x0
K x0 , x D
1 t
if t 1;
0
otherwise.
Dt {
3 3
• Gaussian kernel
( x x0 ) 2
1
K x0 , x
exp(
)
22
2
Kernel
characteristics
Compact – vanishes beyond a finite range (such as Epanechnikov, tri-cube)
Everywhere differentiable (Gaussian, tri-cube)
Local Linear Regression
• In kernel-weighted average method estimated function value
has a high bias at the boundary
• This high bias is a result of the asymmetry at the boundary
• The bias can also be present in the interior when the x values
in the training set are not equally spaced
• Fitting straight lines rather than constants locally helps us to
remove bias (why?)
Locally Weighted Linear Regression
• Least squares solution:
2
N
min
x0 , x0
K x , x y
i 1
0
i
i
x0 x0 xi
T
fˆ x0 ˆ x0 ˆ x0 x0 bx0 BT W x0 B
Ex.
1
B T W x0 y
N
li x0 yi
i 1
vector - valued function : bx 1,x
T
N 2 regression matrix B with ith row bxi
T
N N diagonal matrix W x0 with ith diagonal element K x0 , xi
• Note that the estimate is linear in yi
• The weights li(xi) are sometimes referred to as
the equivalent kernel
Bias Reduction In Local Linear
Regression
• Local linear regression automatically modifies the kernel to
correct the bias exactly to the first order
N
Efˆ x0 li x0 f xi
Write a Taylor series expansion of f(xi)
i 1
N
N
N
i 1
i 1
i 1
f x0 li x0 f x0 xi x0 li x0 f x0 xi x0 li x0 R
f x0
2
N
2
f x0 xi x0 li x0 R
i 1
N
2
bias Efˆ x0 f x0 f x0 xi x0 li x0 R
i 1
N
since :
N
l x 1 and x
i 1
i
0
i 1
i
x0 li x0 0
Ex. 6.2 in [HTF]
Local Polynomial Regression
• Why have a polynomial for the local fit? What would be
the rationale?
d
min
K
x
,
x
y
x
j x0 xij
0
i i
0
x0 , j x0 , j 1,..., d
i 1
j 1
N
d
T
fˆ x0 ˆ x0 ˆ j x0 x0j b x0 B T W x0 B
j 1
1
2
B T W x0 y
N
li x0 yi
i 1
vector - valued function : b x 1,x,..., x d
T
N d 1 regression matrix B with ith row b xi
T
N N diagonal matrix W x0 with ith diagonal element K x0 , xi
• We will gain on bias; however we will pay the price in
terms of variance (why?)
Bias and Variance Tradeoff
• As the degree of local polynomial regression increases, bias
decreases and variance increases
• Local linear fits can help reduce bias significantly at the boundaries
at a modest cost in variance
• Local quadratic fits tend to be most helpful in reducing bias due to
curvature in the interior of the domain
• So, would it be helpful have a mixture or linear and quadratic local
fits?
Local Regression in Higher
Dimensions
• We can extend 1D local regression to higher dimensions
K x , x y
N
min
x0 , x0
0
i 1
i
i
x x0
K x0 , x D
T
fˆ x b x ˆ x bx
0
0
0
bxi x0
T
0
2
T BTW x0 B 1 BTW x0 y
N
li x0 yi
i 1
p dimension with d degree
1 H pd1 vector - valued function : b x
T
N H pp1 regression matrix B with ith row b xi
T
N N diagonal matrix W x0 with ith diagonal element K x0 , xi
• Standardize each coordinates in the kernel, because Euclidean
(square) norm is affected by scaling
Local Regression: Issues in Higher
Dimensions
• The boundary poses even a greater problem in higher
dimensions
– Many training points are required to reduce the bias; Sample
size should increase exponentially in p to match the same
performance.
• Local regression becomes less useful when dimensions
go beyond 2 or 3
• It’s impossible to maintain localness (low bias) and
sizeable samples (low variance) in the same time
Combating Dimensions: Structured Kernels
• In high dimensions, input variables (i.e., x variables)
could be very much correlated. This correlation could be
a key to reduce the dimensionality while performing
kernel regression.
• Let A be a positive semidefinite matrix (what does that
mean?). Let’s now consider a kernel that looks like:
x x0 T Ax x0
K , A x0 , x D
• If A=-1, the inverse of the covariance matrix of the input
variables, then the correlation structure is captured
• Further, one can take only a few principal components of
A to reduce the dimensionality
Combating Dimensions: Low Order
Additive Models
• ANOVA (analysis of variance) decomposition:
f x1 , x2 ,, x p g j x j g kl xk , xl ...
p
j 1
k l
• One-dimensional local regression is all needed:
f x1 , x2 ,, x p g j x j
p
j 1
Probability Density Function
Estimation
• In many classification or regression problems we desperately
want to estimate probability densities– recall the instances
• So can we not estimate a probability density, directly given
some samples xi from it?
• Local methods of Density Estimation:
f ( x0 )
# xi Nbhood ( x0 )
N
• This estimate is typically bumpy, non-smooth (why?)
Smooth PDF Estimation using Kernels
N
ˆf ( x ) 1
K ( x0 , xi )
0
N i 1
• Parzen method:
• Gaussian kernel:
( x0 xi ) 2
1
K ( x0 , xi )
exp(
)
22
2
• In p-dimensions
f X ( x0 )
1
N (2 2 )
N
e
1
(|| xi x0 ||/ ) 2
2
p
2 i 1
Kernel density estimation
Using Kernel Density Estimates in
Classification
Posterior probability density:
In order to estimate this density, we can estimate the class conditional densities
using Parzen method
ˆ f ( x )
P(G j | X x0 )
j
ˆ
l 1
where
f j ( x) p( x | G j )
j
0
K
l
f l ( x0 )
is the jth class conditional density
Class conditional densities
Ratio of posteriors
P(G 1 | X x) ˆ1 f1 ( x)
P(G 1 | X x) ˆ 2 f 2 ( x)
Naive Bayes Classifier
•
•
•
•
•
•
In Bayesian Classification we need to
estimate the class conditional densities:
What if the input space x is multidimensional?
If we apply kernel density estimates, we
will run into the same problems that we
faced in high dimensions
To avoid these difficulties, assume that
the class conditional density factorizes:
In other words we are assuming here
that the features are independent –
Naïve Bayes model
Advantages:
f j ( x) p( x | G j )
p
f j ( x1 ,, x p ) p( xi | G j )
i 1
– Each class density for each feature can
be estimated (low variance)
– If some of the features are continuous,
some are discrete this method can
seamlessly handle the situation
•
Naïve Bayes classifier works surprisingly
well for many problems (why?)
Discriminant function is now generalized linear additive
Key Points
• Local assumption
• Usually Bandwidth () selection is more important than
kernel function selection
• Low bias, low variance usually not guaranteed in high
dimensions
• Little training and high online computational complexity
– Use sparingly: only when really required, like in the highconfusion zone
– Use when model may not be used again: No need for the
training phase