Download Kernel Methods

Kernel Methods Lecture Notes for CMPUT 466/551 Nilanjan Ray Kernel Methods: Key Points • Essentially a local regression (function estimation/fitting) technique • Only the observations (training set) close to the query point are considered for regression computation • While regressing, an observation point gets a weight that decreases as its distance from the query point increases • The resulting regression function is smooth • All these features of this regression are made possible by a function called kernel • Requires very little training (i.e., not many parameters to compute offline from the training set, not much offline computation needed) • This kind of regression is known as memory based technique as it requires entire training set to be available while regressing One-Dimensional Kernel Smoothers • We have seen that k-nearest neighbor directly estimates Pr(Y|X=x) • k-nn assigns equal weight to all points in neighborhood • The average curve is bumpy and discontinuous • Rather than give equal weight, assign weights that decrease smoothly with distance from the target points fˆ x  Ave yi | xi  Nk x Nadaraya-Watson Kernel-weighted Average  fˆ x0   i N1 N • N-W kernel weighted average:  K  x0 , xi  yi i 1 K  x0 , xi   x  x0   where K  x0 , x   D  h x     0  • K is a kernel function: Any smooth function K such that K  x   0,  K x dx  1, 2   xK x dx  0 and x   K x dx  0 Typically K is also symmetric about 0 Some Points About Kernels • hλ(x0) is a width function also dependent on λ • For the N-W kernel average hλ(x0) = λ • For k-nn average hλ(x0) = |x0-x[k]|, where x[k] is the kth closest xi to x0 • λ determines the width of local neighborhood and degree of smoothness • λ also controls the tradeoff between bias and variance – Larger λ makes lower variance but higher bias (Why?) • λ is computed from training data (how?) Example Kernel functions • Epanechnikov quadratic kernel (used in N-W method)  x  x0 K  x0 , x   D        3 1t 2 4 0 Dt   {  if t  1; otherwise. • tri-cube kernel  x  x0 K  x0 , x   D       1 t  if t  1; 0 otherwise. Dt   { 3 3 • Gaussian kernel ( x  x0 ) 2 1 K  x0 , x   exp(  ) 22 2  Kernel characteristics Compact – vanishes beyond a finite range (such as Epanechnikov, tri-cube) Everywhere differentiable (Gaussian, tri-cube) Local Linear Regression • In kernel-weighted average method estimated function value has a high bias at the boundary • This high bias is a result of the asymmetry at the boundary • The bias can also be present in the interior when the x values in the training set are not equally spaced • Fitting straight lines rather than constants locally helps us to remove bias (why?) Locally Weighted Linear Regression • Least squares solution: 2 N min   x0 ,   x0   K  x , x y i 1 0 i i    x0    x0 xi   T fˆ x0   ˆ x0   ˆ  x0 x0  bx0  BT W x0 B Ex.  1 B T W x0  y N   li x0  yi i 1 vector - valued function : bx   1,x  T N  2 regression matrix B with ith row bxi  T N  N diagonal matrix W x0  with ith diagonal element K   x0 , xi  • Note that the estimate is linear in yi • The weights li(xi) are sometimes referred to as the equivalent kernel Bias Reduction In Local Linear Regression • Local linear regression automatically modifies the kernel to correct the bias exactly to the first order N Efˆ  x0    li  x0  f  xi  Write a Taylor series expansion of f(xi) i 1 N N N i 1 i 1 i 1  f  x0  li  x0   f  x0   xi  x0 li  x0   f  x0   xi  x0  li  x0   R  f  x0   2 N 2 f  x0   xi  x0  li  x0   R i 1 N 2 bias  Efˆ  x0   f  x0   f  x0   xi  x0  li  x0   R i 1 N since : N  l x   1 and  x i 1 i 0 i 1 i  x0 li  x0   0 Ex. 6.2 in [HTF] Local Polynomial Regression • Why have a polynomial for the local fit? What would be the rationale? d       min K x , x y   x   j x0 xij     0 i  i 0   x0 ,  j  x0 , j 1,..., d i 1 j 1   N  d T fˆ  x0   ˆ  x0    ˆ j  x0 x0j  b x0  B T W  x0 B  j 1 1 2 B T W  x0  y N   li  x0  yi i 1  vector - valued function : b x   1,x,..., x d T  N  d  1 regression matrix B with ith row b xi  T N  N diagonal matrix W  x0  with ith diagonal element K   x0 , xi  • We will gain on bias; however we will pay the price in terms of variance (why?) Bias and Variance Tradeoff • As the degree of local polynomial regression increases, bias decreases and variance increases • Local linear fits can help reduce bias significantly at the boundaries at a modest cost in variance • Local quadratic fits tend to be most helpful in reducing bias due to curvature in the interior of the domain • So, would it be helpful have a mixture or linear and quadratic local fits? Local Regression in Higher Dimensions • We can extend 1D local regression to higher dimensions  K  x , x y N min   x0 ,   x0  0 i 1 i i  x  x0   K   x0 , x   D     T fˆ x   b x  ˆ x   bx 0 0 0   bxi    x0  T 0 2 T BTW x0 B 1 BTW x0  y N   li x0  yi i 1 p dimension with d degree 1  H pd1 vector - valued function : b x  T N  H pp1 regression matrix B with ith row b xi  T N  N diagonal matrix W x0  with ith diagonal element K   x0 , xi  • Standardize each coordinates in the kernel, because Euclidean (square) norm is affected by scaling Local Regression: Issues in Higher Dimensions • The boundary poses even a greater problem in higher dimensions – Many training points are required to reduce the bias; Sample size should increase exponentially in p to match the same performance. • Local regression becomes less useful when dimensions go beyond 2 or 3 • It’s impossible to maintain localness (low bias) and sizeable samples (low variance) in the same time Combating Dimensions: Structured Kernels • In high dimensions, input variables (i.e., x variables) could be very much correlated. This correlation could be a key to reduce the dimensionality while performing kernel regression. • Let A be a positive semidefinite matrix (what does that mean?). Let’s now consider a kernel that looks like:  x  x0 T Ax  x0    K  , A x0 , x   D     • If A=-1, the inverse of the covariance matrix of the input variables, then the correlation structure is captured • Further, one can take only a few principal components of A to reduce the dimensionality Combating Dimensions: Low Order Additive Models • ANOVA (analysis of variance) decomposition: f x1 , x2 ,, x p      g j x j    g kl xk , xl   ... p j 1 k l • One-dimensional local regression is all needed: f x1 , x2 ,, x p      g j x j  p j 1 Probability Density Function Estimation • In many classification or regression problems we desperately want to estimate probability densities– recall the instances • So can we not estimate a probability density, directly given some samples xi from it? • Local methods of Density Estimation: f ( x0 )  # xi  Nbhood ( x0 ) N • This estimate is typically bumpy, non-smooth (why?) Smooth PDF Estimation using Kernels N ˆf ( x )  1  K  ( x0 , xi ) 0 N i 1 • Parzen method: • Gaussian kernel: ( x0  xi ) 2 1 K  ( x0 , xi )  exp(  ) 22 2  • In p-dimensions f X ( x0 )  1 N (2 2 ) N e 1  (|| xi  x0 ||/  ) 2 2 p 2 i 1 Kernel density estimation Using Kernel Density Estimates in Classification Posterior probability density: In order to estimate this density, we can estimate the class conditional densities using Parzen method ˆ f ( x ) P(G  j | X  x0 )  j  ˆ l 1 where f j ( x)  p( x | G  j ) j 0 K l f l ( x0 ) is the jth class conditional density Class conditional densities Ratio of posteriors P(G  1 | X  x) ˆ1 f1 ( x)  P(G  1 | X  x) ˆ 2 f 2 ( x) Naive Bayes Classifier • • • • • • In Bayesian Classification we need to estimate the class conditional densities: What if the input space x is multidimensional? If we apply kernel density estimates, we will run into the same problems that we faced in high dimensions To avoid these difficulties, assume that the class conditional density factorizes: In other words we are assuming here that the features are independent – Naïve Bayes model Advantages: f j ( x)  p( x | G  j ) p f j ( x1 ,, x p )   p( xi | G  j ) i 1 – Each class density for each feature can be estimated (low variance) – If some of the features are continuous, some are discrete this method can seamlessly handle the situation • Naïve Bayes classifier works surprisingly well for many problems (why?) Discriminant function is now generalized linear additive Key Points • Local assumption • Usually Bandwidth () selection is more important than kernel function selection • Low bias, low variance usually not guaranteed in high dimensions • Little training and high online computational complexity – Use sparingly: only when really required, like in the highconfusion zone – Use when model may not be used again: No need for the training phase

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Kernel Methods