Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Kernel Methods Lecture Notes for CMPUT 466/551 Nilanjan Ray Kernel Methods: Key Points • Essentially a local regression (function estimation/fitting) technique • Only the observations (training set) close to the query point are considered for regression computation • While regressing, an observation point gets a weight that decreases as its distance from the query point increases • The resulting regression function is smooth • All these features of this regression are made possible by a function called kernel • Requires very little training (i.e., not many parameters to compute offline from the training set, not much offline computation needed) • This kind of regression is known as memory based technique as it requires entire training set to be available while regressing One-Dimensional Kernel Smoothers • We have seen that k-nearest neighbor directly estimates Pr(Y|X=x) • k-nn assigns equal weight to all points in neighborhood • The average curve is bumpy and discontinuous • Rather than give equal weight, assign weights that decrease smoothly with distance from the target points fˆ x Ave yi | xi Nk x Nadaraya-Watson Kernel-weighted Average fˆ x0 i N1 N • N-W kernel weighted average: K x0 , xi yi i 1 K x0 , xi x x0 where K x0 , x D h x 0 • K is a kernel function: Any smooth function K such that K x 0, K x dx 1, 2 xK x dx 0 and x K x dx 0 Typically K is also symmetric about 0 Some Points About Kernels • hλ(x0) is a width function also dependent on λ • For the N-W kernel average hλ(x0) = λ • For k-nn average hλ(x0) = |x0-x[k]|, where x[k] is the kth closest xi to x0 • λ determines the width of local neighborhood and degree of smoothness • λ also controls the tradeoff between bias and variance – Larger λ makes lower variance but higher bias (Why?) • λ is computed from training data (how?) Example Kernel functions • Epanechnikov quadratic kernel (used in N-W method) x x0 K x0 , x D 3 1t 2 4 0 Dt { if t 1; otherwise. • tri-cube kernel x x0 K x0 , x D 1 t if t 1; 0 otherwise. Dt { 3 3 • Gaussian kernel ( x x0 ) 2 1 K x0 , x exp( ) 22 2 Kernel characteristics Compact – vanishes beyond a finite range (such as Epanechnikov, tri-cube) Everywhere differentiable (Gaussian, tri-cube) Local Linear Regression • In kernel-weighted average method estimated function value has a high bias at the boundary • This high bias is a result of the asymmetry at the boundary • The bias can also be present in the interior when the x values in the training set are not equally spaced • Fitting straight lines rather than constants locally helps us to remove bias (why?) Locally Weighted Linear Regression • Least squares solution: 2 N min x0 , x0 K x , x y i 1 0 i i x0 x0 xi T fˆ x0 ˆ x0 ˆ x0 x0 bx0 BT W x0 B Ex. 1 B T W x0 y N li x0 yi i 1 vector - valued function : bx 1,x T N 2 regression matrix B with ith row bxi T N N diagonal matrix W x0 with ith diagonal element K x0 , xi • Note that the estimate is linear in yi • The weights li(xi) are sometimes referred to as the equivalent kernel Bias Reduction In Local Linear Regression • Local linear regression automatically modifies the kernel to correct the bias exactly to the first order N Efˆ x0 li x0 f xi Write a Taylor series expansion of f(xi) i 1 N N N i 1 i 1 i 1 f x0 li x0 f x0 xi x0 li x0 f x0 xi x0 li x0 R f x0 2 N 2 f x0 xi x0 li x0 R i 1 N 2 bias Efˆ x0 f x0 f x0 xi x0 li x0 R i 1 N since : N l x 1 and x i 1 i 0 i 1 i x0 li x0 0 Ex. 6.2 in [HTF] Local Polynomial Regression • Why have a polynomial for the local fit? What would be the rationale? d min K x , x y x j x0 xij 0 i i 0 x0 , j x0 , j 1,..., d i 1 j 1 N d T fˆ x0 ˆ x0 ˆ j x0 x0j b x0 B T W x0 B j 1 1 2 B T W x0 y N li x0 yi i 1 vector - valued function : b x 1,x,..., x d T N d 1 regression matrix B with ith row b xi T N N diagonal matrix W x0 with ith diagonal element K x0 , xi • We will gain on bias; however we will pay the price in terms of variance (why?) Bias and Variance Tradeoff • As the degree of local polynomial regression increases, bias decreases and variance increases • Local linear fits can help reduce bias significantly at the boundaries at a modest cost in variance • Local quadratic fits tend to be most helpful in reducing bias due to curvature in the interior of the domain • So, would it be helpful have a mixture or linear and quadratic local fits? Local Regression in Higher Dimensions • We can extend 1D local regression to higher dimensions K x , x y N min x0 , x0 0 i 1 i i x x0 K x0 , x D T fˆ x b x ˆ x bx 0 0 0 bxi x0 T 0 2 T BTW x0 B 1 BTW x0 y N li x0 yi i 1 p dimension with d degree 1 H pd1 vector - valued function : b x T N H pp1 regression matrix B with ith row b xi T N N diagonal matrix W x0 with ith diagonal element K x0 , xi • Standardize each coordinates in the kernel, because Euclidean (square) norm is affected by scaling Local Regression: Issues in Higher Dimensions • The boundary poses even a greater problem in higher dimensions – Many training points are required to reduce the bias; Sample size should increase exponentially in p to match the same performance. • Local regression becomes less useful when dimensions go beyond 2 or 3 • It’s impossible to maintain localness (low bias) and sizeable samples (low variance) in the same time Combating Dimensions: Structured Kernels • In high dimensions, input variables (i.e., x variables) could be very much correlated. This correlation could be a key to reduce the dimensionality while performing kernel regression. • Let A be a positive semidefinite matrix (what does that mean?). Let’s now consider a kernel that looks like: x x0 T Ax x0 K , A x0 , x D • If A=-1, the inverse of the covariance matrix of the input variables, then the correlation structure is captured • Further, one can take only a few principal components of A to reduce the dimensionality Combating Dimensions: Low Order Additive Models • ANOVA (analysis of variance) decomposition: f x1 , x2 ,, x p g j x j g kl xk , xl ... p j 1 k l • One-dimensional local regression is all needed: f x1 , x2 ,, x p g j x j p j 1 Probability Density Function Estimation • In many classification or regression problems we desperately want to estimate probability densities– recall the instances • So can we not estimate a probability density, directly given some samples xi from it? • Local methods of Density Estimation: f ( x0 ) # xi Nbhood ( x0 ) N • This estimate is typically bumpy, non-smooth (why?) Smooth PDF Estimation using Kernels N ˆf ( x ) 1 K ( x0 , xi ) 0 N i 1 • Parzen method: • Gaussian kernel: ( x0 xi ) 2 1 K ( x0 , xi ) exp( ) 22 2 • In p-dimensions f X ( x0 ) 1 N (2 2 ) N e 1 (|| xi x0 ||/ ) 2 2 p 2 i 1 Kernel density estimation Using Kernel Density Estimates in Classification Posterior probability density: In order to estimate this density, we can estimate the class conditional densities using Parzen method ˆ f ( x ) P(G j | X x0 ) j ˆ l 1 where f j ( x) p( x | G j ) j 0 K l f l ( x0 ) is the jth class conditional density Class conditional densities Ratio of posteriors P(G 1 | X x) ˆ1 f1 ( x) P(G 1 | X x) ˆ 2 f 2 ( x) Naive Bayes Classifier • • • • • • In Bayesian Classification we need to estimate the class conditional densities: What if the input space x is multidimensional? If we apply kernel density estimates, we will run into the same problems that we faced in high dimensions To avoid these difficulties, assume that the class conditional density factorizes: In other words we are assuming here that the features are independent – Naïve Bayes model Advantages: f j ( x) p( x | G j ) p f j ( x1 ,, x p ) p( xi | G j ) i 1 – Each class density for each feature can be estimated (low variance) – If some of the features are continuous, some are discrete this method can seamlessly handle the situation • Naïve Bayes classifier works surprisingly well for many problems (why?) Discriminant function is now generalized linear additive Key Points • Local assumption • Usually Bandwidth () selection is more important than kernel function selection • Low bias, low variance usually not guaranteed in high dimensions • Little training and high online computational complexity – Use sparingly: only when really required, like in the highconfusion zone – Use when model may not be used again: No need for the training phase