* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download The Elements of Statistical Learning
Computational phylogenetics wikipedia , lookup
Computer simulation wikipedia , lookup
Machine learning wikipedia , lookup
Mathematical optimization wikipedia , lookup
Theoretical computer science wikipedia , lookup
Vector generalized linear model wikipedia , lookup
Corecursion wikipedia , lookup
Predictive analytics wikipedia , lookup
Inverse problem wikipedia , lookup
Types of artificial neural networks wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Regression analysis wikipedia , lookup
Data assimilation wikipedia , lookup
The Elements of Statistical Learning Thomas Lengauer, Christian Merkwirth using the book by Hastie, Tibshirani, Friedman 1 The Elements of Statistical Learning l Prerequisites: l l l l Vordiplom in mathematics or computer science or equivalent Linear algebra Basic knowledge in statistics l Time l l Lecture: Wed 11-13, HS024, Building 46 (MPI) Tutorial: Fri 14-16 Rm. 15, Building 45 (CS Dep.) Biweekly, starting Oct. 31 Credits: l Übungsschein, based on l At least 50% of points in homework l Final exam, probably oral l Good for l Bioinformatics l CS Theory or Applications 2 1. Introduction 3 Applications of Statistical Learning Medical: Predicted whether a patient, hospitalized due to a heart attack, will have a second heart attack Data: demographic, diet, clinical measurements Business/Economics: Predict the price of stock 6 months from now. Data: company performance, economic data Vision: Identify hand-written ZIP codes Data: Model hand-written digits Medical: Amount of glucose in the blood of a diabetic Data: Infrared absorption spectrum of blood sample Medical: Risk factors for prostate cancer Data: Clinical, demographic 4 Types of Data l Two basically different types of data l l l Data are predicted l l l l l l Quantitative (numerical): e.g. stock price Categorical (discrete, often binary): cancer/no cancer on the basis of a set of features (e.g. diet or clinical measurements) from a set of (observed) training data on these features For a set of objects (e.g. people). Inputs for the problems are also called predictors or independent variables Outputs are also called responses or dependent variables The prediction model is called a learner or estimator (Schätzer). l l Supervised learning: learn on outcomes for observed features Unsupervised learning: no feature values available 5 Example 1: Email Spam l l 4601 email messages each labeled email (+) or spam (-) l l l spam email george 0.00 1.27 you 2.26 1.27 your 1.38 0.44 hp 0.02 0.90 free 0.52 0.07 hpl 0.01 0.43 ! 0.51 0.11 our 0.51 0.18 re 0.13 0.42 edu 0.01 0.29 remove 0.28 0.01 Data: The relative frequencies of the 57 most commonly occurring words and punctuation marks in the message Prediction goal: label future messages email (+) or spam (-) Supervised learning problem on categorical data: classification problem Words with largest difference between spam and email shown. 6 Example 1: Email Spam l Examples of rules for prediction: l l l If (%george<0.6) and (%you>1.5) then spam else email If (0.2·%you-0.3 ·%george)>0 then spam else email Tolerance to errors: l l Tolerant to letting through some spam (false positive) No tolerance towards throwing out email (false negative) spam email george 0.00 1.27 you 2.26 1.27 your 1.38 0.44 hp 0.02 0.90 free 0.52 0.07 hpl 0.01 0.43 ! 0.51 0.11 our 0.51 0.18 re 0.13 0.42 edu 0.01 0.29 remove 0.28 0.01 Words with largest difference between spam and email shown. 7 Example 2: Prostate Cancer l Data (by Stamey et al. 1989): l l l Given: lcavol log cancer volume lweight log prostate weight age lbph log benign hyperplasia amount svi seminal vesicle invasion lcp log capsular penetration gleason gleason score pgg45 percent gleason scores 4 or 5 Predict:PSA (prostate specific antigen) level Supervised learning problem on quantitative data: regression problem 8 Example 2: Prostate Cancer l l l Figure shows scatter plots of the input data, projected onto two variables, respectively. The first row shows the outcome of the prediction, projected onto each input variable, respectively. The variables svi and gleason are categorical 9 Example 3: Recognition of Handwritten Digits l l Data: images are single digits 16x16 8-bit gray-scale, normalized for size and orientation Classify: newly written digits l l Non-binary classification problem Low tolerance to misclassifications 10 Example 4: DNA Expression Microarrays l Data: l l l l l Color intensities signifying the abundance levels of mRNA for a number of genes (6830) in several (64) different cell states (samples). Red over-expressed gene Green under-expressed gene Black normally expressed gene (according to some predefined background) Predict: l l l Which genes show similar expression over the samples Which samples show similar expression over the genes (unsupervised learning problem) Which genes are highly over or under expressed in certain cancers (supervised learning problem) samples 11 genes 2. Overview of Supervised Learning 12 2.2 Notation l Inputs l l l l l l l Outputs l l l X, Xj (jth element of vector X ) p #inputs, N #observations X matrix written in bold Vectors written in bold xi if they have N components and thus summarize all observations on variable Xi Vectors are assumed to be column vectors Discrete inputs often described by characteristic vector (dummy variables) quantitative Y qualitative G (for group) Observed variables in lower case l The i-th observed value of X is xi and can be a scalar or a vector l Main question of this lecture: Given the value of an input vector X, make a good prediction Ŷ of the output Y The prediction should be of the same kind as the searched output (categorical vs. quantitative) Exception: Binary outputs can be approximated by values in [0,1], which can be interpreted as probabilities This generalizes to k-level outputs. 13 2.3.1 Simple Approach 1: Least-Squares l l Given inputs X=(X1,X2,…,Xp) Predict output Y via the model p Yˆ ˆ0 X j ˆ j j 1 ˆ0 bias l Include the constant variable 1 in X Yˆ X T ̂ l l Here Y is scalar (If Y is a K-vector then X is a pxK matrix) l l In the (p+1)-dimensional inputoutput space, (X,Ŷ ) represents a hyperplane If the constant is included in X, then the hyperplane goes through the origin f (X ) X T is a linear function f ( X ) is a vector that points in the steepest uphill direction 14 2.3.1 Simple Approach 1: Least-Squares l l In the (p+1)-dimensional inputoutput space, (X,Ŷ ) represents a hyperplane If the constant is included in X, then the hyperplane goes through the origin f (X ) X T is a linear function f ( X ) is a vector that points in the steepest uphill direction 15 2.3.1 Simple Approach 1: Least-Squares l l l Training procedure: Method of least-squares N = #observations Minimize the residual sum of squares N RSS ( ) ( yi x ) i 1 l T i 2 This quadratic function always has a global minimum, but it may not be unique Differentiating w.r.t. yields the normal equations XT (y X ) 0 l If XTX is nonsingular, then the unique solution is ˆ ( XT X) 1 XT y Or equivalently RSS ( ) (y X )T (y X ) l l l The fitted value at input x is yˆ ( x) xT ̂ l The entire surface is characterized by ̂ . 16 2.3.1 Simple Approach 1: Least-Squares l l l l l Example: Data on two inputs X1 and X2 Output variable has values GREEN (coded 0) and RED (coded 1) 100 points per class Regression line is defined by xT ˆ 0.5 ̂ xT ˆ 0.5 l Easy but many misclassifications if the problem is not linear xT ˆ 0.5 X1 X2 17 2.3.2 Simple Approach 2: Nearest Neighbors 15-nearest neighbor averaging l Uses those observations in the training set closest to the given input. Yˆ 0.5 1 Yˆ ( x) yi k xi N k ( x ) l l l Nk(x) is the set of the k closest points to x is the training sample Average the outcome of the k closest training sample points Fewer misclassifications X1 Yˆ 0.5 Yˆ 0.5 X2 18 2.3.2 Simple Approach 2: Nearest Neighbors 1-nearest neighbor averaging l Uses those observation in the training set closest to the given input. 1 Yˆ ( x) yi k xi N k ( x ) l l l Nk(x) is the set of the k closest points to x is the training sample Average the outcome of the k closest training sample points No misclassifications: X1 Overtraining Yˆ 0.5 Yˆ 0.5 Yˆ 0.5 X2 19 2.3.3 Comparison of the Two Approaches l Least squares l K-nearest neighbors 20 2.3.3 Comparison of the Two Approaches l l Least squares p parameters p = #features l l K-nearest neighbors Apparently one parameter k In fact N/k parameters N = #observations 21 2.3.3 Comparison of the Two Approaches l l Least squares p parameters p = #features l l l Low variance (robust) l K-nearest neighbors Apparently one parameter k In fact N/k parameters N = #observations High variance (not robust) 22 2.3.3 Comparison of the Two Approaches l l Least squares p parameters p = #features l l l l Low variance (robust) High bias (rests on strong assumptions) l l K-nearest neighbors Apparently one parameter k In fact N/k parameters N = #observations High variance (not robust) Low bias (rests only on weak assumptions) 23 2.3.3 Comparison of the Two Approaches l l Least squares p parameters p = #features l l l l Low variance (robust) High bias (rests on strong assumptions) Good for Scenario 1: Training data in each class generated from a twodimensional Gaussian, the two Gaussians are independent and have different means l l l l K-nearest neighbors Apparently one parameter k In fact N/k parameters N = #observations High variance (not robust) Low bias (rests only on weak assumptions) Good for Scenario 2: Training data in each class from a mixture of 10 lowvariance Gaussians, with means again distributed as Gaussian (first choose the Gaussian, then choose the point according to the Gaussian) 24 2.3.3 Origin of the Data l Mixture of the two scenarios Step 1: Generate 10 means mk from the bivariate Gaussian distribution N((1,0)T,I) and label this class GREEN Step 2: Similarly, generate 10 means from the from the bivariate Gaussian distribution N((0,1)T,I) and label this class RED Step 3: For each class, generate 100 observations as follows: l l l For each observation, pick an mk at random with probability 1/10 Generate a point according to N(mk,I/ 5) Similar to scenario 2 Result of 10 000 classifications 25 2.3.3 Variants of These Simple Methods l Kernel methods: use weights that decrease smoothly to zero with distance from the target point, rather than the 0/1 cutoff used in nearest-neighbor methods l In high-dimensional spaces, some variables are emphasized more than others l Local regression fits linear models (by least squares) locally rather than fitting constants locally l Linear models fit to a basis expansion of the original inputs allow arbitrarily complex models l Projection pursuit and neural network models are sums of nonlinearly transformed linear models 26 2.4 Statistical Decision Theory l l l l l l Random input vector: X p Random output variable: Y Joint distribution: Pr(X,Y ) We are looking for a function f(x) for predicting Y given the values of the input X The loss function L(Y,f(X)) shall penalize errors Squared error loss: L(Y , f ( X )) (Y f ( X )) 2 l Expected prediction error (EPE): EPE ( f ) E(Y f ( X )) 2 ( y f ( x)) 2 Pr( dx, dy) l l Since Pr(X,Y )=Pr(Y |X )Pr(X ) EPE can also be written as EPE ( f ) E X EY | X ([Y f ( X )]2 | X ) l Thus it suffices to minimize EPE pointwise f ( x) arg min c EY | X ([Y c]2 | X x) Regression function: f ( x) E(Y | X x) 27 2.4 Statistical Decision Theory l Nearest neighbor methods try to directly implement this recipe l l fˆ ( x) Ave ( yi | xi Nk ( x)) l Several approximations l l l Since no duplicate observations, expectation over a neighborhood Expectation approximated by averaging over observations With increasing k and number of observations the average gets (provably) more stable l But often we do not have large samples By making assumptions (linearity) we can reduce the number of required observations greatly. With increasing dimension the neighborhood grows exponentially. Thus the rate of convergence to the true estimator (with increasing k) decreases Regression function: f ( x) E(Y | X x) 28 2.4 Statistical Decision Theory l l Linear regression Assumes that the regression function is approximately linear l f ( x) xT l l This is a model-based approach After plugging this expression into EPE and differentiating w.r.t. , we can solve for EPE ( f ) E( (Y X T )T (Y X T )) [E( X , X T )]1 E( X , Y ) Again, linear regression replaces the theoretical expectation by averaging over the observed data N RSS ( ) ( yi xiT ) 2 i 1 ˆ ( XT X) 1 XT y l Summary: l l Least squares assumes that f(x) is well approximated by a globally linear function Nearest neighbors assumes that f(x) is well approximated by a locally constant function. Regression function: f ( x) E(Y | X x) 29 2.4 Statistical Decision Theory l l Additional methods in this book are often model-based but more flexible than the linear model. Additive models l L1 (Y , f ( X )) | Y f ( X ) | l p l Each fj is arbitrary In this case fˆ ( x) median (Y | X x) f (X ) f j (X j ) j 1 What happens if we use another loss function? l l l More robust than the conditional mean L1 criterion not differentiable Squared error most popular 30 2.4 Statistical Decision Theory l l l l Procedure for categorical output variable G with values from G Loss function is kxk matrix L where k = card(G) L is zero on the diagonal L(k, ℓ) is the price paid for misclassifying a element from class Gk as belonging to class l EPE E[ L(G, Gˆ ( X ))] l l Expectation taken w.r.t. the joint distribution Pr(G,X) Conditioning yields K EPE E L[G , Gˆ ( X )] Pr(G , X ) X l Gℓ l Expected prediction error (EPE) k k 1 k Again, pointwise minimization suffices K Frequently 0-1 loss function used: L(k, ℓ) = 1-dkl Bayes Classifier Gˆ ( X ) arg min gG L(G k , g ) Pr(G k | X x) k 1 l Or simply Gˆ ( X ) G k if Pr(G k | X x) max Pr( g | X x) gG 31 2.4 Statistical Decision Theory l Expected prediction error (EPE) EPE E[ L(G, Gˆ ( X ))] l l Expectation taken w.r.t. the joint distribution Pr(G,X) Conditioning yields K EPE E L[G , Gˆ ( X )] Pr(G , X ) X l k k 1 k Again, pointwise minimization suffices K Gˆ ( X ) arg min gG L(G k , g ) Pr(G k | X x) k 1 Bayes-optimal decision boundary l Or simply Gˆ ( X ) G k if Pr(G k | X x) max Pr( g | X x) Bayes Classifier gG 32 2.5 Local Methods in High Dimensions l l l Curse of Dimensionality: Local neighborhoods become increasingly global, as the number of dimension increases Example: Points uniformly distributed in a p-dimensional unit hypercube. Hypercubical neighborhood in p dimensions that captures a fraction r of the data l l l Has edge length ep(r) = r1/p e10(0.01) = 0.63 e10(0.1) = 0.82 l To cover 1% of the data we must cover 63% of the range of an input variable 33 2.5 Local Methods in High Dimensions l Reducing r reduces the number of observations and thus the stability l To cover 1% of the data we must cover 63% of the range of an input variable 34 2.5 Local Methods in High Dimensions l l l In high dimensions, all sample points are close to the edge of the sample N data points uniformly distributed in a p-dimensional unit ball centered at the origin Median distance from the closest point to the origin (Homework) 1 d ( p, N ) 1 2 l l 1/ N l l Sampling density is proportional to N1/p If N1 = 100 is a dense sample for one input then N10 = 10010 is an equally dense sample for 10 inputs. 1/ p d(10,500) = 0.52 More than half the way to the boundary 1/2 1/2 median 35 2.5 Local Methods in High Dimensions l Another example l T set of training points xi l generated uniformly in [-1,1]p (red) Functional relationship between X and Y (green) Y f (X ) e l l prediction 8 X 2 No measurement error Error of a 1-nearest neighbor classifier in estimating f(0) (blue) 10 training points closest training point 36 2.5 Local Methods in High Dimensions l l Another example Problem deterministic: Prediction error is the meansquared error for estimating f(0) prediction downward bias MSE ( x0 ) ET [ f ( x0 ) yˆ 0 ]2 ET [ yˆ 0 ET ( yˆ 0 )]2 [ET ( yˆ 0 ) f ( x0 )]2 VarT ( yˆ 0 ) BiasT2 ( yˆ 0 ) telescoping 10 training points closest training point 37 Side Calculation: Bias-Variance Decomposition MSE ( x0 ) ET [( f ( x0 ) yˆ 0 ) 2 ] ET [( yˆ 0 ET ( yˆ 0 ) ET ( yˆ 0 ) f ( x0 )) 2 ] ET [ yˆ 0 ET ( yˆ 0 )]2 [ET ( yˆ 0 ) f ( x0 )]2 2ET [( yˆ 0 ET ( yˆ 0 )) (ET ( yˆ 0 ) f ( x0 ))] const. Bias BiasT2 ( yˆ 0 ) VarT ( yˆ 0 ) 2 BiasT ( yˆ 0 ) ET [( yˆ 0 ET ( yˆ 0 ))] 0 38 2.5 Local Methods in High Dimensions l l l Another example 1-d vs. 2-d Bias increases 2-d bias MSE ( x0 ) ET [ f ( x0 ) yˆ 0 ]2 ET [ yˆ 0 ET ( yˆ 0 )]2 [ET ( yˆ 0 ) f ( x0 )]2 1-d bias VarT ( yˆ 0 ) BiasT2 ( yˆ 0 ) 39 2.5 Local Methods in High Dimensions l The case on N=1000 training points Average Bias increases since distance of nearest neighbour increases Variance does not increase since function symmetric around 0 40 2.5 Local Methods in High Dimensions l Yet another example Y f (X ) 1 ( X 1 1)3 2 Variance increases since function is not symmetric around 0 Bias increases moderately since function is monotonic 41 2.5 Local Methods in High Dimensions l l Assume now a linear relationship l with measurement error l Y X T , ~ N (0, s 2 ) l We fit the model with least Additional variance s2 since output nondeterministic Variance depends on x0 If N is large we get squares, for arbitrary test point x0 E x0 EPE ( x0 ) s 2 ( p / N ) s 2 N yˆ0 x ˆ x0T i ( x0 ) i T 0 i i 1 ( x0 ) is the i th element of X ( X T X )1 x0 EPE( x0 ) E y0 | x0 ET ( y0 yˆ 0 ) 2 Additional variance l l l Variance negligible for large N or small s No bias Curse of dimensionality controlled Var( y0 | x0 ) ET [ y0 ET ( yˆ 0 )]2 [ET ( yˆ 0 ) ET ( y0 )]2 Var( y0 | x0 ) Var( yˆ 0 ) Bias 2 ( yˆ 0 ) s 2 ET [ x0T ( XT X) 1 x0 ]s 2 02 42 2.5 Local Methods in High Dimensions l More generally Y f ( X ) , X uniform, ~ N (0,1) l l Sample size: N = 500 Linear case l l EPE(LeastSquares) is slightly above 1 no bias EPE(1-NN) always above 2, grows slowly as nearest training point strays from target EPE Ratio 1-NN/Least Squares f ( x) x1 f ( x) ( x1 1)3 / 2 43 2.5 Local Methods in High Dimensions l More generally Y f ( X ) , X uniform, ~ N (0,1) l l Sample size: N = 500 Cubic Case l EPE Ratio 1-NN/Least Squares EPE(LeastSquares) is biased, thus ratio smaller f ( x) x1 f ( x) ( x1 1)3 / 2 44 2.6 Statistical Models l NN methods are the direct implementation of f ( x) E(Y | X x) l But can fail in two ways l With high dimensions NN need not be close to the target point l If special structure exists in the problem, this can be used to reduce variance and bias 45 2.6.1 Additive Error Model l l l l Assume additive error model Y f (X ) E( ) 0 independen t of X Then Pr(Y|X) depends only on the conditional mean of f(x) This model is a good approximation in many cases In many cases, f(x) is deterministic and error enters through uncertainty in the input. This can often be mapped on uncertainty in the output with deterministic input. 46 2.6.2 Supervised Learning l l Supervised learning The learning algorithm modifies its input/output relationship in dependence on the observed error yi fˆ ( xi ) l This can be a continuous process 47 2.6.3 Function Approximation l Data: pairs (xi,yi) that are points in (p+1)-space F: l l l l p l Linear basis expansions have the more general form , yi f ( xi ) i More general input spaces are possible Want a good approximation of f(x) in some region of input space, given the training set T Many models have certain parameters E.g. for the linear model f(x)=xT and K f ( x) hk ( x) k k 1 l Examples l l l Polynomial expansions: hk(x) = x1x22 Trigonometric expansions hk(x) = cos(x1) Sigmoid expansion hk ( x) 1 1 exp( xT k ) 48 2.6.3 Function Approximation l Approximating f by minimizing the residual sum of squares l Linear basis expansions have the more general form N RSS ( ) ( yi f ( xi )) K f ( x) hk ( x) k 2 k 1 i 1 l Examples l l l Polynomial expansions: hk(x) = x1x22 Trigonometric expansions hk(x) = cos(x1) Sigmoid expansion hk ( x) 1 1 exp( xT k ) 49 2.6.3 Function Approximation l Approximating f by minimizing the residual sum of squares N RSS( ) ( yi f ( xi )) 2 i 1 l Intuition l l l l l f surface in (p+1)-space Observe noisy realizations Want fitted surface as close to the observed points as possible Distance measured by RSS Methods: l l Closed form: if basis function have no hidden parameters Iterative: otherwise 50 2.6.3 Function Approximation l l Approximating f by maximizing the likelihood Assume an independently drawn random sample yi , i = 1,…,N from a probability density Pr (y). The log-probability of observing the sample is N L( ) log Pr ( yi ) i 1 l Set to maximize L() 51 2.6.3 Function Approximation l l Approximating f by maximizing the likelihood Assume an independently drawn random sample yi , i = 1,…,N from a probability density Pr (y). The log-probability of observing the sample is l Y f ( X ) ~ N (0, s 2 ) l is equivalent to maximum likelihood with the likelihood function Pr(Y | X , ) ~ N ( f ( X ), s 2 ) l This is, because in this case the log-likelihood function is N L( ) log( 2 ) N log s 2 1 N ( yi f ( xi )) 2 2 2s i 1 N L( ) log Pr ( yi ) i 1 l Set to maximize L() proportional to RSS Least squares with the additive error model 52 2.6.3 Function Approximation l l Approximating the regression function Pr(G|X) by maximizing the likelihood for a qualitative output G. Conditional probability of each class given X Pr(G G k | X x) pk , ( x), k 1,, k l Then the log-likelihood, also called the cross-entropy, is N L( ) log p gi , ( xi ) i 1 53 2.7 Structured Regression Models l l Problem with regression: Minimizing N RSS( ) ( yi f ( xi )) 2 i 1 l l has infinitely many (interpolating) solutions If we have repeated outcomes at each point, we can use them to decrease the variance by better estimating the average Otherwise restrict the set of functions to “smooth” functions l l Choice of set is model choice Major topic of this course l l Restricting function spaces: Choose function space of low complexity l l l l l l Close to constant, linear or low-order polynomial in small neighborhoods VC dimension is a relevant complexity measure in this context Estimator does averaging or local polynomial fitting The larger the neighborhood, the stronger the constraint Metric used is important, either explicitly or implicitly defined All such methods run into problems with high dimensions, therefore need metrics that allow neighborhoods to be large in at least some dimensions 54 2.8 Classes of Restricted Estimators l l Bayesian Methods Formula for joint probabilities l Pr( X , Y ) Pr(Y | X ) Pr( X ) = Pr( X | Y ) Pr(Y ) Bayes Formula Prior (probability) for Y l RSS is penalized with a roughness penalty PRSS( f ; ) RSS( f ) J ( f ) l J(f) is large for ragged functions l Pr( X | Y ) Pr(Y ) Pr(Y | X ) = Pr( X ) E.g. cubic smoothing spline is the solution for the leastsquares problem N PRSS( f ; ) ( yi f ( xi )) 2 Posterior (probability) for Y i 1 [ f ( x)]2 dx l Large second derivative is penalized 55 2.8 Classes of Restricted Estimators l Introducing penalty functions is a l type of regularization l l l It works against overfitting It implements beliefs about unseen parts of the problem In Bayesian terms l l Penalty J is the log-prior (probability distribution) PRSS is the log-posterior (probability distribution) RSS is penalized with a roughness penalty PRSS( f ; ) RSS( f ) J ( f ) l J(f) is large for ragged functions l E.g. cubic smoothing spline for least squares N PRSS( f ; ) ( yi f ( xi )) 2 i 1 [ f ( x)]2 dx l Large second derivative is penalized 56 2.8.2 Kernel Methods and Local Regression l Kernel functions l l l l Simplest Kernel estimate: Nadaraya-Watson weighted average N K (x , x ) y ˆf ( x ) i 1 0 i i 0 N i1 K ( x0 , xi ) l General local regression estimate of f (x0) is fˆ ( x0 ) where ˆ minimizes model the local neighborhoods in NN methods define the function space used for approximation Gaussian kernel x x0 1 K ( x0 , x) exp 2 l l 2 assigns weights to points that die exponentially with the square of the distance from the point x0 controls the variance N RSS ( f , x0 ) K ( x0 , xi )( yi f ( xi )) 2 i 1 57 2.8.2 Kernel Methods and Local Regression l f is a simple function such as a low-order polynomial l l l l Simplest Kernel estimate: Nadaraya-Watson weighted average N K (x , x ) y ˆf ( x ) i 1 0 i i 0 N i1 K ( x0 , xi ) l General local regression estimate of f(x0) is fˆ ( x0 ) where ˆ minimizes f(x) = 0 Nadaraya-Watson estimate f(x) = 0+ 1x local linear regression model NN methods can be regarded as kernel methods with a special metric K k ( x, x0 ) I x x0 x( k ) x0 ) x( k ) trainin g sample ranked k N RSS ( f , x0 ) K ( x0 , xi )( yi f ( xi )) 2 i 1 in distance from x0 I indicator function I (b) d b, true 58 2.8.3 Basis Functions and Dictionary Methods l l Include linear and polynomial expansions and more General form M f ( x) m hm ( x) m 1 l Linear in 59 2.8.3 Basis Functions and Dictionary Methods l l Include linear and polynomial expansions and more General form M f ( x) m hm ( x) m 1 l l l Linear in Examples Splines l Parameters are the points of attachment of the polynomials (knots) 60 2.8.3 Basis Functions and Dictionary Methods l l Include linear and polynomial expansions and more General form M f ( x) m hm ( x) m 1 l l l Linear in Examples Splines l l Radial Basis Functions M f ( x) K m ( m m , x) m m 1 l Parameters are l Centroids mm l Scales m Parameters are the points of attachment of the polynomials (knots) 61 2.8.3 Basis Functions and Dictionary Methods l l Include linear and polynomial expansions and more General form l M f ( x) K m ( m m , x) m m 1 M f ( x) m hm ( x) l Linear in Examples Splines l Parameters are the points of attachment of the polynomials (knots) Parameters are l Centroids mm l Scales m m 1 l l l Radial Basis Functions l Neural Networks neuron weight M f ( x) ms ( mT x bm ) m 1 s ( x) 1 /(1 e x ) neuron output 62 2.9 Model Selection l Smoothing and complexity parameters l l l l l Coefficient of the penalty term Width of the kernel Number of basis functions The setting of the parameters implements a tradeoff between bias and variance Example: k-NN methods l Generalization error EPE k ( x0 ) E[Y fˆk ( x0 ) | X x0 ] s 2 [Bias 2 ( fˆk ( x0 )) VarT ( fˆk ( x0 ))] 2 1 k s2 2 s f ( x0 ) f ( x( ) ) k 1 k irreducible error mean-square error Y f (X ) E( ) 0 Var ( ) s 2 l Assume that the values of the xi are fixed in advance underfit region overfit region 63