Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSE546: Linear Regression Bias / Variance Tradeoff Winter 2012 Luke ZeBlemoyer Slides adapted from Carlos Guestrin PredicJon of conJnuous variables • Billionaire says: Wait, that’s not what I meant! • You say: Chill out, dude. • He says: I want to predict a conJnuous variable for conJnuous inputs: I want to predict salaries from GPA. • You say: I can regress that… Linear Regression 40 26 24 22 20 20 30 40 20 0 0 Prediction 30 20 10 20 Prediction 0 10 0 Ordinary Least Squares (OLS) Error or residual Observation Prediction 0 0 20 The regression problem • Instances: <xj, tj> • Learn: Mapping from x to t(x) • Hypothesis space: – Given, basis funcJons {h1,…,hk} – Find coeffs w={w1,…,wk} – Why is this usually called linear regression? • model is linear in the parameters • Can we esJmate funcJons that are not lines??? • Precisely, minimize the residual squared error: Regression: matrix notaJon weights N observed outputs K basis func N data points K basis functions measurements Regression soluJon: simple matrix math where k×k matrix for k basis functions k×1 vector But, why? • Billionaire (again) says: Why sum squared error??? • You say: Gaussians, Dr. Gateson, Gaussians… • Model: predicJon is linear funcJon plus Gaussian noise – t(x) = ∑i wi hi(x) + ε • Learn w using MLE: N i Ni − µ)= 0 = − � (x � 2 (xi −=µ) σ = −i=1 0 =0 = − σ2 σ2 i=1 i=1 N �N =− � xi � +NN µ = 0 = −i=1 = −xi + xNi µ += N 0µ = 0 Maximizing log-‐likelihood i=1 N i=1 2 � N (x − µ) N i � (x Ni − µ)2= 0 2 = − N+ N � µ)0 i −= = −σ = − +i=1 + σ 3 (x =0 σ σ σ3 σ3 i=1 � � �N � N i=1 2 � 1 � �N �� w h (x )] N−[tj − � i i j 2 N j − i w� √ 1 1 + N −[t arg max ln h (x )] 2 i i j � i 2 −[t − w h (x )] j i i j √ 2σ argwmax ln + i σ 2π 2 arg j=1 + w maxσln 2π √ 2σ w 2σ 2 σ 2π j=1 j=1 � N 2 � � w h (x )] −[t − N j i i j 2 N j − i w� = arg max � −[t h (x )] 2 i i j � i 2 w h (x )] −[t2σ j − i i j = arg=wmax 2 i arg max w j=1 2σ w 2σ 2 j=1 j=1 N � � wi hi (xj )]2 = arg min [tj − w i j=1 Maximize wrt w: Least-squares Linear Regression is MLE for Gaussians!!! Bias-‐Variance tradeoff – IntuiJon • Model too simple: does not fit the data well – A biased soluJon 1 M =0 t 0 −1 • Model too complex: small 1 changes to the data, t soluJon changes a lot 0 0 x 1 M =9 – A high-‐variance soluJon −1 0 x 1 (Squared) Bias of learner • Given: dataset D with m samples • Learn: for different datasets D, you will get different funcJons h(x) • Expected predicJon (averaged over hypotheses): ED[h(x)] • Bias: difference between expected predicJon and truth – Measures how well you expect to represent true soluJon – Decreases with more complex model Variance of learner • Given: dataset D with m samples • Learn: for different datasets D, you will get different funcJons h(x) • Expected predicJon (averaged over hypotheses): ED[h(x)] • Variance: difference between what you expect to learn and what you learn from a from a parJcular dataset – Measures how sensiJve learner is to specific dataset – Decreases with simpler model Bias–Variance decomposiJon of error • Consider simple regression problem f:XàT f(x) = g(x) + ε noise ~ N(0,σ) determinisJc • Collect some data, and learn a funcJon h(x) • What are sources of predicJon error? Sources of error 1 – noise f(x) = g(x) + ε • What if we have perfect learner, infinite data? – If our learning soluJon h(x) saJsfies h(x)=g(x) – SJll have remaining, unavoidable error of σ2 due to noise ε Sources of error 2 – Finite data f(x) = g(x) + ε • What if we have imperfect learner, or only m training examples? • What is our expected squared error per example? – ExpectaJon taken over random training sets D of size m, drawn from distribuJon P(X,T) Bias-‐Variance DecomposiJon of Error Bishop Chapter 3 Assume target function: t(x) = g(x) + ε • Then expected squared error over fixed size training sets D drawn from P(X,T) can be expressed as sum of three components: Where: Bias-‐Variance Tradeoff • Choice of hypothesis class introduces learning bias – More complex class → less bias – More complex class → more variance Training set error • Given a dataset (Training data) • Choose a loss funcJon – e.g., squared error (L2) for regression • Training error: For a parJcular set of parameters, loss funcJon on training data: Training error as a funcJon of model complexity PredicJon error • Training set error can be poor measure of “quality” of soluJon • PredicJon error (true error): We really care about error over all possibiliJes: PredicJon error as a funcJon of model complexity CompuJng predicJon error • To correctly predict error • Hard integral! • May not know t(x) for every x, may not know p(x) • Monte Carlo integration (sampling approximation) • Sample a set of i.i.d. points {x1,…,xM} from p(x) • Approximate integral with sample average Why training set error doesn’t approximate predicJon error? • Sampling approximaJon of predicJon error: • Training error : • Very similar equaJons!!! – Why is training set a bad measure of predicJon error??? Why training set error doesn’t approximate predicJon error? • Sampling approximaJon of predicJon error: Because you cheated!!! Training error good estimate for a single w, But you optimized w with respect to the training error, and found w that is good for this set of samples • Training error : Training error is a (optimistically) biased estimate of prediction error • Very similar equaJons!!! – Why is training set a bad measure of predicJon error??? Test set error • Given a dataset, randomly split it into two parts: – Training data – {x1,…, xNtrain} – Test data – {x1,…, xNtest} • Use training data to opJmize parameters w • Test set error: For the final solu)on w*, evaluate the error using: Test set error as a funcJon of model complexity Overfirng: this slide is so important we are looking at it again! • Assume: – Data generated from distribuJon D(X,Y) – A hypothesis space H • Define: errors for hypothesis h ∈ H – Training error: errortrain(h) – Data (true) error: errortrue(h) • We say h overfits the training data if there exists an h’ ∈ H such that: and errortrain(h) < errortrain(h’) errortrue(h) > errortrue(h’) Summary: error esJmators • Gold Standard: • Training: opJmisJcally biased • Test: our final meaure, unbiased? Error as a funcJon of number of training examples for a fixed model complexity little data infinite data Summary: error esJmators • Gold Standard: Be careful!!! Test set only unbiased if you never never ever ever do any any any any learning on the test data • Training: opJmisJcally biased For example, if you use the test set to select the degree of the polynomial… no longer unbiased!!! (We will address this problem later in the semester) • Test: our final meaure, unbiased? What you need to know • Regression – Basis funcJon = features – OpJmizing sum squared error – RelaJonship between regression and Gaussians • Bias-‐Variance trade-‐off • Play with Applet – hBp://mste.illinois.edu/users/exner/java.f/leastsquares/ • True error, training error, test error – Never learn on the test data • Overfirng