Download CSE546: Linear Regression Bias / Variance Tradeoff Winter 2012

CSE546: Linear Regression Bias / Variance Tradeoff Winter 2012 Luke ZeBlemoyer Slides adapted from Carlos Guestrin PredicJon of conJnuous variables •  Billionaire says: Wait, that’s not what I meant! •  You say: Chill out, dude. •  He says: I want to predict a conJnuous variable for conJnuous inputs: I want to predict salaries from GPA. •  You say: I can regress that… Linear Regression 40 26 24 22 20 20 30 40 20 0 0 Prediction 30 20 10 20 Prediction 0 10 0 Ordinary Least Squares (OLS) Error or residual Observation Prediction 0 0 20 The regression problem •  Instances: <xj, tj> •  Learn: Mapping from x to t(x) •  Hypothesis space: –  Given, basis funcJons {h1,…,hk} –  Find coeffs w={w1,…,wk} –  Why is this usually called linear regression? •  model is linear in the parameters •  Can we esJmate funcJons that are not lines??? •  Precisely, minimize the residual squared error: Regression: matrix notaJon weights N observed outputs K basis func N data points K basis functions measurements Regression soluJon: simple matrix math where k×k matrix for k basis functions k×1 vector But, why? •  Billionaire (again) says: Why sum squared error??? •  You say: Gaussians, Dr. Gateson, Gaussians… •  Model: predicJon is linear funcJon plus Gaussian noise –  t(x) = ∑i wi hi(x) + ε •  Learn w using MLE: N i Ni − µ)= 0 = − � (x � 2 (xi −=µ) σ = −i=1 0 =0 = − σ2 σ2 i=1 i=1 N �N =− � xi � +NN µ = 0 = −i=1 = −xi + xNi µ += N 0µ = 0 Maximizing log-‐likelihood i=1 N i=1 2 � N (x − µ) N i � (x Ni − µ)2= 0 2 = − N+ N � µ)0 i −= = −σ = − +i=1 + σ 3 (x =0 σ σ σ3 σ3 i=1 � � �N � N i=1 2 � 1 � �N �� w h (x )] N−[tj − � i i j 2 N j − i w� √ 1 1 + N −[t arg max ln h (x )] 2 i i j � i 2 −[t − w h (x )] j i i j √ 2σ argwmax ln + i σ 2π 2 arg j=1 + w maxσln 2π √ 2σ w 2σ 2 σ 2π j=1 j=1 � N 2 � � w h (x )] −[t − N j i i j 2 N j − i w� = arg max � −[t h (x )] 2 i i j � i 2 w h (x )] −[t2σ j − i i j = arg=wmax 2 i arg max w j=1 2σ w 2σ 2 j=1 j=1 N � � wi hi (xj )]2 = arg min [tj − w i j=1 Maximize wrt w: Least-squares Linear Regression is MLE for Gaussians!!! Bias-‐Variance tradeoff – IntuiJon •  Model too simple: does not fit the data well –  A biased soluJon 1 M =0 t 0 −1 •  Model too complex: small 1 changes to the data, t soluJon changes a lot 0 0 x 1 M =9 –  A high-‐variance soluJon −1 0 x 1 (Squared) Bias of learner •  Given: dataset D with m samples •  Learn: for different datasets D, you will get different funcJons h(x) •  Expected predicJon (averaged over hypotheses): ED[h(x)] •  Bias: difference between expected predicJon and truth –  Measures how well you expect to represent true soluJon –  Decreases with more complex model Variance of learner •  Given: dataset D with m samples •  Learn: for different datasets D, you will get different funcJons h(x) •  Expected predicJon (averaged over hypotheses): ED[h(x)] •  Variance: difference between what you expect to learn and what you learn from a from a parJcular dataset –  Measures how sensiJve learner is to specific dataset –  Decreases with simpler model Bias–Variance decomposiJon of error •  Consider simple regression problem f:XàT f(x) = g(x) + ε noise ~ N(0,σ) determinisJc •  Collect some data, and learn a funcJon h(x) •  What are sources of predicJon error? Sources of error 1 – noise f(x) = g(x) + ε •  What if we have perfect learner, infinite data? –  If our learning soluJon h(x) saJsfies h(x)=g(x) –  SJll have remaining, unavoidable error of σ2 due to noise ε Sources of error 2 – Finite data f(x) = g(x) + ε •  What if we have imperfect learner, or only m training examples? •  What is our expected squared error per example? –  ExpectaJon taken over random training sets D of size m, drawn from distribuJon P(X,T) Bias-‐Variance DecomposiJon of Error Bishop Chapter 3 Assume target function: t(x) = g(x) + ε •  Then expected squared error over fixed size training sets D drawn from P(X,T) can be expressed as sum of three components: Where: Bias-‐Variance Tradeoff •  Choice of hypothesis class introduces learning bias –  More complex class → less bias –  More complex class → more variance Training set error •  Given a dataset (Training data) •  Choose a loss funcJon –  e.g., squared error (L2) for regression •  Training error: For a parJcular set of parameters, loss funcJon on training data: Training error as a funcJon of model complexity PredicJon error •  Training set error can be poor measure of “quality” of soluJon •  PredicJon error (true error): We really care about error over all possibiliJes: PredicJon error as a funcJon of model complexity CompuJng predicJon error •  To correctly predict error •  Hard integral! •  May not know t(x) for every x, may not know p(x) •  Monte Carlo integration (sampling approximation) •  Sample a set of i.i.d. points {x1,…,xM} from p(x) •  Approximate integral with sample average Why training set error doesn’t approximate predicJon error? •  Sampling approximaJon of predicJon error: •  Training error : •  Very similar equaJons!!! –  Why is training set a bad measure of predicJon error??? Why training set error doesn’t approximate predicJon error? •  Sampling approximaJon of predicJon error: Because you cheated!!! Training error good estimate for a single w, But you optimized w with respect to the training error, and found w that is good for this set of samples •  Training error : Training error is a (optimistically) biased estimate of prediction error •  Very similar equaJons!!! –  Why is training set a bad measure of predicJon error??? Test set error •  Given a dataset, randomly split it into two parts: –  Training data – {x1,…, xNtrain} –  Test data – {x1,…, xNtest} •  Use training data to opJmize parameters w •  Test set error: For the final solu)on w*, evaluate the error using: Test set error as a funcJon of model complexity Overfirng: this slide is so important we are looking at it again! •  Assume: –  Data generated from distribuJon D(X,Y) –  A hypothesis space H •  Define: errors for hypothesis h ∈ H –  Training error: errortrain(h) –  Data (true) error: errortrue(h) •  We say h overfits the training data if there exists an h’ ∈ H such that: and errortrain(h) < errortrain(h’) errortrue(h) > errortrue(h’) Summary: error esJmators •  Gold Standard: •  Training: opJmisJcally biased •  Test: our final meaure, unbiased? Error as a funcJon of number of training examples for a fixed model complexity little data infinite data Summary: error esJmators •  Gold Standard: Be careful!!! Test set only unbiased if you never never ever ever do any any any any learning on the test data •  Training: opJmisJcally biased For example, if you use the test set to select the degree of the polynomial… no longer unbiased!!! (We will address this problem later in the semester) •  Test: our final meaure, unbiased? What you need to know •  Regression –  Basis funcJon = features –  OpJmizing sum squared error –  RelaJonship between regression and Gaussians •  Bias-‐Variance trade-‐off •  Play with Applet –  hBp://mste.illinois.edu/users/exner/java.f/leastsquares/ •  True error, training error, test error –  Never learn on the test data •  Overfirng

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CSE546: Linear Regression Bias / Variance Tradeoff Winter 2012