Download CSE546: Linear Regression Bias / Variance Tradeoff Winter 2012

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia , lookup

Time series wikipedia , lookup

Data assimilation wikipedia , lookup

Linear regression wikipedia , lookup

Forecasting wikipedia , lookup

Least squares wikipedia , lookup

Transcript
CSE546: Linear Regression Bias / Variance Tradeoff Winter 2012 Luke ZeBlemoyer Slides adapted from Carlos Guestrin PredicJon of conJnuous variables •  Billionaire says: Wait, that’s not what I meant! •  You say: Chill out, dude. •  He says: I want to predict a conJnuous variable for conJnuous inputs: I want to predict salaries from GPA. •  You say: I can regress that… Linear Regression
40
26
24
22
20
20
30
40
20
0
0
Prediction
30
20
10
20
Prediction
0
10
0
Ordinary Least Squares (OLS)
Error or residual
Observation
Prediction
0
0
20
The regression problem •  Instances: <xj, tj> •  Learn: Mapping from x to t(x) •  Hypothesis space: –  Given, basis funcJons {h1,…,hk}
–  Find coeffs w={w1,…,wk} –  Why is this usually called linear regression? •  model is linear in the parameters •  Can we esJmate funcJons that are not lines??? •  Precisely, minimize the residual squared error: Regression: matrix notaJon weights
N observed outputs
K basis func
N data points
K basis functions
measurements
Regression soluJon: simple matrix math where
k×k matrix
for k basis functions
k×1 vector
But, why? •  Billionaire (again) says: Why sum squared error??? •  You say: Gaussians, Dr. Gateson, Gaussians… •  Model: predicJon is linear funcJon plus Gaussian noise –  t(x) = ∑i wi hi(x) + ε
•  Learn w using MLE: N i
Ni − µ)= 0
= − � (x
�
2 (xi −=µ)
σ
= −i=1
0 =0
= − σ2
σ2
i=1
i=1
N
�N
=− �
xi �
+NN µ = 0
= −i=1
= −xi + xNi µ
+=
N 0µ = 0
Maximizing log-­‐likelihood i=1
N i=1
2
�
N
(x
−
µ)
N
i
� (x
Ni − µ)2= 0 2
= − N+ N
�
µ)0
i −=
= −σ = −
+i=1 + σ 3 (x
=0
σ σ
σ3 σ3
i=1
�
�
�N �
N i=1
2
� 1 � �N ��
w
h
(x
)]
N−[tj − �
i
i
j
2
N j − i w�
√ 1 1 + N −[t
arg max ln
h
(x
)]
2
i
i
j
�
i
2
−[t
−
w
h
(x
)]
j
i
i
j
√
2σ
argwmax
ln
+
i
σ 2π
2
arg
j=1 +
w maxσln 2π √
2σ
w
2σ 2
σ 2π j=1 j=1
�
N
2
�
�
w
h
(x
)]
−[t
−
N
j
i
i
j
2
N j − i w�
= arg max � −[t
h
(x
)]
2
i
i
j
�
i
2
w
h
(x
)]
−[t2σ
j −
i
i
j
= arg=wmax
2 i
arg
max
w j=1
2σ
w
2σ 2
j=1
j=1
N
�
�
wi hi (xj )]2
= arg min
[tj −
w
i
j=1
Maximize wrt w:
Least-squares Linear Regression is MLE for Gaussians!!!
Bias-­‐Variance tradeoff – IntuiJon •  Model too simple: does not fit the data well –  A biased soluJon 1
M =0
t
0
−1
•  Model too complex: small 1
changes to the data, t
soluJon changes a lot 0
0
x
1
M =9
–  A high-­‐variance soluJon −1
0
x
1
(Squared) Bias of learner •  Given: dataset D with m samples •  Learn: for different datasets D, you will get different funcJons h(x) •  Expected predicJon (averaged over hypotheses): ED[h(x)]
•  Bias: difference between expected predicJon and truth –  Measures how well you expect to represent true soluJon –  Decreases with more complex model Variance of learner •  Given: dataset D with m samples •  Learn: for different datasets D, you will get different funcJons h(x) •  Expected predicJon (averaged over hypotheses): ED[h(x)]
•  Variance: difference between what you expect to learn and what you learn from a from a parJcular dataset –  Measures how sensiJve learner is to specific dataset –  Decreases with simpler model Bias–Variance decomposiJon of error •  Consider simple regression problem f:XàT f(x) = g(x) + ε
noise ~ N(0,σ) determinisJc •  Collect some data, and learn a funcJon h(x) •  What are sources of predicJon error? Sources of error 1 – noise f(x) = g(x) + ε
•  What if we have perfect learner, infinite data? –  If our learning soluJon h(x) saJsfies h(x)=g(x) –  SJll have remaining, unavoidable error of σ2 due to noise ε
Sources of error 2 – Finite data f(x) = g(x) + ε
•  What if we have imperfect learner, or only m training examples? •  What is our expected squared error per example? –  ExpectaJon taken over random training sets D of size m, drawn from distribuJon P(X,T) Bias-­‐Variance DecomposiJon of Error Bishop Chapter 3
Assume target function: t(x) = g(x) + ε
•  Then expected squared error over fixed size training sets D
drawn from P(X,T) can be expressed as sum of three
components:
Where:
Bias-­‐Variance Tradeoff •  Choice of hypothesis class introduces learning bias –  More complex class → less bias –  More complex class → more variance Training set error •  Given a dataset (Training data) •  Choose a loss funcJon –  e.g., squared error (L2) for regression •  Training error: For a parJcular set of parameters, loss funcJon on training data: Training error as a funcJon of model complexity PredicJon error •  Training set error can be poor measure of “quality” of soluJon •  PredicJon error (true error): We really care about error over all possibiliJes: PredicJon error as a funcJon of model complexity CompuJng predicJon error •  To correctly predict error •  Hard integral!
•  May not know t(x) for every x, may not know p(x)
•  Monte Carlo integration (sampling approximation)
•  Sample a set of i.i.d. points {x1,…,xM} from p(x)
•  Approximate integral with sample average
Why training set error doesn’t approximate predicJon error? •  Sampling approximaJon of predicJon error: •  Training error : •  Very similar equaJons!!! –  Why is training set a bad measure of predicJon error??? Why training set error doesn’t approximate predicJon error? •  Sampling approximaJon of predicJon error: Because
you cheated!!!
Training error good estimate for a single w,
But you optimized w with respect to the training error,
and found w that is good for this set of samples
•  Training error : Training error is a (optimistically) biased
estimate of prediction error
•  Very similar equaJons!!! –  Why is training set a bad measure of predicJon error??? Test set error •  Given a dataset, randomly split it into two parts: –  Training data – {x1,…, xNtrain} –  Test data – {x1,…, xNtest} •  Use training data to opJmize parameters w •  Test set error: For the final solu)on w*, evaluate the error using: Test set error as a funcJon of model complexity Overfirng: this slide is so important we are looking at it again! •  Assume: –  Data generated from distribuJon D(X,Y)
–  A hypothesis space H
•  Define: errors for hypothesis h ∈ H
–  Training error: errortrain(h)
–  Data (true) error: errortrue(h)
•  We say h overfits the training data if there exists
an h’ ∈ H such that:
and
errortrain(h) < errortrain(h’)
errortrue(h) > errortrue(h’)
Summary: error esJmators •  Gold Standard: •  Training: opJmisJcally biased •  Test: our final meaure, unbiased? Error as a funcJon of number of training examples for a fixed model complexity little data
infinite data
Summary: error esJmators •  Gold Standard: Be careful!!!
Test set only unbiased if you never never ever ever
do any any any any learning on the test data
•  Training: opJmisJcally biased For example, if you use the test set to select
the degree of the polynomial… no longer unbiased!!!
(We will address this problem later in the semester)
•  Test: our final meaure, unbiased? What you need to know •  Regression –  Basis funcJon = features –  OpJmizing sum squared error –  RelaJonship between regression and Gaussians •  Bias-­‐Variance trade-­‐off •  Play with Applet –  hBp://mste.illinois.edu/users/exner/java.f/leastsquares/ •  True error, training error, test error –  Never learn on the test data •  Overfirng