* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CSE546: Linear Regression Bias / Variance Tradeoff Winter 2012
Survey
Document related concepts
Transcript
CSE546: Linear Regression Bias / Variance Tradeoff Winter 2012 Luke ZeBlemoyer Slides adapted from Carlos Guestrin PredicJon of conJnuous variables • Billionaire says: Wait, that’s not what I meant! • You say: Chill out, dude. • He says: I want to predict a conJnuous variable for conJnuous inputs: I want to predict salaries from GPA. • You say: I can regress that… Linear Regression
40
26
24
22
20
20
30
40
20
0
0
Prediction
30
20
10
20
Prediction
0
10
0
Ordinary Least Squares (OLS)
Error or residual
Observation
Prediction
0
0
20
The regression problem • Instances: <xj, tj> • Learn: Mapping from x to t(x) • Hypothesis space: – Given, basis funcJons {h1,…,hk}
– Find coeffs w={w1,…,wk} – Why is this usually called linear regression? • model is linear in the parameters • Can we esJmate funcJons that are not lines??? • Precisely, minimize the residual squared error: Regression: matrix notaJon weights
N observed outputs
K basis func
N data points
K basis functions
measurements
Regression soluJon: simple matrix math where
k×k matrix
for k basis functions
k×1 vector
But, why? • Billionaire (again) says: Why sum squared error??? • You say: Gaussians, Dr. Gateson, Gaussians… • Model: predicJon is linear funcJon plus Gaussian noise – t(x) = ∑i wi hi(x) + ε
• Learn w using MLE: N i
Ni − µ)= 0
= − � (x
�
2 (xi −=µ)
σ
= −i=1
0 =0
= − σ2
σ2
i=1
i=1
N
�N
=− �
xi �
+NN µ = 0
= −i=1
= −xi + xNi µ
+=
N 0µ = 0
Maximizing log-‐likelihood i=1
N i=1
2
�
N
(x
−
µ)
N
i
� (x
Ni − µ)2= 0 2
= − N+ N
�
µ)0
i −=
= −σ = −
+i=1 + σ 3 (x
=0
σ σ
σ3 σ3
i=1
�
�
�N �
N i=1
2
� 1 � �N ��
w
h
(x
)]
N−[tj − �
i
i
j
2
N j − i w�
√ 1 1 + N −[t
arg max ln
h
(x
)]
2
i
i
j
�
i
2
−[t
−
w
h
(x
)]
j
i
i
j
√
2σ
argwmax
ln
+
i
σ 2π
2
arg
j=1 +
w maxσln 2π √
2σ
w
2σ 2
σ 2π j=1 j=1
�
N
2
�
�
w
h
(x
)]
−[t
−
N
j
i
i
j
2
N j − i w�
= arg max � −[t
h
(x
)]
2
i
i
j
�
i
2
w
h
(x
)]
−[t2σ
j −
i
i
j
= arg=wmax
2 i
arg
max
w j=1
2σ
w
2σ 2
j=1
j=1
N
�
�
wi hi (xj )]2
= arg min
[tj −
w
i
j=1
Maximize wrt w:
Least-squares Linear Regression is MLE for Gaussians!!!
Bias-‐Variance tradeoff – IntuiJon • Model too simple: does not fit the data well – A biased soluJon 1
M =0
t
0
−1
• Model too complex: small 1
changes to the data, t
soluJon changes a lot 0
0
x
1
M =9
– A high-‐variance soluJon −1
0
x
1
(Squared) Bias of learner • Given: dataset D with m samples • Learn: for different datasets D, you will get different funcJons h(x) • Expected predicJon (averaged over hypotheses): ED[h(x)]
• Bias: difference between expected predicJon and truth – Measures how well you expect to represent true soluJon – Decreases with more complex model Variance of learner • Given: dataset D with m samples • Learn: for different datasets D, you will get different funcJons h(x) • Expected predicJon (averaged over hypotheses): ED[h(x)]
• Variance: difference between what you expect to learn and what you learn from a from a parJcular dataset – Measures how sensiJve learner is to specific dataset – Decreases with simpler model Bias–Variance decomposiJon of error • Consider simple regression problem f:XàT f(x) = g(x) + ε
noise ~ N(0,σ) determinisJc • Collect some data, and learn a funcJon h(x) • What are sources of predicJon error? Sources of error 1 – noise f(x) = g(x) + ε
• What if we have perfect learner, infinite data? – If our learning soluJon h(x) saJsfies h(x)=g(x) – SJll have remaining, unavoidable error of σ2 due to noise ε
Sources of error 2 – Finite data f(x) = g(x) + ε
• What if we have imperfect learner, or only m training examples? • What is our expected squared error per example? – ExpectaJon taken over random training sets D of size m, drawn from distribuJon P(X,T) Bias-‐Variance DecomposiJon of Error Bishop Chapter 3
Assume target function: t(x) = g(x) + ε
• Then expected squared error over fixed size training sets D
drawn from P(X,T) can be expressed as sum of three
components:
Where:
Bias-‐Variance Tradeoff • Choice of hypothesis class introduces learning bias – More complex class → less bias – More complex class → more variance Training set error • Given a dataset (Training data) • Choose a loss funcJon – e.g., squared error (L2) for regression • Training error: For a parJcular set of parameters, loss funcJon on training data: Training error as a funcJon of model complexity PredicJon error • Training set error can be poor measure of “quality” of soluJon • PredicJon error (true error): We really care about error over all possibiliJes: PredicJon error as a funcJon of model complexity CompuJng predicJon error • To correctly predict error • Hard integral!
• May not know t(x) for every x, may not know p(x)
• Monte Carlo integration (sampling approximation)
• Sample a set of i.i.d. points {x1,…,xM} from p(x)
• Approximate integral with sample average
Why training set error doesn’t approximate predicJon error? • Sampling approximaJon of predicJon error: • Training error : • Very similar equaJons!!! – Why is training set a bad measure of predicJon error??? Why training set error doesn’t approximate predicJon error? • Sampling approximaJon of predicJon error: Because
you cheated!!!
Training error good estimate for a single w,
But you optimized w with respect to the training error,
and found w that is good for this set of samples
• Training error : Training error is a (optimistically) biased
estimate of prediction error
• Very similar equaJons!!! – Why is training set a bad measure of predicJon error??? Test set error • Given a dataset, randomly split it into two parts: – Training data – {x1,…, xNtrain} – Test data – {x1,…, xNtest} • Use training data to opJmize parameters w • Test set error: For the final solu)on w*, evaluate the error using: Test set error as a funcJon of model complexity Overfirng: this slide is so important we are looking at it again! • Assume: – Data generated from distribuJon D(X,Y)
– A hypothesis space H
• Define: errors for hypothesis h ∈ H
– Training error: errortrain(h)
– Data (true) error: errortrue(h)
• We say h overfits the training data if there exists
an h’ ∈ H such that:
and
errortrain(h) < errortrain(h’)
errortrue(h) > errortrue(h’)
Summary: error esJmators • Gold Standard: • Training: opJmisJcally biased • Test: our final meaure, unbiased? Error as a funcJon of number of training examples for a fixed model complexity little data
infinite data
Summary: error esJmators • Gold Standard: Be careful!!!
Test set only unbiased if you never never ever ever
do any any any any learning on the test data
• Training: opJmisJcally biased For example, if you use the test set to select
the degree of the polynomial… no longer unbiased!!!
(We will address this problem later in the semester)
• Test: our final meaure, unbiased? What you need to know • Regression – Basis funcJon = features – OpJmizing sum squared error – RelaJonship between regression and Gaussians • Bias-‐Variance trade-‐off • Play with Applet – hBp://mste.illinois.edu/users/exner/java.f/leastsquares/ • True error, training error, test error – Never learn on the test data • Overfirng