Download Regression Jordan Boyd-Graber University of Colorado Boulder LECTURE 11

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene expression programming wikipedia , lookup

Transcript
Regression
Jordan Boyd-Graber
University of Colorado Boulder
LECTURE 11
Slides adapted from Lauren Hannah
Jordan Boyd-Graber
|
Boulder
Regression
|
1 of 34
Linear Regression
6
●
●
●
●●
0
2
4
● ●
y
●
−2
●
−6
● ●
●
●
●
●● ●
●
●
●
● ●
●
●
●●
●
● ●
●
●
●
−4
●●
●
●
●
●●
●● ●
●
●● ●● ●●
●●
●
●
●
● ●
● ●
●
● ● ● ● ●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●● ●
●●
●
●
●
● ●
●
●
●
●
●●
●
−4
−2
0
2
4
x
Data are the set of inputs and outputs, D = {(xi , yi )}ni=1
Jordan Boyd-Graber
|
Boulder
Regression
|
2 of 34
Linear Regression
6
●
●
●
●●
0
2
4
● ●
y
●
−2
●
−6
● ●
●
●
●
●● ●
●
●
●
● ●
●
●
●●
●
● ●
●
●
●
−4
●●
●
●
●
●●
●● ●
●
●● ●● ●●
●●
●
●
●
● ●
● ●
●
● ● ● ● ●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●● ●
●●
●
●
●
● ●
●
●
●
●
●●
●
−4
−2
0
2
4
x
In linear regression, the goal is to predict y from x using a linear
function
Jordan Boyd-Graber
|
Boulder
Regression
|
3 of 34
Linear Regression
6
●
●
●
●●
4
● ●
2
●
●
●●
●● ●
●
●● ●● ●●
●●
●
●
●
● ●
● ●
● ● ● ●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
0
●
y
●
−2
●
−6
● ●
●
●
●
●● ●
●
●
●
● ●
●
●
●●
●
●●
●
●
●
●
●●
● ●
●
● ●
●
●
●
−4
●●
●
●
●
●
●
●
●●
●
−4
−2
0
2
4
x
Examples of linear regression:
• given a child’s age and gender, what is his/her height?
• given unemployment, inflation, number of wars, and economic
growth, what will the president’s approval rating be?
• given a browsing history, how long will a user stay on a page?
Jordan Boyd-Graber
|
Boulder
Regression
|
4 of 34
Linear Regression
6
●
●
●
●●
0
2
4
(xi , yi )
y
●
−2
●
−6
● ●
●
●
●
●● ●
●
●
●
● ●
●
●
●●
●
●●
●● ●
●
●● ●● ●●
●●
●
●
●
●●●
●
●
● ● ● ● ●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
−4
●●
●
●
● ●
●
●
●
f (x) = !0 + !1 x
●
●●
●
−4
−2
0
2
4
x
Jordan Boyd-Graber
|
Boulder
Regression
|
5 of 34
Multiple Covariates
Often, we have a vector of inputs where each represents a different
feature of the data
x = (x1 , . . . , xp )
The function fitted to the response is a linear combination of the
covariates
p
X
f (x) = β0 +
βj xj
j=1
Jordan Boyd-Graber
|
Boulder
Regression
|
6 of 34
Multiple Covariates
• Often, it is convenient to represent x as (1, x1 , . . . , xp )
• In this case x is a vector, and so is β (we’ll represent them in bold
face)
• This is the dot product between these two vectors
• This then becomes a sum (this should be familiar!)
βx = β0 +
p
X
βj xj
j=1
Jordan Boyd-Graber
|
Boulder
Regression
|
7 of 34
Hyperplanes: Linear Functions in Multiple Dimensions
Hyperplane
35
●
●
●
●
25
●
● ●● ●
●
● ●
●
20
y
●
●
●
●
●
●
●●
● ●
●● ●
●
●
500
●●
x2
30
●
●
400
15
300
200
10
100
0
1
Jordan Boyd-Graber
|
Boulder
2
3
4
x1
5
6
Regression
|
8 of 34
Covariates
• Do not need to be raw value of x1 , x2 , . . .
• Can be any feature or function of the data:
◦ Transformations like x2 = log(x1 ) or x2 = cos(x1 )
◦ Basis expansions like x2 = x12 , x3 = x13 , x4 = x14 , etc
◦ Indicators of events like x2 = 1{−1≤x1 ≤1}
◦ Interactions between variables like x3 = x1 x2
• Because of its simplicity and flexibility, it is one of the most widely
implemented regression techniques
Jordan Boyd-Graber
|
Boulder
Regression
|
9 of 34
Fitting a Regression
Plan
Fitting a Regression
Example
Regularized Regression
Wrapup
Jordan Boyd-Graber
|
Boulder
Regression
|
10 of 34
Fitting a Regression
Fitting a Linear Regression
6
●
●
●
●●
2
4
● ●
0
( yi ! !1 x )
y
●
●●
●
●
−2
●
● ●
●
−6
●
●● ●
●
●
●
● ●
●
●
●●
●
● ●
●
●
●
−4
●
●●
●● ●
●
●● ●● ●●
●●
●
●
●
● ●
● ●
●
● ● ● ● ●
●
●
●
●
●
2
●
● ●
i
●
●
●
● ●
●
●
●
●
●
●
●● ●
●●
●
●
●
● ●
●
●
●
●
●
●●
●
−4
−2
0
2
4
x
Idea: minimize the Euclidean distance between data and fitted line
n
1X
RSS(β) =
(yi − βxi )2
2
i=1
Jordan Boyd-Graber
|
Boulder
Regression
|
11 of 34
Fitting a Regression
How to Find β
• Use calculus to find the value of β that minimizes the RSS
• The optimal value is
Pn
yi xi
2
i=1 xi
β̂ = Pi=1
n
6
●
●
●
●●
4
● ●
●
●● ●
●
●● ●● ●●
●●
●
●
●
● ●
● ●
● ● ● ●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●● ●
●●
●
●
●
● ●
0
2
●
y
●
−2
●
● ●
●
−6
●
●● ●
●
●
●
● ●
●
●
●●
●
● ●
●
●
●
−4
●●
●
●
●●
●
●
●
●
●
●
●●
●
−4
−2
0
2
4
x
Jordan Boyd-Graber
|
Boulder
Regression
|
12 of 34
Fitting a Regression
Prediction
• After finding β̂, we would like to predict an output value for a new
set of covariates
• We just find the point on the line that corresponds to the new
input:
ŷ = β0 + β1 x
Jordan Boyd-Graber
|
Boulder
(1)
Regression
|
13 of 34
Fitting a Regression
Prediction
• After finding β̂, we would like to predict an output value for a new
set of covariates
• We just find the point on the line that corresponds to the new
input:
ŷ = β0 + β1 x
Jordan Boyd-Graber
|
Boulder
(1)
Regression
|
13 of 34
Fitting a Regression
Prediction
• After finding β̂, we would like to predict an output value for a new
set of covariates
• We just find the point on the line that corresponds to the new
input:
ŷ = 1.0 + 0.5x
(1)
y=1.0 + 0.5x
Jordan Boyd-Graber
|
Boulder
Regression
|
13 of 34
Fitting a Regression
Prediction
• After finding β̂, we would like to predict an output value for a new
set of covariates
• We just find the point on the line that corresponds to the new
input:
ŷ = 1.0 + 0.5 ∗ 5
(1)
y=1.0 + 0.5x
x=5.0
Jordan Boyd-Graber
|
Boulder
Regression
|
13 of 34
Fitting a Regression
Prediction
• After finding β̂, we would like to predict an output value for a new
set of covariates
• We just find the point on the line that corresponds to the new
input:
ŷ = 3.5
(1)
y=1.0 + 0.5x
x=5.0
Jordan Boyd-Graber
|
Boulder
Regression
|
13 of 34
Fitting a Regression
Probabilistic Interpretation
• Our analysis so far has not included any probabilities
• Linear regression does have a probabilisitc (probability
model-based) interpretation
6
●
●
●
●●
4
● ●
●
●● ●
●
●● ●● ●●
●●
●
●
●
● ●
● ●
● ● ● ●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●● ●
●●
●
●
●
● ●
0
2
●
y
●
−2
●
● ●
●
−6
●
●● ●
●
●
●
● ●
●
●
●●
●
● ●
●
●
●
−4
●●
●
●
●●
●
●
●
●
●
●
●●
●
−4
−2
0
2
4
x
Jordan Boyd-Graber
|
Boulder
Regression
|
14 of 34
Fitting a Regression
Probabilistic Interpretation
• Linear regression assumes that response values have a Gaussian
distribution around the linear mean function,
Yi | xi , β ∼ N(xi β, σ 2 )
• This is a discriminative model, where inputs x are not modeled
6
●
●
●
●●
0
2
4
● ●
y
●
−2
●
−6
● ●
●
●
●
●● ●
●
●
●
● ●
●
●
●●
●
● ●
●
●
●
−4
●●
●
●
●
●●
●● ●
●
●● ●● ●●
●●
●
●
●
● ●
● ●
●
● ● ● ● ●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●● ●
●●
●
●
●
● ●
●
●
●
●
●●
●
−4
−2
0
2
4
x
• Minimizing RSS is equivalent to maximizing conditional likelihood
Jordan Boyd-Graber
|
Boulder
Regression
|
15 of 34
Example
Plan
Fitting a Regression
Example
Regularized Regression
Wrapup
Jordan Boyd-Graber
|
Boulder
Regression
|
16 of 34
Example
Example: Old Faithful
Jordan Boyd-Graber
|
Boulder
Regression
|
17 of 34
Example
Example: Old Faithful
We will predict the time that we will have to wait to see the next
eruption given the duration of the current eruption
●
●
90
●
●
●
●
●
80
●
●
70
●
●
●
●
●
50
●
●
●
●
●
●
60
Waiting Time (min)
●
●
●
●
●
●
●
●
●
●
● ● ●●
●
● ●
● ●
●
●●
● ●● ●
●
●
●●
● ●●
● ●
●
●●●
●●
●
●
●●
●●
●
●
● ● ● ●
●●
●● ●
●
●
● ● ●
●●
● ● ●
● ●
●
● ●
●
●●
●
●
●●
●
●
●
●
●●
● ●●
●
●
●
●
●●●
● ●
●
●●
● ●
●●
●
●●● ●● ●
●
● ●
●●●
●
●
●
●
●●
●
●
● ●● ●
● ●
● ●
● ●● ● ●
●
● ●●
●
●
●
●
● ●● ●
●
●
● ●● ●
● ●●
●
● ● ● ●● ●
● ●● ●● ●●
●●
●● ● ● ●
●
●
●
●● ●● ● ●
●
●
●
● ●
●
●
●
●
●●
●
●
● ●●
●
●
●●
●
●
●
● ●
● ● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Current Eruption Time (min)
Jordan Boyd-Graber
|
Boulder
Regression
|
18 of 34
Example
Example: Old Faithful
We can plot our data and make a function for new predictions
> # Plot a line on the data
> abline(fit.lm,col="red",lwd=3)
>
> # Make a function for prediction
> fit.lm$coefficients[1]
(Intercept)
33.4744
> fit.lm$coefficients[2]
eruptions
10.72964
> faithful.fit <- function(x) fit.lm$coefficients[1] +
fit.lm$coefficients[2]*x
> x.pred <- c(2.0, 2.7, 3.8, 4.9)
> faithful.fit(x.pred)
[1] 54.93368 62.44443 74.24703 86.04964
Jordan Boyd-Graber
|
Boulder
Regression
|
19 of 34
Example
Example: Old Faithful
●
●
90
●
●
●
●
●
80
●
70
●
●
●
●
●
50
●
●
●
●
●
●
60
Waiting Time (min)
●
●
●
●
●
●
●
●
●
●
●
● ● ●●
●
● ●
● ●
●
●●
● ●● ●
●
●
●●
● ●●
● ●
●
●●●
●●
●
●
●●
●●
●
●
● ● ● ●
●●
●● ●
●
●
● ● ●
●●
● ● ●
● ●
●
● ●
●
●●
●
●
●●
●
●
●
●
●●
● ●●
●
●
●
●
●●●
● ●
●
●●
● ●
●●
●
●●● ●● ●
●
● ●
●●●
●
●
●
●
●●
●
●
● ●● ●
● ●
● ●
● ●● ● ●
●
● ●●
●
●
●
●
● ●● ●
●
●
● ●● ●
● ●●
●
● ● ● ●● ●
● ●● ●● ●●
●●
●● ● ● ●
●
●
●
●● ●● ● ●
●
●
●
● ●
●
●
●
●
●●
●
●
● ●●
●
●
●●
●
●
●
● ●
● ● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Current Eruption Time (min)
Jordan Boyd-Graber
|
Boulder
Regression
|
20 of 34
Regularized Regression
Plan
Fitting a Regression
Example
Regularized Regression
Wrapup
Jordan Boyd-Graber
|
Boulder
Regression
|
21 of 34
Regularized Regression
Multivariate Linear Regression
1.0
1.0
Example: p = 1, have 2 points
−0.5
y
0.0
0.5
●
−1.0
−1.0
−0.5
y
0.0
0.5
●
●
−1.0
●
−0.5
0.0
x
0.5
1.0
−1.0
−0.5
0.0
0.5
1.0
x
• Have p + 1 or fewer points, line hits all (or p with mean 0 data)
• ≥ p + 1 (but still close to that number), line goes close to all points
Jordan Boyd-Graber
|
Boulder
Regression
|
22 of 34
Regularized Regression
Noise, Bias, Variance Tradeoff
• Noise: Lower bound on
performance
• Bias: Error as a result as
choosing the wrong model
• Variance: Variation due to
training sample and
randomization
Jordan Boyd-Graber
|
Boulder
Regression
|
23 of 34
Regularized Regression
Noise, Bias, Variance Tradeoff
• Noise: Lower bound on
performance
• Bias: Error as a result as
choosing the wrong model
• Variance: Variation due to
training sample and
randomization
• No model is perfect
• More complex models are more susceptible to errors due to variance
Jordan Boyd-Graber
|
Boulder
Regression
|
23 of 34
Regularized Regression
Multivariate Linear Regression
θ θ low BIAS high Why linear regression:
• has few parameters to estimate (p)
• really restrictive model–low variance, higher bias
θ θ low VARIANCE high • should be good for data with few observations, large number of
covariates...
• ... but we can’t use it in this situation
Jordan Boyd-Graber
|
Boulder
Regression
|
24 of 34
Regularized Regression
Multivariate Linear Regression
Idea: if we have a large number of covariates compared to
observations, say n < 2p, best to estimate most coefficients as 0!
• not enough info to determine all coefficients
• try to estimate ones with strong signal
• set everything else to 0 (or close)
Coefficients of 0 may not be a bad assumption...
If we have 1,000s of coefficients, are they all equally important?
Jordan Boyd-Graber
|
Boulder
Regression
|
25 of 34
Regularized Regression
Gene Expression
Example: microarray gene expression data
• gene expression: want to measure the level at which information in
a gene is used in the synthesis of a functional gene product
(usually protein)
• can use gene expression data to determine subtype of cancer (e.g.
which type of Lymphoma B?) or predict recurrence, survival time,
etc
• problem: thousands of genes, hundreds of patients, p > n!
Intuition: only a handful of genes should affect outcomes
Jordan Boyd-Graber
|
Boulder
Regression
|
26 of 34
Regularized Regression
Gene Expression
• gene expression levels are continuous values
• data: observation i is gene expression levels from patient i,
attached to outcome for patient (survival time)
• covariates: expression levels for p genes
Jordan Boyd-Graber
|
Boulder
Regression
|
27 of 34
Regularized Regression
Gene Expression
• collinearity: does it matter which gene is selected for prediction?
No!
• overfitting: now fitting p 0 non-0 coefficients to n observations with
p 0 << n means less fitting of noise
Jordan Boyd-Graber
|
Boulder
Regression
|
28 of 34
Regularized Regression
Regularized Linear Regression
Regularization:
• still minimize the RSS
• place a penalty on large values for β1 , ..., βp (why not β0 ? can
always easily estimate mean)
• add this penalty to the objective function
• solve for β̂!
New objective function:
β̂ = arg min
β
n
p
i=1
j=1
X
1X
(yi − xi β)2 + λ
penalty(βj )
2
λ acts as a weight on penalty: low values mean few coefficients near
0, high values mean many coefficients near 0
Jordan Boyd-Graber
|
Boulder
Regression
|
29 of 34
Regularized Regression
Regularized Linear Regression
Regularization:
• still minimize the RSS
• place a penalty on large values for β1 , ..., βp (why not β0 ? can
always easily estimate mean)
• add this penalty to the objective function
• solve for β̂!
New objective function:
β̂ = arg min
β
n
p
i=1
j=1
X
1X
(yi − xi β)2 + λ
penalty(βj )
2
λ acts as a weight on penalty: low values mean few coefficients near
0, high values mean many coefficients near 0
Jordan Boyd-Graber
|
Boulder
Regression
|
29 of 34
Regularized Regression
Regularized Linear Regression
Regularization:
• still minimize the RSS
• place a penalty on large values for β1 , ..., βp (why not β0 ? can
always easily estimate mean)
• add this penalty to the objective function
• solve for β̂!
New objective function:
β̂ = arg min
β
n
p
i=1
j=1
X
1X
(yi − xi β)2 + λ
penalty(βj )
2
λ acts as a weight on penalty: low values mean few coefficients near
0, high values mean many coefficients near 0
Jordan Boyd-Graber
|
Boulder
Regression
|
29 of 34
Regularized Regression
Regularized Linear Regression
Regularization:
• still minimize the RSS
• place a penalty on large values for β1 , ..., βp (why not β0 ? can
always easily estimate mean)
• add this penalty to the objective function
• solve for β̂!
New objective function:
β̂ = arg min
β
n
p
i=1
j=1
X
1X
(yi − xi β)2 + λ
penalty(βj )
2
λ acts as a weight on penalty: low values mean few coefficients near
0, high values mean many coefficients near 0
Jordan Boyd-Graber
|
Boulder
Regression
|
29 of 34
Regularized Regression
Regularized Linear Regression
Regularization: what is a good penalty function?
Same as penalties used to fit errors:
• Ridge regression (squared penalty):
β̂ Ridge = arg min
β
n
p
i=1
j=1
X
1X
(yi − xi β)2 + λ
βj2
2
• Lasso regression (absolute value penalty):
β̂
Jordan Boyd-Graber
|
Boulder
Lasso
n
p
i=1
j=1
X
1X
= arg min
(yi − xi β)2 + λ
|βj |
β 2
Regression
|
30 of 34
Regularized Regression
Comparing Ridge and Lasso
Objective
Estimator
Coefficients
Stability
1
2
Ridge
P
(y
−
xi β)2 + λ pj=0 βj2
i
i=1
−1
XT X + λI
XT y
most close to 0
robust to changes in X, y
Pn
1
2
Lasso
Pp
2
(y
−
x
i
i β) + λ
i=1
j=0 |βj |
not closed form
most exactly 0
not robust to changes in X, y
Pn
Regularized linear regression is fantastic for low signal datasets or
those with p >> n
• Ridge: good when many coefficients affect value but not large
(gene expression)
• Lasso: good when you want an interpretable estimator
Jordan Boyd-Graber
|
Boulder
Regression
|
31 of 34
Regularized Regression
Choosing λ
Both Ridge and Lasso have a tunable parameter, λ
• use cross validation to find best λ
λ̂ = arg min
λ
n X
yi − xi β̂−i,λ
2
i=1
• try out many values
• see how well it works on “development” data
Jordan Boyd-Graber
|
Boulder
Regression
|
32 of 34
Wrapup
Plan
Fitting a Regression
Example
Regularized Regression
Wrapup
Jordan Boyd-Graber
|
Boulder
Regression
|
33 of 34
Wrapup
Regression
• Workhorse technique of data analysis
• Fundamental tool that we saw before (“Logistic Regression”)
• Important to understand interpretation of regression parameters
Jordan Boyd-Graber
|
Boulder
Regression
|
34 of 34