Download Lecture 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Outline

Ordinary least squares regression

Ridge regression
Data mining and statistical learning,
lecture 3
Ordinary least squares regression (OLS)
y
Model:
y   0  β T X  error
y   0  1 x1  ...  β p x p  error
x1
Terminology:
0: intercept (or bias)
1, …, p: regression
coefficients (or weights)
x2
…
xp
The response variable
responds directly and linearly
to changes in the inputs
Data mining and statistical learning,
lecture 3
Least squares regression
Assume that we have observed a training set of data
Case
1
2
3
X1
X2
Xp
x 11
x 12
x 13
x 21
x 22
x 23
xp1
xp2
xp3
Y
y1
y2
y3
N
x 1N
x 2N
x pN
yN
Estimate the  coefficients by minimizing the residual sum of
squares
N
p
i 1
j 1
RSS (  )   ( yi   0   j  X ij ) 2
Data mining and statistical learning,
lecture 3
Matrix formulation of OLS regression
Differentiating the residual sum of squares and setting the
first derivatives equal to zero we obtain
RSS (  )   ( y      X )
p
n
2
i 1
i
0
j
ij
X T ( y  X )  0
j 1
where
 1 x11

 1 x12

X 



1 x
1N

x21
x22
x2 N
x p1 

xp2 





x pN 
and
Data mining and statistical learning,
lecture 3
 y1 
 
 y2 
 
y 
 
 
 
y 
 N
Parameter estimates and predictions
Least squares estimates of the parameters
RSS (  )   ( y      X )
p
n
2
i 1
i
0
j
j 1
ij
ˆ  ( X T X ) 1 X T y
Predicted values
yˆ  Xˆ  X ( X T X ) 1 X T y  Hy
yˆ  Xˆ  X ( X T X ) 1 X T y  Hy
Data mining and statistical learning,
lecture 3
Different sources of inputs
Quantitative inputs
Transformations of quantitative inputs
RSS (  )   ( y      X )
Numeric or dummy coding of the levels of qualitative inputs
p
n
2
i 1
i
0
j
j 1
ij
Interactions between variables (e.g. X3 = X1 X2)
Example of dummy coding:
yˆ  Xˆ  X ( X T X ) 1 X T y  Hy
1, if Jan
X1  
0, otherwise
1, if Feb
X2  
0, otherwise
1, if Nov
X 11  
0, otherwise
Data mining and statistical learning,
lecture 3
An example of multiple linear regression
Response variable:
Requested price of used Porsche cars (1000 SEK)
n
p
i 1
j 1
RSS (  )   ( yi   0   j  X ij ) 2
Inputs:
X1 = Manufacturing year
X2 = Milage (km)
X3 = Model (0 or 1)
X4 = Equipment (1 2, 3)
X5 = Colour (Red Black Silver Blue Black White Green)
Data mining and statistical learning,
lecture 3
Price of used Porsche cars
Response variable:
Requested price of used Porsche cars (1000 SEK)
n
p
i 1
j 1
RSS (  )   ( yi   0   j  X ij ) 2
Inputs:
X1 = Manufacturing year
X2 = Milage (km)
Inputs
Year
Milage
Year, Milage
Estimated model
Price = -76829 + 38.6Year
Price = 430.7 -0.001862Milage
Price = -6389 +32.1Year – 0.000789Milage
Data mining and statistical learning,
lecture 3
RSS
113030
230212
92541
Interpretation of multiple regression coefficients
Assume that
p
Y  0    j X j  
j 1
and that the regression coefficients are estimated by
ordinary least squares regression
Then the multiple regression coefficient ˆ j represents
the additional contribution of xj on y, after xj has been
adjusted for x0, x1, …, xj-1, xj+1, …, xp
Data mining and statistical learning,
lecture 3
Confidence intervals for regression parameters
Assume that Y   0 
p

j 1
j
X j 
p
n
where
RSS (  ) the
     X ) are fixed and the error terms are i.i.d.
 ( yX-variables
and N(0, )
2
i 1
i
0
j
j 1
ij
Then
 j  ˆ j  t0.05 ( N  p  1) v j ˆ (95%)
where vj is the jth diagonal element of ( X T X ) 1
Data mining and statistical learning,
lecture 3
Interpretation of software outputs
Regression of the price of used Porsche cars vs
milage (km) and manufacturing year
Predictor
Constant
Milage (km)
Predictor
Constant
Milage (km)
Year
Coef
SE Coef
T
P
430.69
17.42
24.72
0.000
-0.0018621
0.0002959
-6.29
0.000
Coef
SE Coef
T
P
-63809
6976
-9.15
0.000
-0.0007894
0.0002222
-3.55
0.001
32.103
3.486
9.21
0.000
Adding new independent
variables to a regression
model alters at least one of
the old regression
coefficients unless the
columns of the X-matrix
are orthogonal, i.e.
Data mining and statistical learning,
lecture 3
N
x
i 1
x 0
ij ik
Stepwise Regression: Price (1000SEK) versus Year, Milage (km), ...
Alpha-to-Enter: 0.15
Alpha-to-Remove: 0.15
Step
Constant
1
-76829
2
-63809
3
-53285
4
-52099
Year
T-Value
P-Value
38.6
11.87
0.000
32.1
9.21
0.000
26.8
7.00
0.000
26.2
6.88
0.000
-0.00079
-3.55
0.001
-0.00066
-3.08
0.003
-0.00062
-2.88
0.006
37
2.72
0.009
27
1.83
0.073
Milage (km)
T-Value
P-Value
Model
T-Value
P-Value
Equipment
T-Value
P-Value
S
R-Sq
R-Sq(adj)
Mallows Cp
11.0
1.52
0.135
44.1
70.82
70.32
23.8
40.3
76.11
75.27
11.3
38.2
78.89
77.76
5.7
37.8
79.74
78.27
5.4
Data mining and statistical learning,
lecture 3
Classical statistical
model selection
techniques are
model-based.
In data-mining the
model selection is
data-driven.
The p-value refers to
a t-test of the
hypothesis that the
regression
coefficient of the last
entered x-variable is
zero
Stepwise Regression: Price (1000SEK) versus Year, Milage (km), ...
- model validation by visual inspection of residuals
Residual = Observed - Predicted
Versus Fits
Residuals Versus Milage (km)
(response is Price (1000SEK))
200
200
150
150
100
100
Residual
Residual
(response is Price (1000SEK))
50
50
0
0
-50
-50
-100
-100
200
250
300
350
Fitted Value
400
450
500
0
20000
Data mining and statistical learning,
lecture 3
40000
60000
80000
Milage (km)
100000
120000
140000
The Gram-Schmidt procedure for regression by successive
orthogonalization and simple linear regression
1. Intialize z0 = x0 = 1
2. For j = 1, … , p, compute
j 1
 zk , x j 
k 0
 zk , zk 
zj  xj 
j 1
zk  x j   ˆkj zk ,
k 0
where  depicts the inner product (the sum of coordinate-wise
products)
3. Regress y on zp to obtain the multiple regression coefficient ̂ p
Data mining and statistical learning,
lecture 3
Prediction of a response variable using correlated explanatory variables
- daily temperatures in Stockholm, Göteborg, and Malmö
30
20
10
0
-10
-20
30
Malmö temperature
Göteborg temperature
Göteborg temperature
30
20
10
0
-10
-20
-20
-10
0
10
Stockholm temperature
20
30
20
10
0
-10
-20
-20
-10
0
10
20
30
Malmö temperature
Data mining and statistical learning,
lecture 3
-20
-10
0
10
Stockholm temperature
20
30
Absorbance
Absorbance records for ten samples of chopped meat
5.0
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Sample_1
Sample_2
Sample_3
1 response variable
(protein)
Sample_4
Sample_5
Sample_6
Sample_7
Sample_8
Sample_9
Sample_10
1
12 23 34 45 56 67 78 89 100
Channel
Data mining and statistical learning,
lecture 3
100 predictors
(absorbance at 100
wavelengths or
channels)
The predictors are
strongly correlated
to each other
Absorbance records for 240 samples of chopped meat
Protein (%)
25
20
15
The target is poorly
correlated to each
predictor
10
5
0
0
2
4
6
Absorbance in channel 50
Data mining and statistical learning,
lecture 3
Ridge regression
The ridge regression coefficients minimize a penalized residual sum of
squares:
ˆ
ridge
p
N

2
 argmin  ( yi   0  1 x1 j  ...   p x pj )     j2 
j 1
 i 1

or
ˆ ridge  argmin
N
(y
i 1
i
  0  1 x1 j  ...   p x pj ) 2
p
subject to   j2  s
j 1
Normally, inputs are centred prior to the estimation of regression
coefficients
Data mining and statistical learning,
lecture 3
Matrix formulation of ridge regression for centred inputs
RSS ( )  ( y - X ) 1 ( y - X )   T 
ˆ ridge  ( X T X  I ) 1 X T y
If the inputs are orthogonal, the ridge estimates are just a scaled version
ˆ ridge  ˆ , where 0    1
of the least squares estimates
Shrinking enables estimation of regression coefficients even if the number of
parameters exceeds the number of cases
Figure 3.7
Data mining and statistical learning,
lecture 3
Ridge regression – pros and cons
Ridge regression is particularly useful if the explanatory
variables are strongly correlated to each other.
The variance of the estimated regression coefficient is
reduced at the expensive of (slightly) biased estimates
Data mining and statistical learning,
lecture 3
The Gauss-Markov theorem
Consider a linear regression model in which:
– the inputs are regarded as fixed
– the error terms are i.i.d. with mean 0 and variance 2.
Then, the least squares estimator of a parameter aT has
variance no bigger than any other linear unbiased estimator of aT
Biased estimators may have smaller variance and mean squared
error!
Data mining and statistical learning,
lecture 3
SAS code for an ordinary least squares regression
proc reg data=mining.dailytemperature outest = dtempbeta;
model daily_consumption = stockholm g_teborg malm_;
run;
Data mining and statistical learning,
lecture 3
SAS code for ridge regression
proc reg data=mining.dailytemperature outest = dtempbeta ridge=0 to 10 by 1;
model daily_consumption = stockholm g_teborg malm_;
proc print data=dtempbeta;
run;
_TYPE_
PARMS
RIDGE
RIDGE
RIDGE
RIDGE
RIDGE
RIDGE
RIDGE
RIDGE
RIDGE
RIDGE
RIDGE
_DEPVAR_
Daily_Consumption
Daily_Consumption
Daily_Consumption
Daily_Consumption
Daily_Consumption
Daily_Consumption
Daily_Consumption
Daily_Consumption
Daily_Consumption
Daily_Consumption
Daily_Consumption
Daily_Consumption
_RIDGE_ _RMSE_
Intercept
STOCKHOLM G_TEBORG MALM_
30845.8
480268.9
-5364.6
-548.3
-3598.2
0
30845.8
480268.9
-5364.6
-548.3
-3598.2
1
36314.6
462824.0
-2327.8
-2357.6
-2512.6
2
43008.7
450349.7
-1830.1
-1899.4
-2011.6
3
48325.9
442054.5
-1514.3
-1584.8
-1674.9
4
52401.2
436146.6
-1292.7
-1358.6
-1434.4
5
55571.5
431726.2
-1128.0
-1188.6
-1254.1
6
58092.1
428294.6
-1000.8
-1056.3
-1114.1
7
60138.0
425553.4
-899.4
-950.4
-1002.1
8
61829.0
423313.5
-816.7
-863.8
-910.6
9
63248.9
421448.8
-747.9
-791.7
-834.4
10
64457.3
419872.4
-689.8
-730.6
-770.0
Data mining and statistical learning,
lecture 3
Related documents