Download Week 4S10

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear least squares (mathematics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Omnibus test wikipedia , lookup

Transcript
Week 4
Multiple regression analysis
More general regression model
Consider one Y variable and n independent variables Xi, e.g.
X1, X2, X3.
 Data on n tuples (yi, xi1,xi2,xi3).
 Scatter plots show linear association between Y and the
X-variables
 The observations on y can be assumed to satisfy the
following model
yi  0  1 x1i  2 x2i  3 x3i  ei for i  1,..., n
error
Data
Prediction
Assumptions on the regression model
1. The relationship between the Y-variable and the X-variables is
linear
2. The error terms ei (measured by the residuals)
–
have zero mean (E(ei)=0)
–
have the same standard deviation e for each fixed x
–
are approximately normally distributed - Typically true for
large samples!
–
are independent (true if sample is S.R.S.)
Such assumptions are necessary to derive the inferential
methods for testing and prediction (to be seen later)!
WARNING: if the sample size is small (n<50) and errors are not
normal, you can’t use regression methods!
Parameter estimates
Suppose we have a random sample of n
observations on Y and on p-1 independent Xvariables.
How do we estimate the values of the coefficients ’s
in the regression model of Y versus the Xvariables?
Regression Parameter estimates
The parameter estimates are those values for ’s that
minimize the sum of the square errors:
2
2
ˆ
(
y

y
)

[
y

(



x

...


x
)]
 i i  i 0 11
k k
i

i
Thus the parameter estimates ̂ are those values for
’s that will make the model residuals as small as
possible!
The fitted model to compute predictions for Y is
yˆ  ˆ0  ˆ1 x1  ...  ˆ p x p
Using Linear Algebra for model estimation
(section 12.9)


Let β  (  0 , 1 ,...,  k ) be the parameter vector for the
regression model in p variables.
T
The parameter estimates for each beta can be efficiently found
using linear algebra as
βˆ  ( XT X) 1 XT Y


XT is the transpose of
matrix X
X-1 denotes the inverse of
matrix X
where X is the data matrix for the X-variables and Y is the data
vector for the response variable.
Hard to compute by hand – better use a computer!
EXAMPLE: CPU usage
A study was conducted to examine what factors affect the CPU usage.
A set of 38 processes written in a programming language was
considered. For each program, data were collected on the
Y = CPU usage (time) in seconds of time,
X1= the number of lines (linet) in thousands generated by the process
execution.
X2 = number of programs (step) forming the process
X3 = number of mounted computer devices (device).
Problem: Estimate the regression model of Y on X1,X2
and X3
yˆ  ˆ0  ˆ1LINET  ˆ2 STEP  ˆ3 DEVICE
I) Exploratory data step: Are the associations between
Y and the x-variables linear?
Draw the scatter plot for each pair (Y, Xi)
CPU time
CPU time
Lines executed in process
Number of programs
CPU time
Do the plots show
linearity?
Mounted devices
PROC REG - SAS OUTPUT
The REG Procedure
Parameter Estimates
Variable
Intercept
Linet
step
device
Label
---------
DF
1
1
1
1
Parameter
Estimate
0.00147
0.02109
0.00924
0.01218
Standard
Error
0.01071
0.00271
0.00210
0.00288
t Value
0.14
7.79
4.41
4.23
Pr > |t|
0.8920
<.0001
<.0001
0.0002
The fitted regression model is
yˆ  0.0014  0.021LINET  0.009STEP  0.012 DEVICE
Fitted model
The fitted regression model is
yˆ  0.0014  0.021LINET  0.009STEP  0.012 DEVICE



The ’s estimated values measure the changes in Y for
changes in X’s.
For instance, for each increase of 1000 lines executed by the
process (keeping the other variables fixed), the CPU usage
time will increase of 0.021 seconds.
Fixing the other variables, what happens on the CPU time if I
add another device?
Interpretation of model parameters
In multiple regression
yi  0  1 x1i  2 x2i  3 x3i  ei for i  1,..., n
coefficient value of an X variable measures the
predicted change in Y for any unit increase in that Xvariable while the other independent variables stay
constant.
For instance:  2 measures the changes in Y for a unit
increase of the variable X2 if the other x-variables X1 and
X3 are fixed.
Are the estimated values
accurate?
Residual Standard Deviation (pg. 632)
Testing effects of individual variables
(pg. 652- 655)
How do we measure the accuracy of the
estimated parameter values? (page 632)
For a simple linear regression with one X, the standard deviation
of the parameter estimates are defined as:
1
1
x2
 ˆ1   e
 ˆ0   e

2
2
(
x

x
)
n  ( xi  x )
i

They are both functions of the error variance  e regarded as a
sort of standard deviation (spread) of the points around the
line!
The error variance is estimated by the residual standard
deviation se (a.k.a. root mean square error )
se 
2
ˆ
(
y

y
)
 i i
n2
Residuals!
How do we interpret residual standard
deviation?


Used as a coarse approximation of the prediction error
for new y-values.
Probable error in new predictions is
+/- 2 se

se also used in the formula of standard errors of
parameter estimates:
sˆ  se
0
1
x2

n  ( xi  x ) 2
sˆ  se
1
1
2
(
x

x
)
 i
they can be computed from the data and measure the
noise in the parameter estimates
For general regression models with k x-variables

For k predictors, the standard errors of the parameter
estimates have a complicated form…but they still depend
on the error standard deviation 
!
e

The residual standard deviation or root mean square
error is defined as
se  MS(Residua l ) 
SS ( Residual )

n  (k  1)
2
ˆ
(
y

y
)
 i i
n  (k  1)
k+1 = number of parameters ’s
This measures the precision of our predictions!
The REG Procedure
Analysis of Variance
Source
Model
Error
Corrected Total
DF
3
34
37
Root MSE
Dependent Mean
Coeff Var
Sum of
Squares
0.59705
0.04067
0.63772
0.03459
0.15710
22.01536
Mean
Square
0.19902
0.00120
R-Square
Adj R-Sq
F Value
166.38
Pr > F
<.0001
0.9362
0.9306
The root mean square error for the CPU usage regression model
is computed above. That gives an estimate of the error standard
deviation
se=0.03459
Inference about regression parameters!




Regression estimates are affected by random
error
The concepts of hypothesis testing and
confidence intervals apply!
The t-distribution is used to construct significance
tests and confidence intervals for the “true”
parameters of the population.
Tests are often used to select those x-variables
that have a significant effect on Y
Tests on the slope for straight line regression
Consider the simple straight line case.
A common test on the slope is the test on the hypothesis
“X has no effect on Y” or the slope is equal to zero!
Or in statistical terms :
X has a negative effect

Ho: 1  0
vs Ha: 1  0
X has a significant effect

X has a positive effect
ˆ1  0
ˆ1
The test is given by the t-statistic t 

ˆ
s.e.( 1 ) se 1 /  ( xi  x ) 2
With t-distribution with n-2 degrees of freedom!
Tests on the parameters in multiple regression
Assumptions on the data:
1.
e1, e2, … en are independent of each other.
2.
The ei are normally distributed with mean zero and
have common variance .
Significance Tests on parameter
j
test hypothesis:
“Xj has no effect on Y” or in statistical terms

Xj has a negative effect on Y
Ho :  j  0 vs Ha :  j  0 Xj has a significant effect on Y

Xj has a positive effect on Y
The test is given by the t-statistic
t
ˆ j
s.e.( ˆ j )
with t-distribution with n-(k+1) degrees of freedom
Computed by SAS
Tests in SAS
The test p-values for regression coefficients are computed by
PROC REG
SAS will produce the two-sided p-value.
If your alternative hypothesis is one-sided (either > or < ), then
find the one-sided p-value dividing by 2 the p-value
computed by SAS
one-sided p-value = (two-sided p-value)/2
SAS Output
The REG Procedure
Parameter Estimates
Variable
Intercept
Linet
step
device
Label
---------
DF
1
1
1
1
Parameter
Estimate
0.00147
0.02109
0.00924
0.01218
Standard
Error
0.01071
0.00271
0.00210
0.00288
t Value
0.14
7.79
4.41
4.23
Pr > |t|
0.8920
<.0001
<.0001
0.0002
T-statistic value
P-value
T-tests on each parameter value show that all the x-variables in
the model are significant at 5% level (p-values <0.05).
The null hypothesis of no effect can be rejected, and we
conclude that there is a significant association between Y and
each x-variable.
Test on the intercept 0
The REG Procedure
Parameter Estimates
Variable Label
Intercept --Linet
--step
--device
---
DF
1
1
1
1
Estimate
0.00147
0.02109
0.00924
0.01218
Parameter Standard
Error
t Value
Pr > |t|
0.01071
0.14
0.8920
0.00271
7.79
<.0001
0.00210
4.41
<.0001
0.00288
4.23
0.0002
The test on the intercept says that the null hypothesis of 0=0 should be
accepted. The test p-value is 0.8920.
This means that the model should have no intercept !
This is not recommended though – unless you know that Y=0 if all
the x-variables are equal to zero.
What do we do if a model parameter is not
significant?

If the t-test on a parameter j shows that the
parameter value is not significantly different from
zero, we should refit the regression model without
the x-variable corresponding to j.
SAS Code for the CPU usage
data
Data cpu;
infile "C:\week5\cpudat.txt";
input time line step device;
linet=line/1000;
label time="CPU time in seconds" line="lines in program execution"
step="number of computer programs" device="mounted devices"
linet="lines in program (thousand)";
/*Exploratory data analysis */
/* computes correlation values between all variables in
dataset */
proc corr data=cpu;
run;
/* creates scatterplots between time vs linet, time vs step and
time vs device, respectively */
proc gplot data=cpu;
plot time*(linet step device);
run;
/* Regression analysis: fits model to predict time using
linet, step and device*/
proc reg data=cpu;
model time=linet step device;
plot time*linet /nostat ;
run; quit;
If you want to fit a model with no intercept use the following model statement:
model time=linet step device / noint;