Download Stat 112 Notes 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Time series wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Stat 112 Notes 6
• Today:
– Chapter 4.1 (Introduction to Multiple
Regression)
Multiple Regression
• In multiple regression analysis, we
consider more than one explanatory
variable, X1,…,XK . We are interested in
the conditional mean of Y given X1,…,XK ,
E(Y| X1,…,XK ).
• Two motivations for multiple regression:
– We can obtain better predictions of Y by using
information on X1,…,XK rather than just X1.
– We can control for lurking variables
Automobile Example
• A team charged with designing a new automobile
is concerned about the gas mileage (gallons per
1000 miles on a highway) that can be achieved.
The design team is interested in two things:
(1) Which characteristics of the design are likely
to affect mileage?
(2) A new car is planned to have the following
characteristics: weight – 4000 lbs, horsepower –
200, length – 200 inches, seating – 5 adults.
Predict the new car’s gas mileage.
• The team has available information about gallons
per 1000 miles and four design characteristics
(weight, horsepower, length, seating) for a
sample of cars made in 2004. Data is in
car04.JMP.
Multivariate
Correlations
GP1000M_Hwy
Weight(lb)
Horsepower
Length
Seating
GP1000M_Hwy
1.0000
0.8575
0.6120
0.3912
0.3993
Weight(lb)
0.8575
1.0000
0.6434
0.7023
0.5858
Horsepower
0.6120
0.6434
1.0000
0.4910
0.0642
Length
0.3912
0.7023
0.4910
1.0000
0.6010
Seating
0.3993
0.5858
0.0642
0.6010
1.0000
20 rows not used due to missing or excluded values or frequency or weight variables missing, negative or less than one.
Scatterplot Matrix
60
50
40
30
GP1000M_Hw y
5000
Weight(lb)
4000
3000
2000
300
Horsepow er
200
100
200
Length
180
160
140
6
Seating
4
2
30 40 50 60
2000 4000
100 200 300
140 170 200
2 3 4 5 6 7
Best Single Predictor
• To obtain the correlation matrix and
pairwise scatterplots, click Analyze,
Multivariate Methods, Multivariate.
• If we use simple linear regression with
each of the four explanatory variables,
which provides the best predictions?
Best Single Predictor
• Answer: The simple linear regression that has
the highest R2 gives the best predictions
because recall that
SSE
2
R  1
SST
• Weight gives the best predictions of
GPM1000Hwy based on simple linear
regression.
• But we can obtain better predictions by using
more than one of the independent variables.
Multiple Linear Regression Model
E (Y | X 1 ,
, X K ) (   X1 ,
Yi  0  1 X i1 
,XK
)   0  1 X 1 
 K X K
  K X iK  ei
For each possible value of ( X 1 ,
, X K ) , there is a subpopulation.
Assumptions of the multiple linear regression model:
(1) Linearity: the means of the subpopulation are a linear function of
( X 1 , , X K ) , i.e., E (Y | X1 , , X K )  0  1 X1    K X K for some
( 0 , ,  K ) .
(2) Constant variance: the subpopulation standard deviations are all
equal (to  e )
(3) Normality: The subpopulations are normally distributed.
(4) Independence: The observations are independent.
Point Estimates for Multiple
Linear Regression Model
• We use the same least squares procedure as for
simple linear regression.
• Our estimates of 0 ,...,  K are the coefficients
that minimize the sum of squared prediction b0 ,..., bK
errors:
b0 ,..., bK  arg min b* ,..., b*
0
K
*
*
*
2
(
y

b

b
x



b
x
)
i1 i 0 1 i1
K iK
n
yˆ  b0  b1x1    bK xK
• Least Squares in JMP: Click Analyze, Fit Model,
put dependent variable into Y and add
independent variables to the construct model
effects box.
Response GP1000M_Hwy
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.834148
0.831091
3.082396
39.75907
222
Parameter Estimates
Term
Intercept
Weight(lb)
Seating
Horsepower
Length
Estimate
42.198338
0.0102748
0.2748828
0.0189373
-0.244818
Std Error
3.300533
0.00052
0.254288
0.00524
0.02358
t Ratio
12.79
19.77
1.08
3.61
-10.38
Prob>|t|
<.0001
<.0001
0.2809
0.0004
<.0001
Root Mean Square Error
• Estimate of  e :
2
ˆ
(
y

y
)
i1 i i
n
se 
n  K 1
• se = Root Mean Square Error in JMP
• For simple linear regression of GP1000MHWY
on Weight, RMSE  3.87
.
For multiple linear
regression of GP1000MHWY on weight,
horsepower, cargo, seating, RMSE  3.08
• The multiple regression improves the
predictions.
Residuals and Root Mean Square
Errors
•
Eˆ (Y | X 1  x1 ,
, X K  xK )  b0  b1 x1 
 bK xK
• Residual for observation i = prediction error for observation i =
Yi  Eˆ (Y | X1  xi1 ,
Yi  b0  b1 xi1 
, X K  xiK ) 
 bK xiK
• Root mean square error = Typical size of absolute value of
prediction error
• As with simple linear regression model, if multiple linear regression
model holds
– About 95% of the observations will be within two RMSEs of their
predicted value
• For car data, about 95% of the time, the actual GP1000M will be
within 2*3.08=6.16 GP1000M of the predicted GP1000M of the car
based on the car’s weight, horsepower, length and seating.
Residual Example
BMW 745i
Weight = 4376
Seating = 5
Horsepower = 325
Length = 198
Eˆ (Y | X 1 ,
, X 4 )  42.19  .01027 * 4376  .2479*5 
.0189*325  .2448*198
 46.22
Actual Y (GP1000M) for BMW745i = 38.46
Residual = 38.46-46.22 = -7.76
The BMW is more fuel efficient (lower GP1000M) than we would
expect based on its weight, seating, horsepower and length.
The residuals and predicted values can be saved by clicking the red
triangle next to Response after Fit Model, then clicking Save Columns
and clicking Predicted Values and Residuals.
Interpretation of Regression Coefficients
• Gas mileage regression from
car04.JMP
Parameter Estimates
Term
Intercept
Weight(lb)
Seating
Horsepower
Length
Estimate
42.198338
0.0102748
0.2748828
0.0189373
-0.244818
Std Error
3.300533
0.00052
0.254288
0.00524
0.02358
t Ratio
12.79
19.77
1.08
3.61
-10.38
Prob>|t|
<.0001
<.0001
0.2809
0.0004
<.0001
Interpretation of coefficient bweight  0.0103 : The mean of GP1000Mwy
is estimated to increase 0.0103 for a one pound increase in weight
holding fixed seating, horsepower and length.
E (Y | X 1  x1  1, X 2  x2 ,
( 0  1 ( x1  1) 
, X K  xK )  E (Y | X 1  x1 ,
  K xK )  ( 0  1 x1 
, X K  xK ) 
  K xK )  1