Download Class 16: Thursday, Nov. 4

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Class 16: Thursday, Nov. 4
• Note: I will e-mail you some info on the
final project this weekend and will discuss
in class on Tuesday.
Predicting Emergency Calls to the
AAA Club
Response Calls
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum
Wgts)
Parameter Estimates
Term
Intercept
Average
Temperature
Range
Rain forecast
Snow forecast
Weekday
Sunday
Subzero
0.692384
0.584719
1735.151
4318.75
28
Estimate Std Error
3628.7902 2153.788
-35.63182 51.52383
133.30434
429.70588
548.80038
-1603.1
-1847.152
3857.6004
50.85675
1211.933
1342.27
876.7378
1212.612
1489.803
t Ratio Prob>|t|
1.68 0.1076
-0.69 0.4972
2.62
0.35
0.41
-1.83
-1.52
2.59
0.0164
0.7266
0.6870
0.0824
0.1433
0.0175
R-Squared
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum
Wgts)
0.692384
0.584719
1735.151
4318.75
28
• R-squared: As in simple linear regression, measures proportion of
variability in Y explained by the regression of Y on these X’s.
Between 0 and 1, nearer to 1 indicates more variability explained.
• Don’t get excited that R-squared has increased when you add more
variables into the model. Adding another explanatory variable to the
model will always increase R-squared. The right question to ask is
not whether R-squared has increased when we add an explanatory
variable to a model but whether or not R-squared has increased by
a useful amount. The t-statistic and the associated p-value for the ttest for each coefficient answers this question.
Overall F-test
•
Analysis of Variance
Source
DF Sum of Squares Mean Square
Model
7
135532366
19361767
Error
20
60214949
3010747.4
C. Total
27
195747315
F Ratio
6.4309
Prob > F
0.0005
• Test of whether any of the predictors are useful:
H 0 : 1     p  0 vs.H a : at least one of 1 ,,  p does not
equal zero. Tests whether the model provides better
predictions than the sample mean of Y.
• p-value for the test: Prob>F in Analysis of Variance table.
• p-value = 0.005, strong evidence that at least one of the
predictors is useful for predicting ERS for the New York
AAA club.
Assumptions of Multiple Linear
Regression Model
1. Linearity: E (Y | X 1,, X p )  0  1 X 1     p X p
2. Constant variance: The standard deviation of Y
for the subpopulation of units with X 1  x1 ,, X p  x p
is the same for all subpopulations.
3. Normality: The distribution of Y for the
subpopulation of units with X 1  x1 ,, X p  x p is
normally distributed for all subpopulations.
4. The observations are independent.
Assumptions for linear regression
and their importance to inferences
Inference
Assumptions that are
important
Point prediction, point
estimation
Confidence interval for
slope, hypothesis test
for slope, confidence
interval for mean
response
Prediction interval
Linearity, independence
Linearity, constant
variance, independence,
normality (only if n<30)
Linearity, constant
variance, independence,
normality
Checking Linearity
• Plot residuals versus each of the
explanatory variables. Each of these plots
should look like random scatter, with no
pattern in the mean of the residuals.
Bivariate Fit of Residual Calls By Average Temperature
4000
Bivariate Fit of Residual Calls By Range
4000
3000
Residual Calls
Residual Calls
3000
2000
1000
0
2000
1000
0
-1000
-1000
-2000
-2000
-3000
-3000
0
10
20
30
40
50
-5
0
5
10
15
20
25
30
35
40
Range
Average Temperature
If residual plots show a problem, then we could try to transform the x-variable and/or
the y-variable.
Residual Plots in JMP
• After Fit Model, click red triangle next to
Response, click Save Columns and click
Residuals.
• Use Fit Y by X with Y=Residuals and X the
explanatory variable of interest. Fit Line will
draw a horizontal line with intercept zero. It is a
property of the residuals from multiple linear
regression that a least squares regression of the
residuals on an explanatory variable has slope
zero and intercept zero.
Residual by Predicted Plot
Residual by Predicted Plot
4000
Calls Residual
3000
2000
1000
0
-1000
-2000
-3000
1000
3000
5000
7000
9000
Calls Predicted
•
•
•
•
Fit Model displays the Residual by Predicted Plot automatically in its output.
The plot is a plot of the residuals versus the predicted Y’s,
Yˆi  Eˆ (Yi | X1  xi1,, X p  xip )
We can think of the predicted Y’s as summarizing all the information in the
X’s. As usual we would like this plot to show random scatter.
Pattern in the mean of the residuals as the predicted Y’s increase: Indicates
problem with linearity. Look at residual plots versus each explanatory
variable to isolate problem and consider transformations.
Pattern in the spread of the residuals: Indicates problem with constant
variance.
Checking Normality
• As with simple linear regression, make
histogram of residuals and normal quantile
plot of residuals.
3
.99
2
.95
.90
1
.75
.50
0
.25
.10
.05
-1
-2
.01
-3
-3000
-1000 0
1000 2000 3000 4000
Normal Quantile Plot
Distributions
Residual Calls
Normality appears to be violated: several points are
outside the confidence bands. Distribution of
Residuals is skewed to the right.
Transformations to Remedy
Constant Variance and Normality
Nonconstant Variance
• When the variance of Y| Yˆ increases with Yˆ,
try transforming Y to log Y or Y to Y
• When the variance of Y| Yˆ decreases with Yˆ ,
try transforming Y to 1/Y or Y to Y2
Nonnormality
• When the distribution of the residuals is
skewed to the right, try transforming Y to log Y.
• When the distribution of the residuals is
skewed to the left, try transforming Y to Y2
Influential Points, High Leverage
Points, Outliers
• As in simple linear regression, we identify high leverage
and high influence points by checking the leverages and
Cook’s distances (Use save columns to save Cook’s D
Influence and Hats).
• High influence points: Cook’s distance > 1
• High leverage points: Hat greater than (3*(# of
explanatory variables + 1))/n is a point with high
leverage.
• Use same guidelines for dealing with influential
observations as in simple linear regression.
• Point that has unusual Y given its explanatory variables:
point with a residual that is more than 3 RMSEs away
from zero.
Scatterplot Matrix
• Before fitting a multiple linear regression model,
it is good idea to make scatterplots of the
response variable versus the explanatory
variable. This can suggest transformations of
the explanatory variables that need to be done
as well as potential outliers and influential
points.
• Scatterplot matrix in JMP: Click Analyze,
Multivariate Methods and Multivariate, and then
put the response variable first in the Y, columns
box and then the explanatory variables in the Y,
columns box.
Scatterplot Matrix
8000
6000
4000
2000
50
Calls
30
A verage
Temperature
10
40
25
15
Range
0
1
Rain
f orecast
0.5
0
1
Snow
f orecast
0.5
0
1
Weekday
0.5
0
1
Sund
0.5
0
1
0.5
0
2000
6000
10 20
40 50 0 5 15 25 35
0
.5
1
0
.5
1
0
.5
1
0
.5
• In order to evaluate benefits of a proposed
irrigation scheme in Egypt, the relation of
yield Y of wheat to rainfall is investigated
over several years (see rainfall.JMP).
• How can regression analysis help?
Year
Yield (Bu./Acre), Y
Total Spring
Rainfall, R
Average Spring
Temperature, T
1963
60
8
56
1964
50
10
47
1965
70
11
53
1966
70
10
53
1967
80
9
56
1968
50
9
47
1969
60
12
44
1970
40
11
44
Simple Linear Regression of Yield
on Rainfall
90
80
70
Yield
•
Bivariate Fit of Yield By Total Spring Rainfall
60
50
40
30
7
8
9
10
11
12
13
Total Spring Rainfall
Linear Fit
Linear Fit
Yield = 76.666667 - 1.6666667 Total Spring Rainfall
Rainfall reduces yield!? Is irrigation a bad idea?
Linear Fit
Yield = 76.666667 - 1.6666667 Total Spring Rainfall
• Interpretation of coefficient of rainfall: The
change in the mean yield that is
associated with a one inch increase in
rainfall. Other important variables (lurking
variables) are not held fixed and might
tend to change as rainfall increases.
Bivariate Fit of Average Spring Temperature By Total Spring Rainfall
Average Spring Temperature
57.5
Temperature tends to decrease as
rainfall increases.
55
52.5
50
47.5
45
42.5
7
8
9
10
11
Total Spring Rainfall
12
13
Controlling for Known Lurking
Variables: Multiple Regression
• To evaluate the benefits of the irrigation scheme,
we want to know how changes in rainfall are
associated with changes in yield when all other
important variables (lurking variables) such as
temperature held fixed.
• Multiple regression provides this.
• Coefficient on rainfall in the multiple regression
of yield on rainfall and temperature = change in
the mean yield that is associated with a one inch
increase in rainfall when temperature is held
fixed.
Multiple Regression Analysis
Response Yield
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum
Wgts)
Parameter Estimates
Term
Intercept
Total Spring Rainfall
Average Spring
Temperature
0.790476
0.706667
7.091242
60
8
Estimate Std Error
-144.7619 55.8499
5.7142857 2.680238
2.952381 0.692034
t Ratio Prob>|t|
-2.59 0.0487
2.13 0.0862
4.27 0.0080
• Rainfall is estimated to be beneficial once
temperature is held fixed.