Download Notes 9 - Wharton Statistics Department

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Psychometrics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Categorical variable wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Stat 112 Notes 9
• Today:
– Multicollinearity (Chapter 4.6)
– Multiple regression and causal inference
Assessing Quality of Prediction
(Chapter 3.5.3)
• R squared is a measure of a fit of the
regression to the sample data. It is not
generally considered an adequate measure
of the regression’s ability to predict the
responses for new observations.
• One method of assessing the ability of the
regression to predict the responses for new
observations is data splitting.
• We split the data into a two groups – a
training sample and a holdout sample (also
called a validation sample). We fit the
regression model to the training sample and
then assess the quality of predictions of the
regression model to the holdout sample.
Measuring Quality of Predictions
Let n2 be the number of points in the holdout sample.
Let X1 , , X n2 be the points in the holdout sample.
Let Yˆ , , Yˆ be the predictions of Y for the points in the holdout
1
n2
sample based on the model fit on the training sample.
Mean Squared Deviation (MSD)=

ˆ )2
(
Y

Y
i
i
i 1
n2
n2
Root Mean Squared Deviation (RMSD) =

ˆ )2
(
Y

Y
i
i
i 1
n2
n2
Root Mean Squared Deviation is comparable to Root Mean
Square Error but Root Mean Squared Deviation will generally
be larger because it takes into account both the fact that Y
often deviates from E (Y | X ) (i.e., fact that there are disturbances
in regression equation) and the fact that the least squares estimates
have errors and do not equal the true slope coefficients.
Root Mean Squared Deviation for state.JMP data set
on state average SAT scores
Our training sample was the states Kansas-Wyoming
and our validation sample was the states Alabama-Iowa.
2
ˆ
(
y

y
)
 i1 i i
n2
Root Mean Squared Deviation =
n2
Root Mean Squared Deviation
Multiple Regression
34.04
Elena/Leah
35.15
Joanna/Mark/Shannon 36.08
Kathryn/Kendall/Carly 59.00
Renee/Amy/Tatiana
89.59
Multicollinearity
• DATA: A real estate agents wants to
develop a model to predict the selling price
of a home. The agent takes a random
sample of 100 homes that were recently
sold and records the selling price (y), the
number of bedrooms (x1), the size in
square feet (x2) and the lot size in square
feet (x3). Data is in houseprice.JMP.
Scatterplot Matrix
200000
Price
100000
5.0
4.0
3.0
2.0
3000
2500
2000
1500
Bedrooms
House Size
8000
6000
Lot Size
4000
100000
2.0 3.5 5.0 1500 25004000 7000
Response Price
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.559998
0.546248
25022.71
154066
100
Analysis of Variance
Source
DF
Sum of Squares
Mean Square
F Ratio
Model
3 7.65017e10 2.5501e10 40.7269
Prob > F
Error
96 6.0109e+10 626135896
C. Total 99 1.36611e11
<.0001
Parameter Estimates
Term
Intercept
Bedrooms
House Size
Lot Size
Estimate
Std Error
t Ratio
Prob>|t|
37717.595 14176.74 2.66 0.0091
2306.0808 6994.192 0.33 0.7423
74.296806 52.97858 1.40 0.1640
-4.363783 17.024 -0.26 0.7982
There is strong evidence that predictors are useful, p-value for F-test <.0001
and R 2  .560 , but the t-tests for each coefficient are not significant.
Indicative of multicollinearity.
Note: These results illustrate how the F test is more powerful for testing whether
a group of slopes in multiple regression are all zero than individual
t tests.
Multicollinearity
• Multicollinearity: Explanatory variables are
highly correlated with each other. It is often
hard to determine their individual regression
coefficients.
Multivariate
Correlations
Bedrooms
House Size
Lot Size
Bedrooms
House Size
Lot Size
1.0000
0.8465
0.8374
0.8465
1.0000
0.9936
0.8374
0.9936
1.0000
• There is very little information in the data set
to find out what would happen if we fix house
size and change lot size.
• Since house size and lot size are highly
correlated, for fixed house size, lot sizes do
not change much.
• The standard error for estimating the
coefficient of lot sizes is large. Consequently
the coefficient may not be significant.
• Similarly for the coefficient of house size.
• So, while it seems that at least one of the
coefficients is significant (See ANOVA) you
cannot tell which one is the useful one.
Consequences of Multicollinearity
• Standard errors of regression coefficients are
large. As a result t statistics for testing the
population regression coefficients are small.
• Regression coefficient estimates are unstable.
Signs of coefficients may be opposite of what is
intuitively reasonable (e.g., negative sign on lot
size). Dropping or adding one variable in the
regression causes large change in estimates of
coefficients of other variables.
Detecting Multicollinearity
1. Pairwise correlations between
explanatory variables are high.
2. Large overall F-statistic for testing
usefulness of predictors but small t
statistics.
3. Variance inflation factors
Variance Inflation Factors
• Variance inflation factor (VIF): Let R2j denote the R2 for the multiple
regression of xj on the other x-variables. Then
VIFj 
1
.
1  R2j
• Fact:
 MSE 
SD ˆ j  
VIF
  n  1 S x2  j
j 

2
 
• VIFj for variable xj: Measure of the increase in the variance of the
coefficient on xj due to the correlation among the explanatory variables
compared to what the variance of the coefficient on xj would be if xj were
independent of the other explanatory variables.
Using VIFs
• To obtain VIFs, after Fit Model, go to
Parameter Estimates, right click, click
Columns and click VIFs.
• Detecting multicollinearity with VIFs:
– Any individual VIF greater than 10
indicates multicollinearity.
Summary of Fit
RSquare
0.559998
Parameter Estimates
Term
Intercept
Bedrooms
House Size
Lot Size
Estimate
Std Error
t Ratio
Prob>|t|
VIF
37717.595 14176.74 2.66 0.0091
.
2306.0808 6994.192 0.33 0.7423 3.5399784
74.296806 52.97858 1.40 0.1640 83.066839
-4.363783 17.024 -0.26 0.7982 78.841292
Multicollinearity and Prediction
• If interest is in predicting y, as long as pattern of
multicollinearity continues for those observations
where forecasts are desired (e.g., house size
and lot size are either both high, both medium or
both small), multicollinearity is not particularly
problematic.
• If interest is in predicting y for observations
where pattern of multicollinearity is different than
that in sample (e.g., large house size, small lot
size), no good solution (this would be
extrapolation).
Problems caused by
multicollinearity
• If interest is in predicting y, as long as pattern of
multicollinearity continues for those observations
where forecasts are desired (e.g., house size and lot
size are either both high, both medium or both small),
multicollinearity is not particularly problematic.
• If interest is in obtaining individual regression
coefficients, there is no good solution in face of
multicollinearity.
• If interest is in predicting y for observations where
pattern of multicollinearity is different than that in
sample (e.g., large house size, small lot size), no
good solution (this would be extrapolation).
Dealing with Multicollinearity
• Suffer: If prediction within the range of the data
is the only goal, not the interpretation of the
coefficients, then leave the multicollinearity
alone.
• Omit a variable. Multicollinearity can be reduced
by removing one of the highly correlated
variables. However, if one wants to estimate the
partial slope of one variable holding fixed the
other variables, omitting a variable is not an
option, as it changes the interpretation of the
slope.
California Test Score Data
• The California Standardized Testing and
Reporting (STAR) data set californiastar.JMP
contains data on test performance, school
characteristics and student demographic
backgrounds from 1998-1999.
• Average Test Score is the average of the
reading and math scores for a standardized test
administered to 5th grade students.
• One interesting question: What would be the
causal effect of decreasing the student-teacher
ratio by one student per teacher?
Multiple Regression and Causal Inference
• Goal: Figure out what the causal effect on
average test score would be of decreasing
student-teacher ratio and keeping everything else
in the world fixed.
• Lurking variable: A variable that is associated with
both average test score and student-teacher ratio.
• In order to figure out whether a drop in studentteacher ratio causes higher test scores, we want
to compare mean test scores among schools with
different student-teacher ratios but the same
values of the lurking variables, i.e. we want to hold
the value of the lurking variable fixed.
• If we include all of the lurking variables in the
multiple regression model, the coefficient on
student-teacher ratio represents the change in the
mean of test scores that is caused by a one unit
increase in student-teacher ratio.
Omitted Variables Bias
Response Average Test Score
Parameter Estimates
Term
Intercept
Student Teacher Ratio
Estimate
698.93295
-2.279808
Std Error
9.467491
0.479826
t Ratio
73.82
-4.75
Prob>|t|
<.0001
<.0001
Response Average Test Score
Parameter Estimates
Term
Intercept
Student Teacher Ratio
Percent of English Learners
Estimate
686.03225
-1.101296
-0.649777
Std Error
7.411312
0.380278
0.039343
t Ratio
92.57
-2.90
-16.52
Prob>|t|
<.0001
0.0040
<.0001
• Schools with many English learners tend to have worst resources.
The multiple regression that shows how mean test score changes
when student teacher ratio changes but percent of English learners
is held fixed gives a better idea of the causal effect of the studentteacher ratio than the simple linear regression that does not hold
percent of English learners fixed.
• Omitted variables bias: bias in estimating the causal effect of a
variable from omitting a lurking variable from the multiple regression.
• Omitted variables bias of omitting percentage of English learners =
-2.28-(-1.10)=-1.28.
Key Warning About Using Multiple
Regression for Causal Inference
• Even if we have included many lurking
variables in the multiple regression, we
may have failed to include one or not have
enough data to include one. There will
then be omitted variables bias.
• The best way to study causal effects is to
do a randomized experiment.
Path Diagram
Other
Lurking
Variables
StudentTeacher
Ratio
Average
Test
Score
Calworks
%
Percent
English
Learners