Download sol6 - Duke Statistical

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Lab 6 Objective
To gain more experience with hypothesis testing.
Also, we will interpret beta coefficients in multiple regression.
More regression, we will be performing hypothesis tests and discussing the large p small n
problem a little bit.
Questions:
Read in the dataset useconstat from the website, then answer the following questions:
1.) Regress GDP (market prices, volume) on Long Term interest rates, Private consumption
(volume), and Government Consumption (volume), Export price good and services, Import price
goods and services. Include a summary of your regression from R here:
Call:
lm(formula = USecon[, 22] ~ USecon[, 7] + USecon[, 8] + USecon[,9] + USecon[, 53] +
USecon[, 54])
Residuals:
Min
1Q Median
3Q Max
-59686 -19326 6681 14724 38183
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.096e+05 2.803e+05 -0.391 0.70224
USecon[, 7] 2.187e+04 9.729e+03 2.248 0.04259 *
USecon[, 8] 1.508e+00 4.643e-02 32.486 7.82e-14 ***
USecon[, 9] 2.280e-02 3.463e-01 0.066 0.94851
USecon[, 53] -1.170e+04 4.915e+03 -2.381 0.03323 *
USecon[, 54] 9.703e+03 3.006e+03 3.228 0.00661 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 30920 on 13 degrees of freedom
Multiple R-squared: 0.9991,
Adjusted R-squared: 0.9987
F-statistic: 2759 on 5 and 13 DF, p-value: < 2.2e-16
1.b) Perform model selection by performing individual T (or F) tests. Do this until all variables
left in the model are significant. Include all relevant information about why you chose to omit a
given predictor at each step. For this problem, I want you to consider the Intercept as a predictor:
e.g. you can remove it if the data tells you it's not significantly different from 0.
Okay, so we will be performing multiple tests in this question. Normally in real research you
would correct for doing multiple testing by testing each of the hypotheses at a smaller alpha
level, for example with Bonferroni you would use alpha/k where k is the number of tests you
would like to perform. There are many others, but they get complicated.
Remember that by default R tests the null hypothesis that there is no relationship between any
specific covariate in the presence of the other covariates. The p values for this test are given in
the column on the far right of the summary. Remember that we fail to reject the null hypothesis
when the p value is > alpha, and we fail to reject when the p value is <= alpha.
With this in mind, refer back to the summary which I pasted for the first question (1.))
We can see that the Intercept, and Usecon[,9] (which is Government Consumption Volume) have
p values greater than 0.05. So at the 95% confidence level we would fail to reject the null
hypothesis for either of these covariates having no linear relationship with GDP in the presence
of the other covariates. Pick one of them to remove, here I will pick Gov't Consumption. When I
remove this covariate from the model, I get the following new summary:
Call:
lm(formula = USecon[, 22] ~ USecon[, 7] + USecon[, 8] + USecon[,53] + USecon[, 54])
Residuals:
Min
1Q Median
3Q Max
-59793 -18916 5952 14866 38164
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.778e+04 2.079e+05 -0.470 0.64538
USecon[, 7] 2.143e+04 6.856e+03 3.126 0.00744 **
USecon[, 8] 1.509e+00 4.338e-02 34.791 5.38e-15 ***
USecon[, 53] -1.160e+04 4.464e+03 -2.597 0.02109 *
USecon[, 54] 9.714e+03 2.893e+03 3.358 0.00469 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 29800 on 14 degrees of freedom
Multiple R-squared: 0.9991,
Adjusted R-squared: 0.9988
F-statistic: 3713 on 4 and 14 DF, p-value: < 2.2e-16
We can see here the the Intercept parameter still have a p value greater than 0.05, so if we made
another test for no relationship in the presence of the other variables, we would again fail to
reject the null hypothesis, and therefore would conclude that it should be dropped from the
model. Dropping it, the summary now looks like this:
Call:
lm(formula = USecon[, 22] ~ USecon[, 7] + USecon[, 8] + USecon[,53] + USecon[, 54] - 1)
Residuals:
Min
1Q Median
3Q Max
-57190 -17544 6102 11096 42109
Coefficients:
Estimate Std. Error t value Pr(>|t|)
USecon[, 7] 1.973e+04 5.673e+03 3.478 0.00337 **
USecon[, 8] 1.509e+00 4.224e-02 35.734 6.24e-16 ***
USecon[, 53] -1.241e+04 4.004e+03 -3.100 0.00731 **
USecon[, 54] 9.667e+03 2.815e+03 3.434 0.00369 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 29010 on 15 degrees of freedom
Multiple R-squared:
1,
Adjusted R-squared:
1
F-statistic: 1.995e+05 on 4 and 15 DF, p-value: < 2.2e-16
All of the variables have p values which are less than 0.05, so we cannot remove any more
covariates from the model using this method.
As an important note, in general the order by which you remove the covariates is very important,
and can lead to possibly different conclusions. This is a weak point of this method of analysis to
attempt to perform model selection (model selection is the act of finding out which covariates are
actually important or relevant for your response variable)
1.c) What can you say about the type 1 error rate of the sequence of tests you made?
Here I am asking for the familywise error rate, or the probability of at least one type 1 error in
the sequence of tests that we performed. The statement here should be that the type 1 error rate
of the sequence of tests that we made is greater than alpha=0.05 for the above example.
1.d) Perform the F test for the hypothesis that the coefficients which you removed from the full
model were simultaneously 0.
Here we need to compare the full model, to the model without an intercept and also without the
Gov't Consumption covariate.
SSE in the full model is: 12426170106
SSE in the reduced model without the Intercept and Gov't Consumption is: 12626701098
The df in the full model is 19-5-1=13
We are dropping 2 covariates from the model, so the other df parameter is 2.
With this information we can calculate the F statistic which is: 0.1048957
(See the formula from notes for explanation)
As a note for error checking, the F stat is ALWAYS non negative!!!!! If you get a negative F stat
you have made a mistake!
The critical value is qf(0.95,2,13)=3.805565
Since the absolute value of the F statistic is smaller than the critical value, we fail to reject the
null hypothesis. We conclude that in the presence of the other variables, that there is not
definitively a linear relationship between the Intercept and Gov't Consumption and GDP.
Therefore, we may abandon the larger model for the reduced model in which they are dropped
from the set of covariates.
1.e) What can you say about the type 1 error rate for this test?
Since we performed a single test to simultaneously assess whether or not the coefficients on the
Intercept and on Gov't Consumption were simultaneously 0, the type 1 error rate of this test is
alpha, which we set to be 0.05 in our analysis.
1.f) Perform the F test to test whether or not there is a regression relation between GDP and the
full set of covariates initially given in this problem.
Here, the reduced model, is the model with only the intercept parameter.
SSE in the full model is the same as before.
The df in the full model is the same as before.
SSE in the reduced model is: 1.320014e+13
We are dropping 5 covariates from the model, so the other df parameter is 5.
With this information we can calculate the F statistic which is: 2759.342
The critical value is qf(0.95,5,13)=3.025438
Since the absolute value of the F statistic is larger than the critical value, we reject the null
hypothesis at the 95% level, and conclude that there is a linear relationship between GDP and at
least one of the covariates. That is, that not all of the coefficients on the other covariates are
simultaneously 0.
We could also have carried out this test directly from the summary of the original regression line.
Just like with individual predictors, R by default tests the hypothesis that there is no regression
relation when it calculates the F stat.
For proof of this observe that the F stat calculated by R in the first summary I pasted in question
(1.)) was “F-statistic: 2759 on 5 and 13 DF”. The slight difference is due to rounding when R
reports its numbers to you.
2.a) In the full regression model which you initially computed in question 1, interpret the beta
coefficient on each predictor.
For each of the covariates, the method you should use to interpret the coefficient is that each
coefficient indicates the amount by which GDP tended to increase in the data that we observed
for a one unit change in the covariate on that coefficient, holding all other covariates fixed.
The 'holding all other covariates fixed' statement is very important here, as changing one
covariate may naturally influence other covariates. Take long term and short term interest for
example. One might expect the two to be related due to economic theory. The data supports this
assertion that they are in fact related.
For this question, I wanted you to write out this logic for each of the covariates.
2.b) Interpret the intercept coefficient.
We usually interpret the intercept coefficient as our estimate of the response variable, GDP,
when all of the other covariates are 0. This however doesn't make sense economically in this
example, and furthermore – even in abstraction of economic theory (and the fact that it's absurd!)
– we would be extrapolating to try and interpret the intercept coefficient as our guess of GDP
when all of the other covariates are 0 since we have no data for this scenario.
3.) Answer one question you find interesting about economics using the dataset and regression
techniques you've learned. If you have trouble coming up with a good question ask a classmate
who studies economics for some ideas!