Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lab 6 Objective To gain more experience with hypothesis testing. Also, we will interpret beta coefficients in multiple regression. More regression, we will be performing hypothesis tests and discussing the large p small n problem a little bit. Questions: Read in the dataset useconstat from the website, then answer the following questions: 1.) Regress GDP (market prices, volume) on Long Term interest rates, Private consumption (volume), and Government Consumption (volume), Export price good and services, Import price goods and services. Include a summary of your regression from R here: Call: lm(formula = USecon[, 22] ~ USecon[, 7] + USecon[, 8] + USecon[,9] + USecon[, 53] + USecon[, 54]) Residuals: Min 1Q Median 3Q Max -59686 -19326 6681 14724 38183 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.096e+05 2.803e+05 -0.391 0.70224 USecon[, 7] 2.187e+04 9.729e+03 2.248 0.04259 * USecon[, 8] 1.508e+00 4.643e-02 32.486 7.82e-14 *** USecon[, 9] 2.280e-02 3.463e-01 0.066 0.94851 USecon[, 53] -1.170e+04 4.915e+03 -2.381 0.03323 * USecon[, 54] 9.703e+03 3.006e+03 3.228 0.00661 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 30920 on 13 degrees of freedom Multiple R-squared: 0.9991, Adjusted R-squared: 0.9987 F-statistic: 2759 on 5 and 13 DF, p-value: < 2.2e-16 1.b) Perform model selection by performing individual T (or F) tests. Do this until all variables left in the model are significant. Include all relevant information about why you chose to omit a given predictor at each step. For this problem, I want you to consider the Intercept as a predictor: e.g. you can remove it if the data tells you it's not significantly different from 0. Okay, so we will be performing multiple tests in this question. Normally in real research you would correct for doing multiple testing by testing each of the hypotheses at a smaller alpha level, for example with Bonferroni you would use alpha/k where k is the number of tests you would like to perform. There are many others, but they get complicated. Remember that by default R tests the null hypothesis that there is no relationship between any specific covariate in the presence of the other covariates. The p values for this test are given in the column on the far right of the summary. Remember that we fail to reject the null hypothesis when the p value is > alpha, and we fail to reject when the p value is <= alpha. With this in mind, refer back to the summary which I pasted for the first question (1.)) We can see that the Intercept, and Usecon[,9] (which is Government Consumption Volume) have p values greater than 0.05. So at the 95% confidence level we would fail to reject the null hypothesis for either of these covariates having no linear relationship with GDP in the presence of the other covariates. Pick one of them to remove, here I will pick Gov't Consumption. When I remove this covariate from the model, I get the following new summary: Call: lm(formula = USecon[, 22] ~ USecon[, 7] + USecon[, 8] + USecon[,53] + USecon[, 54]) Residuals: Min 1Q Median 3Q Max -59793 -18916 5952 14866 38164 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -9.778e+04 2.079e+05 -0.470 0.64538 USecon[, 7] 2.143e+04 6.856e+03 3.126 0.00744 ** USecon[, 8] 1.509e+00 4.338e-02 34.791 5.38e-15 *** USecon[, 53] -1.160e+04 4.464e+03 -2.597 0.02109 * USecon[, 54] 9.714e+03 2.893e+03 3.358 0.00469 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 29800 on 14 degrees of freedom Multiple R-squared: 0.9991, Adjusted R-squared: 0.9988 F-statistic: 3713 on 4 and 14 DF, p-value: < 2.2e-16 We can see here the the Intercept parameter still have a p value greater than 0.05, so if we made another test for no relationship in the presence of the other variables, we would again fail to reject the null hypothesis, and therefore would conclude that it should be dropped from the model. Dropping it, the summary now looks like this: Call: lm(formula = USecon[, 22] ~ USecon[, 7] + USecon[, 8] + USecon[,53] + USecon[, 54] - 1) Residuals: Min 1Q Median 3Q Max -57190 -17544 6102 11096 42109 Coefficients: Estimate Std. Error t value Pr(>|t|) USecon[, 7] 1.973e+04 5.673e+03 3.478 0.00337 ** USecon[, 8] 1.509e+00 4.224e-02 35.734 6.24e-16 *** USecon[, 53] -1.241e+04 4.004e+03 -3.100 0.00731 ** USecon[, 54] 9.667e+03 2.815e+03 3.434 0.00369 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 29010 on 15 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: 1 F-statistic: 1.995e+05 on 4 and 15 DF, p-value: < 2.2e-16 All of the variables have p values which are less than 0.05, so we cannot remove any more covariates from the model using this method. As an important note, in general the order by which you remove the covariates is very important, and can lead to possibly different conclusions. This is a weak point of this method of analysis to attempt to perform model selection (model selection is the act of finding out which covariates are actually important or relevant for your response variable) 1.c) What can you say about the type 1 error rate of the sequence of tests you made? Here I am asking for the familywise error rate, or the probability of at least one type 1 error in the sequence of tests that we performed. The statement here should be that the type 1 error rate of the sequence of tests that we made is greater than alpha=0.05 for the above example. 1.d) Perform the F test for the hypothesis that the coefficients which you removed from the full model were simultaneously 0. Here we need to compare the full model, to the model without an intercept and also without the Gov't Consumption covariate. SSE in the full model is: 12426170106 SSE in the reduced model without the Intercept and Gov't Consumption is: 12626701098 The df in the full model is 19-5-1=13 We are dropping 2 covariates from the model, so the other df parameter is 2. With this information we can calculate the F statistic which is: 0.1048957 (See the formula from notes for explanation) As a note for error checking, the F stat is ALWAYS non negative!!!!! If you get a negative F stat you have made a mistake! The critical value is qf(0.95,2,13)=3.805565 Since the absolute value of the F statistic is smaller than the critical value, we fail to reject the null hypothesis. We conclude that in the presence of the other variables, that there is not definitively a linear relationship between the Intercept and Gov't Consumption and GDP. Therefore, we may abandon the larger model for the reduced model in which they are dropped from the set of covariates. 1.e) What can you say about the type 1 error rate for this test? Since we performed a single test to simultaneously assess whether or not the coefficients on the Intercept and on Gov't Consumption were simultaneously 0, the type 1 error rate of this test is alpha, which we set to be 0.05 in our analysis. 1.f) Perform the F test to test whether or not there is a regression relation between GDP and the full set of covariates initially given in this problem. Here, the reduced model, is the model with only the intercept parameter. SSE in the full model is the same as before. The df in the full model is the same as before. SSE in the reduced model is: 1.320014e+13 We are dropping 5 covariates from the model, so the other df parameter is 5. With this information we can calculate the F statistic which is: 2759.342 The critical value is qf(0.95,5,13)=3.025438 Since the absolute value of the F statistic is larger than the critical value, we reject the null hypothesis at the 95% level, and conclude that there is a linear relationship between GDP and at least one of the covariates. That is, that not all of the coefficients on the other covariates are simultaneously 0. We could also have carried out this test directly from the summary of the original regression line. Just like with individual predictors, R by default tests the hypothesis that there is no regression relation when it calculates the F stat. For proof of this observe that the F stat calculated by R in the first summary I pasted in question (1.)) was “F-statistic: 2759 on 5 and 13 DF”. The slight difference is due to rounding when R reports its numbers to you. 2.a) In the full regression model which you initially computed in question 1, interpret the beta coefficient on each predictor. For each of the covariates, the method you should use to interpret the coefficient is that each coefficient indicates the amount by which GDP tended to increase in the data that we observed for a one unit change in the covariate on that coefficient, holding all other covariates fixed. The 'holding all other covariates fixed' statement is very important here, as changing one covariate may naturally influence other covariates. Take long term and short term interest for example. One might expect the two to be related due to economic theory. The data supports this assertion that they are in fact related. For this question, I wanted you to write out this logic for each of the covariates. 2.b) Interpret the intercept coefficient. We usually interpret the intercept coefficient as our estimate of the response variable, GDP, when all of the other covariates are 0. This however doesn't make sense economically in this example, and furthermore – even in abstraction of economic theory (and the fact that it's absurd!) – we would be extrapolating to try and interpret the intercept coefficient as our guess of GDP when all of the other covariates are 0 since we have no data for this scenario. 3.) Answer one question you find interesting about economics using the dataset and regression techniques you've learned. If you have trouble coming up with a good question ask a classmate who studies economics for some ideas!