Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Significance Tests for Regression Analysis A. Testing the Significance of Regression Models The first important significance test is for the regression model as a whole. In this case, it is a test of the model Y = f(X) The null hypothesis is: H0: = 0.0 Here, the test is redundant. Since there is only one variable in the model (X), it duplicates the information provided by the t-test. In the case of multiple regression, it becomes much more important for evaluating models with several independent variables. In that case, the model is Y = f(X1 X2 X3 . . . Xk) where the null hypothesis is H0: 1 = 2 = 3 = . . . K = 0.0 Notice that this is similar to the null hypothesis in the analysis of variance with multiple treatment groups. All the information we require is already available in the analysis of variance summary table. In our little time/temperature example, we have the two mean squares, for the model (98.637) and for the error term (0.030). The ratio of the two is 3287.90, clearly greater than one. However, making the usual significance test with one and one degrees of freedom at the 0.05 level, we find the critical value of F is 161.40 (Appendix 3, p. 544). Since 3287.90 is greater than 161.40, the F-ratio lies inside the region of rejection. Hence, we REJECT the null hypothesis that none of the regression coefficients in the model is greater than zero in favor of the alternate hypothesis that at least ONE of the regression coefficients in our model is greater than zero. B. Testing the Significance of the Regression Coefficient The null hypothesis in the significance test for the regression coefficient (i.e., slope) is: H0 : = 0.0 This simple symbolic expression says more than might first appear. It says: If we begin by assuming that there is no relationship between X and Y in general (i.e., in the universe from which our sample data come), then how likely is it that we would find a regression coefficient for our sample to be DIFFERENT FROM 0.0? Put the other way around, if we find a relationship between X and Y in the sample data, can we infer that there is a relationship between X and Y in general? To test this null hypothesis, we use our old friend the t-test: b t ˆ b where the standard error is: ˆ b MS Error s X2 N 1 This is the standard deviation of the sampling distribution of all theoretically possible regression coefficients for samples of the same size drawn randomly from the same universe. Recall that the mean of this sampling distribution has a value equal to the population characteristic (parameter), in this case the value of the regression coefficient in the universe. Under the null hypothesis, we initially assume that this value is 0.0. To test the significance of the regression coefficient (and the model as a whole), we need the statistical information found in the usual analysis of variance summary table. We already have most of this information for our previous example. Recall that R2YX , the Coefficient of Determination, was found from R2YX = SSRegression / SSTotal From our time/temperature example, remember that R2YX was 0.999. Total sum of squares can be found from SSTotal = sY2 (N - 1) From our previous calculations, remember that sY2 was 49.333. Thus, SSTotal is SSTotal = (49.333) (3 - 1) SSTotal = (49.333) (2) SSTotal = 98.667 By rearranging the algorithm for R2YX , we get SSRegression = (R2YX )(SSTotal) With our sample data, SSRegression = (0.9997)(98.667) SSRegression = 98.637 Now, because of the identity among the three sums of squares, we can find the sum of squares for the error term (residual) by subtraction, SSError = SSTotal - SSRegression SSError = 98.667 - 98.637 SSError = 0.030 All we need to do now is to determine the various numbers of degrees of freedom. Because we have three observations, we know that we still have two total degrees of freedom. Degrees of freedom for the model is the number of independent variables in the model, in this case one. Because of the identity between the three values of degrees of freedom, the total less the model gives us the error degrees of freedom, in this case 2 – 1, or one degree of freedom for the error term. Now we can complete an analysis of variance summary table for the regression example. Table 1. Analysis of Variance Summary Table for Time-Temperature Example. ========================================================== Source ss df Mean Square F ----------------------------------------------------------------------------------------------------Regression (Between) Error (Within) Total 98.637 1 98.637 0.030 1 0.030 98.667 2 3287.90 ----------------------------------------------------------------------------------------------------- Now we can return to the task of testing the significance of the regression coefficient. First, we need to estimate the standard error of b. This is ˆ b MS Error s X2 N 1 In our example, the variance of X, time, was 5.083. ˆ b 0.030 5.083 3 1 ˆ b 0.00295 ˆ b 0.0543 Now we have the value of our "currency conversion" factor which allows us to convert the difference between our sample mean and the mean of the sampling distribution into Student's t values that lie on the underlying x-axis. Recall that our regression coefficient had a value of - 3.115. Remember also that, under the null hypothesis = 0.0. Thus, our t-statistic is b t ˆ b t 3.115 0.0 0.0543 This, for practical purposes, is the same as 3.115 t 0.0543 Thus, the value of the t-statistic is t 57.342 Since we have not specified in advance whether our sample regression coefficient would have a positive or a negative value, we should perform a two-tailed test of significance. Let's again set alpha to be 0.05. The appropriate sampling distribution of Student's t for this test is the one defined by degrees of freedom for the error term because we use the MSError in calculating the value of the t-test. From Appendix 2, p. 543, we find the critical value to be 12.706 (row df = 1, two-tailed test column 0.05). Because this is a two-tailed test, we have two critical values, + 12.706 and - 12.706. Since the t-statistic of - 54.342 is GREATER THAN the critical value - 12.706, we know that it lies within the region of rejection, and therefore we REJECT the null hypothesis. We conclude that the sample regression coefficient is statistically significant at the 0.05 level. This means that the association between time of first sun and afternoon high temperature probably holds in general, not just in our sample. C. Significance Test for the Correlation Coefficient We could calculate a critical value of rxy based of the general relationship between rxy and F, which is: r 2 n 2 F 1 r2 The critical value can be found by: rcritical Fcritical n 2 Fcritical Alternatively, we could use a table such as Appendix 5, p. 548: In the present case, the correlation coefficient is: rXY = - 0.9996 At = 0.05 with df = 1 (for df = n – 2), the critical value of rxy for a two-tailed test from Appendix 5 is 0.997. Therefore, the correlation coefficient IS statistically significant at the 0.05 level. Simple Regression Analysis Example PPD 404 Model: MODEL1 Dependent Variable: TEMP Analysis of Variance Source Model Error C Total Root MSE Dep Mean C.V. DF Sum of Squares Mean Square 1 1 2 98.63388 0.03279 98.66667 98.63388 0.03279 0.18107 82.66667 0.21904 R-square Adj R-sq F Value Prob>F 3008.333 0.0116 0.9997 0.9993 Parameter Estimates Variable DF Parameter Estimate INTERCEP TIME 1 1 107.065574 -3.114754 Standard Error T for H0: Parameter=0 Prob > |T| 0.45696262 0.05678855 234.298 -54.848 0.0027 0.0116 Time and Temperature Example Correlation Analysis 2 'VAR' Variables: TIME TEMP Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / N = 3 TIME TEMP TIME 1.00000 0.0 -0.99983 0.0116 TEMP -0.99983 0.0116 1.00000 0.0 Significance Tests Exercise A regression analysis produced the following analysis of variance summary table as well as the following regression results: the regression coefficient had a value of 3.14. The variance of the independent variable (X) was 1.92. Complete the computations in the ANOVA summary table. Use Appendix 3 for the F-test and Appendix 2 for the t-test. Assume that = 0.05. NOTE: Be sure to perform a two-tailed t-test for the regression coefficient. ================================================================================================== Source SS df Mean Square F -------------------------------------------------------------------------------------------------Regression 655.52 1 Error 195.80 29 Total 851.32 30 -------------------------------------------------------------------------------------------------- 1. What is the critical value of F? ______________ 2. Is the model statistically significant? ______________ 3. What is the value of the standard error for b? ______________ 4. What is the value of the t statistic? ______________ 5. What is the critical value of t? ______________ 6. Is the regression coefficient statistically significant? ______________ Significance Tests Exercise Answers A regression analysis produced the following analysis of variance summary table as well as the following regression results: the regression coefficient had a value of 3.14. The variance of the independent variable (X) was 1.92. Complete the computations in the ANOVA summary table. Use Appendix 3 for the F-test and Appendix 2 for the t-test. Assume that = 0.05. NOTE: Be sure to perform a two-tailed t-test for the regression coefficient. ================================================================================================== Source SS df Mean Square F -------------------------------------------------------------------------------------------------Regression 655.52 1 655.520 Error 195.80 29 6.752 Total 851.32 30 97.089 -------------------------------------------------------------------------------------------------- 1. What is the critical value of F? 4.18 2. Is the model statistically significant? Yes 3. What is the value of the standard error for b? 0.342 4. What is the value of the t statistic? 9.017 5. What is the critical value of t? 2.045 6. Is the regression coefficient statistically significant? Yes