Download File

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Psychometrics wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Omnibus test wikipedia , lookup

Transcript
Significance Tests for
Regression Analysis
A. Testing the Significance of Regression Models
The first important significance test is for the regression
model as a whole. In this case, it is a test of the model
Y = f(X)
The null hypothesis is:
H0:  = 0.0
Here, the test is redundant. Since there is only one
variable in the model (X), it duplicates the information
provided by the t-test.
In the case of multiple regression, it becomes much
more important for evaluating models with several
independent variables. In that case, the model is
Y = f(X1 X2 X3 . . . Xk)
where the null hypothesis is
H0: 1 = 2 = 3 = . . . K = 0.0
Notice that this is similar to the null hypothesis in the
analysis of variance with multiple treatment groups.
All the information we require is already available in the
analysis of variance summary table. In our little
time/temperature example, we have the two mean
squares, for the model (98.637) and for the error term
(0.030). The ratio of the two is 3287.90, clearly greater
than one. However, making the usual significance test
with one and one degrees of freedom at the 0.05 level,
we find the critical value of F is 161.40 (Appendix 3, p.
544). Since 3287.90 is greater than 161.40, the F-ratio
lies inside the region of rejection. Hence, we REJECT
the null hypothesis that none of the regression
coefficients in the model is greater than zero in favor
of the alternate hypothesis that at least ONE of the
regression coefficients in our model is greater than
zero.
B. Testing the Significance of the Regression
Coefficient
The null hypothesis in the significance test for the
regression coefficient (i.e., slope) is:
H0 :  = 0.0
This simple symbolic expression says more than
might first appear. It says: If we begin by assuming
that there is no relationship between X and Y in
general (i.e., in the universe from which our sample
data come), then how likely is it that we would find a
regression coefficient for our sample to be
DIFFERENT FROM 0.0? Put the other way around, if
we find a relationship between X and Y in the sample
data, can we infer that there is a relationship between
X and Y in general?
To test this null hypothesis, we use our old friend the
t-test:
b
t 
ˆ b
where the standard error is:
ˆ b 
MS Error
s X2 N  1
This is the standard deviation of the sampling
distribution of all theoretically possible regression
coefficients for samples of the same size drawn
randomly from the same universe. Recall that the
mean of this sampling distribution has a value equal to
the population characteristic (parameter), in this case
the value of the regression coefficient in the universe.
Under the null hypothesis, we initially assume that this
value is 0.0.
To test the significance of the regression coefficient
(and the model as a whole), we need the statistical
information found in the usual analysis of variance
summary table. We already have most of this
information for our previous example.
Recall that R2YX , the Coefficient of Determination, was
found from
R2YX = SSRegression / SSTotal
From our time/temperature example, remember that
R2YX was 0.999. Total sum of squares can be found
from
SSTotal = sY2 (N - 1)
From our previous calculations, remember that sY2 was
49.333. Thus, SSTotal is
SSTotal = (49.333) (3 - 1)
SSTotal = (49.333) (2)
SSTotal = 98.667
By rearranging the algorithm for R2YX , we get
SSRegression = (R2YX )(SSTotal)
With our sample data,
SSRegression = (0.9997)(98.667)
SSRegression = 98.637
Now, because of the identity among the three sums of
squares, we can find the sum of squares for the error
term (residual) by subtraction,
SSError = SSTotal - SSRegression
SSError = 98.667 - 98.637
SSError = 0.030
All we need to do now is to determine the various
numbers of degrees of freedom. Because we have
three observations, we know that we still have two total
degrees of freedom. Degrees of freedom for the model
is the number of independent variables in the model, in
this case one. Because of the identity between the three
values of degrees of freedom, the total less the model
gives us the error degrees of freedom, in this case 2 – 1,
or one degree of freedom for the error term. Now we
can complete an analysis of variance summary table for
the regression example.
Table 1. Analysis of Variance Summary Table for Time-Temperature
Example.
==========================================================
Source
ss
df
Mean Square
F
----------------------------------------------------------------------------------------------------Regression
(Between)
Error
(Within)
Total
98.637
1
98.637
0.030
1
0.030
98.667
2
3287.90
-----------------------------------------------------------------------------------------------------
Now we can return to the task of testing the significance
of the regression coefficient. First, we need to
estimate the standard error of b. This is
ˆ b 
MS Error
s X2 N  1
In our example, the variance of X, time, was 5.083.
ˆ b 
0.030
5.083 3  1
ˆ b  0.00295
ˆ b  0.0543
Now we have the value of our "currency conversion"
factor which allows us to convert the difference
between our sample mean and the mean of the
sampling distribution into Student's t values that lie on
the underlying x-axis. Recall that our regression
coefficient had a value of - 3.115. Remember also that,
under the null hypothesis  = 0.0. Thus, our t-statistic
is
b  
t 
ˆ b
t 
 3.115  0.0
0.0543
This, for practical purposes, is the same as
 3.115
t 
0.0543
Thus, the value of the t-statistic is
t  57.342
Since we have not specified in advance whether our
sample regression coefficient would have a positive or a
negative value, we should perform a two-tailed test of
significance. Let's again set alpha to be 0.05. The
appropriate sampling distribution of Student's t for this
test is the one defined by degrees of freedom for the
error term because we use the MSError in calculating the
value of the t-test. From Appendix 2, p. 543, we find the
critical value to be 12.706 (row df = 1, two-tailed test
column 0.05). Because this is a two-tailed test, we
have two critical values, + 12.706 and - 12.706.
Since the t-statistic of - 54.342 is GREATER THAN the
critical value - 12.706, we know that it lies within the
region of rejection, and therefore we REJECT the null
hypothesis. We conclude that the sample regression
coefficient is statistically significant at the 0.05 level.
This means that the association between time of first
sun and afternoon high temperature probably holds in
general, not just in our sample.
C. Significance Test for the Correlation Coefficient
We could calculate a critical value of rxy based of
the general relationship between rxy and F, which
is:
r 2 n  2 
F 
1  r2
The critical value can be found by:
rcritical
Fcritical

n  2  Fcritical
Alternatively, we could use a table such as
Appendix 5, p. 548:
In the present case, the correlation coefficient is:
rXY = - 0.9996
At  = 0.05 with df = 1 (for df = n – 2), the critical value
of rxy for a two-tailed test from Appendix 5 is  0.997.
Therefore, the correlation coefficient IS statistically
significant at the 0.05 level.
Simple Regression Analysis Example
PPD 404
Model: MODEL1
Dependent Variable: TEMP
Analysis of Variance
Source
Model
Error
C Total
Root MSE
Dep Mean
C.V.
DF
Sum of
Squares
Mean
Square
1
1
2
98.63388
0.03279
98.66667
98.63388
0.03279
0.18107
82.66667
0.21904
R-square
Adj R-sq
F Value
Prob>F
3008.333
0.0116
0.9997
0.9993
Parameter Estimates
Variable
DF
Parameter
Estimate
INTERCEP
TIME
1
1
107.065574
-3.114754
Standard
Error
T for H0:
Parameter=0
Prob > |T|
0.45696262
0.05678855
234.298
-54.848
0.0027
0.0116
Time and Temperature Example
Correlation Analysis
2 'VAR' Variables:
TIME
TEMP
Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0
/ N = 3
TIME
TEMP
TIME
1.00000
0.0
-0.99983
0.0116
TEMP
-0.99983
0.0116
1.00000
0.0
Significance Tests Exercise
A regression analysis produced the following analysis of variance summary table as well as the
following regression results: the regression coefficient had a value of 3.14. The variance of the
independent variable (X) was 1.92. Complete the computations in the ANOVA summary table. Use
Appendix 3 for the F-test and Appendix 2 for the t-test. Assume that  = 0.05. NOTE: Be sure to
perform a two-tailed t-test for the regression coefficient.
==================================================================================================
Source
SS
df
Mean Square
F
-------------------------------------------------------------------------------------------------Regression
655.52
1
Error
195.80
29
Total
851.32
30
--------------------------------------------------------------------------------------------------
1.
What is the critical value of F?
______________
2.
Is the model statistically significant?
______________
3.
What is the value of the standard error for b?
______________
4.
What is the value of the t statistic?
______________
5.
What is the critical value of t?
______________
6.
Is the regression coefficient statistically significant?
______________
Significance Tests Exercise Answers
A regression analysis produced the following analysis of variance summary table as well as the
following regression results: the regression coefficient had a value of 3.14. The variance of the
independent variable (X) was 1.92. Complete the computations in the ANOVA summary table. Use
Appendix 3 for the F-test and Appendix 2 for the t-test. Assume that  = 0.05. NOTE: Be sure to
perform a two-tailed t-test for the regression coefficient.
==================================================================================================
Source
SS
df
Mean Square
F
-------------------------------------------------------------------------------------------------Regression
655.52
1
655.520
Error
195.80
29
6.752
Total
851.32
30
97.089
--------------------------------------------------------------------------------------------------
1.
What is the critical value of F?
4.18
2.
Is the model statistically significant?
Yes
3.
What is the value of the standard error for b?
0.342
4.
What is the value of the t statistic?
9.017
5.
What is the critical value of t?
2.045
6.
Is the regression coefficient statistically significant?
Yes