Download Regression Analysis-

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
Regression--Testing the Normality Assumption
Recall that the OLS Regression method assumes that the errors (the “e’s”) in the population regression equation
have a normal distribution with a mean of zero. However, we usually cannot observe the e’s directly, because we
don’t have data on all individuals in the population. Instead, we only have data for a sample of individuals, and
we use the data from the sample to calculate the residuals (the “𝑒̂ ’s”) which are our estimates of the e’s. So, we
use our information about the residuals (the 𝑒̂ ’s) to test the assumption about the e’s. Therefore, we examine the
residuals (the 𝑒̂ ’s) instead of the true errors (the e’s) and try to determine whether the residuals (the 𝑒̂ ’s) have a
normal distribution with a mean of zero.
Effects of Non-Normal Residuals
So, why do we care? What happens if the assumption is violated, and the distribution of the residuals is not
normal? Well, luckily, the estimates of the 𝛽̂’s in the regression equation remain unbiased, and the estimates of
the s.e.’s of the 𝛽̂’s remain unbiased. However, this “good luck” doesn’t matter, because when the residuals are
not normal, then the t-tests that we use to test the significance of the 𝛽̂’s and their associated X’s are no longer
valid, because the t-distribution may no longer apply, and we don’t even know what other distribution might
apply instead! So, we just don’t know whether we should use a t-test, an F-test, a chi-square test, or some other
test based on some other distribution to test the significance of the 𝛽̂’s, because we don’t know which distribution
applies to the situation. So, we are left in a situation where we can’t do hypothesis tests of the 𝛽̂’s. So, we don’t
know which X’s have a statistically significant effect on Y, and which do not. That is, we can estimate the 𝛽̂ for
each X, but we don’t know whether the 𝛽̂ is statistically different from zero, so it might be zero, in which case its
X has no effect on Y. So, in the end, we are left not knowing which X’s affect Y and which don’t! Yikes!
Detecting Non-Normality of the Residuals
Many methods and tests have been devised to detect non-normality of the residuals in various situations. We’ll
discuss three of the simplest and most commonly-used methods here and save the other methods for more
advanced courses.
1. Plot a histogram of the residuals, and look at it to determine whether the residuals look like they have a
normal distribution. Remember that a normal distribution is a “bell-shaped” distribution. Also, look to
see whether the distribution appears to be centered on zero. If the distribution of the residuals is not bellshaped, or if it is not centered near zero, or both, then you have the non-normality problem.
To do this in SAS, when you use PROC REG to do a regression, use an “output” command to save the
residuals. Give the residuals a name like “ehat”. Then, use PROC CHART to make a histogram of the
ehats, as shown below. Check the histogram of the ehats to determine whether it is bell-shaped and
centered on zero.
proc reg data=dataset02;
model Y = X1 X2;
output out=dataset03 r=ehat;
run;
proc chart data=dataset03;
vbar ehat / levels=13;
run;
1
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
2. Examine the mean, median, skewness and kurtosis of the residuals. If the distribution of the residuals is
bell-shaped, then (1) the mean will be approximately equal to the median, (2) the skewness will be
approximately zero, and (3) kurtosis will be approximately 3. These three things need to be true for the
distribution to be a normal-shaped distribution. If one or more of these things is not true, then the
distribution is not a normal distribution. To determine whether the distribution is centered on zero, all
you need to do is to check that the mean is approximately zero. If either the distribution is not normal, or
if it is not centered on zero, or both, then you have the non-normality problem.
To do this in SAS, when you use PROC REG to do a regression, use an “output” command to save the
residuals. Give the residuals a name like “ehat”. Then, use PROC MEANS to calculate the mean,
median, skewness and kurtosis of the ehat’s, as shown below. Check the results from PROC MEANS to
determine whether the distribution of the ehats meets the tests described above.
proc reg data=dataset02;
model Y = X1 X2;
output out=dataset03 r=ehat;
run;
proc means data=dataset03 vardef=df maxdec=3 mean median skew kurt;
var ehat;
run;
3. Use a Jarque-Bera (JB) Test1 to determine whether the residuals have a normal distribution centered on
zero. The JB test is a test the following hypotheses:
H0: the residuals are normally-distributed and centered on zero
H1: the residuals are not normally-distributed and/or not centered on zero
The JB test compares a JBtest number (from a formula) against a JBcritical number (from the chi-square χ2
table) to determine whether H0 is accepted or rejected. The idea is that, as the distribution of residuals
moves away from normal, then JBtest becomes bigger, and if it becomes bigger than JBcritical, then the
decision is made that the distribution of residuals is not normal.
The formula for JBtest is:
𝑛−𝑘+1
)∙
6
𝐽𝐵𝑡𝑒𝑠𝑡 = (
[(𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠)2 +
(𝑘𝑢𝑟𝑡𝑜𝑠𝑖𝑠−3)2
]
4
where n is sample size and k is the number of β’s in the regression equation
The JBcritical number is found in the chi-square (χ2) table, using d.f. = 2.
This is a one-sided test, so use α, not α/2.
Finally, as usual with hypothesis tests, compare JBtest to JBcritical:
If JBtest > JBcritical, then reject H0 and accept H1 ===> you have a non-normality problem
If JBtest < JBcritical, then accept H0 and reject H1 ===> you don’t have a non-normality problem
1
Bowman, K. O. and L. R. Shenton, 1975. Omnibus contours for normality based on √b1 and b2. Biometrika, 62, 243-250.
Bera, A. K. and C. M. Jarque, 1982. Model specification tests: A simultaneous approach. Journal of Econometrics, 20, 59-82.
Jarque, C. M. and A. K. Bera, 1987. A test for normality of observations and regression residuals. International Statistical
Review, 55, 163-172. Thadewald, T, and H. Buning, 2004. Jarque-Bera test and its competitors for testing normality - A
power comparison. Discussion Paper Economics 2004/9, School of Business and Economics, Free University of Berlin.
2
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
Causes of the Non-Normality Problem
The non-normality problem is usually caused by either: (1) outlier data points or (2) a model specification
problem where either the Y variable or one or more X variables has been included in the model using the wrong
functional form (the variable should be in logged or squared form, for example, in the regression model).
Remedies for the Non-Normality Problem
If the non-normality problem is caused by outliers, then one of the remedies discussed in the Outliers handout can
be used to detect and remove/modify the outliers.
If the non-normality problem is caused by a model specification problem, then the procedures discussed in the
Functional Form handout can be used to detect which variables might need transformation and how to conduct
any needed transformations. In particular, if the distribution of a Y or X variable is skewed to the right (you can
check the skewness of each variable from PROC MEANS output), then logging the variable before including it in
the regression analysis may reduce any non-normality that shows up in the residuals.
Other Tests for Non-Normality (not required, just FYI)
There are a number of other tests for non-normality, including kernel density plots (similar to histograms, but,
unlike histograms, results don’t depend on the choice of origin or the number of bars chosen for the histogram),
quantile-quantile plots (good for detecting non-normality in the tails of the distribution), normal-probability plots
(good for detecting non-normality near the center of the distribution), the Shapiro-Wilk test (for sample sizes up
to 50), the Anderson-Darling, Martinez-Lglewicz, or D’Agostino tests (for sample sizes between 50 and 2000),
and the Kolmogorov test (for sample sizes greater than 2000),
When the sample size is small, most normality tests have small statistical power (probability of detecting nonnormal data). If the null hypothesis of normality is rejected, the data are definitely non-normal. But if the test fails
to reject the null hypothesis, the conclusion is uncertain. All you know is that there was not enough evidence to
reject the normality assumption. Hair, Anderson, Tatham and Black (1992) suggest that in a small sample size it is
safer to use both a normal probability plot and test statistics to detect non-normality.
Although many authors (Looney, 1995) recommend using skewness and kurtosis for examining normality, as in
the JB test, Wilkinson (1999) argued that skewness and kurtosis often fail to detect distributional irregularities in
the residuals. By this argument, the JB and D'Agostino tests may be less useful than other tests.
3