Download 18.4 Regression Diagnostics - II

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Regression toward the mean wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Econometric Problems
• Econometric problems, detection & fixes?
• Hands on problems
1
Regression Diagnostics
• The required conditions for the model
assessment to apply must be checked.
– Is the error variance constant?
– Are the errors independent?
– Is the error variable normally
distributed?
– Is multicollinearity a problem?
2
Econometric Problems
– Heteroskedasticity (Standard Errors not
reliable)
– Non-normal distribution of error term (t & Fstats not reliable)
– Autocorrelation or Serial Correlation (Standard
Errors not reliable)
– Multicolinearity (t-stats may be biased
downward)
3
• Is the error variance constant? (Homoskedasticity)
• When the requirement of a constant variance is
met we have homoskedasticity.
+
^y
Residual
+
+ +
+
+
+
+
+ +
+
+
+
+ +
+ +
+
++
+
+
++ +
+
+
+ +
+ +
+
The spread of the data points
does not change much.
y^
++
++
++
+
+ +++
+++
+
++
+
+
++
+
+
4
•Heteroskedasticity
– When the requirement of a constant variance is violated
we have heteroskedasticity. The plot of the residual Vs.
predicted value of Y will exhibit a cone shape. +
^y
++
Residual
+ + +
+
+
+
+
+
+
+
+
+
++ +
+
+ +
+
+
+
+ +
+
+ +
+
+
+
y^
++
+ ++
++
++
+
+
++
+
+
The spread increases with ^y
5
Heteroskedasticity
• When the variance of the error term is
different for different values of X you have
heteroskedasticity.
• Problem: The OLS estimators for the’s are
no longer minimum variance. You can no
longer be sure that the value you get for bi a
lies close to the true i.
6
Detection/ Fix for
heteroskedasticity
• Detection: Plot of Residual Vs. Predicted Y
exhibits a cone or megaphone shape.
• Advanced Test: White test (Use Chi-square
stat)
• Fix: White Correction (Uses OLS but keeps
heteroskedasticity from making the variance
of the OLS estimators swell in size)
7
Heteroskedasticity EX1 (Xr19-06)
•
Univ GPA  o  1HS GPA  3SAT  4 Activities  
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.5368707
R Square
0.2882302
Adjusted R Square
0.2659873
Standard Error
2.0302333
Observations
100
ANOVA
df
Regression
Residual
Total
Intercept
HS GPA
SAT
Activities
3
96
99
SS
MS
F
160.2370587 53.41235 12.95835312
395.6973413 4.121847
555.9344
Coefficients Standard Error
0.7211046
1.869815022
0.610872
0.100749211
0.0027085
0.002873196
0.0462535
0.064049816
t Stat
0.385656
6.063293
0.942677
0.722149
8
heteroskedasticity EX1
No Megaphone pattern No
heteroskedasticity
Residuals
12
10
Residual
8
6
Residuals
4
2
0
-5
-4
-3
-2
-1
0
1
2
3
4
5
Predicted GPA at University
9
White Test
• Ho: No heteroskedasticity
H1: Have heteroskedasticity
White’s Chi –Square stat is N*R-square
from regression of residual squared (ei)2
against X1,…Xk and their squared terms.
et2  o  1 X 1   2 X 2  3 X 1 X 2   4 X 12  5 X 22
10
White Test
• White’s Test Stat > Chi-Square from table (with
d.f. = # of slope coefficients)
=> reject Homoskedasticity, you have a problem
with heterokcedasticity
White’s Test Stat < Chi-Square from table (with d.f.
= # of slope coefficients)
=> Fail to reject Homoskedasticity, you do not have
a problem with heteroskedasticity
11
• 12.9 < 16.92 (Chi-Square, 95% Confidence,
9 df) Fail to reject homoskedasticity
White Heteroskedasticity Test:
F-statistic
Obs*R-squared
1.485871
12.93651
Probability
Probability
Test Equation:
Dependent Variable: RESID^2
Method: Least Squares
Date: 09/10/03 Time: 15:21
Sample: 1 100
Included observations: 100
Variable
CoefficientStd. Error t-Statistic
C
24.26312
HSGPA
3.928636
HSGPA^2
-0.07362
HSGPA*SAT
-0.0049
HSGPA*ACTIVITIES-0.04933
SAT
-0.13973
SAT^2
0.000168
SAT*ACTIVITIES
-0.00084
ACTIVITIES
0.597115
ACTIVITIES^2
0.033446
30.8596
2.374181
0.090684
0.003259
0.072608
0.091374
7.41E-05
0.001889
1.337251
0.050154
0.786242
1.654733
-0.81185
-1.50251
-0.67935
-1.52916
2.269871
-0.44207
0.446524
0.666867
12
Heteroskedasticity EX2
Xr 19-10
Internet  o  1AGE  3INCOME  
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.445484192
R Square
0.198456165
Adjusted R Square
0.190318665
Standard Error
4.190258376
Observations
200
ANOVA
df
Regression
Residual
Total
Intercept
Age
Income
2
197
199
SS
856.4167439
3458.978256
4315.395
MS
F
428.2083719 24.38785
17.55826526
Coefficients
Standard Error
t Stat
P-value
13.02717915
2.542151536 5.124469946 7.09E-07
-0.279139752
0.044842815 -6.224848997 2.85E-09
0.093837098
0.031237341 3.004004037 0.00301
13
Heteroskedasticity EX2
Residual Vs. Predicted Internet
16
14
12
Residual
10
8
Residual Vs. Predicted Internet
6
4
2
0
-14
-12
-10
-8
-6
-4
-2
0
2
4
6
8
Predicted Internet
14
White Test EX 2
33.75 > 11.07 (Ch—Square 95% confidence,
d.f. = 5%) => Have heteroskedasticity
White Heteroskedasticity Test:
F-statistic
7.876243
Obs*R-squared 33.7484
Probability
Probability
0.000001
0.000003
Test Equation:
Dependent Variable: RESID^2
Method: Least Squares
Date: 09/10/03 Time: 15:38
Sample: 1 200
Included observations: 200
Variable
C
AGE
AGE^2
AGE*INCOME
INCOME
INCOME^2
CoefficientStd. Error t-Statistic
-67.6927
4.211074
-0.03078
-0.049
0.10313
0.021946
96.59632 -0.70078
3.113861 1.352364
0.03072 -1.00181
0.028515
-1.7185
1.91587 0.053829
0.012718 1.725546
Prob.
0.4843
0.1778
0.3177
0.0873
0.9571
0.086
15
Non-normality of Error
• If the assumption that  is distributed
normally is called into question we cannot
use any of the t-test, F-tests or R-square
because these tests are based on the
assumption that  is distributed normally.
The results of these tests become
meaningless.
16
Non-normality of Error term
• Indication of Non-normal Distribution:
• Histogram of residuals does not look
normal
• (Formal test Jarque-Bera Test)
2

n
(
Kurt

3
)
2
JB  Skew 

6
4

17
Non-normality of Error term
• If the JB stat is smaller than a Chi-square
with 2 degrees of freedom (5.99 for 95%
significance level) then you can relax!! Your
error term follows a normal distribution.
• If not, get more data or transform the
dependent variable. Log(y), Square Y,
Square Root of Y, 1/Y etc.
18
Jarque-Bera Test for nonnormality of error term
• JB stat = 45 > 5.99 (Chi-square, 95%
Confidence df = 2) => Error term not
normally distributed
24
Series: Residuals
Sample 1 200
Observations 200
20
16
Mean
Median
Maximum
Minimum
Std. Dev.
Skewness
Kurtosis
12
8
4
Jarque-Bera
Probability
0
-10
-5
0
2.04E-16
1.667127
5.349932
-11.31068
4.169149
-1.169375
3.190836
45.88472
0.000000
5
19
Additional Fixes for Normality
• Bootstrappig – Ask your advisor if this
approach is right for you
• Appeal to large, Assymptotic Sample
theory. If your sample is large enough the tand F tests are valid approximately.
20
SERIAL CORRELATION Or AUTOCORRELATION
Patterns in the appearance of the residuals
over time indicates that autocorrelation exists.
Residual
Residual
+ ++
+
0
+
+
+
+
+
+ +
+
+
+
++
+
+
+
Time
Note the runs of positive residuals,
replaced by runs of negative residuals
+
+
+
0 +
+
+
+
Time
+
+
Note the oscillating behavior of the
residuals around zero.
21
Positive first order autocorrelation occurs when
consecutive residuals tend to be similar. Then,
Residuals the value of d is small (less than 2).
Positive first order autocorrelation
+
+
+
+
0
+
+
Time
+ +
Negative first order autocorrelation
Residuals
+
+
Negative first order autocorrelation occurs when
consecutive residuals tend to markedly differ.
Then, the value of d is large (greater than 2).
+
+
+
+
+
0
Time
22
Autocorrelation or Serial
Correlation
• The Durbin - Watson Test
– This test detects first order auto-correlation
between consecutive residuals in a time series
– If autocorrelation exists the error variables are
n
not independent
2
(
e

e
)

i
i 1
d  i 2 n
2
 ei
i 1
The range of d is 0  d  4
23
• One tail test for positive first order auto-correlation
– If d<dL there is enough evidence to show that positive firstorder correlation exists
– If d>dU there is not enough evidence to show that positive
first-order correlation exists
– If d is between dL and dU the test is inconclusive.
• One tail test for negative first order auto-correlation
– If d>4-dL, negative first order correlation exists
– If d<4-dU, negative first order correlation does not exists
– if d falls between 4-dU and 4-dL the test is inconclusive.
24
• Two-tail test for first order auto-correlation
– If d<dL or d>4-dL first order auto-correlation
exists
– If d falls between dL and dU or between 4-dU
and 4-dL the test is inconclusive
– If d falls between dU and 4-dU there is no
evidence for first order auto-correlation
First order
correlation
exists
0
dL
First order
correlation
does not
exist
Inconclusive
test
dU
2
First order
correlation
does not
exist
Inconclusive
test
4-dU
First order
correlation
exists
4-dL
4
25
• Example
– How does the weather affect the sales of lift tickets in a ski
resort?
– Data of the past 20 years sales of tickets, along with the
total snowfall and the average temperature during
Christmas week in each year, was collected.
– The model hypothesized was
TICKETS=0+1SNOWFALL+2TEMPERATURE+
– Regression analysis yielded the following results:
26
SUMMARY OUTPUT
The model seems to be very poor:
Regression Statistics
Multiple R
0.3464529
R Square
0.1200296
Adjusted R Square 0.0165037
Standard Error
1711.6764
Observations
20
ANOVA
df
Regression
Residual
Total
Intercept
Snowfall
Tempture
• The fit is very low (R-square=0.12),
• It is not valid (Signif. F =0.33)
• No variable is linearly related to Sales
Diagnosis of the required
conditions
resulted
with
SS
MS
F
Signif. F
following
6793798.2 the
3396899.1
1.1594 findings
0.3372706
2
17 49807214 2929836.1
19 56601012
Coefficients
Standard Error t Stat P-value Lower 95% Upper 95%
8308.0114 903.7285 9.1930391 5E-08 6401.3083 10214.715
74.593249 51.574829 1.4463111 0.1663 -34.22028 183.40678
-8.753738 19.704359 -0.444254 0.6625 -50.32636 32.818884
27
3000
2000
7 y
Residual vs. predicted
1000
0
-10007500
-2000
The errors may be
normally distributed
8500
9500
-3000
-4000
The error distribution
6
5
4
3
105002 11500
1
0
The error variance
is constant
12500
-2.5
-1.5
-0.5
0.5
1.5
2.5
More
Residual over time
The errors are
not independent
3000
2000
1000
0
-1000 0
-2000
-3000
-4000
5
10
15
20
25
28
Test for positive first order auto-correlation:
n=20, k=2. From the Durbin-Watson table we have:
dL=1.10, dU=1.54. The statistic d=0.59
Conclusion: Because d<dL , there is sufficient evidence
to infer that positive first order auto-correlation exists.
Using the computer - Excel
Tools > data Analysis > Regression (check the residual option and then OK)
Tools > Data Analysis Plus > Durbin Watson Statistic > Highlight the range of the residuals
from the regression run > OK
Durbin-Watson Statistic
-2793.99
-1723.23
d = 0.5931
-2342.03
-956.955
The residuals
-1963.73
.
.
Residuals
4000
2000
0
-2000 0
-4000
5
10
15
20
29
25
MANUAL D-W CALCULATION
RESIDUAL OUTPUT
MANUAL DURBIN-WATSON
Observation Predicted Tickets
1
9628.992039
2
9593.231176
3
8515.029601
4
8935.954928
5
9602.730825
6
8632.272672
7
9533.073879
8
9488.932234
9
9423.092722
10
9659.816599
11
9734.036892
12
10251.62629
13
8207.902866
14
9707.402723
15
8115.429185
16
10080.74193
17
9668.943293
18
8672.596882
19
9594.350043
20
9259.843227
Residuals
-2793.992039
-1723.231176
-2342.029601
-956.9549279
-1963.730825
-1465.272672
-1439.073879
414.0677664
364.907278
-102.8165988
49.96310771
1823.373709
920.0971341
-660.402723
2515.570815
2482.258074
1343.056707
1368.403118
334.6499568
1831.156773
1146528.8
382911.49
1918431.9
1013597.7
248460.53
686.37677
3434134
2416.7536
218765.62
23341.639
3144985.2
815908.57
2497979.8
10086808
1109.7387
1297779.8
642.44054
1068645.6
2239532.7
(e i ) 2
7806391.5
2969525.7
5485102.7
915762.73
3856238.8
2147024
2070933.6
171452.12
133157.32
10571.253
2496.3121
3324691.7
846578.74
436131.76
6328096.5
6161605.1
1803801.3
1872527.1
111990.59
3353135.1
SUM
29542666
49807214
D-W
0.5931403
e i - e i-1
1070.761
-618.798
1385.075
-1006.78
498.4582
26.19879
1853.142
-49.1605
-467.724
152.7797
1773.411
-903.277
-1580.5
3175.974
-33.3127
-1139.2
25.34641
-1033.75
1496.507
(e i - e i-1) 2
30
The modified
regression model
The autocorrelation
has occurred over time.
Therefore, a time dependent variable added
TICKETS=
2TEMPERATURE+
3YEARS+
to the model
may correct the
problem
0+ 1SNOWFALL+
• All the required conditions are met for this model.
• The fit of this model is high R2 = 0.74.
• The model is useful. Significance F = 5.93 E-5.
• SNOWFALL and YEARS are linearly related to ticket sales.
• TEMPERATURE is not linearly related to ticket sales.
31
Multicollinearity
• When two or more X’s are correlated you
have multicollinearity.
• Symptoms of multicollinearity include
insignificant t-stats (due to inflated standard
errors of coefficients) and a good R-square.
• Test: Run a correlation matrix of all X
variables.
• Fix: More data, combine variables.
32
Multicolinearity Ex: Xm 19-02
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.748329963
R Square
0.559997733
Adjusted R Square
0.546247662
Standard Error
25022.70761
Observations
100
ANOVA
df
Regression
Residual
Total
Intercept
Bedrooms
House Size
Lot Size
3
96
99
SS
MS
76501718347 2.55E+10
60109046053 6.26E+08
1.36611E+11
Coefficients Standard Error
37717.59451
14176.74195
2306.080821
6994.19244
74.29680602
52.97857934
-4.36378288
17.0240013
t Stat
2.660526
0.329714
1.402393
-0.25633
F Significance F
40.7269 4.57E-17
P-value Lower 95%Upper 95%
Lower 95.0%
Upper 95.0%
0.009145 9576.963 65858.23 9576.963 65858.23
0.742335 -11577.3 16189.45 -11577.3 16189.45
0.164023 -30.8649 179.4585 -30.8649 179.4585
0.798244 -38.1562 29.42862 -38.1562 29.42862
33
Multicolinearity Ex: Xm 19-02
• Diagnostic Correlation Matrix
• Correlation Coefficient over 0.5 => Problem
with Multicollinearity
Bedrooms House Size Lot Size
Bedrooms
1
House Size 0.846453504
1
Lot Size
0.837429579 0.993615
1
34