Download Lecture 8 * Normal Correlation Models

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Signal-flow graph wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

Transcript
Chapter 3: Diagnostics
One cannot be certain in advance of performing a linear regression if this technique is
appropriate. We need to check if the model is suitable based on the given data. That
means we need to check the assumptions associated with linear regression before we
make inferences based on our results.
Simple Diagnostics for X and Y – Graphing
Graphing assists in determining the range and checking for outliers.
1. Stem and leaf plot
2. Histograms
3. Box Plots
Diagnostics for Residuals
Since the Y values are a function of the level of the X variable, diagnostic plots on Y are
typically not useful. So instead diagnostics for Y are often carried out by looking at the
residuals. A box plot of residuals shows any outliers based on the IQR method. Another
way of identifying residual outliers is by using Semi-Studentized residuals.
Semistudentized (or Standardized) Residuals: Since standardized is technically not
correct because MSE is an approximation of the SD of the residuals, the term
semistudentized is used. The calculation is as follows and MTB does not provide an
option, thus you need to divide the residuals by MSE which is S in the output:
ei
e* =
and general rule is that |e*| ≥ 4 are considered outliers
MSE
Model Departures Studied by Residuals
1. The regression function is not linear
2. The error terms do not have constant variance (homescedasticity)
3. The error terms are not independent
4. The model fits all but one or a few outlier observations
5. The error terms are not normally distributed
6. One or more important predictor variables have been omitted from the model
Residual Diagnostics to Study Departures
A. Plot of residuals against predictor variable(s) [1, 4]
B. Plot of residuals against fitted values (same as A when simple linear model) [1,4]
C. Plot of absolute or squared residuals against predictor variable(s) [2]
D. Plot of residuals against time [3]
E. Plots of residuals against omitted variables [6]
F. Box Plot, dot plots, etc. of residuals [4,5]
G. Normal probability plot of residuals. [5]
1
Tests - NOTE: for tests in 2 differ from what you may find is done in software packages
1. Normality – Stat > Basic Stat > Normality Test or Graph > Probability Plot for A-D
a. Kolmogorov-Smirnov
b. Anderson-Darling (MINITAB)
Hypotheses:
Ho: Data follow a normal distribution Ha: not normal
2. Constancy of Error – MTB has no automatic feature for any of these:
a. White's Test - Regress Y on predictor(s) and store residuals and fits. Get
squared terms of both. Regress squared residuals on both fits and squared fits. Use the
F-test in the ANOVA output to test Ho: All slopes are zero.
b. Breusch- Pagan – Regress Y on predictor(s) and store residuals. Get squared
residuals. Regress squared residuals on the predictor(s). Use the F-test in the ANOVA
output to test Ho: All slopes are zero. If more than one predictor and F-test is significant
you can use the individual t-tests of this regression to identify which predictor(s) indicate
possible non-constant variance
c. Modified Levene Test – split data by making a Group variable then Stat >
Basic Statistics > ANOVA > Test of Equal Variances and test Ho: Error variance is
constant
3. F Test Lack of Fit – under "Options" tab in Regression
a) Pure Error and
b) Data Subsetting
[Pure Error Assumes Y is independent, normally distributed and have constant variance.
Need to have replicates for some values of X but not necessarily all. If no replicates, then
use the Data Subsetting option.
F* = MSLF/MSPE where MSLF = SSLF/(c-2) and MSPE = SSPE/(n-c)
SSLF =
 Y  Yˆ  and DF = c – 2
SSPE =
 Y
2
j
ij
ij
>>> Lack of Fit Sum of Squares
 Y j  and DF = n – c >>> Pure Error Sum of Squares
2
C is the number of unique observations in X
j equals 1 to C
The basic premise of the Lack of Fit test is that if the linear regression function is
appropriate then the Yj will be near the fitted values Yˆj calculated from the linear
regression model. Therefore, Ho is “The regression function is linear” versus Ha that
“The regression function is not linear”. That means Ha includes all possible regression
functions that are non-linear.
2
Example:
Sales
1
1
2
2
2
3
4
Adv
1
2
2
3
4
4
5
RESI1
0.19
-0.48
0.52
-0.14
-0.81
0.19
0.52
X=1 2
Y = 1 1.5
FITS1
0.81
1.48
1.48
2.14
2.81
2.81
3.48
3
2
4
2.5
5
4
C = 5 and N = 7
Regression Analysis: Sales versus Adv
Predictor
Constant
Adv
Coef
0.1429
0.6667
S = 0.552052
SE Coef
0.5216
0.1594
R-Sq = 77.8%
T
0.27
4.18
P
0.795
0.009
R-Sq(adj) = 73.3%
Analysis of Variance
Source
Regression
Residual Error
Lack of Fit
Pure Error
Total
DF
1
5
3
2
6
SS
5.3333
1.5238
0.5238
1.0000
6.8571
MS
5.3333
0.3048
0.1746
0.5000
F
17.50
P
0.009
0.35
0.798
3 rows with no replicates
Transformations
1. Non-linear Relation – look for best transformation on the Xs. We use the X’s in this
case, especially when the residuals are assumed to follow a normal distribution and have
constant error variance. Transforming the Ys in such a situation could change the
assumptions regarding the residuals since the residuals are a dependent on the Ys.
2. Nonnormality or Unequal Error Variances – Start with transformations on the Ys, but
keep in mind that similar transformations on X may need to be performed to help keep
the integrity of the linear relationship
A common transformation approach is to use the Box-Cox method. In Minitab, go to
Stat > Control Charts > Box-Cox Transformation. Select “All the observations for a
chart in one column” select the variable and for group size enter the number of
observations. The resulting graph will provide the best lamba transformation value.
3
NOTE: if you transform a variable(s) then the regression must be performed using these
transformed values and the diagnostics need to be re-evaluated. Just as important, any
reporting must be based on the transformed variable and not the original variable. For
instance, if the X variable is transformed using a log, then the reporting of a significant
relationship between X and Y must be stated as a significant relationship between logX
and Y.
Chapter 3 – Lab
1. Create boxplots of Y and X
2. Check for duplicates of X by Stat > Tables > Tally Individual Variables
3. Regress Y on X and store residuals and fits
4. Make scatterplot of residuals vs X
5. Calculate semi-studentized residuals. Go to Calc > Calculator and type Semi in 'Store result'
and in the expression window create expression that takes the residuals divided by the value of S
in the output. Check if any of these new 'semi' values are outside -4 to 4
6. Use Graph > Probability Plot to check normality assumption of residuals
7. Conduct the three tests of constant variance.
1. Modified Levene:
a. Create 'group' by Calc > Calculator. Type Group in 'Store result' and in the
expression window enter c2 <= medi(c2).
b. Go to Stat > ANOVA > Test Equal Variances. Enter the residuals in the Response
and Group in the Factor
2. Breusch-Pagan:
a. Create 'sqres' by Calc > Calculator. Type sqres in 'Store result' and in the expression
window enter an expression that squares the residuals
b. Regress sqres on X. Check F-test of ANOVA output
3. White's Test:
a. Create 'sqres' by Calc > Calculator. Type sqres in 'Store result' and in the expression
window enter an expression that squares the residuals
b. Create 'sqfits' by Calc > Calculator. Type sqfits in 'Store result' and in the expression
window enter an expression that squares the fits
b. Regress sqres on fits and sqfits (i.e. enter both fits and sqfits together in Predictor
window). Check F-test of ANOVA output
8. Check Lack of Fit using both Pure Error and Data Subsetting. In the Regression window click
Options and check the boxes for both under Lack of Fit
9. Check using Box-Cox method to see if a transformation of Y is needed.
- Stat > Control Charts > Box Cox Transformations
- From the dropdown box select “All observations for a chart are in one column” (should
be default) and in the window below that enter the variable of interest, Y.
- For subgroup size enter the size of the sample, 10.
- The resulting graph provides a legend with the best estimate and rounded value. Use
the rounded value.
4