Download 8:1 MLR Assumptions. - UC Davis Plant Sciences

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Data assimilation wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Lecture outline AGR 206
769834562
Revised: 5/6/2017
Chapter 8. Regression Diagnostics
8:1 MLR Assumptions.
1.
Normality of errors.
2.
Homogeneity of variance of errors.
3.
Independence of errors.
4.
Linearity.
Although they are not really assumptions, two additional problem areas are listed here because they are
usually diagnosed and dealt with at the same stage as the assumptions.
5.
Outliers.
6.
Collinearity.
We present this topic using the Spartina dataset as an example. Only five X variables are considered in the
example: salinity, pH, K, Na, and Zn. This example is presented in detail in the RP book, pp. 377-396.
In MLR most of the diagnostics are based on the residuals. Although we have many predictors, we still have
only one residual on the Y dimension. Therefore, all the methods used for simple linear regression apply to MLR
without change, except that now there are more X’s to plot the residuals, and we need to consider the concept
of partial regression plots. Partial regression plots show the relationship between Y and each X after correcting
for the effects of all other X’s.
Most of the assumptions that need to be checked refer to the errors. The first step is to obtain the
studentized residuals and inspect them. Residuals are plotted against the predicted Y and each predictor to look
for patterns that might suggest non-linearity or outliers.
8:1.1 Obtaining diagnostic statistics.
Open the Spartina JMP table and regress biomass on salinity, pH, K, Na, and Zn using the Fit Model
platform (Figure 1). Then save the studentized residuals, hats, and Cook’s D using the Save Columns command
in the drop-down menu from the red triangle in the upper left corner of the output window. Using the same dropdown menu, select Row Diagnostics, and then, Durbin Watson test and PRESS.
The first output window shows the model with all variables (Figure 2). The only estimated regression
coefficient that is significantly different from zero after all other variables are considered is the one for pH. The
rest of the diagnostic statistics can be obtained from this window. Right-click on the Parameter Estimates to add
the VIF column.
1
Lecture outline AGR 206
Revised: 5/6/2017
769834562
1
2
8:1.2 Normality.
Normality is checked exactly in the same manner as for simple linear regression. Save the residuals and
then use the Analyze, Distribution platform and select Fit Distribution Normal and finally, select Goodness of fit.
A significant value of the Shapiro-Wilk statistic would indicate that the data are not normally distributed.
In the Spartina example normality is rejected, because the distribution shows a concentration of values on
the left side and a few vary large values. This will require a remedial measure such as a transformation. Before
deciding on a remedial action, the rest of the diagnostics should be checked. Frequently, a transformation or
other remedial measures will correct more than one problem at once.
2
Lecture outline AGR 206
Revised: 5/6/2017
769834562
3
8:1.3 Homogeneity of variance.
Homogeneity of variance should be assessed visually by plotting residuals against the predicted values and
each of the predictors. This can be done in one step by using the Fit Y by X platform as shown in Figure 4. No
obvious trends in the spread of the errors is seen, except that there are five observations that have very high
biomass and pH. One of these observations has a very low residual. These observations will be flagged when
the hats and the studentized residuals are considered. Observations in the scatter plot can be easily identified
by moving the cursor over them. When the cursor is over an observation, its row number is displayed. It is also
possible to label the observations by selecting Row Label and Column Label, or one can simply click on the
observation on the plot and it will become selected in the data table. Several observations can be selected at
once by Shift-clicking.
Because the data set has predetermined groups, homogeneity of variance among groups can be studied by
plotting the standard deviation vs. the average biomass for each group. Homogeneity of variance among groups
can also be tested with the Fit Y by X platform. In order to build the plot of standard deviation against average
biomass for each of the 9 location by type groups, use the Table Summary command (Figure 5). Select typ and
loc as the grouping variables and in the lower box enter Mean(bmss) and std dev(bmss). A new table is created
that depends on the original table, and contains the summary by groups. While the new summary table is the
active window, select Fit Y by X and plot and regress standard deviation vs. average biomass (Figure 6). The
figure shows a clear increase in variance as biomass increases, typical of experimental data. This lack of
homogeneity of variance, is corroborated by the tests performed in each of the location and type groups (Figure
7). The first choice of remedial action is to apply a transformation on the response variable. A logarithmic
transformation is suggested by the shape of the distribution and by the increasing variance with increasing Yhat.
The Levene test for homogeneity of variance is based on the F value resulting for the model or groups when
the variable Zij=|Yij-Yhat.j| is analyzed by ANOVA with group as the effect. The variable Zij is the absolute value
of the deviations of the observations from the means for each group.
3
Lecture outline AGR 206
Revised: 5/6/2017
769834562
4
4
Lecture outline AGR 206
Revised: 5/6/2017
769834562
5
6
5
Lecture outline AGR 206
Revised: 5/6/2017
769834562
7
8:1.4 Independence of errors.
Independence of errors has to be tested if there is information on the spatial or temporal order of samples.
The Durbin-Watson statistic requested before will print the first order autocorrelation among errors, assuming
that samples are ordered by time and that the intervals between observations are constant. The first order
autocorrelation is the correlation between error of row i and error of row i-1. When errors are independent, this
6
Lecture outline AGR 206
Revised: 5/6/2017
769834562
correlation should not differ from zero. Using the red triangle to the left of the Durbin-Watson title, one can
obtain the probability level for the correlation. If the Durbin-Watson statistic is significant, it indicates that errors
are correlated and require remedial action. The first action is to try to identify a new X variable that is causing
the autocorrelation and include it in the model. Alternatively, time series analysis should be performed. Usually,
the consequence of ignoring a correlation among errors is that the precision of the model is inflated, thus giving
the impression that the model fits better than it actually does.
8:1.5 Linearity.
In order to assess linearity JMP automatically performs a lack of Fit test if there are replicate observations
for the same values of X’s. Replicates are usually not available for multiple linear regression of observational
experiments. When replicates are not available, in simple linear regression the relationship between Y and X
can be explored visually by plotting Y vs. X and errors vs. X. MLR requires that we consider the partial
regression plots for the same purpose.
The term partial refers to the fact that the plots show the relationship between Y and each X after the effects
of all other X’s have been removed. Analogously, partial correlation is the linear correlation between two
variables controlling (statistically) for the effects of all other variables. The partial regression plot between any X
(say pH) and bmss is obtained as follows:
1.
Regress bmss on all variables except pH. Save the residuals and call them eY|other
X’s.
2.
Regress pH on all other X’s (sal, Na, K, Zn). Save the residuals and call them
epH|other X’s.
3.
Use the fit Y by X to plot eY|other X’s vs. epH|other X’s.
8
Figure 8 shows the partial regression plot for pH. The scatter plot is inspected to look for non-linearity,
outliers, etc. In this case, the relationship appears to be linear, and there are no major problems with the scatter,
except perhaps the distribution of the deviations around the line that seems to be asymmetric. This will probably
be taken care of by a transformation. Inspection of the rest of the partial plots is left as an exercise. They should
not show any grave lack of linearity, although the plots will show some problematic observations which will be
fully explored as outliers.
8:1.6 Outliers.
There are two types of outliers, just as in the simple regression situation, outliers in the Y dimension and
outliers in the X dimension. In MLR the X dimensions actually encompasses several dimensions, but the
concept is the same. Outliers in the y dimension are identified by studying the residuals. Outliers in the X
dimensions are identified by studying the hats. Observations can be outliers in either or both types, but they
7
Lecture outline AGR 206
769834562
Revised: 5/6/2017
may not be influential. Non-influential outliers are usually not a concern. To determine if an outlier is influential
we study the Cook’s D statistic for each observation.
In MLR we have only one set of random variables that are assumed to have identical properties, the errors.
Therefore, it is not necessary to study multivariate outliers in the strict sense. When we study true multivariate
statistics, such as in MANOVA, it will be necessary to use the Mahalanobi’s test as described in Lecture 4.
8:1.6.1
Outliers in the Y dimension.
Using the Table … Sort command (Figure 9), sort the table in decreasing order of studentized residuals (str).
The sorting is just to facilitate the location of outliers, but it is not necessary. The str’s should be distributed like
a t distribution. We can calculate the P-value for each str by creating a new column with the following formula:
T Distribution[Studentized Resid bmss, 39]
9
This formula (Figures 10 and 11) gives the probability of obtaining a residual of the observed value, using a t
distribution with 39 degrees of freedom, which is the dfe for the model with all variables. There are 5
observations with probability values greater than 0.975 or smaller than 0.025. The value expected by chance is
45*0.05=2.25 so there are more outliers than expected by chance. However, using a set probability of 0.001 no
outliers are found. In any case, after all the observations with an extreme probability are checked for
correctness, no specific problems of coding or other errors are found, so there is no reason to eliminate any
observations.
Further analysis of the influence of each suspect observation can be done by inspecting the Cook’s D
statistic for each observation. Cook’s D is a combined measure of the impact that each observation has on all
regression coefficients. An observation may be overly influential if its D>F(0.5, p, n-p), where p is the number of
parameters, and n is the number of observations. In the Spartina dataset with the 5 variables considered, the
critical value is F(0.5, 6, 39)=0.907. On this basis, there are no influential observations, because Cook’s D is
smaller than the critical value for all observations.
8
Lecture outline AGR 206
Revised: 5/6/2017
769834562
10
11
8:1.7 Collinearity.
The first thing to check is the VIF. The VIF was obtained by right-clicking on the parameter estimates table
and choosing Columns… VIF. A VIF greater than 10 is interpreted as an indicator or severe collinearity and
excessive inflation of variance of the estimated betas. In this Spartina example all values are below 10, so no
serious problem is present (Figure 12). As seen in the example of variable selection, out of the five variables
considered, a good choice of model would include only pH and Na. If that model were used, collinearity would
practically disappear (Figure 13). Whether one chooses to reduce collinearity beyond that present in the model
with all 5 variables depends on the goals of the regression. If good, low variance, estimates of the regression
coefficients are necessary, one can use a model with 2 or 3 variables, or perform biased regression using
Principal Components Regression or Ridge Regression.
12
9
Lecture outline AGR 206
769834562
Revised: 5/6/2017
13
8:1.7.1
Outliers in the X dimension.
In order to detect outliers in the X dimension, we explore the hats. These values come from the main
diagonal of the hat matrix and depend only on the values of the predictor or X variables. Values that are greater
than 2 p/n (where p is the number of parameters and n is the number of observations)are considered as having
a large leverage and can potentially affect the model negatively. For example, an observation with an extreme
hat value can dramatically and artificially reduce the error of the estimates and predictions.
In the Spartina example, the critical value for the hats is 2*6/45=0.27. There are two observations that
exceed this value, and only one is much larger. Cross checking its Cook;s D value we see that the observation
does not have much influence, so no further action is required.
8:1.8 Summary of diagnostic and remedial measures..
The complete diagnostics resulted in the identification of non-normality of the errors and lack of homogeneity
of variance. In addition, we found a few observations that are borderline outliers, and a little collinearity.
The first action is to apply a transformation that we hope will fix both problems at once and will reduce the
outliers. Based on the distribution of the residuals, we guess that a square root or a logarithmic transformation
may work. The residuals after performing each transformation are presented in Figure 14.
10
Lecture outline AGR 206
769834562
Revised: 5/6/2017
15
Neither transformation fixed the non-normality, so we try additional transformations. The strength of the
transformation can be regulated by rescaling and translating the range of the data according to a linear formula
Y’=f(a+bY) where a and b are arbitrary coefficients selected to meet normality, and F is either the square root or
the log transformation. It turns out that Y’=sqrt(bmss-200) has residuals that are not different from normal. The
regression of Y’ on the predictors gives a similar overall R2 and has not outliers. At this point, the process of
checking the rest of the assumptions should be repeated, to make sure that the new variable meets all
assumptions. In case that some assumptions cannot be met, the method is robust and can still be applied. An
alternative is to use bootstrapping to generate confidence intervals and to test hypotheses based on the model.
Bootstrapping will be considered in detail in a future lecture. Here, it is sufficient to understand that it is a
method to make inferences that has a minimum of assumptions.
8:2 Model Validation.
After all transformations and remedial measures have been applied, and after final variables to keep in the
model have been selected, the model has to be validated. Validation is necessary because in the course of
building the model and applying remedial measures, the probability levels provided by the analysis lose their
meaning. In general, after all the manipulations the model obtained will look much better than it actually is.
Model validation provides a means to check how good the model actually is, and to correct the probability
levels.
There are three ways to validate and assess models:
1.
Collect new data and apply the model to the new data.
2.
Compare the model to previous empirical results, theories, or simulation results.
3.
Check validity with part of the data, using the model developed independently with
another part of the data.
11
Lecture outline AGR 206
769834562
Revised: 5/6/2017
In options 1 and 3 the original model can be assessed b calculating the Mean Squared Prediction Residual
(MSPR):
 Y  Y 
n*
MSPR 
2
i
i
1
n*
where n* is the number of observations in the validation data set, and Yi is the observed value for the ith
observation in the validation data set, and Yhati is the predicted value for the same observation, but based on
the model derived with the model-building or “training” data set. If MSPR is not too much larger than the original
MSE, then the model is valid. If the MSPR is much larger, the model should be improved. Meanwhile, the MSPR
and not the MSE should be used as a true estimate of the variance for predictions.
The third option can be performed as data splitting, with double-crossed validation, if the number of
observations is large relative to the number of variables (more than 20 observations per variable). The data are
randomly split into two halves and each half is used to test the validity of the model developed with the other
half. The PRESS statistic, the MSPR and direct comparisons between estimated coefficients can be made to
check if the models are similar.
See page 229 of RP book.
Test for significance of bias
This is another test that can be used to determine if the model is validated or not.
12