Download AGR206 Chapter 4. Data screening.

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Time series wikipedia , lookup

Data assimilation wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
582739608
Revised: 5/3/2017
Chapter 4. Preparing Data and
Checking Assumptions.
4:1 Data screening before analyses
As eager scientists, we are anxious to “see” the results of our labor and answer the questions we set out to
answer when we started the research. However, it is imperative that we check the data for correctness and
other potential problems before we get too far into using them to answer questions. As seen in previous
chapters, the logic foundation that supports statistical hypothesis testing contains assumptions about the
populations and the data. If the assumptions are not supported by the data themselves, the original plan for
testing cannot be used. The goal of screening and checking the data prior to proceeding to final testing and
interpretation is to check that the data are correct, that assumptions are met, and to become familiar with the
overall “picture” presented by the data.
You heard and read it before and here it is again: plot your data. More specifically, plot the residuals against
all X and Y variables. The reason for this is clearly illustrated by a dataset created by Anscombe (F. J.
Anscombe. 1973. American Statistician 27:17-21). Four samples yield the same statistical results, but obviously
represent very different phenomena (Figure 4-1).
582739608
1
582739608
Revised: 5/3/2017
Figure 4-1. These four dataset yield exactly the same statistical results for the regression of Y on X, but they
are obviously very different. Data are in file Anscomb.jmp.
Although there are no fixed rules for data screening, there are guidelines that should be followed. The first
guideline is that all data manipulation, particularly the identification and fate of outliers, should be fully reported
in the results. The order of the different steps can affect the results. It is recommended that distributional
properties and transformations be considered first, before proceeding with the identification and handling of
outliers. Because any modification such as a transformation will change all results, after each modification the
data should be analyzed again and the screening should be repeated until all assumptions are met.
Screening of the data and checking for assumptions can be the step in data analysis that takes the most
time. Once everything is checked, one proceeds to the final analysis that will be interpreted and potentially
published.
4:1.1 Correspondence between sample and population.
A fundamental consideration in statistical analysis is the correspondence between the sample and the
population it describes. The inferences that can be made based on the sample are only applicable to the
population from which the sample was randomly taken. In defining this population, it is very important to
consider any restriction that may have influenced the sample, because the results will only be valid for the same
sort of conditions.
582739608
2
582739608
Revised: 5/3/2017
Figure 4-2. Relationship determined by observational experiment of plant growth and temperature in a set
of greenhouses. (Fictitious data).
For example, the application of linear regression and the conclusions of the analysis are only applicable
within the range of values of X that were feasible in the population. Consider the case of the inexperienced but
eager agronomist who went to work in a company that grows plants in greenhouses. In an attempt to show
statistical prowess, the agronomist conducted an observational experiment in which greenhouse temperature
and plant growth rate were measured. The agronomist found no relationship whatsoever, and prepared a
scatterplot and a PowerPoint presentation to show the results to the board of directors (Figure 4-2).
"What are you talking about?!" was the response of one of the old timers at the presentation. "Temperature
has such a large effect on growth that we have to control it very carefully to maintain it within the optimum
range" he exclaimed while pulling an old chart covered with dust (Figure 4-3).
Figure 4-3. Relationship between plant growth and greenhouse temperature when temperature range is
not restricted to a narrow range, but manipulated to cover a wide range.
582739608
3
Revised: 5/3/2017
582739608
4:1.2 Missing data.
Missing data can be a problem from at least two points of view. First, sample size can be reduced
dramatically, particularly when many variables are involved in the analysis and the missing values for different
variables are in different cases or observations. For example, in a data set with 5 variables and 30 observations,
4 missing values in each variable in different observations will reduce the sample size to 10. Estimation of the
missing values may be necessary in such a case. One of the best ways to do this is to generate a random
observation that has the properties expected on the basis of the rest of the sample. Details of the procedure
should be reported in the Methods and Results.
Second, missing values may be non-random, and related to the treatments applied or the conditions in
observational experiments. Suppose that you study the relationship between plant population growth rate in
natural conditions as a function of soil chemical properties. It is likely that there will be lots of missing values
when the soil characteristics get close to the extremes of the ecological niche of the species under
consideration. In this case, the frequency of missing values is itself a source of information.
The randomness of missing values in a variable Y can be tested by creating a dummy or grouping variable
(nominal) X that takes a value of (for example) 0 if Y is missing and 1 if Y is not missing. Then, ANOVA or a
MANOVA can be applied to test for differences in the other variables (excluding Y) between the two groups. A
significant result indicates that the missing values are not random. In this case, it is important to make sure that
the analyses are not biased, and observations with missing values must be kept in the data set, for example, by
estimating the missing data on the basis of the rest of the data set and including a random component into
them.
4:1.3 Normality.
In JMP, normality or a variable is tested through the ANALYZE DISTRIBUTIONS platform. Usually, we are
interested in testing normality of errors, but in some analyses (e.g., PCA) we look at the distribution of the
variables themselves. Because errors or residuals are estimated from a sample by imposing a model that is
being tested, the distribution of errors depends on the model. When we change the model we need to check the
distribution of the errors again.
To illustrate the process in JMP we use the file xmpl_Pyield.jmp, which contains one random sample
obtained with the file xmpl_PfertParSim.xls. The true model used to get the sample was a quadratic
polynomial with the following parameter values:
 0  200
1  20
2  2
  3600
2
The procedure requires the use of two Analyze platforms, Fit Y by X, and Distributions. In the Fit Y by X we
fit a simple linear model and save the residuals to the data table. In Distribution we analyze the distribution of
the error and how well they fit a normal distribution.
 Open the xmpl_Pyield.jmp table. Verify
that it has a column labeled P and
another labeled Yield.
 Click on Analyze and drag down to Fit Y
by X and release the mouse button.
1
582739608
4
582739608
Revised: 5/3/2017
 In the dialog box, click on P
once to select it and then
click on X, Factor to place P
in the X box.
 Then, select Yield and place
it in the Y, Response box.
 Click OK.
2
 In the results window that appears you will see a scatter
plot of the data. Click on the red triangle to the left of the
title “Bivariate Fit of Yield By P,” and drag down to
select Fit Line.
 The results for SLR appear below the scatter plot.
3
 Locate and Click on the red triangle to the left of “Linear Fit” at the
lower left of the scatter plot.
 Move down to select Save Residuals.
 In this menus you can also experiment to see what happens when
you select Confidence Curves (Fit and Indiv), Plot Residuals, and
Save Predicteds.
4
 The Save Residuals command creates a new
column in the data table. This column is
named Residuals Yield. It contains the
estimated errors for each observation.
 Now you are ready to analyze the residuals.
 Note that in order to get the residuals you had
to apply a model, so the estimated residual
are indeed dependent on the model imposed
5
582739608
5
Revised: 5/3/2017
582739608
 Click on the Analyze menu and select the Distributions platform. A
dialog window will appear to le you select the variables whose
distributions you wish to analyze. In this case, of course, we are
interested in Residuals Yield.
6
 Following the same idea as in step
2, select Residuals Yield on the left
box and click on Y, Columns to
apply the analysis to the residuals
of Yield.
7
 The new window has a histogram for the
residuals and a few other descriptive
statistics.
 Click on the red triangle at the left of
Residuals Yield near the top, and select Fit
Distribution, Normal.
 This step fits the a normal distribution to the
residuals, and shows the results both
graphically and in a table.
8
582739608
6
Revised: 5/3/2017
582739608
 Although not related to testing normality.
At this point you can explore
modifications to the histogram by
changing the location and width of the
bins. This is achieved by selecting the
“hand” tool and click-dragging on the
histogram. Dragging horizontally
changes the bin width. Dragging
vertically changes the bin locations.
9
 Click on the red triangle to the left of Fitted Normal and select
Goodness of Fit. This performs a Shapiro Wilk test of normality of
residuals.
10
 The value and probability level of the Shapiro-Wilk
statistic are displayed.
 The probability is far from 0.01 or even 0.05, so we
cannot reject the hypothesis that the distribution is
normal.
11
Conceptually, the Shapiro-Wilk statistic is based on the regression of the ordered standardized residuals on
the expected quantiles form a normal distribution. If the residuals are perfectly normal, the regression has a
slope of 1.0 and a null intercept.
In SAS, normality can be assessed by analyzing the proper variable with the PROC UNIVARIATE and
specifying the "normal" option. This results in a report of the Shapiro-Wilk statistic and its probability level. If the
probability is lower than a critical level, the distribution is significantly different from normal. In this case, the
relation between the level of  and the rigorousness of the test is reversed. A large  will result in a greater
probability of rejecting the assumption of normality, and thus, in a more conservative test. I recommend an 
value of 0.01 for this test for sample sizes greater than 30. Testing for normality with small samples is not
useful, because the power of the test is very low. On the other hand, very large samples are extremely sensitive
to deviations from normality. In most cases, the principle of asymptotic normality will make most tests valid,
even if normality of residuals is rejected formally. In such cases, it is not necessary to address non-normality,
but the decision must be made by a person who understands the use of statistics. This is another clear example
of the fact that following statistical procedures blindly may lead to incorrect procedures.
582739608
7
Revised: 5/3/2017
582739608
If non-normality is detected, a transformation should be used according to the recommendations given
below in the section about transformations. Non-normality can result from the presence of outliers, as well as
from lack of homogeneity of variance. Although Tabachnick and Fidell (1996) seem to prefer transformations to
elimination of outliers, in some cases it is clear that one or two outliers are the source of the problem. In that
case, deletion of outliers may be preferable to a transformation. Unfortunately, data screening and testing of
assumptions is an iterative process with flexible rules, and it is not possible to give a recipe that is always valid
to determine a unique set of steps.
Multivariate normality is more difficult to test. Tests are based on the fact that if all variables are normal, then
the squared Mahalanobis (http://www.isical.ac.in/prof.html) distance* (D2) for each observation should have a 2
distribution. The test is based on determining the goodness of fit for a 2 by plotting the observed values of D2
against the quantiles for a 2 with p degrees of freedom (p=number of variables). An example is presented
under the PCA topic.
[P.C. Mahalanobis. On Tests and Measures of Groups Divergence I. Journal of the Asiatic Society of
Benagal, 26:541, 1930.]
4:1.4 Linearity and lack of fit.
Most analyses assume that the relationship among variables is linear. Lack of linearity can be determined by
studying scatterplots or by analysis of lack of fit. Scatterplots can also be used to determine what type of
transformation may be necessary, as indicated in the section on transformations.
When replicate observations are available for at least some of the levels of X, it is possible to test the
hypothesis that a particular model, such as the linear model, is not a good fit to the data. This analysis of lack of
fit can also be performed when "near" replicates are available, where observations can be subdivided into many
groups such that most of the variance in X is among groups, and little variance in X is observed within groups.
The analysis of lack of fit is based in further subdividing the SSE into a portion called Pure Error and a portion
due to Lack of Fit. The variation due to pure error (SSPE) is the sum of squares of deviations of observations
about the average level of Y for each level of X. The variation due to lack of fit (SSLF) is the sum of squares of
the deviations of the average Y for each level of X about the value predicted by the model. Note that this
partition of SS can be applied to any model, not just the linear one.
eij  Yij  Yˆij  (Yij  Y j )  (Y j  Yˆij )
nj
nj
c
c
SSE    (Yij  Y j )    (Y j  Yˆij )2  SSPE  SSLF
2
i 1 j 1
i 1 j 1
The subscript j refers to the levels of X and it ranges from 1 to c; c being the number of different levels of X
present. This partition of the SSE also reflects the Ho and Ha for the test of lack of fit. Ho states that the
expected value of Y is a linear function of X, whereas Ha states that the expected values of Y for each X do not
necessarily fall on the line.
Ho: E{Y}=0+1X or equivalently,
Ha: E{Y}=j
j=0+1Xj
without any additional constraints on the relationship between j and X.
If Ho is true, two independent estimates of the variance of the error can be obtained: one based on the
deviations of observations around the average for each level of X, and one based on the deviations of j around
the line. These are the mean square of the pure error (MSPE) and the mean square of the lack of fit (MSLF),
respectively. If Ho is not true, then MSLF will tend be greater than MSPE, and their ratio will be significantly
greater than the expected value of the F statistic. The analysis can be performed in the usual ANOVA form,
where SSE has n-2 degrees of freedom as usual (total number of observations minus one df for each
parameter), SSLF has c-2 degrees of freedom (number of levels of X minus the number of parameters
estimated), and SSPE has n-c degrees of freedom (number of observations minus c means estimated, one for
each level of X).
582739608
8
Revised: 5/3/2017
582739608
Source
Regression
Error
Lack of Fit
Pure Error
Total
SS
SSR=
df
 ( Yˆ
 Y)
ij
2
 ( Y  Yˆ )
2
SSLF=  ( Yj  Yˆij )
SSE=
2
ij
MSR=SSR/1
n-2
MSE=SSE/(n-2)
c-2
MSLF=SSLF/(c-2)
MSPE=SSPE/(n-c)
ij
 (Y
SSTO=  (Y
SSPE=
1
MS
ij
 Yj )2
n-c
ij
 Y)2
n-1
JMP automatically produces a test and report for lack of fit if the data contain more than one observation
with the same values for all X variables used in the model. In the Pyield example the result appears in the
xmplPyield,jmp:Bivariate window but it is collapsed. The results can be displayed by clicking on the gray triangle
to the left of the title “Lack of Fit.”
The Lack of Fit analysis shows the
partition of the total error into two
components, Lack of Fit and Pure
Error. The F-ratio tests whether the
variance of the average yield for each
level of P around the value predicted
by the line is significantly greater than
the variance of yield around the
average for each level of P.
In this example, there is not significant
lack of fit (P>0.05), so we cannot
reject the linear model in spite of the
fact that we know that the correct
model is quadratic.
The Max RSq indicates the maximum proportion of the total variance in yield that could be explained by a
model that goes exactly through the average yield for each level of P. Such a model is the one used when one
considers each level of P as a discrete treatment in an ANOVA, as shown in the following results obtained after
changing P from a continuous variable to a nominal one.
The variable type is changed by
clicking on the blue c to the left of
P and selecting Nominal from the
drop-down menu, as shown in the
figure.
582739608
9
Revised: 5/3/2017
582739608
4:1.4.1.1
General linear test
It is worth noting that the null hypothesis postulates a model that is more restrictive than the alternative
hypothesis. The model under Ho is also called "reduced" model, because it has fewer parameters (only two)
than the model under Ha, which is called the "full" model and includes as many parameters as possible (one
mean for each level of X). Usually, as parameters are added to a model, the SSR increases, but the df
decrease. Thus, the whole test of lack of fit can also be though of as checking whether the improvement in the
SSR achieved by adding parameters is worth the associated loss of df.
 When yield is analyzed as a
function of nominal P, the error
term in the ANOVA is equal to the
Pure Error term in the analysis of
lack of fit.
 The Rsquare is the same as the
Max Rsq in the analysis of lack of
fit.
 These facts are pointed out to
emphasize the relationship among
analyses, but are not involved in
any testing or estimation directly.
 From a practical point of view, it is
obvious that there is a curvilinear
relationship between yield and P.
Unless we have a priori reasons to
keep the linear model, we should
try a different model even though
the linearity was not rejected.
This alternative point of view happens to be the most general version of the vast majority of F tests we
encounter in statistics, and is called the "General Linear Test." In order to compare any two models, one of
which is a reduced version of the other, the general linear test calculates an F value as follows.
Ho: Reduced model (R)
Ha: Full model (F)
F
SSE / dfe SSE R  SSEF  dfeR  dfeF 

MSEF
SSEF dfeF
This F is compared with the table value with (dfeR-dfeF) degrees of freedom in the numerator and dfeF
degrees of freedom in the denominator. If the calculated value is greater than the table value, Ho is rejected.
4:1.5 Homogeneity of variance-covariance matrices.
In univariate cases the assumption of homogeneity of variance-covariance matrices is equivalent to
assuming homogeneity of variance. Homogeneity of variance in grouped data is assessed among groups. For
ungrouped data, the error terms can be partitioned into two groups, one for low levels of X and the other for high
levels of X. Then, the Levene's test for homogeneity of variance is applied as in grouped data. This test can be
582739608
10
Revised: 5/3/2017
582739608
requested with the HOVTEST option of the PROC ANOVA in SAS (for details, search for the terms "Levene,"
and "HOVTEST" in SAS Help). Lack of constant variance in linear regression can be detected by examination of
scatterplots of errors vs. predicted Y and vs. X (Figure 4-4). A fan shaped plot indicates the need to use a
variance-stabilizing transformation, or the use of weighted regression.
Y
Y
A
B
C
X
Figure 4-4. Scatter plots showing heterogeneous
variance and nonlinearity.
Y
In JMP, you can test for homogeneity of variance in a SLR situation by separating the data into groups by
level of X, for example, low medium and high. Each group receives a different value of a new nominal variable,
say “group” and the homogeneity of variance among groups is tested with the Fit Y by X platform. Select group
as the X variable and the residuals of the model (previously saved) as the Y variable. Once the analysis window
appears, click on the red triangle at the top and select UnEqual Variances. A series of statistics are calculated to
test the homogeneity of variance. As an exercise, select the JMP Help menu and search for Homogenetiy of
Variance Tests. The test is explained in detail in the help page.
In true multivariate situations, the analysis of homogeneity of variance-covariance matrices means that the
patterns of variances and covariances among variables are the same in each group considered. In other words,
the scatterplots should be similar in all groups. However, the examination of scatterplots is only feasible for
simple groupings and no more than 3 variables, by using the 3-D scatterplot option under GRAPH – SPINNING
PLOT in JMP or "Interactive Analyses" in SAS 7. Statistical tests, such as Box's M tend to be very conservative
and very sensitive to outliers. This should be considered against the fact that lack of homogeneity of variancecovariance matrices weakens, but does not invalidate the analyses. SAS offers these tests through options,
such as the POOL=TEST option in discriminant analysis.
4:1.6 Outliers.
Outliers are observations that are not consistent with the rest of the data. This does not mean that they are
automatically removed from the data set, but that the cause and effects of these observations should be
carefully considered.
There may be outliers in the Y or in the X dimensions. An outlier in the Y dimension is a value that falls
outside the reasonably expected range of Y, combination of Y's (for true multivariate analyses), or error. An
outlier in the X dimension is an extreme value that deviates greatly form the average for the predictor. In SLR, a
value that is an outlier only in Y tends to increase the variance of the parameters, without a big impact on
parameter estimates. A value that is an outlier only in the X dimension tends to reduce the variance of the
parameters, without much impact on parameter estimate, whereas an observation that is out there both in X and
Y changes parameter estimates and increases variance.
Often, outliers result from incorrectly coded observations and incorrect data entry. Both of these cases have
clear solutions. Outliers can also appear because they do not belong to the same population as the one
addressed by the rest of the sample. Suppose that you are trying to determine the average size of a species of
aphids in a field where many species are present. Extremely large or small aphids may come from a different
species in the same field. Because aphids species is hard to determine, one is not certain whether the outlier is
or is not the target species.
582739608
11
Revised: 5/3/2017
582739608
Finally, outliers may be caused by the population having a distribution that is not normal, where extreme
values are more common than in the normal distribution. In this case, deleting the outliers can weaken the link
between sample and population. Consideration of transformations is a more desirable solution.
The first step in the analysis is to identify outliers. Then one proceeds to apply transformations or delete
those outliers that are extreme, depending on the situation. Typically, the methods and equations used for
multivariate outliers also apply to univariate situations, because univariate situations can be considered to be
special cases of multivariate ones. However, for the sake of clarity, we present them separately.
4:1.6.1
Univariate outliers
When there is only one variable under consideration, such as the errors in simple linear regression, one can
apply univariate methods to detect outliers in the Y or in the X dimension. Because in SLR X is not considered
to be a random variable, but a set of fixed values, we do not consider both X and Y at the same time, and no
probability is associated with the identification of outliers in the X dimension.
Values in the Y dimension that have a low probability are identified as outliers. For univariate outliers in SLR
we can use the deleted studentized residuals. The deleted studentized residual ti for each observation is used to
identify outliers.
ti 
di
sdi 
where
di  Yi  Yˆ(i) 
s2 di 
ei
1 hii
and
MSE (i)
1 hii
e2
MSE(i ) 
n  pMSE  1ih
n  p 1
ii
In these equations, the subscript (i) indicates the value of the statistic calculated while holding the
observation under consideration out of the sample; n is the sample size, and p is the total number of
independent parameters in the model. Thus, di is the difference between the observed value for Y and the
expected value predicted by a model obtained by temporarily holding observation i out of the sample. The value
hii is the leverage of each observation, and it comes from the main diagonal of the H or Hat matrix. As shown in
the equation, di can be calculated from the regular error and the leverage for observation i. The variance for di is
a function of the MSE obtained with observation i held out. This variance can also be calculated, as shown in
the equations above, as a function of the regular MSE, the error and the leverage for each observation. In JMP,
we calculate the studentized deleted residuals by creating a new column and entering the formulas from above.
582739608
12
582739608
Revised: 5/3/2017
Figure 4-5. This figure has simulated data for yield response to P application. Three points that do
not belong to the original population have been added to demonstrate the detection of outliers
and its interaction with choice of model.
Under the assumption that the errors are normally distributed with homogeneous variance, the deleted
studentized residual should have a t distribution with n-p-1 degrees of freedom. The degrees of freedom are one
less than usual because the MSE-i is calculated with n-1 observations, given that observation i is held out.
Tabachnick and Fidell (1996) recommend that any observation that has a ti greater than the two-tailed table
value for P=0.001 is an outlier and should be considered for deletion. Neter et al. (1996) recommend using a
Bonferroni correction for the test with =0.10. The Bonferroni correction reduces the actual  such that the
probability of erroneously identifying a point as an outlier in the whole sample remains close to the nominal
value. This correction is necessary because as sample size increases, the probability that at least one point will
deviate greatly from the expected value also increases, even if no true outliers are present. The correction is
achieved by dividing the nominal value of  by n, and then using the new probability level to look for the critical
values in the table. The fixed value of =0.001 is equivalent to using the Bonferroni correction with =0.10 when
n=100. For smaller sample sizes Bonferroni is more conservative; for larger n's, Bonferroni is more liberal.
In SAS the deleted studentized residuals are requested in the MODEL statement with the option
"INFLUENCE." The critical level according to the most conservative criterion is t(0.9995, 33)=3.61, which points
to observation number 35 as a clear outlier (Figure 4). This observation also is an outlier in the X dimension,
because its leverage hii is 0.289. The critical value for the leverage when X is not random is 2 (m+1)/n, where m
is the number of X variables. In the example, m=1 and n=36, so any leverage over 0.111 identifies an outlier in
the X dimension. This test can also be used for situations where there are many predictors, or X-variables that
are considered to have fixed values.
582739608
13
582739608
4:1.6.2
Revised: 5/3/2017
Multivariate outliers
Figure 4-6. Multivariate outliers are identified by examining their distance to the centroid for the distribution.
The Euclidean distance is corrected by the pattern of covariation among variables. Point b is further from
the centroid than a, but it is statistically closer. Whereas b is not an outlier, a is clearly outside the
expected distribution.
Multivariate outliers can be detected by calculating the Mahalanobis distance D for each observation. This
distance is a measure of the "statistical" distance between each observation and the centroid for the group
being considered. Suppose you are performing a MANOVA where the Y's are seed number and weight per
seed of a species of interest Figure 5), and the X variable is level of soil fertility and water availability. In this
example, X is a categorical or "class" variable that, say, takes three values: low, medium, and high. In order to
detect outliers in the Y dimensions, a centroid or vector of average values for each Y is calculated for each
group. The centroids are the best estimates of the expected value for the vector of random variables, and
serves the same function as Yhat in the univariate case where we considered deviations about a straight line.
After calculating the centroids, the deviations of each observation form its group's centroid are calculated.
Analogously to the deleted residuals, a robust or "jackknifed" squared Mahalanobis distance is calculated for
each observation while holding the observation out of the sample. This prevents potential outliers from distorting
the very detection of outliers.
Figure 4-6 shows the simulated data for the group of medium fertility and water availability. The centroid for
this group is the point (972, 1173). The Euclidean or geometric distances from each of two potential outliers,
observations a and b, are represented by the lines from each point to the centroid. Clearly, point b is further
away from the centroid than a. The jacknifed squared Mahalanobis distances for points a and b are 31 and 10
respectively, which is counter to the Euclidean distances. The difference is due to the fact that the two variables,
seed weight and seed number, exhibit a strong positive covariance within this group. As is intuitively clear form
the scatterplot, given the dispersion and correlation between the variables, point a is a lot less likely than point
b. This is reflected in the Mahalanobis distance.
Assuming multivariate normality for the random vector of Y variables (in the example the vector is {seed
number, seed weight}), the squared Mahalanobis distance and its jackknifed version should follow a 2
distribution with 2 degrees of freedom (df=number of variables). This distribution can be used in the same way
the t distribution was used for the univariate situation. A critical value of 2 is determined either by a set
probability =0.001 or by using the Bonferroni correction with =0.10.
Outliers are identified by testing the following hypothesis for each observation:
Ho: Yi follows the same multivariate normal distribution as the rest of the sample.
Ha: Yi does not follows the same multivariate normal distribution as the rest of the sample.
582739608
14
582739608
Revised: 5/3/2017
Let Y1 and Y2 be random variables that have a bivariate normal
distribution within a group.
Yi  Y1i , Y2i  is a random observation, and i = 1, , n
The jacknifed squared Mahalanobis distance is defined as
2
d(i2 )  Yi  Y(i )  S1 Yi  Y(i)  (df
 2)
The simulated seed weight example has 250 observations. Observations a and b were not considered in the
sample to simplify the calculations in xmpl_MVoutl.xls, but a strict calculation should have included point b
when calculating d2 for a, and vice versa. Using the Bonferroni approach, the critical level for the Mahalanobis
distance is 2 with 2 degrees of freedom for =0.10/250=0.0002. The test is one-tailed, because the squared
distance can only be positive, and values close to zero indicate that the observations are very much within the
range expected under the assumptions and Ho. The critical value is 15.65, indicating that observation a is an
outlier but observation b is not.
Through the multivariate platform in JMP one can obtain the Mahalanobis distance (D) and the jackknifed
distance. The plot also shows a dotted line that represents the critical D value for =0.05. This distance is
calculated as F*number of variables. The F value is the table value for the desired probability level and with df in
the numerator=number of variables (nvars) and df in the denominator =n-nvars-1.
A question that is relevant at this point is, why did we use the Bonferroni correction with n=250 while in fact
we only tested 2 observations? This is because the 2 observations were picked after looking at the scatterplot.
The only time that the n for the Bonferroni correction is equal to the actual number of tests performed is when
the identity of observations to be tested is identified a priori, before obtaining or looking at the results. In such a
case, SAS may print out all distances anyway, and one can catch a glimpse of a significant distance that was
not in the list prepared a priori. The proper critical value for testing that observation has to be the determined
with n=total sample size.
4:1.7 Transformations.
Transformations can be used to address the following problems:
1.
Lack of normality
2.
Lack of linearity
3.
Heterogeneous variance
4.
Outliers
When errors (or Y's) are not normally distributed, a transformation can fix the problem. Figure 4-7 of
Tabachnick and Fidell (2001) gives guidelines for choosing a transformation.
582739608
15
582739608
Revised: 5/3/2017
Figure 4-7. Original distributions (pdf’s) and common transformations to achieve normality.
The Log transformation is particularly useful to produce normality in skewed distributions and in stabilizing
variance. The strength of the log transformation over different ranges of the variable can be regulated by
applying a linear transformation of the original variable before taking the log,
Y'=Ln(c0+c1Y)
where c0 and c1 are coefficients that can be adjusted by trial and error. The first coefficient "moves" the
whole distribution to different locations of the log transformation, modifying the "average" intensity of the
transformation. The second coefficient modulates the spread of the distribution, thus regulating the difference in
intensity of the log transformation between the low and the high range of Y.
Figure 4-8 shows typical patterns of errors. The first graph shows errors that have constant variance and no
need for nonlinear terms or transformations. The second plot shows a case where there is confounding between
the effects of X and time or spatial sequence. This may have resulted from a poorly planned sampling scheme.
In this case, it is necessary to remove the effects of time before analyzing the effects of X. The third plot shows
a case in which there is a clear curvilinear effect. In this case, the addition of a quadratic X term can fix the
model. The addition of a quadratic X term has similar effects to applying a square root transformation on the Y
variable, except that a transformation of Y can change its distribution in an undesirable way. Finally, the last plot
shows a case of decreasing variance which has to be addressed by using a transformation of the Y variable.
Transformations can be applied both to X, Y, or both. Moreover, multiple transformations can be applied to
the same variable. As a general rule, when the distribution of errors or Y is normal, it is better to address lack of
linearity transforming the X variable. Transformations to fix non-linearity are suggested by the shape of the
scatterplot in Figure 4-9.
582739608
16
582739608
Revised: 5/3/2017
Figure 4-8. Typical problems indicated by the distribution of points in a scatter plot.
582739608
17
Revised: 5/3/2017
582739608
Scatterplot
shape
Transformation of Y
to correct nonl inea rty
X'=Ln(X)
X'=sqrt(X)
X'=X2
X'=exp(X)
Y
X'=1/X
X'=exp(-X)
X
Figure 4-9. Recommended transformations
When the relationship is non linear and the variance of the error appears to increase with increasing value of
predicted Y, as is the case in all graphs of Figure 7, a transformation of Y can fix the problem. One can try the
log, inverse and square roots transformations and select the one that yields the best results. The Guided
Analysis feature of SAS 7 automatically performs a likelihood test to suggest the transformation that will have
the greatest chances of fixing the problem.
JMP offers the possibility of calculating Box-Cox transformations. These constitute a family of power
transformations that have the following functional form, where  is a parameter adjusted by maximum likelihood.
Y ' Y  if   0
Y ' ln( Y ) if   0
Note that Y squared, square root of Y and 1/Y are all members of this family. These transformations
facilitate the correction of problems with the assumptions. In general, a precise value for  is not necessary, so
it is recommended that one select a
value that is easier to interpret and that is close to the fitted one. As an
exercise, use JMP to determine the best transformation for the data in homework 01, which appeared to exhibit
non-linearity and heterogeneity of variance. Try to interpret the selected .
582739608
18
582739608
Revised: 5/3/2017
4:1.8 Multicollinearity and singularity.
Regression and linear models in general require that the matrix X'X be inverted, where X is the design or
data matrix. In SLR the matrix X'X is small and simple (2x2), and it always has rows an columns that are linearly
independent. However, in multiple linear regression (MLR) there are situations when the columns of X are
related to each other. When a column of X can be expressed as an exact linear combination of the other
columns, the X'X matrix cannot be inverted because its determinant is 0. The matrix is said to be "singular." In
most cases, X's are not perfect linear combinations of other X's, but they may be close. As the variance in any X
can be explained more and more by other X's, the determinant of the matrix tends to zero and causes problems
when inverting X'X. For example, the determinant can be so close to zero that just the level of usual roundingoff that computers perform can have major impacts on the results.
This is know as collinearity or multicollinearity, and is a major problem in studying relationships among
variables, particularly in observational experiments. Multicollinearity can prevent us form determining the true
effects of factors on responses, and it is more a problem than an assumption. Identification and measurement of
multicollinearity will be addressed in detail under the subject of PCA and MLR.
582739608
19