Download AGR206 Chapter 4. Data screening.

582739608 Revised: 5/3/2017 Chapter 4. Preparing Data and Checking Assumptions. 4:1 Data screening before analyses As eager scientists, we are anxious to “see” the results of our labor and answer the questions we set out to answer when we started the research. However, it is imperative that we check the data for correctness and other potential problems before we get too far into using them to answer questions. As seen in previous chapters, the logic foundation that supports statistical hypothesis testing contains assumptions about the populations and the data. If the assumptions are not supported by the data themselves, the original plan for testing cannot be used. The goal of screening and checking the data prior to proceeding to final testing and interpretation is to check that the data are correct, that assumptions are met, and to become familiar with the overall “picture” presented by the data. You heard and read it before and here it is again: plot your data. More specifically, plot the residuals against all X and Y variables. The reason for this is clearly illustrated by a dataset created by Anscombe (F. J. Anscombe. 1973. American Statistician 27:17-21). Four samples yield the same statistical results, but obviously represent very different phenomena (Figure 4-1). 582739608 1 582739608 Revised: 5/3/2017 Figure 4-1. These four dataset yield exactly the same statistical results for the regression of Y on X, but they are obviously very different. Data are in file Anscomb.jmp. Although there are no fixed rules for data screening, there are guidelines that should be followed. The first guideline is that all data manipulation, particularly the identification and fate of outliers, should be fully reported in the results. The order of the different steps can affect the results. It is recommended that distributional properties and transformations be considered first, before proceeding with the identification and handling of outliers. Because any modification such as a transformation will change all results, after each modification the data should be analyzed again and the screening should be repeated until all assumptions are met. Screening of the data and checking for assumptions can be the step in data analysis that takes the most time. Once everything is checked, one proceeds to the final analysis that will be interpreted and potentially published. 4:1.1 Correspondence between sample and population. A fundamental consideration in statistical analysis is the correspondence between the sample and the population it describes. The inferences that can be made based on the sample are only applicable to the population from which the sample was randomly taken. In defining this population, it is very important to consider any restriction that may have influenced the sample, because the results will only be valid for the same sort of conditions. 582739608 2 582739608 Revised: 5/3/2017 Figure 4-2. Relationship determined by observational experiment of plant growth and temperature in a set of greenhouses. (Fictitious data). For example, the application of linear regression and the conclusions of the analysis are only applicable within the range of values of X that were feasible in the population. Consider the case of the inexperienced but eager agronomist who went to work in a company that grows plants in greenhouses. In an attempt to show statistical prowess, the agronomist conducted an observational experiment in which greenhouse temperature and plant growth rate were measured. The agronomist found no relationship whatsoever, and prepared a scatterplot and a PowerPoint presentation to show the results to the board of directors (Figure 4-2). "What are you talking about?!" was the response of one of the old timers at the presentation. "Temperature has such a large effect on growth that we have to control it very carefully to maintain it within the optimum range" he exclaimed while pulling an old chart covered with dust (Figure 4-3). Figure 4-3. Relationship between plant growth and greenhouse temperature when temperature range is not restricted to a narrow range, but manipulated to cover a wide range. 582739608 3 Revised: 5/3/2017 582739608 4:1.2 Missing data. Missing data can be a problem from at least two points of view. First, sample size can be reduced dramatically, particularly when many variables are involved in the analysis and the missing values for different variables are in different cases or observations. For example, in a data set with 5 variables and 30 observations, 4 missing values in each variable in different observations will reduce the sample size to 10. Estimation of the missing values may be necessary in such a case. One of the best ways to do this is to generate a random observation that has the properties expected on the basis of the rest of the sample. Details of the procedure should be reported in the Methods and Results. Second, missing values may be non-random, and related to the treatments applied or the conditions in observational experiments. Suppose that you study the relationship between plant population growth rate in natural conditions as a function of soil chemical properties. It is likely that there will be lots of missing values when the soil characteristics get close to the extremes of the ecological niche of the species under consideration. In this case, the frequency of missing values is itself a source of information. The randomness of missing values in a variable Y can be tested by creating a dummy or grouping variable (nominal) X that takes a value of (for example) 0 if Y is missing and 1 if Y is not missing. Then, ANOVA or a MANOVA can be applied to test for differences in the other variables (excluding Y) between the two groups. A significant result indicates that the missing values are not random. In this case, it is important to make sure that the analyses are not biased, and observations with missing values must be kept in the data set, for example, by estimating the missing data on the basis of the rest of the data set and including a random component into them. 4:1.3 Normality. In JMP, normality or a variable is tested through the ANALYZE DISTRIBUTIONS platform. Usually, we are interested in testing normality of errors, but in some analyses (e.g., PCA) we look at the distribution of the variables themselves. Because errors or residuals are estimated from a sample by imposing a model that is being tested, the distribution of errors depends on the model. When we change the model we need to check the distribution of the errors again. To illustrate the process in JMP we use the file xmpl_Pyield.jmp, which contains one random sample obtained with the file xmpl_PfertParSim.xls. The true model used to get the sample was a quadratic polynomial with the following parameter values:  0  200 1  20 2  2   3600 2 The procedure requires the use of two Analyze platforms, Fit Y by X, and Distributions. In the Fit Y by X we fit a simple linear model and save the residuals to the data table. In Distribution we analyze the distribution of the error and how well they fit a normal distribution.  Open the xmpl_Pyield.jmp table. Verify that it has a column labeled P and another labeled Yield.  Click on Analyze and drag down to Fit Y by X and release the mouse button. 1 582739608 4 582739608 Revised: 5/3/2017  In the dialog box, click on P once to select it and then click on X, Factor to place P in the X box.  Then, select Yield and place it in the Y, Response box.  Click OK. 2  In the results window that appears you will see a scatter plot of the data. Click on the red triangle to the left of the title “Bivariate Fit of Yield By P,” and drag down to select Fit Line.  The results for SLR appear below the scatter plot. 3  Locate and Click on the red triangle to the left of “Linear Fit” at the lower left of the scatter plot.  Move down to select Save Residuals.  In this menus you can also experiment to see what happens when you select Confidence Curves (Fit and Indiv), Plot Residuals, and Save Predicteds. 4  The Save Residuals command creates a new column in the data table. This column is named Residuals Yield. It contains the estimated errors for each observation.  Now you are ready to analyze the residuals.  Note that in order to get the residuals you had to apply a model, so the estimated residual are indeed dependent on the model imposed 5 582739608 5 Revised: 5/3/2017 582739608  Click on the Analyze menu and select the Distributions platform. A dialog window will appear to le you select the variables whose distributions you wish to analyze. In this case, of course, we are interested in Residuals Yield. 6  Following the same idea as in step 2, select Residuals Yield on the left box and click on Y, Columns to apply the analysis to the residuals of Yield. 7  The new window has a histogram for the residuals and a few other descriptive statistics.  Click on the red triangle at the left of Residuals Yield near the top, and select Fit Distribution, Normal.  This step fits the a normal distribution to the residuals, and shows the results both graphically and in a table. 8 582739608 6 Revised: 5/3/2017 582739608  Although not related to testing normality. At this point you can explore modifications to the histogram by changing the location and width of the bins. This is achieved by selecting the “hand” tool and click-dragging on the histogram. Dragging horizontally changes the bin width. Dragging vertically changes the bin locations. 9  Click on the red triangle to the left of Fitted Normal and select Goodness of Fit. This performs a Shapiro Wilk test of normality of residuals. 10  The value and probability level of the Shapiro-Wilk statistic are displayed.  The probability is far from 0.01 or even 0.05, so we cannot reject the hypothesis that the distribution is normal. 11 Conceptually, the Shapiro-Wilk statistic is based on the regression of the ordered standardized residuals on the expected quantiles form a normal distribution. If the residuals are perfectly normal, the regression has a slope of 1.0 and a null intercept. In SAS, normality can be assessed by analyzing the proper variable with the PROC UNIVARIATE and specifying the "normal" option. This results in a report of the Shapiro-Wilk statistic and its probability level. If the probability is lower than a critical level, the distribution is significantly different from normal. In this case, the relation between the level of  and the rigorousness of the test is reversed. A large  will result in a greater probability of rejecting the assumption of normality, and thus, in a more conservative test. I recommend an  value of 0.01 for this test for sample sizes greater than 30. Testing for normality with small samples is not useful, because the power of the test is very low. On the other hand, very large samples are extremely sensitive to deviations from normality. In most cases, the principle of asymptotic normality will make most tests valid, even if normality of residuals is rejected formally. In such cases, it is not necessary to address non-normality, but the decision must be made by a person who understands the use of statistics. This is another clear example of the fact that following statistical procedures blindly may lead to incorrect procedures. 582739608 7 Revised: 5/3/2017 582739608 If non-normality is detected, a transformation should be used according to the recommendations given below in the section about transformations. Non-normality can result from the presence of outliers, as well as from lack of homogeneity of variance. Although Tabachnick and Fidell (1996) seem to prefer transformations to elimination of outliers, in some cases it is clear that one or two outliers are the source of the problem. In that case, deletion of outliers may be preferable to a transformation. Unfortunately, data screening and testing of assumptions is an iterative process with flexible rules, and it is not possible to give a recipe that is always valid to determine a unique set of steps. Multivariate normality is more difficult to test. Tests are based on the fact that if all variables are normal, then the squared Mahalanobis (http://www.isical.ac.in/prof.html) distance* (D2) for each observation should have a 2 distribution. The test is based on determining the goodness of fit for a 2 by plotting the observed values of D2 against the quantiles for a 2 with p degrees of freedom (p=number of variables). An example is presented under the PCA topic. [P.C. Mahalanobis. On Tests and Measures of Groups Divergence I. Journal of the Asiatic Society of Benagal, 26:541, 1930.] 4:1.4 Linearity and lack of fit. Most analyses assume that the relationship among variables is linear. Lack of linearity can be determined by studying scatterplots or by analysis of lack of fit. Scatterplots can also be used to determine what type of transformation may be necessary, as indicated in the section on transformations. When replicate observations are available for at least some of the levels of X, it is possible to test the hypothesis that a particular model, such as the linear model, is not a good fit to the data. This analysis of lack of fit can also be performed when "near" replicates are available, where observations can be subdivided into many groups such that most of the variance in X is among groups, and little variance in X is observed within groups. The analysis of lack of fit is based in further subdividing the SSE into a portion called Pure Error and a portion due to Lack of Fit. The variation due to pure error (SSPE) is the sum of squares of deviations of observations about the average level of Y for each level of X. The variation due to lack of fit (SSLF) is the sum of squares of the deviations of the average Y for each level of X about the value predicted by the model. Note that this partition of SS can be applied to any model, not just the linear one. eij  Yij  Yîj  (Yij  Y j )  (Y j  Yîj ) nj nj c c SSE    (Yij  Y j )    (Y j  Yîj )2  SSPE  SSLF 2 i 1 j 1 i 1 j 1 The subscript j refers to the levels of X and it ranges from 1 to c; c being the number of different levels of X present. This partition of the SSE also reflects the Ho and Ha for the test of lack of fit. Ho states that the expected value of Y is a linear function of X, whereas Ha states that the expected values of Y for each X do not necessarily fall on the line. Ho: E{Y}=0+1X or equivalently, Ha: E{Y}=j j=0+1Xj without any additional constraints on the relationship between j and X. If Ho is true, two independent estimates of the variance of the error can be obtained: one based on the deviations of observations around the average for each level of X, and one based on the deviations of j around the line. These are the mean square of the pure error (MSPE) and the mean square of the lack of fit (MSLF), respectively. If Ho is not true, then MSLF will tend be greater than MSPE, and their ratio will be significantly greater than the expected value of the F statistic. The analysis can be performed in the usual ANOVA form, where SSE has n-2 degrees of freedom as usual (total number of observations minus one df for each parameter), SSLF has c-2 degrees of freedom (number of levels of X minus the number of parameters estimated), and SSPE has n-c degrees of freedom (number of observations minus c means estimated, one for each level of X). 582739608 8 Revised: 5/3/2017 582739608 Source Regression Error Lack of Fit Pure Error Total SS SSR= df  ( Yˆ  Y) ij 2  ( Y  Yˆ ) 2 SSLF=  ( Yj  Yîj ) SSE= 2 ij MSR=SSR/1 n-2 MSE=SSE/(n-2) c-2 MSLF=SSLF/(c-2) MSPE=SSPE/(n-c) ij  (Y SSTO=  (Y SSPE= 1 MS ij  Yj )2 n-c ij  Y)2 n-1 JMP automatically produces a test and report for lack of fit if the data contain more than one observation with the same values for all X variables used in the model. In the Pyield example the result appears in the xmplPyield,jmp:Bivariate window but it is collapsed. The results can be displayed by clicking on the gray triangle to the left of the title “Lack of Fit.” The Lack of Fit analysis shows the partition of the total error into two components, Lack of Fit and Pure Error. The F-ratio tests whether the variance of the average yield for each level of P around the value predicted by the line is significantly greater than the variance of yield around the average for each level of P. In this example, there is not significant lack of fit (P>0.05), so we cannot reject the linear model in spite of the fact that we know that the correct model is quadratic. The Max RSq indicates the maximum proportion of the total variance in yield that could be explained by a model that goes exactly through the average yield for each level of P. Such a model is the one used when one considers each level of P as a discrete treatment in an ANOVA, as shown in the following results obtained after changing P from a continuous variable to a nominal one. The variable type is changed by clicking on the blue c to the left of P and selecting Nominal from the drop-down menu, as shown in the figure. 582739608 9 Revised: 5/3/2017 582739608 4:1.4.1.1 General linear test It is worth noting that the null hypothesis postulates a model that is more restrictive than the alternative hypothesis. The model under Ho is also called "reduced" model, because it has fewer parameters (only two) than the model under Ha, which is called the "full" model and includes as many parameters as possible (one mean for each level of X). Usually, as parameters are added to a model, the SSR increases, but the df decrease. Thus, the whole test of lack of fit can also be though of as checking whether the improvement in the SSR achieved by adding parameters is worth the associated loss of df.  When yield is analyzed as a function of nominal P, the error term in the ANOVA is equal to the Pure Error term in the analysis of lack of fit.  The Rsquare is the same as the Max Rsq in the analysis of lack of fit.  These facts are pointed out to emphasize the relationship among analyses, but are not involved in any testing or estimation directly.  From a practical point of view, it is obvious that there is a curvilinear relationship between yield and P. Unless we have a priori reasons to keep the linear model, we should try a different model even though the linearity was not rejected. This alternative point of view happens to be the most general version of the vast majority of F tests we encounter in statistics, and is called the "General Linear Test." In order to compare any two models, one of which is a reduced version of the other, the general linear test calculates an F value as follows. Ho: Reduced model (R) Ha: Full model (F) F SSE / dfe SSE R  SSEF  dfeR  dfeF   MSEF SSEF dfeF This F is compared with the table value with (dfeR-dfeF) degrees of freedom in the numerator and dfeF degrees of freedom in the denominator. If the calculated value is greater than the table value, Ho is rejected. 4:1.5 Homogeneity of variance-covariance matrices. In univariate cases the assumption of homogeneity of variance-covariance matrices is equivalent to assuming homogeneity of variance. Homogeneity of variance in grouped data is assessed among groups. For ungrouped data, the error terms can be partitioned into two groups, one for low levels of X and the other for high levels of X. Then, the Levene's test for homogeneity of variance is applied as in grouped data. This test can be 582739608 10 Revised: 5/3/2017 582739608 requested with the HOVTEST option of the PROC ANOVA in SAS (for details, search for the terms "Levene," and "HOVTEST" in SAS Help). Lack of constant variance in linear regression can be detected by examination of scatterplots of errors vs. predicted Y and vs. X (Figure 4-4). A fan shaped plot indicates the need to use a variance-stabilizing transformation, or the use of weighted regression. Y Y A B C X Figure 4-4. Scatter plots showing heterogeneous variance and nonlinearity. Y In JMP, you can test for homogeneity of variance in a SLR situation by separating the data into groups by level of X, for example, low medium and high. Each group receives a different value of a new nominal variable, say “group” and the homogeneity of variance among groups is tested with the Fit Y by X platform. Select group as the X variable and the residuals of the model (previously saved) as the Y variable. Once the analysis window appears, click on the red triangle at the top and select UnEqual Variances. A series of statistics are calculated to test the homogeneity of variance. As an exercise, select the JMP Help menu and search for Homogenetiy of Variance Tests. The test is explained in detail in the help page. In true multivariate situations, the analysis of homogeneity of variance-covariance matrices means that the patterns of variances and covariances among variables are the same in each group considered. In other words, the scatterplots should be similar in all groups. However, the examination of scatterplots is only feasible for simple groupings and no more than 3 variables, by using the 3-D scatterplot option under GRAPH – SPINNING PLOT in JMP or "Interactive Analyses" in SAS 7. Statistical tests, such as Box's M tend to be very conservative and very sensitive to outliers. This should be considered against the fact that lack of homogeneity of variancecovariance matrices weakens, but does not invalidate the analyses. SAS offers these tests through options, such as the POOL=TEST option in discriminant analysis. 4:1.6 Outliers. Outliers are observations that are not consistent with the rest of the data. This does not mean that they are automatically removed from the data set, but that the cause and effects of these observations should be carefully considered. There may be outliers in the Y or in the X dimensions. An outlier in the Y dimension is a value that falls outside the reasonably expected range of Y, combination of Y's (for true multivariate analyses), or error. An outlier in the X dimension is an extreme value that deviates greatly form the average for the predictor. In SLR, a value that is an outlier only in Y tends to increase the variance of the parameters, without a big impact on parameter estimates. A value that is an outlier only in the X dimension tends to reduce the variance of the parameters, without much impact on parameter estimate, whereas an observation that is out there both in X and Y changes parameter estimates and increases variance. Often, outliers result from incorrectly coded observations and incorrect data entry. Both of these cases have clear solutions. Outliers can also appear because they do not belong to the same population as the one addressed by the rest of the sample. Suppose that you are trying to determine the average size of a species of aphids in a field where many species are present. Extremely large or small aphids may come from a different species in the same field. Because aphids species is hard to determine, one is not certain whether the outlier is or is not the target species. 582739608 11 Revised: 5/3/2017 582739608 Finally, outliers may be caused by the population having a distribution that is not normal, where extreme values are more common than in the normal distribution. In this case, deleting the outliers can weaken the link between sample and population. Consideration of transformations is a more desirable solution. The first step in the analysis is to identify outliers. Then one proceeds to apply transformations or delete those outliers that are extreme, depending on the situation. Typically, the methods and equations used for multivariate outliers also apply to univariate situations, because univariate situations can be considered to be special cases of multivariate ones. However, for the sake of clarity, we present them separately. 4:1.6.1 Univariate outliers When there is only one variable under consideration, such as the errors in simple linear regression, one can apply univariate methods to detect outliers in the Y or in the X dimension. Because in SLR X is not considered to be a random variable, but a set of fixed values, we do not consider both X and Y at the same time, and no probability is associated with the identification of outliers in the X dimension. Values in the Y dimension that have a low probability are identified as outliers. For univariate outliers in SLR we can use the deleted studentized residuals. The deleted studentized residual ti for each observation is used to identify outliers. ti  di sdi  where di  Yi  Yˆ(i)  s2 di  ei 1 hii and MSE (i) 1 hii e2 MSE(i )  n  pMSE  1ih n  p 1 ii In these equations, the subscript (i) indicates the value of the statistic calculated while holding the observation under consideration out of the sample; n is the sample size, and p is the total number of independent parameters in the model. Thus, di is the difference between the observed value for Y and the expected value predicted by a model obtained by temporarily holding observation i out of the sample. The value hii is the leverage of each observation, and it comes from the main diagonal of the H or Hat matrix. As shown in the equation, di can be calculated from the regular error and the leverage for observation i. The variance for di is a function of the MSE obtained with observation i held out. This variance can also be calculated, as shown in the equations above, as a function of the regular MSE, the error and the leverage for each observation. In JMP, we calculate the studentized deleted residuals by creating a new column and entering the formulas from above. 582739608 12 582739608 Revised: 5/3/2017 Figure 4-5. This figure has simulated data for yield response to P application. Three points that do not belong to the original population have been added to demonstrate the detection of outliers and its interaction with choice of model. Under the assumption that the errors are normally distributed with homogeneous variance, the deleted studentized residual should have a t distribution with n-p-1 degrees of freedom. The degrees of freedom are one less than usual because the MSE-i is calculated with n-1 observations, given that observation i is held out. Tabachnick and Fidell (1996) recommend that any observation that has a ti greater than the two-tailed table value for P=0.001 is an outlier and should be considered for deletion. Neter et al. (1996) recommend using a Bonferroni correction for the test with =0.10. The Bonferroni correction reduces the actual  such that the probability of erroneously identifying a point as an outlier in the whole sample remains close to the nominal value. This correction is necessary because as sample size increases, the probability that at least one point will deviate greatly from the expected value also increases, even if no true outliers are present. The correction is achieved by dividing the nominal value of  by n, and then using the new probability level to look for the critical values in the table. The fixed value of =0.001 is equivalent to using the Bonferroni correction with =0.10 when n=100. For smaller sample sizes Bonferroni is more conservative; for larger n's, Bonferroni is more liberal. In SAS the deleted studentized residuals are requested in the MODEL statement with the option "INFLUENCE." The critical level according to the most conservative criterion is t(0.9995, 33)=3.61, which points to observation number 35 as a clear outlier (Figure 4). This observation also is an outlier in the X dimension, because its leverage hii is 0.289. The critical value for the leverage when X is not random is 2 (m+1)/n, where m is the number of X variables. In the example, m=1 and n=36, so any leverage over 0.111 identifies an outlier in the X dimension. This test can also be used for situations where there are many predictors, or X-variables that are considered to have fixed values. 582739608 13 582739608 4:1.6.2 Revised: 5/3/2017 Multivariate outliers Figure 4-6. Multivariate outliers are identified by examining their distance to the centroid for the distribution. The Euclidean distance is corrected by the pattern of covariation among variables. Point b is further from the centroid than a, but it is statistically closer. Whereas b is not an outlier, a is clearly outside the expected distribution. Multivariate outliers can be detected by calculating the Mahalanobis distance D for each observation. This distance is a measure of the "statistical" distance between each observation and the centroid for the group being considered. Suppose you are performing a MANOVA where the Y's are seed number and weight per seed of a species of interest Figure 5), and the X variable is level of soil fertility and water availability. In this example, X is a categorical or "class" variable that, say, takes three values: low, medium, and high. In order to detect outliers in the Y dimensions, a centroid or vector of average values for each Y is calculated for each group. The centroids are the best estimates of the expected value for the vector of random variables, and serves the same function as Yhat in the univariate case where we considered deviations about a straight line. After calculating the centroids, the deviations of each observation form its group's centroid are calculated. Analogously to the deleted residuals, a robust or "jackknifed" squared Mahalanobis distance is calculated for each observation while holding the observation out of the sample. This prevents potential outliers from distorting the very detection of outliers. Figure 4-6 shows the simulated data for the group of medium fertility and water availability. The centroid for this group is the point (972, 1173). The Euclidean or geometric distances from each of two potential outliers, observations a and b, are represented by the lines from each point to the centroid. Clearly, point b is further away from the centroid than a. The jacknifed squared Mahalanobis distances for points a and b are 31 and 10 respectively, which is counter to the Euclidean distances. The difference is due to the fact that the two variables, seed weight and seed number, exhibit a strong positive covariance within this group. As is intuitively clear form the scatterplot, given the dispersion and correlation between the variables, point a is a lot less likely than point b. This is reflected in the Mahalanobis distance. Assuming multivariate normality for the random vector of Y variables (in the example the vector is {seed number, seed weight}), the squared Mahalanobis distance and its jackknifed version should follow a 2 distribution with 2 degrees of freedom (df=number of variables). This distribution can be used in the same way the t distribution was used for the univariate situation. A critical value of 2 is determined either by a set probability =0.001 or by using the Bonferroni correction with =0.10. Outliers are identified by testing the following hypothesis for each observation: Ho: Yi follows the same multivariate normal distribution as the rest of the sample. Ha: Yi does not follows the same multivariate normal distribution as the rest of the sample. 582739608 14 582739608 Revised: 5/3/2017 Let Y1 and Y2 be random variables that have a bivariate normal distribution within a group. Yi  Y1i , Y2i  is a random observation, and i = 1, , n The jacknifed squared Mahalanobis distance is defined as 2 d(i2 )  Yi  Y(i )  S1 Yi  Y(i)  (df  2) The simulated seed weight example has 250 observations. Observations a and b were not considered in the sample to simplify the calculations in xmpl_MVoutl.xls, but a strict calculation should have included point b when calculating d2 for a, and vice versa. Using the Bonferroni approach, the critical level for the Mahalanobis distance is 2 with 2 degrees of freedom for =0.10/250=0.0002. The test is one-tailed, because the squared distance can only be positive, and values close to zero indicate that the observations are very much within the range expected under the assumptions and Ho. The critical value is 15.65, indicating that observation a is an outlier but observation b is not. Through the multivariate platform in JMP one can obtain the Mahalanobis distance (D) and the jackknifed distance. The plot also shows a dotted line that represents the critical D value for =0.05. This distance is calculated as F*number of variables. The F value is the table value for the desired probability level and with df in the numerator=number of variables (nvars) and df in the denominator =n-nvars-1. A question that is relevant at this point is, why did we use the Bonferroni correction with n=250 while in fact we only tested 2 observations? This is because the 2 observations were picked after looking at the scatterplot. The only time that the n for the Bonferroni correction is equal to the actual number of tests performed is when the identity of observations to be tested is identified a priori, before obtaining or looking at the results. In such a case, SAS may print out all distances anyway, and one can catch a glimpse of a significant distance that was not in the list prepared a priori. The proper critical value for testing that observation has to be the determined with n=total sample size. 4:1.7 Transformations. Transformations can be used to address the following problems: 1. Lack of normality 2. Lack of linearity 3. Heterogeneous variance 4. Outliers When errors (or Y's) are not normally distributed, a transformation can fix the problem. Figure 4-7 of Tabachnick and Fidell (2001) gives guidelines for choosing a transformation. 582739608 15 582739608 Revised: 5/3/2017 Figure 4-7. Original distributions (pdf’s) and common transformations to achieve normality. The Log transformation is particularly useful to produce normality in skewed distributions and in stabilizing variance. The strength of the log transformation over different ranges of the variable can be regulated by applying a linear transformation of the original variable before taking the log, Y'=Ln(c0+c1Y) where c0 and c1 are coefficients that can be adjusted by trial and error. The first coefficient "moves" the whole distribution to different locations of the log transformation, modifying the "average" intensity of the transformation. The second coefficient modulates the spread of the distribution, thus regulating the difference in intensity of the log transformation between the low and the high range of Y. Figure 4-8 shows typical patterns of errors. The first graph shows errors that have constant variance and no need for nonlinear terms or transformations. The second plot shows a case where there is confounding between the effects of X and time or spatial sequence. This may have resulted from a poorly planned sampling scheme. In this case, it is necessary to remove the effects of time before analyzing the effects of X. The third plot shows a case in which there is a clear curvilinear effect. In this case, the addition of a quadratic X term can fix the model. The addition of a quadratic X term has similar effects to applying a square root transformation on the Y variable, except that a transformation of Y can change its distribution in an undesirable way. Finally, the last plot shows a case of decreasing variance which has to be addressed by using a transformation of the Y variable. Transformations can be applied both to X, Y, or both. Moreover, multiple transformations can be applied to the same variable. As a general rule, when the distribution of errors or Y is normal, it is better to address lack of linearity transforming the X variable. Transformations to fix non-linearity are suggested by the shape of the scatterplot in Figure 4-9. 582739608 16 582739608 Revised: 5/3/2017 Figure 4-8. Typical problems indicated by the distribution of points in a scatter plot. 582739608 17 Revised: 5/3/2017 582739608 Scatterplot shape Transformation of Y to correct nonl inea rty X'=Ln(X) X'=sqrt(X) X'=X2 X'=exp(X) Y X'=1/X X'=exp(-X) X Figure 4-9. Recommended transformations When the relationship is non linear and the variance of the error appears to increase with increasing value of predicted Y, as is the case in all graphs of Figure 7, a transformation of Y can fix the problem. One can try the log, inverse and square roots transformations and select the one that yields the best results. The Guided Analysis feature of SAS 7 automatically performs a likelihood test to suggest the transformation that will have the greatest chances of fixing the problem. JMP offers the possibility of calculating Box-Cox transformations. These constitute a family of power transformations that have the following functional form, where  is a parameter adjusted by maximum likelihood. Y ' Y  if   0 Y ' ln( Y ) if   0 Note that Y squared, square root of Y and 1/Y are all members of this family. These transformations facilitate the correction of problems with the assumptions. In general, a precise value for  is not necessary, so it is recommended that one select a value that is easier to interpret and that is close to the fitted one. As an exercise, use JMP to determine the best transformation for the data in homework 01, which appeared to exhibit non-linearity and heterogeneity of variance. Try to interpret the selected . 582739608 18 582739608 Revised: 5/3/2017 4:1.8 Multicollinearity and singularity. Regression and linear models in general require that the matrix X'X be inverted, where X is the design or data matrix. In SLR the matrix X'X is small and simple (2x2), and it always has rows an columns that are linearly independent. However, in multiple linear regression (MLR) there are situations when the columns of X are related to each other. When a column of X can be expressed as an exact linear combination of the other columns, the X'X matrix cannot be inverted because its determinant is 0. The matrix is said to be "singular." In most cases, X's are not perfect linear combinations of other X's, but they may be close. As the variance in any X can be explained more and more by other X's, the determinant of the matrix tends to zero and causes problems when inverting X'X. For example, the determinant can be so close to zero that just the level of usual roundingoff that computers perform can have major impacts on the results. This is know as collinearity or multicollinearity, and is a major problem in studying relationships among variables, particularly in observational experiments. Multicollinearity can prevent us form determining the true effects of factors on responses, and it is more a problem than an assumption. Identification and measurement of multicollinearity will be addressed in detail under the subject of PCA and MLR. 582739608 19

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download AGR206 Chapter 4. Data screening.