Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MULTIVARIATE NORMAL PLOTTING USING ORDERED MAHALANOBIS DISTANCES Namjun kang Syracuse University II. Probability Plots and Plotting I. Introduction Positions The assumption of multivariate For normality underlies much of the standard multivariate statistical methodology. The effects of departures from normality are not stood. of assum- oI ing normality for a given body of multivariate data. Such a check would be helpful in guiding the subsequent analysis of the data, by suggesting the need for transformation of the data to make them more nearly normally distributed. The methods for l.. assessing nor- of a beta rather tering, 1972). lation is Marked skew- ness, such as might suggest the use of a transformation of the variables, is shown up by simple curvature of the plot and the presence of kurtosis or of outlying values also might be indicated (Healey, 1968). The purpose of this paper is to develop an easily implementable SAS/IML program that can provide the multivariate normality testing prob- ability plot. There are several graphical techniques available for (Gnanadesikan, 1977, pp. 168-175). The graphical technique proposed here is based on the distribution of the ordered Mahalanobis distances of their mean, and involves plotting this distances against chi-square percen- tiles. Why this particular probability plot? This plot does have the endorsement of several statisticians (Healy, 1968; Johnson and Winchern, 1982) and is easy to use, which means there is a good chance applied researchers will use it. Also, in this paper, the use of the Pearson product moment correlation coefficient is examined as a technique for constructing a test statistic based on the information contained in probability plots (Pilliben, Looney and Gulledge, 1985). mUltivariate normal and both sample size (n) and (n-v) are insignificant (Johnson and Winchern, 1982) . Often it will be informative to supplement the information about the distances of the individuals from the mean by some consideration of angular position 1977. Pp 172-174). 1975; (Gnanadesikan, However, if v>3, it is extremly difficult to calculate the angles of each obsevation. If v=3, we might use the cylindericalor spherical co-ordinates. Thus the angular plot in this paper. is not considered In general, normality points from and Ket- But, when the popu- greater than abour 25, the difference between using the beta and chisquared approximation appears to be tion of systematic non-normality and the individual - than a chi-square distribution (Gnanadesikan Kang & kalinoski, 1987); ii) graphical techniques using a probability plot. Although a probability plot does not provide a formal test, probability plotting techniques have proved very valuable for the detec- multivariate -, will have approximately a chisquared distribution with v degrees of freedom in the v-variate case. The exact marginal distribution of of is known to be a constant multiple skewness and kurtosis (Mardia, 1970; checking - D.i = (Y.. - Y) '*S*(Y, - Y) mality can be grouped into two genres; i) single-statistic-based formula test such as multivariate of outlying values. multivariate cedures that utilizes a distancefrom-mean representation of multivar ia te data. The distance-from-mean in multivariate data, or Mahalanobis distance on the methods easily and clearly underThus, it would be useful to verify the reasonableness evaluating normality, Andrew et al. (1973) have suggested an informal graphical pro- to construct (usual1yon where p. the horizontal is an 1 aXls), estimate of plotting posi tioD. and F- is the inverse of a distribution function. In Mahalanobis probaility plot, P-' is the inverse of chi-square distribution. The plot tong formula has been described by Blom(1958) as p. = (i-c)/(n+1-2c), where c is a func_tion of the distri- bution being sampled, and O<=c<=l. In practice, plify the use of many authors sim- the above formula by assuming c to be a constant. For example, Wilk & Gnanadesikan (1968), and Stevenson (1982) suggested using P,; =(i-.5)/n 742 a probability plot, the ordered sample statistic Y; is plotted (usually on the vertical axis) against X;=P":'(P" ) by setting c=.5. ben(1975) proposed Also, linearity of the probability plot because the correlation is a simple and straightforward measure of linearity between any two variables. Since the Y. are highly correlated and heteroscedastic, however, the usual distributional results for the correlation coefficient do not apply. Instead, empirical sampling methods must be used to determine the null distribution of the test statistic. Filliben(1975) and Looney & Gulledge(1985) already tabulated a normal test statistic for the probability plot correlation when a least square line is computed. Following Filliben and Looney & Gulledge1s lead, the correlation coefficient from a plot will be used as an aid in interpretation of the linearity of probability plot. Here, Looney & Gulledge1s table will be used because the plotting point recommended by Blom(1957) is adopted for tabulating the table. Filli- =(i-.3175)/(n+.365) Pi by using c=.3175. Although the different constants (o) give similar plotting positions for order statisti? Yj near 1=n/2, they can lead to qu1te different plotting positions of the extreme values near i=1 and i=n, especially with small samples (Mage, 1982). After reviewing different plotting positions, Kimball(1960) recommended that an approximation of P~ developed by Blom(1958, P. 71) be used as a plotting position: Pi =(i-.375)/(n+.25) This plotting position has seen increasing acceptance among practioners in recent years; for example the normal probability plot produced by the PROC UNIVARIATE of the SAS is based on this plotting position (SAS Statistics Version 5, 1985. P.1188). Thus, in this paper, the plotting position proposed by Blom(1958) will be used in probability plots. The Mahalanobis distance chisquare probability plot is constructed as follows; IV. Description of Program The SAS/IML code for generating Mahalanobis chi-square probability plot is presented in the Appendix. The program uses the graphic routines in SAS/IML to divide the screen into 4 subplots. The first plot (upper-left) represents the chi-square plot for all observations. After removing an observation that has the largest Mahalanobis distance, the second plot (upper-right) is created using (n-l) observations. Again, after deleting an observation that has the largest value among (n-l) d.istances, the third plot is drawn on lower-left region. In third plot, the number of observations is reduced to (n-2). The fourth plot(lower-right) is plotted by removing an observation having the largest distance on the third plot. On each plot, the correlation coefficient between ordered Mahalanobis distances and order statistic based on chi-square distribution is printed. This program also prints the original observation numbers of the four largest Mahalanobis distances. 1) The distances are ordered from smallest to largest as 2. D, ~ ,DJ, ~ ,OJ, , .•• ••• .),. .),. ,Dnt1 ,0", 'It,t< 2) Then grapJ;:".the pairs (D:, Il) ) , where the·~is the Pi percentile of the chi-square distribution with 'df' degree of freedom. III. The Probability Plot Correlation Test for Normality The use of probability plot for providing qualitative estimate of the goodness of fit to normality has a major disadvantage. As we have mentioned, if the hypothesized normal distribution is the correct one, then the plot of Y;. against X· =F-1 (PJ will be approximately lin~ar. However, there is -no simple objective way to judge how well the data points conform_'to the straight line (Mage, 1982). This lack of objectivitymay be confusing to the users of'probability plot. Therefore, Filliben(1975) and Looney & Gulledge(1985) suggested that one use Pearson product moment correlation between Y4 and X4 to measure the V. Application of Program The data in Figure I have 100 observations from a 5-variate independent normal distribution. EVen the first upper-left plot appears to be reasonably linear, exhibiting no marked departures of Mahalanobis distances from null expectation. 743 The reported correlation coefficient One problem with the Mahalanobis Chi-square probability plo~ ~nd the normal test table for the corre- accompanying correlation coeff1c1ent for the first plot is .9895. From test is that it may not identify those Mahalanobis distances that are distorting the property of multivariate normal distribution. Extreme values with large Mahalanobis distances may still fall close to the best fitted regression line on the plot, thereby fitting in cosistently lation tabulated by Looney and Gulledge (1985) , it is seen that .9895 is above the 5% critical value; in fact the observed correlation falls between 10% to 25% points of the null distribution. On the basis of correlation test, there is no evi- dence to contradict the hypothesis of normality. After removing the observation number 58 that has the largest Mahalanobis distance, the linearity of the plot is slightly with the correlation. In this vein, Comery(1985) proposed a different distance measure to eliminate such problem. The probability plot based on this measure may be easily implemented. improved , as indicated by correla- tion coefficient reported on the second upper-right plot (.9924). Deleting the observation number 92 on the second plot degrades the lin- Acknowledgement: earity of the plot; that is, from The author would like to thank Dr • • 9924 to .9894. The same decreasing pattern is hold on the fourth plot Ronald Kalinoski for his encourage- ment and helpful comments. after removing observation number 66 from the third plot. The Figure 2 is drawn by using mildly nonnormal data. Among five SAS, SAS/IML and SAS/GRAPH are registered trademarks of SAS Institute Inc., Cary, NC. U.S.A. variables, one is a Cauchy random variate with location parameter 0 and scale parameter 1. Under the null hypothesis of normality, the Bibliography plot should have a reasonably linear form. All plots in Figure 2, how- ever, appear quite non-linear, espe- cially at the upper end. Andrews, D. F., Gnanadesikan, R., and Warner, J. L., "Methods for assessing multivariate normality," in Multivariate Analysis III., The skew- ness of the data is clearly evident in the plots. Also the correlation test shows the significant departure NY.Academic Press, 1973. 95 116 from normality in this data -- the observed percentage point is far below the 5% cut-off. Blom, G. After remov- Statistical Estimates and Transformed Beta Variables, NY:John ing seemingly outlying observations, the linearity is decreased rapidly; WHey, 1958 from .9214 to .8067 to .7883 to .7463. Thus it is quite reasonable Comery, A. L., "A method for remov- to reject mul,tivariate normality lytic results," Multivariate Behav- ing outliers to improve factor ana- hypothesis on grounds of both nonlinearity configuration on the plots ioral Research, .Vol. 20, 1985. 273 281 and normal test of correlation coef- Fi1liben, J. L., "The I?robability plot correlation coeff1cient test ficient. for normality," Technometrics, Vol. VI. Discussion 17, No.1, 111-117. 1975 Gnanadesikan, R., Methods for Sta- tistical Data Analysis of Multivari- Instead of using SAS/IML code ate Observations, NY:John Wiley and to calculate Mahalanobis distances and to draw a chi-square plot, PROC REG and PROC GPLOT with ANNOTATE option in SAS/GRAPH can be used to generate the same plots proposed here. Sons, 1977 Gnanadesikan, R., and Kettering, J. R., "Robust estimates# residuals, and outlier detection with multires- The Mahalanobis distance is computed using the following equa- ponse data," Biometrics, Vol. 28, 1972. 81-124 tion; ~ D.: ~(n-1)*(h~; -lin) Healy, M. J. R., "Multivariate normal plotting," Applied Statistics, Vol. 17, 1968. 157 161 In PROC REG, h.; {diagonals of the HAT matrix) can be easily output to a new data set using OUTPUT option. 744 Johnson, Ro, and Wichern, Do, Applied Multivariate Statistical Analysis, Englwood Cliffs, N.J.Prentics Hall, 1982 Kang, N. and Kalinoski R. "Measures of multivariate skewness and kurtosis," SUGI 12 Proceedings, 1987, 1178-1183. Kimball, B. F., "On the choice of plotting positions on probability paper," Journal of American Statis- tical Association, Vol. 55, 546-560. 1960 Mardia, Ko Vo, "Measures of multivariate and kurtosis," Biometrika, Vol. 57, 519-530. 1970 Looney, S. W., and Gulledge, T. R. Jr., "Use of the correlation coeffi- cient with normal probability plots,1I The American Statistician, Vol. 39. 75 79. 1985 Mage, D. To, "An objective graphical methods for testing normal distributional assumptions using probability plots,: The American Statistician. Vol. 36, 116-120. 1982. 745 TEST OF MULTIVARIJ.TE NORMALITY(FIGURE 1) .58 20.00 D I S T A N C E D 17.50 115.00 12.50 +++ ++ + I S .88 • g14 T A N C E ..... 10.00 7.50 10.00 9.00 .66 B.OO •• 79 7.00 ++ 6.00 + 2.50 R=O.9895 4.00 / 1.00 0.00 0.00 0.00 4.00 -..I -1>0 B.OO 12.00 I . 0.00 /' D I S I I B.OO T A N C E .66 D •• 7.00 6.00 .8a 7 9 I S T A 5.00 4.00 3.00 N C E '''~ R=O.9894 0.00 I 4.00 I B.OO CHISQ I 12.00 .141S 1 7.00 + 6.00 ++++ .+ 5.00 .79 • • 4.00 3.00 2.00 2.00 1.00 0.00 12.00 CHISQ 10.00 9.00 B.OO R=O.9924 4.00 CHISQ '" ++ ......' 5.00 3.00 2.00 5.00 .92 1.00 0.00 -l(' 0.00 R=O.9819 I 4.00 I B.OO CHISQ 12.00 TEST OF MULTIVARIATE NORMALITY(FIGURE 2) 40.00 0 I S T A N C E .59 35.00 45.00 ., 0 I S .21 30.00 T A 25.00 N C .35 20.00 E .9 15.00 10.00 + ++ + 5.00 0.00 ~ 30.00 25.00 20.00 15.00 j B.OO 4.00 12.00 .35 .9 j .52 ~++++ I S I I 4.00 B.OO T A CHISQ N C E 40.00 0 • 9 30.00 I S T A 25.00 N C 20.00 15.00 10.00 ... E .4*f!2 .++ + 5.00 4.00 • 9 50.00 40.00 30.00 20.00 10.00 R-O.7883 l .52 .44 I ..• ~++++ .42 R=O.7463 0.00 0.00 0.00 12.00 60.00 .35 35.00 R=O.8067 , CHISQ 45.00 0 + 0.00 0.00 0.00 ..... 40.00 35.00 10.00 5.00 R=O.9213 .21 B.OO CHISQ 12.00 T I I I 0.00 4.00 B.OO 12.00 CHISQ