Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Using SAS/IML®to Create a Chi-Squared Plot for Checking Normality Assumptions in Multivariate Data Steve Hoff, Graduate Student in Applied Statistics and Research Methods University of Northern Colorado, Greeley CO even if assumptions are not met. Many procedures are robust-that is good, but by checking assumptions we will insure that results from statistical analyses are accurate and properly arrived at. Specifically, checking the multivariate normal assumption "would be helpful in guiding the subsequent analysis of the data, perhaps by suggesting the need for and the nature of a transformation of the data to make them more nearly normally distributed, or perhaps by indicating appropriate modifications of the models and methods for analyzing the data" (Gnanadesikan, 161 ). ABSTRACT The Chi-Squared plot takes multivariate data of any dimension and computes squared generalized distances for sample observations, and the corresponding chi-squared quantiles for the ordered data. A plot results that will be linear if the data is multivariate normal. This provides a visual assumption check that may be performed before further statistical procedures are accomplished. The coding is done in SAS/IML®_ INTRODUCTION How many times have I run multiple regressions in SAS, a MANOVA, or a factor analysis-to name some-and not considered the assumptions necessary to run these procedures? Often the assumption not checked is that of multivariate normality. "Despite (or possibly because of) the availability of dozens of procedures that can test a data set for multivariate normality, this assumption often goes untested" (Mecklin, 2). Sometimes, we are encouraged not to perform normality checks. In an example of univariate procedures that require normality of the data, a well-known textbook states "since robustness studies have found that violations of this assumption have inconsequential effects on the accuracy of probability statements, it [a test of normality) is rarely used for that purpose today'' (Glass & Hopkins, 333). Robustness means that procedures are effective, METHODS We are familiar with univariate normal probability plots (also called Q-Q plots), where the ordered observations are plotted against the associated standard normal quantiles. This plot will be approximately linear if the data is sampled from a normal distribution. And, with multivariate data, there is a similar plot of ordered generalized distances plotted against quantiles from an appropriately selected chisquared distribution-and the plot will be approximately linear if the data is sampled from a multivariate normal distribution. The squared generalized distances (Johnson & Wichern, 184) are also called Mahalanobis distances (Jobson, 150), and the squared Mahalanobis distances are calculated by 38 d? =(Xi- x )T s-1 (x;- x ), where Xi is tails or light tails. The density and distribution functions are given in AIGhamedi, 2002, Chapter 4. The inverse of the distribution function can be solved for, so continuous uniform (0, 1) random variables can be substituted into the inverse of the distribution function, to obtain Kappa random variables. a specific p-dimensional observation vector, x is the mean vector for the sample, and s-1 is the inverse of the sample covariance matrix. If the random sample is indeed from a multivariate normal distribution, then the squared distances have "a chisquared distribution with p degrees of freedom" (Everitt & Dunn, 43). SAS IMLCODE dm log 'clear' ; dm output 'clear'; proc iml; INDATA={ 0.229387 0.464256 1.431067 ' 1.830501 -0.276291 -0.758097 ' I wrote a program in SAS/IML®, designed to compute the squared generalized distances for all sample observations (y-axis) and the corresponding chi-squared quantiles for the ordered data (x-axis), and plotted these ordered pairs, with a reference line through the origin of slope one. Graphical procedures are . subjective, so I wanted to see the shape of the graph under specified conditions. I generated multivariate data with p = 3 variables (trivariate data) with a sample size of 100, for the following situations. 1. Multivariate normal (mean=O, variance=1) trivariate data. 2. Multivariate normal (mean=O, variance=1) trivariate data, with an outlier five standard deviations from the mean. 3. Multivariate Kappa (kappa parameter=2) trivariate data, with heavy tails. This is identical with a t-distribution with 2 degrees of freedom. 4. Multivariate Kappa (kappa parameter-B) trivariate data, with light tails. This distribution looks flat, and falls off like a Colorado mesa. The Kappa distribution is a symmetric distribution centered at zero. By changing the single parameter one can simulate heavy -0.388587 0.925243 1.230341 ' 0.718354 1.532996 -2.461299}; X=t ( INDATA) ; n=ncol(X}; p=nrow(X}; O=j (n, 1, 1); XMEANS=(1/n)*X *O*t(O); XRESID=X-XMEANS; S=(1/(n-1))*(X-X MEANS)*t(XXMEANS); SINV=inv(S); BIG=t(XRESID)*SI NV*XRESID; V=j (n,1,0); Q=j (n,1,0); V=vecdiag (BIG) ; TMP=V; V[rank(V},]=TMP ; MAX=O; do i=1 to n; Q[i]=cinv((i-.5 )/n,p); if Q[i]>MAX then MAX=Q[i]; end; print Q V; vmax=V[n]; if vmax>MAX then MAX=vmax; print MAX; MAX=95/MAX; V=MAX*V; Q=MAX*Q; call gstart; xbox={O 100 100 0}; ybox={O 0 100 100}; call gopen; call gpoly(xbox,ybox ); call gscript(55,5,"M ultivariate Normal with outlier"); 39 Distributions. Published Doctor of Philosophy dissertation, Colorado State University. call gpoint(Q, V); call gdrawl({O 0},{100 100}); call gshow; quit; Everitt, B. S., & Dunn, G. (2001 ). Applied Multivariate Data Analysis. London: Arnold. CONCLUSION The plots show the following patterns. 1. Multivariate normal: points stay close to the reference line (Figure 1). 2. Multivariate normal with outlier: points stay close to reference line, and outlier stands out (Figure 2}. 3. Heavy tailed distribution: points bowed from below to above the line, looking left to right (Figure 3). 4. Light tailed distribution: points bowed from above to below the line (Figure 4 ). Friendly, M. (1991). SAS® System for Statistical Graphics. Cary, NC: SAS Institute Inc. Glass, G. V., & Hopkins, K. D. (1996). Statistical Methods in Education and Psycology (3ra ed.). Boston: Allyn and Bacon. Gnanadesikan, R. (1977). Methods for Statistical Data Analysis of Multivariate Observations. New York: Wiley. Jobson, J. D. (1992). Applied Multivariate Data Analysis Volumn II: Categorical and Multivariate Methods. New York: SpringerVerlag. I did not generate multivariate data that was skewed-t hough this situation is encountered frequently in practice. When this piece of the puzzle is added, and the resulting graphs checked for moderately and heavily skewed data, they can be cataloged for use with those situations described above. Johnson, R. A., & Wichern, D. W. (2002). Applied Multivariate Statistical Analysis. Upper Saddle River, NJ: Prentice Hall. ACKNOWLEDGEMENT In his helpful book, Friendly, 1991, has a pertinent Section 9.3. Detecting Multivariate Outliers. It invokes an OUTLIER Macro that is well documented. I recommend using his code if you can make it work. I developed my program independently. Macklin, C. J. (2000). A Comparison of the Power of Classical and Newer Tests of Multivariate Normality. Published Doctor of Philosophy dissertation, University of Northern Colorado. CONTACT INFORMATION Steve Hoff, 518 McKee Hall, University of Northern Colorado, Greeley CO 80639. Phone: (970) 351-2807 Email: [email protected] REFERENCES AI-Ghamedi, A. A. (2002). Robust Estimation and Testing of Location for Symmetric Stable 40 Fi e 1: Multivariate normal data. • * • Fi e 2: Multivariate normal data, with outlier. * * 41 tailed data. * * ** * * * Heavy hlle: t-Dietrlbutlan Nlth 2: d.f. t tailed data. * * 42 * *