Download Using SAS/IML to Create a Chi-squared Plot for Checking Normality Assumption in Multivariate Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Using SAS/IML®to Create a Chi-Squared Plot for Checking
Normality Assumptions in Multivariate Data
Steve Hoff, Graduate Student in Applied Statistics and Research Methods
University of Northern Colorado, Greeley CO
even if assumptions are not met.
Many procedures are robust-that is
good, but by checking assumptions
we will insure that results from
statistical analyses are accurate and
properly arrived at. Specifically,
checking the multivariate normal
assumption "would be helpful in
guiding the subsequent analysis of
the data, perhaps by suggesting the
need for and the nature of a
transformation of the data to make
them more nearly normally
distributed, or perhaps by indicating
appropriate modifications of the
models and methods for analyzing
the data" (Gnanadesikan, 161 ).
ABSTRACT
The Chi-Squared plot takes
multivariate data of any dimension
and computes squared generalized
distances for sample observations,
and the corresponding chi-squared
quantiles for the ordered data. A plot
results that will be linear if the data is
multivariate normal. This provides a
visual assumption check that may be
performed before further statistical
procedures are accomplished. The
coding is done in SAS/IML®_
INTRODUCTION
How many times have I run
multiple regressions in SAS, a
MANOVA, or a factor analysis-to
name some-and not considered the
assumptions necessary to run these
procedures? Often the assumption
not checked is that of multivariate
normality. "Despite (or possibly
because of) the availability of dozens
of procedures that can test a data
set for multivariate normality, this
assumption often goes untested"
(Mecklin, 2). Sometimes, we are
encouraged not to perform normality
checks. In an example of univariate
procedures that require normality of
the data, a well-known textbook
states "since robustness studies
have found that violations of this
assumption have inconsequential
effects on the accuracy of probability
statements, it [a test of normality) is
rarely used for that purpose today''
(Glass & Hopkins, 333). Robustness
means that procedures are effective,
METHODS
We are familiar with univariate
normal probability plots (also called
Q-Q plots), where the ordered
observations are plotted against the
associated standard normal
quantiles. This plot will be
approximately linear if the data is
sampled from a normal distribution.
And, with multivariate data, there is a
similar plot of ordered generalized
distances plotted against quantiles
from an appropriately selected chisquared distribution-and the plot
will be approximately linear if the
data is sampled from a multivariate
normal distribution. The squared
generalized distances (Johnson &
Wichern, 184) are also called
Mahalanobis distances (Jobson,
150), and the squared Mahalanobis
distances are calculated by
38
d? =(Xi- x )T s-1 (x;- x ), where Xi is
tails or light tails. The density and
distribution functions are given in AIGhamedi, 2002, Chapter 4. The
inverse of the distribution function
can be solved for, so continuous
uniform (0, 1) random variables can
be substituted into the inverse of the
distribution function, to obtain Kappa
random variables.
a specific p-dimensional observation
vector, x is the mean vector for the
sample, and s-1 is the inverse of the
sample covariance matrix. If the
random sample is indeed from a
multivariate normal distribution, then
the squared distances have "a chisquared distribution with p degrees
of freedom" (Everitt & Dunn, 43).
SAS IMLCODE
dm log 'clear' ;
dm output 'clear';
proc iml;
INDATA={
0.229387 0.464256 1.431067 '
1.830501 -0.276291 -0.758097 '
I wrote a program in SAS/IML®,
designed to compute the squared
generalized distances for all sample
observations (y-axis) and the
corresponding chi-squared quantiles
for the ordered data (x-axis), and
plotted these ordered pairs, with a
reference line through the origin of
slope one. Graphical procedures are .
subjective, so I wanted to see the
shape of the graph under specified
conditions. I generated multivariate
data with p = 3 variables (trivariate
data) with a sample size of 100, for
the following situations.
1. Multivariate normal (mean=O,
variance=1) trivariate data.
2. Multivariate normal (mean=O,
variance=1) trivariate data,
with an outlier five standard
deviations from the mean.
3. Multivariate Kappa (kappa
parameter=2) trivariate data,
with heavy tails. This is
identical with a t-distribution
with 2 degrees of freedom.
4. Multivariate Kappa (kappa
parameter-B) trivariate data,
with light tails. This
distribution looks flat, and falls
off like a Colorado mesa.
The Kappa distribution is a
symmetric distribution centered at
zero. By changing the single
parameter one can simulate heavy
-0.388587 0.925243 1.230341 '
0.718354 1.532996 -2.461299};
X=t ( INDATA) ;
n=ncol(X};
p=nrow(X};
O=j (n, 1, 1);
XMEANS=(1/n)*X *O*t(O);
XRESID=X-XMEANS;
S=(1/(n-1))*(X-X MEANS)*t(XXMEANS);
SINV=inv(S);
BIG=t(XRESID)*SI NV*XRESID;
V=j (n,1,0);
Q=j (n,1,0);
V=vecdiag (BIG) ;
TMP=V;
V[rank(V},]=TMP ;
MAX=O;
do i=1 to n;
Q[i]=cinv((i-.5 )/n,p);
if Q[i]>MAX then MAX=Q[i];
end;
print Q V;
vmax=V[n];
if vmax>MAX then MAX=vmax;
print MAX;
MAX=95/MAX;
V=MAX*V;
Q=MAX*Q;
call gstart;
xbox={O 100 100 0};
ybox={O 0 100 100};
call gopen;
call gpoly(xbox,ybox );
call gscript(55,5,"M ultivariate
Normal with outlier");
39
Distributions. Published Doctor of
Philosophy dissertation, Colorado
State University.
call gpoint(Q, V);
call gdrawl({O 0},{100 100});
call gshow;
quit;
Everitt, B. S., & Dunn, G. (2001 ).
Applied Multivariate Data Analysis.
London: Arnold.
CONCLUSION
The plots show the following
patterns.
1. Multivariate normal: points
stay close to the reference
line (Figure 1).
2. Multivariate normal with
outlier: points stay close to
reference line, and outlier
stands out (Figure 2}.
3. Heavy tailed distribution:
points bowed from below to
above the line, looking left to
right (Figure 3).
4. Light tailed distribution: points
bowed from above to below
the line (Figure 4 ).
Friendly, M. (1991). SAS®
System for Statistical Graphics.
Cary, NC: SAS Institute Inc.
Glass, G. V., & Hopkins, K. D.
(1996). Statistical Methods in
Education and Psycology (3ra ed.).
Boston: Allyn and Bacon.
Gnanadesikan, R. (1977).
Methods for Statistical Data Analysis
of Multivariate Observations. New
York: Wiley.
Jobson, J. D. (1992). Applied
Multivariate Data Analysis Volumn II:
Categorical and Multivariate
Methods. New York: SpringerVerlag.
I did not generate multivariate
data that was skewed-t hough this
situation is encountered frequently in
practice. When this piece of the
puzzle is added, and the resulting
graphs checked for moderately and
heavily skewed data, they can be
cataloged for use with those
situations described above.
Johnson, R. A., & Wichern, D. W.
(2002). Applied Multivariate
Statistical Analysis. Upper Saddle
River, NJ: Prentice Hall.
ACKNOWLEDGEMENT
In his helpful book, Friendly,
1991, has a pertinent Section 9.3.
Detecting Multivariate Outliers. It
invokes an OUTLIER Macro that is
well documented. I recommend
using his code if you can make it
work. I developed my program
independently.
Macklin, C. J. (2000). A
Comparison of the Power of
Classical and Newer Tests of
Multivariate Normality. Published
Doctor of Philosophy dissertation,
University of Northern Colorado.
CONTACT INFORMATION
Steve Hoff, 518 McKee Hall,
University of Northern Colorado,
Greeley CO 80639.
Phone: (970) 351-2807
Email: [email protected]
REFERENCES
AI-Ghamedi, A. A. (2002).
Robust Estimation and Testing of
Location for Symmetric Stable
40
Fi
e 1: Multivariate normal data.
•
*
•
Fi
e 2: Multivariate normal data, with outlier.
*
*
41
tailed data.
*
*
**
* *
*
Heavy hlle: t-Dietrlbutlan Nlth 2: d.f.
t tailed data.
*
*
42
*
*