Download Multivariate Normal Plotting Using Ordered Mahalanobisn Distances

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
MULTIVARIATE NORMAL PLOTTING USING
ORDERED MAHALANOBIS DISTANCES
Namjun kang
Syracuse University
II. Probability Plots and Plotting
I. Introduction
Positions
The assumption
of multivariate
For
normality underlies much of the
standard multivariate statistical
methodology.
The effects of departures from normality
are not
stood.
of assum-
oI
ing normality for a given body of
multivariate data.
Such a check
would be helpful in guiding the
subsequent analysis of the data, by
suggesting the need for transformation of the data to make them more
nearly normally distributed.
The methods for
l..
assessing nor-
of a beta rather
tering, 1972).
lation is
Marked skew-
ness, such as might suggest the use
of a transformation of the variables,
is shown up by simple curvature of the plot and the presence of
kurtosis or of outlying values also
might be indicated (Healey, 1968).
The purpose of this paper is to
develop an easily implementable
SAS/IML program that can provide the
multivariate normality testing prob-
ability plot.
There are several
graphical techniques available for
(Gnanadesikan, 1977, pp. 168-175).
The graphical technique proposed
here is based on the distribution of
the ordered Mahalanobis distances of
their
mean, and involves plotting this
distances against chi-square percen-
tiles.
Why this particular probability
plot?
This plot does have the
endorsement of several statisticians
(Healy, 1968; Johnson and Winchern,
1982) and is easy to use, which
means there is a good chance applied
researchers will use it.
Also, in
this paper,
the use of the Pearson
product moment correlation coefficient is examined as a technique for
constructing a test statistic based
on the information contained in
probability plots (Pilliben,
Looney and Gulledge, 1985).
mUltivariate normal
and
both sample size (n) and (n-v)
are
insignificant (Johnson and Winchern,
1982) .
Often it will be informative to
supplement the information about the
distances of the individuals from
the mean by some consideration of
angular position
1977. Pp 172-174).
1975;
(Gnanadesikan,
However, if v>3,
it is extremly difficult to calculate the angles of each obsevation.
If v=3,
we might use the cylindericalor spherical co-ordinates.
Thus
the angular plot
in this paper.
is not considered
In general,
normality
points from
and Ket-
But, when the popu-
greater than abour 25,
the difference between using the beta and chisquared approximation appears to be
tion of systematic non-normality and
the individual
-
than a chi-square
distribution (Gnanadesikan
Kang & kalinoski, 1987); ii) graphical techniques using a probability
plot.
Although a probability plot
does not provide a formal test,
probability plotting techniques have
proved very valuable for the detec-
multivariate
-,
will have approximately a chisquared distribution with v degrees
of freedom in the v-variate case.
The exact marginal distribution of of
is known to be a constant multiple
skewness and kurtosis (Mardia, 1970;
checking
-
D.i = (Y.. - Y) '*S*(Y, - Y)
mality can be grouped into two
genres;
i) single-statistic-based
formula test such as multivariate
of outlying values.
multivariate
cedures that utilizes a distancefrom-mean representation of multivar ia te
data.
The
distance-from-mean in multivariate
data, or Mahalanobis distance
on the methods
easily and clearly underThus, it would be useful to
verify the reasonableness
evaluating
normality, Andrew et al. (1973) have
suggested an informal graphical pro-
to construct
(usual1yon
where p.
the horizontal
is an
1
aXls),
estimate of plotting
posi tioD. and F- is
the inverse of a
distribution function. In Mahalanobis probaility plot, P-' is the
inverse of chi-square distribution.
The plot tong formula has been
described by Blom(1958) as
p. = (i-c)/(n+1-2c),
where c is a func_tion of the distri-
bution being sampled, and O<=c<=l.
In practice,
plify the use of
many authors sim-
the above formula
by assuming c to be a constant.
For
example, Wilk & Gnanadesikan (1968),
and Stevenson (1982) suggested using
P,; =(i-.5)/n
742
a
probability plot, the ordered sample
statistic Y; is plotted (usually on
the vertical axis) against X;=P":'(P" )
by setting c=.5.
ben(1975) proposed
Also,
linearity of the probability plot
because the correlation is a simple
and straightforward measure of linearity between any two variables.
Since the Y. are highly correlated
and heteroscedastic, however, the
usual distributional results for the
correlation coefficient do not
apply. Instead, empirical sampling
methods must be used to determine
the null distribution of the test
statistic. Filliben(1975) and Looney & Gulledge(1985) already tabulated a normal test statistic for
the probability plot correlation
when a least square line is computed. Following Filliben and Looney &
Gulledge1s lead, the correlation
coefficient from a plot will be used
as an aid in interpretation of the
linearity of probability plot.
Here, Looney & Gulledge1s table will
be used because the plotting point
recommended by Blom(1957) is adopted
for tabulating the table.
Filli-
=(i-.3175)/(n+.365)
Pi
by using c=.3175. Although the different constants (o)
give similar
plotting positions for order statisti? Yj near 1=n/2, they can lead to
qu1te different plotting positions
of the extreme values near i=1 and
i=n, especially with small samples
(Mage, 1982).
After reviewing different plotting positions, Kimball(1960) recommended that an approximation of P~
developed by Blom(1958, P.
71) be
used as a plotting position:
Pi
=(i-.375)/(n+.25)
This plotting position has seen
increasing acceptance among practioners in recent years; for example
the normal probability plot produced
by the PROC UNIVARIATE of the SAS is
based on this plotting position (SAS
Statistics Version 5, 1985. P.1188).
Thus,
in this paper, the plotting
position proposed by Blom(1958) will
be used in probability plots.
The Mahalanobis distance chisquare probability plot is constructed as follows;
IV. Description of Program
The SAS/IML code for generating
Mahalanobis chi-square probability
plot is presented in the Appendix.
The program uses the graphic routines in SAS/IML to divide the
screen into 4 subplots. The first
plot (upper-left) represents the
chi-square plot for all observations.
After removing an observation that has the largest Mahalanobis distance, the second plot
(upper-right) is created using (n-l)
observations. Again, after deleting
an observation that has the largest
value among (n-l) d.istances, the
third plot is drawn on lower-left
region.
In third plot, the number
of observations is reduced to (n-2).
The fourth plot(lower-right) is
plotted by removing an observation
having the largest distance on the
third plot. On each plot, the correlation coefficient between ordered
Mahalanobis distances and order statistic based on chi-square distribution is printed. This program also
prints the original observation numbers of the four largest Mahalanobis
distances.
1) The distances are ordered from
smallest to largest as
2.
D,
~
,DJ,
~
,OJ,
,
.•• •••
.),.
.),.
,Dnt1 ,0",
'It,t<
2) Then grapJ;:".the pairs (D:,
Il) ) ,
where the·~is the Pi percentile
of the chi-square distribution
with 'df' degree of freedom.
III. The Probability Plot Correlation Test
for Normality
The use of probability plot for
providing qualitative estimate of
the goodness of fit to normality has
a major disadvantage.
As we have
mentioned, if the hypothesized normal distribution is the correct one,
then the plot of Y;. against X· =F-1
(PJ will be approximately lin~ar.
However, there is -no simple objective way to judge how well the data
points conform_'to the straight line
(Mage, 1982). This lack of objectivitymay be confusing to the users
of'probability plot. Therefore,
Filliben(1975) and Looney & Gulledge(1985) suggested that one use
Pearson product moment correlation
between Y4 and X4 to measure the
V. Application of Program
The data in Figure I have 100
observations from a 5-variate independent normal distribution. EVen
the first upper-left plot appears to
be reasonably linear, exhibiting no
marked departures of Mahalanobis
distances from null expectation.
743
The reported correlation coefficient
One problem with the Mahalanobis Chi-square probability plo~ ~nd
the normal test table for the corre-
accompanying correlation coeff1c1ent
for the first plot is .9895.
From
test is that it may not identify
those Mahalanobis distances that are
distorting the property of multivariate normal distribution. Extreme
values with large Mahalanobis distances may still fall close to the
best fitted regression line on the
plot, thereby fitting in cosistently
lation tabulated by Looney and Gulledge (1985) , it is seen that .9895
is above the 5% critical value; in
fact the observed correlation falls
between 10% to 25% points of the
null distribution. On the basis of
correlation test, there is no evi-
dence to contradict the hypothesis
of normality. After removing the
observation number 58 that has the
largest Mahalanobis distance, the
linearity of the plot is slightly
with the correlation.
In this vein,
Comery(1985) proposed a different
distance measure to eliminate such
problem.
The probability plot based
on this measure may be easily implemented.
improved , as indicated by correla-
tion coefficient reported on the
second upper-right plot (.9924).
Deleting the observation number 92
on the second plot degrades the lin-
Acknowledgement:
earity of the plot; that is, from
The author would like to thank Dr •
• 9924 to .9894. The same decreasing
pattern is hold on the fourth plot
Ronald Kalinoski for his encourage-
ment and helpful comments.
after removing observation number 66
from the third plot.
The Figure 2 is drawn by using
mildly nonnormal data. Among five
SAS, SAS/IML and SAS/GRAPH are registered trademarks of SAS Institute
Inc., Cary, NC. U.S.A.
variables, one is a Cauchy random
variate with location parameter 0
and scale parameter 1. Under the
null hypothesis of normality, the
Bibliography
plot should have a reasonably linear
form.
All plots in Figure 2, how-
ever, appear quite non-linear, espe-
cially at the upper end.
Andrews, D. F., Gnanadesikan, R.,
and Warner, J. L., "Methods for
assessing multivariate normality,"
in Multivariate Analysis III.,
The skew-
ness of the data is clearly evident
in the plots. Also the correlation
test shows the significant departure
NY.Academic Press, 1973. 95 116
from normality in this data -- the
observed percentage point is far
below the 5% cut-off.
Blom, G.
After remov-
Statistical Estimates and
Transformed Beta Variables, NY:John
ing seemingly outlying observations,
the linearity is decreased rapidly;
WHey, 1958
from .9214 to .8067 to .7883 to
.7463. Thus it is quite reasonable
Comery, A. L., "A method for remov-
to reject mul,tivariate normality
lytic results," Multivariate Behav-
ing outliers to improve factor ana-
hypothesis on grounds of both nonlinearity configuration on the plots
ioral Research, .Vol. 20, 1985.
273 281
and normal test of correlation coef-
Fi1liben, J. L., "The I?robability
plot correlation coeff1cient test
ficient.
for normality," Technometrics, Vol.
VI. Discussion
17, No.1, 111-117. 1975
Gnanadesikan, R., Methods for Sta-
tistical Data Analysis of Multivari-
Instead of using SAS/IML code
ate Observations, NY:John Wiley and
to calculate Mahalanobis distances
and to draw a chi-square plot, PROC
REG and PROC GPLOT with ANNOTATE
option in SAS/GRAPH can be used to
generate the same plots proposed
here.
Sons, 1977
Gnanadesikan, R., and Kettering, J.
R., "Robust estimates# residuals,
and outlier detection with multires-
The Mahalanobis distance is
computed using the following equa-
ponse data," Biometrics, Vol. 28,
1972. 81-124
tion;
~
D.:
~(n-1)*(h~;
-lin)
Healy, M. J. R., "Multivariate normal plotting," Applied Statistics,
Vol. 17, 1968. 157 161
In PROC REG, h.; {diagonals of the
HAT matrix) can be easily output to
a new data set using OUTPUT option.
744
Johnson, Ro, and Wichern, Do,
Applied Multivariate Statistical
Analysis, Englwood Cliffs,
N.J.Prentics Hall, 1982
Kang, N. and Kalinoski R.
"Measures
of multivariate skewness and kurtosis," SUGI 12 Proceedings, 1987,
1178-1183.
Kimball, B. F., "On the choice of
plotting positions on probability
paper," Journal of American Statis-
tical Association, Vol. 55, 546-560.
1960
Mardia, Ko Vo, "Measures of multivariate and kurtosis," Biometrika,
Vol. 57, 519-530. 1970
Looney, S. W., and Gulledge, T. R.
Jr., "Use of the correlation coeffi-
cient with normal probability
plots,1I The American Statistician,
Vol. 39. 75 79. 1985
Mage, D. To, "An objective graphical
methods for testing normal distributional assumptions using probability
plots,:
The American Statistician.
Vol. 36, 116-120. 1982.
745
TEST OF MULTIVARIJ.TE NORMALITY(FIGURE 1)
.58
20.00
D
I
S
T
A
N
C
E
D
17.50
115.00
12.50
+++
++ +
I
S
.88
• g14
T
A
N
C
E
.....
10.00
7.50
10.00
9.00
.66
B.OO
•• 79
7.00
++
6.00
+
2.50
R=O.9895
4.00
/
1.00
0.00
0.00
0.00
4.00
-..I
-1>0
B.OO
12.00
I .
0.00
/'
D
I
S
I
I
B.OO
T
A
N
C
E
.66
D
••
7.00
6.00
.8a 7 9
I
S
T
A
5.00
4.00
3.00
N
C
E
'''~
R=O.9894
0.00
I
4.00
I
B.OO
CHISQ
I
12.00
.141S 1
7.00
+
6.00
++++
.+
5.00
.79
•
•
4.00
3.00
2.00
2.00
1.00
0.00
12.00
CHISQ
10.00
9.00
B.OO
R=O.9924
4.00
CHISQ
'"
++
......'
5.00
3.00
2.00
5.00
.92
1.00
0.00
-l('
0.00
R=O.9819
I
4.00
I
B.OO
CHISQ
12.00
TEST OF MULTIVARIATE NORMALITY(FIGURE 2)
40.00
0
I
S
T
A
N
C
E
.59
35.00
45.00 .,
0
I
S
.21
30.00
T
A
25.00
N
C
.35
20.00
E
.9
15.00
10.00
+
++ +
5.00
0.00
~
30.00
25.00
20.00
15.00
j
B.OO
4.00
12.00
.35
.9
j
.52
~++++
I
S
I
I
4.00
B.OO
T
A
CHISQ
N
C
E
40.00
0
• 9
30.00
I
S
T
A
25.00
N
C
20.00
15.00
10.00
...
E
.4*f!2
.++ +
5.00
4.00
• 9
50.00
40.00
30.00
20.00
10.00
R-O.7883
l
.52
.44
I
..• ~++++
.42 R=O.7463
0.00
0.00
0.00
12.00
60.00
.35
35.00
R=O.8067
,
CHISQ
45.00
0
+
0.00
0.00
0.00
.....
40.00
35.00
10.00
5.00
R=O.9213
.21
B.OO
CHISQ
12.00
T
I
I
I
0.00
4.00
B.OO
12.00
CHISQ