Download Multivariate Data Analysis Using the SAS System

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
MUl TIVARIA TE DATA ANALYSIS USING THE SAS SYSTEM
E. James Harner, West Virginia University
Abstract
The uses
of CAND1SC, D1SCRIM, STEPDJ5C, and other
multivariate SAS S Procedures are illustrared using a bird
Developing a model is an iterative process involving interactlve
assessments of the model form, distributional assumptions,
outlil2rs, and col linearity.
Thl2 interrelationships of these
concepts are what makes modeling a creativl2, but difficult,
task. A starting place is to examine the univariate graphical
presentations (stem-and-Ieaf, box, and probability plots) and
nurr.erical
summaries
(e.g.,
means, medians,
standard
Since species
deviations, and pseudo-standard deviations).
differences would distort the distribution of values for any
variable, the analysis should be done within each species. The
SAS statements are:
habitat data set.
The objective of the ana!ysis is to
discriminate among four sparrow species basl2d on l2ight habitat
variables. Emphasis is given to assessing the assumptions of
the discriminant model. DISCR1M and PRINCOMP are used to
Bl!amine the assumption of equality of covariance matrices.
Collinearity is evaluated by PR1NCOMP and STEPDISC. Outliers
are determined from the Mahalanobis distances, which are
computed from the output of D!5CRIM.
Normal ity is thl2n
assessed by constructing gamma prObability plots from these
Mahalanobis distances. Both weighted (based on Mahalanobis
distances) and unweighted canonical discriminant analyses are
The important variables are
performed using CANDISC.
"selected by invoking STEPD/SC.
PRoe UNIVARIATE OA T A~IN.5PARROW PLOT NORMAL;
H
BY SP;
VAR BAC LC FC 5C MH VO HH VH;
Introductioo
The PLOT option causes a stem-aDd-leaf, bO:<, and normal
Examining the underlying relationships among many variables
was limited until thl2 advent of high-speed computers. Now
multivariate modeling is entering a ~w era in which the data
analyst is guided by an embedded expert system (Hahn 1985).
Increasingly, the SA5 System is incorporating features to
expand its capabHities and to simplify its use. The SAS Macro
Language can now interact with the user and thus offers the
possibillty of being an "Intelligent" statistical software
product. However. expert guidance does not obviate the reed to
undl2rstand statistical prinCiples. This tutonal IS tailored to
I2Xplaln thl2 statistical ideas In sl2vl2ral of thl2 SAS multivariate
procl2durl2s.
probability plot to be printed, whereas the NORMAL option
requests a test of normal ity.
The univariatl2 summaries give e:<cellent information for
assessing normality and the occurl2nce of outliers.
By
inspecting the box plot, the stem-aM-leaf plot, and the listings
of the e:<tremes from PRDe UNIVAR!ATE, mild outliers
(probability <0.05 of occuring in a normal distribution) and
e:<treme outliers (probability <0.005) are identified. They are
marked on the data listing of Table 1. or the twelve outliers,
four are extreme--all of which are scrub cover values. The
remain!Og mild outlil2rs are scattered as to species and
variables.
The context of thiS discussion is a bird habitat multivariate
data set. During 1976-80, Whitmore (1979) collected sparrow
habitat data on "reclaimed" strip mires in northern West
Virginia. The 1980 data used here contains eight vegetation
variables measured on the territories of 74 male sparrows
identified on the Great Mine (47.5 ha) in Preston County, West
Virginia. The four species (SP) found were: field sparrows (FS;
SpizQlla pus/Jla ; n=16), grasshopper sparrows (GS;
Ammodramus savannarum; 0=25), savannah sparrows (55;
Passerculvs sandwichensis; n=13), and vesper sparrows(VS;
Pooecetes graminelJs ; n=20). The eight habitat variables
represented four types of quantities: basal area cover (SAC),
l!tter cover (LC), forb cover (Fe), and scrub cover (SC),
measured as percl2nts; horizontal diversity (HH) and vertical
divllrsity (VH), computed by thll Shannon-Weaver index H'; mean
vegetation height (MH), mllasured in cm; and vertical densHy
(VD), a count.
Normality, as tested by the Shapiro-Wilk statistic, is generally
satisfied. The power of rejecting normality is low, however,
since the sample sizes are small. Scrub cover is Significantly
non-normal (P < 0.01) for each of the four species. A large
number of zero percentages are accompanied by an occasional
large pl2rcentage (Table O. The influence of these outliers
should be monitored during the model development. A square
root transformation would decrease the influence of the
extreme values, but would not distribute the proDability spike
associated with 0%.
Table 1. sparrow Data Listing, Mahalanobis Squared Distances,
and Wl2ight Values
5£ IlJB[lllAC £C
F5
FS
F5
FS
F5
F5
F5
F5
F5
F5
FS
FS
F5
The objective of this study is to discriminate among the four
sparrow species using the habitat variables. The principal SAS
Procedures invoked to carry-out the analysis are CANDISC and
STEPDI5C.
In addition, UNIVARIATE, PLOT, PRINCOMP, and
D!SCRIM are used to help assess the assumptions of the
discriminant model. CANCORR offii!rs an alternative methods of
analyzing thiS data.
1216
1
2
3
4
5
6
7
8
9
10
27.6
217
44.4
19.7
171
332
45.1
49.4
446
30.6
11 32.8
12 297
13 35.7
30.6
6.3
13.6
145
364
19.4
155
14.2
145
27.0
8.3
276
258
IJI:i
1.407
1.495
1.599
1.411
1.800
1.656
1.983
1.608
1.540
1526
1.516
1.629
1.406
J.L t1I:! 51: 'lll
'IJ:j
tIAtl lI'lil
1.0 11.4 1.315 8.54 0.88
S7.7tl 43 69 83 0.850 7.71 0.93
96.2 76 98 I1.B 1.276 303 1.00
81.2 58 7.7 780.715 981 0.B3
92.2 62 5.1 147 157011.37 0.77
81.3 68 70 143 1.728 719096
97.2 950 170 170 1.483 10.39 0.80
100.0 82 00 13.8 1.415 4.50 1.00
924 69 99 127 1.625 7.23 096
97.2 65 03 10.4 1.459 436 1.00
689 46 0.7 80 0.587 809 0.91
91.0 52 0.0 6.2 1.009 6.45 1.00
948 63 0.0 10.1 1.245 414 1.00
90.4 65
F5
F5
F5
G5
G5
G5
G5
G5
G5
G5
G5
G5
G5
G5
G5
G5
G5
G5
G5
G5
G5
G5
G5
G5
G5
G5
G5
G5
55
55
55
55
55
55
55
55
55
Litter cover also has P-values <0.05 for testing normality,
except for that of vesper sparrows
The LC sample
distributions are mlldly sKewed to the left, I.e., towards
moderate to low percentages. A scattering of other variables
are also signifjcantly non-normal. but the departure is not
serious.
14 8.8 14.6 1.963 73.1 7358.8'
15 17.9 40.7 1.755 93.5 64 9.9
1628.9 17.5 1.629 82.957 0.8
I 30.9 13.8 1.265 68.5 30 0.0
2 28.3 18.0 1.293 98.1 68 9.6
3 28.4 21.7 1.601 94.1 46 5.3
4 28.7 15.4 1.219 90.6 66 7.3
5 29. I 13.2 1.303 79. I 34 0.0
6 10722.7 1.410 74648 0.0
7 12.2 31.4 0.896' 59.4 27 0.2
8 40.6 19.9 1.588 98.673 7.4
9 18.5 35.0 1.252 90.8 5 I 0.0
10 25.0 45.9' 1.432 90.5 64 0.0
11 25.0 22.5 1.134 82.4 37 0.0
12 41.9 22.9 1.778100.0 77 14
13 28.3 23.9 1.595 95.1 74 6.5
1424.5 16.2 1.346 98.643 1.0
15 27.9 40.6 1.434 96.8 64 0.0
16 15.0 21.0 1.384 60.445 1.2
17240270 1.023·80942 2.6
18 22.6 26.3 1.409 82.1 56 11.0
19 186 245 1.220 83.4 45 0.0
20 16.2 16.5 1.441 61.9 43 0.0
21 16.820.7 1.357 44.1'24 00
22 32.0 30.2 1.495 947 63 5.4
23 42.8 194 1470 926 71 0.0
24 34.4 15.2 1.517 81.7 59 0.4
25 23.7 35.2 1.464 98.9 59 3.6
I 24.4 16.6 1324 82.6 58 0.0
2 18.4 18.3 1.292 63.2 52 04
3 349 194 1.228 85.0 46 0.0
4 21.1 28.0 1.408 70.9 42 0.3
5 241 13.6 1.438 77.8 46 0.0
6 68.8'32.9 1.652100.082 0.0
7 28.940.8 1.531 98.668 0.0
8 38 I 32.7 1.654 99.9 77 0.2
9 21.5 25.9 1723 79.7 62 0.2
6.3 1.3041301 0.72
9.6 1.527 6.06 1.00
5.00.749 8.11 0.91
3. I 1.062 8.65 0.88
9.5 1.263 743 0.95
0.4 0.862 12.16 0.74
11.1 0.801 7.85 0.92
5.8 0.441 7.24 0.96
31 0669 8080.91
2.2 0 185 8.58 088
15.8 1.1651060 0.79
3.9 0.621 468 1.00
11.9 1.55012.03 0.75
3.4 0.678 3.33 1.00
103 1.305 755 094
9.2 1.265 3.50 1.00
7.8 1.15311.80075
6.0 1.026 6.56 1.00
3.6 0.000 9.940.82
410593 5.10 100
0.5 1.222 13.79° 0.70
29 0.745 3.37 1.00
3.6 0.535 5.39 1.00
1.9063310.790.79
11.7 1453 482 1.00
10.1 1.250 9.85 082
9.5 1.279 4.97 1.00
76 I 175 3.94 1.00
6.2 0.943 441 1.00
44 1.453 9.52 0.84
3.3 1.292 9.340.85
5.6 1.249 603 1.00
5.1 0.880 6.45 1.00
160'153210340.80
7.1 1051 625 1.00
72 I 203 4.31 1.00
6.2 1040 442 1.00
SS 1030.522.2 1137 94.767 0.5 7.6 O,104e \0_93 0.78
55 II 30621.8 1015 97952 46' 67 1123 886087
55 12 45.6 23.7 1.600 975 79 0.0 9.2 1275 436 1.00
The above univariate analysl2s do not ~xpose serious
distributional problems, except for SC and possibly LC.
However, marginal normality does not imply joint normality
and outliers often hide themselves in high-dimensional spaces.
Bivariate scatter plots are useful, but not definitive, in
locating anomalies in the data. Scatter plots anc! correlations
were obtained by the follOWing SASstatements:
PROC CORR OAT A=IN.SPARROW;
BY SP;
VAR BAC LC FC SC MH VO HH VH;
PROC PLOT OAT A"IN.SPARROW;
BY SP;
PLOT BAC"(LC FC 5C MH VO HH VH);
PLOT LC"(FC SC MH VO HH VH);
PLOTFC"(5C MH VO HH VH);
PLOT 5C"(MH VO HH VH);
PLOT MH*(VO HH VH);
PLOT VO"(HH VH);
PLOT HWVH;
These plots quickly overwhelm the analyst however. The g
groups and p variables generate gp(p-l)/2 plots. I would not
advise plotting all groups, even if uniquely identified, on the
same scatter plot, since variable relationships within a group
ar~ obscurl2d
Several birds have variable values which are distant from the
bivariate "ellipse" of points, but are not univariate outliers.
These include: field sparrow #5, grasshopPl2r sparrow #16,
savannah sparrow #10, and vesper sparrow #9.
SS 13375 114 1.495100,081 13.8 0 8.8 133010780.79
VS 1 33.3 12.2 1.366 88.7 47 16.6° 4.2 1 401 17.55 11 a 62
V5
V5
v5
V5
V5
V5
V5
V5
V5
V5
V5
V5
V5
V5
V5
V5
V5
V5
V5
2 358
3 300
4 29.4
5 29.8
6 34.0
7 30.2
8 13.1
9 5.6
10 27.0
11 10.2
12 23.1
13 11.7
14 16.3
15 6.8
16 16.6
17 16.6
18 19.7
19 20.3
20 4.7
20.2 1428
240 I 593
10.6 I 705
12.4 1.392
12.0 1.702
11.7 1.404
204 1.359
16.6 1.926
33.5 1.523
10. I 0.935
27.5 1.217
17.2 1.201
28.3 0.888
13.2 0.705
18.70.814
22.3 0.868
16.7 1.002
12.6 1.433
243 0.740
95. I 65
970 74
857 77
59.8 49
84.7 58
84.747
40.1 24
26.7 30
79.9 56
40.8 19
79.2 32
55.8 14
69.9 20
30.9 13
61.724
65. I 23
64332
56.6 36
49.1 19
30
28
30
0.0
0.3
3.4
0.0
00
0.3
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
As we expand our "window of the data," structures hidden by
low dimenSion proj2ctions reveal themselves. The key IS to
find meaningful prOjections and comprehensM2 summary
statistics.
9.0 1.233 3.80 1.00
13.9 1373 7.19 0.96
'27 1.292 9.26 0.85
5.2 1.175 10.23 0.8 I
11.5 1300 7.19 0.96
12.8 1.521 10.60 0.79
2.6 0.490 647 1.00
1.2 0.5661247 0.73
6.6 0.963 8.41 0.89
2.0 0.500 4.75 I 00
3.2 0.642 5.28 1.00
1.00.325 7.25 0.96
47 0.745 7.62 0.94
1.0 0.000 9.13 0.86
2.80.154 6.850.99
2.8 0.409 1.89 1.00
4.1 0.751 2.73 1.00
1.80.787 6.27 1.00
1.9 0.206 7.06 0.97
JOint normality and the Influence of multivariate outliers
(Gnanadeslkan 1977) are assessed by examining the Mahalanobis
squared distances:
D, '"(y,-y)'5-'(y,-g)
wMre Yi IS tM itn response vector,
is the group mean, and S
Obtaining the Di Z from SAS IS not straight-forward.
The
posterior claSSification probabilities from DISCRIM are based
on the D;2 but these quantities are not printed, However, the
output data set does contain the statistics necessary to
compute the Mahalanobls squared distances. If the POOl=NO
option IS speCified, the means, the standard dl2viations, and the
invl2rsl2 of the correlation matrix are given for each group.
8Amild outlier
bAn 12l!treme outl ier
CSignificant at cx=O.05
!l
Ii
is the covariance matn:.:. These squared distances must be
computed separately for each group, i.e., the grand mean and the
pooled covariance matrix from the combined samples are not
used.
Then
Significant at ()(=O.10
1217
These computions are done by a macro writtlii:n by Daniel Chilko
(1965); alternatively, PROC MATRIX could be used.
Thlii:
Mahalanobls squarl1:d distances are I isted in Table 1 with the
data listing.
individually or in pairs (Dempster 1969). Aneffective method
has not been found to describe simultaneously the differances
among the g(>2) covariance matrices.
A simple approaCh for comparing covariance structures is to do
a principal component analysis on each covariance or
correlation matrix.
If the covariance matrix is Llsed, the
principZlI variables corresponding to the largest eigenvalues are
usually dominated by the variables with the largl2st variances.
This is scall2 dependent; therl2fore, the correlation matri>: is
prl2ferred. The principal variables are found by solving the
following I2lgenvalue problems:
Rka j (~I=ci (kJ aj (k)
The Dj 2s arl2 uSl2ful for asslOlssing normality and for identifying
outl iers. The squared lengthS have an approximate chi-square
distribution with
p (the number of variables) degrees of
fn~edom. Thus if Dj2>X2(Ot:,p) obslZrvation i should bl2 labe!!l2d
as a possible outlier if 0( is sufficiently sma!! (x: 2(Ot:,p) is the
\-0(
quantile of a chi-square distribution with p degrees of
fr •• dom).
In our cas. X2(0.10,8)~13.36
and X2(0.05,8)~15.5.
Only two potential outliers, grasshopper sparrow "18 and
vesper sparrow #1, are found using e<= 0.10 and only thll! vl2sper
sparrow is an outlier for e<=0.05. The grasshopper sparrow is
not identified as an outlier by either the univariate or bivariatlOl
analyses. The "outlying» Di2S are not extreme quantiles of the
where Rk is the correlation matrix for group k.
Zl(~)=aj(I:::)Yk*
IS the I tn principal variable for the k tn group with
var(Zj(!CI)=Cj1kl,
chi-square distribution. Thus thlOlslOl obslOlrvations should not
cause a serious degradation of the discriminant model. Since
outlier and coliinearity problems are related, the Dj 2 should
where
y*
is
the
standardiZed
variabll2
y"~O.(1/5 j)(Y-ii.).
also be computed after a variable sell2ction.
Thl2 differenc25 among the corrl2lation matrices can be
characterized (approximately) by the 2I'lQ1l;!s among the Z/I:::I and
A
Joint normality can also be assessed by using the Dj2s.
Chi-square distribution with p degrees of freedom is a special
casl2 of a gamma distribution with scale parametl2r (~) 2 and
shape parameter (1'\) p/2.
Thl2reforl2, if thl2 underlying
distribution is multivariate normal, the gamma prObability plot
should be a straight liM. If the data set corresponding to Table
1 is called 01, the SAS code to carry out the analysis is:
their variances
PRDe RANK
By SP:
S(k),
for each fixed i. The elgl2nvalulZs for the
first three and last components are given in Table 2. The last
component(s) is important sincl2 singularities are associated
with Zj for which var(Zi)~O.
Table 2. Selected Eiglii:nvalues from the Principal Compornmt
Analysis of Each Species
DATA~DI DUT~02:
Species
VAR MAH;
RANKS RMAH;
Component
1
2
DATA D2:
SET:
IF SP='FS' THEN GMAH~GAMINV«RMAH-O.5)/16,4):
1F SP~'GS' THEN GMAH~GAMINV«RMAH-O.5/25.4):
8
IF SP~'SS' THEN GMAH~GAMINV«RMAH-0.5/13.4):
IF SP~'VS' THEN GMAH~GAMINV«RMAH-0.5120.4):
f5
3.446
2.042
1.458
0.026
G5
4.330
1.191
0.878
0.096
SS
3.722
1.578
0.655
0.029
VS
4.967
1.201
0.914
0.024
The ellipsoid associated with vl2sper sparrows is most
I210ngated, whereas the ellipsoid for rield sparrows is the least
1210ngated.
The possibll2 collinearit!:! defined by the last
principal variable is least pronounced for the grasshopper
sparrows.
PRDCPLDT:
BY SP:
PLOT MAH~GMAH='*';
The plotting positions (HI2)1n are used in the inverse gamma
function. The gamma probability plot for the field sparrows
closely fits a straight line. The remaining plots have straightto-moderate curvature. The savannah sparrow plot is S-shaped,
indicating a truncated (uniform) distribution.
It must be
rl2membered, howl2ver, that the sample size for savannah
sparrows is small. Overall, the assumption of jOint normality
is not contradicted by these probability plots.
The cosines of the angles for the first and last components are
given in Table 3. These are computed from the eigenvectors
whiCh are print2d by PRINCOMP. The first principal variables
for the four species are similar, particularly those for GS and
VS (an angle of 6.50 ). An examination of the eigenvectors
reveals that these principal variables measure "vegetation
presence" with all variables being weighted I2xcept for FC for
thl2 grasshopper and vespl2r sparrows and SC for the field and
savannah sparrows.
Equality of population covariance matrices is another
The multivariate
important assumption to I2xamine.
generalization of Bartlett's test IS performed by DISCRIM if the
POOl=TEST option is specified.
In our case, using the
chi-square apprm:imation with gp(p+1)/2 degrees of freedom
results in a P-valU12 of <0.0001. Therl2fore, thl2 assumption of
equality of covariance matrices is not tenable.
Table 3.
COSlnl2S of the Angles Among the First prinCipal
components (upper TriangUlar Part) and the last Principal
Components (lower Triangular Part)
Species
Species
FS
GS
Characterizing the differences among the covariarce structures
IS difficult. Most approaches explore the covariance matrices
55
VS
1218
FS
1.000
0.096
0.777
0.743
LOOO
55
0.928
0.930
-0.189
0.424
0.539
GS
0.953
LOOO
VS
0.934
0.994
0.902
1.000
Table 4. Classification Percentages for Not Pooling/Pooling
the Covariance Matrices Using PROC OISCRIM
An analysis of the second components was not done, but by
inspecting their eigenvectors, these principal variables differ
substantially.
Field and savamah sparrows have high
coefficients for 5C, whereas grasshopper and vespllr sparrows
scorll high on FC. The last components indicate the nature of
near singularities in the data.
The cosines among these
components (Table 3) indicate grasshopper sparrows differ the
most from the other species. it also has the largest last
eigenvalue (Table 2). The possible col linearity for F5, 55, and
VS is defined roughly in terms of LC versus the other cover
variables.
Species
Species
F5
F5
G5
SS
VS
81Z/50.0
8.0/1Z.0
0.01 0.0
10 01 20.D
G5
18.8/31Z
64.0/44.0
7.7/3D.8
5.0115.0
55
V5
D.OI 6.Z
IZ.OI ZO.O
84.6/61S
0.0/10.0
0.0/1Z.5
16.0/Z4.0
7.71 7.7
85.0/55.0
A canonical discriminant analysis was run to characteriZe the
mean differences among the groups. The canonical variables,
which define these differences, are affected by the inequality
Thli! 5AS code to generate the principal compol"If2nt information
is:
of covariance matrices, A canonical correlation analysis is
also run between the original vari2lbles and the dummy variables
generated from SP. This shows that discriminant analysis is a
spt;;!cial caslil of canonical correlation analYSIS.
The SAS
statemlmts for carrying-out this analysis are:
PAOC PAINCOMP DATA=IN.5PAAAOW OUT=5COAE;
BY 5P;
VARBACLC FC 5C MH VD HH VH;
PROCCORR;
BY SP;
PAOC CANDI5C DA TA=IN.5PAAAOW OUT=SCOAE NCAN=3;
CLASS 5P;
VAA BAC LC FC 5C MH VD HH VH;
PAOC PLOT;
PLOT CANZ"CANI=5P;
DATA GEN;
DI=O; OZ=O; D3=0;
IF SP='FS' THEN DI=I;
IF 5P='G5' THEN OZ~I;
IF 5P='S5' THEN D3=1;
PROC CANCORA ALL;
VAABACLC FC 5C MH VD HH VH;
WITH Dl-D3;
VAABACLC FC 5C MH VD HH VH;
WITH PAINI-PAIN8;
PAOCPAINT;
BY 5P;
PAOCPLOT;
BY 5P;
PLOT PRIN2*PR!NI;:'*';
The correlations between the original and ith principal variabills
are found by 21/ c I '
LI2., thl2y are proportional to the
eigenvectors. However, tile correlatiOns are ~cessary to
interpret the principal variables if the covariance matrix is
analyzed, since the eigenvectors are scale dependent.
This canonical discriminant analysis is computed from a pooled
Within-group and a between-group covariance matri)!.
Classically, these are defined by:
In summary, the major orientations of the four ellipsoids are
similar, but differences then begin to appear. Consequently,
usil"(! ttuz pooled covariance matrix for a distance metric is
questionable.
W=[II(n-g») E(nj-I) Sj
where Si is the estImated covariance matri)( ror group i and
The major objective of this stUdy is to 2)!plain the nature of
the differences among the species with respect to their habitat
variables. Therefore, the preceding covariance analysis and a
canonical discriminant analysis are more meaningful than a
classification analysis. PROC D1SCRIM was run, however, to
gain insight into group separation The SA5 statements to
generate the analysis are:
[nj;:n, and
B=[I/(g-I») Enj(Yj-Y)(Yj-jj)'
wh~r~
Yj
is thlOl mlOlan of the ; tn group and jj;:[niy/n. We th~n
want to find the discriminant variable z;:a'y which ma)(imally
separates the groups in the sense that a'Ba/a'Wa is
ma)!imized.
This is equivalent to solving the generalized
eigenvalue problem
Ba;:cW'a.
Actually, there are t;:min(g-l,p) discriminant variables. i.e.,
ZhZ2, ···,2 t corresponding to the eigenvalues
PAOC D15CAIM LIST POOL=NO PCOAA OUT=D15T
DAT A=IN.5PAAAOW;
CLA555P;
VARBAC LC FC 5C MH VD HH VH;
Cl~C2~
... ~c
t.
The option with POOL;:YES was also run. The statistics in the
output data set are used to compute the Mahalonobis distanc/Zs.
The separating abil ity IS indicated by the magnitude or the
eigenvalues and thus Z! is the maXimal discriminant.
The classification percentages for both the linear (POOL;:YES)
and quadratic (POOL;:NO) classification models are given in
Table 4.
Overall, 77.0% of the sparrows are classified
correctly by the quadratic mode! and only 51.4% are classified
correctly by the linear model. The individual percentages
indicates grasshopper sparrows are the most difficult to
distinguish from the other sparrow species.
The hypothesis
Ho: P.,=P2= .. , =p.g
is tlilsted by various symmt;;!tric functions of the eiglmvalues.
Two important test statistics are the likelihood ratio test
given by J\=TTl/(I ...d j) and Roy's maximum ·root defined by
e,=d,/(I'd,), who" d,=[(g-I)/(n-g»)c ,.
These statistics are given by CANDISC. In addition. the within
canonical structure, I.e., the correlations between the
1219
canonical (discriminant) variables and the original variabll2s is
given to aid in interpreting these variables. Thl2se are deflnQd
where
by,
corr(Z,y)=A'WlJ(II{w ii)
whr;;:rr;;: A has as columns the scaled eigenvalues (a j 'W8 j=Sjj) and
rep2ated until the wl2ights converge. This would be difficult to
do In SAS unless PROe MATRIX is used. Therefore, a I-step
process is used, I.e., the weights are computed for each
observation and the WEIGHT statement is added to the
procedure statements of CANDISC.
o is a diagonal matrix matrix with 111 w ii as tM ith diagonal
element where Wjj is the jtn diagonal element of W.
WOk's 1\ test statistic (s 0.388 (P=0.00002)and Roy's gn~atest
root is 0.583 (P=O.OOOI).
These are highly Significant,
indicating at least some group differences. SAS gives pairwise
Mahalanobis distances based on group means and pairwisl1: mean
difference tests (Hotellings T2s). Pairwise, all speCies are
significantly different (and generally P<:O.Oll except for
grasshopper versus savamah sparrows.
The weighted discriminant analysis is similar to the
unwelghted analysis. This is due to our previous observation
that outliers are not a serious problem for this data.
An initial screening of the data reduced the n.Jmber of variables
from 13 to 8. A stepwise selection procedurll is now used to
determine which variables are important to the model. The SAS
statements are:
The nature of group differences are characterized by the within
canonical structure.
Table 5 gives these correlations.
Canonical variable 1 is most highly associated with BAC, LC,
FC, and MH, whereas canonical variable 2 is associated with all
variables except Fe. An examination of the plot of the flrst
two canonical variables shows substantial overlap among the
four species. An increasing score on the abSclssa (CANI)
corresponds to increasing BAC, LC, FC, and MH and to decreasing
SC, as determined from the raw canonical coefficlents. The
ordinate (CAN2) corresponds to increaSing lC, SC, VD, and HH,
and decreaSing BAC, MH, and VH. The field sparrows are in the
upper left indicating high SC and low BAC and MH values. The
grasshopper sparrows are in the center of the graph, whereas
the savannah sparrows are in the center right. The vesper
sparrows occupy the lower right corresponding to low LC
values. The interpretations must be made cautiously, however,
PROC STEPDlSC DA TA~IN.sPARROIY SlY
SLE=O.IS SLS~D.30;
BY SP;
VAR BAC LC FC SC MH VD HH VH;
The entry level (P-value) of 0.15 and the remove level of 0.30
for the F test lire those recommended by Mifi and Clark (1984).
Table 6 gives a summary of the stepwise discriminant analysis.
MH is the most important discriminator as judged by the
F-criterion. Howlilver, with all thli! variables in the madill, VO
has tne largest partial F. SC is the only cover variable which is
clearly important to the model. LC is marginal but IS left in the
model for Further analysis.
Since the covariance structures are different and classification
Is only moderate.
Table 5.
Sequential
Enetered
I.
2.
3.
4.
Canonical Variables
BAC
LC
FC
SC
MH
VO
HH
VH
CANI
0.3B6
0.572
0.357
-0.121
0.S02
0.032
0.112
0.191
CAN2
OA09
0.566
0.041
0.542
0.782
0.726
0.76S
0.S67
CAN3
0.384
-0.177
-0.370
0.017
0.292
0.270
0.187
OAIO
MH
VO
SC
LC
F
10.152
4.704
3.774
2.140
Partial
P-value
0.0001
0.0049
0.0144
0.1019
F
4.144
6.769
3.147
2.140
P-value
0.0094
O.OOOS
0.0303
0.1019
A weighted and unwelghted CANDlSC analysis was run on this
four-variable modl21. Wilks' A is 0.453 (P:;O.0000002) and
Roy's greatest root is 0.542 (p=0.000005) for the unweighted
case. These results are more highly significant than the
eight-variable model but only 48.6% of the obserVations are
claSSIfied correctly using the quadratic classification
functions.
The within canonical structure- is givlZn in Table 7. Aswith the
eight- variable model, lC and MH dominate CANI; all variables
are important to CAN2. The locations of the species In habitat
space and thll interpretations of the canonical variables are
similar to that of thl2 Iilight-variable model.
The canonical correlation analysis using CANCORR gives raw
canonical cofficients which are proportional to the raw
canonical coefficients from CANDlSC. Thus CANCORR can be
used to Plarform a canonical discriminant anlysis, but certain
important output is not given.
In order to assess the influence of outliers on tne analysis, a
robust discriminant analysis was done.
The Mahalanobis
distances are used to compute weights for CANDISC.
Tab!e 7. Correlations between the Canonical Variables and the
Original Variables
Variable
A more general procedure is now described (Randles I2t 211.
1978). For each group compute
LC
SC
MH
VD
y":;Ewjy/,£w j
and
Summary Statistics for the Variables Selected by
PROC STEPDl5C
Table 5. Correlations Betwel2n the Three Canonical Variables
and the Original Variables
Variable
if D?h
otherWise.
Di is the Mahalanobis distance and h is a tuning constant
generally set equal to .fX 2(h:)(,p).The calculations are usually
5":;'i.w j2(Yj -D")(Yj-D")' I'i.Wi2
1220
CANI
0.716
0.019
0.699
0.217
CAN2
0.379
0.613
0.679
0.780
CAN3
-0.441
-0.283
0.157
0.042
We hiM~ used SAS to gain an understanding of the variable
structure underlying four sparrow species. The SAS procedures
for analyzing this data are generally complete, particularly
CANDl5C and STEPDI5C. The major shortcoming is the lack of
proper diagnostics, especially those based on the Mahalanobis
distances. It it possible to obtain thl2sl2 statistics from thlil
output data set of D1SCRIM--but not gasily.
References
Afifi, A.A. and Clark, V.c. (1984), Comouter-Aided Ml!ltiyarjate
Belmont, CA liFetime learning Publications.
~
ChilkO, D.M. (1985), Personal CommUnication.
Dempster, A.P. (1969), Elements of Continuous MultiVarjate
Reading, MA: Addison-Wesley.
~
Gnanadesikan, R (1977), Methods for Statistical Data Analysis
.QL MpH jyarjate 0bservatjons, New York: John Wiley & Sons
Hahn, G,J. (1985), "More Intelligent Statistical Software and
Statistical Expert Systems Future Directions," The Amer. Stat
39: no 1,1-8
Randles, R.H., et aL (1978), "Generalized Linear and Quadratic
Discriminant Functons Using Robust Estimates," JASA, 73,
564-568
Whitmore, R.c. (1979), "Temporal Variation in the Selected
Habitats of a Guild of Grassland Sparrows," Wilson Bu1l., 91
592-598,
®SAS is the resistered trademark: of SAS Institute Inc., Cary,
NC, U5A
E. James Harner
Professor of Statistics
Department of Statistics and Computer Science
207 Knapp Hall
West Virginia University
Morgantown, West Virginia 26506
(304) 293-3507
,
j
"
1221