Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MUl TIVARIA TE DATA ANALYSIS USING THE SAS SYSTEM E. James Harner, West Virginia University Abstract The uses of CAND1SC, D1SCRIM, STEPDJ5C, and other multivariate SAS S Procedures are illustrared using a bird Developing a model is an iterative process involving interactlve assessments of the model form, distributional assumptions, outlil2rs, and col linearity. Thl2 interrelationships of these concepts are what makes modeling a creativl2, but difficult, task. A starting place is to examine the univariate graphical presentations (stem-and-Ieaf, box, and probability plots) and nurr.erical summaries (e.g., means, medians, standard Since species deviations, and pseudo-standard deviations). differences would distort the distribution of values for any variable, the analysis should be done within each species. The SAS statements are: habitat data set. The objective of the ana!ysis is to discriminate among four sparrow species basl2d on l2ight habitat variables. Emphasis is given to assessing the assumptions of the discriminant model. DISCR1M and PRINCOMP are used to Bl!amine the assumption of equality of covariance matrices. Collinearity is evaluated by PR1NCOMP and STEPDISC. Outliers are determined from the Mahalanobis distances, which are computed from the output of D!5CRIM. Normal ity is thl2n assessed by constructing gamma prObability plots from these Mahalanobis distances. Both weighted (based on Mahalanobis distances) and unweighted canonical discriminant analyses are The important variables are performed using CANDISC. "selected by invoking STEPD/SC. PRoe UNIVARIATE OA T A~IN.5PARROW PLOT NORMAL; H BY SP; VAR BAC LC FC 5C MH VO HH VH; Introductioo The PLOT option causes a stem-aDd-leaf, bO:<, and normal Examining the underlying relationships among many variables was limited until thl2 advent of high-speed computers. Now multivariate modeling is entering a ~w era in which the data analyst is guided by an embedded expert system (Hahn 1985). Increasingly, the SA5 System is incorporating features to expand its capabHities and to simplify its use. The SAS Macro Language can now interact with the user and thus offers the possibillty of being an "Intelligent" statistical software product. However. expert guidance does not obviate the reed to undl2rstand statistical prinCiples. This tutonal IS tailored to I2Xplaln thl2 statistical ideas In sl2vl2ral of thl2 SAS multivariate procl2durl2s. probability plot to be printed, whereas the NORMAL option requests a test of normal ity. The univariatl2 summaries give e:<cellent information for assessing normality and the occurl2nce of outliers. By inspecting the box plot, the stem-aM-leaf plot, and the listings of the e:<tremes from PRDe UNIVAR!ATE, mild outliers (probability <0.05 of occuring in a normal distribution) and e:<treme outliers (probability <0.005) are identified. They are marked on the data listing of Table 1. or the twelve outliers, four are extreme--all of which are scrub cover values. The remain!Og mild outlil2rs are scattered as to species and variables. The context of thiS discussion is a bird habitat multivariate data set. During 1976-80, Whitmore (1979) collected sparrow habitat data on "reclaimed" strip mires in northern West Virginia. The 1980 data used here contains eight vegetation variables measured on the territories of 74 male sparrows identified on the Great Mine (47.5 ha) in Preston County, West Virginia. The four species (SP) found were: field sparrows (FS; SpizQlla pus/Jla ; n=16), grasshopper sparrows (GS; Ammodramus savannarum; 0=25), savannah sparrows (55; Passerculvs sandwichensis; n=13), and vesper sparrows(VS; Pooecetes graminelJs ; n=20). The eight habitat variables represented four types of quantities: basal area cover (SAC), l!tter cover (LC), forb cover (Fe), and scrub cover (SC), measured as percl2nts; horizontal diversity (HH) and vertical divllrsity (VH), computed by thll Shannon-Weaver index H'; mean vegetation height (MH), mllasured in cm; and vertical densHy (VD), a count. Normality, as tested by the Shapiro-Wilk statistic, is generally satisfied. The power of rejecting normality is low, however, since the sample sizes are small. Scrub cover is Significantly non-normal (P < 0.01) for each of the four species. A large number of zero percentages are accompanied by an occasional large pl2rcentage (Table O. The influence of these outliers should be monitored during the model development. A square root transformation would decrease the influence of the extreme values, but would not distribute the proDability spike associated with 0%. Table 1. sparrow Data Listing, Mahalanobis Squared Distances, and Wl2ight Values 5£ IlJB[lllAC £C F5 FS F5 FS F5 F5 F5 F5 F5 F5 FS FS F5 The objective of this study is to discriminate among the four sparrow species using the habitat variables. The principal SAS Procedures invoked to carry-out the analysis are CANDISC and STEPDI5C. In addition, UNIVARIATE, PLOT, PRINCOMP, and D!SCRIM are used to help assess the assumptions of the discriminant model. CANCORR offii!rs an alternative methods of analyzing thiS data. 1216 1 2 3 4 5 6 7 8 9 10 27.6 217 44.4 19.7 171 332 45.1 49.4 446 30.6 11 32.8 12 297 13 35.7 30.6 6.3 13.6 145 364 19.4 155 14.2 145 27.0 8.3 276 258 IJI:i 1.407 1.495 1.599 1.411 1.800 1.656 1.983 1.608 1.540 1526 1.516 1.629 1.406 J.L t1I:! 51: 'lll 'IJ:j tIAtl lI'lil 1.0 11.4 1.315 8.54 0.88 S7.7tl 43 69 83 0.850 7.71 0.93 96.2 76 98 I1.B 1.276 303 1.00 81.2 58 7.7 780.715 981 0.B3 92.2 62 5.1 147 157011.37 0.77 81.3 68 70 143 1.728 719096 97.2 950 170 170 1.483 10.39 0.80 100.0 82 00 13.8 1.415 4.50 1.00 924 69 99 127 1.625 7.23 096 97.2 65 03 10.4 1.459 436 1.00 689 46 0.7 80 0.587 809 0.91 91.0 52 0.0 6.2 1.009 6.45 1.00 948 63 0.0 10.1 1.245 414 1.00 90.4 65 F5 F5 F5 G5 G5 G5 G5 G5 G5 G5 G5 G5 G5 G5 G5 G5 G5 G5 G5 G5 G5 G5 G5 G5 G5 G5 G5 G5 55 55 55 55 55 55 55 55 55 Litter cover also has P-values <0.05 for testing normality, except for that of vesper sparrows The LC sample distributions are mlldly sKewed to the left, I.e., towards moderate to low percentages. A scattering of other variables are also signifjcantly non-normal. but the departure is not serious. 14 8.8 14.6 1.963 73.1 7358.8' 15 17.9 40.7 1.755 93.5 64 9.9 1628.9 17.5 1.629 82.957 0.8 I 30.9 13.8 1.265 68.5 30 0.0 2 28.3 18.0 1.293 98.1 68 9.6 3 28.4 21.7 1.601 94.1 46 5.3 4 28.7 15.4 1.219 90.6 66 7.3 5 29. I 13.2 1.303 79. I 34 0.0 6 10722.7 1.410 74648 0.0 7 12.2 31.4 0.896' 59.4 27 0.2 8 40.6 19.9 1.588 98.673 7.4 9 18.5 35.0 1.252 90.8 5 I 0.0 10 25.0 45.9' 1.432 90.5 64 0.0 11 25.0 22.5 1.134 82.4 37 0.0 12 41.9 22.9 1.778100.0 77 14 13 28.3 23.9 1.595 95.1 74 6.5 1424.5 16.2 1.346 98.643 1.0 15 27.9 40.6 1.434 96.8 64 0.0 16 15.0 21.0 1.384 60.445 1.2 17240270 1.023·80942 2.6 18 22.6 26.3 1.409 82.1 56 11.0 19 186 245 1.220 83.4 45 0.0 20 16.2 16.5 1.441 61.9 43 0.0 21 16.820.7 1.357 44.1'24 00 22 32.0 30.2 1.495 947 63 5.4 23 42.8 194 1470 926 71 0.0 24 34.4 15.2 1.517 81.7 59 0.4 25 23.7 35.2 1.464 98.9 59 3.6 I 24.4 16.6 1324 82.6 58 0.0 2 18.4 18.3 1.292 63.2 52 04 3 349 194 1.228 85.0 46 0.0 4 21.1 28.0 1.408 70.9 42 0.3 5 241 13.6 1.438 77.8 46 0.0 6 68.8'32.9 1.652100.082 0.0 7 28.940.8 1.531 98.668 0.0 8 38 I 32.7 1.654 99.9 77 0.2 9 21.5 25.9 1723 79.7 62 0.2 6.3 1.3041301 0.72 9.6 1.527 6.06 1.00 5.00.749 8.11 0.91 3. I 1.062 8.65 0.88 9.5 1.263 743 0.95 0.4 0.862 12.16 0.74 11.1 0.801 7.85 0.92 5.8 0.441 7.24 0.96 31 0669 8080.91 2.2 0 185 8.58 088 15.8 1.1651060 0.79 3.9 0.621 468 1.00 11.9 1.55012.03 0.75 3.4 0.678 3.33 1.00 103 1.305 755 094 9.2 1.265 3.50 1.00 7.8 1.15311.80075 6.0 1.026 6.56 1.00 3.6 0.000 9.940.82 410593 5.10 100 0.5 1.222 13.79° 0.70 29 0.745 3.37 1.00 3.6 0.535 5.39 1.00 1.9063310.790.79 11.7 1453 482 1.00 10.1 1.250 9.85 082 9.5 1.279 4.97 1.00 76 I 175 3.94 1.00 6.2 0.943 441 1.00 44 1.453 9.52 0.84 3.3 1.292 9.340.85 5.6 1.249 603 1.00 5.1 0.880 6.45 1.00 160'153210340.80 7.1 1051 625 1.00 72 I 203 4.31 1.00 6.2 1040 442 1.00 SS 1030.522.2 1137 94.767 0.5 7.6 O,104e \0_93 0.78 55 II 30621.8 1015 97952 46' 67 1123 886087 55 12 45.6 23.7 1.600 975 79 0.0 9.2 1275 436 1.00 The above univariate analysl2s do not ~xpose serious distributional problems, except for SC and possibly LC. However, marginal normality does not imply joint normality and outliers often hide themselves in high-dimensional spaces. Bivariate scatter plots are useful, but not definitive, in locating anomalies in the data. Scatter plots anc! correlations were obtained by the follOWing SASstatements: PROC CORR OAT A=IN.SPARROW; BY SP; VAR BAC LC FC SC MH VO HH VH; PROC PLOT OAT A"IN.SPARROW; BY SP; PLOT BAC"(LC FC 5C MH VO HH VH); PLOT LC"(FC SC MH VO HH VH); PLOTFC"(5C MH VO HH VH); PLOT 5C"(MH VO HH VH); PLOT MH*(VO HH VH); PLOT VO"(HH VH); PLOT HWVH; These plots quickly overwhelm the analyst however. The g groups and p variables generate gp(p-l)/2 plots. I would not advise plotting all groups, even if uniquely identified, on the same scatter plot, since variable relationships within a group ar~ obscurl2d Several birds have variable values which are distant from the bivariate "ellipse" of points, but are not univariate outliers. These include: field sparrow #5, grasshopPl2r sparrow #16, savannah sparrow #10, and vesper sparrow #9. SS 13375 114 1.495100,081 13.8 0 8.8 133010780.79 VS 1 33.3 12.2 1.366 88.7 47 16.6° 4.2 1 401 17.55 11 a 62 V5 V5 v5 V5 V5 V5 V5 V5 V5 V5 V5 V5 V5 V5 V5 V5 V5 V5 V5 2 358 3 300 4 29.4 5 29.8 6 34.0 7 30.2 8 13.1 9 5.6 10 27.0 11 10.2 12 23.1 13 11.7 14 16.3 15 6.8 16 16.6 17 16.6 18 19.7 19 20.3 20 4.7 20.2 1428 240 I 593 10.6 I 705 12.4 1.392 12.0 1.702 11.7 1.404 204 1.359 16.6 1.926 33.5 1.523 10. I 0.935 27.5 1.217 17.2 1.201 28.3 0.888 13.2 0.705 18.70.814 22.3 0.868 16.7 1.002 12.6 1.433 243 0.740 95. I 65 970 74 857 77 59.8 49 84.7 58 84.747 40.1 24 26.7 30 79.9 56 40.8 19 79.2 32 55.8 14 69.9 20 30.9 13 61.724 65. I 23 64332 56.6 36 49.1 19 30 28 30 0.0 0.3 3.4 0.0 00 0.3 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 As we expand our "window of the data," structures hidden by low dimenSion proj2ctions reveal themselves. The key IS to find meaningful prOjections and comprehensM2 summary statistics. 9.0 1.233 3.80 1.00 13.9 1373 7.19 0.96 '27 1.292 9.26 0.85 5.2 1.175 10.23 0.8 I 11.5 1300 7.19 0.96 12.8 1.521 10.60 0.79 2.6 0.490 647 1.00 1.2 0.5661247 0.73 6.6 0.963 8.41 0.89 2.0 0.500 4.75 I 00 3.2 0.642 5.28 1.00 1.00.325 7.25 0.96 47 0.745 7.62 0.94 1.0 0.000 9.13 0.86 2.80.154 6.850.99 2.8 0.409 1.89 1.00 4.1 0.751 2.73 1.00 1.80.787 6.27 1.00 1.9 0.206 7.06 0.97 JOint normality and the Influence of multivariate outliers (Gnanadeslkan 1977) are assessed by examining the Mahalanobis squared distances: D, '"(y,-y)'5-'(y,-g) wMre Yi IS tM itn response vector, is the group mean, and S Obtaining the Di Z from SAS IS not straight-forward. The posterior claSSification probabilities from DISCRIM are based on the D;2 but these quantities are not printed, However, the output data set does contain the statistics necessary to compute the Mahalanobls squared distances. If the POOl=NO option IS speCified, the means, the standard dl2viations, and the invl2rsl2 of the correlation matrix are given for each group. 8Amild outlier bAn 12l!treme outl ier CSignificant at cx=O.05 !l Ii is the covariance matn:.:. These squared distances must be computed separately for each group, i.e., the grand mean and the pooled covariance matrix from the combined samples are not used. Then Significant at ()(=O.10 1217 These computions are done by a macro writtlii:n by Daniel Chilko (1965); alternatively, PROC MATRIX could be used. Thlii: Mahalanobls squarl1:d distances are I isted in Table 1 with the data listing. individually or in pairs (Dempster 1969). Aneffective method has not been found to describe simultaneously the differances among the g(>2) covariance matrices. A simple approaCh for comparing covariance structures is to do a principal component analysis on each covariance or correlation matrix. If the covariance matrix is Llsed, the principZlI variables corresponding to the largest eigenvalues are usually dominated by the variables with the largl2st variances. This is scall2 dependent; therl2fore, the correlation matri>: is prl2ferred. The principal variables are found by solving the following I2lgenvalue problems: Rka j (~I=ci (kJ aj (k) The Dj 2s arl2 uSl2ful for asslOlssing normality and for identifying outl iers. The squared lengthS have an approximate chi-square distribution with p (the number of variables) degrees of fn~edom. Thus if Dj2>X2(Ot:,p) obslZrvation i should bl2 labe!!l2d as a possible outlier if 0( is sufficiently sma!! (x: 2(Ot:,p) is the \-0( quantile of a chi-square distribution with p degrees of fr •• dom). In our cas. X2(0.10,8)~13.36 and X2(0.05,8)~15.5. Only two potential outliers, grasshopper sparrow "18 and vesper sparrow #1, are found using e<= 0.10 and only thll! vl2sper sparrow is an outlier for e<=0.05. The grasshopper sparrow is not identified as an outlier by either the univariate or bivariatlOl analyses. The "outlying» Di2S are not extreme quantiles of the where Rk is the correlation matrix for group k. Zl(~)=aj(I:::)Yk* IS the I tn principal variable for the k tn group with var(Zj(!CI)=Cj1kl, chi-square distribution. Thus thlOlslOl obslOlrvations should not cause a serious degradation of the discriminant model. Since outlier and coliinearity problems are related, the Dj 2 should where y* is the standardiZed variabll2 y"~O.(1/5 j)(Y-ii.). also be computed after a variable sell2ction. Thl2 differenc25 among the corrl2lation matrices can be characterized (approximately) by the 2I'lQ1l;!s among the Z/I:::I and A Joint normality can also be assessed by using the Dj2s. Chi-square distribution with p degrees of freedom is a special casl2 of a gamma distribution with scale parametl2r (~) 2 and shape parameter (1'\) p/2. Thl2reforl2, if thl2 underlying distribution is multivariate normal, the gamma prObability plot should be a straight liM. If the data set corresponding to Table 1 is called 01, the SAS code to carry out the analysis is: their variances PRDe RANK By SP: S(k), for each fixed i. The elgl2nvalulZs for the first three and last components are given in Table 2. The last component(s) is important sincl2 singularities are associated with Zj for which var(Zi)~O. Table 2. Selected Eiglii:nvalues from the Principal Compornmt Analysis of Each Species DATA~DI DUT~02: Species VAR MAH; RANKS RMAH; Component 1 2 DATA D2: SET: IF SP='FS' THEN GMAH~GAMINV«RMAH-O.5)/16,4): 1F SP~'GS' THEN GMAH~GAMINV«RMAH-O.5/25.4): 8 IF SP~'SS' THEN GMAH~GAMINV«RMAH-0.5/13.4): IF SP~'VS' THEN GMAH~GAMINV«RMAH-0.5120.4): f5 3.446 2.042 1.458 0.026 G5 4.330 1.191 0.878 0.096 SS 3.722 1.578 0.655 0.029 VS 4.967 1.201 0.914 0.024 The ellipsoid associated with vl2sper sparrows is most I210ngated, whereas the ellipsoid for rield sparrows is the least 1210ngated. The possibll2 collinearit!:! defined by the last principal variable is least pronounced for the grasshopper sparrows. PRDCPLDT: BY SP: PLOT MAH~GMAH='*'; The plotting positions (HI2)1n are used in the inverse gamma function. The gamma probability plot for the field sparrows closely fits a straight line. The remaining plots have straightto-moderate curvature. The savannah sparrow plot is S-shaped, indicating a truncated (uniform) distribution. It must be rl2membered, howl2ver, that the sample size for savannah sparrows is small. Overall, the assumption of jOint normality is not contradicted by these probability plots. The cosines of the angles for the first and last components are given in Table 3. These are computed from the eigenvectors whiCh are print2d by PRINCOMP. The first principal variables for the four species are similar, particularly those for GS and VS (an angle of 6.50 ). An examination of the eigenvectors reveals that these principal variables measure "vegetation presence" with all variables being weighted I2xcept for FC for thl2 grasshopper and vespl2r sparrows and SC for the field and savannah sparrows. Equality of population covariance matrices is another The multivariate important assumption to I2xamine. generalization of Bartlett's test IS performed by DISCRIM if the POOl=TEST option is specified. In our case, using the chi-square apprm:imation with gp(p+1)/2 degrees of freedom results in a P-valU12 of <0.0001. Therl2fore, thl2 assumption of equality of covariance matrices is not tenable. Table 3. COSlnl2S of the Angles Among the First prinCipal components (upper TriangUlar Part) and the last Principal Components (lower Triangular Part) Species Species FS GS Characterizing the differences among the covariarce structures IS difficult. Most approaches explore the covariance matrices 55 VS 1218 FS 1.000 0.096 0.777 0.743 LOOO 55 0.928 0.930 -0.189 0.424 0.539 GS 0.953 LOOO VS 0.934 0.994 0.902 1.000 Table 4. Classification Percentages for Not Pooling/Pooling the Covariance Matrices Using PROC OISCRIM An analysis of the second components was not done, but by inspecting their eigenvectors, these principal variables differ substantially. Field and savamah sparrows have high coefficients for 5C, whereas grasshopper and vespllr sparrows scorll high on FC. The last components indicate the nature of near singularities in the data. The cosines among these components (Table 3) indicate grasshopper sparrows differ the most from the other species. it also has the largest last eigenvalue (Table 2). The possible col linearity for F5, 55, and VS is defined roughly in terms of LC versus the other cover variables. Species Species F5 F5 G5 SS VS 81Z/50.0 8.0/1Z.0 0.01 0.0 10 01 20.D G5 18.8/31Z 64.0/44.0 7.7/3D.8 5.0115.0 55 V5 D.OI 6.Z IZ.OI ZO.O 84.6/61S 0.0/10.0 0.0/1Z.5 16.0/Z4.0 7.71 7.7 85.0/55.0 A canonical discriminant analysis was run to characteriZe the mean differences among the groups. The canonical variables, which define these differences, are affected by the inequality Thli! 5AS code to generate the principal compol"If2nt information is: of covariance matrices, A canonical correlation analysis is also run between the original vari2lbles and the dummy variables generated from SP. This shows that discriminant analysis is a spt;;!cial caslil of canonical correlation analYSIS. The SAS statemlmts for carrying-out this analysis are: PAOC PAINCOMP DATA=IN.5PAAAOW OUT=5COAE; BY 5P; VARBACLC FC 5C MH VD HH VH; PROCCORR; BY SP; PAOC CANDI5C DA TA=IN.5PAAAOW OUT=SCOAE NCAN=3; CLASS 5P; VAA BAC LC FC 5C MH VD HH VH; PAOC PLOT; PLOT CANZ"CANI=5P; DATA GEN; DI=O; OZ=O; D3=0; IF SP='FS' THEN DI=I; IF 5P='G5' THEN OZ~I; IF 5P='S5' THEN D3=1; PROC CANCORA ALL; VAABACLC FC 5C MH VD HH VH; WITH Dl-D3; VAABACLC FC 5C MH VD HH VH; WITH PAINI-PAIN8; PAOCPAINT; BY 5P; PAOCPLOT; BY 5P; PLOT PRIN2*PR!NI;:'*'; The correlations between the original and ith principal variabills are found by 21/ c I ' LI2., thl2y are proportional to the eigenvectors. However, tile correlatiOns are ~cessary to interpret the principal variables if the covariance matrix is analyzed, since the eigenvectors are scale dependent. This canonical discriminant analysis is computed from a pooled Within-group and a between-group covariance matri)!. Classically, these are defined by: In summary, the major orientations of the four ellipsoids are similar, but differences then begin to appear. Consequently, usil"(! ttuz pooled covariance matrix for a distance metric is questionable. W=[II(n-g») E(nj-I) Sj where Si is the estImated covariance matri)( ror group i and The major objective of this stUdy is to 2)!plain the nature of the differences among the species with respect to their habitat variables. Therefore, the preceding covariance analysis and a canonical discriminant analysis are more meaningful than a classification analysis. PROC D1SCRIM was run, however, to gain insight into group separation The SA5 statements to generate the analysis are: [nj;:n, and B=[I/(g-I») Enj(Yj-Y)(Yj-jj)' wh~r~ Yj is thlOl mlOlan of the ; tn group and jj;:[niy/n. We th~n want to find the discriminant variable z;:a'y which ma)(imally separates the groups in the sense that a'Ba/a'Wa is ma)!imized. This is equivalent to solving the generalized eigenvalue problem Ba;:cW'a. Actually, there are t;:min(g-l,p) discriminant variables. i.e., ZhZ2, ···,2 t corresponding to the eigenvalues PAOC D15CAIM LIST POOL=NO PCOAA OUT=D15T DAT A=IN.5PAAAOW; CLA555P; VARBAC LC FC 5C MH VD HH VH; Cl~C2~ ... ~c t. The option with POOL;:YES was also run. The statistics in the output data set are used to compute the Mahalonobis distanc/Zs. The separating abil ity IS indicated by the magnitude or the eigenvalues and thus Z! is the maXimal discriminant. The classification percentages for both the linear (POOL;:YES) and quadratic (POOL;:NO) classification models are given in Table 4. Overall, 77.0% of the sparrows are classified correctly by the quadratic mode! and only 51.4% are classified correctly by the linear model. The individual percentages indicates grasshopper sparrows are the most difficult to distinguish from the other sparrow species. The hypothesis Ho: P.,=P2= .. , =p.g is tlilsted by various symmt;;!tric functions of the eiglmvalues. Two important test statistics are the likelihood ratio test given by J\=TTl/(I ...d j) and Roy's maximum ·root defined by e,=d,/(I'd,), who" d,=[(g-I)/(n-g»)c ,. These statistics are given by CANDISC. In addition. the within canonical structure, I.e., the correlations between the 1219 canonical (discriminant) variables and the original variabll2s is given to aid in interpreting these variables. Thl2se are deflnQd where by, corr(Z,y)=A'WlJ(II{w ii) whr;;:rr;;: A has as columns the scaled eigenvalues (a j 'W8 j=Sjj) and rep2ated until the wl2ights converge. This would be difficult to do In SAS unless PROe MATRIX is used. Therefore, a I-step process is used, I.e., the weights are computed for each observation and the WEIGHT statement is added to the procedure statements of CANDISC. o is a diagonal matrix matrix with 111 w ii as tM ith diagonal element where Wjj is the jtn diagonal element of W. WOk's 1\ test statistic (s 0.388 (P=0.00002)and Roy's gn~atest root is 0.583 (P=O.OOOI). These are highly Significant, indicating at least some group differences. SAS gives pairwise Mahalanobis distances based on group means and pairwisl1: mean difference tests (Hotellings T2s). Pairwise, all speCies are significantly different (and generally P<:O.Oll except for grasshopper versus savamah sparrows. The weighted discriminant analysis is similar to the unwelghted analysis. This is due to our previous observation that outliers are not a serious problem for this data. An initial screening of the data reduced the n.Jmber of variables from 13 to 8. A stepwise selection procedurll is now used to determine which variables are important to the model. The SAS statements are: The nature of group differences are characterized by the within canonical structure. Table 5 gives these correlations. Canonical variable 1 is most highly associated with BAC, LC, FC, and MH, whereas canonical variable 2 is associated with all variables except Fe. An examination of the plot of the flrst two canonical variables shows substantial overlap among the four species. An increasing score on the abSclssa (CANI) corresponds to increasing BAC, LC, FC, and MH and to decreasing SC, as determined from the raw canonical coefficlents. The ordinate (CAN2) corresponds to increaSing lC, SC, VD, and HH, and decreaSing BAC, MH, and VH. The field sparrows are in the upper left indicating high SC and low BAC and MH values. The grasshopper sparrows are in the center of the graph, whereas the savannah sparrows are in the center right. The vesper sparrows occupy the lower right corresponding to low LC values. The interpretations must be made cautiously, however, PROC STEPDlSC DA TA~IN.sPARROIY SlY SLE=O.IS SLS~D.30; BY SP; VAR BAC LC FC SC MH VD HH VH; The entry level (P-value) of 0.15 and the remove level of 0.30 for the F test lire those recommended by Mifi and Clark (1984). Table 6 gives a summary of the stepwise discriminant analysis. MH is the most important discriminator as judged by the F-criterion. Howlilver, with all thli! variables in the madill, VO has tne largest partial F. SC is the only cover variable which is clearly important to the model. LC is marginal but IS left in the model for Further analysis. Since the covariance structures are different and classification Is only moderate. Table 5. Sequential Enetered I. 2. 3. 4. Canonical Variables BAC LC FC SC MH VO HH VH CANI 0.3B6 0.572 0.357 -0.121 0.S02 0.032 0.112 0.191 CAN2 OA09 0.566 0.041 0.542 0.782 0.726 0.76S 0.S67 CAN3 0.384 -0.177 -0.370 0.017 0.292 0.270 0.187 OAIO MH VO SC LC F 10.152 4.704 3.774 2.140 Partial P-value 0.0001 0.0049 0.0144 0.1019 F 4.144 6.769 3.147 2.140 P-value 0.0094 O.OOOS 0.0303 0.1019 A weighted and unwelghted CANDlSC analysis was run on this four-variable modl21. Wilks' A is 0.453 (P:;O.0000002) and Roy's greatest root is 0.542 (p=0.000005) for the unweighted case. These results are more highly significant than the eight-variable model but only 48.6% of the obserVations are claSSIfied correctly using the quadratic classification functions. The within canonical structure- is givlZn in Table 7. Aswith the eight- variable model, lC and MH dominate CANI; all variables are important to CAN2. The locations of the species In habitat space and thll interpretations of the canonical variables are similar to that of thl2 Iilight-variable model. The canonical correlation analysis using CANCORR gives raw canonical cofficients which are proportional to the raw canonical coefficients from CANDlSC. Thus CANCORR can be used to Plarform a canonical discriminant anlysis, but certain important output is not given. In order to assess the influence of outliers on tne analysis, a robust discriminant analysis was done. The Mahalanobis distances are used to compute weights for CANDISC. Tab!e 7. Correlations between the Canonical Variables and the Original Variables Variable A more general procedure is now described (Randles I2t 211. 1978). For each group compute LC SC MH VD y":;Ewjy/,£w j and Summary Statistics for the Variables Selected by PROC STEPDl5C Table 5. Correlations Betwel2n the Three Canonical Variables and the Original Variables Variable if D?h otherWise. Di is the Mahalanobis distance and h is a tuning constant generally set equal to .fX 2(h:)(,p).The calculations are usually 5":;'i.w j2(Yj -D")(Yj-D")' I'i.Wi2 1220 CANI 0.716 0.019 0.699 0.217 CAN2 0.379 0.613 0.679 0.780 CAN3 -0.441 -0.283 0.157 0.042 We hiM~ used SAS to gain an understanding of the variable structure underlying four sparrow species. The SAS procedures for analyzing this data are generally complete, particularly CANDl5C and STEPDI5C. The major shortcoming is the lack of proper diagnostics, especially those based on the Mahalanobis distances. It it possible to obtain thl2sl2 statistics from thlil output data set of D1SCRIM--but not gasily. References Afifi, A.A. and Clark, V.c. (1984), Comouter-Aided Ml!ltiyarjate Belmont, CA liFetime learning Publications. ~ ChilkO, D.M. (1985), Personal CommUnication. Dempster, A.P. (1969), Elements of Continuous MultiVarjate Reading, MA: Addison-Wesley. ~ Gnanadesikan, R (1977), Methods for Statistical Data Analysis .QL MpH jyarjate 0bservatjons, New York: John Wiley & Sons Hahn, G,J. (1985), "More Intelligent Statistical Software and Statistical Expert Systems Future Directions," The Amer. Stat 39: no 1,1-8 Randles, R.H., et aL (1978), "Generalized Linear and Quadratic Discriminant Functons Using Robust Estimates," JASA, 73, 564-568 Whitmore, R.c. (1979), "Temporal Variation in the Selected Habitats of a Guild of Grassland Sparrows," Wilson Bu1l., 91 592-598, ®SAS is the resistered trademark: of SAS Institute Inc., Cary, NC, U5A E. James Harner Professor of Statistics Department of Statistics and Computer Science 207 Knapp Hall West Virginia University Morgantown, West Virginia 26506 (304) 293-3507 , j " 1221