Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
THE ANALYSIS OF ECOLOGICAL SURVEY DATA WITH SAS AND EAP Robert W. Smith, Ecological Data Analysis 1151 Avila Drive, Ojai, CA 93023 matrix with the rows (species) and columns (samples) arranged in the same order as appears on the corresponding dendrograms (EAP FROe TWT). The data values can be standardi~ed and converted to symbols for compactness and ease of interpretation (Fig. 1). The symbols in the two-way tables of Figures land 2 are based on species mean (of values >O)standardi~ed data, and the values corresponding to each symbol are as follows: INTRODUCTION The Ecological Analysis package is a set of user-written SAS procedures which are useful in the analysis of ecological survey data. These types of data are often collected as part of environmental impact or monitoring studies. Both biological (usually as species importance values) and environ(EAP) mental data are collected at several pertinent locations. The first step in the analysis usually consists of finding the biological patterns in the data. Although single species can be studied, the main emphasis here will involve the study of biological patterns at the community Symbol level. blank * + After the biological patterns are quantified and illustrated, they can be correlated with the environmental measurements. The results of this analysis can lead to hypotheses of cause and effect. These types of analyses can conveniently be performed with SAS and EAP procedures. As the methods are discussed, the procedures involved will be noted. ~,. >2 >1 to 2 >.5 to 1 >0 to .5 o Such a table is extremely valuable to the ecologist because the biological patterns can easily be seen and interpreted. In addition, when choosing groups from the dendrograms, reference to the table is very useful. The two-way tables are informative because similar samples and species are grouped; thus they will appear in contiguous positions on the two-way table. However, the specific order of entities along a dendrogram can be quite arbitrary since anyone node on the dendrogram can be rotated 180 0 without changing any groupings. The clustering algorithm in EAP FROC DENDRO is modified to create a maxirr,ally informative order of entities along the dendrogram. To accomplish this, the main trend in the data is found by calculating scores for each entity along an ordination axis (see below). The order of entities along the dendrogram is made to approach the order of entities along the ordination axis by appropriate rotation of the nodes (Fig. 2). Note that the rows and columns of the two-way table in Fig. 2 show continuous biological change. This is not evident from observing the twoway table in Fig. 1. METHODS FOR FINDING BIOLOGICAL (COMMUNITY) PATTERNS 1. Range of Values Agglomerative Hierarchical Cluster Analysis: This type of cluster analysis consists of two steps. I} 'Distances are calculated between all pairs of entities (the units being clustered, which can be observations or variables) These distance values are proportional to the dissimilarity of the entities. 2) The most similar remaining pairs of entities are successively fused to form larger and larger groups until all entities are in a single group. The path and levels of fusion are shown in a tree-like structure called a dendrogram (Fig. 1). All agglomerative clustering methods are similar in this respect, but differ in the manner in which distances between groups of entities are calculated as the groups are built. Some examples of clustering strategies are complete linkage (used by SAS PROe CLUSTER), single linkage, centroid (Sneath and Sokal, 1973), group average (EAP PROC DENDRO) and flexible (Lance and Williams, 1967; EAP PROC DENDRO) . The dendrograms from a sample and a species cluster analysis can be used to construct a two-way coincidence table, which is simply the biological data Ordination Analysis: Here the relationships between the entities are displayed in a maximally informative subset of a multidimensional space, The entities being studied are represented by points in the space and the distance between two points should be proportional to the dissimilarity of the corresponding entities (Fig. 3). The dimensions of' the space are called axes and the point coordinates are called scores. As with agglomerative cluster analysis, the ordination techniques can i 610 Alternately, the groups can be formed according to the distributions of selected species (Green, 1971). Discriminant coefficients will show which environmental variables are correlated with the discriminant axes. 'l'he results from this type of analysis usually can be improved by weighting the observations according to how well each observation fits into each group. There is one set of weights for each observation for each group (Smith, 1976, 1979). Additional important biological within- and betweengroup information can be conveyed with the weights. When groups are defined with cluster analysis, the weights can be calculated from the same intersample distances used in the clustering (EAP PROC GRSIM). be based on inter-entity distances. Principal coordinates. analysis (EAP PRoe PCOORD) and multidimensional scaling (SAS supplemental PROC ALSCAL) directly use distances, while principal cornFonents analysis (SAS PRoe PRINCOMP) are indirectly based on Euclidean distances. When ordination scores are plotted, it is necessary to be able to identify individual points. EAP PROC PLOTM has an option which generates a set of plot symbols which are easily identifiable from an accompanying table (Fig. 4c). In addition, symbols can automatically be generated to distinguish groups (Fig. 4b) or trends of selected variables in the plot (Fig. 4a). METHODS FOR CORRELATING BIOLOGICAL AND ENVIRONMENTAL PATTERNS Non-parametric Analysis of Distances: This method can be used to test for community difference between a 1riori defined groups of samples Dyer, unpublished; EAP PROC DCOMP). Inter-sample distances calculated from the biological data are divided up into within-group distances and between-group distances. A nonparametric test is then made to determine if the between-group distances are significantly larger than the withingroup distances. This method takes into account the lack of independence of the distance values. The groupings can reflect some hypothesis concerning biologicalenvironmental relationships. For example, samples taken in an area of impact could be compared with samples in a similar area without the impact. Multiple Linear Regression: Environmental factors Wh1Ch cause major community changes will be correlated with the first or first few ordination axes. Multiple regression can be used to possibly identify these factors (SAS ~ROC GLM, SAS PROC REG). Here, the dependent variable will represent scores for an ordination axis" and the independent variables will be the measured environmental variables (Cassie and Michael, 1968; Smith aHd Greene, 1976). It is also possible to use intersample distances as the dependent variable, and corresponding changes in environmental variables as independent variables (Dyer, 1978; EAP PRoe REGDIST). With such an analysis, modified calculations are required since distance values are not necessarily independent observations. THE MEASUREMENT OF DISTANCE Canonical Correlations: Canonical correlatIons can be used instead of multiple regression to study the correlations between the ordination axes and the environmental variables (SAS PRoe CANCORR). One set of variables consists of the ordination axes scores and the other set is the environmental variables. Most of the methods mentioned above require the calculation of distances. There are several distance indices from which to choose. When using species importance values, some indices arc more appropriate than others. One of the most widely available distance indices is Euclidean. Unfortunately, this index is not well suited for ecological data (Beals, 1973). EAP PRoe DENDRO has distance indices which are suitable for these types of data. Inter-sample Distances: Intersample d1stances are used to measure community changes. When used in this manner, all distance indices have one major shortcoming. As the actual community change increases beyond a moderate level, the distance index values do not increase commensurately. This is due to the fact that species change takes place in a non-linear, nonmonotonic manner (see Fig. la), and the distance indices assume linear Discriminant Analysis: SAS PROC DISCRIM 1S ma1nly used for classifying observations into a priori defined groups. Alternately, discrlmlnant analysis can be used to study betweengroup differences (EAP PROC WTDISC) . -Here, the axes of a defined multidimensional space are positioned to maximally separate the groups. The original dimensions of the multidimensional space could represent the measured environmental variables, and the groups could be defined by a cluster analysis of the samples using biological data (Smith, 1976; Bernstein, et aI, 1978; Green and Vascotto, 1978). 611 species change (Swan, 1970). The relatively shorter and moderate distances can be somewhat improved with proper data transformation and standardization (Smith, 1976; Smith, in prep; EAP FRoe TSALL). The relatively longer distances can be substantially improved by reestimation using a "step-across" procedure (Williamson, 1978) modified by Smith (1981) (EAP PROC DENDRO). Technical Memorandum 80/9. CSIRO Institute of Earth Resources, Division of Land Use Research, CanbGrra, Australia. Bernstein, B.B., R.R. Hessler, R. Smith, and P.A. Jumars, 1978. Spatial dispersion of benthic Foraminifera in the abyssal central North Pacific. Limnol. Oceanographer. 23 (3), Here, the shorter distances are used to reestimate the longer distances. Inter-species Distances: With most lndlces, the data must first be standardized by species maximum to remove the irrelevant effects of scale in the calculations (Smith, 1976) The distance values will be inversely proportional to the overlap between the species being compared. These distance values can be adversely affected by uneven sampling of the various habitats in the survey area (Colwell and Futuyma, 1971). This can somewhat be corrected for with the use of weights in the distance calculations (EAP PROC UNIQWT, 543-556. Green, R.H. and G.L. Vascotto. 1978. A method for the analysis of environmental factors controlling patterns of species composition in aquatic communities. Water Res. 12: 583590. Howard-Williams, C. and B.H. Walker, 1974. The vegetation of a tropical African lake: classifi cat.ion and ordination of thE! vegetation of Lake Chilwa (Malawi). J. Ecol. TSALL) . Besides species overlap, ecologists are often interested in the relative habitat preferences of the species. For example, the distance between two nonoverlapping species will be the maximal distance value, regardless of their habitat preferences. The ecologist may prefer that two non-overlapping species occurring in similar habitats will be separated by a shorter distance than two non-overlapping species found in very dissimilar habitats. The distances measuring overlap can be converted to distances reflecting relative habitat preference with the "two-step" method (Belbin, 1980; Austin and Belbin, unpublished; EAP PRoe DENDRO). 62 (3), 831-853. Lance, G.K. and W.T. Williams, 1967. A general theory of classificatory sorting strategies. I. Hierarchical systems. Computer J. 9: 373-380. Smith, R.W., 1976. Numerical analysis of ecological survey data, Ph.D. thesis. University of Southern California, LA. 401 pp. Smith, R.W., 1979. Discriminant analysis. EAP Technical Report No.1: 53 pp. Available from author at 1151 Avila Dr., Ojai, CA 93023. Smith, R.W., 19B1. The re-estimation of ecological ~istance values using the step-across procedure. EAP Technical Report No.2: 19 pp. Available from author at 1151 Avila Drive, Ojai, CA 93023. Smith, R.W. and C.S. Greene, 1976. Biological communities near submarine outfall. Journal Water Pollution. Control Fed. 48(8): 18941912. Smith, R.W., in preparation. The improvement of ecological distances with transformation and standardization. To be an EAP technical report. Sneath, P.A. and R.R. Sakal, 1973. Numerical Taxonomy. W.H. Freeman and Co., San Francisco: 573 pp. Swan, J.M.A., 1970. An examination of some ordination problems by use of simulated vegetational data. Ecology 5t: 89-102. Whittaker, R.H., 1973. Direct gradient analysis: Techniques. In Handbook of Vegetation Science, Part 5: Ordination and Classification of Communities. R.H. Whittaker, ed., Dr. W. Junk Publishers, The Hague: 7-31. Williamson, M.H., 1978. The Ordination of Indidence Data. J. Ecol. 66: 911-920. DISCUSSION The authors of EAP are active in the management and analysis of ecological survey data. Accordingly, every effort is made to keep the EAP programs user-oriented and the techniques stateof-the-art. Besides the EAP procedures mentioned, there are several other procedures for analysis, display, and data manipUlation which can be useful to the analyst. REFERENCES Austin, M.P. and L. Belbin, unpublished. A new approach to the inverse classification problem in floristic analysis. Beals, E.W., 1973. Ordination: Mathematical elegance and ecological naivete. J. Eco1., 61(1), 401-416. Cassie, R.M. and M.D. Michael, 1968. Fauna and sediments of an intertidal mud flat: A multivariate analysis. J. EXp. Mar. BioI. & Ecol., 2: 1-23. Colwell, R.K., and D.J. Futuyma, 1971. On the measurement of niche breadLh and overlap. Ecology S2 (4): 5fi7-S76. Dyer, D.P., 1978. An analysis of species dissimilarity using multiple environmental variables. Ecology 59(1); 117-125. Dyer, D.P., unpublished. A statistical test for dissimilarity and similarity matrices. Available from David P. Dyer, Moorman Mfg. Co., 1000 N. 30th St., Quincy, IL 62301 Green, R.H., 1971. A multivariate statistical approach to the Hutchinsonian niche: bivalve molluscks of central Canada. Ecology 52(4): 23-35. Belbin, L., 1980 TWOSTP; A program incorporating asymmetric comparisons that use two steps to produce a dissimilarity matrix. 612 Density 500 , - - - - -__~~------__~~~------------------------, 400 300 200 100 234 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Samples Along Moisture Gradient (Wet --;;. Dry) a. Plant Species Densities Along Environmental MOisture Gradient. After Whittaker (1973) . .. Site 15 Site 16 Site 14 S I S I S I S I S I S I S I S I S T T T T T T T T T Site 19 E E E E E E E E E Site 11 1 6 1 1 1 2 8 5 9 1 0 6 7 Site 17 Site 18 Site 10 Site 13 Site 12 S I Site 7 Site 6 Site 9 Site 8 Site 4 ........... Site 5 .................................................. Site 3 Site 1 Site 2 b. Cluster Analysis of Samples .. Species I Species J Species K I S I S I S I S I S I S I S I S I S I T T T T T T T T T T E E E E E E E E E E 1 5 1 1 8 1 3 7 9 4 3 2 4 123 4 567 8 901 234 5 6 7 8 9 Species Species Species Species Species Species Species Species Species Species Species Species I J K L B C D E G H F A ""***+---*+ ""****+--*+ ***"'+-. +- --.+** _+. "*+**w +*--***-+ +- * + * --+ . . **.+-**-+- *+ ****--*+ - . - .. **-+****++--- 123 4 567 8 9 a 1 234 • + 56 7 89 d. Two-way Coincidence Table. Species L Species B ,t Species C Species D Species E Species G •....•.•..• u . • • . . . ! ••••••••• Species H Species F Species A c. Cluster Analysis of Species 613 Figure 1. Cluster Analysis of Biological Data Along a Moisture Gradient Using Unimproved Clustering "Algorithm. Site 19 Site 18 Site 17 .. Site 16 Site 15 Site 14 Site 13 Site 12 8ite 11 8ite 10 Site 9 Site 8 8ite 7 Site 6 Site 5 8ite 4 Site 3 Site 2 8ite 1 a. Cluster Analysis of Samples •• ,.......... Species L 8 I 8 I S 8 I 8 I 8 I 8 I S I I 8 I T E T T T E E E 8 6 4 2 ........... Species K T T T T T Species J E E E E E .. Species I 1 8 1 6 1 4 1 2 1 0 Species H Species G S S I I T T E E 9 7 "......... Species F 8 I 8 I 8 I 8 I 8 I 8 I ,.......... Species E T T T T E E E E T E 1 1 7 1 1 5 3 Species C Species B 9 Species 0 Species A b. Cluster Analysis of Species 8 I 8 I T E T T E 5 3 1 2 345 6 7 8 9 0 1 2 3 4 5 7 8 9 0 Species Species Species Species Species Species Species Species Species Species Species Species L K J I H G F E C B 0 A * * + - -. - +* * * * +- . +* * * * * * +- - . - + * * * .... + - - • - - + * * * " ... + - - . - + ......... + - - - + ... * ......... * + + - - - • + ... * -- - + * ... * ... + - - + + ........ ... - + .... + +. 1 2 345 6 789 0 1 2 3 456 7 8 9 c. Two-way Coincidence Table. Figure 2. Cluster Analysis of Biological Data Along A Moisture Gradient Using Improved Clustering Algorithm 614 E II Neutral to acidic ~ -70 -71 67. 4 Marsh 65 30 36 ~33 34. 073 31 3~.7 .72 18 lie .47 I~ 05 51 24 0 52 063 048 ) Alkaline marsh .2 !4"g 055 Swamp transition Floodplain 058 ~o .57 Grassland 5901"' 062 56. 13 I----~------------------------------------I Figure 3. Ordination of Samples Based on Species Compositions Using Principal Components Analysis. After Howard·Williams and Walker, 1974. I I I 9, I 12 I 11 5+1 3 7 7 " I I I 1+ 6 5 7. A •• i::t+ 9+ A 2 14 5'4 1 1 7. 1 4 3 x 1 I s 1 2 2 32 1+ 3 -+-----+-----+-----+-19 7 13 9 -+-----+-----+-----+-13 1 2 3 I 45 INCREMENTED SYMBOL PLOT 13+ 1 1 1 9+ 1 44 I X I S 6 3 7 SYMBOLS GROUP 1101STlJRE 13+ 19 I 23 1 17 5+" 1 1 1 1+ D E F Gil I 7 " #19 #3 #4 "'0 013 # 16 ., OBS SYM ID 6 8 SITE #6 14 9 SITE #14 17 A SITE #17 7. B 5 C 8 D 15 E 13 19 c. Symbols For Identification of Points Using Symbol Table. Locations (1 ;;:: lowest moisture, 9 ;;:: highest moisture). OBS SYM ID 19 1 SITE 3 2 SITE 3 SITE 10 SITE " 13 5 SITE 16 6 SITE 1 7 SITE J AXISl b. Symbols Signifying Groups of Samples. Measurements at Sample A 9 -+-----+-----+-----+-- AXISl AXISl a. Symbols From Moisture c 6 5 " 8 ., SITE SITE #5 SITE #8 SITE #15 ons SYM ID 12 F SITE 9 G SITE H SITE 11 I SITE 7 ,. J #12 i9 #11 #7 SITE #1. d. Symbol Table For Symbols in c. Figure 4. Output From EAP PROe PLOTM. Point Positions on All Plots Are Identical But Symbols Are Different. 615