Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Syst. Biol. 45(l):67-78, 1996 FINITE MIXTURE CODING: A NEW APPROACH TO CODING CONTINUOUS CHARACTERS DAVID S. STRAIT,1-3 MARC A. MONIZ, 1 AND PEGGY T. STRAIT 2 doctoral Program in Anthropological Sciences, State University of New York, Stony Brook, New York 11794-4364, USA department of Mathematics, City University of New York, Flushing, New York 11367, USA Abstract.—Finite mixture coding (FMC) is a new method of coding continuous characters. FMC uses a three-step goodness-of-fit procedure to assign codes. First, for a given measurement, parameters are estimated for a number of density functions that describe a data set either of species means or of measurements of specimens from several species. The density functions represent either a single population or a mixture of populations (e.g., a mixture of two normal distributions). Next, a goodness-of-fit criterion (the Akaike information criterion) is used to determine which of the density functions best describes the data set. The best function indicates the number of populations into which the variates of the data set can be segregated. Finally, species are assigned to the population for which its probability of membership is highest. Each population is then assigned a code, and species falling within the same population share the same code. Although other coding methods incorporate statistical tests or parameters into the coding process, FMC is the only method that produces codes as the direct output of a statistical procedure. [Code; continuous character; finite mixture analysis; likelihood estimation; cladistics.] There has been considerable debate concerning methods of coding continuous characters (Mickevich and Johnson, 1976; Simon, 1983; Almeida and Bisby, 1984; Thorpe, 1984; Archie, 1985; Felsenstein, 1988; Chappill, 1989; Farris, 1990; Thiele, 1993). Because computer-assisted cladistic analyses (e.g., PAUP, Hennig86, PHYLIP) require that character states take the form of discrete codes and because many morphological characters such as size and shape tend to vary continuously, a coding procedure is necessary in many studies. Unfortunately, there is no consensus on which method is best. This paper introduces a new method of coding continuous characters, the finite mixture method. Unlike other methods, finite mixture coding (FMC) produces codes as the direct output of a statistical procedure. Consequently, FMC distinguishes among taxa using standard methods of statistical inference. range of variation into discrete states or because it is unclear whether metric characters provide valid cladistic information (Pimentel and Riggins, 1987; Cranston and Humphries, 1988; Felsenstein 1988; Bookstein, 1994). However, continuous characters comprise a large proportion of the morphological features present in organisms, many of them are heritable, and some can be used to discriminate among groups of taxa (Falconer, 1981; Chappill, 1989; Stevens, 1991; Thiele, 1993). Consequently, some of these characters may be useful in estimating phylogeny. To use them, a coding procedure must be employed. CODING PROCEDURES With respect to continuous characters, coding procedures are the methods used to determine whether taxa are discernible for a given trait. Disagreement over coding techniques has focused on seven methods: gap coding (Mickevich and Johnson, 1976), segment coding (Simon, 1983; Thorpe, 1984; Chappill, 1989), divergence coding (Thorpe, 1984), homogeneous subset coding (Simon, 1983; see also Archie, 1985), generalized gap coding (Archie, 1985), gap weighting (Thiele, 1993), and the coding CONTINUOUS CHARACTERS Some authors have suggested that continuously varying characters are of little use in cladistic analysis, either because of difficulties encountered in reducing the 1 E-mail: [email protected]. 67 68 SYSTEMATIC BIOLOGY method of Almeida and Bisby (1984). Range coding (Colless, 1980) is omitted from this list because it is actually a weighting technique rather than a coding method. None of these methods produce codes as the direct output of a statistical test. Rather, they use statistical tests or parameters as guidelines for applying other criteria that are used to discriminate taxa. FINITE MIXTURE CODING Finite mixture coding adopts conventional notions of statistical inference. In statistics, groups are said to be significantly different if the probability that their observed distributions could have been sampled from a single population is low. However, if this probability is higher than a designated value, then the groups are not meaningfully discernible. The finite mixture method applies these principles to character coding. If with respect to a given measurement the distribution of taxa is such that it is probable that they could have been sampled from a single theoretical statistical population (not a biological population), then the taxa are not meaningfully discernible and are assigned the same code. Furthermore, the number of codes present in a set of taxa is equal to the number of discernible statistical populations of taxa that can be identified. This number is identified using finite mixture analysis and likelihood estimation methods. Although species are said to share a code when they are distributed as a single statistical population, such populations can be distributed in an infinite number of ways (i.e., normally distributed, Poisson distributed, etc.). To identify statistical populations of taxa, it is necessary to have an expectation of the general form of that population's distribution. These expectations vary depending on whether the data being analyzed consist of species means or observations of individual specimens within species. Finite Mixture Coding of Taxon Means As defined above, species share a code for a given measurement when they are VOL. 45 distributed as a single statistical population. Thus, such species can be treated as samples from a single population. If one assumes that species can be sampled independently, then the means of those species should be distributed normally. The central limit theorem states that when large samples are drawn from a population, the means of those samples will be normally distributed. This statement is true regardless of the actual distribution of the measurement within species. If the distribution within species is not normal, then the distribution of species means will approach normality as sample size increases (e.g., Strait, 1989). Thus, the number of states present in a given set of species should correspond to the number of normal distributions that can be identified. Finite mixture analysis can be used to identify multiple normal distributions within a data set. Applications of finite mixture modeling were reviewed by Pearson et al. (1992). Most of what follows is based on their discussion, and they presented a more detailed description of the method (see also Everitt, 1985). They also provided a list of computer programs that can perform likelihood estimation. A data set is said to be mixed if it is composed of variates from more than one population. Each of these populations is a component of the total data set, and thus they are called component distributions. The data set is therefore a mixture of these component distributions. The form of each component distribution can be described by an equation, the component density function. Likewise, the shape of the mixture can be described by a mixture density function. Both types of functions are written in terms of the population parameters (e.g., mean, variance) of the component distributions. The number of component distributions in a data set (and hence the number of codes) is identified by what is essentially a complex goodness-of-fit procedure. This procedure has three stages. First, using likelihood estimation methods, several mixture density functions are fit to the 1996 STRAIT ET AL.—FINITE MIXTURE CODING data set. Specifically, parameters are estimated for each of the component distributions in a mixture. The general form of these distributions must be specified (i.e., normal, Poisson, chi-square, etc.), although in the case of species means it is known that they are normal. Thus, when given a density function that describes a mixture of two normal distributions, a mean and a variance are estimated for each component. The best parameter values are those that maximize the likelihood statistic, L. This procedure is performed on several mixture density functions (e.g., one-normal, two-normal, three-normal). The researcher determines the number of mixtures to be examined. The second stage of the analysis identifies the mixture whose density function provides the best fit with the data set. Using L, mixtures are compared using the Akaike information criterion (AIC; Akaike, 1974). If, for instance, the two-normal model is better than all others (i.e., onenormal, three-normal, etc.), then two component distributions (and thus two codes) are likely to be present in the data set. Finally, individual species are assigned codes by calculating the probability of a species mean being drawn from any given component distribution. One additional step at the beginning of this procedure might be necessary, depending on the distribution of the character within species. The premise behind coding taxon means is the central limit theorem, but a measurement that is asymmetrically distributed requires large samples for the central limit theorem to hold. Thus, in cases of small sample size, it is advisable to transform the data such that the distribution of the measurement within species is symmetrical (using a log, arcsine, or other transformation; cf. Sokal and Rohlf, 1981). The general form of a mixture density function (Everitt, 1985; see also Pearson et al, 1992) is (1) 69 where f(x) is the mixture density function, g(x; 6,) are the density functions of the component distributions, c is the number of components in the mixture, 6, are the parameters of the distributions, and pt are mixing proportions, i.e., the proportions of the total data set contributed by each component (e.g., distribution A makes up 40% of the data set, and distribution B makes up 60%). Because for the moment we are examining only the distribution of taxon means, we need to know only g(x; 6,) for normal distributions. The general form of the density function of a normal distribution is 8(x) = (2) where (x is the distribution mean andCT2is the variance. Thus the density function of, for example, a two-normal mixture would be /(*) = V + (1 - p) The likelihood function (Pearson et al., 1992), which is maximized, is L = f\f(xl)=f(x1)f(x2)...f(xn), (4) where x; represents the variates (i.e., observations) of a data set. A simple visual example explains why the maximum likelihood estimate identifies the best function parameters (Fig. 1). Suppose that one wishes to find the parameters of a normal density function that explains the distribution of a set of data consisting of seven variates, xx-x7. In case A, the data set and density function have the same mean. In case B, the mean of the density function differs from that of the data set. In each case, L = f(Xl)f(x2) ... f(x7). (5) In case A, the values of f(x) for each of the data points will be high, resulting in a 70 VOL. 45 SYSTEMATIC BIOLOGY Count /(x) Count /(x) FIGURE 1. Example of maximum likelihood estimation. Two probability density functions are fit to a distribution of seven variates. (a) /(x) for each variate is large, meaning that L will be large. This function clearly provides a better explanation of the data set. (b) /(x) for each variate is small, meaning that L will be small. large value of L. In case B, L will be very small, because for each data point the value of f(x) is low. Clearly, the density function of case A matches the data set better than that of case B. Therefore, a high L indicates an accurate estimation of function parameters (in this case, the mean). After parameters have been estimated for several mixture density functions, these functions must be compared to determine which provides the best fit with the data set. The comparison is made using either the AIC (Akaike, 1974) or the log-likelihood ratio test. The AIC is defined as AIC = -2(ln L) + 2K, (6) where K is the number of independent parameters estimated by the mixture model. The best mixture is the one with the lowest AIC, which favors mixtures with high likelihood statistics and few component distributions. Alternatively, one can use a loglikelihood ratio test to compare mixtures in a pairwise fashion. However, Everitt (1981) noted that this test is inaccurate when samples sizes are less than 10 times the number of parameters estimated by the model (5 times, according to Pearson et al., 1992), and even with many samples the power of the test is low. The log-likelihood ratio statistic G (Sokal and Rohlf, 1981) is computed as G = -2(ln Lo - In L,), (7) where Lo and La are the likelihood statistics of the two mixtures being compared. The degrees of freedom for the G-test depend on the type of mixtures being compared. If the mixtures contain the same number of components (as may happen when normal and nonnormal mixtures are compared), then the degrees of freedom are equal to the difference in the numbers of parameters estimated by the two models. If the mixtures have different numbers of components, then the degrees of freedom are equal to two times the difference in the number of parameters, not including the mixing proportions. The AIC and the G-test should produce comparable results. Noting the criticisms of the G-test, we here use the AIC. However, for purposes of comparison and because many readers are familiar with the G-test, the results of both are presented. After the best mixture model has been identified, one must specify the distributions to which individual taxa probably belong, which is done by calculating the posterior probabilities w of each taxon; w is simply the probability that a data point belongs to a given distribution (Fig. 2). It is defined as (Pearson et al., 1992) Pr (8) for i = 1 to c and ;' = 1 to n. Because the component distributions of a mixture may overlap, some taxa may not be assigned to components with absolute certainty. However, posterior probabilities at least allow measurement of the level of uncertainty. As a rule, a taxon is assigned to the component for which w is highest. Obviously, if w is low, then the component assignment is questionable, and the researcher may 1996 71 STRAIT ET AL.—FINITE MIXTURE CODING 12(a) 10Count 8 6 4 2 • 60 80 100 nn120 140 0.05. (b) 0.04. 0 Taxon 0.03- FIGURE 2. Calculation of posterior probabilities. The probability w that a taxon belongs to any given component distribution is equal to its value as a function of the component distribution (A) divided by its value as a function of the mixture (B). 0.02. 0.01. 0 60 wish to leave such assignments uncertain (meaning that the taxon will be coded as having an uncertain state). After membership in component distributions is established, coding is simple. Each component receives a code in ascending order, starting with that containing the smallest species means. Codes may be considered either ordered or unordered. Example 1 The intermembral index of anthropoid primates was coded as an example of the finite mixture method as applied to taxon means. The intermembral index is the ratio of forelimb length to hind limb length X 100. This character was chosen because data are available for a large number of species (91). We do not expect this character to reflect the true anthropoid phylogeny, but the degree of homoplasy present is irrelevant to the coding method. A histogram of anthropoid intermembral indices is shown in Figure 3 (data from Fleagle, 1988). Using the BMDP statistics program, parameters of four different mixtures were estimated for these data (one-normal, two-normal, three-normal, four-normal; see Table 1). The three-normal mixture (Fig. 3) is favored because it has the lowest AIC. In addition, it is significantly better than the one- and twonormal mixtures according to the G-test 80 100 120 140 Anthropoid Intermembral Index FIGURE 3. Distribution of the intermembral index among anthropoids, (a) Histogram of species means, (b) Three-normal mixture density function that describes histogram. (Table 2). It is not significantly better than the four-normal mixture because that function has broken the middle component of the three-normal mixture into two overlapping distributions. As a result, the probability density functions of these mixtures are nearly identical in shape, producing nearly identical values for L. The TABLE 1. Estimated function parameters for anthropoid intermembral index data. Parameters and likelihood statistics of four mixture functions were obtained using maximum likelihood estimation. Mixture Parameter M-i °i Pi 1 normal 2 normal 3 normal 4 normal 90.9 17.2 1.00 86.1 10.2 0.89 131.7 7.9 0.11 78.9 4.1 0.55 97.4 4.8 0.34 130.9 8.7 0.11 78.9 4.1 0.56 96.2 3.6 0.22 99.2 5.6 0.11 131.0 8.6 0.11 -351.0 |X 2 CT2 P2 M-3 O-3 Ps M'4 °-4 Pi lnL -388.2 -368.1 -351.6 SYSTEMATIC BIOLOGY 72 VOL. 45 TABLE 2. Significance tests of mixture functions for anthropoid intermembral index data. Results of the AIC and G-test support the three-normal mixture as the model that best explains the distribution of anthropoid species means. AIC Mixture model 1 2 3 4 normal normals normals normals G-test lnL AIC Comparison -388.2 -368.1 -351.6 -351.0 778.4 741.2 711.2 713.0 1 vs. 2 normals 2 vs. 3 normals 3 vs. 4 normals 1 vs. 3 normals G -2(-388.2 -2(-368.1 -2(-351.6 -2(-388.2 + + + + 368.1) 351.6) 351.0) 351.6) = = = = 40.2 33.0 1.2 73.2 df P 4 4 4 8 <0.001 <0.001 n.s. <0.001 may belong to more than one component (overlap among species is not unique to FMC; all coding methods allow overlap among taxa with different codes). The procedure used to assign codes varies depending on the general form of the comFinite Mixture Coding of Individual ponent distributions. For symmetric or Specimens nearly symmetric distributions, taxa are In the case of taxon means, FMC oper- coded in the usual way: a species receives ates on the expectation that sample means the same code as the component distributaken from species that share a code will tion within which its mean falls. Thus, a be normally distributed. When applied to species mean is treated as if it were a variindividual specimens, this expectation ate, and posterior probabilities are calcuchanges. If a given measurement has a lated for its value. Barring strong asymcharacteristic distribution within species metry, this procedure will ensure that and if specimens are drawn from species most of the individuals in a species will that share a code, then the pooled distri- belong to the same component distribubution of all of those specimens should be tion. If the component distributions are of the same general form as that observed known to be strongly asymmetrical, then within species. Thus, unlike taxon means, posterior probabilities are calculated for all individuals will not always be normally of the specimens in a species and the spedistributed, and in the mixture density cies receives the code of the component to function, g(x, 6) may not be normal. If g(x, which the majority of its specimens be9) is not normal, then either its general long. form must be specified (or at least approximated; e.g., Pearson et al. [1992] provided Example 2 a density function for skewed distribuHominoid facial shape was coded as an tions) or the data must be transformed such that there is an expectation of nor- example of how the finite mixture method mality (cf. Sokal and Rohlf, 1981). Follow- can be applied to individual specimens. ing transformation or the specification of Facial shape, as defined here, is a ratio of g(x, 6), FMC proceeds as before. breadth across the orbits to facial height. The final step of the analysis is the as- This ratio was coded for samples of Hylosignment of codes. Ideally, one would like bates lar, Pongo pygmaeus, Pan troglodytes, to assign codes based on the placement of Gorilla gorilla, and Homo sapiens (data from individual specimens within component Chamberlain, 1987). Each species was repdistributions. However, this coding crite- resented by 10 males and 10 females. Rarion is complicated by the fact that com- tios are not always normally distributed. ponent distributions may overlap, mean- Thus, before FMC could be used, the probing that specimens from a single species ability density function of the ratio needed AIC favors the three-normal mixture because it requires fewer parameters and thus is the simpler explanation. Posterior probabilities and codes are presented in Table 3. 1996 73 STRAIT ET AL.—FINITE MIXTURE CODING TABLE 3. Posterior probabilities and <codes of anthropoid intermembral indices. Values are species means of the intermembral index (1MB) and the probabilities that any given species should receive code 0, 1, or 2. TABLE 3. Posterior probabilities Mean Code Code Code Species Species 1 ostenor probabilities Mean Code Code Code 1 1MB 0 2 Pithecinae Pitecia pithecia P. monachus Chiropotes satanas Cacajao calvus 75 1.0 77 1.0 83 0.99 83 0.99 0 0 0.01 0.01 0 0 0 0 Aotinae Aotus trivirgatus Callicebus moloch C. personatus 74 74 73 1.0 1.0 1.0 0 0 0 0 0 0 Cebinae Cebus apella C. albifrons Saimiri sciureus 82 1.0 82 1.0 80 1.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0 1.0 1.0 1.0 1.0 0.99 0.99 1.0 0.88 0 0 0 0 0 0.01 0.01 0 0.12 69 79 76 76 75 74 74 74 88 76 75 75 82 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.53 1.0 1.0 1.0 1.0 0 0 0 0 0 0 0 0 0.47 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 92 95 100 99 93 94 96 95 93 93 0.02 0 0 0 0.01 0 0 0 0.01 0.01 0.98 1.0 1.0 1.0 0.99 1.0 1.0 1.0 0.99 0.99 0 0 0 0 0 0 0 0 0 0 Atelinae Alouatta seniculus A. palliata A. caraya Lagothrix lagothricha Barchyteles arachnoides Ateles paniscus A. geoffroyi A. fuscicpes A. belzebuth Callitrichinae Callimico goeldi Saguinus fuscicoUis S. mystax S. labiatus S. imperator S. midas S. oediups S. leucopus Leontopithecus rosalia Callithrix argentata C. jacchus C. penicillata Cebuella pygmaea Cercopithecinae Macaca nemestrina M. tonkeana M. ochreata M. brunescens M. hecki M. nigra M. assamensis M. thibetana M. fascicularis M. mulatta 97 98 97 98 104 105 105 103 109 Continued. M. arctoides Cercocebus albigena C. galeritus C. torquatus Papio hamadryas P. anubis P. cynocephalus Mandrillns sphinx Theropithecus gelada Cercopithecus mitis C. nictitans C. ascanius C. cephus C. mona C. diana C. neglectus C. aethiops Allenopithecus nigroviridis Miopithecus talapoin Erythrocebus patas Colobinae Colobus guerza C. polykomos Piliocolobus badius Procolobus verus Presbytis entellus P. johnii P. melalophos P. aygula P. rubicunda P. frontata P. hosei P. obscura P. cristata P. pileata Nasalis larvatus Pygathrix nemaeus 1MB 1 o 0 1.0 l.U 0.98 0.99 0 0 0 0 0 1.0 1.0 1.0 1.0 0.88 1.0 1.0 0.99 0.98 0.99 0.02 U 0.02 0.01 1.0 1.0 1.0 1.0 1.0 0 0 0 0 0.12 0 0 0.01 0.02 0.01 0.98 79 79 87 80 83 80 78 76 76 76 75 83 82 82 94 94 1.0 1.0 0.74 1.0 0.99 1.0 1.0 1.0 1.0 1.0 1.0 0.99 1.0 1.0 0 0 0 0 0.26 0 0.01 0 0 0 0 0 0 0.01 0 0 1.0 1.0 147 140 129 126 130 129 127 129 139 105 102 116 72 0 0 0 0 0 0 0 0 0 0 0 0 1.0 0 0 98 /o 84 83 95 97 96 95 100 82 82 79 81 86 79 82 83 84 83 92 1 n 2 0 n U 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Hominoidea Symphalangus syndactylus Hylobates concolor H. hoolock H. klossi H. lar H. agilis H. mobch H. muelleri Pongo pygmaeus Pan troglodytes P. paniscus Gorilla gorilla Homo sapiens 1.0 1.0 o 1.0 o 1.0 1.0 0 0 1.0 0 1.0 0 1.0 0 1.0 0.99 0.01 1.0 0 0.01 0.99 0 0 74 VOL. 4 5 SYSTEMATIC BIOLOGY (a) 16- where z = x/y is a ratio of two normally distributed measurements; a = <J2Z2 — 12 - 2pvxo• z + <r2x;b = jjiy(-CTyfcz + p<Jxayz p<jx<fyk — <J2X); c = \x,2(cr2k2 — 2pax(jyk + Count 8 - CJ^); p is a coefficient of correlation between x and y; and k = |xx/|xy. This simplification is valid if the following conditions are met: 4 - (b) 8 -i -0.1 0 0.1 0.2 0.3 0.4 < -4 /(X) 4 - + (10) If the density function is entered in this form, likelihood estimation will not be possible because for any given value of z there are an infinite number of values of x and y. Thus, there are an infinite number -0.1 0 0.1 0.2 0.3 0.4 of values for fxy, and the likelihood funcLog of Facial Shape Ratio tion cannot be maximized at any given set FIGURE 4. Distribution of facial shape ratio among of parameters. To account for this, all of hominoids. (a) Histogram of hominoid individuals, (b) Three-normal mixture density function that de- the values of x and y are scaled such that |xy for each species is equal to 1. This scalscribes histogram. ing procedure is valid because taxa that are similar with respect to z need not be similar with respect to x and y (i.e., taxa to be specified or the data set needed to be may be similar in shape but be different in transformed. size). Thus, the parameters of this density The logarithmic transformation is the function have no real meaning except as most commonly used method of normal- scaled variables. Furthermore, by setting izing data. A histogram of log-transformed jxy equal to 1, the relative values of k, <JX, facial shape ratios among hominoids is <jy, and p do not change. If (xy is set at 1, shown in Figure 4 (data from Chamber- then it drops out of the function and lain, 1987). Parameters of four different - 2po\ c a y z - v2x a = mixtures were estimated for these data (one-normal, two-normal, three-normal, b = -<j2ykz p(jx(jyz + pvx<jyk - d2x four-normal; see Table 4). As before, the three-normal mixture (Fig. 4) is favored - 2pvx(jyk + vx (11) c = because it has the lowest AIC. It is signifFinally, p must be estimated in advance icantly better than the one- and two-norbecause p interacts with vx and vy to demal mixtures according to the G-test (Ta- termine the peakedness of the function (as ble 5). Posterior probabilities and codes are p approaches 1, the function will become presented in Table 6. more peaked). Thus, the likelihood funcAn alternative to transforming the data tion may be maximized equally well at is to specify the probability density func- several values of p, ax, and ay, which pretion of a ratio of two normally distributed vents the computer program from choosmeasurements. A simplified version of this ing any one set of values. Estimation of p function is (Strait, in prep.): in advance should be valid if correlation coefficients are not too variable within the taxa of interest. To be conservative, p /m should be estimated as the maximum, the 1996 TABLE 4. Estimated function parameters for logtransformed facial shape data. Parameters and likelihood statistics of four mixture functions were obtained using maximum likelihood estimation. Parameter 1 normal Pi 2 normal 3 normal 4 normal Species Log facial shape 0.047 0.003 0.576 0.230 0.003 0.424 0.061 0.004 0.657 0.206 0.00009 0.154 0.280 0.001 0.189 0.043 0.003 0.400 0.090 0.006 0.300 0.206 0.00008 0.131 0.280 0.001 0.169 98.6 Gorilla gorilla Pongo pygmaeus Pan troglodytes Homo sapiens Hylobates lar 0.027 0.033 0.089 0.200 0.274 U, 2 (J2 Pi m <r3 o-4 Pi lnL TABLE 6. Posterior probabilities and codes of logtransformed hominoid facial shape data. Values are species menas of facial shape and the probabilities that any given species should receive code 0, 1, or 2. Mixture 0.125 0.011 1.000 M<1 75 STRAIT ET AL.—FINITE MIXTURE CODING 82.9 92.0 98.6 minimum, and the average observable values within the taxa being studied. If equivalent codes are obtained using each of these values for p, then the results will be reliable. Mixtures of ratio density functions were fit to the untransformed facial shape data set. p was estimated at 0.2, 0.5, and 0.7. For each value of p, codes were equivalent to those obtained using the log-transformed data set. DISCUSSION Comparisons between FMC and Other Coding Procedures In FMC, taxa share a code if it is probable that their distributions could have been sampled from a single statistical population. Thus, in FMC, codes are produced using conventional principles of statistical Posterior probabilities Code 0 Code 1 Code 2 1.0 1.0 1.0 0.07 0.01 0 0 0 0.91 0 0 0 0 0.02 0.99 inference, which is not the case for other coding procedures. Five methods arguably produce arbitrary codes. Three methods (gap, generalized gap, and segment coding) require the researcher to arbitrarily choose the size of the critical gap or segment length. A fourth method, gap weighting, produces arbitrary codes in the sense that all continuous characters have the same number of codes, which is equal to the maximum allowable by a given parsimony computer program. The method of Almeida and Bisby (1984) forces the researcher to subjectively identify points along the range of a character in which there is minimal overlap between species. Divergence coding and homogeneous subset coding (HSC) employ statistical tests and thus do not create arbitrary codes. However, the codes are not directly produced by the tests. For instance, HSC uses posterior comparisons tests to identify statistically homogeneous sets of taxa, but when subsets overlap, taxa in the same subset do not always receive the same code. Rather, taxa share a code when they share a unique pattern of membership in TABLE 5. Significance tests of mixture functions for log-transformed hominoid facial shape data. Results of the AIC and G-test support the three-normal mixture as the model that best explains the distribution individuals. AIC 1 2 3 4 G-test Mixture model lnL AIC normal normals normals normals 82.9 92.0 98.6 98.6 -163.8 -179.0 -189.2 -186.2 Comparison 1 2 3 1 vs. 2 normals vs. 3 normals vs. 4 normals vs. 3 normals G -2(82.9 -2(92.0 -2(98.6 -2(82.9 - 92.0) 98.6) 98.6) 98.6) = = = = 18.2 13.2 0 31.4 df P 4 4 4 8 <0.005 <0.01 n.s. <0.001 76 VOL. 45 SYSTEMATIC BIOLOGY subsets. From a statistical standpoint, it is not clear why the pattern of membership in overlapping subsets is a criterion that should reveal which taxa are meaningfully discernible from each other. Sampling Independence Felsenstein (1985, 1988) and Harvey and Pagel (1991) noted that for purposes of statistical comparison, species cannot be considered independent entities. As a result of phylogenetic hierarchy, species naturally group into clusters. Because most statistical procedures assume independence, conclusions based on analyses that fail to recognize hierarchy may be invalid. Coding procedures that employ statistical tests or parameters may be subject to these criticisms. Felsenstein (1985, 1988) and Harvey and Pagel (1991) suggested that nonindependence can be accounted for if knowledge about phylogeny is used to correct for the effect of hierarchical clustering. Unfortunately, such an approach cannot be applied to coding methods because by definition the phylogeny is unknown prior to the analysis. Finite mixture coding of species means also relies on an assumption of sampling independence. The central limit theorem states that when large samples are drawn independently from a population, the means of those samples will be normally distributed. If independence is not allowed, then the central limit theorem is invalid and normality cannot be expected. Because FMC requires an expectation of how species means are distributed, the method would not be applicable. However, although phylogenetic hierarchy is a certainty, phylogenetic inertia is not. Felsenstein (1985:6) noted that the assumption of independence is valid so long as "characters respond essentially instantaneously to natural selection in the current environment, so that phylogenetic inertia is essentially absent." Thus, FMC of taxon means, like other coding methods that require sampling independence, must assume that phylogenetic inertia is absent or slight. Finite mixture coding of individual specimens does not require such equivo- cation. Consider an extreme case of phylogenetic inertia in which all descendant species of a given common ancestor inherit identical specieswide distributions for a given measurement. When specimens are sampled from these species, they are taken from nonindependent but identical populations. When pooled, such specimens would be distributed as if they were independent samples taken from a single population, and the form of that distribution would be the same as that observed within species. This model is precisely the one used by FMC to define species that share a code. Now consider a more realistic case in which the degree of similarity between species varies according to recency of common ancestry (where recency refers to both time and branching pattern). If a set of species were descended from a relatively recent common ancestor, then their within-species distributions for the given measurement would all be very similar (barring natural selection or other factors). Thus, specimens sampled from these species would be taken from nonindependent but nearly identical populations. Such specimens would be distributed approximately as if they were independent samples from a single population, and the form of that distribution would be approximately the same as that observed within species. Thus, if phylogenetic inertia were the only factor influencing the degree of similarity between species, FMC of individual specimens would assign codes to clusters of closely related (and recently divergent) species, particularly clusters separated by relatively long divergence times. Limitations of FMC A common criticism of coding methods that employ statistical tests is that they are overly dependent on sample size (Archie, 1985; Felsenstein, 1988). In general, statistical tests are better able to discriminate between groups as sample sizes increase. Thus, for a given set of taxa, the codes produced by a procedure might vary simply as a result of sampling. This criticism applies to FMC, although in a slightly differ- 1996 STRAIT ET AL.—FINITE MIXTURE CODING ent manner than it does to methods such as HSC. Large sample sizes increase the degrees of freedom of the posterior comparisons tests used by HSC, thereby increasing the ability of those tests to identify statistically significant differences between taxa. With respect to FMC, large sample sizes mean that sample distributions will better approximate the population distributions from which they were drawn, meaning that those populations are more likely to be detected by likelihood estimation. A corollary of the sample size problem is that FMC requires relatively large samples of either taxon means or individual specimens to identify a code. Thus, when examining taxon means, a unique species might not receive its own code even if it is a morphological outlier because one point along a distribution may not be enough to define a distinct statistical population. Species with unique means can receive their own code if codes are based on a distribution of individual specimens. However, when coding specimens, a similar problem is encountered; a poorly represented but divergent species might not appear distinct. This problem is particularly acute when examining fossil taxa. When the method is applied to specimens, FMC relies on a reasonable expectation of how a measurement is distributed within species. If the wrong type of distribution is chosen, then the codes may be inappropriate. This problem can be avoided by transforming the data such that they conform to a particular type of distribution (e.g., normal). CONCLUSION 77 populations. Taxa receive the same code as the population within which its species mean falls. Finite mixture coding has advantages and disadvantages that are characteristic of many statistical procedures. The method is advantageous in the sense that it differentiates among groups on the basis of nonarbitrary criteria, but it suffers from problems associated with sample size. Unlike some other coding methods, FMC (of specimens) is tenable even when it is assumed that species cannot be independently sampled. ACKNOWLEDGMENTS We thank F. E. Grine, T. C. Rae, F. J. Rohlf, W. L. Jungers, S. Leigh, R. Cowan, and the members of the Numerical Taxonomy Discussion Group at SUNYStony Brook for providing valuable comments and criticisms. This paper was considerably improved by comments from D. Cannatella and two anonymous reviewers. This research was supported by a National Science Foundation grant (BNS 9120117) to F. E. Grine. REFERENCES AKAIKE, H. 1974. A new look at the statistical model identification. Inst. Electr. Electron. Trans. Automatic Control 19:716-723. ALMEIDA, M. X, AND F. A. BISBY. 1984. A simple method for establishing taxonomic characters from measurement data. Taxon 33:405-409. ARCHIE, J. W. 1985. Methods for coding variable morphological features for numerical taxonomic analysis. Syst. Zool. 34:326-345. BOOKSTEIN, F. L. 1994. Can biometrical shape be a homologous character? Pages 198-227 in Homology: The hierarchical basis of comparative biology (B. K. Hall, ed.). Academic Press, San Diego, California. CHAMBERLAIN, A. T. 1987. A taxonomic review and phylogenetic analysis of Homo habilis. Ph.D. Dissertation, Univ. Liverpool, Liverpool, England. CHAPPILL, J. A. 1989. Quantitative characters in phylogenetic analysis. Cladistics 5:217-234. COLLESS, D. H. 1980. Congruence between morphometric and allozyme data for Menidia species: A reappraisal. Syst. Zool. 29:288-299. Finite mixture coding differs from other coding methods in that it produces codes CRANSTON, P. S., AND C. J. HUMPHRIES. 1988. Cladistics and computers: A chironomid conundrum? Claas the direct output of a statistical procedistics 4:72-92. dure. This procedure has three stages: (1) EVERITT, B. S. 1981. A Monte Carlo investigation of parameters are estimated for a number of the likelihood ratio test for the number of compomixture density functions that describe a nents in a mixture of normal distributions. Multivar. Behav. Res. 16:171-180. given data set, (2) a goodness-of-fit criterion is used to identify the function that EVERITT, B. S. 1985. Mixture distributions. Pages 559560 in Encyclopedia of statistical sciences. Wiley, best describes the set, and (3) species are New York. assigned to populations based on posterior FALCONER, D. S. 1981. Introduction to quantitative geprobabilities, and codes are assigned to netics. Longman, London. 78 SYSTEMATIC BIOLOGY FARRIS, J. S. 1990. Phenetics in camouflage. Cladistics 6:91-100. FELSENSTEIN, J. 1985. Phytogenies and the compara- tive method. Am. Nat. 125:1-15. VOL. 45 SIMON, C. 1983. A new coding procedure for mor- phometric data with an example from periodical cicada wing veins. Pages 378-382 in Numerical taxonomy (J. Felsenstein, ed.). Springer-Verlag, Berlin. FELSENSTEIN, J. 1988. Phylogenies and quantitative SOKAL, R. R., AND F. J. ROHLF. 1981. Biometry, 2nd characters. Annu. Rev. Ecol. Syst. 19:445-471. FLEAGLE, J. G. 1988. Primate evolution and adaptation. Academic Press, New York. edition. W. H. Freeman, New York. STEVENS, P. F. 1991. Character states, morphological variation, and phylogenetic analysis: A review. Syst. Bot. 16:553-583. STRAIT, P. T. 1989. A first course in probability and statistics with applications. Harcourt Brace Jovanovich, New York. THIELE, K. 1993. The holy grail of the perfect character: The cladistic treatment of morphomertric data. Cladistics 9:275-304. THORPE, R. S. 1984. Coding morphometric characters for constructing distance Wagner networks. Evolution 38:244-255. HARVEY, P. H., AND M. D. PAGEL. 1991. The compar- ative method in evolutionary biology. Oxford Univ. Press, Oxford, England. MICKEVICH, M. F., AND M. F. JOHNSON. 1976. Congru- ence between morphological and allozyme data in evolutionary inference and character evolution. Syst. Zool. 25:260-270. PEARSON, J. D., C. H. MORRELL, AND L. J. BRANT. 1992. Mixture models for investigating complex distributions. J. Quant. Anthropol. 3:325-345. PIMENTEL, R. A., AND R. RIGGINS. 1987. The nature of cladistic data. Cladistics 3:201-209. Received 30 August 1994; accepted 3 August 1995 Associate Editor: David Cannatella