Download finite mixture coding: a new approach to coding

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
Syst. Biol. 45(l):67-78, 1996
FINITE MIXTURE CODING: A NEW APPROACH TO CODING
CONTINUOUS CHARACTERS
DAVID S. STRAIT,1-3 MARC A. MONIZ, 1 AND PEGGY T. STRAIT 2
doctoral Program in Anthropological Sciences, State University of New York,
Stony Brook, New York 11794-4364, USA
department of Mathematics, City University of New York, Flushing, New York 11367, USA
Abstract.—Finite mixture coding (FMC) is a new method of coding continuous characters. FMC
uses a three-step goodness-of-fit procedure to assign codes. First, for a given measurement, parameters are estimated for a number of density functions that describe a data set either of species
means or of measurements of specimens from several species. The density functions represent
either a single population or a mixture of populations (e.g., a mixture of two normal distributions).
Next, a goodness-of-fit criterion (the Akaike information criterion) is used to determine which of
the density functions best describes the data set. The best function indicates the number of populations into which the variates of the data set can be segregated. Finally, species are assigned to
the population for which its probability of membership is highest. Each population is then assigned a code, and species falling within the same population share the same code. Although
other coding methods incorporate statistical tests or parameters into the coding process, FMC is
the only method that produces codes as the direct output of a statistical procedure. [Code; continuous character; finite mixture analysis; likelihood estimation; cladistics.]
There has been considerable debate concerning methods of coding continuous
characters (Mickevich and Johnson, 1976;
Simon, 1983; Almeida and Bisby, 1984;
Thorpe, 1984; Archie, 1985; Felsenstein,
1988; Chappill, 1989; Farris, 1990; Thiele,
1993). Because computer-assisted cladistic
analyses (e.g., PAUP, Hennig86, PHYLIP)
require that character states take the form
of discrete codes and because many morphological characters such as size and
shape tend to vary continuously, a coding
procedure is necessary in many studies.
Unfortunately, there is no consensus on
which method is best. This paper introduces a new method of coding continuous
characters, the finite mixture method. Unlike other methods, finite mixture coding
(FMC) produces codes as the direct output
of a statistical procedure. Consequently,
FMC distinguishes among taxa using standard methods of statistical inference.
range of variation into discrete states or
because it is unclear whether metric characters provide valid cladistic information
(Pimentel and Riggins, 1987; Cranston and
Humphries, 1988; Felsenstein 1988; Bookstein, 1994). However, continuous characters comprise a large proportion of the
morphological features present in organisms, many of them are heritable, and
some can be used to discriminate among
groups of taxa (Falconer, 1981; Chappill,
1989; Stevens, 1991; Thiele, 1993). Consequently, some of these characters may be
useful in estimating phylogeny. To use
them, a coding procedure must be employed.
CODING PROCEDURES
With respect to continuous characters,
coding procedures are the methods used
to determine whether taxa are discernible
for a given trait. Disagreement over coding
techniques has focused on seven methods:
gap coding (Mickevich and Johnson, 1976),
segment coding (Simon, 1983; Thorpe,
1984; Chappill, 1989), divergence coding
(Thorpe, 1984), homogeneous subset coding (Simon, 1983; see also Archie, 1985),
generalized gap coding (Archie, 1985), gap
weighting (Thiele, 1993), and the coding
CONTINUOUS CHARACTERS
Some authors have suggested that continuously varying characters are of little
use in cladistic analysis, either because of
difficulties encountered in reducing the
1
E-mail: [email protected].
67
68
SYSTEMATIC BIOLOGY
method of Almeida and Bisby (1984).
Range coding (Colless, 1980) is omitted
from this list because it is actually a
weighting technique rather than a coding
method. None of these methods produce
codes as the direct output of a statistical
test. Rather, they use statistical tests or parameters as guidelines for applying other
criteria that are used to discriminate taxa.
FINITE MIXTURE CODING
Finite mixture coding adopts conventional notions of statistical inference. In statistics, groups are said to be significantly
different if the probability that their observed distributions could have been sampled from a single population is low. However, if this probability is higher than a
designated value, then the groups are not
meaningfully discernible. The finite mixture method applies these principles to
character coding. If with respect to a given
measurement the distribution of taxa is
such that it is probable that they could
have been sampled from a single theoretical statistical population (not a biological
population), then the taxa are not meaningfully discernible and are assigned the
same code. Furthermore, the number of
codes present in a set of taxa is equal to
the number of discernible statistical populations of taxa that can be identified. This
number is identified using finite mixture
analysis and likelihood estimation methods.
Although species are said to share a
code when they are distributed as a single
statistical population, such populations can
be distributed in an infinite number of
ways (i.e., normally distributed, Poisson
distributed, etc.). To identify statistical
populations of taxa, it is necessary to have
an expectation of the general form of that
population's distribution. These expectations vary depending on whether the data
being analyzed consist of species means or
observations of individual specimens within species.
Finite Mixture Coding of Taxon Means
As defined above, species share a code
for a given measurement when they are
VOL. 45
distributed as a single statistical population. Thus, such species can be treated as
samples from a single population. If one
assumes that species can be sampled independently, then the means of those species should be distributed normally. The
central limit theorem states that when
large samples are drawn from a population, the means of those samples will be
normally distributed. This statement is
true regardless of the actual distribution of
the measurement within species. If the distribution within species is not normal, then
the distribution of species means will approach normality as sample size increases
(e.g., Strait, 1989). Thus, the number of
states present in a given set of species
should correspond to the number of normal distributions that can be identified. Finite mixture analysis can be used to identify multiple normal distributions within a
data set.
Applications of finite mixture modeling
were reviewed by Pearson et al. (1992).
Most of what follows is based on their discussion, and they presented a more detailed description of the method (see also
Everitt, 1985). They also provided a list of
computer programs that can perform likelihood estimation.
A data set is said to be mixed if it is
composed of variates from more than one
population. Each of these populations is a
component of the total data set, and thus
they are called component distributions.
The data set is therefore a mixture of these
component distributions. The form of each
component distribution can be described
by an equation, the component density
function. Likewise, the shape of the mixture can be described by a mixture density
function. Both types of functions are written in terms of the population parameters
(e.g., mean, variance) of the component
distributions.
The number of component distributions
in a data set (and hence the number of
codes) is identified by what is essentially
a complex goodness-of-fit procedure. This
procedure has three stages. First, using
likelihood estimation methods, several
mixture density functions are fit to the
1996
STRAIT ET AL.—FINITE MIXTURE CODING
data set. Specifically, parameters are estimated for each of the component distributions in a mixture. The general form of
these distributions must be specified (i.e.,
normal, Poisson, chi-square, etc.), although
in the case of species means it is known
that they are normal. Thus, when given a
density function that describes a mixture
of two normal distributions, a mean and a
variance are estimated for each component. The best parameter values are those
that maximize the likelihood statistic, L.
This procedure is performed on several
mixture density functions (e.g., one-normal, two-normal, three-normal). The researcher determines the number of
mixtures to be examined.
The second stage of the analysis identifies the mixture whose density function
provides the best fit with the data set. Using L, mixtures are compared using the
Akaike information criterion (AIC; Akaike,
1974). If, for instance, the two-normal
model is better than all others (i.e., onenormal, three-normal, etc.), then two component distributions (and thus two codes)
are likely to be present in the data set.
Finally, individual species are assigned
codes by calculating the probability of a
species mean being drawn from any given
component distribution.
One additional step at the beginning of
this procedure might be necessary, depending on the distribution of the character within species. The premise behind
coding taxon means is the central limit
theorem, but a measurement that is asymmetrically distributed requires large samples for the central limit theorem to hold.
Thus, in cases of small sample size, it is
advisable to transform the data such that
the distribution of the measurement within
species is symmetrical (using a log, arcsine, or other transformation; cf. Sokal and
Rohlf, 1981).
The general form of a mixture density
function (Everitt, 1985; see also Pearson et
al, 1992) is
(1)
69
where f(x) is the mixture density function,
g(x; 6,) are the density functions of the
component distributions, c is the number
of components in the mixture, 6, are the
parameters of the distributions, and pt are
mixing proportions, i.e., the proportions of
the total data set contributed by each component (e.g., distribution A makes up 40%
of the data set, and distribution B makes
up 60%). Because for the moment we are
examining only the distribution of taxon
means, we need to know only g(x; 6,) for
normal distributions. The general form of
the density function of a normal distribution is
8(x) =
(2)
where (x is the distribution mean andCT2is
the variance. Thus the density function of,
for example, a two-normal mixture would
be
/(*) = V
+ (1 - p)
The likelihood function (Pearson et al.,
1992), which is maximized, is
L = f\f(xl)=f(x1)f(x2)...f(xn),
(4)
where x; represents the variates (i.e., observations) of a data set.
A simple visual example explains why
the maximum likelihood estimate identifies the best function parameters (Fig. 1).
Suppose that one wishes to find the parameters of a normal density function that
explains the distribution of a set of data
consisting of seven variates, xx-x7. In case
A, the data set and density function have
the same mean. In case B, the mean of the
density function differs from that of the
data set. In each case,
L = f(Xl)f(x2) ... f(x7).
(5)
In case A, the values of f(x) for each of the
data points will be high, resulting in a
70
VOL. 45
SYSTEMATIC BIOLOGY
Count
/(x)
Count
/(x)
FIGURE 1. Example of maximum likelihood estimation. Two probability density functions are fit to a
distribution of seven variates. (a) /(x) for each variate
is large, meaning that L will be large. This function
clearly provides a better explanation of the data set.
(b) /(x) for each variate is small, meaning that L will
be small.
large value of L. In case B, L will be very
small, because for each data point the value of f(x) is low. Clearly, the density function of case A matches the data set better
than that of case B. Therefore, a high L indicates an accurate estimation of function
parameters (in this case, the mean).
After parameters have been estimated
for several mixture density functions, these
functions must be compared to determine
which provides the best fit with the data
set. The comparison is made using either
the AIC (Akaike, 1974) or the log-likelihood ratio test. The AIC is defined as
AIC = -2(ln L) + 2K,
(6)
where K is the number of independent parameters estimated by the mixture model.
The best mixture is the one with the lowest
AIC, which favors mixtures with high likelihood statistics and few component distributions. Alternatively, one can use a loglikelihood ratio test to compare mixtures
in a pairwise fashion. However, Everitt
(1981) noted that this test is inaccurate
when samples sizes are less than 10 times
the number of parameters estimated by the
model (5 times, according to Pearson et al.,
1992), and even with many samples the
power of the test is low. The log-likelihood
ratio statistic G (Sokal and Rohlf, 1981) is
computed as
G = -2(ln Lo - In L,),
(7)
where Lo and La are the likelihood statistics
of the two mixtures being compared. The
degrees of freedom for the G-test depend
on the type of mixtures being compared.
If the mixtures contain the same number
of components (as may happen when normal and nonnormal mixtures are compared), then the degrees of freedom are
equal to the difference in the numbers of
parameters estimated by the two models.
If the mixtures have different numbers of
components, then the degrees of freedom
are equal to two times the difference in the
number of parameters, not including the
mixing proportions.
The AIC and the G-test should produce
comparable results. Noting the criticisms
of the G-test, we here use the AIC. However, for purposes of comparison and because many readers are familiar with the
G-test, the results of both are presented.
After the best mixture model has been
identified, one must specify the distributions to which individual taxa probably
belong, which is done by calculating the
posterior probabilities w of each taxon; w
is simply the probability that a data point
belongs to a given distribution (Fig. 2). It
is defined as (Pearson et al., 1992)
Pr
(8)
for i = 1 to c and ;' = 1 to n. Because the
component distributions of a mixture may
overlap, some taxa may not be assigned to
components with absolute certainty. However, posterior probabilities at least allow
measurement of the level of uncertainty.
As a rule, a taxon is assigned to the component for which w is highest. Obviously,
if w is low, then the component assignment
is questionable, and the researcher may
1996
71
STRAIT ET AL.—FINITE MIXTURE CODING
12(a)
10Count
8
6
4
2 •
60
80
100
nn120
140
0.05.
(b)
0.04.
0
Taxon
0.03-
FIGURE 2. Calculation of posterior probabilities.
The probability w that a taxon belongs to any given
component distribution is equal to its value as a function of the component distribution (A) divided by its
value as a function of the mixture (B).
0.02.
0.01.
0
60
wish to leave such assignments uncertain
(meaning that the taxon will be coded as
having an uncertain state).
After membership in component distributions is established, coding is simple.
Each component receives a code in ascending order, starting with that containing the
smallest species means. Codes may be considered either ordered or unordered.
Example 1
The intermembral index of anthropoid
primates was coded as an example of the
finite mixture method as applied to taxon
means. The intermembral index is the ratio
of forelimb length to hind limb length X
100. This character was chosen because
data are available for a large number of
species (91). We do not expect this character to reflect the true anthropoid phylogeny, but the degree of homoplasy present
is irrelevant to the coding method.
A histogram of anthropoid intermembral indices is shown in Figure 3 (data
from Fleagle, 1988). Using the BMDP statistics program, parameters of four different mixtures were estimated for these data
(one-normal, two-normal, three-normal,
four-normal; see Table 1). The three-normal mixture (Fig. 3) is favored because it
has the lowest AIC. In addition, it is significantly better than the one- and twonormal mixtures according to the G-test
80
100
120
140
Anthropoid Intermembral Index
FIGURE 3. Distribution of the intermembral index
among anthropoids, (a) Histogram of species means,
(b) Three-normal mixture density function that describes histogram.
(Table 2). It is not significantly better than
the four-normal mixture because that function has broken the middle component of
the three-normal mixture into two overlapping distributions. As a result, the
probability density functions of these
mixtures are nearly identical in shape, producing nearly identical values for L. The
TABLE 1. Estimated function parameters for anthropoid intermembral index data. Parameters and
likelihood statistics of four mixture functions were obtained using maximum likelihood estimation.
Mixture
Parameter
M-i
°i
Pi
1 normal
2 normal
3 normal
4 normal
90.9
17.2
1.00
86.1
10.2
0.89
131.7
7.9
0.11
78.9
4.1
0.55
97.4
4.8
0.34
130.9
8.7
0.11
78.9
4.1
0.56
96.2
3.6
0.22
99.2
5.6
0.11
131.0
8.6
0.11
-351.0
|X 2
CT2
P2
M-3
O-3
Ps
M'4
°-4
Pi
lnL
-388.2
-368.1
-351.6
SYSTEMATIC BIOLOGY
72
VOL. 45
TABLE 2. Significance tests of mixture functions for anthropoid intermembral index data. Results of the AIC
and G-test support the three-normal mixture as the model that best explains the distribution of anthropoid
species means.
AIC
Mixture
model
1
2
3
4
normal
normals
normals
normals
G-test
lnL
AIC
Comparison
-388.2
-368.1
-351.6
-351.0
778.4
741.2
711.2
713.0
1 vs. 2 normals
2 vs. 3 normals
3 vs. 4 normals
1 vs. 3 normals
G
-2(-388.2
-2(-368.1
-2(-351.6
-2(-388.2
+
+
+
+
368.1)
351.6)
351.0)
351.6)
=
=
=
=
40.2
33.0
1.2
73.2
df
P
4
4
4
8
<0.001
<0.001
n.s.
<0.001
may belong to more than one component
(overlap among species is not unique to
FMC; all coding methods allow overlap
among taxa with different codes). The procedure used to assign codes varies depending on the general form of the comFinite Mixture Coding of Individual
ponent distributions. For symmetric or
Specimens
nearly symmetric distributions, taxa are
In the case of taxon means, FMC oper- coded in the usual way: a species receives
ates on the expectation that sample means the same code as the component distributaken from species that share a code will tion within which its mean falls. Thus, a
be normally distributed. When applied to species mean is treated as if it were a variindividual specimens, this expectation ate, and posterior probabilities are calcuchanges. If a given measurement has a lated for its value. Barring strong asymcharacteristic distribution within species metry, this procedure will ensure that
and if specimens are drawn from species most of the individuals in a species will
that share a code, then the pooled distri- belong to the same component distribubution of all of those specimens should be tion. If the component distributions are
of the same general form as that observed known to be strongly asymmetrical, then
within species. Thus, unlike taxon means, posterior probabilities are calculated for all
individuals will not always be normally of the specimens in a species and the spedistributed, and in the mixture density cies receives the code of the component to
function, g(x, 6) may not be normal. If g(x, which the majority of its specimens be9) is not normal, then either its general long.
form must be specified (or at least approximated; e.g., Pearson et al. [1992] provided
Example 2
a density function for skewed distribuHominoid facial shape was coded as an
tions) or the data must be transformed
such that there is an expectation of nor- example of how the finite mixture method
mality (cf. Sokal and Rohlf, 1981). Follow- can be applied to individual specimens.
ing transformation or the specification of Facial shape, as defined here, is a ratio of
g(x, 6), FMC proceeds as before.
breadth across the orbits to facial height.
The final step of the analysis is the as- This ratio was coded for samples of Hylosignment of codes. Ideally, one would like bates lar, Pongo pygmaeus, Pan troglodytes,
to assign codes based on the placement of Gorilla gorilla, and Homo sapiens (data from
individual specimens within component Chamberlain, 1987). Each species was repdistributions. However, this coding crite- resented by 10 males and 10 females. Rarion is complicated by the fact that com- tios are not always normally distributed.
ponent distributions may overlap, mean- Thus, before FMC could be used, the probing that specimens from a single species ability density function of the ratio needed
AIC favors the three-normal mixture because it requires fewer parameters and
thus is the simpler explanation. Posterior
probabilities and codes are presented in
Table 3.
1996
73
STRAIT ET AL.—FINITE MIXTURE CODING
TABLE 3. Posterior probabilities and <codes of anthropoid intermembral indices. Values are species
means of the intermembral index (1MB) and the probabilities that any given species should receive code 0,
1, or 2.
TABLE 3.
Posterior
probabilities
Mean Code Code Code
Species
Species
1 ostenor
probabilities
Mean Code Code Code
1
1MB 0
2
Pithecinae
Pitecia pithecia
P. monachus
Chiropotes satanas
Cacajao calvus
75 1.0
77 1.0
83 0.99
83 0.99
0
0
0.01
0.01
0
0
0
0
Aotinae
Aotus trivirgatus
Callicebus moloch
C. personatus
74
74
73
1.0
1.0
1.0
0
0
0
0
0
0
Cebinae
Cebus apella
C. albifrons
Saimiri sciureus
82 1.0
82 1.0
80 1.0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1.0
1.0
1.0
1.0
1.0
0.99
0.99
1.0
0.88
0
0
0
0
0
0.01
0.01
0
0.12
69
79
76
76
75
74
74
74
88
76
75
75
82
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.53
1.0
1.0
1.0
1.0
0
0
0
0
0
0
0
0
0.47
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
92
95
100
99
93
94
96
95
93
93
0.02
0
0
0
0.01
0
0
0
0.01
0.01
0.98
1.0
1.0
1.0
0.99
1.0
1.0
1.0
0.99
0.99
0
0
0
0
0
0
0
0
0
0
Atelinae
Alouatta seniculus
A. palliata
A. caraya
Lagothrix lagothricha
Barchyteles arachnoides
Ateles paniscus
A. geoffroyi
A. fuscicpes
A. belzebuth
Callitrichinae
Callimico goeldi
Saguinus fuscicoUis
S. mystax
S. labiatus
S. imperator
S. midas
S. oediups
S. leucopus
Leontopithecus rosalia
Callithrix argentata
C. jacchus
C. penicillata
Cebuella pygmaea
Cercopithecinae
Macaca nemestrina
M. tonkeana
M. ochreata
M. brunescens
M. hecki
M. nigra
M. assamensis
M. thibetana
M. fascicularis
M. mulatta
97
98
97
98
104
105
105
103
109
Continued.
M. arctoides
Cercocebus albigena
C. galeritus
C. torquatus
Papio hamadryas
P. anubis
P. cynocephalus
Mandrillns sphinx
Theropithecus gelada
Cercopithecus mitis
C. nictitans
C. ascanius
C. cephus
C. mona
C. diana
C. neglectus
C. aethiops
Allenopithecus nigroviridis
Miopithecus talapoin
Erythrocebus patas
Colobinae
Colobus guerza
C. polykomos
Piliocolobus badius
Procolobus verus
Presbytis entellus
P. johnii
P. melalophos
P. aygula
P. rubicunda
P. frontata
P. hosei
P. obscura
P. cristata
P. pileata
Nasalis larvatus
Pygathrix nemaeus
1MB
1
o
0
1.0
l.U
0.98
0.99
0
0
0
0
0
1.0
1.0
1.0
1.0
0.88
1.0
1.0
0.99
0.98
0.99
0.02
U
0.02
0.01
1.0
1.0
1.0
1.0
1.0
0
0
0
0
0.12
0
0
0.01
0.02
0.01
0.98
79
79
87
80
83
80
78
76
76
76
75
83
82
82
94
94
1.0
1.0
0.74
1.0
0.99
1.0
1.0
1.0
1.0
1.0
1.0
0.99
1.0
1.0
0
0
0
0
0.26
0
0.01
0
0
0
0
0
0
0.01
0
0
1.0
1.0
147
140
129
126
130
129
127
129
139
105
102
116
72
0
0
0
0
0
0
0
0
0
0
0
0
1.0
0
0
98
/o
84
83
95
97
96
95
100
82
82
79
81
86
79
82
83
84
83
92
1 n
2
0
n
U
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Hominoidea
Symphalangus syndactylus
Hylobates concolor
H. hoolock
H. klossi
H. lar
H. agilis
H. mobch
H. muelleri
Pongo pygmaeus
Pan troglodytes
P. paniscus
Gorilla gorilla
Homo sapiens
1.0
1.0
o 1.0
o 1.0
1.0
0
0
1.0
0
1.0
0
1.0
0
1.0
0.99 0.01
1.0
0
0.01 0.99
0
0
74
VOL. 4 5
SYSTEMATIC BIOLOGY
(a)
16-
where z = x/y is a ratio of two normally
distributed measurements; a = <J2Z2 —
12 -
2pvxo• z + <r2x;b = jjiy(-CTyfcz + p<Jxayz
p<jx<fyk — <J2X); c = \x,2(cr2k2 — 2pax(jyk
+
Count
8 -
CJ^); p is a coefficient of correlation between
x and y; and k = |xx/|xy. This simplification
is valid if the following conditions are met:
4 -
(b)
8 -i
-0.1
0
0.1
0.2
0.3
0.4
< -4
/(X)
4 -
+
(10)
If the density function is entered in this
form, likelihood estimation will not be
possible because for any given value of z
there are an infinite number of values of x
and y. Thus, there are an infinite number
-0.1
0
0.1 0.2
0.3
0.4
of values for fxy, and the likelihood funcLog of Facial Shape Ratio
tion cannot be maximized at any given set
FIGURE 4. Distribution of facial shape ratio among of parameters. To account for this, all of
hominoids. (a) Histogram of hominoid individuals,
(b) Three-normal mixture density function that de- the values of x and y are scaled such that
|xy for each species is equal to 1. This scalscribes histogram.
ing procedure is valid because taxa that
are similar with respect to z need not be
similar with respect to x and y (i.e., taxa
to be specified or the data set needed to be may be similar in shape but be different in
transformed.
size). Thus, the parameters of this density
The logarithmic transformation is the function have no real meaning except as
most commonly used method of normal- scaled variables. Furthermore, by setting
izing data. A histogram of log-transformed jxy equal to 1, the relative values of k, <JX,
facial shape ratios among hominoids is <jy, and p do not change. If (xy is set at 1,
shown in Figure 4 (data from Chamber- then it drops out of the function and
lain, 1987). Parameters of four different
- 2po\ c a y z - v2x
a =
mixtures were estimated for these data
(one-normal, two-normal, three-normal,
b = -<j2ykz
p(jx(jyz + pvx<jyk - d2x
four-normal; see Table 4). As before, the
three-normal mixture (Fig. 4) is favored
- 2pvx(jyk + vx
(11)
c =
because it has the lowest AIC. It is signifFinally, p must be estimated in advance
icantly better than the one- and two-norbecause
p interacts with vx and vy to demal mixtures according to the G-test (Ta- termine the
peakedness of the function (as
ble 5). Posterior probabilities and codes are p approaches 1, the function will become
presented in Table 6.
more peaked). Thus, the likelihood funcAn alternative to transforming the data tion may be maximized equally well at
is to specify the probability density func- several values of p, ax, and ay, which pretion of a ratio of two normally distributed vents the computer program from choosmeasurements. A simplified version of this ing any one set of values. Estimation of p
function is (Strait, in prep.):
in advance should be valid if correlation
coefficients are not too variable within the
taxa of interest. To be conservative, p
/m
should be estimated as the maximum, the
1996
TABLE 4. Estimated function parameters for logtransformed facial shape data. Parameters and likelihood statistics of four mixture functions were obtained using maximum likelihood estimation.
Parameter 1 normal
Pi
2 normal
3 normal
4 normal
Species
Log
facial
shape
0.047
0.003
0.576
0.230
0.003
0.424
0.061
0.004
0.657
0.206
0.00009
0.154
0.280
0.001
0.189
0.043
0.003
0.400
0.090
0.006
0.300
0.206
0.00008
0.131
0.280
0.001
0.169
98.6
Gorilla gorilla
Pongo pygmaeus
Pan troglodytes
Homo sapiens
Hylobates lar
0.027
0.033
0.089
0.200
0.274
U, 2
(J2
Pi
m
<r3
o-4
Pi
lnL
TABLE 6. Posterior probabilities and codes of logtransformed hominoid facial shape data. Values are
species menas of facial shape and the probabilities
that any given species should receive code 0, 1, or 2.
Mixture
0.125
0.011
1.000
M<1
75
STRAIT ET AL.—FINITE MIXTURE CODING
82.9
92.0
98.6
minimum, and the average observable values within the taxa being studied. If equivalent codes are obtained using each of
these values for p, then the results will be
reliable.
Mixtures of ratio density functions were
fit to the untransformed facial shape data
set. p was estimated at 0.2, 0.5, and 0.7. For
each value of p, codes were equivalent to
those obtained using the log-transformed
data set.
DISCUSSION
Comparisons between FMC and Other
Coding Procedures
In FMC, taxa share a code if it is probable that their distributions could have
been sampled from a single statistical population. Thus, in FMC, codes are produced
using conventional principles of statistical
Posterior probabilities
Code 0 Code 1 Code 2
1.0
1.0
1.0
0.07
0.01
0
0
0
0.91
0
0
0
0
0.02
0.99
inference, which is not the case for other
coding procedures. Five methods arguably
produce arbitrary codes. Three methods
(gap, generalized gap, and segment coding) require the researcher to arbitrarily
choose the size of the critical gap or segment length. A fourth method, gap
weighting, produces arbitrary codes in the
sense that all continuous characters have
the same number of codes, which is equal
to the maximum allowable by a given parsimony computer program. The method of
Almeida and Bisby (1984) forces the researcher to subjectively identify points
along the range of a character in which
there is minimal overlap between species.
Divergence coding and homogeneous
subset coding (HSC) employ statistical
tests and thus do not create arbitrary
codes. However, the codes are not directly
produced by the tests. For instance, HSC
uses posterior comparisons tests to identify statistically homogeneous sets of taxa,
but when subsets overlap, taxa in the same
subset do not always receive the same
code. Rather, taxa share a code when they
share a unique pattern of membership in
TABLE 5. Significance tests of mixture functions for log-transformed hominoid facial shape data. Results of
the AIC and G-test support the three-normal mixture as the model that best explains the distribution individuals.
AIC
1
2
3
4
G-test
Mixture
model
lnL
AIC
normal
normals
normals
normals
82.9
92.0
98.6
98.6
-163.8
-179.0
-189.2
-186.2
Comparison
1
2
3
1
vs. 2 normals
vs. 3 normals
vs. 4 normals
vs. 3 normals
G
-2(82.9
-2(92.0
-2(98.6
-2(82.9
-
92.0)
98.6)
98.6)
98.6)
=
=
=
=
18.2
13.2
0
31.4
df
P
4
4
4
8
<0.005
<0.01
n.s.
<0.001
76
VOL. 45
SYSTEMATIC BIOLOGY
subsets. From a statistical standpoint, it is
not clear why the pattern of membership
in overlapping subsets is a criterion that
should reveal which taxa are meaningfully
discernible from each other.
Sampling Independence
Felsenstein (1985, 1988) and Harvey and
Pagel (1991) noted that for purposes of statistical comparison, species cannot be considered independent entities. As a result of
phylogenetic hierarchy, species naturally
group into clusters. Because most statistical procedures assume independence, conclusions based on analyses that fail to recognize hierarchy may be invalid. Coding
procedures that employ statistical tests or
parameters may be subject to these criticisms. Felsenstein (1985, 1988) and Harvey
and Pagel (1991) suggested that nonindependence can be accounted for if knowledge about phylogeny is used to correct
for the effect of hierarchical clustering. Unfortunately, such an approach cannot be
applied to coding methods because by definition the phylogeny is unknown prior to
the analysis.
Finite mixture coding of species means
also relies on an assumption of sampling
independence. The central limit theorem
states that when large samples are drawn
independently from a population, the
means of those samples will be normally
distributed. If independence is not allowed, then the central limit theorem is invalid and normality cannot be expected.
Because FMC requires an expectation of
how species means are distributed, the
method would not be applicable. However,
although phylogenetic hierarchy is a certainty, phylogenetic inertia is not. Felsenstein (1985:6) noted that the assumption of
independence is valid so long as "characters respond essentially instantaneously to
natural selection in the current environment, so that phylogenetic inertia is essentially absent." Thus, FMC of taxon means,
like other coding methods that require
sampling independence, must assume that
phylogenetic inertia is absent or slight.
Finite mixture coding of individual
specimens does not require such equivo-
cation. Consider an extreme case of phylogenetic inertia in which all descendant
species of a given common ancestor inherit
identical specieswide distributions for a
given measurement. When specimens are
sampled from these species, they are taken
from nonindependent but identical populations. When pooled, such specimens
would be distributed as if they were independent samples taken from a single
population, and the form of that distribution would be the same as that observed
within species. This model is precisely the
one used by FMC to define species that
share a code.
Now consider a more realistic case in
which the degree of similarity between
species varies according to recency of common ancestry (where recency refers to
both time and branching pattern). If a set
of species were descended from a relatively recent common ancestor, then their
within-species distributions for the given
measurement would all be very similar
(barring natural selection or other factors).
Thus, specimens sampled from these species would be taken from nonindependent
but nearly identical populations. Such
specimens would be distributed approximately as if they were independent samples from a single population, and the
form of that distribution would be approximately the same as that observed within
species. Thus, if phylogenetic inertia were
the only factor influencing the degree of
similarity between species, FMC of individual specimens would assign codes to
clusters of closely related (and recently divergent) species, particularly clusters separated by relatively long divergence times.
Limitations of FMC
A common criticism of coding methods
that employ statistical tests is that they are
overly dependent on sample size (Archie,
1985; Felsenstein, 1988). In general, statistical tests are better able to discriminate
between groups as sample sizes increase.
Thus, for a given set of taxa, the codes produced by a procedure might vary simply
as a result of sampling. This criticism applies to FMC, although in a slightly differ-
1996
STRAIT ET AL.—FINITE MIXTURE CODING
ent manner than it does to methods such
as HSC. Large sample sizes increase the
degrees of freedom of the posterior comparisons tests used by HSC, thereby increasing the ability of those tests to identify statistically significant differences
between taxa. With respect to FMC, large
sample sizes mean that sample distributions will better approximate the population distributions from which they were
drawn, meaning that those populations are
more likely to be detected by likelihood estimation.
A corollary of the sample size problem
is that FMC requires relatively large samples of either taxon means or individual
specimens to identify a code. Thus, when
examining taxon means, a unique species
might not receive its own code even if it is
a morphological outlier because one point
along a distribution may not be enough to
define a distinct statistical population. Species with unique means can receive their
own code if codes are based on a distribution of individual specimens. However,
when coding specimens, a similar problem
is encountered; a poorly represented but
divergent species might not appear distinct. This problem is particularly acute
when examining fossil taxa.
When the method is applied to specimens, FMC relies on a reasonable expectation of how a measurement is distributed
within species. If the wrong type of distribution is chosen, then the codes may be
inappropriate. This problem can be avoided by transforming the data such that they
conform to a particular type of distribution (e.g., normal).
CONCLUSION
77
populations. Taxa receive the same code as
the population within which its species
mean falls.
Finite mixture coding has advantages
and disadvantages that are characteristic
of many statistical procedures. The method is advantageous in the sense that it differentiates among groups on the basis of
nonarbitrary criteria, but it suffers from
problems associated with sample size. Unlike some other coding methods, FMC (of
specimens) is tenable even when it is assumed that species cannot be independently sampled.
ACKNOWLEDGMENTS
We thank F. E. Grine, T. C. Rae, F. J. Rohlf, W. L.
Jungers, S. Leigh, R. Cowan, and the members of the
Numerical Taxonomy Discussion Group at SUNYStony Brook for providing valuable comments and
criticisms. This paper was considerably improved by
comments from D. Cannatella and two anonymous reviewers. This research was supported by a National
Science Foundation grant (BNS 9120117) to F. E. Grine.
REFERENCES
AKAIKE, H. 1974. A new look at the statistical model
identification. Inst. Electr. Electron. Trans. Automatic Control 19:716-723.
ALMEIDA, M. X, AND F. A. BISBY. 1984. A simple
method for establishing taxonomic characters from
measurement data. Taxon 33:405-409.
ARCHIE, J. W. 1985. Methods for coding variable morphological features for numerical taxonomic analysis. Syst. Zool. 34:326-345.
BOOKSTEIN, F. L. 1994. Can biometrical shape be a
homologous character? Pages 198-227 in Homology:
The hierarchical basis of comparative biology (B. K.
Hall, ed.). Academic Press, San Diego, California.
CHAMBERLAIN, A. T. 1987. A taxonomic review and
phylogenetic analysis of Homo habilis. Ph.D. Dissertation, Univ. Liverpool, Liverpool, England.
CHAPPILL, J. A. 1989. Quantitative characters in phylogenetic analysis. Cladistics 5:217-234.
COLLESS, D. H. 1980. Congruence between morphometric and allozyme data for Menidia species: A reappraisal. Syst. Zool. 29:288-299.
Finite mixture coding differs from other
coding methods in that it produces codes CRANSTON, P. S., AND C. J. HUMPHRIES. 1988. Cladistics and computers: A chironomid conundrum? Claas the direct output of a statistical procedistics 4:72-92.
dure. This procedure has three stages: (1) EVERITT,
B. S. 1981. A Monte Carlo investigation of
parameters are estimated for a number of
the likelihood ratio test for the number of compomixture density functions that describe a
nents in a mixture of normal distributions. Multivar.
Behav. Res. 16:171-180.
given data set, (2) a goodness-of-fit criterion is used to identify the function that EVERITT, B. S. 1985. Mixture distributions. Pages 559560 in Encyclopedia of statistical sciences. Wiley,
best describes the set, and (3) species are
New York.
assigned to populations based on posterior FALCONER, D. S. 1981. Introduction to quantitative geprobabilities, and codes are assigned to
netics. Longman, London.
78
SYSTEMATIC BIOLOGY
FARRIS, J. S. 1990. Phenetics in camouflage. Cladistics
6:91-100.
FELSENSTEIN, J. 1985. Phytogenies and the compara-
tive method. Am. Nat. 125:1-15.
VOL. 45
SIMON, C. 1983. A new coding procedure for mor-
phometric data with an example from periodical cicada wing veins. Pages 378-382 in Numerical taxonomy (J. Felsenstein, ed.). Springer-Verlag, Berlin.
FELSENSTEIN, J. 1988. Phylogenies and quantitative
SOKAL, R. R., AND F. J. ROHLF. 1981. Biometry, 2nd
characters. Annu. Rev. Ecol. Syst. 19:445-471.
FLEAGLE, J. G. 1988. Primate evolution and adaptation. Academic Press, New York.
edition. W. H. Freeman, New York.
STEVENS, P. F. 1991. Character states, morphological
variation, and phylogenetic analysis: A review. Syst.
Bot. 16:553-583.
STRAIT, P. T. 1989. A first course in probability and
statistics with applications. Harcourt Brace Jovanovich, New York.
THIELE, K. 1993. The holy grail of the perfect character: The cladistic treatment of morphomertric
data. Cladistics 9:275-304.
THORPE, R. S. 1984. Coding morphometric characters
for constructing distance Wagner networks. Evolution 38:244-255.
HARVEY, P. H., AND M. D. PAGEL. 1991. The compar-
ative method in evolutionary biology. Oxford Univ.
Press, Oxford, England.
MICKEVICH, M. F., AND M. F. JOHNSON. 1976. Congru-
ence between morphological and allozyme data in
evolutionary inference and character evolution. Syst.
Zool. 25:260-270.
PEARSON, J. D., C. H. MORRELL, AND L. J. BRANT. 1992.
Mixture models for investigating complex distributions. J. Quant. Anthropol. 3:325-345.
PIMENTEL, R. A., AND R. RIGGINS. 1987. The nature of
cladistic data. Cladistics 3:201-209.
Received 30 August 1994; accepted 3 August 1995
Associate Editor: David Cannatella