Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Psychology Science, Volume 46, 2004 (1), p. 150-170 Base models for Configural Frequency Analysis 1 ALEXANDER VON EYE Abstract In this article, a grouping of base models for Configural Frequency Analysis (CFA) is proposed. Four groups are discussed. The first involves log-linear models and estimates expected cell frequencies using information provided by the observed cell frequencies. The second group uses a priori probabilities. These probabilities can exist in the form of, for instance, population parameters. The third group uses a priori probabilities that conform to substantive models. The fourth group of base models represents hypotheses concerning the multivariate distribution in cross-classifications. In this group, CFA methods are used to determine whether and where deviations from distributional assumptions exist. Data examples are provided from the areas of politics, birth statistics, and adolescent aggressive behavior. In the discussion, the relation between log-linear modeling and CFA is specified. It is stated that the foundation of CFA goes far beyond what one would assume if CFA were equivalent to residual analysis in log-linear modeling. Key words: Configural Frequency Analysis, base model, log-linear model, crossclassification 1 Alexander von Eye, Michigan State University, Department of Psychology, 119 Snyder Hall, East Lansing, MI 48824-1117, USA; E-mail: [email protected] Base models for Configural Frequency Analysis 151 Base models for Configural Frequency Analysis In the garden of classification methods, Configural Frequency Analysis (CFA; Lienert, 1968; Lautsch & von Weber, 1995; von Eye, 2002a) plays a particular role. In contrast to such methods as cluster analysis or latent class analysis, both of which create a priori unknown groups from raw data, CFA asks whether clusters of existing groups contain more or fewer cases than expected. CFA shares the characteristic of analyzing existing groups with discriminant analysis and with logistic regression. However, in contrast to these two methods, the typical application of CFA is exploratory in nature, as is the case for many other methods of classification. There exist confirmatory methods of CFA. However, these have found only few applications (later in this article, confirmatory applications are discussed). CFA has been developed intensively since its first presentation in 1968 (see Lienert, 1969), and is now among the more popular methods of data analysis. One of the reasons for the increasing popularity of CFA is that the results of this method of data analysis are deemed easy to interpret. Specifically, CFA allows researchers to explore whether for a particular cell of a cross-classification, the null hypothesis H0: E[Ni] = Ei can be rejected, where E indicates the expectancy, Ni is the observed frequency for Cell i, Ei is the estimated expected frequency for Cell i, and i indexes all cells in a cross-classification. If a significance test suggests that E[Ni] > Ei, the cell, also called Configuration i, is said to constitute a CFA type. If E[Ni] < Ei, Cell i is said to constitute a CFA antitype. If neither is the case, that is, if E[Ni] = Ei, Cell i is said to follow the distribution proposed in the CFA base model. In different words, CFA identifies those cells as type-constituting that contain more cases than expected. Cells that contain fewer cases than expected constitute antitypes. Cells with observed frequencies that do not differ from their expected counterparts beyond chance are typically not the focus of interpretational efforts. Reference to the base model that was used to estimate the expected cell frequencies is rarely made. In this article, we discuss base models that can be used for CFA. Specifically, we discuss four groups of base models. First, we discuss log-linear models. These are the most frequently used base models. Indeed, the original CFA base model (Lienert, 1969) is a loglinear main effect model. Second, we discuss models in which cell probabilities are determined based on population parameters. Third, we consider models that determine cell probabilities based on a priori considerations (see, for instance, von Eye, 2002a, Ch. 8). Fourth, we discuss base models that reflect assumptions concerning the distributions of the variables under study (von Eye & Gardiner, 2003). We also present examples and arguments that can be used for the selection of base models. In the next section, we first present a brief overview of the method of CFA. The following sections discuss the four groups of base models. 1. The five steps of Configural Frequency Analysis CFA is performed in five steps (von Eye, 2002a). The first step involves selecting a base model. This article is concerned with the nature of base models. The selection of base models also requires that theoretical considerations concerning the scale characteristics of the variables in the cross-classification be taken into account. In addition, the sampling scheme for data collection determines which kind of base model is admissible (von Eye & Schuster, 1998; von Eye, Schuster, & Gutiérrez-Peña, 2000). Neither of these two issues is of concern 152 A. von Eye here. Instead, the characteristics and applications of four groups of base models are discussed. The second step of CFA involves the selection of a concept of deviation from independence. This issue is relatively new in the discussion of CFA (von Eye, Spiel, & Rovine, 1995). It is based on the idea that a number of definitions of deviations from some base model exist that reflect different data characteristics (Goodman, 1991). The concept of deviation that is used in routine CFA applications is that of the standardized residual. This concept is marginal-dependent. Other concepts are marginal-free, and measures of deviation can behave differently, depending on whether they are marginal-free or marginal-dependent (von Eye & Mun, 2003). At the moment of this writing, there exist no CFA applications that consider these characteristics of measures and tests of deviation, and there exists no theoretical work that discusses these characteristics in CFA models other than 2-sample CFA. The third step involves the selection of a significance test. A large number of tests has been proposed for use in CFA. Recent work has shown that these tests differ dramatically in their characteristics. For example, von Eye (2002b) has shown that, depending on sample size, some of the more popular tests of CFA tend to identify more types, whereas others tend to identify more antitypes. von Weber, Lautsch, and von Eye (2003) showed that the $-error of CFA tests can differ by large quantities. The fourth step involves estimating (or determining) the expected cell frequencies, performing the significance tests, and identifying those configurations that constitute types or antitypes. The fifth step involves interpreting types and antitypes. A complete example of CFA follows. Data example. In a study published by DiFrancisco and Critelman (1984; from Lawal, 2003, p. 251), respondents from 5 nations indicated their level of education and whether they followed politics regularly. The five nations were 1 = USSR, 2 = USA, 3 = UK, 4 = Italy, and 5 = Mexico. The three levels of education were 1 = primary education, 2 = secondary education, and 3 = college education or higher. A score of 1 on the third variable indicates that a respondent follows politics regularly, a 2 indicates that a respondent does not follow politics regularly. The cross-classification of these three variables appears in Table 1, along with CFA results. The standard normal z-test was used, and the significance level " was Bonferroni-adjusted to be "* = 0.001667. For the present analyses, we used the original CFA base model, that is, the log-frequency model log Ei = λ + λ Country + λkEduc + λlPolit j where the subscripts to the 8 terms index the estimated parameters, and the superscripts index the variables. Obviously, this is the model of independence among the three variables under study. Types and antitypes emerge if these variables covary. This base model of CFA is also known as first order CFA. Table 1 shows that there is a large number of types and antitypes that describe patterns of political interest that deviate significantly from the assumption that the three variables Country, Level of education, and Interest in politics are independent from each other. In the following paragraphs, we present sample interpretations of types and antitypes as well as patterns of types and antitypes. Base models for Configural Frequency Analysis 153 Table 1: CFA of the cross-classification of the three variables Country (C), Educational level (E), and Interest in politics (I) CEI 111 112 121 122 131 132 211 212 221 222 231 232 311 312 321 322 331 332 411 412 421 422 431 432 511 512 521 522 531 532 Ni 94 84 318 120 473 72 227 112 371 71 180 8 356 144 256 76 22 2 166 256 142 103 47 7 447 430 78 25 22 2 Êi 387.578 183.188 261.063 123.391 139.735 66.046 323.482 152.893 217.890 102.985 116.627 55.123 285.759 135.063 192.480 90.975 103.026 48.695 240.692 113.763 162.124 76.628 86.778 41.015 335.166 158.416 225.760 106.705 120.839 57.114 statistic -14.912 -7.328 3.524 -.305 28.193 .733 -5.364 -3.307 10.373 -3.152 5.868 -6.347 4.155 .769 4.578 -1.570 -7.983 -6.692 -4.814 13.336 -1.581 3.013 -4.270 -5.311 6.109 21.578 -9.834 -7.910 -8.991 -7.293 p .00000000 .00000000 .00021266 .38009021 .00000000 .23187817 .00000004 .00047128 .00000000 .00081139 .00000000 .00000000 .00001626 .22096045 .00000234 .05820057 .00000000 .00000000 .00000074 .00000000 .05699597 .00129473 .00000978 .00000005 .00000000 .00000000 .00000000 .00000000 .00000000 .00000000 Antitype Antitype Type Type Antitype Antitype Type Antitype Type Antitype Type Type Antitype Antitype Antitype Type Type Antitype Antitype Type Type Antitype Antitype Antitype Antitype The first antitype in Table 1, constituted by Cell 111, suggests that fewer individuals with primary education from the USSR than expected follow politics regularly. The second antitype, constituted by Cell 112, suggests that also fewer individuals with primary education from the USSR than expected do not follow politics regularly. This result may sound implausible. However, these two antitypes, taken together, indicate that there are fewer individuals with primary education from the USSR in this sample (for details on aggregation of results from CFA see von Eye, 2002a, Ch. 10.8). Therefore, these antitypes reflect the association between the variables Country and Level of Education rather than political interest. These two antitypes, and the corresponding two antitypes in the individuals from the USA, are counterbalanced by the types constituted by Cells 511 and 512. These types sug- A. von Eye 154 gest that there are more Mexicans than expected from the assumption of variable independence who have primary education. Other types and antitypes may be interpretable as indicating whether individuals do or do not follow political events more or less regularly than expected. For example, the type constituted by Cell 311 suggests that more individuals with primary education from the UK than expected indicate that they follow politics regularly. Accordingly, fewer individuals with secondary education from the US indicate that they don’t follow politics regularly. Readers are invited to interpret additional types and antitypes as well as patterns of types and antitypes. The fact that some of the type and antitype patterns reflect particular variable associations can give rise to follow-up data analyses. In the present data example, researchers may consider an alternative base model. This could be the base model in which the variables Country and Level of Education are allowed to interact but are proposed to be independent of the variable Political Interest. This is the base model of a Prediction CFA (P-CFA; Heilmann, Lienert, & Maly, 1979; cf. von Eye & Schuster, 1998). The types and antitypes that result from P-CFA in the present data example can be interpreted such that particular patterns of Country and Level of Education allow one to predict Political Interest. The log-linear representation of this base model is log ei = λ + λ Country + λkEduc + λlPolit + λ CountryxEduc . j jk Alternatively, if the grouping of variables in predictors and criteria is not desired, one may consider performing a second order CFA, that is, an analysis with a base model that takes all pairwise associations into account. The log-linear representation of this base model is log Ei = λ + λ Country + λkEduc + λlPolit + λ CountryxEduc + λ CountryxPolit + λlkEducxPolit . j jk jl (This base model is employed in Section 2.1, below). Both of these base models as well as the one used for the analysis of the data in Table 1 are log-linear models. In the following sections, we describe four groups of base models, one of which includes log-linear models. 2. Four groups of base models for CFA In this section, we present four groups of base models for CFA. These are (1) log-linear models, (2) models based on population parameters, (3) models with a priori determined probabilities, and (4) models based on distributional assumptions. In either case, a CFA base model must meet the following three criteria (von Eye & Schuster, 1998): 1. Uniqueness of interpretation of types and antitypes: there must be only one reason for discrepancies between observed and expected cell frequencies. Sample cases of such reasons include the presence of higher order interactions, main effects, and predictor-criterion relationships. Base models for Configural Frequency Analysis 2. 3. 155 Parsimony: base models must be as simple as possible. That is, base models must include as few terms as possible and terms of the lowest possible order. Methods have been proposed to simplify CFA base models (Schuster & von Eye, 2000). Consideration of sampling scheme: the sampling scheme of all variables must be considered. Particular sampling schemes can have the effect that the selection of base models is limited. For example, product-multinomial sampling precludes using base models of zero order CFA. From a more general perspective, CFA base models play the role of a superordinate null hypothesis. The observed data are assumed to have been drawn from a population in which the base model provides a valid description of the frequency distribution in the crosstabulation. Types and antitypes suggest that, at least in specific sectors of the table, this null hypothesis must be rejected. 2.1 Log-linear base models for CFA Log-linear models allow one to describe relationships among variables. For instance, loglinear models can be used to depict association structures, dependency structures, or joint variable distributions (Goodman, 1984). In the context of CFA, the first two possibilities have been used. The concept behind using log-linear models lies in the definition of a CFA base model given above. The base model includes all relationships that researchers do not wish to be the reason for the emergence of types and antitypes. Types and antitypes can then emerge only if the relationships exist that are not part of the base model. Such relationships can either be global, that is, involve the entire range of variable categories, or local, that is, involve only a selection of variable categories (Havránek & Lienert, 1984). For example, if researchers wish that types and antitypes reflect nothing but variable associations, the main effects must be part of the base model. An example of such a base model is the original CFA base model used by Lienert (1969). In contrast to log-linear modeling, the base model of a CFA is not altered in the presence of rejected null hypotheses of model fit. Rather, researchers interpret the types and antitypes as indicated in the data example above, and attempt to explain the discrepancies from the base model. As was illustrated in the first data example, this interpretation can focus on individual types and antitypes as well as on groups of types and antitypes. Log-linear base models can be classified in two groups (von Eye, 2002a). The first group is that of global models. These are models in which all variables have the same status. There is no distinction between, for example, predictors and criteria. Within the group of global models, there exists a hierarchy. At the bottom of this hierarchy, there is the base model of zero order CFA. This model proposes that the cross-classification under study reflects no effects whatsoever. The resulting expected frequencies are thus uniformly distributed. Types and antitypes can emerge if effects exist, any effects. At the second level of this hierarchy, we find first order CFA. This is the base model that proposes that main effects exist but no variable associations. This model is also called the model of variable independence or the base model of classical CFA. Types and antitypes will emerge if variable associations exist. A. von Eye 156 At the subsequent higher levels, increasingly higher order interactions are taken into account. Types and antitypes emerge if associations exist at the levels not considered in the base model. For d variables, the highest possible order of a base model takes the d-1st order interactions into account. Types and antitypes can then emerge only if interactions of the dth order exist. In different words, the highest possible order of CFA base model identifies types and antitypes only if a saturated hierarchical log-linear model is needed to provide a satisfactory description of the observed frequency distribution. The second group of log-linear CFA base models is that of regional models. These are models that group variables. Examples of such models include Prediction CFA which distinguishes between predictors and criteria; Interaction Structure Analysis (ISA; Lienert & Krauth, 1973) which distinguishes between two groups of variables of equal status; and ksample CFA which includes one or more classification variables and one or more variables that are used to distinguish between these groups. Models of CFA that distinguish among more than two groups of variables have been discussed (Lienert & von Eye, 1988), but have found limited application. The following paragraphs present examples of log-linear base models. For these examples, we assume that all variables have been observed under a multinomial sampling scheme such that there are no constraints on the selection of base models. For the examples, we assume that the four variables, A, B, C, and D are included in the analyses. Examples of global base models. Zero order base model: log Ei = λ First order base model: log Ei = λ + λ jA + λkB + λlC + λmD Second order base model: log Ei = λ + γ Aj + γ lB + γ lC + γ mD + AB AD BC BD CD + γ jk + γ AC jl + γ jm + γ kl + γ km + γ lm Examples of regional base models. ISA in which the first variable group contains variables A and B, and the second variable group contains variables C and D: log Ei = λ + γ Aj + γ lB + γ lC + γ mD + CD + γ AB jk + γ lm Base models for Configural Frequency Analysis 157 P-CFA in which variable D is predicted from variables A, B, and C: log Ei = λ + γ Aj + γ lB + γ lC + γ mD AB ABC + γ jk + γ jlAC + γ klBC + γ jkl Data example. In the following paragraphs, we re-analyze the data from the first example. As was indicated in the discussion of the results of this example, some of the types and antitypes reflect the association between Country and Level of Education or, in general, first order associations. Thus, if researchers ask the question whether types and antitypes result at the level of the remaining second order association, a second order CFA can be performed. In the typical case, the number of types and antitypes in higher order CFA applications is smaller than in lower order applications, because more effects are taken into account. The Table 2: Second order CFA of the data in Table 1 Configuration 111 112 121 122 131 132 211 212 221 222 231 232 311 312 321 322 331 332 411 412 421 422 431 432 511 512 521 522 531 532 Ni 94 84 318 120 473 72 227 112 371 71 180 8 356 144 256 76 22 2 166 256 142 103 47 7 447 430 78 25 22 2 Êi 94.261 83.739 310.282 127.718 480.457 64.543 235.001 103.999 366.791 75.209 176.208 11.792 339.334 160.666 272.270 59.730 22.396 1.604 167.386 254.614 143.713 101.287 43.902 10.098 454.019 422.981 71.944 31.056 21.036 2.964 statistic -.027 .028 .438 -.683 -.340 .928 -.522 .785 .220 -.485 .286 -1.104 .905 -1.315 -.986 2.105 -.084 .313 -.107 .087 -.143 .170 .468 -.975 -.329 .341 .714 -1.087 .210 -.560 p .48929071 .48863800 .33064370 .24733225 .36685148 .17665137 .30085750 .21634957 .41301778 .31370450 .38757386 .13474938 .18279871 .09427841 .16206149 .01763773 .46661597 .37710080 .45735818 .46540306 .44319895 .43243690 .32003912 .16479129 .37092086 .36644349 .23763608 .13859947 .41680103 .28783443 158 A. von Eye following analyses will show whether this is the case here too. To make results comparable, 2 the following CFA also uses the z-test and the Bonferroni-adjusted "* = 0.0016667. Results are displayed in Table 2. The results in Table 2 show clearly that our earlier interpretations were correct. The types and antitypes in Table 1 resulted because first order associations exist. At the level of the second order association, not a single type or antitype emerges. 2.2 CFA base models that are based on a population parameter Application of log-linear base models presupposes in the typical case that the cell probabilities are estimated from sample information. In particular instances, however, population parameters are known or can be derived from population statistics. Examples of such parameters include gender distributions, cohort size, family characteristics such as number of children in a family, or number of children raised in single parent households. An early example of a CFA based on population parameters can be found in Spiel and von Eye (1993). In a similar fashion, the number of cases that are expected to fall in a particular range of such characteristics as intelligence or extraversion can be derived from test norms that are supposed to be valid for entire populations. When population parameters are known, the expected cell probabilities can be calculated as specified in a base model. Consider the case in which the three variables A, B, and C are observed, each having two categories. The univariate marginal probabilities for these three variables are π 1A , π 2A , π 1B , π 2B , π 1C , and π 2C . The bivariate marginal probabilities are A.C . . BC π ijAB . , π i.k , and π . jk . To illustrate, consider the base model of first order CFA which implies only main effects, that is, variable independence. For this model the expected cell probability is ABC π ijk = π iAπ Bj π kC , with i, j , k = {1, 2} . The corresponding expected cell frequency is calculated to be ABC . Eˆ ijk = N π ijk A more complex base model proposes, for example that A and B are unrelated, but A and C as well as B and C can be related. The expected cell probability for this base model can be calculated by π ijk = 2 π iA.k.Cπ ..BC jk π kC Readers recalculating these results using SPSS or SYSTAT will notice that different expected cell frequencies result. Readers recalculating these results using Rindskopf’s (1987) log-linear modeling program will notice that this program produces the same expected cell frequencies as reported here. The reasons for this discrepancy need to be explored. Base models for Configural Frequency Analysis 159 (Bishop, Fienberg, & Holland, 1975) and the expected cell frequency is as before. Most if not all CFA base models are simple models, that is, the expected frequencies can be estimated from the marginals using closed form equations. Thus, for most models, equations of the kind given here can be derived. Data example. For the following example, we use data from a study by Spiel and von Eye (1992). In this study, the authors asked whether Austrian students of psychology and journalism are representative of families in Austria in regard to the size of their families of origin. 244 students of psychology and 179 students of journalism indicated, among other information, the number of children in their families of origin. Table 3 displays the relative frequencies of numbers of children, pooled over student gender and major, and the population probabilities. The population probabilities were taken from the Austrian census. Table 3: Probabilities and relative frequencies of numbers of children in families in Austria Number of children Austria (Census) Student sample (Spiel & von Eye, 1992) 1 .4829 .1844 2 .3466 .4350 3 .1194 .2128 >3 .0510 .1678 As Table 3 indicates, the distribution of numbers of children is quite different in the pooled sample of students of psychology and journalism than in the population. We now ask whether particular patterns of Major (M; 1 = psychology, 2 = journalism), number of children (C; labels as in Table 3), and Gender (G; 1 = females, 2 = males) deviate from the population such that types or antitypes result. We use the model of variable independence as the base model. That is, we propose that major, number of children in the family of origin, and gender are unrelated. We analyze the M x C x G cross-classification in two ways. First, we employ standard first order CFA in which we estimate the expected cell frequencies from the observed univariate marginal frequencies. The expected frequencies for this analysis can be estimated using the log-linear model C G log Ei = λ + λ M j + λk + λl . 2 This translates into the well known X formula for the estimation of expected cell frequencies, N j.. N.k . N..l . Eˆ ijk = N2 Second, we substitute the probabilities from Table 3 for the marginal relative frequencies for numbers of children. This translates into the following equation for the estimation of expected cell frequencies: 160 A. von Eye Eˆ ijk = N j..π k N..l / N . For both analyses, we use the z-test and the Bonferroni-adjusted "* = 0.003125. Table 4 displays results from both analyses. The two panels in Table 4 display two configural analyses that differ only in the reference population that is used to estimate the expected cell frequencies. In the reference population of Austrian psychology and journalism majors, the three variables Major, Number of Children, and Gender are independent and, consequently, not a single type or antitype emerges. In contrast, when Austria’s general population is used as a reference, three antitypes and six types emerge. The three antitypes suggest that female students of psychology and journalism are less likely than expected from general population parameters to be the sole children of their families of origin. In addition, male psychology majors are also less likely than expected children from one-child families. The types of the psychology majors suggest that male students are more likely than expected to have either two siblings or more than three. Female journalism majors stem with increased likelihood from families with two, Table 4: Two CFA analyses of the cross-classification of Major (M), Number of Children in the family of origin (C), and Gender of student (G) Patterns Standard first order CFA CFA taking a priori probabilities into account a ˆ ˆ z p(z) z p(z) MCG Ni Type/ Ei Ei Antitype? 111 34 31.70 .42 .335 82.92 -5.73 < "* A 112 14 13.30 .20 .422 34.78 -3.53 < "* A 121 76 74.77 .16 .437 59.58 2.13 .017 122 23 31.36 -1.55 .060 24.99 -.40 .345 131 43 36.57 1.11 .133 20.52 4.96 < "* T 132 10 15.34 -1.39 .082 8.61 .47 .319 141 37 28.85 1.57 .058 8.77 9.54 < "* T 142 7 12.10 -1.49 .069 3.68 1.73 .042 211 15 23.25 -1.76 .039 60.84 -5.88 < "* A 212 15 9.75 1.70 .045 25.52 -2.08 .019 221 52 54.85 -.41 .340 43.71 1.25 .106 222 33 23.01 2.14 .016 18.34 3.43 < "* T 231 22 26.83 -.96 .168 15.06 1.79 .037 232 15 11.25 1.13 .129 6.32 3.46 < "* T 241 19 21.17 -.48 .315 2.70 4.96 < "* T 242 8 8.88 -.30 .383 2.70 3.23 < "* T a ”< "*” indicates that the tail probability is smaller than can be expressed with three decimals, and that a type or antitype exists Base models for Configural Frequency Analysis 161 three, or more children. Male journalism majors are with increased likelihood children from families with more than three children. In all, with reference to the population of psychology and journalism majors, the three variables Major in College, Number of children in family of origin, and Gender are not associated to the extent that types and antitypes emerge. In contrast, with reference to the general Austrian population, psychology and journalism majors are less likely to have no siblings, and they are more likely to have two or more siblings. 2.3 CFA base models based on a priori probabilities A particular group of CFA base models results from theoretical models and assumptions. Consider an experiment in which coins are tossed and dies are rolled. For coins, one assumes that heads and tails appear at equal rates, that is, with probability 1/2. For dies, one assumes that each of the six faces appears with probability 1/6. Similar assumptions can be made for diamonds in urns, for cards in standard decks, and for the rates with which the numbers in the lotto game 6 out of 49 are drawn. Using CFA under the usual base models, one can test whether a coin is even, whether Donata is able to roll more 6es than Alex, both using the same die, whether beginners have more luck playing poker, or whether alcohol consumption affects the rates of heads and tails. The sample equations given in Section 2.2 can be used for these purposes in a parallel fashion. More complex situations exist. For example, von Eye (2002a; Ch. 8) showed that the probabilities for positive first, second, and higher order differences can be described using a probability model that suggests a probability pattern that differs greatly from a uniform distribution. Univariate deviations from such patterns suggest main effects that can lead to the emergence of types and antitypes that need explanation. This applies accordingly when such differences are studied in a multivariate context. An additional group of base models may be considered. In mathematical Psychology, researchers derive mathematical descriptions of processes that culminate in testable models. Alternative models are often formulated and plotted against each other. For example, Erdfelder and Buchner (1998) derived and tested several competing models of the hindsight bias. Evidence of hindsight bias is typically observed in situations in which subjective probabilities are estimated in retrospect with outcome information available. Raters are asked to indicate subjective probabilities as if the outcome information were unavailable. Hindsight is observed if subjective probabilities are biased in the direction of the outcome. This has also been called the knew-it-all-along effect (Wood, 1978). Erdfelder and Buchner (1998) derived a multinomial model of the cognitive processes involved in the hindsight bias phenomenon (see also Winman, Juslin, & Björkman, 1998). Sample equations appear, for instance in Erdfelder and Buchner’s Table 3 (1998, p. 396). Now, from the perspective of CFA applications, it is interesting to note that models as the ones derived by Erdfelder and Buchner allow one to predict the probabilities of particular outcome patterns. These probabilities can be compared with observed frequencies using goodness-of-fit tests. It can thus be attempted to decide which of a number of competing models describes the observed frequency distribution better. In addition, and this leads to an application of CFA, one can ask which observed frequencies in particular deviate from the predicted ones such that CFA types or antitypes result. A. von Eye 162 In the present article, we give no example for this kind of model. Examples of analyses that take the a priori probabilities of first and second differences into account can be found in von Eye (2002a). 2.4 CFA models that are based on distributional assumptions As was indicated in Section 2.1, log-linear models are typically used to explore (1) variable associations, (2) dependency structures, or (3) joint variable distributions. The CFA models discussed in the first two groups of base models allow one to perform analyses that reflect specific forms of variable associations and dependency structures. These are associations or dependency structures that result in types and antitypes and are not part of a base model. In this section, we discuss a new field of application for CFA, that is, the examination of distributional characteristics of cross-classifications (for more detail see von Eye & Gardiner, 2003; von Eye, 2003). Consider a multivariate space that is spanned by d continuous variables. The random vector x (d x 1) follows a d-dimensional normal distribution if its density can be described by f ( x) = (2π )− d / 2 | Σ |−0.5 e−0.5( x− µ ) / Σ −1 ( x−µ ) , where µ is the mean and the covariance matrix Σ is positive definite and has rank d. The characteristics of the multivariate normal distribution imply that if X follows this distribution, each subset of X will also follow this distribution. In particular, each univariate distribution will be normal too. The reverse is known not to be true. Variables that are normally distributed individually, do not necessarily display a multivariate normal distribution in general, unless, for example, independence is assumed. A large number of tests has been proposed to determine whether a sample can be assumed to stem from a multinormal distribution. Among these tests are graphical methods 2 which first calculate the squared multivariate Mahalanobis distances r of the individual data points, xi, from their means, then rank order these distances, and third plot them against the χ12−αi ;2 distribution, with 1 - αi = (i - 0.5)/N (see Jobson, 1992). Most popular are Mardia’s tests (1970, 1980) which use the concepts of multivariate skewness and kurtosis. The currently known tests of multinormality propose hypotheses that are compatible with predictions based on multinormality. However, they do not test multinormality directly. Von Eye and Gardiner (2003) proposed a procedure that tests multinormality directly. This procedure can be seen as a multivariate extension of the well known χ 2 -test of univariate normality. Specifically, the procedure of von Eye and Gardiner proceeds in the following steps: 1. Split the scales that are used to observe the variables under study in segments. Two approaches have been proposed. One approach involves creating equal-length segments on the scales. The other approach involves creating equal-probability segments on the scales. Equal-length segments imply that the probability of segments decreases as their distance from the mean increases. Equal-probability segments im- Base models for Configural Frequency Analysis 2. 3. 163 ply that their length increases with their distance from the mean. Let the number of segments of scale j be cj. Cross the segmented scales. The crossed segmented scales create a d-dimensional space with C = Π jc j sectors. Calculate for each of the C sectors the probability under the assumption of a dvariate normal distribution. The methods proposed by Genz (1992) can be used to calculate these probabilities. These probabilities are 2 z1i +1 z j +1 zkd+1 z1i z 2j zkd p( z1i − z1i +1 , z 2j − z 2j +1 ,..., zkd − zkd+1 ) = ∫ ∫ ... ∫ Ψ ( z1 , z 2 ,..., z d ) dz1dz 2 ...dz d , 4. 5. where the subscripts index the segments and the superscripts indicate the variables. Calculate the expected frequency for Sector sij...,k as Êi,j, ...,k = Npi,j,...,k. The following steps identify von Eye and Gardiner’s sector test as a CFA method. Determine for each sector whether the observed frequency, Ni,j,...,k differs from the expected frequency. The null hypothesis for these sector-specific tests is E[ N i , j ,...,k ] = Eˆ i , j ,...,k If this comparison suggests that a sector contains significantly more or fewer objects than expected based on the joint density function of the d variables under study, this sector evinces a violation of multivariate normality. Therefore, the assumption of multivariate normality must be rejected at least for this sector. The tests that can be used to examine the sectors are known from CFA. In addition to the CFA-like sector tests, von Eye and Gardiner proposed an overall good2 ness-of-fit test that is based on the sum of the X - components. Data example. In a study on the development of aggressive behaviors in adolescence, Finkelstein, von Eye, and Preece (1994) presented 114 adolescents about 12 years of age with a questionnaire that included questions about verbal aggression against adults (VAAA), physical aggression against peers, (PAAP), and aggressive impulses (AI). The respondents processed the instrument in 1983, 1985, and 1987. One condition that must be met for proper application of parametric multivariate statistical procedures is that of multinormality. In the present example, we examine the three variables from the wave in 1983, VAAA83, PAAP83, and AI83. Before testing multinormality, we ask whether these three variables can be considered normally distributed when analyzed at the univariate level. Descriptive statistics appear in Table 5. Table 5 suggests that the univariate skewness and kurtosis values of the three variables are not worrisome. The standard deviations are somewhat small, which can be explained by a relatively large number of ties near the scale mean. Figure 1 displays the probability plots of the three variables. The probability plots of the three variables in Figure 1 show that the variables are, at the level of univariate analysis, very near the normal distribution. We thus retain the hypothesis that these three variables are, in the population, normally distributed. We now ask whether these three variables can be assumed to stem from a multinormal population. In a first step, we inspect the 3D scatterplot of these three variables. Figure 2 displays the scatterplot. A. von Eye 164 Table 5: Descriptive statistics for the three self report variables VAAA83, PAAP83, and AI83 VAAA83 114 7.000 36.000 19.208 5.661 0.417 0.226 0.222 0.449 N of cases Minimum Maximum Mean Standard Dev Skewness(G1) SE Skewness Kurtosis(G2) SE Kurtosis PAAP83 114 8.000 44.000 21.291 8.254 0.606 0.226 -0.471 0.449 AI83 114 5.000 29.000 16.890 5.432 0.100 0.226 -0.442 0.449 3 2 1 0 -1 -2 -3 0 10 3 2 1 0 -1 -2 -3 0 20 VAAA83 Expected Value for Normal Distribution Expected Value for Normal Distribution Expected Value for Normal Distribution Figure 1: Probability plots of the variables VAAA83, PAAP83, and AI83 10 20 30 PAAP83 40 50 30 40 3 2 1 0 -1 -2 -3 0 10 20 AI83 30 Base models for Configural Frequency Analysis 165 Figure 3: 3D scatterplot of the variables VAAA83, PAAP83, and AI83 The visual inspection of the three-dimensional distribution of these three variables is certainly inconclusive. Obviously, the variables are strongly correlated (rVAAA83-PAAP83 = 0.606 rVAAA83-AI83 = 0.495, rAI83-PAAP83 = 0.372). However, it is hard to make out whether the deviations from multinormality are so strong that the null hypothesis must be rejected. Employing Mardia’s tests of multivariate skewness and kurtosis, we obtain skew = 0.163 with p = 0.9790 and kurtosis = 11.152 with p < 0.001, suggesting that the three variables do not conform with hypotheses that are compatible with the assumption of multinormality. This information, although useful, does not provide us with hints at where in the threedimensional space deviations are strongest. Therefore, we perform a CFA of the threedimensional space. Specifically, we create three equi-probable segments for each of the three indicators of aggression. Segment 1 of each variable contains the low-scoring adolescents, Segment 2 contains the adolescents with scores about the mean, and Segment 3 contains the high-scoring adolescents, that is, the adolescents with above average aggression scores. Then, we cross the thus categorized variables to obtain a 3 x 3 x 3 cross-classification. To calculate the expected cell frequencies, we use the method proposed by Genz (1992). This calculation takes the correlations among the three variables into account. We then perform a CFA of this cross-classification, using the z-test and the Bonferroni-adjusted "* = 0.00185. Table 6 displays results of the configural analysis. A. von Eye 166 Table 6: CFA of the three-dimensional distribution of the variables VAAA83, PAAP83, and AI83, performed with the goal of testing multinormality Configuration 111 112 113 121 122 123 131 132 133 211 212 213 221 222 223 231 232 233 311 312 313 321 322 323 331 332 333 Êi Ni 15 4 2 7 2 2 2 2 0 11 5 6 2 9 3 1 4 4 2 1 2 0 1 4 1 8 14 8.664 5.252 2.131 5.170 3.918 1.908 1.489 1.371 .792 3.657 4.005 2.690 4.803 6.148 4.815 2.874 4.342 4.040 .756 1.323 1.458 1.913 3.949 5.295 2.528 6.569 12.040 2 statistic 2.153 -.546 -.090 .805 -.969 .066 .419 .537 -.890 3.840 .497 2.018 -1.279 1.150 -.827 -1.105 -.164 -.020 1.431 -.281 .449 -1.383 -1.484 -.563 -.961 .558 .565 p .01567217 .29237026 .46416812 .21052392 .16625234 .47357112 .33758449 .29551351 .18674901 .00006162 .30947955 .02180589 .10043393 .12507364 .20403187 .13451753 .43488734 .49200377 .07618921 .38929046 .32688650 .08329751 .06889284 .28677873 .16831158 .28832552 .28608148 Type/ Antitype? Type The overall goodness-of-fit X for the model in Table 6 is 40.05 (df = 17; p = 0.00127). This significant value suggests that the null hypothesis of multinormality must be rejected. The sector-specific CFA tests will indicate whether there are particular sectors with too many or too few cases, compared to the expected frequencies that were calculated based on the assumption of multinormality. The sector-specific results in Table 6 suggest that there is one sector with more cases than expected based on the hypothesis of multinormality. In Sector 211, we find 11 adolescents, but less than 4 were expected for this sector. These are respondents who are about average in verbal aggression against adults, and below average both in physical aggression against peers, and aggressive impulses. Given this result, researchers can respond by discussing the assumption that the three aggression variables follow a multinormal distribution. There may be reasons why this assump- Base models for Configural Frequency Analysis 167 tion fails to carry. Alternatively, or in addition to discussing this assumption, the researchers may consider selective resampling, with the goal of completing the sample. For reasons of comparison, we present in Table 7 results from standard, first order CFA, that is, from a CFA of the cross-classification displayed in Table 6. In this CFA, only the main effects of the three variables are taken into account. 2 The overall Pearson goodness-of-fit X for the first order CFA base model is 96.12 (df = 20; p < 0.01). This value suggests significant deviations from the assumption of variable independence. Two types emerge which carry this result. The first type is constituted by Configuration 111. It contains 9 adolescents. Only about 3 had been expected. These are adolescents that score below average in all three aggression scales. The second type is constituted by Configuration 333. This type suggests that more adolescents with high scores on all three variables than expected based on the hypothesis of variable independence were found. Obviously, this result differs from the one presented in Table 6. Table 7: First order CFA of the variables VAAA83, PAAP83, and AI83, performed with the goal of testing deviations from independence Configuration 111 112 113 121 122 123 131 132 133 211 212 213 221 222 223 231 232 233 311 312 313 321 322 323 331 332 333 Ni Êi 15 4 2 7 2 2 2 2 0 11 5 6 2 9 3 1 4 4 2 1 2 0 1 4 1 8 14 5.452 4.787 4.920 3.407 2.992 3.075 4.089 3.590 3.690 6.814 5.983 6.150 4.259 3.740 3.843 5.111 4.488 4.612 4.997 4.388 4.510 3.123 2.742 2.819 3.748 3.291 3.382 statistic 4.090 -.360 -1.316 1.946 -.573 -.613 -1.033 -.839 -1.921 1.603 -.402 -.060 -1.095 2.720 -.430 -1.818 -.230 -.285 -1.341 -1.617 -1.182 -1.767 -1.052 .704 -1.419 2.596 5.773 p .00002162 .35958151 .09403134 .02580282 .28320486 .26995857 .15081630 .20068376 .02737383 .05442260 .34383465 .47595022 .13684194 .00326195 .33350828 .03450379 .40898962 .38780038 .08999708 .05290488 .11864076 .03859086 .14636441 .24080438 .07788918 .00471728 .00000000 Type/ Antitype? Type Type A. von Eye 168 3. Discussion Repeatedly, CFA has been labeled as nothing but a re-packaged residual analysis for loglinear models. Indeed, a good number of CFA applications uses the routine base model of CFA which is the first order model, a log-linear model of variable independence. Lehmacher (2000) and von Eye (2002) have raised the argument that even though many CFA base models use the same methods as log-linear modeling for the estimation of expected cell frequencies and for the analysis of the cell-wise deviances, there exist fundamental differences in the goals pursued with these two methods. Log-linear modeling is used to devise a model that describes the data well. If there are significant deviations from the frequencies proposed by a model, researchers try to modify the model such that it describes the data without significant deviations. In contrast, CFA is used to identify and explain configurations that constitute types and antitypes, that is, significant deviations from some suitably specified base model. The present article shows that the labeling of CFA as a method of log-linear residual analysis can be disputed from an additional perspective. This perspective considers the base models that can be specified for CFA application. Four groups of base models have been identified in this article. Only the first group includes log-linear models. As was mentioned above, the best known CFA base model in this group is the model of variable independence, used in Lienert’s (1969) original CFA. As was indicated above, this group of base models is very flexible. It allows one to consider a very large number of base models, including global and regional models, and models with and without covariates. The second group of base models uses population parameters instead of estimating the expected cell frequencies from the data. The third group of models is fueled by a priori considerations. These considerations reflect, for instance, processes as described by mathematical psychological theories, or theoretical propositions such as the a priori probability that can be determined for events as draws from urns or coin toss outcomes. The most frequently discussed base model in this group specifies the a priori probability of sign patterns for differences between adjacent scores in, for instance, time series. The fourth group of models is parametric in nature. This group allows one to test hypotheses concerning the multivariate distribution of categorized and cross-classified continuous variables. This group is new (von Eye & Gardiner, 2003). It uses Genz’ (1992) method which allows one to estimate the probability of rectangular sectors in a multivariate space under the assumption of multinormality. Methods of CFA are used in this context first to specify these sectors, and second to determine whether individual sectors contain more or fewer cases than expected based on the assumption of multinormality. Further developments of this approach will allow researchers to test additional distributional assumptions. Alternative definitions of sectors will be proposed in the future, for example, convex sectors, identified by methods of, for example, cluster analysis. The selection of a base model for a particular application of CFA is obviously not arbitrary. The following criteria can be used for selection: 1. 2. A base model must conform to the specification of CFA base models given above (von Eye & Schuster, 1998): uniqueness of interpretation, parsimony, and consideration of sampling scheme. A base model must also take into account the available information about the phenomenon under study. For example, population parameters must be taken into ac- Base models for Configural Frequency Analysis 3. 169 count if available. This applies accordingly to theoretical considerations concerning a priori probabilities, or to models that describe the process under study. A base model can be a hypothesis about a probability distribution. von Eye and Gardiner (2003) proposed using CFA methods to test hypotheses concerning multinormality. We thus conclude that CFA is a method with a much broader foundation than considered by arguments that reduce this method to a derivative of log-linear modeling. CFA is of interest whenever researchers ask questions concerning the frequencies in individual cells and groups of cells. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. Bishop, Y.M.M., Fienberg, S.E., & Holland, P.W. (1975). Discrete multivariate analysis. Cambridge, MA: Cambridge University Press. DiFrancisco, W., & Critelman, Z. (1984). Soviet political culture and covert participation in policy implementation. American Political Science Review, 78, 603 - 621. Erdfelder, E., & Buchner, A. (1998). Decomposing the hindsight bias: A multinomial processing tree model for separating recollection and reconstruction in hindsight. Journal of Experimental Psychology: Learning, Memory, and Cognition, 24, 387 - 414. Finkelstein, J. W., von Eye, A., & Preece, M. A. (1994). The relationship between aggressive behavior and puberty in normal adolescents: A longitudinal study. Journal of Adolescent Health, 15, 319 - 326. Genz, A. (1992). Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics, 1, 141 - 149. Goodman, L.A. (1984). The analysis of cross-classified data having ordered categories. Cambridge, MA: Harvard University Press. Goodman, L.A. (1991). Measures, models, and graphical displays in the analysis of crossclassified data. Journal of the American Statistical Association, 86, 1085 - 1111. Havránek, T., & Lienert, G.A. (1984). Local and regional versus global contingency testing. Biometrical Journal, 26, 483 - 494. Heilmann, W.-R., Lienert, G.A., & Maly, V. (1979). Prediction models in Configural Frequency Analysis. Biometrical Journal, 24, 79 - 86. Jobson, J.D. (1992). Applied multivariate data analysis. Volume II: Categorical and multivariate analysis. New York: Springer. Lautsch, E., & von Weber, S. (1995). Methoden und Anwendungen der Konfigurationsfrequenzanalyse. Weinhein: Beltz. Lehmacher, W. (2000). Die Konfigurationsfrequenzanalyse als Komplement des log-linearen Modells. Psychologische Beiträge, 42, 418 - 427. Lawal, B. (2003). Categorical data analysis with SAS and SPSS applications. Mahwah, NJ: Lawrence Erlbaum. Lienert, G.A. (1969). Die Konfigurationsfrequenzanalyse als Klassifikationsmittel in der klinischen Psychologie, In: M. Irle (Hrsg.), Bericht über den 26. Kongress der Deutschen Gesellschaft für Psychologie 1968 in Tübingen (pp. 244 - 255). Göttingen: Hogrefe. Lienert, G.A., & Krauth, J. (1973). Die Konfigurationsfrequenzanalyse V. Kontingenz- und Interaktionsstrukturanalyse multinär skalierter Merkmale. Zeitschrift für Klinische Psychologie und Psychotherapie, 21, 26 - 39. 170 A. von Eye 16. Lienert, G. A., & von Eye, A. (1988). Syndromaufklärung mittels generalisierter ISA. Zeitschrift für Klinische Psychologie, Psychopathologie, und Psychotherapie, 36, 25-33. 17. Mardia, K.V. (1970). Measures of multivariate skewness and kurtosis with applications. Biometrika, 57, 519 - 530. 18. Mardia, K.V. (1980). Tests of univariate and mutivariate normality. In P.R. Krishnaiah (ed.), Handbook of statistics (vol. 1; pp 279 - 320). Amsterdam: North Holland. 19. Rindskopf, D. (1987). A compact BASIC program for log-linear models. In R.M. Heiberger (ed.), Computer science and statistics (pp. 381 - 386). Alexandria, VA: American Statistical Association. 20. Schuster, C., & von Eye, A. (2000). Using log-linear modeling to increase power in two-sample Configural Frequency Analysis. Psychologische Beiträge, 42, 273 - 284. 21. Spiel, C, & von Eye, A. (1992). Wen bilden wir aus--Oder stammen Österreichs Psychologiestudenten aus dieser Welt?--eine Anwendung der parametrischen KFA. Wien: Universität, Institut für Psychologie: unpublished paper. 22. Spiel, C., & von Eye, A. (1993). Configural Frequency Analysis as a parametric method for the search for types and antitypes. Biometrical Journal, 35, 151-164. 23. von Eye, A. (2002). Configural Frequency Analysis - Methods, Models, and Applications. Mahwah, NJ: Lawrence Erlbaum. (a) 24. von Eye, A. (2002). The odds favor antitypes - A comparison of tests for the identification of configural types and antitypes. Methods of Psychological Research - online, 7, 1-29. (b) 25. von Eye, A. (2003). Using methods of Configural Frequency Analysis to test hypotheses concerning multinormality. (In preparation). 26. von Eye, A., & Gardiner, J.C. (2003). Locating deviations from multivariate normality. (Under editorial review) 27. von Eye, A., & Mun, E.-Y. (2003). Characteristics of measures for 2 x 2 tables. Understanding Statistics, 2, 243 - 266. 28. von Eye, A., & Mun, E.Y. (2004). Modeling rater agreement - manifest variable approaches. Mahwah, NJ: Lawrence Erlbaum. 29. von Eye, A., & Schuster, C. (1998). On the specification of models for Configural Frequency Analysis - Sampling schemes in Prediction CFA. Methods of Psychological Research - online, 3, 55 - 73. 30. von Eye, A., Schuster, C., & Gutiérrez-Peña, E. (2000). Configural Frequency Analysis under retrospective and prospective sampling schemes - frequentist and Bayesian approaches. Psychologische Beiträge, 42, 428 - 447. 31. von Eye, A., Spiel, C., & Rovine, M. J. (1995). Concepts of nonindependence in Configural Frequency Analysis. Journal of Mathematical Sociology, 20, 41 - 54. 32. von Weber, S., Lautsch, E., & von Eye, A. (2003). Table-specific continuity corrections for Configural Frequency Analysis. Psychology Science, 45, 339 - 354. 33. Winman, A., Juslin, P., & Björkman, M. (1998). The confidence-hindsight mirror effect in judgement: An accuracy-assessment model for the knew-it-all-along phenomenon. Journal of Experimental Psychology: Learning Memory, and Cognition, 24, 415 - 431. 34. Wood, G. (1978). The knew-it-all-along effect. Journal of Experimental Psychology: Human Perception and Performance, 4, 345 - 353.