Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
From the Statistics of Data to the Statistics of Knowledge: Symbolic Data Analysis L. B ILLARD and E. D IDAY Increasingly, datasets are so large they must be summarized in some fashion so that the resulting summary dataset is of a more manageable size, while still retaining as much knowledge inherent to the entire dataset as possible. One consequence of this situation is that the data may no longer be formatted as single values such as is the case for classical data, but rather may be represented by lists, intervals, distributions, and the like. These summarized data are examples of symbolic data. This article looks at the concept of symbolic data in general, and then attempts to review the methods currently available to analyze such data. It quickly becomes clear that the range of methodologies available draws analogies with developments before 1900 that formed a foundation for the inferential statistics of the 1900s, methods largely limited to small (by comparison) datasets and classical data formats. The scarcity of available methodologies for symbolic data also becomes clear and so draws attention to an enormous need for the development of a vast catalog (so to speak) of new symbolic methodologies along with rigorous mathematical and statistical foundational work for these methods. KEY WORDS: Clustering; Concepts; Descriptive statistics; Principal components; Symbolic data. 1. INTRODUCTION With the advent of computers, very large datasets have become routine. What is not so routine is how to analyze the attendant data and/or how to glean useful information from within their massive con nes. It is evident, however, that even in those situations where in theory available methodology might seem to apply, routine use of such statistical techniques is often inappropriate. The reasons for this are many. Broadly, one reason surrounds the issue of whether or not the dataset really is a sample from some population,because often the data constitute the “whole,” as in for example, the record of all credit transactions of all card users. A related question pertains to whether the data at a speci ed point in time can be viewed as being from the same population at another time, as, for example, will the credit card dataset have the same pattern the next “week” or when the next week’s transactions are added to the data already collected? Another major broad reason that known techniques fail revolves around the issue of the sheer size of the dataset. Therefore, although traditional methods have served well on the smaller datasets that dominated in the past, it now behooves us as data analysts to develop procedures that work well on the large modern datasets. One approach is to summarize large datasets in such a way that the resulting summary dataset is of a manageable size and yet retains as much of the knowledge in the original dataset as possible. Thus in the credit card example, instead of hundreds of speci c transactions for each person (or credit card) over time, a summary of the transactions per card (or per unit time, such as 1 week) can be made. One such summary format could be a range of transactions by, for instance, dollars spent (e.g., $10–$982), by type of purchase (e.g., gas, clothes, food), by type and expenditure (e.g., gas, $10–$30; food, $15–$95). In each of these examples, the data are no longer single values as in traditional data (e.g., $1,524 for total credit card expenditure, 37 total number of transactions). Instead, the summarized data constitute ranges, lists, and the like and thus are examples of symbolic data. Symbolic data have their own internal structure L. Billard is University Professor & Professor of Statistics, Department of Statistics, University of Georgia, Athens, GA 30602. E. Diday is Professor of Computer Science, Centre d’Etudes et Recherches en Mathematiques et Aides la Decision (CEREMADE), Universite de Paris 9 Dauphine 75775, Paris Cedex 16, France. This work was partially supported by the National Science Foundation and INRIA France. (not present, nor possible, in classical data) and as such should be analyzed using symbolic data analysis techniques. Whereas the summarization of very large datasets can produce smaller datasets consisting of symbolic data, symbolic data are distinctive in their own right on any size datasets. For small symbolic datasets, the question is how the analysis proceeds. For large datasets, the rst question is what approach is being adopted to summarize the data into a (necessarily) smaller dataset. Buried behind any summarization is the notion of a symbolic concept, with any one aggregation being tied necessarily to the concept relating to a speci c aim of an ensuing analysis. In this work we review concepts and methods developed under the heading of symbolic data analysis. These methods so far have tended to be limited to developing methodologiesto organize the data into meaningful and manageable formats, somewhat akin to the developments leading to frequency histograms and other basic descriptive statistics efforts before 1900, which themselves formed a foundation for the inferential statistics of the 1900s. A brief review of existing symbolic statistical methods is included herein. An extensive coverage of earlier results has been given by Bock and Diday (2000). What quickly becomes clear is that so far very little statistical methodology has been developed for the resulting symbolic data formats. Symbolic data are de ned and contrasted with classical data in Section 2. The construction of classes of symbolic objects— a necessary precursor to statistical analyses when the size of the original dataset is too large for classical analyses, or where knowledge (in the form of classes, concepts, taxonomies, and so forth) is given as input instead of standard data—is discussed in Section 3. Available methods of symbolic data analysis are brie y described in Section 4, and some of these are discussed in more detail in subsequent sections. What becomes apparent is the inevitabilityof an increasing prevalence of symbolic data, and hence the attendant need to develop statistical methodologies to analyze such data. It will also be apparent that few methods currently exist, and even for those that do exist the need remains to establish mathematical underpinningand rigor, including statistical properties of the results of these procedures. Typically, point estimators have been developed, but there are still 470 © 2003 American Statistical Association Journal of the American Statistical Association June 2003, Vol. 98, No. 462, Review Article DOI 10.1198/016214503000242 Billard and Diday: Symbolic Data Analysis 471 essentially no results governing their properties, such as standard errors and distribution theory. These remain open problems. 2. SYMBOLIC DATASETS A dataset may from its outset be structured as a symbolic dataset. Alternatively, it may be structured as a classical dataset, but when aggregated to establish it in a more manageable fashion, it becomes symbolic data. In this section we present examples of both classical and symbolic data and introduce notation describing symbolic datasets for analysis. Suppose that we have a dataset comprising the medical records of individuals in a country. Suppose that for each individual there will be a record of geographic location variables, such as region (north, northeast, south, : : :), city (Boston, Atlanta, : : :), urban/rural (yes/no), and so on. There will be demographic variables, such as gender, marital status, age, information on parents and siblings (alive or not), number of children, employer, and health provider. Basic medical variables could include weight, pulse rate, blood pressure, and other variables. Other health variables (for which the list of possible variables is endless) could include incidences of certain ailments and diseases; likewise, for a given incidence or prognosis, treatments and other related variables associated with that disease are recorded. A typical such dataset may follow the lines of that shown in Table 1. Let p be the number of variables for each individual i 2 Ä D f1; : : : ; ng, where clearly p and n can be extremely large, and let Yj , j D 1; : : : ; p, represent the jth variable. Let Yj D xij be the particular value assumed by the variable Yj for the ith individual in the classical setting, and write X D .xij / as the n £ p matrix of the entire dataset. Let the domain of Yj be Yj ; so p X D .Y1 ; : : : ; Yp / takes values in X D £jD1 Yj . (Because the presence or absence of missing values is of no importance for the present, let us assume all values exist, even though this is highly unlikely for large datasets.) Variables can be quantitative; for example, age with Yage D fx ¸ 0g D YC as a continuous random variable or with Yage D f0; 1; 2; : : :g D N 0 as a discrete random variable. Variables can be categorical; for example, city with Ycity D fAtlanta, Boston, : : :g or coded Ycity D f1; 2; : : :g. Disease variables can be recorded as categories (coded or not) of a single variable with domain Y D fheart, stroke, cancer, cirrhosis, : : :g or as an indicator variable, (e.g., Y D cancer with domain Y D fno, yesg or f0; 1g or with other coded levels indicatingstages of disease). Likewise, for a recording of the many possible types of cancers, each type may be represented by a variable, Y, or by a category of the cancer variable. Table 1. Sample Dataset: Classical Data Systolic pressure (mm Hg) Diastolic pressure (mm Hg) Cancer i City Gender Age (years) 1 2 3 4 Boston Boston Chicago El Paso M M M F 24 56 48 47 120 130 126 121 79 90 82 86 No No Lung Yes :: : :: : :: : :: : :: : :: : :: : The precise nature of the description of the variables is not critical. What is crucial in the classical setting is that for each xij in X, there is precisely one possible realized value; for example, an individual’s Yage D 24, Ycity D Boston, Ycancer D yes, Ypulse D 64, and so on. Thus a classical data point is a single point in the p-dimensional space X . In contrast, a symbolic data point can be a hypercube in p-dimensional space or a Cartesian product of distributions.Entries in a symbolic dataset (denoted by »ij ) are not restricted to a single speci c value. Thus age could be recorded as being in an interval, for example, [0; 10/; [10; 20/; [20; 30/, : : : . This could occur when the data point represents the age of a family or group of individuals whose collective ages fall in an interval (such as [20; 30/ years, say), or the data may correspond to a single individual whose precise age is unknown but is known to be within an interval range, or whose age has varied over time in the course of the experiment that generated the data; or combinations and variations thereof, producing interval-ranged data. In a different direction, it may not be possible to measure some characteristic accurately as a single value (e.g., pulse rate is 64), but rather measures the variable as an .x § ±/ value (e.g., pulse rate is 64 § 2/. A person’s weight may uctuate between .130; 135/ over a weekly period. The blood pressure variable may be recorded by its [low, high] values, for example, »ij D [78; 120]. These variables are interval-valued symbolic variables. A different type of variable is a cancer variable that may have a domain Y D flung, bone, breast, liver, lymphoma, prostate, : : :} listing all possible cancers with a speci c individual having the particular values »ij D flung, liverg. In another example, suppose the variable Yj represents type of automobile owned by a household, with domain Yj D fChevrolet, Ford, Toyota, Volvo, : : :g. A particular household i may have the value »ij D fToyota, Volvog. Such variables are called multivalued variables. A third type of symbolic variable is a modal variable. Modal variables are multistate variables with a frequency, probability, or weight attached to each speci c value in the data; that is, the modal variable Y is a mapping Y.i/ D fU.i/; ¼i g for i 2 Ä, where ¼i is a nonnegative measure or a distribution on the domain Y of possible observation values and U.i/ µ Y is the support of ¼ i . For example, if three of an individual’s siblings are diabetic and one is not, then the variable describing propensity to diabetes could take the particular value »ij D f3=4 diabetes, 1=4 nondiabetiesg. More generally, »ij may be, for example, a histogram, an empirical distribution function, a probability distribution, or a model. Indeed, Schweizer (1984) opined that “distributions are the numbers of the future.” While in this example the weights .3=4; 1=4/ might represent relative frequencies, other kinds of weights, such as “capacities,” “credibilities,” “necessities,” or “possibilities” may be used (see Diday 1995). In general, then, unlike classical data for which each data point consists of a single (categorical or quantitative) value, symbolic data can contain internal variation and can be structured. It is the presence of this internal variation that necessitates the need for new techniquesfor analysis that generally will differ from those for classical data. Note, however, that classical data represent special cases (e.g., the classical point x D a is equivalent to the symbolic interval » D [a; a]). 472 Journal of the American Statistical Association, June 2003 Notationally, we have a basic set of objects that are elements or entities, E D f1; : : : ; Ng called the object set. This object set can represent a universe of individuals E D Ä (as earlier), in which case N D n; or if N · n, any one object set is a subset of Ä. Also, as frequently occurs in symbolic analyses, the objects u in E are classes C1 ; : : : ; Cm of individuals in Ä, with E D fC1 ; : : : ; Cm g, and N D m. Thus, for example, class C1 may consist of all those individuals in Ä who have had cancer. Each object u 2 E is described by p symbolic variables Yj j D 1; : : : ; p, with domain Yj and with Yj being a mapping from the object set E to a range Yj that depends on the type of variable Yj is. Thus if Yj is a classical quantitative variable, then the domain Bj is a subset of the real line < (i.e., Bj µ <); if Yj is an interval variable, then Bj D f[®; ¯], ¡1 < ®; ¯ < 1g; if Yj is categorical (nominal, ordinal, subsets of a nite domain Yj ), then Bj D fBjB µ f(e.g., list of cancers)gg; and if Yj is a modal variable, then Bj D M.Yj /, where M.Y/ is the family of all nonnegative measures on Y . Then the symbolic data for the object set E are represented by the N £ p matrix X D .»uj /, where »uj D Yj .u/ 2 Bj is the observed symbolic value for the variable Yj , j D 1; : : : ; p, for the object u 2 E. The row x0u of X is called the “symbolic description” of the object u. Thus for the data in Table 2, the rst row, x01 D f[20; 30]; [79; 120]; Boston, fbrain tumorg; fMaleg; [170; 180]g; represents a male in his 20s who has a brain tumor, a blood pressure of 120/79 mm Hg, weighs between 170 and 180 pounds, and lives in Boston. The object u associated with this x0u may be a speci c male individual followed over a 10-year period whose weight has uctuated between 170 and 180 pounds over that interval, or u could be a collection of individuals whose ages range from 20 to 30 and who have the characteristics described by x0u . The data x04 in Table 2 may represent the same individual as that represented by the i D 4th individual of Table 1 but where it is known only that she has either breast cancer (with probability p) or lung cancer (with probability 1 ¡ p), but it is not known which. On the other hand, it could represent the set of 47-year-old women from El Paso of whom a proportion p have lung cancer and proportion .1 ¡ p/ have breast cancer; or it could represent individuals who have both lung and breast cancer, and so on. At some stage, whether the variable (here, type of cancer) is categorical, a list, modal, or whatever would have to be explicitly de ned. Another issue relates to dependent variables, which for symbolic data implies logical dependence,hierarchical dependence, Table 2. Sample Dataset: Symbolic Data u Age Blood pressure (mm Hg) 1 2 3 4 [20, 30) [50, 60) [45, 55) [47, 47) (79/120) (90/130) (80/130) (86/121) :: : :: : :: : City Type of cancer Boston {Brain tumor} Boston {Lung, liver} Chicago {Prostate} El Paso {Breast p, lung (1 ¡ p)} :: : :: : Gender {Male} {Male} {Male} {Female} :: : taxonomic, or stochastic dependence. Logical dependence is as the word implies; for example, if [age · 10], then [number of children D 0]. Hierarchical dependence occurs when the outcome of one variable (e.g., Y2 D treatment for cancer) depends on the actual outcome realized for other variables (e.g., Y1 D has cancer with Y1 D fno, yesg). If Y1 has the value fyesg, then Y2 D fsay, chemotherapyg; whereas if Y1 D fnog, then clearly Y2 is not applicable. (We assume for illustrative purposes here that the individual does not have chemotherapy treatment for some other reason.) In these cases, the nonapplicable variable Z is de ned with domain Z D fNAg. Such variables are also called mother .Y1 /–daughter .Y2 / variables. Other variables may exhibit a taxonomic dependency; for example, Y1 D region and Y2 D city can take values, say, if Y1 D northeast, then Y2 D Boston, or if Y1 D south, then Y2 D Atlanta. 3. CLASSES AND THEIR CONSTRUCTION: SYMBOLIC OBJECTS Suppose that we have a dataset Ä consisting of n individuals i 2 Ä D f1; : : : ; ng, and that we wish to aggregate these into m classes. For example, one set of classes, C1 ; : : : ; Cm , may represent individualscategorized according to m different types of primary diseases, whereas another analysis may necessitate a class structure by, say, city, gender, age, or gender and age. This stage of the symbolic data analysis is designed so as to elicit more manageable formats before any statistical analysis. Note that this construction may (but need not) be distinct from classes obtained from a clustering procedure. This leads us to the concept of a symbolic object developed in a series of articles by Diday and colleagues (e.g., Diday 1987, 1989, 1990; Bock and Diday 2000; Stéphan, Hébrail, and Lechevallier 2000). We introduce this here through some motivating examples, and then at the end of this section present a more rigorous formal de nition. 3.1 Some Examples Suppose that we are interested in the concept “northeasterner.” Thus we have a description d representing the particular values fBoston, : : : , other northeast cities, : : :g in the domain Ycity ; and we have a relation R (here 2) linking the variable Ycity with the particular description of interest. We write this as [Ycity 2 fBoston, : : : , other northeast cities, : : :g] D a, say. Then each individual i in Ä D f1; : : : ; ng is either a northeasterner or is not. That is, a is a mapping from Ä ! f0; 1g, where for an individual i who lives in the northeast, a.i/ D 1; otherwise, a.i/ D 0, i 2 Ä. Thus if an individual i lives in Boston [i.e., Ycity .i/ D Boston], then we have a.i/ D [Boston µ fBoston, : : : , other northeast cities, : : :g] D 1. The set of all i 2 Ä for which a.i/ D 1 is called the extent of a in Ä. The triple s D .a; R; d/ is a symbolic object, where R is a relation between the description Y.i/ of the (silent) variable Y and a description d and a is a mapping from Ä to L that depends on R and d. (In the Northeasterner example, L D f0; 1g.) The description d is an intentional description; for example, as the name suggests, we intend to nd the set of individualsin Ä who live in the “northeast.” Symbolic objects play a role in one of three major ways within the scope of symbolic data analyses. First, a symbolic Billard and Diday: Symbolic Data Analysis object may represent a concept by its intent (i.e., its description and a way for calculating its extent) and can be used as the input of a symbolic data analysis. Thus the concept “northeasterner” can be represented by a symbolic object whose intent is de ned by a characteristic description and a way to nd its extent, which is the set of people who live in the northeast. A set of such regions and their associated symbolic objects can constitute the input of a symbolic data analysis. Second, it can be used as output from a symbolic data analysis, as when a clustering analysis suggests that northeasterners belong to a particular cluster, where the cluster itself can be considered a concept and be represented by a symbolic object. The third situation is when we have a new individual .i0 / who has description d 0 , and we want to know whether this individual .i0 / matches the symbolic object whose description is d; that is, we compare d and d 0 by R to give [d 0 Rd] 2 L D f0; 1g, where [d 0 Rd] D 1 means that there is a connection between d 0 and d. This “new” individual may be an “old” individual but with updated data, or it may be a new individual being added to the data base who may or may not “ t into” one of the classes of symbolic objects already present (e.g., should this person be provided with speci c insurance coverage?). In the context of the aggregation of our data into a smaller number of classes, were we to aggregate the individuals in Ä by city (i.e., by the value of the variable Ycity ) then the respective classes Cu , u 2 f1; : : : ; mg constitute those individualsin Ä who are in the extent of the corresponding mapping a.u/, say. Subsequent statistical analysis can take one of two broad directions. Either we analyze, separately for each class, the classical or symbolic data for the individuals in Cu as a sample of nu observations as appropriate, or we summarize the data for each class to yield a new dataset with one “observation” per class. In the latter case, the dataset values will be symbolic data regardless of whether the original values were classical or symbolic data. For example, even though each individualin Ä is recorded as having or not having cancer (Ycancer D no, yes)—that is, as a classical data value—this variable when related to the class for, say, city will become, for example, fyes (.1), no (.9)g; that is, 10% have had cancer and 90% have not. Thus the variable Ycancer is now a symbolic modal-valued variable. Likewise, a class constructed as the extent of a symbolic object is typically described by a symbolic dataset. For example, suppose that our interest lies with “those who live in Boston” (i.e., a D [Ycity D Boston]), and suppose that the variable Ychild is the number of children that each individual i 2 Ä has, with possible values f0; 1; 2; ¸ 3g. (The adjustment for a symbolic data value such as individual i has 1 or 2 children, that is, »i D f1; 2g, readily follows.) Then the object representing all those who live in Boston will now have the symbolic variable Ychild with particular value Ychild D f.0; f0 /; .1; f1 /; .2; f2 /; .¸3; f3 /g; where fi , i D 0; 1; 2; ¸ 3, is the relative frequency of individuals in this class who have i children. A special case of a symbolic object is an assertion. Assertions, also called queries, are particularly important when aggregating individuals into classes from an initial (relational) database. Let us denote by z D .z1 ; : : : ; zp /, the required description of interest of an individual i or of a concept w. Here zj can be a classical single-valued entity, xj , or a symbolic entity, »j . Thus, for example, suppose that we are interested in the 473 symbolic object representing those who live in the northeast; then zcity is the set of northeastern cities. We formulate this as the assertion a D [Ycity 2 fBoston; : : : ; other northeast cities; : : :g]; (1) where a is mapping from Ä to f0; 1g such that for individual or object w, a.w/ D 1 if Ycity .w/ 2 fBoston, : : : , other northeast cities, : : :g. In general, an assertion takes the form a D [Yj1 Rj1 zj1 ] ^ [Yj2 Rj2 zj2 ] ^ ¢ ¢ ¢ ^ [Yjv Rjv zjv ] (2) for 1 · j1 ; : : : ; jv · p, where “^” indicates the logical multiplicative “and” and R represents the speci ed relationship between the symbolic variable Yj and description value zj . For each individual i 2 Ä; a.i/ D 1 (or 0) when the assertion is true (or not) for that individual. More precisely, an assertion is a conjunction of v events, [Yk Rk zk ], k D 1; : : : ; v. For example, the assertions a D [Yage ¸ 60]; a D [Ycancer D yes] ^ [Yage ¸ 60]; a D [Yage < 20] ^ [Yage > 70] (3) represent all individualsage 60 and over, all cancer patients age 60 and over, and all individuals under age 20 and over 70. In each case, we are dealing with a concept “those over age 60,” “those over age 60 with cancer,” and so on, and a maps the particular individuals present in Ä onto the space f0; 1g. The assertion a D [Ycancer 2 flung, liverg] describes the class of individuals who have either lung cancer or liver cancer or both, whereas a D [Yage µ [20; 30]] ^ [Ycity 2 fBostong] (4) describes those who live in Boston and are in their 20s. The relations R can take any of the forms D; 6D; 2; ·; µ, and so on, or can be a matching relationship (e.g., a comparison of probability distributions), a structured sequence, or some other form. The domain of the symbolic object can be written as D D D1 £ ¢ ¢ ¢ £ Dp ; where Dj µ Yj . The p-tuple .D1 ; : : : ; Dp / of sets is called a description system, and each subset D is a description set consisting of description vectors z D .z1 ; : : : ; zp /, whereas a combination of elements zj 2 Yj and sets Dj µ Yj as in (4) is simply a description. When there are constraints on any of the variables (such as when logical dependencies exist), then the space D has a “hole” in it corresponding to those values that match the constraints. The totality of all descriptions, D, is the description space, D. Hence an assertion can be written as a D [Y 2 D] ´ v ^ kD1 [Yjk Rjk zjk ] D [YRz]; Vv where R D kD1 Rjk is called the product relation. Note that implicitly, if an assertion does not involve a particular variable, say, Yj , then the domain relating to that variable remains unchanged at Yj . For example, if p D 3, and Y1 D city, Y2 D age, and Y3 D weight, then the assertion a D [Y1 2 fBoston, Atlantag] has domain D D fBoston, Atlantag £ Y2 £ Y3 and seeks all those in Boston and Atlanta only (but regardless of age and weight). 474 Journal of the American Statistical Association, June 2003 3.2 Formal De nitions Let us now formally de ne some concepts. De nition 1. When an assertion is asked of any particular object i 2 Ä, it assumes that a value is true .a D 1/ if that assertion holds for that object, or false .a D 0/ if it is not. We write » 1; if Y.i/ 2 D a.i/ D [Y.i/ 2 D] D (5) 0; otherwise. The function a.i/, called a truth function, represents a mapping of Ä onto f0; 1g. The set of all i 2 Ä for which the assertion holds is called the extension in Ä, denoted by Ext.a/ or Q, Ext.a/ D Q D fi 2 ÄjY.i/ 2 Dg D fi 2 Äja.i/ D 1g: (6) Formally, the mapping a : Ä ! f0; 1g is called an extension mapping. Typically, a class would be identi ed by the symbolic object that described it. For example, the assertion (4) corresponds to the symbolic object “20 to 30-year-olds living in Boston.” The extension of this assertion, Ext.a/, produces the class consisting of all individuals i 2 Ä that match this description a, that is, those i for whom a.i/ D 1. A class may be constructed as in the foregoing by seeking the extension of an assertion a. Alternatively, it may be that one desires to nd all those (other) individuals in Ä who match the description Yj .u/ of a given Vp individual u 2 Ä. Thus in this case, the assertion is a.i/ D jD1 [Yj D Yj .u/], where now zj D Yj .u/ and has extension Ext.ai / D fi 2 ÄjYj .i/ D Yj .u/; j D 1; : : : ; pg. More generally, an assertion may hold with some intermediate value (e.g., a probability), that is, 0 · a.i/ · 1, representing a degree of matching of an object i with an assertion a. Flexible matching is also possible. In these cases, the mapping a is onto the interval [0; 1]. Then the extension of a has level ®, 0 · ® · 1, where now Ext® .a/ D Q ® D fi 2 Äja.i/ ¸ ®g: Diday and Emilion (1996) and Diday, Emilion, and Hillali (1996) studied modal symbolic objects and also provided some of its theoretical underpinning;see also Section 9. We now have the formal de nition of a symbolic object, as follows. De nition 2. A symbolic object is the triple s D .a; R; d/, where a is a mapping a: Ä ! L that maps individualsi 2 Ä onto the space L depending on the relation R between descriptions and the description d. When L D f0; 1g, that is, when a is a binary mapping, s is a Boolean symbolic object. When L D [0; 1], s is a modal symbolic object. Finally, computational implementation of the generation of appropriate classes can be executed by queries (assertions) used in search engines, such as the exhaustive search, genetic, or hill climbing algorithms, and/or by the use of available software, such as the standard query language (SQL) package or other “object-oriented” language, such as CCC or JAVA. Stephan et al. (2000) provided some detailed examples illustrating the use of SQL. Another package speci cally written for symbolic data is the SODAS (Symbolic Of cial Data Analysis System) software. Gettler-Summa (1999, 2000) developed the MGS (marking and generalization by symbolic descriptions algorithm) for building symbolic descriptions, starting with classical nominal data. Thus the outputs (ready for symbolic data analysis) are symbolic objects that have modal or multivalued variables and that identify logical links between the variables. Csernel (1997), inspired by Codd’s (1972) methods of dealing with relational databases with functional dependenciesbetween variables, developed a normalization of Boolean symbolic objects taking into account constraints on the variables. Although we have not addressed this application explicitly, the concepts described in this section can also be applied to merged or linked datasets. For example, suppose that we also had a dataset consisting of meteorological and environmental variables (measured as classical or symbolic data) for U.S. cities. Then if the data are merged by city, it is easy to obtain relevant meteorological and environmental data for each individual identi ed in Table 1. Thus, for example, questions relating to environmental measures and cancer incidences could be considered. 4. SYMBOLIC DATA ANALYSES Given a symbolic dataset, the next step is to conduct statistical analyses as appropriate. As for classical data, the possibilities are endless. Unlike classical data, for which a century of effort has produced a considerable library of analytical/statistical methodologies,symbolic statistical analyses are new, and available methodologies are still only few in number. In the following sections, we review some of these as they pertain to symbolic data. Therefore, we consider descriptive statistics in Sections 5 and 6. We review principal component methods for intervalvalued data in Section 7, and treat clustering methods in Section 8. We restrict attention to the more fully developed criterion-based divisive clustering approach for categorical and interval-valued variables. We also provide a summary of other methods limited largely to speci c special cases. Much of the current progress in symbolic data analyses revolve around cluster-related questions. We cover theoretical underpining, where it exists, in Section 9. Except for the speci c distance measures developed for symbolic clustering analysis (see Sec. 8), herein we do not attempt to review the literature on symbolic similarity and dissimilarity measures. Classical measures include the Minkowski or Lq distance (with q D 2 being the Euclidean distance measure), Hamming distance, Mahalanobis distance, and the Gower–Legendre family, plus the large class of dissimilarity measures for probability distributions from classical probability and statistical theory, including the familiar chi-squared measure, Bhattacharyya distance, and Chernoff’s distance, to name but a very few. Symbolic analogs have been proposed for speci c cases. Gowda and Diday (1991) rst introduced a basic dissimilarity measure for Boolean symbolic objects. Later, Ichino and Yaguchi (1994), using Cartesian operators, developed a symbolic generalized Minkowski distance of order q ¸ 1. This was extended by DeCarvalho (1994, 1998) to include Boolean symbolic objects constrained by logical dependencies between variables. Matching of Boolean symbolic objects was considered by Esposito, Malerba, and Semeraro (1991). These measures invoke the concepts of Cartesian join and meet. Some are developed Billard and Diday: Symbolic Data Analysis 475 along the lines of distance functions (as described in Sec. 8; see, e.g., Chavent 1998); many are limited to univariate cases. Because such measures vary widely, and the choices of what to use are generally dependent on the task at hand, we do not expand on their details here. In a different direction, Morineau, Sammartino, Gettler-Summa, and Pardoux (1994), Loustaunau,Pardoux, and Gettler-Summa (1997), and Gettler-Summa and Pardoux (2000) considered three-way tables, and Billard and Diday (2002a) considered regression for some special cases. A more detailed and exhaustive description of symbolic methodologies up to the time of its publication was given by Bock and Diday (2000). Billard and Diday (2002b) also provided expanded coverage with some additional illustrations. vir.d/ D fx 2 DI v.x/ D 1; for all v in Vx g: (8) In effect, this operation is mathematically cleaning the data by identifying only those values that make logical sense. For small datasets, it may be possible to “correct” the data visually (or some such variation thereof). For very large datasets, this is not always possible; hence a logical dependency rule to do so mathematically/computationallyis essential. 5.2 Multivalued Variables 5. DESCRIPTIVE UNIVARIATE STATISTICS 5.1 Some Preliminaries Basic descriptive statistics include frequency histograms and sample means and variances. We consider symbolic data analogs of these statistics, for multivalued and interval-valued variables with rules and modal variables. We consider univariate statistics in this section and bivariate statistics in Section 6. For integer-valued and interval-valued variables, we follow the approach adopted by Bertrand and Goupil (2000). DeCarvalho (1994, 1995) and Chouakria, Diday, and Cazes (1999) used different but equivalent methods to nd the histogram and interval probabilities [see eq. (16)] for interval-valued variables. Before describing these quantities, we rst introduce the concept of virtual extensions. Recall from Section 3 that the symbolic description of an object u 2 E (or, equivalently, i 2 Ä) was given by the description vector d u D .»u1 ; : : : ; »up /, u D 1; : : : ; m; or, more generally, d 2 .D 1 ; : : : ; Dp / in the space p D D £jD1 Dj , where in any particular case the realization of Yj may be an xj as for classical data or an »j of symbolic data. Individual descriptions, denoted by x, are those for which each Dj is a set of single (classical) values only, that is, x ´ d D p .fx1 g; : : : ; fxp g/; x 2 X D £jD1 Yj . The calculation of the symbolic frequency histogram involves a count of the number of individual descriptions that match certain implicit logical dependencies in the data. A logical dependency can be represented by a rule v, v : [x 2 A] ) [x 2 B]; De nition 3. The virtual description of the description vector d is the set of all individual description vectors x that satisfy all of the (logical dependency) rules v in X , written as, for Vx the set of all rules v operating on x, (7) for A µ D, B µ D, and x 2 X and where v is a mapping of X onto f0; 1g with v.x/ D 0.1/ if the rule is not (is) satis ed by x. For example, suppose that x D .x1 ; x2 / D .10; 0/ is an individual description of Y1 D age and Y2 D number of children for i 2 Ä, and suppose that A D fage · 12g and B D f0g. Then the rule that an individual under age 12 implies that he or she has no children is logically true, whereas an individual under age 12 with 2 children is not logically true. It follows that an individual description vector x satis es the rule v if and only = A. This formulation of the logical depenif x 2 A \ B or x 2 dency rule is suf cient for the purposes of establishing basic descriptive statistics. Verde and DeCarvalho (1999) discussed a variety of related rules (e.g., logical equivalence, logical implication, multiple dependencies, hierarchical dependencies). We have the following formal de nition. Suppose that we want to nd the frequency distribution for the particular multivalued symbolic variable Yj ´ Z that takes possible particular values » 2 Z. These can be categorical values (e.g., types of cancer) or any form of discrete random variable. We de ne the observed frequency of » as X OZ .» / D ¼Z .» I u/; (9) u2E where the summation is over u 2 E D f1; : : : ; mg and where ¼Z .» I u/ D jfx 2 vir.du /jxZ D » gj j vir.du /j (10) is the percentage of the individual description vectors in vir.du / such that xZ D » , and where jAj is the number of individual descriptions in the space A. In the summation in (9), any u for which vir.du / is empty is ignored. We note that this observed frequency is a positive real number and not necessarily an integer as for classical data. In the classical case, j vir.du /j D 1 and case of (10). We can easily show that P so is a special 0 0 » 2Z OZ .» / D m , where m D .m ¡ m0 /, with m 0 the number of u for which j vir.du /j D 0. For a multivalued symbolic variable Z taking values » 2 Z , the empirical frequency distributionis the set of pairs [»; OZ .» /] for » 2 Z, and the relative frequency distribution or frequency histogram is the set [»; .m0 /¡1 OZ .» /]: The following de nitions follow readily. The empirical distribution function of Z is given by FZ .» / D 1 X OZ .»k /: m0 (11) »k ·» When the possible values » for the multivalued symbolic variable Yj D Z are quantitative, the symbolic sample mean and the symbolic sample variance are de ned as S ZD S2Z 1 X »k O Z .»k / m0 and »k 1 X D 0 OZ .»k /[»k ¡ S Z ]2 : m (12) »k Detailed worked examples have been given by Bock and Diday (2000). 476 Journal of the American Statistical Association, June 2003 Finally, we observe that we can calculate weighted frequencies, weighted means, and weighted variances by replacing (9) with X OZ .» / D wu ¼Z .» I u/; (13) u2E P wu D 1. For example, if the objects u 2 E with wu ¸ 0 and are classes Cu comprising individuals from the set of individuals Ä D f1; : : : ; ng (i.e., Cu µ Ä) then a possible weight is wu D jCu j=jÄj, corresponding to the relative sizes of the class Cu ; u D 1; : : : ; m. Or, more generally, if we consider each individual description vector x as an elementary unit of the object u, then we can use weights for u 2 E, wu D P j vir.d u /j : u2E j vir.du /j (14) For example, suppose that the data of Table 3 represent the outcomes relative to the presence of cancer Y1 , with Y1 D fno D 0; yes D 1g, and the number of cancer-related treatments Y2 , with Y2 D f0; 1; 2; 3g, with a logical dependency rule, v : y1 2 f0g ) y2 D f0g. Suppose that the u of Table 3 represent classes Cu of size jCu j, with jÄj D 1;000, as shown in Table 3. Then, if we use the weights wu D jCu j=jÄj, we can show that rel freq.Y1 / : [.0; :229/; .1; :771/] and rel freq.Y2 / : [.0; :2540/; .1; :0970/; .2; :2395/; .3; :4095/]: 5.3 Interval-Valued Variables Suppose that we are interested in the particular intervalvalued variable Yj ´ Z, and suppose the observation value for object u is the interval Z.u/ D [au ; bu ], for u 2 E D f1; : : : ; mg. Assume the individual description vectors x 2 vir.d u / are uniformly distributed over the interval Z.u/. S The individual description vector x takes values globally in u2E vir.du /. Further, each object is assumed to be equally likely to be observed with probability 1=m. Therefore, the empirical distribution function, FZ .» /, is the distribution function of a mixture of m uniform distributions fZ.u/; u D 1; : : : ; mg. Therefore, it can be shown that (see Bertrand and Goupil 2000) the empirical density function of Z is 1 X I u .» / ; » 2 <; f .» / D (15) m kZ.u/k u2E where I u .¢/ is the indicator function that » is or is not in the interval Z.u/ and where kZ.u/k is the length of that interval. Table 3. Symbolic Data: Multi-Valued Variables u Y1 Y2 vir(du ) |vir (du )| |Cu | 1 2 3 4 5 6 7 8 9 {0,1} {0,1} {0,1} {0,1} {0} {0} {1} {1} {1} {2} {0,1} {3} {2,3} {1} {0,1} {2,3} {1,2} {1,3} {(1,2)} {(0,0), (1,0), (1,1)} {(1,3)} {(1,2), (1,3)} 1 3 1 2 0 1 2 2 2 128 75 249 113 2 204 87 23 121 Á {(0,0)} {(1,2), (1,3)} {(1,1), (1,2)} {(1,1), (1,3)} Note that the summation in (15) is only over those objects u for which » 2 Z.u/. To construct a histogram, let I D [minu2E au ; maxu2E b u ] be the interval that spans all the observed values of Z in X , and suppose that we partition I into r subintervals Ig D [»g¡1 ; »g /, g D 1; : : : ; r ¡ 1, and Ir D [»r¡1 ; »r ]. Then the histogram for Z is the graphical representation of the frequency distribution f.I g ; pg /; g D 1; : : : ; rg, where pg D 1 X kZ.u/ \ Ig k I m u2E kZ.u/k (16) that is, pg is the probability an arbitrary individual description vector x lies in the interval I g . If we want to plot the histogram with height fg on the interval ig so that the “area” is p g , then pg D .»g ¡ »g¡1 / £ fg . The symbolic sample mean for an interval-valued variable Z and the symbolic sample variance are given by S ZD S2 D 1 X .bu C au / 2m u2E and µ ¶2 (17) 1 X 2 1 X .ba C bu au C a2u / ¡ .b C a / : u u 3m u2E 4m2 u2E As for multivalued variables, if an object u has some internal inconsistency relative to a logical rule (i.e., if u is such that j vir.d u /j D 0) then the summations in (15)–(17) are over only those u for which j vir.du /j 6D 0. This has been illustrated with worked examples by Bock and Diday (2000) and Billard and Diday (2002b). Henceforth, we use m and E to refer to those u for which these rules hold. 5.4 Multivalued Modal Variables Of the many types of modal-valued variables, we brie y consider only two types, one each of multivalued (in this section) and interval-valued variables (in Sec. 5.5). Suppose that we have data from a categorical variable Yj taking possible Pvalues »jk , with relative frequencies pk , k D pk D 1. Suppose that we are interested in the 1; : : : ; s, with particular symbolic random variable Yj ´ Z, in the presence of the dependency rule v. Then we de ne the observed frequency that Z D »k , k D 1; : : : ; s, as X OZ .»k / D ¼Z .»k I u/; (18) u2E where the summation is over all u 2 E and where ¼Z .»k I u/ D P.Z D »k jv.x/ D 1; u/ P P.x D »k jv.x/ D 1; u/ D P Pxs ; x kD1 P.x D »k jv.x/ D 1; u/ (19) where for each u, P.x D »k jv.x/ D 1; u/ is the probability that a particular description x (´xj ) has the value »k and that the logical dependency rule v holds. If for a speci c object u there are no description vectors satisfying the rule v, [i.e., if the denominator of (19)P is 0] then that u is omitted in the summation in (18). Note that skD1 OZ .»k / D m: Billard and Diday: Symbolic Data Analysis 477 Example. Suppose that households are recorded as having one of the possible central heating fuel types Y1 taking values in Y1 D fgas, solid fuel, electricity, otherg and suppose that Y2 is an indicator variable taking values in Y2 D fno, yesg depending on whether a household does not (does) have central heating installed. For illustrative convenience, it is assumed the values for Y1 are conditional on there being central heating present in the household. Suppose that aggregation by geographical region produced the data of Table 4. The original (classical) dataset consisted of census data from the U.K. Of ce for National Statistics on 34 demographic–socio-economic variables on individual households in 374 parts of the country. The data given in Table 4 are those obtained for two speci c variables after aggregation into 25 regions and were obtained by applying the SODAS software to the original classical dataset. Thus region one, represented by the object u D 1, is such that 87% of the households with central heating are fueled by gas, 7% by solids, 5% by electricity, and 1% by some other type of fuel, and that 9% of householdsdid not have central heating and 91% did. Let us suppose that interest centers only on the variable Y1 (without regard to any other variable) and that there is no rule v to be satis ed. Then it is readily seen that the observed frequencies that Z D »1 D gas, »2 D solid, »3 D electricity, and »r D other fuels are OY1 .»1 / D 18:05; OY1 .»2 / D 1:67; O Y1 .»3 / D 3:51; and OY1 .»4 / D 1:77: Hence the relative frequencies are rel freq.Y1 / : [.»1 ; :722/; .»2 ; :067/; .»3 ; :140/; .»4 ; :071/]: Similarly, we can show that for the Y2 variable, rel freq.Y2 / : [.no; :156/; .yes; :844/]. Suppose, however, that there is the logical rule that a household must have one of »k , k D 1; : : : ; 4, if it has central heating. For convenience,we denote Y1 D »0 if there is no fuel type used (i.e., Y1 D »0 correspondsto Y2 D no). Then this logical rule can be written as the pair v D .v1 ; v2 /, » v1 : Y2 2 fnog ) Y1 2 f»0 g vD (20) v2 : Y2 2 fyesg ) Y1 2 f»1 ; : : : ; »4 g: Table 4. Symbolic Data: Multi-Valued Modal Variables Y1 Y2 u Gas Solid Electricity Other No Yes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 {.87 {.71 {.83 {.76 {.78 {.90 {.87 {.78 {.91 {.73 {.59 {.90 {.84 {.82 {.88 {.85 {.71 {.87 {.32 {.50 {.69 {.79 {.72 {.43 {.00 .07 .11 .08 .06 .06 .01 .01 .02 .00 .08 .07 .01 .00 .00 .00 .01 .03 .09 .24 .12 .13 .01 .05 .00 .41 .05 .10 .09 .11 .09 .08 .10 .13 .09 .11 .17 .08 .14 .11 .09 .10 .17 .04 .24 .28 .18 .20 .19 .43 .14 .01} .08} .00} .07} .07} .01} .02} .07} .00} .08} .17} .01} .02} .07} .03} .04} .09} .00} .20} .10} .00} .00} .04} .14} .45} {.09 {.12 {.23 {.19 {.12 {.22 {.22 {.13 {.24 {.14 {.10 {.19 {.09 {.17 {.12 {.09 {.16 {.13 {.25 {.14 {.12 {.21 {.07 {.28 {.09 .91} .88} .77} .81} .88} .78} .78} .87} .76} .86} .90} .71} .91} .83} .88} .91} .84} .87} .75} .86} .88} .79} .93} .72} .91} Suppose that the relative frequencies for both Y1 and Y2 pertain as in Table 4 (i.e., with the »0 values having zero relative frequencies), except that the values for the original object u D 25 are replaced by the new relative frequencies for Y1 of (.25, .00, .41, .14, .20). We refer to this as object u D 25¤ . We consider Y1 and u D 25¤ . In the presence of the rule v, the particular descriptions x D .Y1 ; Y2 / 2 f.»0 , no), (»1 , yes), (»2 , yes), (»3 , yes), (»4 , yes)g can occur. Thus the relative frequencies for each of the possible »k values must be adjusted. Table 5 displays the apparent relative frequencies for each of the possible description vectors before any rules have been invoked. We consider the relative frequency for »2 (solid fuels). Under the rule v, the adjusted relative frequency for object u D 25¤ is ¼ Y1 .»2 I 25¤ / D .:3731/=[:0225 C ¢ ¢ ¢ C :1820] D :5292: Hence the observed frequency for »2 over all objects under the rule v is OY1 .»2 / D :0637 C ¢ ¢ ¢ C :0000 C :5292 D 1:5902: Similarly, we calculate under v that OY1 .»0 / D 3:8519; O Y1 .»1 / D 15:2229; OY1 .»3 / D 2:9761; and OY1 .»4 / D 1:3589: Likewise, we can nd the observed frequencies for Y2 . Hence the relative frequencies are rel freq .Y1 / : [.»0 ; :154/, .»1 ; :609/; .»2; :064/; .»3 ; :119/; .»4 ; :054/] and rel freq .Y2 / : [.no; :154/; .yes; :846/]. Therefore, over all regions, 60.9% of households use gas, 6.4% use solid fuel, 11.9% use electricity, and 5.4% use other fuels to operate their central heating systems, whereas 15.4% do not have central heating. 5.5 Interval-Valued Modal Variables Suppose that the random variable of interest, Z ´ Y for object u, u D 1; : : : ; m, takes values on the intervals »uk D [auk ; buk / with probabilities puk , k D 1; : : : ; su . One manifestation of such data is when (alone, or after aggregation) an object u assumes a histogram as its observed value of Z. An example of such a dataset is that of Table 6, in which the data (such as might be obtained from Table 1 after aggregation) display the histogram of weight of women by age group (those in their 20s to those in their 80s and older). Thus, for example, we observe that for women in their 30s (object u D 2), 40% weigh between 116 and 124 pounds. By analogy with ordinary interval-valued variables, it is assumed that within each interval [auk ; buk / each individual description vector x 2 vir.du / is uniformly distributed across that interval. Therefore, for each »k , 8 »k < auk 0; > < » ¡a k uk Pfx · »k jx 2 vir.du /g D ; auk · »k < b uk (21) > : buk ¡ a uk k ¸ b uk . 1; A histogram of all of the observed histograms can be constructed as follows. Let I D [mink;u2E aku ; maxk;u2E bku ] be the Table 5. Apparent Relative Frequencies for u D 25 ¤ Y2 nY1 No Yes Total Total |v (x)D 1 »0 »1 .0225 .0000a .2275a .0000 .2500 .0000 .0319 .0000 »2 »3 .0369a .0126a .3731 .1274 .4100 .1400 .5292 .1807 ¤ See text. a Invalidated when rule v is invoked. »4 Total .0180a .0900 .1820 .9100 .2000 .2582 Total |v ( x ) D 1 .0319 .9681 478 Journal of the American Statistical Association, June 2003 Table 6. Histogram for Weight by Age Group Table 7. Histogram for Weights (over all ages) Age (years) u Weights g Ig Observed frequency Relative frequency (d) 20–29 1 30–39 2 40–49 3 50–59 4 60–69 5 1 2 3 4 5 6 7 8 9 10 [60–75) [75–90) [90–105) [105–120) [120–135) [135–150) [150–165) [165–180) [180–195) [195–210] :00714 :04286 :24179 :72488 1:27470 1:67364 1:97500 :88500 :13500 :04000 .0010 .0061 .0345 .1036 .1821 .2391 .2821 .1264 .0193 .0057 .0215 .1133 .1890 .2411 .2835 .1264 .0193 .0057 70–79 6 80–89 7 {[70 – 84), .02; [84 – 96), .06; [96 – 108), .24; [108 – 120), .30; [120 – 132), .24; [132 – 144), .06; [144 – 160], .08} {[100 – 108), .02; [108 – 116), .06; [116 – 124), .40; [124 – 132), .24; [132 – 140), .24; [140 – 150], .04} {[110 – 125), .04; [125 – 135), .14; [135 – 145), .20; [145 – 155), .42; [155 – 165), .14; [165 – 175), .04; [175 – 185], .02} {[100 – 114), .04; [114 – 126), .06; [126 – 138), .20; [138 – 150), .26; [150 – 162), .28; [162 – 174), .12; [174 – 190], .04} {[125 – 136), .04; [136 – 144), .14; [144 – 152), .38; [152 – 160), .22; [160 – 168), .16; [168 – 180], .06} {[135 – 144), .04; [144 – 150), .06; [150 – 156), .24; [156 – 162), .26; [162 – 168), .22; [168 – 174), .14; [174 – 180], .04} {(100 – 120), .02; [120 – 135), .12; [135 – 150), .16; [150 – 165), .24; [165 – 180), .32; [180 – 195), .10; [195 – 210], .04} interval that spans all of the observed values of Z in X , and let I be partitioned into r subintervals, Ig D [»g¡1 ; »g /, g D 1; : : : ; r ¡ 1, and I r D [»r¡1 ; »r ]. Then the observed frequency for the interval Ig is X O Z .g/ D ¼Z .gI u/; (22) u2E where ¼Z .gI u/ D X kZ.kI u/ \ Ig k p uk ; kZ.kI u/k (23) k2Z.g/ where the set Z.g/ represents all those intervals Z.kI u/ ´ [a uk ; buk / that overlap with Ig , for a given u. Thus each term in the summation in (23) represents that portion of the interval Z.kI u/ that is spanned by I g and hence that proportion of its observed relative frequency .puk / that pertains to the overall histogram interval I g . Note that when kZ.kI u/ \ Ig k=kZ.kI u/k D k0k=k0k (as, e.g., when an interval is a classical point value) this term takes the value Pr 1; Section 6.3 provides an example. It follows that gD1 OZ .g/ D m: Hence the relative frequency for the interval Ig is pg D OZ .g/=m: The set of values f.p g ; Ig /; g D 1; : : : ; rg together represents the relative frequency histogram for the combined set of observed histograms. The empirical density function can be derived from s u Iuk .» / 1 XX f .» / D p uk ; » 2 <; m kZ.uI k/k (24) u2E kD1 where Iuk .¢/ is the indicator function that » is in the interval Z.uI k/. The symbolic sample mean becomes ( s ) u 1 X X S ZD .buk C auk /p uk ; (25) 2m u2E kD1 and the symbolic sample variance is (s ) u 1 X X 2 2 2 S D .b uk C buk a uk C auk /p uk 3m u2E kD1 ¡ 1 4m 2 ( su XX u2E kD1 and, likewise, ¼.5I 2/ D :530; ¼.5I 3/ D :153; ¼.5I 4/ D :180, ¼.5I 5/ D :036; ¼.5I 6/ D :000; and ¼.5I 7/ D :120: Hence, from (22), the “observed relative frequency” for the interval I5 is O.5/ D 1:2747, and the relative frequency is p5 D :1821: The complete set of histogram values, fpg ; Ig ; g D 1; : : : ; 10g, is given in Table 7. Likewise, from (25) and (26), we have S Z D 143:9 and S 2 D 447:5. It is implicit in the formula (21)–(26) that the rules v hold. Suppose that the data of Table 6 represented (instead of weights) costs (in $) involved for a certain procedure at seven different hospitals. Suppose further that there is a minimum charge of $100. This translates into a rule v : fY < 100g ) fpk D 0g: The data for all hospitals (objects) except the rst one satisfy this rule. However, some values for the rst hospital do not. Hence, an adjustment to the overall relative frequencies is needed to accommodate this rule. Let us assume that the histogram now spans r D 8 intervals [100–105/; [105–120/; : : : ; [195–210]. The relative frequencies for these data after this rule is taken into account are given in Table 7, column(d). 6. DESCRIPTIVE BIVARIATE STATISTICS Many of the principles developed for the univariate case can be expanded to a general p-variate case, p > 1. In particular, this permits derivation of dependence measures. Thus, for example, calculation of the covariance matrix aids the development of methodologies in, for example, principal component analysis, discriminant analysis, and cluster analysis. Recently, Billard and Diday (2000, 2002a) extended these methods to t multiple linear regression models to interval-valued data and histogram data. Different but related ideas can be extended to enable the derivation of resemblance measures such as Euclidean distances, Minkowski or Lq distances, and Mahalanobis distances. 6.1 Multivalued Variables )2 .buk C auk /puk To illustrate, we take the data of Table 6 and construct the histogram on the r D 10 intervals [60–75/; [75–90/; : : : , [195–210]. Take the g D 5th interval I5 D [120–135/. Then, from Table 6, it follows from (23) that ³ ´ ³ ´ 132 ¡ 130 135 ¡ 132 ¼.5I 1/ D .:24/ C .:06/ D :255 132 ¡ 120 144 ¡ 132 : (26) Some De nitions. We rst consider multivalued variables. Analogously to the de nition of the observed frequency of spe- Billard and Diday: Symbolic Data Analysis 479 ci c values for a single variable given in (9), we de ne the observed frequency that .Z1 D »1 ; Z2 D »2 / by X OZ1 ;Z2 .»1 ; »2 / D ¼Z1 ;Z2 .»1 ; »2 I u/; (27) u2E where discussion of the role of copulas in symbolic data will be presented elsewhere. We de ne the empirical joint density function for .Z1 ; Z2 / as 1 X I u .»1 ; »2 / f .»1 ; »2 / D ; (31) m kZ 0 .u/k u2E jfx 2 vir.du /jxZ1 D »1 ; xZ2 D »2 gj ¼Z1 ;Z2 .»1 ; »2 I u/ D j vir.du /j (28) is the percentage of the individual description vectors x D .x1 ; x2 / in vir.du / for which .Z1 D »1 ; Z2 D »2 /. Note that ¼.¢/ is a real number on <, in contrast to its being P a positive integer for classical data. We can show that »1 2Z1 ;»2 2Z2 OZ1 ;Z2 .»1 ; »2 / D m. We now de ne the empirical joint frequency distribution of Z1 and Z2 as the set of pairs [» ; OZ1 Z2 .» /], where » D .»1 ; »2 / for » 2 Z D Z1 £ Z2 , and we de ne the empirical joint histogram Z1 and Z2 as the graphical set [» ; m1 OZ1 Z2 .» /]. When Z1 and Z2 are quantitativesymbolic variables, the symbolic sample covariance function between the symbolic quantitative multivalued variables Z1 and Z2 is given by µ X ¶ 1 SZ1 ;Z2 ´ S12 D .»1 £ »2 /OZ1 ;Z2 .»1 ; »2 / ¡ .S Z1 /.S Z2 /; m »1 ;»2 (29) Zi , i D 1; 2; are the symbolic sample means of the uniwhere S variate variables Zi , i D 1; 2; de ned in (12). The symbolic sample correlation function between the multivalued symbolic variables Z1 and Z2 is given by ¯q 2 r.Z1 ; Z2 / D S Z1 ;Z2 .S Z1 /.S 2Z2 /; (30) where S 2Z1 , i D 1; 2, are the respective univariate symbolic sample variances de ned in (12). An Example. We illustrate these statistics with the cancer data given in Table 3. We have that OZ1 ;Z2 .»1 D 0; »2 D 0/ ´ O.0; 0/, say, is 0 1 0 0 1 0 0 0 4 C C C C C C C D ; 1 3 1 2 1 2 2 2 3 where, for example, ¼.0; 0I 1/ D 0=1, because the individual vector (0,0) occurs no times out of a total of j vir.d 1 /j D 1 vectors in vir.d 1 /. Likewise, ¼.0; 0I 2/ D 1=3, because (0,0) is one of three individual vectors in vir.d2 /, and so on, and where the summation does not include the u D 5 term because j vir.d5 /j D 0. Similarly, we can show that O.0; 1/ D 0, O.0; 2/ D 0, O.0; 3/ D 0, O.1; 0/ D 1=3, O.1; 1/ D 4=3, O.1; 2/ D 5=2; and O.1; 3/ D 5=2: Hence the symbolic sample covariance is, from (29), S12 D :288: Therefore, the symbolic correlation function is, from (30), r.Z1 ; Z2 / D :748: O.0; 0/ D 6.2 Interval-Valued Variables Some De nitions. For interval-valued variables, we suppose that the speci c variables of interest, Z1 and Z2 , have observations on the rectangle Z.u/ D Z1 .u/ £ Z2 .u/ D .[a1u ; b 1u ], [a2u ; b2u ]/ for each u 2 E. As before, we make the intuitive choice of assuming individual vectors x 2 vir.du / are each uniformly distributed over the respective intervals Z1 .u/ and Z2 .u/. Therefore, the joint distribution of .Z1 ; Z2 / is a copula C.z1 ; z2 / (see Nelsen 1999; Schweizer and Sklar 1983). An expanded where Iu .»1 ; »2 / is the indicator function that .»1 ; »2 / is or is not in the rectangle Z 0 .u/ and where kZ 0 .u/k is the area of this rectangle. Analogously with (16), we can nd the joint histogram for Z1 and Z2 by graphically plotting fRg1 g 2 ; pg 1 g2 g over the rectangles Rg1 g2 D f[»1;g1 ¡1 ; »1g1 / £ [»2;g2 ¡1 ; »2g2 /g, g1 D 1; : : : ; r1 , g 2 D 1; : : : ; r2 , where 1 X kZ 0 .u/ \ Rg1 g2 k pg1 ;g2 D I (32) m kZ 0 .u/k u2E that is, pg1 g2 is the probability an arbitrary individual description vector lies in the rectangle Rg1 g2 . Then, if the “volume” on the rectangle Rg1 g 2 of the joint histogram represents the probability, its height would be fg1 g2 D [.»1g1 ¡ »1;g1 ¡1 /.»2g2 ¡ »2;g2 ¡1 /]¡1 pg1 g2 : (33) The symbolic sample covariance function is obtained analogously to the derivation of the mean and variance for a univariate interval-valued variable, that is, Z 1 .»1 ¡ S Z1 /.»2 ¡ S Z2 /f .»1 ; »2 / d»1 d»2 cov.Z1 ; Z2 / D SZ1 Z1 D ¡1 1 1X D m .b1u ¡ a1u /.b2u ¡ a2u / u2E ZZ £ »1 »2 d»1 d»2 ¡ S Z1 S Z2 .» 1 ;»2 /2Z.u/ 1 X D .b1u C a1u /.b2u C a 2u / 4m u2E µ ¶µX ¶ 1 X ¡ .b1u C a1u / .b2u C a2u / : 4m2 u2E u2E (34) Hence the symbolic sample correlation function is de ned by (30), where SZ1 Z2 is given in (34) and S 2Z1 , i D 1; 2, are given in (17). As before, the summation in (31) is only over those u for which vir.du / is not empty. An Example. We illustrate these concepts with the data of Table 8, from Raju (1997). In particular, we consider the systolic and diastolic blood pressure levels, Y2 and Y3 . Suppose further that there is a logical rule specifying that the diastolic blood pressure must be less than the systolic blood pressure, that is, Y2 .u/ · Y1 .u/. The observations u D 1; : : : ; 10, all satisfy this rule. However, had there been another observation represented in the table by the u D 11 data point, then the rule v would be violated, because Y2 .11/ > Y1 .11/. Therefore, j vir.d 11 /j D 0, and so subsequent calculations would not include this observation. Suppose that we want to nd the joint histogram for Y1 and Y2 . We want to construct the histogram on the rectangles Rg2 g3 D f[»2;g2 ¡1 ; »2;g2 / £ [»3;g3 ¡1 ; »3;g3 /g, g2 D 1; : : : ; 10, 480 Journal of the American Statistical Association, June 2003 Table 8. Symbolic Data: Interval-Valued Variables value .»1 ; »2 / is u Y1 , pulse rate (bpm) Y2 , systolic pressure (mm Hg) Y3 , diastolic pressure (mm Hg) 1 2 3 4 5 6 7 8 9 10 11 [44–68] [60–72] [56–90] [70–112] [54–72] [70–100] [72–100] [76–98] [86–96] [86–100] [63–75] [90–110] [90–130] [140–180] [110–142] [90–100] [134–142] [130–160] [110–190] [138–180] [110–150] [60–100] [50–70] [70–90] [90–100] [80–108] [50–70] [80–110] [76–90] [70–110] [90–110] [78–100] [140–150] g3 D 1; : : : ; 6, where the sides of these rectangles are the intervals given in Table 9. Then, from (32), we can calculate the probability pg2 g3 that an arbitrary description vector lies in the rectangle Rg2 g3 . For example, for g2 D 6, g3 D 4, p64 D Pfx 2 [140; 150/ £ [80; 90/g » 1 2 10 10 10 2 10 D ¢ C0C ¢ C ¢ 0C0C0C 10 32 28 8 30 30 14 ¼ 10 10 10 10 C ¢ C0C ¢ D :04886: 80 40 40 22 Table 9 provides all of the probabilities pg2 g3 . The table also gives the marginal totals that correspond to the corresponding probabilities pgj , j D 2; 3, of the univariate case, obtained from (16). A plot of the joint histogram is given in Figure 1, where the heights fg2 g3 are calculated from (33). (s ) s 2u 1u X p1uk1 p2uk2 Iuk1 ;k2 .»1 ; »2 / 1X X f .»1 ; »2 / D ; (35) m kZk1 k2 .u/k u2E where kZk1 k 2 .u/k is the area of the rectangle Zk 1 k2 .u/ D [a 1uk1 ; b1uk1 / £ [a2uk2 ; b 2uk2 / and Iuk1 k2 .»1 ; »2 / is the indicator variable that the point .»1 ; »2 / is (is not) in the rectangle Zk1 k2 .u/. Analogously with (23) and (32), the joint histogram for Z1 and Z2 is found by plotting fRg1 g2 ; p g1 g2 g over the rectangles Rg1g2 D f[»1; g1¡1 , »1g1 / £ [»2; g2 ¡1 , »2g2 /g, g1 D 1; : : : ; r1 ; g2 D 1; : : : ; r2 with p g1 g2 D 1 X X m X u2E k1 2Z.g1 / k2 2Z.g2 / kZ.k1 ; k2 I u/ \ Rg1 g2 k kZ.k1 ; k2 I u/k £ p1uk1 p2uk2 ; (36) where Z.gj / represents all of the intervals Z.k1 ; k2 I u/ ´ [ajukj ; bjukj /, j D 1; 2, which overlap with the rectangle Rg1 g2 for each given u value. If we assume that .Z1 ; Z2 / values are uniformly distributed across the rectangles f»1uk £ »2uk g, we can then derive the symbolic sample covariance function as (s s2u 1u X 1 X X p1uk1 p2uk2 .b1uk1 C a1uk1 / cov.Y1 ; Y2 / D 4m u2E k1 D1 k2 D1 ) £ .b2uk2 C a2uk2 / 6.3 Modal-Valued Variables Some De nitions. We develop corresponding results for quantitativemodal-valuedvariables where the data are recorded as histograms. Suppose that interest centers on the two speci c variables Z1 and Z2 . Suppose that for each object u, each variable Zj .u/ takes values on the subintervals »juk D [ajuk ; b juk / with relative frequency pjuk , k D 1; : : : ; sju , j D 1; 2, and u D Psju pjuk D 1. When sju D 1 and p juk D 1 for all 1; : : : ; m, with kD1 j, u, and k values, the data are interval-valued realizations. Then, by extending the derivations of Section 6.2, we can show that the empirical joint density function for .Z1 ; Z2 / at the k1 D1 k 2 D1 " (s )# 1u 1 X X ¡ p1uk1 .b 1uk1 C a1uk1 / 4m2 u2E k D1 1 £ " (s 2u X X u2E k1 D2 )# p2uk2 .b 2uk2 C a2uk2 / : (37) The symbolic mean and symbolic variance for each of Z1 and Z2 can be obtained from (25) and (26). Table 9. Bivariate Histogram for Blood Pressure Data g2 1 2 3 4 5 6 7 8 9 10 g3 Ig2 nIg3 [90, 100) [100, 110) [110, 120) [120, 130) [130, 140) [140, 150) [150, 160) [160, 170) [170, 180) [180, 190] pg3 £ 102 1 [50, 60) 2 [60, 70) 3 [70, 80) 7:500 2:500 0 0 0 0 0 0 0 0 10:000 7:500 2:500 0 0 0 0 0 0 0 0 10:000 1:250 1:250 1:790 1:790 1:492 1:492 1:265 :313 :313 :313 11:266 pg2 g3 £ 10 2 4 [80, 90) 5 [90, 100) 6 [100, 110] 1:250 1:250 3:815 3:815 7:446 4:886 2:693 :313 :313 :313 26:093 0 0 2:565 2:565 5:303 6:196 4:003 4:003 4:003 :313 28:950 0 0 1:205 1:205 3:943 2:515 1:503 1:503 1:503 :313 13:690 pg2 £ 10 2 17:500 7:500 9:375 9:375 18:185 15:089 9:464 6:131 6:131 1:250 Billard and Diday: Symbolic Data Analysis 481 Figure 1. Bivariate Histogram: Blood Pressures. An Example. The data of Table 10 record histogram hematocrit .Y1 / values and hemoglobin .Y2 / values for each of m D 5 objects. Suppose that we want to calculate the joint histogram function over the rectangles Rg1 g2 D f[30; 35/ £ [11; 12/; : : :; [50; 55/ £[16; 17]g, that is, r1 D 5 and r2 D 6. The terms p g1g 2 are calculated from (36); for example, p21 D Pfx 2 [35; 40/ £ [11; 12/g »µ ¶ 1 2:8 £ :5 1:8 £ :5 D .:7/.:4/ C 0 C .:3/.:4/ C 0 5 4:6 £ :6 1:8 £ :6 ¼ C0C0C0C0 D :04841: This also applies to the other terms shown in Table 11. 7. PRINCIPAL COMPONENTS ANALYSIS A principal components analysis is designed to reduce p-dimensional observations into s-dimensional (where usually s < p) components. Cazes, Chouakria, Diday, and Schektman (1997), Chouakria (1998), and Chouakria et al. (1999) have developed a method of conducting principal components analysis on symbolic data for which each symbolic variable Yj , j D 1; : : : ; p, takes interval values »u D [a uj; b uj ], say, for each object u D 1; : : : ; m, and where each object represents nu individuals; for simplicity, we take n u D 1. Each symbolic data point is represented by a hyperrectangle with 2p vertices. Thus each hyperrectangle can be represented Table 10. Bivariate Histograms: Hematocrit and Hemoglobin Data u Y1 , Hematocrit Y2 , Hemoglobin 1 2 {[33.2–37.8), .7; [37.8–39.6], .3} {[36.3–38.1), .2; [38.1–47.4], .8} 3 {41.5–44.9), .5; [44.9–48.8], .5} 4 {[44.4–47.1), .4; [47.1–52.5], .6} 5 {[39.3–51.5], 1.0} {[11.5–12.1), .4; [12.1–12.8], .6} {[12.3–14.2), .5; [14.2–14.9), .4; [14.9–15.2], .1} {[14.3–14.8), .5; [14.8–15.5], .4; 15.5, .1} {[15.3–15.7), .3; [15.7–16.4), .5; [16.4–16.7], .2} {[13.7–14.5), .5; [14.5–15.2), .3; [15.2–16.5], .2} by a 2 p £ p matrix M u with each row containing the coordinate values of a vertex Rk , k D 1; : : : ; 2p , of the hyperrectangle.Then a .m ¢ 2p £p/ matrix M is constructed of the fMu ; u D 1; : : : ; mg, that is, M D .M1 ; : : : ; Mm /0 02a : : : a1p 3 11 :: 5 D @4 : b 11 ::: b1p 2a m1 ::: 4 bm1 ::: :: : ::: a mp 3 1 0 5A : b mp For example, if p D 2, then the data »u D .[au1 ; bu1 ], [au2 ; b u2 ]/ are transformed to the 2 p £ p D 22 £ 2 matrix " #0 au1 au1 bu1 bu1 ; Mu D au2 bu2 au2 bu2 and likewise for M. The matrix M is now treated as though it represents classical p-variate data for n D m ¢ 2p individuals. Chouakria (1998) has shown that the basic theory for a classical analysis carries through; hence a classical principal components analysis can be applied. If observations have weights pi ¸ 0, then in this transformation of the symbolic to the classical matrix M, each vertex of the hyperrectangle now has the same weight .pi 2¡p /. Let Y1¤ ; : : : ; Ys¤ , s · p, denote the rst “numerical” principal components with associated eigenvalues ¸1 ¸ ¢ ¢ ¢ ¸ ¸s ¸ 0 that result from this analysis. We then construct the interval principal components Y1I ; : : : ; YsI as follows. Let Lu be the set of row indices in M identifying the vertices of the hyperrectangle Ru ; that is, Lu represents the rows of Mu describing the symbolic data »u¢ . For each k D 1; : : : ; 2p , in Lu , let ykv be the value of the numerical principal component Yu¤ , v D 1; : : : ; s, for that row k. Then the interval principal component YvI for object u, is given by yuv D [yauv ; ybuv ], where yauv D mink2Lu .ykv / and ybuv D maxk2Lu .ykv /. Then plotting gives us the data represented as hyperrectangles of principal components in s-dimensional space. Thus, taking v D 1; 2, we have each object u D 1; : : : ; m, represented by a rectangle with axes corresponding to the rst two principal components. 482 Journal of the American Statistical Association, June 2003 Table 11. Bivariate Joint Histogram: Hematocrit and Hemoglobin Data g1 1 2 3 4 5 g2 Ig1 n Ig2 [30–35) [35–40) [40–45) [45–50) [50–55) pg2 £ 102 pg1 g2 £ 10 2 1 [11, 12) 2 [12, 13) 3 [13, 14) 4 [14, 15) 1:826 4:841 0 0 0 6:667 3:652 11:020 1:585 :761 0 17:018 0 2:128 3:801 2:623 :461 9:013 0 4:139 14:799 12:310 1:295 32:543 As an alternative to using the vertices of the hyperrectangles Ru as before, the centers of the hyperrectangles could be used. In this case, each object data point »u D .[au1 ; bu1 ]; : : : ; [aup ; bup ]/ is transformed to xcu D .xcu1 ; : : : , xcup /, u D 1; : : : ; m, where xcuj D .auj C b uj /=2, j D 1; : : : ; p. Thus the symbolic data matrix X has been transformed to a classical m £ p matrix Xc with classical variables Y1c ; : : : ; Ypc , say. Then the classical principal components analysis is applied to the classical data Xc . The vth center Ppprincipal component for object u is, for v D 1; : : : ; s, ycuv D jD1 .xcuj ¡ xN cj /wuv , where P c the mean of the values for Yj is xN cj D m1 . m uD1 xuj / and where wv D .w1v ; : : : ; wpv / is the vth eigenvector. Because each centered coordinate xcuj lies in the interval [auj ; buj ] and because the principal components are linear functions of xcuj , we can obtain cb the interval principal components as [yca uv ; yuv ], where yca uv D ycb uv D p X jD1 p X jD1 min auj ·xrij ·buj .xruj ¡ xN cj /wuv ; max .xruj ¡ xN cj /wuv : auj ·xrij ·buj The methodology for the centers method is illustrated with the interval data of Table 12, extracted from U.S. Census data. There are m D 9 objects representing classes/regions of the United States described by p D 6 symbolic variables; specifically, Y1 D median household income, Y2 D average income per household member, Y3 D percentage of people covered by health insurance, Y4 D percentage of population over age 25 who have completed high school, Y5 D percentage of population over 25 years who have earned at least a bachelor’s degree, and Y6 D travel time (in minutes) to work. For simplicity, we restrict the analysis to the rst three variables, Yj , j D 1; 2; 3; only. The s D 3 .Dp/ eigenvalues (together with the percentage 5 [15, 16) 0 :724 7:155 12:259 3:371 23:509 6 [16, 17] 0 :088 1:494 6:783 2:888 11:253 pg1 £ 10 2 5:478 22:940 28:834 34:736 8:015 of the total sum of squares contributed by the respective principal components)are ¸1 D 1:909 (63.64%), ¸2 D :914 (30.46%), and ¸3 D :177 (5.91%). Thus the rst and second components together explain 94.10% of the variation. The results are plotted in Figure 2, and the distinct clusters are evident, with the SouthE and SouthW regions forming one cluster, the Mountain and SouthAtl regions forming a second cluster, New England standing on its own as a third grouping, and these latter two clusters spanned by a cluster consisting of the remaining four regions (MidWestE, MidAtl, MidWestW, and the Paci c). Executing the same analysis but with the vertices approach gives similar results, as shown by Bock and Diday (2000). Thus far in this section, we have assumed that the object u is a class of size 1. However, more generally, it may be a class of size nu , with nu C ¢ ¢ ¢ C nm D m¤ . The methods carry through readily. Principal components as a method is designed to reduce the dimension of the data space to s < p. Dimensionalityreduction methods for interval data have also been considered by Ichino (1988) and Ichino and Yaguchi (1994) using generalized Minkowsky metrics, and by Nagabushan, Gowda, and Diday (1995) using Taylor series ideas. Currently, these principal components methods are limited to interval-valued symbolic data. Symbolic principal components analysis for other types of symbolic data, such as for modal variables (as would appear, e.g., in pixel satellite or tomography data) remains an outstanding problem. 8. SYMBOLIC CLUSTERING In this section we consider clustering methods for symbolic data. Our aim is to classify the objects in Ä into clusters (or classes) C1 ; : : : ; Cm , which are internally as homogeneous as possible and externally as distinct from each other as possible. This process is distinct from the construction of classes procedure discussed in Section 3, in which preassigned criteria were Table 12. Regional Pro les: Demographics Region NewEng MidAtl MidWestE MidWestW SouthAtl SouthE SouthW Mountain Paci c Y1 , MedHouse$ Y2 , MeanPerson$ Y3 , Health Y4 , EducHigh Y5 , EducBach Y6 , TravelTime [45.72, 47.74] [43.69, 45.16] [44.03, 45.23] [43.85, 45.52] [40.68, 41.60] [33.22, 35.14] [35.98, 36.95] [41.92, 43.41] [45.22, 46.56] [23.58, 24.84] [22.66, 23.37] [21.22, 21.86] [22.29, 23.47] [22.34, 23.07] [18.74, 19.89] [19.03, 19.82] [20.21, 21.25] [22.38, 23.21] [88.50, 94.10] [84.80, 92.40] [86.50, 92.90] [88.50, 91.30] [82.70, 88.10] [86.50, 89.70] [78.50, 89.40] [81.50, 88.70] [76.20, 89.90] [81.30, 90.00] [77.20, 87.30] [84.60, 90.80] [88.10, 90.40] [79.20, 84.00] [77.50, 79.90] [79.20, 86.60] [85.50, 91.80] [81.20, 91.80] [24.10, 31.60] [15.30, 38.30] [17.10, 31.20] [24.60, 27.30] [19.00, 23.20] [20.40, 22.00] [18.40, 26.20] [20.00, 34.60] [19.30, 28.60] [18.00, 21.90] [20.00, 28.60] [18.30, 25.10] [15.80, 17.20] [19.80, 22.70] [20.70, 21.50] [19.00, 22.30] [13.00, 20.70] [16.70, 24.60] Billard and Diday: Symbolic Data Analysis 483 Figure 2. Principal Components Analysis: Regional Pro les. selected for class membership. In contrast, the traditional clustering problem seeks to nd “natural” groupings from the data. There is, however, a link between the two procedures, as we discuss later. Here we follow the approach of Chavent (1998) for criterion-based divisive clustering. Chavent (1998) distinguished between two types of sets of variables. The rst type deals with quantitative variables Y1 ; : : : ; Yk , in which Yj takes a single (classical) value or Yj takes an interval value Yj .u/ D [a; b] ½ <, for u 2 Ä. Or the Y1 ; : : : ; Yp can be categorical-type variables; that is, they can be either (classical) ordinal values with Yj .u/ 2 Yj , multivalued variable with Yj .u/ ½ Yj and Bj D P.Yj /, or modal variables where Yj .u/ D ¼j is a probability or frequency distribution on Yj , and Bj is the set of all probability distributions on Yj . However, the Y1 ; : : : ; Yp cannot consist of a mixture of quantitativetype and categorical-type variables. Let D D .duv / be the n £ n matrix of distance measures between objects u, v 2 Ä. First, we take quantitative-type variables and assume that all Yj are interval valued. (Adjustment for the case where some Yj are classical variables is straightforward.) Suppose that we have the intervals »u D [auj ; b uj ] and »v D [a vj ; bvj ], j D 1; : : : ; p, u; v 2 Ä. We seek a distance function ±j .u; v/ between objects u and v. A number of possible such functions can be used in the clustering process. We can de ne a symbolic Hausdorff distance for Yj as ±j .u; v/ D maxfjauj ¡ a vj j; jb uj ¡ bvj jg: (38) If we take a distance function, Á p !1=2 X 2 d.u; v/ D ; [± j .u; v/] be normalized to 0 d .u; v/ D ( p X )1=2 2 [m¡1 j ±j .u; v/] for suitable choices of mj . Chavent (1998) suggested m2j D .2n2 /¡1 n X n X [±j .u; v/]2 ; or uD1 vD1 mj D length of the domain of Yj : When the variables are categorical, we rst construct a symbolic frequency table as follows. Suppose that Yj can take pj possible values. If Yj is modal, then we already have that the probability of obtaining the kth possibility is ¼jk , k D 1; : : : ; p, P ¼ D 1. Each possible value is “labeled” as a new variable k jk Yjk , k D 1; : : : ; p j . For a multivalued variable Yj .u/, we assume that the observed categories are uniformly distributed on Yj .u/ and the probability for those possibilities not observed is 0. The required frequency table then represents an n £ t matrix of classical-type data X D .fuj /, where u D 1; : : : ; n, j D 1; : : : ; t, where t D p 1 C ¢ ¢ ¢ C pp is the number of “variables.” The entries fuj are real numbers and not necessarily integers (as would be the case for classical data). Then the distance between two objects u and v in Ä is given by # " t X 1 ³ puj pvj ´2 1=2 d.u; v/ D ¡ ; (41) p:j pu: pv: jD1 where jD1 and use the speci c ±j .¢/ of (38), we then have Á p !1=2 X 2 d.u; v/ D : [maxfjauj ¡ b vj j; jb uj ¡ bvj jg] (40) jD1 puj D fuj =np; (39) jD1 Notice that d.u; v/ in (39) reduces to the Euclidean distance on <p when all Yj are classical variables. The distance in (38) can pu: D t X jD1 puj ; p:j D n X puj: uD1 For example, suppose that we have the symbolic data of Table 13, which shows data for three objects relating to a cancer prognosis Y1 (poor, medium, and good) and level of cigarette smoking Y2 (below average, average, and above average). Notice that object u D 3 is in fact a classical data point. Here Y1 is 484 Journal of the American Statistical Association, June 2003 Table 13. Categorical Symbolic Data u Y1 , cancer prognosis Y2 , smoking 1 2 3 {Medium (.06), good (.4)} {Poor (.2), medium (.6), good (.2)} {Good (1)} {Below average, average} {Below average} {Average} modal, but Y2 is categorical. The symbolic data X of Table 13 is transformed into classical data for six variables, as shown in Table 14. Hence the distance between objects u D 1 and v D 2 is, from (41), d.1; 2/ D :913; likewise, d.1; 3/ D 1:118 and d.2; 3/ D 1:812. Another cluster partitioningcriteria is the following symbolic version of the classical within-class variance criteria, or inertia, given by 1 XX I.Ci / D wu wv [d.u; v/]2 ; (42) 2¹i u2Ci v2Ci wu is the weight associated with object u and ¹i D where P u2Ci wu . If we take wu D 1=n, then X X I.Ci / D .mn i /¡1 [d.u; v/] 2 ; u; v2Ci u>v where ni D jCi j is the size of the class Ci . Suppose that at the mth stage of the clustering process we have clusters .C1 ; : : : ; Cm /. Thus, initially when m D 1, C1 ´ Ä. We wish to select which cluster Ci is to be partitioned in two clusters, Ci1 and Ci2 , at the next .m C 1/th stage. This is achieved by selecting an i that maximizes fI.Ci / ¡ I.Ci1 / ¡ I.Ci2 /g. The cluster Ci is partitioned into these two subclusters, Ci1 and Ci2 , as follows. We choose the best partition where “best” is speci ed by the distance measure adopted [e.g., minimizing I.Ci /]. Let the binary cut be speci ed by the binary function qc , where we partition Ci into Ci1 and Ci2 according to C 1 : fu 2 Cjqc .u/ D trueg and C 2 : fu 2 Cjqc .u/ D falseg. For interval data [au ; bu ], we de ne qc for object u, for mu D .au C bu /=2, by » true; if m u · c qc .u/ D false; if m u > c. For modal variables (including categorical data after transformation to modal variable format), with data value Yj .u/ D ¼u , we de ne q c by X 8 ¼u .x/ ¸ :5 true; if > > < x·c qc .u/ D X > > ¼u .x/ < :5, : false; if x·c where the summation is over all individual x values, x · c. Table 14. Classical Transformation of Table 13 Data fuj Poor 0 :2 f:j 0 :2 Y1 , medium :6 :6 0 1 :2 Good :4 :2 1 1:8 Below average Y2 , average Above average :5 1 0 1 :5 :5 0 1 1:5 0 0 0 0 f u: 2 2 2 This is illustrated with the data of Table 12 using all six variables, and supposing that we wish to use the normalized Hausdorff distance measure of (40), with weight mj corresponding to the length of the domain Yj . Thus m1 D 14:52, m 2 D 6:10, m3 D 17:40, m 4 D 14:6, m5 D 23:00, and m 6 D 15:60. We have the three binary functions as q 1 D Y1 · 41:90, q2 D Y3 · 89:80, and q3 D Y4 · 84:375. The clusters of Figure 3 emerge. In terms of the queries of Section 3, we observe, for instance, that the cluster C3 D fMidWestE, Mountain, Paci cg matches the assertion that C3 : [Y1 > 41:90] ^ [Y3 · 89:80] ^ [Y4 > 84:375]I that is, the output of the clustering process gives a class (or cluster) represented by the symbolic object for which the median household income exceeds $41,900, fewer than 89.80% have health coverage and at least 84.375% completed high school. Chavent’s method is a divisive hierarchical clustering procedure that starts with all the objects in a single cluster and is a form of monothetic divisive clustering method built by considering one variable at a time. Monothetic divisive methods are not dissimilar from discriminant analysis methods, such as the CART algorithm of Brieman, Friedman, Olshen, and Stone (1984) or the ID3 algorithm of Quinlan (1986) (see also Périnel and Lechevallier 2000 for speci cs on symbolic discrimation analysis). Monothetic divisive clustering for conceptual objects was rst introduced by Michalski, Diday, and Stepp (1981) and Michalski and Stepp (1983). In contrast with divisive methods, agglomerative methods start with each object in Ä being a cluster of size 1, itself; with the clustering algorithm developed to merge objects into large classes. For symbolic objects, merging criteria revolve around nding those symbolic objects that are similar (as measured by an appropriate similarity index) and relate to the assertions described in Section 3. Building on Diday’s (1986) development of pyramid clusters for classical data, Brito (1994, 1995, 2000) gave an algorithm for developing pyramid clusters for symbolic data, in which pyramidal clusters are de ned as families of nested overlapping classes; that is, a class can belong to two distinct clusters, in contrast to a pure hierarchical cluster in which classes are distinct or are entirely contained within another class. Brito and DeCarvalho (1999) extended this to the case in which speci c hierarchical rules exist, and DeCarvalho, Verde, and Lechevallier (1999) extended this to dependency rules. Polaillon (2000) developed pyramidal clusters for interval data using Galois lattice reductions. DeCarvalho et al. (1999) examined dynamic clustering of Boolean symbolic objects based on a content dependent proximity measure. Bock and Diday (2000) provided a comprehensive coverage of these approaches along with elucidating examples. With the possible exception of Polaillon’s (1998) Galois lattice reduction theory, results tend to be limited to the methods themselves, with theoretical justi cations remaining as outstanding problems still to be addressed. 9. PROBABILITIES IN SYMBOLIC DATA ANALYSIS Probabilistic underpinning for a few of these symbolic data analysis methods has been developed by Diday (1995), Di- Billard and Diday: Symbolic Data Analysis 485 Figure 3. Divisive Clustering: Regional Pro les. day and Emilion (1996), and Diday et al. (1996), primarily for modal data associated with capacities and credibilities, and by Emilion (2002) for mixture decompositions that underlie much of the histogram methods. Suppose that in general the data are arrayed in a table with n rows and p columns with entries Yij . The rows are the observations of a sample of size n from a random variable Y D .Y1 ; : : : ; Yp / : Ä ! X 1 £ ¢ ¢ ¢ £ X p , where X j is the set of random variables de ned on the same domain, such that Y.wi / D .Yi1 ; : : : ; Yip / for object wi . If instead of having this general model, we focus on some descriptive property of the Yij (e.g., a con dence interval or empirical distribution density), we obtain different kinds of nonnumerical data (i.e., we obtain symbolic data). For instance, if we are interested in the probability distributions, then the modal data become Y D .Y1 ; : : : ; Yp / : Ä ! P.V1 / £ ¢ ¢ ¢ £ P.Vp /; where P.Vj / is the set of all probability measures on a domain Dj with a ¾ algebra A. Hence, if Y.wi / D .¹i1 ; : : : ; ¹ip /, the stochastic process Xj .¢/.A/ is a “random distribution,” taking the terminology of some authors using Bayesian analyses (see, e.g., Ferguson 1974). Starting from this probabilistic framework, we have considered several approaches. For instance, suppose that we are interested in the probability of an event A for an easily given random variable Yj such that Pf[Y1j D A] [ ¢ ¢ ¢ [ [Ypj D A]g D F.A/. It can be easily shown that F is a capacity. Diday (1995), in addition to capacity, also studied and compared other measures (e.g., credibility, probability). Using special maps, such as T norms and t conorms introduced by Schweizer and Sklar (1983), Diday described a histogram of capacities that is subadditive and a histogram of credibilities that is superadditive,in contrast to standard classical histograms of frequencies, which are additive. Diday also proved the existence of a limit on the height of these histograms as the interval lengths supporting the histogram tend to 0. Thus, it is clear that the concepts intent and extent (described in Sec. 3), as well as Galois lattices (not described herein), of symbolic objects can be represented as (partial or full) capacities or credibilities and hence by the relevant respective histograms. These results can also be applied to special cases of principal components (the vertices method), clustering (to pyramid clusters), and decision trees. Bertrand and Goupil (2000) indicated that for interval-valued symbolic data, using the law of large numbers, the true limiting distribution of Z as m ! 1 is only approximated by the exact distribution f .» / in (15), because the derivation depends on the veracity of the uniform distribution within each interval assumption. Recently, clustering methods based on the estimation of mixtures of probability distribution where the observations are probability distributions were considered by Emilion (2002). Using a theorem of Kakutani (see, e.g., Hewitt and Stromberg 1969, p. 453), Emilion proved a convergenceresult for the three cases of mixture components; a Dirichlet process, a normalized gamma process, and a Kraft process. Another approach for mixture distributions suggested by Diday (2001) is based on copula theory (Schweizer and Sklar 1983; Nelsen 1999) using different copula models, such as those of Frank (1979) and Clayton (1978). Bertrand (1995) considered structural properties of pyramids. By and large, however, probabilistic and rigorous mathematical underpinning, along with the need to develop relevant convergence and limiting results, remain outstanding problems needing considerable development.This includes deriving relevant standard errors and distribution functions for the respective estimators thus far available, as well as consideration of robust- 486 Journal of the American Statistical Association, June 2003 ness issues as to the effects of the underlying assumptions on these inference procedures. The eld is wide open. 10. CONCLUSIONS In conclusion, then, given contemporary data formats and dataset sizes, the need to develop statistical methods to analyze them is becoming increasingly important. In addition to these types of data, several attempts have been made to extend data analysis to fuzzy, uncertain, nonprecise data (e.g., Bandemer and Nather 1992; Viertl 1996), or compositional data (e.g., Aitchison 1986). In both cases by taking care of the variation inside a class of units described by such data, we obtain symbolic data. The statistical world has developed a wealth of methodology throughout the twentieth century, methods that are largely limited to small (by comparison) datasets and limited to classical data formats. Some work has been done on imprecise or fuzzy data but not on symbolic data. In this article we have reviewed the formulation and construction of symbolic objects and classes. We also presented brief overviews of those symbolic data analyses, which have emerged as ways to draw statistical inferences in some formats of symbolic data. What the obvious omission of methodologiesin any list of such statistical methods makes abundantly clear is the enormous need for the development of a vast catalog of new symbolic methodologies, along with rigorous mathematical and statistical foundationsfor these methods. [Received March 2001. Revised August 2002.] REFERENCES Aitchison, J. (1986), The Statistical Analysis of Compositional Data, New York: Chapman Hall. Bandemer, H., and Nather, W. (1992), Fuzzy Data Analysis, Dordrecht: Kluwer Academic. Bertrand, P. (1995), “Structural Properties of Pyramidal Clustering,” in Partitioning Data Sets, eds. I. Corc, P. Hansen, and B. Julesz, Hayward, CA: American Mathematical Society, pp. 352–353. Bertrand, P., and Goupil, F. (2000), “Descriptive Statistics for Symbolic Data,” in Analysis of Symbolic Data, eds. H.-H. Bock and E. Diday, Berlin: SpringerVerlag, pp. 103–124. Billard, L., and Diday, E. (2000), “Regression Analysis for Interval-Valued Data,” in Data Analysis, Classi cation, and Related Methods, eds. H. A. L. Kiers, J.-P. Rassoon, P. J. F. Groenen, and M. Schader, Berlin: SpringerVerlag, pp. 369–374. (2002a), “Symbolic Regression Analysis,” in Classi cation, Clustering, and Data Analysis, eds. K. Jajuga, A. Sokolowski, and H.-H. Bock, Berlin: Springer-Verlag, pp. 281–288. (2002b), “Symbolic Data Analysis: De nitions and Examples,” technical report, available at hhttp://www.stat.uga.edu/faculty/LYNNE/ Lynne.htmli. Bock, H.-H., and Diday, E. (eds.) (2000), Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information From Complex Data, Berlin: Springer-Verlag. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984), Classi cation and Regression Trees, Belmont, CA: Wadsworth. Brito, P. (1994), “Use of Pyramids in Symbolic Data Analysis,” in New Approaches in Classi cation and Data Analysis, eds. E. Diday, Y. Lechevallier, M. Schader, P. Bertrand, and B. Burtschy, Berlin: Springer-Verlag, pp. 378– 386. (1995), “Symbolic Objects: Order Structure and Pyramidal Clustering,” Annals of Operations Research, 55, 277–297. (2000), “Hierarchical and Pyramidal Clustering With Complete Symbolic Objects,” in Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information From Complex Data, eds. H.-H. Bock and E. Diday, Berlin: Springer-Verlag, pp. 312–324. Brito, P., and DeCarvalho, F. A. T. (1999), “Symbolic Clustering in the Presence of Hierarchical Rules,” in Knowledge Extraction from Statistical Data, Luxembourg: European Commission Eurostat, pp. 119–128. Cazes, P., Chouakria, A., Diday, E., and Schektman, Y. (1997), “Extensions de l’Analyse en Composantes Principales a des Donnees de Type Intervalle,” Revue de Statistique Appliquee, 24, 5–24. Chavent, M. (1998), “A Monothetic Clustering Algorithm,” Pattern Recognition Letters, 19, 989–996. Chouakria, A. (1998), “Extension des Methodes d’Analyse Factorielle a des Donees de Type Intervalle,” unpublished doctoral thesis, Université Paris Dauphine. Chouakria, A., Diday, E., and Cazes, P. (1999), “An Improved Factorial Representation of Symbolic Objects,” in Knowledge Extraction From Statistical Data, Luxembourg: European Commission Eurostat, pp. 301–305. Clayton, D. G. (1978), “A Model for Association in Bivariate Life Tables and Its Application in Epidemiological Studies of Familial Tendency in Chronic Disease Incidence,” Biometrika, 65, 141–151. Codd, E. F. (1972), “Further Normalization of the Data Base Relational Model,” in Data Base Systems Courant Computer Science Series, Vol. 6, ed. R. Rustin, Englewood Cliffs, NJ: Prentice-Hall, pp. 33–64. Csernel, M. (1997), “Normalization of Symbolic Objects,” in Symbolic Data Analysis and its Applications, eds. E. Diday and K. C. Gowda, Dauphine, France: Ceremade, 32–44. DeCarvalho, F. A. T. (1994), “Proximity Coef cients Between Boolean Symbolic Objects,” in New Approaches in Classi cation and Data Analysis, eds. E. Diday, Y. Lechevallier, M. Schader, P. Bertrand, and B. Burtschy, Berlin: Springer-Verlag, pp. 387–394. (1995), “Histograms in Symbolic Data Analysis,” Annals of Operations Research, 55, 299–322. (1998), “Extension-Based Proximities Between Contrained Boolean Symbolic Objects,” in Data Science, Classi cation and Related Methods, eds. C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H. H. Bock, and Y. Baba, Berlin: Springer-Verlag, pp. 370–378. DeCarvalho, F. A. T., Verde, R., and Lechevallier, Y. (1999), “A Dynamical Clustering of Symbolic Objects Based on a Content Dependent Proximity Measure,” Applied Statistical Models and Data Analysis, 15, 237–242. Diday, E. (1986), “Orders and Overlapping Clusters by Pyramids,” in Multidimensional Data Analysis, eds. J. De Leeuw, W. J. Heisen, J. J. Meulman, and F. Critchley, Leiden, Netherlands: DSWO Press, pp. 201–234. (1987), “Introduction a l’Approche Symbolique en Analyse des Donnees,” Premiere Jouneles Symbolique-Numerique, Dauphine, France: CEREMADE, pp. 21–56. (ed.) (1989), Data Analysis: Learning Symbolic and Numerical Knowledge, Antibes: Nova Science. (1990), “Knowledge Representation and Symbolic Data Analysis,” in Knowledge Data and Computer-Assisted Decisions, eds. M. Schader and W. Gaul, Berlin: Springer-Verlag, pp. 17–34. (1995), “Probabilistic, Possibilistic and Belief Objects for Knowledge Analysis,” Annals of Operations Research, 55, 227–276. (2001), “A Generalization of the Mixture Decomposition Problem in the Symbolic Data Analysis Framework,” Dauphine: Ceremade. Diday, E., and Emilion, R. (1996), “Capacities and Credibilities in Analysis of Probabilistic Objects,” in Ordinal and Symbolic Data Analysis, eds. E. Diday, Y. Lechevallier, and O. Opitz, Berlin: Springer-Verlag, pp. 13–30. Diday, E., Emilion, R., and Hillali, Y. (1996), “Symbolic Data Analysis of Probabilistic Objects by Capacities and Credibilities,” Atti Della XXXVIII Riunione Scienti a, pp. 5–22. Emilion, R. (2002), “Clustering and Mixtures of Stochastic Processes,” Compte Rendus Academy Sciences Paris I, 335, 189–193. Esposito, F., Malerba, D., and Semeraro, G. (1991), “Flexible Matching of Noisy Structural Descriptions,” in Conference on Arti cal Intelligence, eds. J. Mylopoulos and R. Reiter, London: Morgan Kaufman, pp. 658–664. Ferguson, T. S. (1974), “Prior Distributions on Spaces of Probability Measures,” Annals of Statistics, 2, 615–629. Frank, M. J. (1979), “On the Simultaneous Associativity of F.x; y/ and x C y ¡ F.x; y/,” Aequationes Mathematics, 19, 194–226. Gettler-Summa, M. (1999), “Approaches MGS (Marquage et Generalisations Symboliques) Pour le Suivi de Typologies dans le Temps,” Journees de Statistique, 31, 833–845. (2000), “Marking and Generalization by Symbolic Objects in the Symbolic Of cial Data Analysis Software,” in Data Analysis, Classi cation, and Related Methods, eds. H. A. L. Kiers, J.-P Rasson, P. J. F. Groenen, and M. Schader, Berlin: Springer-Verlag, pp. 417–422. Gettler-Summa, M., and Pardoux, C. (2000), “Symbolical Approaches for Three-Way Data,” in Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information From Complex Data, eds. H.-H. Bock and E. Diday, Berlin: Springer-Verlag, pp. 342–354. Gowda, K. C., and Diday, E. (1991), “Symbolic Clustering Using a New Dissimilarity Measure,” Pattern Recognition, 24, 567–578. Hewitt, E., and Stromberg, K. R. (1969), Real and Abstract Analysis: A Modern Treatment of the Theory of Functions of a Real Variable, New York: SpringerVerlag. Billard and Diday: Symbolic Data Analysis Ichino, M. (1988), “General Metrics for Mixed Features—The Cartesian Space Theory for Pattern Recognition,” Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, 1, 494–497. Ichino, M., and Yaguchi, H. (1994), “Generalized Minkowski Metrics for Mixed Feature Type Data Analysis,” IEEE Transactions on Systems, Man and Cybernetics, 24, 698–708. Loustaunau, D., Pardoux, C., and Gettler-Summa, M. (1997), “Multidimensional Analysis for the Acquisition and Processing of Expert Knowledge: Concept of On-Site Telecommunication,” in Multidimensional Data Analysis, eds. K. Fernandez-Aguirre and A. Morineau, Paris: CISIA-CERESTA, pp. 293–300. Michalski, R. S., Diday, E., and Stepp, R. E. (1981), “A Recent Advance in Data Analysis: Clustering Objects Into Classes Characterized by Conjunctive Concepts,” in Progress in Pattern Recognition, eds. L. Kanal and A. Rosenfeld, Amsterdam: North-Holland, pp. 33–56. Michalski, R. S., and Stepp, R. E. (1984), “Learning from Observation: Conceptual Clustering,” in Machine Learning, eds. R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, Berlin: Springer-Verlag, pp. 331–363. Morineau, A., Sammartino, A.-E., Gettler-Summa, M., and Pardoux, C. (1994), “Analyses des Donnels et Modelisation des Series Temporelles,” Review Statistique Appliquee, 42, 61–81. Nagabhushan, P., Gowda, K. C., and Diday, E. (1995), “Dimensionality Reduction of Symbolic Data,” Pattern Recognition Letters, 16, 219–233. Nelsen, R. B. (1999), An Introduction to Copulas, New York: Springer-Verlag. Périnel, E., and Lechevallier, Y. (2000), “Symbolic Discrimination Rules,” in Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information From Complex Data, eds. H.-H. Bock and E. Diday, Berlin: Springer-Verlag, pp. 244–265. 487 Polaillon, G. (1998), “Organisation et Interpretation par les Treillis de Galois de Donnees de Type Multivalue, Intervalle ou Histogramme,” unpublised doctoral thesis, Université Paris Dauphine. (2000), “Pyramidal Classi cation for Internal Data Using Galois Lattice Reduction,” in Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information From Complex Data, eds. H.-H. Bock and E. Diday, Berlin: Springer-Verlag, pp. 324–340. Quinlan, J. R. (1986), “Introduction of Decision Trees,” Machine Learning, 1, 81–106. Raju, S. R. K. (1997), “Symbolic Data Analysis in Cardiology,” in Symbolic Data Analysis and Its Applications, eds. E. Diday and K. C. Gowda, Paris: CEREMADE, pp. 245–249. Schweizer, B. (1984), “Distributions Are the Numbers of the Future,” in Proceedings of The Mathematics of Fuzzy Systems Meeting, eds. A. di Nola and A. Ventre, Naples, Italy: University of Naples, 137–149. Schweizer, B., and Sklar, A. (1983), Probabilistic Metric Spaces, New York: North-Holland. Stéphan, V., Hébrail, G., and Lechevallier, Y. (2000), “Generation of Symbolic Objects From Relational Databases,” in Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information From Complex Data, eds. H.-H. Bock and E. Diday, Berlin: Springer-Verlag, pp. 78–105. Verde, R., and DeCarvalho, F. A. T. (1999), “Dependence Rules In uence on Factorial Representation of Boolean Symbolic Objects,” in Knowledge Extraction From Statistical Data, Luxembourg: European Commission Eurostat, pp. 287–300. Viertl, R. (1996), Statistical Methods for Non-Precise Data, Boca Raton, FL: CRC Press.