Download From the Statistics of Data to the Statistics of Knowledge

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia , lookup

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
From the Statistics of Data to the Statistics
of Knowledge: Symbolic Data Analysis
L. B ILLARD and E. D IDAY
Increasingly, datasets are so large they must be summarized in some fashion so that the resulting summary dataset is of a more manageable
size, while still retaining as much knowledge inherent to the entire dataset as possible. One consequence of this situation is that the data may
no longer be formatted as single values such as is the case for classical data, but rather may be represented by lists, intervals, distributions,
and the like. These summarized data are examples of symbolic data. This article looks at the concept of symbolic data in general, and then
attempts to review the methods currently available to analyze such data. It quickly becomes clear that the range of methodologies available
draws analogies with developments before 1900 that formed a foundation for the inferential statistics of the 1900s, methods largely limited
to small (by comparison) datasets and classical data formats. The scarcity of available methodologies for symbolic data also becomes clear
and so draws attention to an enormous need for the development of a vast catalog (so to speak) of new symbolic methodologies along with
rigorous mathematical and statistical foundational work for these methods.
KEY WORDS: Clustering; Concepts; Descriptive statistics; Principal components; Symbolic data.
1. INTRODUCTION
With the advent of computers, very large datasets have become routine. What is not so routine is how to analyze the attendant data and/or how to glean useful information from within
their massive conŽ nes. It is evident, however, that even in those
situations where in theory available methodology might seem
to apply, routine use of such statistical techniques is often inappropriate. The reasons for this are many. Broadly, one reason surrounds the issue of whether or not the dataset really is a
sample from some population,because often the data constitute
the “whole,” as in for example, the record of all credit transactions of all card users. A related question pertains to whether
the data at a speciŽ ed point in time can be viewed as being from
the same population at another time, as, for example, will the
credit card dataset have the same pattern the next “week” or
when the next week’s transactions are added to the data already
collected? Another major broad reason that known techniques
fail revolves around the issue of the sheer size of the dataset.
Therefore, although traditional methods have served well on the
smaller datasets that dominated in the past, it now behooves us
as data analysts to develop procedures that work well on the
large modern datasets.
One approach is to summarize large datasets in such a way
that the resulting summary dataset is of a manageable size and
yet retains as much of the knowledge in the original dataset
as possible. Thus in the credit card example, instead of hundreds of speciŽ c transactions for each person (or credit card)
over time, a summary of the transactions per card (or per unit
time, such as 1 week) can be made. One such summary format
could be a range of transactions by, for instance, dollars spent
(e.g., $10–$982), by type of purchase (e.g., gas, clothes, food),
by type and expenditure (e.g., gas, $10–$30; food, $15–$95). In
each of these examples, the data are no longer single values as
in traditional data (e.g., $1,524 for total credit card expenditure,
37 total number of transactions). Instead, the summarized data
constitute ranges, lists, and the like and thus are examples of
symbolic data. Symbolic data have their own internal structure
L. Billard is University Professor & Professor of Statistics, Department of
Statistics, University of Georgia, Athens, GA 30602. E. Diday is Professor of
Computer Science, Centre d’Etudes et Recherches en Mathematiques et Aides
la Decision (CEREMADE), Universite de Paris 9 Dauphine 75775, Paris Cedex
16, France. This work was partially supported by the National Science Foundation and INRIA France.
(not present, nor possible, in classical data) and as such should
be analyzed using symbolic data analysis techniques.
Whereas the summarization of very large datasets can produce smaller datasets consisting of symbolic data, symbolic
data are distinctive in their own right on any size datasets. For
small symbolic datasets, the question is how the analysis proceeds. For large datasets, the Ž rst question is what approach
is being adopted to summarize the data into a (necessarily)
smaller dataset. Buried behind any summarization is the notion
of a symbolic concept, with any one aggregation being tied necessarily to the concept relating to a speciŽ c aim of an ensuing
analysis.
In this work we review concepts and methods developed under the heading of symbolic data analysis. These methods so far
have tended to be limited to developing methodologiesto organize the data into meaningful and manageable formats, somewhat akin to the developments leading to frequency histograms
and other basic descriptive statistics efforts before 1900, which
themselves formed a foundation for the inferential statistics of
the 1900s. A brief review of existing symbolic statistical methods is included herein. An extensive coverage of earlier results
has been given by Bock and Diday (2000). What quickly becomes clear is that so far very little statistical methodology has
been developed for the resulting symbolic data formats.
Symbolic data are deŽ ned and contrasted with classical data
in Section 2. The construction of classes of symbolic objects—
a necessary precursor to statistical analyses when the size of
the original dataset is too large for classical analyses, or where
knowledge (in the form of classes, concepts, taxonomies, and
so forth) is given as input instead of standard data—is discussed
in Section 3. Available methods of symbolic data analysis are
brie y described in Section 4, and some of these are discussed
in more detail in subsequent sections. What becomes apparent
is the inevitabilityof an increasing prevalence of symbolic data,
and hence the attendant need to develop statistical methodologies to analyze such data. It will also be apparent that few methods currently exist, and even for those that do exist the need remains to establish mathematical underpinningand rigor, including statistical properties of the results of these procedures. Typically, point estimators have been developed, but there are still
470
© 2003 American Statistical Association
Journal of the American Statistical Association
June 2003, Vol. 98, No. 462, Review Article
DOI 10.1198/016214503000242
Billard and Diday: Symbolic Data Analysis
471
essentially no results governing their properties, such as standard errors and distribution theory. These remain open problems.
2. SYMBOLIC DATASETS
A dataset may from its outset be structured as a symbolic
dataset. Alternatively, it may be structured as a classical dataset,
but when aggregated to establish it in a more manageable fashion, it becomes symbolic data. In this section we present examples of both classical and symbolic data and introduce notation
describing symbolic datasets for analysis.
Suppose that we have a dataset comprising the medical
records of individuals in a country. Suppose that for each individual there will be a record of geographic location variables,
such as region (north, northeast, south, : : :), city (Boston, Atlanta, : : :), urban/rural (yes/no), and so on. There will be demographic variables, such as gender, marital status, age, information on parents and siblings (alive or not), number of children, employer, and health provider. Basic medical variables
could include weight, pulse rate, blood pressure, and other variables. Other health variables (for which the list of possible variables is endless) could include incidences of certain ailments
and diseases; likewise, for a given incidence or prognosis, treatments and other related variables associated with that disease
are recorded. A typical such dataset may follow the lines of that
shown in Table 1.
Let p be the number of variables for each individual i 2 Ä D
f1; : : : ; ng, where clearly p and n can be extremely large, and
let Yj , j D 1; : : : ; p, represent the jth variable. Let Yj D xij be
the particular value assumed by the variable Yj for the ith individual in the classical setting, and write X D .xij / as the n £ p
matrix of the entire dataset. Let the domain of Yj be Yj ; so
p
X D .Y1 ; : : : ; Yp / takes values in X D £jD1 Yj . (Because the
presence or absence of missing values is of no importance for
the present, let us assume all values exist, even though this is
highly unlikely for large datasets.)
Variables can be quantitative; for example, age with Yage D
fx ¸ 0g D YC as a continuous random variable or with Yage D
f0; 1; 2; : : :g D N 0 as a discrete random variable. Variables
can be categorical; for example, city with Ycity D fAtlanta,
Boston, : : :g or coded Ycity D f1; 2; : : :g. Disease variables can
be recorded as categories (coded or not) of a single variable
with domain Y D fheart, stroke, cancer, cirrhosis, : : :g or as an
indicator variable, (e.g., Y D cancer with domain Y D fno, yesg
or f0; 1g or with other coded levels indicatingstages of disease).
Likewise, for a recording of the many possible types of cancers,
each type may be represented by a variable, Y, or by a category
of the cancer variable.
Table 1. Sample Dataset: Classical Data
Systolic
pressure
(mm Hg)
Diastolic
pressure
(mm Hg)
Cancer
i
City
Gender
Age
(years)
1
2
3
4
Boston
Boston
Chicago
El Paso
M
M
M
F
24
56
48
47
120
130
126
121
79
90
82
86
No
No
Lung
Yes
::
:
::
:
::
:
::
:
::
:
::
:
::
:
The precise nature of the description of the variables is not
critical. What is crucial in the classical setting is that for each
xij in X, there is precisely one possible realized value; for example, an individual’s Yage D 24, Ycity D Boston, Ycancer D yes,
Ypulse D 64, and so on. Thus a classical data point is a single
point in the p-dimensional space X .
In contrast, a symbolic data point can be a hypercube in
p-dimensional space or a Cartesian product of distributions.Entries in a symbolic dataset (denoted by »ij ) are not restricted to
a single speciŽ c value. Thus age could be recorded as being
in an interval, for example, [0; 10/; [10; 20/; [20; 30/, : : : . This
could occur when the data point represents the age of a family
or group of individuals whose collective ages fall in an interval
(such as [20; 30/ years, say), or the data may correspond to a
single individual whose precise age is unknown but is known
to be within an interval range, or whose age has varied over
time in the course of the experiment that generated the data; or
combinations and variations thereof, producing interval-ranged
data. In a different direction, it may not be possible to measure some characteristic accurately as a single value (e.g., pulse
rate is 64), but rather measures the variable as an .x § ±/ value
(e.g., pulse rate is 64 § 2/. A person’s weight may  uctuate
between .130; 135/ over a weekly period. The blood pressure
variable may be recorded by its [low, high] values, for example, »ij D [78; 120]. These variables are interval-valued symbolic variables.
A different type of variable is a cancer variable that may
have a domain Y D flung, bone, breast, liver, lymphoma,
prostate, : : :} listing all possible cancers with a speciŽ c individual having the particular values »ij D flung, liverg. In another example, suppose the variable Yj represents type of automobile owned by a household, with domain Yj D fChevrolet,
Ford, Toyota, Volvo, : : :g. A particular household i may have the
value »ij D fToyota, Volvog. Such variables are called multivalued variables.
A third type of symbolic variable is a modal variable. Modal
variables are multistate variables with a frequency, probability,
or weight attached to each speciŽ c value in the data; that is,
the modal variable Y is a mapping Y.i/ D fU.i/; ¼i g for i 2 Ä,
where ¼i is a nonnegative measure or a distribution on the domain Y of possible observation values and U.i/ µ Y is the support of ¼ i . For example, if three of an individual’s siblings are
diabetic and one is not, then the variable describing propensity
to diabetes could take the particular value »ij D f3=4 diabetes,
1=4 nondiabetiesg. More generally, »ij may be, for example,
a histogram, an empirical distribution function, a probability
distribution, or a model. Indeed, Schweizer (1984) opined that
“distributions are the numbers of the future.” While in this example the weights .3=4; 1=4/ might represent relative frequencies, other kinds of weights, such as “capacities,” “credibilities,” “necessities,” or “possibilities” may be used (see Diday
1995).
In general, then, unlike classical data for which each data
point consists of a single (categorical or quantitative) value,
symbolic data can contain internal variation and can be structured. It is the presence of this internal variation that necessitates the need for new techniquesfor analysis that generally will
differ from those for classical data. Note, however, that classical data represent special cases (e.g., the classical point x D a is
equivalent to the symbolic interval » D [a; a]).
472
Journal of the American Statistical Association, June 2003
Notationally, we have a basic set of objects that are elements
or entities, E D f1; : : : ; Ng called the object set. This object set
can represent a universe of individuals E D Ä (as earlier), in
which case N D n; or if N · n, any one object set is a subset
of Ä. Also, as frequently occurs in symbolic analyses, the objects u in E are classes C1 ; : : : ; Cm of individuals in Ä, with
E D fC1 ; : : : ; Cm g, and N D m. Thus, for example, class C1
may consist of all those individuals in Ä who have had cancer. Each object u 2 E is described by p symbolic variables Yj
j D 1; : : : ; p, with domain Yj and with Yj being a mapping from
the object set E to a range Yj that depends on the type of variable Yj is. Thus if Yj is a classical quantitative variable, then
the domain Bj is a subset of the real line < (i.e., Bj µ <); if Yj
is an interval variable, then Bj D f[®; ¯], ¡1 < ®; ¯ < 1g; if
Yj is categorical (nominal, ordinal, subsets of a Ž nite domain
Yj ), then Bj D fBjB µ f(e.g., list of cancers)gg; and if Yj is a
modal variable, then Bj D M.Yj /, where M.Y/ is the family of
all nonnegative measures on Y .
Then the symbolic data for the object set E are represented
by the N £ p matrix X D .»uj /, where »uj D Yj .u/ 2 Bj is the
observed symbolic value for the variable Yj , j D 1; : : : ; p, for
the object u 2 E. The row x0u of X is called the “symbolic description” of the object u. Thus for the data in Table 2, the Ž rst
row,
x01 D f[20; 30]; [79; 120]; Boston, fbrain tumorg;
fMaleg; [170; 180]g;
represents a male in his 20s who has a brain tumor, a blood pressure of 120/79 mm Hg, weighs between 170 and 180 pounds,
and lives in Boston. The object u associated with this x0u may
be a speciŽ c male individual followed over a 10-year period
whose weight has  uctuated between 170 and 180 pounds
over that interval, or u could be a collection of individuals
whose ages range from 20 to 30 and who have the characteristics described by x0u . The data x04 in Table 2 may represent the same individual as that represented by the i D 4th
individual of Table 1 but where it is known only that she
has either breast cancer (with probability p) or lung cancer
(with probability 1 ¡ p), but it is not known which. On the
other hand, it could represent the set of 47-year-old women
from El Paso of whom a proportion p have lung cancer
and proportion .1 ¡ p/ have breast cancer; or it could represent individuals who have both lung and breast cancer, and so
on. At some stage, whether the variable (here, type of cancer)
is categorical, a list, modal, or whatever would have to be explicitly deŽ ned.
Another issue relates to dependent variables, which for symbolic data implies logical dependence,hierarchical dependence,
Table 2. Sample Dataset: Symbolic Data
u
Age
Blood
pressure (mm Hg)
1
2
3
4
[20, 30)
[50, 60)
[45, 55)
[47, 47)
(79/120)
(90/130)
(80/130)
(86/121)
::
:
::
:
::
:
City
Type of
cancer
Boston
{Brain tumor}
Boston
{Lung, liver}
Chicago {Prostate}
El Paso {Breast p,
lung (1 ¡ p)}
::
:
::
:
Gender
{Male}
{Male}
{Male}
{Female}
::
:
taxonomic, or stochastic dependence. Logical dependence is as
the word implies; for example, if [age · 10], then [number of
children D 0]. Hierarchical dependence occurs when the outcome of one variable (e.g., Y2 D treatment for cancer) depends
on the actual outcome realized for other variables (e.g., Y1 D
has cancer with Y1 D fno, yesg). If Y1 has the value fyesg, then
Y2 D fsay, chemotherapyg; whereas if Y1 D fnog, then clearly
Y2 is not applicable. (We assume for illustrative purposes here
that the individual does not have chemotherapy treatment for
some other reason.) In these cases, the nonapplicable variable
Z is deŽ ned with domain Z D fNAg. Such variables are also
called mother .Y1 /–daughter .Y2 / variables. Other variables
may exhibit a taxonomic dependency; for example, Y1 D region and Y2 D city can take values, say, if Y1 D northeast, then
Y2 D Boston, or if Y1 D south, then Y2 D Atlanta.
3. CLASSES AND THEIR CONSTRUCTION:
SYMBOLIC OBJECTS
Suppose that we have a dataset Ä consisting of n individuals
i 2 Ä D f1; : : : ; ng, and that we wish to aggregate these into
m classes. For example, one set of classes, C1 ; : : : ; Cm , may
represent individualscategorized according to m different types
of primary diseases, whereas another analysis may necessitate
a class structure by, say, city, gender, age, or gender and age.
This stage of the symbolic data analysis is designed so as to
elicit more manageable formats before any statistical analysis.
Note that this construction may (but need not) be distinct from
classes obtained from a clustering procedure.
This leads us to the concept of a symbolic object developed
in a series of articles by Diday and colleagues (e.g., Diday
1987, 1989, 1990; Bock and Diday 2000; Stéphan, Hébrail, and
Lechevallier 2000). We introduce this here through some motivating examples, and then at the end of this section present a
more rigorous formal deŽ nition.
3.1 Some Examples
Suppose that we are interested in the concept “northeasterner.” Thus we have a description d representing the particular
values fBoston, : : : , other northeast cities, : : :g in the domain
Ycity ; and we have a relation R (here 2) linking the variable
Ycity with the particular description of interest. We write this as
[Ycity 2 fBoston, : : : , other northeast cities, : : :g] D a, say. Then
each individual i in Ä D f1; : : : ; ng is either a northeasterner
or is not. That is, a is a mapping from Ä ! f0; 1g, where for
an individual i who lives in the northeast, a.i/ D 1; otherwise,
a.i/ D 0, i 2 Ä. Thus if an individual i lives in Boston [i.e.,
Ycity .i/ D Boston], then we have a.i/ D [Boston µ fBoston, : : : ,
other northeast cities, : : :g] D 1.
The set of all i 2 Ä for which a.i/ D 1 is called the extent of
a in Ä. The triple s D .a; R; d/ is a symbolic object, where R is
a relation between the description Y.i/ of the (silent) variable Y
and a description d and a is a mapping from Ä to L that depends
on R and d. (In the Northeasterner example, L D f0; 1g.) The
description d is an intentional description; for example, as the
name suggests, we intend to Ž nd the set of individualsin Ä who
live in the “northeast.”
Symbolic objects play a role in one of three major ways
within the scope of symbolic data analyses. First, a symbolic
Billard and Diday: Symbolic Data Analysis
object may represent a concept by its intent (i.e., its description and a way for calculating its extent) and can be used as the
input of a symbolic data analysis. Thus the concept “northeasterner” can be represented by a symbolic object whose intent is
deŽ ned by a characteristic description and a way to Ž nd its extent, which is the set of people who live in the northeast. A set
of such regions and their associated symbolic objects can constitute the input of a symbolic data analysis. Second, it can be
used as output from a symbolic data analysis, as when a clustering analysis suggests that northeasterners belong to a particular cluster, where the cluster itself can be considered a concept
and be represented by a symbolic object. The third situation is
when we have a new individual .i0 / who has description d 0 , and
we want to know whether this individual .i0 / matches the symbolic object whose description is d; that is, we compare d and
d 0 by R to give [d 0 Rd] 2 L D f0; 1g, where [d 0 Rd] D 1 means
that there is a connection between d 0 and d. This “new” individual may be an “old” individual but with updated data, or it
may be a new individual being added to the data base who may
or may not “Ž t into” one of the classes of symbolic objects already present (e.g., should this person be provided with speciŽ c
insurance coverage?).
In the context of the aggregation of our data into a smaller
number of classes, were we to aggregate the individuals in Ä
by city (i.e., by the value of the variable Ycity ) then the respective classes Cu , u 2 f1; : : : ; mg constitute those individualsin Ä
who are in the extent of the corresponding mapping a.u/, say.
Subsequent statistical analysis can take one of two broad directions. Either we analyze, separately for each class, the classical
or symbolic data for the individuals in Cu as a sample of nu
observations as appropriate, or we summarize the data for each
class to yield a new dataset with one “observation” per class. In
the latter case, the dataset values will be symbolic data regardless of whether the original values were classical or symbolic
data. For example, even though each individualin Ä is recorded
as having or not having cancer (Ycancer D no, yes)—that is, as
a classical data value—this variable when related to the class
for, say, city will become, for example, fyes (.1), no (.9)g; that
is, 10% have had cancer and 90% have not. Thus the variable
Ycancer is now a symbolic modal-valued variable.
Likewise, a class constructed as the extent of a symbolic object is typically described by a symbolic dataset. For example,
suppose that our interest lies with “those who live in Boston”
(i.e., a D [Ycity D Boston]), and suppose that the variable Ychild
is the number of children that each individual i 2 Ä has, with
possible values f0; 1; 2; ¸ 3g. (The adjustment for a symbolic
data value such as individual i has 1 or 2 children, that is, »i D
f1; 2g, readily follows.) Then the object representing all those
who live in Boston will now have the symbolic variable Ychild
with particular value Ychild D f.0; f0 /; .1; f1 /; .2; f2 /; .¸3; f3 /g;
where fi , i D 0; 1; 2; ¸ 3, is the relative frequency of individuals in this class who have i children.
A special case of a symbolic object is an assertion. Assertions, also called queries, are particularly important when aggregating individuals into classes from an initial (relational)
database. Let us denote by z D .z1 ; : : : ; zp /, the required description of interest of an individual i or of a concept w. Here
zj can be a classical single-valued entity, xj , or a symbolic entity, »j . Thus, for example, suppose that we are interested in the
473
symbolic object representing those who live in the northeast;
then zcity is the set of northeastern cities. We formulate this as
the assertion
a D [Ycity 2 fBoston; : : : ; other northeast cities; : : :g];
(1)
where a is mapping from Ä to f0; 1g such that for individual
or object w, a.w/ D 1 if Ycity .w/ 2 fBoston, : : : , other northeast
cities, : : :g.
In general, an assertion takes the form
a D [Yj1 Rj1 zj1 ] ^ [Yj2 Rj2 zj2 ] ^ ¢ ¢ ¢ ^ [Yjv Rjv zjv ]
(2)
for 1 · j1 ; : : : ; jv · p, where “^” indicates the logical multiplicative “and” and R represents the speciŽ ed relationship between the symbolic variable Yj and description value zj . For
each individual i 2 Ä; a.i/ D 1 (or 0) when the assertion is true
(or not) for that individual. More precisely, an assertion is a
conjunction of v events, [Yk Rk zk ], k D 1; : : : ; v.
For example, the assertions
a D [Yage ¸ 60];
a D [Ycancer D yes] ^ [Yage ¸ 60];
a D [Yage < 20] ^ [Yage > 70]
(3)
represent all individualsage 60 and over, all cancer patients age
60 and over, and all individuals under age 20 and over 70. In
each case, we are dealing with a concept “those over age 60,”
“those over age 60 with cancer,” and so on, and a maps the
particular individuals present in Ä onto the space f0; 1g. The
assertion a D [Ycancer 2 flung, liverg] describes the class of individuals who have either lung cancer or liver cancer or both,
whereas
a D [Yage µ [20; 30]] ^ [Ycity 2 fBostong]
(4)
describes those who live in Boston and are in their 20s.
The relations R can take any of the forms D; 6D; 2; ·; µ, and
so on, or can be a matching relationship (e.g., a comparison of
probability distributions), a structured sequence, or some other
form. The domain of the symbolic object can be written as D D
D1 £ ¢ ¢ ¢ £ Dp ; where Dj µ Yj . The p-tuple .D1 ; : : : ; Dp / of sets
is called a description system, and each subset D is a description
set consisting of description vectors z D .z1 ; : : : ; zp /, whereas a
combination of elements zj 2 Yj and sets Dj µ Yj as in (4) is
simply a description. When there are constraints on any of the
variables (such as when logical dependencies exist), then the
space D has a “hole” in it corresponding to those values that
match the constraints. The totality of all descriptions, D, is the
description space, D. Hence an assertion can be written as
a D [Y 2 D] ´
v
^
kD1
[Yjk Rjk zjk ] D [YRz];
Vv
where R D kD1 Rjk is called the product relation.
Note that implicitly, if an assertion does not involve a particular variable, say, Yj , then the domain relating to that variable remains unchanged at Yj . For example, if p D 3, and Y1 D
city, Y2 D age, and Y3 D weight, then the assertion a D [Y1 2
fBoston, Atlantag] has domain D D fBoston, Atlantag £ Y2 £
Y3 and seeks all those in Boston and Atlanta only (but regardless of age and weight).
474
Journal of the American Statistical Association, June 2003
3.2 Formal DeŽ nitions
Let us now formally deŽ ne some concepts.
DeŽ nition 1. When an assertion is asked of any particular
object i 2 Ä, it assumes that a value is true .a D 1/ if that assertion holds for that object, or false .a D 0/ if it is not. We
write
»
1; if Y.i/ 2 D
a.i/ D [Y.i/ 2 D] D
(5)
0; otherwise.
The function a.i/, called a truth function, represents a mapping
of Ä onto f0; 1g. The set of all i 2 Ä for which the assertion
holds is called the extension in Ä, denoted by Ext.a/ or Q,
Ext.a/ D Q D fi 2 ÄjY.i/ 2 Dg D fi 2 Äja.i/ D 1g:
(6)
Formally, the mapping a : Ä ! f0; 1g is called an extension
mapping.
Typically, a class would be identiŽ ed by the symbolic object
that described it. For example, the assertion (4) corresponds to
the symbolic object “20 to 30-year-olds living in Boston.” The
extension of this assertion, Ext.a/, produces the class consisting
of all individuals i 2 Ä that match this description a, that is,
those i for whom a.i/ D 1.
A class may be constructed as in the foregoing by seeking
the extension of an assertion a. Alternatively, it may be that
one desires to Ž nd all those (other) individuals in Ä who match
the description Yj .u/ of a given
Vp individual u 2 Ä. Thus in this
case, the assertion is a.i/ D jD1 [Yj D Yj .u/], where now zj D
Yj .u/ and has extension Ext.ai / D fi 2 ÄjYj .i/ D Yj .u/; j D
1; : : : ; pg.
More generally, an assertion may hold with some intermediate value (e.g., a probability), that is, 0 · a.i/ · 1, representing
a degree of matching of an object i with an assertion a. Flexible matching is also possible. In these cases, the mapping a
is onto the interval [0; 1]. Then the extension of a has level ®,
0 · ® · 1, where now Ext® .a/ D Q ® D fi 2 Äja.i/ ¸ ®g: Diday and Emilion (1996) and Diday, Emilion, and Hillali (1996)
studied modal symbolic objects and also provided some of its
theoretical underpinning;see also Section 9.
We now have the formal deŽ nition of a symbolic object, as
follows.
DeŽ nition 2. A symbolic object is the triple s D .a; R; d/,
where a is a mapping a: Ä ! L that maps individualsi 2 Ä onto
the space L depending on the relation R between descriptions
and the description d. When L D f0; 1g, that is, when a is a binary mapping, s is a Boolean symbolic object. When L D [0; 1],
s is a modal symbolic object.
Finally, computational implementation of the generation of
appropriate classes can be executed by queries (assertions) used
in search engines, such as the exhaustive search, genetic, or
hill climbing algorithms, and/or by the use of available software, such as the standard query language (SQL) package
or other “object-oriented” language, such as CCC or JAVA.
Stephan et al. (2000) provided some detailed examples illustrating the use of SQL. Another package speciŽ cally written for
symbolic data is the SODAS (Symbolic OfŽ cial Data Analysis
System) software. Gettler-Summa (1999, 2000) developed the
MGS (marking and generalization by symbolic descriptions algorithm) for building symbolic descriptions, starting with classical nominal data. Thus the outputs (ready for symbolic data
analysis) are symbolic objects that have modal or multivalued
variables and that identify logical links between the variables.
Csernel (1997), inspired by Codd’s (1972) methods of dealing
with relational databases with functional dependenciesbetween
variables, developed a normalization of Boolean symbolic objects taking into account constraints on the variables.
Although we have not addressed this application explicitly,
the concepts described in this section can also be applied to
merged or linked datasets. For example, suppose that we also
had a dataset consisting of meteorological and environmental variables (measured as classical or symbolic data) for U.S.
cities. Then if the data are merged by city, it is easy to obtain
relevant meteorological and environmental data for each individual identiŽ ed in Table 1. Thus, for example, questions relating to environmental measures and cancer incidences could be
considered.
4. SYMBOLIC DATA ANALYSES
Given a symbolic dataset, the next step is to conduct statistical analyses as appropriate. As for classical data, the possibilities are endless. Unlike classical data, for which a century of effort has produced a considerable library of analytical/statistical
methodologies,symbolic statistical analyses are new, and available methodologies are still only few in number. In the following sections, we review some of these as they pertain to symbolic data.
Therefore, we consider descriptive statistics in Sections 5
and 6. We review principal component methods for intervalvalued data in Section 7, and treat clustering methods in
Section 8. We restrict attention to the more fully developed
criterion-based divisive clustering approach for categorical and
interval-valued variables. We also provide a summary of other
methods limited largely to speciŽ c special cases. Much of
the current progress in symbolic data analyses revolve around
cluster-related questions. We cover theoretical underpining,
where it exists, in Section 9.
Except for the speciŽ c distance measures developed for symbolic clustering analysis (see Sec. 8), herein we do not attempt
to review the literature on symbolic similarity and dissimilarity
measures. Classical measures include the Minkowski or Lq distance (with q D 2 being the Euclidean distance measure), Hamming distance, Mahalanobis distance, and the Gower–Legendre
family, plus the large class of dissimilarity measures for probability distributions from classical probability and statistical theory, including the familiar chi-squared measure, Bhattacharyya
distance, and Chernoff’s distance, to name but a very few. Symbolic analogs have been proposed for speciŽ c cases. Gowda
and Diday (1991) Ž rst introduced a basic dissimilarity measure for Boolean symbolic objects. Later, Ichino and Yaguchi
(1994), using Cartesian operators, developed a symbolic generalized Minkowski distance of order q ¸ 1. This was extended
by DeCarvalho (1994, 1998) to include Boolean symbolic objects constrained by logical dependencies between variables.
Matching of Boolean symbolic objects was considered by Esposito, Malerba, and Semeraro (1991). These measures invoke
the concepts of Cartesian join and meet. Some are developed
Billard and Diday: Symbolic Data Analysis
475
along the lines of distance functions (as described in Sec. 8; see,
e.g., Chavent 1998); many are limited to univariate cases. Because such measures vary widely, and the choices of what to use
are generally dependent on the task at hand, we do not expand
on their details here. In a different direction, Morineau, Sammartino, Gettler-Summa, and Pardoux (1994), Loustaunau,Pardoux, and Gettler-Summa (1997), and Gettler-Summa and Pardoux (2000) considered three-way tables, and Billard and Diday
(2002a) considered regression for some special cases. A more
detailed and exhaustive description of symbolic methodologies
up to the time of its publication was given by Bock and Diday (2000). Billard and Diday (2002b) also provided expanded
coverage with some additional illustrations.
vir.d/ D fx 2 DI v.x/ D 1; for all v in Vx g:
(8)
In effect, this operation is mathematically cleaning the data
by identifying only those values that make logical sense. For
small datasets, it may be possible to “correct” the data visually
(or some such variation thereof). For very large datasets, this is
not always possible; hence a logical dependency rule to do so
mathematically/computationallyis essential.
5.2 Multivalued Variables
5. DESCRIPTIVE UNIVARIATE STATISTICS
5.1 Some Preliminaries
Basic descriptive statistics include frequency histograms
and sample means and variances. We consider symbolic data
analogs of these statistics, for multivalued and interval-valued
variables with rules and modal variables. We consider univariate statistics in this section and bivariate statistics in Section 6.
For integer-valued and interval-valued variables, we follow the
approach adopted by Bertrand and Goupil (2000). DeCarvalho
(1994, 1995) and Chouakria, Diday, and Cazes (1999) used different but equivalent methods to Ž nd the histogram and interval
probabilities [see eq. (16)] for interval-valued variables.
Before describing these quantities, we Ž rst introduce the concept of virtual extensions. Recall from Section 3 that the symbolic description of an object u 2 E (or, equivalently, i 2 Ä)
was given by the description vector d u D .»u1 ; : : : ; »up /, u D
1; : : : ; m; or, more generally, d 2 .D 1 ; : : : ; Dp / in the space
p
D D £jD1 Dj , where in any particular case the realization of
Yj may be an xj as for classical data or an »j of symbolic data.
Individual descriptions, denoted by x, are those for which each
Dj is a set of single (classical) values only, that is, x ´ d D
p
.fx1 g; : : : ; fxp g/; x 2 X D £jD1 Yj .
The calculation of the symbolic frequency histogram involves a count of the number of individual descriptions that
match certain implicit logical dependencies in the data. A logical dependency can be represented by a rule v,
v : [x 2 A] ) [x 2 B];
DeŽ nition 3. The virtual description of the description vector d is the set of all individual description vectors x that satisfy
all of the (logical dependency) rules v in X , written as, for Vx
the set of all rules v operating on x,
(7)
for A µ D, B µ D, and x 2 X and where v is a mapping of
X onto f0; 1g with v.x/ D 0.1/ if the rule is not (is) satisŽ ed
by x. For example, suppose that x D .x1 ; x2 / D .10; 0/ is an individual description of Y1 D age and Y2 D number of children
for i 2 Ä, and suppose that A D fage · 12g and B D f0g. Then
the rule that an individual under age 12 implies that he or she
has no children is logically true, whereas an individual under
age 12 with 2 children is not logically true. It follows that an
individual description vector x satisŽ es the rule v if and only
= A. This formulation of the logical depenif x 2 A \ B or x 2
dency rule is sufŽ cient for the purposes of establishing basic
descriptive statistics. Verde and DeCarvalho (1999) discussed a
variety of related rules (e.g., logical equivalence, logical implication, multiple dependencies, hierarchical dependencies). We
have the following formal deŽ nition.
Suppose that we want to Ž nd the frequency distribution for
the particular multivalued symbolic variable Yj ´ Z that takes
possible particular values » 2 Z. These can be categorical values (e.g., types of cancer) or any form of discrete random variable. We deŽ ne the observed frequency of » as
X
OZ .» / D
¼Z .» I u/;
(9)
u2E
where the summation is over u 2 E D f1; : : : ; mg and where
¼Z .» I u/ D
jfx 2 vir.du /jxZ D » gj
j vir.du /j
(10)
is the percentage of the individual description vectors in vir.du /
such that xZ D » , and where jAj is the number of individual descriptions in the space A. In the summation in (9), any u for
which vir.du / is empty is ignored. We note that this observed
frequency is a positive real number and not necessarily an integer as for classical data. In the classical case, j vir.du /j D 1
and
case of (10). We can easily show that
P so is a special
0
0
» 2Z OZ .» / D m , where m D .m ¡ m0 /, with m 0 the number
of u for which j vir.du /j D 0.
For a multivalued symbolic variable Z taking values » 2 Z ,
the empirical frequency distributionis the set of pairs [»; OZ .» /]
for » 2 Z, and the relative frequency distribution or frequency
histogram is the set [»; .m0 /¡1 OZ .» /]: The following deŽ nitions follow readily.
The empirical distribution function of Z is given by
FZ .» / D
1 X
OZ .»k /:
m0
(11)
»k ·»
When the possible values » for the multivalued symbolic
variable Yj D Z are quantitative, the symbolic sample mean and
the symbolic sample variance are deŽ ned as
S
ZD
S2Z
1 X
»k O Z .»k /
m0
and
»k
1 X
D 0
OZ .»k /[»k ¡ S
Z ]2 :
m
(12)
»k
Detailed worked examples have been given by Bock and Diday
(2000).
476
Journal of the American Statistical Association, June 2003
Finally, we observe that we can calculate weighted frequencies, weighted means, and weighted variances by replacing (9)
with
X
OZ .» / D
wu ¼Z .» I u/;
(13)
u2E
P
wu D 1. For example, if the objects u 2 E
with wu ¸ 0 and
are classes Cu comprising individuals from the set of individuals Ä D f1; : : : ; ng (i.e., Cu µ Ä) then a possible weight is
wu D jCu j=jÄj, corresponding to the relative sizes of the class
Cu ; u D 1; : : : ; m. Or, more generally, if we consider each individual description vector x as an elementary unit of the object u,
then we can use weights for u 2 E,
wu D P
j vir.d u /j
:
u2E j vir.du /j
(14)
For example, suppose that the data of Table 3 represent
the outcomes relative to the presence of cancer Y1 , with Y1 D
fno D 0; yes D 1g, and the number of cancer-related treatments
Y2 , with Y2 D f0; 1; 2; 3g, with a logical dependency rule, v :
y1 2 f0g ) y2 D f0g. Suppose that the u of Table 3 represent
classes Cu of size jCu j, with jÄj D 1;000, as shown in Table 3.
Then, if we use the weights wu D jCu j=jÄj, we can show that
rel freq.Y1 / : [.0; :229/; .1; :771/]
and
rel freq.Y2 / : [.0; :2540/; .1; :0970/; .2; :2395/; .3; :4095/]:
5.3 Interval-Valued Variables
Suppose that we are interested in the particular intervalvalued variable Yj ´ Z, and suppose the observation value for
object u is the interval Z.u/ D [au ; bu ], for u 2 E D f1; : : : ; mg.
Assume the individual description vectors x 2 vir.d u / are uniformly distributed over the interval Z.u/.
S The individual description vector x takes values globally in
u2E vir.du /. Further, each object is assumed to be equally
likely to be observed with probability 1=m. Therefore, the empirical distribution function, FZ .» /, is the distribution function
of a mixture of m uniform distributions fZ.u/; u D 1; : : : ; mg.
Therefore, it can be shown that (see Bertrand and Goupil 2000)
the empirical density function of Z is
1 X I u .» /
;
» 2 <;
f .» / D
(15)
m
kZ.u/k
u2E
where I u .¢/ is the indicator function that » is or is not in the
interval Z.u/ and where kZ.u/k is the length of that interval.
Table 3. Symbolic Data: Multi-Valued Variables
u
Y1
Y2
vir(du )
|vir (du )|
|Cu |
1
2
3
4
5
6
7
8
9
{0,1}
{0,1}
{0,1}
{0,1}
{0}
{0}
{1}
{1}
{1}
{2}
{0,1}
{3}
{2,3}
{1}
{0,1}
{2,3}
{1,2}
{1,3}
{(1,2)}
{(0,0), (1,0), (1,1)}
{(1,3)}
{(1,2), (1,3)}
1
3
1
2
0
1
2
2
2
128
75
249
113
2
204
87
23
121
Á
{(0,0)}
{(1,2), (1,3)}
{(1,1), (1,2)}
{(1,1), (1,3)}
Note that the summation in (15) is only over those objects u for
which » 2 Z.u/.
To construct a histogram, let I D [minu2E au ; maxu2E b u ] be
the interval that spans all the observed values of Z in X , and
suppose that we partition I into r subintervals Ig D [»g¡1 ; »g /,
g D 1; : : : ; r ¡ 1, and Ir D [»r¡1 ; »r ]. Then the histogram for
Z is the graphical representation of the frequency distribution
f.I g ; pg /; g D 1; : : : ; rg, where
pg D
1 X kZ.u/ \ Ig k
I
m u2E kZ.u/k
(16)
that is, pg is the probability an arbitrary individual description
vector x lies in the interval I g . If we want to plot the histogram
with height fg on the interval ig so that the “area” is p g , then
pg D .»g ¡ »g¡1 / £ fg .
The symbolic sample mean for an interval-valued variable Z
and the symbolic sample variance are given by
S
ZD
S2 D
1 X
.bu C au /
2m u2E
and
µ
¶2 (17)
1 X 2
1 X
.ba C bu au C a2u / ¡
.b
C
a
/
:
u
u
3m u2E
4m2 u2E
As for multivalued variables, if an object u has some internal
inconsistency relative to a logical rule (i.e., if u is such that
j vir.d u /j D 0) then the summations in (15)–(17) are over only
those u for which j vir.du /j 6D 0. This has been illustrated with
worked examples by Bock and Diday (2000) and Billard and
Diday (2002b). Henceforth, we use m and E to refer to those u
for which these rules hold.
5.4 Multivalued Modal Variables
Of the many types of modal-valued variables, we brie y consider only two types, one each of multivalued (in this section)
and interval-valued variables (in Sec. 5.5).
Suppose that we have data from a categorical variable Yj
taking possible
Pvalues »jk , with relative frequencies pk , k D
pk D 1. Suppose that we are interested in the
1; : : : ; s, with
particular symbolic random variable Yj ´ Z, in the presence of
the dependency rule v. Then we deŽ ne the observed frequency
that Z D »k , k D 1; : : : ; s, as
X
OZ .»k / D
¼Z .»k I u/;
(18)
u2E
where the summation is over all u 2 E and where
¼Z .»k I u/ D P.Z D »k jv.x/ D 1; u/
P
P.x D »k jv.x/ D 1; u/
D P Pxs
;
x
kD1 P.x D »k jv.x/ D 1; u/
(19)
where for each u, P.x D »k jv.x/ D 1; u/ is the probability that
a particular description x (´xj ) has the value »k and that the
logical dependency rule v holds. If for a speciŽ c object u there
are no description vectors satisfying the rule v, [i.e., if the denominator of (19)P
is 0] then that u is omitted in the summation
in (18). Note that skD1 OZ .»k / D m:
Billard and Diday: Symbolic Data Analysis
477
Example. Suppose that households are recorded as having
one of the possible central heating fuel types Y1 taking values in
Y1 D fgas, solid fuel, electricity, otherg and suppose that Y2 is
an indicator variable taking values in Y2 D fno, yesg depending
on whether a household does not (does) have central heating
installed. For illustrative convenience, it is assumed the values
for Y1 are conditional on there being central heating present
in the household. Suppose that aggregation by geographical
region produced the data of Table 4. The original (classical)
dataset consisted of census data from the U.K. OfŽ ce for National Statistics on 34 demographic–socio-economic variables
on individual households in 374 parts of the country. The data
given in Table 4 are those obtained for two speciŽ c variables
after aggregation into 25 regions and were obtained by applying the SODAS software to the original classical dataset. Thus
region one, represented by the object u D 1, is such that 87%
of the households with central heating are fueled by gas, 7%
by solids, 5% by electricity, and 1% by some other type of fuel,
and that 9% of householdsdid not have central heating and 91%
did. Let us suppose that interest centers only on the variable Y1
(without regard to any other variable) and that there is no rule v
to be satisŽ ed. Then it is readily seen that the observed frequencies that Z D »1 D gas, »2 D solid, »3 D electricity, and »r D
other fuels are OY1 .»1 / D 18:05; OY1 .»2 / D 1:67; O Y1 .»3 / D
3:51; and OY1 .»4 / D 1:77: Hence the relative frequencies are
rel freq.Y1 / : [.»1 ; :722/; .»2 ; :067/; .»3 ; :140/; .»4 ; :071/]: Similarly, we can show that for the Y2 variable, rel freq.Y2 / :
[.no; :156/; .yes; :844/].
Suppose, however, that there is the logical rule that a household must have one of »k , k D 1; : : : ; 4, if it has central heating.
For convenience,we denote Y1 D »0 if there is no fuel type used
(i.e., Y1 D »0 correspondsto Y2 D no). Then this logical rule can
be written as the pair v D .v1 ; v2 /,
»
v1 : Y2 2 fnog ) Y1 2 f»0 g
vD
(20)
v2 : Y2 2 fyesg ) Y1 2 f»1 ; : : : ; »4 g:
Table 4. Symbolic Data: Multi-Valued Modal Variables
Y1
Y2
u
Gas
Solid
Electricity
Other
No
Yes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{.87
{.71
{.83
{.76
{.78
{.90
{.87
{.78
{.91
{.73
{.59
{.90
{.84
{.82
{.88
{.85
{.71
{.87
{.32
{.50
{.69
{.79
{.72
{.43
{.00
.07
.11
.08
.06
.06
.01
.01
.02
.00
.08
.07
.01
.00
.00
.00
.01
.03
.09
.24
.12
.13
.01
.05
.00
.41
.05
.10
.09
.11
.09
.08
.10
.13
.09
.11
.17
.08
.14
.11
.09
.10
.17
.04
.24
.28
.18
.20
.19
.43
.14
.01}
.08}
.00}
.07}
.07}
.01}
.02}
.07}
.00}
.08}
.17}
.01}
.02}
.07}
.03}
.04}
.09}
.00}
.20}
.10}
.00}
.00}
.04}
.14}
.45}
{.09
{.12
{.23
{.19
{.12
{.22
{.22
{.13
{.24
{.14
{.10
{.19
{.09
{.17
{.12
{.09
{.16
{.13
{.25
{.14
{.12
{.21
{.07
{.28
{.09
.91}
.88}
.77}
.81}
.88}
.78}
.78}
.87}
.76}
.86}
.90}
.71}
.91}
.83}
.88}
.91}
.84}
.87}
.75}
.86}
.88}
.79}
.93}
.72}
.91}
Suppose that the relative frequencies for both Y1 and Y2 pertain as in Table 4 (i.e., with the »0 values having zero relative frequencies), except that the values for the original object
u D 25 are replaced by the new relative frequencies for Y1 of
(.25, .00, .41, .14, .20). We refer to this as object u D 25¤ . We
consider Y1 and u D 25¤ . In the presence of the rule v, the particular descriptions x D .Y1 ; Y2 / 2 f.»0 , no), (»1 , yes), (»2 , yes),
(»3 , yes), (»4 , yes)g can occur. Thus the relative frequencies for
each of the possible »k values must be adjusted.
Table 5 displays the apparent relative frequencies for each
of the possible description vectors before any rules have been
invoked. We consider the relative frequency for »2 (solid fuels). Under the rule v, the adjusted relative frequency for object u D 25¤ is ¼ Y1 .»2 I 25¤ / D .:3731/=[:0225 C ¢ ¢ ¢ C :1820] D
:5292: Hence the observed frequency for »2 over all objects under the rule v is OY1 .»2 / D :0637 C ¢ ¢ ¢ C :0000 C
:5292 D 1:5902: Similarly, we calculate under v that OY1 .»0 / D
3:8519; O Y1 .»1 / D 15:2229; OY1 .»3 / D 2:9761; and OY1 .»4 / D
1:3589: Likewise, we can Ž nd the observed frequencies for Y2 .
Hence the relative frequencies are rel freq .Y1 / : [.»0 ; :154/,
.»1 ; :609/; .»2; :064/; .»3 ; :119/; .»4 ; :054/] and rel freq .Y2 / :
[.no; :154/; .yes; :846/]. Therefore, over all regions, 60.9% of
households use gas, 6.4% use solid fuel, 11.9% use electricity, and 5.4% use other fuels to operate their central heating
systems, whereas 15.4% do not have central heating.
5.5 Interval-Valued Modal Variables
Suppose that the random variable of interest, Z ´ Y for object
u, u D 1; : : : ; m, takes values on the intervals »uk D [auk ; buk /
with probabilities puk , k D 1; : : : ; su . One manifestation of such
data is when (alone, or after aggregation) an object u assumes
a histogram as its observed value of Z. An example of such a
dataset is that of Table 6, in which the data (such as might be
obtained from Table 1 after aggregation) display the histogram
of weight of women by age group (those in their 20s to those
in their 80s and older). Thus, for example, we observe that for
women in their 30s (object u D 2), 40% weigh between 116 and
124 pounds.
By analogy with ordinary interval-valued variables, it is assumed that within each interval [auk ; buk / each individual description vector x 2 vir.du / is uniformly distributed across that
interval. Therefore, for each »k ,
8
»k < auk
0;
>
< » ¡a
k
uk
Pfx · »k jx 2 vir.du /g D
; auk · »k < b uk (21)
>
: buk ¡ a uk
k ¸ b uk .
1;
A histogram of all of the observed histograms can be constructed as follows. Let I D [mink;u2E aku ; maxk;u2E bku ] be the
Table 5. Apparent Relative Frequencies for u D 25 ¤
Y2 nY1
No
Yes
Total
Total
|v (x)D 1
»0
»1
.0225 .0000a
.2275a .0000
.2500 .0000
.0319 .0000
»2
»3
.0369a .0126a
.3731 .1274
.4100 .1400
.5292 .1807
¤ See text. a Invalidated when rule v is invoked.
»4
Total
.0180a .0900
.1820 .9100
.2000
.2582
Total
|v ( x ) D 1
.0319
.9681
478
Journal of the American Statistical Association, June 2003
Table 6. Histogram for Weight by Age Group
Table 7. Histogram for Weights (over all ages)
Age
(years)
u
Weights
g
Ig
Observed
frequency
Relative
frequency
(d)
20–29
1
30–39
2
40–49
3
50–59
4
60–69
5
1
2
3
4
5
6
7
8
9
10
[60–75)
[75–90)
[90–105)
[105–120)
[120–135)
[135–150)
[150–165)
[165–180)
[180–195)
[195–210]
:00714
:04286
:24179
:72488
1:27470
1:67364
1:97500
:88500
:13500
:04000
.0010
.0061
.0345
.1036
.1821
.2391
.2821
.1264
.0193
.0057
.0215
.1133
.1890
.2411
.2835
.1264
.0193
.0057
70–79
6
80–89
7
{[70 – 84), .02; [84 – 96), .06; [96 – 108), .24; [108 – 120),
.30; [120 – 132), .24; [132 – 144), .06; [144 – 160], .08}
{[100 – 108), .02; [108 – 116), .06; [116 – 124), .40;
[124 – 132), .24; [132 – 140), .24; [140 – 150], .04}
{[110 – 125), .04; [125 – 135), .14; [135 – 145), .20;
[145 – 155), .42; [155 – 165), .14; [165 – 175), .04;
[175 – 185], .02}
{[100 – 114), .04; [114 – 126), .06; [126 – 138), .20;
[138 – 150), .26; [150 – 162), .28; [162 – 174), .12;
[174 – 190], .04}
{[125 – 136), .04; [136 – 144), .14; [144 – 152), .38;
[152 – 160), .22; [160 – 168), .16; [168 – 180], .06}
{[135 – 144), .04; [144 – 150), .06; [150 – 156), .24;
[156 – 162), .26; [162 – 168), .22; [168 – 174), .14;
[174 – 180], .04}
{(100 – 120), .02; [120 – 135), .12; [135 – 150), .16;
[150 – 165), .24; [165 – 180), .32; [180 – 195), .10;
[195 – 210], .04}
interval that spans all of the observed values of Z in X , and
let I be partitioned into r subintervals, Ig D [»g¡1 ; »g /, g D
1; : : : ; r ¡ 1, and I r D [»r¡1 ; »r ]. Then the observed frequency
for the interval Ig is
X
O Z .g/ D
¼Z .gI u/;
(22)
u2E
where
¼Z .gI u/ D
X kZ.kI u/ \ Ig k
p uk ;
kZ.kI u/k
(23)
k2Z.g/
where the set Z.g/ represents all those intervals Z.kI u/ ´
[a uk ; buk / that overlap with Ig , for a given u. Thus each term
in the summation in (23) represents that portion of the interval
Z.kI u/ that is spanned by I g and hence that proportion of its observed relative frequency .puk / that pertains to the overall histogram interval I g . Note that when kZ.kI u/ \ Ig k=kZ.kI u/k D
k0k=k0k (as, e.g., when an interval is a classical point value)
this term takes the value
Pr 1; Section 6.3 provides an example.
It follows that
gD1 OZ .g/ D m: Hence the relative frequency for the interval Ig is pg D OZ .g/=m: The set of values f.p g ; Ig /; g D 1; : : : ; rg together represents the relative frequency histogram for the combined set of observed histograms.
The empirical density function can be derived from
s
u
Iuk .» /
1 XX
f .» / D
p uk ; » 2 <;
m
kZ.uI k/k
(24)
u2E kD1
where Iuk .¢/ is the indicator function that » is in the interval
Z.uI k/. The symbolic sample mean becomes
( s
)
u
1 X X
S
ZD
.buk C auk /p uk ;
(25)
2m u2E
kD1
and the symbolic sample variance is
(s
)
u
1 X X 2
2
2
S D
.b uk C buk a uk C auk /p uk
3m u2E
kD1
¡
1
4m 2
(
su
XX
u2E kD1
and, likewise, ¼.5I 2/ D :530; ¼.5I 3/ D :153; ¼.5I 4/ D :180,
¼.5I 5/ D :036; ¼.5I 6/ D :000; and ¼.5I 7/ D :120: Hence,
from (22), the “observed relative frequency” for the interval
I5 is O.5/ D 1:2747, and the relative frequency is p5 D :1821:
The complete set of histogram values, fpg ; Ig ; g D 1; : : : ; 10g,
is given in Table 7. Likewise, from (25) and (26), we have
S
Z D 143:9 and S 2 D 447:5.
It is implicit in the formula (21)–(26) that the rules v
hold. Suppose that the data of Table 6 represented (instead
of weights) costs (in $) involved for a certain procedure at
seven different hospitals. Suppose further that there is a minimum charge of $100. This translates into a rule v : fY <
100g ) fpk D 0g: The data for all hospitals (objects) except the Ž rst one satisfy this rule. However, some values for
the Ž rst hospital do not. Hence, an adjustment to the overall relative frequencies is needed to accommodate this rule.
Let us assume that the histogram now spans r D 8 intervals
[100–105/; [105–120/; : : : ; [195–210]. The relative frequencies
for these data after this rule is taken into account are given in
Table 7, column(d).
6. DESCRIPTIVE BIVARIATE STATISTICS
Many of the principles developed for the univariate case can
be expanded to a general p-variate case, p > 1. In particular,
this permits derivation of dependence measures. Thus, for example, calculation of the covariance matrix aids the development of methodologies in, for example, principal component
analysis, discriminant analysis, and cluster analysis. Recently,
Billard and Diday (2000, 2002a) extended these methods to Ž t
multiple linear regression models to interval-valued data and
histogram data. Different but related ideas can be extended to
enable the derivation of resemblance measures such as Euclidean distances, Minkowski or Lq distances, and Mahalanobis
distances.
6.1 Multivalued Variables
)2
.buk C auk /puk
To illustrate, we take the data of Table 6 and construct
the histogram on the r D 10 intervals [60–75/; [75–90/; : : : ,
[195–210]. Take the g D 5th interval I5 D [120–135/. Then,
from Table 6, it follows from (23) that
³
´
³
´
132 ¡ 130
135 ¡ 132
¼.5I 1/ D
.:24/ C
.:06/ D :255
132 ¡ 120
144 ¡ 132
: (26)
Some DeŽ nitions. We Ž rst consider multivalued variables.
Analogously to the deŽ nition of the observed frequency of spe-
Billard and Diday: Symbolic Data Analysis
479
ciŽ c values for a single variable given in (9), we deŽ ne the observed frequency that .Z1 D »1 ; Z2 D »2 / by
X
OZ1 ;Z2 .»1 ; »2 / D
¼Z1 ;Z2 .»1 ; »2 I u/;
(27)
u2E
where
discussion of the role of copulas in symbolic data will be presented elsewhere.
We deŽ ne the empirical joint density function for .Z1 ; Z2 / as
1 X I u .»1 ; »2 /
f .»1 ; »2 / D
;
(31)
m
kZ 0 .u/k
u2E
jfx 2 vir.du /jxZ1 D »1 ; xZ2 D »2 gj
¼Z1 ;Z2 .»1 ; »2 I u/ D
j vir.du /j
(28)
is the percentage of the individual description vectors x D
.x1 ; x2 / in vir.du / for which .Z1 D »1 ; Z2 D »2 /. Note that ¼.¢/
is a real number on <, in contrast to its being
P a positive integer for classical data. We can show that »1 2Z1 ;»2 2Z2 OZ1 ;Z2
.»1 ; »2 / D m.
We now deŽ ne the empirical joint frequency distribution of
Z1 and Z2 as the set of pairs [» ; OZ1 Z2 .» /], where » D .»1 ; »2 /
for » 2 Z D Z1 £ Z2 , and we deŽ ne the empirical joint histogram Z1 and Z2 as the graphical set [» ; m1 OZ1 Z2 .» /].
When Z1 and Z2 are quantitativesymbolic variables, the symbolic sample covariance function between the symbolic quantitative multivalued variables Z1 and Z2 is given by
µ X
¶
1
SZ1 ;Z2 ´ S12 D
.»1 £ »2 /OZ1 ;Z2 .»1 ; »2 / ¡ .S
Z1 /.S
Z2 /;
m
»1 ;»2
(29)
Zi , i D 1; 2; are the symbolic sample means of the uniwhere S
variate variables Zi , i D 1; 2; deŽ ned in (12). The symbolic sample correlation function between the multivalued symbolic variables Z1 and Z2 is given by
¯q 2
r.Z1 ; Z2 / D S Z1 ;Z2
.S Z1 /.S 2Z2 /;
(30)
where S 2Z1 , i D 1; 2, are the respective univariate symbolic sample variances deŽ ned in (12).
An Example. We illustrate these statistics with the cancer
data given in Table 3. We have that OZ1 ;Z2 .»1 D 0; »2 D 0/ ´
O.0; 0/, say, is
0 1 0 0 1 0 0 0 4
C C C C C C C D ;
1 3 1 2 1 2 2 2 3
where, for example, ¼.0; 0I 1/ D 0=1, because the individual
vector (0,0) occurs no times out of a total of j vir.d 1 /j D 1
vectors in vir.d 1 /. Likewise, ¼.0; 0I 2/ D 1=3, because (0,0)
is one of three individual vectors in vir.d2 /, and so on, and
where the summation does not include the u D 5 term because j vir.d5 /j D 0. Similarly, we can show that O.0; 1/ D
0, O.0; 2/ D 0, O.0; 3/ D 0, O.1; 0/ D 1=3, O.1; 1/ D 4=3,
O.1; 2/ D 5=2; and O.1; 3/ D 5=2: Hence the symbolic sample
covariance is, from (29), S12 D :288: Therefore, the symbolic
correlation function is, from (30), r.Z1 ; Z2 / D :748:
O.0; 0/ D
6.2 Interval-Valued Variables
Some DeŽ nitions. For interval-valued variables, we suppose that the speciŽ c variables of interest, Z1 and Z2 , have observations on the rectangle Z.u/ D Z1 .u/ £ Z2 .u/ D .[a1u ; b 1u ],
[a2u ; b2u ]/ for each u 2 E. As before, we make the intuitive
choice of assuming individual vectors x 2 vir.du / are each uniformly distributed over the respective intervals Z1 .u/ and Z2 .u/.
Therefore, the joint distribution of .Z1 ; Z2 / is a copula C.z1 ; z2 /
(see Nelsen 1999; Schweizer and Sklar 1983). An expanded
where Iu .»1 ; »2 / is the indicator function that .»1 ; »2 / is or is
not in the rectangle Z 0 .u/ and where kZ 0 .u/k is the area of
this rectangle. Analogously with (16), we can Ž nd the joint
histogram for Z1 and Z2 by graphically plotting fRg1 g 2 ; pg 1 g2 g
over the rectangles Rg1 g2 D f[»1;g1 ¡1 ; »1g1 / £ [»2;g2 ¡1 ; »2g2 /g,
g1 D 1; : : : ; r1 , g 2 D 1; : : : ; r2 , where
1 X kZ 0 .u/ \ Rg1 g2 k
pg1 ;g2 D
I
(32)
m
kZ 0 .u/k
u2E
that is, pg1 g2 is the probability an arbitrary individual description vector lies in the rectangle Rg1 g2 . Then, if the “volume” on
the rectangle Rg1 g 2 of the joint histogram represents the probability, its height would be
fg1 g2 D [.»1g1 ¡ »1;g1 ¡1 /.»2g2 ¡ »2;g2 ¡1 /]¡1 pg1 g2 :
(33)
The symbolic sample covariance function is obtained analogously to the derivation of the mean and variance for a univariate interval-valued variable, that is,
Z 1
.»1 ¡ S
Z1 /.»2 ¡ S
Z2 /f .»1 ; »2 / d»1 d»2
cov.Z1 ; Z2 / D SZ1 Z1 D
¡1
1
1X
D
m
.b1u ¡ a1u /.b2u ¡ a2u /
u2E
ZZ
£
»1 »2 d»1 d»2 ¡ S
Z1 S
Z2
.» 1 ;»2 /2Z.u/
1 X
D
.b1u C a1u /.b2u C a 2u /
4m u2E
µ
¶µX
¶
1 X
¡
.b1u C a1u /
.b2u C a2u / :
4m2 u2E
u2E
(34)
Hence the symbolic sample correlation function is deŽ ned by
(30), where SZ1 Z2 is given in (34) and S 2Z1 , i D 1; 2, are given
in (17). As before, the summation in (31) is only over those u
for which vir.du / is not empty.
An Example. We illustrate these concepts with the data of
Table 8, from Raju (1997). In particular, we consider the systolic and diastolic blood pressure levels, Y2 and Y3 . Suppose
further that there is a logical rule specifying that the diastolic blood pressure must be less than the systolic blood pressure, that is, Y2 .u/ · Y1 .u/. The observations u D 1; : : : ; 10,
all satisfy this rule. However, had there been another observation represented in the table by the u D 11 data point, then the
rule v would be violated, because Y2 .11/ > Y1 .11/. Therefore,
j vir.d 11 /j D 0, and so subsequent calculations would not include this observation.
Suppose that we want to Ž nd the joint histogram for Y1
and Y2 . We want to construct the histogram on the rectangles Rg2 g3 D f[»2;g2 ¡1 ; »2;g2 / £ [»3;g3 ¡1 ; »3;g3 /g, g2 D 1; : : : ; 10,
480
Journal of the American Statistical Association, June 2003
Table 8. Symbolic Data: Interval-Valued Variables
value .»1 ; »2 / is
u
Y1 , pulse
rate (bpm)
Y2 , systolic
pressure (mm Hg)
Y3 , diastolic
pressure (mm Hg)
1
2
3
4
5
6
7
8
9
10
11
[44–68]
[60–72]
[56–90]
[70–112]
[54–72]
[70–100]
[72–100]
[76–98]
[86–96]
[86–100]
[63–75]
[90–110]
[90–130]
[140–180]
[110–142]
[90–100]
[134–142]
[130–160]
[110–190]
[138–180]
[110–150]
[60–100]
[50–70]
[70–90]
[90–100]
[80–108]
[50–70]
[80–110]
[76–90]
[70–110]
[90–110]
[78–100]
[140–150]
g3 D 1; : : : ; 6, where the sides of these rectangles are the intervals given in Table 9.
Then, from (32), we can calculate the probability pg2 g3
that an arbitrary description vector lies in the rectangle Rg2 g3 .
For example, for g2 D 6, g3 D 4,
p64 D Pfx 2 [140; 150/ £ [80; 90/g
»
1
2 10 10 10
2 10
D
¢
C0C ¢
C
¢
0C0C0C
10
32 28
8 30 30 14
¼
10 10
10 10
C
¢
C0C
¢
D :04886:
80 40
40 22
Table 9 provides all of the probabilities pg2 g3 . The table also
gives the marginal totals that correspond to the corresponding
probabilities pgj , j D 2; 3, of the univariate case, obtained from
(16). A plot of the joint histogram is given in Figure 1, where
the heights fg2 g3 are calculated from (33).
(s
)
s 2u
1u X
p1uk1 p2uk2 Iuk1 ;k2 .»1 ; »2 /
1X X
f .»1 ; »2 / D
; (35)
m
kZk1 k2 .u/k
u2E
where kZk1 k 2 .u/k is the area of the rectangle Zk 1 k2 .u/ D
[a 1uk1 ; b1uk1 / £ [a2uk2 ; b 2uk2 / and Iuk1 k2 .»1 ; »2 / is the indicator variable that the point .»1 ; »2 / is (is not) in the rectangle
Zk1 k2 .u/. Analogously with (23) and (32), the joint histogram
for Z1 and Z2 is found by plotting fRg1 g2 ; p g1 g2 g over the rectangles Rg1g2 D f[»1; g1¡1 , »1g1 / £ [»2; g2 ¡1 , »2g2 /g, g1 D 1; : : : ; r1 ;
g2 D 1; : : : ; r2 with
p g1 g2 D
1 X X
m
X
u2E k1 2Z.g1 / k2 2Z.g2 /
kZ.k1 ; k2 I u/ \ Rg1 g2 k
kZ.k1 ; k2 I u/k
£ p1uk1 p2uk2 ; (36)
where Z.gj / represents all of the intervals Z.k1 ; k2 I u/ ´ [ajukj ;
bjukj /, j D 1; 2, which overlap with the rectangle Rg1 g2 for each
given u value.
If we assume that .Z1 ; Z2 / values are uniformly distributed
across the rectangles f»1uk £ »2uk g, we can then derive the symbolic sample covariance function as
(s
s2u
1u X
1 X X
p1uk1 p2uk2 .b1uk1 C a1uk1 /
cov.Y1 ; Y2 / D
4m
u2E k1 D1 k2 D1
)
£ .b2uk2 C a2uk2 /
6.3 Modal-Valued Variables
Some DeŽ nitions. We develop corresponding results for
quantitativemodal-valuedvariables where the data are recorded
as histograms. Suppose that interest centers on the two speciŽ c
variables Z1 and Z2 . Suppose that for each object u, each variable Zj .u/ takes values on the subintervals »juk D [ajuk ; b juk /
with relative frequency pjuk , k D 1; : : : ; sju , j D 1; 2, and u D
Psju
pjuk D 1. When sju D 1 and p juk D 1 for all
1; : : : ; m, with kD1
j, u, and k values, the data are interval-valued realizations.
Then, by extending the derivations of Section 6.2, we can
show that the empirical joint density function for .Z1 ; Z2 / at the
k1 D1 k 2 D1
" (s
)#
1u
1 X X
¡
p1uk1 .b 1uk1 C a1uk1 /
4m2 u2E k D1
1
£
" (s
2u
X X
u2E
k1 D2
)#
p2uk2 .b 2uk2 C a2uk2 /
:
(37)
The symbolic mean and symbolic variance for each of Z1 and
Z2 can be obtained from (25) and (26).
Table 9. Bivariate Histogram for Blood Pressure Data
g2
1
2
3
4
5
6
7
8
9
10
g3
Ig2 nIg3
[90, 100)
[100, 110)
[110, 120)
[120, 130)
[130, 140)
[140, 150)
[150, 160)
[160, 170)
[170, 180)
[180, 190]
pg3 £ 102
1
[50, 60)
2
[60, 70)
3
[70, 80)
7:500
2:500
0
0
0
0
0
0
0
0
10:000
7:500
2:500
0
0
0
0
0
0
0
0
10:000
1:250
1:250
1:790
1:790
1:492
1:492
1:265
:313
:313
:313
11:266
pg2 g3 £ 10 2
4
[80, 90)
5
[90, 100)
6
[100, 110]
1:250
1:250
3:815
3:815
7:446
4:886
2:693
:313
:313
:313
26:093
0
0
2:565
2:565
5:303
6:196
4:003
4:003
4:003
:313
28:950
0
0
1:205
1:205
3:943
2:515
1:503
1:503
1:503
:313
13:690
pg2 £ 10 2
17:500
7:500
9:375
9:375
18:185
15:089
9:464
6:131
6:131
1:250
Billard and Diday: Symbolic Data Analysis
481
Figure 1. Bivariate Histogram: Blood Pressures.
An Example. The data of Table 10 record histogram hematocrit .Y1 / values and hemoglobin .Y2 / values for each of
m D 5 objects. Suppose that we want to calculate the joint
histogram function over the rectangles Rg1 g2 D f[30; 35/ £
[11; 12/; : : :; [50; 55/ £[16; 17]g, that is, r1 D 5 and r2 D 6. The
terms p g1g 2 are calculated from (36); for example,
p21 D Pfx 2 [35; 40/ £ [11; 12/g
»µ
¶
1 2:8 £ :5
1:8 £ :5
D
.:7/.:4/ C 0 C
.:3/.:4/ C 0
5 4:6 £ :6
1:8 £ :6
¼
C0C0C0C0
D :04841:
This also applies to the other terms shown in Table 11.
7. PRINCIPAL COMPONENTS ANALYSIS
A principal components analysis is designed to reduce
p-dimensional observations into s-dimensional (where usually
s < p) components. Cazes, Chouakria, Diday, and Schektman
(1997), Chouakria (1998), and Chouakria et al. (1999) have developed a method of conducting principal components analysis on symbolic data for which each symbolic variable Yj ,
j D 1; : : : ; p, takes interval values »u D [a uj; b uj ], say, for each
object u D 1; : : : ; m, and where each object represents nu individuals; for simplicity, we take n u D 1.
Each symbolic data point is represented by a hyperrectangle
with 2p vertices. Thus each hyperrectangle can be represented
Table 10. Bivariate Histograms: Hematocrit and Hemoglobin Data
u
Y1 , Hematocrit
Y2 , Hemoglobin
1
2
{[33.2–37.8), .7; [37.8–39.6], .3}
{[36.3–38.1), .2; [38.1–47.4], .8}
3
{41.5–44.9), .5; [44.9–48.8], .5}
4
{[44.4–47.1), .4; [47.1–52.5], .6}
5
{[39.3–51.5], 1.0}
{[11.5–12.1), .4; [12.1–12.8], .6}
{[12.3–14.2), .5; [14.2–14.9), .4;
[14.9–15.2], .1}
{[14.3–14.8), .5; [14.8–15.5], .4;
15.5, .1}
{[15.3–15.7), .3; [15.7–16.4), .5;
[16.4–16.7], .2}
{[13.7–14.5), .5; [14.5–15.2), .3;
[15.2–16.5], .2}
by a 2 p £ p matrix M u with each row containing the coordinate
values of a vertex Rk , k D 1; : : : ; 2p , of the hyperrectangle.Then
a .m ¢ 2p £p/ matrix M is constructed of the fMu ; u D 1; : : : ; mg,
that is,
M D .M1 ; : : : ; Mm /0
02a
: : : a1p 3
11
::
5
D @4
:
b 11
:::
b1p
2a
m1
::: 4
bm1
:::
::
:
:::
a mp 3 1 0
5A :
b mp
For example, if p D 2, then the data »u D .[au1 ; bu1 ], [au2 ; b u2 ]/
are transformed to the 2 p £ p D 22 £ 2 matrix
"
#0
au1 au1 bu1 bu1
;
Mu D
au2 bu2 au2 bu2
and likewise for M.
The matrix M is now treated as though it represents classical p-variate data for n D m ¢ 2p individuals. Chouakria (1998)
has shown that the basic theory for a classical analysis carries
through; hence a classical principal components analysis can be
applied. If observations have weights pi ¸ 0, then in this transformation of the symbolic to the classical matrix M, each vertex of the hyperrectangle now has the same weight .pi 2¡p /. Let
Y1¤ ; : : : ; Ys¤ , s · p, denote the Ž rst “numerical” principal components with associated eigenvalues ¸1 ¸ ¢ ¢ ¢ ¸ ¸s ¸ 0 that result from this analysis. We then construct the interval principal
components Y1I ; : : : ; YsI as follows.
Let Lu be the set of row indices in M identifying the vertices of the hyperrectangle Ru ; that is, Lu represents the rows
of Mu describing the symbolic data »u¢ . For each k D 1; : : : ; 2p ,
in Lu , let ykv be the value of the numerical principal component Yu¤ , v D 1; : : : ; s, for that row k. Then the interval principal
component YvI for object u, is given by yuv D [yauv ; ybuv ], where
yauv D mink2Lu .ykv / and ybuv D maxk2Lu .ykv /.
Then plotting gives us the data represented as hyperrectangles of principal components in s-dimensional space. Thus, taking v D 1; 2, we have each object u D 1; : : : ; m, represented by
a rectangle with axes corresponding to the Ž rst two principal
components.
482
Journal of the American Statistical Association, June 2003
Table 11. Bivariate Joint Histogram: Hematocrit and Hemoglobin Data
g1
1
2
3
4
5
g2
Ig1 n Ig2
[30–35)
[35–40)
[40–45)
[45–50)
[50–55)
pg2 £ 102
pg1 g2 £ 10 2
1
[11, 12)
2
[12, 13)
3
[13, 14)
4
[14, 15)
1:826
4:841
0
0
0
6:667
3:652
11:020
1:585
:761
0
17:018
0
2:128
3:801
2:623
:461
9:013
0
4:139
14:799
12:310
1:295
32:543
As an alternative to using the vertices of the hyperrectangles Ru as before, the centers of the hyperrectangles could be
used. In this case, each object data point »u D .[au1 ; bu1 ]; : : : ;
[aup ; bup ]/ is transformed to xcu D .xcu1 ; : : : , xcup /, u D 1; : : : ; m,
where xcuj D .auj C b uj /=2, j D 1; : : : ; p. Thus the symbolic data
matrix X has been transformed to a classical m £ p matrix Xc
with classical variables Y1c ; : : : ; Ypc , say.
Then the classical principal components analysis is applied
to the classical data Xc . The vth center
Ppprincipal component for
object u is, for v D 1; : : : ; s, ycuv D jD1 .xcuj ¡ xN cj /wuv , where
P
c
the mean of the values for Yj is xN cj D m1 . m
uD1 xuj / and where
wv D .w1v ; : : : ; wpv / is the vth eigenvector. Because each centered coordinate xcuj lies in the interval [auj ; buj ] and because the
principal components are linear functions of xcuj , we can obtain
cb
the interval principal components as [yca
uv ; yuv ], where
yca
uv D
ycb
uv D
p
X
jD1
p
X
jD1
min
auj ·xrij ·buj
.xruj ¡ xN cj /wuv ;
max .xruj ¡ xN cj /wuv :
auj ·xrij ·buj
The methodology for the centers method is illustrated with
the interval data of Table 12, extracted from U.S. Census data.
There are m D 9 objects representing classes/regions of the
United States described by p D 6 symbolic variables; specifically, Y1 D median household income, Y2 D average income
per household member, Y3 D percentage of people covered by
health insurance, Y4 D percentage of population over age 25
who have completed high school, Y5 D percentage of population over 25 years who have earned at least a bachelor’s degree,
and Y6 D travel time (in minutes) to work. For simplicity, we
restrict the analysis to the Ž rst three variables, Yj , j D 1; 2; 3;
only. The s D 3 .Dp/ eigenvalues (together with the percentage
5
[15, 16)
0
:724
7:155
12:259
3:371
23:509
6
[16, 17]
0
:088
1:494
6:783
2:888
11:253
pg1 £ 10 2
5:478
22:940
28:834
34:736
8:015
of the total sum of squares contributed by the respective principal components)are ¸1 D 1:909 (63.64%), ¸2 D :914 (30.46%),
and ¸3 D :177 (5.91%). Thus the Ž rst and second components
together explain 94.10% of the variation. The results are plotted in Figure 2, and the distinct clusters are evident, with the
SouthE and SouthW regions forming one cluster, the Mountain
and SouthAtl regions forming a second cluster, New England
standing on its own as a third grouping, and these latter two
clusters spanned by a cluster consisting of the remaining four
regions (MidWestE, MidAtl, MidWestW, and the PaciŽ c). Executing the same analysis but with the vertices approach gives
similar results, as shown by Bock and Diday (2000).
Thus far in this section, we have assumed that the object u is
a class of size 1. However, more generally, it may be a class of
size nu , with nu C ¢ ¢ ¢ C nm D m¤ . The methods carry through
readily.
Principal components as a method is designed to reduce
the dimension of the data space to s < p. Dimensionalityreduction methods for interval data have also been considered
by Ichino (1988) and Ichino and Yaguchi (1994) using generalized Minkowsky metrics, and by Nagabushan, Gowda, and
Diday (1995) using Taylor series ideas.
Currently, these principal components methods are limited to
interval-valued symbolic data. Symbolic principal components
analysis for other types of symbolic data, such as for modal
variables (as would appear, e.g., in pixel satellite or tomography
data) remains an outstanding problem.
8. SYMBOLIC CLUSTERING
In this section we consider clustering methods for symbolic
data. Our aim is to classify the objects in Ä into clusters (or
classes) C1 ; : : : ; Cm , which are internally as homogeneous as
possible and externally as distinct from each other as possible.
This process is distinct from the construction of classes procedure discussed in Section 3, in which preassigned criteria were
Table 12. Regional ProŽ les: Demographics
Region
NewEng
MidAtl
MidWestE
MidWestW
SouthAtl
SouthE
SouthW
Mountain
PaciŽ c
Y1 , MedHouse$
Y2 , MeanPerson$
Y3 , Health
Y4 , EducHigh
Y5 , EducBach
Y6 , TravelTime
[45.72, 47.74]
[43.69, 45.16]
[44.03, 45.23]
[43.85, 45.52]
[40.68, 41.60]
[33.22, 35.14]
[35.98, 36.95]
[41.92, 43.41]
[45.22, 46.56]
[23.58, 24.84]
[22.66, 23.37]
[21.22, 21.86]
[22.29, 23.47]
[22.34, 23.07]
[18.74, 19.89]
[19.03, 19.82]
[20.21, 21.25]
[22.38, 23.21]
[88.50, 94.10]
[84.80, 92.40]
[86.50, 92.90]
[88.50, 91.30]
[82.70, 88.10]
[86.50, 89.70]
[78.50, 89.40]
[81.50, 88.70]
[76.20, 89.90]
[81.30, 90.00]
[77.20, 87.30]
[84.60, 90.80]
[88.10, 90.40]
[79.20, 84.00]
[77.50, 79.90]
[79.20, 86.60]
[85.50, 91.80]
[81.20, 91.80]
[24.10, 31.60]
[15.30, 38.30]
[17.10, 31.20]
[24.60, 27.30]
[19.00, 23.20]
[20.40, 22.00]
[18.40, 26.20]
[20.00, 34.60]
[19.30, 28.60]
[18.00, 21.90]
[20.00, 28.60]
[18.30, 25.10]
[15.80, 17.20]
[19.80, 22.70]
[20.70, 21.50]
[19.00, 22.30]
[13.00, 20.70]
[16.70, 24.60]
Billard and Diday: Symbolic Data Analysis
483
Figure 2. Principal Components Analysis: Regional ProŽ les.
selected for class membership. In contrast, the traditional clustering problem seeks to Ž nd “natural” groupings from the data.
There is, however, a link between the two procedures, as we
discuss later. Here we follow the approach of Chavent (1998)
for criterion-based divisive clustering.
Chavent (1998) distinguished between two types of sets
of variables. The Ž rst type deals with quantitative variables
Y1 ; : : : ; Yk , in which Yj takes a single (classical) value or Yj
takes an interval value Yj .u/ D [a; b] ½ <, for u 2 Ä. Or the
Y1 ; : : : ; Yp can be categorical-type variables; that is, they can
be either (classical) ordinal values with Yj .u/ 2 Yj , multivalued
variable with Yj .u/ ½ Yj and Bj D P.Yj /, or modal variables
where Yj .u/ D ¼j is a probability or frequency distribution on
Yj , and Bj is the set of all probability distributions on Yj . However, the Y1 ; : : : ; Yp cannot consist of a mixture of quantitativetype and categorical-type variables.
Let D D .duv / be the n £ n matrix of distance measures between objects u, v 2 Ä. First, we take quantitative-type variables and assume that all Yj are interval valued. (Adjustment
for the case where some Yj are classical variables is straightforward.) Suppose that we have the intervals »u D [auj ; b uj ] and
»v D [a vj ; bvj ], j D 1; : : : ; p, u; v 2 Ä. We seek a distance function ±j .u; v/ between objects u and v. A number of possible such
functions can be used in the clustering process.
We can deŽ ne a symbolic Hausdorff distance for Yj as
±j .u; v/ D maxfjauj ¡ a vj j; jb uj ¡ bvj jg:
(38)
If we take a distance function,
Á p
!1=2
X
2
d.u; v/ D
;
[± j .u; v/]
be normalized to
0
d .u; v/ D
( p
X
)1=2
2
[m¡1
j ±j .u; v/]
for suitable choices of mj . Chavent (1998) suggested
m2j D .2n2 /¡1
n X
n
X
[±j .u; v/]2 ; or
uD1 vD1
mj D length of the domain of Yj :
When the variables are categorical, we Ž rst construct a symbolic frequency table as follows. Suppose that Yj can take pj
possible values. If Yj is modal, then we already have that the
probability
of obtaining the kth possibility is ¼jk , k D 1; : : : ; p,
P
¼
D
1.
Each possible value is “labeled” as a new variable
k jk
Yjk , k D 1; : : : ; p j . For a multivalued variable Yj .u/, we assume
that the observed categories are uniformly distributed on Yj .u/
and the probability for those possibilities not observed is 0.
The required frequency table then represents an n £ t matrix of
classical-type data X D .fuj /, where u D 1; : : : ; n, j D 1; : : : ; t,
where t D p 1 C ¢ ¢ ¢ C pp is the number of “variables.” The entries fuj are real numbers and not necessarily integers (as would
be the case for classical data).
Then the distance between two objects u and v in Ä is given
by
#
" t
X 1 ³ puj pvj ´2 1=2
d.u; v/ D
¡
;
(41)
p:j pu: pv:
jD1
where
jD1
and use the speciŽ c ±j .¢/ of (38), we then have
Á p
!1=2
X
2
d.u; v/ D
:
[maxfjauj ¡ b vj j; jb uj ¡ bvj jg]
(40)
jD1
puj D fuj =np;
(39)
jD1
Notice that d.u; v/ in (39) reduces to the Euclidean distance on
<p when all Yj are classical variables. The distance in (38) can
pu: D
t
X
jD1
puj ;
p:j D
n
X
puj:
uD1
For example, suppose that we have the symbolic data of Table 13, which shows data for three objects relating to a cancer
prognosis Y1 (poor, medium, and good) and level of cigarette
smoking Y2 (below average, average, and above average). Notice that object u D 3 is in fact a classical data point. Here Y1 is
484
Journal of the American Statistical Association, June 2003
Table 13. Categorical Symbolic Data
u
Y1 , cancer prognosis
Y2 , smoking
1
2
3
{Medium (.06), good (.4)}
{Poor (.2), medium (.6), good (.2)}
{Good (1)}
{Below average, average}
{Below average}
{Average}
modal, but Y2 is categorical. The symbolic data X of Table 13
is transformed into classical data for six variables, as shown
in Table 14. Hence the distance between objects u D 1 and
v D 2 is, from (41), d.1; 2/ D :913; likewise, d.1; 3/ D 1:118
and d.2; 3/ D 1:812.
Another cluster partitioningcriteria is the following symbolic
version of the classical within-class variance criteria, or inertia,
given by
1 XX
I.Ci / D
wu wv [d.u; v/]2 ;
(42)
2¹i
u2Ci v2Ci
wu is the weight associated with object u and ¹i D
where
P
u2Ci wu . If we take wu D 1=n, then
X X
I.Ci / D .mn i /¡1
[d.u; v/] 2 ;
u; v2Ci u>v
where ni D jCi j is the size of the class Ci .
Suppose that at the mth stage of the clustering process
we have clusters .C1 ; : : : ; Cm /. Thus, initially when m D 1,
C1 ´ Ä. We wish to select which cluster Ci is to be partitioned
in two clusters, Ci1 and Ci2 , at the next .m C 1/th stage. This
is achieved by selecting an i that maximizes fI.Ci / ¡ I.Ci1 / ¡
I.Ci2 /g.
The cluster Ci is partitioned into these two subclusters, Ci1
and Ci2 , as follows. We choose the best partition where “best”
is speciŽ ed by the distance measure adopted [e.g., minimizing
I.Ci /]. Let the binary cut be speciŽ ed by the binary function qc ,
where we partition Ci into Ci1 and Ci2 according to C 1 : fu 2
Cjqc .u/ D trueg and C 2 : fu 2 Cjqc .u/ D falseg.
For interval data [au ; bu ], we deŽ ne qc for object u, for mu D
.au C bu /=2, by
»
true; if m u · c
qc .u/ D
false; if m u > c.
For modal variables (including categorical data after transformation to modal variable format), with data value Yj .u/ D ¼u ,
we deŽ ne q c by
X
8
¼u .x/ ¸ :5
true; if
>
>
<
x·c
qc .u/ D
X
>
>
¼u .x/ < :5,
: false; if
x·c
where the summation is over all individual x values, x · c.
Table 14. Classical Transformation of Table 13 Data
fuj
Poor
0
:2
f:j
0
:2
Y1 ,
medium
:6
:6
0
1 :2
Good
:4
:2
1
1:8
Below
average
Y2 ,
average
Above
average
:5
1
0
1 :5
:5
0
1
1:5
0
0
0
0
f u:
2
2
2
This is illustrated with the data of Table 12 using all six variables, and supposing that we wish to use the normalized Hausdorff distance measure of (40), with weight mj corresponding
to the length of the domain Yj . Thus m1 D 14:52, m 2 D 6:10,
m3 D 17:40, m 4 D 14:6, m5 D 23:00, and m 6 D 15:60. We have
the three binary functions as q 1 D Y1 · 41:90, q2 D Y3 · 89:80,
and q3 D Y4 · 84:375. The clusters of Figure 3 emerge.
In terms of the queries of Section 3, we observe, for instance,
that the cluster C3 D fMidWestE, Mountain, PaciŽ cg matches
the assertion that
C3 : [Y1 > 41:90] ^ [Y3 · 89:80] ^ [Y4 > 84:375]I
that is, the output of the clustering process gives a class (or
cluster) represented by the symbolic object for which the median household income exceeds $41,900, fewer than 89.80%
have health coverage and at least 84.375% completed high
school.
Chavent’s method is a divisive hierarchical clustering procedure that starts with all the objects in a single cluster and is
a form of monothetic divisive clustering method built by considering one variable at a time. Monothetic divisive methods
are not dissimilar from discriminant analysis methods, such as
the CART algorithm of Brieman, Friedman, Olshen, and Stone
(1984) or the ID3 algorithm of Quinlan (1986) (see also Périnel
and Lechevallier 2000 for speciŽ cs on symbolic discrimation
analysis). Monothetic divisive clustering for conceptual objects
was Ž rst introduced by Michalski, Diday, and Stepp (1981) and
Michalski and Stepp (1983).
In contrast with divisive methods, agglomerative methods
start with each object in Ä being a cluster of size 1, itself; with
the clustering algorithm developed to merge objects into large
classes. For symbolic objects, merging criteria revolve around
Ž nding those symbolic objects that are similar (as measured
by an appropriate similarity index) and relate to the assertions
described in Section 3. Building on Diday’s (1986) development of pyramid clusters for classical data, Brito (1994, 1995,
2000) gave an algorithm for developing pyramid clusters for
symbolic data, in which pyramidal clusters are deŽ ned as families of nested overlapping classes; that is, a class can belong
to two distinct clusters, in contrast to a pure hierarchical cluster
in which classes are distinct or are entirely contained within
another class. Brito and DeCarvalho (1999) extended this to
the case in which speciŽ c hierarchical rules exist, and DeCarvalho, Verde, and Lechevallier (1999) extended this to dependency rules. Polaillon (2000) developed pyramidal clusters for
interval data using Galois lattice reductions. DeCarvalho et al.
(1999) examined dynamic clustering of Boolean symbolic objects based on a content dependent proximity measure. Bock
and Diday (2000) provided a comprehensive coverage of these
approaches along with elucidating examples. With the possible
exception of Polaillon’s (1998) Galois lattice reduction theory,
results tend to be limited to the methods themselves, with theoretical justiŽ cations remaining as outstanding problems still to
be addressed.
9. PROBABILITIES IN SYMBOLIC DATA ANALYSIS
Probabilistic underpinning for a few of these symbolic data
analysis methods has been developed by Diday (1995), Di-
Billard and Diday: Symbolic Data Analysis
485
Figure 3. Divisive Clustering: Regional ProŽ les.
day and Emilion (1996), and Diday et al. (1996), primarily for
modal data associated with capacities and credibilities, and by
Emilion (2002) for mixture decompositions that underlie much
of the histogram methods.
Suppose that in general the data are arrayed in a table with
n rows and p columns with entries Yij . The rows are the
observations of a sample of size n from a random variable
Y D .Y1 ; : : : ; Yp / : Ä ! X 1 £ ¢ ¢ ¢ £ X p , where X j is the set
of random variables deŽ ned on the same domain, such that
Y.wi / D .Yi1 ; : : : ; Yip / for object wi . If instead of having this
general model, we focus on some descriptive property of the Yij
(e.g., a conŽ dence interval or empirical distribution density),
we obtain different kinds of nonnumerical data (i.e., we obtain
symbolic data). For instance, if we are interested in the probability distributions, then the modal data become
Y D .Y1 ; : : : ; Yp / : Ä ! P.V1 / £ ¢ ¢ ¢ £ P.Vp /;
where P.Vj / is the set of all probability measures on a domain
Dj with a ¾ algebra A. Hence, if Y.wi / D .¹i1 ; : : : ; ¹ip /, the
stochastic process Xj .¢/.A/ is a “random distribution,” taking
the terminology of some authors using Bayesian analyses (see,
e.g., Ferguson 1974).
Starting from this probabilistic framework, we have considered several approaches. For instance, suppose that we are interested in the probability of an event A for an easily given random variable Yj such that Pf[Y1j D A] [ ¢ ¢ ¢ [ [Ypj D A]g D F.A/.
It can be easily shown that F is a capacity. Diday (1995), in
addition to capacity, also studied and compared other measures (e.g., credibility, probability). Using special maps, such
as T norms and t conorms introduced by Schweizer and Sklar
(1983), Diday described a histogram of capacities that is subadditive and a histogram of credibilities that is superadditive,in
contrast to standard classical histograms of frequencies, which
are additive. Diday also proved the existence of a limit on the
height of these histograms as the interval lengths supporting
the histogram tend to 0. Thus, it is clear that the concepts intent
and extent (described in Sec. 3), as well as Galois lattices (not
described herein), of symbolic objects can be represented as
(partial or full) capacities or credibilities and hence by the relevant respective histograms. These results can also be applied
to special cases of principal components (the vertices method),
clustering (to pyramid clusters), and decision trees.
Bertrand and Goupil (2000) indicated that for interval-valued
symbolic data, using the law of large numbers, the true limiting
distribution of Z as m ! 1 is only approximated by the exact distribution f .» / in (15), because the derivation depends on
the veracity of the uniform distribution within each interval assumption.
Recently, clustering methods based on the estimation of mixtures of probability distribution where the observations are
probability distributions were considered by Emilion (2002).
Using a theorem of Kakutani (see, e.g., Hewitt and Stromberg
1969, p. 453), Emilion proved a convergenceresult for the three
cases of mixture components; a Dirichlet process, a normalized
gamma process, and a Kraft process. Another approach for mixture distributions suggested by Diday (2001) is based on copula
theory (Schweizer and Sklar 1983; Nelsen 1999) using different copula models, such as those of Frank (1979) and Clayton (1978). Bertrand (1995) considered structural properties of
pyramids.
By and large, however, probabilistic and rigorous mathematical underpinning, along with the need to develop relevant
convergence and limiting results, remain outstanding problems
needing considerable development.This includes deriving relevant standard errors and distribution functions for the respective
estimators thus far available, as well as consideration of robust-
486
Journal of the American Statistical Association, June 2003
ness issues as to the effects of the underlying assumptions on
these inference procedures. The Ž eld is wide open.
10. CONCLUSIONS
In conclusion, then, given contemporary data formats and
dataset sizes, the need to develop statistical methods to analyze them is becoming increasingly important. In addition to
these types of data, several attempts have been made to extend
data analysis to fuzzy, uncertain, nonprecise data (e.g., Bandemer and Nather 1992; Viertl 1996), or compositional data (e.g.,
Aitchison 1986). In both cases by taking care of the variation inside a class of units described by such data, we obtain symbolic
data. The statistical world has developed a wealth of methodology throughout the twentieth century, methods that are largely
limited to small (by comparison) datasets and limited to classical data formats. Some work has been done on imprecise or
fuzzy data but not on symbolic data. In this article we have reviewed the formulation and construction of symbolic objects
and classes. We also presented brief overviews of those symbolic data analyses, which have emerged as ways to draw statistical inferences in some formats of symbolic data. What the
obvious omission of methodologiesin any list of such statistical
methods makes abundantly clear is the enormous need for the
development of a vast catalog of new symbolic methodologies,
along with rigorous mathematical and statistical foundationsfor
these methods.
[Received March 2001. Revised August 2002.]
REFERENCES
Aitchison, J. (1986), The Statistical Analysis of Compositional Data, New York:
Chapman Hall.
Bandemer, H., and Nather, W. (1992), Fuzzy Data Analysis, Dordrecht: Kluwer
Academic.
Bertrand, P. (1995), “Structural Properties of Pyramidal Clustering,” in Partitioning Data Sets, eds. I. Corc, P. Hansen, and B. Julesz, Hayward, CA:
American Mathematical Society, pp. 352–353.
Bertrand, P., and Goupil, F. (2000), “Descriptive Statistics for Symbolic Data,”
in Analysis of Symbolic Data, eds. H.-H. Bock and E. Diday, Berlin: SpringerVerlag, pp. 103–124.
Billard, L., and Diday, E. (2000), “Regression Analysis for Interval-Valued
Data,” in Data Analysis, ClassiŽ cation, and Related Methods, eds. H. A. L.
Kiers, J.-P. Rassoon, P. J. F. Groenen, and M. Schader, Berlin: SpringerVerlag, pp. 369–374.
(2002a), “Symbolic Regression Analysis,” in ClassiŽ cation, Clustering, and Data Analysis, eds. K. Jajuga, A. Sokolowski, and H.-H. Bock,
Berlin: Springer-Verlag, pp. 281–288.
(2002b), “Symbolic Data Analysis: DeŽ nitions and Examples,” technical report, available at hhttp://www.stat.uga.edu/faculty/LYNNE/ Lynne.htmli.
Bock, H.-H., and Diday, E. (eds.) (2000), Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information From Complex
Data, Berlin: Springer-Verlag.
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984), ClassiŽ cation and Regression Trees, Belmont, CA: Wadsworth.
Brito, P. (1994), “Use of Pyramids in Symbolic Data Analysis,” in New Approaches in ClassiŽ cation and Data Analysis, eds. E. Diday, Y. Lechevallier,
M. Schader, P. Bertrand, and B. Burtschy, Berlin: Springer-Verlag, pp. 378–
386.
(1995), “Symbolic Objects: Order Structure and Pyramidal Clustering,” Annals of Operations Research, 55, 277–297.
(2000), “Hierarchical and Pyramidal Clustering With Complete Symbolic Objects,” in Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information From Complex Data, eds. H.-H. Bock and E.
Diday, Berlin: Springer-Verlag, pp. 312–324.
Brito, P., and DeCarvalho, F. A. T. (1999), “Symbolic Clustering in the Presence of Hierarchical Rules,” in Knowledge Extraction from Statistical Data,
Luxembourg: European Commission Eurostat, pp. 119–128.
Cazes, P., Chouakria, A., Diday, E., and Schektman, Y. (1997), “Extensions
de l’Analyse en Composantes Principales a des Donnees de Type Intervalle,”
Revue de Statistique Appliquee, 24, 5–24.
Chavent, M. (1998), “A Monothetic Clustering Algorithm,” Pattern Recognition Letters, 19, 989–996.
Chouakria, A. (1998), “Extension des Methodes d’Analyse Factorielle a des
Donees de Type Intervalle,” unpublished doctoral thesis, Université Paris
Dauphine.
Chouakria, A., Diday, E., and Cazes, P. (1999), “An Improved Factorial Representation of Symbolic Objects,” in Knowledge Extraction From Statistical
Data, Luxembourg: European Commission Eurostat, pp. 301–305.
Clayton, D. G. (1978), “A Model for Association in Bivariate Life Tables and
Its Application in Epidemiological Studies of Familial Tendency in Chronic
Disease Incidence,” Biometrika, 65, 141–151.
Codd, E. F. (1972), “Further Normalization of the Data Base Relational Model,”
in Data Base Systems Courant Computer Science Series, Vol. 6, ed. R. Rustin,
Englewood Cliffs, NJ: Prentice-Hall, pp. 33–64.
Csernel, M. (1997), “Normalization of Symbolic Objects,” in Symbolic Data
Analysis and its Applications, eds. E. Diday and K. C. Gowda, Dauphine,
France: Ceremade, 32–44.
DeCarvalho, F. A. T. (1994), “Proximity CoefŽ cients Between Boolean Symbolic Objects,” in New Approaches in ClassiŽ cation and Data Analysis, eds.
E. Diday, Y. Lechevallier, M. Schader, P. Bertrand, and B. Burtschy, Berlin:
Springer-Verlag, pp. 387–394.
(1995), “Histograms in Symbolic Data Analysis,” Annals of Operations Research, 55, 299–322.
(1998), “Extension-Based Proximities Between Contrained Boolean
Symbolic Objects,” in Data Science, ClassiŽ cation and Related Methods, eds.
C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H. H. Bock, and Y. Baba,
Berlin: Springer-Verlag, pp. 370–378.
DeCarvalho, F. A. T., Verde, R., and Lechevallier, Y. (1999), “A Dynamical
Clustering of Symbolic Objects Based on a Content Dependent Proximity
Measure,” Applied Statistical Models and Data Analysis, 15, 237–242.
Diday, E. (1986), “Orders and Overlapping Clusters by Pyramids,” in Multidimensional Data Analysis, eds. J. De Leeuw, W. J. Heisen, J. J. Meulman, and
F. Critchley, Leiden, Netherlands: DSWO Press, pp. 201–234.
(1987), “Introduction a l’Approche Symbolique en Analyse des Donnees,” Premiere Jouneles Symbolique-Numerique, Dauphine, France: CEREMADE, pp. 21–56.
(ed.) (1989), Data Analysis: Learning Symbolic and Numerical
Knowledge, Antibes: Nova Science.
(1990), “Knowledge Representation and Symbolic Data Analysis,” in
Knowledge Data and Computer-Assisted Decisions, eds. M. Schader and W.
Gaul, Berlin: Springer-Verlag, pp. 17–34.
(1995), “Probabilistic, Possibilistic and Belief Objects for Knowledge
Analysis,” Annals of Operations Research, 55, 227–276.
(2001), “A Generalization of the Mixture Decomposition Problem in
the Symbolic Data Analysis Framework,” Dauphine: Ceremade.
Diday, E., and Emilion, R. (1996), “Capacities and Credibilities in Analysis of
Probabilistic Objects,” in Ordinal and Symbolic Data Analysis, eds. E. Diday,
Y. Lechevallier, and O. Opitz, Berlin: Springer-Verlag, pp. 13–30.
Diday, E., Emilion, R., and Hillali, Y. (1996), “Symbolic Data Analysis of Probabilistic Objects by Capacities and Credibilities,” Atti Della XXXVIII Riunione ScientiŽ a, pp. 5–22.
Emilion, R. (2002), “Clustering and Mixtures of Stochastic Processes,” Compte
Rendus Academy Sciences Paris I, 335, 189–193.
Esposito, F., Malerba, D., and Semeraro, G. (1991), “Flexible Matching of
Noisy Structural Descriptions,” in Conference on ArtiŽ cal Intelligence, eds.
J. Mylopoulos and R. Reiter, London: Morgan Kaufman, pp. 658–664.
Ferguson, T. S. (1974), “Prior Distributions on Spaces of Probability Measures,” Annals of Statistics, 2, 615–629.
Frank, M. J. (1979), “On the Simultaneous Associativity of F.x; y/ and x C y ¡
F.x; y/,” Aequationes Mathematics, 19, 194–226.
Gettler-Summa, M. (1999), “Approaches MGS (Marquage et Generalisations
Symboliques) Pour le Suivi de Typologies dans le Temps,” Journees de Statistique, 31, 833–845.
(2000), “Marking and Generalization by Symbolic Objects in the Symbolic OfŽ cial Data Analysis Software,” in Data Analysis, ClassiŽ cation, and
Related Methods, eds. H. A. L. Kiers, J.-P Rasson, P. J. F. Groenen, and M.
Schader, Berlin: Springer-Verlag, pp. 417–422.
Gettler-Summa, M., and Pardoux, C. (2000), “Symbolical Approaches for
Three-Way Data,” in Analysis of Symbolic Data: Exploratory Methods for
Extracting Statistical Information From Complex Data, eds. H.-H. Bock and
E. Diday, Berlin: Springer-Verlag, pp. 342–354.
Gowda, K. C., and Diday, E. (1991), “Symbolic Clustering Using a New Dissimilarity Measure,” Pattern Recognition, 24, 567–578.
Hewitt, E., and Stromberg, K. R. (1969), Real and Abstract Analysis: A Modern
Treatment of the Theory of Functions of a Real Variable, New York: SpringerVerlag.
Billard and Diday: Symbolic Data Analysis
Ichino, M. (1988), “General Metrics for Mixed Features—The Cartesian Space
Theory for Pattern Recognition,” Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, 1, 494–497.
Ichino, M., and Yaguchi, H. (1994), “Generalized Minkowski Metrics for
Mixed Feature Type Data Analysis,” IEEE Transactions on Systems, Man and
Cybernetics, 24, 698–708.
Loustaunau, D., Pardoux, C., and Gettler-Summa, M. (1997), “Multidimensional Analysis for the Acquisition and Processing of Expert Knowledge:
Concept of On-Site Telecommunication,” in Multidimensional Data Analysis, eds. K. Fernandez-Aguirre and A. Morineau, Paris: CISIA-CERESTA,
pp. 293–300.
Michalski, R. S., Diday, E., and Stepp, R. E. (1981), “A Recent Advance in Data
Analysis: Clustering Objects Into Classes Characterized by Conjunctive Concepts,” in Progress in Pattern Recognition, eds. L. Kanal and A. Rosenfeld,
Amsterdam: North-Holland, pp. 33–56.
Michalski, R. S., and Stepp, R. E. (1984), “Learning from Observation:
Conceptual Clustering,” in Machine Learning, eds. R. S. Michalski, J. G.
Carbonell, and T. M. Mitchell, Berlin: Springer-Verlag, pp. 331–363.
Morineau, A., Sammartino, A.-E., Gettler-Summa, M., and Pardoux, C. (1994),
“Analyses des Donnels et Modelisation des Series Temporelles,” Review Statistique Appliquee, 42, 61–81.
Nagabhushan, P., Gowda, K. C., and Diday, E. (1995), “Dimensionality Reduction of Symbolic Data,” Pattern Recognition Letters, 16, 219–233.
Nelsen, R. B. (1999), An Introduction to Copulas, New York: Springer-Verlag.
Périnel, E., and Lechevallier, Y. (2000), “Symbolic Discrimination Rules,” in
Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical
Information From Complex Data, eds. H.-H. Bock and E. Diday, Berlin:
Springer-Verlag, pp. 244–265.
487
Polaillon, G. (1998), “Organisation et Interpretation par les Treillis de Galois
de Donnees de Type Multivalue, Intervalle ou Histogramme,” unpublised doctoral thesis, Université Paris Dauphine.
(2000), “Pyramidal ClassiŽ cation for Internal Data Using Galois Lattice Reduction,” in Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information From Complex Data, eds. H.-H. Bock and
E. Diday, Berlin: Springer-Verlag, pp. 324–340.
Quinlan, J. R. (1986), “Introduction of Decision Trees,” Machine Learning, 1,
81–106.
Raju, S. R. K. (1997), “Symbolic Data Analysis in Cardiology,” in Symbolic
Data Analysis and Its Applications, eds. E. Diday and K. C. Gowda, Paris:
CEREMADE, pp. 245–249.
Schweizer, B. (1984), “Distributions Are the Numbers of the Future,” in Proceedings of The Mathematics of Fuzzy Systems Meeting, eds. A. di Nola and
A. Ventre, Naples, Italy: University of Naples, 137–149.
Schweizer, B., and Sklar, A. (1983), Probabilistic Metric Spaces, New York:
North-Holland.
Stéphan, V., Hébrail, G., and Lechevallier, Y. (2000), “Generation of Symbolic Objects From Relational Databases,” in Analysis of Symbolic Data:
Exploratory Methods for Extracting Statistical Information From Complex
Data, eds. H.-H. Bock and E. Diday, Berlin: Springer-Verlag, pp. 78–105.
Verde, R., and DeCarvalho, F. A. T. (1999), “Dependence Rules In uence on
Factorial Representation of Boolean Symbolic Objects,” in Knowledge Extraction From Statistical Data, Luxembourg: European Commission Eurostat,
pp. 287–300.
Viertl, R. (1996), Statistical Methods for Non-Precise Data, Boca Raton, FL:
CRC Press.