Download A Database Clustering Methodology and Tool

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Relational algebra wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
A Database Clustering Methodology and Tool
Tae-Wan Ryu
Department of Computer Science
California State University, Fullerton
Fullerton, California 92834
[email protected]
Christoph F. Eick
Department of Computer Science
University of Houston
Houston, Texas 77204-3010
[email protected]
Abstract
Clustering is a popular data analysis and data mining technique. However, applying traditional clustering
algorithms directly to a database is not straightforward due to the fact that a database usually consists of
structured and related data; moreover, there might be several object views of the database to be clustered,
depending on a data analyst’s particular interest. Finally, in many cases, there is a data model
discrepancy between the format used to store the database to be analyzed and the representation format
that clustering algorithms expect as their input. These discrepancies have been mostly ignored by current
research.
This paper focuses on identifying those discrepancies and on analyzing their impact on the
application of clustering techniques to databases. We are particularly interested in the question on how
clustering algorithms can be generalized to become more directly applicable to real-world databases. The
paper introduces methodologies, techniques, and tools that serve this purpose. We propose a data set
representation framework for database clustering that characterizes objects to be clustered through sets of
tuples, and introduce preprocessing techniques and tools to generate object views based on this
framework. Moreover, we introduce bag-oriented similarity measures and clustering algorithms that are
suitable for our proposed data set representation framework. We also demonstrate that our approach is
capable of dealing with relationship information commonly found in databases through the bag-oriented
clustering. We also argue that our bag-oriented data representation framework is more suitable for
database clustering than the commonly used flat file format and produce better quality of clusters.
Keywords and Phrases: database clustering, preprocessing in KDD, data mining,
data model discrepancy, similarity measures for bags.
1 Introduction
Current technologies for collecting data such as scanners and other data collection tools have
generated a huge amount of data, and the volume of data is growing rapidly every year. Database
systems provide tools and an environment that manage and access the large volume of data
systematically and efficiently. However, extracting useful knowledge from databases is very
difficult without additional computer assistance and more powerful analytical tools. In general,

to appear in Information Science in Spring 2005.
1
there is a significant gap between data generation and data understanding. Consequently,
automatic powerful analytical tools for discovering useful and interesting patterns from databases
are desirable. Knowledge discovery in data (KDD) is such a generic approach to analyze and
extract useful knowledge from databases using fully automated techniques. Recently, many
techniques and tools [HK01] have been proposed for this purpose. Popular KDD-tasks include
classification, data summarization, dependency modeling, deviation detection, etc., are the
popular techniques used in KDD.
The focus of this paper is database clustering. The goal of database clustering is to take a
database that stores information concerning a particular type of objects (e.g., customers or
purchases) and identify subgroups of those objects, such that objects belonging to the same
subgroup are very similar to each other, and such that objects belonging to different subgroups
are quite different from each other.
Restaurant database
Preprocessing
Object View for Clustering
Clustering
A Set of Similar
Object Clusters
Summarization
Young at
midnight
White Collar
for Dinner
Retired for
Lunch
Figure 1. Example of Database Clustering
Suppose that a restaurant owner has a database that contains customer information and he wants
to obtain a better understanding of his main customer groups for marketing purposes. In order to
accomplish this goal, as depicted in Figure 1, the restaurant database will be first preprocessed
2
for clustering and a clustering algorithm is applied to the preprocessed data set; for example the
algorithm might reveal that there are three clusters in the customer database. Finally,
characteristic knowledge that summarizes each cluster can be generated, telling the restaurant
owner that his major customer groups are young people that come at midnight, white collar
people that come for dinner, and retirees that come for lunch. This knowledge definitely will be
useful for marketing purposes, and for designing his menu.
The paper is organized as follows. Section 2 introduces the different steps that have to be
taken when clustering a database, and explains how database clustering is different from
traditional flat file data clustering. Based on the discussion of Section 2, Section 3 introduces a
“new” data set representation framework for database clustering that characterizes objects
through sets of tuples. Moreover, preprocessing techniques for generating object views based on
this framework are introduced. In Section 4 similarity measures for our bag-oriented knowledge
representation framework are introduced. Section 5 introduces the architecture and the
components of a database clustering environment we developed. Moreover, the problems of
generalizing traditional clustering algorithms for database clustering will be addressed in this
section. Section 6 reviews the related literature and Section 7 summarizes the findings and
contributions of the paper.
2 Database Clustering
2.1
Steps of Database Clustering
Due to the fact that database clustering has not been discussed very much in the literature, we
think it is useful to discuss the different steps of database clustering first. In general, we consider
database clustering to be an activity that is conducted by passing through the following seven
steps:
(1)
Define Object-View
(2)
Select Relevant Attributes
(3)
Generate Suitable Input Format for the Clustering Tool
(4)
Define Similarity Measure
(5)
Select Parameter Settings for the Chosen Clustering Algorithm
(6)
Run Clustering Algorithm
(7)
Characterize the Computed Clusters
3
The first three steps of the suggested database clustering methodology center on preprocessing
the database and on generating a data set that can be processed by the employed clustering
algorithm(s). In these steps, a decision has to be made what objects in the database (usually
databases contain multiple types of objects) and which of their properties will be used for the
purpose of clustering; moreover, the relevant information has to be converted to a format that can
be processed by the selected clustering tool(s). In the fourth step similarity measures for the
objects to be clustered have to be defined. Finally, in steps 5-7 the clustering algorithm has to be
run, and summaries of the obtained clusters are generated.
2.2
Differences between Database Clustering and Ordinary Clustering
Data collections are stored in many different formats such as flat files, relational or objectoriented databases. The flat file format is the simplest and most frequently used format in the
traditional data analysis area. When using flat file format, data objects (e.g., records, cases,
examples) are represented through vectors in the n-dimensional space, each of which describes
an object, and the object is characterized by n attributes, each of which has a single value. Almost
all existing data analysis and data mining tools, such as clustering tools, inductive learning tools,
and statistical analysis tools, assume that data sets to be analyzed are represented in a flat file
format. The well-known inductive learning environment C4.5 [Quin93] and similar decision tree
based rule induction algorithm [Domi96], conceptual clustering algorithms such as COBWEB
[Fish87], AutoClass [Chee96], ITERATE [Bisw95], statistical packages, etc. make this
assumption.
Due to the fact that databases are more complex than flat files, database clustering faces
additional problems that do not exist when clustering flat files; these problems include:

Databases contain objects that belong to different types; consequently, it has to be defined
what objects in the database need to be clustered.

Databases contain 1:1, 1:n and n:m relationships between objects of the same and different
types.

The definition of object similarity is more complex due to the presence of bags of values (or
related information) that characterize an object.
4

Attributes of objects have different types which makes the selection of an appropriate
similarity measure more difficult.
The first two problems will be analyzed in more detail in the next two subsections; the third and
fourth problem will be addressed in Section 4.
2.3
Support for Object Views for Database Clustering
Because databases usually contain objects belonging to different classes, there can be several
ways of viewing a database depending on what classes of objects need to be clustered. To
illustrate the problems of database clustering, let us use the following simple relational database
that consists of a Customer and a Purchase table; a particular state of this database is shown in
Figure 2 (a). The underlined attributes in each relation represent the primary key in the relation.
It is not possible to directly apply a clustering algorithm to a relational database, such as the
one that is depicted in Figure 2 (a). Before a clustering algorithm can be applied to a database it
has to be determined what classes of objects should be clustered: should customers or purchases
be clustered? After it has been decided which objects have to be clustered, in the next step
relevant attributes have to be associated with the particular objects to be clustered. The
availability of preprocessing tools that facilitate the generation of such object-views is highly
desirable for database clustering, because generating such object-views manually can be quite
time consuming.
2.4
Problems with Relationships
In general, a relational database usually consists of several related relations (or of related classes
when using the object-oriented model), which frequently describe many to one, and many to
many relationships between objects. For example, let us assume that we are interested in
clustering the customers belonging to the relational database that was depicted in Figure 2 (a). It
is obvious that the attributes found in the Customer relation alone are not sufficient to
accomplish this goal, because many important characteristics of persons are found in other
“related” relations, such as the Purchase relation. Prior to clustering customers, the relevant
information has to be extracted from the relational database and associated with each customer
object. We call a data structure that stores the results of this process an object view. An example
5
of such an object view is depicted in Figure 2 (c). The depicted data set was generated by
grouping related tuples into a unique object (based on cid). The attributes p.pid, p.plocation,
p.ptype, and p.amount are called related attributes with respect to the Customer relation because
they had to be imported from a foreign relation, the Purchase relation in this particular case.
(a) A data collection consisting of two relations, Customer and Purchase. The underlined attributes are the keys in
each relation. cid (customer id) is a foreign key in the relation Purchase. oid is an order id, pgid is a product group
id, ptype is a payment type (e.g., 1 for cash, 2 for credit card, and 3 for check). The cardinality ratio between two
relations is 1:n.
Customer
cid
1
2
3
4
name
Johny
Andy
Post
Jenny
Purchase
age gender
43
M
21
F
67
M
35
F
oid pgid cid
1
p1 1
1
p2 1
1
p3 1
2
p2 2
3
p3 2
4
p1 3
ptype
1
1
1
2
3
1
amount
400
70
200
390
100
30
date
02-10-96
02-10-96
02-10-96
02-23-96
03-03-96
03-03-96
(b) A single-valued data set created by performing an outer join on cid. The related attributes (from Purchase
relation) have prefix p. For example, p.pgid is a pgid in Purchase relation.
cid
1
1
1
2
2
3
4
name
Johny
Johny
Johny
Andy
Andy
Post
Jenny
age gender p.oid p.pgid p.ptype p.amount date
43 M
1
p1
1
400
02-10-96
43 M
1
p2
1
70
02-10-96
43 M
1
p3
1
200
02-10-96
21 F
2
p2
2
390
02-23-96
21 F
3
p3
3
100
03-03-96
67 M
4
p1
1
30
03-03-96
35 F
null null
null
null
null
(c) A multi-valued data set created by grouping related tuples into an object. For example, the three tuples that
charactize Johnny are grouped into one “Johny” object using separate bags for his payment type, product group,
and amount spent on each product group.
cid
1
2
3
4
name age gender p.pid
p.ptype
Johny 43 M
{p1,p2,p3} {1,2,3}
Andy 21 F
{p2,p3}
{2,3}
Post
67 M
p1
1
Jenny 35 F
null
null
p.amount
{400,70,200}
{390,100}
30
null
(d) A single-valued data set created by averaging of multi-valued attributes in (c). For the symbolic multi-valued
attributes such as p.pgid, p.location, and p.ptye, we picked the first value in the set (arbitrarily) since we cannot
calculate the averages.
cid name
1
Johny
2
Andy
3
Post
4
Jenny
age gender p.pgid p.ptype p.amount
43 M
p1
1
223
21 F
p2
2
245
67 M
p1
1
30
35 F
null
null
null
Figure 2. Various representations of a data set consisting of two related relations
In general, as the example shows, object views frequently contain bags of values if the
relationship cardinality between the two relations is 1:n. Note that in a relational database n:m
relationships are typically designed to two 1:n relationships. Unlike a set, a bag allows for
6
duplicate elements, but the elements must take values in a same domain. For example, the bag
{400, 70, 200} for the amount attribute might represent three purchases, 400, 70, and 200 dollars
by the customer “Johny”. Ryu and Eick [Ryu98c] call such a data set in Figure 1 (c), a multivalued data set and use term single-valued data set for the traditional flat files in (a) or (b). They
use curly brackets to represent a bag of values with the cardinality of the bag greater than one
(e.g., {1,2,3}), null to denote an empty bag, and give its element, if the bag has one element.
Most traditional similarity measures for single-valued attributes cannot deal with multi-valued
attributes such as p.pid, p.plocation, p.ptype, and p.amount. Measuring similarity between bags
of values requires group similarity measures. For example, how do we compute similarity
between a pair of objects, “Andy” and “Post” for a multi-valued attribute p.amount,
{390,100}:30, or between “Andy” and “Johny”, {390,100}:{400,70,200}? One simple idea may
be to replace the bag of values for multi-valued attributes by a single value by applying certain
aggregate function (e.g., average, sum or count), as depicted in Figure 2 (d). Another alternative
would be to use an outer join with respect to cid attribute to obtain a single-valued data set, as
depicted in Figure 2 (b).
The problem with the first approach is that by applying the aggregate function frequently
valuable information may be lost. For example, if the average purchase amount is used to replace
the bag of individual purchase amounts, this approach does not consider other potentially
relevant information, such as total amount, and the number of purchases, in computing similarity.
Another problem is that aggregate functions are only applicable to numerical attributes. Using
aggregate functions for symbolic attributes, such as the attribute location or ptype in the example
database, does not make sense at all. In summary, the approach of replacing bag of values by a
single value faces a lot of technical difficulties.
If we look at the single-valued data set in Figure 2 (b) which has been generated by using an
outer join approach we observe a different problem. A clustering algorithm would treat each
tuple in the obtained single-valued data set as a separate object (e.g. Johny’s 3 purchases would
be considered to be different objects, and not as data that are related to the customer “Johny”),
which means no longer the 4 customers would be clustered, but rather the 7 purchases;
obviously, if our goal is to cluster customers, clustering purchases instead seems to be quite
confusing.
7
3 A Data Set Representation Framework for Database Clustering
In the following a data set representation framework for database clustering is proposed;
similarity measures that are suitable in the context of the proposed framework will then be
introduced in Section 4. In general, the framework consists of the following mechanisms:

An object identification mechanism that defines what classes of objects will be clustered
and how those objects will be uniquely identified.

Mechanisms to define modular units based on object similarity have to be provided; each
modular unit represents a particular perspective of the objects to be clustered; similarity of
different modular units is measured independently. In the context of the relational data model
modular units are defined as procedures that associate a bag of tuples with a given object.
Using this framework, objects to be clustered are characterized by a set of bags of tuples, one
bag for each modular unit.

The similarity between two objects is measured as a weighted sum of the similarity of all its
modular units. To be able to do that a weight and a (bag) similarity measure have to be
provided for each modular unit.
cid
1
Age Gender
43 M
Pid
P1
P2
P3
Amount
400
70
200
2
Age Gender
21 F
Pid Amount
P2 390
P3 100
Sum(amount) Date
390
2/23/96
100
3/3/96
3
Age Gender
67 M
Pid Amount
P1 30
Sum(amount) Date
30
3/3/96
4
Age Gender
35 F
Pid Amount
Sum(amount) Date
Sum(amount) Date
670
2/10/96
Figure 3. An example of the bag-oriented clustering framework
To illustrate this framework, let us assume that we are still interested in clustering customers. In
this case the attribute cid of the relation Customer that uniquely identifies customers serves as
our object identification mechanism. After the object identification mechanism has been selected,
relevant attributes to define similarity between customers have to be selected. In the particular
case, we assume that we consider the customer’s age/gender information, the amount of money
8
they spend on various product groups, and the customer’s daily spending pattern to be relevant
for defining customer similarity. In the next step, modular units to measure customer similarity
have to be defined. In this particular example, we identify three modular units each of which
characterizes customers through a set of tuples. For example, the customer with cid 1 is
characterized as a 43 years old male, who spent 400, 70, and 200 dollars on product groups p1,
p2, and p3, and who purchased all his goods in a single day of the reporting period, spending
total 670 dollars. There are different approaches to define modular units. When the relational
data model is used, modular units can be defined using SQL queries that associate customers
(using cid) with a set tuples that are specific for the modular unit. In the example, depicted in
Figure 3, the following three SQL queries associate customers with the characteristic knowledge
with respect to each modular unit:
Modular Unit 1 := SELECT cid, age, gender
FROM Customer;
Modular Unit 2 := SELECT Customer.cid, pgid, amount
FROM Customer, Purchase
WHERE Customer.cid=Purchase.cid;
Moduler Unit 3 := SELECT Customer.cid, sum(amount), date
FROM Customer, Purchase
WHERE Customer.cid=Purchase.cid
GROUPED BY Customer.cid, date;
As we have seen throughout the discussions of the last two sections, many different object
views can be constructed from a given database. There are “simple” object views based on flat
file format, such as those in Figure 2 (b) and Figure 2 (d); in this section, a more complicated
scheme for defining object views has been introduced that characterizes object through sets of
bags of tuples. We claim that this data set representation framework is more suitable for database
clustering, and will present arguments to support our claim in Section 5.
When following the proposed methodology, object views based on the definition of modular
units are constructed. In the next step similarity measures have to be defined with respect to the
chosen object view that will be the subject of the discussions in the next section.
9
4 Similarity Measures for Database Clustering
In the previous section, we introduced a data set representation framework for database
clustering. In this section, we will introduce several similarity measures that are suitable for the
proposed framework.
As discussed earlier, in the proposed framework each object to be clustered is described
through a set of bags of tuples—one bag for each modular unit. In the case of single-valued data
sets each bag degenerates to a single tuple. When defining object similarity for this framework
we assume that a similarity measure is used to evaluate object similarity with respect to a
particular modular unit. Object-similarity itself is measured as the weighted sum of the similarity
of its modular units. More formally:
Let
O be the set of objects to be clustered
a, b O
mi: O  X denotes a function that computes the bag of tuples of the ith modular unit
I denotes the similarity function for ith modular unit
wi denotes the weight for ith modular unit
Based on these definitions, the similarity between two objects a and b can be defined as
follows:  (a, b) =
n
 w  (m (a), m (b))/  w
n
i
i
i
i
i 1
i 1
i
, where n is the number of modular units.
Figure 4 illustrates how the similarity measure is computed between two objects, Objecta and
Objectb with modular units in our similarity framework.
Similarity computation between Objecta and Objectb
Objecta
wn
Modular unitn
w2
…
…
Objectb
w1
Modular unit 2
w1
Modular unit1
Modular unit1
1
w2
Modular unit2
wn
…
…
Modular unitn
…
2
n
Figure 4. Similarity framework
10
There are many similarity metrics and concepts proposed in the literature from variety of
disciplines including engineering, science [Ande73, Ever93, Jain88, Wils97] and psychology
[Ashb88, Shep62]. In this paper, we broadly categorize types of attributes into quantitative type
and qualitative type, and introduce existing similarity measures based on these two types, and
generalize those to cope with the special characteristics of our framework.
4.1
Similarity Measures for Quantitative Types
A class of distance functions, known as Minkowski metric, is the most popularly used
dissimilarity function for the quantitative attributes. It is defined as follows:
m
dr(a,b) = (  a i  bi r)1/r, r  1
(1)
i 1
where a and b are two objects with m number of quantitative attributes, a = (a1, …, am) and b =
m
(b1, .., bm). For r = 2, it is the Euclidean metric, dr(a,b) = (  ( a i  bi ) 2 ) 1 / 2 ; for r = 1, it is the cityi 1
m
block (also known as taxicab or Manhattan) metric, dr(a,b) = (  a i  bi ) , and for r = , it is the
i 1
dominance metric, d(a,b) = max a i  bi . The Euclidean metric is the most commonly used
1 i  m
similarity function of the Minkowski metrics. Wilson and Martinez [Wils97] discusses many
other distance functions and their properties.
One simple way to measure the similarity between modular units in our similarity
framework is to substitute group means for the ith attribute of an object in the formulae for interobject measures such as Euclidean distance, city-block distance, or squared Mahalanobis
[Jain88]. For example, suppose that group A has the mean vector A = [ x a1, x a2, …, x am] and
group B has the mean vector B = [ x b1, x b2, …, x bm], then the measure by Euclidean distance
between the two groups can be defined as
m
d(A,B) = (  ( x ai  x bi ) 2 )1 / 2
i 1
(2)
The other approach is to measure the distance between their closest or furthest members, one
from each group, which is known as nearest neighbor or furthest neighbor distance [Ever93].
This approach is used in hierarchical clustering algorithms such as single-linkage and complete-
11
linkage. The main problems with these two approaches are that the similarity is insensitive to the
quantitative variance and that it does not account for the cardinality of elements in a group.
Another approach, known as group average, can be used to measure inter-group similarity.
In this approach, similarity between groups is measured by taking the average of all the interobject measures for those pairs of objects for which objects in the pair are in different groups.
For example, the average dissimilarity between group A and B can be defined as
na nb
d(A,B) = [   d (ai , b j )] n
i 1 j 1
(3)
where n is the total number of object-pairs, which is n = na  nb, na and nb are the number of
objects in the object ai and bj, respectively, and d(ai,bj) is the dissimilarity function for a pair of
objects ai and bj, ai  A, bj  B. Note that the dissimilarity function (usually distance function)
can be easily converted into a similarity function by reciprocating it.
4.2
Similarity Measures for Qualitative Types
Two coefficients, the Matching coefficient and Jaccard’s coefficient, are the most commonly
used similarity measures for qualitative type of attributes [Ever93, Jain88]. The Matching
coefficient is the ratio of the number of features the two objects have in common, to the total
number of features. Jaccard’s coefficient is the Matching coefficient that excludes negative
matches. For example, let m be the total number of features; m00 and m11 be the number of
common features and mismatching features; m01 and m10 be the distinctive features between two
objects. Then, the Matching coefficient and Jaccard’s coefficient are defined as (m00+m11)/m and
m11/(mm00), respectively (m01 and m10 are ignored). There can be other varied coefficients
giving weight to either matching features or mismatching features depending on the accepted
practice.
The above coefficient measures can be extended to multi-valued qualitative of attributes.
Restle [Rest59] has investigated the concepts of distance and ordering on sets. There are several
other set-theoretical models of similarity proposed [Ashb88, Tver77]. Tversky [Tver77]
proposed his contrast model and ratio model that generalize several set-theoretical similarity
models proposed at that time. Tversky considers objects as sets of features instead of geometric
points in a metric space. To illustrate his models, let a and b be two objects, and ma and mb
12
denote the sets of features associated with the objects a and b respectively. Tversky proposed the
following similarity measure, called the contrast model:
S(a,b) =f(ma mb)  f(ma mb)  f(mb ma)
(4)
for some , ,   0; f is a set operator (usually the set cardinality is used). Here, ma  mb
represents the features that are common to both a and b; ma  mb, the features that belong to a but
not to b; mb  ma, the features that belong to b but not to a. In the previous models, the similarity
between objects was determined only by their common features, or only by their distinctive
features. In the contrast model, the similarity of a pair of objects is expressed as a linear
combination of the measures of the common and the distinctive features. The contrast model
expresses similarity between objects as a weighted difference of the measures for their common
and distinctive features. The following similarity measure represents the ratio model:
S(a,b) = f(ma  mb) / [f(ma  mb) + f(ma  mb) + f(mb  ma)], ,   0 (5)
In the ratio model, the similarity value is normalized to a value range of 0 and 1. The ratio model
generalizes a wide variety of similarity models that are based on the Matching coefficients for
qualitative type of attributes as well as several other set-theoretical models of similarity [Eisl59].
For example, if  =  = 1, then S(a,b) becomes the Matching coefficient, f(ma  mb)/f(ma  mb),
discussed in section 2.1.2. Note that the set in Tversky’s model is a crisp set. Santini et al.
[Santi96] extend Tversky’s model to cope with fuzzy sets.
Wilson and Martinez [Wils97] discuss the Value Difference Metric (VDM) introduced by
Stanfill and Waltz (1986) and propose the Heterogeneous Value Difference Metric (HVDM) for
handling nominal attributes. Gibson et al. [Gibs98] introduce a sophisticated approach to handle
the similarity measure arising from the co-occurrence of values in a data set using an iterative
method for assigning and propagating weights on the qualitative values. Their approach may
handle the limited form of transitive similarity, e.g., if Oa is similar to Ob; Oc is similar to Od,
then Oa is considered to be similar to Od.
4.3
Similarity Measures for Mixed Types
In many real world problems, we often encounter a data set with a mixture of attribute types.
Specifically, if algorithms are to be applied to databases, it may not be sensible to assume a
13
single type of attributes since data can be generated from multiple tables with different properties
in a given database.
A similarity measure proposed by Gower [Gowe71] is particularly useful for data with mixed
types of attributes. This measure is defined as:
m
m
i 1
i 1
S(a,b) =  wi si (ai , bi ) /  wi
(6)
where a and b are two objects with m number of attributes, a = (a1, …, am) and b = (b1, .., bm). In
this formula, si (ai , bi ) is the normalized similarity index in the range of 0 and 1 between the
objects a and b as measured by the function s i for ith attribute, and wi is a weight for the ith
attribute. For example, the similarity index s i ( a i , bi ) can be any appropriate function of the
similarity measures defined in sections 4.1 and 4.2 depending on attribute types or applications.
Higher weights are assigned to more important attributes. As the reader might already observed,
our approach to assess object similarity defined in the formula (0) relies on Gower’s similarity
measure and associates similarity measures si with modular units that represent different facets
of objects.
Wilson and Martinez [Wils97] introduce a comprehensive similarity measure called,
HVDM, IVDM, and the WVDM for handling mixed types of attributes. The Gower’s similarity
framework to deal with mixed types of attributes can be extended to Wilson and Martinez’s
framework by adding appropriate similarity measures for each type of attribute defined in
HVDM.
4.4
Support for the Contextual Assessment of Similarity
The similarity measures we introduced so far do not take into consideration that attributes are
frequently interpreted in the context of other attributes. For example, let us consider that the data
set in Figure 2 in which customer “Johny” had purchases of three product groups, p1 for $400, p2
for $70, and p3 for $200 and customer “Andy” spend $390 on product group p2 and $100 on
product group p3. If product Ids, p1, p2, and p3 are “TV”, “fruit”, and “Jewelry” respectively, it
might not be sensible to compute the similarity for the purchase amount attribute between
“Johny” and “Andy” without considering the type of product they bought because purchases of
fruit might not be considered to be similar to purchases of TV-related products, even if the
14
amount spent for each purchase is similar each other. That is, the similarity of the amount
attribute needs to be evaluated in the context of the product attribute. In the following, we will
introduce a new similarity measure for this purpose.
Let us assume that the similarity of attribute  has to be evaluated in the context of attribute
, which we denote by: |. Then we can define the similarity between two objects having
attributes  and  as the similarity of  attribute with respect to  attribute. The new similarity
function is defined as follows:
s |  (a, b)    ( k ) s( k ) /  ( k ) ,
k
(7)
k
where  is a matching function for the attribute  and s is a similarity function for the attribute
, k is number of elements in a bag. The value of  is 1 for qualitative attribute if both objects
take the same value for the attribute , otherwise,  is 0 (i.e., no matching values). The value of
 is between 0 and 1 for quantitative attribute (i.e., a normalized distance value) that represents
the degree of relevancy between the two objects on the attribute, . Note that the contextual
relationship between  and ,  |  is not commutative (e.g., |  |). In addition, theoretically
we can expand  and  to a conjunctive list of attributes or disjunctive list of attributes.
Accordingly, the general form of  |  can be:
1 2 … p | 1 op 2 op … op n,
where op is either  or , p and n are the number of attributes for the similarity computations
between two objects. However, since the similarity between two objects will be computed
attribute-by-attribute from the selected list of attributes, it can be rewritten as  | 1 op 2 op …
op n. Some examples of contextual relationships can be  | ,  | 1  2  …  n,  | 1  2
 …  n,  | 1  2  3 …  n, and so on. So, for a case of  | 1  2, the value of  is 1
for qualitative attribute when both objects take the same value for attributes, 1 and 2. In this
definition, the information from the related multi-valued attributes is combined in an orderly way
to give a similarity value. This similarity measure is embedded into our similarity framework.
15
Figure 5 illustrates how to compute the similarity considering the contextual assessment.
Figure 5 also illustrates how (k) and s(k) are used when computing samount|pgid between example
objects “Johny” and “Andy”.
Objects
Johny
Andy
Product Id ()
TV (p1)
Fruit (p2)
Jewelry (p3)
Fruit (p2)
Jewelry (p3)
Amount ()
400
70
200
390
100
(1):TV = 0, (2):Fruit = 1, (3):Jewelry = 1
Assuming the city-block metric, the normalized similarity index can be computed as follows:
s(1) = 0.0, s(2) = 0.18, s(3) = 0.5
The similarity between Johny and Andy in the context of product ID and amount is computed:
s|a (Johny, Andy) = 0.34
Figure 5. Two objects with attributes that have contextual information and its example of
similarity computation
Note that the proposed contextual similarity is not designed to find any sequential patterns like
PrefixSpan [Pei01] or to measure transitive similarity [Gibs98] but to take the valid contextual
information into account of the similarity computation.
5 Architecture of Database Clustering System
Figure 6 depicts the architecture of our database clustering system that we are currently
developing. The system consists of three major tools: a data preparation tool, a clustering tool,
and a similarity measure tool. The data preparation tool is used to generate an object view from a
relational database based on the user’s requirements. The clustering tool guides the user to
choose an appropriate clustering algorithm for an application, from the library of clustering
algorithms that contains various algorithms such as nearest-neighbor, hierarchical clustering, etc.
Once a clustering algorithm has been selected, the similarity measure tool will assist the user in
constructing an appropriate similarity measure for his/her application and the chosen clustering
16
algorithm. When the user constructs an appropriate similarity measure, the system inquires
information about types, weights, and other characteristics of attributes, offering alternatives and
choices to the user, if more than one similarity measure seems to be appropriate.
Library of clustering algorithms
Object View
Data
preparation tool
Clustering tool
User interface
DBMS
A set of clusters
Similarity measure
Similarity
measure Tool
Default choice and
domain information
Library of
similarity
measures
Type and weight
information
Figure 6. Architecture for database clustering
In the case that the user does not provide the necessary information, default assumptions are
made based on the types of attributes (e.g., Euclidean distance is chosen for the quantitative types
and Tversky’s ration model is our default choice for the qualitative types). The range value
information for quantitative type of attributes can be easily retrieved from a given data set by
scanning the column vector of quantitative attributes. The range value information is used to
normalize the similarity index. Normalizing the similarity index is important in combining
similarity values of all attributes with possibly different types. Finally, the clustering tool takes
the constructed similarity measure and the object view as its input and returns a set (or a
hierarchy) of object clusters as its output.
17
5.1
A Framework to Generate Object Views from Databases
Our generalized
clustering algorithms
Bag-based
Object View
Processed
data
Structured
database
User
Interface
User's interests and objectives
Conventional
clustering
algorithms
Flat file-based
object view
Database name
Data set of interest
Object attribute(s)
Selected attributes
Figure 7. A framework for generating object views
Figure 7 illustrates the proposed framework for generating object views from a database. One of
the key ideas of the proposed research for dealing with the problems raised in Section 2.2 is to
develop a semi-automatic data preparation tool that generates object views from a (relational)
database based on the user’s interests and objectives. The tool basically automates the first three
steps of the database clustering methodology that was introduced in Section 2.1. The tool will be
interactive so that the user can define his/her object-view and the relevant attributes; based on
these inputs an object-view will be automatically generated by the tool. In order to generate an
object view from a database, our approach is first to enter a database name, to select a table
called a data set of interest, object attribute(s), and selected attributes. A data set of interest is an
anchor table for other related tables the user is interested in for clustering. The object attribute(s)
(e.g., usually a key attribute in a data set of interest) define the object-view of the particular
clustering task. An object in relational database is defined as a collection of tuples that have the
same value for all object attribute(s). The set of tuples is viewed to describe the same object.
Consequently, when generating object view, information in tuples that agree in the object
attributes should be combined into a single object in the format shown in Figure 2 (c), whereas
those that do not agree are represented as different objects in the generated object view. The
18
selected attributes are attributes in all the related tables the user has chosen. Although the tool
can generate an object-view in conventional flat-file format for conventional clustering
algorithms, the main format of the object-view in our approach is bag-based.
Figure 9 shows our implemented interface for the data preparation tool to generate an
object view from a relational database. We used Visual Basic to implement this tool. Using the
information provided by the user through the interface the algorithm to generate an object-view
works as follows: as the database name and the data set of interest are given, the attributes from
the data set of interest in the database are first extracted; next the related attributes in related
tables are selected through joining (usually outer join) with related tables; finally, the object
attribute(s) is selected from the attributes and the object-view is created by grouping the tuples
with the same values for the object attribute(s) into one object with the bags of values for the
related attributes [Ryu98c and Zehu98 give a more detailed description of the algorithm].
Figure 9. Interface for data preparation tool
19
5.2
Features of the Clustering Tool
Figure 10 shows the class diagram for our clustering tool in UML (Unified Modeling Language,
which is a notational language for software design and architecture [Mart97]). The class diagram
describes the developed classes, attributes, operations, and the relationships among classes.
GetAnalysisInfo class receives basic information from the user such as the name of the selected
data set, the interested attributes, data types for attributes, and the chosen similarity measure that
will be applied to the selected data set. ReadDataSetObjects class reads the selected data set.
Similarity Measure class defines our similarity measure. For the similarity measure in this
implementation, we chose the average dissimilarity measure for quantitative attributes and the
Tversky’s ratio model for qualitative attributes considering the contextual assessment of
similarity. Clustering class defines a clustering algorithm that uses the similarity measure defined
in Similarity Measure class.
Figure 10. Class diagram for the clustering tool
20
For the clustering algorithm, we chose the Nearest-neighbor algorithm, which is a partitioning
clustering method. In the nearest-neighbor algorithm, two objects are considered similar and are
put in the same cluster if they are neighbors or share neighbors. In this algorithm, an object o1,
from a set of objects in a data set D={o1, o2, o3,…, on} which is going to be partitioned into K
clusters, is assigned to a cluster C1. The nearest neighbor of the object oi among the objects
already assigned to cluster CJ is selected. And then, oi is assigned to CJ if the distance between oi
and the found nearest neighbor  t (t is a threshold on the nearest neighbor algorithm, selected by
the user). Otherwise, the object oi is assigned to a new cluster CR. This step is repeated until all
objects in the analyzed data set are assigned to clusters. When using the Nearest-neighbor
algorithm the user should provide the threshold value to be used in the clustering process.
Threshold value sets the condition based on which two objects can share or be grouped together
in the same cluster. Consequently, the threshold value affects the number of generated clusters.
As the value of the threshold increases, fewer clusters are generated.
We selected the Nearest-neighbor algorithm because it is directly applicable for clustering
the proposed bag-based object-view. Other algorithms that compute similarity directly between
two objects can be also applicable to our framework. However, generalizing the clustering
algorithm, K-means is not trivial, because of difficulties in computing centroids for clusters of
objects that are characterized by sets of bags of values.
21
Go
Figure 11. Clustering tool
Figure 11 illustrates our implemented interface of the clustering tool. In order to cluster an
object-view generated from the data preparation tool, the user needs to select a tag for flat-file
based data set or bag-based data set, the attributes, types of attributes, corresponding weights,
threshold value, and the output of the clustering.
5.3
Database Clustering Examples
We used two relational databases, Movies database available in the UC-Irvine data set archive
[UCIML04] and Online customer database received from a local company [Ryu02]. In these
experiments, we generated three different data sets for each database using three different data
representation formats called, single-valued data set and average-valued data set for conventional
data representation and multi-valued data set for the proposed representation like the formats
shown in Figure 2 to see whether a data set representation affects the quality of clusters. For the
clustering algorithm, we chose the Nearest-Neighbor algorithm. For the similarity measure, we
22
used the formula (0) based on the Gower’s similarity function (6). For the multi-valued data set,
the function consists of the formula (3) for quantitative attributes, (5) for qualitative attributes,
and (7) for the contextual assessment. For the single-valued data set and average-valued data set,
the function consists of the formula (1) (Euclidean metric) for quantitative attributes and (5) for
qualitative attributes, but not the formula (7). However, other similarity measures can be also
incorporated into the proposed framework depending on the user’s choice.
The Movies database holds information about the movies that have been developed since
1900 such as their titles, types, directors, producers, and their years of release. The Table 1
illustrates the selected attributes, their types, and assigned weights for the Movies database.
Attributes
Name
Properties
Assigned
Weight
Film_Id
Single-valued, Qualitative
0
Year
Single-valued, Quantitative
0.2
Director
Single-valued, Qualitative
0.6
Category
Multi-valued, Qualitative
0.7
Awards
Multi-valued, Qualitative
0.5
Table 1. Movies database
The key attribute of this data set is “Film-id”. All the attributes in this data set except for the
attribute Year (We are not very interested in the year information) are qualitative type. The
attributes Year, Director, are single-valued; Category and Awards are multi-valued. The
empirically selected threshold value for the clustering algorithm in this experiment is 0.36
[Sala00].
The number of clusters generated by each technique is shown in Table 2. The same
clustering algorithm with the same similarity framework with slightly different similarity formula
was applied to three different data sets except for the average-valued data set. We did not
23
generate the average-valued data set for the Movies database, because most attributes in the
database are symbolic, which cannot be easily converted into meaning quantitative scale.
Approach
Number of Clusters
Single-valued approach
136
Average-valued approach
N/A
Multi-valued approach
130
Table 2. Number of clusters generated by each approach
Both the single-valued approach and the multi-valued approach produced the same clusters for
the single-valued objects. This is not surprising since the multi-valued approach is basically a
generalization of the single-valued approach. The number of clusters generated by multi-valued
approach is less than the single-valued approach.
Film_id
category
director
Awards
Year
Asf10,T:I’ll be
Home for Christmas
Asf8,T:A Very
Brady Sequel
Comd
D:Sanford
Null
1998
Comd
D:Sanford
Null
1996
Atk10,T:Hilary and
Jackie
BioP,
Comd
D:A.Tucker
Null
1998
Atk12,T:Map of
Love
BioP,
Comd
D:A.Tucker
Null
1999
Table 3. Some objects in cluster-A from multi-valued approach
Moreover, as we expected, in the clustering result of the single-valued approach, the same
objects with multi-values were grouped into different clusters. Obviously, there is no such a
problem in the multi-valued approach. Some objects of a cluster generated by the multi-valued
approach are shown in Table 3. In this cluster, there are four different objects with similar
properties. Note that the attribute, category, has the highest weight. As Tables 4 and 5 illustrate,
some objects appeared in two different clusters generated by single-valued approach even though
they are in the same cluster shown in Table 3.
24
Film_id
category
Director
awards
year
Atk10,T:Hilary and Jackie
BioP
D:A.Tucker
Null
1998
Atk12,T:Map of Love
BioP
D:A.Tucker
Null
1999
Table 4. Some objects in cluster-B from single-valued approach
awards
Year
Asf10,T:I’ll be Home for Christmas
film_id
Comd
category
D:Sanford
Director
Null
1998
Asf8,T:A Very Brady Sequel
Comd
D:Sanford
Null
1996
Atk10,T:Hilary and Jackie
Comd
D:A.Tucker
Null
1998
Atk12,T:Map of Love
Comd
D:A.Tucker
Null
1999
Table 5. Some objects in cluster-C from single-valued approach
For example, the two objects, “Atk10, T: Hilary and Jackie” and “Atk12, T:Map of Love” are
grouped into different clusters by the single-valued approach. Note that these objects appear in
both clusters in the single-valued approach. This clustering result may confuse the data analysts.
We could not compare the quality of clusters for each data set since no class information for the
Movies database is available.
In the second experiment, we used an online customer database for a local Internet
company that sells the popular climate control products such as portable heaters, window, air
conditioners, etc. The size of this data set is 25,221 records after eliminating redundant,
incomplete, and inconsistent data. Ryu and Chang [Ryu02] have studied this database to identify
the characteristics of customers using decision tree [Quin93] and association rule mining
[Agra93] approaches. They found three major groups of customers for the company as shown in
Figure 12.
25
Each group
Average
60%
50%
40%
30%
20%
10%
0%
East Coast
High Rise Renters Young Immigrant
Immigrants
Families
Figure 12. Three customer groups with higher buying tendency. The vertical axis represents the
percentage of buyers.
So, the attributes and the weights for the selected attributes shown in Table 6 were
selected/assigned based on the analysis result by Ryu and Chang [Ryu02].
Attributes
Name
Properties
Assigned
Weight
CustID
Single-valued, Qualitative
0
Age
Single-valued, Quantitative
0.7
Ethnic group
Single-valued, Qualitative
0.7
Amount
Multi-valued, Quantitative
0.6
PayType
Multi-valued, Qualitative
0.6
City
Single-valued, Qualitative
0.8
State
Single-valued, Qualitative
0.8
Table 6. Online customer database
Again, in this experiment, we want to see whether the clusters generated by each data set
representation approach are compatible to the previous analysis result. The clustering result is
shown in Table 7. The multi-valued approach generates less number of clusters than other
approaches do. However, the number of clusters generated by each approach is much larger than
the three groups shown in Figure 12.
Approach
Number of Clusters
Single-valued approach
251
Average-valued approach
95
Multi-valued approach
81
Table 7. Number of clusters generated by each approach
So we manually examined the contents of each cluster and found that many clusters can be
eventually merged into the three groups shown in Figure 12. This job was much easier for the
26
clusters generated by the multi-valued approach. However, for the clusters generated by the
single-valued approach, it was very difficult since the same objects with multi-values appear in
different clusters. For example, Table 8 shows some objects in a cluster-A generated by singlevalued approach.
CustID
age
Ethnic group
Amount
payType
City
State
12001
27
A
25
Credit
Brooklyn
NY
13100
30
A
30
Credit
Newark
NJ
12200
33
A
50
Credit
Los Angeles
CA
13200
29
B
55
Credit
Bronx
NY
Table 8. Some objects in cluster-A generated by single-valued approach
Table 9 shows some objects assigned to other clusters generated by single-valued approach. As
we can see, the customers 12001 and 13200 in Table 9 are represented as two different objects
and assigned to different clusters. They should have been grouped to either cluster-A or cluster-B.
CustID
age
Ethnic group
Amount
payType
City
State
12001
27
A
280
Paypal
Brooklyn
NY
12005
30
B
125
Paypal
Sunnyside
NY
13200
29
B
280
Paypal
Bronx
NY
13200
29
B
235
Paypal
Bronx
NY
Table 9. Some objects in cluster-B generated by single-valued approach
There are 157 customers (out of 5,271 customers) grouped into more than one cluster like the
customers 12001 or 13200. Average-valued approach and multi-valued approach do not create
this type of confusion. However, the clustering result by average-valued approach is not as
accurate as the multi-valued approach, and even as the single-valued approach. One possible
reason is that in the average-valued approach the mapping from qualitative data to quantitative
data or the representative values (the first values picked from tuples for an object if the mapping
is not possible) for the qualitative attribute might be inappropriate (see the example format in
Figure 2 (d)). In summary, the overall quality of clusters generated by multi-valued approach is
27
better than other approaches. In addition, analyzing the clustering result generated by multivalued approach is much easier.
Intuitively, one can think that the run-time for multi-valued approach may take longer than
other approaches because of additional computation. However, the overall run-time including
preprocessing for each approach was not very different. This may be because the multi-valued
approach deals with much less number of records for clustering than the single-valued approach.
For the average-valued approach, it requires additional preprocessing time.
6 Related Work on Structural Data Analysis
In this section, we conduct a literature review on approaches to deal with structural data sets. We
categorize those approaches into two general groups; the data set conversion approach that
converts a database to a single flat data set without modifying data mining methods, and the
method generalization approach that generalizes data mining algorithms so that they can directly
be applied to structured objects. We also discuss the previously proposed database clustering
algorithms.
6.1
Data Set Conversion Approach
In order to concert a structured data set into a single flat file, related data sets are usually joined
and various aggregate functions and/or generalization operators have to be applied to remove
multi-valued attributes (for example, by averaging sets of values or by storing generalizations)
before data mining techniques are applied to the given data set. Conventional data mining
techniques are then applied to the “flattened” data set without any need for generalization. Many
existing statistical analysis techniques and data mining techniques employ this approach [Agra93,
Shek96, Bisw95, Han96, Haim97].
Nishio et al. [Nish93] proposed generalized operators that can be applied to convert a set of
values into a higher-level of concept description that can encompass the set of values for data
mining techniques in an object-oriented database framework. For example, a set of values,
{tennis, soccer, volleyball}, can be generalized as a single value of the higher-level concept
28
“sports”. They categorize attributes into several types, such as single-valued attribute, set-valued
attribute, list-valued attribute, and structure-valued attribute; and they propose the generalization
mechanisms for each category of attribute.
Applying a generalization operator to the related values may be a reasonable idea, since the
generalized values for an attribute may keep the related information. However, it may not be
always possible to generalize a set of values into a correct and consistent high-level concept
description, specifically for quantitative attributes, since there can be several ways to generalize
for the same set of values. Moreover, in many application domains suitable generalization
hierarchies for symbolic attributes are not available. Gibson [Gib00] and similarly Ganti [Gan99]
introduce novel formalizations of a cluster for categorical attributes and propose clustering
algorithms for data sets with categorical attributes.
DuMouchel et al. [DuMo99] proposed a methodology that squashes flat files applying
statistical approaches mainly to resolve the scalability problem of data mining. The methodology
consists of three steps, grouping, momentizing, and generating. These steps describe the
squashing pipeline whereby the original data set is sectioned off into mutually exclusive groups;
within each group a series of low-order moments are computed; and finally these moments are
passed off to a routine that generates pseudo data that accurately reproduce the moments. They
claim that the squashed data set keeps the structure of the original data set.
6.2
Method Generalization Approach
The other way to cope with structured data sets is to generalize existing data mining methods so
that they can perform data mining tasks in structured domains. A few approaches that directly
represent structured data sets using more complex data structures and which generalize data
mining techniques for those data structures have been proposed [Gold95, Haye78, Step86,
Mich83, Wass85, Thom91, Kett95, Mana91, Mcke96, Hold94, Biss92, Kiet94, Kauf96] in the
literature. We only review those approaches we consider most relevant for database clustering in
this section.
Goldberg and Senator [Gold95] restructure databases for data mining by consolidation and
link formation. Consolidation relates identifiers present in a database to a set of real world
entities (RWEs) which are not uniquely identified in the database. This process can be viewed as
29
a transformation of representation from the identifiers present in the original database to the
RWE. Link formation constructs structured relationships between consolidated RWEs through
identifiers and events explicitly represented in the database. Both consolidation and link
formation may be interpreted as transformations of representation from the identifications
originally present in a database to the RWE’s of interest. McKearney and Roberts [Mcke96]
produce a single data set for data mining by generating a query, after analyzing dependencies
between attributes and relationships between data sets (e.g., relations or classes). This is
somewhat similar to our approach, except that our approach employs sets of queries (and not a
single query) to construct modular units for similarity assessment.
LABYRINTH [Thom91] is a system that extends the well-known conceptual clustering
system COBWEB [Fish87] to structured domains. It forms concept hierarchy incrementally and
integrates many interesting features such as incremental, probabilistic, unsupervised,
relationship, and component features, as used in earlier systems. For example, it learns
probabilistic concepts and also decomposes objects into sets of components to constrain
matching like MERGE [Wass85]. LABYRINTH can make effective generalizations by using a
more powerful structured representation language.
Ketterlin et al. [Kett95] also generalize COBWEB to cope with complex objects (or
composite objects, i.e., objects with 1:n relationships); that is, objects that have many other
related objects (or components) to deal with 1:n or n:m relationships in structured database. The
basic idea used in their system is to find a characterization of a cluster of composite objects using
component clustering; that is, components are clustered first, leading to a component-clusters
hierarchy, then composite objects are clustered.
The systems KBG [Biss92] and KLUSTER [Kiet94] both employ high-level languages
(respectively first-order logic and description logic). Both systems build a DAG (directed acyclic
graph) of clusters, instead of a hierarchy.
Ribeiro et al. [Ribe95, Kauf96] extend the discovery system INLEN [Kauf91] to discover
knowledge in multiple data sets. In order to discover knowledge across multiple data sets, they
include information on primary and foreign keys for the target data sets (e.g., relations); that is,
keys serve as the links across the data sets, as it is the case in our approach. INLEN first
discovers knowledge in each database or relation, then the discovered knowledge is associated
30
with related information using foreign key information; finally, all discovered knowledge for
each database is integrated into a single knowledge base.
Gibson et al. [Gibs98] proposed an approach for clustering categorical data based on
dynamical systems. The approach tries to handle the similarity measure arising from the cooccurrence of values in a data set using an iterative method for assigning and propagating
weights on the qualitative values. Ganti et al [Gant99] proposed an improved approach called,
CACTUS for categorical data clustering based on the inter-attribute and the intra-attribute
summaries to compute the “candidate” clusters which can then be validated to determine the
actual set of clusters.
6.3
Database Clustering Algorithms
Several clustering algorithms such as CLARAN [Ng94], DBSCAN [Este96], BIRCH [Zhan96],
and STING [Wang97] for large databases have been proposed. However, most algorithms are
targeted for spatial databases, not for structural databases like business-oriented relational
databases or object-oriented databases. Moreover, like many of conventional clustering
algorithms, those algorithms also make one-tuple one-object assumption.
Bradley et al. [Brad98] proposed a scalable clustering framework for large databases based
on identifying regions of the data that are compressible, regions that must be maintained in
memory, and regions that are discardable. The framework focuses on the scalability of database
clustering algorithms.
7 Summary and Conclusion
In this paper, methodologies, techniques, and tools for clustering databases were introduced. One
critical problem of database clustering is the data model discrepancy between the representation
format used to store the target data and the input data format that clustering algorithms expect. In
most databases, data are stored in several tables or classes and related information are
represented as relationships among related tables or classes, while most traditional clustering
algorithms assume that input data are stored in a single flat file format. Based on this
31
observation, we showed that the traditional flat file format is not appropriate for storing related
information since it restricts each attribute in a data set to have a single value while once related
objects in related tables or classes are combined, objects are frequently characterized by bags of
values (or tuples). We proposed a better data representation format that relies on bags of tuples
and modular units for database clustering, and introduced similarity measures that are suitable for
our proposed representation framework. Moreover, we reviewed the features of a database
clustering environment that employs the proposed representation framework.
We proposed a unified framework for similarity measures to cope with mixed-type attributes
that may have set or bag of values and with object relationships commonly found in databases.
Most of the similarity measures that we recommended have been introduced in the literature a
long time ago; however, we also introduced a new similarity measure that allows for defining
attribute similarity in the context of other attributes.
We performed the experiment for clustering different types of data sets using the nearestneighbor algorithm and analyzed the results. In this experiment, we conducted a cluster analysis
for a data set represented in the various formats including single-valued data set, average-valued
data set, and multi-valued data set to see the effectiveness of the proposed framework. Based on
our analysis, we found that the proposed multi-valued data representation approach produced
clearer (better quality of) clustering result than the traditional data representation approach
(single-valued or average-valued data set). Interpreting clustering result generated by the
proposed approach is also easier. In the clustering result generated by traditional clustering
approach, as we expected, the same objects with multi-values are grouped in different clusters,
which may confuse data analysts.
32
Future Work and Issues
Although we claim that our introduced knowledge representation framework is useful in
representing related information, there are still several issues that are not yet analyzed or
understood in sufficient details. In general, the proposed representation framework may be well
applied to the clustering algorithms that compute the similarity directly between a pair of objects.
Nearest-neighbor clustering and DBSCAN [Este96] are such algorithms that belong to that
category. For other clustering algorithms such as K-means, COBWEB, etc., a major modification
of those algorithms would be required in order to cluster object-views based on bags of tuples.
Generalizing decision tree algorithm such as C4.5 to cope with structural information seems
to be another challenging problem. One way to approach this problem would be to generalize the
decision tree learning algorithm so that it can be applied to bags of values (or tuples). Such a
generalization could reuse the preprocessing techniques that were described in this paper. This
would make it possible to directly apply concept learning algorithms to databases that consist of
multiple related data sets, such as relational databases, which is currently not available.
Another subject that needs to be investigated further is the scalability of the nearest-neighbor
clustering framework we proposed. When running our prototype system, we observed that as the
number of objects and/or the information associated with a particular object grows the
construction of the object similarity matrix becomes a performance bottleneck for our clustering
algorithm. We believe that in order to obtain better scalability of our methods special, efficient
data structures for object views need to be developed that facilitate the construction of object
views and the similarity computations for pairs of objects.
33
Bibliography
[Ande73]
[Agra93]
[Ashb88]
[Biss92]
[Bisw95]
[Brad98]
[Chee96]
[Domi96]
[DuMo99]
[Este96]
[Ever93]
[Fish87]
[Gib00]
[Gant99]
[Gibs98]
[Gowe71]
[Haim97]
[Han96]
[Hart75]
M.R. Anderberg, Cluster analysis for application, Academic Press, New York, 1973.
R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in
large databases, In Proc. ACM SIGMOD pp. 207-216, 1993.
F.G., Ashby, N.A. Perrin, Toward a unified theory of similarity and recognition,
Psychological review 95(1) 124-150, 1988.
G. Bisson, Conceptual clustering in a first order logic representation, In proc. of the tenth
European conference on Artificial Intelligence, John Wiley & Sons, 1992.
G. Biswas, J. Weinberg, C. Li, ITERATE: A Conceptual clustering method for knowledge
discovery in databases, In Innovative Applications of Artificial Intelligence in the Oil and
Gas Industry, B. Braunschweig, R. Day (Ed.), 1995.
P.S. Bradley, U. Fayyad, C. Reina, Scaling Clustering Algorithms to Large Databases, In
Proc of 4th International conference on Knowledge Discovery and Data Mining (KDD-98),
New York. 1998.
P. Cheeseman, J. Stutz, Bayesian Classification (AutoClass): theory and results, Advances
in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth,
R. Uthurusamy (Ed.), AAAI/MIT Press, Cambridge, MA, pp. 153-180, 1996.
P. Domingos, Linear-time rule induction, In Proc. of the 2nd Int'l Conf. on Knowledge
Discovery and Data Mining, Portland, Oregon, 1996.
W. DuMouchel, C. Volinsky, T. Johnson, C. Cortes, D. Pregibon, Squashing Flat Files
Flatter, In Proc. Of the Fifth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD-99), San Diego, California, 1999.
M. Ester, H-P. Kriegel, J. Sander, X. Xu, A Density-Based Algorithm for Discovering
Clusters in Large Spatial Databases with Noise, In proceedings of the Second Knowledge
Discovery and Data Mining conference, Portland, Oregon, 1996.
B.S. Everitt, Cluster Analysis, Edward Arnold co-published by Halsted Press and imprint
of John Wiley & Sons Inc., 3rd edition, 1993.
D. Fisher, Knowledge acquisition via incremental conceptual clustering, In Machine
Learning, 2 pp. 139-172, 1987.
D. Gibson, J. Kleinberg, P. Raghavan, Clustering Categorical Data: An Approach Based on
Dynamical Systems, VLDB Journal 8 (3-4) pp. 222-236, 2000.
V. Ganti, J. Gehrke, R. Ramakrishnan, CACTUS - Clustering Categorical Data Using
Summaries, In Proc. Of the Fifth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD-99), San Diego, California, pp. 73-83, 1999.
D. Gibson, J. Kleinberg, P. Raghavan, Clustering Categorical Data: An Approach Based on
Dynamical Systems, In Proc. of the 24th International Conference on Very Large
Databases, New York, 1998.
J.C. Gower, A general coefficient of similarity and some of its properties, Biometrics 27,
pp. 857-872, 1971.
I.J. Haimowitz, O. Gur-Ali, H. Schwarz, Integrating and Mining Distributed Customer
Databases, In Proc. of the 3rd Int'l Conf. on Knowledge Discovery and Data Mining,
Newport Beach, California, 1997.
J. Han, Y. Fu, W. Wang, J. Chiang, W. Gong, K. Koperski, D. Li, Y. Lu, A. Rajan, N.
Stefanovic, B. Xia, O.R. Zaiane, DBMiner: A system for Mining Knowledge in Large
Relational Databases, In Proc. of the 2nd Int'l Conf. on Knowledge Discovery and Data
Mining, Portalnd, Oregon, 1996.
J.A. Hartigan, Clustering Algorithms. John Wiley & Sons, Inc., 1975.
34
[Hayes78]
F. Hayes-Roth, J. McDermott, An interference matching technique for inducing
abstractions, Communications of the ACM, 21, pp. 401-410, 1978.
[Han01]
J. Han, M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann Publishers,
2001.
[Hold94]
L.B. Holder, D.J. Cook, S. Djoko, Substructure Discovery in the SUBDUE system, In Proc.
of the AAAI-94 Workshop on Knowledge Discovery in Databases (KDD-94), Seattle,
Washington, 1994.
[UCIML04] http://www.ics.uci.edu/AI/ML/Machine-Learning.html, 2004.
[Jain88]
A.K. Jain, R.C. Dubes, Algorithms for clustering data, Prentice Hall, Englewood Cliffs,
NJ, 1988.
[Jarv73]
R.A. Jarvis, E.A. Patrick, Clustering using a similarity measure based on shared near
neighbors, IEEE Transactions on Computers C22, pp. 1025-1034, 1973.
[Kauf91]
K.A. Kaufman, R.S. Michalski, L. Kerschberg, Mining for knowledge in databases: Goals
and general description of the INLEN system, In Knowledge Discovery in Databases,
AAAI/MIT, Cambridge, MA, 1991.
[Kauf96]
K.A. Kaufman, R.S. Michalski, A Method for Reasoning with Structured and Continuous
Attributes in the INLEN-2 Multistrategy Knowledge Discovery System, In Proc. of the 2nd
Int’l Conf. On Knowledge Discovery and Data Mining, Portland, Oregon, 1996.
[Kett95]
A. Ketterlin, P. Gancarski, J.J. Korczak, Conceptual Clustering in Structured Databases: a
Practical Approach, In Proc. of the 1st Int’l Conf. On Knowledge Discovery and Data
Mining, Quebec, Montreal, 1995.
[Kiet94]
J.-U. Kietz, K. Morik, A polynomial approach to the constructive induction of structural
knowledge, Machine Learning 14, pp. 193-217, 1994.
[Lu78]
S.Y. Lue, K.S. Fu, A sentence-to-sentence clustering procedure for pattern analysis, IEEE
Transactions on Systems, Man and Cybernetics SMC 8, pp. 381-389, 1978.
[Mana91]
M. Manago, Y. Kodratoff, Induction of Decision Trees from Complex Structured Data, In
Knowledge Discovery in Databases, AAAI/The MIT press, pp. 289-306, 1991.
[Mart97]
F. Martin, S. Kendall, UML Distilled, Applying the Standard Object Modeling Language,
Addison Wesley Longman Inc., 1997.
[Mich83]
R.S. Michalski, R.E. Stepp, Learning from observation: Conceptual clustering, In Machine
Learning: An Artificial Intelligence Approach, Morgan Kaufmann, San Mateo, CA, pp.
331-363, 1983.
[Ng94]
R.T. Ng, J. Han, Efficient and Effective Clustering Methods for Spatial Data Mining, Proc.
20th Int. Conf. on Very Large Data Bases, Santiago, Chile, pp. 144-155, 1994.
[Nish93]
S. Nishio, H. Kawano, J. Han, Knowledge Discovery in Object-Oriented Databases: The
First Step, In Proc. of the AAAI-93 Workshop on Knowledge Discovery in Databases
(KDD-93), Washington, 1993.
[Pei01]
J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, M-C. Hsu, PrefixSpan:
Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth, In Proc. of the
17th International Conference on Data Engineering, Heidelberg, Germany, 2001.
[Quin93]
J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993.
[Ribe95]
J.S. Ribeiro, K. Kaufmann, L. Kerschberg, Knowledge Discovery from Multiple Databases,
In Proc. of the 1st Int’l Conf. On Knowledge Discovery and Data Mining, Quebec,
Montreal, Canada, 1995.
[Ryu98a]
T.W. Ryu, C.F. Eick, Discovering Discriminant Characteristic Queries from Databases
through Clustering, In the Proc. of the Fourth International Conference on Computer
Science and Informatics (CS&I'98) at Research Triangle Park, NC, 1998.
[Ryu98b]
T.W. Ryu, Discovery of Characteristic Knowledge in Databases using Cluster Analysis and
Genetic Programming, Ph.D. Dissertation, Department of Computer Science, University of
Houston, Houston, 1998.
35
[Ryu98c]
[Ryu02]
[Sala00]
[Shek96]
[Shep62]
[Stan86]
[Step86]
[Thom91]
[Tver77]
[Wang97]
[Wass85]
[Wils97]
[Zhan96]
[Zehu98]
T.W. Ryu, C.F. Eick, Similarity Measures for Multi-valued Attributes for Database
Clustering, In the Proc. of the Conference on SMART ENGINEERING SYSTEM DESIGN
Neural Networks, Fuzzy Logic, Evolutionary Programming, Data Mining and Rough Sets
(ANNIE'98), St. Louis, Missouri, 1998.
T.W. Ryu, W-Y. Chang, Customer Analysis Using Decision Tree and Association Rule
Mining, In the Proc. of the International Conference on SMART ENGINEERING SYSTEM
DESIGN: Neural Networks, Fuzzy Logic, Evolutionary Programming, Artificial Life, and
Data Mining (ANNIE’02), ASME press, St. Louis, Missouri, 2002.
H. Salameh, Nearest-neighbor clustering algorithm for relational databases, Master of
Science Thesis, Department of Computer Science, California State University, Fullerton,
2000.
E.C. Shek, R.R. Muntz, E. Mesrobian, K. Ng, Scalable Exploratory Data Mining of
Distributed Geoscientific Data, In Proc. of the 2nd Int’l Conf. On Knowledge Discovery and
Data Mining, Portland, Oregon, 1996.
R.N. Shepard, The analysis of proximities: Multidimensional scaling with unknown
distance function. Part I Psychometrika 27, pp. 125-140, 1962.
C. Stanfil, D. Waltz, Toward memory-based reasoning, Communications of the ACM 29,
pp. 1213-1228, 1986.
R.E. Stepp, R.S. Michalski, Conceptual clustering: Inventing goal-oriented classifications
of structured objects. In Machine Learning: An Artificial Intelligence Approach 2, Morgan
Kaufmann, San Mateo, CA, pp. 471-498, 1986.
K. Thompson, P. Langley, Concept formation in structured domains, In Concept
Formation: Knowledge and Experience in Unsupervised Learning, D.H. Fisher, M.
Pazzani, P. Langley (Ed.), Morgan Kaufmann, 1991.
A. Tversky, Feature of Similarity, Psychological review 84 (4), pp. 327-352, 1977.
W. Wang, J. Yang, R.R. Muntz, STING: A Statistical Information Grid Approach to Spatial
Data Mining, VLDB, pp. 186-195, 1997.
K. Wasserman, Unifying representation and generalization: Understanding hierarchically
structured objects, Doctoral dissertation, Department of Computer Science, Columbia
University, New York, 1985.
D.R. Wilson, T.R. Martinez, Improved Heterogeneous Distance Functions, Journal of
Artificial Intelligence Research 6, pp. 1-34, 1997.
T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: an efficient database clustering method for
very large databases, In Proc. of ACM-SIGMOD Int. Conf. On Management of Data,
Montreal, Canada, pp. 103-114, 1996.
W. Zehua, Design and Implementation Tool to Extract Structural Information from
Relational Databases, Master of Science Thesis, Department of Computer Science,
University of Houston, Houston, 1998.
36