Download Transaction / Regular Paper Title

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
IEEE TRANSACTIONS ON TKDE, MANUSCRIPT ID
1
A Unifying Domain-Driven Framework for
Clustering with Plug-In Fitness Functions and
Region Discovery
Christoph F. Eick, Oner U. Celepcikay, Rachsuda Jiamthapthaksin and
Vadeerat Rinsurongkawong
Abstract—— The main challenge in developing methodologies for domain-driven data mining is incorporating domain
knowledge and domain-specific evaluation measures into the data mining algorithms and tools, so that “actionable knowledge”
can be discovered. In this paper a generic, domain-driven clustering framework is proposed that incorporates domain
intelligence into domain-specific plug-in fitness functions that are maximized by the clustering algorithm. The framework
provides a family of clustering algorithms and a set of fitness functions, along with the capability of defining new fitness
functions. Fitness functions are the core components in the framework as they capture a domain expert’s notion of the
interestingness. The fitness function is independent from the clustering algorithm employed. The framework also incorporates
domain knowledge through preprocessing and post-processing steps and parameter selections. This paper introduces the
framework in detail, and illustrates the framework through demonstrations and case studies that center on spatial clustering and
region discovery. Moreover, the paper introduces ontology and a theoretical foundation for clustering with fitness functions in
general, and region discovery in particular. Finally, intensional clustering algorithms that operate on cluster models are
introduced.
Index Terms— Clustering, Data Mining, Spatial Databases and GIS, Domain-driven Data Mining
——————————  ——————————
1 INTRODUCTION
T
o extract knowledge from the immense amount of
data that has been generated by advances in data acquisition technologies has been a major focus of data
mining research over the last 20 years. However, it has
been observed that knowledge obtained from traditional
data-driven data mining algorithms in domain-specific
applications is not really actionable [1] because the extracted knowledge does not capture what domain experts
are interested in. This observation can be explained by
two limitations of traditional data mining: 1) traditional
data mining algorithms insufficiently incorporate domain
intelligence to aid the mining process and 2) the algorithms use technical significance as their sole evaluation
measure.
As far as the first limitation is concerned, domain intelligence includes the involvement of domain knowledge, domain-specific constraints and experts. Consider
a situation in which a clustering algorithm is used to
identify clusters in a specific domain. Different clustering
algorithms have their own assumptions on clustering
criteria, e.g. tightness, connectivity, separation and so on.
Due to the fact that clustering is NP-hard, clustering algorithms focus their search efforts on clusters that maximize
those criteria, frequently generating “optimal” but out-ofinterest clusters. Clustering with constraints intends to
alleviate this problem by incorporating must-link and
cannot-link constraints to better guide the search for good
clusters [2]. The second limitation occurs due to the fact
that in traditional data mining, actionability of knowledge
is determined solely by technical significance based on
domain-independent criteria [1]; this type of measure
usually differs from domain-specific expectations and
measures of interestingness. To address this problem,
when assessing cluster quality both technical and domain-specific significance should be considered. Consequently, the main challenge in developing methodologies and techniques for domain-driven data mining is to
incorporate domain knowledge into data mining algorithms and tools so that actionable knowledge can be discovered.
In this paper, we propose a unifying domain-driven
clustering framework that provides families of clustering
algorithms with plug-in fitness functions capable of discovering actionable knowledge. The fitness function is the
core component of the framework, as it captures the domain expert’s notion of interestingness. The fitness function is specifically designed to be externally plugged-in to
provide extensibility and flexibility; the fitness function
component of the framework is independent of the clustering algorithms employed.
In general, families of task- and/or domain-specific
fitness functions are employed to capture the domain
interestingness and to incorporate domain knowledge.
————————————————
For example, let us consider a data mining task in which
 Authors are with the Department of Computer Science, University of Hougeologists are interested in discovering hotspots in geoston, Houston, TX, 77204.
graphical space where deep earthquakes are in close
 Emails: (ceick, onerulvi, rachsuda, vadeerat )@cs.uh.edu
Manuscript received (03/31/2009).
xxxx-xxxx/0x/$xx.00 © 200x IEEE
2
proximity to shallow earthquakes. That is, they are interested in identifying contiguous regions in an earthquake
data set for which the variance of the variable earthquake_depth is high. When using our framework, the
geologist’s notion of interestingness is the captured in the
form of a High Variance fitness function— formally defined in section 3. The domain expert additionally selects
the parameters to instruct the clustering algorithm in
what patterns they are really interested in: an earth-depth
variance threshold and a parameter that controls cluster
granularity and size of the spatial clusters discovered.
Next, a clustering algorithm is run with the parameterized High Variance fitness function, and high variance
earthquake depth hotspots are obtained, as displayed in
fig. 1.
Fig. 1. Examples of interesting regions discovered by a domain driven clustering algorithm using a High Va-riance fitness function
Our framework incorporates domain knowledge not
only through domain-specific fitness functions, but also
through preprocessing and post-processing steps, fitness
function parameter selections including seed patterns,
threshold parameter values that are suitable for a specific
domain, and desired cluster granularities. The family of
clustering algorithms supported by the framework includes divisive, grid-based, prototype-based and agglomerative clustering algorithms; all of which support plug-in
fitness functions.
The first high level domain driven data mining framework has been introduced by Cao and Zhang [1]. In this
framework, domain intelligence is incor-porated into the
KDD process towards actionable knowledge discovery,
and the framework has been illustrated through mining
activity patterns in social security data. They also proposed criteria to measure actionability of the knowledge.
Yang [3] introduced a framework with two techniques to
produce actionable output from traditional KKD output
models. The first technique uses an algorithm for extracting actions from decision trees such that each test instance
falls in a desirable state. The second technique uses an
algorithm that can learn relational action models from
frequent item sets. This technique is applied to automatic
planning systems and Yang’s Action-Relation Modeling
System (ARMS) automatically acquires action models
from recorded user plans.
One subcomponent of the domain knowledge that
must be incorporated into any domain-driven data mining framework is human intelligence; and Multiaspect
Data Analysis (MDA) is an important Brain Informatics
methodology. Brain Informatics considers the brain as an
IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID
information-processing system to understand its mechanism for analyzing and managing data. But, since brain
researchers can not use MDA results directly, Zhong [4]
proposes a methodology that employs an explanationbased reasoning process that combines multiple source
data into more general results to form actionable
knowledge. Zhong’s framework basically takes traditional KDD output as input to an explanation-based reasoning process that generates actionable output. The concept
of moving from method-driven or data-driven data mining to domain-driven has been recently proposed and is
featured in [5]. The authors describe four aspects of moving data mining, from a method-driven ap-proach to a
process that focuses on domain knowledge. In general,
the use of plug-in fitness functions is not very common in
traditional clustering; the only exception is the CHAMELEON [6] clustering algorithm. However, fitness functions play a more important role in semi-supervised and
supervised clustering [7] and in adaptive clustering [8].
The main contributions of this paper are that it:
1. Introduces a unifying domain-driven clustering
framework for actionable knowledge discovery.
2. Proposes a novel domain-specific fitness function
model that is plugged into clustering algorithms externally to capture domain interestingness.
3. Presents a set of fitness functions capable of serving
for clustering tasks for various domains.
4. Introduces a family of clustering algorithms, most of
which have been developed in our previous work, as part
of the framework and introduces novel intensional clustering that directly manipulate cluster model.
5. Illustrates deployment of the proposed framework
and its benefits in challenging real world case stu-dies.
The remainder of this paper is organized as follows: In
section 2, we formally present our domain-driven clustering framework. Section 3 provides a detailed discussion
on domain-specific plug-in fitness function including
three examples. Section 4 introduces the family of clustering algorithms provided in our framework, and section 5
illustrates the framework through demonstrations and
case studies. Section 6 concludes the paper.
2 A SPATIAL CLUSTERING WITH PLUG-IN FITNESS
FUNCTION
2.1 Preview
As mentioned in the introduction, the goal of this paper is
to introduce a highly generic clustering framework that
supports plug-in fitness functions to capture domain interestingness. As we will discuss later the framework is
very general and can be used for traditional clustering.
However, because almost all of our applications involve
spatial data mining, the remainder of this paper will
mostly focus on spatial clustering and on region discovery in particular. The goal of spatial clustering is to identify interesting groups of objects in the subspace of the spatial attributes. Region discovery is a special type of spatial
clustering that focuses on finding interesting places in
AUTHOR ET AL.: TITLE
spatial datasets. Moreover, in this section and in Section 4
a theoretical foundation and ontology for clustering with
plug-in fitness functions is intoduced. Finally, novel intensional clustering algorithms are introduced.
2.2 An Architecture for Region Discovery
As depicted in Figure 2, the proposed region discovery
framework consists of three key components. The first
two components are families of clustering algorithms and
fitness functions that play a major role in discovering interesting regions and their associated patterns. As we will
discuss in more detail soon, the framework uses clustering algorithms that support plug-in fitness functions to
find interesting regions in spatial datasets. Decoupling
cluster evaluation from the search for good clusters creates flexiblility in using any clustering algorithm with any
fitness function. The role of the third component is to
manage and integrate datasets residing in several repositories; it will not further be discussed in this paper.
3
nothing else. Intensional clustering algorithms, on the
other hand, create a clustering model based on O and
other inputs. Most popular clustering algorithms have
been introduced as extension clustering algorithms, but—
it is not too difficult to generalize most extensional clustering algorithms so that they become intensional clustering algorithms, as we present in Section 5.
Extensional clustering algorithms create clusters X on
O that are sets of disjoint subsets of O:
X={c1,...,ck} with ciO(i=1,…,k) and ci cj= (ij)
Intensional clustering algorithms create a set of disjoint
regions Y in F:
Y={r1,...,rk} with riF (i=1,…,k) and ri rj= (ij)
In the case of spatial clustering and region discovery,
cluster models have a peculiar structure in that they seek
for regions in the subspace Dom(S) and not in F itself: a
region
discovery
model
is
a
function2
:
Dom(S){1,…,k}{} that assigns a region (p) to a point
p in Dom(S) assuming that there are k regions in the spatial dataset—the number of regions k is chosen by the
region discovery algorithm that creates the model. Models support the notion of outliers; that is, a point p’ can be
an outlier that does not belong to any region: in this case:
(p’)=.
Intensional region discovery algorithms obtain a clustering Y in dom(S) that is defined as a set of disjoint regions in dom(S)3:
Y={r1,...,rk} with riF[S] (i=1,…,k) and ri rj= (ij)
Fig. 2. Region Discovery Framework
2.3 Goals and Objectives of Region Discovery
As mentioned earlier, the goal of region discovery is to
find interesting places in spatial datasets. Our work assumes that region discovery algorithms that we develop
operate on datasets containing objects o1,..,on:
O={o1,..,on}F where F is relational database schema and
the objects belonging to O are tuples that are characterized by attributes SN, where:
S={s1,…,sq} is a set of spatial attributes.
N={n1,..,np} is a set of non-spatial attributes.
Dom(S) and Dom(N) describe the possible values the
attributes in S and N can take; that is, each object oO is
characterized by a single tuple that takes values from
Dom(S)Dom(N)1.
In general, clustering algorithms can be subdivided into intensional clustering and extensional clustering algorithms: extensional clustering algorithms just create clusters for the data set O, portioning O into subsets, but do
1 If S is empty we call the problem a traditional clustering problem.
One key characteristic of spatial clustering is that spatial- and non-spatial
attributes play different roles in the clustering process, which is not the
case in traditional clustering.
Moreover, regions r belong to Y are described as functions over tupels in Dom(S)—r: Dom(S){t,f} indicating
if a point pDom(S) belongs to r: r(p)=t. r is called the
intension of r. r can easily be constructed from a the model  of a clustering Y. Moreover, the extension of a region r r
is defined as follows:
r={oO|r(o[S])=t}
In the above definition o[S] denotes the projection of o
on its spatial attributes.
Our approach requires discovered regions to be contiguous. To cope with this constraint in extensional clustering, we assume that we have neighbor relationships
no() between the objects in O and cluster neighbor relationship nc() between regions in X defined with respect to
O: if no(o,o’) holds objects o and o’ are neighboring; if
nc(r,r’) holds regions r and r’ are neighboring.
noOxO
nc2Ox2O
Moreover, neighboring relationships are solely determined by the attributes in S; that is, the temporal and spatial attributes in S are used to determine which objects
and clusters are neighboring. A region r is contiguous if
for each pair of points u and v in r there is a path between
u and v that solely traverses r and no other regions. More
2
3
 denotes “undefined”.
F[S] denotes the projection of F on the attributes in S.
4
IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID
formally, contiguity4 is defined as a predicate over subsets c of O:
for the optimal number of regions k.
contiguous(c)wcvcm≥2x1,…,xm c:
3 DOMAIN-SPECIFIC PLUG-IN FITNESS FUNCTIONS
w=x1  v= xm  no(xi, xi+1) (i=1,…,m).
The fitness function, whose formula was given in
equation 3, is the core component in our framework in
capturing the notion of the interestingness of the domain.
The main challenge in developing methodologies and
techniques for domain-driven data mining is to incorporate domain knowledge into the data mining task so “actionable knowledge” can be discovered. For example, in
region discovery, the framework searches for interesting
subspaces and then extracts regional knowledge from the
obtained subspaces which provides very crucial
knowledge for domain experts.
The fitness function is specifically designed to be externally plugged-in to provide extensibility and flexibility. The fitness function component of the framework is
independent from the clustering algorithm employed and
for each domain a domain-specific fitness function is designed to capture the domain interestingness and incorporate domain knowledge. Because the fitness function is
external and encapsulated from the rest of the framework,
any change in the framework such as a parameter change
or change in the clustering algorithm will not affect the
fitness function. Likewise, changes on fitness function
that comes from domain requirements will not affect the
contents of the clustering algorithm and so on. This design enables the framework to be flexible and extensible
to meet domain needs and requirements.
In order to illustrate how the notion of domain interestingness and the domain-specific fitness functions are
used in domain driven data mining and in discovering
actionable knowledge, we now provide several examples
of such fitness functions in the remainder of this section.
contiguous(X) cX: contiguous(c)
Our approach employs arbitrary plug-in, rewardbased fitness functions to evaluate the quality of a given
set regions. The goal of region discovery is to find a set of
regions X that maximize an externally given fitness function q(X); moreover, q is assumed to have the following
structure:
q( X ) 
 reward (c)   i(c) * c
cX

(1)
cX
where i(c) is the interestingness of a region c—a quantity
designed by a domain expert to reflect a degree to which
regions are “newsworthy". The number of objects in O
belonging to a region is denoted by |c|, and the quantity
i(c)|c| can be considered as a “reward" given to a region c; we seek X such that the sum of rewards over all of
its constituent regions is maximized. The amount of premium put on the size of the region is controlled by the
value of parameter β (β>1). A region reward is proportional to its interestingness, but larger regions receive a
higher reward than smaller regions having the same value of interestingness, to reflect a preference for larger regions. Furthermore, it is assumed that the fitness function
q is additive; the reward associated with X is the sum of
the reward of its constituent regions.
The reader might ask why we restrict the form of fitness function in our proposed framework. The main reason is our desire to develop an efficient clustering algorithm for region discovery. Restricting the form of fitness
function supported allows us to use knowledge about the
structure of the fitness function to obtain faster clustering
algorithms which employ pruning, incremental updating,
and sophisticated search strategies. This topic will be revisited in Section 4 of this paper when specific clustering
algorithms are introduced.
Given a spatial dataset O, there are many possible clustering algorithms to seek for interesting regions in O with
respect to a plug in fitness function q. In general, the objective of region discovery with plug-in fitness func-tions
is:
Given: O, q, and possibly other input parameters
Find: regions r1,...,rk that maximize q({r1,...,rk}) subject to the following constraints:
(1a) riO (i=1,…,k) for extensional clustering
(1b) riF[S] (i=1,…,k) for intensional clustering
(2) contiguous(ri) (i=1,..,k)
(3) ri rj= (ij)
It should be emphasized that the number of regions k
is not an input parameter in the proposed framework;
that is, region discovery algorithms are assumed to seek
4 Other alternative definitions of contiguity exist, but will not be discussed in this paper due to the lack of space.
3.1 PCA-based Fitness Function
Finding interesting regional correlation patterns that will
help summarize the characteristics of a region is important to domain and business people, since many patterns only exist at a regional level, but not at the global
level. Moreover, using regional patterns which are normally are globally hidden, domain or business people can
understand the structure of data and make business or
domain decisions by analyzing these correlation patterns.
For example, a strong correlation between a fatal disease
and a set of chemical concentrations in Texas water wells
might not be detectable throughout Texas, but a strong
correlation pattern might exist regionally which is also a
reflection of Simpsons' paradox [9]. This type of regional
knowledge is crucial for the domain experts who seek to
understand the causes of such diseases and predict future
cases. To identify a sub-region in South Texas with 35
water wells that demonstrates a unique and strong correlation between Arsenic, another chemical in water of
those wells and high occurrence of the disease in this region, might suggest to domain experts the possible existence of nearby toxic waste, and provide valuable actionable knowledge that will help them to understand the
cause of dangerous amount of arsenic in water wells, then
AUTHOR ET AL.: TITLE
5
develop a solution to this problem and prevent future
incidents. An example of discovered regions along with
highly correlated attribute sets is given in fig. 4. This is an
application of our framewok using PCA-based fitness
function on Texas Water Wells data [10]; and the fact that
the correlation sets for each region show significant differences emphasizes the importance of regional pattern
discovery.
Fig. 3. An Example of Regional Correlation Patterns for Chemical
Concentrations in Texas
In order to discover regions where sets of attributes are
highly correlated, we need a fitness function that will reward high correlation and enables our framework to discover such regions. The Principal Component Analysis
(PCA) is a good candidate since the directions identified
by PCA are the eigenvectors of the correlation matrix, and
each eigenvector has an associated eigenvalue that is a
measure of the corresponding variance. The Principal
Components (PCs) are ordered with respect to the variance associated with that component in descending order.
The eigenvectors of PCs can help to reveal correlation
patterns among sets of attributes.
Ideally, it is desirable to have high eigenvalues for the
first k PCs, since this means that a smaller number of PCs
will be adequate to account for the threshold variance
which overall suggests that a strong correlation among
variables exists [11]. The PCA-based fitness function is
defined next.
Let λ1, λ2,…,λk be the eigenvalues of the first k PCs,
with k being a parameter:
PCA-based Interestingness is estimated using formula
2:
iPCA (r ) = (12  ...  k2 )/k
(2)
based on a variance threshold to decide how many PCs to
retrieve. This variance threshold is also domain-specific
and is set based on the domain knowledge available, to
ensure selecting appropriate k value for each dataset from
different domains and reflecting concerns and constraints
implied by domain knowledge.
The PCA-based fitness function repeatedly applies
PCA during the search for the optimal set of regions,
maximizing the eigenvalues of the first k PCs in that region. Having an externally plugged in PCA-based fitness
function enables the clustering algorithm to probe for
optimal partitioning, and encourages the merging of two
regions that exhibit structural similarities in correlation
patterns. This approach is more advantageous than applying PCA just once or multiple times on the data using
other tools, since the PCA-based fitness function is applied repeatedly to candidate regions to explore each possible region combination.
3.2 Co-location Fitness Function
Co-location mining is a data mining task that seeks for
interesting but implicit patterns in which two or more
patterns collocate in spatial proximity. In the following
we will introduce an interestingness function for colocation sets involving objects that are characterized by
continous attributes (see also [12] for background on the
described approach). The pattern A↑ denotes that attribute A has high values and the pattern A↓ indicates that
attribute A has low values. For example, the pattern {A↑,
B↓, D↑} describes that high values of A are co-located
with low values of B and high values of D.
Let O be a dataset
c be a region
o  O be an object in the dataset O
N = {A1,…,Aq} be the set of non-geo-referenced continuous attributes in the dataset O
Q={A1↑, A1↓,…, Aq↑, Aq↓} be the set of possible base colocation patterns
B  Q be a set of co-location patterns
Let z-score (A,o) be the z-score(A,o) of object o’s value
of attribute A
 z  score( A, o) if z  score( A, o)  0
z ( A , o)  

otherwise
0

(4)
PCA-based fitness function then becomes:
k
qPCA ( R) =  iPCA (rj )* size(rj ) 
(3)
j 1
The fitness function rewards high eigenvalues for the
first k PCs. By taking the square of each eigenvalue we
ensure that regions with a higher spread in their eigenvalues will obtain higher rewards—reflecting the higher
importance assigned in PCA to higher ranked principal
components.
Moreover, a generic pre-processing technique to select
the best k value for the PCA-based fitness function is
(5)
 z  score( A, o) if z  score( A, o)  0
z ( A , o)  

otherwise
0

The interestingness of an object o with respect to a collocation set B  Q is measured as the product of the zvalues of the patterns in the set B. It is defined as follows:
i( B, o)   z ( p, o)
pB
(6)
where z(p,o) is called the z-value of base pattern p  Q for
6
IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID
object o.
In general, the interestingness of a region can be
straightforwardly computed by using the average interestingness of the objects belonging to a region. However,
using this approach some very large products might
dominate interestingness computations. For some domain
experts just finding a few objects with very high products
in close proximity of each other is important, even if the
remaining objects in the region deviate from the observed
pattern. In other cases, domain experts are more interested in patterns with highly regular products so that all or
almost all objects in a region share this pattern, and are
less interested in a few very high products. To satisfy the
needs of different domains, our approach additionally
considers purity when computing region interestingness,
where purity(B,c) denotes the percentage of objects o  c
for which i(B,o)>0. In summary, the interestingness of a
region c with respect to a co-location set B, denoted by
(B,c), is computed as follows:
 ( B, c ) 
 i ( B, o)
oc
|c|
* purity ( B, c)
(7)
The parameter θ  [0,∞) controls the importance attached to purity in interestingness computations; θ=0 implies that purity is ignored, and using larger values increases the importance of purity.
Fig. 6 depicts regions r in Texas with their highest valued co-location sets B; that is, the depicted co-location set
B has the highest value for (B,r).
3.3 Variance Fitness Function
High Variance Fitness Function is a fitness function to
discover regions where there is high contrast in value of
attribute of interest. For example, in studying of earthquake as discussed in more details in a case study in section 5.2, where attribute of interest is the depth of earthquakes, the domain expert may use High Variance Fitness
Function to find regions where shallow earthquakes are
in close proximity with deep earthquake. The interestingness of a region r, i(r), is defined as follows:

0

i (r )  


 Var (r , attr)  th  * r 

 Var (O, attr)

Var (r , attr)

 th
Var (O, attr)


otherwise 

1
 (attr(o)   attr (r ))2
r  1 or
4 CLUSTERING ALGORITHMS WITH PLUG-IN
FITNESS FUNCTIONS
Another key component of proposed framework is a family of clustering algorithms that allows domain experts to
instruct clustering algorithms to seek clusters that satisfy
their specific requirements. To achieve this flexible clustering capability, several clustering algorithms were designed and implemented that support externally-given
fitness functions that are maximized during the clustering
process. Using different plug-in fitness functions in the
algorithms results in obtaining different, alternative clusters for the same data set. Existing clustering paradigms
have been extended to support plug-in fitness functions,
namely representative-based clustering, agglomerative
clustering, divisive clustering, and grid-based clustering.
Three such clustering algorithms CLEVER [12], MOSAIC
[13], and SCMRG [14] will be briefly introduced and formally described by extending the formal framework that
was introduced in Section 2. Different clustering paradigms are superior with respect to different aspects of
clustering. For example, grid-based clustering algorithms
are able to cluster large datasets quickly, whereas representative-based clustering algorithms discover clusters of
better quality. Finally, agglomerative clustering algorithms are capable of identifying arbitrary shape clusters
which is particularly important in spatial data mining.
They can also be employed as a post processing technique
to enhance the quality of clusters that were obtained by
running a representative-based clustering algorithm.
(8)
4.1 CLEVER— A Representative-based Clustering
Algorithm
Representative-based clustering algorithms, sometimes
called prototype-based clustering algorithms in the literature, construct clusters by seeking for a set of representatives; clusters are then created by assigning objects in the
dataset to their closest representative; in general, they
compute the following function :
(9)
: Oqd{other parameters}2Dom(S)
where
Var (r , attr) 
above variance interestingness function for an earthquake
dataset with earthquake depth being the attribute of interest. The polygons in Figure 1 indicate regions with positive interestingness; usually, those regions will be further
ranked by region reward to sort regions from most interesting to least interesting, providing search engine-type
capabilities to scientists that are interested in finding interesting places in spatial datasets.
The interestingness function parameters β and th are
determined in close collaboration with the domain experts. Attr is the attribute of the interest and in the formula attr(o) denotes the value of attr for object o. The interestingness function computes the ratio of the region’s
variance with respect to attr and the dataset’s variance.
Regions whose ratio is above a given threshold th receive
rewards.
Figure 1 in Section 1 shows the result of using the
 takes O, q, a distance function d over Dom(S), and possibly other parameters as an input and seeks for an “optimal set”5 of representatives in Dom(S), such that the
clustering X obtained by assigning the objects in O to
their closest representative in (O,q,d,…) maximizes q(X)the fitness function. Moreover, it should be noted that
5 In general, prototype-based clustering is NP-hard. Therefore, most
representative-based clustering algorithm will only be able to find a
suboptimal clustering X and not the global maximum of q.
AUTHOR ET AL.: TITLE
clustering is done in the spatial attribute space S, and not
in F; the attributes in N are only used by fitness function q
when evaluating clusters.
CLEVER is an example of the representative-based clustering algorithms that uses randomized hill climbing and
larger neighborhood sizes6 to battle premature convergence when greedily searching for the best set of representatives. Initially, the algorithm randomly selects k’
representatives from O. In the iterative process, CLEVER
samples and evaluates p solutions in the neighborhood of
the current solution; if the best one improves fitness, it
becomes the current solution. The neighboring solutions
are created by applying one of following operators on a
representative of the current solution: Insert, Delete and
Replace. Each operator has a certain selection probability
and representatives to be manipulated are chosen at random. Moreover, to battle premature convergence, CLEVER re-samples p’>p solutions before terminating. The
Pseudocode of CLEVER is given in Fig. 4.
7
shape of regions obtained by representative-based clustering algorithms is limited to convex polygons in
Dom(S). Neighboring relationships no() between objects
in O and nc() between clusters obtained by a representative-based clustering algorithm can be constructed by
computing the Delaunay triangulation for R. Moreover,
representative-based clustering algorithms do not support the concept of outliers; therefore, representativebased models have to assign a cluster to every point p in
S.
4.2 MOSAIC—An Agglomerative Clustering
Algorithm
The agglomerative clustering problem can be defined as follows:
Given: O, F, S, N, a fitness function q, and an initial
clustering X with contiguous(X)
Find: X’={c’1,…,c’h} that maximizes q(X’) and all clusters in X’ have been constructed using unions of neighboring clusters in X:
ciX’: ci=ci1…cij  ci1,…,cij X  nc(cik,cik+1)
(for k=1,j-1)  cicj=(for ij)
Due to the fact that the above definition assumes that
only neighboring clusters are merged, contiguous(X’)
trivially holds.
In the following, we view results that are obtained by
agglomerative methods as a meta-clustering X’ over an
initial clustering X of O; X’ over X is defined as an exhaustive set of contiguous, disjoint subsets of X. More formally, the objectives of agglomerative clustering can be reformulated as follows:
Find: X’={x1,...,xr} with xiX (i=1,…,r) maximizing
q(X’), subject to the following constraints:
Fig. 4. Pseudo-code of CLEVER
The cluster model  for the result obtained by running
a representative-based clustering algorithm can be constructed as follows:
Let
(O,q,d,…)={rep1,…, repk}Dom(S)
that is; the representative-based clustering algorithm returned R={rep1,…, repk}. Then the model  can be defined
as follows:
pS (p)=m d(p,repm} d(p,repj} for j=1,…,k
that is,  assigns p to the cluster associated with the closest representative7.
Because representative-based clustering algorithms assign objects to clusters using 1-nearest neighbor queries,
the spatial extent of regions riDom(S) can be constructed
by computing Voronoi diagrams; this implies that the
It modifies the current set of representatives by applying more than
one operator to it; e.g. modifying the current set of representatives by
replacing two representatives and inserting a new representative.
7 Our formulation ignores the problem of ties when finding the closest
representative; in general, our representative-based clustering algorithms
break ties randomly.
6
(1)
(2)
(3)
(4)
x1…xr=X
xixj= (ij)
contiguous(xi) (for i=1,..,r)
xX’m1x’1…x’mX: x =x’1…x’m
We use the term meta-clustering, because it is a clustering of clusters and not of objects as is the case with traditional clustering. It should be noted that agglomerative
clusters are exhaustive subsets of an initial clustering X;
that is, we assume that outliers are not removed by the
agglomerative clustering algorithm itself, but rather by
the algorithm that constructs the input X for the agglomerative clustering algorithm. In general, an agglomerative
clustering algorithm is decomposed of two algorithms:
1.
a preprocessing algorithm that constructs the
clustering X
2.
the agglomerative clustering algorithm itself that
derives X’ from X.
The preprocessing algorithm is frequently degenerated; for example, its input could consist of single object
clusters, or X could be constructed based on a gridstucture; however, it is beneficial for many applications to
use a full fledged clustering algorithm for the prepro-
8
IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID
cessing step.
An agglomerative clustering algorithm MOSAIC [13]
has been introduced in a previous work. MOSAIC takes
the clustering X obtained by running a representativebased region discovery algorithm as its input, merges
neighboring regions greedily as long merging enhances
q(X). For efficiency reasons, MOSAIC uses Gabriel graphs
[15]—which are subsets of Delaunay graphs—to compute
nc; nc is then used to identify merge candidates for MOSAIC which are pairs of neighboring clusters whose
merging enhances q(X); taking nc is updated incrementally as clusters are merged. Finally, when clusters are
merged, q(X) is updated incrementally, taking advantage
of the fact that our framework assume that q is additive.
Fig. 8 gives the pseudo-code for MOSAIC:
Moreover, models for the clusters obtained by an agglomerative regions discovery algorithm can be easily
constructed from the models of the input clusters in X
that have been merged to obtain the region in question.
Let us assume r has been obtained as r=r1…rm; in this
case the model for r can be definded as :r(p)= r1(p) …
rm(p)
In the case of MOSAIC, r(p) is implemented by characterizing MOSAIC clusters by sets of representatives8;
new points are then assigned to the cluster whose set of
representatives contains the representative that is closest
to p. Basically, MOSAIC constructs regions as union of
Voronoi cells and the above construction takes advantage
of this property.
Fig. 5. Pseudo code of MOSAIC
4.3 SCMRG—A Divisive Grid based Clustering
Algorithm
The divisive clustering problem can be defined as follows:
Given: O, F, S, N, a fitness function q, and an initial
clustering X={x1,…,xh} with contiguous(X).
Find: X’={c’1,…,c’k} that maximizes q(X’) and X’ has
been obtained from X.
Procedure: Initially, X’ is set to X. Then X’ is modified
to increase q(X’) by recursively replacing an xX’ by
If r in X’ has been constructed using r=r1…rm from X r would be
characterized by the representatives of regions r1,…,rm.
8
x=x’1… x’p as long as q(X) improves, and the following
conditions are satisfied:
(1)
(2)
(3)
(4)
x’jx (j=1,…p)
x’jx’i= (for ji)
contiguous(x’j) (j=1,…p)
reward(x)<reward(x’1)+…+reward(x’p)
Region x is only replaced by regions at a lower level of
resolution, if the sum of the rewards of the regions at the
lower level of resolution is higher than x’s reward. It
should be emphasized that the splitting procedure employs a variable number of decompositions; e.g. one region might not be split at all, another region might be
split into just four regions, whereas a third region might
be split into 17 subregions. Moreover, the splitting procedure is not assumed to be exhaustive; that is, x can be
split into y1, y2, y3 with y1y2y3x; in other words, the
above specification allows divisive region discovery algorithms to discard outliners when seeking for interesting
regions; basically the objects belonging to the residual
region x/y1y2y3 in the above examples are considered to be outliers.
SCMRG (Supervised Clustering using MultiResolution Grids) [14] is a divisive, grid-based region discovery algorithm that has been developed by our past
work. SCMRG partitions the spatial space Dom(S) of the
dataset into grid cells. Each grid cell at a higher level is
partitioned further into a number of smaller cells at the
lower level, and this process continues if the sum of the
rewards of the lower level cells is greater than the rewards at the higher level cell. The regions returned by
SCMRG usually have different sizes, because they were
obtained at different levels of resolution. Moreover, a cell
is drilled down only if it is promising (if its fitness improves at a lower level of resolution). SCRMG uses a
look-ahead splitting procedure that splits a cell into 4, 16,
and 64 cells respectively and analyzes if there is an improvement in fitness in any of these three splits; if this is
not the case and the original cell receives a reward, this
cell is included in the region discovery result; however,
regions who themselves as well as their successors at
lower level of resolution do not receive any rewards, will
be treated as outliers, and discarded from the final clustering X’.
SCMRG employs a queue to store cells that need further processing. SCMRG starts at a user defined level of
resolution and puts the cells associated with this level on
the queue. Next, SCMRG generates a clustering from cells
in the queue by traversing through the hierarchical structure and examining those cells in the queue, and considering the following three cases when processing a cell:
Case 1. If the cell c receives a reward, and its reward is
greater than the sum of the rewards of its children and
the sum of rewards of its grandchildren respectively, this
cell is returned as a cluster by the algorithm.
Case 2. If the cell c does not receive a reward, nor does
its children and grandchildren, neither the cell nor any of
AUTHOR ET AL.: TITLE
its decedents will be further processed or labeled as a
cluster.
Case 3. Otherwise, if the cell c does not receive a reward, but its children receive rewards, put all the children of the cell c into a queue for further processing.
Finally, all cells that have been labels as clusters (case1)
are returned as the final result of SCMRG.
5 CASE STUDIES
5.1 Co-location Mining of Risks Patterns of
Arsenics and Associated Chemicals in Texas
Water Supply
In this case study, we apply our domain-driven clustering
framework for discovering interesting regions where two
or more attriutes are collocated and associated patterns.
The employed procedure is summarized in fig. 6 and explained step by step step below:
Fig. 6. A procedure of applying domain-driven clustering framework
for actionable region discovery with involvement of domain experts
Step 1. Define the problem: Co-location mining is a
data mining task that seeks for interesting but implicit
patterns in which two or more patterns collocate in spatial proximity. For this case study, hydrologists helped us
select subsets of chemicals and some external factors suspected of generating high levels of Arsenic concentrations. Interesting patterns B is defined as follows:
Given
 N={A1,…,Aq} be the set of non-spatial continuous attributes that measure chemical concentrations in Texas
water wells.
 Q={A1↑, A1↓,…, Aq↑, Aq↓} be the set of base collocation patterns; in this case study, the domain expert is interested in finding associations of high/low concentra-
9
tions (denoted by ‘↑’ and ‘↓’, respectively) with high/low
concentrations of other chemicals.
 B  Q be a set of co-location patterns, where
 P(B) be a predicate over B that restricts the colocation sets considered, i.e. P(B)=As↑B (“only look for
co-location sets nvolving high arsenic concentrations of
Arsenic”)
Step 2. Create/Select a fitness function: First, the hydrologists formulate a measure of their interestingness in
form of a reward-based fitness function. Fitness functions
express extrinsic characteristics which are varying in different problems and domains. In our framework, it is a
generic component so that hydrologists can define several
fitness functions, some of which might have small variations from each other based on his diverse interests. The
simplified version of fitness function applied in the colocation mining called z-value was given in section 3.2.
Step 3: Select a clustering algorithm. The framework
provides many algorithms that exemplify different clustering paradigms, e.g. representative-based clustering,
divisive grid-based clustering, agglomerative clustering.
For this case study, CLEVER (CLustEring using representatiVEs and Randomized hill climbing) is employed to
identify regions and associated co-location patterns.
Step 4. Select parameters of the fitness function and
the clustering algorithm: Tuning or setting parameters of
the fitness function and the region discovery framework
such as  helps obtain better results, or extends the search
to focus on alternative patterns or patterns at different
levels of granularity. For the particular fitness function
employed, parameter controls an importance of purity
of a pattern in interestingness computations; the larger 
is, the more importance of purity of a pattern is addressed. Beside those two parameters, hydrologists can
also specify seed patterns, i.e. As as a mandatory item in
the co-location patterns considered. Later on they can
simply change the seed patterns to force the co-location
mining algorithm to seek for alternative patterns; e.g. patterns that are co-located with both {As,F}. This bridges
a gap between hydrologists’ expectations and the results
of clustering algorithms, permitting hydrologists to tune
the comprehensive parameters in order to derive actionable patterns.
Step 5. Run the clustering algorithm to discover interesting regions and associated patterns: Results (a set
of clusters) obtained from the clustering algorithm are
ranked either by reward or interestingness. An example
of experimental results is given in Fig. 6. For instance, the
first ranked pattern indicates that high level of Arsenic
collocates with high levels of Boron, Chloride and Total
Dissolved Solids in Southern of Texas.
10
IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID
change analysis and the steps are explained next.
Fig. 7. Example of Top 5 regions ranked by interestingness
Step 6. Analyze the results: By the nature of fitness
functions, the clustering algorithm consequently weeds
out many regions having zero interestingness. The experimental results show the ability of the framework to identify interesting regions and associated patterns exemplified in Fig. 7, which are comparative to regions of high
level of Arsenic concentration obtained from TCEQ as
depicted in Fig. 8. Steps 4–6 are usually repeated several
times in order to enhance the results or explore alternative regions and patterns.
Fig. 9. A procedure of applying domain-driven clustering framework
in change analysis
First, geologists sample two datasets corresponding to
two different time frames.
Secondly, the domain driven clustering framework is
used to separately identify the interesting regions of each
time frame; a fitness function measures high variance of
earthquakes depth. To generate intensional clusterings
from results of CLEVER, we construct voronoi cells, in
which polygons represent cluster models. Then, change
analysis techniques are applied in Steps 3–6 in order to
detect and identify different change patterns in those regions.
Third: users select relevant change predicates to compare changes between the two intensional clusterings; the
predicates also have thresholds to be controlled externally. Then changes between the two clusterings are instantiated with respect to the predicate threshold. Finally,
emergent patterns are summarized and further analyzed.
Fig. 8. Arsenic pollution map
In contrast to traditional clustering, our framework offers search engine-type capabilities to domain experts, to
help them identify patterns they are interested in. Domain experts assist and incorporate their knowledge in
several mining phases, especially before the clustering
phase. By expressing their interestingness in forms of fitness functions, the clustering algorithms are able to seek
for clusters with extrinsic characteristics. Therefore, the
clusters and associated patterns obtained repesent actionable knowledge.
5.2 Change Analysis in EarthQuake Data
A change analysis framework is developed using our
framework to analyze how interesting regions in two different time frames. For instance, analyzing changes in
places where deep earthquakes are in close proximity to
shallow earthquakes. Fig. 9 summarizes the approach of
Fig. 10. An overlay of interesting regions discovered in Oold and and
Onew
Fig. 10 illustrates An overlay of interesting regions discovered in Oold and and Onew; the red regions belong to
the early time frame (labeled with Regionold) whereas the
blue regions belong to the late time frame (labeled with
Regionnew). Examples of relationship discovered between
two clusterings are also given; regions 5 and 10 in Fig. 11
are considered new whereas region 0, 2, 3, 7 in Fig. 12 is
considered dissappearance.
AUTHOR ET AL.: TITLE
11
also interested in identifying regions which satisfy multiple patterns of chemical contamination in water supply.
We apply multi-run clustering in as a tool to gather multiobjective clusters simultaneously. Multi-run clustering
reduces extensive human effort by searching for and enhancing novel and high quality clusters in automated
fashion. Since multi-run clustering is developed on top of
domain driven clustering framework, it conforms to the
framework and also inherits the capability of the framework to plug in different clustering algorithms and fitness
functions. Therefore, results obtained from multi-run
clustering are also considered actionable.
Fig. 11. Novelty areas of regions in Onew data
6 CONCLUSION
Fig. 12. Disappearance areas of regions in Oold data
5.3 Other Applications of the Framework for
Actionable Regional Knowledge Discovery
Beside the use of the domain driven clustering framework to discover actionable knowledge specified in the
two aforementioned case studies, the framework can also
be applied to aid knowledge discovery in other real applications. The first application, similar to the first case
study, is co-location mining in planetary science [16]; we
are interested in mining feature-based hotspots where
extreme densities of deep ice and shallow ice co-locate on
Mars; the fitness function employed is an absolute of
product of z-score of the continuous non-spatial feature in
spatial dataset. Outcomes of the framework are regions
having either very high co-location or very high anti colocation.
The second application is regional correlation pattern
discovery using PCA in hydrology [10]. Finding regional
patterns in spatial datasets is an important data mining
task. PCA-based fitness function is used to discover regional correlation patterns. This approach is more effective than solely applying PCA once or multiple times on
the data, since the PCA is applied repeatedly to candidate
regions to explore each possible region combination. This
case study uses PCA-based fitness function maximizing
the eigenvalues of first k PCs; it rewards the regions with
high correlation since higher correlated sets would result
in higher eigenvalues, in other words, higher variance is
captured.
The third application is multi-objective clustering,
whose goal is to seek for a set of clusters individually satisfying multiple objectives. For example, hydrologists are
In this paper a generic, domain-driven clustering framework
is proposed that incorporates domain intelligence into domain-specific, plug-in fitness functions that are maximized
by clustering algorithms. The framework provides a family
of clustering algorithms and a set of fitness functions, along
with the capability of defining new fitness functions. Moreover, an ontology and a theoretical foundation for clustering
with fitness functions in general, and for region discovery in
particular is introduced. Fitness functions are the core components in the framework as they capture a domain expert’s
notion of of interestingness. The fitness function is independent from the clustering algorithms employed.
The framework was evaluated for different region discovery tasks in several case studies. The framework treats
the region discovery problem as a clustering problem in
which a given, plug-in fitness function has to be maximized.
By integrating and utilizing domain knowledge and domain-specific evaluation measures, into parameterized,
plug-in fitness functions altogether with controlling thresholds, the framework is able to obtain actionable regional
knowledge and their associated patterns satisfying domainspecific needs.
The case studies demonstrate the capability of the
framework to integrate the domain intelligence and effectively utilize clustering tasks by incorporating domain requirements into the clustering algorithms in form of a fitness
function to guide clustering. To the best of our knowledge,
this capability has been very little been explored by past research in the field of clustering and we are optimistic that
our proposed framework will foster novel applications of
domain-driven clustering..
REFERENCES
[1]
L Cao and C. Zhang, “The Evolution of KDD: Towards DomainDriven Data Mining,” Journal of Pattern Recognition and Artificial Intelligence, vol.21, no. 4, pp. 677-692, World Scientific Publishing Company,
2007.
[2] I. Davidson and S.S. Ravi, “Clustering under Constraints: Feasibility
Issues and the k-means Algorithm,” Proc. Fifth SIAM Data Mining Conf.,
2005.
[3] Q. Yang, K. Wu., and Y. Jiang, “Learning Action Models from Plan
Examples using Weighted MAX-SAT,” Artif. Intell., vol. 171, issue 2-3,
pp. 107-143, 2007.
[4] N. Zhong,, “Actionable Knowledge Discovery: A Brain Informatics
Perspective,” Special Trends and Controversies Department on DomainDriven, Actionable Knowledge Discovery, IEEE Intelligent Systems, vol. 22,
issue 4, pp. 85-86, 2007.
12
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID
W. Graco, T. Semenova, and E. Dubossarsky, “Toward KnowledgeDriven Data Mining,” Proc. Domain Driven Data Mining Workshop, 2007.
G. Karypis, E.H.S Han, V. Kumar, “Chameleon: Hierarchical Clustering
using Dynamic Modeling,” IEEE Computer, vol. 32, issue 8, pp. 68-75,
1999.
C. F. Eick, N. Zeidat, and Z. Zhao, “Supervised Clustering --- Algorithms and Benefits,” Proc. Int. Conf. on Tools with AI., 2004.
A. Bagherjeiran, C. F. Eick, C.-S. Chen, and R.Vilalta, “Adaptive Clustering: Obtaining Better Clusters Using Feedback and Past Experience,”
Proc. Fifth IEEE Int. Conf. on Data Mining, 2005.
E. H.: Simpson, “The Interpretation of Interaction in Contingency Tables,” Journal of the Royal Statistical Society, ser. B, 13, pp. :238-241, 1951.
O. U. Celepcikay and C. F. Eick, “A Regional Pattern Discovery
Framework using Principal Component Analysis,” Proc. Int. Conf. on
Multivariate Statistical Modeling & High Dimensional Data Mining,
2008.
I. T Jolliffe, “Principal Component Analysis,” NY Springer, 1986.
C. F. Eick, R. Parmar, W. Ding, T. Stepinki, and J.-P. Nicot, “Finding
Regional Co-location Patterns for Sets of Continuous Variables in Spatial Datasets,” Proc. Sixteenth ACM SIGSPATIAL Int. Conf. on Advances in
GIS, 2008.
J. Choo, R. Jiamthapthaksin, C.-S. Chen, O. Celepcikay, C. Giusti, and C.
F. Eick, “MOSAIC: A Proximity Graph Approach to Agglomerative
Clustering,” Proc. Ninth Int. Conf. on Data Warehousing and Knowledge
Discovery, 2007.
C. F. Eick, B. Vaezian, D. Jiang, and J. Wang, “Discovery of Interesting
Regions in Spatial Datasets Using Supervised Clustering,” Proc. Tenth
European Conf. on Principles and Practice of Knowledge Discovery in Databases, 2006.
Gabriel, K. and R. Sokal, “A New Statistical Approach to Geographic
Variation Analysis,” Systematic Zoology, vol. 18, pp. 259-278, 1969.
W. Ding, R. Jiamthapthaksin, R. Parmar, D. Jiang, T. Stepinski, and C. F.
Eick, “Towards Region Discovery in Spatial Datasets,” Proc. PacificAsia Conf. on Knowledge Discovery and Data Mining, 2008.
Christoph F. Eick received his PhD from the University of Karlsruhe
in Germany. He is currently an Associate Professor in the Department of Computer Science at the University of Houston. He is the
Co-Director of the UH Data Mining and Machine Learning Group. His
research interests include data mining, machine learning, evolutionary computing, and artificial intelligence. He published more than 95
papers these and related areas. He serves on the program committee of the IEEE International Conference on Data Mining (ICDM) and
other major data mining and machine learning conferences.
Oner Ulvi Celepcikay is a senior PhD candidate at the University of
Houston, Computer Science Department. He got his bachelor degree in Electrical Engineering in 1997 from Istanbul University, Istanbul, Turkey. He acquired his M.S. Degree in Computer Science at
University of Houston 2003. He had worked at University of Houston
Educational Technology Outreach (ETO) Department from 2000 to
2007. He has published number of papers in his research fields including cluster analysis, multivariate statistical analysis, and spatial
data mining. He served as session chair in International Conference
on Multivariate Statistical Modeling & High Dimensional Data Mining
in 2008 and has been serving as a non-pc reviewer in many conferences.
Rachsuda Jiamthapthaksin graduated her bachelor degree in
Computer Science in 1997 and graduated with Honors in master
degree in Computer Science, Dean’s Prize for Outstanding Performance, in 1999 from Assumption University, Bangkok, Thailand. She
was a faculty in Computer Science department, Assumption University during 1997-2004. She is now a PhD candidate in Computer
Science at University of Houston, Texas. She has published papers
in the area of her research interests including intelligent agents,
fuzzy systems, cluster analysis, data mining and knowledge discovery. She has served as a non-pc reviewer in many conferences and
served as a volunteer staff in an organization of the 2005 IEEE ICDM
Conference, November 2005, Houston, Texas.
Vadeerat Rinsurongkawong is a PhD candidate in Computer Science at University of Houston. She got her M.S.degree in Information Technology from Assumption University, Thailand and her B.
Eng. deree in Electrical Engineering from Chulalongkorn University,
Thailand. She has work experience in electrical engineering, information technology and computer science. Her areas of interest are
data mining and databases.