Download Transaction / Regular Paper Title

IEEE TRANSACTIONS ON TKDE, MANUSCRIPT ID 1 A Unifying Domain-Driven Framework for Clustering with Plug-In Fitness Functions and Region Discovery Christoph F. Eick, Oner U. Celepcikay, Rachsuda Jiamthapthaksin and Vadeerat Rinsurongkawong Abstract—— The main challenge in developing methodologies for domain-driven data mining is incorporating domain knowledge and domain-specific evaluation measures into the data mining algorithms and tools, so that “actionable knowledge” can be discovered. In this paper a generic, domain-driven clustering framework is proposed that incorporates domain intelligence into domain-specific plug-in fitness functions that are maximized by the clustering algorithm. The framework provides a family of clustering algorithms and a set of fitness functions, along with the capability of defining new fitness functions. Fitness functions are the core components in the framework as they capture a domain expert’s notion of the interestingness. The fitness function is independent from the clustering algorithm employed. The framework also incorporates domain knowledge through preprocessing and post-processing steps and parameter selections. This paper introduces the framework in detail, and illustrates the framework through demonstrations and case studies that center on spatial clustering and region discovery. Moreover, the paper introduces ontology and a theoretical foundation for clustering with fitness functions in general, and region discovery in particular. Finally, intensional clustering algorithms that operate on cluster models are introduced. Index Terms— Clustering, Data Mining, Spatial Databases and GIS, Domain-driven Data Mining ——————————  —————————— 1 INTRODUCTION T o extract knowledge from the immense amount of data that has been generated by advances in data acquisition technologies has been a major focus of data mining research over the last 20 years. However, it has been observed that knowledge obtained from traditional data-driven data mining algorithms in domain-specific applications is not really actionable [1] because the extracted knowledge does not capture what domain experts are interested in. This observation can be explained by two limitations of traditional data mining: 1) traditional data mining algorithms insufficiently incorporate domain intelligence to aid the mining process and 2) the algorithms use technical significance as their sole evaluation measure. As far as the first limitation is concerned, domain intelligence includes the involvement of domain knowledge, domain-specific constraints and experts. Consider a situation in which a clustering algorithm is used to identify clusters in a specific domain. Different clustering algorithms have their own assumptions on clustering criteria, e.g. tightness, connectivity, separation and so on. Due to the fact that clustering is NP-hard, clustering algorithms focus their search efforts on clusters that maximize those criteria, frequently generating “optimal” but out-ofinterest clusters. Clustering with constraints intends to alleviate this problem by incorporating must-link and cannot-link constraints to better guide the search for good clusters [2]. The second limitation occurs due to the fact that in traditional data mining, actionability of knowledge is determined solely by technical significance based on domain-independent criteria [1]; this type of measure usually differs from domain-specific expectations and measures of interestingness. To address this problem, when assessing cluster quality both technical and domain-specific significance should be considered. Consequently, the main challenge in developing methodologies and techniques for domain-driven data mining is to incorporate domain knowledge into data mining algorithms and tools so that actionable knowledge can be discovered. In this paper, we propose a unifying domain-driven clustering framework that provides families of clustering algorithms with plug-in fitness functions capable of discovering actionable knowledge. The fitness function is the core component of the framework, as it captures the domain expert’s notion of interestingness. The fitness function is specifically designed to be externally plugged-in to provide extensibility and flexibility; the fitness function component of the framework is independent of the clustering algorithms employed. In general, families of task- and/or domain-specific fitness functions are employed to capture the domain interestingness and to incorporate domain knowledge. ———————————————— For example, let us consider a data mining task in which  Authors are with the Department of Computer Science, University of Hougeologists are interested in discovering hotspots in geoston, Houston, TX, 77204. graphical space where deep earthquakes are in close  Emails: (ceick, onerulvi, rachsuda, vadeerat )@cs.uh.edu Manuscript received (03/31/2009). xxxx-xxxx/0x/$xx.00 © 200x IEEE 2 proximity to shallow earthquakes. That is, they are interested in identifying contiguous regions in an earthquake data set for which the variance of the variable earthquake_depth is high. When using our framework, the geologist’s notion of interestingness is the captured in the form of a High Variance fitness function— formally defined in section 3. The domain expert additionally selects the parameters to instruct the clustering algorithm in what patterns they are really interested in: an earth-depth variance threshold and a parameter that controls cluster granularity and size of the spatial clusters discovered. Next, a clustering algorithm is run with the parameterized High Variance fitness function, and high variance earthquake depth hotspots are obtained, as displayed in fig. 1. Fig. 1. Examples of interesting regions discovered by a domain driven clustering algorithm using a High Va-riance fitness function Our framework incorporates domain knowledge not only through domain-specific fitness functions, but also through preprocessing and post-processing steps, fitness function parameter selections including seed patterns, threshold parameter values that are suitable for a specific domain, and desired cluster granularities. The family of clustering algorithms supported by the framework includes divisive, grid-based, prototype-based and agglomerative clustering algorithms; all of which support plug-in fitness functions. The first high level domain driven data mining framework has been introduced by Cao and Zhang [1]. In this framework, domain intelligence is incor-porated into the KDD process towards actionable knowledge discovery, and the framework has been illustrated through mining activity patterns in social security data. They also proposed criteria to measure actionability of the knowledge. Yang [3] introduced a framework with two techniques to produce actionable output from traditional KKD output models. The first technique uses an algorithm for extracting actions from decision trees such that each test instance falls in a desirable state. The second technique uses an algorithm that can learn relational action models from frequent item sets. This technique is applied to automatic planning systems and Yang’s Action-Relation Modeling System (ARMS) automatically acquires action models from recorded user plans. One subcomponent of the domain knowledge that must be incorporated into any domain-driven data mining framework is human intelligence; and Multiaspect Data Analysis (MDA) is an important Brain Informatics methodology. Brain Informatics considers the brain as an IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID information-processing system to understand its mechanism for analyzing and managing data. But, since brain researchers can not use MDA results directly, Zhong [4] proposes a methodology that employs an explanationbased reasoning process that combines multiple source data into more general results to form actionable knowledge. Zhong’s framework basically takes traditional KDD output as input to an explanation-based reasoning process that generates actionable output. The concept of moving from method-driven or data-driven data mining to domain-driven has been recently proposed and is featured in [5]. The authors describe four aspects of moving data mining, from a method-driven ap-proach to a process that focuses on domain knowledge. In general, the use of plug-in fitness functions is not very common in traditional clustering; the only exception is the CHAMELEON [6] clustering algorithm. However, fitness functions play a more important role in semi-supervised and supervised clustering [7] and in adaptive clustering [8]. The main contributions of this paper are that it: 1. Introduces a unifying domain-driven clustering framework for actionable knowledge discovery. 2. Proposes a novel domain-specific fitness function model that is plugged into clustering algorithms externally to capture domain interestingness. 3. Presents a set of fitness functions capable of serving for clustering tasks for various domains. 4. Introduces a family of clustering algorithms, most of which have been developed in our previous work, as part of the framework and introduces novel intensional clustering that directly manipulate cluster model. 5. Illustrates deployment of the proposed framework and its benefits in challenging real world case stu-dies. The remainder of this paper is organized as follows: In section 2, we formally present our domain-driven clustering framework. Section 3 provides a detailed discussion on domain-specific plug-in fitness function including three examples. Section 4 introduces the family of clustering algorithms provided in our framework, and section 5 illustrates the framework through demonstrations and case studies. Section 6 concludes the paper. 2 A SPATIAL CLUSTERING WITH PLUG-IN FITNESS FUNCTION 2.1 Preview As mentioned in the introduction, the goal of this paper is to introduce a highly generic clustering framework that supports plug-in fitness functions to capture domain interestingness. As we will discuss later the framework is very general and can be used for traditional clustering. However, because almost all of our applications involve spatial data mining, the remainder of this paper will mostly focus on spatial clustering and on region discovery in particular. The goal of spatial clustering is to identify interesting groups of objects in the subspace of the spatial attributes. Region discovery is a special type of spatial clustering that focuses on finding interesting places in AUTHOR ET AL.: TITLE spatial datasets. Moreover, in this section and in Section 4 a theoretical foundation and ontology for clustering with plug-in fitness functions is intoduced. Finally, novel intensional clustering algorithms are introduced. 2.2 An Architecture for Region Discovery As depicted in Figure 2, the proposed region discovery framework consists of three key components. The first two components are families of clustering algorithms and fitness functions that play a major role in discovering interesting regions and their associated patterns. As we will discuss in more detail soon, the framework uses clustering algorithms that support plug-in fitness functions to find interesting regions in spatial datasets. Decoupling cluster evaluation from the search for good clusters creates flexiblility in using any clustering algorithm with any fitness function. The role of the third component is to manage and integrate datasets residing in several repositories; it will not further be discussed in this paper. 3 nothing else. Intensional clustering algorithms, on the other hand, create a clustering model based on O and other inputs. Most popular clustering algorithms have been introduced as extension clustering algorithms, but— it is not too difficult to generalize most extensional clustering algorithms so that they become intensional clustering algorithms, as we present in Section 5. Extensional clustering algorithms create clusters X on O that are sets of disjoint subsets of O: X={c1,...,ck} with ciO(i=1,…,k) and ci cj= (ij) Intensional clustering algorithms create a set of disjoint regions Y in F: Y={r1,...,rk} with riF (i=1,…,k) and ri rj= (ij) In the case of spatial clustering and region discovery, cluster models have a peculiar structure in that they seek for regions in the subspace Dom(S) and not in F itself: a region discovery model is a function2 : Dom(S){1,…,k}{} that assigns a region (p) to a point p in Dom(S) assuming that there are k regions in the spatial dataset—the number of regions k is chosen by the region discovery algorithm that creates the model. Models support the notion of outliers; that is, a point p’ can be an outlier that does not belong to any region: in this case: (p’)=. Intensional region discovery algorithms obtain a clustering Y in dom(S) that is defined as a set of disjoint regions in dom(S)3: Y={r1,...,rk} with riF[S] (i=1,…,k) and ri rj= (ij) Fig. 2. Region Discovery Framework 2.3 Goals and Objectives of Region Discovery As mentioned earlier, the goal of region discovery is to find interesting places in spatial datasets. Our work assumes that region discovery algorithms that we develop operate on datasets containing objects o1,..,on: O={o1,..,on}F where F is relational database schema and the objects belonging to O are tuples that are characterized by attributes SN, where: S={s1,…,sq} is a set of spatial attributes. N={n1,..,np} is a set of non-spatial attributes. Dom(S) and Dom(N) describe the possible values the attributes in S and N can take; that is, each object oO is characterized by a single tuple that takes values from Dom(S)Dom(N)1. In general, clustering algorithms can be subdivided into intensional clustering and extensional clustering algorithms: extensional clustering algorithms just create clusters for the data set O, portioning O into subsets, but do 1 If S is empty we call the problem a traditional clustering problem. One key characteristic of spatial clustering is that spatial- and non-spatial attributes play different roles in the clustering process, which is not the case in traditional clustering. Moreover, regions r belong to Y are described as functions over tupels in Dom(S)—r: Dom(S){t,f} indicating if a point pDom(S) belongs to r: r(p)=t. r is called the intension of r. r can easily be constructed from a the model  of a clustering Y. Moreover, the extension of a region r r is defined as follows: r={oO|r(o[S])=t} In the above definition o[S] denotes the projection of o on its spatial attributes. Our approach requires discovered regions to be contiguous. To cope with this constraint in extensional clustering, we assume that we have neighbor relationships no() between the objects in O and cluster neighbor relationship nc() between regions in X defined with respect to O: if no(o,o’) holds objects o and o’ are neighboring; if nc(r,r’) holds regions r and r’ are neighboring. noOxO nc2Ox2O Moreover, neighboring relationships are solely determined by the attributes in S; that is, the temporal and spatial attributes in S are used to determine which objects and clusters are neighboring. A region r is contiguous if for each pair of points u and v in r there is a path between u and v that solely traverses r and no other regions. More 2 3  denotes “undefined”. F[S] denotes the projection of F on the attributes in S. 4 IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID formally, contiguity4 is defined as a predicate over subsets c of O: for the optimal number of regions k. contiguous(c)wcvcm≥2x1,…,xm c: 3 DOMAIN-SPECIFIC PLUG-IN FITNESS FUNCTIONS w=x1  v= xm  no(xi, xi+1) (i=1,…,m). The fitness function, whose formula was given in equation 3, is the core component in our framework in capturing the notion of the interestingness of the domain. The main challenge in developing methodologies and techniques for domain-driven data mining is to incorporate domain knowledge into the data mining task so “actionable knowledge” can be discovered. For example, in region discovery, the framework searches for interesting subspaces and then extracts regional knowledge from the obtained subspaces which provides very crucial knowledge for domain experts. The fitness function is specifically designed to be externally plugged-in to provide extensibility and flexibility. The fitness function component of the framework is independent from the clustering algorithm employed and for each domain a domain-specific fitness function is designed to capture the domain interestingness and incorporate domain knowledge. Because the fitness function is external and encapsulated from the rest of the framework, any change in the framework such as a parameter change or change in the clustering algorithm will not affect the fitness function. Likewise, changes on fitness function that comes from domain requirements will not affect the contents of the clustering algorithm and so on. This design enables the framework to be flexible and extensible to meet domain needs and requirements. In order to illustrate how the notion of domain interestingness and the domain-specific fitness functions are used in domain driven data mining and in discovering actionable knowledge, we now provide several examples of such fitness functions in the remainder of this section. contiguous(X) cX: contiguous(c) Our approach employs arbitrary plug-in, rewardbased fitness functions to evaluate the quality of a given set regions. The goal of region discovery is to find a set of regions X that maximize an externally given fitness function q(X); moreover, q is assumed to have the following structure: q( X )   reward (c)   i(c) * c cX  (1) cX where i(c) is the interestingness of a region c—a quantity designed by a domain expert to reflect a degree to which regions are “newsworthy". The number of objects in O belonging to a region is denoted by |c|, and the quantity i(c)|c| can be considered as a “reward" given to a region c; we seek X such that the sum of rewards over all of its constituent regions is maximized. The amount of premium put on the size of the region is controlled by the value of parameter β (β>1). A region reward is proportional to its interestingness, but larger regions receive a higher reward than smaller regions having the same value of interestingness, to reflect a preference for larger regions. Furthermore, it is assumed that the fitness function q is additive; the reward associated with X is the sum of the reward of its constituent regions. The reader might ask why we restrict the form of fitness function in our proposed framework. The main reason is our desire to develop an efficient clustering algorithm for region discovery. Restricting the form of fitness function supported allows us to use knowledge about the structure of the fitness function to obtain faster clustering algorithms which employ pruning, incremental updating, and sophisticated search strategies. This topic will be revisited in Section 4 of this paper when specific clustering algorithms are introduced. Given a spatial dataset O, there are many possible clustering algorithms to seek for interesting regions in O with respect to a plug in fitness function q. In general, the objective of region discovery with plug-in fitness func-tions is: Given: O, q, and possibly other input parameters Find: regions r1,...,rk that maximize q({r1,...,rk}) subject to the following constraints: (1a) riO (i=1,…,k) for extensional clustering (1b) riF[S] (i=1,…,k) for intensional clustering (2) contiguous(ri) (i=1,..,k) (3) ri rj= (ij) It should be emphasized that the number of regions k is not an input parameter in the proposed framework; that is, region discovery algorithms are assumed to seek 4 Other alternative definitions of contiguity exist, but will not be discussed in this paper due to the lack of space. 3.1 PCA-based Fitness Function Finding interesting regional correlation patterns that will help summarize the characteristics of a region is important to domain and business people, since many patterns only exist at a regional level, but not at the global level. Moreover, using regional patterns which are normally are globally hidden, domain or business people can understand the structure of data and make business or domain decisions by analyzing these correlation patterns. For example, a strong correlation between a fatal disease and a set of chemical concentrations in Texas water wells might not be detectable throughout Texas, but a strong correlation pattern might exist regionally which is also a reflection of Simpsons' paradox [9]. This type of regional knowledge is crucial for the domain experts who seek to understand the causes of such diseases and predict future cases. To identify a sub-region in South Texas with 35 water wells that demonstrates a unique and strong correlation between Arsenic, another chemical in water of those wells and high occurrence of the disease in this region, might suggest to domain experts the possible existence of nearby toxic waste, and provide valuable actionable knowledge that will help them to understand the cause of dangerous amount of arsenic in water wells, then AUTHOR ET AL.: TITLE 5 develop a solution to this problem and prevent future incidents. An example of discovered regions along with highly correlated attribute sets is given in fig. 4. This is an application of our framewok using PCA-based fitness function on Texas Water Wells data [10]; and the fact that the correlation sets for each region show significant differences emphasizes the importance of regional pattern discovery. Fig. 3. An Example of Regional Correlation Patterns for Chemical Concentrations in Texas In order to discover regions where sets of attributes are highly correlated, we need a fitness function that will reward high correlation and enables our framework to discover such regions. The Principal Component Analysis (PCA) is a good candidate since the directions identified by PCA are the eigenvectors of the correlation matrix, and each eigenvector has an associated eigenvalue that is a measure of the corresponding variance. The Principal Components (PCs) are ordered with respect to the variance associated with that component in descending order. The eigenvectors of PCs can help to reveal correlation patterns among sets of attributes. Ideally, it is desirable to have high eigenvalues for the first k PCs, since this means that a smaller number of PCs will be adequate to account for the threshold variance which overall suggests that a strong correlation among variables exists [11]. The PCA-based fitness function is defined next. Let λ1, λ2,…,λk be the eigenvalues of the first k PCs, with k being a parameter: PCA-based Interestingness is estimated using formula 2: iPCA (r ) = (12  ...  k2 )/k (2) based on a variance threshold to decide how many PCs to retrieve. This variance threshold is also domain-specific and is set based on the domain knowledge available, to ensure selecting appropriate k value for each dataset from different domains and reflecting concerns and constraints implied by domain knowledge. The PCA-based fitness function repeatedly applies PCA during the search for the optimal set of regions, maximizing the eigenvalues of the first k PCs in that region. Having an externally plugged in PCA-based fitness function enables the clustering algorithm to probe for optimal partitioning, and encourages the merging of two regions that exhibit structural similarities in correlation patterns. This approach is more advantageous than applying PCA just once or multiple times on the data using other tools, since the PCA-based fitness function is applied repeatedly to candidate regions to explore each possible region combination. 3.2 Co-location Fitness Function Co-location mining is a data mining task that seeks for interesting but implicit patterns in which two or more patterns collocate in spatial proximity. In the following we will introduce an interestingness function for colocation sets involving objects that are characterized by continous attributes (see also [12] for background on the described approach). The pattern A↑ denotes that attribute A has high values and the pattern A↓ indicates that attribute A has low values. For example, the pattern {A↑, B↓, D↑} describes that high values of A are co-located with low values of B and high values of D. Let O be a dataset c be a region o  O be an object in the dataset O N = {A1,…,Aq} be the set of non-geo-referenced continuous attributes in the dataset O Q={A1↑, A1↓,…, Aq↑, Aq↓} be the set of possible base colocation patterns B  Q be a set of co-location patterns Let z-score (A,o) be the z-score(A,o) of object o’s value of attribute A  z  score( A, o) if z  score( A, o)  0 z ( A , o)    otherwise 0  (4) PCA-based fitness function then becomes: k qPCA ( R) =  iPCA (rj )* size(rj )  (3) j 1 The fitness function rewards high eigenvalues for the first k PCs. By taking the square of each eigenvalue we ensure that regions with a higher spread in their eigenvalues will obtain higher rewards—reflecting the higher importance assigned in PCA to higher ranked principal components. Moreover, a generic pre-processing technique to select the best k value for the PCA-based fitness function is (5)  z  score( A, o) if z  score( A, o)  0 z ( A , o)    otherwise 0  The interestingness of an object o with respect to a collocation set B  Q is measured as the product of the zvalues of the patterns in the set B. It is defined as follows: i( B, o)   z ( p, o) pB (6) where z(p,o) is called the z-value of base pattern p  Q for 6 IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID object o. In general, the interestingness of a region can be straightforwardly computed by using the average interestingness of the objects belonging to a region. However, using this approach some very large products might dominate interestingness computations. For some domain experts just finding a few objects with very high products in close proximity of each other is important, even if the remaining objects in the region deviate from the observed pattern. In other cases, domain experts are more interested in patterns with highly regular products so that all or almost all objects in a region share this pattern, and are less interested in a few very high products. To satisfy the needs of different domains, our approach additionally considers purity when computing region interestingness, where purity(B,c) denotes the percentage of objects o  c for which i(B,o)>0. In summary, the interestingness of a region c with respect to a co-location set B, denoted by (B,c), is computed as follows:  ( B, c )   i ( B, o) oc |c| * purity ( B, c) (7) The parameter θ  [0,∞) controls the importance attached to purity in interestingness computations; θ=0 implies that purity is ignored, and using larger values increases the importance of purity. Fig. 6 depicts regions r in Texas with their highest valued co-location sets B; that is, the depicted co-location set B has the highest value for (B,r). 3.3 Variance Fitness Function High Variance Fitness Function is a fitness function to discover regions where there is high contrast in value of attribute of interest. For example, in studying of earthquake as discussed in more details in a case study in section 5.2, where attribute of interest is the depth of earthquakes, the domain expert may use High Variance Fitness Function to find regions where shallow earthquakes are in close proximity with deep earthquake. The interestingness of a region r, i(r), is defined as follows:  0  i (r )      Var (r , attr)  th  * r    Var (O, attr)  Var (r , attr)   th Var (O, attr)   otherwise   1  (attr(o)   attr (r ))2 r  1 or 4 CLUSTERING ALGORITHMS WITH PLUG-IN FITNESS FUNCTIONS Another key component of proposed framework is a family of clustering algorithms that allows domain experts to instruct clustering algorithms to seek clusters that satisfy their specific requirements. To achieve this flexible clustering capability, several clustering algorithms were designed and implemented that support externally-given fitness functions that are maximized during the clustering process. Using different plug-in fitness functions in the algorithms results in obtaining different, alternative clusters for the same data set. Existing clustering paradigms have been extended to support plug-in fitness functions, namely representative-based clustering, agglomerative clustering, divisive clustering, and grid-based clustering. Three such clustering algorithms CLEVER [12], MOSAIC [13], and SCMRG [14] will be briefly introduced and formally described by extending the formal framework that was introduced in Section 2. Different clustering paradigms are superior with respect to different aspects of clustering. For example, grid-based clustering algorithms are able to cluster large datasets quickly, whereas representative-based clustering algorithms discover clusters of better quality. Finally, agglomerative clustering algorithms are capable of identifying arbitrary shape clusters which is particularly important in spatial data mining. They can also be employed as a post processing technique to enhance the quality of clusters that were obtained by running a representative-based clustering algorithm. (8) 4.1 CLEVER— A Representative-based Clustering Algorithm Representative-based clustering algorithms, sometimes called prototype-based clustering algorithms in the literature, construct clusters by seeking for a set of representatives; clusters are then created by assigning objects in the dataset to their closest representative; in general, they compute the following function : (9) : Oqd{other parameters}2Dom(S) where Var (r , attr)  above variance interestingness function for an earthquake dataset with earthquake depth being the attribute of interest. The polygons in Figure 1 indicate regions with positive interestingness; usually, those regions will be further ranked by region reward to sort regions from most interesting to least interesting, providing search engine-type capabilities to scientists that are interested in finding interesting places in spatial datasets. The interestingness function parameters β and th are determined in close collaboration with the domain experts. Attr is the attribute of the interest and in the formula attr(o) denotes the value of attr for object o. The interestingness function computes the ratio of the region’s variance with respect to attr and the dataset’s variance. Regions whose ratio is above a given threshold th receive rewards. Figure 1 in Section 1 shows the result of using the  takes O, q, a distance function d over Dom(S), and possibly other parameters as an input and seeks for an “optimal set”5 of representatives in Dom(S), such that the clustering X obtained by assigning the objects in O to their closest representative in (O,q,d,…) maximizes q(X)the fitness function. Moreover, it should be noted that 5 In general, prototype-based clustering is NP-hard. Therefore, most representative-based clustering algorithm will only be able to find a suboptimal clustering X and not the global maximum of q. AUTHOR ET AL.: TITLE clustering is done in the spatial attribute space S, and not in F; the attributes in N are only used by fitness function q when evaluating clusters. CLEVER is an example of the representative-based clustering algorithms that uses randomized hill climbing and larger neighborhood sizes6 to battle premature convergence when greedily searching for the best set of representatives. Initially, the algorithm randomly selects k’ representatives from O. In the iterative process, CLEVER samples and evaluates p solutions in the neighborhood of the current solution; if the best one improves fitness, it becomes the current solution. The neighboring solutions are created by applying one of following operators on a representative of the current solution: Insert, Delete and Replace. Each operator has a certain selection probability and representatives to be manipulated are chosen at random. Moreover, to battle premature convergence, CLEVER re-samples p’>p solutions before terminating. The Pseudocode of CLEVER is given in Fig. 4. 7 shape of regions obtained by representative-based clustering algorithms is limited to convex polygons in Dom(S). Neighboring relationships no() between objects in O and nc() between clusters obtained by a representative-based clustering algorithm can be constructed by computing the Delaunay triangulation for R. Moreover, representative-based clustering algorithms do not support the concept of outliers; therefore, representativebased models have to assign a cluster to every point p in S. 4.2 MOSAIC—An Agglomerative Clustering Algorithm The agglomerative clustering problem can be defined as follows: Given: O, F, S, N, a fitness function q, and an initial clustering X with contiguous(X) Find: X’={c’1,…,c’h} that maximizes q(X’) and all clusters in X’ have been constructed using unions of neighboring clusters in X: ciX’: ci=ci1…cij  ci1,…,cij X  nc(cik,cik+1) (for k=1,j-1)  cicj=(for ij) Due to the fact that the above definition assumes that only neighboring clusters are merged, contiguous(X’) trivially holds. In the following, we view results that are obtained by agglomerative methods as a meta-clustering X’ over an initial clustering X of O; X’ over X is defined as an exhaustive set of contiguous, disjoint subsets of X. More formally, the objectives of agglomerative clustering can be reformulated as follows: Find: X’={x1,...,xr} with xiX (i=1,…,r) maximizing q(X’), subject to the following constraints: Fig. 4. Pseudo-code of CLEVER The cluster model  for the result obtained by running a representative-based clustering algorithm can be constructed as follows: Let (O,q,d,…)={rep1,…, repk}Dom(S) that is; the representative-based clustering algorithm returned R={rep1,…, repk}. Then the model  can be defined as follows: pS (p)=m d(p,repm} d(p,repj} for j=1,…,k that is,  assigns p to the cluster associated with the closest representative7. Because representative-based clustering algorithms assign objects to clusters using 1-nearest neighbor queries, the spatial extent of regions riDom(S) can be constructed by computing Voronoi diagrams; this implies that the It modifies the current set of representatives by applying more than one operator to it; e.g. modifying the current set of representatives by replacing two representatives and inserting a new representative. 7 Our formulation ignores the problem of ties when finding the closest representative; in general, our representative-based clustering algorithms break ties randomly. 6 (1) (2) (3) (4) x1…xr=X xixj= (ij) contiguous(xi) (for i=1,..,r) xX’m1x’1…x’mX: x =x’1…x’m We use the term meta-clustering, because it is a clustering of clusters and not of objects as is the case with traditional clustering. It should be noted that agglomerative clusters are exhaustive subsets of an initial clustering X; that is, we assume that outliers are not removed by the agglomerative clustering algorithm itself, but rather by the algorithm that constructs the input X for the agglomerative clustering algorithm. In general, an agglomerative clustering algorithm is decomposed of two algorithms: 1. a preprocessing algorithm that constructs the clustering X 2. the agglomerative clustering algorithm itself that derives X’ from X. The preprocessing algorithm is frequently degenerated; for example, its input could consist of single object clusters, or X could be constructed based on a gridstucture; however, it is beneficial for many applications to use a full fledged clustering algorithm for the prepro- 8 IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID cessing step. An agglomerative clustering algorithm MOSAIC [13] has been introduced in a previous work. MOSAIC takes the clustering X obtained by running a representativebased region discovery algorithm as its input, merges neighboring regions greedily as long merging enhances q(X). For efficiency reasons, MOSAIC uses Gabriel graphs [15]—which are subsets of Delaunay graphs—to compute nc; nc is then used to identify merge candidates for MOSAIC which are pairs of neighboring clusters whose merging enhances q(X); taking nc is updated incrementally as clusters are merged. Finally, when clusters are merged, q(X) is updated incrementally, taking advantage of the fact that our framework assume that q is additive. Fig. 8 gives the pseudo-code for MOSAIC: Moreover, models for the clusters obtained by an agglomerative regions discovery algorithm can be easily constructed from the models of the input clusters in X that have been merged to obtain the region in question. Let us assume r has been obtained as r=r1…rm; in this case the model for r can be definded as :r(p)= r1(p) … rm(p) In the case of MOSAIC, r(p) is implemented by characterizing MOSAIC clusters by sets of representatives8; new points are then assigned to the cluster whose set of representatives contains the representative that is closest to p. Basically, MOSAIC constructs regions as union of Voronoi cells and the above construction takes advantage of this property. Fig. 5. Pseudo code of MOSAIC 4.3 SCMRG—A Divisive Grid based Clustering Algorithm The divisive clustering problem can be defined as follows: Given: O, F, S, N, a fitness function q, and an initial clustering X={x1,…,xh} with contiguous(X). Find: X’={c’1,…,c’k} that maximizes q(X’) and X’ has been obtained from X. Procedure: Initially, X’ is set to X. Then X’ is modified to increase q(X’) by recursively replacing an xX’ by If r in X’ has been constructed using r=r1…rm from X r would be characterized by the representatives of regions r1,…,rm. 8 x=x’1… x’p as long as q(X) improves, and the following conditions are satisfied: (1) (2) (3) (4) x’jx (j=1,…p) x’jx’i= (for ji) contiguous(x’j) (j=1,…p) reward(x)<reward(x’1)+…+reward(x’p) Region x is only replaced by regions at a lower level of resolution, if the sum of the rewards of the regions at the lower level of resolution is higher than x’s reward. It should be emphasized that the splitting procedure employs a variable number of decompositions; e.g. one region might not be split at all, another region might be split into just four regions, whereas a third region might be split into 17 subregions. Moreover, the splitting procedure is not assumed to be exhaustive; that is, x can be split into y1, y2, y3 with y1y2y3x; in other words, the above specification allows divisive region discovery algorithms to discard outliners when seeking for interesting regions; basically the objects belonging to the residual region x/y1y2y3 in the above examples are considered to be outliers. SCMRG (Supervised Clustering using MultiResolution Grids) [14] is a divisive, grid-based region discovery algorithm that has been developed by our past work. SCMRG partitions the spatial space Dom(S) of the dataset into grid cells. Each grid cell at a higher level is partitioned further into a number of smaller cells at the lower level, and this process continues if the sum of the rewards of the lower level cells is greater than the rewards at the higher level cell. The regions returned by SCMRG usually have different sizes, because they were obtained at different levels of resolution. Moreover, a cell is drilled down only if it is promising (if its fitness improves at a lower level of resolution). SCRMG uses a look-ahead splitting procedure that splits a cell into 4, 16, and 64 cells respectively and analyzes if there is an improvement in fitness in any of these three splits; if this is not the case and the original cell receives a reward, this cell is included in the region discovery result; however, regions who themselves as well as their successors at lower level of resolution do not receive any rewards, will be treated as outliers, and discarded from the final clustering X’. SCMRG employs a queue to store cells that need further processing. SCMRG starts at a user defined level of resolution and puts the cells associated with this level on the queue. Next, SCMRG generates a clustering from cells in the queue by traversing through the hierarchical structure and examining those cells in the queue, and considering the following three cases when processing a cell: Case 1. If the cell c receives a reward, and its reward is greater than the sum of the rewards of its children and the sum of rewards of its grandchildren respectively, this cell is returned as a cluster by the algorithm. Case 2. If the cell c does not receive a reward, nor does its children and grandchildren, neither the cell nor any of AUTHOR ET AL.: TITLE its decedents will be further processed or labeled as a cluster. Case 3. Otherwise, if the cell c does not receive a reward, but its children receive rewards, put all the children of the cell c into a queue for further processing. Finally, all cells that have been labels as clusters (case1) are returned as the final result of SCMRG. 5 CASE STUDIES 5.1 Co-location Mining of Risks Patterns of Arsenics and Associated Chemicals in Texas Water Supply In this case study, we apply our domain-driven clustering framework for discovering interesting regions where two or more attriutes are collocated and associated patterns. The employed procedure is summarized in fig. 6 and explained step by step step below: Fig. 6. A procedure of applying domain-driven clustering framework for actionable region discovery with involvement of domain experts Step 1. Define the problem: Co-location mining is a data mining task that seeks for interesting but implicit patterns in which two or more patterns collocate in spatial proximity. For this case study, hydrologists helped us select subsets of chemicals and some external factors suspected of generating high levels of Arsenic concentrations. Interesting patterns B is defined as follows: Given  N={A1,…,Aq} be the set of non-spatial continuous attributes that measure chemical concentrations in Texas water wells.  Q={A1↑, A1↓,…, Aq↑, Aq↓} be the set of base collocation patterns; in this case study, the domain expert is interested in finding associations of high/low concentra- 9 tions (denoted by ‘↑’ and ‘↓’, respectively) with high/low concentrations of other chemicals.  B  Q be a set of co-location patterns, where  P(B) be a predicate over B that restricts the colocation sets considered, i.e. P(B)=As↑B (“only look for co-location sets nvolving high arsenic concentrations of Arsenic”) Step 2. Create/Select a fitness function: First, the hydrologists formulate a measure of their interestingness in form of a reward-based fitness function. Fitness functions express extrinsic characteristics which are varying in different problems and domains. In our framework, it is a generic component so that hydrologists can define several fitness functions, some of which might have small variations from each other based on his diverse interests. The simplified version of fitness function applied in the colocation mining called z-value was given in section 3.2. Step 3: Select a clustering algorithm. The framework provides many algorithms that exemplify different clustering paradigms, e.g. representative-based clustering, divisive grid-based clustering, agglomerative clustering. For this case study, CLEVER (CLustEring using representatiVEs and Randomized hill climbing) is employed to identify regions and associated co-location patterns. Step 4. Select parameters of the fitness function and the clustering algorithm: Tuning or setting parameters of the fitness function and the region discovery framework such as  helps obtain better results, or extends the search to focus on alternative patterns or patterns at different levels of granularity. For the particular fitness function employed, parameter controls an importance of purity of a pattern in interestingness computations; the larger  is, the more importance of purity of a pattern is addressed. Beside those two parameters, hydrologists can also specify seed patterns, i.e. As as a mandatory item in the co-location patterns considered. Later on they can simply change the seed patterns to force the co-location mining algorithm to seek for alternative patterns; e.g. patterns that are co-located with both {As,F}. This bridges a gap between hydrologists’ expectations and the results of clustering algorithms, permitting hydrologists to tune the comprehensive parameters in order to derive actionable patterns. Step 5. Run the clustering algorithm to discover interesting regions and associated patterns: Results (a set of clusters) obtained from the clustering algorithm are ranked either by reward or interestingness. An example of experimental results is given in Fig. 6. For instance, the first ranked pattern indicates that high level of Arsenic collocates with high levels of Boron, Chloride and Total Dissolved Solids in Southern of Texas. 10 IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID change analysis and the steps are explained next. Fig. 7. Example of Top 5 regions ranked by interestingness Step 6. Analyze the results: By the nature of fitness functions, the clustering algorithm consequently weeds out many regions having zero interestingness. The experimental results show the ability of the framework to identify interesting regions and associated patterns exemplified in Fig. 7, which are comparative to regions of high level of Arsenic concentration obtained from TCEQ as depicted in Fig. 8. Steps 4–6 are usually repeated several times in order to enhance the results or explore alternative regions and patterns. Fig. 9. A procedure of applying domain-driven clustering framework in change analysis First, geologists sample two datasets corresponding to two different time frames. Secondly, the domain driven clustering framework is used to separately identify the interesting regions of each time frame; a fitness function measures high variance of earthquakes depth. To generate intensional clusterings from results of CLEVER, we construct voronoi cells, in which polygons represent cluster models. Then, change analysis techniques are applied in Steps 3–6 in order to detect and identify different change patterns in those regions. Third: users select relevant change predicates to compare changes between the two intensional clusterings; the predicates also have thresholds to be controlled externally. Then changes between the two clusterings are instantiated with respect to the predicate threshold. Finally, emergent patterns are summarized and further analyzed. Fig. 8. Arsenic pollution map In contrast to traditional clustering, our framework offers search engine-type capabilities to domain experts, to help them identify patterns they are interested in. Domain experts assist and incorporate their knowledge in several mining phases, especially before the clustering phase. By expressing their interestingness in forms of fitness functions, the clustering algorithms are able to seek for clusters with extrinsic characteristics. Therefore, the clusters and associated patterns obtained repesent actionable knowledge. 5.2 Change Analysis in EarthQuake Data A change analysis framework is developed using our framework to analyze how interesting regions in two different time frames. For instance, analyzing changes in places where deep earthquakes are in close proximity to shallow earthquakes. Fig. 9 summarizes the approach of Fig. 10. An overlay of interesting regions discovered in Oold and and Onew Fig. 10 illustrates An overlay of interesting regions discovered in Oold and and Onew; the red regions belong to the early time frame (labeled with Regionold) whereas the blue regions belong to the late time frame (labeled with Regionnew). Examples of relationship discovered between two clusterings are also given; regions 5 and 10 in Fig. 11 are considered new whereas region 0, 2, 3, 7 in Fig. 12 is considered dissappearance. AUTHOR ET AL.: TITLE 11 also interested in identifying regions which satisfy multiple patterns of chemical contamination in water supply. We apply multi-run clustering in as a tool to gather multiobjective clusters simultaneously. Multi-run clustering reduces extensive human effort by searching for and enhancing novel and high quality clusters in automated fashion. Since multi-run clustering is developed on top of domain driven clustering framework, it conforms to the framework and also inherits the capability of the framework to plug in different clustering algorithms and fitness functions. Therefore, results obtained from multi-run clustering are also considered actionable. Fig. 11. Novelty areas of regions in Onew data 6 CONCLUSION Fig. 12. Disappearance areas of regions in Oold data 5.3 Other Applications of the Framework for Actionable Regional Knowledge Discovery Beside the use of the domain driven clustering framework to discover actionable knowledge specified in the two aforementioned case studies, the framework can also be applied to aid knowledge discovery in other real applications. The first application, similar to the first case study, is co-location mining in planetary science [16]; we are interested in mining feature-based hotspots where extreme densities of deep ice and shallow ice co-locate on Mars; the fitness function employed is an absolute of product of z-score of the continuous non-spatial feature in spatial dataset. Outcomes of the framework are regions having either very high co-location or very high anti colocation. The second application is regional correlation pattern discovery using PCA in hydrology [10]. Finding regional patterns in spatial datasets is an important data mining task. PCA-based fitness function is used to discover regional correlation patterns. This approach is more effective than solely applying PCA once or multiple times on the data, since the PCA is applied repeatedly to candidate regions to explore each possible region combination. This case study uses PCA-based fitness function maximizing the eigenvalues of first k PCs; it rewards the regions with high correlation since higher correlated sets would result in higher eigenvalues, in other words, higher variance is captured. The third application is multi-objective clustering, whose goal is to seek for a set of clusters individually satisfying multiple objectives. For example, hydrologists are In this paper a generic, domain-driven clustering framework is proposed that incorporates domain intelligence into domain-specific, plug-in fitness functions that are maximized by clustering algorithms. The framework provides a family of clustering algorithms and a set of fitness functions, along with the capability of defining new fitness functions. Moreover, an ontology and a theoretical foundation for clustering with fitness functions in general, and for region discovery in particular is introduced. Fitness functions are the core components in the framework as they capture a domain expert’s notion of of interestingness. The fitness function is independent from the clustering algorithms employed. The framework was evaluated for different region discovery tasks in several case studies. The framework treats the region discovery problem as a clustering problem in which a given, plug-in fitness function has to be maximized. By integrating and utilizing domain knowledge and domain-specific evaluation measures, into parameterized, plug-in fitness functions altogether with controlling thresholds, the framework is able to obtain actionable regional knowledge and their associated patterns satisfying domainspecific needs. The case studies demonstrate the capability of the framework to integrate the domain intelligence and effectively utilize clustering tasks by incorporating domain requirements into the clustering algorithms in form of a fitness function to guide clustering. To the best of our knowledge, this capability has been very little been explored by past research in the field of clustering and we are optimistic that our proposed framework will foster novel applications of domain-driven clustering.. REFERENCES [1] L Cao and C. Zhang, “The Evolution of KDD: Towards DomainDriven Data Mining,” Journal of Pattern Recognition and Artificial Intelligence, vol.21, no. 4, pp. 677-692, World Scientific Publishing Company, 2007. [2] I. Davidson and S.S. Ravi, “Clustering under Constraints: Feasibility Issues and the k-means Algorithm,” Proc. Fifth SIAM Data Mining Conf., 2005. [3] Q. Yang, K. Wu., and Y. Jiang, “Learning Action Models from Plan Examples using Weighted MAX-SAT,” Artif. Intell., vol. 171, issue 2-3, pp. 107-143, 2007. [4] N. Zhong,, “Actionable Knowledge Discovery: A Brain Informatics Perspective,” Special Trends and Controversies Department on DomainDriven, Actionable Knowledge Discovery, IEEE Intelligent Systems, vol. 22, issue 4, pp. 85-86, 2007. 12 [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID W. Graco, T. Semenova, and E. Dubossarsky, “Toward KnowledgeDriven Data Mining,” Proc. Domain Driven Data Mining Workshop, 2007. G. Karypis, E.H.S Han, V. Kumar, “Chameleon: Hierarchical Clustering using Dynamic Modeling,” IEEE Computer, vol. 32, issue 8, pp. 68-75, 1999. C. F. Eick, N. Zeidat, and Z. Zhao, “Supervised Clustering --- Algorithms and Benefits,” Proc. Int. Conf. on Tools with AI., 2004. A. Bagherjeiran, C. F. Eick, C.-S. Chen, and R.Vilalta, “Adaptive Clustering: Obtaining Better Clusters Using Feedback and Past Experience,” Proc. Fifth IEEE Int. Conf. on Data Mining, 2005. E. H.: Simpson, “The Interpretation of Interaction in Contingency Tables,” Journal of the Royal Statistical Society, ser. B, 13, pp. :238-241, 1951. O. U. Celepcikay and C. F. Eick, “A Regional Pattern Discovery Framework using Principal Component Analysis,” Proc. Int. Conf. on Multivariate Statistical Modeling & High Dimensional Data Mining, 2008. I. T Jolliffe, “Principal Component Analysis,” NY Springer, 1986. C. F. Eick, R. Parmar, W. Ding, T. Stepinki, and J.-P. Nicot, “Finding Regional Co-location Patterns for Sets of Continuous Variables in Spatial Datasets,” Proc. Sixteenth ACM SIGSPATIAL Int. Conf. on Advances in GIS, 2008. J. Choo, R. Jiamthapthaksin, C.-S. Chen, O. Celepcikay, C. Giusti, and C. F. Eick, “MOSAIC: A Proximity Graph Approach to Agglomerative Clustering,” Proc. Ninth Int. Conf. on Data Warehousing and Knowledge Discovery, 2007. C. F. Eick, B. Vaezian, D. Jiang, and J. Wang, “Discovery of Interesting Regions in Spatial Datasets Using Supervised Clustering,” Proc. Tenth European Conf. on Principles and Practice of Knowledge Discovery in Databases, 2006. Gabriel, K. and R. Sokal, “A New Statistical Approach to Geographic Variation Analysis,” Systematic Zoology, vol. 18, pp. 259-278, 1969. W. Ding, R. Jiamthapthaksin, R. Parmar, D. Jiang, T. Stepinski, and C. F. Eick, “Towards Region Discovery in Spatial Datasets,” Proc. PacificAsia Conf. on Knowledge Discovery and Data Mining, 2008. Christoph F. Eick received his PhD from the University of Karlsruhe in Germany. He is currently an Associate Professor in the Department of Computer Science at the University of Houston. He is the Co-Director of the UH Data Mining and Machine Learning Group. His research interests include data mining, machine learning, evolutionary computing, and artificial intelligence. He published more than 95 papers these and related areas. He serves on the program committee of the IEEE International Conference on Data Mining (ICDM) and other major data mining and machine learning conferences. Oner Ulvi Celepcikay is a senior PhD candidate at the University of Houston, Computer Science Department. He got his bachelor degree in Electrical Engineering in 1997 from Istanbul University, Istanbul, Turkey. He acquired his M.S. Degree in Computer Science at University of Houston 2003. He had worked at University of Houston Educational Technology Outreach (ETO) Department from 2000 to 2007. He has published number of papers in his research fields including cluster analysis, multivariate statistical analysis, and spatial data mining. He served as session chair in International Conference on Multivariate Statistical Modeling & High Dimensional Data Mining in 2008 and has been serving as a non-pc reviewer in many conferences. Rachsuda Jiamthapthaksin graduated her bachelor degree in Computer Science in 1997 and graduated with Honors in master degree in Computer Science, Dean’s Prize for Outstanding Performance, in 1999 from Assumption University, Bangkok, Thailand. She was a faculty in Computer Science department, Assumption University during 1997-2004. She is now a PhD candidate in Computer Science at University of Houston, Texas. She has published papers in the area of her research interests including intelligent agents, fuzzy systems, cluster analysis, data mining and knowledge discovery. She has served as a non-pc reviewer in many conferences and served as a volunteer staff in an organization of the 2005 IEEE ICDM Conference, November 2005, Houston, Texas. Vadeerat Rinsurongkawong is a PhD candidate in Computer Science at University of Houston. She got her M.S.degree in Information Technology from Assumption University, Thailand and her B. Eng. deree in Electrical Engineering from Chulalongkorn University, Thailand. She has work experience in electrical engineering, information technology and computer science. Her areas of interest are data mining and databases.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Transaction / Regular Paper Title