Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
VOL. 3, NO. 9 SEP, 2012 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences ©2009-2012 CIS Journal. All rights reserved. http://www.cisjournal.org Using AK-Mode Algorithm to Cluster OLAP Requirements 1 Nouha Arfaoui, 2 Jalel AKaichi 1,2 BIRT- Institut Supérieur de Gestion, 41, Avenue de la liberté, Cité Bouchoucha, Le Bardo 2000, Tunisia 1 [email protected], 2 [email protected] ABSTRACT The data warehousing is becoming increasingly important in terms of strategic decision making through their capacity to integrate heterogeneous data from multiple information sources in a common storage space, for querying and analysis. Since its design is not an easy task, we propose exploiting the OLAP requirements to construct the schemas of data marts which will be used next to build the schema of the Data Warehouse. The construction of the Data Marts is done through the clustering of the schemas that correspond to the OLAP requirements. In this work we focus on the clustering step and we propose then the use of AK-Mode which is an extension of k-mode algorithm. The AK-Mode integrates the ontology to take into consideration the semantic aspect of our data. Keywords: AK-mode, Clustering, Simple Matching, Ontology, Data Warehouse Schema, Data Mart Schema. 1. INTRODUCTION The data warehousing is becoming increasingly important in terms of strategic decision making through their capacity to integrate heterogeneous data from multiple information sources in a common storage space, for querying and analysis. In order to fully exploit the DW, it is essential to have a good design to be able to satisfy specified needs and thus give a complete and centralized view of all existing data. The design is not an easy task because of the necessity to acquire the important knowledge related to the scope and the design techniques used to ensure adequate understanding of the different concepts. Their mastery imposes more effort and time especially with the continuous change and evolution of the domain. This requires to resort to methods such as design methods top-down, bottom-up and middle-out. The idea is to construct the Data warehouse schema from OLAP requirements. In order to facilitate this task, we propose the use of clustering as a data mining technique to group the different schemas resulting from the process of transforming the requirements. As result we get schemas grouped according to their departments or business functions. Then, for each cluster we construct the corresponding Data Mart schema. In this work we focus on clustering the schemas, and we start by defining this notion. In fact, it is the unsupervised classification of patterns into groups called Clusters[3], it involves dividing a set of data points into non-overlapping groups, or cluster of points [35]. The objects of one cluster are more similar to those of another one, so the clustering aims at maximizing the homogeneity within the same group. To determine the notion of similarity we use some measure of proximity. In the literature, the clustering has been used, first, to cluster numerical data as consequence many algorithms has been proposed. The base of each algorithm is the use of coefficients to calculate the similarity/dissimilarity measure between objects. With the emergence of categorical data and their use in real databases, it becomes important to look for new algorithms to cluster this kind of data, so, many new ones has been proposed. But the challenge, in this level, is to solve the problem of similarity measure [6]. In fact, traditional measures that are used with numerical data cannot be applied; they must be modified to take into consideration the specific characteristic of categorical data. So, as consequence new measures have been proposed to cluster categorical data. The problem in our case is that the schemas provide additional information that we have to take into consideration when we make the comparison. This information can influence the result of clustering, for example, we can use several words to denote the same thing. So if we use the traditional measures, the result of our comparison will not reflect the reality. Using the traditional version of k-mode, this level is ignored. So, to overcome this problem, we propose “AK-Mode” that extends Simple Matching (SM) dissimilarity measure by adding the ontology, by this way, we improve the efficiency of this measure. The outline of this paper is as follows: in section 2, we detail some works that have mixed the OLAP and the Data Mining technologies. In the next section, we propose the use of multidimensional table to express the OLAP requirements. The information is visualized using an intermediate schema. Section 4 describes the AK-mode algorithm, it argues this choice and explains the different modifications, and we finish with the conclusion. 2. STATE OF THE ART In this section we list the various works that have mixed the OLAP and the Data Mining technologies to tAKe advantages of the both. In [25], Ben Messaoud et al. propose OpAC (Operator for Aggregation by Clustering) which is considered as a new operator for multidimensional on-line analysis. It consists in using the agglomerative hierarchical clustering to achieve a semantic aggregation on the attributes of a data cube 1285 VOL. 3, NO. 9 SEP, 2012 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences ©2009-2012 CIS Journal. All rights reserved. http://www.cisjournal.org dimension. The authors propose tAKing advantage from both the OLAP and the Data Mining to get at the end an analysis process that provides the exploration, explication and prediction capabilities. Another data mining system DBMiner was presented in [10]. This latter integrates different data mining functions such as characterization, comparison, association, classification, prediction and clustering, as well as it integrates database and OLAP technologies. It has the advantage to mine various kind of knowledge at multiple levels of abstraction from different databases also from data warehouse. It offers SQL-like data mining query language, DMQL and a graphical user interface to facilitate the interactive mining. In the same context, Goil, in [30], proposes PARSIMONY which is a parallel and scalable OLAP and Data Mining framework for large datasets. The system sparse the datasets using chunks so the data can be stored either as a dense block or as sparse representation. There is also, iDiff [33] which is an operator used to automate the manual discovery processes using mining technology that can play a fitting role in providing the state of these products. iDiff returns summarized reasons for drops or increases observed at an aggregated level, and this can be done in a single step. In [24], Chen et al. propose a scalable DW and OLAP-based engine for analyzing web log records. The proposed framework supports the typical OLAP operation and DM operations such as extended multilevel and multidimensional association rules. The OLAP server is used as a computation engine to support DM operations. The data mining can be applied to detect the outliers. In this field, the [32] implements OLAP-outlier-based data association method as the result of the combination of OLAP and Data Mining. The proposed method integrates both outlier detection concept in DM and ideas from OLAP field and it is used to solve the data association problem. It can be used also to discover causal relations among heterogeneous databases as presented in [16] that propose a computer software agent. This agent combines Data Warehouse, OLAP and KDD functionalities in order to support the kno wledge discovery tasks. The solution consists on developing IIMiner (Integrated Interactive data Miner) that is used to provide convenient ways for the user to interact with the KDD processes using OLAP and DW techniques. In [29], Goil and Choudhary present a parallel multidimensional framework for large data sets in OLAP. This framework has been integrated with DM of association rules to facilitate handling a large number of dimensions and large datasets. 3. OLAP REQUIREMENTS This section is devoted to present the OLAP requirements, also their modelisation through the multidimensional table first then using an intermediate schema. The requirements play a crucial role in the DW process design; since it is the first step, it can cause the failure of the whole project if it is faulty. Despite its importance not much attention has been paid to this phase causing 85% of the DW projects fail to meet business objects, 40% of the DW projects never develop, the authors consider also that the problem for the fail is the poor communication between the different stAKeholders [11]. According to [9], this phase is used to specify “what data should be available and how it should be organized as well as what queries are of interest” and it is serves to extract the important elements related to the multidimensional schema (facts, measures, dimensions, hierarchies) this extraction helps to manipulate and calculate data. 3.1 The Multidimensional Table Different works in the literature propos the use of ndimensional table (called also multidimensional table) as a way to express the needs of the decision mAKers. In fact and since the decision mAKers cannot be computer scientists, i.e. they can find difficulties to express their needs using the SQL queries (especially when using the GROUP BY and/or HAVING clauses), we propose, as solution, the use of ndimensional table [18], [8]. It is a tabular representation that can show the fact of the decision mAKer. This fact and its measures can be analyzed according to dimensions and their granularity levels. This choice is done because: - It is easy to be used by a non computer scientist user. It allows seeing values of certain attributes as a function of others. The representation is close to decision mAKers’ vision of data [8] We propose the use of a multidimensional table (MT) to visualize the structure of the data i.e. the fact, the dimensions and the mesaures. The Fig. 1 shows the model that we adapt to present our MT that has the following structure: - “Dom”: The domain of analysis. “F”: The fact corresponding to the analyzed subject. “M”: The measures which are defined through an aggregation function “f”: {f1(m1),…, fn(mn)} “CalculFun”: The function that used to calculate the measure. “D”: The dimensions related to the subject of analysis. “L”: The levels “HStar”: HD HL1x…xHLn: it is a function that associates the different levels to their linked dimension instance. Fig 1: The model of multidimensional table 1286 VOL. 3, NO. 9 SEP, 2012 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences ©2009-2012 CIS Journal. All rights reserved. http://www.cisjournal.org The Fig. 2 presents an example of MT where we have three dimensions “Customer”, Supplier” and “List Book”, one fact “Sale” and the measure “Benefit” that will be calculated using the “Sum” function. Concerning the dimension “Customer” we take into consideration two levels “Customer ID” and “Quantity”. Clustering can be applied to various type of data: continuous numerical variables [26],[21], binary variables [12], categorical variables [21]. In our case we propose its use to cluster OLAP-Requirement Schemas (ORS). Indeed, each ORS is composed by a set of dimensions, measures, fact and levels. For categorical data, as algorithms we find: KMODE [19], ROCK [31], QROCK [17], CACTUS [28], COOLCAT [5], CLICK [20], LIMBO [23], MULIC[4], etc. Our choice is in function of the “Time complexity”. Table 1 presents the time complexity of different algorithms. Fig 2: Example of Multidimensional Table (MT) Table 1: The algorithms that are used to cluster categorical data 3.2 The Schema of The Olap Requirement Using the database and the multidimensional table, we propose the visualization of the intermediate schema. This latter allows the user to validate him/her-self his/her requirements. The schema will be transformed to an xml file to facilitate its manipulation during the following stage. Applying this to our example, we get the Fig.3 which is the intermediate schema corresponding to the MT of the Fig.2. Algorithm K-MODE ROCK QROCK CACTUS COOLCAT CLICK LIMBO MULIC Complexity O (n) O(kn2) O(n2) Scalable O (n2) Scalable O (n Log n) O (n2) Coefficient Simple Matching Links Threshold Support Entropy Co-occurrence Information Bottleneck Hamming measure We can notice that the k-mode has O(n) which is the lowest complexity, but it cannot deal with our data because it does not take into consideration the semantic aspect of the elements. So, we extend it to deal with our case and we propose the AK-Mode. 4.1 Fig 3: The intermediate schema corresponding to the MT (Fig.2) 4. THE AK-MODE ALGORITHM In this section, we propose AK-Mode which is an extension of the k-mode algorithm where we use the ontology to calculate the dissimilarity distance. The Data Mining (DM) is “the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner” [7]. Many techniques and algorithms are used, in the following we give some of them: clustering, classification, prediction, etc. In our case, we propose the use of clustering because it is the process of partitioning a given population of events or items into sets of similar elements, so that items within a cluster have high similarity in comparison to one another, but are very dissimilar to items in other clusters [1]. The Algorithm The new algorithm uses an extension of “Simple Matching” as well as an extension of mode algorithm update. For the rest of the algorithm we keep it and it is described as follow [36]: a) Select ‘k’ initial modes. b) Allocate an object to the cluster whose mode is the nearest to the cluster, using the following formula(1): d (A, B) = ∑ δ(ai, bi) where δ(ai, bi) = 0 if ai = bi (1) and δ(ai, bi) =1 if ai ≠ bi Update the mode of the cluster after each allocation. c) After all objects have been allocated to the respective cluster, retest the objects with new modes and update the clusters. d) Repeat steps (b) and (c) until there is no change in clusters. 1287 VOL. 3, NO. 9 SEP, 2012 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences ©2009-2012 CIS Journal. All rights reserved. http://www.cisjournal.org 4.2 - The Ontology - The ontology is used to resolve the heterogeneity problem [34], but in our case, it can be used to improve the document quality using the hierarchical knowledge [2], [14]. We propose in this section a way to improve the simple matching dissimilarity measure. In fact, the traditional measures, those are used for numerical data, categorical data and even for heterogeneous data, ignore the semantic knowledge. This has negatively influences on the quality of the interpretations [15], especially with the possibility to add semantic information about the domain in some fields [27]. The ontology must take into consideration the following details: - In our work, we propose the use of two kinds of ontology: Word Net ontology, and domain ontology. Word Net ontology: it is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations [13]. Domain ontology: it contains information about different classes as well as the relationships between them. So we define the most general concepts which are the following: - - - - - Domain: it indicates the domain to which the schema belongs. Schema: it is used to group the different facts, dimensions, measures, hierarchies and levels that belong to the same schema. This concept includes the different terms used to design the same meaning. Fact: it corresponds to the subject of analysis. It includes all the different ways used to describe one fact. Dimension: it corresponds to the axe of analysis. It serves to group the different ways to describe one specific dimension. Measure: every fact has one or more measures that are numerical. We keep information about the different words used to describe a specific measure. Hierarchy: it is a logical structure used to order levels as a means of organization data. Level: it represents a position in a hierarchy. We keep information about the different terms used to describe the same level. Concerning the relationships, we have: - is-Schema (Si, Dj): it indicates that “Si” is a schema that belongs to the domain “Dj”. is-Fact (Fi, Sj): it indicates that “Fi” is a fact that belongs to the schema “Sj”. is-Dimension (Di, Fj): it indicates that “Di” is a dimension that belongs to the fact “Fj”. is-Measure (Mi, Fj): it indicates that “Mi” is a measure that belongs to the fact “Fj”. is-Hierarchy (Hi, Dj): it indicates that “Hi” is a hierarchy that characterizes the dimension “Dj”. is-Level(Li, Hj): it indicates that the “Li” is a level that exists into the hierarchy “Hj”. - 4.3 Partial-Name: we are in the case where different words are applied to design the same meaning. Those words have been pre- or post- fixed. Example: “Tab Product”, “Product” and “Product Table”. Levenshtein Name: this is the case when there are misspellings. Example “Customer” and “Customer”. We need here calculate the degree of similarity of the words. Synonymous: we can use different words to talk about the same thing. Example: “Customer” and “Client”. The Extension of Simple Matching Based on Ontology We start this part by an example to clarify the importance of adding the ontology; then we move to present the new simple matching dissimilarity measure. Running Example. The simple matching coefficient as applied to categorical data is calculated using the formula (1). This coefficient cannot be applied to calculate the dissimilarity measure between two schemas, this is way we propose the following algorithm (Fig 4), it takes as input the ‘Mode’ and the ‘ORS’ to give as result ‘Simple Matching Coef’ CoefSM, with: - CoefD: calculate the number of similar dimensions. CoefM: calculate the number of similar measures. CoefL: calculate the number of similar levels names. Then, when we calculi the ‘coefSM’ we need to define ‘MaxD’ (it corresponds to the maximum number of the existing dimensions), ‘MaxM’ (it corresponds to the maximum number of the existing measure) and ‘MaxL’ (it corresponds to the maximum number of the existing levels names). Input: ORS, Mode Output: CoefSM Begin Coef D = Similarity Function D (ORS, Mode) Coef M = Similarity Function M (ORS, Mode) Coef L = Similarity Function L (ORS, Mode) Coef SM=[(Max D – Coef D )/Max D]+(Max M – Coef M)/Max M] + ( Max L – Coef L ) /Max L] End Fig 4: The “Simple Matching” algorithm 1288 VOL. 3, NO. 9 SEP, 2012 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences ©2009-2012 CIS Journal. All rights reserved. http://www.cisjournal.org We consider the two following schemas. Each one corresponds to OLAP requirement schema. The first schema (Fig. 5) is composed by one fact table “Sales” and four dimensions: “Customer”, “Time”, “Product” and “Seller”. “Customer” contains one key “Id-Customer” and two attributes “FN-Customer” (FN=First Name) and “LNCustomer” (LN=Last Name). “Time” is composed by one key “Id-Time” and four attributes: “Month”, “Week”, “Day” and “Hour”. “Seller” has one key “Id-Seller” and two attributes “FN-Seller” and “LN-Seller”. “Product” is defined by a key “Id-Product” and two attributes “Name-Product” and “Category-Product”. The second schema (Fig. 6) is composed by one fact table “Sales” and four dimensions: “Customer”, “Salesman”, “Date” and “Product”. “Customer” contains one key “Customer-ID” and two attributes “First Name” and “Last Name”. “Salesman” is composed by “Salesman-ID” which is the key and two attributes “First Name” and “Last Name”. “Date” has one key “Date-ID” and four attributes “Month”, “Week”, “Day” and “Hour”. “Product” is defined by one key “Product-ID” and two attributes “Product-Name” and “Product-Category”. But this coefficient does not reflect the reality since “Salesman” and “Seller” means the same thing, it is the case for many others such as: “FN-Customer” and “First Name”, “LN-Customer” and “Last Name”, etc. So, if we take the semantic of the words in the consideration we get the following values: CoefD = 4; CoefM = 2; CoefL = 14; CoefSM (2) = [(4 - 4)/4] + [(2 - 2)/2] + [(14 – 14)/14] =0 +0+0=0 According CoefSM (2) the two schemas are similar. The extension of Simple Matching. Based on the example presented before, we can conclude that the simple matching as used in different works does not correspond to our need; so, we propose, to improve this coefficient, modifying the three functions “Similarity Function D” (Fig.7), “Similarity Function M”, and “Similarity Function L”. The “Onto Term” function serves to take into consideration the different details (explained in 4.2). Input: Schema, Mode Output: coefSD Begin For (each Schema. Dimension) If(Mode. Dimension. Equals (On to Term(Schema. Dimension)) Coef SD++ End If End For End Fig 7: “Similarity Function D” algorithm 5. IMPLEMENTATION Fig 5: First example of OLAP requirement schema In this section, we present the implementation of our system. In order to realize this purpose, we used the eclipse as Java editor, SQL Server 2008 and different libraries as Jena to deal with xml files, etc. The Fig. 8 presents the structure of the database that we use to storage the information extracted from the schemas, we need then the following tables: “Schema”, “Dimension” Level Name”, “Measure” and “Mode”. Fig 6: Second example of OLAP requirement schema Let us apply the simple matching, as presented in the Fig.4, to calculate the dissimilarity measure between the two schemas (Fig. 5, and Fig. 6) we get: CoefD = 2; CoefM = 2; CoefL = 4; CoefSM (1) = [(4 - 2)/4] + [(2 - 2)/2] + [(14 – 4)/14] =0. 5 + 0 + 0.714 = 1.214 Fig 8: The structure of the tables of our database 1289 VOL. 3, NO. 9 SEP, 2012 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences ©2009-2012 CIS Journal. All rights reserved. http://www.cisjournal.org Fig. 9 presents the interface of our system. Indeed, the user starts with the specification of the paths of different xml files presenting the OLAP requirement schemas. Then once he/she validates the selection, our system extracts different elements including: schemas, dimensions, levels and measures, and stores them into tables. The existing schemas are displayed in a list where the user can initialize the modes. In fact, the number of the selected modes corresponds to the number of clusters. For example, in Fig 9, the user choose “3” modes, so the “k = 3”. Once he/she finishes the specifications, he/she cliques on the button “Cluster” to start the process of clustering. Fig 9: The interface of our application Fig. 10 presents an example of the result of the clustering of the schemas. We propose the presentation of the elements as graph so we can see the clusters and their content. From one cluster we can distinguish the different schemas. For each schema, we can determine the dimensions and the measures, and for each dimensions we can see the levels. Fig 10: Example of the result of the clustering improve the update of the “Mode”, we propose also the techniques of matching and mapping to ensure the fusion of different schemas existing in one cluster to get the corresponding data mart schema. REFERENCES [1] A. Omari, M. B. Lamine, and S. Conrad, “On Using Clustering And Classification During The Design Phase To Build Well-Structured Retail Websites”, IADIS European Conference on Data Mining 2008, Amsterdam, The Netherlands, 2008, pp. 51-59 [2] A. Hotho, S. Staab, and G. Stumme, “Wordnet improves Text Document Clustering”, In Proceeding of the Semantic Web Workshop at SIGIR-2003, 26th Annual International ACM SIGIR Conference, Toronto, Canada, (2003). [3] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data Clustering: A Review”, ACM Comput. Surv. Vol. 31, 1999, pp. 264-323. [4] B. Andreopoulos, A. An, and X. Wang, “MULIC: Multi-Layer Increasing Coherence Clustering of Categorical data sets”, Technical Report CS-2004-07, York University, 2004. [5] D. Barbara, J. Couto, and Y. Li: COOLCAT: An entropy-based algorithm for categorical clustering, In Proceedings of the eleventh international conference on Information and knowledge management, (2002) 582589. [6] D. Chen, D.W. Cui, C.X. Wang, and Z. R. Wang, “A Rough Set-Based Hierarchical Clustering Algorithm for Categorical Data”, International Journal of Information Technology, Vol.12, 2006. [7] D. Hand, H. Mannila and P. Smyth, “Principles of Data Mining”, MIT Press, Cambridge, MA, 2001. [8] E. Annoni, F. Ravat, O. Teste, and G. Zurfluh, “Towards multidimensional requirement design”, In: DaWAK 2006. LNCS, vol. 4081, 2006, pp. 75-84. [9] E. Zimányi, E. Malinowski, “Advanced data warehouse design”, Springer, 2008 [10] J. Han, J.Y. Chiang, S. Chee, J. Chen, Q. Chen, S. Cheng, W. Gong, M. Kamber, K. Koperski, G. Liu, Y. Lu, N. Stefanovic, L. Winstone, B.B, Xia, O.R. Zaiane, S. Zhang, and H. Zhu, “DBMiner: A System for Data Mining in Relational Databases and Data Warehouses”, In: Proceeding of CASCON'97: Meeting of Minds, Toronto, Canada, 1997. [11] J. Schiefer, B. List, and R. M. Bruckner, “A Holistic Approach for Managing Requirements of Data Warehouse Systems”, In proceeding of 8th Americas Conference on Information Systems, 2002 6. CONCLUSION In our work we proposed AK-Mode which is an extension of k-mode used to cluster the schemas extracted from the OLAP requirements. The proposed algorithm integrates the ontology to take into consideration the semantic aspect when comparing the different schemas. The goal behind this work is to get a set of clusters. Each one contains set of schemas belonging to the same domain which facilitates the construction of data mart schemas which will be used next to build the schema of the data warehouse. As perspective, we propose the use of “union-based algorithm” instead of “frequency- based algorithm” to 1290 VOL. 3, NO. 9 SEP, 2012 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences ©2009-2012 CIS Journal. All rights reserved. http://www.cisjournal.org [12] H. Rezankova, “Cluster Analysis and Categorical Data”, Profesional Publishing, Vysoka Skola Ekonomicka v Praze, Praha, 2009 [13] http://wordnet.princeton.edu/ [14] L. Jing, L. Zhou, M.K. Ng, and J. Z. Huang, “Ontology-based Distance Measure for Text Clustering”, In: Proceeding of the Fourth Workshop on Text Mining Sixth SIAM International Conference on Data Mining Hyatt Regency Bethesda Bethesda, Maryland, 2006. [15] M. Batet, A. Valls, and K. Gibert, “Improving classical clustering with ontologies”, In: Proceeding of IASC08, Japan, 2008. [16] M. Chen, Q. Zhu, and Z. Chen, “An integrated interactive environment for knowledge discovery from heterogeneous data sources”, Information and Software Technology, Volume 43, 2001, pp. 487-496. [17] [18] [19] M. Dutta, A. K. Mahanta, and A. K. Pujari, “QROCK: A Quick Version of the ROCK Algorithm for Clustering of Categorical Data, Pattern Recognition Letters”, 2005, pp. 2364-2373. M. Gyssen, and L.V.S. LAKshmanan, “A Foundation for Multi-Dimensional Databases”, In: 23rd Int. Conf. on Very Large Data Bases (VLDB), 1997, pp. 106–115. M. K. Ng, M. J. Li, J. Z. Huang, and Z. He, “On the Impact of Dissimilarity Measure in k-modes Clustering Algorithm”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, pp. 503-507. [20] M. Peters and M. J. ZAKi, “CLICK: Clustering Categorical Data using K-partite Maximal Cliques”, International Engineering (ICDE), 2005. [21] M. Yan, “Methods of Determining the Number of Clusters in a Data Set and a New Clustering Criterion”, PhD, November, Blacksburg, Virginia, 2005 [22] O. M. San, V.N. Huynh, and Y. NAKamori, “An Alternative Extension Of The K-Means Algorithm For Clustering Categorical Data”, An Alternative Extension of the k-Means Algorithm for Clustering Categorical Data, Journal of Applied Mathematics and Computer Science, 2004, pp. 241-247. [23] [24] P. Andritsos, P. Tsaparas, R. J. Miller, and K. C. Sevcik, “LIMBO: Scalable Clustering of Categorical Data”, In Proceedings of the 9th International Conference on Extending Database Technology (EDBT), HerAKlion, Greece 2004, pp.123-146. Q. Chen, U. Dayal, and M. Hsu, “An OLAP-based Scalable Web Access Analysis Engine”, In Proceeding of CASCON'97: Meeting of Minds, Toronto, Canada, 1997. [25] R. Ben Messaoud, S. Rabaséda, O. Boussaid, and F. Bentayeb, “OpAC: A New OLAP Operator Based on a Data Mining Method”, ixth International Baltic Conference on Databases and Information Systems (DB&IS 04), Riga, Latvia, 2004. [26] R. Shahid, S.Bertazzon, M.L. Knudtson, and W. A. Ghali, “Comparison of distance measures in spatial analytical modeling for health service planning”, BMC Health Services Research 2009 [27] R. Studer, V.R. Benjamins, and D. Fensel, “Knowledge Engineering: Principles and Methods”, IEEE Trans on Data and Knowledge Engineering, 1998, pp. 161-197. [28] S. Aranganayagi, and K. Thangavel, “Clustering Categorical Data using Bayesian Concept”, International Journal of Computer Theory and Engineering, 2009, pp. 119-125. [29] S. Goil, A. Choudhary, “High Performance Multidimensional Analysis and Data Mining”, In: International Database Engineering and Application Symposium, 1999 [30] S. Goil, “PARSIMONY: An Infrastructure for Parallel Multidimensional Analysis and Data Mining”, Journal of parallel and distributed computing, (2001, pp. 285 – 321. [31] S. Guha, R. Rastogi, and K. Shim, “ROCK: A Robust Clustering Algorithm for Categorical Attributes”, In Inf. Syst., UK: Elsevier Science Ltd, 2000, pp. 345-366. [32] S. Lin, and D. E. Brown, “Outlier-based Data Association: Combining OLAP and Data Mining”, Technical Report, Department of Systems Engineering University of Virginia, 2002. [33] S. Sarawagi, “iDiff: Informative Summarization of Differences in Multidimensional Aggregates”, Data Min. Knowl. Discov. 2001, pp. 255-276. [34] V. Alexiev, M. Breu, J. Bruijn, D. Fensel, R. Lara, and H. Lausen, “Information Integration with Ontologies: Experiences from an Industrial Showcase”, John Wiley & Son, Ltd. 180, 2005. [35] V. Faber, “Clustering and the Continuous k-means Algorithm”, Los Alamos Science, 1994, pp. 138-144. [36] Z. Huang, “A Fast Clustering Algorithm to cluster Very Large Categorical Datasets in Data Mining”, In: Proceeding of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997. }. 1291