Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IGI PUBLISHING ITB13885 701 E.Peleks, Chocolate&Avenue, Suite 200, Hershey PA 17033-1240, USA Ntouts, Theodords Tel: 717/533-8845; Fax 717/533-8661; URL-http://www.igi-pub.com This paper appears in the publication, Research and Trends in Data Mining Technologies and Applications edited by D. Taniar © 2007, IGI Global Chapter.IV Pattern.Comparison.in. Data.Mining: A.Survey Irene Ntouts, Unversty of Praeus, Greece Nkos Peleks, Unversty of Praeus, Greece Yanns Theodords, Unversty of Praeus, Greece Abstract Many patterns are available nowadays due to the widespread use of knowledge discovery in databases (KDD), as a result of the overwhelming amount of data. This “flood” of patterns imposes new challenges regarding their management. Pattern comparison, which aims at evaluating how close to each other two patterns are, is one of these challenges resulting in a variety of applications. In this chapter we investigate issues regarding the pattern comparison problem and present an overview of the work performed so far in this domain. Due to heterogeneity of data mining patterns, we focus on the most popular pattern types, namely frequent itemsets and association rules, clusters and clusterings, and decision trees. Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. Pattern Comparson n Data Mnng: A Survey Introduction Nowadays a large quantity of raw data is collected from different application domains (business, science, telecommunication, health care systems, etc.). According to Lyman and Varian (2003), “The world produces between 1 and 2 exabytes of unique information per year, which is roughly 250 megabytes for every man, woman, and child on earth”. Due to their quantity and complexity, it is impossible for humans to thoroughly investigate these data collections directly. Knowledge discovery in databases (KDD) and data mining (DM) provide a solution to this problem by generating compact and rich semantics representations of raw data, called patterns (Rizzi et al, 2003). With roots in machine learning, statistics, and pattern recognition, KKD aims at extracting valid, novel, potentially useful, and ultimately understandable patterns from data (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). Several pattern types exist in the literature mainly due to the wide heterogeneity of data and the different techniques for pattern extraction as a result of the different goals that a mining process tries to achieve (i.e., what data characteristics the mining process highlights). Frequent itemsets (and their extension, association rules), clusters (and their grouping, clusterings), and decision trees are among the most well-known pattern types in data mining. Due to the current spreading of DM technology, even the amount of patterns extracted from heterogeneous data sources is large and hard to be managed by humans. Of course, patterns do not raise from the DM field only; signal processing, information retrieval, and mathematics are among the fields that also “yield” patterns. The new reality imposes new challenges and requirements regarding the management of patterns in correspondence to the management of traditional raw data. These requirements have been recognized by both the academic and the industrial parts that try to deal with the problem of efficient and effective pattern management (Catania & Maddalena, 2006), including, among others, modeling, querying, indexing, and visualization issues. Among the several interesting operations on patterns, one of the most important is that of comparisonthat is, evaluating how similar two patterns are. As an application example, consider a supermarket that is interested in discovering changes in its customers’ behavior over the last two months. For the supermarket owner, it is probably more important to discover what has changed over time in its customers’ behavior rather than to preview some more association rules on this topic. This is the case in general: the more familiar an expert becomes with data mining, the more interesting it becomes for her to discover changes rather than already-known patterns. A similar example also stands in the case of a distributed data mining environment, where one might be interested in discovering what differentiates the distributed branches with respect to each other or, in grouping together branches of similar patterns. From the latter, another application of similarity arises, that of exploiting similarity between patterns for meta-pattern managementthat is, Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. 33 more pages are available in the full version of this document, which may be purchased using the "Add to Cart" button on the publisher's webpage: www.igi-global.com/chapter/pattern-comparison-datamining/28422 Related Content Comparison of Linguistic Summaries and Fuzzy Functional Dependencies Related to Data Mining Miroslav Hudec, Miljan Vueti and Mirko Vujoševi (2014). Biologically-Inspired Techniques for Knowledge Discovery and Data Mining (pp. 174-203). www.irma-international.org/chapter/comparison-of-linguistic-summaries-andfuzzy-functional-dependencies-related-to-data-mining/110459/ Big Data Management in Financial Services Javier Vidal-García and Marta Vidal (2016). Effective Big Data Management and Opportunities for Implementation (pp. 217-230). www.irma-international.org/chapter/big-data-management-in-financialservices/157694/ Algebraic and Graphic Languages for OLAP Manipulations Ravat Franck, Teste Olivier, Tournier Ronan and Zurfluh Gilles (2010). Strategic Advancements in Utilizing Data Mining and Warehousing Technologies: New Concepts and Developments (pp. 60-90). www.irma-international.org/chapter/algebraic-graphic-languages-olapmanipulations/40398/ An Efficient Method for Discretizing Continuous Attributes Kelley M. Engle and Aryya Gangopadhyay (2012). Exploring Advances in Interdisciplinary Data Mining and Analytics: New Trends (pp. 69-90). www.irma-international.org/chapter/efficient-method-discretizing-continuousattributes/61169/ A Novel Multi-Secret Sharing Approach for Secure Data Warehousing and OnLine Analysis Processing in the Cloud Varunya Attasena, Nouria Harbi and Jérôme Darmont (2015). International Journal of Data Warehousing and Mining (pp. 22-43). www.irma-international.org/article/a-novel-multi-secret-sharing-approach-forsecure-data-warehousing-and-on-line-analysis-processing-in-the-cloud/125649/