Download Knowledge Discovery and Data Mining

Forac Summer School, University Laval, Quebec, Canada, May-June 2004 Data Mining Applications Overview Elena Irina Neaga Forac Research Consortium Laval University Québec City, Canada E-mail: [email protected] 1 Outline • • • • • • • • • • • • • Foundation Motivation: Why Data Mining? Definitions Current State-of-the-Art General Applications Examples Industry and Business Application Areas Selected Algorithms and Methods Distributed Data Mining using Intelligent Agents Commercial Software Systems Methodologies, Projects and Standards Main References and Web-Resources PolyAnalystTM software demonstration 2 Background “ Knowledge is Power“ Francis Bacon • The conventional model to turn data into information and further to knowledge and probably wisdom is defined as follows: data ==> information ==> knowledge ==> wisdom • Knowledge discovery (KD) and data mining (DM) are interdisciplinary areas based on statistical analysis, database approaches and artificial intelligence (AI), especially machine learning. • KD and DM incorporate complex algorithms from statistics and AI, including imaginative and intuitive processing. 3 Data, Information, Knowledge Wisdom ” Yesterday’s Data are today’s Information, and tomorrow’s Knowledge.” I. Spiegler © Spiegler, I. - Technology and knowledge: bridging a ‘‘generating’’ gap, Information & Management 40 (2003) 533–539, Elsevier Science B.V. 4 Data is a collection of unanalyzed observations of worldly events. Information is a summary and communication of the main components and relationships contained within the data and presented within a specific context. Knowledge is an interrelated collection of procedures for acting toward particular results in the world with associated references for when each is applicable along with its range of effectiveness. ©Pyle, D. “Business Modeling and Data Mining”, Morgan Kaufmann, 2003. 5 Motivation ”We are drowning in information, but starved for knowledge.” John Naisbitt Nowadays the amount of data generated by several applications has dramatically increased, and this data is a valuable source for the discovery of new information and knowledge. Also, the eruption of data has caused a comparable explosion in the need to analyze it which is possible by the increase of computational power which might at one time have been too computationally expensive. 6 Motivation (continued) “ In an economy where the only certainty is uncertainty, the one sure source of lasting competitive advantage is knowledge.“ Ikujiro Nonaka •Organizations have huge databases containing amounts of data which could be a source of new information and knowledge; •Business and marketing databases potentially constitute a valuable resource for business and market intelligence applications; •Enterprises also rely on vast amounts of data and information that is located in large databases. The value of this information can be increased if additional knowledge can be gained from it. 7 Definition Knowledge Discovery from Databases (KDD) is the nontrivial process of identifying valid, previously unknown, potentially useful and ultimately understandable patterns in data [Fayyad et al.,1996]. The whole KDD process could include and it is not limited to the following steps: •data selection; •data cleaning; •data preprocessing includes reduction and transformation; •data mining for identifying interesting patterns in datasets; •data interpretation and evaluation; •application. 8 Patterns in the context of knowledge discovery and data mining are defined as similar structures in a file or a database that are relevant and repetitive. A model is an abstraction that captures the essential and global aspects of the complex real-world systems and/or sub-systems. The model may include the definition of an information structure in order to store, process, analyze and use the associated data. In the context of DM the distinction between the pattern and model is arbitrary [Hand, 1998] 9 Discovery vs. Invention •Discovery Science(DS) - 5th Conference was held at Lubeck, Germany, in 2002. •Knowledge is a topic which belongs to science and philosophy. •Francis Bacon (1610) stated that knowledge is obtained from experience, and the Nature is ruled by laws and theories which the scientists have the main task to discover and to describe by models. According to him science is an inductive process. •On the other hand science may be defined as a process of inventing theories which are checked against experience. This trend is stated in 19th century by the invention of nonEuclidian geometry, and relativity theories. •This is still an open debate! 10 KDD vs. KM     KM supports the knowledge creation; KDD leads to knowledge. KM typically deals with the managerial procedures for producing and using knowledge within an organization such as individual, collective learning and transferring. KDD is focused on the automated or semi-automated knowledge generation from rough data based on machine learning. The difficulty of the formulation of distinct definitions for KDD and KM is due to the paradox that knowledge resides in the human’s mind, but it may be captured, generated, stored, processed and reported using information technologies. 11 Polanyi (1962, 1966) defines two types of knowledge generally accepted in the field of KM, but also some KDD approaches attempt to consider:  Tacit knowledge: implicit, mental models, and experiences of individuals.  Explicit knowledge: formal models, rules, and procedures. An open debate may be related to the human knowledge and computer knowledge approaches such as knowledge discovery, knowledge engineering (acquisition, knowledge based/expert systems) and some areas of knowledge management. 12 Knowledge about the past which is stable, voluminous and accurate; Knowledge about present which is unstable, compact and may be inaccurate; Knowledge about the future which is hypothetical. 13 DM vs. Operations Research Combining OR and data mining may be very useful in decision-making because:       A discovered pattern is interesting only to the extent in which it can be used in the decision-making process of an enterprise. Generally OR deals with searching for the best solutions to decision problems using mathematical techniques. Optimization Solvers may be complemented and refined with data mining algorithms. Optimization algorithms are applied to data imported from DBMS and/or Internet, but they may be processed a data warehouse and/or the discovered patterns in data. The potential of applied DM and neural networks for OR increases. SAS/Operations Research and SAS/Enterprise Miner may be used in the same environment. 14 Related Definitions KD and DM are defined in several ways, but from the perspective of computer science the best known definitions are: The process of searching and retrieving or visualization of valuable information and new knowledge in large volumes of data. Representing the exploration and analysis by automatic, or semi-automatic means of large quantities of data usually stored in databases. Dealing with the discovery of new correlations, hidden knowledge, unexpected information, patterns and new rules from large databases; It is also possible to consider DM more as a set of organized activities than as methods on their own because the main algorithms are employed from close areas such as statistics and/or artificial intelligence. 15 Related Definitions (continued) DM is the key element or the core of the whole process of Knowledge Discovery in Databases (KDD) dealing with several processing techniques for data especially included in large databases and data warehouse. Data warehouse is a central store of data extracted from operational data. Cristofor (2002) clearly specifies that there is no restriction to the types of data that can be used as input for DM. The input data can be a relational or object-oriented database, a data warehouse, a web server log or a text file. DM is associated with large amounts of data, but for research and testing applications, the test data sets are of a limited length, and are usually flat files. 16 Related Definitions (continued) Several research projects are inter or cross-disciplinary with respect to data mining as well as to business, finance, marketing and other areas. These approaches define data mining as follows [Berry, Linoff, 2000], [Berson et al., 2000], [Helberg, 2002], [Pyle, 2003]: • The process of utilizing the results of data exploration to adjust or enhance business strategies and performances. The information produced by DM engines requires intelligent review by human experts. • A technique which helps uncover trends in time to make the knowledge actionable. • Within every organization is an amount of data which can describe the past performance of the organization through KD and DM. 17 Related Definitions (continued) DM finds patterns and relationships in data by using sophisticated techniques to build models. A model is an abstract representation of the reality which is useful to understand and analyze it in order to making decisions. There are two main kinds of models in data mining: Predictive models can be used to forecast explicit values, based on patterns determined from known results. They could predict financial trends, market evolution, customer behaviour etc. Descriptive models describe patterns in existing data, and are generally used to create meaningful subgroups. 18 General Applications Marketing (Direct Marketing, Market Basket Analysis) Banking and Finance Telecommunication Engineering Environmental and Molecular Sciences Medicine Computer/Digital Art 19 Examples • • • • Analysis of transactional data stored into a database of a supermarket in order to improve the way in which the products are arranged on shelves. Exploring a supermarket database in order to determine the patterns related to the way in which people use to buy, grouping products that people buy together, and what time. Predicting customer demand for a specific product. Data analysis of a promotional campaign e.g. who is most probably to reply to a direct-mail promotional campaign. 20 Industry and Business Application Areas •Customer Relationship Management; •Supply Chain Management; •Enterprise Resource Planning; •E-Business and E-Commerce; •Demand Management (forecasting); •Etc. 21 ENTERPRISE OBJECTIVES STRATEGIES : ENTERPRISE DATABASES AND Improving the quality of products and services; Improving business performances; Improving the position on the marketplaces; Improving the customer satisfaction and loyality; Etc. Fidéliser les clients; Etc. Data Mining Processing Knowledge Communication Data management and selection Presentation and interpretation of the result Prediction and forcasting based on new information Data aggregation and integration Data Visualisation Data Segmentation Data Modeling MODIFY THE OBJECTIVE © 1999 Michel Jambu – Introduction au data mining Analyse intelligente des données 22 DM using a DataWarehouse DB1 DB2 …… Legacy System DATA WAREHOUSE M1 & Central Repository M2 . . Data Mining Mn Retrieving/Using/Visualizing New Information Knowledge & Patterns 23 Integrated DM, DW and OLAP DATA MINING SERVER Data Data Mart Mart Other DATA Data Mart CRM SCM ERP Data Mining tools Data Warehouse Data Mart Data Mart Data Mart Extended Enterprise Databases DSS tools OLAP DataBase OLAP tools Enterprise DataBase Administrator 24 Data warehouse(DW) is defined as the extraction and integration of data from multiple sources and legacy systems in an effective and efficient manner. Usually a DW is obtained from operational data, and the information in a DW is subject-oriented, non-volatile, integrated and time dependent [Adriaans, Zantinge, 1996]. A DW contains large datasets which are organized using metadata concept which describes the properties and characteristics of data and information stored in a central repository. The metadata becomes a topic in its own right which deals with the intensive studies of data and its behaviour. Data marts (DMs) are subsets of data focused on selected subjects for e.g. a marketing data mart may include customer, product and sales information. 25 Data Mining vs. Statistics ” To statisticians, the data mining conveys the sense of naive hope vainly struggling against the cold realities of chance.” D.J. Hand DM and Statistics do not overlap, and the main differences are presented below [Pyle, 1999]: Statistics assume a pattern and the algorithms attempt to prove it; DM describes a kind of pattern and the algorithms find them; DM processes data which is usually given as a database or a large flat file; Statistics are applied to small and clean datasets; The objective of DM is to find patterns, knowledge and valuable new information in data and through statistical analysis data is processed according to a defined objective of analysis; Statistics consider data variation, but this is not considered in DM; In DM residual data is useful, and it is processed, and in statistics it is removed from the original data set. 26 Data Mining vs. Statistics (continued) ” To statisticians, the data mining conveys the sense of naive hope vainly struggling against the cold realities of chance.” D.J. Hand DM is very much an inductive process opposed to the hypothetic-deductive approach often seen as the paradigm for how modern science progresses [Hand, 1998] Statistics are dealt with primary data analysis. DM is entirely concerned with secondary data analysis. Classical statistics deals with numeric data. DM is applied to image data, audio data, text data, and geographical data. Mining the web has become a distinct topic. 27 However DM applied to several real-world problems such as supply chain optimizations, process and quality control may not provide solutions beyond the use of statistics, probability theory, evolutionary computation (ANN, fuzzy logic) and operations research. Several DM algorithms have their roots in statistical analysis. DM is not new as it joins several mathematical and artificial intelligence problem solving techniques and methods usually applied to large amounts of historical data. 28 Selected Algorithms and Method ” It is by intuition that we discover and by logic we prove.” Henri Poincaré •Regression; •Classification; •Association Rules; •Clustering; •Sequential Analysis/Pattern Finding; •Combined Methods; •On-Line Analytical Processing (OLAP); •Others. 29 Regression Linear and non-linear regression are widely used for correlating data. Statistical regression requires the specification of a function over which the data is fitted. In order to specify the function it is necessary to know the forms of equations governing the correlation for a data set [Wang, 1999]. Even though regression is considered to be a statistical technique, the distinction is arbitrary because DM deals with predictive modeling, and regression does exactly the same [Berry, Linoff, 2000]. There are many applications of regression, for example predicting costumer demand for a new product as a function of advertising expenditure and predicting time series where the input variables can be time-lagged versions of the prediction variable. 30 Classification Classification also known as segmentation is the process of examining known groups of data to determine which characteristics can be used to identify (or predict) group membership. Examples of classification include the classification of trends in financial markets, grouping customers based on their past transactions and predicting their response to a particular product promotion [Fayyad et al., 1996], [Helberg, 2002]. 31 Association Rules Association Rules were introduced by R. Agrawal, T. Imielski and A. Shami, in 1993, and the most used algorithm Apriori, in 1994, by R. Agrawal and R. Srikant. The basic idea of association rules is to search the data for patterns of the following form: IF (some conditions are true) THEN (some other conditions are probably true) Each condition extracted from data is called an association rule, or simply a rule. Association Rules generate rule-based models. 32 Association Rules (continued) Association Rules have two main characteristics associated with them that measure their value: • Coverage describes how much evidence is in the training data set to back up the rule. It usually ranges between 0 and 1 (0% and 100%). • Confidence describes how likely the rule is to give a correct prediction. It is also in the range between 0 and 1(0% and 100%). In addition, the algorithm of association rules uses the support of a rule which is the number of records or transactions which confirm the rule [Cristofor, 2002]. 33 Association Rules (continued) Let I = {i1, i2, ….im} – a set of items; Let D – a database usually of transactions, where each T  I; For a given itemset (a non-empty set of items) X  I and given a transaction T If X  T then T contains X; It is also defined the support count  X of an itemset X and X is a large itemset with respect to support (s) if  X  s x |D| where |D| is the number of transactions in D. An Association Rule is an implication of the form X  Y , where X  I, Y and X  Y =  The Association Rule X  Y has the confidence c if the ratio of  X  Y over  X = c. The rule X  Y has the support s in D if  X  Y = s x |D|. I Thus if s is the given support the mining association rules is finding the set L ={ X|X  I   X  s x |D|}. 34 Clustering Clustering like segmentation identifies groups of similar cases, but it does not predict outcomes or target categories [Helberg, 2002]. Clustering algorithms are also called unsupervised classification, and they process a group of physical and abstract objects into classes of similar objects. Clustering analysis supports the construction of meaningful partitions of a large set of objects based on the divide-and-conquer methodology which decomposes a largescale system into smaller components to simplify design and implementation. An example relates to identifying customers that would make good targets for a new product marketing promotion. 35 Clustering (continued) The clustering methods are divided into: • Hierarchical clustering which represents the combination of cases and clusters that are similar to each other, one pair at a time. • K-Means clustering which is based on the assumption that the data falls into a known number (K) of clusters. This method starts by defining initial profiles called cluster centers, for the K clusters, sometimes using random values for the clustering characteristics or sometimes using dissimilar cases from the data set. 36 K-Means Clustering In K-Means algorithm, each object xi is assigned to a cluster j according to its distance d(xi,mj) from a value mj representing the cluster itself. mj is called the representative of the cluster. Given a set of objects D = {x1, . . . , xn}, a clustering problem is to find a partition C = {C1 . . . Ck}, of D such that: 1. Each Ci is associated to a representative mi ; 2. xi  Cj if d(xi,mj) ≤ d(xi,ml) for 1 ≤ l ≤ k, j  l; 3. The partition C minimizes: ik 1 xj  Ci d 2 (xj, mi) . 37 Sequential Patterns Sequential patterns are part of sequential analysis. The main goal of this algorithm is to find all sequential patterns with a pre-defined minimum support represented by a data sequence. The input data is represented by a list of sequential transactions and there is often an associated transaction-time. 38 Combined Methods •Combination of different algorithms for the knowledge extraction process based on rules with neural networks (NN) and Case Base Reasoning (CBR)  CBR represents the process of acquiring knowledge represented by cases using reasoning by analogy.  NNs are computer models based on the architecture of the human brain which consists of multiple simple processing units connected by adaptive weights. •Combination of clustering and neural networks (NN); •Combination of classification and NN. 39 Combined Methods Knowledge extraction (continued) NN model generation NN models DB Rule base Rule-based model generation Knowledge Extracting by Neural Networks and Rules 40 On-line Analytical Processing OLAP and DM are considered to be two complementary techniques for analyzing large amounts of data in databases and/or data warehousing environments. OLAP is a way of performing multi-dimensional analysis on relational databases. DM is more powerful than an OLAP because of the difference of multi-dimensional processing of a database and the fact that new knowledge, and hidden information can be extracted through DM. A multi-dimensional representation related to a product family is shown in the next slide / figure. 41 OLAP (continued) City= London Company = xx Product= yy Category=aa Industry= Food Year=2002 Profit= 56% DIMENSION ATTRIBUTES 42 Distributed Data Mining using Intelligent Agents Intelligent Agents support the distributed and collaborative KD&DM systems:      Each agent is responsible for a different step in the KD&DM process such as pre-processing, DM, and evaluation of the results; Some agents specialize in a pre-determined task could use the services of other agents, e.g. classification uses a pre-processing agent services; The agents interact as usually by a communication language or messages; The cooperative DM agents run concurrently and they could be driven by an agent manager; The mining agent systems could be flexibly integrated with other agent systems. 43 XLMinerTM • It is an extension of Microsoft ExcelTM; • It can help to quickly start the DM on spreadsheets and Excel files; • It has extensive coverage of statistical and machine learning techniques for classification, prediction, affinity analysis, data exploration and reduction. 44 SAS Enterprise MinerTM • It is supported by SEMMA (sampling, exploration, modification, modeling and assessment) methodology; • It combines data warehousing, data mining and OLAP technologies; • It defines a comprehensive solution that addresses the whole KDD processes; • It integrates advanced models and algorithms including clustering, decision trees, neural networks, memory-based reasoning, linear and logistic regression and associations; • It also provides powerful statistical analysis capabilities; • It uses advanced modeling techniques; • It generates code in SAS internal language as well as C and Java; •It has a component for text mining. 45 SAS Enterprise MinerTM It has been successfully used for a wide range of CRM and e-commerce applications such as: direct mail, telephone, e-mail, and Internet delivered and promotion campaigns; customers profiling; identifying the most profitable customers and the underlying reasons for their loyalty; Identifying the fraudulent behaviour in an e-commerce site. It is very easy to be used because of its GUI; The business analyst with little statistical expertise can quickly and easily navigate through the SEMMA process while the data mining experts can analyze deeply the analytical process. 46 SPSS ClementineTM •It is a DM workbench that enables to quickly develop predictive models and deploy them into business operations to improve decision making; •It delivers the maximum return on investment in the minimum amount of time; •It supports the entire DM process to shorten time-to-solution; •It is designed around the de facto industry standard and methodology CRoss-Industry Standard Process for Data Mining (CRISP-DM); •It uses Clementine Application Templates (CATs) which follow the industry standard CRISP-DM methodology and use previous real-world application experience in order that a new project to benefit from a proven methodology and best practices. 47 SEMMA Methodology SEMMA (Sample, Explore, Modify, Model, Assess) methodology was elaborated by SAS Institute Inc. and it is applied successfully, with the SAS Enterprise MinerTM. The steps of this methodology are as follows: •Sample the data by extracting a portion of a large data set containing enough significant information, but having optimal dimension to be manipulated quickly. •Explore the data by searching for unanticipated trends and anomalies in order to understanding ideas and the trends of the data set. •Modify the data by creating, selecting and transforming the variables to focus the model selection process. •Model the data by allowing the system to search automatically for a combination of data that reliably predicts a desired outcome. •Assess the data by evaluating the usefulness and reliability of the findings from the data mining process. 48 Projects and Standards CWM JDM PMML CRISP SOLEUNET SQL/MM KDE OLE DB for DM Overview of Main Projects and Standards 49 Projects & Standards (continued) •ISO: SQL/MM is a collection of SQL user-defined types and routines to define and apply DM models. •DM Group: Predictive Model Markup Language (PMML) is an open standard based on XML specification for exchanging DM models between applications. •OMG: Common Warehouse Metamodel (CWM) is a Unified Modeling Language/XML specification for DM metadata. •Microsoft: OLE DB for DM is a major step toward the standardization of DM primitives, and it defines a DM object model for relational databases. •Oracle9i DM is an extension to Oracle9i Database Enterprise Edition that embeds DM algorithms for classifications, predictions and association rules. All models and functions are accessible through Java-based Application Programming Interfaces called Java Data Mining (JDM). 50 Projects & Standards (continued) •CRISP-DM is a project which has also defined and validated a standard DM process that is applicable in diverse industry sectors, and it attempts to make any DM project faster, cheaper, reliable and manageable. •SolEuNet has the main aim to apply of DM and Decision Support (DS) systems in order to enhance efficiency, effectiveness and quality of operations in business and industry. A virtual enterprise model has been proposed as a dynamic problem-solving link between advanced DM and DS systems. •Kensington Enterprise DM (Imperial College, Dept. of Computing, London, UK) project has developed Kensington Discovery Edition (KDE) which is an enterprise-wide platform that supports entire processes of KD, including dynamic information integration, knowledge discovery and management. 51 CRISP-DM Cross-Industry Standard Process for DM CRISP-DM Project Description [Helberg, 2002] 52 Defining a DM Project "Make it as simple as possible, but no simpler.“ Albert Einstein Project Definition Data Identification & Experimental Design Data Mining Factors affecting the Data pre- - processing Evaluation of Results adoption of DW and DM Essential Activities in a Data Mining Project 53 Main References Adriaans, P., Zantinge, D. “Data Mining”, Addison-Wesley, 1996. Berry, M., Linoff, G.S. “Mastering Data Mining The Art and Science of Customer Relationship Management” , John Wiley & Sons Inc., 2000. Berson et al. “Building Data Mining Applications for CRM”, McGraw-Hill, USA, 2000. Bramer, M.A.(editor)”Knowledge Discovery and Data Mining”, IEE, 1999. Chen, Z., “An integrated architecture for OLAP and data mining” in Knowledge Discovery and Data Mining, Bramer, M.A.(editor), IEE, 1999. Cristofor, L. “Mining Rules in Single-table, and Multiple-table Databases, PhD Thesis, CS Dept. of Univ.of Massachusetts, Boston, USA, 2002. Goglin, J.F., “La construction du datawarehouse du datamart au dataweb“, 2e édition revue, Hermes Science Publication, Paris, 1998, 2001. 54 Main References (continued) Fayyad et al. (eds) “Advances in Knowledge Discovery and Data Mining”, AAAI Press/The MIT Press, 1996. Han, J., Kamber, M. “Data Mining: Concepts and Techniques”, Morgan Kaufman, 2001. Hand, D.J., “Data Mining: Statistics and More”, The American Statistician, Vol. 52, No. 2, 1998. Helberg, C. “Data Mining with Confidence”, 2nd edition, SPSS Inc., 2002. Jambu, M., “ Introduction au data mining - Analyse intelligente des données“, 1999 Eyrolles, Paris. Lange, S., Satoh, K., Smith, C.H. (eds.) “Discovery Science 5th International Conference, DS2002, Lubeck, Germany, Procedings“, Berlin: Springer-Verlag, 2002. Klosgen, W., Zytkow, J.M. (editors) “Handbook of Data Mining and Knowledge Discovery “, Oxford University Press, 2002. Pyle, D.“Data Preparation for Data Mining” Morgan Kaufmann, 1999. Pyle, D. “Business Modeling and Data Mining”, Morgan Kaufmann, 2003. 55 Web-Resources (continued) • http://www.dmreview.com/ • http://www.andypryke.com/university/sites.html • http://www.modelandmine.com • http://www.kdnuggets.com/ • http://www.sas.com/technologies/analytics/data mining/miner/index.html • http://www.sas.com/operationsresearch • http://www.spss.com/spssbi/clementine/ 56 Web-Resources (continued) • http://www.thearling.com/dmintro/dmintro.htm • http://www.crisp-dm.org/ • http://soleunet.ijs.si/website/html/euproject.html • http://www.dmg.org • http://kmcenter.free.fr • http://www.megaputer.com 57 ” Discovery consists of seeing what everybody has seen and thinking what nobody has thought.” Albert von Szent-Gyorgyi THANK YOU MERÇI 58

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Knowledge Discovery and Data Mining