Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data mining and the knowledge discovery process Institute for Knowledge and Agent Technology MICC Universiteit Maastricht Summer Course 2007 H.H.L.M. Donkers Content Opening / acquaintance What is data mining Data mining methodology Course perspective Course contents Data - Information - Knowledge Data: symbols Information: data that are processed to be useful; provides answers to "who", "what", "where", and "when" questions Knowledge: application of data and information; answers "how" questions Understanding: appreciation of "why" Wisdom: evaluated understanding. (Russell Ackoff - http://www.outsights.com/systems/dikw/dikw.htm) Data - Information - Knowledge - http://www.outsights.com/systems/dikw/dikw.htm What is Data Mining – Traditionally “Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.” Witten & Frank (2000). Data Mining. What is Data Mining – Traditionally “The application of specific algorithms for extracting patterns from data, it is a part of knowledge discovery from databases” Fayyad (1997). From data mining to knowledge discovery in databases. What is Data Mining – Traditionally “Data mining is a process, not just a series of statistical analyses.” SAS Institute (2003). Finding the solution to data mining. What is Data Mining – Traditionally Computer Science • (Semi-)automated application of algorithms for pattern discovery • Algorithms developed in the field of Artificial Intelligence (machine learning) • Part of the process of knowledge discovery Data mining = Statistics + Marketing Statistics • Process of discovering patterns in data • (Manual) application of a series of statistical techniques (among which machine learning) • Incorporates – – – – Exploration Sampling Modeling Validation What is Data Mining – A Fusion “An analytic process designed to explore data in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal is prediction.” Statsoft (2003). Data Mining Techniques. What is Data Mining – A Fusion “An information extraction activity whose goal is to discover hidden facts contained in databases. Using a combination of machine learning, statistical analysis, modeling techniques and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results.” Rudjer Boskovic Institute (2001). DMS Tutorial. Data Mining in this Course We use the book of Witten & Frank • Computer science (machine learning) approach Emphasis on algorithms for pattern discovery and rule extraction – What are the underlying models – What are the properties of the algorithms – When to use (for which tasks) – How to apply and to tune – How to interpret and assess the results Data Mining Process These algorithms are only part of a process that computer scientists call Knowledge Discovery and the statisticians call Data Mining The process starts with the recognition of a problem and ends with the control of a deployed solution The whole process needs to be supported for a successful application Methodologies for Data Mining As Data Mining is coming of age, several methodologies have been developed, each with their own perspective. We will discuss three of them: • Fayyad et al. (Computer science) – E.g., WEKA • SEMMA (SAS) (Statistics) – SAS Enterprise Miner, R • CRISP-DM (SPSS, OHRA, a.o.) (Business) – SPSS Clementine Fayyad’s KDD Methodology Transformed data Target data Patterns Processed data Data Mining data Selection Transformation Preprocessing & feature selection & cleaning Knowledge Interpretation Evaluation SEMMA Methodology Supported by SAS Enterprise Mining environment SAMPLE EXPLORE Input data, Sampling, Data partition Distribution explorer, Multiplot, Insight, Association, Variable selection MODIFY MODEL Transform variable, Filter outliers, Clustering, SOM / Kohonen ASSESS Assessment, Score, Report Regression, Tree, Neural Network, Ensemble CRISP-DM Methodology Developed by data-mining companies (SPSS, NCR, OHRA, ChryslerDaimler), funded by the European Commission Tool-independent / industry-independent Hierarchical process model 1 Generic phases 2 Generic tasks 3 Specific tasks 4 Task instances Supported by SPSS Clementine environment CRISP-DM Methodology TASKS Business objective Business understanding Data understanding Assess situation Data mining goals Data Preparation Deployment Modeling Evaluation Project plan CRISP-DM Methodology TASKS Collect data Business understanding Data understanding Describe data Explore data Data Preparation Deployment Modeling Evaluation Verify data quality CRISP-DM Methodology TASKS Select data Business understanding Data understanding Clean data Construct data Data Preparation Integrate data Format data Deployment Modeling Evaluation CRISP-DM Methodology Business understanding Data understanding Data Preparation Deployment Modeling Evaluation TASKS Select modeling techniques Design the test Build model Assess model CRISP-DM Methodology TASKS Evaluate results Business understanding Data understanding Data Preparation Deployment Modeling Evaluation Review process Determine next steps CRISP-DM Methodology TASKS Plan deployment Business understanding Data understanding Data Preparation Deployment Modeling Evaluation Plan monitoring and maintenance Final report Review project A Comparison Transformed data Target data Knowledge Patterns Processed data Interpretation Evaluation Data Mining data Preprocessing & cleaning Business understanding Transformation & feature selection Data understanding Data Preparation Selection Deployment Modeling Evaluation SAMPLE EXPLORE Input data, Sampling, Data partition Distribution explorer, Multiplot, Insight, Association, Variable selection MODIFY MODEL Transform variable, Filter outliers, Clustering, SOM / Kohonen ASSESS Assessment, Score, Report Regression, Tree, Neural Network, Ensemble A Small Poll (July 2002) Which DM Methodology do you use? None Other My own My organisation's SEMMA Crisp DM 0 20 40 60 80 Source: http://www.kdnuggets.com/polls/2002/methodology.htm 100 Poll repeated (2004) Which DM Methodology do you use? None Other My own My organisation's SEMMA Crisp DM 0 20 40 60 Source: http://www.kdnuggets.com/polls/2004/data_mining_methodology.htm 80 Course perspective and goal The perspective is from computer science (machine learning): Fayyad’s approach The emphasis is on techniques for the automated discovery of patterns in data and the automated extraction of rules (the model phase of SEMMA and CRISP) The goal is to get acquainted with these techniques, so you can use them in the methodology of your choice Course contents Data preparation (Tuesday) • Selection, preprocessing, transformation Techniques, algorithms and models • • • • • • Decision trees (Monday) Instance based and Bayesian learning (Wednesday) Neural networks (Wednesday) Association rules (Thursday) Clustering (Thursday) Support Vector Machines (Friday) Evaluation of learned models (Tuesday) Course contents For each technique you learn • For which tasks it is suitable – Classification, rules, prediction, … – Restrictions on input data (numerical, symbolic, etc.) • • • • What algorithms are available What parameters should be tuned How to interpret the results How to evaluate the model