Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
České vysoké učení technické v Praze Fakulta informačních technologií Katedra teoretické informatiky Evropský sociální fond Praha & EU: Investujeme do vaší budoucnosti MI-PDD – Data Preprocessing module (2010/2011) Lecture 1: Introduction, CRISP-DM, Visualization Pavel Kordík, FIT, Czech Technical University in Prague 1 Module organization 13 lectures 6 exercises (min 30/60b) Test (20b) Semestral project(40b) Exam (min 20/40b) Test (40b) Optional examination +5b max Kordik, CTU Prague, FIT, MI-PDD 2 Knowledge engineering master specialization module (abbreviation) Statistics for Informatics ( MIE-SPI ) dimension 4+1 completion z,zk type of module p-sz lecturer Prof. Blažek recom. year 1. Parallel Computer Architectures ( MIE-PAR ) Systems Theory ( MIE-TES ) Data Preprocessing ( MIE-PDD ) 3+1 2+1 2+1 z,zk z,zk z,zk p-sz p-sz pv-ob prof. Tvrdík prof. Moos Kordík, Ph.D. 1. 1. 1. Pattern Recognition ( MIE-ROZ ) Cybernality ( MIE-KYB ) Mathematics for Informatics ( MIE-MPI ) 2+1 2+0 4+1 z,zk zk z,zk pv-ob p-hu p-sz doc. Haindl doc. Jirovský doc. Šolcová 1. 1. 1. Functional and Logical Programming ( MIE-FLP ) Advanced Database Systems ( MIE-PDB ) Advanced Information Systems ( MIE-PIS ) 2+1 2+1 2+1 z,zk z,zk z,zk pv-ob pv-ob pv-ob Janoušek, Ph.D. Valenta, Ph.D. prof. Mišovič 1. 1. 1. elective module 2+1 z,zk v 1. elective module 2+1 z,zk v 1. Project Management ( MIE-PRM ) Problems and Algorithms ( MIE-PAA ) 1+2 3+1 z z,zk p-em pv-ob Vala Schmidt, Ph.D. 1. 2. Computational Intelligence Methods ( MIE-MVI ) Knowledge Discovery from Databases ( MIE-KDD ) 2+1 2+1 z,zk z,zk pv-ob pv-ob Kordík, Ph.D. doc. Rauch 2. 2. z p-pr 2. Master Project ( MIE-MPR ) elective module 2+1 z,zk v 2. elective module 2+1 z,zk v 2. Information Security ( MIE-IBE ) 2+0 zk p-em elective module 2+1 z,zk v IT Support to Business and CIO Role ( MIE-CIO ) 3+0 zk p-em obligatory humanity module zk pv-hu 2. Master Thesis (MIE-DIP) z p-pr 2. Kordik, CTU Prague, FIT, MI-PDD Čermák, CSc. 2. 2. prof. Dohnal 2. 3 Recommended elective modules Get specialized! Even more modules available Kordik, CTU Prague, FIT, MI-PDD 4 Lectures Program Inroduction, CRISP-DM, visualization. Data exploration, exploratory analysis techniques, descriptive statistics. Methods to determine the relevance of features. Problems with data – dimensionality, noise, outliers, inconsistency, missing values, non-numeric data. Data cleaning, transformation, imputing, discretization, binning. Reduction of data dimension. Reduction of data volume, class balancing. Feature extraction from text. Feature extraction from documents, web. Preprocessing of structured data. Feature extraction from time series. Feature extraction from images. Data preparation case studies. Automation of data preprocessing. Kordik, CTU Prague, FIT, MI-PDD 5 Software Pentaho Data Integration (Kettle) Matlab FAKE GAME Kordik, CTU Prague, FIT, MI-PDD 6 Books Dorian Pyle, Data Preparation for Data Mining, Morgan Kaufmann, Mar 15, 1999 Mamdouh Refaat, Data Preparation for Data Mining Using SAS, Morgan Kaufmann, Sep. 29, 2006) Tamraparni Dasu, Theodore Johnson, Exploratory Data Mining and Data Cleaning, Wiley 2003 Kordik, CTU Prague, FIT, MI-PDD 7 Cross-Industry Data Mining Standard – CRISP-DM CRISP-DM slides adapted from Kordik, CTU Prague, FIT, MI-PDD 8 CRISP-DM: Phases and tasks MI-KDD Business Understanding MI-PDD Data Understanding MI-ROZ, MI-MVI Data Preparation Modeling MI-KDD Evaluation Deployment Determine Business Objectives Collect Initial Data Select Data Select Modeling Technique Evaluate Results Plan Deployment Assess Situation Describe Data Clean Data Generate Test Design Review Process Plan Monitering & Maintenance Determine Data Mining Goals Explore Data Construct Data Build Model Determine Next Steps Produce Final Report Produce Project Plan Verify Data Quality Integrate Data Assess Model Review Project Format Data Kordik, CTU Prague, FIT, MI-PDD 9 Phase 1. Business Understanding Statement of Business Objective Statement of Data Mining Objective Statement of Success Criteria Focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives Kordik, CTU Prague, FIT, MI-PDD 10 Phase 1. Business Understanding Determine business objectives - thoroughly understand, from a business perspective, what the client really wants to accomplish - uncover important factors, at the beginning, that can influence the outcome of the project - neglecting this step is to expend a great deal of effort producing the right answers to the wrong questions Assess situation - more detailed fact-finding about all of the resources, constraints, assumptions and other factors that should be considered - flesh out the details Kordik, CTU Prague, FIT, MI-PDD 11 Phase 1. Business Understanding Determine data mining goals - a business goal states objectives in business terminology - a data mining goal states project objectives in technical terms ex) the business goal: “Increase catalog sales to existing customers.” a data mining goal: “Predict how many widgets a customer will buy, given their purchases over the past three years, demographic information (age, salary, city) and the price of the item.” Produce project plan - describe the intended plan for achieving the data mining goals and the business goals - the plan should specify the anticipated set of steps to be performed during the rest of the project including an initial selection of tools and techniques Kordik, CTU Prague, FIT, MI-PDD 12 Phase 2. Data Understanding • • • Explore the Data Verify the Quality Find Outliers Starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information. Kordik, CTU Prague, FIT, MI-PDD 13 Phase 2. Data Understanding Collect initial data - acquire within the project the data listed in the project resources includes data loading if necessary for data understanding possibly leads to initial data preparation steps if acquiring multiple data sources, integration is an additional issue, either here or in the later data preparation phase Describe data - examine the “gross” or “surface” properties of the acquired data - report on the results Kordik, CTU Prague, FIT, MI-PDD 14 Phase 2. Data Understanding Explore data - tackles the data mining questions, which can be addressed using querying, visualization and reporting including: distribution of key attributes, results of simple aggregations relations between pairs or small numbers of attributes properties of significant sub-populations, simple statistical analyses - may address directly the data mining goals - may contribute to or refine the data description and quality reports - may feed into the transformation and other data preparation needed Verify data quality - examine the quality of the data, addressing questions such as: “Is the data complete?”, Are there missing values in the data?” Kordik, CTU Prague, FIT, MI-PDD 15 Phase 3. Data Preparation Takes usually over 70% of the time - Collection - Assessment Consolidation and Cleaning Data selection Transformations Covers all activities to construct the final dataset from the initial raw data. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record and attribute selection as well as transformation and cleaning of data for modeling tools. Kordik, CTU Prague, FIT, MI-PDD 16 Phase 3. Data Preparation Select data - decide on the data to be used for analysis - criteria include relevance to the data mining goals, quality and technical constraints such as limits on data volume or data types - covers selection of attributes as well as selection of records in a table Clean data - raise the data quality to the level required by the selected analysis techniques - may involve selection of clean subsets of the data, the insertion of suitable defaults or more ambitious techniques such as the estimation of missing data by modeling Kordik, CTU Prague, FIT, MI-PDD 17 Phase 3. Data Preparation Construct data - constructive data preparation operations such as the production of derived attributes, entire new records or transformed values for existing attributes Integrate data - methods whereby information is combined from multiple tables or records to create new records or values Format data - formatting transformations refer to primarily syntactic modifications made to the data that do not change its meaning, but might be required by the modeling tool Kordik, CTU Prague, FIT, MI-PDD 18 Phase 4. Modeling Select the modeling technique (based upon the data mining objective) Build model (Parameter settings) Assess model (rank the models) Various modeling techniques are selected and applied and their parameters are calibrated to optimal values. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often necessary. Kordik, CTU Prague, FIT, MI-PDD 19 Phase 4. Modeling Select modeling technique - select the actual modeling technique that is to be used ex) decision tree, neural network - if multiple techniques are applied, perform this task for each techniques separately Generate test design - before actually building a model, generate a procedure or mechanism to test the model’s quality and validity ex) In classification, it is common to use error rates as quality measures for data mining models. Therefore, typically separate the dataset into train and test set, build the model on the train set and estimate its quality on the separate test set Kordik, CTU Prague, FIT, MI-PDD 20 Phase 4. Modeling Build model - run the modeling tool on the prepared dataset to create one or more models Assess model - interprets the models according to his domain knowledge, the data mining success criteria and the desired test design - judges the success of the application of modeling and discovery techniques more technically - contacts business analysts and domain experts later in order to discuss the data mining results in the business context - only consider models whereas the evaluation phase also takes into account all other results that were produced in the course of the project Kordik, CTU Prague, FIT, MI-PDD 21 Phase 5. Evaluation Evaluation of model - how well it performed on test data Methods and criteria - depend on model type Interpretation of model - important or not, easy or hard depends on algorithm Thoroughly evaluate the model and review the steps executed to construct the model to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached Kordik, CTU Prague, FIT, MI-PDD 22 Phase 5. Evaluation Evaluate results - assesses the degree to which the model meets the business objectives - seeks to determine if there is some business reason why this model is deficient - test the model(s) on test applications in the real application if time and budget constraints permit - also assesses other data mining results generated - unveil additional challenges, information or hints for future directions Kordik, CTU Prague, FIT, MI-PDD 23 Phase 5. Evaluation Review process - do a more thorough review of the data mining engagement in order to determine if there is any important factor or task that has somehow been overlooked - review the quality assurance issues ex) “Did we correctly build the model?” Determine next steps - decides how to proceed at this stage - decides whether to finish the project and move on to deployment if appropriate or whether to initiate further iterations or set up new data mining projects - include analyses of remaining resources and budget that influences the decisions Kordik, CTU Prague, FIT, MI-PDD 24 Phase 6. Deployment Determine how the results need to be utilized Who needs to use them? How often do they need to be used Deploy Data Mining results by Scoring a database, utilizing results as business rules, interactive scoring on-line The knowledge gained will need to be organized and presented in a way that the customer can use it. However, depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise. Kordik, CTU Prague, FIT, MI-PDD 25 Phase 6. Deployment Plan deployment - in order to deploy the data mining result(s) into the business, takes the evaluation results and concludes a strategy for deployment - document the procedure for later deployment Plan monitoring and maintenance - important if the data mining results become part of the day-to-day business and it environment - helps to avoid unnecessarily long periods of incorrect usage of data mining results - needs a detailed on monitoring process - takes into account the specific type of deployment Kordik, CTU Prague, FIT, MI-PDD 26 Phase 6. Deployment Produce final report - the project leader and his team write up a final report - may be only a summary of the project and its experiences - may be a final and comprehensive presentation of the data mining result(s) Review project - assess what went right and what went wrong, what was done well and what needs to be improved Kordik, CTU Prague, FIT, MI-PDD 27 Knowledge discovery from databases - the process understand the opportunity identify and define business opportunity identify and define business opportunity prepare data profile and understand data derive attributes transform data typically about create case set 75% of process build models profile data derive attributes transform data create case set train models assess model performance train models assess performance use models deploy model monitor model performance deploy model monitor model performance Kordik, CTU Prague, FIT, MI-PDD 28 Example: transforming source data Purchase PurchDt 102302 11:02:44 102302 11:02:44 102302 11:02:45 102302 11:02:45 … 102402 11:01:01 102402 11:02:59 102402 11:02:21 102402 11:03:58 … 102502 12:01:34 102502 12:01:49 102502 12:03:45 102502 12:03:58 … Store Account Amt Store $4.50 423 $88.38 221 $121.33 221 $19.99 73 Acct Size 8849940044 249 8376636636 337 8376636636 893 3866493657 102 $43.84 $77.01 $11.63 $144.00 743 23 189 270 8376636636 5378366284 8376636636 3866493657 219 430 501 194 12 6 14 2 44 90 23 5 0 0 0 1 $289.08 $71.99 $38.23 $58.84 45 301 219 17 6474538469 3866493657 5382638977 3866493657 579 220 331 430 5 13 1 8 75 34 91 18 0 1 0 0 Billions of Purchases Millions of Accounts Age 4 9 1 19 Purchase History CS CR CrLim 33 1 1000 88 0 4600 76 0 1700 43 1 1700 S1 0 1 0 0 A1 P3 S3 0 0 0 $54 1 1 0 0 0 0 0 0 Item Summary Fraud Ten 8 46 15 15 P1 0 1 0 0 4600 1000 2000 1500 89 1 20 12 1 1 2 1 1 $121 1 $54 2 $79 1 $20 1 1 2 3 1 $121 $8 $19 1 $54 $15 $22 2 $79 $1 $3 1 $60 $11 $42 1 1 0 1 0 1 0 1 0 0 0 1 0 0 0 0 3000 3300 2900 1800 30 28 29 16 0 2 0 3 0 1 0 2 0 4 0 5 0 1 0 2 0 0 0 1 0 1 1 0 1 0 0 1 0 0 0 1 0 $54 0 $55 A3 Min Max Elec Vid Jewl Frd? 0 $1 $3 0 0 0 0 $54 $9 $17 1 1 0 1 0 $19 $42 0 0 1 0 0 $4 $9 0 1 0 0 0 $19 $98 $59 $7 $22 0 $4 $9 $58 $6 $14 Aggregate and Pivot Kordik, CTU Prague, FIT, MI-PDD 29 result: modeling matrix Hundreds of Attributes PurchDt 102302 11:02:44 102302 11:02:44 102302 11:02:45 102302 11:02:45 … 102402 11:01:01 102402 11:02:59 102402 11:02:21 102402 11:03:58 … 102502 12:01:34 102502 12:01:49 102502 12:03:45 102502 12:03:58 … Amt Store $4.50 423 $88.38 221 $121.33 221 $19.99 73 Acct Size 8849940044 249 8376636636 337 8376636636 893 3866493657 102 $43.84 $77.01 $11.63 $144.00 743 23 189 270 4674847467 5378366284 8376636636 3866493657 219 430 501 194 12 6 14 2 44 90 23 5 0 0 0 1 $289.08 $71.99 $38.23 $58.84 45 301 219 17 6474538469 3866493657 5382638977 3866493657 579 220 331 430 5 13 1 8 75 34 91 18 0 1 0 0 One Row Per Purchase Age 4 9 1 19 CS CR CrLim 33 1 1000 88 0 4600 76 0 1700 43 1 1700 Ten 8 46 15 15 P1 0 1 0 0 S1 0 1 0 0 A1 P3 S3 0 0 0 $54 1 1 0 0 0 0 0 0 4600 1000 2000 1500 89 1 20 12 1 1 2 1 1 $121 1 $54 2 $79 1 $20 1 1 2 3 1 $121 $8 $19 1 $54 $15 $22 2 $79 $1 $3 1 $60 $11 $42 1 1 0 1 0 1 0 1 0 0 0 1 1 1 0 1 3000 3300 2900 1800 30 28 29 16 0 2 0 3 0 1 0 2 0 4 0 5 0 1 0 2 0 0 0 1 0 1 1 0 1 0 0 1 0 0 0 1 0 $54 0 $55 A3 Min Max Elec Vid Jewl Frd? 0 $1 $3 0 0 0 0 $54 $9 $17 1 1 0 1 0 $19 $42 0 0 1 0 0 $4 $9 0 1 0 0 0 $19 $98 $59 $7 $22 0 $4 $9 $58 $6 $14 Mix of Fraud and No-Fraud Purchases Kordik, CTU Prague, FIT, MI-PDD 30 Example: modeling in SAS enterprise miner Kordik, CTU Prague, FIT, MI-PDD 31 Example: model export score converter node generates Java model code reporter node exports code and HTML report to project directory Kordik, CTU Prague, FIT, MI-PDD 32 Example: model deployment tool task copy model information to the Data Store design HTTP Web browser client HTML JDBC access Web App. Web Server File/registry access NonStop SQL/MX SAS Open Metadata server Data Store File/SAS server Kordik, CTU Prague, FIT, MI-PDD SAS Enterprise Miner Mining Mart Model export/registration 33 Interaction Manager Example: real-time scoring Offers / Advice Rules Engine Business Rules Model Scores Scoring Engine Deployed Models Model Aggregates Customer Data Aggregation Engine Kordik, CTU Prague, FIT, MI-PDD Aggregate Definitions 34 Visualization Extremely useful in all CRISP-DM stages Raw data visualization – detect problems (data inconsistency, outliers, errors) Large, multivariate data Model behavior visualization Results visualization Often the best interface for domain experts Examples follow Kordik, CTU Prague, FIT, MI-PDD 35 Visual data exploration Kordik, CTU Prague, FIT, MI-PDD 36 Kordik, CTU Prague, FIT, MI-PDD 37 Glyphs http://www.ii.uib.no/vis/publications/publication/2009/vids/lie09glyphBased3Dvisualization.html Kordik, CTU Prague, FIT, MI-PDD 38 Dedicated infrastructures EVEREST (Exploratory Visualization Environment for REsearch in Science and Technology) is a large-scale venue for data exploration and analysis. Its main feature is a 27-projector PowerWall with an aggregate pixel count of 35 million pixels. The projectors are arranged in a 9×3 array, each providing 3,500 lumens for a very bright display. Displaying 11,520 by 3,072 pixels, or a total of 35 million pixels, the wall offers a tremendous amount of visual detail. The wall is integrated with the rest of the computing center, creating a high-bandwidth data path between largescale high-performance computing and large-scale data visualization. Kordik, CTU Prague, FIT, MI-PDD 39 Next lecture Visual data exploration Statistical data description Kordik, CTU Prague, FIT, MI-PDD 40