Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 2 Data Mining Processes and Knowledge Discovery Identify actionable results 結束 Contents Describes the Cross-Industry Standard Process for Data Mining (CRISP-DM), a set of phases that can be used in data mining studies Discusses each phase in detail Gives an example illustration Discusses a knowledge discovery process 2-2 結束 CRISP-DM Cross-Industry Standard Process for Data Mining One of first comprehensive attempts toward standard process model for data mining Independent of industry sector & technology 2-3 結束 CRISP-DM Phases 1. Business (or problem) understanding 2. Data understanding A systematic process to try to make sense of the massive amounts of data generated from daily operations. 3. Data preparation • Transform & create data set for modeling 4. Modeling 5. Evaluation • Check good models, evaluate to assure nothing missing 6. Deployment 2-4 結束 Business Understanding Solve a specific problem Determining business objectives, assessing the current situation, establishing data mining goals, and developing a project plan. Clear definition helps Measurable success criteria Convert business objectives to set of data-mining goals What to achieve in technical terms, such as What types of customers are interested in each of our products? What are typical profiles of customers … 2-5 結束 Data Understanding Initial data collection, data description, data exploration, and the verification of data quality. Three issues considered in data selection: 1. Set up a concise and clear description of the problem. For example, a retail DM project may seek to identify spending behaviors of female shoppers who purchase seasonal clothes. 2. Identify the relevant data for the problem description, such demographical, credit card transactional, financial data… 3. Select variables for the relevant important for the project. 2-6 結束 Data Understanding (cont.) Data types: Demographic data (income, education, age …) Socio-graphic data (hobby, club membership,…) Transactional data (sales record, credit card spending…) Quantitative data: are measurable using numerical values) Qualitative data: known as categorical data, contains both nominal and ordinal data. (see also page. 22) Related data: Can come from many sources? Internal ERP (or MIS) Data Warehouse External Government data Commercial data Created Research 2-7 結束 Data Preparation Once data sources available are identified, the data need to be selected, cleaned, built into the desired and formatted forms. Clean data: Formats, gaps, filters outliers & redundancies (see page .22) Unified numerical scales Nominal data Code (such gender data, male and female) Ordinal data Nominal code or scale (excellent, fair, poor) Cardinal data (Categorical, A, B, C levels) 2-8 結束 Types of Data Type Numerical Features Continuous Integer Binary Categorical Synonyms Range Range Yes/No Flag Finite Set Date/Time Range String Typeless Text String Range: Numeric vales (integer, real, or date/time) Set: Data with distinct multiple value (numeric, string, or data/time) Typeless: for other types of data 2-9 結束 Data Preparation (Cont.) Several statistical method and visualization tools can be used to preprocess the selected data. Such max, min, mean, and mode can be used to aggregate or smooth the data. Scatter plots and box plots can be used to filter outliers. More advanced techniques, such as regression analysis, cluster analysis, decision tree, or hierarchical analysis may be applied in data preprocessing. In some cases, data preprocessing could take over 50% of the time of the entire data mining process. Shortening data processing time can reduce much of the total computation time in data mining. 2-10 結束 Data Preparation – data transformation Data transformation is to use simple mathematical formulations or learning curves to convert different measurements of selected, and clean, data into a unified numerical scale for the data analysis. Data transformation can be used to 1. Transform from numerical to numerical scales, to shrink or enlarge the given data. Such as (x-min)/maxmin) to shrink the data into the interval [0,1]. 2. Recode categorical data to numerical scales. Categorical data can be ordinal (less, moderate, strong) and nominal (red, yellow, blue..). Such 1=yes, 0=no. see also page. 24. See page. 24 for more details. 2-11 結束 Modeling Data modeling is where the data mining software is used to generate results for various situations. Data visualization and cluster analysis are useful for initial analysis. Depending on the data type, 1. if the task is to group data, discriminant analysis is applied. 2. If the purpose is estimation, regression is appropriate the data are continuous (and logistic regression is not). 3. Neural networks could be applied for both tasks. Data Treatment Training set for development of the model. Test set for testing the model that is built. Maybe others for refining the model 2-12 結束 Data mining techniques Techniques Association: the relationship of a particular item in a data transaction on other items in the same transaction is used to predict patterns. See also page 25 for example. Classification: the methods are intended for learning different functions that map each item of the selected data into one of a predefined set of classes. Two key research problems related to classification results are the evaluation of misclassification and prediction power(C4.5). Mathematical modeling is often used to construct classification methods are binary decision trees (CART), neural networks (nonlinear), linear programming (boundary), and statistics. See also page. 25, 26 for more explanations 2-13 結束 Data mining techniques (Cont.) Clustering: taking ungrouped data and uses automatic techniques to put this data into groups. Clustering is unsupervised and does not require a learning set. (Chapter 5) Predictions: is related to regression technique, to discover the relationship between the dependent and independent variables. Sequential patterns: seeks to find similar patterns in data transaction over a business period. The mathematical models behind sequential patterns are logic rules, fuzzy logic, and so on. Similar time sequences: applied to discover sequences similar to a known sequence over both past and current business periods. 2-14 結束 Evaluation Does model meet business objectives? Any important business objectives not addressed? Does model make sense? Is model actionable? CRISP-DM 2-15 結束 Deployment DM can be used to verify previously held hypotheses or for knowledge discovery. DM models can be applied to business purposes , including prediction or identification of key situations Ongoing monitoring & maintenance Evaluate performance against success criteria Market reaction & competitor changes (remodeling or fine tune) 2-16 結束 Example Training set for computer purchase 16 records 5 attributes Goal Find classifier for consumer behavior 2-17 結束 Database (1st half) Case Age Income Student Credit Gender Buy? A1 31-40 High No Fair Male Yes A2 >40 Medium No Fair Female Yes A3 >40 Low Yes Fair Female Yes A4 31-40 Low Yes Excellent Female Yes A5 ≤30 Low Yes Fair Female Yes A6 >40 Medium Yes Fair Male Yes A7 ≤30 Medium Yes Excellent Male Yes A8 31-40 Medium No Excellent Male Yes 2-18 結束 Database (2nd half) Case Age Income Student Credit Gender Buy? A9 31-40 High Yes Fair Male Yes A10 ≤30 High No Fair Male No A11 ≤30 High No Excellent Female No A12 >40 Low Yes Excellent Female No A13 ≤30 Medium No Fair Male No A14 >40 Medium No Excellent Female No A15 ≤30 Unknown No Fair Male Yes A16 >40 Medium No N/A Female No 2-19 結束 Data Selection Gender has weak relationship with purchase Based on correlation Drop gender Selected Attribute Set {Age, Income, Student, Credit} 2-20 結束 Data Preprocessing Income unknown in Case 15 Credit not available in Case 16 Drop these noisy cases 2-21 結束 Data Transformation Assign numerical values to each attribute Age: ≤30 = 3 31-40 = 2 >40 = 1 Income: High = 3 Medium = 2 Low = 1 Student: Yes = 2 No = 1 Credit: Excellent = 2 Fair = 1 2-22 結束 Data Mining Categorize output Buys = C1 Doesn’t buy = C2 Conduct analysis Model says A8, A10 don’t buy; rest do Of the actual yes, 7 correct and 1 not Of the actual no, 2 correct Confusion matrix 2-23 結束 Data Interpretation and Test Data Set Test on independent data Case Actual Model B1 Yes Yes (1) B2 Yes Yes (2) B3 Yes Yes (3) B4 Yes Yes (4) B5 Yes Yes (5) B6 Yes Yes (6) B7 Yes Yes (7) B8 (do not) No No B9 No Yes B10 (do not) No No 2-24 結束 Confusion Matrix Model Buy Model Not Totals Actual Buy 7 0 7 Actual Not 1 2 3 Totals 8 2 10 2-25 結束 Measures Correct classification rate 9/10 = 0.90 Cost function cost of error: model says buy, actual no $20 model says no, actual buy $200 1 x $20 + 0 x $200 = $20 2-26 結束 Goals Avoid broad concepts: Gain insight; discover meaningful patterns; learn interesting things Can’t measure attainment Narrow and specify: Identify customers likely to renew; reduce churn; Rank order by propensity (favor) to…; 2-27 結束 Goals Description: what is understand explain discover knowledge Prescription: what should be done classify predict 2-28 結束 Goal Method A: four rules, explains 70% Method B: fifty rules, explains 72% BEST? Gain understanding:Method A better minimum description length (MDL) Reduce cost of mailing: Method B better 2-29 結束 Measurement Accuracy How well does model describe observed data? Confidence levels a proportion of the time between lower and upper limits Comprehensibility Whole or parts? 2-30 結束 Measuring Predictive Classification & prediction: error rate = incorrect/total requires evaluation set be representative Estimators predicted - actual (MAD, MSE, MAPE) variance = sum(predicted - actual)^2 standard deviation = square root of variance distance - how far off 2-31 結束 Statistics Population - entire group studied Sample - subset from population Bias - difference between sample average & population average mean, median, mode distribution significance correlation, regression (hamming distance) 2-32 結束 Classification Models LIFT = probability in class by sample divided by probability in class by population if population probability is 20% and sample probability is 30%, LIFT = 0.3/0.2 = 1.5 Best lift not necessarily best need sufficient sample size as confidence increase. 2-33 結束 Lift Chart LIFT 100 90 80 responded 70 60 % mailed 50 % responded 40 30 20 10 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 mailed 2-34 結束 Measuring Impact Ideal - $ (NPV) because of expenditure Mass mailing may be better Depends on: fixed cost cost per recipient cost per respondent value of positive response 2-35 結束 Bottom Line Return on investment 2-36 結束 Example Application Telephone industry Problem: Unpaid bills Data mining used to develop models to predict nonpayment as early as possible See page. 27 2-37 結束 Knowledge Discovery Process 1 Data Selection Learning the application domain Creating target data set 2 Data Preprocessing Data cleaning & preprocessing 3 Data Transformation Data reduction & projection 4 Data Mining Choosing function Choosing algorithms Data mining 5 Data Interpretation Interpretation Using discovered knowledge 2-38 結束 1: Business Understanding Predict which customers would be insolvent In time for firm to take preventive measures (and avert losing good customers) Hypothesis: Insolvent customers would change calling habits & phone usage during a critical period before & immediately after termination of billing period 2-39 結束 2: Data Understanding Static customer information available in files Bills, payments, usage Used data warehouse to gather & organize data Coded to protect customer privacy 2-40 結束 Creating Target Data Set Customer files Customer information Disconnects Reconnections Time-dependent data Bills Payments Usage 100,000 customers over 17-month period Stratified (hierarchical) sampling to assure all groups appropriately represented 2-41 結束 3: Data Preparation Filtered out incomplete data Deleted inexpensive calls Reduced data volume about 50% Low number of fraudulent cases Cross-checked with phone disconnects Lagged data made synchronization necessary 2-42 結束 Data Reduction & Projection Information grouped by account Customer data aggregated by 2-week periods Discriminant analysis on 23 categories Calculated average owed by category (significant) Identified extra charges (significant) Investigated payment by installments (not significant) 2-43 結束 Choosing Data Mining Function Classes: Most possibly solvent (99.3%) Most possibly insolvent (0.7%) Costs of error widely different New data set created through stratified sampling Retained all insolvent Altered distribution to 90% solvent Used 2,066 cases total Critical period identified Last 15 two-week periods before service interruption Variables defined by counting measures in twoweek periods 46 variables as candidate discriminant factors 2-44 結束 4: Modeling Discriminant Analysis Linear model SPSS – stepwise forward selection Decision Trees Rule-based classifier, C5, C4.5 Neural Networks Nonlinear model 2-45 結束 Data Mining Training set about 2/3rds Rest test Discriminant analysis Used 17 variables Equal costs – 0.875 correct Unequal costs – 0.930 correct Rule-based – 0.952 correct Neural network – 0.929 correct 2-46 結束 5: Evaluation 1st objective to maximize accuracy of predicting insolvent customers Decision tree classifier best 2nd objective to minimize error rate for solvent customers Neural network model close to Decision tree Used all 3 on case-by-case basis 2-47 結束 Coincidence Matrix – Combined Models Model insolvent Model solvent Unclass Totals Actual insolvent 19 17 28 64 Actual solvent 1 626 27 654 Totals 20 643 91 718 2-48 結束 6: Implementation Every customer examined using all 3 algorithms If all 3 agreed, used that classification If disagreement, categorized as unclassified Correct on test data 0.898 Only 1 actually solvent customer would have been disconnected 2-49