Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Data Mining Overview James E. Parry Solution Architect IBM Business Analytics software © 2010 IBM Corporation Business Analytics software Introduction to SPSS, an IBM Company Leadership in Predictive Analytics – Market leader – 40+ year heritage in predictive analytic technologies – Broad product range • Statistics, data-mining, data collection and deployment product families A “first-to-market” deployment methodology – A methodology for deploying Predictive Analytics across the enterprise – Provides an incremental, phased approach to the enterprise solution – Based on the convergence of analytics, architecture and business processes “Play well with others” – Non-intrusive integration (Service Oriented Architecture) – Database-agnostic – Leverages existing operational software & IT investments © 2010 IBM Corporation Business Analytics software Agenda What is Predictive Analytics? Questions Data Mining Can Answer Statistics vs. Data Mining Analysis Tools in the Data Mine – User Driven vs. Data Driven Tools Supervised vs. Unsupervised Learning – Supervised: Prediction and Classification – Unsupervised: Clustering, Association and Anomaly Detection Text Mining Deployment Technology: Making Findings Matter Q&A © 2009 SPSS Inc. © 2010 IBM Corporation Business Analytics software Predictive Analytics Predictive analytics helps connect data to effective action by drawing reliable conclusions about current conditions and future events. — Gareth Herschel, Research Director, Gartner Group © 2009 SPSS Inc. – Confidential 4 © 2010 IBM Corporation Business Analytics software Questions Data Mining Can Answer Commercial CommercialSector Sector Public Public Sector Sector Reducing campaign costs and increasing customer conversions Reducing recruiting costs and increasing employee retention Decreasing customer churn Decreasing institutional attrition Reducing fraud and improper payments Reducing fraud and improper payments Maximizing ROI on direct marketing campaigns Maximizing ROI on public service campaigns Improving product offerings by understanding customer needs Improving public health and safety by understanding constituent needs © 2010 IBM Corporation Business Analytics software What kinds of questions can you answer with Data Mining in Public Sector? It’s all about propensity . . . – Propensity to . . . Be a successful employee Network Attack Commit Fraud Quit © 2010 IBM Corporation Business Analytics software In other words… What are the ? © 2010 IBM Corporation Business Analytics software How do we figure those propensities out again? © 2010 IBM Corporation Business Analytics software How do we figure those propensities out again? You need predictors… © 2010 IBM Corporation Business Analytics software How do we figure those propensities out again? You need outcomes… © 2010 IBM Corporation Business Analytics software How do we figure those propensities out again? But you don’t necessarily need to understand complex equations to get answers! © 2010 IBM Corporation Business Analytics software Aren’t those statistics? Traditional Statistical Data Analysis –Descriptive (sample) –Inferential (population) Data Mining (and machine learning in general) –Accuracy of prediction (predicted classification) –Individual predictions –Rules of thumb © 2010 IBM Corporation Business Analytics software Aren’t those statistics? Traditional Statistical Data Analysis – Data Mining (and machine learning in general) © 2010 IBM Corporation Business Analytics software Aren’t those statistics? Traditional Statistical Data Analysis – Data Mining (and machine learning in general) Regardless of whether or not the models are easily explained! © 2010 IBM Corporation Business Analytics software Aren’t those statistics? Traditional Statistical Data Analysis – Data Mining (and machine learning in general) Regardless of whether or not the models are easily explained! © 2010 IBM Corporation Business Analytics software CRISP-DM, the Cross Industry Standard Process for Data Mining Process Phases 1. 2. 3. 4. 5. 6. Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment © 2010 IBM Corporation Business Analytics software CRISP-DM, the Cross Industry Standard Process for Data Mining © 2010 IBM Corporation Business Analytics software Data Mining vs. Statistical Analysis Statistical Analysis –Confirm Hypotheses –More Data Requirements –More Assumptions –General Population Predictions –Cumulative Results Data Mining –Generate Hypotheses –More Exploratory –Less Data Prep –Fewer Assumptions –Individual Predictions –Results Oriented User Driven Data Driven © 2010 IBM Corporation Business Analytics software In a nutshell… Data mining works by… – Clearly defining business goals – Data exploration and hypothesis generation – Training – Refining and . . . – Validating models – Deploying production models into operational framework Statistics are most useful when… – You plan an experiment – You need to plan data collection wisely • Costly data collection process- minimum cases necessary to find an effect (Power!) – You need to estimate population parameters – Confirm or fail to confirm a hypothesis © 2010 IBM Corporation Business Analytics software Analysis tools in the Data Mine Query, SQL, Spreadsheets On Line Analytical Processing (OLAP) Data visualisation Statistics Rule induction and Segmentation Neural networks & Decision Trees 20 SPSS Inc. © 2009 © 2010 IBM Corporation Business Analytics software Statistics – Descriptive Analysis Analytic software: – Data displays (e.g., frequency distributions) Satisfaction with service 1-10 – Graphic displays of data (e.g. histogram) 80 – Measures of central tendency (e.g., mean, median) 60 – Estimates of variance (e.g., standard deviation) Frequency 40 20 Std. Dev = 1.65 Mean = 8.3 N = 248.00 0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 Satisfaction with service 1-10 © 2009 SPSS Inc. 21 © 2010 IBM Corporation Business Analytics software Statistics – Inferential Analysis Predicting numerical or categorical outcomes – Linear regression – GLM Multivariate/Repeated Measures – Non-linear regression – Time Series – Survival Analysis/Cox regression – Structural Equation Modeling © 2009 SPSS Inc. 22 © 2010 IBM Corporation Business Analytics software Statistics – Inferential Analysis Used often in experimental design, clinical trials and survey research with complex sampling designs – N.O.R.C. and Gallup use extensive inferential statistics accurately representing survey data on how people think and feel about the world today. – NIH uses inferential statistics to analyze experimental data to quantify significant differences in treatments and interventions. – CDC – extensive epidemiological studies require inferential statistics Used to create data when you don’t have it. – Sample size – Effect size – Validity of results © 2009 SPSS Inc. 23 © 2010 IBM Corporation Business Analytics software Data Mining Three classes of data mining algorithms Cluster Supervised vs. unsupervised “Differences” Group cases that exhibit similar characteristics. Complementary What events occur together? Given a series of actions; what action is likely to occur next? Data Mining Predict “Relationships” Associate “Patterns” © 2007 SPSS Inc. Predict who is likely to exhibit specific behavior in the future. 24 © 2010 IBM Corporation Business Analytics software What is Supervised Learning? A technique when we know the output or outputs We will “Supervise” the algorithm and tell it what we want to predict. © 2010 IBM Corporation Business Analytics software Supervised Learning: Profile and Predict Build a predictive profile of the historical outcome using a collection of potential input fields. Credit ranking (1=default) Cat. % n Bad 52.01 168 Good 47.99 155 Total (100.00) 323 Paid Weekly/Monthly P-value=0.0000, Chi-square=179.6665, df=1 Weekly pay Monthly salary Cat. % n Bad 86.67 143 Good 13.33 22 Total (51.08) 165 Cat. % n Bad 15.82 25 Good 84.18 133 Total (48.92) 158 Age Categorical P-value=0.0000, Chi-square=30.1113, df=1 Young (< 25);Middle (25-35) Explores all combinations, interactions and contingencies. Cat. % n Bad 90.51 143 Good 9.49 15 Total (48.92) 158 Age Categorical P-value=0.0000, Chi-square=58.7255, df=1 Old ( > 35) Cat. % Bad 0.00 Good 100.00 Total (2.17) n 0 7 7 Young (< 25) Middle (25-35);Old ( > 35) Cat. % n Bad 48.98 24 Good 51.02 25 Total (15.17) 49 Cat. % n Bad 0.92 1 Good 99.08 108 Total (33.75) 109 Social Class P-value=0.0016, Chi-square=12.0388, df=1 Management;Clerical Cat. % Bad 0.00 Good 100.00 Total (2.48) n 0 8 8 Professional Cat. % n Bad 58.54 24 Good 41.46 17 Total (12.69) 41 Use this profile to understand and predict future cases. 26 © 2009 SPSS Inc. © 2010 IBM Corporation Business Analytics software Profile and Predict Neural Networks –A technique for predicting outcomes based on inputs where the inputs are weighted on hidden layers –Behaves similar to the neurons in your brain –Powerful general function estimators –Require minimal statistical or mathematical knowledge 27 SPSS Inc. © 2009 © 2010 IBM Corporation Business Analytics software Neural Network Anatomy 28 © 2010 IBM Corporation Business Analytics software Neural Network Output 29 © 2010 IBM Corporation Business Analytics software Neural Network Output 30 © 2010 IBM Corporation Business Analytics software Neural Network Output 31 © 2010 IBM Corporation Business Analytics software Neural Network Summary Excellent for modeling complex relationships and predicting outcomes – Can handle nonlinearity and interactions with ease Good for solving many different problem sets (categorical, binary, scale predictors and outcomes) Very poor (Black Box) at describing the relationships among predictors and outcomes 32 © 2010 IBM Corporation Business Analytics software Profile and Predict Decision Trees and Rule Induction –Classification systems that predict or classify –Technique that shows the ‘reasoning’ – contrast with Neural Network –Builds sets of easy to understand ‘If – Then’ Rules –Eliminates factors that are unimportant 33 © 2010 IBM Corporation Business Analytics software Basic Decision Tree* weather sunny Temp > 75 BBQ rainy cloudy Eat in Eat in windy no yes BBQ Eat in *www.cs.utsa.edu/~kwek/cs6463s05/Classification.ppt © 2010 IBM Corporation Business Analytics software Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) –Tree is constructed in a top-down recursive divide-andconquer manner –At start, all the training examples are at the root –Attributes are categorical (if continuous-valued, they are discretized in advance) –Examples are partitioned recursively based on selected attributes –Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)* *www.cs.utsa.edu/~kwek/cs6463s05/Classification.ppt © 2010 IBM Corporation Business Analytics software Decision Tree Anatomy X1 X2 36 © 2010 IBM Corporation Business Analytics software Decision Tree Anatomy X1 X2 37 © 2010 IBM Corporation Business Analytics software Running a Decision Tree 38 © 2010 IBM Corporation Business Analytics software Running a Decision Tree 39 © 2010 IBM Corporation Business Analytics software Start Here: Claim Amount 40 Decision Tree Output © 2010 IBM Corporation Business Analytics software Why not just use Regression? • OccPrest = 4(Educ) + 10: “For every year of education completed occupational prestige increases by 4 points on average.” 41 © 2010 IBM Corporation Business Analytics software Why not just use Regression? How do we describe the relationship? ??? X1 X2 42 © 2010 IBM Corporation Business Analytics software Why not just use Regression? Does an increase in X2 lead to Green? ??? X1 X2 43 © 2010 IBM Corporation Business Analytics software Why not just use Regression? Can a line describe something that is not linear by nature? ??? X1 X2 44 © 2010 IBM Corporation Business Analytics software Why not just use Regression? Many phenomena can not be fit to a straight line. ??? X1 X2 45 © 2010 IBM Corporation Business Analytics software Decision Trees Excellent at uncovering and modeling complex relationships Very accurate on even small data sets to inform decision making. Can handle nonlinear relationships with complex interactions. Very easy to understand and describe to others. Time to insight in minutes. 46 © 2010 IBM Corporation Business Analytics software What is Unsupervised Learning? A data mining technique when we do not know the output or outputs Can be thought of as finding ‘useful’ patterns above and beyond noise…or “fishing” for information © 2010 IBM Corporation Business Analytics software Unsupervised Learning: Clustering and Association Find emerging patterns and unusual cases. Use data mining to examine the differences and shifts across all dimensions of the data. Select large groups to identify common patterns. Select small groups to identify unusual patterns. 48 © 2009 SPSS Inc. © 2010 IBM Corporation Business Analytics software Cluster and Associate Clustering – An exploratory data analysis technique – Reveals natural groups within a data set – Distance Measure: No prior knowledge about groups or characteristics – Not always an end in itself Associations – Finds things that occur together – ex: events in a crime incident – Associations can exist between any of the attributes (no single outcome like Decision Trees) Sequential Associations – Discovers association rules in time-oriented data – Find the sequence or order of the events 49 SPSS Inc. © 2009 © 2010 IBM Corporation Business Analytics software Anomaly Detection Anomalies – Anomaly detection is an exploratory method – Designed for quick detection of unusual cases or records that should be candidates for further analysis – These should be regarded as suspected anomalies, which, on closer examination, may or may not turn out to be real 50 SPSS Inc. © 2009 © 2010 IBM Corporation Business Analytics software Anomaly Detection- Output Anomalous Records – Each record is assigned an anomaly index, ($O-AnomalyIndex) which is the ratio of the group deviation index to its average over the cluster that the case belongs to. – The larger the value of this index, the more deviation the case has than the average. – Under the usual circumstance, cases with anomaly index values less than 1 or even 1.5 would not be considered as anomalies, because the deviation is just about the same or a bit more than the average. – However, cases with an index value greater than 2 could be good anomaly candidates because the deviation is at least twice the average. 51 SPSS Inc. © 2009 © 2010 IBM Corporation Text Mining Business Analytics software © 2010 IBM Corporation Business Analytics software What is Text Mining? Most data held within an organization is in the form of unstructured text documents or records: –Emails, communications logs, –Reports, –Web pages, blogs, … Text Mining, refers to extracting usable knowledge from unstructured text data, through identification of core concepts, opinions and trends, to drive better business decisions across the enterprise. 53 SPSS Inc. © 2009 © 2010 IBM Corporation Business Analytics software Text Mining Timeline: Text Extraction “Mr. Smith aka Mr. Ahmed was seen on the corner of Church St. and Magnolia Ave. on Nov 13th” Bag of « Words » extraction Expressions extraction Mr. Smith aka was seen with Ahmed on the corner of Church Etc. 70’s Mr. Smith (Person) -> aka (Alias) -> Mr. Ahmed (Person) was seen (location) -> Church and Magnolia (address) -> November 13 (Date) Citizens Named Entities Mr. Smith was seen extraction Mr. Ahmed corner Church St. Magnolia Ave. Mr. Smith -> Person Nov 13th Mr. Ahmed-> Person aka -> Alias was seen -> location Church St. -> Address Magnolia Ave. -> Address Nov 13th -> Date 80’s 90’s ->us citizens ->civilians ->civilian bus -> … Events/Sentiment Extraction Entity Grouping Build Categories Now © 2010 IBM Corporation Business Analytics software Discover critical information with TM 55 SPSS Inc. © 2009 © 2010 IBM Corporation Business Analytics software The Deployment Technology In Data Mining, time to insight is half the battle. Time to production is the other half (and much more repetitive). Must be able to ‘deploy’ model into operations: Quickly In a standards-based, repeatable fashion Must be able to monitor model performance for ‘drift’. 56 Automating model performance monitoring and model refresh decreases errors because it’s a ‘hands off’ operation- no user intervention required. Automating model refresh guarantees the most accurate models in the shortest amount of time (time to production). © 2010 IBM Corporation Business Analytics software Data Mining Considerations Data Modeling © 2009 SPSS Inc. Batch vs. Real-time Production Automation Supervised vs. Unsupervised Different types of models (NN vs. Rules) Combining models (Meta modeling) Deployment Available data (structured/unstructured) Relevant factors Subject matter expertise Scheduling Champion – Challenger Multi-step jobs, conditional logic Governance Version control Security and auditing 57 © 2010 IBM Corporation Business Analytics software Data Mining Considerations Data Modeling Scheduling Champion – Challenger Multi-step jobs, conditional logic Governance © 2009 SPSS Inc. Batch vs. Real-time Production Automation Supervised vs. Unsupervised Different types of models (NN vs. Rules) Combining models (Meta modeling) Deployment Available data (structured/unstructured) Relevant factors Subject matter expertise Version control Security and auditing 58 © 2010 IBM Corporation Business Analytics software Managing (many) Models • How do you keep models secure and keep track of their evolution? • Where do your models sleep at night? • Who’s model is in production right now? • Which version is it? • How do you manage a model once it’s in production? • How it performs now? • How it is likely to perform on new data? • When is it time to retire this model? 59 © 2010 IBM Corporation Business Analytics software Collaboration & Deployment Analytic content management repository – Version control – Powerful search • Analytic awareness – Security and auditing Process management – Multi-step jobs – Conditional job flow – Scheduling – Automated model evaluation • Champion - challenger – Open integration • SPSS tools and non-SPSS tools Integration & delivery interfaces – Reporting – Automatic delivery of analytical output – Multiple IT infrastructure integration options • Web services, authentication, and database interfaces © 2010 IBM Corporation Business Analytics software Store, manage, automate, distribute, score ... Store & manage Modeler artifacts Streams, data files, output Search on data-mining metadata Automate Modeler operations Stream execution Support for remote Modeler servers and clusters Scheduling and Automation Model Management Refresh Score Evaluate Store & distribute output from Modeler Accuracy Gains Accreditation File based output (cou, html, dat, jpg, etc) New graph templates from Viz Designer Batch and Real-time scoring of models © 2010 IBM Corporation Business Analytics software Deployment: Model Refresh Deployed models can automatically be refreshed using the ChampionChallenger scenario . . . 62 © 2010 IBM Corporation Business Analytics software Deployment Steps 63 Production considerations . . . Models should be easily deployed and managed No SQL programming necessary No DBA intervention Done in a standards-based, replicable fashion (not one-off) © 2010 IBM Corporation Business Analytics software 64 © 2010 IBM Corporation Business Analytics software 65 © 2010 IBM Corporation Business Analytics software Predictions and Confidence 66 © 2010 IBM Corporation Business Analytics software 67 © 2010 IBM Corporation Business Analytics software 68 © 2010 IBM Corporation Business Analytics software 69 © 2010 IBM Corporation Business Analytics software From Analyst to Production in Minutes 70 Real Time Prediction © 2010 IBM Corporation Business Analytics software Deployment: Web Application Deploy into a web application where cases can be scored to find unusual attributes . . . 71 © 2010 IBM Corporation 7 software Business Analytics 2 Deployment Intelligently route claims and enhance rules engines with output from a deployed data mining model. Action based on data mining model. Action = REFER © 2010 IBM Corporation Business Analytics software Summary of the Science • Data Mining is a ‘data’ driven process where control is relinquished to the machine and many hypotheses are generated and explored. • Leveraging these insights leads to high ROI. • Supervised and Unsupervised learning techniques complement each other • Clustering and Anomaly Detection are excellent at typifying data with high dimensionality and finding the ‘needle in the haystack’ • They also serve as great data reduction techniques for data preparation • Decision Trees and Neural Nets are great a predicting and classifying, Decision Trees are generally easier to interpret • Deployment technology is key to unleashing predictive analytics. 73 © 2010 IBM Corporation Business Analytics software Questions? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? © 2010 IBM Corporation