Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Issues in Data Mining Infrastructure Authors: Nemanja Jovanovic, [email protected] Valentina Milenkovic, [email protected] Prof. Dr. Veljko Milutinovic, [email protected] http://galeb.etf.bg.ac.yu/~vm Page 1/71 Data Mining in the Nutshell  Uncovering the hidden knowledge  Huge n-p complete search space  Multidimensional interface Page 2/71 A Problem … You are a marketing manager for a cellular phone company  Problem: Churn is too high  Turnover (after contract expires) is 40%  Customers receive free phone (cost 125$) with contract  You pay a sales commission of 250$ per contract  Giving a new telephone to everyone whose contract is expiring is very expensive (as well as wasteful)  Bringing back a customer after quitting is both difficult and expensive Page 3/71 … A Solution  Three months before a contract expires, predict which customers will leave  If you want to keep a customer that is predicted to churn, offer them a new phone  The ones that are not predicted to churn need no attention  If you don’t want to keep the customer, do nothing  How can you predict future behavior?  Tarot Cards?  Magic Ball?  Data Mining? Page 4/71 Still Skeptical? Page 5/71 The Definition The automated extraction of predictive information from (large) databases  Automated  Extraction  Predictive  Databases Page 6/71 History of Data Mining Page 7/71 Repetition in Solar Activity  1613 – Galileo Galilei  1859 – Heinrich Schwabe Page 8/71 The Return of the Halley Comet Edmund Halley (1656 - 1742) 1531 1607 1682 239 BC 1910 1986 Page 9/71 2061 ??? Data Mining is Not  Data warehousing  Ad-hoc query/reporting  Online Analytical Processing (OLAP)  Data visualization Page 10/71 Data Mining is  Automated extraction of predictive information from various data sources  Powerful technology with great potential to help users focus on the most important information stored in data warehouses or streamed through communication lines Page 11/71 Data Mining can  Answer question that were too time consuming to resolve in the past  Predict future trends and behaviors, allowing us to make proactive, knowledge driven decision Page 12/71 Focus of this Presentation  Data Mining problem types  Data Mining models and algorithms  Efficient Data Mining  Available software Page 13/71 Data Mining Problem Types Page 14/71 Data Mining Problem Types  6 types  Often a combination solves the problem Page 15/71 Data Description and Summarization  Aims at concise description of data characteristics  Lower end of scale of problem types  Provides the user an overview of the data structure  Typically a sub goal Page 16/71 Segmentation  Separates the data into interesting and meaningful subgroups or classes  Manual or (semi)automatic  A problem for itself or just a step in solving a problem Page 17/71 Classification  Assumption: existence of objects with characteristics that belong to different classes  Building classification models which assign correct labels in advance  Exists in wide range of various application  Segmentation can provide labels or restrict data sets Page 18/71 Concept Description  Understandable description of concepts or classes  Close connection to both segmentation and classification  Similarity and differences to classification Page 19/71 Prediction (Regression)  Finds the numerical value of the target attribute for unseen objects  Similar to classification - difference: discrete becomes continuous Page 20/71 Dependency Analysis  Finding the model that describes significant dependences between data items or events  Prediction of value of a data item  Special case: associations Page 21/71 Data Mining Models Page 22/71 Neural Networks  Characterizes processed data with single numeric value  Efficient modeling of large and complex problems  Based on biological structures Neurons  Network consists of neurons grouped into layers Page 23/71 Neuron Functionality I1 W1 I2 W2 I3 W3 In f Output Wn Output = f (W1*I1, W2*I1, …, Wn*In) Page 24/71 Training Neural Networks Page 25/71 Neural Networks - Conclusion  Once trained, Neural Networks can efficiently estimate value of output variable for given input  Neurons and network topology are essentials  Usually used for prediction or regression problem types  Difficult to understand  Data pre-processing often required Page 26/71 Decision Trees  A way of representing a series of rules that lead to a class or value  Iterative splitting of data into discrete groups maximizing distance between them at each split  Classification trees and regression trees  Univariate splits and multivariate splits  Unlimited growth and stopping rules  CHAID, CHART, Quest, C5.0 Page 27/71 Decision Trees Balance>10 Age<=32 Married=NO Page 28/71 Balance<=10 Age>32 Married=YES Decision Trees Page 29/71 Rule Induction  Method of deriving a set of rules to classify cases  Creates independent rules that are unlikely to form a tree  Rules may not cover all possible situations  Rules may sometimes conflict in a prediction Page 30/71 Rule Induction If balance>100.000 then confidence=HIGH & weight=1.7 If balance>25.000 and status=married then confidence=HIGH & weight=2.3 If balance<40.000 then confidence=LOW & weight=1.9 Page 31/71 K-nearest Neighbor and Memory-Based Reasoning (MBR)  Usage of knowledge of previously solved similar problems in solving the new problem  Assigning the class to the group where most of the k-”neighbors” belong  First step – finding the suitable measure for distance between attributes in the data  How far is black from green?  + Easy handling of non-standard data types  - Huge models Page 32/71 K-nearest Neighbor and Memory-Based Reasoning (MBR) Page 33/71 Data Mining Models and Algorithms  Many other available models and algorithms  Logistic regression  Discriminant analysis  Generalized Adaptive Models (GAM)  Genetic algorithms  Etc…  Many application specific variations of known models  Final implementation usually involves several techniques  Selection of solution that match best results Page 34/71 Efficient Data Mining Page 35/71 NO YES Is It Working? Don’t Mess With It! YES Did You Mess With It? You Shouldn’t Have! NO Anyone Else Knows? NO YES You’re in TROUBLE! NO Hide It Can You Blame Someone Else? YES NO PROBLEM! Page 36/71 YES Will it Explode In Your Hands? NO Look The Other Way DM Process Model  5A – used by SPSS Clementine (Assess, Access, Analyze, Act and Automate)  SEMMA – used by SAS Enterprise Miner (Sample, Explore, Modify, Model and Assess)  CRISP–DM – tends to become a standard Page 37/71 CRISP - DM  CRoss-Industry Standard for DM  Conceived in 1996 by three companies: Page 38/71 CRISP – DM methodology Four level breakdown of the CRISP-DM methodology: Phases Generic Tasks Specialized Tasks Process Instances Page 39/71 Mapping generic models to specialized models  Analyze the specific context  Remove any details not applicable to the context  Add any details specific to the context  Specialize generic context according to concrete characteristic of the context  Possibly rename generic contents to provide more explicit meanings Page 40/71 Generalized and Specialized Cooking  Preparing food on your own  Raw Find out what youvegetables? want to eat stake with   Find the recipe for that meal Check the Cookbook or call mom Gather the ingredients Defrost the meat (if you had it in the fridge) Prepare the meal Buy missing ingredients Enjoy yourthe food or borrow from the neighbors Clean up everything (or leave it for later) Cook the vegetables and fry the meat  Enjoy your food or even more  You were cooking so convince someone else to do the dishes        Page 41/71 CRISP – DM model  Business understanding  Data understanding  Data preparation  Modeling Business understanding Deployment  Evaluation  Deployment Page 42/71 Evaluation Data understanding Data preparation Modeling Customizing a Web Page  User-friendly design  Prediction of the users interests  Reduction of server workload  Reduction of Web traffic Page 43/71 Customizing a Web Page Page 44/71 Business Understanding  Determine business objectives  Assess situation  Determine data mining goals  Produce project plan Page 45/71 Business Understanding - Outputs  Background  Business objectives and success criteria  Inventory of resources  Requirements, assumptions, and constrains  Risks and contingencies  Terminology  Costs and benefits  Data mining goals and success criteria  Project plan  Initial assessment of tools and techniques Page 46/71 Customizing a Web Page – Business Understanding Example   Business objectives Make the users surfing Assess situation more comfortable  Make the users Decrease of overhead surfingfor users  Data mining goals more comfortable  Reduction of workload and  Find Web the Decrease traffic patterns of overhead for users  Project planbehavior in the user  Reduction of workload and Web traffic  Page 47/71 Data Understanding  Collect initial data  Describe data  Explore data  Verify data quality Page 48/71 Data Understanding - Outputs  Data collection report  Data  Background description of datareport   List of data sources Data  Detailed exploration descriptionreport of each data source   For each data source, method of acquisition  List of tables or other database objects Data  Expected quality regularities report or patterns and  Problems encountered in data acquisition methods of detection  Description of each field units, codes, etc. Approach taken to assessincluding data quality   Regularities or patterns found Results of data quality assessment (expected and unexpected)  Any other surprises  Conclusions for data transformation, data cleaning and any other pre-processing  Conclusions related to data mining goals or business objectives Page 49/71 Customizing a Web Page – Data Understanding Example     Collecting the data Update the server to monitor Data userdescription behavior  Record the users activities Results of data exploring into a storage   Analyze recorded data  Decide which data is usable for mining Verification of the quality of the data Page 50/71 Data Preparation  Select data  Clean data  Construct data  Integrate data  Format data Page 51/71 Data Preparation - Outputs   Dataset description report  Background including broad goals and plan for pre-processing  Description of pre-processing  Detailed description of resultant datasets  Rational for inclusion/exclusion of attributes  Discoveries made during pre-processing and implications for further work Dataset Page 52/71 Customizing a Web Page – Data Preparation Example  Decide from what period will the users monitored actions be considered  Make assumptions about unnecessary monitored data and discard them  Classify user actions into categories, group interesting links, etc…  If more information about user is available from other sources, use them  Transform data into suitable forms so several modeling techniques could be applied Page 53/71 Modeling  Select modeling technique  Generate test design  Build model  Assess model Page 54/71 Modeling - Outputs  Assessment of DM results with respect to business success criteria Test design   Broaddescription description of the type of model and Model   the training data to be used  Type assessment of model and relation to data mining goals Model  Explanation of how the model will be tested or assessed  Overview assessment including Parameterofsettings used process to produce model  Description of any for testing deviations from thedata testrequired plan  Detailed description of the model and  Description of any planned of models Detailed assessment of the examination model any special features by domain or data experts  Comments models by domaininorthe data experts Conclusionson regarding patterns data  Insights into why a certain modeling technique and certain parameter setting lead to good/bad results Page 55/71 Customizing a Web Page – Modeling Example  The problem is prediction of behavior  Regression could be a good solution due to distinct nature of the data  Create the software according to the project plan  Observe the behavior of the software  Tune the model after each evaluation phase if needed Page 56/71 Evaluation results = models + findings  Evaluate results  Review process  Determine next steps Page 57/71 Evaluation - Outputs  Assessment of DM results with respect to business success criteria   Reviewof of process Business Objectives and Review  List of possible actions  Comparison between success criterion and DM results Business Success Criteria  Conclusion about achievability of success criterion and suitability of data mining process  Review of “Project Success”  Are there new business objectives? Page 58/71 Customizing a Web Page – Evaluation Example  Observe the model behavior at work  Collect response from Beta testers  Check user satisfaction  Check server and network engagement  Classify results  Determine which parameter of the model should be changed  Present new ideas and modifications  Step back into previous phases as needed Page 59/71 Deployment  Plan deployment  Plan monitoring and maintenance  Produce final report  Review project Page 60/71 Deployment - Outputs  Monitoring and maintenance plan  Final  Overview report of deployment results and indication  which of results may require updating Summary of Business Understanding Description ofobjectives how updating be triggered (background, and will success criteria) Description how updating will be performed Summary ofof data mining process  Summary of data mining results  Summary of results evaluation  Summary of deployment and maintenance plan  Cost/benefit analysis  Conclusions for the business  Conclusions for future data mining   Page 61/71 Customizing a Web Page – Deployment Example  Make the feature available to all users  Make plan for maintenance and user feedback  Analyze costs and benefits  Summarize the whole documentation  Summarize network and server additional activity  Collect the new ideas  Award according to results  Leave space for upgrade Page 62/71 At Last… Page 63/71 Available Software Page 64/71 Available Software Discussion of data mining vendors and software is not included into this slide set Page 65/71 Conclusions Page 66/71 WWW.NBA.COM Page 67/71 Se7en Page 68/71  CD – ROM  Page 69/71 Credits Anne Stern, SPSS, Inc. Djuro Gluvajic, ITE, Denmark Obrad Milivojevic, PC PRO, Yugoslavia Page 70/71 References  Bruha, I., ‘Data Mining, KDD and Knowledge Integration: Methodology and A case Study”, SSGRR 2000  Fayyad, U., Shapiro, P., Smyth, P., Uthurusamy, R., “Advances in Knowledge Discovery and Data Mining”, MIT Press, 1996  Glumour, C., Maddigan, D., Pregibon, D., Smyth, P., “Statistical Themes nad Lessons for Data Mining”, Data Mining And Knowledge Discovery 1, 11-28, 1997  Hecht-Nilsen, R., “Neurocomputing”, Addison-Wesley, 1990  Pyle, D., “Data Preparation for Data Mining”, Morgan Kaufman, 1999  galeb.etf.bg.ac.yu/~vm  www.thearling.com  www.crisp-dm.com  www.twocrows.com  www.sas.com/products/miner  www.spss.com/clementine Page 71/71 The END Page 72/71