Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining دكترمحسن كاهاني http://www.um.ac.ir/~kahani/ Motivation: “Necessity is the Mother of Invention” Data explosion problem: Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories We are drowning in data, but starving for knowledge! دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Related Fields Machine Learning Visualization Data Mining and Knowledge Discovery Statistics Databases دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Knowledge Discovery Process Integration Interpretation & Evaluation Knowledge Knowledge __ __ __ __ __ __ __ __ __ DATA Ware house Transformed Data Target Data Patterns and Rules Understanding Raw Data دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Data Mining and Business Intelligence Increasing potential to support business decisions End User Making Decisions Business Analyst Data Presentation Visualization Techniques Data Mining Information Discovery Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP DBA دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Definition of Data Mining “…The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data…” Fayyad, Piatetsky-Shapiro, Smyth [1996] دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Need for Data Mining Data accumulate and double every 9 months There is a big gap from stored data to knowledge; and the transition won’t occur automatically. Manual data analysis is not new but a bottleneck Fast developing Computer Science and Engineering generates new demands Seeking knowledge from massive data دكتر كاهاني-سيستمهاي خبره و مهندسي دانش When is DM useful Data rich world Large data (dimensionality and size) Image data (size) Gene chip data (dimensionality) Little knowledge about data (exploratory data analysis) What if we have some knowledge? دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Challenges Increasing data dimensionality and data size Various data forms New data types Streaming data, multimedia data Efficient search and access to data/knowledge Intelligent update and integration Privacy Concerns دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Results of Data Mining Include: Forecasting what may happen in the future Classifying people or things into groups by recognizing patterns Clustering people or things into groups based on their attributes Associating what events are likely to occur together Sequencing what events are likely to lead to later events دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Data Mining versus OLAP OLAP - On-line Analytical Processing Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Data Mining Versus Statistical Analysis Data Mining Originally developed to act as expert systems to solve problems Less interested in the mechanics of the technique If it makes sense then let’s use it Does not require assumptions to be made about data Can find patterns in very large amounts of data Requires understanding of data and business problem Data Analysis Tests for statistical correctness of models Are statistical assumptions of models correct? Eg Is the R-Square good? Hypothesis testing Is the relationship significant? Use a t-test to validate significance Tends to rely on sampling Techniques are not optimised for large amounts of data Requires strong statistical skills دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Data Mining Taxonomy Predictive Method - …predict the value of a particular attribute… Descriptive Method - …foundation of human-interpretable patterns that describe the data… دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Data Mining Tasks... Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Deviation Detection [Predictive] دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Data Mining Tasks: Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Statistics, Decision Trees, Neural Networks, ... دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Classification: Linear Regression Linear Regression w0 + w1 x + w2 y >= 0 Regression computes wi from data to minimize squared error to ‘fit’ the data Not flexible enough دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Classification: Decision Trees if X > 5 then blue else if Y > 3 then blue else if X > 2 then green else blue Y 3 2 5 X دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Example Decision Tree Splitting Attributes Tid Refund Marital Status Taxable Income Cheat 1 125K No Yes Single 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Refund Yes No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES The splitting attribute at a node is determined based on the Gini index. 10 دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Classification: Neural Networks - efficiently model large and complex problems; - may be used in classification problems or for regressions; - Starts with input layer => hidden layer => output layer 3 1 4 6 2 Inputs 5 Hidden Layer Output دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Neural Networks (cont.) - can be easily implemented to run on massively parallel computers; - can not be easily interpret; - require an extensive amount of training time; - require a lot of data preparation (involve very careful data cleansing, selection, preparation, and preprocessing); - require sufficiently large data set and high signal-to noise ratio. دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Classification Example Tid Refund Marital Status Taxable Income Cheat Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No No Single 75K ? 2 No Married 100K No Yes Married 50K ? 3 No Single 70K No No Married 150K ? 4 Yes Married 120K No Yes Divorced 90K ? 5 No Divorced 95K Yes No Single 40K ? 6 No Married No No Married 80K ? 60K 10 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 10 No Single 90K Yes Training Set Learn Classifier Test Set Model دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Classification Application Direct Marketing Fraud Detection Customer Attrition/Churn Sky Survey Cataloging دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Data Mining Tasks: Clustering Goal is to identify categories Natural grouping of customers by processing all the available data about them. Other applications market segmentation, discovering affinity groups, and defect analysis دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Kohonen Network Description unsupervised seeks to describe dataset in terms of natural clusters of cases دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Data Mining Tasks: Association Rule Discovery Given a set of records each of which contain some number of items from a given collection; Produce dependency rules which will predict occurrence of an item based on occurrences of other items. TID Items 1 2 3 4 5 Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Association Rule Discovery Application Marketing and Sales Promotion Supermarket Shelf Management Inventory Management دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Deviation Detection & Pattern Discovery Deviation Detection: …discovering most significant changes in data from previously measured or normative values… V. Kumar, M. Joshi, Tutorial on High Performance Data Mining. Sequential Pattern Discovery: …process of looking for patterns and rules that predict strong sequential dependencies among different events… V. Kumar, M. Joshi, Tutorial on High Performance Data Mining. دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Sequential Patterns Identify frequently occurring sequences from given records 40 percent of female customers buy a gray skirt six months after buying a red jacket دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Data Mining Methodology: SAS Sample Extract a portion of the dataset for data mining Explore Modify create, select and transform variables with the intention of building a model Model Specify a relationship of variables that reliably predicts a desired goal Assess Evaluate the practical value of the findings and the model resulting from the data mining effort دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Data Mining Methodology: CRISP-DM Data understanding Data preparation Modeling Evaluation Deployment دكتر كاهاني-سيستمهاي خبره و مهندسي دانش CRISP-DM Phases سيستمهاي خبره و مهندسي دانش -دكتر كاهاني Phases and Tasks Business Understanding Determine Business Objectives Background Business Objectives Business Success Criteria Situation Assessment Inventory of Resources Requirements, Assumptions, and Constraints Risks and Contingencies Terminology Costs and Benefits Determine Data Mining Goal Data Mining Goals Data Mining Success Criteria Data Understanding Collect Initial Data Initial Data Collection Report Data Preparation Data Set Data Set Description Select Data Data Description Report Rationale for Inclusion / Exclusion Explore Data Clean Data Describe Data Data Exploration Report Verify Data Quality Data Quality Report Data Cleaning Report Construct Data Derived Attributes Generated Records Integrate Data Merged Data Format Data Modeling Select Modeling Technique Modeling Technique Modeling Assumptions Generate Test Design Test Design Build Model Parameter Settings Models Model Description Assess Model Model Assessment Revised Parameter Settings Evaluation Evaluate Results Assessment of Data Mining Results w.r.t. Business Success Criteria Approved Models Review Process Review of Process Determine Next Steps List of Possible Actions Decision Deployment Plan Deployment Deployment Plan Plan Monitoring and Maintenance Monitoring and Maintenance Plan Produce Final Report Final Report Final Presentation Review Project Experience Documentation Reformatted Data Produce Project Plan Project Plan Initial Asessment of Tools and Techniques دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Major Application Areas for Data Mining Solutions Fraud/Non-Compliance Anomaly detection Isolate the factors that lead to fraud, waste and abuse Target auditing and investigative efforts more effectively Credit/Risk Scoring Intrusion detection Parts failure prediction Recruiting/Attracting customers Maximizing profitability (cross selling, identifying profitable customers) Service Delivery and Customer Retention Build profiles of customers likely to use which services Web Mining Health Care دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Controversial Issues Data mining (or simple analysis) on people may come with a profile that would raise controversial issues of Discrimination Privacy Security Examples: Should males between 18 and 35 from countries that produced terrorists be singled out for search before flight? Can people be denied mortgage based on age, sex, race? Women live longer. Should they pay less for life insurance? 34 Data Mining and Discrimination Can discrimination be based on features like sex, age, national origin? In some areas (e.g. mortgages, employment), some features cannot be used for decision making In other areas, these features are needed to assess the risk factors E.g. people of African descent are more susceptible to sickle cell anemia 35 Data Mining and Privacy Can information collected for one purpose be used for mining data for another purpose In Europe, generally no, without explicit consent In US, generally yes Companies routinely collect information about customers and use it for marketing, etc. People may be willing to give up some of their privacy in exchange for some benefits See Data Mining And Privacy Symposium, www.kdnuggets.com/gpspubs/ieee-expert-9504-priv.html 36 Data Mining and Privacy Data Mining looks for patterns, not people! Technical solutions can limit privacy invasion Replacing sensitive personal data with anon. ID Give randomized outputs Multi-party computation – distributed data … دكتر كاهاني-سيستمهاي خبره و مهندسي دانش The Hype Curve for Data Mining and Knowledge Discovery Over-inflated expectations Growing acceptance and mainstreaming rising expectations Disappointment Performance Expectations 1990 1998 2000 2002 دكتر كاهاني-سيستمهاي خبره و مهندسي دانش Final Remarks Data Mining can be utilized for any field that needs to find patterns or relationships in their data. دكتر كاهاني-سيستمهاي خبره و مهندسي دانش