Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction Jun Du The University of Western Ontario [email protected] Outline • • • • • • • • • • Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Trends and Challenges in Data Mining Data Mining Resources Summary 1 Why Data Mining? • The Explosive Growth of Data: from terabytes to petabytes – Hardware – Data collection and data availability • Automated data collection tools, database systems, Web – Major sources of abundant data • Business: Web, e-commerce, transactions, stocks, … • Science: Remote sensing, bioinformatics, … • Society and everyone: news, digital cameras, YouTube, facebook, … • We have everything ready for data – But, data is useless, unless it becomes knowledge • We are drowning in data, but starving for knowledge! 2 Outline • • • • • • • • • • Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Trends and Challenges in Data Mining Data Mining Resources Summary 3 What Is Data Mining? • Data mining (knowledge discovery from data) – Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data • Alternative names – Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, predictive modeling, data science, business intelligence, etc. • Examples of data mining: – Search engine (Google, Bing, Yahoo, …) – Online shopping (Amazon, eBey, …) – Social network (Facebook, LinkedIn, …) – Email service (uwo, gmail, hotmail, …) – …… 4 Knowledge Discovery (KDD) Process Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases 5 Data Mining in Business Intelligence Increasing potential to support business decisions Decision Making Data Presentation Visualization Techniques End User Business Analyst Data Mining Information Discovery Data Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems DBA 6 KDD Process: A Typical View from ML and Statistics Input Data Data PreProcessing Data integration Normalization Feature selection Dimension reduction Data Mining Pattern discovery Association & correlation Classification Clustering Outlier analysis ………… PostProcessing Pattern evaluation Pattern selection Pattern interpretation Pattern visualization 7 Outline • • • • • • • • • • Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Trends and Challenges in Data Mining Data Mining Resources Summary 8 Multi-Dimensional View of DM • Data to be mined – What kind of data can be mined? • Knowledge to be mined – What kind of pattern can be mined? • Techniques utilized – What technology are used? • Applications adapted – What kind of applications are targeted? 9 Outline • • • • • • • • • • Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Trends and Challenges in Data Mining Data Mining Resources Summary 10 On What Kinds of Data? • Most commonly used: – Table data (in raw format or in relational database) • Advanced data sets – – – – – – – – Transaction data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Spatial data and spatiotemporal data Multimedia data Text data The World-Wide Web • Poll (June 2011) – What data types you analyzed/mined in the past 12 months? 11 Outline • • • • • • • • • • Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Trends and Challenges in Data Mining Data Mining Resources Summary 12 Association Rule • Given a set of transaction records each of which contains some items from a given collection; – Produce dependency rules which will predict occurrence of an item based on occurrences of other items. TID Items 1 2 3 4 5 Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper} --> {Beer} • Story of “Diaper” and “Beer” 13 Association Rule Application 1 Marketing and Sales Promotion: • Let the rule discovered be {Bagels} --> {Potato Chips} – If bagels are on sale, potato chips might go fast as well. – If the store discontinues selling bagels, potato chips selling might be affected. –… 14 Association Rule Application 2 Supermarket shelf management • Let the rule discovered be “Diaper” “Beer” – Can put beer beside diaper, customers might feel convenient; – Or, can put beer far away from diaper, customers might pick up some other items on their way from “diaper” to “beer”; –… 15 Classification & Regression • Construct models (functions) based on some existing data and make predictions on some future unseen data Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No Refund Marital Status Taxable Income Cheat 2 No Married 100K No No Single 75K ? 3 No Single 70K No Yes Married 50K ? 4 Yes Married 120K No No Married 150K ? 5 No Divorced 95K Yes Yes Divorced 90K ? 6 No Married No No Single 40K ? 7 Yes Divorced 220K No No Married 80K ? 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K 10 10 Training Set Learning Algorithm Test Set Model 16 Classification Application 1 Direct Marketing – Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. – Approach: • Use the data for a similar product introduced before. • We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute. • Collect customer data (demographic, lifestyle, etc.) • Use this information as input attributes to learn a classification model. • New York Times article (Feb, 2012): How Companies Learn Your Secrets 17 Classification Application 2 • Fraud Detection – Goal: Predict fraudulent cases in credit card transactions. – Approach: • Use credit card transactions and the information on its accountholder as attributes. – When, what and where does a customer buy, etc • Label past transactions as fraud or fair transactions (class attribute). • Learn a model for the class of the transactions. • Use this model to detect fraud transactions. 18 Classification Application 3 Customer Attrition/Churn: – Goal: To predict whether a cell-phone plan customer is likely to be lost to a competitor. – Approach: • Use detailed record of transactions with each of the past and present customers, to find attributes. – How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc. • Label the customers as loyal or disloyal. • Find a model for loyalty. 19 Clustering • Given a set of data points, each having a set of attributes, group data points into different clusters. – Data points in one cluster are more similar to each other. – Data points in separate clusters are less similar to each other. 20 Clustering Application Market Segmentation: – Goal: subdivide a market into distinct subsets of customers, which may be selected as market targets – Approach: • Collect different attributes of customers based on their geographical and lifestyle related information. • Find clusters of similar customers. • Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. 21 Outlier Analysis • Outlier: A data object that does not comply with the general behavior of the data • Noise or exception? ― One person’s garbage could be another person’s treasure • Methods: classification, regression, clustering, … • Application: – Credit Card Fraud Detection – Network Intrusion Detection 22 Other Patterns • Recommendation system – “people you might know” (Facebook) – “jobs you might be interested” (LinkedIn) – “people who bought this product also bought” (Amazon) – “movies (Tvs) that you might like to watch” (Netflix) – …. • Social network analysis – A new and very popular area – Can be applied to a lot of applications: fraud detection, marketing, terrorism and crime prevention, … • … 23 Outline • • • • • • • • • • Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Trends and Challenges in Data Mining Data Mining Resources Summary 24 Data Mining: Confluence of Multiple Disciplines Machine Learning Applications Algorithm Pattern Recognition Data Mining Database Technology Statistics Visualization High-Performance Computing 25 Top 10 Algorithms in DM • IEEE International Conference of Data Mining 2006 1. Decision Trees 2. The K-Means Algorithm 3. Support Vector Machines 4. The Apriori Algorithm 5. The EM Algorithm 6. PageRank Algorithm 7. AdaBoost Algorithm 8. K-Nearest Neighbor Algorithm 9. Naive Baye 10. CART Algorithm 26 Algorithms in DM • Kdnuggets Poll (Nov, 2011) – Algorithms for data analysis / data mining • Rexer Analytics Survey (2012) 27 Outline • • • • • • • • • • Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Trends and Challenges in Data Mining Data Mining Resources Summary 28 Applications of Data Mining • Kdnuggets Poll (December, 2011): – Industries / Fields where you applied Data Mining in 2011 • Rexer Analytics Survey (2012) 29 Outline • • • • • • • • • • Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Trends and Challenges in Data Mining Data Mining Resources Summary 30 10 Challenging Problems in DM • IEEE International Conference of Data Mining 2005 1. Developing a Unifying Theory of Data Mining 2. Scaling Up for High Dimensional Data and High Speed Data Streams 3. Mining Sequence Data and Time Series Data 4. Mining Complex Knowledge from Complex Data 5. Data Mining in a Network Setting 6. Distributed Data Mining and Mining Multi-agent Data 7. Data Mining for Biological and Environmental Problems 8. Data-Mining-Process Related Problems 9. Security, Privacy and Data Integrity 10. Dealing with Non-static, Unbalanced and Cost-sensitive Data 31 Hot Topics and Trends in DM • Kdnuggets Poll (Jan, 2012) – Hottest Analytics / Data Mining Topics in 2012 • Rexer Analytics Survey (2012) 32 Outline • • • • • • • • • • Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Trends and Challenges in Data Mining Data Mining Resources Summary 33 Conferences • Data Mining Conferences – ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD) – IEEE Int. Conf. on Data Mining (ICDM) – SIAM Data Mining Conf. (SDM) – European Conf. on Machine Learning and Principles and Practices of Knowledge Discovery and Data Mining (ECML-PKDD) – Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD) • Other Related Conferences – DB conferences: ACM SIGMOD, VLDB, ICDE – Web and IR conferences: WWW, SIGIR, CIKM – ML conferences: ICML, NIPS – AI conferences: IJCAI, AAAI 34 Journals and Online Resources • Data Mining Journals – – – – Data Mining and Knowledge Discovery (DMKD) IEEE Trans. On Knowledge and Data Eng. (TKDE) KDD Explorations ACM Trans. on KDD • Online Resources – – – – Kdnuggets Kaggle UCI Machine Learning Repository …… 35 Software • Kdnuggets poll (May 2012) – What Analytics, Data mining, Big Data software you used in the past 12 months for a real project? • Rexer Analytics Survey (2012) 36 Programming Languages • Kdnuggets poll (August 2012) – Programming languages for analytics / data mining? 37 Outline • • • • • • • • • • Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Trends and Challenges in Data Mining Data Mining Resources Summary 38 Summary • Data mining: Discovering interesting patterns and knowledge from massive amount of data • A natural evolution of science and information technology, in great demand, with wide applications • A KDD process includes data pre-processing, data mining, data post-processing pattern, and knowledge presentation • Mining can be performed in a variety of data • Data mining patterns: association, classification, clustering, outlier analysis, recommendation system, social network analysis, etc. • A variety of data mining technologies and applications • Data mining resources 39