Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining 1 Agenda • Examples • What is data mining? • The Industry comments • Techniques 2 Examples • “On Friday evenings, shoppers who buy diapers also buy beer”. – Supermarket transaction database • “People with good credit ratings have fewer accidents” – Insurance database, http://wtonline.com • “A one-dollar gas station credit-card transaction followed by a large transaction is likely to be indicative of fraud”. – Credit card transactions database 3 More Examples • Marketing – Targeted marketing using decision trees • Stock selection / Fraud detection – Using neural networks • Telecommunications – Churn modeling, identifying valuable customers 4 Even More Examples • Healthcare – Fish oil and Reynaud’s disease • Finding communities on the Web – Abortion example • Personalization – Recommender systems 5 Even More More Examples • Games (e.g. Hollywood Stock Exchange) – www.hsx.com • Viral Marketing – Social networks and network mining • Sports – NBA Scout 6 Agenda • Examples • What is data mining? • The Industry comments • Techniques 7 What is Data Mining? 8 What is Data Mining? • Querying large databases? • Learning patterns from data? • Building models from data? 9 What is Data Mining? • Learning “structure” from large data – “reverse engineering” – “structure” could be patterns or models • How is this different from statistics? 10 Data mining techniques • Lots of them exist! • How to categorize these? – Two approaches • Description vs prediction • RES framework 11 Classification of the main engines/techniques Main Use Description Prediction Technique OLAP Decision Trees Support Vector Machines Neural Nets Rule Discovery Methods Clustering Methods Genetic Algorithms Nearest neighbor Expert Systems Fuzzy logic systems Bayesian Approaches 12 Representation, Evaluation & Search: Linear Model Example • Representation – Risk = 0.93*prior_default + 0.23*num_cards – 1.3* employed –0.734 • Evaluation – R-squared/degree of fit • Search – How did the technique find the coefficients? 13 Representation, evaluation and search • Different techniques represent, evaluate and search for patterns differently. – Methods can be characterized based on how they do these things. • Data mining methods use very different representation schemes, use predictive accuracies as the main evaluation measure and use heuristic search procedures • Strengths: Can build very accurate models and learn interesting patterns in a bottom-up manner • Weaknesses: Can find false patterns and may “overfit” the learning data – How to mitigate these? • This is one way to think about the difference between DM methods and traditional statistical methods 14 Agenda • Examples • What is data mining? • The Industry comments • Techniques 15 The Industry Space • Data gathering and management – External data sources – Integrating databases to design unified views • For realtime support • For historical warehouse driven apps • Firms – Data vendors, consulting services 16 Customer Centric Architecture channels email web Action Database Other Data Sources phone golfcourse 17 The Industry Space • Broad Data Analytics – Traditional statistical tools – Data mining tools • Firms – www.kdnuggets.com – SPSS, SAS, Trajecta, IBM, SGI, Gainsmarts, HNC Software • Other common sources – In-house analytics development and academia18 The Industry Space • Niche Market Analytics and Services – – – – – – Fraud detection Customer Segmentation Direct Marketing Bioinformatics Internet Advertising Personalization • Firms – Examples: Doubletwist, Celera, HNC Software, Knowledge Stream Partners, Adknowledge (acquired by Engage), Epiphany. 19 The Industry Space • Broad CRM Technologies and Services – General features • • • • • Some data collection and integration tools Some analytics and profitability analyses Some features to streamline operations Often customizable based on client needs Boils down to client needs • Firms – E.g. Siebel. 20 Data Mining Revisited • Smart techniques – Data mining • Not a problem. • Engineering – Integrating this into an overall data management architecture • The more difficult problem • When and how to use – The hard part is figuring out which problem to solve, what data to use etc – The importance of thinking “bottom up” for solving problems 21 The Chief Data Officer 22 The Chief Data Officer 23 Agenda • Examples • What is data mining? • The Industry comments • Techniques 24 Example DM Models: Neural Networks Attempts to mimic the way neurons work in translating input data into an output (dependent variable) 25 Structure of a Neural Network 26 Surface-fitters or Function Approximators 27 Example DM Models: OLAP (On Line Analytical Processing) Provides visual tools to slice and dice the data 28 Browsing a Data Cube 29 Example: Clustering • Identify homogeneous and separable groups (“clusters”) so that: – maximum similarity between points within a group – maximum difference between groups • Applications – group customers into categories useful for targeted marketing. – Identify clusters in image data 30 What clusters can look like 31 Example: Classification 32 Example: Nearest neighbor methods Read “Amazon.com recommendations” paper 33 Online Recommender Systems • Opportunities – Customized stores and all the associated benefits – Easy measurement – Permits experimentation • Challenges – Scale (tens of millions of users, and millions of items) – Need for real-time results – Amount of info on customers varies, but often sparse data 34 Simple collaborative filtering I1 I2 I3 ….. Im C1 C2 C3 .. .. .. Cn 1 1 1 1 1 1 1 1 1. Let C1 be the vector of zeros and ones corresponding to customer 1. 2. Define similarity between customers A and B as cos(A, B) = A . B ||A|| . ||B|| 3. In traditional collaborative filtering, for a given customer find the closest customer and then recommend the other products purchased by this closest cust. Advantages and Disadvantages? 35 Content based recommendations • Treat recommendations as search for related items. • E.g. if you liked “Men In Black” you may get recommendations for comedy films. • Advantages and disadvantages? 36 Item-to-Item Collaborative Filtering C1 C2 C3 ….. Cm I1 I2 I3 .. .. .. In 1 1 1 1 1 1 1 1 1. For each item, find all similar items in an offline computation 2. Create a similar items table where for each items the set of all related items Are stored. 37 Example : Rule discovery methods Read: “On the discovery of statistical quantitative rules” 38 On Evaluation • Apparently I would like watching movies on gang violence in New York theaters. – Why? • Because… • Hamburger grills product recommendation • On evaluation – absolutely critical in a world in which more interactions are being structured automatically – ‘evaluation’ has multiple aspects, not just how “accurate” a model may seem to be. 39 Agenda • Examples • What is data mining? • The Industry comments • Techniques 40