Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Mining These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. For more information on how you may use them, please see http://www.openlineconsult.com/db 1 Topics Types of Data Mining Data Mining Activities Estimation: Classification & Approximation Decision Trees Clustering Targeted Clustering & Association Rules Market Basket Analysis Scoring Market Basket Analysis © Ellis Cohen, 2003-2006 2 Types of Data Mining © Ellis Cohen, 2003-2006 3 Goals of Data Mining Find patterns and relationships among objects represented by data points in a dataset Novel / Significant People who buy notebooks buy pencils vs. Men who buy diapers at night also buy beer Understandable / Useful Tradeoff between accuracy and simplicity Causality vs Chance All data has random variations, which can show up as spurious patterns and relationships © Ellis Cohen, 2003-2006 4 Types of Data Mining Covered Estimation (Classification & Approximation) Classify/categorize objects, or approximate some value associated with objects based on their features Clustering Find groups of objects, [some of] whose features are all similar to one another Market Basket Analysis Find collections of items which frequently occur together (in the same basket) & formulate the cause © Ellis Cohen, 2003-2006 5 Objects & Features custid 3043 3174 … age 23 44 … The entire collection of data is called a dataset income 5000 6190 … … Object: Customer (Represented by a Data Point) Features (also called Variables) © Ellis Cohen, 2003-2006 6 Classification Determine target category of an object, based on its features Predictor Variables: Age, MonthlyIncome Target Variable: CarType Category values: x - Luxury Cars o - Midrange Cars # - Cheap Cars Objects: Customers Note difficult separation Monthly Income x Region x x x x x x # #ox x xx x xx x x o x o x xxx x ox # o o x ox o x o # o x # o o # o x o o# o # o o oo x oxo o o x o o o # o x o o o # o # # # # o o o o # # #o # # o # #o # # x # # # # # # # # Draw regions in which one category predominates Age o o Region # Region When a new customer comes along, you can categorize what kind of car they're likely to buy based upon which region they fall into © Ellis Cohen, 2003-2006 7 Approximation Approximate some (continuous) target value of an object based on its features Predictor Variables: Age, MonthlyIncome Target Variable: CarCost -- amt spent by customer on a car Objects: customers Find a function f(income,age) that gives a good estimate of the actual value Monthly Income o o oo o oo oo o o o o o o oo o o o o o oo o o o oo o o o o o o o o oooo o o o o o oo o o o o o o oo oooo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o ooo o o o o o o o Age o o o o When a new customer comes along, you can apply f to their income & age to estimate how much they're likely to spend on a car © Ellis Cohen, 2003-2006 8 Applications of Estimation Sales Estimate how much a customer is likely to spend Determine amount of effort/money to put into capturing a potential customer Credit Decide whether to approve a credit application Determine someone's credit limit Evaluate whether a credit card is stolen Medicine Determine whether a patient has a disease based on symptoms and test results Assess which protocol should be used to treat an illness © Ellis Cohen, 2003-2006 9 Clustering Find groups of objects, [some of] whose features are all similar to one another Objects: Corvette Buyers x x x x x x x x x xx x x xxx xx xx x x x x x Monthly Income FEATURES x Identify Cluster e.g. by center & radius Age © Ellis Cohen, 2003-2006 10 Applications of Clustering Marketing Target advertising for each cluster; use media whose demographics match cluster Astronomy Find galactic clusters; explore large scale structure of the universe Epidemiology Find things in common among people with the same illness (esp location) Government Policy & Planning Identify regions with similar features (e.g. economic, industrial, geographic) for policies and planning (e.g. land use, economic assistance, office locations, bus routes) © Ellis Cohen, 2003-2006 11 Market Basket Analysis Organize dataset into baskets. Find groups of items which frequently occur together in baskets 11-Feb-99 11-Feb-99 11-Feb-99 … 11-Feb-99 11-Feb-99 … 13-Feb-99 13-Feb-99 Rules capture causality Joe Joe Joe Diapers Formula Beer Simha Pretzels Simha Beer Sasha Diapers Sasha Beer Basket: Daily Shopping by a Customer Diapers and beer occur together frequently Item: Product purchased NO! People who buy beer are not more likely Beer Diapers? to buy diapers YES! People who buy diapers are more likely Diapers Beer? to buy beer (esp men at night) © Ellis Cohen, 2003-2006 12 Applications of Market Basket Analysis Marketing Baskets: Daily Shopping Items: Products Purchased Controlling Customer Traversal in Stores Coupons Recommendations (e.g. Amazon) Semantic Nets Baskets: Documents Items: Words/Phrases Use for Semantic Search Plagiarism Detection Baskets: Sentences Items: Documents © Ellis Cohen, 2003-2006 13 Data Mining Approaches Deterministic Heuristic K-Means Clustering Agglomerative CLIQUE Regression Neural Nets Estimation Bayesian Networks Genetic Algorithms K-Nearest Neighbor Decision Trees Mkt Basket Apriori Analysis Produces "best" possible model or prediction Produces reasonably good model or prediction © Ellis Cohen, 2003-2006 14 Data Mining Activities © Ellis Cohen, 2003-2006 15 Data Mining Activities Diagram Forensics Analyzer Anomalies Model Discovery Detector/ Predictor Modelling Detection Prediction Sample Data Live Data © Ellis Cohen, 2003-2006 Predictions 16 Data Mining Activities Discovery/Modeling Using an existing sample dataset to develop a model which characterizes or describes the data Forensics Finding anomalous data points in an existing sample dataset – those that do not match the discovered model, and determining the cause (which may involve another round of Discovery/Modeling) Prediction Using the discovered model to predict an unknown feature value of a new live data point Detection Detect new live data points which are anomalous – those that do not match the discovered model, and determining the cause (more Discovery/Modeling) © Ellis Cohen, 2003-2006 17 Applications of Modeling & Prediction Clustering Model to decide on a targeted advertising program Predict whether a web user is in a target group for a product, and if so, show them the ad prepared for that group. Market Basket Analysis Use the model to decide on store layout, sales & promotions Use Predictions to delivery Personalized Coupons at checkout Classification/Approximation Model to decide marketing program for lessfocused mass media advertising Predict how an individual potential customer will behave, and personalize sales approach to that one customer © Ellis Cohen, 2003-2006 18 Data Mining Planning Data, Mining Type & Activity Selection What data do you want to mine and how do you want to mine it? Data Cleaning & Transformation Does the data need to be prepared so the mining will work correctly? Evaluation How will you be able to tell whether the results are good? Visualization & Scoring Mining Type & Activity Details What approach will you use to implement the mining type & activity, and with what control parameters? © Ellis Cohen, 2003-2006 19 Integrated DW/DM Process Data Sources Data Warehouse Data Mining Store More ETL ETL OLAP & Visualization Data Mining • Episodic • Strategic • Continuous © Ellis Cohen, 2003-2006 20 Estimation: Classification & Approximation © Ellis Cohen, 2003-2006 21 Classification Determine target category of an object, based on its features Predictor Variables: Age, MonthlyIncome Target Variable: CarType Category values: x - Luxury Cars o - Midrange Cars # - Cheap Cars Objects: Customers x x x x x x # #ox x xx x xx x x o x o x xxx x ox # o o x ox o x o # o x # o o # o x o o# o # o o oo x oxo o o x o o o # o x o o o # o # # # # o o o o # # #o # # o # #o # # x # # # # # # # # Draw regions in which one category predominates Note difficult separation Monthly Income Age o x Region o Region # Region When a new customer comes along, you can categorize what kind of car they're likely to buy based upon which region they fall into © Ellis Cohen, 2003-2006 22 Approximation Approximate some (continuous) target value of an object based on its features Predictor Variables: Age, MonthlyIncome Target Variable: CarCost -- amt spent by customer on a car Objects: customers Find a function f(income,age) that gives a good estimate of the actual value Monthly Income o o oo o oo oo o o o o o o oo o o o o o oo o o o oo o o o o o o o o oooo o o o o o oo o o o o o o oo oooo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o ooo o o o o o o o Age o o o o When a new customer comes along, you can apply f to their income & age to estimate how much they're likely to spend on a car © Ellis Cohen, 2003-2006 23 Estimation Activities Forensics Analyzer Anomalies Model Discovery Modeling Sample Data Live Data Detector/ Predictor Detection Prediction Predictions Modeling Come up with a way of estimating target value of data items (only known for sampled data, not live data) based on other features Forensics Understand why some data items have values which significantly differ from the estimated value Predication Estimate the (unknown) target value of live data items based on the known features Detection When the live data's unknown target value becomes known, find items whose target value doesn't match its estimated value © Ellis Cohen, 2003-2006 24 Model Characteristics Forensics Analyzer Anomalies Model Discovery Modelling Sample Data Live Data Detector/ Predictor Detection Prediction Predictions Characteristics of Models: • Transparent (Descriptive): Understandable • Opaque (Predictive): Not particularly understandable; its sole purpose is to drive a predictor/detector/analyzer • Raw Data: The retained sample data is provided directly to the predictor/detector/analyzer © Ellis Cohen, 2003-2006 25 Training & Testing Sample Set Training Set Testing Set Testing Set Testing Set Testing Set Testing Set Use the training set to build the model Use testing sets to tweak and validate the model © Ellis Cohen, 2003-2006 26 Estimation Approaches Models Raw Data Transparent Opaque Classification Approximation K Nearest Neighbor Bayesian Network Linear Regression Decision Trees Regression Trees Feed-Forward Neural Networks Genetic Algorithms © Ellis Cohen, 2003-2006 27 Decision Trees © Ellis Cohen, 2003-2006 28 Classification Determine target category of an object, based on its features Predictor Variables: Age, MonthlyIncome Target Variable: CarType Category values: x - Luxury Cars o - Midrange Cars # - Cheap Cars Objects: Customers x x x x x x # #ox x xx x xx x x o x o x xxx x ox # o o x ox o x o # o x # o o # o x o o# o # o o oo x oxo o o x o o o # o x o o o # o # # # # o o o o # # #o # # o # #o # # x # # # # # # # # Draw regions in which one category predominates Note difficult separation Monthly Income Age o x Region o Region # Region When a new customer comes along, you can categorize what kind of car they're likely to buy based upon which region they fall into © Ellis Cohen, 2003-2006 29 Motivating Decision Trees 1 x x x x xxx xxx x x # x x x x x oxxxx x oooo xxx xo o o x o o o oo o o x xx x x x o o oo# o o o x o # o oo # o o # o## oooo o ooo o x o o oo o # # ### # # # # oo o o # # ### # # # #o ### # o # o # # # # o# # # # # ## x x x x xxx xxx x x # x x x x x oxxxx x oooo xxx xo o o x o o o oo o o x xx x x x o o oo# o o o x o # o oo # o o # o## oooo o ooo o x o o oo o # # ### # # # # oo o o # # ### # # # #o ### # o # o # # # # o# # # # # ## o x x x x xxx x x x x x x x x x x #oxxxxxx oooo xxx xo o o x o o o x xxxxxo oo o x o o o o o o xoo ooo # # o o o o #o##o## oo # oo o ooo o x o oo o # # ## # oo o o # # #### # #### # # o # # o # ## # ## o o o # # ## # ## © Ellis Cohen, 2003-2006 o x x x xxx x x x x x x x x x x #oxxxxxx oooo xxx xo o o x o o o o# o o o oo o x xx x o x x o o o xoo ooo # # o o o oo o ooo o x o oo o o # ## o## ### # # # # oo o o # # ### # # # ## o # # o # ## # ## o o # # ## # ## o 2 30 x o o # # o Decision Tree Construction x x x x xxx x x x x x x x x x x #oxxxxxx oooo xxx xo o o x o o o x xxxxxo oo o x o o o o o o xoo ooo # # o o o o #o##o## oo # oo o ooo o x o oo o # # ## # oo o o # # #### # income #### # # o # # o # ## # ## o o o # # ## # ## age x o o # # o income < 3000 # income < 6000 age < 35 # # o age < 45 o # x age < 52 o … o … © Ellis Cohen, 2003-2006 age < 54 x o … 31 Decision Tree Algorithm Given a set of variables to consider, in each iteration the algorithm – decides which variable to split – where (at what value) to split it in order to find the best possible separation of tuples © Ellis Cohen, 2003-2006 32 Decision Tree Construction Issues Splitting Which variable to split & where Gini, Twoing, Entropy, Chi-Square, Gain, … Binary or n-ary splits, esp for categorical variables Use multiple approaches and pick best Biased Splitting Non-symmetric misclassification costs safety misclassification (e.g. O rings) Category Importance fraud detection Penalties on Variables acquisition cost, questionable values Case weights different distributions of sample & live data © Ellis Cohen, 2003-2006 33 Linear Combination Splitting x x x x o # x xx x xxxx x x x x x x x x x xxxx oooo x x xx x x o oo o o o o x x x xo o o o o o o x o o o o # o ooo # o o x # # ### o # oo o ooo oo # # # # ## # # ### ## # # # # # # # # # #### o # ## o# ## # # # # ## # ## o income < 40*age + 400 x o x xxxxx xx x x x x x x x x x x x x# xxxx oooo x o x xx x x o o o o o x x x xo o o o o oo o x o o o o # o # o oo o o x# # # ## ### o oo o o o o o #o # # # # # # # ## # # # ## #o # # # ## # # # # # ##o # # # ## # # # # ## o © Ellis Cohen, 2003-2006 34 Overfitting income < 3000 x x x xxx x x x x x x x x x x #oxxxxxx oooo xxx xo o o x o o o o# o o o oo o x xx x o x x o o o xoo ooo # # o o o oo o ooo o x o oo o o # ## o## # # ## # oo o o # # #### # income ### # # o # # o # ## # ## o o # # ## # ## age … income < 6000 o Prevent overfitting by • Stopping Need criteria Danger of stopping too early • Pruning Build full tree Cut back when testing … o age < 52 … o income < 4500 # o age < 59 # # age < 61 o © Ellis Cohen, 2003-2006 # 35 Classification Rule Extraction x x x x xxx x x x x x x x x x x #oxxxxxx oooo xxx xo o o x o o o x xxxxxo oo o x o o o o o o xoo ooo # # o o o o #o##o## oo # oo o ooo o x o oo o # # ## # oo o o # # #### # income #### # # o # # o # ## # ## o o # # ## # ## age x o o # # o Extract a rule for each region (25 ≤ age ≤ 35) Λ (income < 3000) CarType = 'midrange' (i.e. 'o') Support: 7/94 7.4% Confidence: 6/7 86% Support( A ) = # of objects satisfying A / # of total objects: Does the region determined by A have enough pts to matter? Confidence( A B ) = # satisfying A & B / # satisfying A: How confident are we that a point in the region determined by A also satisfies B? © Ellis Cohen, 2003-2006 36 Classification Ruleset (income < 3000) & (age < 35) '#' (income < 3000) & (35 <= age < 45) 'o' (income < 3000) & (45 <= age) '#' (3000 <= income < 4500) & (age < 20) 'o' … ((income < 3000) & (35 <= age < 45) | (3000 <= income < 4500) & (age < 20)) | … 'o' (income < 3000) & ((age < 35) | (45 <= age)) … '#' © Ellis Cohen, 2003-2006 37 Using Decision Trees Forensics Analyzer Anomalies Model Discovery Modeling Sample Data Live Data Detector/ Predictor Detection Prediction Predictions The Decision Tree (or the extracted Classification Rules) are a Transparent (understandable) Model. How is the Model used for Forensics, Prediction & Detection? © Ellis Cohen, 2003-2006 38 Clustering © Ellis Cohen, 2003-2006 39 Clustering Find groups of objects, [some of] whose features are all similar to one another Objects: Corvette Buyers x x x x x x x x x xx x x xxx xx xx x x x x x Monthly Income FEATURES x Identify Cluster e.g. by center & radius Age © Ellis Cohen, 2003-2006 40 Clustering Activities Forensics Analyzer Anomalies Model Discovery Modeling Sample Data Live Data Detector/ Predictor Detection Prediction Predictions Modeling Description of each cluster • Cluster boundaries • For compact clusters: subspaces + centroid + radius Forensics Identify and explain outliers (points not in a cluster) Detection/Predication Does live data cluster in the same way? © Ellis Cohen, 2003-2006 41 Modeling Clusters Bounds-Based Cluster 1: [age: 21 2, monthlyIncome: 5000 1000] Cluster 2: [age: 54 3, monthlyIncome: 6400 1200] Centroid/Radius-Based Cluster 1: centroid: [age: 21, monthlyIncome: 5000], radius: .12 Cluster 2: centroid: [age: 54, monthlyIncome: 6400], radius: .15 Centroid/Distance-Based approach implies that • clusters are circular (too strong) • we need a uniform distance metric (needed anyway) © Ellis Cohen, 2003-2006 42 Distance Metric Clustering requires a distance metric Given 2 data points, pt1, pt2 Compute distance d( pt1, pt2 ) Distance in a single dimension Easy for quantitative variables (v2-v1) Harder for categorical variables Hardest for structured variables (e.g. similarity metrics for text, images) Distance over multiple dimensions More than just Pythagoras … © Ellis Cohen, 2003-2006 43 Categorical Variable Distance Ordinal Variables [ordered] v2 - v1 doesn't work Use lookup table or function f(v1,v2) Nominal Variables [unordered] – Non-hierarchical [e.g. gender] d(v1,v2) = 0, if v1=v2 1, otherwise – Hierarchical Use distance based upon hierarchy d(p1,p2) [p1 and p2 are prodid] = for example 0, if p1 = p2, else .4, if Prodtyp(p1) = Prodtyp(p2), else .7, if Subcat(p1) = Subcat(p2), else .9, if Category(p1) = Category(p2) 1, otherwise © Ellis Cohen, 2003-2006 44 Multidimensional Distance x = (x1,x2,…,xn) y = (y1,y2,…,xn) Euclidean Distance d(x,y) = sqrt( (xi - yi)2 ) What if dimensions aren't commensurate? Scale all dimensions – Use weights based upon importance, or – so values between 0 and 1, or – d(x,y) = sqrt( ((xi - yi)/i ) 2 ) i is the standard deviation for the ith dimension © Ellis Cohen, 2003-2006 45 Types of Clustering Partition-Driven (primarily for O.R.) Partition data points Score: based on compactness Either • Every pt is in a cluster • Minimize # of pt which are not Density-Driven (primarily for Data Mining) Discovering dense collections of data points Find all clusters which have minimum size & density No requirement to include outliers © Ellis Cohen, 2003-2006 46 Targeted Clustering & Association Rules © Ellis Cohen, 2003-2006 47 Exploratory vs Targeted Clustering Exploratory Clustering Find clusters involving an arbitrary set of variables Targeted Clustering Find clusters among a set of variables which include the target variable (possibly restricted to a particular value) © Ellis Cohen, 2003-2006 48 Single Value Targeted Clustering Suppose • Our sample dataset consists of car buyers • we want to find clusters of car buyers who bought luxury cars 1. Restrict the sample dataset to just those tuples where CarType = "luxury" 2. Use clustering among this restricted dataset © Ellis Cohen, 2003-2006 49 Multiple Value Targeted Clustering Suppose • Our sample dataset consists of car buyers • we want to find clusters of car buyers who bought the various categories of cars Use Single Value Targeted Clustering for CarType = "luxury", then again for CarType = "midrange", then again for CarType = "cheap" © Ellis Cohen, 2003-2006 50 Clustering vs Classification Can't we use Multiple Value Targeted Clustering to do Classification? Find the clusters where CarType = "luxury", CarType = "midrange", and CarType = "cheap" Use the clusters as the model and to predict the value of live data. WILL THIS WORK OR NOT? © Ellis Cohen, 2003-2006 51 Clustering Does NOT Classify Clusters do not cover the space Clusters only identify dense regions of objects. The bulk of the space that a decision tree would assign to cheap car buyers probably does NOT hold dense clusters of them, so would not be included in the clustered model Clusters for different target values may overlap x o cluster o x x cluster o o x xx x o o o x o xo x o o o x x x xx o oo xo o x x o oo o x o x o x o xx x o o © Ellis Cohen, 2003-2006 52 Association Rules The region corresponding to a cluster may include other data points whose target values differ from the targeted cluster value o cluster o ox o oo x o o ox o oo xo Monthly oo oo x ox o Income Age Cluster: [age: 26 3, monthlyIncome: 3000 1000] Pct of data pts within the cluster region with target value 'midrange' (i.e. with symbol 'o') is the same as the CONFIDENCE of the association rule (23 ≤ age ≤ 29) Λ (2000 ≤ monthlyIncome ≤ 4000) carType = "midrange' © Ellis Cohen, 2003-2006 53 Cluster Confidence Depending upon the data mining application, confidence may or may not matter. Problem: Come up with applications where the confidence of clusters – must be > 80% – must be > 30% – doesn't matter © Ellis Cohen, 2003-2006 54 Requiring Confident Clusters Suppose we are interested in regions where enough data points are clustered together (i.e. with good support), but where a minimum confidence w.r.t. a targeted value is required. Is there any alternative to simply discarding clustered regions with low confidence? © Ellis Cohen, 2003-2006 55 Split Clusters o ox o oo x o o ox o oo xo oo oo x ox o Use decision-tree style splitting of the data points within the cluster to best separate the 'o' valued points from the non-'o' valued points If the 'o' region has too few data points (too little support), tough luck If it has adequate support & confidence, done! If it has adequate support, but has inadequate confidence, split it again … (If the non-'o' regions are large enough, they can also potentially be split to find 'o' subregions) © Ellis Cohen, 2003-2006 56 Quantitative Target Variables If a target variable is quantitative, especially if it is continuous, how can targeted clustering be done? For example, how do you find clusters of car buyers who spent approximately the same amount on a car? © Ellis Cohen, 2003-2006 57 Quantitative Targeted Clustering How do you find clusters of car buyers who spent approximately the same amount on a car? Do standard clustering, just require that the variables used always INCLUDE the target variable! © Ellis Cohen, 2003-2006 58 Market Basket Analysis © Ellis Cohen, 2003-2006 59 Market Basket Analysis Organize dataset into baskets. Find groups of items which frequently occur together in baskets 11-Feb-99 11-Feb-99 11-Feb-99 … 11-Feb-99 11-Feb-99 … 13-Feb-99 13-Feb-99 Rules capture causality Joe Joe Joe Diapers Formula Beer Simha Pretzels Simha Beer Sasha Diapers Sasha Beer Basket: Daily Shopping by a Customer Diapers and beer occur together frequently Item: Product purchased NO! People who buy beer are not more likely Beer Diapers? to buy diapers YES! People who buy diapers are more likely Diapers Beer? to buy beer (esp men at night) © Ellis Cohen, 2003-2006 60 Market Basket Activities Forensics Analyzer Anomalies Model Discovery Modeling Sample Data Live Data Detector/ Predictor Detection Prediction Predictions Modeling Identify good sets of rules (w statistics) Forensics Understand why other groups of items are NOT related Detection/Prediction Does live data follow same rules with same statistics © Ellis Cohen, 2003-2006 61 Baskets In order to use market basket analysis, you must first divide the dataset into baskets. Baskets are specified as a group of variables (possibly derived). The actual baskets are obtained by grouping the dataset by these variables (e.g. date/customer). The first step of market basket analysis is deciding which variables define the baskets. © Ellis Cohen, 2003-2006 62 Items Market Basket analysis looks for groups of items which frequently appear together in a basket. An item is determined by a variable (or set of variables). Each different value for that variable (or variables) determines a different item (e.g. productPurchased). The second step of market basket analysis is determining which variable(s) are used to identify the items © Ellis Cohen, 2003-2006 63 Market Basket Discovery 1) Find frequent itemsets 2 items that appear together frequently are interesting 3 items that appear together frequently are really interesting { charcoal, chicken, bbq sauce } 4 or more items, really really interesting 2) Find rules that characterize causality Diapers Beer, but not Beer Diapers Think in terms of which do you give a coupon for. © Ellis Cohen, 2003-2006 64 Apriori Algorithm Find all itemsets which have at least n items Use Apriori or Monotonicity principle: If a set of items S is frequent (i.e. appears in at least n baskets), then every subset of S is frequent. Call L1, the items which appear in at least n baskets Consider all combos of 2 items {A,B}, both from L1 Call L2, those which appear in at least n baskets Consider all combos of 3 items {A,B,C}, where {A,B} and {A,C}, and {B,C} are in L2 Call L3, those which appear in at least n baskets Consider all combos of 4 items {A,B,C,D}, where {A,B,C} and {A,B,D}, {A,C,D} and {B,C,D} are in L3 Call L4, those which appear in at least n baskets The frequent itemsets are L1 + L2 + … © Ellis Cohen, 2003-2006 65 DB Implementation of Apriori One scan through DB to get frequent items CREATE HotItems AS SELECT item FROM purchases GROUP BY item HAVING count(*) >= n Another scan through DB looking for pair itemsets (repeat for size n itemsets) WITH HotPurchases AS (SELECT * FROM Purchase NATURAL JOIN HotItems) SELECT P1.item, P2.item FROM HotPurchases P1, HotPurchases P2 WHERE P1.basket = P2.basket AND P1.item < P2.item GROUP BY P1.item, P2.item HAVING count(*) >= n © Ellis Cohen, 2003-2006 66 Apriori Scalability Obtaining frequent pair itemsets: If HotItems can be kept in memory allowing rapid lookup (sorted list or hashtable) FP (frequent pair) itemsets can be obtained in one linear pass through the DB. Obtaining frequent size n itemsets: 1. 2. Use a separate linear scan through the DB up to n. Slow. On second scan, don't just count pairs; instead build a memory-based FP-Tree. Can be used to find all frequent itemsets of any size. But, we often only care about frequent pair itemsets. © Ellis Cohen, 2003-2006 67 Clustering & Market Basket Analysis Market Basket Analysis is a form of clustering • Turn each basket into a single LARGE data item. – Each LARGE data item has a separate boolean variable for each possible item that can be in a basket. For example – Beer, diapers, etc. are separate variables – A LARGE data item's beer value is TRUE if the basket it came from had a beer • In the original dataset, we look for k-element itemsets which appear in p or more baskets (using apriori) This is equivalent to using the LARGE item dataset and – using the subspace clustering algorithm to look for k-dimensional cells with p or more points © Ellis Cohen, 2003-2006 68 Scoring Market Basket Analysis © Ellis Cohen, 2003-2006 69 Support Cereal Beer Support( s ) = # of baskets containing S / # of total baskets 40 1000 2000 4000 Support { Beer } = 1000/4000 = 25% Support { Cereal } = 2000/4000 = 50% Support { Beer, Cereal } = 40/4000 = 1% Support: How significant is this itemset In a supermarket, anything over .1% might be significant Given the # of total baskets, the minimum interesting support determines n for the Apriori algorithm © Ellis Cohen, 2003-2006 70 Confidence Cereal Confidence( A B ) = Support( A & B ) / Support( A ) Beer 40 1000 2000 4000 Confidence( A B ) = # of baskets containing A & B / # of baskets containing A Confidence( Beer Cereal ) = 40/1000 = 4% Confidence( Cereal Beer ) = 40/2000 = 2% Confidence( A B ): If a basket has A, how likely is it that the basket also will have B (i.e. how confident are we that A predicts B) If this is low (say, less than 30%), it is not very interesting, since the two items don't correlate © Ellis Cohen, 2003-2006 71 High Support & Confidence Milk Beer 400 1000 2000 4000 Support { Beer } = 1000/4000 = 25% Support { Milk } = 2000/4000 = 50% Support { Beer, Milk } = 400/4000 = 10% WOW! Confidence( Milk Beer ) = 400/2000 = 20% Confidence( Beer Milk ) = 400/1000 = 40% High Confidence, so potentially interesting BUT 40% < 50% the pct who buy milk anyway So giving milk coupons to beer buyers is probably not the most useful thing to do © Ellis Cohen, 2003-2006 72 Lift Milk Lift( A B ) = Confidence( A B ) / Support( B ) = Lift( B A ) Beer 400 1000 2000 4000 Support { Beer } = 1000/4000 = 25% Support { Milk } = 2000/4000 = 50% Support { Beer, Milk } = 400/4000 = 10% WOW! Confidence( Milk Beer ) = 400/2000 = 20% Confidence( Beer Milk ) = 400/1000 = 40% OK! Lift( A B ): How much does A help B Lift( Beer Milk ) = 40% / 50% = .8 If lift < 1, then it doesn't help at all! © Ellis Cohen, 2003-2006 73 Good Lift Diapers Lift( A B ) = Confidence( A B ) / Support( B ) = Lift( B A ) Beer 80 1000 200 4000 Support { Beer } = 1000/4000 = 25% Support { Diapers } = 200/4000 = 5% Support { Beer, Diapers } = 80/4000 = 2% OK! Confidence( Beer Diapers ) = 80/1000 = 8% Confidence( Diapers Beer ) = 80/200 = 40% OK! Lift( Diapers Beer ) = 40% / 25% = 1.6 Note: Lift can be useful in clustering situations as well © Ellis Cohen, 2003-2006 74 Support, Confidence & Lift AB Support( A & B ) How important is the rule: What percent of baskets have both A & B? Confidence( A B ) How likely is it that baskets which contains A also contains B. In general, should be at least 35%. Lift( A B ) If we know that a basket contains A, how much surer are we that the basket contains B than if we didn't know what else what in the basket. Must be > 1; probably should be at least 1.3. © Ellis Cohen, 2003-2006 75 Hierarchical Categories Do apriori with values at each category level Whole Wheat Bread Skim Milk, but not Bread Milk or vice versa! For scalability, can initially only include higher level categories, then split itemsets with high levels of support © Ellis Cohen, 2003-2006 76 Rules for Larger Itemsets For { A, B, C, D } Consider A, A, A, B, B, C D B, D C C, D B C, D A Diapers, ChildrensTylenol Beer may have less support than Diapers Beer but may well have higher confidence and higher lift © Ellis Cohen, 2003-2006 77 Incorporating Other Variables Diapers, gender:male, time: [8pm:1am] Beer will also have less support than Diapers Beer But will almost certainly have higher confidence & lift Remember, this is just subspace clustering with more variables © Ellis Cohen, 2003-2006 78