Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Mining These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. For more information on how you may use them, please see http://www.openlineconsult.com/db 1 Topics Types of Data Mining Data Mining Activities Datasets Estimation: Classification & Approximation Decision Trees Approximation (Regression) Raw Data Modeling with k-Nearest Neighbor Opaque Modeling with Feed-Forward Neural Networks Clustering Partition-Driven Clustering Density-Driven Clustering Subspace Clustering Targeted Clustering & Association Rules Market Basket Analysis Scoring Market Basket Analysis Scoring & Data Preparation Copyright © Ellis Cohen 2002-2005 2 Goals of Data Mining Find patterns and relationships among objects represented by data points in a dataset Novel / Significant People who buy notebooks buy pencils vs. Men who buy diapers at night also buy beer Understandable / Useful Tradeoff between accuracy and simplicity Causality vs Chance All data has random variations, which can show up as spurious patterns and relationships Copyright © Ellis Cohen 2002-2005 3 Types of Data Mining Copyright © Ellis Cohen 2002-2005 4 Types of Data Mining Covered Estimation (Classification & Approximation) Classify/categorize objects, or approximate some value associated with objects based on their features Clustering Find groups of objects, [some of] whose features are all similar to one another Market Basket Analysis Find collections of items which frequently occur together (in the same basket) & formulate the cause Copyright © Ellis Cohen 2002-2005 5 Objects & Features custid 3043 3174 … age 23 44 … The entire collection of data is called a dataset income 5000 6190 … … Object: Customer (Represented by a Data Point) Features (also called Variables) Copyright © Ellis Cohen 2002-2005 6 Classification Determine target category of an object, based on its features Predictor Variables: Age, MonthlyIncome Target Variable: CarType Category values: x - Luxury Cars o - Midrange Cars # - Cheap Cars Objects: Customers Note difficult separation Monthly Income x Region x x x x x x # #ox x xx x xx x x o x o x xxx x ox # o o x ox o x o # o x # o o # o x o o# o # o o oo x oxo o o x o o o # o x o o o # o # # # # o o o o # # #o # # o # #o # # x # # # # # # # # Draw regions in which one category predominates Age o o Region # Region When a new customer comes along, you can categorize what kind of car they're likely to buy based upon which region they fall into Copyright © Ellis Cohen 2002-2005 7 Approximation Approximate some (continuous) target value of an object based on its features Predictor Variables: Age, MonthlyIncome Target Variable: CarCost -- amt spent by customer on a car Objects: customers Find a function f(income,age) that gives a good estimate of the actual value Monthly Income o o oo o oo oo o o o o o o oo o o o o o oo o o o oo o o o o o o o o oooo o o o o o oo o o o o o o oo oooo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o ooo o o o o o o o Age o o o o When a new customer comes along, you can apply f to their income & age to estimate how much they're likely to spend on a car Copyright © Ellis Cohen 2002-2005 8 Applications of Estimation Sales Estimate how much a customer is likely to spend Determine amount of effort/money to put into capturing a potential customer Credit Decide whether to approve a credit application Determine someone's credit limit Evaluate whether a credit card is stolen Medicine Determine whether a patient has a disease based on symptoms and test results Assess which protocol should be used to treat an illness Copyright © Ellis Cohen 2002-2005 9 Clustering Find groups of objects, [some of] whose features are all similar to one another Objects: Corvette Buyers x x x x x x x x x xx x x x xxx xx xx x x x x Monthly Income FEATURES x Identify Cluster e.g. by center & radius Age Copyright © Ellis Cohen 2002-2005 10 Applications of Clustering Marketing Target advertising for each cluster; use media whose demographics match cluster Astronomy Find galactic clusters; explore large scale structure of the universe Epidemiology Find things in common among people with the same illness (esp location) Government Policy & Planning Identify regions with similar features (e.g. economic, industrial, geographic) for policies and planning (e.g. land use, economic assistance, office locations, bus routes) Copyright © Ellis Cohen 2002-2005 11 Market Basket Analysis Organize dataset into baskets. Find groups of items which frequently occur together in baskets 11-Feb-99 11-Feb-99 11-Feb-99 … 11-Feb-99 11-Feb-99 … 13-Feb-99 13-Feb-99 Rules capture causality Joe Joe Joe Diapers Formula Beer Simha Pretzels Simha Beer Sasha Diapers Sasha Beer Basket: Daily Shopping by a Customer Diapers and beer occur together frequently Item: Product purchased NO! People who buy beer are not more likely Beer Diapers? to buy diapers YES! People who buy diapers are more likely Diapers Beer? to buy beer (esp men at night) Copyright © Ellis Cohen 2002-2005 12 Applications of Market Basket Analysis Marketing Baskets: Daily Shopping Items: Products Purchased Controlling Customer Traversal in Stores Coupons Recommendations (e.g. Amazon) Semantic Nets Baskets: Documents Items: Words/Phrases Use for Semantic Search Plagiarism Detection Baskets: Sentences Items: Documents Copyright © Ellis Cohen 2002-2005 13 Data Mining Approaches Deterministic Heuristic K-Means Clustering Agglomerative CLIQUE Regression Neural Nets Estimation Bayesian Networks Genetic Algorithms K-Nearest Neighbor Decision Trees Mkt Basket Apriori Analysis Produces "best" possible model or prediction Produces reasonably good model or prediction Copyright © Ellis Cohen 2002-2005 14 Data Mining Activities Copyright © Ellis Cohen 2002-2005 15 Data Mining Activities Diagram Forensics Analyzer Anomalies Model Discovery Detector/ Predictor Modelling Detection Prediction Sample Data Live Data Copyright © Ellis Cohen 2002-2005 Predictions 16 Data Mining Activities Discovery/Modeling Using an existing sample dataset to develop a model which characterizes or describes the data Forensics Finding anomalous data points in an existing sample dataset – those that do not match the discovered model, and determining the cause (which may involve another round of Discovery/Modeling) Prediction Using the discovered model to predict an unknown feature value of a new live data point Detection Detect new live data points which are anomalous – those that do not match the discovered model, and determining the cause (more Discovery/Modeling) Copyright © Ellis Cohen 2002-2005 17 Applications of Modeling & Prediction Clustering Model to decide on a targeted advertising program Predict whether a web user is in a target group for a product, and if so, show them the ad prepared for that group. Market Basket Analysis Use the model to decide on store layout, sales & promotions Use Predictions to delivery Personalized Coupons at checkout Classification/Approximation Model to decide marketing program for lessfocused mass media advertising Predict how an individual potential customer will behave, and personalize sales approach to that one customer Copyright © Ellis Cohen 2002-2005 18 Data Mining Planning Data, Mining Type & Activity Selection What data do you want to mine and how do you want to mine it? Data Cleaning & Transformation Does the data need to be prepared so the mining will work correctly? Evaluation How will you be able to tell whether the results are good? Visualization & Scoring Mining Type & Activity Details What approach will you use to implement the mining type & activity, and with what control parameters? Copyright © Ellis Cohen 2002-2005 19 Integrated DW/DM Process Data Sources Data Warehouse Data Mining Store More ETL ETL OLAP & Visualization Data Mining • Episodic • Strategic • Continuous Copyright © Ellis Cohen 2002-2005 20 Datasets Copyright © Ellis Cohen 2002-2005 21 Dataset A set of data points (that represent objects) each of which has multiple variables (that represent the object's features) From a Data Warehouse Perspective Dataset Derived from some selected subset of a fact table • Either an original fact table, or • An aggregated fact table derived from it Variables – Measure in the fact table – Attribute of one of the dimensions – Derived from the measures & attributes # of variables is the data set's dimensionality (not to be confused with dimensions of the original fact table) Copyright © Ellis Cohen 2002-2005 22 Variables • Quantitative [numeric] – Continuous [float] – Discrete [int] • Categorical [enumerated] – Ordinal [ordered] – Nominal [unordered, possibly hierarchical] • Structured – Text, Image, etc. (ignore these) Quantitative variables Have well defined distance (v2 - v1) May have well-defined ratio (v2/v1) Unlike integer categorical variables [e.g. 1:doctor, 2:nurse, 3:other] Copyright © Ellis Cohen 2002-2005 23 Types of Data Mining Revisited Clustering Find groups of data points that are similar to one another in some way Approximation Estimate the value of a quantitative target variable based on the values of other variables Classification Classify (a.k.a. categorize) a data point (estimate the value of a categorical target variable) based on the values of other variables Market Basket Analysis After dividing data points into baskets, find sets of values of the item variable that frequently occur together in a basket Copyright © Ellis Cohen 2002-2005 24 Example Dataset Definition Basic Fact Table Sales( price, units ) Dimensions Customer, Product, Date, Location Aggregate, Filter, Select & Transform AGGREGATE sum(price * units) as total, total / sum(units) as avgprice BY Customer.custid, Product.subcatid WHERE (year = 2002) VARIABLES on next page Actually specify this using a DMQL (Data Mining Query Language) -- e.g. PMML (Predictive Model Markup Language) Copyright © Ellis Cohen 2002-2005 25 Dataset Variables total CONTINUOUS avgprice CONTINUOUS product.subcatid NOMINAL product.catid NOMINAL, PARENT OF subcatid customer.custid ORDINAL (order of date joined) customer.gender NOMINAL customer.age DISCRETE customer.married NOMINAL customer.numDepend DISCRETE customer.income CONTINUOUS customer.familyIncome CONTINUOUS customer.vacationSpend CONTINUOUS customer.monthlyRent CONTINUOUS customer.monthlyMortgage CONTINUOUS monthlyHousing CONTINUOUS = monthlyRent + monthlyMortgage allowance CONTINUOUS = CalcAllowance( familyIncome, married, numDepend, monthlyHousing ) famAllowance CONTINUOUS = CalcFamilyAllowance ( … ) nearestStoreDistance = CalcNearestStoreDist( customer.x, customer.y ) Copyright © Ellis Cohen 2002-2005 26 Exploratory vs Targeted Data Mining Exploratory Data Mining How are the variables in a dataset related to one another? Targeted Data Mining How are the target variables in a dataset related to the other variables (sometimes called the predictor variables)? The target variables (e.g. type of car bought) are • Known for the sample dataset • Unknown (at least initially) for live datasets Estimation: Always Targeted Clustering: Exploratory or Targeted Market Basket Analysis: Generally Exploratory Copyright © Ellis Cohen 2002-2005 27 Estimation: Classification & Approximation Copyright © Ellis Cohen 2002-2005 28 Classification Determine target category of an object, based on its features Predictor Variables: Age, MonthlyIncome Target Variable: CarType Category values: x - Luxury Cars o - Midrange Cars # - Cheap Cars Objects: Customers x x x x x x # #ox x xx x xx x x o x o x xxx x ox # o o x ox o x o # o x # o o # o x o o# o # o o oo x oxo o o x o o o # o x o o o # o # # # # o o o o # # #o # # o # #o # # x # # # # # # # # Draw regions in which one category predominates Note difficult separation Monthly Income Age o x Region o Region # Region When a new customer comes along, you can categorize what kind of car they're likely to buy based upon which region they fall into Copyright © Ellis Cohen 2002-2005 29 Approximation Approximate some (continuous) target value of an object based on its features Predictor Variables: Age, MonthlyIncome Target Variable: CarCost -- amt spent by customer on a car Objects: customers Find a function f(income,age) that gives a good estimate of the actual value Monthly Income o o oo o oo oo o o o o o o oo o o o o o oo o o o oo o o o o o o o o oooo o o o o o oo o o o o o o oo oooo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o ooo o o o o o o o Age o o o o When a new customer comes along, you can apply f to their income & age to estimate how much they're likely to spend on a car Copyright © Ellis Cohen 2002-2005 30 Estimation Activities Forensics Analyzer Anomalies Model Discovery Modeling Sample Data Live Data Detector/ Predictor Detection Prediction Predictions Modeling Come up with a way of estimating target value of data items (only known for sampled data, not live data) based on other features Forensics Understand why some data items have values which significantly differ from the estimated value Predication Estimate the (unknown) target value of live data items based on the known features Detection When the live data's unknown target value becomes known, find items whose target value doesn't match its estimated value Copyright © Ellis Cohen 2002-2005 31 Model Characteristics Forensics Analyzer Anomalies Model Discovery Modelling Sample Data Live Data Detector/ Predictor Detection Prediction Predictions Characteristics of Models: • Transparent (Descriptive): Understandable • Opaque (Predictive): Not particularly understandable; its sole purpose is to drive a predictor/detector/analyzer • Raw Data: The retained sample data is provided directly to the predictor/detector/analyzer Copyright © Ellis Cohen 2002-2005 32 Training & Testing Sample Set Training Set Testing Set Testing Set Testing Set Testing Set Testing Set Use the training set to build the model Use testing sets to tweak and validate the model Copyright © Ellis Cohen 2002-2005 33 Estimation Approaches Models Raw Data Transparent Opaque Classification Approximation K Nearest Neighbor Bayesian Network Linear Regression Decision Trees Regression Trees Feed-Forward Neural Networks Genetic Algorithms Copyright © Ellis Cohen 2002-2005 34 Decision Trees Copyright © Ellis Cohen 2002-2005 35 Classification Determine target category of an object, based on its features Predictor Variables: Age, MonthlyIncome Target Variable: CarType Category values: x - Luxury Cars o - Midrange Cars # - Cheap Cars Objects: Customers x x x x x x # #ox x xx x xx x x o x o x xxx x ox # o o x ox o x o # o x # o o # o x o o# o # o o oo x oxo o o x o o o # o x o o o # o # # # # o o o o # # #o # # o # #o # # x # # # # # # # # Draw regions in which one category predominates Note difficult separation Monthly Income Age o x Region o Region # Region When a new customer comes along, you can categorize what kind of car they're likely to buy based upon which region they fall into Copyright © Ellis Cohen 2002-2005 36 Motivating Decision Trees 1 x x x x xxx xxx x x # x x x x x oxxxx x oooo xxx xo o o x o o o oo o o x xx x x x o o oo# o o o x o # o oo # o o # o## oooo o ooo o x o o oo o # # ### # # # # oo o o # # ### # # # #o ### # o # o # # # # o# # # # # ## x x x x xxx xxx x x # x x x x x oxxxx x oooo xxx xo o o x o o o oo o o x xx x x x o o oo# o o o x o # o oo # o o # o## oooo o ooo o x o o oo o # # ### # # # # oo o o # # ### # # # #o ### # o # o # # # # o# # # # # ## o x x x x xxx x x x x x x x x x x #oxxxxxx oooo xxx xo o o x o o o x xxxxxo oo o x o o o o o o xoo ooo # # o o o o #o##o## oo # oo o ooo o x o oo o # # ## # oo o o # # #### # #### # # o # # o # ## # ## o o o # # ## # ## Copyright © Ellis Cohen 2002-2005 o x x x xxx x x x x x x x x x x #oxxxxxx oooo xxx xo o o x o o o o# o o o oo o x xx x o x x o o o xoo ooo # # o o o oo o ooo o x o oo o o # ## o## ### # # # # oo o o # # ### # # # ## o # # o # ## # ## o o # # ## # ## o 2 37 x o o # # o Decision Tree Construction x x x x xxx x x x x x x x x x x #oxxxxxx oooo xxx xo o o x o o o x xxxxxo oo o x o o o o o o xoo ooo # # o o o o #o##o## oo # oo o ooo o x o oo o # # ## # oo o o # # #### # income #### # # o # # o # ## # ## o o o # # ## # ## age x o o # # o income < 3000 # income < 6000 age < 35 # # o age < 45 o # x age < 52 o … o … Copyright © Ellis Cohen 2002-2005 age < 54 x o … 38 Decision Tree Algorithm Given a set of variables to consider, in each iteration the algorithm – decides which variable to split – where (at what value) to split it in order to find the best possible separation of tuples Copyright © Ellis Cohen 2002-2005 39 Decision Tree Construction Issues Splitting Which variable to split & where Gini, Twoing, Entropy, Chi-Square, Gain, … Binary or n-ary splits, esp for categorical variables Use multiple approaches and pick best Biased Splitting Non-symmetric misclassification costs safety misclassification (e.g. O rings) Category Importance fraud detection Penalties on Variables acquisition cost, questionable values Case weights different distributions of sample & live data Copyright © Ellis Cohen 2002-2005 40 Linear Combination Splitting x x x x o # x xx x xxxx x x x x x x x x x xxxx oooo x x xx x x o oo o o o o x x x xo o o o o o o x o o o o # o ooo # o o x # # ### o # oo o ooo oo # # # # ## # # ### ## # # # # # # # # # #### o # ## o# ## # # # # ## # ## o income < 40*age + 400 x o x xxxxx xx x x x x x x x x x x x x# xxxx oooo x o x xx x x o o o o o x x x xo o o o o oo o x o o o o # o # o oo o o x# # # ## ### o oo o o o o o #o # # # # # # # ## # # # ## #o # # # ## # # # # # ##o # # # ## # # # # ## o Copyright © Ellis Cohen 2002-2005 41 Overfitting income < 3000 x x x xxx x x x x x x x x x x #oxxxxxx oooo xxx xo o o x o o o o# o o o oo o x xx x o x x o o o xoo ooo # # o o o oo o ooo o x o oo o o # ## o## # # ## # oo o o # # #### # income ### # # o # # o # ## # ## o o # # ## # ## age … income < 6000 o Prevent overfitting by • Stopping Need criteria Danger of stopping too early • Pruning Build full tree Cut back when testing … o age < 52 … o income < 4500 # o age < 59 # # age < 61 o Copyright © Ellis Cohen 2002-2005 # 42 Classification Rule Extraction x x x x xxx x x x x x x x x x x #oxxxxxx oooo xxx xo o o x o o o x xxxxxo oo o x o o o o o o xoo ooo # # o o o o #o##o## oo # oo o ooo o x o oo o # # ## # oo o o # # #### # income #### # # o # # o # ## # ## o o # # ## # ## age x o o # # o Extract a rule for each region (25 ≤ age ≤ 35) Λ (income < 3000) CarType = 'midrange' (i.e. 'o') Support: 7/94 7.4% Confidence: 6/7 86% Support( A ) = # of objects satisfying A / # of total objects: Does the region determined by A have enough pts to matter? Confidence( A B ) = # satisfying A & B / # satisfying A: How confident are we that a point in the region determined by A also satisfies B? Copyright © Ellis Cohen 2002-2005 43 Classification Ruleset (income < 3000) & (age < 35) '#' (income < 3000) & (35 <= age < 45) 'o' (income < 3000) & (45 <= age) '#' (3000 <= income < 4500) & (age < 20) 'o' … ((income < 3000) & (35 <= age < 45) | (3000 <= income < 4500) & (age < 20)) | … 'o' (income < 3000) & ((age < 35) | (45 <= age)) … '#' Copyright © Ellis Cohen 2002-2005 44 Using Decision Trees Forensics Analyzer Anomalies Model Discovery Modeling Sample Data Live Data Detector/ Predictor Detection Prediction Predictions The Decision Tree (or the extracted Classification Rules) are a Transparent (understandable) Model. How is the Model used for Forensics, Prediction & Detection? Copyright © Ellis Cohen 2002-2005 45 Approximation (also called Regression) Copyright © Ellis Cohen 2002-2005 46 Approximation Approximate some (continuous) target value of an object based on its features Predictor Variables: Age, MonthlyIncome Target Variable: CarCost -- amt spent by customer on a car Objects: customers Find a function f(income,age) that gives a good estimate of the actual value Monthly Income o o oo o oo oo o o o o o o oo o o o o o oo o o o oo o o o o o o o o oooo o o o o o oo o o o o o o oo oooo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o ooo o o o o o o o Age o o o o When a new customer comes along, you can apply f to their income & age to estimate how much they're likely to spend on a car Copyright © Ellis Cohen 2002-2005 47 1D Piecewise Linear Regression $ spent on x income income income income Copyright © Ellis Cohen 2002-2005 48 Piecewise Multivariate Linear Regression If there are n predictor dimensions Use multivariate linear regression to fit the best possible n-dimensional hyperplane Find the hyper-plane that has the largest error, and "split" it into 2 (or more) hyper-planes Continue until the error is within desired limits. Use testing sets to prune in case of overfitting Copyright © Ellis Cohen 2002-2005 49 Raw Data Estimation Modeling with k Nearest Neighbor Copyright © Ellis Cohen 2002-2005 50 Raw Data Model Model Discovery Modeling Sample Data Live Data Detector/ Predictor Detection Prediction Predictions Ordinarily, the sample data is used to build a model (e.g. a decision tree, a ruleset, an approximation function) that is used to predict target variable values for live data In raw data models, the model is the sample data (or some subset of it). The predictor uses the raw data directly to predict the target variable values of live data tuples. Copyright © Ellis Cohen 2002-2005 51 K Nearest Neighbor for Approximation k=5 o oo o oo o oo o o o oo o oo o o o o oo oo o oo o o o ooo o oo o oooo o o o o o o oo o o o oo oooo o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o * Estimate value of a new data point as average value of nearest k points Using an average weighted by distance generalizes linear interpolation Copyright © Ellis Cohen 2002-2005 52 K Nearest Neighbor for Classification k=5 x x x x x # #ox x xx x xx x x o x o x ox o xxx x # o o x ox o xo o # o x # o # o x o o# o #o o oo x oxo o o x o o* # o # x o o o o # # # # o o o o # # #o # # o # #o # # x # # # # # # # # x o Classify value of a new data pt by choosing category among k nearest neighbors which has the most weight (possibly taking distance into account) Copyright © Ellis Cohen 2002-2005 53 k Nearest Neighbor Scalability Too expensive when large # of points Pick a number of candidate representative subsample sets instead from a training set Use test sets to determine the quality of each representative subsample set. Also use test sets to determine the best value for k Choose the • representative subsample set and • value of k with the smallest misclassification rate Problematic when too many variables Best (even good) distance metric is hard to determine Best distance metric may vary depending upon region Copyright © Ellis Cohen 2002-2005 54 Opaque Estimation Modeling with Feed-Forward Neural Networks Copyright © Ellis Cohen 2002-2005 55 Opaque Data Model Model Discovery Modeling Sample Data Live Data Detector/ Predictor Detection Prediction Predictions In a transparent data model -- a decision tree, a ruleset, or (possibly) an approximation function, the model can be examined and provides an understandable description & explanation of the patterns in the sample data. An opaque data model does not provide an understandable description of the patterns in the sample data. However, it can be used to configure a "machine" that predicts the target variable value of live data. Copyright © Ellis Cohen 2002-2005 56 Neural Net Prediction Engine One or more layers of hidden nodes Predictor variables Use weights to combine a node's inputs to produce its output. The node weights are the Model Target variables Output nodes Input nodes Hidden nodes This is an opaque model: Good for prediction, but not for explanation Copyright © Ellis Cohen 2002-2005 57 Propagation Rule o1 w1 o2 w2 in w3 o3 o4 w4 in = o1*w1 + o2*w2 + o3*w3 + o4*w4 If wi is positive, effect is "excitory" If wi is negative, effect is "inhibitory" Copyright © Ellis Cohen 2002-2005 58 Activation o1 w1 o2 w2 in out w3 o3 w4 Activation Function out = f(in) o4 Copyright © Ellis Cohen 2002-2005 59 Activation Functions Linear Linear Regression Step Piecewise Linear Regression Sigmoid Approximate Any Function (better than piecewise linear regression w same # of nodes) out = 1 / (1 + e-in) Copyright © Ellis Cohen 2002-2005 60 Learning & Back Propagation The big problem is: How do you set the weights? Pick initial weights Repeatedly • Provide a data pt from the training set as input • Compare the predicted response(s) with the actual response(s) • Use the difference to adjust the weights of the output node • Use the adjustment in weights of a node to adjust the weights of the nodes that feed into it (back propagation) Copyright © Ellis Cohen 2002-2005 61 Momentum If you adjust the weights too fast The weights can oscillate instead of converging If you adjust the weights too slowly The weights converge too slowly In general, neural nets converge slowly Can be helped by using other techniques to select the data points in the training set & the order in which they are presented Copyright © Ellis Cohen 2002-2005 62 Neural Nets for Classification Luxury Midrange Cheap If a sample data point has a luxury target value adjust weights towards Luxury=1, Midrange=0, Cheap=0 If a sample data point has a midrange target value adjust weights towards Luxury=0, Midrange=1, Cheap=0 If a sample data point has a cheap target value adjust weights towards Luxury=0, Midrange=0, Cheap=1 Copyright © Ellis Cohen 2002-2005 63 Clustering Copyright © Ellis Cohen 2002-2005 64 Clustering Find groups of objects, [some of] whose features are all similar to one another Objects: Corvette Buyers x x x x x x x x x xx x x x xxx xx xx x x x x Monthly Income FEATURES x Identify Cluster e.g. by center & radius Age Copyright © Ellis Cohen 2002-2005 65 Clustering Activities Forensics Analyzer Anomalies Model Discovery Modeling Sample Data Live Data Detector/ Predictor Detection Prediction Predictions Modeling Description of each cluster • Cluster boundaries • For compact clusters: subspaces + centroid + radius Forensics Identify and explain outliers (points not in a cluster) Detection/Predication Does live data cluster in the same way? Copyright © Ellis Cohen 2002-2005 66 Modeling Clusters Bounds-Based Cluster 1: [age: 21 2, monthlyIncome: 5000 1000] Cluster 2: [age: 54 3, monthlyIncome: 6400 1200] Centroid/Radius-Based Cluster 1: centroid: [age: 21, monthlyIncome: 5000], radius: .12 Cluster 2: centroid: [age: 54, monthlyIncome: 6400], radius: .15 Centroid/Distance-Based approach implies that • clusters are circular (too strong) • we need a uniform distance metric (needed anyway) Copyright © Ellis Cohen 2002-2005 67 Distance Metric Clustering requires a distance metric Given 2 data points, pt1, pt2 Compute distance d( pt1, pt2 ) Distance in a single dimension Easy for quantitative variables (v2-v1) Harder for categorical variables Hardest for structured variables (e.g. similarity metrics for text, images) Distance over multiple dimensions More than just Pythagoras … Copyright © Ellis Cohen 2002-2005 68 Categorical Variable Distance Ordinal Variables [ordered] v2 - v1 doesn't work Use lookup table or function f(v1,v2) Nominal Variables [unordered] – Non-hierarchical [e.g. gender] d(v1,v2) = 0, if v1=v2 1, otherwise – Hierarchical Use distance based upon hierarchy d(p1,p2) [p1 and p2 are prodid] = for example 0, if p1 = p2, else .4, if Prodtyp(p1) = Prodtyp(p2), else .7, if Subcat(p1) = Subcat(p2), else .9, if Category(p1) = Category(p2) 1, otherwise Copyright © Ellis Cohen 2002-2005 69 Multidimensional Distance x = (x1,x2,…,xn) y = (y1,y2,…,xn) Euclidean Distance d(x,y) = sqrt( (xi - yi)2 ) What if dimensions aren't commensurate? Scale all dimensions – Use weights based upon importance, or – so values between 0 and 1, or – d(x,y) = sqrt( ((xi - yi)/i ) 2 ) i is the standard deviation for the ith dimension Copyright © Ellis Cohen 2002-2005 70 Types of Clustering Partition-Driven (primarily for O.R.) Partition data points Score: based on compactness Either • Every pt is in a cluster • Minimize # of pt which are not Density-Driven (primarily for Data Mining) Discovering dense collections of data points Find all clusters which have minimum size & density No requirement to include outliers Copyright © Ellis Cohen 2002-2005 71 Partition-Driven Clustering Copyright © Ellis Cohen 2002-2005 72 Partition-Driven Clustering Assume fixed # of clusters deal with variable # of clusters later Different objectives and constraints • Every pt in some cluster Warehouse Problem: k warehouses, each delivering to nearby customers Place warehouse at center of assigned clusters • Every pt in some cluster, per-cluster caps Require every warehouse serve same # of customers, or deliver same # of total units • Not every pt needs to be in a cluster, exclusion penalty No delivery to faraway customers; the delivery cost would be more than the cost of attracting a new customer Copyright © Ellis Cohen 2002-2005 73 K-Means Approach Every pt must be in some partition, no per-cluster caps Decide on k points that will be the initial center of each of the k clusters Repeat Assign each pt to the nearest cluster center Recalculate the centers Until initial center points are same as final center points For faster convergence (esp for 1st & probably 2nd iteration) After assigning a point to a cluster, immediately recalculate that cluster's center Problem: Finds Local Minimum, not Global Minimum Use different starting points, solve each, take best overall result (one with min compactness score) Copyright © Ellis Cohen 2002-2005 74 Per-Cluster Caps Decide on k points that will be the initial center of each of the k clusters Repeat Assign the data points to minimize the score and satisfy the caps! Recalculate the centers Until initial center points are same as final center points Assigning the data points – Use integer programming (exact but slow) – Use greedy method, inexact but fast Suppose scorei(pt) is the score if the pt is assigned to cluster i (small score is best) Hash each pt into a bucket based on max(scorei(pt)) - min(scorei(pt)) Starting with the bucket w biggest values, assign each pt in the bucket to the cluster which minimizes scorei(pt) and still has room Copyright © Ellis Cohen 2002-2005 75 Varying k What if the best value for k isn't known? Could just put each pt in a separate cluster If score function is based on distance of pt from cluster center, total score would be zero Balance with cost of having too many partitions Two components of total score: Partitioning Score + Complexity Score Partitioning Score: based on compactness Complexity Score: based on k Overall Algorithm Choose likely k Use Cluster Algorithm with k-1, k, and k+1 => Move k in direction of smallest score (or use larger window to avoid local minima) Done when cost for k is smaller than cost for k+1 and k-1 Copyright © Ellis Cohen 2002-2005 76 Density-Driven Clustering Copyright © Ellis Cohen 2002-2005 77 Density-Driven Clustering Goal: Find interesting clusters Discover dense collections of data points Find all clusters which have minimum size & density No requirement to include outliers Kinds of Density-Driven Clustering Fixed Space Clustering Pick a set of dimensions, and find all the clusters in those dimensions Subspace Clustering A cluster in some set of dimensions need not form a cluster in a subset or superset of those dimensions Given a set of dimensions, find all the clusters in all possible subsets of those dimensions Copyright © Ellis Cohen 2002-2005 78 Fixed-Space Clustering Approaches Partition-Based Find partition with best k, but using a scoring not hurt by outliers, then eliminate outliers Division-Based Start with one partition containing entire dataset Iteratively separate non-uniformly dense partitions into subpartitions Agglomeration-Based Divide data set into large # of small partitions (either use partition-based clustering, or each initial partition has exactly one data point) Iteratively combine together dense nearby partitions Grid-Based Agglomeration-Based, where initial partitions are cells of an n-dimensional grid Copyright © Ellis Cohen 2002-2005 79 Partitioning for Density-Driven Clustering Use partitioning approach e.g. Variable k-Means Use different score not dominated by outliers e.g. instead of d2, use S curve -- d2 / ( m2 + d2 ) [fixed m] Throw outliers away After finding clusters Then possibly re-cluster Copyright © Ellis Cohen 2002-2005 80 Misclustering x x xx x x x x xx x xx x x x xx xx x xx x x x x xx x x xx x xx x x x x x x xxxx xx x x x xx x x x xx xx xx x xx x xx x x xx xx x xx xx xx xxx x x x xx xxx What if you have the right k, but the wrong clusters? Use Subclustering Use Agglomeration Subclustering Pick k' initial pts dispersed in cluster (& vary k') Cluster around those. Compare results. In general (but not in this example), adjusting clusters (by subclustering + agglomeration) will change k Leads to more complex algorithm: Intersperses finding k with adjusting clusters Copyright © Ellis Cohen 2002-2005 81 Pure Agglomerative Clustering Initially put each pt in a separate cluster Repeat Find two nearest clusters & merge them Until no clusters are sufficiently close Throw away clusters which are too small (treat as outliers) Needs a notion of inter-cluster distance Copyright © Ellis Cohen 2002-2005 82 Inter-Cluster Distance Metrics Closest Neighbor Find two points, one in each cluster, which are nearest together Can lead to chaining: long narrow clusters. This may be acceptable, even desirable Variant: increases compactness k Closest: average of k nearest neighbors k might depend upon size of cluster Farthest Neighbor Find two points, one in each cluster, which are farthest together Leads to very compact clusters Centroid Distance Distance between cluster centroid Keeps reasonably compact Fastest approach Note: May also want to scale the metric based upon density and size of the two clusters Copyright © Ellis Cohen 2002-2005 83 Agglomeration & Scalability Performance of agglomerative clustering doesn't scale well especially expensive if pts don't all fit into main memory Do in 2 steps: 1. Divide into k clusters [pick a large k] Use k-means (where cost not dominated by outliers) 2. Use Agglomerative Clustering starting with those clusters Use centroid inter-cluster distance metric Copyright © Ellis Cohen 2002-2005 84 Grid-based Clustering Divide the space into a grid, with each of the n dimensions divided into d divisions (adjust for non-continuous variables) Control parameters d: number of parts each dimension is divided into p: min # of pts in a cell for it to be an interesting cluster 64 3D cells n = 3, d = 4 1. Divide each space into cells 2. Mark the cells which have at least p points 3. Agglomerate the marked cells 4. (Optional) Use the agglomerated cell clusters as the starting point for another clustering approach. Copyright © Ellis Cohen 2002-2005 85 Subspace Clustering Copyright © Ellis Cohen 2002-2005 86 Clustering & Subspaces Not all clusters involve all variables Cluster 1: [subcatid: Convenience, age: 16 2, married: FALSE] Cluster 2: [subcatid: Convenience, allowance: 500 75, nearestStoreDistance: .1 .1] Cluster 3: [subcatid: Sofas, familyIncome: 6000 1000, vacationSpend: 2000 400] The variables involved in each cluster form a subspace Cluster 1: [ subcatid, age, married ] Cluster 2: [ subcatid, allowance, nearestStoreDistance ] Cluster 3: [ subcatid, familyIncome, vacationSpend ] Fixed space clustering only works if you KNOW, in advance, which group of dimensions you should examine for clusters. But often, you don't know. What do you do? Copyright © Ellis Cohen 2002-2005 87 Subspace Clustering Consider every possible subspace, and find clusters in each subspace Grid-based implementation Find all the dense cells in each single dimension separately To find all dense cells in an n+1 dimension subspaces, combine dense cell in all n dimension subspaces with dense cells in each single dimension (this is pretty efficient) In each subspace with dense cells, agglomerate to find clusters Copyright © Ellis Cohen 2002-2005 88 Counting Pts In Subspaces 1 3D space {age, income, housing} 64 3D cells n = 3, d = 4 3 2D subspaces {age, income} {age, housing} {income, housing} 3 1D subspaces {age} {income} {housing} Housing Age Income To count # pts in a 2D cell, aggregate counts in corresponding 3D cells Copyright © Ellis Cohen 2002-2005 89 Partial Projection-Based Merging 10 1D cells 100 2D cells x x x x age x x x x x xx x x x x x x xx xx x x x x x age income 10 1D cells Clusters in one space might merge in some subspaces income Copyright © Ellis Cohen 2002-2005 90 Total Projection-Based Merging x x x x x x x x x x xx xx x x x x x x x x x xx age x age income Clusters in one space might merge in all subspaces income Copyright © Ellis Cohen 2002-2005 91 Total Projection-Based Hiding Clusters in one space might even be hidden in all subspaces x xx x x x xx x x x x x x xx x xx x x x xx x x x x x x x x x x xx x x x You cannot just look in lower dimension spaces to find clusters in higher dimension spaces Copyright © Ellis Cohen 2002-2005 92 Superspace Dispersion Clusters in a subspace may be dispersed in a superspace 10 1D cells 100 2D cells x x x x x x x x x x x x x x x x x x x x 10 1D cells You cannot just look in higher dimension spaces to find clusters in lower dimension spaces Copyright © Ellis Cohen 2002-2005 93 Subspaces & Cells 64 3D cells n = 3, d = 4 subspaces with ( 33) 3D 4 cells each 3 1 subspace, 64 cells each, 64 total cells subspaces with ( 32) 2D 4 cells each 2 3 subspaces, 16 cells each, 48 total cells subspaces with ( 31) 1D 4 cells each: 3 subspaces, 4 cells each, 12 total cells Copyright © Ellis Cohen 2002-2005 94 Smaller High Dimensional Cells n = 20, d = 10 (20 1) (20 2) (20 3) 1D subspaces with 10 cells each: 20 subspaces, 10 cells each, 120 total cells 2D subspaces with 102 cells each 190 subspaces, 100 cells each, 19000 total cells 3D subspaces with 103 cells each 1140 subspaces, 1000 cells each, 1,140,000 total cells ~1 million subspaces total Copyright © Ellis Cohen 2002-2005 95 Monotonicity 64 3D cells n = 3, d = 4 Age Income If a cell in k+1 dimensions contains p points, then every k dimensional cell which aggregates it must also contain p points. So: to find the 3D cells with p points, you only need to consider cells, all of whose 2D aggregates contain p points. e.g. a cell in the subspace {age,income, housing} only contains p points if the Housing corresponding cells in {age,income}, {age,housing} and {housing,income} all contain p points Copyright © Ellis Cohen 2002-2005 96 Subspace Clustering Algorithm In every subspace, find all cluster cells (i.e. dense cells with at least p points) – Find all 1D cells with at least p points – Use monotonicity to efficiently find all 2D cells with at least p points – Continue using monotonicity to efficiently find all cells with at least p points in all subspaces – Discard cells in low dimensional subspaces which are not sufficiently dense For each subspace which has cluster cells – Agglomerate the marked cells – (Optional) Use the agglomerated cell clusters as the starting point for another clustering approach Copyright © Ellis Cohen 2002-2005 97 Density Requirements Suppose we have 10,000 data points and 20 dimensions d: 10 (divide each dimension into 10ths) p: 50 (need 50 data pts in a cell) : 4 (cell must be at least 4 times as dense as avg) # dim 1 2 3 4 5 avg # pts per cell # cells 10 100 1000 100000 1000000 1000 100 10 1 .1 req # of pts per cell to satisfy req # of pts per cell to satisfy &p 4000 400 40 4 1 4000 400 50 50 50 Copyright © Ellis Cohen 2002-2005 98 Targeted Clustering & Association Rules Copyright © Ellis Cohen 2002-2005 99 Exploratory vs Targeted Clustering Exploratory Clustering Find clusters involving an arbitrary set of variables Targeted Clustering Find clusters among a set of variables which include the target variable (possibly restricted to a particular value) Copyright © Ellis Cohen 2002-2005 100 Single Value Targeted Clustering Suppose • Our sample dataset consists of car buyers • we want to find clusters of car buyers who bought luxury cars 1. Restrict the sample dataset to just those tuples where CarType = "luxury" 2. Use clustering among this restricted dataset Copyright © Ellis Cohen 2002-2005 101 Multiple Value Targeted Clustering Suppose • Our sample dataset consists of car buyers • we want to find clusters of car buyers who bought the various categories of cars Use Single Value Targeted Clustering for CarType = "luxury", then again for CarType = "midrange", then again for CarType = "cheap" Copyright © Ellis Cohen 2002-2005 102 Clustering vs Classification Can't we use Multiple Value Targeted Clustering to do Classification? Find the clusters where CarType = "luxury", CarType = "midrange", and CarType = "cheap" Use the clusters as the model and to predict the value of live data. WILL THIS WORK OR NOT? Copyright © Ellis Cohen 2002-2005 103 Clustering Does NOT Classify Clusters do not cover the space Clusters only identify dense regions of objects. The bulk of the space that a decision tree would assign to cheap car buyers probably does NOT hold dense clusters of them, so would not be included in the clustered model Clusters for different target values may overlap x o cluster o x x cluster o o x xx x o o o x o xo x o o o x x x xx o oo xo o x x o oo o x o x o x o xx x o o Copyright © Ellis Cohen 2002-2005 104 Association Rules The region corresponding to a cluster may include other data points whose target values differ from the targeted cluster value o cluster o ox o oo x o o ox o oo xo Monthly oo oo x ox o Income Age Cluster: [age: 26 3, monthlyIncome: 3000 1000] Pct of data pts within the cluster region with target value 'midrange' (i.e. with symbol 'o') is the same as the CONFIDENCE of the association rule (23 ≤ age ≤ 29) Λ (2000 ≤ monthlyIncome ≤ 4000) carType = "midrange' Copyright © Ellis Cohen 2002-2005 105 Cluster Confidence Depending upon the data mining application, confidence may or may not matter. Problem: Come up with applications where the confidence of clusters – must be > 80% – must be > 30% – doesn't matter Copyright © Ellis Cohen 2002-2005 106 Requiring Confident Clusters Suppose we are interested in regions where enough data points are clustered together (i.e. with good support), but where a minimum confidence w.r.t. a targeted value is required. Is there any alternative to simply discarding clustered regions with low confidence? Copyright © Ellis Cohen 2002-2005 107 Split Clusters o ox o oo x o o ox o oo xo oo oo x ox o Use decision-tree style splitting of the data points within the cluster to best separate the 'o' valued points from the non-'o' valued points If the 'o' region has too few data points (too little support), tough luck If it has adequate support & confidence, done! If it has adequate support, but has inadequate confidence, split it again … (If the non-'o' regions are large enough, they can also potentially be split to find 'o' subregions) Copyright © Ellis Cohen 2002-2005 108 Quantitative Target Variables If a target variable is quantitative, especially if it is continuous, how can targeted clustering be done? For example, how do you find clusters of car buyers who spent approximately the same amount on a car? Copyright © Ellis Cohen 2002-2005 109 Quantitative Targeted Clustering How do you find clusters of car buyers who spent approximately the same amount on a car? Do standard clustering, just require that the variables used always INCLUDE the target variable! Copyright © Ellis Cohen 2002-2005 110 Market Basket Analysis Copyright © Ellis Cohen 2002-2005 111 Market Basket Analysis Organize dataset into baskets. Find groups of items which frequently occur together in baskets 11-Feb-99 11-Feb-99 11-Feb-99 … 11-Feb-99 11-Feb-99 … 13-Feb-99 13-Feb-99 Rules capture causality Joe Joe Joe Diapers Formula Beer Simha Pretzels Simha Beer Sasha Diapers Sasha Beer Basket: Daily Shopping by a Customer Diapers and beer occur together frequently Item: Product purchased NO! People who buy beer are not more likely Beer Diapers? to buy diapers YES! People who buy diapers are more likely Diapers Beer? to buy beer (esp men at night) Copyright © Ellis Cohen 2002-2005 112 Market Basket Activities Forensics Analyzer Anomalies Model Discovery Modeling Sample Data Live Data Detector/ Predictor Detection Prediction Predictions Modeling Identify good sets of rules (w statistics) Forensics Understand why other groups of items are NOT related Detection/Prediction Does live data follow same rules with same statistics Copyright © Ellis Cohen 2002-2005 113 Baskets In order to use market basket analysis, you must first divide the dataset into baskets. Baskets are specified as a group of variables (possibly derived). The actual baskets are obtained by grouping the dataset by these variables (e.g. date/customer). The first step of market basket analysis is deciding which variables define the baskets. Copyright © Ellis Cohen 2002-2005 114 Items Market Basket analysis looks for groups of items which frequently appear together in a basket. An item is determined by a variable (or set of variables). Each different value for that variable (or variables) determines a different item (e.g. productPurchased). The second step of market basket analysis is determining which variable(s) are used to identify the items Copyright © Ellis Cohen 2002-2005 115 Market Basket Discovery 1) Find frequent itemsets 2 items that appear together frequently are interesting 3 items that appear together frequently are really interesting { charcoal, chicken, bbq sauce } 4 or more items, really really interesting 2) Find rules that characterize causality Diapers Beer, but not Beer Diapers Think in terms of which do you give a coupon for. Copyright © Ellis Cohen 2002-2005 116 Apriori Algorithm Find all itemsets which have at least n items Use Apriori or Monotonicity principle: If a set of items S is frequent (i.e. appears in at least n baskets), then every subset of S is frequent. Call L1, the items which appear in at least n baskets Consider all combos of 2 items {A,B}, both from L1 Call L2, those which appear in at least n baskets Consider all combos of 3 items {A,B,C}, where {A,B} and {A,C}, and {B,C} are in L2 Call L3, those which appear in at least n baskets Consider all combos of 4 items {A,B,C,D}, where {A,B,C} and {A,B,D}, {A,C,D} and {B,C,D} are in L3 Call L4, those which appear in at least n baskets The frequent itemsets are L1 + L2 + … Copyright © Ellis Cohen 2002-2005 117 DB Implementation of Apriori One scan through DB to get frequent items CREATE HotItems AS SELECT item FROM purchases GROUP BY item HAVING count(*) >= n Another scan through DB looking for pair itemsets (repeat for size n itemsets) WITH HotPurchases AS (SELECT * FROM Purchase NATURAL JOIN HotItems) SELECT P1.item, P2.item FROM HotPurchases P1, HotPurchases P2 WHERE P1.basket = P2.basket AND P1.item < P2.item GROUP BY P1.item, P2.item HAVING count(*) >= n Copyright © Ellis Cohen 2002-2005 118 Apriori Scalability Obtaining frequent pair itemsets: If HotItems can be kept in memory allowing rapid lookup (sorted list or hashtable) FP (frequent pair) itemsets can be obtained in one linear pass through the DB. Obtaining frequent size n itemsets: 1. 2. Use a separate linear scan through the DB up to n. Slow. On second scan, don't just count pairs; instead build a memory-based FP-Tree. Can be used to find all frequent itemsets of any size. But, we often only care about frequent pair itemsets. Copyright © Ellis Cohen 2002-2005 119 Clustering & Market Basket Analysis Market Basket Analysis is a form of clustering • Turn each basket into a single LARGE data item. – Each LARGE data item has a separate boolean variable for each possible item that can be in a basket. For example – Beer, diapers, etc. are separate variables – A LARGE data item's beer value is TRUE if the basket it came from had a beer • In the original dataset, we look for k-element itemsets which appear in p or more baskets (using apriori) This is equivalent to using the LARGE item dataset and – using the subspace clustering algorithm to look for k-dimensional cells with p or more points Copyright © Ellis Cohen 2002-2005 120 Scoring Market Basket Analysis Copyright © Ellis Cohen 2002-2005 121 Support Cereal Beer Support( s ) = # of baskets containing S / # of total baskets 40 1000 2000 4000 Support { Beer } = 1000/4000 = 25% Support { Cereal } = 2000/4000 = 50% Support { Beer, Cereal } = 40/4000 = 1% Support: How significant is this itemset In a supermarket, anything over .1% might be significant Given the # of total baskets, the minimum interesting support determines n for the Apriori algorithm Copyright © Ellis Cohen 2002-2005 122 Confidence Cereal Confidence( A B ) = Support( A & B ) / Support( A ) Beer 40 1000 2000 4000 Confidence( A B ) = # of baskets containing A & B / # of baskets containing A Confidence( Beer Cereal ) = 40/1000 = 4% Confidence( Cereal Beer ) = 40/2000 = 2% Confidence( A B ): If a basket has A, how likely is it that the basket also will have B (i.e. how confident are we that A predicts B) If this is low (say, less than 30%), it is not very interesting, since the two items don't correlate Copyright © Ellis Cohen 2002-2005 123 High Support & Confidence Milk Beer 400 1000 2000 4000 Support { Beer } = 1000/4000 = 25% Support { Milk } = 2000/4000 = 50% Support { Beer, Milk } = 400/4000 = 10% WOW! Confidence( Milk Beer ) = 400/2000 = 20% Confidence( Beer Milk ) = 400/1000 = 40% High Confidence, so potentially interesting BUT 40% < 50% the pct who buy milk anyway So giving milk coupons to beer buyers is probably not the most useful thing to do Copyright © Ellis Cohen 2002-2005 124 Lift Milk Lift( A B ) = Confidence( A B ) / Support( B ) = Lift( B A ) Beer 400 1000 2000 4000 Support { Beer } = 1000/4000 = 25% Support { Milk } = 2000/4000 = 50% Support { Beer, Milk } = 400/4000 = 10% WOW! Confidence( Milk Beer ) = 400/2000 = 20% Confidence( Beer Milk ) = 400/1000 = 40% OK! Lift( A B ): How much does A help B Lift( Beer Milk ) = 40% / 50% = .8 If lift < 1, then it doesn't help at all! Copyright © Ellis Cohen 2002-2005 125 Good Lift Diapers Lift( A B ) = Confidence( A B ) / Support( B ) = Lift( B A ) Beer 80 1000 200 4000 Support { Beer } = 1000/4000 = 25% Support { Diapers } = 200/4000 = 5% Support { Beer, Diapers } = 80/4000 = 2% OK! Confidence( Beer Diapers ) = 80/1000 = 8% Confidence( Diapers Beer ) = 80/200 = 40% OK! Lift( Diapers Beer ) = 40% / 25% = 1.6 Note: Lift can be useful in clustering situations as well Copyright © Ellis Cohen 2002-2005 126 Support, Confidence & Lift AB Support( A & B ) How important is the rule: What percent of baskets have both A & B? Confidence( A B ) How likely is it that baskets which contains A also contains B. In general, should be at least 35%. Lift( A B ) If we know that a basket contains A, how much surer are we that the basket contains B than if we didn't know what else what in the basket. Must be > 1; probably should be at least 1.3. Copyright © Ellis Cohen 2002-2005 127 Hierarchical Categories Do apriori with values at each category level Whole Wheat Bread Skim Milk, but not Bread Milk or vice versa! For scalability, can initially only include higher level categories, then split itemsets with high levels of support Copyright © Ellis Cohen 2002-2005 128 Rules for Larger Itemsets For { A, B, C, D } Consider A, A, A, B, B, C D B, D C C, D B C, D A Diapers, ChildrensTylenol Beer may have less support than Diapers Beer but may well have higher confidence and higher lift Copyright © Ellis Cohen 2002-2005 129 Incorporating Other Variables Diapers, gender:male, time: [8pm:1am] Beer will also have less support than Diapers Beer But will almost certainly have higher confidence & lift Remember, this is just subspace clustering with more variables Copyright © Ellis Cohen 2002-2005 130 Scoring & Data Preparation Copyright © Ellis Cohen 2002-2005 131 Model Scoring Descriptive Scoring How good is the resulting model wrt to the sample data? E.g. estimation: does the model function do a good job of actually estimating the target variable of the sample data? E.g. clustering: Do the clusters overlap? Are there unidentified subclusters? Predictive Scoring How good is the resulting model wrt to live data? E.g. estimation: Does the model function do a good job of actually estimating the target variable of live data? May not know right away, or ever … Copyright © Ellis Cohen 2002-2005 132 Causes of Scoring Failure Descriptive Scoring Failures Noisy, approximate or erroneous data Curse of Dimensionality: too many variables Variables inadequate for finding good patterns • Clustering: Limited clustering with given variables • Classification: Limited separation relative to given variable • Mkt Basket: Baskets are too coarse or too fine Approach doesn't match sample data Predictive Scoring Failures Environment Change • Calls for continuous mining & trend analysis Overfitting • Model does not just identify the persistent trends embodied in data • Model also captures idiosyncratic random variations of the raw data Copyright © Ellis Cohen 2002-2005 133 Data Cleaning Missing Small Amounts of Data Not a problem for many approaches Use estimation to fill in missing values Noisy or Approximate Data Use smoothing Lose ability to discover some real patterns & relationships Avoids overfitting & discovery of non-causal relationships Erroneous Data Forensic activity to discover outliers Distinguish (if possible) between anomalous and erroneous outliers; remove/fix erroneous ones Copyright © Ellis Cohen 2002-2005 134 Curse of Dimensionality Too many dimensions – can cause scoring failures – can makes data mining significantly more expensive Choose a subset of the possible dimensions – For targeted data mining: choose the ones which contribute most to the value of the target variable Copyright © Ellis Cohen 2002-2005 135 Attribute Importance Auto Sales o o o o o o o o o o o o o Monthly Income o o o o o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o oo oo o o o o o o o oo o o o o o o o o o o o o o o oo o o o o o o ooo o o o o o o o o o o Age o o o o Size of O proportional to price paid Suppose you wanted to generate an estimator for the price a new customer will pay for a car. You data mining tool can define an estimator that only works given a single variable (unlikely, but bear with me …). Would you prefer to estimate price as f1(age) or f2(monthlyIncome)? Copyright © Ellis Cohen 2002-2005 136 Feature Selection Cumulative attribute importance monthlyIncome 42% vacationSpend 64% numDepends 72% dateOfBirth 81% gender 90% … … … 99% … … … 100% age 100% Feature selection can itself be considered a type of data mining Orders predictor variables in terms of the amount to which they contribute to values of target variables Users can do data mining with just attributes which cover some % of importance Can lose patterns, esp cluster-related Extend feature selection to take changes of density into account. Copyright © Ellis Cohen 2002-2005 137 Principal Components Analysis Combination of – Feature Selection – Transformation (Linear Combination) Defines a new set of variables Each is linear combination of predictor variables e.g. v1 = .05*allowance + .01*vacationSpend Replace the original predictor variables Goal Choose variable definitions to get the best possible effect from feature selection A small number of the new variables almost completely determines the values of the target variables Pro: fewer dimensions Con: less understandable system Copyright © Ellis Cohen 2002-2005 138 Goals of Data Mining Find patterns and relationships among objects represented by data points in a dataset Novel / Significant People who buy notebooks buy pencils vs. Men who buy diapers at night also buy beer Understandable / Useful Tradeoff between accuracy and simplicity Causality vs Chance All data has random variations, which can show up as spurious patterns and relationships Copyright © Ellis Cohen 2002-2005 139