Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Knowledge discovery & data mining Clustering & customer segmentation Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa http://www-kdd.di.unipi.it/ A tutorial @ EDBT2000 Module outline The clustering task Application to a case-study in customer segmentation Quick overview of main clustering techniques EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 2 Problem 1: customer segmentation Given: Large data base of customer data containing their properties and past buying records Goal: Find groups of customers with similar behavior Find customers with unusual behavior EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 3 Problem 2: CAD databases Given: Large data base of CAD data containing abstract feature vectors Goal: Find homogeneous groups of similar CAD parts Determine standard parts for each group Use standard parts instead of special parts reduction of the number of parts to be produced EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 4 Clustering: general problem description Given: A data set with N d-dimensional data items. Find: a natural partitioning of the data set into a number of clusters (k) and noise. Clusters should be such that: items in same cluster are similar intra-cluster similarity is maximized items from different clusters are different inter-cluster similarity is minimized EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 5 Clustering = unsupervised classification No predefined classes! Used either as a stand-alone tool to get insight into data distribution or as a preprocessing step for other algorithms. Help to understand how objects in a data set tend to naturally group together Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English) EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 6 Research perspective From the past … clustering is a well-known problem in statistics more recent research in machine learning databases visualization … to the future effective and efficient clustering algorithms for large high-dimensional data sets with high noise requires scalability with respect to the number of data points (N) the number of dimensions (d) the noise level EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 7 Basic methods Partitioning methods k-means, k-medoids Hierarchical methods Agglomerative/divisive, BIRCH, CURE Linkage-based methods Density-based methods DBSCAN, DENCLUE Statistical methods IBM-IM demographic clustering, COBWEB EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 8 A small example Source: DM@CINECA http://open.cineca.it/datamining/dmCineca/index.htm Domain: technology watch - a.k.a. competitive intelligence Which are the emergent technologies? Which of my competitors are investing on them? In which area are my competitors active? Which area will my competitor drop in the near future? Data sources: public (on-line) databases EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 9 The Derwent database Contains all patents filed worldwide in last 10 years Searching this database by keywords may yield thousands of documents Derwent documents are semi-structured: many long text fields Goal: analyze Derwent documents to build a model of competitors’ strategy EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 10 Structure of Derwent documents EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 11 Example dataset Patents in the area: patch technology (medical plaster) 105 companies from 12 countries 94 classification codes 52 Derwent codes EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 12 Clustering output EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 13 Zoom on cluster 2 EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 14 Zoom on cluster 2 - profiling competitors EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 15 Activity of competitors in the clusters EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 16 Temporal analysis of clusters EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 17 Module outline The clustering task Application to a case-study in customer segmentation Quick overview of main clustering techniques EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 18 A case-study on customer segmentation Source: Gary Saarenvirta, “Mining customer data”. DB2 magazine on line, 1998 http://www.db2mag.com/98fsaar.html Gratefully acknowledged! EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 19 Customer segmentation Task: use customer-purchase transaction data to track buying behavior create strategic business initiatives divide customers into segments based on shareholder value variables: customer profitability, measure of risk, measure of the lifetime value of a customer, retention probability. EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 20 Attractive customer segments high-profit, high-value, and low-risk customers highly attractive: typically 10% to 20% of customers who create 50% to 80% of a company's profits strategic initiative for the segment is retention low-profit, high-value, and low-risk customers attractive strategic initiative for the segment is to increase profitability cross-selling (selling new products) up-selling (selling more of what customers currently buy) EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 21 Behavioral vs. demographic segments Within behavioral segments: demographic subsegments. Customer demographic data are not used to create segments. Demographic (sub)segmenting is used to select appropriate tactics (advertising, marketing channels, and campaigns) to implement the strategic initiatives. EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 22 AIR MILES is a Canadian reward program for a coalition of more than 125 companies in all industry sectors finance, credit card, retail, grocery, gas, telecom. 60% of Canadian households enrolled is a frequent-shopper program: consumers collect bonuses that can be redeemed for rewards (air travel, hotel accommodation, theatre/sport tickets, …) EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 23 Data capture The coalition partners capture consumer transactions and transmit them to AIR MILES, which stores these transactions and uses the data for database marketing initiatives. AIR MILES data warehouse contained (at the date of the project) more than 6.3 million household records 1 billion transaction records. EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 24 Before data mining Analytical techniques used: Recency, Frequency, Monetary value (RFM) analysis online analytic processing tools linear statistical methods to analyze the success of the various marketing initiatives undertaken by the coalition. EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 25 Data mining project at AMRP Goal: create a customer segmentation using a data mining tool and compare the results to an existing segmentation developed using RFM analysis. data mining platform IBM - DB2 Universal Database Enterprise parallelized over a five-node RS/6000 SP system. IBM - Intelligent Miner for Data EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 26 Dataset ~ 50,000 customers and their associated transactions for a 12month period. EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 27 Data preparation “shareholder value” variables revenue customer tenure number of sponsor companies shopped at over the customer tenure number of sponsor companies shopped at over the last 12 months, recency (in months) of the last transaction calculated by aggregating the transaction data and then adding then to each customer record 84 variables = 14 categories of sponsor companies 3 variables per category 2 quarters (first two quarters of 1997) EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 28 Data cleaning - missing values demographic data is usually categorical has a high % of missing values the missing values can be set to either unknown or unanswered (if result of unanswered questions) if a large portion of the attribute is missing, it may be discarded in the case study, missing numeric values set to 0 EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 29 Data transformation Ratio variables. E.g.: profitability = profit / tenure Time-derivative variables. E.g.: profit 2nd quarter - profit 1st quarter Discretization using quantiles. E.g., break points at 10, 25, 50, 75, and 90. Discretization using predefined ranges. E.g., those used in census Log transforms. E.g., for very skewed distributions EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 30 Distribution of original data EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 31 Distribution of discretized data EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 32 Before/after discretization EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 33 Clustering/segmentation methodology EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 34 IBM-IM demographic clustering Designed for categorical variables Similarity index: increases with number of common values on same attribute decreases with number of different values on same attribute # of clusters is not fixed a priori only upper bound set EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 35 Demographic clustering: data structure EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 36 Demographic clustering: parameters EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 37 Demographic clustering: similarity index proportional to 1-1 (# of common 1’s) inversely proportional to 0-1 and 1-0 unaffected by 0-0 (# of common 0’s) Condorcet index= N11 / (N11 + 1/2(N01 + N10)) Dice index= N11 / (N11 + 1/4(N01 + N10)) Dice looser then Condorcet appropriate with highly different objects Indexes generalize from objects to clusters EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 38 Demographic clustering: similarity index Similarity threshold i,j assumed similar if s(i,j)> low values (<0.5) appropriate with highly different objects Weights for attributes importance of attributes in the similarity index may be varied with different weights default weight = 1 EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 39 IM Demographic clustering basic parameters to set: Maximum number of clusters. Maximum number of passes through the data. Accuracy: a stopping criterion for the algorithm. If the change in the Condorcet criterion between data passes is smaller than the accuracy (as %), the algorithm will terminate. In the case study: maximum # of clusters: 9 maximum # of passes: 5 accuracy: 0.1 EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 40 Input dataset dataset: all continuous variables discretized. input variables : # of products purchased over customer’s lifetime # of products purchased in the last 12 months Customer's revenue contribution over lifetime Customer tenure in months Ratio of revenue to tenure Ratio of number of products to tenure Region Recency Tenure (# of months since customer first enrolled in the program). EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 41 Input dataset Other discrete and categorical variables and some interesting continuous variables were input as supplementary variables: variables used to profile the clusters but not to define them. easier interpretation of clusters using data other than the input variables. EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 42 Output of demographic clustering EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 43 Visualization of clusters horizontal strip = a cluster clusters are ordered from top to bottom in order of size variables are ordered from left to right in order of importance to the cluster, based on a chi-square test between variable and cluster ID. other metrics include entropy, Condorcet criterion, and database order. EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 44 Visualization of clusters variables used to define clusters are without brackets, while the supplementary variables appear within brackets. numeric (integer), discrete numeric (small integer), binary, and continuous variables have their frequency distribution shown as a bar graph. red bars = distribution of the variable within the current cluster. gray solid bars = distribution of the variable in the whole universe. EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 45 Visualization of clusters Categorical variables are shown as pie charts. inner pie = distribution of the categories for the current cluster outer ring = distribution of the variable for the entire universe. The more different the cluster distribution is from the average, the more interesting or distinct the cluster. EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 46 Qualitative characterization of clusters Gold98 is a binary variable that indicates the best customers in the database, created previously by the business using RFM analysis. The clusters have almost all Gold or no Gold customers. Gold segment is confirmed! EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 47 Qualitative characterization of clusters Clustering results validate the existing concept of Gold customers, suggest the introduction of a cluster within the Gold98 segment - a platinum segment Cluster 5: 9 %of the population. revenue, bonus collected are all in the 75th percentile and above, skewed to almost all greater than the 90th percentile. looks like a very profitable cluster EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 48 Detailed view of cluster 5 EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 49 Profiling clusters Goal: assess the potential business value of each cluster quantitatively by profiling the aggregate values of the shareholder value variables by cluster. EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 50 Profiling clusters leverage = ratio of revenue to customer. product index = ratio of the average number of products purchased by the customers in the cluster divided by the average number of products purchased overall. as profitability increases, so does the average number of products purchased. customer profitability increases as tenure increases. EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 51 Business opportunities Best customers in clusters 2, 5, and 7: indication: retention Clusters 2, 6, and 0 indication: cross-selling by contrasting with clusters 5 and 7. 2, 6, and 0 have a product index close to that of 5 and 7, which have the highest number of products purchased. Try to convert customers from clusters 2, 6, and 0 to clusters 5 and 7. By comparing which products are bought we can find products that are candidates for crossselling. EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 52 Business opportunities Clusters 3 and 4 indication: cross-selling to clusters 2, 6, and 0 Cluster 1 indication: wait and see. It appears to be a group of new customers Cluster 8 indication: no waste of marketing dollars EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 53 Follow-up Reactions from the management visualization of results allowed for meaningful and actionable analysis. original segmentation methodology validated, and obtained refinements potentially valuable. decision to undertake further data mining projects, including predictive models for direct mail targeting, further work on segmentation using more detailed behavioral data, opportunity identification using association algorithms within the segments discovered. EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 54 Module outline The clustering task Application to a case-study in customer segmentation Quick overview of main clustering techniques EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 55 A taxonomy of main clustering methods Partitioning methods K-means (MacQueen 67), expectation maximization (Lauritzen 95), CLARANS (Ng and Han 94) Hierarchical methods Agglomerative/divisive, BIRCH (Zhang et al 96), CURE (Guha et al 98) Linkage-based methods Density-based methods DBSCAN (Ester et al 96), DENCLUE (Hinnenburg and Keim 98) Statistical methods IBM-IM demographic clustering, COBWEB (Fisher 87), Autoclass (Cheeseman 96) EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 56 Which notion of distance? Given two samples with n continuous attributes X=<x1,…,xn> and Y=<y1,…,yn>, the Minkowski distance between X and Y is: d(X,Y)= (|x1-y1|q + … + |xn-yn|q)1/q Euclidean distance: Manhattan distance: EDBT2000 tutorial - Clust q=2 q=1 Konstanz, 27-28.3.2000 57 K-Means Determine k prototypes (p1, … ,pk) of a given data set Assign data points to nearest prototype Minimize distance stop-criterion: Iterative Algorithm Shift the prototypes towards the mean of their point set Re-assign the data points to the nearest prototype EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 58 K-Means: example 2 3 EDBT2000 tutorial - Clust 2 1 3 1 Konstanz, 27-28.3.2000 59 Expectation Maximization Generalization of k-Means probabilistic assignment of points to clusters Idea: Estimate parameters of k Gaussians Optimize the probability that the mixture of parameterized Gaussians fits the data Iterative algorithm similar to k-Means EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 60 Linkage-based methods (from statistics) Single Linkage Connected components for distance d EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 61 Linkage-based methods Method of Wishart Min. no. of neighboring points: 4 Reduce data set Apply Single Linkage EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 62 Kernel Density Estimation Data Set Influence Function Density Function Influence Function: Influence of a data point in its neighborhood Density Function: Sum of the influences of all data points EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 63 Kernel Density Estimation Influence Function The influence of a data point y at a point x in the data space is modeled by a function Kernel Density Estimation: Density Function The density at a point x in the data space is defined as the sum of influences of all data points: EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 64 KDE and visual clustering EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 65 KDE and visual clustering EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 66 Scalability problem of clustering methods Effectiveness degenerates with dimensionality d with noise level Efficiency degenerates (at least) linearly with no of data points N exponentially with dimensionality d EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 67 References - Clustering L. Kaufman and P. J. Rousseeuw. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, 1990. J. MacQueen. Some methods for classification and analysis of multivariate observations. Proc. 5th Berkeley Synp. On Math., Stat., and Prob., 1:281-297, 1967. S. L. Lauritzen. The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19:191-201, 1995. Charu C. Agrawal, Ceilia Procopiuc, Joel L. Wolf, Philip S. Yu, and Jong Soo Prk, Fast Algorithms for Projected Clustering, the ACM SIGMOD Conference on Management of Data, Philadelphia, PA, June 1999. Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan, Automatic Subspace Clustering on High Dimensional Data for Data Mining Applications, the ACM SIGMOD Conference on Management of Data, Seattle, Washington, June 1998. P. Cheeseman and J. Stutz. Bayesian Classification (Autoclass): theory and results. In U. M. Fayyad et al (eds.) Advances in Knowledge Discovery and Data Mining, p. 153-180, AAAI/MIT Press, 1996. Martin Ester, Hans-Peter Kriegel, Jorg Sander, Michael Wimmer, and Xiaowei Xu, Incremental Clustering for Mining in a Data Warehousing Environment, the VLDB Conference, New York City, New York, August 1998. Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu, A density-based algorithm for discovering clusters in large spatial database with noise. In Proc. KDD’96 A. Hinnenburg and D. A. Keim. An efficient approach to clustering in large multimedia databases with noise. In Proc. KDD’98. P. Michaud. Clustering techniques. Future Generation Computer Systems, 13, 1997. D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172, 1987. Venkatesh Ganti, Raghu Ramakrishnan, and Johannes Gehrke, Clustering Large Datasets in Arbitrary Metric Spaces, the 15th International Conference on Data Engineering, Sydney, Australia, April 1999. Sudipto Guha, Rajeev Rastogi and Kyuseok Shim, CURE: An Efficient Clustering Algorithm for Large Databases, the ACM SIGMOD Conference on Management of Data, Seattle, Washington, June 1998 Sudipto Guha, Rajeev Rastogi and Kyuseok Shim, ROCK: A Robust Clustering Algorithm for Categorical Attributes, the 15th International Conference on Data Engineering, Sydney, Australia, April 1999. Anil K. Jain and Richard C. Dubes, Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, New Jersey, 1988. Raymond T. Ng and Jiawei Han, Efficient and effective clustering methods for spatial data mining, the VLDB Conference, Santiago, Chile, September 1994. Tian Zhang, Raghu Ramakrishnan, and Miron, Birch: An efficient data clustering method for very large databases, the ACM SIGMOD Conference on Management of Data, Montreal, Canada, June 1996. EDBT2000 tutorial - Clust Konstanz, 27-28.3.2000 68