Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MIS 451 Building Business Intelligence Systems Clustering (2) Problem Target Marketing Diaper, Baby food, Toys Swiss cheese and Belgian chocolate French Wine 2 Clustering Clustering is a data mining method for grouping objects such that objects within the same cluster are similar and objects in different clusters are dissimilar. Why clustering SQL based OLAP is not suitable for clustering objects whose attributes have a large number of possible values SQL based OLAP is not suitable for clustering objects with a large number of attributes 3 Clustering Steps in clustering objects Compute similarity between objects Clustering based on similarity between objects 4 Similarity An object (e.g., a customer) has a list of variables (e.g., attributes of a customer such as age, spending, gender etc.) When measuring similarity between objects we measure similarity between variables of objects. Instead of measuring similarity between variables, we use distance to measure dissimilarity between variables. 5 Dissimilarity Continuous variable Manhattan distance Euclidean distance 6 Dissimilarity For two objects X and Y with continuous variables 1,2,…n, Manhattan distance is defined as: d ( X , Y ) x1 y1 x2 y2 xn yn where x1 ... xn are values of variables of object X and y1 ... yn are values of variables of object Y 7 Dissimilarity Example of Manhattan distance NAME AGE SPENDING($) Sue 21 2300 Carl 27 2600 TOM 45 5400 JACK 52 6000 8 Dissimilarity For two objects X and Y with continuous variables 1,2,…n, Euclidean distance is defined as: d ( X , Y ) ( x1 y1 )2 ( x2 y2 )2 ... ( xn yn )2 where x1 ... xn are values of variables of object X and y1 ... yn are values of variables of object Y 9 Dissimilarity Example of Euclidean distance NAME AGE SPENDING($) Sue 21 23200 Carl 27 23330 TOM 45 23260 JACK 52 23400 10 Dissimilarity Standardize values of an variable Calculate mean value Calculate mean absolute deviation Standardize values of an variable using the formula: new value = (old value – mean value)/mean standard deviation 11 Dissimilarity Binary variable distance = number of matched variables/total number of variables NAME Married(Y/N) Gender Internet connection at home Sue Y M Y Carl Y F Y TOM N M N JACK N F N 12 Clustering based on dissimilarity After calculating dissimilarity between objects, a dissimilarity matrix can be created with objects as indexes and dissimilarities between objects as elements. 13 Clustering based on dissimilarity Sue Tom Carl Jack Mary Sue 0 6 8 2 7 Tom 6 0 1 5 3 Carl 8 1 0 10 9 Jack 2 5 10 0 4 Mary 7 3 9 4 0 14 Clustering based on dissimilarity Step 1:Initially, place each object in an unique cluster Step 2: Calculate dissimilarity between clusters Dissimilarity between clusters is the minimum dissimilarity between two objects of the clusters, one from each cluster Step 3: Merge two clusters with the least dissimilarity Step 4: Continue step 1-3 until all objects are in one cluster 15