Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Techniques: Classification and Clustering Chein-Shung Hwang Today Classification Basic Concepts in Classification Decision Trees ID3/C4.5 Algorithm Bayesian Classification Tools and Software for Classification Clustering Distance Measures Graph-based Techniques K-Means Clustering Tools and Software for Clustering 2 What Is Classification? The goal of data classification is to organize and categorize data in distinct classes A model is first created based on the data distribution The model is then used to classify new data Given the model, a class can be predicted for new data Classification = prediction for discrete and nominal values With classification, I can predict in which bucket to put the ball, but I can’t predict the weight of the ball 3 Prediction, Clustering, Classification What is Prediction? The goal of prediction is to forecast or deduce the value of an attribute based on values of other attributes A model is first created based on the data distribution The model is then used to predict future or unknown values Supervised vs. Unsupervised Classification Supervised Classification = Classification We know the class labels and the number of classes Unsupervised Classification = Clustering We do not know the class labels and may not know the number of classes 4 Classification: 3 Step Process 1. Model construction (Learning): Each record (instance) is assumed to belong to a predefined class, as determined by one of the attributes, called the class label The set of all records used for construction of the model is called training set The model is usually represented in the form of classification rules, (IFTHEN statements) or decision trees 2. Model Evaluation (Accuracy): Estimate accuracy rate of the model based on a test set The known label of test sample is compared with the classified result from model Accuracy rate: percentage of test set samples correctly classified by the model Test set is independent of training set otherwise over-fitting will occur 3. Model Use (Classification): The model is used to classify unseen instances (assigning class labels) Predict the value of an actual attribute 5 Model Construction 6 Model Evaluation 7 Model Use: Classification 8 Classification Methods Decision Tree Induction Neural Networks Bayesian Classification Association-Based Classification K-Nearest Neighbor Case-Based Reasoning Genetic Algorithms Fuzzy Sets Many More 9 Decision Trees A decision tree is a flow-chart-like tree structure Internal node denotes a test on an attribute (feature) Branch represents an outcome of the test All records in a branch have the same value for the tested attribute Leaf node represents class label or class label distribution outlook sunny humidity high N overcast rain windy P normal P true N false P 10 Decision Trees Example: “is it a good day to play golf?” a set of attributes and their possible values: outlook sunny, overcast, rain temperature cool, mild, hot humidity high, normal windy true, false A particular instance in the training set might be: <overcast, hot, normal, false>: play In this case, the target class is a binary attribute, so each instance represents a positive or a negative example. 11 Using Decision Trees for Classification Examples can be classified as follows 1. look at the example's value for the feature specified 2. move along the edge labeled with this value 3. if you reach a leaf, return the label of the leaf 4. otherwise, repeat from step 1 Example (a decision tree to decide whether to go on a picnic): outlook sunny humidity high N overcast So a new instance: rain will be classified as “noplay” windy P normal P <rainy, hot, normal, true>: ? true N false P 12 Decision Trees and Decision Rules outlook If attributes are continuous, internal nodes may test against a threshold. sunny overcast rain yes humidity > 75%<= 75% no windy > 20 yes no <= 20 yes Each path in the tree represents a decision rule: Rule1: If (outlook=“sunny”) AND (humidity<=0.75) Then (play=“yes”) Rule3: If (outlook=“overcast”) Then (play=“yes”) Rule2: If (outlook=“rainy”) AND (wind>20) Then (play=“no”) ... 13 Top-Down Decision Tree Generation The basic approach usually consists of two phases: Tree construction At the start, all the training examples are at the root Partition examples are recursively based on selected attributes Tree pruning remove tree branches that may reflect noise in the training data and lead to errors when classifying test data improve classification accuracy Basic Steps in Decision Tree Construction Tree starts a single node representing all data If sample are all same class then node becomes a leaf labeled with class label Otherwise, select feature that best separates sample into individual classes. Recursion stops when: Samples in node belong to the same class (majority) There are no remaining attributes on which to split 14 Trees Construction Algorithm (ID3) Decision Tree Learning Method (ID3) Input: a set of examples S, a set of features F, and a target set T (target class T represents the type of instance we want to classify, e.g., whether “to play golf”) 1. If every element of S is already in T, return “yes”; if no element of S is in T return “no” 2. Otherwise, choose the best feature f from F (if there are no features remaining, then return failure); 3. Extend tree from f by adding a new branch for each attribute value 4. Distribute training examples to leaf nodes (so each leaf node S is now the set of examples at that node, and F is the remaining set of features not yet selected) 5. Repeat steps 1-5 for each leaf node Main Question: how do we choose the best feature at each step? Note: ID3 algorithm only deals with categorical attributes, but can be extended (as in C4.5) to handle continuous attributes 15 Choosing the “Best” Feature Using Information Gain to find the “best” (most discriminating) feature Entropy, E(I) of a set of instance I, containing p positive and n negative examples E(I ) p p n n log2 log2 pn pn pn pn Gain(A, I) is the expected reduction in entropy due to feature (attribute) A Gain( A, I ) E ( I ) descendant j pj nj E(I j ) pn the jth descendant of I is the set of instances with value vj for A S: [9+,5-] Outlook? overcast sunny E = -(9/14).log(9/14) - (5/14).log(5/14) = 0.940 rainy “yes”, since all positive examples [4+,0-] [2+,3-] [3+,2-] 16 Decision Tree Learning - Example Day D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 outlook temp humidity wind sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain hot hot hot mild cool cool cool mild cool mild mild mild hot mild high high high high normal normal normal high normal normal normal high normal high weak strong weak weak weak strong strong weak weak weak strong strong weak strong play No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No S: [9+,5-] (E = 0.940) humidity? high [3+,4-] (E = 0.985) [6+,1-] (E = 0.592) Gain(S, humidity) = .940 - (7/14)*.985 - (7/14)*.592 = .151 S: [9+,5-] (E = 0.940) wind? weak So, classifying examples by humidity provides more information gain than by wind. In this case, however, you can verify that outlook has largest information gain, so it’ll be selected as root normal [6+,2-] (E = 0.811) strong [3+,3-] (E = 1.00) Gain(S, wind) = .940 - (8/14)*.811 - (8/14)*1.0 = .048 17 Decision Tree Learning - Example Partially learned decision tree S: [9+,5-] Outlook sunny {D1, D2, …, D14} overcast rainy ? yes ? [2+,3-] [4+,0-] [3+,2-] {D3, D7, D12, D13} {D4, D5, D6, D10, D14} {D1, D2, D8, D9, D11} which attribute should be tested here? Ssunny = {D1, D2, D8, D9, D11} Gain(Ssunny, humidity) = .970 - (3/5)*0.0 - (2/5)*0.0 = .970 Gain(Ssunny, temp) = .970 - (2/5)*0.0 - (2/5)*1.0 - (1/5)*0.0 = .570 Gain(Ssunny, wind) = .970 - (2/5)*1.0 - (3/5)*.918 = .019 18 Dealing With Continuous Variables Partition continuous attribute into a discrete set of intervals sort the examples according to the continuous attribute A identify adjacent examples that differ in their target classification generate a set of candidate thresholds midway problem: may generate too many intervals Another Solution: take a minimum threshold M of the examples of the majority class in each adjacent partition; then merge adjacent partitions with the same majority class Example: M = 3 70.5 Temperature 64 65 68 69 70 71 Play? yes no yes yes yes no 77.5 72 72 75 75 80 81 83 85 no yes yes yes no yes yes no Same majority, so they are merged Final mapping: temperature 77.5 ==> “yes”; temperature > 77.5 ==> “no” 19 Improving on Information Gain Info. Gain tends to favor attributes with a large number of values larger distribution ==> lower entropy ==> larger Gain Quinlan suggests using Gain Ratio penalize for large number of values SplitInfo ( A, S ) Si S log i S S GainRatio ( A, S ) Example: “outlook” Gain ( A, S ) SplitInfo ( A, S ) S: [9+,5-] SplitInfo (outlook, S) = -(4/14).log(4/14) - (5/14).log(5/14) - (5/14).log(5/14) = 0.156 Outlook overcast sunny rainy GainRatio (outlook, S) = 0.246 / 0.156 = 1.577 S1: [4+,0-] S2 : [2+,3-] S3 : [3+,2-] 20 Over-fitting in Classification A tree generated may over-fit the training examples due to noise or too small a set of training data Two approaches to avoid over-fitting: (Stop earlier): Stop growing the tree earlier (Post-prune): Allow over-fit and then post-prune the tree Approaches to determine the correct final tree size: Separate training and testing sets or use cross-validation Use all the data for training, but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve over entire distribution Use Minimum Description Length (MDL) principle: halting growth of the tree when the encoding is minimized. Rule post-pruning (C4.5): converting to rules before pruning 21 Pruning the Decision Tree A decision tree constructed using the training data may need to be pruned over-fitting may result in branches or leaves based on too few examples pruning is the process of removing branches and subtrees that are generated due to noise; this improves classification accuracy Subtree Replacement: merge a subtree into a leaf node Using a set of data different from the training data At a tree node, if the accuracy without splitting is higher than the accuracy with splitting, replace the subtree with a leaf node; label it using the majority class color red blue yes no 1 2 Suppose with test set we find 3 red “no” examples, and 1 blue “yes” example. We can replace the tree with a single “no” node. After replacement there will be only 2 errors instead of 5. 22 Bayesian Classification It is a statistical classifier based on Bayes theorem It uses probabilistic learning by calculating explicit probabilities for hypothesis A naïve Bayesian classifier, that assumes total independence between attributes, is commonly used and performs well with large data sets The model is incremental in the sense that each training example can incrementally increase or decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data Given a data sample X with an unknown class label, H is the hypothesis that X belongs to a specific class C The conditional probability of hypothesis H given X, Pr(H|X), follows the Bayes theorem: Pr( X | H ) Pr( H ) Pr( H | X ) Pr( X ) Practical difficulty: requires initial knowledge of many probabilities, significant computational cost 23 Naïve Bayesian Classifier Suppose we have n classes C1 , C2 ,…,Cn. Given an unknown sample X, the classifier will predict that X=(x1 ,x2 ,…,xn) belongs to the class with the highest conditional probability: X Ci if Pr(Ci | X ) Pr(C j | X ), for i j n, i j Maximize Pr(X | Ci).Pr(Ci) / Pr(X) => maximize Pr(X | Ci).Pr(Ci) n Note: Pr(C s,Pr(and Pr( X i|)C=i )si / xk | Ci ) where Pr( xk | Ci ) sik / si k 1 Greatly reduces the computation cost, only count the class distribution Naïve: class conditional independence 24 Naïve Bayesian Classifier - Example Given a training set, we can compute the probabilities X = <sunny, mild, high, true> Pr(X | “no”).Pr(“no”) = (3/5 . 2/5 . 4/5 . 3/5) . 5/14 = 0.04 Pr(X | “yes”).Pr(“yes”) = (2/9 . 4/9 . 3/9 . 3/9) . 9/14 = 0.007 25 Classification Example - Bank Data Want to determine likely responders to a direct mail campaign a new product, a "Personal Equity Plan" (PEP) training data include records kept about how previous customers responded and bought the product in this case the target class is “pep” with binary value want to build a model and apply it to new data (a customer list) in which the value of the class attribute is not available id ID12101 ID12102 ID12103 ID12104 ID12105 ID12106 ID12107 ID12108 ID12109 ID12110 ID12111 … age 48 40 51 23 57 57 22 58 37 54 66 … sex FEMALE MALE FEMALE FEMALE FEMALE FEMALE MALE MALE FEMALE MALE FEMALE … region income married children car INNER_CITY 17546 NO 1 NO TOWN 30085.1 YES 3 YES INNER_CITY 16575.4 YES 0 YES TOWN 20375.4 YES 3 NO RURAL 50576.3 YES 0 NO TOWN 37869.6 YES 2 NO RURAL 8877.07 NO 0 NO TOWN 24946.6 YES 0 YES SUBURBAN 25304.3 YES 2 YES TOWN 24212.1 YES 2 YES TOWN 59803.9 YES 0 NO … … … … … save_act current_act mortgage pep NO NO NO YES NO YES YES NO YES YES NO NO NO YES NO NO YES NO NO NO YES YES NO YES NO YES NO YES YES YES NO NO NO NO NO NO YES YES NO NO YES YES NO NO … … … … 26 Data Preparation Several steps for prepare data for Weka and for See5 open training data in Excel, remove the “id” column, save results (as a comma delimited file (e.g., “bank.csv”) do the same with new customer data, but also add a new column called “pep” as the last column; the value of this column for each record should be “?” Weka must convert the the data to ARFF format attribute specification and data are in the same file the data portion is just the comma delimited data file without the label row See5/C5 create a “name” file and a “data” file “name” file contains attribute specification; “data” file is same as above first line of “name” file must be the name(s) of the target class(es) - in this case “pep” 27 Data File Format for Weka Training Data New Cases @relation ’train-bank-data' @attribute 'age' real @attribute 'sex' {'MALE','FEMALE'} @attribute 'region' {'INNER_CITY','RURAL','TOWN','SUBURBAN'} @attribute 'income' real @attribute 'married' {'YES','NO'} @attribute 'children' real @attribute 'car' {'YES','NO'} @attribute 'save_act' {'YES','NO'} @attribute 'current_act' {'YES','NO'} @attribute 'mortgage' {'YES','NO'} @attribute 'pep' {'YES','NO'} @data 48,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES 40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO ... @relation 'new-bank-data' @attribute 'age' real @attribute 'region' {'INNER_CITY','RURAL','TOWN','SUBURBAN'} ... @attribute 'pep' {'YES','NO'} @data 23,MALE,INNER_CITY,18766.9,YES,0,YES,YES,NO,YES,? 30,MALE,RURAL,9915.67,NO,1,NO,YES,NO,YES,? 28 Data File Format for See5/C5 pep. age: continuous. sex: MALE,FEMALE. region: INNER_CITY,RURAL,TOWN,SUBURBAN. income: continuous. married: YES,NO. children: continuous. car: YES,NO. save_act: YES,NO. current_act: YES,NO. mortgage: YES,NO. pep: YES,NO. 48,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES 40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO 51,FEMALE,INNER_CITY,16575.4,YES,0,YES,YES,YES,NO,NO ... 23,MALE,INNER_CITY,18766.9,YES,0,YES,YES,NO,YES,? 30,MALE,RURAL,9915.67,NO,1,NO,YES,NO,YES,? 45,FEMALE,RURAL,21881.6,NO,0,YES,YES,YES,NO,? ... Name file for Training Data Data file for Training Data Note: no “name file” is necessary for new cases, but the file name must use the same stem, and a suffix “.cases” 29 C4.5 Implementation in Weka To build a model (decision tree) using the classifiers.j48.J48 class from command line or using the simple java-based command line interface Decision Tree Output (pruned) Command for building the model (additional parameters can be specified for pruning, cross validation, etc.) children <= 2 | children <= 0 | | married = YES | | | mortgage = YES | | | | save_act = YES: NO (16.0/2.0) | | | | save_act = NO: YES (9.0/1.0) | | | mortgage = NO: NO (59.0/6.0) | | married = NO | | | mortgage = YES | | | | save_act = YES: NO (12.0) | | | | save_act = NO: YES (3.0) | | | mortgage = NO: YES (29.0/2.0) | children > 0 | | income <= 29622 | | | children <= 1 | | | | income <= 12640.3: NO (5.0) | | | | income > 12640.3 | | | | | current_act = YES: YES (28.0/1.0) | | | | | current_act = NO | | | | | | income <= 17390.1: NO (3.0) | | | | | | income > 17390.1: YES (6.0) | | | children > 1: NO (47.0/3.0) | | income > 29622: YES (48.0/2.0) children > 2 | income <= 43228.2: NO (30.0/2.0) | income > 43228.2: YES (5.0) > java weka.classifiers.j48.J48 -t <file-path-name>.arff -d <file-path-name>.model 30 C4.5 Implementation in Weka === Error on training data === The rest of the output contains statistical information about the model, including confusion matrix, error rates, etc. Correctly Classified Instances Incorrectly Classified Instances Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 281 93.6667 % 19 6.3333 % 0.1163 0.2412 23.496 % 48.4742 % 300 === Confusion Matrix === a b <-- classified as 122 13 | a = YES 6 159 | b = NO === Stratified cross-validation === The model is now contained in the (binary) file <file-path-name>.model Correctly Classified Instances Incorrectly Classified Instances Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 274 91.3333 % 26 8.6667 % 0.1434 0.291 28.9615 % 58.4922 % 300 === Confusion Matrix === a b <-- classified as 118 17 | a = YES 9 156 | b = NO 31 C4.5 Implementation in Weka Applying the model to new cases: java weka.classifiers.j48.J48 -p -l <file-pathnaem>.model -T <newdatafile-pathnaem>.arff The output gives the predicted class for each new instance along with its predicted acuracy. Since we removed the “id” field, we now need to map these predictions to the original “new case” records (e.g. using Excel) 0 NO 0.875 ? 1 NO 1.0 ? 2 YES 0.9310344827586207 ? 3 YES 0.9583333333333334 ? 4 NO 0.8983050847457628 ? 5 YES 0.9642857142857143 ? 6 NO 0.875 ? 7 YES 1.0 ? 8 NO 0.9333333333333333 ? 9 YES 0.9642857142857143 ? 10 NO 0.875 ? . . . 195 YES 0.9583333333333334 ? 196 NO 0.9361702127659575 ? 197 YES 1.0 ? 198 NO 0.8983050847457628 ? 199 NO 0.9361702127659575 ? 32 Classification Using See5/C5 Note: See5/C5 also has options for creating the model as a set of decision rules (as opposed to a decision tree) Either model can be used to classify new unclassified cases. 33 Classification Using See5/C5 Class specified by attribute `pep' Decision tree: income > 30085.1: :...children > 0: YES (43/5) ** This demonstration version cannot process ** : children <= 0: ** more than 200 training or test cases. ** : :...married = YES: NO (19/2) : married = NO: Read 200 cases (11 attributes) from bank-train.data : :...mortgage = YES: NO (3) : mortgage = NO: YES (5) income <= 30085.1: Evaluation on training data (200 cases): :...children > 1: NO (50/4) children <= 1: Decision Tree :...children <= 0: ---------------:...save_act = YES: NO (27 Size Errors : save_act = NO: : :...married = NO: YES 13 19( 9.5%) << : married = YES: : :...mortgage = YES (a) (b) <-classified as : mortgage = NO: ---- ---children > 0: 76 13 (a): class YES :...income <= 12681.9: NO 6 105 (b): class NO income > 12681.9: :...current_act = YES: Time: 0.1 secs current_act = NO: :...car = YES: NO car = NO: YES 34 See5/C5: Applying Model to New Cases Need to use the executable file “sample.exe” (note that source code is available to allow you to build classifier into your applications From the Command Line: prompt> sample -f bank-train > bank-new.out Output classification file for new cases (along with predicted accuracy for each new case). Case No Given Class Predicted Class 1 2 3 4 5 . . . 197 198 199 200 ? ? ? ? ? NO NO NO YES NO [0.79] [0.86] [0.79] [0.87] [0.79] ? ? ? ? NO YES NO NO [0.90] [0.88] [0.79] [0.90] 35 Classification Using See5/C5 Building the model based on decision rules: Rule 1: (31, lift 2.2) income > 12681.9 children > 0 children <= 1 current_act = YES -> class YES [0.970] Rule 2: (20, lift 2.1) income > 12681.9 children > 0 children <= 1 car = NO -> class YES [0.955] Rule 3: (17, lift 2.1) income > 30085.1 married = NO mortgage = NO -> class YES [0.947] Rule 4: (7, lift 2.0) married = NO children <= 0 save_act = NO -> class YES [0.889] Rule 5: (43/5, lift 1.9) income > 30085.1 children > 0 -> class YES [0.867] Rule 6: (7/1, lift 1.7) children <= 0 save_act = NO mortgage = YES -> class YES [0.778] . . . 36 What is Clustering in Data Mining? Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters Helps users understand the natural grouping or structure in a data set Cluster: a collection of data objects that are “similar” to one another and thus can be treated collectively as one group but as a collection, they are sufficiently different from other groups Clustering unsupervised classification no predefined classes 37 Requirements of Clustering Methods Scalability Dealing with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Able to deal with noise and outliers Insensitive to order of input records The curse of dimensionality Interpretability and usability 38 Applications of Clustering Clustering has wide applications in Pattern Recognition Spatial Data Analysis: create thematic maps in GIS by clustering feature spaces detect spatial clusters and explain them in spatial data mining Image Processing Market Research Information Retrieval Document or term categorization Information visualization and IR interfaces Web Mining Cluster Web usage data to discover groups of similar access patterns Web Personalization 39 Clustering Methodologies Two general methodologies Partitioning Based Algorithms Hierarchical Algorithms Partitioning Based divide a set of N items into K clusters (top-down) Hierarchical agglomerative: pairs of items or clusters are successively linked to produce larger clusters divisive: start with the whole set as a cluster and successively divide sets into smaller partitions 40 Distance or Similarity Measures Measuring Distance In order to group similar items, we need a way to measure the distance between objects (e.g., records) Note: distance = inverse of similarity Often based on the representation of objects as “feature vectors” An Employee DB ID 1 2 3 4 5 Gender F M M F M Age 27 51 52 33 45 Salary 19,000 64,000 100,000 55,000 45,000 Term Frequencies for Documents Doc1 Doc2 Doc3 Doc4 Doc5 T1 0 3 3 0 2 T2 4 1 0 1 2 T3 0 4 0 0 2 T4 0 3 0 3 3 T5 0 1 3 0 1 T6 2 2 0 0 4 Which objects are more similar? 41 Distance or Similarity Measures Properties of Distance Measures: for all objects A and B, dist(A, B) 0, and dist(A, B) = dist(B, A) for any object A, dist(A, A) = 0 dist(A, C) dist(A, B) + dist (B, C) Common Distance Measures: Manhattan distance: X x1 , x2 , dist ( X , Y ) x1 y1 x2 y2 , xn x1 y1 2 xn yn Can be normalized to make values fall between 0 and 1. 2 Cosine similarity: dist ( X , Y ) 1 sim( X , Y ) , yn xn yn Euclidean distance: dist ( X , Y ) Y y1 , y2 , sim( X , Y ) ( xi yi ) i xi yi 2 i 2 i 42 Distance or Similarity Measures Weighting Attributes in some cases we want some attributes to count more than others associate a weight with each of the attributes in calculating distance, e.g., dist ( X , Y ) w1 x1 y1 2 wn xn yn 2 Nominal (categorical) Attributes can use simple matching: distance=1 if values match, 0 otherwise or convert each nominal attribute to a set of binary attribute, then use the usual distance measure if all attributes are nominal, we can normalize by dividing the number of matches by the total number of attributes Normalization: want values to fall between 0 an 1: other variations possible x 'i xi min xi max xi min xi 43 Distance or Similarity Measures Example max distance for age: 100000-19000 = 79000 max distance for age: 52-27 = 25 ID 1 2 3 4 5 Gender F M M F M Age 27 51 52 33 45 Salary 19,000 64,000 100,000 55,000 45,000 ID 1 2 3 4 5 x 'i Gender 1 0 0 1 0 xi min xi max xi min xi Age 0.00 0.96 1.00 0.24 0.72 Salary 0.00 0.56 1.00 0.44 0.32 dist(ID2, ID3) = SQRT( 0 + (0.04)2 + (0.44)2 ) = 0.44 dist(ID2, ID4) = SQRT( 1 + (0.72)2 + (0.12)2 ) = 1.24 44 Domain Specific Distance Functions For some data sets, we may need to use specialized functions we may want a single or a selected group of attributes to be used in the computation of distance - same problem as “feature selection” may want to use special properties of one or more attribute in the data Example: Zip Codes distzip(A, B) = 0, if zip codes are identical distzip(A, B) = 0.1, if first 3 digits are identical distzip(A, B) = 0.5, if first digits are identical distzip(A, B) = 1, if first digits are different natural distance functions may exist in the data Example: Customer Solicitation distsolicit(A, B) = 0, if both A and B responded distsolicit(A, B) = 0.1, both A and B were chosen but did not respond distsolicit(A, B) = 0.5, both A and B were chosen, but only one responded distsolicit(A, B) = 1, one was chosen, but the other was not 45 Distance (Similarity) Matrix Similarity (Distance) Matrix based on the distance or similarity measure we can construct a symmetric matrix of distance (or similarity values) (i, j) entry in the matrix is the distance (similarity) between items i and j I1 I2 I1 d 21 I2 d12 In d1n d2n In d n1 d n 2 Note that dij = dji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix. The diagonal is all 1’s (similarity) or all 0’s (distance) dij similarity (or distance) of Di to D j 46 Example: Term Similarities in Documents Doc1 Doc2 Doc3 Doc4 Doc5 T1 0 3 3 0 2 T2 4 1 0 1 2 T3 0 4 0 0 2 T4 0 3 0 3 3 T5 0 1 3 0 1 T6 2 2 0 0 4 T7 1 0 3 2 0 T8 3 1 0 0 2 N sim(Ti , Tj ) ( wik w jk ) k 1 Term-Term Similarity Matrix T2 T3 T4 T5 T6 T7 T8 T1 7 16 15 14 14 9 7 T2 T3 T4 T5 T6 T7 8 12 3 18 6 17 18 6 16 0 8 6 18 6 9 6 9 3 2 16 3 47 Similarity (Distance) Thresholds A similarity (distance) threshold may be used to mark pairs that are “sufficiently” similar T2 T3 T4 T5 T6 T7 T2 T3 T4 T5 T6 T7 T8 T1 7 16 15 14 14 9 7 8 12 3 18 6 17 18 6 16 0 8 6 18 6 9 6 9 3 2 16 3 T2 T3 T4 T5 T6 T7 T2 T3 T4 T5 T6 T7 T8 T1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 Using a threshold value of 10 in the previous example 48 Graph Representation The similarity matrix can be visualized as an undirected graph each item is represented by a node, and edges represent the fact that two items are similar (a one in the similarity threshold matrix) T2 T3 T4 T5 T6 T7 T8 T1 T2 T3 T4 T5 T6 T7 0 1 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 If no threshold is used, then matrix can be represented as a weighted graph T1 T3 T5 T4 T2 T7 T6 T8 49 Simple Clustering Algorithms If we are interested only in threshold (and not the degree of similarity or distance), we can use the graph directly for clustering Clique Method (complete link) all items within a cluster must be within the similarity threshold of all other items in that cluster clusters may overlap generally produces small but very tight clusters Single Link Method any item in a cluster must be within the similarity threshold of at least one other item in that cluster produces larger but weaker clusters Other methods star method - start with an item and place all related items in that cluster string method - start with an item; place one related item in that cluster; then place anther item related to the last item entered, and so on 50 Simple Clustering Algorithms Clique Method a clique is a completely connected subgraph of a graph in the clique method, each maximal clique in the graph becomes a cluster T1 T3 Maximal cliques (and therefore the clusters) in the previous example are: T5 T4 T2 {T1, T3, T4, T6} {T2, T4, T6} {T2, T6, T8} {T1, T5} {T7} Note that, for example, {T1, T3, T4} is also a clique, but is not maximal. T7 T6 T8 51 Simple Clustering Algorithms Single Link Method selected an item not in a cluster and place it in a new cluster place all other similar item in that cluster repeat step 2 for each item in the cluster until nothing more can be added repeat steps 1-3 for each item that remains unclustered T1 T3 In this case the single link method produces only two clusters: T5 T4 T2 {T1, T3, T4, T5, T6, T2, T8} {T7} Note that the single link method does not allow overlapping clusters, thus partitioning the set of items. T7 T6 T8 52 Clustering with Existing Clusters The notion of comparing item similarities can be extended to clusters themselves, by focusing on a representative vector for each cluster cluster representatives can be actual items in the cluster or other “virtual” representatives such as the centroid this methodology reduces the number of similarity computations in clustering clusters are revised successively until a stopping condition is satisfied, or until no more changes to clusters can be made Partitioning Methods reallocation method - start with an initial assignment of items to clusters and then move items from cluster to cluster to obtain an improved partitioning Single pass method - simple and efficient, but produces large clusters, and depends on order in which items are processed Hierarchical Agglomerative Methods starts with individual items and combines into clusters then successively combine smaller clusters to form larger ones grouping of individual items can be based on any of the methods discussed earlier 53 K-Means Algorithm The basic algorithm (based on reallocation method): 1. select K data points as the initial representatives 2. for i = 1 to N, assign item xi to the most similar centroid (this gives K clusters) 3. for j = 1 to K, recalculate the cluster centroid Cj 4. repeat steps 2 and 3 until these is (little or) no change in clusters Example: Clustering Terms Initial (arbitrary) assignment: C1 = {T1,T2}, C2 = {T3,T4}, C3 = {T5,T6} Doc1 Doc2 Doc3 Doc4 Doc5 T1 0 3 3 0 2 T2 4 1 0 1 2 T3 0 4 0 0 2 T4 0 3 0 3 3 T5 0 1 3 0 1 Cluster Centroids T6 2 2 0 0 4 T7 1 0 3 2 0 T8 3 1 0 0 2 C1 4/2 4/2 3/2 1/2 4/2 C2 0/2 7/2 0/2 3/2 5/2 C3 2/2 3/2 3/2 0/2 5/2 54 Example: K-Means Example (continued) Now using simple similarity measure, compute the new cluster-term similarity matrix Class1 Class2 Class3 Assign T1 T2 T3 T4 T5 T6 T7 T8 29/2 29/2 24/2 27/2 17/2 32/2 15/2 24/2 31/2 20/2 38/2 45/2 12/2 34/2 6/2 17/2 28/2 21/2 22/2 24/2 17/2 30/2 11/2 19/2 to Class2 Class1 Class2 Class2 Class3 Class2 Class1 Class1 Now compute new cluster centroids using the original document-term matrix Doc1 Doc2 Doc3 Doc4 Doc5 T1 0 3 3 0 2 T2 4 1 0 1 2 T3 0 4 0 0 2 T4 0 3 0 3 3 T5 0 1 3 0 1 T6 2 2 0 0 4 T7 1 0 3 2 0 T8 3 1 0 0 2 C1 8/3 2/3 3/3 3/3 4/3 C2 2/4 12/4 3/4 3/4 11/4 C3 0/1 1/1 3/1 0/1 1/1 The process is repeated until no further changes are made to the clusters 55 K-Means Algorithm Strength of the k-means: Relatively efficient: O(tkn), where n is # of objects, k is # of clusters, and t is # of iterations. Normally, k, t << n Often terminates at a local optimum Weakness of the k-means: Applicable only when mean is defined; what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Variations of K-Means usually differ in: Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means 56 Hierarchical Algorithms Use distance matrix as clustering criteria does not require the number of clusters k as an input, but needs a termination condition 57 Hierarchical Agglomerative Clustering HAC starts with unclustered data and performs successive pairwise joins among items (or previous clusters) to form larger ones this results in a hierarchy of clusters which can be viewed as a dendrogram useful in pruning search in a clustered item set, or in browsing clustering results Some commonly used HACM methods Single Link: at each step join most similar pairs of objects that are not yet in the same cluster Complete Link: use least similar pair between each cluster pair to determine inter-cluster similarity - all items within one cluster are linked to each other within a similarity threshold Ward’s method: at each step join cluster pair whose merger minimizes the increase in total within-group error sum of squares (based on distance between centroids) - also called the minimum variance method Group Average (Mean): use average value of pairwise links within a cluster to determine inter-cluster similarity (i.e., all objects contribute to inter-cluster similarity) 58 Hierarchical Agglomerative Clustering Dendrogram for a hierarchy of clusters A B C D E F G H I 59