Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Cluster Analysis for Data Mining 5-1 Used for automatic identification of natural groupings of things Part of the machine-learning family Employ unsupervised learning Learns the clusters of things from past data, then assigns new instances There is not an output variable Also known as segmentation Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall Cluster Analysis for Data Mining Clustering results may be used to 5-2 Identify natural groupings of customers Identify rules for assigning new cases to classes for targeting/diagnostic purposes Provide characterization, definition, labeling of populations Decrease the size and complexity of problems for other data mining methods Identify outliers in a specific domain (e.g., rare-event detection) Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall Advantages and disadvantages 5-3 Clustering allows a user to make groups of data to determine patterns from the data. Advantage: when the data set is defined and a general pattern needs to be determined from the data. You can create a specific number of groups, depending on your business needs. One defining benefit of clustering over classification is that every attribute in the data set will be used to analyze the data. Disadvantage: the user is required to know ahead of time how many groups he wants to create. For a user without any real knowledge of his data, this might be difficult Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall Cluster Analysis for Data Mining Analysis methods 5-4 Statistical methods Neural Fuzzy logic (e.g., fuzzy c-means algorithm) Genetic algorithms Divisive : all items start in one cluster and are broken apart. Agglomerative: all items start in individual clusters and the clusters are joined together. Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall Cluster Analysis for Data Mining k-Means Clustering Algorithm k : pre-determined number of clusters Algorithm (Step 0: determine value of k) Step 1: Randomly generate k random points as initial cluster centers Step 2: Assign each point to the nearest cluster center Step 3: Re-compute the new cluster centers Repetition step: Repeat steps 3 and 4 until some convergence criterion is met (usually that the assignment of points to clusters becomes stable) 5-5 Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall K-Means Predetermined number of cluster Start with seed clusters of one element Seeds 5-6 Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall 6 Assign Instances to Clusters 5-7 Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall 7 Find New Centroids 5-8 Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall 8 New Clusters 5-9 Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall 9 Cluster Analysis for Data Mining k-Means Clustering Algorithm Step 1 5-10 Step 2 Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall Step 3 Example - BMW dealership 5-11 The dealership has kept track of how people walk through the dealership and the showroom, what cars they look at, and how often they ultimately make purchases. Need to mine this data by finding patterns and by using clusters to determine if certain behaviors in their customers emerge. There are 100 rows of data in this sample, and each column describes the steps that the customers reached in their BMW experience, with a column having a 1 (they made it to this step or looked at this car), or 0 (they didn't reach this step) Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall Example - BMW dealership 5-12 Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall Weka results 5-13 Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall