Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 10. Prediction and Cluster Analysis Prediction Definition: Models continuous-valued functions, i.e., predicts unknown or missing values Similarities to Classification: construct a model use model to predict continuous or ordered value for a given input Differences from Classification: Classification refers to predict categorical class label Prediction models continuous-valued functions Major method for prediction: regression model the relationship between one or more independent or predictor variables and a dependent or response variable Regression analysis Linear and multiple regression Non-linear regression Other regression methods: generalized linear model, Poisson regression, log-linear models, regression trees Linear regression: Involves a response variable y and a single predictor variable x y = w0 + w1 x where w0 (y-intercept) and w1 (slope) are regression coefficients Example: Mencari persamaan regresi dengan Excel: 1. Buat grafik berdasarkan data yang ada 2. Klik kanan pada data di dalam grafik dan pilih Add Trendline 3. Pada tab Type, pilih tipe regresi yang dikehendaki 4. Pada tab Options, klik pada Display equation on chart 5. Tekan OK Menambah data dalam grafik: 1. Tuliskan data yang akan ditambah dibawah data yang sudah ada 2. Klik pada data di dalam grafik 3. Geser garis biru kebawah sampai melingkupi data baru 4. Gambar grafik akan berubah sesuai dengan data yang ditambahkan Cluster Analysis Definition 1: Clustering is the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure. Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Besides the term data clustering (or just clustering), there are a number of terms with similar meanings, including cluster analysis, automatic classification, numerical taxonomy, botryology and typological analysis. Definition 2: Cluster analysis, also called segmentation analysis or taxonomy analysis, seeks to identify homogeneous subgroups of cases in a population. That is, cluster analysis seeks to identify a set of groups which both minimize within-group variation and maximize betweengroup variation. Definition 3: What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Applications of Cluster Analysis - Understanding – Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations - Summarization – Reduce the size of large data sets What is not Cluster Analysis? - Supervised classification – Have class label information - Simple segmentation – Dividing students into different registration alphabetically, by last name - Results of a query – Groupings are a result of an external specification Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters © Tan,Steinbach, Kumar Introduction to Data Mining Types of Clusterings - A clustering is a set of clusters - Partitional Clustering 4/18/2004 5 groups – A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset - Hierarchical clustering – A set of nested clusters organized as a hierarchical tree Partitional Clustering Original Points © Tan,Steinbach, Kumar A Partitional Clustering Introduction to Data Mining 4/18/2004 7 Hierarchical Clustering p1 p3 p4 p2 p1 p2 Traditional Hierarchical Clustering p3 p4 Traditional Dendrogram p1 p3 p4 p2 p1 p2 Non-traditional Hierarchical Clustering © Tan,Steinbach, Kumar p3 p4 Non-traditional Dendrogram Introduction to Data Mining 4/18/2004 8 Clustering Algorithms - K-means and its variants - Hierarchical clustering - Density-based clustering K-means Clustering - Partitional clustering approach - Each cluster is associated with a centroid (center point) - Each point is assigned to the cluster with the closest centroid - Number of clusters, K, must be specified - The basic algorithm is very simple K-means Clustering – Details - Initial centroids are often chosen randomly. – Clusters produced vary from one run to another. - The centroid is (typically) the mean of the points in the cluster. Example: The data set has three dimensions and the cluster has two points: X = (x1, x2, x3) and Y = (y1, y2, y3). Then the centroid Z becomes Z = (z1, z2, z3), where z1 = (x1 + y1)/2 and z2 = (x2 + y2)/2 and z3 = (x3 + y3)/2. - ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc. Euclidean distance is the most common distance measure. A given pair of cases is plotted on two variables, which form the x and y axes. The Euclidean distance is the square root of the sum of the square of the x difference plus the square of the y distance. - K-means will converge for common similarity measures mentioned above. - Most of the convergence happens in the first few iterations. – Often the stopping condition is changed to ‘Until relatively few points change clusters’ Two different K-means Clusterings 3 2.5 Original Points 2 y 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x 3 3 2.5 2.5 y 2 1.5 y 2 1.5 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x -1.5 -1 -0.5 0 0.5 1 1.5 2 x Optimal Clustering © Tan,Steinbach, Kumar -2 Introduction to Data Mining Sub-optimal Clustering 4/18/2004 22 Limitations of K-means - K-means has problems when clusters are of differing – Sizes – Densities – Non-globular shapes - K-means has problems when the data contains outliers. Limitations of K-means: Differing Sizes K-means (3 Clusters) Original Points © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40 Limitations of K-means: Differing Density K-means (3 Clusters) Original Points © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41 Limitations of K-means: Non-globular Shapes Original Points © Tan,Steinbach, Kumar K-means (2 Clusters) Introduction to Data Mining 4/18/2004 42 Overcoming K-means Limitations Original Points K-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43 Overcoming K-means Limitations Original Points © Tan,Steinbach, Kumar K-means Clusters Introduction to Data Mining 4/18/2004 44 Overcoming K-means Limitations Original Points © Tan,Steinbach, Kumar K-means Clusters Introduction to Data Mining 4/18/2004 45 The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets. Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments. Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that records the sequences of merges or splits 5 6 0.2 4 3 4 2 0.15 5 2 0.1 1 0.05 3 0 1 © Tan,Steinbach, Kumar 3 2 5 4 1 6 Introduction to Data Mining 4/18/2004 46 Strengths of Hierarchical Clustering - Do not have to assume any particular number of clusters – Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level - They may correspond to meaningful taxonomies – Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …) Limitation of Hierarchical Clustering Hierarchical clustering is appropriate for smaller samples (typically < 250). When n is large, the algorithm will be very slow to reach a solution Two main types of hierarchical clustering – Agglomerative (Forward Clustering): Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left – Divisive (Backward Clustering): Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters) Traditional hierarchical algorithms use a similarity or distance matrix – Merge or split one cluster at a time Agglomerative Clustering Algorithm - More popular hierarchical clustering technique - Basic algorithm is straightforward 1. Compute the proximity matrix 2. Let each data point be a cluster 3. Repeat 4. Merge the two closest clusters 5. Update the proximity matrix 6. Until only a single cluster remains - Key operation is the computation of the proximity of two clusters – Different approaches to defining the distance between clusters distinguish the different algorithms How to Define Inter-Cluster Similarity (Proximity) 1. The minimum distance between elements of each cluster (also called single linkage clustering) Similarity of two clusters is based on the two most similar (closest) points in the different clusters 2. The maximum distance between elements of each cluster (also called complete linkage clustering) Similarity of two clusters is based on the two least similar (most distant) points in the different clusters 3. The mean distance between elements of each cluster (also called average linkage clustering, used e.g. in UPGMA-unweighted pair-group method using averages) Proximity of two clusters is the average of pairwise proximity between points in the two clusters. 4. Distance Between Centroids Applications - Biology - Market research - Image segmentation: Clustering can be used to divide a digital image into distinct regions for border detection or object recognition. - Data mining: Many data mining applications involve partitioning data items into related subsets; the marketing applications discussed above represent some examples. Another common application is the division of documents, such as World Wide Web pages, into genres. - Search result grouping: In the process of intelligent grouping of the files and websites, clustering may be used to create a more relevant set of search results compared to normal search engines like Google. There are currently a number of web based clustering tools such as Clusty.