Survey

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
```CLUSTERING
PARTITIONING METHODS
Elsayed Hemayed
Data Mining Course
Outline
2





Introduction
Clustering Methods
K-Means Clustering
Selecting K
Acknowledgement: some of the material in these slides are
from [Max Bramer, “Principles of Data Mining”, SpringerVerlag London Limited 2007]
Partitioning Methods
Introduction
3



Extracting information from unlabelled data.
Clustering is concerned with grouping together objects
that are similar to each other and dissimilar to the
objects belonging to other clusters.
Examples:
In an economics application we might be interested in
finding countries whose economies are similar.
 In a financial application we might wish to find clusters of
companies that have similar financial performance.
 In a marketing application we might wish to find clusters of

Partitioning Methods
Clustering Examples
4



In a medical application we might wish to find
clusters of patients with similar symptoms.
In a document retrieval application we might wish to
find clusters of documents with related content.
In a crime analysis application we might look for
clusters of high volume crimes such as burglaries or
try to cluster together much rarer (but possibly
related) crimes such as murders.
Partitioning Methods
Clustering Example
5
Partitioning Methods
Clustering: Application 1
6

Market Segmentation:
 Goal:
subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
 Approach:
 Collect
different attributes of customers based on their
geographical and lifestyle related information.
 Find clusters of similar customers.
 Measure the clustering quality by observing buying patterns
of customers in same cluster vs. those from different clusters.
Partitioning Methods
Clustering: Application 2
7

Document Clustering:
 Goal:
To find groups of documents that are similar to
each other based on the important terms appearing in
them.
 Approach: To identify frequently occurring terms in
each document. Form a similarity measure based on the
frequencies of different terms. Use it to cluster.
 Gain: Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.
Partitioning Methods
Clustering Methods
8
Partitioning methods


K-Means
Hierarchical methods



Agglomerative Hierarchical Clustering
Divisive hierarchical clustering
Density-based methods


DBSCAN: a Density-Based Spatial Clustering of Applications with
Noise
Grid-based methods


STING: A Statistical Information Grid Approach to Spatial Data
Mining
High Dimensional Data Clustering


CLIQUE: A Dimension-Growth Subspace Clustering Method
Partitioning Methods
9
Partitioning Methods
K-Means Clustering
Partitioning Methods
K-Means Clustering
10



k-means clustering is an exclusive clustering
algorithm. Each object is assigned to precisely one of
a set of clusters. (There are other methods that
allow objects to be in more than one cluster.)
For this method of clustering we start by deciding
how many clusters k we would like to form from our
data.
The value of k is generally a small integer, such as 2,
3, 4 or 5, but may be larger.
Partitioning Methods
The k-Means Clustering Algorithm
11
1. Choose a value of k.
2. Select k objects in an arbitrary fashion. Use these as
the initial set of k centroids.
3. Assign each of the objects to the cluster for which it
is nearest to the centroid.
4. Recalculate the centroids of the k clusters.
5. Repeat steps 3 and 4 until the centroids no longer
move.
Partitioning Methods
Example (k=3)
12
Partitioning Methods
Initial Clusters
13
Partitioning Methods
Revised Cluster
14
Partitioning Methods
Third Set of Clusters
15
These are the same clusters as before. Their centroids will be the
same as those from which the clusters were generated. Hence the
termination condition of the k-means algorithm has been met and
these are the final clusters produced by the algorithm for the initial
Partitioning Methods
Finding the Best Set of Clusters
16


It can be proved that the k-means algorithm will
always terminate, but it does not necessarily find the
best set of clusters, corresponding to minimising the
value of the objective function.
The initial selection of centroids can significantly
affect the result.
 Solution:
Try different initial selection and take the best
 But what should be k
 Try

different k
But which one to choose
Partitioning Methods
Selecting K
17



The table shown value of
Objective Function For Different
Values of k
These results suggest that the
best value of k is probably 3.
The value of the function for k =
3 is much less than for k = 2, but
only a little better than for k = 4.
We normally prefer to find a fairly small number of
Partitioning Methods
clusters
as far as possible.
K-Selection Strategy
18



We are not trying to find the value of k with the
smallest value of the objective function.
That will occur when the value of k is the same as the
number of objects, i.e. each object forms its own
cluster of one. The objective function will then be
zero, but the clusters will be worthless.
We usually want a fairly small number of clusters
and accept that the objects in a cluster will be
spread around the centroid (but ideally not too far
away).
Partitioning Methods
Summary
19




Introduction
Clustering Methods
K-Means Clustering
Selecting K
Partitioning Methods
```
Related documents