Download DATA MINING AND CLUSTERING

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
CLUSTERING
Overview
• Definition of Clustering
• Existing clustering methods
• Clustering examples
Definition
• Clustering can be considered the most important
unsupervised learning technique; so, as every other
problem of this kind, it deals with finding a structure in a
collection of unlabeled data.
• Clustering is “the process of organizing objects into
groups whose members are similar in some way”.
• A cluster is therefore a collection of objects which are
“similar” between them and are “dissimilar” to the
objects belonging to other clusters.
Why clustering?
A few good reasons ...
•
•
•
•
Simplifications
Pattern detection
Useful in data concept construction
Unsupervised learning process
Where to use clustering?
•
•
•
•
•
Data mining
Information retrieval
text mining
Web analysis
medical diagnostic
Major Existing clustering methods
•
•
•
•
Distance-based
Hierarchical
Partitioning
Probabilistic
Measuring Similarity
• Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, which is typically metric:
d(i, j)
• There is a separate “quality” function that measures the
“goodness” of a cluster.
• The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
• Weights should be associated with different variables
based on applications and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
Hierarchical clustering
Agglomerative (bottom up)
1.
start with 1 point
(singleton)
2. recursively add two or
more appropriate
clusters
3. Stop when k number of
clusters is achieved.
Divisive (top down)
1.
2.
3.
Start with a big cluster
Recursively divide into
smaller clusters
Stop when k number of
clusters is achieved.
general steps of hierarchical clustering
Given a set of N items to be clustered, and an N*N distance (or
similarity) matrix, the basic process of hierarchical clustering
(defined by S.C. Johnson in 1967) is this:
• Start by assigning each item to a cluster, so that if you have N
items, you now have N clusters, each containing just one item.
Let the distances (similarities) between the clusters the same as
the distances (similarities) between the items they contain.
• Find the closest (most similar) pair of clusters and merge them
into a single cluster, so that now you have one cluster less.
• Compute distances (similarities) between the new cluster and
each of the old clusters.
• Repeat steps 2 and 3 until all items are clustered into K number
of clusters
Exclusive vs. non exclusive
clustering
• In the first case data are grouped in an
exclusive way, so that if a certain datum
belongs to a definite cluster then it could
not be included in another cluster. A
simple example of that is shown in the
figure below, where the separation of
points is achieved by a straight line on a
bi-dimensional plane.
• On the contrary the second type, the
overlapping clustering, uses fuzzy sets to
cluster data, so that each point may
belong to two or more clusters with
different degrees of membership.
Partitioning clustering
1. Divide data into proper subset
2. recursively go through each subset
and relocate points between clusters
(opposite to visit-once approach in
Hierarchical approach)
This recursive relocation= higher quality cluster
Probabilistic clustering
1. Data are picked from mixture of
probability distribution.
2. Use the mean, variance of each
distribution as parameters for cluster
3. Single cluster membership
Single-Linkage Clustering(hierarchical)
• The N*N proximity matrix is D = [d(i,j)]
• The clusterings are assigned sequence
numbers 0,1,......, (n-1)
• L(k) is the level of the kth clustering
• A cluster with sequence number m is
denoted (m)
• The proximity between clusters (r) and (s)
is denoted d [(r),(s)]
The algorithm is composed of the
following steps:
• Begin with the disjoint clustering having level
L(0) = 0 and sequence number m = 0.
• Find the least dissimilar pair of clusters in the
current clustering, say pair (r), (s), according to
d[(r),(s)] = min d[(i),(j)]
where the minimum is over all pairs of clusters
in the current clustering.
The algorithm is composed of the
following steps:(cont.)
• Increment the sequence number : m = m +1. Merge
clusters (r) and (s) into a single cluster to form the next
clustering m. Set the level of this clustering to
L(m) = d[(r),(s)]
• Update the proximity matrix, D, by deleting the rows and
columns corresponding to clusters (r) and (s) and adding
a row and column corresponding to the newly formed
cluster. The proximity between the new cluster, denoted
(r,s) and old cluster (k) is defined in this way:
d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)]
• If all objects are in one cluster, stop. Else, go to step 2.
Hierarchical clustering example
• Let’s now see a simple example: a hierarchical clustering
of distances in kilometers between some Italian cities.
The method used is single-linkage.
• Input distance matrix (L = 0 for all the clusters):
• The nearest pair of cities is MI and TO, at distance 138. These
are merged into a single cluster called "MI/TO". The level of
the new cluster is L(MI/TO) = 138 and the new sequence
number is m = 1.
Then we compute the distance from this new compound object
to all other objects. In single link clustering the rule is that
the distance from the compound object to another object is
equal to the shortest distance from any member of the
cluster to the outside object. So the distance from "MI/TO"
to RM is chosen to be 564, which is the distance from MI to
RM, and so on.
• After merging MI with TO we obtain the
following matrix:
• min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a
new cluster called NA/RM
L(NA/RM) = 219
m=2
• min d(i,j) = d(BA,NA/RM) = 255 => merge BA and NA/RM
into a new cluster called BA/NA/RM
L(BA/NA/RM) = 255
m=3
• min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM
and FI into a new cluster called BA/FI/NA/RM
L(BA/FI/NA/RM) = 268
m=4
• Finally, we merge the last two clusters at level 295.
• The process is summarized by the following hierarchical tree:
K-mean algorithm
1.
2.
It accepts the number of clusters to group data
into, and the dataset to cluster as input values.
It then creates the first K initial clusters (K=
number of clusters needed) from the dataset by
choosing K rows of data randomly from the
dataset. For Example, if there are 10,000 rows of
data in the dataset and 3 clusters need to be
formed, then the first K=3 initial clusters will be
created by selecting 3 records randomly from the
dataset as the initial clusters. Each of the 3 initial
clusters formed will have just one row of data.
3. The K-Means algorithm calculates the Arithmetic Mean of
each cluster formed in the dataset. The Arithmetic Mean of
a cluster is the mean of all the individual records in the cluster. In
each of the first K initial clusters, their is only one record. The
Arithmetic Mean of a cluster with one record is the set of values
that make up that record. For Example if the dataset we are
discussing is a set of Height, Weight and Age measurements for
students in a University, where a record P in the dataset S is
represented by a Height, Weight and Age measurement, then P =
{Age, Height, Weight}. Then a record containing
the measurements of a student John, would be represented as
John = {20, 170, 80} where John's Age = 20 years, Height = 1.70
meters and Weight = 80 Pounds. Since there is only one record in
each initial cluster then the Arithmetic Mean of a cluster with only
the record for John as a member = {20, 170, 80}.
4.
Next, K-Means assigns each record in the dataset to only one of the initial
clusters. Each record is assigned to the nearest cluster (the cluster which it is
most similar to) using a measure of distance or similarity like the Euclidean
Distance Measure or Manhattan/City-Block Distance Measure.
5.
K-Means re-assigns each record in the dataset to the most similar cluster and recalculates the arithmetic mean of all the clusters in the dataset. The arithmetic
mean of a cluster is the arithmetic mean of all the records in that cluster. For
Example, if a cluster contains two records where the record of the set of
measurements for John = {20, 170, 80} and Henry = {30, 160, 120}, then the
arithmetic mean Pmean is represented as Pmean= {Agemean, Heightmean,
Weightmean). Agemean= (20 + 30)/2, Heightmean= (170 + 160)/2 and
Weightmean= (80 + 120)/2. The arithmetic mean of this cluster = {25,
165, 100}. This new arithmetic mean becomes the center of this new cluster.
Following the same procedure, new cluster centers are formed for all the
existing clusters.
6. K-Means re-assigns each record in the dataset to only one of the
new clusters formed. A record or data point is assigned to the
nearest cluster (the cluster which it is most similar to) using a
measure of distance or similarity
7. The preceding steps are repeated until stable clusters are formed
and the K-Means clustering procedure is completed. Stable clusters
are formed when new iterations or repetitions of the K-Means
clustering algorithm does not create new clusters as the cluster center
or Arithmetic Mean of each cluster formed is the same as the old
cluster center. There are different techniques for determining when a
stable cluster is formed or when the k-means clustering algorithm
procedure is completed.