Download What is a cluster

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Clustering: partitioning data into similar groupings. A
cluster is grouping of 'similar' items! Clustering is a process
that partitions a set of objects into equivalence classes.
In clustering, data is partitioned into classes. Class
members of each class are in some way more "similar" to
each other than with those pertaining to different classes.
So
a.
b.
intra-class variance is low
inter-class variance is high
Application of clustering:
1.
2.
3.
4.
5.
6.
7.
8.
9.
Pattern recognition
Data mining
Image processing
Market research & econometric assessment
In WWW: document classification and search on a
similar ontology space
Performance modeling
Land use registry
Insurance
Urban and regional planning
Clustering = Unsupervised learning with no notion of predefined classes (what are they, how many of them – no a
priori knowledge)
Data preparation before data mining:
► Normal data to be mined is noisy with many unwanted
attributes, etc.
►Discretization of continuous data
►Data normalization [ -1 .. + 1] or [0 .. 1] range
►Data smoothing to reduce noise, removal of outliers, etc.
►Relevance analysis: feature selection to ensure relevant
set of wanted features only
Clustering is an unsupervised partition of a given data into
equivalent classes. Ultimately, clustering is equivalent to
classification.
A good clustering would produce a partition with low
within-group variance and high inter-group variance. The
idea is centroid of a cluster (the geographical center of a
cluster in a spatial data) and its variance about the centroid
may be sufficient in most cases to depict the data.
many data  few data.
Cluster point or centroid = exempler
Variance = mushiness of the concept of the centroid
within-group = intra-cluster
between-groups = inter cluster
The issue here is "similarity". How do we measure
similarity? This is not easy to answer.
Secondly, if there are "hidden"patterns, does the clustering
scheme discover them? Requirements of good clustering:
1. Insensitivity to order of input data
2. Capable of cluster identification on a single pass over
data
3. Works even in presence of noise and outliers
4. Scalability
5. low dependence on domain knowledge
6. Ability to deal with different types of input:
numerical, ordinal, categorical, etc.
7. Discovery of clusters with arbitrary shape. e.g.
Two points or patterns are similar if the distance between
them is lower than some threshold.
d( A,B )  
 A and B are in the same
cluster.
A metric space is a set S   xi  where a generalized
distance function of some sort could be defined. In this,
d( xi ,x j )  0 if xi  x j
d( xi ,x j )  d( x j ,xi )  0
d( xi ,x j )  d( xi ,xk )  d( xk ,x j ) triangular inequality
How do we measure the distance? It depends! 
Euclidian distance:
d 2 ( xi ,x j ) 
( x
ik
 x jk )2 as 2-norm
k
Manhattan distance: d( xi ,x j ) 
| x
ik
 x jk |
k
Bounded distance:
On any given metric space, a measure that never exceeds a
threshold.
e.g. d( xi ,x j ) 
D( xi ,x j )
1  D( xi ,x j )
where D( xi ,x j ) is a
measure of distance between two points obeying usual
distance properties.
Maximum distance:
d( xi ,x j )  max | xik  x jk | -norm
k
Minimum distance:
d( xi ,x j )  min | xik  x jk |
k
Mahalnabis distance:


d p ( xi ,x j )   |xik  x jk | p 
 k

1
p
as p-norm
String distance:
a. Hamming: Distance between two strings is the total
number of positions they differ.
b. Levenshtein distance: Given a source string s and a
target string t , the minimum number of insertions,
deletions and substitutions require to transform s into
t . Useful in




Spell checking
Speech recognition
DNA analysis
Plagiarism detection
Larger the distance between two data items less similar
they are.
Also, we need a way to measure the distance between two
clusters. A number of possibilities exist.
Single linkage distance (nearest neighbor):
D( C1 ,C2 )  minx1C1 ,x2C2 d( x1 ,x2 )
minimum distance between points in them.
Complete Linkage distance (farthest neighbors)
D( C1 ,C2 )  maxx1C1 ,x2C2 d( x1 ,x2 )
Average linkage distance (a compromise):
Average distance among all pairs of points
d( C1 ,C2 ) 
1
| C1 || C2 |
  d( x ,x
x1C1 x2 C2
1
2
)
Centroid Linkage (easiest)
d( C1 ,C2 )  d( x1 ,x2 )
■ Single-linkage produces long and stringy clusters
■ Complete linkage produces compact and small clusters
What should be the criteria for good clustering? Again
depends.
Classical criterion-set:
1. Number of distinct clusters should be as low as possible
and yet accommodate a sense of classification.
2. Distance between two distinct clusters should be larger
than some threshold. How big, how small?
3. A decent clustering scheme will yield a minimum total
variance of the clusters over all possible schemes.
Not all these are independent. Ideal clustering algorithm is
an NP-complete bin-packing problem. One needs a
workable heuristic to approach the problem.
Classification = Clustering = Learning
■ A set S is partitioned into a number of equivalent classes.
Supervised learning  classification (normally)
Unsupervised Learning  clustering
Both clustering and classification induce partition on the
given data set .
The clustering schemes:
a. Partitioning algorithms: Construct various partitions
and then evaluate them by various algorithms.
If clusters are too close, perhaps they should be coalesced.
If they are too voluminous (high variance) , perhaps they
should be partitioned further.
b. Hierarchical algorithm: agglomerative ad divisive
approach
c. Density-based algorithms: based on connectivity and
density functions.
d. Grid-based algorithms:
Clustering
Scheme
Bottom -Up
Agglomerative
Top-Down
Hierarchical
In general,
A cluster-scheme is agglomerative if new clusters are
formed one at a time from the existing cluster-set.
n
Let P = { v1 ,v2 ,...vn } be an n-cluster partition. We obtain
P k from P k 1 as follows:
k 1
a. Chose C h , Cl  P
k 1
b. Pk 1  Pk as Ch and Cl are erased from P and
the new cluster C  Ch Cl is inserted into the
configuration.
The selection of C h and Cl is predicated by some objective
function. Some choices:
a. var( Ch
Cl )  var( Ch )  var( Cl )
b. minimize F(Ch ,Cl )  diameter( Ch
diameter( C )  max x ,yC min
P(x,y)
Cl ) where
length (P(x,y))
Clustering or classification can be classified as
► supervised learning (e.g. as in neural nets)
► unsupervised learning (e.g. as in Isodata, ..)
Example Isodata.
Basic algorithm.
1. Given n patterns (points, objects, signatures ..)
xk  X .
2. Choose randomly any two points xk and xl such that
the distance d( xk ,xl )   , the minimum intercluster distance.
3. Take one of the remaining point xm  X . If
 d( xm ,C )  d( xm ,C ) assign xm to C if
d( xm ,C )   , the maximum cluster diameter
4. If  d( xm ,C )  d( xm ,C ) and
d( xm ,C )   ,assign
xm to is own cluster Cm
5. For the remaining points go to step 3 until no point
is remaining.
Pros and cons of Isodata
Pros.
1. Clustering is not geographically biased to any
particular region of data distribution.
2. A very efficient way of finding inherent clusters in a
set.
Cons.
1. Clustering is based on a number of iterations
required.
2. One doesn’t know a priori number of distinct
clusters.
3. Insensitive to variance/covariance.