Download What is a cluster

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia, lookup

Human genetic clustering wikipedia, lookup

Nonlinear dimensionality reduction wikipedia, lookup

K-nearest neighbors algorithm wikipedia, lookup

K-means clustering wikipedia, lookup

Nearest-neighbor chain algorithm wikipedia, lookup

Cluster analysis wikipedia, lookup

Transcript
Clustering: partitioning data into similar
groupings. A cluster is grouping of 'similar' items!
Clustering is a process that partitions a set of
objects into equivalence classes.
In clustering, data is partitioned into classes. Class
members of each class are in some way more
"similar" to each other than with those pertaining
to different classes. So
a. intra-class variance is low
b. inter-class variance is high
Application of clustering:
1. Pattern recognition
2. Data mining
3. Image processing
4. Market research & econometric assessment
5. In WWW: document classification and
search on a similar ontology space
6. Land use registry
7. Insurance
8. Urban and regional planning
Clustering = Unsupervised learning with no
notion of pre-defined classes (what are they, how
many of them – no a priori knowledge)
Data preparation before data mining:
► Normal data to be mined is noisy with many
unwanted attributes, etc.
►Discretization of continuous data
►Data normalization [ -1 .. + 1] or [0 .. 1] range
►Data smoothing to reduce noise, removal of
outliers, etc.
►Relevance analysis: feature selection to ensure
relevant set of wanted features only
Clustering is an unsupervised partition of a given
data into equivalent classes. Ultimately, clustering
is equivalent to classification.
A good clustering would produce a partition with
low within-group variance and high inter-group
variance. The idea is centroid of a cluster (the
geographical center of a cluster in a spatial data)
and its variance about the centroid may be
sufficient in most cases to depict the data.
many data

few data.
Cluster point or centroid = exempler
Variance = mushiness of the concept of the
centroid
within-group = intra-cluster
between-groups = inter cluster
The issue here is "similarity". How do we measure
similarity? This is not easy to answer.
Secondly, if there are "hidden"patterns, does the
clustering scheme discover them? Requirements
of good clustering:
1. Insensitivity to order of input data
2. Capable of cluster identification on a single
pass over data
3. Works even in presence of noise and outliers
4. Scalability
5. low dependence on domain knowledge
6. Ability to deal with different types of input:
numerical, ordinal, categorical, etc.
7. Discovery of clusters with arbitrary shape.
e.g.
Two points or patterns are similar if the distance
between them is lower than some threshold.
d( A,B )  
cluster.

A and B are in the same
A metric space is a set S   xi  where a
generalized distance function of some sort could
be defined. In this,
d( xi ,x j )  0 if xi  x j
d( xi ,x j )  d( x j ,xi )  0
d( xi ,x j )  d( xi ,xk )  d( xk ,x j ) triangular
inequality
How do we measure the distance? It depends! 
Euclidian distance:
d 2 ( xi ,x j )  ( xik  x jk )2 as 2-norm
k
Manhattan distance: d( xi ,x j )  | xik  x jk |
k
Bounded distance:
On any given metric space, a measure that never
exceeds a threshold.
e.g. d( xi ,x j ) 
D( xi ,x j )
where D( xi ,x j ) is a
1  D( xi ,x j )
measure of distance between two points obeying
usual distance properties.
Maximum distance:
d( xi ,x j )  max | xik  x jk | -norm
k
Minimum distance:
d( xi ,x j )  min | xik  x jk |
k
Mahalnabis distance:


d p ( xi ,x j )   |xik  x jk | p 
 k

String distance:
1
p
as p-norm
a. Hamming: Distance between two strings is
the total number of positions they differ.
b. Levenshtein distance: Given a source string s
and a target string t , the minimum number of
insertions, deletions and substitutions require
to transform s into t . Useful in




Spell checking
Speech recognition
DNA analysis
Plagiarism detection
Larger the distance between two data items less
similar they are.
Also, we need a way to measure the distance
between two clusters. A number of possibilities
exist.
Single linkage distance (nearest neighbor):
D( C1 ,C2 )  minx1C1 ,x2C2 d( x1 ,x2 )
minimum distance between points in them.
Complete Linkage distance (farthest neighbors)
D( C1 ,C2 )  maxx1C1 ,x2C2 d( x1 ,x2 )
Average linkage distance (a compromise):
Average distance among all pairs of points
d( C1 ,C2 ) 
1
| C1 || C2 |
  d( x ,x
x1C1 x2 C2
1
2
)
Centroid Linkage (easiest)
d( C1 ,C2 )  d( x1 ,x2 )
■ Single-linkage produces long and stringy
clusters
■ Complete linkage produces compact and small
clusters
What should be the criteria for good clustering?
Again depends.
Classical criterion-set:
1. Number of distinct clusters should be as low as
possible and yet accommodate a sense of
classification.
2. Distance between two distinct clusters should be
larger than some threshold. How big, how small?
3. A decent clustering scheme will yield a
minimum total variance of the clusters over all
possible schemes.
Not all these are independent. Ideal clustering
algorithm is an NP-complete bin-packing
problem. One needs a workable heuristic to
approach the problem.
Classification = Clustering = Learning
■ A set S is partitioned into a number of
equivalent classes.
Supervised learning  classification (normally)
Unsupervised Learning  clustering
Both clustering and classification induce partition
on the given data set .
The clustering schemes:
a. Partitioning algorithms: Construct various
partitions and then evaluate them by various
algorithms.
If clusters are too close, perhaps they should be
coalesced. If they are too voluminous (high
variance) , perhaps they should be partitioned
further.
b. Hierarchical algorithm: agglomerative ad
divisive approach
c. Density-based algorithms: based on
connectivity and density functions.
d. Grid-based algorithms:
Clustering
Scheme
Bottom -Up
Agglomerative
Top-Down
Hierarchical
In general,
A cluster-scheme is agglomerative if new clusters
are formed one at a time from the existing clusterset.
Let P n = { v1 ,v2 ,...vn } be an n-cluster partition. We
obtain
P k from P k 1 as follows:
a. Chose C h , Cl  P k 1
b. Pk 1  Pk as Ch and Cl are erased from
P k 1 and
the new cluster C  Ch Cl is inserted into
the configuration.
The selection of C h and Cl is predicated by some
objective function. Some choices:
a. var( Ch Cl )  var( Ch )  var( Cl )
b. minimize F(Ch ,Cl )  diameter( Ch
diameter( C )  max x ,yC min
P(x,y)
Cl ) where
length (P(x,y))
Clustering or classification can be classified as
► supervised learning (e.g. as in neural nets)
► unsupervised learning (e.g. as in Isodata, ..)
Example Isodata.
Basic algorithm.
1. Given n patterns (points, objects, signatures
..)
xk  X .
2. Choose randomly any two points xk and xl
such that the distance d( xk ,xl )   , the
minimum inter-cluster distance.
3. Take one of the remaining point xm  X . If
 d( xm ,C )  d( xm ,C ) assign xm to C if
d( xm ,C )   , the maximum cluster
diameter
4. If  d( xm ,C )  d( xm ,C ) and
d( xm ,C )   ,assign
xm to is own cluster Cm
5. For the remaining points go to step 3 until
no point is remaining.
Pros and cons of Isodata
Pros.
1. Clustering is not geographically biased to any
particular region of data distribution.
2. A very efficient way of finding inherent
clusters in a set.
Cons.
1. Clustering is based on a number of
iterations required.
2. One doesn’t know a priori number of
distinct clusters.
3. Insensitive to variance/covariance.