Download Unsupervised Data Mining (Clustering)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Unsupervised Data Mining
(Clustering)
Javier Béjar
KEMLG
December 2012
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
1 / 51
Clustering in Data Mining
Introduction
Clustering in KDD
One of the main tasks in the KDD process is the analysis of data
when we do not know its structure
This task is very different from the task of prediction in which we
know the true answer and we try to approximate it
A large number of KDD projects involve unsupervised problems
(KDNuggets poll, 2-3 most frequent task)
Problems: Scalability, Arbitrary cluster shapes, New types of data,
Parameters fitting, ...
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
2 / 51
Clustering in Data Mining
Introduction
Clustering in DM
Data
Streams
Feature
Selection
Structured
Data
Feature
Transformation
Dimensionality
Reduction
Data
Unstructured
Data
Clustering in DM
Model
Based
Paralelization
Density
Based
Data
Compression
Clustering
Algorithms
Strategies
Sampling
Approximation
One pass
Javier Béjar (KEMLG)
Grid Based
Other
Agglomerative
Unsupervised Data Mining (Clustering)
December 2012
3 / 51
Clustering in Data Mining
The data Zoo
All kinds of data
Structured
Data
Unstructured
Data
Javier Béjar (KEMLG)
Data
Unsupervised Data Mining (Clustering)
Data
Streams
December 2012
4 / 51
Clustering in Data Mining
The data Zoo
Unstructured data
Only one table of observations
Each example represents an instance
of the problem
Each instance is represented by a set
of attributes (Discrete, continuous)
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
A
1
1
0
1
0
1
..
.
B
3.1
5.7
-2.2
-9.0
0.3
2.1
..
.
C
a
b
b
c
d
a
..
.
···
···
···
···
···
···
···
..
.
December 2012
5 / 51
Clustering in Data Mining
The data Zoo
Structured data
One sequential relation among
instances (Time, Strings)
Several instances with internal
structure (eg: sequences of
events)
Subsequences of unstructured
instances (eg: sequences of
complex transactions)
One big instance (eg: time
series)
Several relations among
instances (graphs, trees)
Several instances with internal
structure (eg: XML
documents)
One big instance (eg: social
network)
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
6 / 51
Clustering in Data Mining
The data Zoo
Data Streams
Endless sequence of data (eg:
sensor data)
Several streams synchronized
Unstructured instances
Structured instances
Static/Dynamic model
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
7 / 51
Clustering in Data Mining
Dimensionality Reduction
I am in Hyperspace
Feature
Transformation
Feature
Selection
Dimensionality
Reduction
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
8 / 51
Clustering in Data Mining
Dimensionality Reduction
The Curse of dimensionality
Two problems arise from the dimensionality of a dataset
The computational cost of processing the data (scalability of the
algorithms)
The quality of the data (more probability of bad data)
Two elements define the dimensionality of a dataset
The number of examples
The number of attributes
Having too many examples can sometimes be solved by sampling
Attribute reduction is a more complex problem
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
9 / 51
Clustering in Data Mining
Dimensionality Reduction
Reducing attributes
Usually the number of attributes of the dataset has an impact on the
performance of the algorithms:
Because their poor scalability (cost is a function of the number of
attributes)
Because the inability to cope with irrelevant/noisy/redundant attributes
Because the low ratio instances/dimensions (sparsity of space)
Two main methodologies to reduce attributes from a dataset
Transforming the data to a space of less dimensions preserving the
structure of the data (dimensionality reduction - feature extraction)
Eliminating the attributes that are not relevant for the goal task
(feature subset selection)
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
10 / 51
Clustering in Data Mining
Dimensionality Reduction
Dimensionality reduction
Many techniques have been developed for this purpose
Projection to a space that preserve the statistical model of the data
(Linear: PCA, ICA; Non linear: Kernel-PCA)
Projection to a space that preserves distances among the data (Linear:
Multidimensional scaling, random projection, SVD, NMF; Non linear:
ISOMAP, LLE, t-SNE)
Not all this techniques are scalable per se (O(N 2 ) to O(N 3 ) time
complexity), but some can be approximated
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
11 / 51
Clustering in Data Mining
Dimensionality Reduction
Original (60 Dim)
PCA
Kernel PCA
ISOMAP
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
12 / 51
Clustering in Data Mining
Dimensionality Reduction
Unsupervised Attribute Selection
The goal is to eliminate from the dataset all the redundant or
irrelevant attributes
The original attributes are preserved
The methods for Unsupervised Attribute Selection are less developed
than in Supervised Attribute Selection
The problem is that an attribute can be relevant or not depending on
the goal of the discovery process
There are mainly two techniques for attribute selection: Wrapping and
Filtering
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
13 / 51
Clustering in Data Mining
Clustering Algorithms
All kinds of algorithms
Grid Based
Density
Based
Agglomerative
Clustering
Algorithms
Model Based
Javier Béjar (KEMLG)
Other
Unsupervised Data Mining (Clustering)
December 2012
14 / 51
Clustering in Data Mining
Clustering Algorithms
Model/Center Based Algorithms
The algorithm is usually biased by a preferred model (hyperspheres,
hyper elipsoids, ...)
Clustering defined as an optimization problem
An objective function is optimized (Distorsion, loglikelihood, ...)
Based on gradient descent-like algorithms
k-means and variants
Probabilistic mixture models/EM algorithms (eg: Gaussian Mixture)
Scalability depends on the number of parameters, size of the dataset
and iterations until convergence
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
15 / 51
Clustering in Data Mining
Clustering Algorithms
Model Based Algorithms
2
2
2
2
2
2
2
1
2 2
1 1
1
1
2
2
2
2
1 2
1 1
1
2
2
1
1
1
2
2
2
2
2
1
1
1
2
2
2
2
2
2
1
2
2
1 1
1
Javier Béjar (KEMLG)
1 1
11
1
1
2
2
2
2
2
2
2
1
1 1
1
1 1
11
1
2
2
2
2
2
1
Unsupervised Data Mining (Clustering)
December 2012
16 / 51
Clustering in Data Mining
Clustering Algorithms
Agglomerative Algorithms
Based on the hierarchical aggregation of examples
From spherical to arbitrary shaped clusters
Defined over the affinity/distance matrix
Algorithms based on graphs/matrix algebra
Several criteria for the aggregation
Scalability is cubic on the number of examples (O(n2 log(n)) using
heaps, O(n2 ) in special cases)
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
17 / 51
Clustering in Data Mining
Clustering Algorithms
8
Agglomerative Algorithms
●
6
●
9
7
2
4
1
13
5
3
6
12
8
11
14
10
20
16
21
18
24
22
19
23
17
15
●
●
●
●
●
●
●
4
●
2
x2
●
●
●
●
0
●●
●●
●
●
●
●
−2
●
●
−1
0
1
2
3
4
5
12
10
8
6
4
2
0
x1
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
18 / 51
Clustering in Data Mining
Clustering Algorithms
Density Based Algorithms
Based on the discovery of areas of larger density
Arbitrary shaped clusters
Noise/outliers tolerant
Algorithms based on:
Topology (DBSCAN-like: core points, reachability)
Kernel density estimation (DENCLUE, mean shift)
Scalability depends on the size of the dataset and the computation of
the density (KDE, k-nearest neighbours) (best complexity
O(n log(n)))
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
19 / 51
Clustering in Data Mining
Clustering Algorithms
Density Based Algorithms - DBSCAN
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
20 / 51
Clustering in Data Mining
Clustering Algorithms
Density Based Algorithms - DENCLUE
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
21 / 51
Clustering in Data Mining
Clustering Algorithms
Grid Based Algorithms
Based on the discovery of areas of larger density by space partitioning
Arbitrary shaped clusters
Noise/outliers tolerant
Algorithms based on:
Fixed/adaptative grid partitioning of features
Histogram mode discovery and dimension merging
Scalability depends on the number of features, grid granularity and
number of examples
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
22 / 51
Clustering in Data Mining
Clustering Algorithms
Grid Based Algorithms
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
23 / 51
Clustering in Data Mining
Clustering Algorithms
Other Algorithms
Based on graph theory
K-way graph partitioning
Max-flow/min-flow algorithms
Spectral clustering
Message passing/belief propagation (affinity clustering)
Evolutionaty algorithms
SVM clustering/Maximum margin clustering
···
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
24 / 51
Clustering in Data Mining
Clustering Algorithms
Other Clustering Related Areas
Consensus clustering
Clustering by consensuating several partitions
Semisupervised clustering/clustering with constraints
Clustering with additional information (must and cannot links)
Subspace clustering
Clusters with reduced dimensionality
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
25 / 51
Clustering in Data Mining
Scalability strategies
Size matters :-)
Approximation
Data
compression
One pass
Strategies
Sampling
Javier Béjar (KEMLG)
Paralelization
Unsupervised Data Mining (Clustering)
December 2012
26 / 51
Clustering in Data Mining
Scalability strategies
Strategies for cluster scalability
One-pass
Process data as a stream
Summarization/data compression
Compress examples to fit more data in memory
Sampling/Batch algorithms
Process a subset of the data and maintain/compute a global model
Approximation
Avoid expensive computations by approximate estimation
Paralelization/Distribution
Divide the task in several parts and merge models
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
27 / 51
Clustering in Data Mining
Scalability strategies
One pass
This strategy is based on incremental clustering algorithms
They are cheap but order of processing affects greatly their quality
Although can be used as a preprocessing step
Two steps algorithms
1
2
A large number of clusters is generated using the one-pass algorithm
A more accurate algorithm clusters the preprocessed data
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
28 / 51
Clustering in Data Mining
Scalability strategies
Data Compression/Summarization
Discard sets of examples and summarize by:
Sufficient statistics
Density approximations
Discard data irrelevant for the model (do not affect the result)
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
29 / 51
Clustering in Data Mining
Scalability strategies
Approximation
Not using all the information available to make decisions
Using K-neighbours (data structures for computing k-neighbours)
Preprocessing the data using a cheaper algorithm
Generate batches using approximate distances (eg: canopy clustering)
Use approximate data structures
Use of hashing or approximate counts for distances and frequency
computation
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
30 / 51
Clustering in Data Mining
Scalability strategies
Batches/Sampling
Process only data that fits in memory
Obtain from the data set:
Samples (process only a subset of the dataset)
Determine the size of the sample so all the clusters are represented
Batches (process all the dataset)
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
31 / 51
Clustering in Data Mining
Scalability strategies
Paralelization/Distribution/Divide&Conquer
Paralelization of clustering usually depends on the specific algorithm
Some are difficult to parallelize (eg: hierarchical clustering)
Some have specific parts that can be solved in parallel or by
Divide&Conquer
Distance computations in k-means
Parameter estimation in EM algorithms
Grid density estimations
Space partitioning
Batches and sampling are more general approaches
The problem is how to merge all the different partitions
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
32 / 51
Clustering in Data Mining
Scalability strategies
Examples - One pass + Summarization
One pass + hierarchical single link
One pass summarization using Leader algorithm (many clusters)
Single link hierarchical clustering of the summaries
Theoretical results about equivalence to SL at top levels (from O(N 2 )
to O(c 2 ))
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
33 / 51
Clustering in Data Mining
Scalability strategies
One pass + Single Link
2nd Phase
Hierarchical Clustering
1st Phase
Leader Algorithm
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
34 / 51
Clustering in Data Mining
Scalability strategies
Examples - One pass + Summarization
One pass + model based algorithms
BIRCH algorithm
One-pass Leader algorithm using an adaptive hierarchical data
structure (CF-tree)
Postprocess for noise filtering
Clustering of cluster centers from CF-tree
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
35 / 51
Clustering in Data Mining
Scalability strategies
One pass + CFTREE (BIRCH)
1st Phase - CFTree
Javier Béjar (KEMLG)
2nd Phase - Kmeans
Unsupervised Data Mining (Clustering)
December 2012
36 / 51
Clustering in Data Mining
Scalability strategies
Examples - Batch + (Divide&Conquer / One-pass)
Batch + K-means-like
SaveSpace algorithm (hierarchy of centers)
Clustering using a randomized algorithm based on k-facilities problem
(LSEARCH algorithm)
Data is divided on M batches
Each batch is clustered in k weighted centers
Recursivelly M · k centers are divided on new batches for the next level
Last level is clustered in k centers
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
37 / 51
Clustering in Data Mining
Scalability strategies
Batch + Divide&Conquer
M Batches - K Clusters each
M*K new examples
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
38 / 51
Clustering in Data Mining
Scalability strategies
Examples - Sampling + Summarization
Random Samples + Data Compression
Scalable K-means
Retrieve a sample from the database that fits in memory
Cluster with the current model
Discard data close to the centers
Compress high density areas using sufficient statistics
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
39 / 51
Clustering in Data Mining
Scalability strategies
Sampling + Summarization
Original Data
Compression
Javier Béjar (KEMLG)
Sampling
More data is added
Unsupervised Data Mining (Clustering)
Updating model
Updating model
December 2012
40 / 51
Clustering in Data Mining
Scalability strategies
Examples - Sampling + Distribution
CURE
Draws a random samples from the dataset
Partitions the sample in p groups
Executes the clustering algorithm on each partition
Deletes outliers
Runs the clustering algorithm on the union of all groups until it obtains
k groups
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
41 / 51
Clustering in Data Mining
Scalability strategies
Sampling + Distribution
DATA
Clustering partition 2
Javier Béjar (KEMLG)
Sampling+Partition
Join partitions
Unsupervised Data Mining (Clustering)
Clustering partition 1
Labelling data
December 2012
42 / 51
Clustering in Data Mining
Scalability strategies
Examples - Approximation + Distribution
Canopy Clustering
Approximate dense areas by probing the space of examples
Batches are formed by extracting spherical volumes (canopies) using a
cheap distance
Inner radius/Outer radius
Each canopy is clustered separately (disjoint clusters)
Only distances inside the canopy are computed
Several strategies for classical algorithms
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
43 / 51
Clustering in Data Mining
Scalability strategies
Approximation + Distribution
1st Canopy
Javier Béjar (KEMLG)
2nd Canopy
Unsupervised Data Mining (Clustering)
3rd Canopy
December 2012
44 / 51
Clustering in Data Mining
Scalability strategies
Examples - Divide&Conquer
K-means + indexing
Build a kd-tree structure for the examples
Determine the example-center assignment in parallel
Each branch of the kd-tree only keeps a subset of centers
If only one candidate in a branch → no distances computed
Distances only are computed at leaves
Effective for low dimensionality
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
45 / 51
Clustering in Data Mining
Scalability strategies
Divide&Conquer - K-means + KD-tree
KD-TREE
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
46 / 51
Clustering in Data Mining
Scalability strategies
Examples - Divide&Conquer
Optigrid
Find projections of the data in lower dimensionality
Determine the best cuts of the projections
Build a grid with the cuts
Recurse for dense cells (paralellization)
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
47 / 51
Clustering in Data Mining
Scalability strategies
Divide&Conquer - Grid Clustering
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
48 / 51
Clustering in Data Mining
Scalability strategies
Examples - Approximation
Quantization Cluster (k-means)
Quantize the space of examples and keep dense cells
Approximate cluster assignments using cells as examples
Recompute centers with examples in the cells
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
49 / 51
Clustering in Data Mining
Scalability strategies
Divide&Conquer + approximation
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
50 / 51
Clustering in Data Mining
Scalability strategies
The END
THANKS
(Questions, Comments, ...)
Javier Béjar (KEMLG)
Unsupervised Data Mining (Clustering)
December 2012
51 / 51