Download T-61.6020 Popular Algorithms in Data Mining and Machine

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
T-61.6020 Popular Algorithms in Data Mining and Machine Learning
BIRCH: Balanced Iterative Reducing and
Clustering using Hierarchies
Sami Virpioja
Adaptive Informatics Research Centre
Helsinki University of Technology
April 23, 2008
Conclusions
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Outline
1
Introduction
2
Clustering Features
3
CF Tree
4
BIRCH Clustering
5
Performance and Experiments
6
Conclusions
Performance & Experim.
Conclusions
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Outline
1
Introduction
2
Clustering Features
3
CF Tree
4
BIRCH Clustering
5
Performance and Experiments
6
Conclusions
Performance & Experim.
Conclusions
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
Task
Data clustering with computational constraints:
Given number of clusters K, a dataset of N points, and a
distance-based measurement function, find a partition of the
dataset that minimizes the function.
Limited amount of memory (less than the dataset size).
Minimize the time used for I/O operations.
I.e., do not read the data multiple times from the disk.
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Properties of BIRCH
On-line algorithm: Single scan of the dataset enough.
Optional refining with additional scans
I/O cost minimization: Organize data in a in-memory,
height-balanced tree.
Local decisions: Each clustering decision is made without
scanning all the points or clusters.
Outlier detection (optional): Points in sparse regions are
treated separately as outliers.
Conclusions
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Algorithm Phases (1)
BIRCH overview [Zhang et al. 1997]
Conclusions
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Algorithm Phases (2)
Two main phases:
Phase 1: Construction of CF tree using local clustering.
Incremental, one scan of the data enough.
Phase 3: Global clustering of subclusters from the CF tree.
Use some standard clustering algorithm (e.g. hierarchical
clustering, K-means).
The rest is optional.
Conclusions
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Outline
1
Introduction
2
Clustering Features
3
CF Tree
4
BIRCH Clustering
5
Performance and Experiments
6
Conclusions
Performance & Experim.
Conclusions
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
Clustering Features
Clustering Feature: Triple CF = (N, LS, SS)
N is the number of data points in cluster (0th order moment).
PN
LS is the linear sum of data points i=1 X i
(1st order moment).
PN
SS is the square sum of data points i=1 ||X i ||2
(2rd order moment).
CF Additivity Theorem: Assume that for two disjoint clusters
CF1 = (N1 , LS 1 , SS1 ) and CF1 = (N2 , LS 2 , SS2 ). For the
merged cluster
CF1 + CF2 = (N1 + N2 , LS 1 + LS 2 , SS1 + SS2 ).
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Representativity of Clustering Features (1)
CF Representativity Theorem: Given the CF entries of
subclusters, the following measurements can be computed
accurately:
For a single cluster X:
PN
Centroid X 0 = N1
Xi
PN i=1
1
1
Radius R = ( N i=1 ||X i − X 0 ||2 ) 2
P
P
1
N
N
Diameter D = ( N (N1−1) i=1 j=1 ||X i − X j ||2 ) 2
Conclusions
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
Representativity of Clustering Features (2)
For two clusters X and Y :
Euclidian centroid distance D0 = ||X 0 − Y 0 ||
Manhattan centroid distance D1 = |X 0 − Y 0 |
Average inter-cluster
distance
PNX P
NY
2 21
D2 = ( NX1NY
i=1
j=1 ||X i − Y j || )
Average intra-cluster distance D3 = . . .
Variance increase distance D4 = . . .
For example:
NX X
NY
X
i=1 j=1
2
||X i − Y j ||
=
NX X
NY
X
(X Ti X i − 2X Ti Y j + Y Tj Y j )
i=1 j=1
= SSX + SSY − 2LS TX LS Y
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
CF Summary
Thus, instead storing a set of data points with the set of their
vectors, it can be summarized as a CF entry.
Sometimes called micro-clusters.
Additivity: CF entries can be stored and calculated
incrementally.
Representativity: Some distances can be calculated without
the individual data points.
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Outline
1
Introduction
2
Clustering Features
3
CF Tree
4
BIRCH Clustering
5
Performance and Experiments
6
Conclusions
Performance & Experim.
Conclusions
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
CF Tree
Height-balanced tree with branching factor B, leaf size L and
leaf threshold T .
Non-leaf nodes contain at most B entries of the form
[CFi , childi ].
Represents a cluster made up of all the subclusters of the
entries.
Leaf nodes contain at most L entries of the form [CFi ] and
pointers to previous and next leaf nodes.
All entries in a leaf node must satisfy threshold requirement,
i.e., diameter or radius of the subcluster must be less than T .
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
CF Tree Example
Root
B=3
CF1 CF2
L=4
ch1 ch2
CF1 CF2 CF3
CF1
ch1 ch2 ch3
ch1
Non−leaf nodes
CF1 CF2 CF3 CF4
CF1 CF2
CF1
CF1
prev next
prevnext
prevnext
prevnext
Leaf nodes
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
Size of CF Tree
In BIRCH, it is required that each node fits to a page of size
P.
Given the dimensionality (and type) of the data, B and L can
be determined from P .
Thus, size of the tree will be function of T :
Larger T allows more points in the same entries.
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Operations of CF Tree
Structure of CF tree is similar to B+-tree (used for
representing sorted data).
Basic operation: Insert a new CF entry (single point or
subcluster) into the tree.
Additional operation: Rebuild the tree with increased
threshold T .
Conclusions
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
Insertion Algorithm
1
Identifying the appropriate leaf:
Descend the tree by always choosing the closest child node
according to the chosen distance metric (D0–D4).
2
Modifying the leaf:
Find the closest entry.
If threshold is not exceeded, absorb into the entry.
Otherwise, if L is not exceeded, add new entry to the leaf.
Otherwise, split the node: Choose the farthest entries and
redistribute the rest, each to the closest one.
3
Modifying the path to the leaf:
If the child was not split, just update CF.
If there was a split, and B is not exceeded, add the new node
to the parent. Otherwise, split the parent.
Recurse upwards.
4
A merging refinement:
When the split propagated up to some non-leaf node, test if
the split pair are the closest of the entries.
If not, merge the closest entries and resplit if needed.
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
Insertion Algorithm
Root
B=3
CF1 CF2
L=4
ch1 ch2
CF1 CF2 CF3
CF1
ch1 ch2 ch3
ch1
Non−leaf nodes
CF1 CF2 CF3 CF4
CF1 CF2
CF1
CF1
prev next
prevnext
prevnext
prevnext
Leaf nodes
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
Insertion Algorithm
Root
B=3
CF1 CF2
L=4
ch1 ch2
Indentifying leaf
new data point
CF1 CF2 CF3
CF1
ch1 ch2 ch3
ch1
Non−leaf nodes
CF1 CF2 CF3 CF4
CF1 CF2
CF1
CF1
prev next
prevnext
prevnext
prevnext
Leaf nodes
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
Insertion Algorithm
Root
B=3
CF1 CF2
L=4
ch1 ch2
new data point
CF1 CF2 CF3
CF1
ch1 ch2 ch3
ch1
Non−leaf nodes
Modifying leaf
CF1 CF2
CF1 CF2 CF3
CF1 CF2
CF1
CF1
prev next
prev next
prevnext
prevnext
prevnext
Leaf nodes
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
Insertion Algorithm
Root
B=3
CF1 CF2
L=4
ch1 ch2
new data point
Modifying path
to leaf
CF1 CF2
CF1 CF2
CF1
ch1 ch2
ch1 ch2
ch1
Non−leaf nodes
CF1 CF2
CF1 CF2 CF3
CF1 CF2
CF1
CF1
prev next
prev next
prevnext
prevnext
prevnext
Leaf nodes
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
Insertion Algorithm
B=3
Modifying path
to leaf
L=4
Root
CF1 CF2 CF3
new data point
ch1 ch2 ch3
CF1 CF2
CF1 CF2
CF1
ch1 ch2
ch1 ch2
ch1
Non−leaf nodes
CF1 CF2
CF1 CF2 CF3
CF1 CF2
CF1
CF1
prev next
prev next
prevnext
prevnext
prevnext
Leaf nodes
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
Insertion Algorithm
Root
B=3
CF1 CF2
L=4
ch1 ch2
Merging refinement
CF1 CF2
CF1 CF2 CF3
ch1 ch2
ch1 ch2 ch3
Non−leaf nodes
CF1 CF2
CF1 CF2 CF3
CF1 CF2
CF1
CF1
prev next
prev next
prevnext
prevnext
prevnext
Leaf nodes
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
Rebuilding Algorithm
If the selected T led to a tree that is too large for the
memory, it must be rebuild with a larger T .
Can be done with small extra memory and without scanning
the data again.
Assume that original tree ti with threshold Ti , size Si and
height hi .
Proceed one path from a leaf to the root at a time:
1
2
3
Copy the path to the new tree.
Insert the leaf entries to the new tree. If they can be absorbed
or fit in to the closest path of the tree, insert there. Otherwise
insert to the leaf of the new path.
Remove empty nodes from both the old and new tree.
If for new tree ti+1 , if Ti+1 ≥ Ti , then Si+1 ≤ Si and the
transformation from ti to ti+1 needs at most hi extra pages of
memory.
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Rebuilding Sketch
Rebuilding CF tree [Zhang et al. 1997]
Conclusions
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
Anomalies
Limited number of entries per node and skewed input order
can produce some anomalies:
Two subclusters that should have been in one cluster are split.
Two subclusters that should not be in one cluster are kept in
same node.
Two instances of the same data point might be entered into
distinct leaf entries.
Global clustering (BIRCH phase 3) solves the first two
anomalies.
Further passes over the data are needed for the latter
(optional phase 4).
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Outline
1
Introduction
2
Clustering Features
3
CF Tree
4
BIRCH Clustering
5
Performance and Experiments
6
Conclusions
Performance & Experim.
Conclusions
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
BIRCH Phases
BIRCH overview [Zhang et al. 1997]
Conclusions
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
BIRCH Phase 1
1
Start an initial in-memory CF tree t0 with small threshold T0 .
2
Scan the data and insert points to the current tree ti .
3
If memory runs out before all the data is inserted, rebuild a
new tree ti+1 with Ti+1 > Ti .
Optionally detect outliers while rebuilding:
If a leaf entry has “far fewer” data points than on average,
write them to disk instead of adding to the tree.
If disk space runs out, go through all outliers and re-absorb
those that do not increase the tree size.
4
Otherwise, proceed to the next phase. (And, optionally,
recheck if some of the potential outliers can be absorbed.)
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
BIRCH Phases 1–2
After Phase 1, the clustering should be easier in several ways:
Fast because no I/O operations are needed, and the problem
of clustering N points is reduced to clustering of M
subclusters (with hopefully M << N ).
Accurate because outliers are eliminated and the rest data is
reflected with finest granularity given the memory limit.
Less order sensitive because the order of the leaf entries
contain better data locality than an arbitrary order.
Phase 2: If known that the global clustering algorithm in
Phase 3 works best with some range of the input size M ,
continue reducing the tree until the range is reached.
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
BIRCH Phase 3
Use a global or semi-global clustering for the M subclusters
(CF entries of the leaf nodes).
Fixes the anomalies caused by limited page size in Phase 2.
Several possibilities for handling the subclusters:
Naive: Use centroid to represent the subcluster.
Better: N data points in the centroids.
Accurate: As seen, the most pointwise distance and quality
metrics can be calculated from the CF vectors.
The authors of BIRCH used agglomerative hierarchical
clustering with distances metrics D2 and D4.
Conclusions
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
BIRCH Phase 4
Use the centroids of the clusters from Phase 3 as seeds.
Make additional scan of the data and redistribute each point
to the cluster corresponding to the nearest seed.
Fixes the misplacement problems of the first phases.
Can be extended with additional passes (cf. K-means).
Optionally discard outliers (i.e. those too far from the closest
seed).
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Outline
1
Introduction
2
Clustering Features
3
CF Tree
4
BIRCH Clustering
5
Performance and Experiments
6
Conclusions
Performance & Experim.
Conclusions
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
CPU cost
N points in Rd , memory U bytes, page size P bytes
Maximum size of the tree
U
P
Phases 1 and 2:
Inserting all data points: O(N ∗ d ∗ B ∗ logB U
P)
Plus something less for the rebuildings.
Plus some I/O usage if outliers are written to the disk.
Phase 3:
Depends on the algorithm and the number of subclusters M .
If M is limited in Phase 2, it does not depend on N .
Phase 4:
One iteration is O(N ∗ K)
So scales up pretty linearly.
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
Experiments
In [Zhang et al. 1996] and [Zhang et al. 1997], the authors
compare the performance to K-means and CLARANS.
CLARANS uses a graph of partitions and searches it locally to
find a good one.
Random 2-d datasets of K = 100 clusters
Centers either on a grid, on a curve of sine function, or random
About 1000 points per each cluster
Unsurprisingly, BIRCH uses less memory and is faster, less
order-sensitive and more accurate.
Scalability seems to be quite good:
Linear w.r.t. cluster number and points per cluster.
Little worse than linear w.r.t. data dimensionality.
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Application: Filtering Trees from Background
Images taken in NIR and VIS [Zhang et al. 1997]
Conclusions
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
Application: Filtering Trees from Background
Separate branches, shadows and sunlit leaves [Zhang et al. 1997]
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Outline
1
Introduction
2
Clustering Features
3
CF Tree
4
BIRCH Clustering
5
Performance and Experiments
6
Conclusions
Performance & Experim.
Conclusions
Introduction
Clustering Features
CF Tree
BIRCH Clustering
Performance & Experim.
Conclusions
Conclusions
“Single-pass, sort-of-linear time algorithm that results in a
sort-of-clustering of large number of data points, with most
outliers sort-of-thrown-out.” [Fox & Gribble 1997]
Lots of heuristics in the overall algorithm.
The novelty was in dealing with large datasets and using
summary information (CF entries) instead of the data itself.
Several extensions to the idea: parallel computation, different
data types, different summary schemes, different clustering
algorithms.
References
References
Tian Zhang, Raghu Ramakrishnan and Miron Livny (1996).
BIRCH: An Efficient Data Clustering Method for Very Large
Databases. In Proc. 1996 ACM SIGMOD international
Conference on Management of Data, pp. 103–114. ACM
Press, New York, NY.
Tian Zhang, Raghu Ramakrishnan and Miron Livny (1997).
BIRCH: A New Data Clustering Algorithm and Its
Applications. Data Mining and Knowledge Discovery,
1:2(141–182). Springer Netherlands.
Armando Fox and Steve Gribble (1997). Databases Paper
Summaries. [On-line, cited 23 April 2008.] Available at: http:
//www.cs.berkeley.edu/~fox/summaries/database/
Related documents