Download Clustering - UTK-EECS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
COSC 526 Class 13
Clustering – Part II
Arvind Ramanathan
Computational Science & Engineering Division
Oak Ridge National Laboratory, Oak Ridge
Ph: 865-576-7266
E-mail: [email protected]
Slides inspired by: Andrew Moore (CMU), Andrew Ng (Stanford),
Tan, Steinbach and Kumar
Last Class
• K-means Clustering + Hierarchical Clustering
– Basic Algorithms
– Modifications for streaming data
• More today:
– Modifications for big data!
• DBSCAN Algorithm
• Measures for validating clusters…
2
Project Schedule Updates…
No.
Milestone
Due date
1
Selection of Topics
1/27/2015
2
Project Description & Approach
3/3/2015
3
Initial Project Report
3/31/2015
4
Project Demonstrations
4/16/2015 – 4/19/2015
5
Project Report
4/23/2015
6
Posters
4/23/2015
• Reports (2-3) will be about 1 -2 pages most,
• Final report which will be between 5-6 pages typically
NIPS format
• More to come on posters
3
Assignment 1 updates
• Dr. Sukumar will be here on Tue (Mar 3) to give
you an overview of the python tools
• We will keep the deadline for the assignment to
be Mar 10
• Assignment 2 will go out on Mar 10
4
Density-based Spatial Clustering of
Applications with Noise (DBSCAN)
5
Preliminaries
• Density is defined to be the
number of points within a radius
(ε)
– In this case, density = 9
ε
• Core point has more than a specified number of
points (minPts) at ε
minPts = 4
core point – Points are interior to a cluster
• Border points have < minPts at ε but are within
vicinity of the core point
border point
• A noise point is neither a core point nor a border point
ε
6
noise point
DBSCAN Algorithm
7
Illustration of DBSCAN: Assignment of
Core, Border and Noise Points
8
DBSCAN: Finding Clusters
9
Advantages and Limitations
• Resistant to noise
• Eps and MinPts are
dependent on each
• Can handle clusters of
W h e n D BSCAN D oe s Nother
OT W or k W e ll
different sizes and
– Can be difficult to
shapes
specify
• Different density
clusters within the
same class can be
difficult to find
(MinPts=4, Eps=9.75).
Original Points
10
W
h ennDD
BSCAN
sW
Nor
OT
We llor k W e ll
W he
BSCAN
D oeD
s oe
N OT
k W
Advantages and Limitations
(MinPts=4, Eps=9.75).
Original Points
(MinPts=4, Ep
Original Points
• •Varying
density data
Varying densities
High-dimensional data
• •High
dimensional data
11
• Varying densities
(MinPts=4, Eps=9.92)
How to determine Eps and MinPoints
• For points within a
cluster, kth nearest
neighbors are roughly at
the same distance
• Noise points are farther
away in general
• Plot by sorting the
distance of every point to
its kth nearest neighbor
12
Modifying clustering algorithms to work
with large-datasets…
13
Last class, MapReduce Approach for Kmeans…
In the reduce step (we get associated
In the map step:
• Read the cluster centers into vectors for each center):
memory from a sequencefile • Iterate over each value vector and
calculate the average vector. (Sum
• Iterate over each cluster
each
vector
and devide each part by
•center
What
do
we
do,
if
we
do
not
have
sufficient
for each input
the number of vectors we received).
memory?
of new
thecenter,
data…
• Thisall
is the
save it into a
key/value
pair.i.e., we cannot hold
SequenceFile.
Practical
scenarios:
hold a million data
• •Measure
the distances
andwe cannot
• Check the convergence between the
save
the
nearest
center
which
points at a given time, eachclustercenter
of them being
highin the key
that is stored
hasdimensional
the lowest distance to the
object and the new center.
vector
• If it they are not equal, increment an
update counter
• Write the clustercenter with
its vector to the filesystem.
14
Bradley-Fayyad-Reina (BFR) Algorithm
• BFR is a variant of k-means designed to work
with very large, disk resident data
• Assumes clusters are normally distributed around
a centroid in a Euclidean space
– Standard deviation in different dimensions may vary
– Clusters are axis-aligned ellipses
• Efficient way to summarize clusters
– want memory required O(clusters)
– Instead of O(data)
15
How does BFR work?
• Rather than keeping the data, BFR just maintains
summary statistics of data:
– Cluster summaries
– Outliers
– Points to be clustered
• This makes it easier to store only the statistics
based on the number of clusters
16
Details of the BFR Algorithm
1. Initialize k clusters
2. Load a bag of points from disk
3. Assign new points to one of the K original clusters, if
they are within some distance threshold of the cluster
4. Cluster the remaining points, and create new clusters
5. Try to merge new clusters from step 4 with any of the
existing clusters
6. Repeat 2-5 until all points have been examined
17
Details, details, details of BFR
• Points are read from disk one main-memory full
at a time
• Most points from previous memory loads are
summarized by simple statistics
• From the initial load, we select k centroids:
– Take k random points
– Take a small random sample and cluster optimally
– Take a sample; pick a random point, and then k-1
more points, each as far as possible from the
previously selected points
18
Three classes of points…
• 3 sets of points, which we can keep track of
• Discard Set (DS):
– Points close enough to a centroid to be summarized
• Compression Set (CS):
– Groups of points that are close together but not close to
any existing centroid
– These points are summarized, but not assigned to a
cluster
• Retained Set (RS):
– Isolated points waiting to be assigned to a compression
set
19
Using the Galaxies Picture to learn BFR
Reject Set
+
+
+
Compressed
Set
+
Discard Set
centroid
• Discard set (DS): Close enough to a centroid to be summarized
• Compression set (CS): Summarized, but not assigned to a cluster
• Retained set (RS): Isolated points
20
Summarizing the sets of points
• For each cluster, the discard set (DS) is
summarized by:
– The number of points, N
– The vector SUM, whose ith component is is the sum of
the coordinates of the points in the ith dimension
– The vector SUMSQ: ith component = sum of squares of
coordinates in ith dimension
21
More details on summarizing the points
• 2d+1 values represent any clusters
– d is the number of dimensions
• Average in each dimension (the centroid) can
be calculated as SUMi / N
– SUMi = ith component of SUM
• Variance of a cluster’s discard set in dimension i
is: (SUMSQi / N) – (SUMi / N)2
– And standard deviation is the square root of that
22
Note: Dropping the “axis-aligned” clusters assumption would require
storing full covariance matrix to summarize the cluster. So, instead of
SUMSQ being a d-dim vector, it would be a d x d matrix, which is too big!
The “Memory Load” of Points and
Clustering
• Step 3: Find those points that are “sufficiently
close” to a cluster centroid and add those points
to that cluster and the DS
– These points are so close to the centroid that they can
be summarized and then discarded
• Step 4: Use any in-memory clustering algorithm
to cluster the remaining points and the old RS
– Use any in-memory clustering algorithm to cluster the
remaining points and the old RS
23
More on Memory Load of points
• DS set: Adjust statistics of the clusters to
account for the new points
– Add Ns, SUMs, SUMSQs
• Consider merging compressed sets in the CS
• If this is the last round, merge all compressed
sets in the CS and all RS points into their nearest
cluster
24
A
few
more
Q1)
We
needdetails…
a way to decide whether to put
•aHow
we decide
if a point
is discard)
“close enough”
newdo
point
into a cluster
(and
to a cluster that we will add the point to that
BFR
suggests two ways:
cluster?
Theneed
Mahalanobis
is less than
threshold
• We
a way todistance
decide whether
to aput
a new
High into
likelihood
of the
point
belonging to
point
a cluster
(and
discard)
currently nearest centroid
• BFR suggests two ways:
– Mahalanobis distance
1/19/2015
25
High likelihood of the point belonging to currently nearest
centroid
Jure Leskovec, Stanford C246: Mining Massive Datasets
44
Second way to define a “close” point
• Normalized Euclidean distance from centroid
• For point (x1, x2, …, xd) and centroid (c1, c2, …,
cd):
– Normalize in each dimension: yi = (xi-ci)/d
– Take sum of squares of yi
– Take the square root
26
More on Mahalanobis Distance
Euclidean vs. Mahalanobis distance
• If clusters
are normally
distributed
d origin
Contours
of equidistant
points frominthe
dimensions, then after transformation, one
standard deviation = sqrt(d)
– i.e., 68% of points will cluster with a distance of sqrt(d)
• Accept a point for a cluster if its M.D. is < some
threshold, e.g. 2 standard deviations
Uniformly distributed points,
Euclidean distance
1/19/2015
27
Normally distributed points,
Euclidean distance
Jure Leskovec, Stanford C246: Mining Massive Datasets
Normally distributed points,
Mahalanobis distance
47
Should 2 CS clusters be combined?
• Compute the variance of the combined
subcluster
– N, SUM, and SUMSQ allow us to make that
calculation quickly
• Combine if the combined variance is below some
threshold
• Many alternatives: Treat dimensions differently,
consider density
28
Limitations of BFR…
• Makes strong assumption about
the data:
– Normally distributed data
– Does not work with non-linearly
separable data
– Woks only with axis aligned
datasets…
• Real world datasets are hardly
this way
29
Clustering with Representatives (CURE)
30
CURE: Efficient clustering approach
• Robust clustering approach
• Can handle outliers better
• Employs a hierarchical clustering approach:
– Middle ground between centroid based and all-points
extreme (MAX)
• Can handle different types of data
31
CURE Algorithm
CURE (points, k)
32
1.
It is similar to hierarchical clustering approach. But it uses sample
point variance as the cluster representative rather than every point in
the cluster.
2.
First set a target sample number c . Than we try to select c well
scattered sample points from the cluster.
3.
The chosen scattered points are shrunk toward the centroid in a
fraction of  where 0 <<1
CURE clustering procedure
4.
These points are used as representative of clusters and
will be used as the point in dmin cluster merging approach.
5.
After each merging, c sample points will be selected from
original representative of previous clusters to represent
new cluster.
6.
Cluster merging will be stopped until target k cluster is
found
Nearest
Merge
Merge
Nearest
33
Other tweaks in the CURE algorithm
• Use Random Sampling:
– data cannot be stored all at once in memory
– it is similar to the core-set idea
• Partition and Two-pass clustering:
– Reduces compute time
– First, we divide the n data point into p partition and each contain n/p data
point.
– We than pre-cluster each partition until the number of cluster n/pq
reached in each partition for some q > 1
– Then each cluster in the first pass result will be used as the second pass
clustering input to form the final cluster.
34
Space and Time Complexity
• Worst case time complexity:
– O(n2 log n)
• Space complexity:
– O(n)
– Using a k-d tree for insertion/updating
35
Example of how it works
36
How to validate clustering approaches?
37
Cluster validity
• For supervised learning:
– we had a class label,
– which meant we could identify how good our training
and testing errors were
– Metric: Accuracy, Precision, Recall
• For clustering:
– How do we measure the “goodness” of the resulting
clusters?
38
Clustering random data (overfitting)
If you ask a clustering algorithm to find clusters, it will find some
39
Different aspects of validating clsuters
• Determine the clustering tendency of a set of data, i.e.,
whether non-random structure actually exists in the data
(e.g., to avoid overfitting)
• External Validation: Compare the results of a cluster
analysis to externally known class labels (ground truth).
• Internal Validation: Evaluating how well the results of a
cluster analysis fit the data without reference to external
information.
• Compare clusterings to determine which is better.
• Determining the ‘correct’ number of clusters.
40
Measures of cluster validity
• External Index: Used to measure the extent to
which cluster labels match externally supplied
class labels.
– Entropy, Purity, Rand Index
• Internal Index: Used to measure the goodness
of a clustering structure without respect to
external information.
– Sum of Squared Error (SSE), Silhouette coefficient
• Relative Index: Used to compare two different
clusterings or clusters.
41
– Often an external or internal index is used for this
function, e.g., SSE or entropy
Measuring Cluster Validation with
Correlation
• Proximity Matrix vs. Incidence matrix:
– A matrix Kij with 1 if the point belongs to the same
cluster; 0 otherwise
• Compute the correlation between the two
matrices:
– Only n(n-1)/2 values to be computed
– High values indicate similarity between points in the
same cluster
• Not suited for density based clustering
42
Another approach: use similarity matrix for
cluster validation
43
Internal Measures: SSE
• SSE is also a good measure to understand how
good the clustering is
– Lower SSE  good clustering
• Can be used to estimate number of clusters
44
More on Clustering a little later…
• We will discuss other forms of clustering in the
following classes
• Next class:
– please bring your brief write up on the two papers
– We will discuss frequent itemset mining and a few
other aspects of clustering
– Move on to Dimensionality Reduction
45