Download slide - UCLA Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Data Stream Management Systems-Supporting Stream Mining Applications
Carlo Zaniolo
CS240B
1
Motivation for Data Stream Mining
 Most interesting applications come from dynamic environments where
data are collected over time -- e.g., customer transactions, call
records, customer click data.
 In these applicationsn batch learning is not sufficient anymore
algorithms should be able to incorporate new data
•
•
Some algorithms that are incremental by nature, e.g. kNN classifiers,
Naïve Bayes classifiers can be easily extended for data streams.
But most algorithms need changes to make incremental induction.
 Algorithms should be able to deal with non‐stationary data, by

– Adapting in the presence of concept drift

– Forgetting outdated data and use the most recent state of the
Knowledge in the presence of significant changes (concept shift),
2
Motivation
 Experiments at CERN are generating an entire petabyte (1PB=106 GB)
of data every second as particles fired around the Large Hadron
Collider (LHC) at velocities approaching the speed of light are smashed
together
 “We don’t store all the data as that would be impractical. Instead,
from the collisions we run, we only keep the few pieces that are of
interest, the rare events that occur, which our filters spot and send on
over the network,” he said.
 This still means CERN is storing 25PB of data every year – the same as
1,000 years' worth of DVD quality video – which can then be analyzed
andinterrogated by scientists looking for clues to the structure and
make‐up of the universe.
3
Clus te r Analys is :
objective shared by all algorithms

Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Intra-cluster
distances are
minimized
Inter-cluster
distances are
maximized
‹#›
Introduction to Data Mining
© Tan,Steinbach, Kumar
Cluster Analysis:
Many Different Approaches and Algorithms

Partitional Clustering
– A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset

Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree

Exclusive versus non-exclusive
– In non-exclusive clusterings, points may belong to multiple clusters.
– Can represent multiple classes or ‘border’ points

Fuzzy versus non-fuzzy
– In fuzzy clustering, a point belongs to every cluster with some weight
between 0 and 1
– Weights must sum to 1
– Probabilistic clustering has similar characteristics

Partial versus complete
– In some cases, we only want to cluster some of the data
Introduction to Data Mining
Maim Static Clustering Algorithms
 K-means
and its variants
 Hierarchical
clustering
 Density-based
clustering
Introduction to Data Mining
‹#›
K-m e ans Clus te ring





Partitional clustering approach
Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest
centroid
Number of clusters, K, must be specified
The basic algorithm is very simple
© Tan,Steinbach,
Kumar
Introduction to Data Mining
4/18/2004
‹#›
K-means Clustering – Details

Initial centroids are often chosen randomly.
–




The centroid is (typically) the mean of the points in the
cluster.
‘Closeness’ is measured by Euclidean distance,
cosine similarity, correlation, etc.
K-means will converge for common similarity measures
mentioned above.
Most of the convergence happens in the first few
iterations.
–

Clusters produced vary from one run to another.
Often the stopping condition is changed to ‘Until relatively few
points change clusters’
Complexity is O( n * K * I * d )
–
n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
© Tan,Steinbach,
Kumar
Introduction to Data Mining
4/18/2004
‹#›
Tw o diffe re nt K-m e ans Clus te ring s
3
2.5
Original Points
2
y
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
2.5
2.5
2
2
1.5
1.5
y
3
y
3
1
1
0.5
0.5
0
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
Optimal Clustering
© Tan,Steinbach,
Kumar
-2
Introduction to Data Mining
Sub-optimal Clustering
4/18/2004
‹#›
Im po rtance o f Cho o s ing Initial Ce ntro ids
© Tan,Steinbach,
Kumar
Introduction to Data Mining
4/18/2004
‹#›
Different Ce ntro ids ( S e e d s )
© Tan,Steinbach,
Kumar
Introduction to Data Mining
4/18/2004
‹#›
Limitations of k-means:
Problems with the algorithm:
1.Result depends on initial centroids—no assurance of
optimality. Many runs used in practice.
2.Much work on good seeding algorithms: K++means
3.But user must supply K. Or try many Ks to find the
best.
4.Or use a series of Bisecting K-means.
12
Limitations of k-Means
Problems with the model:
K-means has problems when clusters are of
differing
– Sizes
– Densities
– Non-globular shapes
K-means has problems when the data contains
outliers.
13
Static Clustering Algorithms
 K-means
and its variants: in spite of all
these problems K-means remains the most
commonly used clustering algorithm !*?

Next: Hierarchical clustering
 Density-based
clustering
Introduction to Data Mining
‹#›
Limitations of k-Means: different sizes
15
Limitatios of k-Means: different densities
16
Limitations of k-means: non-globular shapes
17
Hierarchical Clustering
Two main types of hierarchical clustering
– Agglomerative:
•
•
Start with the points as individual clusters
At each step, merge the closest pair of clusters
until only one cluster (or k clusters) left
– Divisive:
•
Start with one, all-inclusive cluster
At each step, split a cluster until each cluster
contains a point (or there are k clusters)
Traditional hierarchical algorithms use similarity or

distance matrix Merge or split one cluster at a time
--Expensive
18
Hierarchical Clustering Algorithms
 Hierarchical Clustering Algorithms Can be used to generate
hierarchically structured clusters such as that below, or to simply
partition the data into clusters.
p1
p2
p3 p4
p1 p 2 p 3 p 4
Tradit onal HierarchicalClustering Tradit onal Dendrogram
19
Hierarchical Clustering Algorithms
 Hierarchical Clustering Algorithms Can also be used to partition the
data into clusters.
 The CLUBS/CLUBS+ algorithm recently developed at UCLA:
1. Uses a divisive phase followed by an agglomerative phase to
build elliptical clusters around centroids, and it is
2. Totally unsupervised (no seeding, no K),
3. Insensitive to noise and outliers, it produces results of
superior quality
4. Extremely fast. So much so that it can be used for fast
seeding of K-means.
20
Clus te ring Alg o rithm s
 K-means
and its variants
 Hierarchical
clustering
 Density-based
clusterin: next
Introduction to Data Mining
‹#›
DBSCAN
DBSCAN is a density-based algorithm.
Density = number of points within a specified
radius (Eps)
A point is a core point if it has more than a
specified number of points (MinPts) within Eps –
Core points that are at the interior of a cluster
A border point has fewer than MinPts within Eps,
but is in but is in
A noise point is any point that is neither a core
point or a border point.
22
DBSCAN: core, border & noise points
DBSCAN: Core, Border, and Noise Points
23
DBSCAN: The Algorithm
Eps and MinPts
Let ClusterCount=0. For every point p:
1. If p it is not a core point, assign a null label
to it [e.g., zero]
2. If p is a core point, a new cluster is formed
[with label ClusterCount:= ClusterCount+1]
Then find all points density-reachable form p
and classify them in the cluster.
[Reassign the zero labels but not the others]
Repeat this process until all of the points have
been visited.
Since all the zero labels of border points have been reassigned
in 2, the remaining points with zero label are noise.
24
When DBSCAN Works Well
Original Points
• Resistant to Noise
Introduction to Data Mining
Clusters
• Can handle clusters of different shapes and sizes
© Tan,Steinbach, Kumar
4/18/2004
‹#›
25
(MinPts=4, Eps=9.75).
When DBSCAN Does NOT Work Well
Original Points
• Varying densities
(MinPts=4, Eps=9.92)
Introduction to Data Mining
• High-dimensional data
© Tan,Steinbach, Kumar
4/18/2004
‹#›
26
Many stream clustering approaches: a taxonomy
27
Partitioning methods
•Goal: Construct a partition of a set of objects
into k clusters
– e.g. k‐Means, k‐Medoids
•Two types of methods:
– Adaptive methods:
•Leader (Spath 1980)
•Simple single pass k‐Means (Farnstrom et al, 2000)
•STREAM k‐Means (O’Callaghan et al, 2002)
– Online summarization ‐ offline clustering methods:
•CluStream (Aggarwal et al, 2003)
28
Leader [Spath 1980]
• The simplest single‐pass partitioning algorithm
• Whenever a new instance p arrives from the stream
– Find its closest cluster (leader), cclos
– Assign p to cclos if their distance is below the threshold dthresh
– Otherwise, create a new cluster (leader) with p
+ 1‐pass and fast algorithm
+ No prior information on the number of clusters
– Unstable algorithm
– It depends on the order of the examples
– It depends on a correct guess of dthresh
29
STREAM k-Means(O’Callaghan et al, 2002)
• An extension of k‐Means for streams
– The iterative process of static k‐Means cannot be applied to streams
– Use a buffer that fits in memory and apply k‐Means locally in the buffer
• Stream is processed in chunks X1, X2…, each fitting in
memory
– For each chunk Xi
o Apply k‐Means locally on Xi (retain only the k centers)
o X’  i*k weighted centers obtained from chunks X1 … Xi
o Each center is treated as a point, weighted with the number of points it compresses
o Apply k‐Means on X’ output the k centers
…
30
CluStream [Aggarwal et al. 2003]
• The stream clustering process is separated into:
– an online micro‐cluster component, that summarizes the stream locally
as new data arrive over time
o Micro‐clusters are stored in disk at snapshots in time that follow a
pyramidal time frame.
– an offline macro‐cluster component, that clusters these summaries into
global clusters
o Clustering is performed upon summaries instead of raw data
31
CluStream: microcluster summary Structure
32
CluStream Algorithm
• A fixed number of q micro‐clusters is maintained over time
• Initialize: apply q‐Means over initPoints, built a summary for each cluster
• Online micro‐cluster maintenance as a new point p arrives from the stream
– Find the closest micro‐cluster clu for the new point p
oIf p is within the max‐boundary of clu, p is absorbed by clu
ootherwise., a new cluster is created with p
– The number of micro‐clusters should not exceed q
oDelete most obsolete micro‐cluster or merge the two closest ones
• Periodic storage of micro‐clusters snapshots into disk
– At different levels of granularity depending upon their recency
• Offline macro‐clustering
– Input: A user defined time horizon h and number of macro‐clusters k to be
detected
– Locate the valid micro‐clusters during h
– Apply k‐Means upon these micro‐clusters  k macro‐clusters
33
CluStream: Initialization Step
• Initialization
– Done using an offline process in the beginning
– Wait for the first InitNumber points to arrive
– Apply a standard k‐Means algorithm to create q
clusters
o For each discovered cluster, assign it a unique ID and create its
micro‐cluster summary.
• Comments on the choice of q
– much larger than the natural number of clusters
– much smaller than the total number of points
arrived
34
CluStream: on-line step
• A fixed number of q micro‐clusters is maintained over time
• Whenever a new point p arrives from the stream
– Compute distance between p and each of the q maintained micro‐cluster
centroids
– clu  the closest micro‐cluster to p
– Find the max boundary of clu
oIt is defined as a factor of t of clu radius
– If p falls within the maximum boundary of clu
op is absorbed by clu
oUpdate clu statistics (incrementality property)
– Else, create a new micro‐cluster with p, assign it a new cluster ID, initialize
its statistics
o To keep the total number of micro‐clusters fixed (i.e., q):
• Delete the most obsolete micro‐cluster or
• If its safe (based on how far in the past, the micro‐cluster received new points)
• Merge the two closest ones (Additivity property)
• When two micro‐clusters are merged, a list of ids is created. This way, we can
identify the component micro‐clusters that comprise a micro‐cluster.
35
CluStream: periodic microcluster storage
• Micro‐clusters snapshots are stored at particular times
• If current time is tc and user wishes to find clusters based on a
history of length h
– Then we use the subtractive property of micro‐clusters at
snapshots tc and tc‐h
– In order to find the macro‐clusters in a history or time horizon
of length h
• How many snapshots should be stored?
– It is too expensive to store snapshots at every time stamp
– They are stored in a pyramidal time frame
• It is an effective trade‐off between the storage requirements and
the ability to recall summary statistics from different time
horizons.
36
CluStream: offline step
• The offline step is applied on demand upon the q maintained micro‐clusters
instead of the raw data
• User input: time horizon h, # macro‐clusters k to be detected
• Find the active micro‐clusters during h:
– We exploit the subtractivity property to find the active micro‐clusters during
h:
oSuppose current time is tc. Let S(tc) be the set of micro‐clusters at tc.
oFind the stored snapshot which occurs just before time tc‐h. We can always find
such a snapshot h’. Let S(tc–h’) be the set of micro‐clusters.
oFor each micro‐cluster in the current set S(tc), we find the list of ids. For each of
the list of ids, find the corresponding micro‐clusters in S(tc–h’).
oSubtract the CF vectors for the corresponding micro‐clusters in S(tc–h’)
oThis ensures that the micro‐clusters created before the user‐specified horizon do
not dominate the result of clustering process
• Apply k‐Means over the active micro‐clusters in h to derive the k
macro‐clusters
– Initialization: seeds are not picked up randomly, rather sampled with
probability proportional to the number of points in a given micro‐cluster
– Distance is the centroid distance
– New seed for a given partition is the weighted centroid of the micro‐clusters
in that partition
37
CluStream: summary
+ CluStream clusters large evolving data
streams. It
+ Views the stream as a changing process over
time, rather than clustering the whole stream
at a time
+ Can characterize clusters over different time
horizons in changing environment
+ Provides flexibility to an analyst in a real‐time
and changing environment
–Fixed number of micro‐clusters maintained
over time
–Sensitive to outliers/ noise
38
Density-Based Data Stream Clustering
• We will cover DenStream: Feng Cao, Martin Ester, Weining Qian,
Aoying Zhou: “Density-Based Clustering over an Evolving Data Stream
with Noise”. SDM ’06
• DenStream operates on microclusters using an extension of
DBSCAN: Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu:
“A Density-Based Algorithm for Discovering Clusters in Large Spatial
Databases with Noise”. KDD ‘96
39
DBSCAN
 DBSCAN is a density-based algorithm.

Density = number of points within a specified radius
(Eps)

A point is a core point if it has more than a
specified number of points (MinPts) within Eps
 These are points that are at the interior of a cluster

A border point has fewer than MinPts within Eps, but
is in the neighborhood of a core point

A noise point is any point that is not a core point or
a border point.
40
DBSCAN: Core, Border, and Noise Points
41
Density-Reachable and Density-Connected
(w.r.t. Eps, MinPts)
 Let p be a core point, then every
point in its Eps neighborhood is said
to be directly density-reachable
from p.
p
q
 A point p is density-reachable from
a point core point q if there is a
chain of points p1, …, pn, p1 = q, pn
= p
 A point p is density-connected to a
point q if there is a point o such
that both, p and q are densityreachable from o
p1
p
q
o
42
DBSCAN: The Algorithm
Eps and MinPts
Let ClusterCount=0. For every point p:
1. If p it is not a core point, assign a null label
to it [e.g., zero]
2. If p is a core point, a new cluster is formed
[with label ClusterCount:= ClusterCount+1]
Then find all points density-reachable form p
and classify them in the cluster.
[Reassign the zero labels but not the others]
Repeat this process until all of the points have
been visited.
Since all the zero labels of border points have been reassigned
in 2, the remaining points with zero label are noise.
43
DBSCAN
Application examples:
Population density, Spreading of Deseases, Trajectory tracing
44
DenStream
Feng Cao, Martin Ester, Weining Qian, Aoying Zhou: “Density-Based Clustering over an Evolving Data Stream with Noise”. SDM ‘06
•
Based on DBSCAN
•
Core-micro-cluster: CMC(w,c,r) weight
w > μ, center c, radius r < ε
•
Potential/outlier micro-clusters
•
Online: merge point into p (or o)
micro-cluster if new radius r'< ε
•
•
Promote o microcluster to p if w > βμ
•
Else create new o-micro-cluster
Offline: modified DBSCAN (on user
demand)
Conclusion
Much work still needed on data stream
clustering
46