* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download slide - UCLA Computer Science
Survey
Document related concepts
Transcript
Data Stream Management Systems-Supporting Stream Mining Applications Carlo Zaniolo CS240B 1 Motivation for Data Stream Mining Most interesting applications come from dynamic environments where data are collected over time -- e.g., customer transactions, call records, customer click data. In these applicationsn batch learning is not sufficient anymore algorithms should be able to incorporate new data • • Some algorithms that are incremental by nature, e.g. kNN classifiers, Naïve Bayes classifiers can be easily extended for data streams. But most algorithms need changes to make incremental induction. Algorithms should be able to deal with non‐stationary data, by – Adapting in the presence of concept drift – Forgetting outdated data and use the most recent state of the Knowledge in the presence of significant changes (concept shift), 2 Motivation Experiments at CERN are generating an entire petabyte (1PB=106 GB) of data every second as particles fired around the Large Hadron Collider (LHC) at velocities approaching the speed of light are smashed together “We don’t store all the data as that would be impractical. Instead, from the collisions we run, we only keep the few pieces that are of interest, the rare events that occur, which our filters spot and send on over the network,” he said. This still means CERN is storing 25PB of data every year – the same as 1,000 years' worth of DVD quality video – which can then be analyzed andinterrogated by scientists looking for clues to the structure and make‐up of the universe. 3 Clus te r Analys is : objective shared by all algorithms Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized ‹#› Introduction to Data Mining © Tan,Steinbach, Kumar Cluster Analysis: Many Different Approaches and Algorithms Partitional Clustering – A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset Hierarchical clustering – A set of nested clusters organized as a hierarchical tree Exclusive versus non-exclusive – In non-exclusive clusterings, points may belong to multiple clusters. – Can represent multiple classes or ‘border’ points Fuzzy versus non-fuzzy – In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 – Weights must sum to 1 – Probabilistic clustering has similar characteristics Partial versus complete – In some cases, we only want to cluster some of the data Introduction to Data Mining Maim Static Clustering Algorithms K-means and its variants Hierarchical clustering Density-based clustering Introduction to Data Mining ‹#› K-m e ans Clus te ring Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› K-means Clustering – Details Initial centroids are often chosen randomly. – The centroid is (typically) the mean of the points in the cluster. ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc. K-means will converge for common similarity measures mentioned above. Most of the convergence happens in the first few iterations. – Clusters produced vary from one run to another. Often the stopping condition is changed to ‘Until relatively few points change clusters’ Complexity is O( n * K * I * d ) – n = number of points, K = number of clusters, I = number of iterations, d = number of attributes © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Tw o diffe re nt K-m e ans Clus te ring s 3 2.5 Original Points 2 y 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x 2.5 2.5 2 2 1.5 1.5 y 3 y 3 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x -1.5 -1 -0.5 0 0.5 1 1.5 2 x Optimal Clustering © Tan,Steinbach, Kumar -2 Introduction to Data Mining Sub-optimal Clustering 4/18/2004 ‹#› Im po rtance o f Cho o s ing Initial Ce ntro ids © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Different Ce ntro ids ( S e e d s ) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Limitations of k-means: Problems with the algorithm: 1.Result depends on initial centroids—no assurance of optimality. Many runs used in practice. 2.Much work on good seeding algorithms: K++means 3.But user must supply K. Or try many Ks to find the best. 4.Or use a series of Bisecting K-means. 12 Limitations of k-Means Problems with the model: K-means has problems when clusters are of differing – Sizes – Densities – Non-globular shapes K-means has problems when the data contains outliers. 13 Static Clustering Algorithms K-means and its variants: in spite of all these problems K-means remains the most commonly used clustering algorithm !*? Next: Hierarchical clustering Density-based clustering Introduction to Data Mining ‹#› Limitations of k-Means: different sizes 15 Limitatios of k-Means: different densities 16 Limitations of k-means: non-globular shapes 17 Hierarchical Clustering Two main types of hierarchical clustering – Agglomerative: • • Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left – Divisive: • Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters) Traditional hierarchical algorithms use similarity or distance matrix Merge or split one cluster at a time --Expensive 18 Hierarchical Clustering Algorithms Hierarchical Clustering Algorithms Can be used to generate hierarchically structured clusters such as that below, or to simply partition the data into clusters. p1 p2 p3 p4 p1 p 2 p 3 p 4 Tradit onal HierarchicalClustering Tradit onal Dendrogram 19 Hierarchical Clustering Algorithms Hierarchical Clustering Algorithms Can also be used to partition the data into clusters. The CLUBS/CLUBS+ algorithm recently developed at UCLA: 1. Uses a divisive phase followed by an agglomerative phase to build elliptical clusters around centroids, and it is 2. Totally unsupervised (no seeding, no K), 3. Insensitive to noise and outliers, it produces results of superior quality 4. Extremely fast. So much so that it can be used for fast seeding of K-means. 20 Clus te ring Alg o rithm s K-means and its variants Hierarchical clustering Density-based clusterin: next Introduction to Data Mining ‹#› DBSCAN DBSCAN is a density-based algorithm. Density = number of points within a specified radius (Eps) A point is a core point if it has more than a specified number of points (MinPts) within Eps – Core points that are at the interior of a cluster A border point has fewer than MinPts within Eps, but is in but is in A noise point is any point that is neither a core point or a border point. 22 DBSCAN: core, border & noise points DBSCAN: Core, Border, and Noise Points 23 DBSCAN: The Algorithm Eps and MinPts Let ClusterCount=0. For every point p: 1. If p it is not a core point, assign a null label to it [e.g., zero] 2. If p is a core point, a new cluster is formed [with label ClusterCount:= ClusterCount+1] Then find all points density-reachable form p and classify them in the cluster. [Reassign the zero labels but not the others] Repeat this process until all of the points have been visited. Since all the zero labels of border points have been reassigned in 2, the remaining points with zero label are noise. 24 When DBSCAN Works Well Original Points • Resistant to Noise Introduction to Data Mining Clusters • Can handle clusters of different shapes and sizes © Tan,Steinbach, Kumar 4/18/2004 ‹#› 25 (MinPts=4, Eps=9.75). When DBSCAN Does NOT Work Well Original Points • Varying densities (MinPts=4, Eps=9.92) Introduction to Data Mining • High-dimensional data © Tan,Steinbach, Kumar 4/18/2004 ‹#› 26 Many stream clustering approaches: a taxonomy 27 Partitioning methods •Goal: Construct a partition of a set of objects into k clusters – e.g. k‐Means, k‐Medoids •Two types of methods: – Adaptive methods: •Leader (Spath 1980) •Simple single pass k‐Means (Farnstrom et al, 2000) •STREAM k‐Means (O’Callaghan et al, 2002) – Online summarization ‐ offline clustering methods: •CluStream (Aggarwal et al, 2003) 28 Leader [Spath 1980] • The simplest single‐pass partitioning algorithm • Whenever a new instance p arrives from the stream – Find its closest cluster (leader), cclos – Assign p to cclos if their distance is below the threshold dthresh – Otherwise, create a new cluster (leader) with p + 1‐pass and fast algorithm + No prior information on the number of clusters – Unstable algorithm – It depends on the order of the examples – It depends on a correct guess of dthresh 29 STREAM k-Means(O’Callaghan et al, 2002) • An extension of k‐Means for streams – The iterative process of static k‐Means cannot be applied to streams – Use a buffer that fits in memory and apply k‐Means locally in the buffer • Stream is processed in chunks X1, X2…, each fitting in memory – For each chunk Xi o Apply k‐Means locally on Xi (retain only the k centers) o X’ i*k weighted centers obtained from chunks X1 … Xi o Each center is treated as a point, weighted with the number of points it compresses o Apply k‐Means on X’ output the k centers … 30 CluStream [Aggarwal et al. 2003] • The stream clustering process is separated into: – an online micro‐cluster component, that summarizes the stream locally as new data arrive over time o Micro‐clusters are stored in disk at snapshots in time that follow a pyramidal time frame. – an offline macro‐cluster component, that clusters these summaries into global clusters o Clustering is performed upon summaries instead of raw data 31 CluStream: microcluster summary Structure 32 CluStream Algorithm • A fixed number of q micro‐clusters is maintained over time • Initialize: apply q‐Means over initPoints, built a summary for each cluster • Online micro‐cluster maintenance as a new point p arrives from the stream – Find the closest micro‐cluster clu for the new point p oIf p is within the max‐boundary of clu, p is absorbed by clu ootherwise., a new cluster is created with p – The number of micro‐clusters should not exceed q oDelete most obsolete micro‐cluster or merge the two closest ones • Periodic storage of micro‐clusters snapshots into disk – At different levels of granularity depending upon their recency • Offline macro‐clustering – Input: A user defined time horizon h and number of macro‐clusters k to be detected – Locate the valid micro‐clusters during h – Apply k‐Means upon these micro‐clusters k macro‐clusters 33 CluStream: Initialization Step • Initialization – Done using an offline process in the beginning – Wait for the first InitNumber points to arrive – Apply a standard k‐Means algorithm to create q clusters o For each discovered cluster, assign it a unique ID and create its micro‐cluster summary. • Comments on the choice of q – much larger than the natural number of clusters – much smaller than the total number of points arrived 34 CluStream: on-line step • A fixed number of q micro‐clusters is maintained over time • Whenever a new point p arrives from the stream – Compute distance between p and each of the q maintained micro‐cluster centroids – clu the closest micro‐cluster to p – Find the max boundary of clu oIt is defined as a factor of t of clu radius – If p falls within the maximum boundary of clu op is absorbed by clu oUpdate clu statistics (incrementality property) – Else, create a new micro‐cluster with p, assign it a new cluster ID, initialize its statistics o To keep the total number of micro‐clusters fixed (i.e., q): • Delete the most obsolete micro‐cluster or • If its safe (based on how far in the past, the micro‐cluster received new points) • Merge the two closest ones (Additivity property) • When two micro‐clusters are merged, a list of ids is created. This way, we can identify the component micro‐clusters that comprise a micro‐cluster. 35 CluStream: periodic microcluster storage • Micro‐clusters snapshots are stored at particular times • If current time is tc and user wishes to find clusters based on a history of length h – Then we use the subtractive property of micro‐clusters at snapshots tc and tc‐h – In order to find the macro‐clusters in a history or time horizon of length h • How many snapshots should be stored? – It is too expensive to store snapshots at every time stamp – They are stored in a pyramidal time frame • It is an effective trade‐off between the storage requirements and the ability to recall summary statistics from different time horizons. 36 CluStream: offline step • The offline step is applied on demand upon the q maintained micro‐clusters instead of the raw data • User input: time horizon h, # macro‐clusters k to be detected • Find the active micro‐clusters during h: – We exploit the subtractivity property to find the active micro‐clusters during h: oSuppose current time is tc. Let S(tc) be the set of micro‐clusters at tc. oFind the stored snapshot which occurs just before time tc‐h. We can always find such a snapshot h’. Let S(tc–h’) be the set of micro‐clusters. oFor each micro‐cluster in the current set S(tc), we find the list of ids. For each of the list of ids, find the corresponding micro‐clusters in S(tc–h’). oSubtract the CF vectors for the corresponding micro‐clusters in S(tc–h’) oThis ensures that the micro‐clusters created before the user‐specified horizon do not dominate the result of clustering process • Apply k‐Means over the active micro‐clusters in h to derive the k macro‐clusters – Initialization: seeds are not picked up randomly, rather sampled with probability proportional to the number of points in a given micro‐cluster – Distance is the centroid distance – New seed for a given partition is the weighted centroid of the micro‐clusters in that partition 37 CluStream: summary + CluStream clusters large evolving data streams. It + Views the stream as a changing process over time, rather than clustering the whole stream at a time + Can characterize clusters over different time horizons in changing environment + Provides flexibility to an analyst in a real‐time and changing environment –Fixed number of micro‐clusters maintained over time –Sensitive to outliers/ noise 38 Density-Based Data Stream Clustering • We will cover DenStream: Feng Cao, Martin Ester, Weining Qian, Aoying Zhou: “Density-Based Clustering over an Evolving Data Stream with Noise”. SDM ’06 • DenStream operates on microclusters using an extension of DBSCAN: Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu: “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”. KDD ‘96 39 DBSCAN DBSCAN is a density-based algorithm. Density = number of points within a specified radius (Eps) A point is a core point if it has more than a specified number of points (MinPts) within Eps These are points that are at the interior of a cluster A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point A noise point is any point that is not a core point or a border point. 40 DBSCAN: Core, Border, and Noise Points 41 Density-Reachable and Density-Connected (w.r.t. Eps, MinPts) Let p be a core point, then every point in its Eps neighborhood is said to be directly density-reachable from p. p q A point p is density-reachable from a point core point q if there is a chain of points p1, …, pn, p1 = q, pn = p A point p is density-connected to a point q if there is a point o such that both, p and q are densityreachable from o p1 p q o 42 DBSCAN: The Algorithm Eps and MinPts Let ClusterCount=0. For every point p: 1. If p it is not a core point, assign a null label to it [e.g., zero] 2. If p is a core point, a new cluster is formed [with label ClusterCount:= ClusterCount+1] Then find all points density-reachable form p and classify them in the cluster. [Reassign the zero labels but not the others] Repeat this process until all of the points have been visited. Since all the zero labels of border points have been reassigned in 2, the remaining points with zero label are noise. 43 DBSCAN Application examples: Population density, Spreading of Deseases, Trajectory tracing 44 DenStream Feng Cao, Martin Ester, Weining Qian, Aoying Zhou: “Density-Based Clustering over an Evolving Data Stream with Noise”. SDM ‘06 • Based on DBSCAN • Core-micro-cluster: CMC(w,c,r) weight w > μ, center c, radius r < ε • Potential/outlier micro-clusters • Online: merge point into p (or o) micro-cluster if new radius r'< ε • • Promote o microcluster to p if w > βμ • Else create new o-micro-cluster Offline: modified DBSCAN (on user demand) Conclusion Much work still needed on data stream clustering 46