Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Rheinisch-Westfälische Technische Hochschule Aachen Lehrstuhl für Datenmanagement und -exploration Prof. Dr. T. Seidl Proseminar Data Stream Mining Basic methods and techniques Martin Matysiak Fall 2012 Supervisor: Univ.-Prof. Dr. rer. nat. Thomas Seidl Dipl. Ing. Marwan Hassani The material in this paper has not previously been submitted for a degree in any University, and to the best of my knowledge contains no material previously published or written by another person except where due to acknowledgement is made in the paper itself. Aachen, November 8, 2012 Contents Abstract ix 1 Introduction 2 Data Windows 2.1 Fixed Sliding Window 2.2 Adaptive Window . . . 2.3 Landmark Window . . 2.4 Damped Window . . . 1 . . . . 3 3 3 4 5 3 Micro Clusters 3.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 8 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Time Frames 11 4.1 Tilted Time Frames . . . . . . . . . . . . . . . . . . . . . . . . 11 4.2 Pyramidal Time Frames . . . . . . . . . . . . . . . . . . . . . 12 5 Summary 15 Appendix A Measurements on Pyramidal Time Frames 17 References 19 v List of Figures 2.1 2.2 ADWIN running on a synthetic data set . . . . . . . . . . . . Damped data window . . . . . . . . . . . . . . . . . . . . . . . 4 5 3.1 Micro-clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.1 Mean relative error in approximation of time windows . . . . . 13 List of Tables 4.1 Snapshots stored at t = 60 for α = β = 2 . . . . . . . . . . . . 12 A.1 Development of the number of snapshots and relative error in approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 vii Abstract The domain of stream data mining poses many new challenges compared to more traditional data mining applications. Large volumes of data over long periods of time render common mining and clustering techniques unusable. This paper will give the reader an overview of available methods to cope with the special properties of data streams and point to various places where they’re applied in practice. ix Chapter 1 Introduction Data streaming applications are typically dealing with large amounts of data over an extended period of time. However, in most cases the user is only interested in recent data instead of the whole data set. Furthermore, stream data tends to express features of a concept drift, i.e. the data is evolving over time. This would cause algorithms which consider the whole data set with the same importance to produce distorted results because the majority of processed data would not be valid anymore. This paper has the intention to give the reader a brief overview over common methods used in stream data mining applications. Most of these methods are in the domain of ageing techniques, i.e. methods that take into account the evolutionary property of data streams and try to counter its negative consequences by ageing older data points in an appropriate way. In chapter 2, data windows as a way of looking at relevant slices of a data stream will be introduced. Chapter 3 covers the concept of micro-clusters, which prove to be a space and time efficient way of summarizing a stream’s current state. In chapter 4 we will take a look at how such summaries can be stored over a long period of time while requiring minimal amounts of space. Lastly, we will recapitulate the discussed methods in chapter 5 and look at how they can be combined to gain optimal results from mining a stream. 1 2 Chapter 1. Introduction Chapter 2 Data Windows In this chapter we will take a look at several data windowing models which can be used to limit the amount of processed data based on different characteristics, thus improving the results of afterwards executed algorithms. 2.1 Fixed Sliding Window The easiest way of limiting data is by using a fixed sliding window. These windows can be either fixed by including only the most recent n data points or by showing only the most recent t time units of data (where n and t are constants). While the implementation of this model is very simple, it is prone to errors when choosing a wrong window width. Too narrow windows will produce very accurate representations of the current state, but are heavily affected by noisy data while too wide windows result in more stable yet equally inaccurate results due to the effects of concept drift [2]. Nevertheless, fixed sliding windows can be found in many applications. An example is [7] where they are used to detect regions of abnormal network activity. 2.2 Adaptive Window Because of the disadvantages of a fixed window size, Bifet et al. [2] introduced the adaptive windowing technique (short: ADWIN ) which dynamically resizes the window based on the incoming data and a user-specifiable confidence value λ . Essentially, resizing is done by looking at possibilities for dividing the current data window W into two consecutive windows W1 and W2 such that 3 4 Chapter 2. Data Windows Value 1 Value Window width 400 0.5 0 200 0 Window width W1 · W2 = W and checking whether the means of these two windows differ greater than a threshold cut . If that is the case, the older window W1 will be dropped from W . Using this technique, it can be proven that ADWIN will maintain an optimal window width throughout the streaming process [2]. 0 100 200 300 400 500 600 700 800 900 1,000 Time Figure 2.1: ADWIN running on a synthetic data set. One can clearly see the window width adaption whenever the data set is changing significantly. The shaded area depicts the data window at t = 1000. An application of adaptive windowing is the MawStream algorithm [9] which clusters data streams and maintains multiple adaptive windows (one for each cluster) in order to keep the detected clusters relevant and make adding of new data points computationally fast. 2.3 Landmark Window Sometimes it is useful to track the evolution of data points starting at a fixed point in time, the so called landmark. Landmark data windows will include all data points starting from that particular landmark [10]. Note that this model is typically not very suitable for streaming applications, because the amount of data inside the window would quickly grow to unprocessable sizes. Still, the model has a few limited applications, for instance the stock market, where it can be used to observe the average price of a stock in the current month or year. 2.4. Damped Window 2.4 5 Damped Window In contrast to the other window models, Damped Windows assign weights to the data points rather than performing a binary decision on whether to include a point or not. These weights are depending on the age of a data point. Frequently, an exponential falloff is used [3]. This ensures that while past data is not completely disregarded, recent data will always have a stronger influence on the computation to be performed. 1 Value Weight 0.5 0 0.5 0 Weight Value 1 0 100 200 300 400 500 600 700 800 900 1,000 Time Figure 2.2: Damped windowing on a synthetic dataset at t = 1000. The weight is dropping off exponentially with the age of a data point. Damped windows are used for instance in the domain of finding recent frequent item sets [4] in order to lessen the contribution of old data points towards the rate of an itemset appearing in the stream, thus giving itemsets with a high count of recent data points a much higher relative importance. 6 Chapter 2. Data Windows Chapter 3 Micro Clusters Clustering streams is a challenging task because of high amounts of constantly arriving data. Known clustering algorithms are too slow to work in such environments. Therefore we divide clustering of streams into two phases. First, an online algorithm is constantly processing the incoming data and summarizing it into a space efficient format. Afterwards a traditional offline algorithm can be executed on the summary in order to perform the actual clustering. In this chapter we will look at the concept of micro-clustering, which is a fast method to summarize large amounts of incoming data without loosing too much granularity. To match this constraint, the number of micro-clusters is usually several times higher than the number of actual clusters, yet way smaller than the number of single data points as shown in figure 3.1, thus still having low memory requirements. We will look at micro-clusters based on the definitions by Aggarwal et al. [1], but note that the discussed properties apply to other definitions as well. 3.1 Structure Let D denote the number of dimensions of the input data. Further let xi be the i-th D-dimensional data point and ti the timestamp at which it occurred. A micro-cluster is a summary of a set of data points xi1 , xi2 , . . . , xin , n ∈ N. They are based on the original cluster feature vector introduced in [8]. The main difference is that additional to a summary of data values, temporal information about the occurrence of these values is stored. This information can be used to track the evolution of micro-clusters throughout the stream. Micro-clusters can be defined as a 5-tuple (CF 2x , CF 1x , CF 2t , CF 1t , n), with CF 2x and CF 1x being the sum of squares and simple sum of values 7 8 Chapter 3. Micro Clusters Figure 3.1: A sample data set (left) and a corresponding micro-clustering (right), generated using the MOA Framework. The clusters were calculated by the CluStream algorithm, which maintains a fixed number of microclusters and assigns incoming data points either to existing micro-clusters (if they fall within a maximum boundary of such a cluster) or to a new micro-cluster which will replace a stale one. in the cluster (both D-dimensional vectors), Pn n2 being the number of data t points in a particular cluster, i=1 ti being the sum of squares P CF 2 = of timestamps and CF 1t = ni=1 ti the sum of timestamps. Having these values, it is easy to derive other information about the micro-cluster, for instance its average timestamp (CF 1t /n). 3.2 Properties Micro-clusters express several useful properties that make them an ideal choice for the online phase of stream clustering. The most important one is their additivity. Let M1 = (CF 2x1 , CF 1x1 , CF 2t1 , CF 1t1 , n1 ) and M2 = (CF 2x2 , CF 1x2 , CF 2t2 , CF 1t2 , n2 ) be two micro-clusters over disjoint sets of data points. These micro-clusters can be merged to M1 ∪ M2 by using simple addition of their values: M1 ∪ M2 = M1 + M2 = (CF 2x1 + CF 2x2 , CF 1x1 + CF 1x2 , CF 2t1 + CF 2t2 , CF 1t1 + CF 1t2 , n1 + n2 ). This property implies that merging two clusters or adding new data points to them is a constant-time operation, thus making micro-cluster maintenance during the online phase very efficient. 3.3. Applications 9 Analogous to the additivity, micro-clusters also express a subtractivity property. This property can be used to get a summary about data points that arrived between two timestamps t1 and t2 , simply by subtracting M1 from M2 (given that M2 ⊇ M1 ∧ t2 > t1 ). 3.3 Applications Micro-clustering is used in a variety of applications. One prominent example is DenStream [3], where, instead of storing simple sums of timestamps in the feature vector, a decay function is applied to the timestamps in order to calculate a weight, essentially combining micro-clustering with damped data windows (see chapter 2.4). Another one is CluStream [1]. Here, micro-clustering is used mainly in combination with pyramidal time frames which we will discuss in the following chapter. 10 Chapter 3. Micro Clusters Chapter 4 Time Frames In chapter 3 we discussed a way of creating summaries of a data stream. Using the subtractivity property of micro-clusters, we can generate summaries for any arbitrary time window simply by subtracting two summaries of the stream taken at different timestamps from each other. For this purpose, we need a way of storing several such snapshots of summaries. Despite the fact that micro-clusters have very low space requirements, it is still not feasible to store indefinitely many of them. Therefore we will look at ways of storing snapshots in tilted time frames instead of linearly throughout the streaming process. 4.1 Tilted Time Frames The general idea of tilted time frames is to store snapshots at different levels of granularity depending on how old these snapshots are. The more time has passed, the larger the gap between two consecutive snapshots will be. Tilted time frames were introduced in [5], where the pattern of storing snapshots is aligned with the natural time, that means in the most recent quarter we take one snapshot per minute, in the most recent hour one snapshot per quarter and so on. Older snapshots have to be maintained regularly in this model. There are several ways of dealing with a transition from one level of granularity to another depending on how the snapshots are structured. One way is by looking independently at each granularity level and dropping the oldest snapshot whenever a new snapshot for that particular level arrives. This concept is used in CluStream which we will discuss in section 4.2. Another way is by merging snapshots. Snapshots of a finer granularity are accumulated until they contain enough data to form a snapshot for the next 11 12 Chapter 4. Time Frames smaller level of detail. Using a logarithmic time window where the time gap between snapshots increases by a factor of 2 between each level, we can show that the amortized number of maintenance operations is limited to O(1) [6]. 4.2 Pyramidal Time Frames Aggarwal et al. [1] used a variation of tilted time frames which maintain snapshots in a pyramidal pattern. Pyramidal time frames are loosely based on logarithmic time frames, but allow the user to customize the logarithm’s base and level of detail by parameters α and β where α > 1 ∧ α, β ∈ N. The basic rules for maintaining the different levels are simple: for every level i, we store snapshots whenever the current timestamp is divisible by αi but not by αi+1 to avoid redundancy. At most αβ + 1 snapshots are in any level i by dropping the oldest snapshot whenever a new one arrives. The total number of levels after T time units elapsed since the beginning equals logα (T ) and the total number of snapshots thus equals (αβ + 1) · logα (T ). Level 0 1 2 3 4 5 Snapshots 59 57 55 53 51 58 54 50 36 42 60 52 44 36 28 56 40 24 8 48 16 32 Table 4.1: Snapshots stored at t = 60 for α = β = 2. One can clearly see the distinctive pyramidal shape with rising level. Even with such modest space requirements it can be proven that for any given timestamp t a snapshot can be found within at most (1 + α1β ) · t units of time from the current timestamp [1]. Figure 4.1 demonstrates how the mean relative error of approximating random time windows develops for different values of α and β. Appendix A contains the detailed results of this experiment. As we can see for instance, α = 3, β = 8 needs only about 0.2% of the space compared to when no tilted time frames are used, yet the mean error in approximation is still lower than 0.01% which should be accurate enough for the majority of applications. 4.2. Pyramidal Time Frames 13 α=2 α=3 α=4 α=5 101 Average relative error in % 100 10−1 10−2 10−3 10−4 10−5 2 3 4 5 6 β 7 8 9 10 Figure 4.1: Mean relative error in approximation of time windows for different values of α and β. 14 Chapter 4. Time Frames Chapter 5 Summary In this paper we have seen several basic methods which can be used to gain useful information out of data streams. As the large amount of available techniques might suggest, there is no one way for all types of applications. Choosing the right technique for the right application involves taking into account various constraints and properties of the specific application. In general, though, it seems useful to combine some of these techniques in order to track the evolution of a data stream efficiently. Summary techniques (such as data windows or micro-clustering) can be used to get an overview of a stream’s state at any given point. These summaries can then be stored periodically (e.g. using tilted time frames) in order to analyze how the state has been changing over time. Choosing one of the presented methods to gain space and time efficiency is necessarily connected with a tradeoff of giving up the overall level of detail in one’s analysis. However, most of the presented methods are designed in a way that the tradeoff affects only areas where a detailed view is not required, anyway, such as very old data segments. Finally, sometimes the nature of a data stream itself requires giving up a certain amount of precision because its high volume couldn’t be processed otherwise and one would end up with no information at all. 15 16 Chapter 5. Summary Appendix A Measurements on Pyramidal Time Frames The following numbers are based on a sample stream with a duration of one year at a resolution of one snapshot per second. The reference value (i.e. when not using a tilted time frame) is therefore 365 · 24 · 60 · 60 = 31 536 000 snapshots. The mean error erel was calculated on the basis of approximating I = 1 000 000 random time windows. Let tc denote the current timestamp, wi the desired window width in test i and ts,i the nearest available snapshot just before tc − wi . It is: erel I 1 X (tc − ts,i ) − wi = . I i=1 wi 17 18 Chapter A. Measurements on Pyramidal Time Frames α 2 3 4 5 β 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 erel in % 7.7661 4.0351 2.1063 1.0766 0.5443 0.2740 0.1372 0.06871 0.0344 6.1618 2.3056 0.7760 0.2606 0.0870 0.0290 0.0097 0.0032 0.0011 5.0743 1.3024 0.3327 0.0836 0.02092 0.0052 0.0013 0.0003 7.3822e-05 3.9862 0.8326 0.1673 0.0335 0.0067 0.0013 0.0003 4.6318e-05 6.0225e-06 Number of snapshots 117 (0.0004%) 204 (0.0006%) 370 (0.0012%) 687 (0.0022%) 1290 (0.0041%) 2433 (0.0077%) 4593 (0.0146%) 8657 (0.0275%) 16 274 (0.0516%) 146 (0.0005%) 383 (0.0012%) 1043 (0.0033%) 2862 (0.0091%) 7834 (0.0248%) 21 294 (0.0675%) 57 302 (0.1817%) 152 207 (0.4826%) 397 559 (1.2607%) 194 (0.0006%) 680 (0.0022%) 2433 (0.0077%) 8681 (0.0275%) 30 603 (0.0970%) 106 009 (0.3362%) 358 481 (1.1367%) 1 171 767 (3.7156%) 3 638 481 (11.5375%) 250 (0.0008%) 1088 (0.0035%) 4785 (0.0152%) 20 774 (0.0659%) 88 221 (0.2797%) 362 961 (1.1509%) 1 424 166 (4.5160%) 5 167 692 (16.3866%) 16 072 826 (50.9666%) Table A.1: Development of the number of snapshots and relative error in approximation. Bibliography [1] C. Aggarwal, J. Han, J. Wang, and P. Yu. A framework for clustering evolving data streams. In Proceedings of the 29th international conference on Very large data bases-Volume 29, 81–92. VLDB Endowment, 2003. [2] A. Bifet and R. Gavalda. Learning from time-changing data with adaptive windowing. 2006. [3] F. Cao, M. Ester, W. Qian, and A. Zhou. Density-based clustering over an evolving data stream with noise. In Proceedings of the 2006 SIAM International Conference on Data Mining, 328–339, 2006. [4] J. H. Chang and W. S. Lee. Finding recent frequent itemsets adaptively over online data streams. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’03, 487–492, New York, NY, USA, 2003. ACM. [5] Y. Chen, G. Dong, J. Han, B. Wah, and J. Wang. Multi-dimensional regression analysis of time-series data streams. In Proceedings of the 28th international conference on Very Large Data Bases, 323–334. VLDB Endowment, 2002. [6] C. Giannella, J. Han, J. Pei, X. Yan, and P. Yu. Mining frequent patterns in data streams at multiple time granularities. Next generation data mining, 212:191–212, 2003. [7] W. Lee and S. Stolfo. Data mining approaches for intrusion detection. Defense Technical Information Center, 2000. [8] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efficient data clustering method for very large databases. In ACM SIGMOD international Conference on Management of Data, volume 1, 103–114, 1996. [9] H. Zhu, Y. Wang, and Z. Yu. Clustering of evolving data stream with multiple adaptive sliding window. In Data Storage and Data Engineering (DSDE), 2010 International Conference on, 95 –100, feb. 2010. [10] Y. Zhu and D. Shasha. Statstream: Statistical monitoring of thousands of data streams in real time. In Proceedings of the 28th international conference on Very Large Data Bases, 358–369. VLDB Endowment, 2002. 19