Download Data Stream Mining - Data Management and Data Exploration

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Rheinisch-Westfälische Technische Hochschule Aachen
Lehrstuhl für Datenmanagement und -exploration
Prof. Dr. T. Seidl
Proseminar
Data Stream Mining
Basic methods and techniques
Martin Matysiak
Fall 2012
Supervisor:
Univ.-Prof. Dr. rer. nat. Thomas Seidl
Dipl. Ing. Marwan Hassani
The material in this paper has not previously been submitted for a degree
in any University, and to the best of my knowledge contains no material
previously published or written by another person except where due to acknowledgement is made in the paper itself.
Aachen, November 8, 2012
Contents
Abstract
ix
1 Introduction
2 Data Windows
2.1 Fixed Sliding Window
2.2 Adaptive Window . . .
2.3 Landmark Window . .
2.4 Damped Window . . .
1
.
.
.
.
3
3
3
4
5
3 Micro Clusters
3.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
7
8
9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Time Frames
11
4.1 Tilted Time Frames . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Pyramidal Time Frames . . . . . . . . . . . . . . . . . . . . . 12
5 Summary
15
Appendix A Measurements on Pyramidal Time Frames
17
References
19
v
List of Figures
2.1
2.2
ADWIN running on a synthetic data set . . . . . . . . . . . .
Damped data window . . . . . . . . . . . . . . . . . . . . . . .
4
5
3.1
Micro-clustering . . . . . . . . . . . . . . . . . . . . . . . . . .
8
4.1
Mean relative error in approximation of time windows . . . . . 13
List of Tables
4.1
Snapshots stored at t = 60 for α = β = 2 . . . . . . . . . . . . 12
A.1 Development of the number of snapshots and relative error in
approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
vii
Abstract
The domain of stream data mining poses many new challenges
compared to more traditional data mining applications. Large
volumes of data over long periods of time render common mining and clustering techniques unusable. This paper will give the
reader an overview of available methods to cope with the special properties of data streams and point to various places where
they’re applied in practice.
ix
Chapter 1
Introduction
Data streaming applications are typically dealing with large amounts of data
over an extended period of time. However, in most cases the user is only
interested in recent data instead of the whole data set. Furthermore, stream
data tends to express features of a concept drift, i.e. the data is evolving
over time. This would cause algorithms which consider the whole data set
with the same importance to produce distorted results because the majority
of processed data would not be valid anymore.
This paper has the intention to give the reader a brief overview over
common methods used in stream data mining applications. Most of these
methods are in the domain of ageing techniques, i.e. methods that take into
account the evolutionary property of data streams and try to counter its
negative consequences by ageing older data points in an appropriate way.
In chapter 2, data windows as a way of looking at relevant slices of a data
stream will be introduced. Chapter 3 covers the concept of micro-clusters,
which prove to be a space and time efficient way of summarizing a stream’s
current state. In chapter 4 we will take a look at how such summaries can be
stored over a long period of time while requiring minimal amounts of space.
Lastly, we will recapitulate the discussed methods in chapter 5 and look at
how they can be combined to gain optimal results from mining a stream.
1
2
Chapter 1. Introduction
Chapter 2
Data Windows
In this chapter we will take a look at several data windowing models which
can be used to limit the amount of processed data based on different characteristics, thus improving the results of afterwards executed algorithms.
2.1
Fixed Sliding Window
The easiest way of limiting data is by using a fixed sliding window. These
windows can be either fixed by including only the most recent n data points
or by showing only the most recent t time units of data (where n and t are
constants).
While the implementation of this model is very simple, it is prone to errors
when choosing a wrong window width. Too narrow windows will produce very
accurate representations of the current state, but are heavily affected by noisy
data while too wide windows result in more stable yet equally inaccurate
results due to the effects of concept drift [2].
Nevertheless, fixed sliding windows can be found in many applications.
An example is [7] where they are used to detect regions of abnormal network
activity.
2.2
Adaptive Window
Because of the disadvantages of a fixed window size, Bifet et al. [2] introduced the adaptive windowing technique (short: ADWIN ) which dynamically resizes the window based on the incoming data and a user-specifiable
confidence value λ .
Essentially, resizing is done by looking at possibilities for dividing the
current data window W into two consecutive windows W1 and W2 such that
3
4
Chapter 2. Data Windows
Value
1
Value
Window width
400
0.5
0
200
0
Window width
W1 · W2 = W and checking whether the means of these two windows differ
greater than a threshold cut . If that is the case, the older window W1 will be
dropped from W . Using this technique, it can be proven that ADWIN will
maintain an optimal window width throughout the streaming process [2].
0
100 200 300 400 500 600 700 800 900 1,000
Time
Figure 2.1: ADWIN running on a synthetic data set. One can clearly see the
window width adaption whenever the data set is changing significantly. The
shaded area depicts the data window at t = 1000.
An application of adaptive windowing is the MawStream algorithm [9]
which clusters data streams and maintains multiple adaptive windows (one
for each cluster) in order to keep the detected clusters relevant and make
adding of new data points computationally fast.
2.3
Landmark Window
Sometimes it is useful to track the evolution of data points starting at a fixed
point in time, the so called landmark. Landmark data windows will include
all data points starting from that particular landmark [10].
Note that this model is typically not very suitable for streaming applications, because the amount of data inside the window would quickly grow
to unprocessable sizes. Still, the model has a few limited applications, for
instance the stock market, where it can be used to observe the average price
of a stock in the current month or year.
2.4. Damped Window
2.4
5
Damped Window
In contrast to the other window models, Damped Windows assign weights
to the data points rather than performing a binary decision on whether to
include a point or not.
These weights are depending on the age of a data point. Frequently,
an exponential falloff is used [3]. This ensures that while past data is not
completely disregarded, recent data will always have a stronger influence on
the computation to be performed.
1
Value
Weight
0.5
0
0.5
0
Weight
Value
1
0
100 200 300 400 500 600 700 800 900 1,000
Time
Figure 2.2: Damped windowing on a synthetic dataset at t = 1000. The
weight is dropping off exponentially with the age of a data point.
Damped windows are used for instance in the domain of finding recent
frequent item sets [4] in order to lessen the contribution of old data points
towards the rate of an itemset appearing in the stream, thus giving itemsets
with a high count of recent data points a much higher relative importance.
6
Chapter 2. Data Windows
Chapter 3
Micro Clusters
Clustering streams is a challenging task because of high amounts of constantly arriving data. Known clustering algorithms are too slow to work
in such environments. Therefore we divide clustering of streams into two
phases. First, an online algorithm is constantly processing the incoming
data and summarizing it into a space efficient format. Afterwards a traditional offline algorithm can be executed on the summary in order to perform
the actual clustering.
In this chapter we will look at the concept of micro-clustering, which is a
fast method to summarize large amounts of incoming data without loosing
too much granularity. To match this constraint, the number of micro-clusters
is usually several times higher than the number of actual clusters, yet way
smaller than the number of single data points as shown in figure 3.1, thus still
having low memory requirements. We will look at micro-clusters based on
the definitions by Aggarwal et al. [1], but note that the discussed properties
apply to other definitions as well.
3.1
Structure
Let D denote the number of dimensions of the input data. Further let xi be
the i-th D-dimensional data point and ti the timestamp at which it occurred.
A micro-cluster is a summary of a set of data points xi1 , xi2 , . . . , xin , n ∈ N.
They are based on the original cluster feature vector introduced in [8].
The main difference is that additional to a summary of data values, temporal
information about the occurrence of these values is stored. This information
can be used to track the evolution of micro-clusters throughout the stream.
Micro-clusters can be defined as a 5-tuple (CF 2x , CF 1x , CF 2t , CF 1t , n),
with CF 2x and CF 1x being the sum of squares and simple sum of values
7
8
Chapter 3. Micro Clusters
Figure 3.1: A sample data set (left) and a corresponding micro-clustering
(right), generated using the MOA Framework. The clusters were calculated by the CluStream algorithm, which maintains a fixed number of microclusters and assigns incoming data points either to existing micro-clusters
(if they fall within a maximum boundary of such a cluster) or to a new
micro-cluster which will replace a stale one.
in the cluster (both D-dimensional vectors),
Pn n2 being the number of data
t
points in a particular cluster,
i=1 ti being the sum of squares
P CF 2 =
of timestamps and CF 1t = ni=1 ti the sum of timestamps. Having these
values, it is easy to derive other information about the micro-cluster, for
instance its average timestamp (CF 1t /n).
3.2
Properties
Micro-clusters express several useful properties that make them an ideal
choice for the online phase of stream clustering. The most important one
is their additivity. Let M1 = (CF 2x1 , CF 1x1 , CF 2t1 , CF 1t1 , n1 ) and M2 =
(CF 2x2 , CF 1x2 , CF 2t2 , CF 1t2 , n2 ) be two micro-clusters over disjoint sets of
data points. These micro-clusters can be merged to M1 ∪ M2 by using simple
addition of their values: M1 ∪ M2 = M1 + M2 = (CF 2x1 + CF 2x2 , CF 1x1 +
CF 1x2 , CF 2t1 + CF 2t2 , CF 1t1 + CF 1t2 , n1 + n2 ).
This property implies that merging two clusters or adding new data points
to them is a constant-time operation, thus making micro-cluster maintenance
during the online phase very efficient.
3.3. Applications
9
Analogous to the additivity, micro-clusters also express a subtractivity
property. This property can be used to get a summary about data points
that arrived between two timestamps t1 and t2 , simply by subtracting M1
from M2 (given that M2 ⊇ M1 ∧ t2 > t1 ).
3.3
Applications
Micro-clustering is used in a variety of applications. One prominent example
is DenStream [3], where, instead of storing simple sums of timestamps in
the feature vector, a decay function is applied to the timestamps in order to
calculate a weight, essentially combining micro-clustering with damped data
windows (see chapter 2.4).
Another one is CluStream [1]. Here, micro-clustering is used mainly in
combination with pyramidal time frames which we will discuss in the following chapter.
10
Chapter 3. Micro Clusters
Chapter 4
Time Frames
In chapter 3 we discussed a way of creating summaries of a data stream. Using the subtractivity property of micro-clusters, we can generate summaries
for any arbitrary time window simply by subtracting two summaries of the
stream taken at different timestamps from each other. For this purpose, we
need a way of storing several such snapshots of summaries.
Despite the fact that micro-clusters have very low space requirements,
it is still not feasible to store indefinitely many of them. Therefore we will
look at ways of storing snapshots in tilted time frames instead of linearly
throughout the streaming process.
4.1
Tilted Time Frames
The general idea of tilted time frames is to store snapshots at different levels
of granularity depending on how old these snapshots are. The more time has
passed, the larger the gap between two consecutive snapshots will be. Tilted
time frames were introduced in [5], where the pattern of storing snapshots is
aligned with the natural time, that means in the most recent quarter we take
one snapshot per minute, in the most recent hour one snapshot per quarter
and so on.
Older snapshots have to be maintained regularly in this model. There
are several ways of dealing with a transition from one level of granularity
to another depending on how the snapshots are structured. One way is
by looking independently at each granularity level and dropping the oldest
snapshot whenever a new snapshot for that particular level arrives. This
concept is used in CluStream which we will discuss in section 4.2.
Another way is by merging snapshots. Snapshots of a finer granularity are
accumulated until they contain enough data to form a snapshot for the next
11
12
Chapter 4. Time Frames
smaller level of detail. Using a logarithmic time window where the time gap
between snapshots increases by a factor of 2 between each level, we can show
that the amortized number of maintenance operations is limited to O(1) [6].
4.2
Pyramidal Time Frames
Aggarwal et al. [1] used a variation of tilted time frames which maintain
snapshots in a pyramidal pattern. Pyramidal time frames are loosely based
on logarithmic time frames, but allow the user to customize the logarithm’s
base and level of detail by parameters α and β where α > 1 ∧ α, β ∈ N.
The basic rules for maintaining the different levels are simple: for every
level i, we store snapshots whenever the current timestamp is divisible by
αi but not by αi+1 to avoid redundancy. At most αβ + 1 snapshots are in
any level i by dropping the oldest snapshot whenever a new one arrives. The
total number of levels after T time units elapsed since the beginning equals
logα (T ) and the total number of snapshots thus equals (αβ + 1) · logα (T ).
Level
0
1
2
3
4
5
Snapshots
59 57 55 53 51
58 54 50 36 42
60 52 44 36 28
56 40 24 8
48 16
32
Table 4.1: Snapshots stored at t = 60 for α = β = 2. One can clearly see
the distinctive pyramidal shape with rising level.
Even with such modest space requirements it can be proven that for
any given timestamp t a snapshot can be found within at most (1 + α1β ) · t
units of time from the current timestamp [1]. Figure 4.1 demonstrates how
the mean relative error of approximating random time windows develops for
different values of α and β. Appendix A contains the detailed results of this
experiment. As we can see for instance, α = 3, β = 8 needs only about 0.2%
of the space compared to when no tilted time frames are used, yet the mean
error in approximation is still lower than 0.01% which should be accurate
enough for the majority of applications.
4.2. Pyramidal Time Frames
13
α=2
α=3
α=4
α=5
101
Average relative error in %
100
10−1
10−2
10−3
10−4
10−5
2
3
4
5
6
β
7
8
9
10
Figure 4.1: Mean relative error in approximation of time windows for different
values of α and β.
14
Chapter 4. Time Frames
Chapter 5
Summary
In this paper we have seen several basic methods which can be used to gain
useful information out of data streams. As the large amount of available
techniques might suggest, there is no one way for all types of applications.
Choosing the right technique for the right application involves taking into
account various constraints and properties of the specific application.
In general, though, it seems useful to combine some of these techniques in
order to track the evolution of a data stream efficiently. Summary techniques
(such as data windows or micro-clustering) can be used to get an overview
of a stream’s state at any given point. These summaries can then be stored
periodically (e.g. using tilted time frames) in order to analyze how the state
has been changing over time.
Choosing one of the presented methods to gain space and time efficiency
is necessarily connected with a tradeoff of giving up the overall level of detail
in one’s analysis. However, most of the presented methods are designed in a
way that the tradeoff affects only areas where a detailed view is not required,
anyway, such as very old data segments.
Finally, sometimes the nature of a data stream itself requires giving up
a certain amount of precision because its high volume couldn’t be processed
otherwise and one would end up with no information at all.
15
16
Chapter 5. Summary
Appendix A
Measurements on Pyramidal
Time Frames
The following numbers are based on a sample stream with a duration of one
year at a resolution of one snapshot per second. The reference value (i.e.
when not using a tilted time frame) is therefore 365 · 24 · 60 · 60 = 31 536 000
snapshots. The mean error erel was calculated on the basis of approximating
I = 1 000 000 random time windows. Let tc denote the current timestamp,
wi the desired window width in test i and ts,i the nearest available snapshot
just before tc − wi . It is:
erel
I 1 X (tc − ts,i ) − wi =
.
I i=1 wi
17
18
Chapter A. Measurements on Pyramidal Time Frames
α
2
3
4
5
β
2
3
4
5
6
7
8
9
10
2
3
4
5
6
7
8
9
10
2
3
4
5
6
7
8
9
10
2
3
4
5
6
7
8
9
10
erel in %
7.7661
4.0351
2.1063
1.0766
0.5443
0.2740
0.1372
0.06871
0.0344
6.1618
2.3056
0.7760
0.2606
0.0870
0.0290
0.0097
0.0032
0.0011
5.0743
1.3024
0.3327
0.0836
0.02092
0.0052
0.0013
0.0003
7.3822e-05
3.9862
0.8326
0.1673
0.0335
0.0067
0.0013
0.0003
4.6318e-05
6.0225e-06
Number of snapshots
117 (0.0004%)
204 (0.0006%)
370 (0.0012%)
687 (0.0022%)
1290 (0.0041%)
2433 (0.0077%)
4593 (0.0146%)
8657 (0.0275%)
16 274 (0.0516%)
146 (0.0005%)
383 (0.0012%)
1043 (0.0033%)
2862 (0.0091%)
7834 (0.0248%)
21 294 (0.0675%)
57 302 (0.1817%)
152 207 (0.4826%)
397 559 (1.2607%)
194 (0.0006%)
680 (0.0022%)
2433 (0.0077%)
8681 (0.0275%)
30 603 (0.0970%)
106 009 (0.3362%)
358 481 (1.1367%)
1 171 767 (3.7156%)
3 638 481 (11.5375%)
250 (0.0008%)
1088 (0.0035%)
4785 (0.0152%)
20 774 (0.0659%)
88 221 (0.2797%)
362 961 (1.1509%)
1 424 166 (4.5160%)
5 167 692 (16.3866%)
16 072 826 (50.9666%)
Table A.1: Development of the number of snapshots and relative error in
approximation.
Bibliography
[1] C. Aggarwal, J. Han, J. Wang, and P. Yu. A framework for clustering evolving
data streams. In Proceedings of the 29th international conference on Very large
data bases-Volume 29, 81–92. VLDB Endowment, 2003.
[2] A. Bifet and R. Gavalda. Learning from time-changing data with adaptive
windowing. 2006.
[3] F. Cao, M. Ester, W. Qian, and A. Zhou. Density-based clustering over an
evolving data stream with noise. In Proceedings of the 2006 SIAM International Conference on Data Mining, 328–339, 2006.
[4] J. H. Chang and W. S. Lee. Finding recent frequent itemsets adaptively over
online data streams. In Proceedings of the ninth ACM SIGKDD international
conference on Knowledge discovery and data mining, KDD ’03, 487–492, New
York, NY, USA, 2003. ACM.
[5] Y. Chen, G. Dong, J. Han, B. Wah, and J. Wang. Multi-dimensional regression
analysis of time-series data streams. In Proceedings of the 28th international
conference on Very Large Data Bases, 323–334. VLDB Endowment, 2002.
[6] C. Giannella, J. Han, J. Pei, X. Yan, and P. Yu. Mining frequent patterns
in data streams at multiple time granularities. Next generation data mining,
212:191–212, 2003.
[7] W. Lee and S. Stolfo. Data mining approaches for intrusion detection. Defense
Technical Information Center, 2000.
[8] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efficient data clustering
method for very large databases. In ACM SIGMOD international Conference
on Management of Data, volume 1, 103–114, 1996.
[9] H. Zhu, Y. Wang, and Z. Yu. Clustering of evolving data stream with multiple
adaptive sliding window. In Data Storage and Data Engineering (DSDE), 2010
International Conference on, 95 –100, feb. 2010.
[10] Y. Zhu and D. Shasha. Statstream: Statistical monitoring of thousands of
data streams in real time. In Proceedings of the 28th international conference
on Very Large Data Bases, 358–369. VLDB Endowment, 2002.
19