Download Mining Data Streams with Periodically changing Distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Mixture model wikipedia , lookup

Transcript
Mining Data Streams with
Periodically changing Distributions
Yingying Tao, Tamer Ozsu CIKM’09
Supervisor Dr Koh
Speaker Nonhlanhla Shongwe
April 26, 2010
Preview
 Introduction
 Challenge
 Method
 DMM framework
 Distance Function Selection
 Experiments
 Conclusion
Introduction
 Mining stream for knowledge mining such as
 Clustering
 Classification
 Frequent patterns discovery , has become important
 Important characteristic of unbounded data stream is that the
underlying distributions can show important changes over time,
leading to dynamic data streams.
Challenge
 The problem of mining dynamic data streams
 Balance accuracy with efficiency – highly accurate mining
techniques are generally computationally expensive
 Some question to ask: for dynamic data streams
 Is the distribution changes entirely random and unpredictable
 Is it possible for the distribution changes to follow certain
patterns?
Method
 Propose a method for mining dynamic data streams
 Where important observed distributions patterns are stored
 And compared new detected changes with these patterns
Method
 Two streams with the same distribution
 Mining results such as
 List of all frequent items / itemset for frequent patterns discovery
 Set of clusters / classes for clustering and classification should be the
same
 If distribution change is detected and a match is found
 Possible to skip the re-mining process
 Directly output the mining results for the archived distribution
 This is called the match-and-reuse strategy
Method
 Issues to be resolved
 Pattern selection
 Selecting and storing important distributions that have a high
probability of occurring in the future
 Pattern representation
 Storing each pattern succinctly
 Matching
 Efficient procedure for rapid data streams with high accuracy
DMM framework
 DMM framework stands for Detect, Match and Mine
 Consist of four sequences
 Choosing representative set
 Change detection
 Pattern matching
 Stream mining
 All processes are independent
DMM framework
 Window model
 Generate reference window (choosing representative set)
 Change detection
 Distribution matching (Pattern matching)

Choosing important distribution
Window model
 Two windows on Stream S

Time-based
 Defines the time intervals
 Denoted by Wt
 Called observation window
 Implemented as a tumbling window
 Moves forward at each clock tick
Window model cont’s
 Count-based

Contains a sub stream with fixed number of elements
 Denoted by Wr
 Called reference window
 The size of the reference window (|Wr|) and time intervals of
Wt are predefined values
DMM framework
 Window model
 Generate reference window (choosing representative set)
 Change detection
 Distribution matching (Pattern matching)
 Choosing important distribution
Generate reference window (choosing
representative set)
 Wr stores a set of data elements that represents a current
distribution of S
 The size needs to be small due to memory limitations
 Inaccurate results if we use a small data set to represent a
distribution if the distribution is complicated
 Due to this problem, use a dynamic reference window (Wr)
Generate reference window (choosing
representative set)
 Merge and select process
 Dynamic reference window (Wr)
 Merge Wt and Wr to get a larger substream |Wr|+|Wt|
 Select |Wr| elements from the merged window (Wr +Wt) and
replace the stream in Wr by the new set
 Merge and select process is triggered every time Wt tumbles
Generate reference window (choosing
representative set)
Generate reference window (choosing
representative set)

Selecting representative set: Two-step sampling approach
 First-step sampling approach
 Estimate the density function of Wr + Wt
 K= kernel function
 h= smoothing parameter (bandwidth)
 si= data element in Wr + Wt
Generate reference window (choosing
representative set)

Selecting representative set: Two-step sampling approach


K is set to (Standard Gaussian function mean = 0 variance = 1)
Then the density function

h=value between 0 or 1
Generate reference window (choosing
representative set)
 Selecting representative set: Two-step sampling approach
 With the density function we are able to estimate the “shape” of
the current distribution
 X-axis is the = value of the data s (v(s)) in Wr +Wt
 Y-axis is the probability (p(v)) for all data values
Generate reference window (choosing
representative set)
 Selecting representative set: Two-step sampling approach
 Second-step sampling approach
Generate reference window (choosing
representative set)
 Selecting representative set: Two-step sampling approach
 Second-step sampling approach
 First calculate the start and end values for each partition
DMM framework
 Window model
 Generate reference window (choosing representative set)
 Change detection
 Distribution matching (Pattern matching)
 Choosing important distribution
Change detection
 Online change detection technique that is not restricted to
specific stream processing application
 Wt tumbles, the change detection procedure is triggered
 Compare the distributions of substreams in both Wr and Wt
windows
 If the distance is greater than the predefined maximum matching
distance, then a distribution change is flagged
DMM framework
 Window model
 Generate reference window (choosing representative set)
 Change detection
 Distribution matching (Pattern matching)
 Choosing important distribution
Distribution matching (Pattern matching)
 We use the appropriate distance measure to check their
similarity
 If a match is found, then the persevered mining results are
outputted
 The maximum predefined maximum matching is important
 Smaller implies a higher accuracy
 Larger increases the possibility of a new distribution to match a
pattern in the preserved set
DMM framework
 Window model
 Generate reference window (choosing representative set)
 Change detection
 Distribution matching (Pattern matching)
 Choosing important distribution
Choosing important distribution
 Use heuristic rules
 Distribution that have occurred in the stream for more times are
more important
 The longer a distribution lasts in the streams lifespan the more
important it is
 Distribution that has mining results with higher accuracy is more
important that a distributions with less accurate mining results
Distance Function Selection
 Dynamic Time Wrapping (DTW), Longest Common Subsequence
(LCSS), Edit Distance on Real Sequence (EDR) and Relativized
Discrepancy (RD)
 A proper distance function that can be used with DMM
 Efficient , with the ability to stretching
Experiments
 Change detection
 Kernel density approach (KD)
 Distance function-based approach(DF)
Experiments
 Distribution matching evaluation
 Data from Tropical Atmosphere Ocean
 Sea surface temperatures.
 12 218 streams each with a length of 962
Experiments
 Efficiency with and without DMM
 Adopt a popular decision tree-based clustering technique VFDT to
cluster the temperatures
 Best decision tree generators for dynamic data streams
 Time is reduced by 31.3%
Conclusion
 Introduced a DMM framework to mine dynamic data streams
 Window model
 Generate reference window (choosing representative set)
 Change detection
 Distribution matching (Pattern matching)
 Choosing important distribution
 Experiments that showing DMM performs better
Thank you for your attention