Download Data Stream Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Stream Mining
Tore Risch
Dept. of information technology
Uppsala University
Sweden
2016-02-25
Enormous data growth
•
•
•
•
•
Read landmark article in Economist 2010-02-27:
http://www.economist.com/node/15557443/
The traditional Moore’s law:
– Processor speed doubles every 1.5 years
Current data growth rate significantly higher
– Data grows 10-fold every 5 year, which about the same as Moore’s law
Major opportunities:
– spot business trends
– prevent diseases
– combat crime
– scientific discoveries, the 4th research paradigm
(http://research.microsoft.com/en-us/collaboration/fourthparadigm/)
– data-centered economy
Major challenges:
– Information overload
– Scalable data processing, ‘Bigdata management’
– Data security
– Data privacy
Too much data to store on disk
⇒Need to mine streaming data
Mining a swift data river
Cover, Economist 14 pages thematic issue 2010-02-27
New applications
•
-
Data comes as huge data streams, e.g.:
Satellite data e.g. satellite receivers
Scientific instruments, e.g. colliders
Social networks, e.g. twitter
Stock data
Process industry, e.g. equipment in use
Traffic control, e.g. car monitoring
Patient monitoring, e.g. EKG, flow models
Data Stream Management
Systems (DSMS)
• DataBase Management System (DBMS)
– General purpose software to handle large volume
persistent data (usually on disk)
– Important tool for traditional datamining
• Data Stream Management System (DSMS)
– General purpose software to handle large volume
data streams (often transient data)
– Important tool for data stream mining
Data Base Management System
SQL Queries
DBMS
Query Processor
Data Manager
Meta –
data
Stored
Data
Data Stream Management System
Continuous Queries (CQs)
DSMS
Query Processor
Data
streams
Data Stream Manager
Meta –
data
Data
streams
Data Stream Management System
Continuous Queries (CQs)
DSMS
Query Processor
Data
streams
Data & Stream Manager
Meta –
data
Stored
Data
Data
streams
Mining streams vs. Databases and files
• Data streams are dynamic and infinite in size
– Data is continuously generated and changing
– Live streams may have no upper limit
– A live stream, can be read only once (“Just One
Look”)
– The stream rate may be very high
• Traditional data mining does not work for
streaming data:
– Regular data mining based on finite data collections
stored in files or databases
– Regular data mining done in batch:
• access data in collection several times
• analyze accessed data
• store results in database (or file)
– For example, store for each data object what cluster it belongs to
Data stream mining vs.
traditional data mining
• Live streams mining must be done on-line (in real-time)
– Not traditional batch processing
• Live streams require main memory processing
– To keep up with very high data flow rates (speed)
• Live streams must be mined with limited memory
– Not load-and-analyze as traditional data mining
– Iterative processing of statistics
• Live streams must keep up with very high data flow
volumes
– Approximate mining algorithms
– Parallel on-line computations
Requirements for Stream Mining
• Single scan of data (“Just One Look”),
– because of very large or infinite size of streams.
– because it may be impossible or very expensive to reread the
stream for the computation
• Limited memory and CPU usage
– because the processing should be done in main memory despite
the very large stream volume
• Compact continuously evolving representation of mined
data
– It is not possible to store the mined data in database as with
traditional data mining
– A compact main memory representation of mined data needed
Iterative stream processing
• Data stream mining requires iterative stream processing
• Regular load – analyze – store mining does not scale
• Read data tuples iteratively from input stream(s)
– E.g. measured values
– Do not store read data in database
• Result of data stream mining is iteratively produced as
derived stream
– Can be seen as continuously changing database of
mined data (statistics, clusters, association rules, etc.)
Differential aggregation
• Data stream mining requires differential aggregation
• Analyze received tuples in streams differentially
• Continuously keep summary results in main memory
– E.g. running statistics, such as sum and count
Sum:
initially sum=0 in database
for each received x: sum = sum + x
Count: initially cnt=0 in database
for each received x: cnt = cnt + 1
• Continuosly emit incremental analyzed resuls
– E.g. running average by dividing sum with count
Sum: emit sum
Count: emit cnt
Avg:
emit sum/cnt
Incremental stream aggregation
Incremental stream statistics
• Iterative numerically stable one-pass solution:
set N = 0 // incremental counter
set M = 0 // incremental mean
set S = 0 // incremental standard deviation
for each Number Xi where Xi in X do
set N = N + 1;
// incremental count
set Mnew = M + (Xi - M)/N; // incremental mean
set S = S + (Xi - M)*(Xi - Mnew); // incremental sdev
set M = Mnew,
return sqrt(S/(N-1))
• Similar numerically stable and incremental stream computation
algorithms needed also for other computations.
Streams vs. regular data
• Stream processing should keep up with data
flow
– Make computations so fast that they keep up with the
flow (on average)
– Should not be catastrophic if if miner cannot keep up
with the flow:
• Drop input data values
• Use approximate computations such as sampling
• Asynchronous logging in (many) files often
possible
– At least during limited time (log files)
– Perhaps of processed (reduced) streaming data
Requirements for Stream Mining
• Allow for continuous evolution of mined data
– Traditional batch mining is for static data
– Continuous mining makes mined data into a stream
too
=>Concept drift
• Often mining over different kinds of windows of
streams
– E.g. sliding or tumbling windows
– Windows of limited size
– Often only statistics summaries needed (synopses,
sketches)
Stream windows
• Limited size section of stream stored temporarily in
DSMS
– ’Regular’ database queries can be made over these windows
• Need window operator to chop stream into segments
• Window size (sz) based on:
– Number of elements, a counting window
• E.g. last 10 elements
– i.e. windows has fixed size of 10 elements
– A time window
• E.g. elements last second
– i.e. windows contains all event processed during the last second
– A landmark window
• All events from time t0 in window
– c.f. growing log file
– A decaying window
• Decrease importance of measurement by multiplying with factor λ
• Remove when importance below threshold
Stream windows
• Windows may also have stride (str)
– Rule for how fast they move forward,
• E.g. 10 elements for a 10 element counting window
– A tumbling window
• E.g. 2 elements for a 10 element counting window
– A sliding windows
• E.g. 100 elements for a 10 element counting window
– A sampling window
• Windows need not always be materialized
– E.g often sufficient to keep statistics materialized
Continuous (standing) queries over
streams from expressways
Schema for stream CarLocStr of tuples:
CarLocStr(car_id,
/* unique car identifier
*/
speed,
/* speed of the car
*/
exp_way,
/* expressway: 0..10
*/
lane,
/* lane: 0,1,2,3
*/
dir,
/* direction: 0(east), 1(west) */
x-pos);
/* coordinate in express way */
CQL query to continuously get the cars in a window of the stream every 30 seconds:
SELECT DISTINCT car_id
FROM CarLocStr [RANGE 30 SECONDS];
Get the average speed of vehicles per expressway, direction, segment each 5 minutes:
SELECT exp_way, dir, seg, AVG(speed) as speed,
FROM CarSegStr [RANGE 5 MINUTES]
GROUP BY exp_way, dir, seg;
•
http://www.cs.brandeis.edu/~linearroad/
Denstream
• Streamed DBScan
• Published: 2006 SIAM Conf. on Data Mining
(http://user.it.uu.se/~torer/DenStream.pdf)
• Regular DBScan:
– DBScan saves cluster memberships of static database per
member object in database by scanning database looking for
pairs of objects close to each other
– Database accessed many times
– For scalable processing a spatial index must be used to index
points in hyperspace and answer nearest-neighbor queries
Denstream
• Denstream
– One pass processing
– Limited memory
– Evolving clustering => concept drift, i.e. not static
cluster membership
– Indefinite stream => cluster memberships not stored
in database objects
– No assumption of number of clusters
• Transient clusters fade in and fade out of decaying window
– Clusters of arbitrary shape allowed
– Good at handling outliers
Core micro-clusters
• Core point: ‘anchor’ in cluster of other points
• Core micro-cluster: An area covering points
close to (epsilon similar) a core point
• Cluster defined as set of c-micro-clusters
Potential micro-clusters
•
Outlier o-micro-cluster
– New point not included in any micro-cluster
•
Potential p-micro-cluster
– Several clustered points not large enough to form a
micro-cluster
•
When new data point arrives:
1. Try to merge with nearest p-micro-cluster
2. Try to merge with nearest o-micro-cluster
•
If so convert o-micro-cluster to p-micro-cluster
3. Otherwise make new o-micro-cluster
Decaying p-micro-cluster windows
• Maintain weight Cp per p-micro-cluster
• Periodically (each Tp time period)
decrease weight exponentially by
multiplying old weight with λ
• Weight lower than threshold => delete,
i.e. decaying window
• Decaying window of micro clusters
Dealing with outliers
•
o-micro-clusters important for forming new
evoving p-micro-clusters
⇒ Keep o-micro-clusters around
•
Keeping all o-micro-clusters may be expensive
⇒ Delete o-micro-cluster by special exponential
pruning rule (decaying window)
•
Decaying window method proven to make #
micro-clusters grow logarithmically with stream
size
– Good, but not sufficient for indefinite stream
– Shown to grow very slowly though
Growth of micro-clusters
Forming c-micro-cluster sets
• Regularly (e.g. each time period) the user
demands forming current c-micro-clusters
from the current p-micro-clusters
• Done by running regular DBSCAN over
the p-micro-clusters
– Center of each p-micro-cluster regarded as
point
– Close when p-micro-clusters intersect
=> Clusters formed
Bloom-filters
• Problem: Testing membership in extremely large
data sets
– E.g. all non-spam e-mail addresses
• No false negatives, i.e. if address is in set then
OK guaranteed
• Few false positives allowed, i.e. a small number
of spams may sneek through
See http://infolab.stanford.edu/~ullman/mmds/ch4.pdf
section 4.3.2
Bloom-filters
• Main idea:
– Assume bitmap B of objects of size s
– Hash each object x to h in [1,s]
– Set bit B[h]
• Smaller than sorted table in memory:
– 109 email addresses of 40 bytes => 40 GByte if set to
be stored sorted in memory
• Would be expensive to extend
– Bitmap could have e.g. 109/8= 125MBytes
• May have false positives
– Since hash function not perfect
Lowering false positives
• Small bitmap => many false positives
• Idea, hash with several independent hash
functions h1(x), h2(x) and set bits
correspondingly (logical OR)
• For each new x check that all hi(x) are set
– If so => match
– Chance of false positives decrease exponentially with
number of hi
• Assumes independent hi(x)
– hi(x) and hj(x) no common factors if i ≠ j
Articles
L. Golab and T. Özsu: Issues in Stream Data Management,
SIGMOD Records, 32(2), June 2003,
http://www.acm.org/sigmod/record/issues/0306/1.golabozsu1.pdf
Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali
Krishnaswamy, "Mining Data Streams: A Review", ACM
SIGMOD Record, Vol. 34, No. 2, June 2005, pp. 18–26.