Download Formato Base dei Dati - UCLA Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Extending DSMS for
Data Stream Mining
CS240B Notes
by
Carlo Zaniolo
UCLA CSD
1
Data Streams
 Continuous, unbounded, rapid, time-varying
streams of data elements
 Occur in a variety of modern applications
Network monitoring and traffic engineering
Sensor networks, RFID tags
Telecom call records
Financial applications
Web logs and click-streams
Manufacturing processes
 DSMS = Data Stream Management System
2
Many Research Projects …
 Amazon/Cougar (Cornell) – sensors
 Aurora (Brown/MIT) – sensor monitoring, dataflow
 Hancock (AT&T) – Telecom streams
 Niagara (OGI/Wisconsin) – Internet DBs & XML
 OpenCQ (Georgia) – triggers, view maintenance
 Stream (Stanford) – general-purpose DSMS
 Tapestry (Xerox) – pubish/subscribe filtering
 Telegraph (Berkeley) – adaptive engine for sensors
 Gigascope: AT&T Labs – Network Monitoring
 Stream Mill (UCLA) - power & extensibility
3
Technology Challenges
 Data Models
Relational Streams--but XML streams important too
Tuple Time-Stamping
Order is important
Windows
 Query Languages: Extensions of SQL or XQUERY
 To support continuous (i.e., persistent) queries on transient data—
reversal of roles.
 Blocking operators excluded




 Query Plans:
 New execution models (main memory oriented)
 Optimized scheduling for response time or memory
 Quality of Services (QoS) & Approximation
 Synopses
 Sampling
 Load shedding.
4
Commercial Developments
 Several Startups
Streambase,
Coral8,
Apama, and
Truviso.
 Oracle and DBMS companies
Publish/subscribe
Complex Event Processing (CEP)
 Limitations: only simple applications—e.g. continuous
queries expressed in SQL
No Support for Data Stream Mining queries.
5
Data Stream Mining
 Many applications: click stream analysis, intrusion
detection,...
 Many fast & light algorithms developed for stream mining.
 Ensembles, Moment, SWIM, etc.
 Analyst should be able to focus on high-level mining tasks.
 Leaving QoS and lower-level issues to the system.
 Integration of mining methods into Data Stream
Management Systems (DSMS) is required
 Many research challenges.
 Stream Mill Miner (SMM) is the first DSMS designed
for that.
6
Data Stream Management Systems (DSMS)
 Data stream mining applications so far ignored by
DSMS … although
A. DSMS technology is required for data stream mining
 QoS, query scheduling, synopses, sampling, windows,
...
B. But supporting DM applications is difficult since
current DSMS only support simple query languages
based on SQL.
Conclusion: either a shotgun wedding ... or a research
breakthrough is needed here!
7
A Difficult Problem: the Inductive DBMS
Experience
 Initial attempts to support mining queries in relational
DBMS: Unsuccessful
 OR-DBMS do not fare much better [Sarawagi’ 98].
 In 1996 the ‘high-road’ approach by Imielinski & Mannila who
called for a quantum leap in functionality:
 High-level declarative languages for DM .
 Extensions for query processing and optimization.
 The research area of Inductive DBMS was thus born
 Inspired DMQL, Mine Rule, MSQL, etc.
 Suffer from limited generality and performance issues.
8
DBMS Vendors
 Vendors have taken a `low-road’ approach.
 A library of mining functions using a cache-mining approach
 IBM DB2 Intelligent Miner
 Oracle Data Miner
 MS OLE DB for DM: mining models
 Closed systems,
 Lacking in coverage and user-extensibility.
 Not as popular as dedicated, stand-alone mining systems,
such as Weka
9
Weka
 A comprehensive set of mining algorithms, and tools.
 Generic algorithms over arbitrary data sets.
 Independent on the number of columns in tables.
 Open and extensible system based on Java.
These are the features that we want in our SMM—
starting from SQL rather than Java!
Not an easy task ...why?
10
SMM Contributions
 Build on Stream Mill DSMS and its SQL-based continuous
query language and enabling technology.
 Language and System Extensions:
 Genericity,
 Extensibility, and
 Performance
 A suite of stream mining algorithms.
 Existing ones and
 Newly developed in this project—e.g., SWIM.
 High level mining model for better
 Usability
 Control of mining process.
11
From SQL to Online Mining in SMM:
step by step
 Naïve Bayesian Classifier (NBC).
 Important and frequently used.
 Schema-specific NBC. Simple to express in SQL— by count, sum
aggregates. But a generci NBC is still preferable.
 Genericity: one function independent of number columns involved.
 Schema independence in SQL?
12
Genericity
Weka
 Arrays of type real.
 SMM
 Verticalization.
 Similar arrays, but in tables.
 Built-in table function to
reduce any table to this form.
 Thus, generic UDAs work with
this schema.
 And further improvements are
also supported in SMM
13
Extensibility?
 Most mining tasks cannot be implemented in SQL.
 Solution: Define complex functions by User Defined
Aggregates (UDAs)
 Complex mining tasks can be viewed as aggregates
 UDAs Natively defined in SQL make the language
computationally complete [Wang’ 04]
 Turing-complete over static data
 Non-blocking complete over data streams
 Natural extensions to support windows and delta
computations for data streams [Bai’ 06]
 UDAs can be defined in a PL, for better performance
14
Windowed UDA Example – Continuous Count
WINDOW AGGREGATE sum(val REAL):REAL {
TABLE state (tot real);
INITIALIZE: {
INSERT INTO state VALUES(val);
}
ITERATE: {
UPDATE state SET tot = tot + val;
}
EXPIRE: {
UPDATE state SET tot = tot – oldest().val;
}
/* No TERMINATE state */
}
For efficient
differential
computation
15
Online Mining in SMM
 UDAs Invoked with standard SQL:2003 syntax of
OLAP functions.
SELECT learn(ts.Column, ts.Value, t.dec)
OVER (ROWS 1000 PRECEDING)
FROM trainingstream AS t,
TABLE (verticalize(Outlook, Temp, Humidity, Wind)) AS ts
 Powerful framework:
 Concept drifts-shifts
 Association rule mining
16
The Slide Construct
A window can be divided into panes (called a
slide)
 Tumbling windows when the size of the slide
is equal or larger than that of the window
The slide/window combination is great for
data stream mining.
Simple construct added to support slides in UDAs
 Allowed us to build a flexible and efficient library
of data stream mining UDAs
17
SMM Contributions
 Build on Stream Mill DSMS and its SQL-based continuous
query language and enabling technology.
 Language and System Extensions:
 Genericity,
 Extensibility, and
 Performance
 A suite of stream mining algorithms.
 Existing ones and
 Newly developed in this project—e.g., SWIM.
 High level mining model for better
 Usability
 Control of mining process.
18
Association Rule Mining
SWIM [Mozafari’ 08] – Maintaining frequent
patterns over large windows with slides.
Differentially computes frequent patterns as
slides enter (expire out of) the window.
Uses efficient ‘Verifiers’ based on conditional
counting.
Trade-off between Delay and Performance
Performance gain over existing algorithms.
19
SWIM (Sliding Window Incremental Miner)
 If pattern p is freq in a window, it must be freq in at least one
of its slides -- keep a union of freq patterns of all slides (PT)
Expired
…
S4
New
S5
S6
W4
W5
Count/Update
frequencies
PT = F5
F4 U
U F6
F5 U
U F7
F6
……….
Mine
Count/Update
frequencies
Add F7 to PT
PT
S7
Mining
Alg.
Prune PT
20
Concept Drifts/Shifts—Complex Processes
 Ensemble based methods.
 Weighted bagging [Wang’ 03], adaptive boosting [Chu’
04], inductive transfer [Forman’ 06].
 Generic support, e.g. adaptive boosting (below).
21
Built-in Online Mining Algorithms In SMM
 Online classifiers
 Naïve Bayesian
 Decision Tree
 K-nearest Neighbor
 Online clustering
 DBScan [Ester’ 96]
 IncDBScan
 Windowed K-means*
 DenStream* [Cao’ 06]
 CluStream

Already supported
 Association rule mining
 Approximate frequent
items
 SWIM [Mozafari’ 08]
 Moment [Chi’ 04]
 AFPIM
 Time series/sequence
queries
 SQL-TS [Sadri’ 01]
 Many more …

To be supported
22
SMM Contributions
 Build on Stream Mill DSMS and its SQL-based continuous
query language and enabling technology.
 Language and System Extensions:
 Genericity,
 Extensibility, and
 Performance
 A suite of stream mining algorithms.
 Existing ones and
 Newly developed in this project—e.g., SWIM.
 High level mining model for better
 Usability
 Control of mining process.
23
Usability?
Complex SQL queries to invoke built-in and
user-defined mining algorithms.
 An open and extensible system
Most analysts would prefer using high-level
mining language that
supports uniform invocation of built-in and userdefined mining algorithms (no SQL required)
 describes the workflow of the mining process
 Is also open and extensible to incorporate newly
defined mining algorithms.
24
Example: Defining a Mining Model
CREATE MODEL TYPE NaiveBayesianClassifier {
SHAREDTABLES (DescriptorTbl),
Learn (UDA LearnNaiveBayesian,
WINDOW TRUE,
PARTABLES(), % names of param tables required by the method
PARAMETERS() % additional parameters to be specified for input
),
Classify (UDA ClassifyNaiveBayesian,
WINDOW TRUE,
PARTABLES(),
PARAMETERS()
)
};
25
Example: Using a Mining Model
 Creating an instance:
CREATE MODEL INSTANCE NaiveBayesianInstance
AS NaiveBayesianClassifier;
 Uniform invocation of mining tasks:
RUN NaiveBayesianInstance.Learn WITH TrainingSet;
26
Performance
SMM Vs. Weka
NBC and decision tree classifier
 Datasets [UCI]
• Iris: 5 attributes
• Heart disease: 13 attributes
Overhead of integrating algorithms into
SMM
The SWIM algorithm standalone vs. integrated
 Dataset [IBM Quest]
• Trans len 20, Pattern len 5, Tuples 50K
27
Comparison with Weka: NBC-Iris
28
Comparison with Weka: NBC-HD
29
Comparison with Weka: Decision Tree - Iris
30
Integration Overhead: Integrated SWIM vs.
Standalone SWIM
31
The Stream Mill System
 One
server, multiple clients
 Server (on Linux): hosts the ESL language and manages storage
and continuous queries
 Client (Java based GUI): allows the user to specify streams,
queries, etc.
32
Conclusion
 SMM integrates new solutions for several difficult
problems:
 Usability by high-level mining models
 Extensibility by user-defined mining models that call on
UDAs with windows
 Suite of built-in data stream mining UDAs
 Generic mining UDAs by Verticalization & other techniques
 Performance
 SMM is the first of its kind: more and better
systems will follow in its footsteps.
33
Future Work
Faster & lighter mining algorithms
E.g. online algorithms for clustering
Integration of other mining algorithms
Data flow in mining models
Similar solution for databases
34
Thank you!
35
References
 [Arasu’ 04] Arvind Arasu and Jennifer Widom. Resource sharing in continuous
sliding-window aggregates. In VLDB, pages 336–347, 2004.
 [Babcock’ 02] B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom.
Models and issues in data stream systems. In PODS, 2002.
 [Bai’ 06] Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, and Carlo Zaniolo.
A data stream language and system designed for power and extensibility. In
CIKM, pages 337–346, 2006.
 [Cao’ 06] F Cao, M Ester, W Qian, and A Zhou, Density-based Clustering over
an Evolving Data Stream with Noise, To appear in Proceedings of SIAM 2006.
 [Chi’ 04] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Moment: Maintaining
closed frequent itemsets over a stream sliding window. In Proceedings of the
2004 IEEE International Conference on Data Mining (ICDM’04), November 2004.
 [Chu’ 04] F. Chu and C. Zaniolo. Fast and light boosting for adaptive mining of
data streams. In PAKDD, volume 3056, 2004.
 [Ester’ 96] Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. A
density-based algorithm for discovering clusters in large spatial databases with
noise. In Second International Conference on Knowledge Discovery and Data
Mining, pages 226–231, 1996.
 [Forman’ 06] George Forman. Tackling concept drift by temporal inductive
transfer. In SIGIR, pages 252–259, 2006.
36
References
 [Imielinski’ 96] Tomasz Imielinski and Heikki Mannila. A database
perspective on knowledge discovery. Commun. ACM, 39(11):58–64, 1996.
 [Law’ 04] Yan-Nei Law, Haixun Wang, and Carlo Zaniolo. Data models and
query language for data streams. In VLDB, pages 492–503, 2004.
 [Mozafari’ 08] Barzan Mozafari, Hetal Thakkar, and Carlo Zaniolo.
Verifying and mining frequent patterns from large windows over data
streams. In International Conference on Data Engineering (ICDE), 2008.
 [Sadri’ 01] Reza Sadri, Carlo Zaniolo, Amir Zarkesh, and Jafar Adibi.
Optimization of sequence queries in database systems. In PODS, Santa
Barbara, CA, May 2001.
 [Sarawagi’ 98] S. Sarawagi, S. Thomas, and R. Agrawal. Integrating
association rule mining with relational database systems: Alternatives and
implications. In SIGMOD, 1998.
 [UCI-MLR] http://archive.ics.uci.edu/ml/datasets.html
 [Wang’ 03] H. Wang, W. Fan, P. S. Yu, and J. Han. Mining conceptdrifting data streams using ensemble classifiers. In SIGKDD, 2003.
37