Download slides - UCLA Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
SMM: A Data Stream
Management System for
Knowledge Discovery
Hetal Thakkar, Nikolay Laptev, Hamid Mousavi,
Barzan Mozafari, Vincenzo Russo, Carlo Zaniolo
Computer Science Department UCLA
1
Data Stream Management Systems (DSMS)
• DSMS critical in a variety of applications
o
o
o
o
•
Many DSMS Projects and Prototypes :
o
•
Click-stream analysis,
Algorithmic Trading
Network monitoring
Credit card fraud detection …
STREAM (Stanford), Aurora/Borealis (Brown, MIT),
Telegraph (UCB), Gigascope (AT&T), Stream Mill (UCLA), …
and so on.
Commercial Startups and vendor extensions:
o StreamBase, Aleri, Coral8, Apama, Truviso,
o DBMS vendors …
• Support for online mining on data streams:
unresolved issue for current systems.
2
Two Main Research Challenges
• Challenge I: Fast and Light algorithms needed
for online mining algorithms.
• Challenge II: These and business intelligence
applications require the Quality of Service
(QoS) of DSMS. Thus these algorithms must
be deployed as part of a DSMS.
• Much research on first challenge—a stream of
papers in DM conferences—but not on the
second that is probably even harder.
3
Data Stream Mining & DSMS QoS
• DSMS: Support continuous queries over massive data streams
– with QoS (Quality of Service) guarantees and
– (Quasi) Real-time response through:
o Scheduling, query optimization,
o Windows and other Synopses
o Load shedding …
• But
- Current DSMS focus on simple continuous queries
- Using query languages based on SQL
- Lackluster history of SQL with KDD
- DSMS bring more problems:
e.g. blocking queries not allowed.
4
Knowledge Discovery from DBs (KDD) vs. SQL
 OLAP in Relational DBMS: simple SQL extensions brought
rich payoffs to vendors.
 Extending DBMSs for Data Mining proved much harder:
 Limited expressive power of SQL and OR-DBMS
 Apriori in DB2 [Saravagi’ 98] proved extremely difficult and not as
efficient as the cache-mining task.CAC
 Imielinski & Mannila [CACM’96]: A call for Declarative DM
 Vendors: Libraries of Mining Methods
o IBM: DB2 Intelligent Minero Oracle Data Miner
o OLE DB for DM (DMX)
 Mining Models, Predictive Model Markup Language (PMML)
 Closed proprietary systems, limited extensibility, usability
 … compared with open systems such as WEKA.
+ CACM’96]
M’96]
5
Our Stream Mill Miner (SMM) Syst.
• Efficient support for online mining algorithms: DSMS
QoS, scalability, load shedding, synopses
• Expressive Power & Extensibility: User-Defined
Aggregates (UDAs), with windows and slides.
• Genericity: Mining Algorithms for arbitrary windows
(logical/physical), & tables with any number of columns
• Abstract Mining Models & Mining Workflows: analysts
do not care to see SQL code
• GUI to further enhance ease of use.
6
Expressive Power and Extensibility
by UDA functions
• UDAs are needed to express mining algorithms.
• UDAs can be written in:
• An external PL (as other DBMS & DSMS do), or
• In SQL itself (then SQL becomes Turing-Complete!)
• Using the following template:
o
INITIALIZE: the first tuple
o
ITERATE: each subsequent tuples
o
TERMINATE : After the end of the relation/stream
• UDAs are invoked in the same way as built-in aggregates
7
UDA Example:
same as SQL AVG
The ‘state’
table stores the
AGGREGATE myavg(Next Real) : Real computation state
{
TABLE state(tsum Real, cnt Int);
INITIALIZE: { INSERT INTO state VALUES (Next, 1)
}
ITERATE: {
UPDATE state SET tsum = tsum+Next, cnt = cnt+1;
}
TERMINATE: {
INSERT INTO RETURN SELECT tsum/cnt FROM state; }
}
Blocking UDAs can be
applied
to data streams only if
we use Windows !
INSERT INTO RETURN
the value produced by the UDA
In TERMINATE this is blocking
8
Windows:
• Windowed Query
SELECT myavg(price) OVER (ROWS 9 PRECEDING)
FROM OpenAuction
• SQL:2003 OLAP functions for built-ins only
o
Other DSMSs
• Window types
o
o
Logical, physical, unlimited, partition by,
(DSMS) slides, tumbles.
9
Window UDAs and Differential Computation
WINDOW AGGREGATE myavg(Next Real) : Real
{ TABLE inwindow(wnext Real);
TABLE state(tsum Real, cnt Int);
INITIALIZE: {
INSERT INTO state VALUES (Next, 1);
INSERT INTO RETURN VALUES (Next)}
ITERATE: {
UPDATE state SET tsum=tsum+Next, cnt=cnt+1;
INSERT INTO RETURN SELECT tsum/cnt FROM state}
EXPIRE: {
UPDATE state SET cnt = cnt-1, tsum=tsum - oldest().wnext }
}
• inwindow: system-memorized tuples. oldest() of such tuples
• EXPIRE: servicing an expired tuple—possibly asynchrously
• Physical window: for each new tuple one expired tuple
• Logical window: for each new tuple zero or more expired
tuples—same code!
10
Slides for Window UDAs
SELECT myavg(price) OVER (ROWS 99 PRECEDING SLIDE 5)
FROM OpenAuction
 The 100-tuple window is partitioned into 20 panes of 5 tuples each
 Very useful to scale down computation & output
 E.g. mining aggregates: train on data from last hour but a new classifier
every 10 minutes.
 When slide ≥ window is called a tumble
 Windows in DBMS (SQL:2003 OLAP functions) and slides in DSMS--only
for built-in aggregates.
 Windows logical, physical, unlimited, partition by, slides, tumbles:
declared in SMM by a base UDA +a window UDA.
 Superior expressive power and extensibility.
11
Genericity (of UDAs)
• UDAs with window and slides: expressive power and
extensibility—great for mining algorithms
• SMM UDAs can be declared with an arbitrary but fixed
number of arguments.
• But the same mining algorithm must work on windows and
tables where tuples have an arbitrary number of columns.
Solution:
• For UDA coded in an external PL, each tuple can be
represented as a self-describing blob.
• For UDAs coded in SQ use verticalization…
12
Verticalization
RID
1:Outlook
1
Sunny
Overcast
2
2:Temprt 3:Humidity
Hot
Hot
High
High
4: Wind Play?
Weak
Weak
RID
Column Value
Decision
1
1
Sunny
No
1
2
Hot
No
1
3
High
No
1
4
Weak
No
2
1
Overcast
Yes
2
2
Hot
Yes
2
3
High
Yes
2
4
Weak
Yes
No
Yes
13
Generic NBClassifier (Training)
TABLE DescriptorTbl(Col INT, Val INT, Dec INT, normCnt REAL);
WINDOW AGGREGATE
LearnNB(col INT, val Char, totCols INT, classVal INT) : INT
{
INITIALIZE: ITERATE: {
UPDATE DescriptorTbl SET normCnt = normCnt + 1
WHERE Col = col AND Val = val AND Dec = classVal;
INSERT INTO DescriptorTbl VALUES (col, val, classVal, 1)
WHERE SQLCODE <> 0;
}
EXPIRE: { UPDATE DescriptorTbl SET normCnt = normCnt - 1
WHERE Col = oldest().col AND Val = oldest().val
AND Dec = oldest().classVal }
}
E.g. for DescriptorTbl
14
Example: NBClassifier (Classifiying)
•Assume that that test table has the same verticalized format
as the training table.
• Then we can use a join to find the counts for each tuple and each
class.
• Then we need to multiply the counts for each class—how? Use the
sum aggregate.
• And compare the results and select the larger. …
•What about missing values ….?
• What if tuples arrive in a stream?
•NBC is probably the simplest classification methods—among the
effective ones
15
Data Stream Mining: why Mining Models?
Specification mining tasks involves many details. E.g., a
simple classifier:
• Define a training stream and a testing stream
• A data cleaner/discretizer
• A TrainingUDA that builds a model
• A TestingUDA that uses the model
• How these two communicate: a table holding the model
• Different parameters for each classifier instance
• A workflow to describe the flow of the information between
mining tasks: More critical here than in KDD
SMM’s Mining Models: a declarative, user-definable
framework to achieve all the above.
16
Built-in Mining Algorithms in SMM
• Online classifiers
o
o
o
o
o
Naïve Bayesian
Decision Tree
Linear Regression
Ensemble Methods
K-nearest Neighbor
• Online clustering
o
o
o
o
o
DBScan [Ester’ 96]
IncDBScan
Windowed K-means*
DenStream* [Cao’ 06]
CluStream
o
Already supported
• Association rule mining
o Approximate frequent
items
o SWIM [ICDE’ 08]
o Moment [Chi’ 04]
o AFPIM
• Time Series/Sequences
o SQL-TS [Sadri’ 01]
o K*SQL [VLDB’10]
• Many more …
O
Work in progress
17
Our Stream Mill Miner (SMM) Syst.
1. DSMS performance and QoS
2. Genericity/flexibility
3. Expressive Power & Extensibility
–
So performance, flexibility, power, extensibility of mining
algorithms written in our Expressive Stream Language
ESL (an extension of SQL)
–
But Analysts do not want to see hundred of lines of ESL
code!
Thus SMM also provides:
• Abstract Mining Models & Mining Workflows, and
• GUI to further enhance ease of use.
18
A More Complex Task: Classifier Ensembles
• Classifier Ensembles for accuracy and concept shift/drift:
Weighted bagging [Wang’ 03], adaptive boosting [Chu’ 04], inductive
transfer [Forman’ 06].
• Example: Specify UDAs (boxes) and flow for Weighted Bagging
Classify
Train
Classify
BuildEns
Voting
19
Ensemble Based Bagging: Flows
MODELTYPE EnsembleBag {
BuildEns (UDA buildEns),
Train (UDA learnDTree),
UpdateEns (UDA updateEnsembles),
Classify (UDA evaluateClassifier),
ManageWeights (UDA updateWeights),
Voting (UDA voting),
SHARETABLES(activeEnsembles, ensClassTbl, ensembleWeights),
Flow Train (
CREATE STREAM buildEnsTrain AS (RUN BuildEns ON INSTREAM);
CREATE STREAM dTreeTrain AS (RUN Train ON buildEnsTrain);
RUN UpdateEns ON dTreeTrain;
CREATE STREAM ensClassiTrainPairs AS
(SELECT a.ensId trainEns,b.ensId,b.id,b.col,b.val,b.lbl,b.numCols
FROM buildEnsTrain AS b, activeEnsembles AS a);
CREATE STREAM evalClassiTrain AS (RUN Classify ON
ensClassiTrainPairs);
INSERT INTO OUTSTREAM RUN ManageWeights ON evalClassiTrain;
),
Flow Test ( … )
}
20
Aggregate buildEns (idi int, coli int, vali int, lbli int, numColsi int,
tWeighti int, ensSize int):
(ensId int, id int, col int, val int, lbl int, numCols int, tWeight int) {
table curEnsId(ensId int) memory;
table curEnsCnt(cnt int) memory;
initialize:
{
insert into curEnsCnt values(1);
insert into curEnsId values(1);
insert into return
select ensId, idi, coli, vali, lbli, numColsi, tWeighti
from curEnsId;
}
iterate:
{
update curEnsCnt set cnt = cnt + 1 where coli = numColsi;
insert into return
select ensId, idi, coli, vali, lbli, numColsi, tWeighti
from curEnsId;
/* indicates end of ens */
insert into return select ensId, -1, 0, 0, 0, 0, 0 from curEnsId
where coli = numColsi and (select cnt from curEnsCnt) = ensSize;
update curEnsId set ensId = ((ensId+1)%20) where (select cnt from
curEnsCnt) = ensSize and coli = numColsi;
update curEnsCnt set cnt = 0 where coli = numColsi and (select cnt
from curEnsCnt) = ensSize;
}
};
Picture of the flow definition GUI
29th Oct 08
22
After Functionality & Usability:
Performance

No data stream mining workbench to compare against,

We compared SMM with WEKA on
• Integration overhead: performance lost because
algorithm is embedded in the system
• Scalability.
– Results obtained using a single-processor machine,
with a Pentium4, 2.4GHz processor, 1GB RAM
– On data preloaded in main memory.
23
Comparison with Weka
• C4.5 was recast as a UDA and incorporated into SMM
• Left most bars, Iris-C4.5, HD-C4.5: C4.5 directly on data.
• Middle bars, Iris-SMM(C4.5), HD-SMM(C4.5): C4.5 incorporated into SMM
• Rightmost bars: Weka J48
29th Oct 08
24
Integration Overhead: Integrated SWIM vs.
Standalone SWIM (Frequent Patterns on DS)
25
Concurrent Queries
26
Conclusions
• SMM main contributions
– Building on a DSMS efficiency and QoS, achieved
– Expressive Power, Generality and User Extensibility
• With an SQL-based continuous query language
• UDAs with windows and slides
• Arbitrary relational/XML data streams
• A suite of fast & light mining algorithms (domestic & imported)
– High-level mining Language
• Defining the mining process and information flow
• New mining models can be defined easily
• High-level abstractions and GUI to match analysts’
requirements.
27
Acknowledging the many
SMM Contributors
•
•
•
•
•
•
•
Yijian Bai,
Stefano Emiliozzi,
Chang Luo,
Yan-Nei Law,
Haixun Wang,
Kai Zeng,
Xin Zhou
•
•
•
•
•
Hetal Thakkar,
Nikolay Laptev,
Hamid Mousavi,
Barzan Mozafari,
Vincenzo Russo,
28
Thank You!
Questions?
29
Example: NBClassifier (Classifiying)
AGGREGATE ClassifyNB(col INT, val REAL, totCols INT):INT {
TABLE tmp(column INT, value REAL);
TABLE pred(dec INT, tot REAL);
INITIALIZE: ITERATE: {
INSERT INTO tmp VALUES (col, val);
INSERT INTO pred SELECT d.Dec, sum(abs(log(normCnt)))
FROM DescriptorTbl AS d, tmp AS t
WHERE col = totCols AND d.Val=t.value
AND d.Col=t.column
GROUP BY d.Dec;
INSERT INTO RETURN SELECT dec FROM pred
WHERE col = totCols
AND tot = (SELECT max(tot) FROM pred GROUP BY dec);
DELETE FROM tmp WHERE col = totCols;
DELETE FROM pred WHERE col = totCols;
}
}
30
Future Work
• Integration of other mining algorithms
• Distributed execution of UDAs, like
MapReduce
• Similar solution for databases
31
Research Challenge I:
Online Mining Algorithms
• Online mining different from static mining
o
Changing data characteristics
 Data distribution
 Concept-drifts and shifts
o
Data volume
 Existing static data solutions not suitable
 Load shedding and sampling
o
Response time constraints
 Sacrifice accuracy
• Fast & light algorithms required
29th Oct 08
32
Hot Research Topic:
Online Mining Algorithms
• Existing Online algorithms
o
o
o
o
o
Moment [Chi’ 04], AFPIM [Koh’ 04]
CluStream [Aggrawal’ 03], GenIc [Gupta’ 04]
Ensemble based bagging [Wang’ 03]
Adaptive boosting [Chu’ 04]
Many more opportunities
 E.g. frequent itemset mining over large sliding windows
(SWIM) [ICDE’ 08]
• Do not tackle the system oriented challenges
29th Oct 08
33
DSMSs
• Commercial Systems (General Purpose)
o Aleri - OLAP queries
o StreamBase - Synopses and pattern matching
o Apama, Coral8, …
o Oracle, IBM
o KX, Vhayu – specialized
o All oriented towards SQL
• Focusing on special purpose queries or simple SQL queries
o
o
29th Oct 08
Little extensibility
No mining support
34
Another Example: End-to-end Association
Rule Mining
GUI for defining and using mining models and flows.
29th Oct 08
35