Download slides - UCLA Computer Science

SMM: A Data Stream Management System for Knowledge Discovery Hetal Thakkar, Nikolay Laptev, Hamid Mousavi, Barzan Mozafari, Vincenzo Russo, Carlo Zaniolo Computer Science Department UCLA 1 Data Stream Management Systems (DSMS) • DSMS critical in a variety of applications o o o o • Many DSMS Projects and Prototypes : o • Click-stream analysis, Algorithmic Trading Network monitoring Credit card fraud detection … STREAM (Stanford), Aurora/Borealis (Brown, MIT), Telegraph (UCB), Gigascope (AT&T), Stream Mill (UCLA), … and so on. Commercial Startups and vendor extensions: o StreamBase, Aleri, Coral8, Apama, Truviso, o DBMS vendors … • Support for online mining on data streams: unresolved issue for current systems. 2 Two Main Research Challenges • Challenge I: Fast and Light algorithms needed for online mining algorithms. • Challenge II: These and business intelligence applications require the Quality of Service (QoS) of DSMS. Thus these algorithms must be deployed as part of a DSMS. • Much research on first challenge—a stream of papers in DM conferences—but not on the second that is probably even harder. 3 Data Stream Mining & DSMS QoS • DSMS: Support continuous queries over massive data streams – with QoS (Quality of Service) guarantees and – (Quasi) Real-time response through: o Scheduling, query optimization, o Windows and other Synopses o Load shedding … • But - Current DSMS focus on simple continuous queries - Using query languages based on SQL - Lackluster history of SQL with KDD - DSMS bring more problems: e.g. blocking queries not allowed. 4 Knowledge Discovery from DBs (KDD) vs. SQL  OLAP in Relational DBMS: simple SQL extensions brought rich payoffs to vendors.  Extending DBMSs for Data Mining proved much harder:  Limited expressive power of SQL and OR-DBMS  Apriori in DB2 [Saravagi’ 98] proved extremely difficult and not as efficient as the cache-mining task.CAC  Imielinski & Mannila [CACM’96]: A call for Declarative DM  Vendors: Libraries of Mining Methods o IBM: DB2 Intelligent Minero Oracle Data Miner o OLE DB for DM (DMX)  Mining Models, Predictive Model Markup Language (PMML)  Closed proprietary systems, limited extensibility, usability  … compared with open systems such as WEKA. + CACM’96] M’96] 5 Our Stream Mill Miner (SMM) Syst. • Efficient support for online mining algorithms: DSMS QoS, scalability, load shedding, synopses • Expressive Power & Extensibility: User-Defined Aggregates (UDAs), with windows and slides. • Genericity: Mining Algorithms for arbitrary windows (logical/physical), & tables with any number of columns • Abstract Mining Models & Mining Workflows: analysts do not care to see SQL code • GUI to further enhance ease of use. 6 Expressive Power and Extensibility by UDA functions • UDAs are needed to express mining algorithms. • UDAs can be written in: • An external PL (as other DBMS & DSMS do), or • In SQL itself (then SQL becomes Turing-Complete!) • Using the following template: o INITIALIZE: the first tuple o ITERATE: each subsequent tuples o TERMINATE : After the end of the relation/stream • UDAs are invoked in the same way as built-in aggregates 7 UDA Example: same as SQL AVG The ‘state’ table stores the AGGREGATE myavg(Next Real) : Real computation state { TABLE state(tsum Real, cnt Int); INITIALIZE: { INSERT INTO state VALUES (Next, 1) } ITERATE: { UPDATE state SET tsum = tsum+Next, cnt = cnt+1; } TERMINATE: { INSERT INTO RETURN SELECT tsum/cnt FROM state; } } Blocking UDAs can be applied to data streams only if we use Windows ! INSERT INTO RETURN the value produced by the UDA In TERMINATE this is blocking 8 Windows: • Windowed Query SELECT myavg(price) OVER (ROWS 9 PRECEDING) FROM OpenAuction • SQL:2003 OLAP functions for built-ins only o Other DSMSs • Window types o o Logical, physical, unlimited, partition by, (DSMS) slides, tumbles. 9 Window UDAs and Differential Computation WINDOW AGGREGATE myavg(Next Real) : Real { TABLE inwindow(wnext Real); TABLE state(tsum Real, cnt Int); INITIALIZE: { INSERT INTO state VALUES (Next, 1); INSERT INTO RETURN VALUES (Next)} ITERATE: { UPDATE state SET tsum=tsum+Next, cnt=cnt+1; INSERT INTO RETURN SELECT tsum/cnt FROM state} EXPIRE: { UPDATE state SET cnt = cnt-1, tsum=tsum - oldest().wnext } } • inwindow: system-memorized tuples. oldest() of such tuples • EXPIRE: servicing an expired tuple—possibly asynchrously • Physical window: for each new tuple one expired tuple • Logical window: for each new tuple zero or more expired tuples—same code! 10 Slides for Window UDAs SELECT myavg(price) OVER (ROWS 99 PRECEDING SLIDE 5) FROM OpenAuction  The 100-tuple window is partitioned into 20 panes of 5 tuples each  Very useful to scale down computation & output  E.g. mining aggregates: train on data from last hour but a new classifier every 10 minutes.  When slide ≥ window is called a tumble  Windows in DBMS (SQL:2003 OLAP functions) and slides in DSMS--only for built-in aggregates.  Windows logical, physical, unlimited, partition by, slides, tumbles: declared in SMM by a base UDA +a window UDA.  Superior expressive power and extensibility. 11 Genericity (of UDAs) • UDAs with window and slides: expressive power and extensibility—great for mining algorithms • SMM UDAs can be declared with an arbitrary but fixed number of arguments. • But the same mining algorithm must work on windows and tables where tuples have an arbitrary number of columns. Solution: • For UDA coded in an external PL, each tuple can be represented as a self-describing blob. • For UDAs coded in SQ use verticalization… 12 Verticalization RID 1:Outlook 1 Sunny Overcast 2 2:Temprt 3:Humidity Hot Hot High High 4: Wind Play? Weak Weak RID Column Value Decision 1 1 Sunny No 1 2 Hot No 1 3 High No 1 4 Weak No 2 1 Overcast Yes 2 2 Hot Yes 2 3 High Yes 2 4 Weak Yes No Yes 13 Generic NBClassifier (Training) TABLE DescriptorTbl(Col INT, Val INT, Dec INT, normCnt REAL); WINDOW AGGREGATE LearnNB(col INT, val Char, totCols INT, classVal INT) : INT { INITIALIZE: ITERATE: { UPDATE DescriptorTbl SET normCnt = normCnt + 1 WHERE Col = col AND Val = val AND Dec = classVal; INSERT INTO DescriptorTbl VALUES (col, val, classVal, 1) WHERE SQLCODE <> 0; } EXPIRE: { UPDATE DescriptorTbl SET normCnt = normCnt - 1 WHERE Col = oldest().col AND Val = oldest().val AND Dec = oldest().classVal } } E.g. for DescriptorTbl 14 Example: NBClassifier (Classifiying) •Assume that that test table has the same verticalized format as the training table. • Then we can use a join to find the counts for each tuple and each class. • Then we need to multiply the counts for each class—how? Use the sum aggregate. • And compare the results and select the larger. … •What about missing values ….? • What if tuples arrive in a stream? •NBC is probably the simplest classification methods—among the effective ones 15 Data Stream Mining: why Mining Models? Specification mining tasks involves many details. E.g., a simple classifier: • Define a training stream and a testing stream • A data cleaner/discretizer • A TrainingUDA that builds a model • A TestingUDA that uses the model • How these two communicate: a table holding the model • Different parameters for each classifier instance • A workflow to describe the flow of the information between mining tasks: More critical here than in KDD SMM’s Mining Models: a declarative, user-definable framework to achieve all the above. 16 Built-in Mining Algorithms in SMM • Online classifiers o o o o o Naïve Bayesian Decision Tree Linear Regression Ensemble Methods K-nearest Neighbor • Online clustering o o o o o DBScan [Ester’ 96] IncDBScan Windowed K-means* DenStream* [Cao’ 06] CluStream o Already supported • Association rule mining o Approximate frequent items o SWIM [ICDE’ 08] o Moment [Chi’ 04] o AFPIM • Time Series/Sequences o SQL-TS [Sadri’ 01] o K*SQL [VLDB’10] • Many more … O Work in progress 17 Our Stream Mill Miner (SMM) Syst. 1. DSMS performance and QoS 2. Genericity/flexibility 3. Expressive Power & Extensibility – So performance, flexibility, power, extensibility of mining algorithms written in our Expressive Stream Language ESL (an extension of SQL) – But Analysts do not want to see hundred of lines of ESL code! Thus SMM also provides: • Abstract Mining Models & Mining Workflows, and • GUI to further enhance ease of use. 18 A More Complex Task: Classifier Ensembles • Classifier Ensembles for accuracy and concept shift/drift: Weighted bagging [Wang’ 03], adaptive boosting [Chu’ 04], inductive transfer [Forman’ 06]. • Example: Specify UDAs (boxes) and flow for Weighted Bagging Classify Train Classify BuildEns Voting 19 Ensemble Based Bagging: Flows MODELTYPE EnsembleBag { BuildEns (UDA buildEns), Train (UDA learnDTree), UpdateEns (UDA updateEnsembles), Classify (UDA evaluateClassifier), ManageWeights (UDA updateWeights), Voting (UDA voting), SHARETABLES(activeEnsembles, ensClassTbl, ensembleWeights), Flow Train ( CREATE STREAM buildEnsTrain AS (RUN BuildEns ON INSTREAM); CREATE STREAM dTreeTrain AS (RUN Train ON buildEnsTrain); RUN UpdateEns ON dTreeTrain; CREATE STREAM ensClassiTrainPairs AS (SELECT a.ensId trainEns,b.ensId,b.id,b.col,b.val,b.lbl,b.numCols FROM buildEnsTrain AS b, activeEnsembles AS a); CREATE STREAM evalClassiTrain AS (RUN Classify ON ensClassiTrainPairs); INSERT INTO OUTSTREAM RUN ManageWeights ON evalClassiTrain; ), Flow Test ( … ) } 20 Aggregate buildEns (idi int, coli int, vali int, lbli int, numColsi int, tWeighti int, ensSize int): (ensId int, id int, col int, val int, lbl int, numCols int, tWeight int) { table curEnsId(ensId int) memory; table curEnsCnt(cnt int) memory; initialize: { insert into curEnsCnt values(1); insert into curEnsId values(1); insert into return select ensId, idi, coli, vali, lbli, numColsi, tWeighti from curEnsId; } iterate: { update curEnsCnt set cnt = cnt + 1 where coli = numColsi; insert into return select ensId, idi, coli, vali, lbli, numColsi, tWeighti from curEnsId; /* indicates end of ens */ insert into return select ensId, -1, 0, 0, 0, 0, 0 from curEnsId where coli = numColsi and (select cnt from curEnsCnt) = ensSize; update curEnsId set ensId = ((ensId+1)%20) where (select cnt from curEnsCnt) = ensSize and coli = numColsi; update curEnsCnt set cnt = 0 where coli = numColsi and (select cnt from curEnsCnt) = ensSize; } }; Picture of the flow definition GUI 29th Oct 08 22 After Functionality & Usability: Performance  No data stream mining workbench to compare against,  We compared SMM with WEKA on • Integration overhead: performance lost because algorithm is embedded in the system • Scalability. – Results obtained using a single-processor machine, with a Pentium4, 2.4GHz processor, 1GB RAM – On data preloaded in main memory. 23 Comparison with Weka • C4.5 was recast as a UDA and incorporated into SMM • Left most bars, Iris-C4.5, HD-C4.5: C4.5 directly on data. • Middle bars, Iris-SMM(C4.5), HD-SMM(C4.5): C4.5 incorporated into SMM • Rightmost bars: Weka J48 29th Oct 08 24 Integration Overhead: Integrated SWIM vs. Standalone SWIM (Frequent Patterns on DS) 25 Concurrent Queries 26 Conclusions • SMM main contributions – Building on a DSMS efficiency and QoS, achieved – Expressive Power, Generality and User Extensibility • With an SQL-based continuous query language • UDAs with windows and slides • Arbitrary relational/XML data streams • A suite of fast & light mining algorithms (domestic & imported) – High-level mining Language • Defining the mining process and information flow • New mining models can be defined easily • High-level abstractions and GUI to match analysts’ requirements. 27 Acknowledging the many SMM Contributors • • • • • • • Yijian Bai, Stefano Emiliozzi, Chang Luo, Yan-Nei Law, Haixun Wang, Kai Zeng, Xin Zhou • • • • • Hetal Thakkar, Nikolay Laptev, Hamid Mousavi, Barzan Mozafari, Vincenzo Russo, 28 Thank You! Questions? 29 Example: NBClassifier (Classifiying) AGGREGATE ClassifyNB(col INT, val REAL, totCols INT):INT { TABLE tmp(column INT, value REAL); TABLE pred(dec INT, tot REAL); INITIALIZE: ITERATE: { INSERT INTO tmp VALUES (col, val); INSERT INTO pred SELECT d.Dec, sum(abs(log(normCnt))) FROM DescriptorTbl AS d, tmp AS t WHERE col = totCols AND d.Val=t.value AND d.Col=t.column GROUP BY d.Dec; INSERT INTO RETURN SELECT dec FROM pred WHERE col = totCols AND tot = (SELECT max(tot) FROM pred GROUP BY dec); DELETE FROM tmp WHERE col = totCols; DELETE FROM pred WHERE col = totCols; } } 30 Future Work • Integration of other mining algorithms • Distributed execution of UDAs, like MapReduce • Similar solution for databases 31 Research Challenge I: Online Mining Algorithms • Online mining different from static mining o Changing data characteristics  Data distribution  Concept-drifts and shifts o Data volume  Existing static data solutions not suitable  Load shedding and sampling o Response time constraints  Sacrifice accuracy • Fast & light algorithms required 29th Oct 08 32 Hot Research Topic: Online Mining Algorithms • Existing Online algorithms o o o o o Moment [Chi’ 04], AFPIM [Koh’ 04] CluStream [Aggrawal’ 03], GenIc [Gupta’ 04] Ensemble based bagging [Wang’ 03] Adaptive boosting [Chu’ 04] Many more opportunities  E.g. frequent itemset mining over large sliding windows (SWIM) [ICDE’ 08] • Do not tackle the system oriented challenges 29th Oct 08 33 DSMSs • Commercial Systems (General Purpose) o Aleri - OLAP queries o StreamBase - Synopses and pattern matching o Apama, Coral8, … o Oracle, IBM o KX, Vhayu – specialized o All oriented towards SQL • Focusing on special purpose queries or simple SQL queries o o 29th Oct 08 Little extensibility No mining support 34 Another Example: End-to-end Association Rule Mining GUI for defining and using mining models and flows. 29th Oct 08 35

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download slides - UCLA Computer Science