Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Supporting Data Stream Mining Applications: in DBMS&DSMS Carlo Zaniolo UCLA CSD 5/6/2017 1 DM Experience for DBMS: Dreams vs. Reality Decision Support and business intelligence: OLAP & data warehouses: resounding success for DBMS vendors, via relational DBMS extensions for DM queries: a flop Simple extensions of SQL (aggregates & analytics) OR-DBMS do not fare much better [Sarawagi’ 98]. Imielinski & Mannila proposed a ‘high-road’ approach [CACM’96] was suggested by who called for a quantum leap in functionality based on: Simple declarative extensions of SQL for Data Mining (DM) Efficiency through DM query optimization techniques (yet to be invented) The research area of Inductive DBMS was thus born, producing Interesting language work: DMQL, Mine Rule, MSQL, … 5/6/2017 Where implementation technology lacks generality & performance limitations Real questions if optimizers will ever take us there. http://wis.cs.ucla.edu 2 DM Experience for DBMS: Dreams vs. Reality The Low-Road Approach by Commercial DBMS Approaches Largely based on a Cache Mining Stored procedures and virtual mining views Outside the DBMS Data transfer delays No move toward standarization IBM DB2 http://www-306.ibm.com/software/data/iminer/ Intelligent Miner no longer supported. 5/6/2017 http://wis.cs.ucla.edu 3 Oracle Data Miner Algorithms PL/SQL with extensions for mining Models as first class objects Adaptive Naïve Bayes SVM regression K-means clustering Association rules, text, mining, etc., etc. Create_Model, Prediction, Prediction_Cost, Prediction_Details, etc. http://www.oracle.com/technology/products/bi/odm/index.html 5/6/2017 http://wis.cs.ucla.edu 4 MS: OLE DB for DM (DMX): 3 steps Model creation Create mining model MemCard_Pred ( CustomerId long key, Age long continuous, Profession text discrete, Income long continuous, Risk text discrete predict) Using Microsoft_Decision_Tree; Training Insert into MemCard_Pred OpenRowSet( “‘sqloledb’, ‘sa’, ‘mypass’”, ‘SELECT CustomerId, Age, Profession, Income, Risk from Customers’) Prediction Join Select C.Id, C.Risk, PredictProbability(MemCard_Pred.Risk) From MemCard_Pred AS MP Prediction Join Customers AS C Where MP.Profession = C.Profession and AP.Income = C.Income AND MP.Age = C.Age; 5/6/2017 http://wis.cs.ucla.edu 5 MS: Defining a Mining Model: E.g., a model to predict students’ plan to attend college The format of “training cases” (top-level entity) Attributes, Input/output type, distribution Algorithms and parameters Example CREATE MINING MODEL CollegePlanModel ( StudentID Gender ParentIncome Encouragement CollegePlans LONG TEXT LONG TEXT TEXT KEY, DISCRETE, NORMAL CONTINUOUS, DISCRETE, DISCRETE PREDICT ) USING Microsoft_Decision_Trees 5/6/2017 http://wis.cs.ucla.edu 6 Training INSERT INTO CollegePlanModel (StudentID, Gender, ParentIncome, Encouragement, CollegePlans) OPENROWSET(‘<provider>’, ‘<connection>’, ‘SELECT StudentID, Gender, ParentIncome, Encouragement, CollegePlans FROM CollegePlansTrainData’) 5/6/2017 http://wis.cs.ucla.edu 7 Prediction Join SELECT t.ID, CPModel.Plan FROM CPModel PREDICTION JOIN OPENQUERY(…,‘SELECT * FROM NewStudents’) AS t ON CPModel.Gender = t.Gender AND CPModel.IQ = t.IQ CPModel 5/6/2017 ID Gender IQ Plan ID Gender IQ NewStudents http://wis.cs.ucla.edu 8 OLE DB for DM (DMX) (cont.) Mining objects as first class objects Schema rowsets Other features Mining_Models Mining_Model_Content Mining_Functions Column value distribution Nested cases http://research.microsoft.com/dmx/DataMining/ 5/6/2017 http://wis.cs.ucla.edu 9 Summary of Vendors’ Approaches Built-in library of mining methods Limitations Script language or GUI tools Closed systems (internals hidden from users) Adding new algorithms or customizing old ones -Difficult Poor integration with SQL Limited interoperability across DBMSs Predictive Markup Modeling Language (PMML) as a palliative 5/6/2017 http://wis.cs.ucla.edu 10 PMML Predictive Markup Model Language XML based language for vendor independent definition of statistical and data mining models Share models among PMML compliant products A descriptive language Supported by all major vendors 5/6/2017 http://wis.cs.ucla.edu 11 PMML Example 5/6/2017 http://wis.cs.ucla.edu 12 The Data Mining World According to The Data Mining Software Vendors Market Competition Disclaimer Disclaimer This presentation contains preliminary information that may be changed substantially prior to final commercial release of the software described herein. The information contained in this presentation represents the current view of Microsoft Corporation on the issues discussed as of the date of the presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of the presentation. This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this presentation. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this information does not give you any license to these patents, trademarks, copyrights, or other intellectual property. © 2005 Microsoft Corporation. All rights reserved. Major Data Mining Vendors • Platforms IBM Oracle SAS • Tools SPSS Angoss KXEN Megaputer FairIsaac Insightful Competition Product SQL Server 2005 Oracle 10g IBM SAS SQL Server Analysis Services Oracle Data Mining DB2 Intelligent Miner, WebSphere Enterprise Miner http://otn.oracle.com/pr oducts/bi/odm/odminin g.html http://www306.ibm.com/software/data/imin er/ http://www.sas.com/technologies/analytics/data mining/miner/factsheet.pdf Link API OLEDB/DM, DMX, XMLA, ADOMD.Net Java DM, PL/SQL SQL MM/6 based on UDF, SQL SPROC SAS Script Algorithms 7 (+2) 8 6 8+ Text Mining Yes Yes Yes Yes Marketing Pages N/A 18 10 Dozens Client Tools Embeddable Viewers, Reporting Services Analysis tools, Webbased targeted reports WebSphere Portal (vertical solution) None Discoverer Excel AddIn IM Visualization Distribution Included Additional Package Additional Packages Separate Product Target Developers Developers DB2 IM Scoring module is for developers; Other modules are for analysts. Analysts Strengths Powerful yet simple API Good credibility with enterprise customers Integration with other BI technologies New GUI, Leader of JDM API Mature, Market Leader. Extensive customization and modelling abilities. Robust, industry tested and accepted algorithms and methodologies. Export to DB2 Scoring. New GUI CRM Integration Mature product (6 years). Good service model. Scoring inside relational engine. Strong partnership with SAS Not in-process with relational engine Lacking statistical functions Poor Analyst experience API overly complex High price. Standard Functionality. Poor API (SQL MM). Confusing product line. Expensive. Proprietary. Customer relations range from congenial to hostile. Weaknesses Inconsistent Major DM Vendors SAS Institute (Enterprise Miner) IBM (DB2 Intelligent Miner for Data) Oracle (ODM option to Oracle 10g) SPSS (Clementine) Unica Technologies, Inc. (Pattern Recognition Workbench) Insightsful (Insightful Miner) KXEN (Analytic Framework) Prudsys (Discoverer and its family) Microsoft (SQL Server 2005) Angoss (KnowledgeServer and its family) DBMiner (DBMiner) etc… Platforms IBM Oracle SAS, Tools SPSS Angoss KXEN Megaputer FairIsaac Insightful ORACLE Strengths Oracle Data Mining (ODM) Integrated into relational engine – Performance benefits – Management integration – SQL Language integration ODM Client – “Walks through” Data Mining Process – Data Mining tailored data preparation – Generates code Integration into Oracle CRM – “EZ” Data Mining for customer churn, other applications Full suite of algorithms – Typical algorithms, plus text mining and bioinformatics Nice marketing/user education ORACLE Weaknesses Additional Licensing Fees (base $400/user, $20K proc) Confusing API Story – Certain features only work with Java API – Certain features only work with PL/SQL API – Same features work differently with different API’s Difficult to use – Different modeling concepts for each algorithm Poor connectivity – ORACLE only SAS • Entrenched Data Mining Leader Market Share Mind Share • “Best of Breed” Always will attract the top ?% of customers • Overall poor product Only for the expert user (SAS Philosophy) Integration of results generally involves source code • Integrated with ETL, other SAS tools • Partnership with IBM Model in SAS, deploy in DB2 My View ... 5/6/2017 DBMS pachyderms have made some progress toward high level data models and integration with SQL, but Closed systems, Lacking in coverage and user-extensibility. Not as popular as dedicated, stand-alone, opensoftware DM systems, such as Weka OS experience again? http://wis.cs.ucla.edu 21 Weka A comprehensive set of DM algorithms, and tools. Generic algorithms over arbitrary data sets. Independent on the number of columns in tables. Open and extensible system based on Java. * These are the desiderata for a DSMS (or a CEP system) that support the data stream mining task 5/6/2017 http://wis.cs.ucla.edu 22 References Tomasz Imielinski and Heikki Mannila. A database perspective on knowledge discovery. Communication ACM, 39(11):58, 1996. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In SIGMOD, 1998. [Imielinski’ 96] Tomasz Imielinski and Heikki Mannila. A database perspective on knowledge discovery. Commun. ACM, 39(11):58–64, 1996. T. Imielinski and A. Virmani. MSQL: a query language for database mining. Data Mining and Knowledge Discovery, 3:373--408, 1999. J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane. DMQL: A data mining query language for relational databases. In Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD), pages 27--33, Montreal, Canada, June 1996. R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. In VLDB, pages 122--133, Bombay, India, 1996. Marco Botta, Jean-Francois Boulicaut, Cyrille Masson, and Rosa Meo. Query languages supporting descriptive rule mining: A comparative study. In Database Support for Data Mining Applications, pages 24--51, 2004. 5/6/2017 http://wis.cs.ucla.edu 23 Road Map for Next Three Weeks Fast& Light Algorithms for Mining Data Streams Classifiers and Classifier Ensembles, Clustering methods, Association Rules, Time series Supporting the mining task in a DSMS 5/6/2017 Data Mining Query Languages and support for the mining process Toward a Data stream mining workbench http://wis.cs.ucla.edu 24 References IBM. DB2 Intelligent Miner www.306.ibm.com/software/data/iminer ORACLE. Oracle Data Miner Release10gr2:http://www.oracle.com/technology/products/bi/od m. Z. Tang, J. Maclennan, and P. Kim. Building data mining solutions with OLE DB for DM and XML analysis. SIGMOD Record, 34(2):80–85, 2005 Data Mining Group (DMG). Predictive model markup language (pmml). http://sourceforge.net/projects/pmml. Carlo Zaniolo: Mining Databases and Data Streams with Query Languages and Rules: Invited Talk, Fourth International Workshop on Knowledge Discovery in Inductive Databases, KDID 2005. Hetal Thakkar Mozafari and Carlo Zaniolo: Designing an Hetal Thakkar, Nikolay Laptev, Hamid Mousavi, Barzan Mozafari, Vincenzo Russo, Carlo Zaniolo:SMM: A data stream management system for Knowledge Discovery. ICDE 2011: 757-768 5/6/2017 http://wis.cs.ucla.edu 25 Thank you! 5/6/2017 http://wis.cs.ucla.edu 26 Supporting DM Tasks and the Process in DSMS or a CEP System I had a dream: WEKA for Data Streams! But with a DSMS we have to starting from SQL rather than Java! Case Study: Naïve Bayesian Classifiers—arguably the simplest mining algorithm. It is doable in SQL/DBMS. Is it also doable in SQL/DSMS? What about the various CEP systems, which claim to be powerful (e.g., support rules). Can they support NBC? In general, can they be extended to support generic versions of NBC, and perhaps other data stream mining methods? 5/6/2017 http://wis.cs.ucla.edu 27 Assignment: due on May 10th. Download a DSMS or a CEP system of your choice and (after explaining why you have selected this and not the others) explore how you can implement the following tasks: 1. Testing of a Naïve Bayesian Classifier: you can assume that the NBC has already been trained and you can read it from the input, or a DB, a file, or memory. 2. Assume now that you also have a stream of pre-classified samples. Use this to determine the accuracy of your current classifier, at periodic intervals. Output the accuracy, and if this falls below a certain threshold repeat Step 1. 3. Periodically retrain a new NBC from the stream of pre-classified tuples; then use the newly built classifier to predict the class of unclassified tuples (Step 1). 4. See if you can generalize your software, and e.g., design/develop generic NBCs, ensemble methods, other classifiers, etc. It is understood that the limitations of DSMS and CEP systems will probably prevent you from completing all these tasks (listed in order of increasing difficulty). So, you should make sure that you (1) download a good system (but not Stream Mill), (2) write clear report explaining your efforts, and the reasons that prevented you from going further. (For test sets, refer to CS240A.) 5/6/2017 http://wis.cs.ucla.edu 28