Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Systems and Languages CS240A Notes 1 Knowledge Discovery (KDD) Process Data mining—core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases 2 DM Experience for DBMS: Dreams vs. Reality Decision Support and business intelligence: OLAP & data warehouses: resounding success for DBMS vendors, via Simple extensions of SQL (aggregates & analytics) relational DBMS extensions for DM queries: a flop OR-DBMS do not fare much better [Sarawagi’ 98]. Imielinski & Mannila proposed a ‘high-road’ approach [CACM’96] was suggested by who called for a quantum leap in functionality based on: Simple declarative extensions of SQL for Data Mining (DM) Efficiency through DM query optimization techniques (yet to be invented) The research area of Inductive DBMS was thus born, producing Interesting language work: DMQL, Mine Rule, MSQL, … Where implementation technology lacks generality & performance limitations Real questions if optimizers will ever take us there. 3 DBMS Limitations DBMSs were easily and very Successfully extended for Data Warehouses with help of OLAP functions Extending DBMSs for Mining has proven much harder Limited expressive power Flexibility of the languages Apriori in DB2 [Saravagi’ 98] Because of lack of suitable primitives task proved extremely difficult and not as efficient as the cachemining task Cache-mining: move data from the database to cache and then use PL algorithms to mine the cache. 4 Mining Systems Desiderata Problem: How to efficiently support the vast variety of online mining algorithms in an integrated framework? Generality over a wide spectrum of mining tasks Ease of use for naïve users and flexibility and customizability for experts Efficiency, scalability Databases: where the data is. But DBMS do not support well the KDD tasks. Three approaches 1. Inductive DBMS 2.Commercial DBMS extensions 3. Dedicated KDD systems with DBMS connections. 5 Inductive DBMSs vs. Vendor Extensions Imielinski & Manilla introduced the notion of A high-level Data Mining Query Language for DBMS Optimization techniques for Inductive DBMS a new research field MSQL, DMQL, Mine Rule: DM query language Performance and generality an open problem. DBMS Vendors Ad-hoc approaches based on mining libraries 6 DBMS extensions: DB2 Intelligent Miner Model creation Training CALL IDMMX.DM_buildClasModelCmd('IDMMX.CLASTASKS', 'TASK', 'ID', 'HeartClasTask', 'IDMMX.CLASSIFMODELS', 'MODEL', 'MODELNAME', 'HeartClasModel' ); Prediction Stored procedures and virtual mining views Most of the implementation outside the DBMS (Cache Mining) Data transfer delays http://www-306.ibm.com/software/data/iminer/ 7 Oracle Data Miner Algorithms Adaptive Naïve Bayes SVM regression K-means clustering Association rules, text, mining, etc., etc. PL/SQL with extensions for mining Models as first class objects Create_Model, Prediction, Prediction_Cost, Prediction_Details, etc. http://www.oracle.com/technology/products/bi/odm/index.html 21-Mar-08 8 OLE DB for DM by Microsoft Model creation. Descriptive phase Prediction joins Other features Nested cases http://research.microsoft.com/dmx/DataMining/ PMML a descriptive XML language for exchanging information between systems 9 OLE DB for DM (DMX) (cont.) Mining objects as first class objects Schema rowsets Mining_Models Mining_Model_Content Mining_Functions Other features Column value distribution Nested cases http://research.microsoft.com/dmx/DataMining/ 21-Mar-08 10 OLE DB for DM (DMX): 3 steps Model creation Create mining model MemCard_Pred ( CustomerId long key, Age long continuous, Profession text discrete, Income long continuous, Risk text discrete predict) Using Microsoft_Decision_Tree; Training Insert into MemCard_Pred OpenRowSet( “‘sqloledb’, ‘sa’, ‘mypass’”, ‘SELECT CustomerId, Age, Profession, Income, Risk from Customers’) Prediction Join Select C.Id, C.Risk, PredictProbability(MemCard_Pred.Risk) From MemCard_Pred AS MP Prediction Join Customers AS C Where MP.Profession = C.Profession and AP.Income = C.Income AND MP.Age = C.Age; 21-Mar-08 11 Defining a Mining Model: E.g., a model to predict students’ plan to attend college The format of “training cases” (top-level entity) Attributes, Input/output type, distribution Algorithms and parameters Example CREATE MINING MODEL CollegePlanModel ( StudentID Gender ParentIncome Encouragement CollegePlans LONG TEXT LONG TEXT TEXT KEY, DISCRETE, NORMAL CONTINUOUS, DISCRETE, DISCRETE PREDICT ) USING Microsoft_Decision_Trees 12 Training INSERT INTO CollegePlanModel (StudentID, Gender, ParentIncome, Encouragement, CollegePlans) OPENROWSET(‘<provider>’, ‘<connection>’, ‘SELECT StudentID, Gender, ParentIncome, Encouragement, CollegePlans FROM CollegePlansTrainData’) 21-Mar-08 13 Prediction Join SELECT t.ID, CPModel.Plan FROM CPModel PREDICTION JOIN OPENQUERY(…,‘SELECT * FROM NewStudents’) AS t ON CPModel.Gender = t.Gender AND CPModel.IQ = t.IQ CPModel ID Gender IQ Plan ID Gender 21-Mar-08 IQ NewStudents 14 Summary of Vendors’ Approaches Built-in library of mining methods Script language or GUI tools Limitations Closed systems (internals hidden from users) Adding new algorithms or customizing old ones -Difficult Poor integration with SQL Limited interoperability across DBMSs Predictive Markup Modeling Language (PMML) as a palliative 21-Mar-08 15 PMML Predictive Markup Model Language XML based language for vendor independent definition of statistical and data mining models Share models among PMML compliant products A descriptive language Supported by all major vendors 21-Mar-08 16 PMML Example 21-Mar-08 17 Much Competion Vendors SAS Institute (Enterprise Miner) IBM (DB2 Intelligent Miner for Data) Oracle (ODM option to Oracle 10g) SPSS (Clementine) Unica Technologies, Inc. (Pattern Recognition Workbench) Insightsful (Insightful Miner) KXEN (Analytic Framework) Prudsys (Discoverer and its family) Microsoft (SQL Server 2005) Angoss (KnowledgeServer and its family) DBMiner (DB2) Platforms IBM Oracle SAS, Tools SPSS Angoss KXEN Megaputer FairIsaac Insightful Stand Alone Systems WEKA is open-source java code created by researchers at the University of Waikato in New Zealand. It provides many different machine learning algorithms Applicable to generic data described in Attribute-Relation File Format (ARFF) 19 Weka A comprehensive set of DM algorithms, and tools. Generic algorithms over arbitrary data sets. Independent on the number of columns in tables. Open and extensible system based on Java. * Also free … 21-Mar-08 20