Download KDD systems & DBMS

Data Mining Systems and Languages CS240A Notes 1 Knowledge Discovery (KDD) Process Data mining—core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases 2 DM Experience for DBMS: Dreams vs. Reality Decision Support and business intelligence:  OLAP & data warehouses: resounding success for DBMS vendors, via  Simple extensions of SQL (aggregates & analytics)  relational DBMS extensions for DM queries: a flop  OR-DBMS do not fare much better [Sarawagi’ 98].  Imielinski & Mannila proposed a ‘high-road’ approach [CACM’96] was suggested by who called for a quantum leap in functionality based on: Simple declarative extensions of SQL for Data Mining (DM) Efficiency through DM query optimization techniques (yet to be invented)  The research area of Inductive DBMS was thus born, producing  Interesting language work: DMQL, Mine Rule, MSQL, … Where implementation technology lacks generality & performance limitations Real questions if optimizers will ever take us there. 3 DBMS Limitations  DBMSs were easily and very Successfully extended for Data Warehouses with help of OLAP functions  Extending DBMSs for Mining has proven much harder Limited expressive power Flexibility of the languages Apriori in DB2 [Saravagi’ 98]  Because of lack of suitable primitives task proved extremely difficult and not as efficient as the cachemining task  Cache-mining: move data from the database to cache and then use PL algorithms to mine the cache. 4 Mining Systems Desiderata  Problem: How to efficiently support the vast variety of online mining algorithms in an integrated framework? Generality over a wide spectrum of mining tasks Ease of use for naïve users and flexibility and customizability for experts Efficiency, scalability  Databases: where the data is. But DBMS do not support well the KDD tasks. Three approaches 1. Inductive DBMS 2.Commercial DBMS extensions 3. Dedicated KDD systems with DBMS connections. 5 Inductive DBMSs vs. Vendor Extensions Imielinski & Manilla introduced the notion of  A high-level Data Mining Query Language for DBMS  Optimization techniques for Inductive DBMS a new research field MSQL, DMQL, Mine Rule: DM query language Performance and generality an open problem. DBMS Vendors  Ad-hoc approaches based on mining libraries 6 DBMS extensions: DB2 Intelligent Miner  Model creation  Training CALL IDMMX.DM_buildClasModelCmd('IDMMX.CLASTASKS', 'TASK', 'ID', 'HeartClasTask', 'IDMMX.CLASSIFMODELS', 'MODEL', 'MODELNAME', 'HeartClasModel' );  Prediction  Stored procedures and virtual mining views  Most of the implementation outside the DBMS (Cache Mining) Data transfer delays  http://www-306.ibm.com/software/data/iminer/ 7 Oracle Data Miner  Algorithms Adaptive Naïve Bayes SVM regression K-means clustering Association rules, text, mining, etc., etc.  PL/SQL with extensions for mining  Models as first class objects Create_Model, Prediction, Prediction_Cost, Prediction_Details, etc.  http://www.oracle.com/technology/products/bi/odm/index.html 21-Mar-08 8 OLE DB for DM by Microsoft Model creation. Descriptive phase Prediction joins Other features Nested cases  http://research.microsoft.com/dmx/DataMining/ PMML a descriptive XML language for exchanging information between systems 9 OLE DB for DM (DMX) (cont.) Mining objects as first class objects Schema rowsets Mining_Models Mining_Model_Content Mining_Functions Other features Column value distribution Nested cases  http://research.microsoft.com/dmx/DataMining/ 21-Mar-08 10 OLE DB for DM (DMX): 3 steps  Model creation Create mining model MemCard_Pred ( CustomerId long key, Age long continuous, Profession text discrete, Income long continuous, Risk text discrete predict) Using Microsoft_Decision_Tree;  Training Insert into MemCard_Pred OpenRowSet( “‘sqloledb’, ‘sa’, ‘mypass’”, ‘SELECT CustomerId, Age, Profession, Income, Risk from Customers’)  Prediction Join Select C.Id, C.Risk, PredictProbability(MemCard_Pred.Risk) From MemCard_Pred AS MP Prediction Join Customers AS C Where MP.Profession = C.Profession and AP.Income = C.Income AND MP.Age = C.Age; 21-Mar-08 11 Defining a Mining Model: E.g., a model to predict students’ plan to attend college The format of “training cases” (top-level entity) Attributes, Input/output type, distribution Algorithms and parameters Example CREATE MINING MODEL CollegePlanModel ( StudentID Gender ParentIncome Encouragement CollegePlans LONG TEXT LONG TEXT TEXT KEY, DISCRETE, NORMAL CONTINUOUS, DISCRETE, DISCRETE PREDICT ) USING Microsoft_Decision_Trees 12 Training INSERT INTO CollegePlanModel (StudentID, Gender, ParentIncome, Encouragement, CollegePlans) OPENROWSET(‘<provider>’, ‘<connection>’, ‘SELECT StudentID, Gender, ParentIncome, Encouragement, CollegePlans FROM CollegePlansTrainData’) 21-Mar-08 13 Prediction Join SELECT t.ID, CPModel.Plan FROM CPModel PREDICTION JOIN OPENQUERY(…,‘SELECT * FROM NewStudents’) AS t ON CPModel.Gender = t.Gender AND CPModel.IQ = t.IQ CPModel ID Gender IQ Plan ID Gender 21-Mar-08 IQ NewStudents 14 Summary of Vendors’ Approaches  Built-in library of mining methods Script language or GUI tools  Limitations Closed systems (internals hidden from users) Adding new algorithms or customizing old ones -Difficult Poor integration with SQL Limited interoperability across DBMSs  Predictive Markup Modeling Language (PMML) as a palliative 21-Mar-08 15 PMML  Predictive Markup Model Language XML based language for vendor independent definition of statistical and data mining models Share models among PMML compliant products A descriptive language  Supported by all major vendors 21-Mar-08 16 PMML Example 21-Mar-08 17 Much Competion Vendors  SAS Institute (Enterprise Miner)  IBM (DB2 Intelligent Miner for Data)  Oracle (ODM option to Oracle 10g)  SPSS (Clementine)  Unica Technologies, Inc. (Pattern Recognition Workbench)  Insightsful (Insightful Miner)  KXEN (Analytic Framework)  Prudsys (Discoverer and its family)  Microsoft (SQL Server 2005)  Angoss (KnowledgeServer and its family)  DBMiner (DB2) Platforms IBM Oracle SAS, Tools SPSS Angoss KXEN Megaputer FairIsaac Insightful Stand Alone Systems WEKA is open-source java code created by researchers at the University of Waikato in New Zealand.  It provides many different machine learning algorithms  Applicable to generic data described in Attribute-Relation File Format (ARFF) 19 Weka  A comprehensive set of DM algorithms, and tools.  Generic algorithms over arbitrary data sets. Independent on the number of columns in tables.  Open and extensible system based on Java. * Also free … 21-Mar-08 20

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download KDD systems & DBMS