Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DAT204 Introduction to Data Mining with SQL Server 2000 ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation Agenda • • • • What is Data Mining The Data Mining Market OLE DB for Data Mining Overview of the Data Mining Features in SQL Server 2000 • Demo • Q&A What Is Data Mining? What is DM? • A process of data exploration and analysis using automatic or semi-automatic means – Techniques origin from Machine Learning, statistics and database – “Exploring data” – scanning samples of known facts about “cases”. – “knowledge”: Clusters, Rules, Decision trees, Equations, Association rules… • Once the “knowledge” is extracted it: – Can be browsed • Provides a very useful insight on the cases behavior – Can be used to predict values of other cases • Can serve as a key element in closed loop analysis What drives high school students to attend college? The deciding factors for high school students to attend college are… All Students Attend College: 55% Yes 45% No IQ ? IQ=High IQ=Low Attend College: 79% Yes 21% No Attend College: 45% Yes 55% No Wealth Parents Encourage? Wealth = True Attend College: 94% Yes 6% No Parents Wealth = False Encourage = Yes Attend College: 69% Yes 21% No Attend College: 70% Yes 30% No Parents Encourage = No Attend College: 31% Yes 69% No Business Oriented DM Problems • Targeted ads – “What banner should I display to this visitor?” • Cross sells – “What other products is this customer likely to buy? • Fraud detection – “Is this insurance claim a fraud?” • Churn analysis – “Who are those customers likely to churn?” • Risk Management – “Should I approve the loan to this customer?” • … Mining Process - Illustrated Mining Model Training Data Data Mining Model To Predict DM Engine DM Engine Mining Model Predicted Data The Data Mining Market The $$$: Market Size • DM Tools Market: – 1999: $341.3M – 2000: $455.1M – 2001: $449.5M * IDC The Players • Leading vendors – – – – – SAS SPSS IBM Angoss Hundreds of smaller vendors offering DM algorithms… • Oracle –Thinking Machines acquisition The Products • End-to-end horizontal DM tools – Extraction, Cleansing, Loading, Modeling, Algorithms (dozens), Analysts workbench, Reporting, Charting…. • The customer is the power-analyst – PhD in statistics is usually required… • Closed tools – no standard API – Total vendor lock-in – Limited integration with applications • DM an “outsider” in the Data Warehouse • Extensive consulting required • Sky rocketing prices – $60K+ for a single user license What the analysts say… • “Stand-alone Data Mining Is Dead” Forrester • “The demise of [stand alone] data mining” – Gartner The Microsoft Approach DataPro Users Survey 1999-2001 “Data mining will be the fastestgrowing BI technology…” Market Size of BI * IDC SQL Server 2000 - The Analysis Platform • SQL 2000 provides a complete Analysis Platform – Not an isolated, stand alone DM product • Platform means: – Standard based DM API’s (OLE DB for DM) for applications development – Integrated vision for all technologies, tools – Extensible – Scaleable Data Flow DM Apps DM OLTP DW Reports & Analysis OLAP Analysis Services 2000 – Components Tree View Control DM Wizards Manager UI Cluster View Control DM DTS Task DSO DMM Lift Chart Control OLE DB OLAP DM Sample Query Tool Analysis Server OLAP DM Engine Engine Client OLAP DM Engine Engine (local) (local) OLE DB for Data Mining… Why OLE DB for DM? Make DM a mass market technology by: • Leverage existing technologies and knowledge – SQL and OLE DB • Common industry wide concepts and data presentation • Changing DM market perception from “proprietary” to “open” • Increasing the number of players: – Reduce the cost and risk of becoming a consumer – one tool works with multiple providers – Reduce the cost and risk of becoming a provider – focus on expertise and find many partners to complement offering Integration With RDBMS • Customers would like to – Build DM models from within their RDBMS – Train the models directly off their relational tables – Perform predictions as relational queries (tables in, tables out) – Feel that DM is a native part of their database. • Therefore… – Data mining models are relational objects – All operations on the models are relational – The language used is SQL (w/Extensions) • The effect: every DBA and VB developer can become a DM developer Creating a Data Mining Model (DMM) Identifying the “Cases” • DM algorithms analyze “cases” • The “case” is the entity being categorized and classified • Examples – Customer credit risk analysis: Case = Customer – Product profitability analysis: Case = Product – Promotion success analysis: Case = Promotion • Each case encapsulate all we know about the entity A Simple Set of Cases StudentI Gende D r Parent Income 1 Male 23400 2 Female 79200 3 Male 42000 IQ Encouragement 120 Not Encouraged 90 Encouraged 105 Not Encouraged College Plans No Yes Yes More Complicated Cases Cust ID Age Marit al Statu s 1 35 M 2 3 20 57 S M Favorite Movies IQ 2 3 2 Title Score Star Wars 8 Toy Story 9 Terminator 7 Star Wars 7 Braveheart 7 The Matrix 10 Sixth Sense 9 Casablanca 10 A DMM is a Table! • A DMM structure is defined as a table – Training a DMM means inserting data (pattern) into the table – Predicting from a DMM means querying the table • All information describing the case are contained in columns Creating a Mining Model CREATE MINING MODEL [Plans Prediction] ( StudentID LONG KEY, Gender TEXT DISCRETE, ParentIncome LONG CONTINUOUS, IQ DOUBLE CONTINUOUS, Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE PREDICT ) USING Microsoft_Decision_Trees Creating a mining model with nested table Create Mining Model MoviePrediction ( CutomerId long key, Age long continuous, Gender discrete, Education discrete, MovieList table predict ( MovieName text key ) ) using microsoft_decision_trees Training a DMM Training a DMM • Training a DMM means passing it data for which the attributes to be predicted are known – Multiple passes are handled internally by the provider! • Use an INSERT INTO statement • The DMM will not persist the inserted data • Instead it will analyze the given cases and build the DMM content (decision tree, segmentation model, association rules) INSERT [INTO] <mining model name> [(columns list)] <source data query> INSERT INTO INSERT INTO [Plans Prediction] ( StudentID, Gender, ParentIncome, IQ, Encouragement, CollegePlans ) SELECT [StudentID], [Gender], [ParentIncome], [IQ], [Encouragement], [CollegePlans] FROM [Students] When Insert Into Is Done… • The DMM is trained – The model can be retrained – Content (rules, trees, formulas) can be explored – OLE DB Schema rowset – SELECT * FROM <dmm>.CONTENT – XML string (PMML) • Prediction queries can be executed Predictions What are Predictions? • Predictions apply the rules of a trained model to a new set of data in order to estimate missing attributes or values • Predictions = queries – The syntax is SQL - like – The output is a rowset • In order to predict you need: – Input data set – A trained DMM – Binding (mapping) information between the input data and the DMM The Truth Table Concept Gende r Parent Income IQ Colleg Probabilit Encouragement e y Plans Male 20000 85 Not Encouraged No 85% Male 20000 85 Not Encouraged Yes 15% Male 20000 85 Encouraged No 60% Male 20000 85 Encouraged Yes 40% Male 20000 90 Not Encouraged No 80% Male 20000 90 Not Encouraged Yes 20% Male 20000 90 Encouraged No 58% … Prediction StudentI D Gender Parent Income IQ Encouragement 1 Male 43000 85 Not Encouraged 2 Male 20000 135 Not Encouraged 3 Female 25000 105 Encouraged 4 Male 96000 100 Encouraged 5 Female 56000 125 Not Encouraged 6 Female 46000 90 Not Encouraged It’s a JOIN! Gender Parent Income IQ Encouragement College Plans Probability Male 20000 85 Not Encouraged No 85% Male 20000 85 Not Encouraged Yes 15% Male 20000 85 Encouraged No 60% Male 20000 85 Encouraged Yes 40% Male 20000 90 Not Encouraged No 80% Male 20000 90 Not Encouraged Yes 20% Male 20000 90 Encouraged No 58% Male 20000 90 Encouraged Yes 42% Male 20000 95 Not Encouraged No 78% Male 20000 95 Not Encouraged Yes 22% Male 20000 95 Encouraged No 45% The Prediction Query Syntax SELECT <columns to return or predict> FROM <dmm> PREDICTION JOIN <input data set> ON <dmm column> = <dmm input column>… Example SELECT [New Students].[StudentID], [Plans Prediction].[CollegePlans], PredictProbability([CollegePlans]) FROM [Plans Prediction] PREDICTION JOIN [New Students] ON [Plans Prediction].[Gender] = [New Students].[Gender] AND [Plans Prediction].[IQ] = [New Students].[IQ] AND ... Demo OLE DB for Data Mining Defines API Consumer Consumer … OLE DB for DM (API) Provider Provider Provider OLE DB Cube RDBMS Misc. Data Source … OLEDB for DM Configuration Options Demo Consumers OLEDB for DM Providers MS Analysis Manager 1 MS DM Provider ANGOSS Controls 2 3 4 ANGOSS DM Provider Demo on OLE DB for DM API using Angoss Controls and Provider For more info… • DM URL – www.microsoft.com/data/oledb – www.microsoft.com/data/oledb/DMResKit.htm • News Group: – Microsoft.public.SQLserver.datamining – Communities.msn.com/AnalysisServicesDataMining • White papers: – Performance paper: www.unisys.com/windows2000/default-07.asp www.microsoft.com/SQL/evaluation/compare/analysisdmwp.asp Questions ?