Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
欢迎光临 微软SQL数据挖掘/数据仓库 技术研讨会 今日安排 • 微软SQL数据挖掘技术概述 − 左洪 微软公司 • 数据仓库在电信的应用 − 贝志城 明天高科 • 数据挖掘在CRM中的应用 − 王立军 中圣公司 • 灵通IT Service维护管理服务系统 – 邹雄文 广州灵通 Introduction to Data Mining with SQL Server 2000 左洪 高级产品市场经理 微软(中国)有限公司 Agenda What is Data Mining The Data Mining Market OLE DB for Data Mining Overview of the Data Mining Features in SQL Server 2000 Q&A What Is Data Mining? What is DM? A process of data exploration and analysis using automatic or semiautomatic means “Exploring data” – scanning samples of known facts about “cases”. “knowledge”: Clusters, Rules, Decision trees, Equations, Association rules… Once the “knowledge” is extracted it: Can be browsed Provides a very useful insight on the cases behavior Can be used to predict values of other cases Can serve as a key element in closed loop analysis What drive high school students to attend college? The deciding factors for high school students to attend college are… All Students Attend College: 55% Yes 45% No IQ ? IQ=High IQ=Low Attend College: 79% Yes 11% No Attend College: 45% Yes 55% No Wealth Parents Encourage? Wealth = True Attend College: 94% Yes 6% No Parents Wealth = False Encourage = Yes Attend College: 69% Yes 21% No Attend College: 70% Yes 30% No Parents Encourage = No Attend College: 31% Yes 69% No Business Oriented DM Problems Targeted ads Cross sells “Who are those customers likely to churn?” Risk Management “Is this insurance claim a fraud?” Churn analysis “What other products is this customer likely to buy? Fraud detection “What banner should I display to this visitor?” … “Should I approve the loan to this customer?” Http://www.tunes.com Mining Process - Illustrated Mining Model Training Data Data Mining Model To Predict DM Engine DM Engine Mining Model Predicted Data The Data Mining Market The $$$: Y2000 Market Size DM Tools Market: $250M 40% - license fees 60% consulting * Gartner The Players Leading vendors SAS SPSS IBM Hundreds of smaller vendors offering DM algorithms… Oracle –Thinking Machines acquisition The Products End-to-end Data Mining tools The customer is the power-analyst PhD in statistics is usually required… Closed tools – no standard API Extraction, Cleansing, Loading, Modeling, Algorithms (dozens), Analysts workbench, Reporting, Charting…. Total vendor lock-in Limited integration with applications DM an “outsider” in the Data Warehouse Extensive consulting required Sky rocketing prices $60K+ for a single user license What the analysts say… “Stand-alone Data Mining Is Dead” - Forrester “The demise of [stand alone] data mining” – Gartner The Microsoft Approach DataPro Users Survey 1999-2001 “Data mining will be the fastestgrowing BI technology…” The $$$: 2000 Market Size DM Applications Market Size: $1.5B * IDC SQL Server 2000 - The Analysis Platform SQL 2000 provides a complete Analysis Platform Not an isolated, stand alone DM product Platform means: The infrastructure for applications Not an application by itself Integrated vision for all technologies, tools Standard based API’s (OLE DB for DM) Extensible Scaleable Data Flow DM Apps DM OLTP DW Reports & Analysis OLAP Analysis Services 2000 Architecture DM Wizards Manager UI DM DTS Task DSO DMM OLE DB OLAP DM Analysis Server OLAP DM Engine Engine Ext. Client OLAP DM Engine Engine (local) (local) Ext. OLE DB for Data Mining… Why OLE DB for DM? Make DM a mass market technology by: Leverage existing technologies and knowledge SQL and OLE DB Common industry wide concepts and data presentation Changing DM market perception from “proprietary” to “open” Increasing the number of players: Reduce the cost and risk of becoming a consumer – one tool works with multiple providers Reduce the cost and risk of becoming a provider – focus on expertise and find many partners to complement offering Dramatically increase the number of DM developers Integration With RDBMS Customers would like to Therefore… Build DM models from within their RDBMS Train the models directly off their relational tables Perform predictions as relational queries (tables in, tables out) Feel that DM is a native part of their database. Data mining models are relational objects All operations on the models are relational The language used is SQL (w/Extensions) The effect: every DBA and VB developer can become a DM developer Creating a Data Mining Model (DMM) Identifying the “Cases” DM algorithms analyze “cases” The “case” is the entity being categorized and classified Examples Customer credit risk analysis: Case = Customer Product profitability analysis: Case = Product Promotion success analysis: Case = Promotion Each case encapsulate all we know about the entity A Simple Set of Cases StudentI Gender D 1 Male 2 3 Female Male Parent Income 23400 79200 42000 IQ Encouragement 120 Not Encouraged 90 Encouraged 105 Not Encouraged College Plans No Yes Yes More Complicated Cases Cust ID Age 1 35 Marit al Statu s M 2 20 S 3 57 M Favorite Movies IQ Title Score 2 Star Wars Toy Story Terminator 8 9 7 3 Star Wars 7 Braveheart 7 The Matrix 10 Sixth Sense 9 Casablanca 10 2 A DMM is a Table! A DMM structure is defined as a table Training a DMM means inserting data into the table Predicting from a DMM means querying the table All information describing the case are contained in columns Creating a Mining Model CREATE MINING MODEL [Plans Prediction] ( StudentID LONG KEY, Gender TEXT DISCRETE, ParentIncome LONG CONTINUOUS, IQ DOUBLE CONTINUOUS, Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE PREDICT ) USING Microsoft_Decision_Trees Creating a mining model with nested table Create Mining Model MoviePrediction ( CutomerId long key, Age long continuous, Gender discrete, Education discrete, MovieList table predict ( MovieName text key ) ) using microsoft_decision_trees Training a DMM Training a DMM Training a DMM means passing it data for which the attributes to be predicted are known Multiple passes are handled internally by the provider! Use an INSERT INTO statement The DMM will not persist the inserted data Instead it will analyze the given cases and build the DMM content (decision tree, segmentation model, association rules) INSERT [INTO] <mining model name> [(columns list)] <source data query> INSERT INTO INSERT INTO [Plans Prediction] ( StudentID, Gender, ParentIncome, IQ, Encouragement, CollegePlans ) SELECT [StudentID], [Gender], [ParentIncome], [IQ], [Encouragement], [CollegePlans] FROM [CollegePlans] When Insert Into Is Done… The DMM is trained The model can be retrained Content (rules, trees, formulas) can be explored OLE DB Schema rowset SELECT * FROM <dmm>.CONTENT XML string (PMML) Prediction queries can be executed Predictions What are Predictions? Predictions apply the rules of a trained model to a new set of data in order to estimate missing attributes or values Predictions = queries The syntax is SQL - like The output is a rowset In order to predict you need: Input data set A trained DMM Binding (mapping) information between the input data and the DMM Specification of what to predict The Truth Table Concept Gende r Parent Income IQ Male 20000 85 Colleg Encouragement e Probability Plans Not Encouraged No 85% Male 20000 85 Not Encouraged Yes 15% Male 20000 85 Encouraged No 60% Male 20000 85 Encouraged Yes 40% Male 20000 90 Not Encouraged No 80% Male 20000 90 Not Encouraged Yes 20% Male 20000 90 Encouraged 58% … No Prediction Student ID Gender Parent Income IQ Encouragement 1 Male 43000 85 Not Encouraged 2 Male 20000 135 Not Encouraged 3 Female 25000 105 Encouraged 4 Male 96000 100 Encouraged 5 Female 56000 125 Not Encouraged 6 Female 46000 90 Not Encouraged It’s a JOIN! Gender Parent Income IQ Encouragement College Plans Probability Male 20000 85 Not Encouraged No 85% Male 20000 85 Not Encouraged Yes 15% Male 20000 85 Encouraged No 60% Male 20000 85 Encouraged Yes 40% Male 20000 90 Not Encouraged No 80% Male 20000 90 Not Encouraged Yes 20% Male 20000 90 Encouraged No 58% Male 20000 90 Encouraged Yes 42% Male 20000 95 Not Encouraged No 78% Male 20000 95 Not Encouraged Yes 22% Male 20000 95 Encouraged No 45% The Prediction Query Syntax SELECT <columns to return or predict> FROM <dmm> PREDICTION JOIN <input data set> ON <dmm column> = <dmm input column>… Example SELECT [New Students].[StudentID], [Plans Prediction].[CollegePlans], PredictProbability([CollegePlans]) FROM [Plans Prediction] PREDICTION JOIN [New Students] ON [Plans Prediction].[Gender] = [New Students].[Gender] AND [Plans Prediction].[IQ] = [New Students].[IQ] AND ... OLE DB DM Sample Provider with Source All required OLE DB objects, such as session, command, and rowset The OLE DB for Data Mining syntax parser Tokenization of input data Query processing engine A sample Naïve Bayes algorithm Model persistence in XML and binary formats Available at www.microsoft.com/data/oledb/DMResKit.htm Integrated OLAP and DM Analysis Why Use DM with OLAP Relational DM is designed for: Reports of patterns Batch predictions fed into an OLTP system Real-time singleton prediction in an operational environment OLAP is designed for interactive analysis by a knowledge worker Consistent and convenient navigational model Pre-aggregations of OLAP allow faster performance Understanding DM Content – Decision Trees Customers having high debt and college education: All Customers Filter([Individual Customers].Members, Credit Risk: Customers.CurrentMember.Properties(“Debt”) = “High” 65% Good And Customers.CurrentMember.Properties(“Education”) = 35% Bad “College”) Debt ? Debt=Low Debt=High Customers having low debt and are self employed: Credit Risk: Good Filter([Individual89% Customers].Members, 11% Bad Customers.CurrentMember.Properties(“Debt”) = Low EmployAnd Customers.CurrentMember.Properties(“Employment -ment Type? Type”) = “Self Employed”) ET = Salaried Credit Risk: 94% Good 6% Bad ET = Self Employed Credit Risk: 79% Good 21% Bad Credit Risk: 45% Good 55% Bad Education= College Credit Risk: 70% Good 30% Bad Education? Education= High School Credit Risk: 31% Good 69% Bad …Equivalent DM Dimension Custom Roll-up Credit Risk - Good = 65%, Bad = 35% Aggregate(Filter( … Good = 89%, Bad = 11% Customers with low debt and self employed Aggregate(Filter( … Good = 79%, Bad = 21% Customers with low debt and salaried Aggregate(Filter( … Good = 94%, Bad = 6% Aggregate(Filter( … Good = 45%, Bad = 55% Customers with high debt and college educationAggregate(Filter( Good = 70%, Bad = 30% Aggregate(Filter( Customers with high debt and high school education … Good = 31%, Bad = 69% All Customers Customers with low debt Customers with high debt … Tree = Dimension Every node on the tree is a dimension member The node statistics are the member properties All members are calculated Formula aggregates the case dimension members that apply to this node The MDX is generated by the DM algorithm Analysis Service will automatically generate the calculated dimension based on the DM content and also a virtual cube Applies to Classification (decision trees) Segmentation (clusters) Browsing the Virtual Cube Pivot the DM dimension: WA All Customers Customers with low debt Customers with high debt OR CA 3200 2500 8000 2320 1503 4300 880 997 4700 Credit Risk: 70% Good, 30% Bad Customers … college 320 450 2310 Customers … high school 560 547 2390 Predictions You might want to view predictions for each case For example: What is the expected profitability of a product? What is the credit risk of a specific customer? What are the products this customer is likely to buy? All of those predictions are available through MDX calculated members Singleton query is created automatically Prediction Calculated Member Measures.[Probability of High Credit Risk]: PREDICT(Customers.CurrentMember, “Credit Risk Model”, “PredictionProbability( PredictionHistogram(“Credit Risk”), ‘High’)“ ) Predictions Example Joe Smith Probability of High Credit Risk 73% Probability of Low Credit Risk 27% John Dow 68% 32% William Clington 45% 55% Robert Maxwell 98% 2% Denis Rodman 81% 19% Questions ? E-Mail: [email protected] http://www.microsoft.com/china/sql