Download Male

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
欢迎光临
微软SQL数据挖掘/数据仓库
技术研讨会
今日安排
• 微软SQL数据挖掘技术概述
− 左洪 微软公司
• 数据仓库在电信的应用
− 贝志城 明天高科
• 数据挖掘在CRM中的应用
− 王立军 中圣公司
• 灵通IT Service维护管理服务系统
– 邹雄文 广州灵通
Introduction to Data Mining
with SQL Server 2000
左洪
高级产品市场经理
微软(中国)有限公司
Agenda





What is Data Mining
The Data Mining Market
OLE DB for Data Mining
Overview of the Data Mining
Features in SQL Server 2000
Q&A
What Is Data Mining?
What is DM?

A process of data exploration and
analysis using automatic or semiautomatic means


“Exploring data” – scanning samples of known
facts about “cases”.
“knowledge”: Clusters, Rules, Decision trees, Equations,
Association rules…

Once the “knowledge” is extracted it:

Can be browsed


Provides a very useful insight on the cases behavior
Can be used to predict values of other cases

Can serve as a key element in closed loop analysis
What drive high school
students to attend college?
The deciding factors for high
school students to attend college
are…
All Students
Attend College:
55% Yes
45% No
IQ ?
IQ=High
IQ=Low
Attend College:
79% Yes
11% No
Attend College:
45% Yes
55% No
Wealth
Parents
Encourage?
Wealth = True
Attend College:
94% Yes
6% No
Parents
Wealth = False Encourage = Yes
Attend College:
69% Yes
21% No
Attend College:
70% Yes
30% No
Parents
Encourage = No
Attend College:
31% Yes
69% No
Business Oriented DM Problems

Targeted ads


Cross sells


“Who are those customers likely to churn?”
Risk Management


“Is this insurance claim a fraud?”
Churn analysis


“What other products is this customer likely to buy?
Fraud detection


“What banner should I display to this visitor?”
…
“Should I approve the loan to this customer?”
Http://www.tunes.com
Mining Process - Illustrated
Mining Model
Training Data
Data
Mining Model To Predict
DM
Engine
DM
Engine
Mining Model
Predicted Data
The Data Mining Market
The $$$: Y2000 Market Size

DM Tools Market: $250M


40% - license fees
60% consulting
* Gartner
The Players

Leading vendors





SAS
SPSS
IBM
Hundreds of smaller vendors offering DM
algorithms…
Oracle –Thinking Machines acquisition
The Products

End-to-end Data Mining tools


The customer is the power-analyst





PhD in statistics is usually required…
Closed tools – no standard API


Extraction, Cleansing, Loading, Modeling, Algorithms (dozens),
Analysts workbench, Reporting, Charting….
Total vendor lock-in
Limited integration with applications
DM an “outsider” in the Data Warehouse
Extensive consulting required
Sky rocketing prices

$60K+ for a single user license
What the analysts say…


“Stand-alone Data Mining Is Dead”
- Forrester
“The demise of [stand alone] data
mining” – Gartner
The Microsoft Approach
DataPro Users Survey
1999-2001
“Data mining will be the fastestgrowing BI technology…”
The $$$: 2000 Market Size

DM Applications Market Size: $1.5B
* IDC
SQL Server 2000 - The
Analysis Platform

SQL 2000 provides a complete Analysis
Platform


Not an isolated, stand alone DM product
Platform means:

The infrastructure for applications





Not an application by itself
Integrated vision for all technologies, tools
Standard based API’s (OLE DB for DM)
Extensible
Scaleable
Data Flow
DM
Apps
DM
OLTP
DW
Reports
&
Analysis
OLAP
Analysis Services 2000 Architecture
DM Wizards
Manager
UI
DM DTS Task
DSO
DMM
OLE DB OLAP DM
Analysis Server
OLAP DM
Engine Engine
Ext.
Client
OLAP DM
Engine Engine
(local) (local)
Ext.
OLE DB for Data Mining…
Why OLE DB for DM?
Make DM a mass market technology by:
 Leverage existing technologies and knowledge




SQL and OLE DB
Common industry wide concepts and data
presentation
Changing DM market perception from
“proprietary” to “open”
Increasing the number of players:



Reduce the cost and risk of becoming a consumer – one tool
works with multiple providers
Reduce the cost and risk of becoming a provider – focus on
expertise and find many partners to complement offering
Dramatically increase the number of DM developers
Integration With RDBMS

Customers would like to





Therefore…




Build DM models from within their RDBMS
Train the models directly off their relational tables
Perform predictions as relational queries (tables
in, tables out)
Feel that DM is a native part of their database.
Data mining models are relational objects
All operations on the models are relational
The language used is SQL (w/Extensions)
The effect: every DBA and VB developer can
become a DM developer
Creating a Data Mining
Model (DMM)
Identifying the “Cases”



DM algorithms analyze “cases”
The “case” is the entity being categorized
and classified
Examples




Customer credit risk analysis: Case = Customer
Product profitability analysis: Case = Product
Promotion success analysis: Case = Promotion
Each case encapsulate all we know about
the entity
A Simple Set of Cases
StudentI
Gender
D
1
Male
2
3
Female
Male
Parent
Income
23400
79200
42000
IQ
Encouragement
120 Not
Encouraged
90 Encouraged
105 Not
Encouraged
College
Plans
No
Yes
Yes
More Complicated Cases
Cust
ID
Age
1
35
Marit
al
Statu
s
M
2
20
S
3
57
M
Favorite Movies
IQ
Title
Score
2
Star Wars
Toy Story
Terminator
8
9
7
3
Star Wars
7
Braveheart
7
The Matrix
10
Sixth Sense
9
Casablanca
10
2
A DMM is a Table!

A DMM structure is defined as a table



Training a DMM means inserting data into
the table
Predicting from a DMM means querying
the table
All information describing the case are
contained in columns
Creating a Mining Model
CREATE MINING MODEL [Plans Prediction]
(
StudentID
LONG
KEY,
Gender
TEXT
DISCRETE,
ParentIncome LONG
CONTINUOUS,
IQ
DOUBLE CONTINUOUS,
Encouragement TEXT
DISCRETE,
CollegePlans TEXT
DISCRETE PREDICT
)
USING Microsoft_Decision_Trees
Creating a mining model
with nested table
Create Mining Model MoviePrediction
(
CutomerId long key,
Age long continuous,
Gender discrete,
Education discrete,
MovieList table predict (
MovieName text key
)
)
using microsoft_decision_trees
Training a DMM
Training a DMM

Training a DMM means passing it data for
which the attributes to be predicted are known




Multiple passes are handled internally by the
provider!
Use an INSERT INTO statement
The DMM will not persist the inserted data
Instead it will analyze the given cases and
build the DMM content (decision tree,
segmentation model, association rules)
INSERT [INTO] <mining model name>
[(columns list)]
<source data query>
INSERT INTO
INSERT INTO [Plans Prediction]
(
StudentID, Gender, ParentIncome, IQ,
Encouragement, CollegePlans
)
SELECT
[StudentID], [Gender],
[ParentIncome], [IQ],
[Encouragement], [CollegePlans]
FROM [CollegePlans]
When Insert Into Is Done…

The DMM is trained






The model can be retrained
Content (rules, trees, formulas) can be
explored
OLE DB Schema rowset
SELECT * FROM <dmm>.CONTENT
XML string (PMML)
Prediction queries can be executed
Predictions
What are Predictions?


Predictions apply the rules of a trained
model to a new set of data in order to
estimate missing attributes or values
Predictions = queries



The syntax is SQL - like
The output is a rowset
In order to predict you need:




Input data set
A trained DMM
Binding (mapping) information between the
input data and the DMM
Specification of what to predict
The Truth Table Concept
Gende
r
Parent
Income
IQ
Male
20000
85
Colleg
Encouragement e
Probability
Plans
Not Encouraged No
85%
Male
20000
85
Not Encouraged Yes
15%
Male
20000
85
Encouraged
No
60%
Male
20000
85
Encouraged
Yes
40%
Male
20000
90
Not Encouraged No
80%
Male
20000
90
Not Encouraged Yes
20%
Male
20000
90
Encouraged
58%
…
No
Prediction
Student
ID
Gender
Parent
Income
IQ
Encouragement
1
Male
43000
85
Not Encouraged
2
Male
20000
135
Not Encouraged
3
Female
25000
105
Encouraged
4
Male
96000
100
Encouraged
5
Female
56000
125
Not Encouraged
6
Female
46000
90
Not Encouraged
It’s a JOIN!
Gender
Parent
Income
IQ
Encouragement
College
Plans
Probability
Male
20000
85
Not Encouraged
No
85%
Male
20000
85
Not Encouraged
Yes
15%
Male
20000
85
Encouraged
No
60%
Male
20000
85
Encouraged
Yes
40%
Male
20000
90
Not Encouraged
No
80%
Male
20000
90
Not Encouraged
Yes
20%
Male
20000
90
Encouraged
No
58%
Male
20000
90
Encouraged
Yes
42%
Male
20000
95
Not Encouraged
No
78%
Male
20000
95
Not Encouraged
Yes
22%
Male
20000
95
Encouraged
No
45%
The Prediction Query
Syntax
SELECT <columns to return or predict>
FROM
<dmm> PREDICTION JOIN
<input data set>
ON <dmm column> = <dmm input
column>…
Example
SELECT [New Students].[StudentID],
[Plans Prediction].[CollegePlans],
PredictProbability([CollegePlans])
FROM
[Plans Prediction] PREDICTION JOIN
[New Students]
ON [Plans Prediction].[Gender] =
[New Students].[Gender] AND
[Plans Prediction].[IQ] =
[New Students].[IQ] AND ...
OLE DB DM Sample
Provider with Source







All required OLE DB objects, such as session,
command, and rowset
The OLE DB for Data Mining syntax parser
Tokenization of input data
Query processing engine
A sample Naïve Bayes algorithm
Model persistence in XML and binary formats
Available at
www.microsoft.com/data/oledb/DMResKit.htm
Integrated OLAP and DM
Analysis
Why Use DM with OLAP

Relational DM is designed for:




Reports of patterns
Batch predictions fed into an OLTP system
Real-time singleton prediction in an
operational environment
OLAP is designed for



interactive analysis by a knowledge worker
Consistent and convenient navigational
model
Pre-aggregations of OLAP allow faster
performance
Understanding DM
Content – Decision Trees
Customers having high debt and college education:
All Customers
Filter([Individual Customers].Members,
Credit Risk:
Customers.CurrentMember.Properties(“Debt”)
= “High”
65% Good
And Customers.CurrentMember.Properties(“Education”)
=
35% Bad
“College”)
Debt ?
Debt=Low
Debt=High
Customers having low debt and are self employed:
Credit Risk:
Good
Filter([Individual89%
Customers].Members,
11% Bad
Customers.CurrentMember.Properties(“Debt”) = Low
EmployAnd Customers.CurrentMember.Properties(“Employment
-ment Type?
Type”) = “Self Employed”)
ET = Salaried
Credit Risk:
94% Good
6% Bad
ET = Self
Employed
Credit Risk:
79% Good
21% Bad
Credit Risk:
45% Good
55% Bad
Education=
College
Credit Risk:
70% Good
30% Bad
Education?
Education=
High School
Credit Risk:
31% Good
69% Bad
…Equivalent DM Dimension
Custom
Roll-up
Credit Risk
-
Good = 65%, Bad =
35%
Aggregate(Filter(
…
Good = 89%, Bad =
11%
Customers with low debt and self employed
Aggregate(Filter(
…
Good = 79%, Bad =
21%
Customers with low debt and salaried
Aggregate(Filter(
…
Good = 94%, Bad =
6%
Aggregate(Filter(
…
Good = 45%, Bad =
55%
Customers with high debt and college educationAggregate(Filter(
Good = 70%, Bad =
30%
Aggregate(Filter(
Customers with high debt and high school education
…
Good = 31%, Bad =
69%
All Customers
Customers with low debt
Customers with high debt
…
Tree = Dimension



Every node on the tree is a dimension member
The node statistics are the member properties
All members are calculated




Formula aggregates the case dimension members
that apply to this node
The MDX is generated by the DM algorithm
Analysis Service will automatically generate the
calculated dimension based on the DM content
and also a virtual cube
Applies to


Classification (decision trees)
Segmentation (clusters)
Browsing the Virtual Cube

Pivot the DM dimension:
WA
All Customers
Customers with low debt
Customers with high debt
OR
CA
3200
2500
8000
2320
1503
4300
880
997
4700
Credit Risk: 70% Good, 30% Bad
Customers … college
320
450
2310
Customers … high school
560
547
2390
Predictions


You might want to view predictions for each
case
For example:





What is the expected profitability of a product?
What is the credit risk of a specific customer?
What are the products this customer is likely to buy?
All of those predictions are available through
MDX calculated members
Singleton query is created automatically
Prediction Calculated
Member
Measures.[Probability of High Credit Risk]:
PREDICT(Customers.CurrentMember,
“Credit Risk Model”,
“PredictionProbability(
PredictionHistogram(“Credit Risk”),
‘High’)“
)
Predictions Example
Joe Smith
Probability of
High Credit
Risk
73%
Probability of
Low Credit
Risk
27%
John Dow
68%
32%
William Clington
45%
55%
Robert Maxwell
98%
2%
Denis Rodman
81%
19%
Questions ?
E-Mail: [email protected]
http://www.microsoft.com/china/sql