Download Encouragement

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
DAT204
Introduction to Data Mining with
SQL Server 2000
ZhaoHui Tang
Program Manager
SQL Server Analysis Services
Microsoft Corporation
Agenda
•
•
•
•
What is Data Mining
The Data Mining Market
OLE DB for Data Mining
Overview of the Data Mining Features in
SQL Server 2000
• Demo
• Q&A
What Is Data Mining?
What is DM?
• A process of data exploration and analysis
using automatic or semi-automatic means
– Techniques origin from Machine Learning, statistics and
database
– “Exploring data” – scanning samples of known facts
about “cases”.
– “knowledge”: Clusters, Rules, Decision trees, Equations,
Association rules…
• Once the “knowledge” is extracted it:
– Can be browsed
• Provides a very useful insight on the cases behavior
– Can be used to predict values of other cases
• Can serve as a key element in closed loop analysis
What drives high school students
to attend college?
The deciding factors for high school
students to attend college are…
All Students
Attend College:
55% Yes
45% No
IQ ?
IQ=High
IQ=Low
Attend College:
79% Yes
21% No
Attend College:
45% Yes
55% No
Wealth
Parents
Encourage?
Wealth = True
Attend College:
94% Yes
6% No
Parents
Wealth = False Encourage = Yes
Attend College:
69% Yes
21% No
Attend College:
70% Yes
30% No
Parents
Encourage = No
Attend College:
31% Yes
69% No
Business Oriented DM Problems
• Targeted ads
– “What banner should I display to this visitor?”
• Cross sells
– “What other products is this customer likely to buy?
• Fraud detection
– “Is this insurance claim a fraud?”
• Churn analysis
– “Who are those customers likely to churn?”
• Risk Management
– “Should I approve the loan to this customer?”
• …
Mining Process - Illustrated
Mining Model
Training Data
Data
Mining Model To Predict
DM
Engine
DM
Engine
Mining Model
Predicted Data
The Data Mining Market
The $$$: Market Size
• DM Tools Market:
– 1999: $341.3M
– 2000: $455.1M
– 2001: $449.5M
* IDC
The Players
• Leading vendors
–
–
–
–
–
SAS
SPSS
IBM
Angoss
Hundreds of smaller vendors offering DM
algorithms…
• Oracle –Thinking Machines acquisition
The Products
• End-to-end horizontal DM tools
– Extraction, Cleansing, Loading, Modeling, Algorithms (dozens), Analysts
workbench, Reporting, Charting….
• The customer is the power-analyst
– PhD in statistics is usually required…
• Closed tools – no standard API
– Total vendor lock-in
– Limited integration with applications
• DM an “outsider” in the Data Warehouse
• Extensive consulting required
• Sky rocketing prices
– $60K+ for a single user license
What the analysts say…
• “Stand-alone Data Mining Is Dead” Forrester
• “The demise of [stand alone] data
mining” – Gartner
The Microsoft Approach
DataPro Users Survey
1999-2001
“Data mining will be the fastestgrowing BI technology…”
Market Size of BI
* IDC
SQL Server 2000 - The Analysis
Platform
• SQL 2000 provides a complete Analysis
Platform
– Not an isolated, stand alone DM product
• Platform means:
– Standard based DM API’s (OLE DB for DM) for
applications development
– Integrated vision for all technologies, tools
– Extensible
– Scaleable
Data Flow
DM
Apps
DM
OLTP
DW
Reports
&
Analysis
OLAP
Analysis Services 2000 –
Components
Tree View Control
DM Wizards
Manager
UI
Cluster View Control
DM DTS Task
DSO
DMM
Lift Chart Control
OLE DB OLAP DM
Sample Query Tool
Analysis Server
OLAP DM
Engine Engine
Client
OLAP DM
Engine Engine
(local) (local)
OLE DB for Data Mining…
Why OLE DB for DM?
Make DM a mass market technology by:
• Leverage existing technologies and knowledge
– SQL and OLE DB
• Common industry wide concepts and data
presentation
• Changing DM market perception from “proprietary” to
“open”
• Increasing the number of players:
– Reduce the cost and risk of becoming a consumer – one tool works
with multiple providers
– Reduce the cost and risk of becoming a provider – focus on expertise
and find many partners to complement offering
Integration With RDBMS
• Customers would like to
– Build DM models from within their RDBMS
– Train the models directly off their relational tables
– Perform predictions as relational queries (tables in,
tables out)
– Feel that DM is a native part of their database.
• Therefore…
– Data mining models are relational objects
– All operations on the models are relational
– The language used is SQL (w/Extensions)
• The effect: every DBA and VB developer can
become a DM developer
Creating a Data Mining Model
(DMM)
Identifying the “Cases”
• DM algorithms analyze “cases”
• The “case” is the entity being categorized and
classified
• Examples
– Customer credit risk analysis: Case = Customer
– Product profitability analysis: Case = Product
– Promotion success analysis: Case = Promotion
• Each case encapsulate all we know about the
entity
A Simple Set of Cases
StudentI Gende
D
r
Parent
Income
1
Male
23400
2
Female
79200
3
Male
42000
IQ
Encouragement
120 Not Encouraged
90 Encouraged
105 Not Encouraged
College
Plans
No
Yes
Yes
More Complicated Cases
Cust
ID
Age
Marit
al
Statu
s
1
35
M
2
3
20
57
S
M
Favorite Movies
IQ
2
3
2
Title
Score
Star Wars
8
Toy Story
9
Terminator
7
Star Wars
7
Braveheart
7
The Matrix
10
Sixth Sense
9
Casablanca
10
A DMM is a Table!
• A DMM structure is defined as a table
– Training a DMM means inserting data (pattern)
into the table
– Predicting from a DMM means querying the
table
• All information describing the case are
contained in columns
Creating a Mining Model
CREATE MINING MODEL [Plans Prediction]
(
StudentID
LONG
KEY,
Gender
TEXT
DISCRETE,
ParentIncome LONG
CONTINUOUS,
IQ
DOUBLE CONTINUOUS,
Encouragement TEXT
DISCRETE,
CollegePlans TEXT
DISCRETE PREDICT
)
USING Microsoft_Decision_Trees
Creating a mining model with
nested table
Create Mining Model MoviePrediction
(
CutomerId long key,
Age long continuous,
Gender discrete,
Education discrete,
MovieList table predict (
MovieName text key
)
)
using microsoft_decision_trees
Training a DMM
Training a DMM
• Training a DMM means passing it data for which the
attributes to be predicted are known
– Multiple passes are handled internally by the provider!
• Use an INSERT INTO statement
• The DMM will not persist the inserted data
• Instead it will analyze the given cases and build the
DMM content (decision tree, segmentation model,
association rules)
INSERT [INTO] <mining model name>
[(columns list)]
<source data query>
INSERT INTO
INSERT INTO [Plans Prediction]
(
StudentID, Gender, ParentIncome, IQ,
Encouragement, CollegePlans
)
SELECT
[StudentID], [Gender],
[ParentIncome], [IQ],
[Encouragement], [CollegePlans]
FROM [Students]
When Insert Into Is Done…
• The DMM is trained
– The model can be retrained
– Content (rules, trees, formulas) can be
explored
– OLE DB Schema rowset
– SELECT * FROM <dmm>.CONTENT
– XML string (PMML)
• Prediction queries can be executed
Predictions
What are Predictions?
• Predictions apply the rules of a trained
model to a new set of data in order to
estimate missing attributes or values
• Predictions = queries
– The syntax is SQL - like
– The output is a rowset
• In order to predict you need:
– Input data set
– A trained DMM
– Binding (mapping) information between the
input data and the DMM
The Truth Table Concept
Gende
r
Parent
Income
IQ
Colleg
Probabilit
Encouragement e
y
Plans
Male
20000
85
Not Encouraged
No
85%
Male
20000
85
Not Encouraged
Yes
15%
Male
20000
85
Encouraged
No
60%
Male
20000
85
Encouraged
Yes
40%
Male
20000
90
Not Encouraged
No
80%
Male
20000
90
Not Encouraged
Yes
20%
Male
20000
90
Encouraged
No
58%
…
Prediction
StudentI
D
Gender
Parent
Income
IQ
Encouragement
1
Male
43000
85
Not Encouraged
2
Male
20000
135
Not Encouraged
3
Female
25000
105
Encouraged
4
Male
96000
100
Encouraged
5
Female
56000
125
Not Encouraged
6
Female
46000
90
Not Encouraged
It’s a JOIN!
Gender
Parent
Income
IQ
Encouragement
College
Plans
Probability
Male
20000
85
Not Encouraged
No
85%
Male
20000
85
Not Encouraged
Yes
15%
Male
20000
85
Encouraged
No
60%
Male
20000
85
Encouraged
Yes
40%
Male
20000
90
Not Encouraged
No
80%
Male
20000
90
Not Encouraged
Yes
20%
Male
20000
90
Encouraged
No
58%
Male
20000
90
Encouraged
Yes
42%
Male
20000
95
Not Encouraged
No
78%
Male
20000
95
Not Encouraged
Yes
22%
Male
20000
95
Encouraged
No
45%
The Prediction Query Syntax
SELECT <columns to return or predict>
FROM
<dmm> PREDICTION JOIN
<input data set>
ON <dmm column> = <dmm input column>…
Example
SELECT [New Students].[StudentID],
[Plans Prediction].[CollegePlans],
PredictProbability([CollegePlans])
FROM
[Plans Prediction] PREDICTION JOIN
[New Students]
ON [Plans Prediction].[Gender] =
[New Students].[Gender] AND
[Plans Prediction].[IQ] =
[New Students].[IQ] AND ...
Demo
OLE DB for Data Mining Defines
API
Consumer
Consumer
…
OLE DB for DM (API)
Provider
Provider
Provider
OLE DB
Cube
RDBMS
Misc.
Data
Source
…
OLEDB for DM Configuration
Options Demo
Consumers
OLEDB for DM
Providers
MS Analysis
Manager
1
MS DM
Provider
ANGOSS
Controls
2 3
4
ANGOSS DM
Provider
Demo on OLE DB for DM API
using Angoss Controls
and Provider
For more info…
• DM URL
– www.microsoft.com/data/oledb
– www.microsoft.com/data/oledb/DMResKit.htm
• News Group:
– Microsoft.public.SQLserver.datamining
– Communities.msn.com/AnalysisServicesDataMining
• White papers:
– Performance paper:
www.unisys.com/windows2000/default-07.asp
www.microsoft.com/SQL/evaluation/compare/analysisdmwp.asp
Questions ?