Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
DBMS support of the Data Mining
Advisor :
S.-Y. Hwang Ph.D
D954020005 Tsung-Hsien Yang
D954020006 Shi-Hwao Wang
1/22/2008
Agenda
Introduction to Data Mining
The Promise of Data Mining
KDD Process
Data Mining Algorithms
Data Mining Modeling and Language
Conclusion
Introduction to Data Mining
The Explosive Growth of Data: from terabytes to petabytes
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific
simulation, …
Society and everyone: news, digital cameras, YouTube
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
What Is Data Mining?
Data mining: Discovering interesting patterns from large amounts of data
Data mining (knowledge discovery from data)
Alternative names
Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information
harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?
Simple search and query processing
(Deductive) expert systems
The Promise of Data Mining
Database analysis and decision support
Market analysis and management
target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
Risk analysis and management
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
Fraud detection and management
Other Applications
Text mining (news group, email, documents) and Web analysis.
Knowledge Discovery (KDD) Process
Data mining—core of
knowledge discovery
process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
Selection
Data preprocessing
Define a model
Train the model
Test the model
Training Data
Data Mining
Management System
(DMMS)
Test Data
Mining Model
Prediction using the model
Prediction Input Data
Data Mining Algorithms
Decision Trees
Naïve Bayesian
Clustering
Sequence Clustering
Association Rules
Neural Network
Time Series
Support Vector Machines
….
Data Mining Function
Classification (attribute)
Estimation (regression)
Prediction (time series)
Association (cross selling)
Clustering (segmentation)
√ - first choice
√ - second choice
Data Mining Algorithms
√
√
√
√
√
√
√
√
√
√
√
Classification
√
√
Regression
√
√
√
Segmentaion
√
√
√
Assoc. Analysis
√
√
√
Anomaly Detect.
√
√
Seq. Analysis
√
Time series
Data Mining Language
New challenges in data mining API
Large spectrum of applications: embedded to interactive BI
Interoperability between different DM providers (engine) and DM
consumers (tools)
Data independence between content representation (trees, attributes,
networks, etc) and data mining task (prediction, scoring, etc)
Requirements:
Algorithm-neutral
Task-oriented (specification of what we need, rather than how to)
Vendor-neutral
Flexible, extensible, declarative/self-contained
Sound familiar?
Yes, SQL
DMX Approach
Data Mining Extensions (DMX) to SQL
Table vs. Mining Model
TABLE
MINING MODEL
schema Column definition
Attribute (variable)
definition
contains Rows
Patterns, knowledge,
cases
DDL
Create/drop/alter a model
operatio (create,drop,alter)
ns
DML (insert, delete) Train (populate) a model
Query (select)
Prediction/browsing a
model
Typical DM Process Using DMX
Define a model:
CREATE MINING MODEL ….
Train a model:
INSERT INTO dmm ….
Data Mining
Management System
(DMMS)
Training Data
Prediction using a model:
SELECT …
FROM dmm PREDICTION JOIN …
Prediction Input Data
Mining Model
Defining a DM Model
Defines
Shape of “training cases” (top-level entity being
modeled)
Input/output attributes (variables): type,
distribution
Algorithms and parameters
Example
CREATE MINING MODEL CollegePlanModel
(
StudentID
Gender
ParentIncome
Encouragement
CollegePlans
LONG
TEXT
LONG
TEXT
TEXT
KEY,
DISCRETE,
NORMAL CONTINUOUS,
DISCRETE,
DISCRETE PREDICT
) USING Microsoft_Decision_Trees
(complexity_penalty = 0.5)
Training a DM Model: Simple
INSERT INTO CollegePlanModel
(StudentID, Gender, ParentIncome,
Encouragement, CollegePlans)
OPENROWSET(‘<provider>’, ‘<connection>’,
‘SELECT
StudentID,
Gender,
ParentIncome,
Encouragement,
CollegePlans
FROM CollegePlansTrainData’)
Prediction Using a DM Model
PREDICTION JOIN
SELECT t.ID, CPModel.Plan
FROM CPModel PREDICTION JOIN
OPENQUERY(…,‘SELECT * FROM NewStudents’) AS t
ON CPModel.Gender = t.Gender AND
CPModel.IQ = t.IQ
CPModel
ID
Gender
IQ
Plan
ID
Gender
IQ
NewStudents
Classification
Model Definition
CREATE MINING MODEL CPClass
(
StudentID LONG KEY,
Gender
TEXT DISCRETE,
ParentIncome LONG CONTINUOUS,
Encouragement TEXT DISCRETE,
CollegePlans TEXT DISCRETE PREDICT
) USING Microsoft_Decision_Trees
Classification (cont)
Find the new students whose predicted class
(CollegePlan) is ‘Yes’ with confidence > 0.8
SELECT StudentID, PredictProbability(CPClass.CollegePlan)
FROM CPClass PREDICTION JOIN
OPENROWSET (’<provider>’,’<connection>’,
’SELECT * FROM NewStudents’) AS t
ON t.Gender = CPClass.Gender AND
t.ParentIncome = CPClass.ParentIncome AND
t.Encouragement = CPClass.Encouragement
WHERE
CPClass.CollegePlan = ‘Yes’ AND
PredictProbability(CPClass.CollegePlan) > 0.8
Regression
Model Definition
CREATE MINING MODEL CustCredit
(
CustID LONG KEY,
Gender TEXT DISCRETE,
Age TEXT CONTINUOUS REGRESSOR,
Income LONG CONTINUOUS REGRESSOR,
Credit DOUBLE CONTINUOUS PREDICT
) USING Microsoft_Decision_Trees
Regression (cont)
Predict Credit score (and stdev) for the new
customer data entered from the web form.
SELECT CustCredit.Credit, PredictStdev(CustCredit.Credit)
FROM CustCredit PREDICTION JOIN
(SELECT ’Female’ AS Gender, 30 AS Age, 50000 AS Income) AS t
ON t.Gender = CustCredit.Gender AND
t.Age = CustCredit.Age AND
t.Income = CustCredit.Income
Segmentation
Model Definition
CREATE MINING MODEL CPCluster
(
StudentID LONG KEY,
Gender
TEXT DISCRETE,
ParentIncome LONG CONTINUOUS,
Encouragement TEXT DISCRETE,
CollegePlans TEXT DISCRETE
) USING Microsoft_Clustering
Segmentation (cont.)
Find cluster and its probability for each
student
SELECT StudentID, $Cluster, ClusterProbability()
FROM CPCluster PREDICTION JOIN
OPENROWSET (’<provider>’,’<connection>’,
’SELECT * FROM NewStudents’) AS t
ON t.Gender
= CPCluster.Gender AND
t.ParentIncome = CPCluster.ParentIncome AND
t.Encouragement = CPCluster.Encouragement AND
t.CollegePlans = CPCluster.CollegePlans
Association Prediction
Model Definition
CREATE MINING MODEL FavMovieModel (
ID
LONG KEY,
MaritalStatus TEXT DISCRETE,
FavMovies TABLE PREDICT (
Title
TEXT
KEY
)
) USING Microsoft_Decision_Trees
Association Prediction (cont)
As a web application, find 5 best recommendations for
a customer whose shopping cart contains ‘Star Wars’
and ‘Matrix’.
SELECT FLATTENED
PredictAssociation(FavMovieModel.FavMovies,
INCLUDE_STATISTICS, 5)
FROM FavMovieModel NATURAL PREDICTION JOIN
(SELECT ’Single’ AS MaritalStatus,
(SELECT ’Star Wars’ AS Title UNION SELECT ’Matrix’ AS Title) AS
FavMovies) AS t
Sequence Prediction
Model Definition
CREATE MINING MODEL WebSeqModel (
SessionLONG KEY,
PageSeq
TABLE PREDICT (
SeqID
LONG KEY SEQUENCE,
Page
TEXT DISCRETE
)
) USING Microsoft_Sequence_Clustering
Sequence Prediction (cont)
Show the next 2 steps that a web visitor who visited ‘home’
‘news’ is going to take. For each step, it has to show top 5 candidate
pages with the highest probability.
SELECT FLATTENED
( SELECT $Sequence,
TopCount(PredictHistogram(Page), $Probability, 5) FROM
PredictSequence(WebSeqModel.PageSeq, 2)
)
FROM WebSeqModel NATURAL PREDICTION JOIN
(SELECT
(SELECT 1 AS SeqID, ’home’ AS Page UNION
SELECT 2 AS SeqID, ’news’ AS Page) AS PageSeq
) AS t
Time-Series Prediction
Model Definition
CREATE MINING MODEL StockModel (
Symbol
LONG KEY,
DateRecorded
DATE KEY TIME,
OpeningQuote DOUBLE CONTINUOUS,
ClosingQuote
DOUBLE CONTINUOUS
) USING Microsoft_Time_Series
Time-Series Prediction (cont)
Predict next five days of MSFT stock closing
quotes.
SELECT FLATTENED
PredictTimeSeries(StockModel.ClosingQuote, 5)
FROM FavMovieModel
WHERE StockModel.Symbol = ’MSFT’
Major Issues in Data Mining
Mining methodology
Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing one: knowledge fusion
User interaction
Data mining query languages and ad-hoc mining
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of abstraction
Applications and social impacts
Domain-specific data mining & invisible data mining
Protection of data security, integrity, and privacy
Data Mining Vendors
SAS (Enterprise Miner)
IBM (DB2 Intelligent Miner)
Oracle (ODM option to Oracle 10g)
SPSS (Clementine)
Insightsful (Insightful Miner)
KXEN (Analytic Framework)
Prudsys (Discoverer and its family)
Microsoft (SQL Server 2005)
Angoss (KnowledgeServer and its family)
DBMiner (DBMiner)
Many others
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DBA
Data Mining Modeling and Language
Problem Description
two powerful tools
Database management systems
Efficient and effective data mining algorithms and
frameworks
Generally, this work asks:
“How can we merge the two?”
“How can we integrate data mining more closely with
traditional database systems, particularly querying?”
Three Different Answers
MSQL: A Query Language for Database
Mining (Imielinski & Virmani, Rutgers
University)
DMQL: A Data Mining Query Language for
Relational Databases (Han et al, Simon Fraser
University)
Integrating Data Mining with SQL Databases:
OLE DB for Data Mining (Netz et al, Microsoft)
MSQL
Focus on Association Rules
Seeks to provide a language both to selectively
generate rules, and separately to query the rule
base
Expressive rule generation language, and
techniques for optimizing some commands
MSQL
Get-Rules and Select-Rules Queries
Get-Rules operator generates rules over elements of
argument class C, which satisfy conditions described in the
“where” clause
[Project Body, Consequent, confidence, support]
GetRules(C) [as R1]
[into <rulebase_name>]
[where <conds>]
[sql-group-by clause]
[using-clause]
MSQL
<conds> may contain a number of conditions,
including:
restrictions on the attributes in the body or consequent
“rule.body HAS {(Job = ‘Doctor’}”
“rule1.consequent IN rule2.body”
“rule.consequent IS {Age = *}”
pruning conditions (restrict by support, confidence, or size)
Stratified or correlated subqueries
in, has, and is are rule
subset, superset,
and equality
respectively
MSQL
GetRules(Patients)
where Body has {Age = *}
and Support > .05 and Confidence > .7
and not exists ( GetRules(Patients)
Support > .05 and
Confidence > .7
and R2.Body HAS R1.Body)
Retrieve all rules with descriptors of the form “Age = *” in the body,
except when there is a rule with equal or greater support and confidence
with a rule containing a superset of the descriptors in the body
MSQL
correlated
stratified
GetRules(C) R1
where <pruning-conds>
and not exists ( GetRules(C) R2
where <same pruning-conds>
and R2.Body HAS R1.Body)
GetRules(C) R1
where <pruning-conds>
and consequent is {(X=*)}
and consequent in (SelectRules(R2)
where consequent is {(X=*)}
MSQL
Nested Get-Rules Queries and their optimization
Stratified (non-corrolated) queries are evaluated “bottom-up.”
The subquery is evaluated first, and replaced with its results
in the outer query.
Correlated queries are evaluated either top-down or bottomup (like “loop-unfolding”), and there are rules for choosing
between the two options
MSQL
Top-Down Evaluation
GetRules(Patients)
where Body has {Age = *}
and Support > .05 and Confidence > .7
For each rule produced by the outer, evaluate the
inner
not exists ( GetRules(Patients)
Support > .05 and
Confidence > .7
and R2.Body HAS R1.Body)
MSQL
Bottom-Up Evaluation
not exists ( GetRules(Patients)
Support > .05 and
Confidence > .7
and R2.Body HAS R1.Body)
For each rule produced by the inner, evaluate the
outer
GetRules(Patients)
where Body has {Age = *}
and Support > .05 and Confidence > .7
DMQL
Commands specify the following:
The set of data relevant to the data mining task (the training
set)
The kinds of knowledge to be discovered
Generalized relation
Characteristic rules
Discriminant rules
Classification rules
Association rules
DMQL
Commands Specify the following:
Background knowledge
Concept hierarchies based on attribute relationships,
etc.
Various thresholds
Minimum support, confidence, etc.
DMQL
Specify background
knowledge
Specify rules to be
discovered
Relevant attributes or
aggregations
Collect the set of
relevant data to mine
Specify threshold
parameters
Syntax
use database <database_name>
{use hierarchy <hierarchy_name> for
<attribute>}
<rule_spec>
related to <attr_or_agg_list>
from <relation(s)>
[where <conditions>]
[order by <order list>]
{with [<kinds of>] threshold =
<threshold_value> [for <attribute(s)>]}
DMQL
use database Hospital
find association rules as Heart_Health
related to Salary, Age, Smoker, Heart_Disease
from Patient_Financial f, Patient_Medical m
where f.ID = m.ID and m.age >= 18
with support threshold = .05
with confidence threshold = .7
DMQL
DMQL provides a display in command to view
resulting rules, but no advanced way to query
them
Suggests that a GUI interface might aid in the
presentation of these results in different forms
(charts, graphs, etc.)
OLE DB for DM
An extension to the OLE DB interface for Microsoft
SQL Server
Seeks to support the following ideas:
Define a model by specifying the set of attributes to be
predicted, the attributes used for the prediction, and the
algorithm
Populate the model using the training data
Predict attributes for new data using the populated model
Browse the mining model (not fully addressed because it
varies a lot by model type)
OLE DB for DM
Defining a Mining Model
Identify the set of data attributes to be predicted, the set of
attributes to be used for prediction, and the algorithm to be
used for building the model
Populating the Model
Pull the information into a single rowset using views, and
train the model using the data and algorithm specified
OLE DB for DM
Using the mining model to predict
Defines a new operator prediction join. A
model may be used to make predictions on
datasets by taking the prediction join of the
mining model and the data set.
OLE DB for DM
CREATE MINING MODEL Heart_Health Prediction
(
ID Int Key,
Age Int,
Smoker Int,
Salary Double discretized,
HeartAttack Int PREDICT,
%Prediction column
)
USING Microsoft_Decision_Trees
Identifies the source columns for the training
data, the column to be predicted, and the data
mining algorithm.
OLE DB for DM
INSERT INTO Heart_Health Prediction
(Age, Smoker, Salary, HeartAttack )
OPENROWSET (’<provider>’,’<connection>’,
’SELECT Age, Smoker, Salary, HeartAttack
FROM Patient_Medical M, Patient_Financial F
WHERE M.ID = F.ID’)
The INSERT represents using a tuple for training the
model (not actually inserting it into the rowset).
OLE DB for DM
SELECT T.ID, H.HeartAttack
FROM Heart_Health Prediction H
PREDICTION JOIN (
OPENROWSET (’<provider>’,’<connection>’,
’SELECT ID, Age, Smoker, Salary
FROM Patient_Medical M, Patient_Financial F
WHERE M.ID = F.ID’) as T
ON H.Age = T.Age AND H.Smoker = T.Smoker AND H.Salary =
T.Salary
Prediction join connects the model and an actual data
table to make predictions
Key Ideas
Important to have an API for creating and
manipulating data mining models
The data is already in the DBMS, so it makes
sense to do the data mining where the data is
Applications already use SQL, so a SQL
extension seems logical
Key Ideas
Need a method for defining data mining models,
including algorithm specification, specification
of various parameters, and training set
specification (DMQL, MSQL, ODBDM)
Need a method of querying the models (MSQL)
Need a way of using the data mining model to
interact with other data in the database, for
purposes such as prediction (ODBDM)