Download Data Mining with The SAS System

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining with
The SAS System
Dr. John Brocklebank, SAS Institute Inc
Gerhard Held, SAS Institute Europe
Outline
1. Data Mining - Needs and Requirements
2. The SAS Data Mining Solution
3. Conclusion
Data Mining?
??? ??? ??? ??? ???
???
???
???
???
???
???
???
???
???
???
??? ??? ??? ??? ???
Data Mining
• Data Mining is the Process of
selecting, exploring and modelling
• large Amounts of Data
• to uncover previously unknown
Patterns for Business Advantage
Data Mining Applications
Banking
Credit Authorization
Credit Card Fraud Detection
Portfolio Analysis
Customer Segmentation
Insurance & Health Care
Claim Analysis
Fraudulent Behavior
Telecommunications
Churn Management
Call Behaviour Analysis
Retail/Marketing
Market Basket Analysis
Database Marketing
Category Management
Targeted/cross marketing
Transportation, Networks,
Utilities
Loading Patterns
General
Pricing Analysis
Associations &
Demography
Data Mining - Needs and
Requirements
• Data Mining is a Process
• Data Mining involves close Co-operation of
IT, Business, and Data Miners
The SAS Data Mining
Solution
Business Problem
The SAS Data Mining
Solution
Business Problem
The SAS Data Mining Solution
- Currently • NNA - Initial Prod. on Win, OS/2, HP-UX,
AIX, SUN, Digital UNIX, ORLANDO I and II
• Tree Menue System (CHAID, later CART)
• Everything else Production Software:
– Exploration: INSIGHT, SPECTRAVIEW,
GIS
– Statistics
– Time Series Forecasting
– Market Research Methods
The SAS Data Mining Solution
New: SAS Enterprise Miner(TM)
A unique and full-scale Business Solution:
• IT: DW Access, Scalability
• Business Users:
Intuitive Interface and
Business Orientation
• Data Miners: Analytical Depth
and Flexibility
The SAS Enterprise Miner(TM)
Environment
• Graphical User Interface: Analytical solutions
based on SEMMA process using process flow
diagrams
• Existing SAS programs and applications can
be easily incorporated
• All ingredients of SAS Enterprise Miner in
particular the DMDB and all analytical
engines are exclusively available through this
Data Mining Solution.
Sample
• The Sampling Tool allows users to extract a
sample of their data using:
– simple random
- stratified
– Nth observation
- first N observations
– cluster
• The sampling tool also facilitates construction
of training DMDB’s, validation, and test data
sets
Data Mining Database
(DMDB)
• PROC DMDB is a procedure that builds a
data mining database (DMDB). The DMDB
consists of two parts:
– an efficient SAS data set (character variables
convered to integers)
– a catalog of meta data information that
characterizes target and input variables
• The DMDB is required before the data mining
analytical modules are run.
Data Mining Database
(DMDB)
• Numeric Variables: Summary statistics are
calculated and stored.
• Character Variables: Values are converted
to integer form and linked to the meta data
layer.
• Classification Variables: Levels and
frequencies for each variable are stored in
the meta data (mapping recipe).
Explore / Modify
• Advanced Visualization tools enable users
to explore their data graphically.
• The Outlier Filter tool allows users to
quickly identify and remove outliers from
their data set.
• The Transformation Tool facilitates the
creation of transformed variables to be
used in the construction of the DMDB and
the modeling process.
Model
• The SAS Data Mining Solution provides a
full range of modeling and evaluation
techniques including:
– Data Mining Regression
– Decision Trees
– Neural Networks
– Associations
• All modelling techniques are available as:
– Procedures
– Icons in process flow diagrams
The DMINE Procedure
• Developed by Dr. Jim Goodnight to perform
“true data mining” by:
– providing a fast preliminary variable
assessment
– facilitating quick development of predictive
models with large volumes of data
The DMINE Procedure
• PROC DMINE quickly identifies input
variables useful for predicting target variables
(“model screening”)
• Describes how they fit into a linear models
(regression/ANOVA) framework.
• Results from this procedure can be passed to
the Neural Network and Data Splits tools or to
any other procedure in the SAS System.
The DMINE Procedure
• Supports multiple target variables
• Constructs and evaluates up to two-factor
interactions
• Collapses levels of class variables using a
criterion based on the R-square value
The DMINE Procedure
• First step: simple linear regression model is
fit for each input variable.
• The input variables sorted in descending
order by the R-square values.
• Second step: forward selection regression
for all inputs including class, continuous
variables, grouped classes, and ANOVA16
variables
• Variables and factors used and not used in
final model are displayed
The DMINE Procedure
• Binary targets: logistic regression model is
fit to the data using the predictions from the
stepwise OLS run as a covariate.
• No logistic regression model is run for
interval target variables.
Data Mining Regression
• DMREG is a new procedure that allows the
user to access all of the functionality of REG
and LOGISTIC while including additional
functionality for data mining.
Data Mining Regression
• New features include:
– Uses DMDB as an input data source
– Handles training, validation, test and score data
sets
– Accepts both continuous and discrete variables
as inputs
– Accepts binary, continuous, or ordinal variables
as targets
Data Mining Regression
• Statistical methodologies and algorithms
supported include:
– Multiple linear regression
– Logistic regression
– Variable selection methods
– Multiple optimization methods
– Event and Event/Trial coding for classification
target variables
Data Mining Regression
• Model results and assessments are provided
in the form of:
– parameter estimates and related statistics
– goodness of fit statistics
Decision Trees
• DATA SPLITS is a new procedure that allows
users to construct classification and decision
trees.
• This procedure replaces the TREEDISC
macro and the SAS Tree Application.
Decision Trees
• Features include:
– Uses DMDB as an input data source
– Accepts both continuous and discrete input
and target variables
– Incorporates missing values for the inputs
into the modeling process
Decision Trees
• Statistical methodologies and algorithms
supported include:
– Utility functions defined for each alternative
decision
– Different fitting criteria for continuous and
discrete targets, CART, and CHAID
– Manual pruning of the tree in a graphical
environment
Decision Trees
• Model results and assessments are provided
in the form of:
– Utilities for rule assessment on both training
and validation data sets
– Goodness of fit statistics
– Interactive classification tree
– Interactive 3-D tree-ring graph
– Decision rules
Neural Networks
• NEURAL is a new procedure that allows
users to construct and train neural networks.
• This procedure replaces the TNN macros and
the SAS Neural Network Application.
Neural Networks
• Features include:
– Uses DMDB as an input data source
– Accepts both continuous and discrete input
and target variables
– Provides interactive network diagram for
construction of neural networks
Neural Network
• Statistical methodologies and algorithms
supported include:
– Construction of multi-layer feedforward
networks and radial basis functions
– Multiple training techniques including nonlinear optimization methods and
backpropagation
– User control over selection of activation and
objective functions
Neural Network
• Model results and assessments are provided
in the form of:
– Goodness of fit statistics including RMSE,
SBC, and AIC
– Misclassification tables for nominal outputs
Associations
• ASSOC and RULEGEN are new procedures
that allow users to discover associations
among items in a data base.
• Possible Applications:
– Market basket analysis
– Analysis of Web usage
– Bank transactions
Associations
• Features include:
– Uses DMDB as an input data source
– Discovers rules of the form:
· if item A is part of an event, then x% of the time,
item B is also part of the event
Associations
• Statistical methodologies and algorithms
supported include:
– Constructs rules containing a left-hand-side
(LHS) and a right-hand-side (RHS) based on
frequency counts for various combinations of
items
Associations
• Model results and assessments are provided
in the form of:
– Association rules
– Information statistics such as confidence,
support, and lift
– A user interface which allows users to sort the
rules by information statistics and to select both
the LHS and RHS rules
Model Management
and Assessment
• Users can assess results of modeling through
interactive assessment graphs, gains charts,
and profit and ROI graphs.
• A common interface for each modeling tool
allows the user to document and manage the
model development process.
The SAS Enterprise Miner(TM)
Architecture
Client-server Approach:
• Clients: Win 95, Win NT
• Servers Win NT, all major UNIX
• Mainframe as Data Server, later also
Compute Server
SAS System and Data Mining
Approx. Timeline
The SAS System for Data Mining
SAS Enterprise Miner
Restr.
alpha
Apr
SEUGI
SUGI/CEBIT
Feb
Jun
beta
Aug
prod
Oct
Dec
1997
Summary
The SAS Data Mining Solution is unique:
• IT: DW Access, Scalability
• Business Users:
Intuitive Interface and
Business Orientation
• Data Miners: Analytical Depth
and Flexibility
Data Mining with
The SAS System
Thank you for your Attention!
Questions?