Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining with The SAS System Dr. John Brocklebank, SAS Institute Inc Gerhard Held, SAS Institute Europe Outline 1. Data Mining - Needs and Requirements 2. The SAS Data Mining Solution 3. Conclusion Data Mining? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? Data Mining • Data Mining is the Process of selecting, exploring and modelling • large Amounts of Data • to uncover previously unknown Patterns for Business Advantage Data Mining Applications Banking Credit Authorization Credit Card Fraud Detection Portfolio Analysis Customer Segmentation Insurance & Health Care Claim Analysis Fraudulent Behavior Telecommunications Churn Management Call Behaviour Analysis Retail/Marketing Market Basket Analysis Database Marketing Category Management Targeted/cross marketing Transportation, Networks, Utilities Loading Patterns General Pricing Analysis Associations & Demography Data Mining - Needs and Requirements • Data Mining is a Process • Data Mining involves close Co-operation of IT, Business, and Data Miners The SAS Data Mining Solution Business Problem The SAS Data Mining Solution Business Problem The SAS Data Mining Solution - Currently • NNA - Initial Prod. on Win, OS/2, HP-UX, AIX, SUN, Digital UNIX, ORLANDO I and II • Tree Menue System (CHAID, later CART) • Everything else Production Software: – Exploration: INSIGHT, SPECTRAVIEW, GIS – Statistics – Time Series Forecasting – Market Research Methods The SAS Data Mining Solution New: SAS Enterprise Miner(TM) A unique and full-scale Business Solution: • IT: DW Access, Scalability • Business Users: Intuitive Interface and Business Orientation • Data Miners: Analytical Depth and Flexibility The SAS Enterprise Miner(TM) Environment • Graphical User Interface: Analytical solutions based on SEMMA process using process flow diagrams • Existing SAS programs and applications can be easily incorporated • All ingredients of SAS Enterprise Miner in particular the DMDB and all analytical engines are exclusively available through this Data Mining Solution. Sample • The Sampling Tool allows users to extract a sample of their data using: – simple random - stratified – Nth observation - first N observations – cluster • The sampling tool also facilitates construction of training DMDB’s, validation, and test data sets Data Mining Database (DMDB) • PROC DMDB is a procedure that builds a data mining database (DMDB). The DMDB consists of two parts: – an efficient SAS data set (character variables convered to integers) – a catalog of meta data information that characterizes target and input variables • The DMDB is required before the data mining analytical modules are run. Data Mining Database (DMDB) • Numeric Variables: Summary statistics are calculated and stored. • Character Variables: Values are converted to integer form and linked to the meta data layer. • Classification Variables: Levels and frequencies for each variable are stored in the meta data (mapping recipe). Explore / Modify • Advanced Visualization tools enable users to explore their data graphically. • The Outlier Filter tool allows users to quickly identify and remove outliers from their data set. • The Transformation Tool facilitates the creation of transformed variables to be used in the construction of the DMDB and the modeling process. Model • The SAS Data Mining Solution provides a full range of modeling and evaluation techniques including: – Data Mining Regression – Decision Trees – Neural Networks – Associations • All modelling techniques are available as: – Procedures – Icons in process flow diagrams The DMINE Procedure • Developed by Dr. Jim Goodnight to perform “true data mining” by: – providing a fast preliminary variable assessment – facilitating quick development of predictive models with large volumes of data The DMINE Procedure • PROC DMINE quickly identifies input variables useful for predicting target variables (“model screening”) • Describes how they fit into a linear models (regression/ANOVA) framework. • Results from this procedure can be passed to the Neural Network and Data Splits tools or to any other procedure in the SAS System. The DMINE Procedure • Supports multiple target variables • Constructs and evaluates up to two-factor interactions • Collapses levels of class variables using a criterion based on the R-square value The DMINE Procedure • First step: simple linear regression model is fit for each input variable. • The input variables sorted in descending order by the R-square values. • Second step: forward selection regression for all inputs including class, continuous variables, grouped classes, and ANOVA16 variables • Variables and factors used and not used in final model are displayed The DMINE Procedure • Binary targets: logistic regression model is fit to the data using the predictions from the stepwise OLS run as a covariate. • No logistic regression model is run for interval target variables. Data Mining Regression • DMREG is a new procedure that allows the user to access all of the functionality of REG and LOGISTIC while including additional functionality for data mining. Data Mining Regression • New features include: – Uses DMDB as an input data source – Handles training, validation, test and score data sets – Accepts both continuous and discrete variables as inputs – Accepts binary, continuous, or ordinal variables as targets Data Mining Regression • Statistical methodologies and algorithms supported include: – Multiple linear regression – Logistic regression – Variable selection methods – Multiple optimization methods – Event and Event/Trial coding for classification target variables Data Mining Regression • Model results and assessments are provided in the form of: – parameter estimates and related statistics – goodness of fit statistics Decision Trees • DATA SPLITS is a new procedure that allows users to construct classification and decision trees. • This procedure replaces the TREEDISC macro and the SAS Tree Application. Decision Trees • Features include: – Uses DMDB as an input data source – Accepts both continuous and discrete input and target variables – Incorporates missing values for the inputs into the modeling process Decision Trees • Statistical methodologies and algorithms supported include: – Utility functions defined for each alternative decision – Different fitting criteria for continuous and discrete targets, CART, and CHAID – Manual pruning of the tree in a graphical environment Decision Trees • Model results and assessments are provided in the form of: – Utilities for rule assessment on both training and validation data sets – Goodness of fit statistics – Interactive classification tree – Interactive 3-D tree-ring graph – Decision rules Neural Networks • NEURAL is a new procedure that allows users to construct and train neural networks. • This procedure replaces the TNN macros and the SAS Neural Network Application. Neural Networks • Features include: – Uses DMDB as an input data source – Accepts both continuous and discrete input and target variables – Provides interactive network diagram for construction of neural networks Neural Network • Statistical methodologies and algorithms supported include: – Construction of multi-layer feedforward networks and radial basis functions – Multiple training techniques including nonlinear optimization methods and backpropagation – User control over selection of activation and objective functions Neural Network • Model results and assessments are provided in the form of: – Goodness of fit statistics including RMSE, SBC, and AIC – Misclassification tables for nominal outputs Associations • ASSOC and RULEGEN are new procedures that allow users to discover associations among items in a data base. • Possible Applications: – Market basket analysis – Analysis of Web usage – Bank transactions Associations • Features include: – Uses DMDB as an input data source – Discovers rules of the form: · if item A is part of an event, then x% of the time, item B is also part of the event Associations • Statistical methodologies and algorithms supported include: – Constructs rules containing a left-hand-side (LHS) and a right-hand-side (RHS) based on frequency counts for various combinations of items Associations • Model results and assessments are provided in the form of: – Association rules – Information statistics such as confidence, support, and lift – A user interface which allows users to sort the rules by information statistics and to select both the LHS and RHS rules Model Management and Assessment • Users can assess results of modeling through interactive assessment graphs, gains charts, and profit and ROI graphs. • A common interface for each modeling tool allows the user to document and manage the model development process. The SAS Enterprise Miner(TM) Architecture Client-server Approach: • Clients: Win 95, Win NT • Servers Win NT, all major UNIX • Mainframe as Data Server, later also Compute Server SAS System and Data Mining Approx. Timeline The SAS System for Data Mining SAS Enterprise Miner Restr. alpha Apr SEUGI SUGI/CEBIT Feb Jun beta Aug prod Oct Dec 1997 Summary The SAS Data Mining Solution is unique: • IT: DW Access, Scalability • Business Users: Intuitive Interface and Business Orientation • Data Miners: Analytical Depth and Flexibility Data Mining with The SAS System Thank you for your Attention! Questions?