Download ECLT5810 E-Commerce Data Mining Techniques Overview of SAS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Transcript
Enterprise Miner Overview
1
ECLT5810 E-Commerce Data Mining Techniques
Overview of SAS Enterprise Miner
1.
Start Enterprise Miner I.
Select English Windows
II. Under Windows, select Start Menu  SAS  SAS 9.3
III. Type “miner” in command prompt of SAS
Enterprise Miner Overview
2
2.
Enterprise Miner (EM) 4. 3 I.
Several windows:
SAS 9.3 Interface 1.
2.
3.
4.
5.
Log Editor Output Explorer Results ‐ logging message of SAS program ‐ SAS program editor ‐ output of SAS program Enterprise Miner Overview
II.
3
SAS Enterprise Miner
-
Some Terminology
- Node
– EM tool to perform data mining
- Connection
– data flow between nodes
- Project Flow Diagram (PFD) – diagram specifies the data mining process
-
The Interface
 Project Navigator
The three tabs: -
Diagrams
Tools
‐ Reports – lists the current project and diagrams
– contains a palette of EM tools for constructing the process
flow diagram
– contains the HTML report entries generated by the Reporter node 




Diagram Workspace - building, editing, running and saving the process flow diagram Tools Bar - contains commonly used EM nodes - 7 categories: 1 Sample: e.g. Input Data Source, Sampling, Data Partition 2 Explore: e.g. Distribution Explorer, Multiplot, Variable Selection 3 Modify: e.g. Transform Variables, Filter Outliers, Clustering 4 Model: e.g. Regression, Tree, Neural Network 5 Assess: e.g. Assessment, Reporter 6 Scoring: e.g. Score 7 Utility: e.g. SAS Code Progress Indicator - displays the process of EM task Message Panel - displays message about the execution of EM task Connection Status Indicator - display the remote host name and the connection status for a client-server project
Enterprise Miner Overview




4
Command Prompt
- command line input
Pull-Down Menus
- functions provided by different nodes and windows
- change dynamically with different nodes and windows
Toolbar
- Specific functions provides by different nodes and windows
- change dynamically with different nodes and windows
Pop-up Menus
- common operations for the selected nodes (open, run, results, … )
- invoked by clicking the right mouse button on the node
3.
Five Processes in Data Mining SAS defines Data Mining as the following tasks: 1.
Sample the data by creating one or more data tables. The samples should be large enough to contain the significant information, yet small enough to process. ‐ define data source 2.
Explore the data by searching for anticipated relationships, unanticipated trends, and anomalies in order to gain understanding and ideas. ‐ find mining direction / decide mining tools 3.
Modify the data by creating, selecting, and transforming the variables to focus the model selection process. ‐ preparation before modeling ( normalization, missing value handling, … ) 4.
Model the data by using the analytical tools to search for a combination of the data that reliably predicts a desired outcome. ‐ the mining process 5.
Assess the data by evaluating the usefulness and reliability of the findings from the data mining process. ‐ evaluation  The SEMMA processes. Sample
Explore
Modify
Model
Assess Enterprise Miner Overview
4.
5
Creating a New Project  Project: a collection of Enterprise process flow diagrams and information
 only one project (but many diagrams) can be edited at the same time
 Different data mining task  different project
I.
New a project: Pull‐down menus  File  New  Project II.
Project structure Project Location / ‐ EMDATA / ‐ EMPROJ / ‐ USERS / ‐ REPORTS / ‐ *.dmd ‐ project_name.dmp (process flow diagram) (*.dmp : project file) EMDATA Directory

contain data set and temporary large file when running processes
EMPROJ Directory

contain info. For each diagram and its nodes (settings and target profile)
USERS subdirectory

contains files that represent the users currently sharing the project. REPORTS Directory

HTML reports generated by the Reporter node are stored in this directory. Each report has its own subdirectory. The name of the subdirectory defines the name of the report. 5.
Sample I.
Input Data Source

Specify the source of data and details about the variables in the data set 
Typically the first node of a process flow diagram 
View distribution of each variable 
Create target profile for each target variable Enterprise Miner Overview
-
-
6
Data Tab - Specify data set from SAS destination - e. g. SAMPSIO.DMAGECR (All: German Credit Data) - Set data set role / size - RAW – default - TRAIN – training set, model fitting - VALIDATE – assess the adequacy of the model, fine tune model nodes (Regression, Tree and Neural Networks) - TEST – obtain a final, unbiased estimate of the error of the model - SCORE – data set needed to be processed by the trained model Variables Tab - Define Attribute (Meta Information) - Set target profile (teach later) - Profit / Loss, prior probability, … - Set model role, measurement, type, format and view distribution of different variables  e. g. change model role of variable A to target - right click the Model Role of the A to invoke the pop‐up menus - select Set Model Role  target e.g. The attribute GOOD_BAD in the German Credit Data can be set to target. -
Common model roles include:  target ‐ the variable required to be predicted.  input ‐ a variable that is used to predict the target.  rejected ‐ a variable that is excluded from the analysis in the process flow.  id ‐ an indicator variable for every transaction in the data set. The Associations node requires it for association discovery while other modeling nodes exclude it. -
View distribution of each variable  Right click the selected variable and choose View Distribution Interval Variables and Class Variables Tabs  Display summary statistics and distributions for interval (continuous) and class (nominal / ordinal / binary) variables  Import Self Created Data File
 Sources: Excel, Oracle and Sybase  In CU, 4 formats:  Delimited File (*.*)  Comma Separate Value (*.csv) (can be export from Excel)  Tab Delimited File (*.txt) (can be export form Excel)  Excel file (*.xls) Procedures: -
-
1) Prepare a data file from Excel  variable names can be specified in the first row  save as one of the above 4 formats, says Tab Delimited  remember to close the file before exporting to EM 2) Type “prog” in command prompt 3) Select Pull‐down menu  File  Import Data (An Import Wizard will be popped up) 4) Select file type, says Tab Delimited  Next 5) Broswe the input file  Next  Options – specify whether the first row is read as variable names 6) Set SAS Destination – specify Library and Member  Finish 7) Create a Input Data Source node  Select input data  Set model role, measurement, type, format, … of all the variables  Set target profile for the target variable (if any) 8) View the data: Open the node and select Pull‐down meun  File  Details… Enterprise Miner Overview
II.
Sampling


7
Sample data set recommended for extremely large databases because it can tremendously decrease model fitting time -
-
Data Tab  Display data set information and data records Variables Tab  Display variable information General Tab  Select Sampling Method, Size and Random Seed - Simple Random - Nth (systematic sampling) 1 Total: 100; Sampling Size: 5%  n = 100 / 5 = 20 2 Randomly select a start position, says 10 3 select: 10th , 30th , 50th, 70th and 90th observations 4 every nth may contain sample with part of structure, especially when data set is sorted - Stratified 1 Specify class variables to form strata (subsets) 2 Preserve the strata proportions of the original data set - First N 1 Select first N observations - Cluster 1 Cluster variable: class variable only 2 Cluster data set based on the selected cluster variable 3 Data within the same cluster has the same probability to be selected Output Tab  List the output data set III. Data Partition
 Partition data set into TRAIN, VALIDATION and TEST subsets  TRAIN ‐ used for preliminary model fitting. The analyst attempts to find the best model weights using this data set.  VALIDATION ‐ used to assess the adequacy of the model in the Model Manager and in the Assessment node. The validation data set is also used for model fine‐tuning in the following nodes:  Tree node ‐ to create the best subtree.  Neural Network node ‐ to choose among network architectures or for the early‐stopping of the training algorithm.  Regression node ‐ to choose a final subset of predictors from all the subsets computed during stepwise regression.  TEST‐ is used to obtain a final, unbiased estimate of the generalization error of the model. -
Data and Variables Tab (same as the ones in Sampling) Partition Tab Specify Partition Method and Random Seed  Stratification: maintain the distribution for the original data set  User Define: Partition using specific variables with specific values Specify Train, Validation and Test data set size Enterprise Miner Overview
6.
8
Explore I.
Distribution Explorer
 Visualize large volumes of data graphically  Uncover patterns and trends and reveal extreme values  Generate multidimensional histograms for interval and class variables  Fully interactive -
-
Variables Tab - Axis (x‐, y‐ and z‐ axis) assignment for different variables - Create summary report by unselect the Chart Only check box X Axis, Y Axis and Z Axis Tab - Show distribution of the selected variables - Set axis for different variables -
On the Toolbar, the button : probe the value of the diagram (click (hold) and move cursor on the diagram) On the Toolbar, the button: resize the diagram legend for interval variable: - specify no. of Bins -
the Toolbar button move) : set the max and min boundary (left click the mark of the max / min bar and After “run” the Distribution Explorer node -
View results (the Toolbar button ) - A View pull‐down menu is added - Dimension and Perspective: Drill‐up / drill‐down (right click ‐> view ‐> dimension) -
the Toolbar button : move the legend (drag and move the legend) -
the Toolbar button : spin the diagram (move any direction) II.
Multiplot




-
Visualize large volumes of data graphically Uncover patterns and trends and reveal extreme values Automatically create bar charts and scatter plots for the input and target variables Batch Variables Tab  Specify whether the variable is used to plot graphs Set graph properties  Pull‐down menu  Tools  Settings View results  The View and Graphs pull‐down menus are created - Bar Charts:  Histogram of each input and target.  Bar chart of each input versus each class target.  Bar chart of each input grouped by each interval target. - Scatter Plots:  Plot of each interval input versus the target.  Plot of each class input versus the target. Model Tab - Check information of the data source Enterprise Miner Overview
III. Insight



9
SAS/INSIGHT Analyses data through linked graph Most useful and powerful visualization tool in EM -
-
Data Tab  Specify whether process the entire or partial data set After run, view results  The Analyze, View, Tables, Graphs, Curves and Vars pull‐down menus are created  A data window showing the values of all the data will be shown How to analyze: - Data Window - Select subset of data - Sort data  select variable and click button in upper left corner - Histogram - Show the distribution of variables (frequency of different values) - Pull‐down menu  Analyze  Histogram/Bar Chart (Y) 1 Select Y variables from the left window and press Y (use left mouse button with Shift / Crtl button) ‐ if more than one Y variables  graph will be show simultaneously  relationship between multiple graphs 2
3
4
5
Things to notify:  Distribution, Distribution Center  Skewing? Outliers? Select any bar in any graph  Value of the bar chart will be displayed  Use Shift / Ctrl to select multiple region  The same portion of data will be selected in other graphs and the data window  Double click the point  show values of data selected Brushing observation  Select a portion of observation in any graph (bounded by a rectangle)  Drag the rectangle and move  select portion will be changed accordingly  Investigating multivariate data You can group the graphs by values of Class variables (select variable(s) and press Group) ‐ e.g. two variables: X = 1, 2; Y = a, b group X and Y  4 graphs: 1. X = 1, Y = a; 3. X = 2, Y = a; 2. X = 1, Y = b; 4. X = 2, Y = b. ‐ Pull‐down menus  Edit  Windows  Align 6
Choose Freq variable to display the sum of the values of numeric variable instead of the no. of data ‐ choose GOOD_BAD as Y and AMOUNT as Freq 7
 make y‐axis of all graphs become the same scale  we can see total amount of loan by bad applicants Change bin width and starting point (numeric variable only)  Pull‐down menus  Edit  Window  Tools  Click the hand icon and click on the histogram  Width changes when press and move the mouse vertically  Starting points changes when move horizontally Enterprise Miner Overview
10
-
Create Box Plots - Compare distributions of interval data (numeric variable) - Pull‐down Menu  Analyze  Box Plot/Mosaic Plot (Y) 1 Select Y variable (interval variable) and X variable 2 Select Label variable to label the data with extremely value on the selected variable 3 Click the graph to see the range between regions -
Create Scatter Plots - Show the relationship between two variable Y and X (2D) - Pull‐down Menu  Analyze  Scatter Plot (Y X) 1 Select X and Y variables 2 Multiple variables can be selected for both X and Y  Click OK 3 Select any bar in any graph  Variable IDs of selected data will be displayed 4 Use Edit  Windows  Tools to edit the scatter point style Distribution (Y) - Pull‐down Tables, Graphs and Curves are enabled - Tables Menu 1 Select the tables displayed in distribution output 2 Default – Moments and Quantiles tables are displayed - Graphs Menu 1 Select the graphs displayed in the distribution output 2 Default – Box Plots and Histogram 3 QQ plot can be selected here  compares ordered values of a variable with quantiles of a specific theoretical distribution. If the data are from the theoretical distribution, the points on the QQ plot lie approximately on a straight line. The normal, lognormal, exponential, and Weibull distributions can be used in the plot -
-
help - Pull‐down menu  Help  EM Reference : find “insight” IV.
Variable Selection

reduce the number of inputs by setting the status of the input variables that are not related to the target  a target variable should be specified 
results can be overrided by resetting the status of the variables 
process on interval or binary targets only, if ordinal target  Regression Node to select variable 


Variables Tab  Selecting variables to be processed Manual Selection Tab  Override the role of variable (automatic / rejected / input) Target Association Tab  Disabling the Automatic Selection Process  deselect Remove variables unrelated to target check box  select manually  Selection Criterion 1. R – square (Linear and Logistic Regression)  Numeric target  Fast preliminary variable assessment  Settings  Squared correlation : cut‐off threshold  Stepwise R2 Improvement : cut‐off threshold of a further forward stepwise regression  Include 2‐way interactions  measure the effect of a class variable across all the levels of another class variable  Bin interval variables  bin interval variables into 16 equally‐spaced groups Enterprise Miner Overview
11


Use only grouped class variables  Reduce the levels of class variables 2. Chi– square (Decision Tree)  Binary target variable  Settings  Bins : no. of bins  Chi‐square : larger  fewer splits  Passes : no. of passes through the data to determine the optimum splits General Tab  Remove variables with a certain amount of missing values  Remove variables in hierarchies  e. g. district (Least detailed) and street (Most detailed) 7.
Modify I.
Data Set Attribute
 Modify data set attributes (name, description and role)  Modify metadata information  Similar to Input Data Source node II.
Transformation Variables
 Create new variables transformed from existing variables in the data set  Transform interval variables only  Improve the fit of model to the data  Stabilize variances, remove nonlinearity, … 
Variables Tab  A Table Editor and 3 Toolbar buttons: transform / create / delete variable  select variables in Table Editor and click Toolbar button (transform / delete)  only created / transformed variables can be deleted  Three type of transformation 1. Simple Transformations - log - square root - inverse - square - exponential - standardize 2. Binning Transformations bucket ( equal width ) quantile ( equal no. of data ) optimal binning for relationship to target  no. of bins and bin range can be specified 3. Best Power Transformations -
maximize normality  use when skewed distribution maximize correlation with target  match interval inputs with interval target equalize spread with target levels  match interval inputs with different levels in class target Enterprise Miner Overview
12
III. Replacement




Replace missing and nonmissing values 
Default: data with missing value is not used in: Variable Selection, Regression and Neural Network nodes 
No single correct answer for what value to replace 
Value imputation Imputation can also be done by: Clustering / SOM/Kohonen nodes Data Tab  Training Subtab – specify the training set for imputation (i.e. the data set to calculate mean, median, mode, … ) Defaults Tab  General Subtab  Replace before imputation:  Replace nonmissing values before imputation of missing values  Specified in Interval Variables and Class Variables Tabs  Replace unknown levels with:  Replace unknown values of class variables in score data set  New class variables in scoring data set  Create imputed indicator variables:  create a variable to indicate whether the input data has replaced value or not  Imputation Method Subtab  Specify the imputation method for interval and class variables  Interval variable:  Mean  Median  Midrange : ( max + min ) / 2  Distribution based : random percentiles of variable’s distribution  not change the original distribution 




Tree imputation  analyze each input as a target using the remaining input and rejected variables  may be more accurate  Mid‐minimum spacing : same as midrange but using a portion of data  Tukey's biweight / Huber's / Andrew's Wave  Roubust M‐Estimators of location  Minimize functions of deviations of the data  Default constant  None : no imputation  Class variable:  Most frequent value (count)  Distribution‐based  Tree imputation  Default constant  none Interval Variables Tab  Customize the interval imputation method specified in Defaults Tab  Replace nonmissing value  ( Replace < / With Value (<) ) and ( Replace > / With Value (>) )  Replace extreme values before perform missing value imputation Class Variables Tab  Customize the class imputation method specified in Defaults Tab  Replace nonmissing value (Replace Value)  You can change the order of ordinal data here Tree Imputation Tab  Specify the perdictor variable used in tree imputation Frequency Tab  List freq variable(s) of the data set Enterprise Miner Overview

13
No imputation is done for freq variables 8.
Model  The mining process 1. Tree  To construct decision tree models for your data (e.g., CHAID, CART, C4.5, and C5.0.)  The Tree node supports both automatic and interactive training. 2.
Regression  To use linear and logistic regression models for your data.  It supports the stepwise, forward, and backward selection methods. 3.
Neural Network  To construct, train, and validate multilayer feedforward neural networks. 9.
Assess Assessment 

To compare models and predictions from the Model nodes. 
To make cross‐model comparisons and assessments, independent of all other factors (such as sample size, modeling node, and so on). The champion model(s) can be exported to a Score node to score new data. References: SAS Enterprise Miner. http://support.sas.com/documentation/onlinedoc/miner/