Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Using SAS Enterprise MinerTM For Forecasting Response and Risk Kattamurj. S. Sanna, Ph.D. White Plains, N.Y. Abstract promoting products or services. Customer profitability can also be derived from these scores. This paper shows how to organize and execute a data mining project for developing predictive models for direct marketing. The steps involved in developing the project are demonstrated using SAS Enterprise Miner"". A data mining process diagiam is included to show the sequellce of steps in the project. The data mim'ng process diagram consists of a number of connected nodes (tools), each node perfonning a particular task and passing its output to the next node. The nodes included in the diagram are: Input Data Source node, Filter Outliers node, Data Partition node, Decision Tree node, Regression node, Assessment node, Score node, SAS Code node, and Insight node. This paper shows the steps involved in developing a response model using SAS Entelprise MinerTM. A risk model is s1so developed with claim frequency as the target variable. Since the steps involved in both types of models are the same, the diagrams are provided only for the Before developing the models it is response model. necessary to prepare the data. Dsta preparation involves finding and eliminating eaors, filtering outliers, and imputing missing values. This paper shows how these tasks can be performed using various tools provided by the SAS Enterprise Minef'IM. There are several modeling level options in the Enterprise MinerTM. Due to their intuitive appeal. decision tree models are demonstrated in this paper. Introduction: The data set used in this paper is hypothetical It is generated for illustrative purposes only. The examples provided here are generated using SAS Version 8.2 for Microsoft Windows 98. A company may 1Iy to boost the efficiency of its marketing campaign by promoting its products or services to those individuals who are most likely to respond. Response models are used to forecast an individual's probability of response and 1'llllk the individuals according to the predicted probability of response. However, certain individuals with high probability of response may s1so have a high propensity to generate losses for the company. Therefore, it is s1so necessary to forecast the potential losses associated with each individual in the target Risk Models are used to predict these population. potential losses. Setting up the Forecasting Project in SAS Enterprise Miner. To start a new project in the Enterprise Miner we follow. these steps: (1) From the Menu bar at the topofthe SAS window click on ~lutions. (2) Select Analysis and Enterprise Miner (3) From the Menu bar select File-> New-> Project. In response models the variable being predicted (ie. the target variable) is usoally binary. It takes the value of 1 if there is a response, and 0 if there is no response. In risk models, the target variable can be binary, ordinal or continuous. For example, banks offering loans will incur losses if an acquired customer fails to pay the borrowed amount. In auto insurance, losses arise whenever the acquired customer has an insurance claim. In risk models for auto insurance companies, the frequency of claims can be used as an indicator of risk. In this paper a risk model is developed to predict claims per car yesr. The target Create New Project window opens as shown in Diagram 1. In this window you type the name of the project and select ''Create." An Enterprise Miner window opens that contains two sub-windows (Diagram 2). The right-most sub-window is the Diagram Workspace. To its left is the Project Navigator. In the Project Navigator you see the project name followed by the names of the diagrams. Initially there is one diagram name ''Untitled." This is changed to "Response 1" in this example. At the bottom of the Project Navigator window there are three tabs: Diagmm. Tools, and &ports. After clicking on the Tools tab, a menu of tools opens up. One can click on any tool and drag it into the Diagram Workspace. With a simple point-and-click action, one can perform complex tasks using these tools. Some of these tools are s1so on the tool bar. A Data Mining Process Diagram, created for developing the response model, is shown in Diagram 3. variable, namely the claim frequency, is continuous (interval scaled.) By rounding, it can be changed into an ordinal variable. Response models can be used to calculate a response score, while risk models can be used to calculate a risk score for each individual in the target population. Both of these scores should be used together to achieve optimum selection of 1lliiiHlB (current or prospective customers) for 149 Diagram 1: Creatblg a new project ce Diagram 2: Enterprise Miner Window: Project Navigator and Diagram Workspa 150 Diagram 3: Enterpris e Miner window: Data mining process diagram Source Data, Description, and Role. In order to specicy the source data, one first selects the library reference and then the data set name. In this example, the source data is myl!b.bookl. The role of the data set is "Raw." Other choJ.CeS for the role are: "Train," "Validate," "Test," or "Score." At the top of the input source window there are five tabs labeled: Data, Variables, Interval Variables Clo.Js Variables, and Notes. In this window one can ~ clumge the size of the metadata from its default size of 2000. Input Data Source node: In the data mining process diagram (Diagram 3) the first node is the Input Data Source node. In this node the source data is specified, and the roles of the variables are defined. SAS creates a data mining database from the input source data. Diagram 4 below shows the Input Data Source node. ~ ~ o~ the _Input Data Source node either by double clicking on 1t or right clicking and selecting "Open." The Input Source Data window opens, allowing one to specify 151 Diagram 4: Input Data Source node Diagra m 5 : Assigning model role to variables in the Input Data Source node 152 Filter Outllen Node: Allsiguing "Model Role" to the variables in the Input Data Source node: The Filter Outliers node (Diagram 6) is used to clean up the data. One can examine each variable graphically and eliminate extreme values or outliers. When this· node is opened the window shown in Diagram 6 appears. At the top of this window there are several tabs: Data, Settings, Class Variables, Interval Variables, Output, and Notes. You can select the Variables tab to view the variables list and assign model roles. The variables window is shown in Diagram 5. There are 11 variables in the response model dataset. Suppose you wish to examine the interval ~les for outliers. Click on the Interval Variables tsb. In Diagram 6 the CREDIT variable is selected for examination. By right clic.lcing in the column titled "Range to include" and the row corresponding to the CREDIT variable another window opens, as shown in Diagram 7. In this window there is a histogram of the variable CREDIT. The histogram bas two vertical bars (labeled MIN and MAX) with handles, which can be moved horizontally. The position of the left handle defines the minimum value to include and the position of the right handle defines the maximum value allowed for the variable. Any observation with the variable taking a value outside this :range is excluded. In this example, we excluded observations with credit index above 728.859. Table I : Variables for the Response Model Age of the responder. AGE: An index of credit rating. CREDIT: Annual miles driven. :MILEAGE: Gender. GENDER: Number of jobs held during the last EMP: 3 years. Number of addresses in the last 3 years. RES: Number of vehicles owned. NVEH: Type of residence RESTYPE: - private house or other. Dummy variable indicating whether MFDU: the responder lives in a multifamily dwelling. Target variable in the response model. RESP: If an individual responds to direct mail then RESP takes the value 1, otherwise it is 0. Diagram 6 : Filter Outllen node 153 Diagram 7: Filter Outlier s node (Select lltlhus window) Diagram 8: Data Partitio n node data set, for validating the model. and the third data set is ''Test" data set. In this example we use the test data set for scoring. These three data sets are passed to the successor nodes. In this example the data sets are first passed to the Decision Tree node where we develop the model. Data Partitio n node: The filtered data set is then passed to the Data Partition node. Here the data set is partitioned into three data sets. The first one is for developing the models. This is called the ''Training" data set. The next one is the "Validation" 154 Entropy: Entropy is another measure of the impurity of a node and it Decision Tree Methodology: I A decision 1ree partitions the observations (records, examples, or cases) of the data set into distinct groups (disjoint subsets), known as kaves, letifnodes, or temtinal nodes. Each letif has a unique combination of ranges . of the input variables. The root node is the first node of the tree and it contains all the observations of the data set. Starting at the root node the tree algorithm successively splits the data set into sub-regions or nodes. If a node cannot be partitioned further it becomes a temtinal node (leaf, or leafnode.) The process of partitioning proceeds in the following way: isdefined as 1 2 1 targets. Petii'Son 's chi41J1U1re Test: records, which have To illustrate the chi-square test let us consider a simple example. Suppose node vis split into two child nodes a and b . Suppose there are 1000 individuals in node v . Of these suppose there are 100 responders and 900 nonresponders. Suppose in node a there are 400 individuals 111111 60 responders, llllllinnode b there are 600 individuals and 40 responders. One can construct a 2 x 2 contingency table with rows representing the child nodes a and b , and columns representing the target classes (1 for response and 0 for no response). ~ c go to the left node, and those records, which have The chi-square test statistic can be calculated as and maximum value of X; . All z =L <a~E) The algorithm selects the best split on esch variable and then selects the best of these. The process is repeated at esch node. In order to determine which split is the best, one can use tests of impurity reduction or Pearson's chi-square test. X 1 > C go to the right node. 2 v measures of impurity. They are Gini Index and Entropy. Stopping rules for limiting tree growth: Gini Impurity Irulex: Starting at the root node, the algorithm splits esch node If p 1 is the proportion of responders in a node, and p 0 is the proportion of non-responders, the Gini Impurity Index of that node is defined as i(p) 1- p; - p~. If two records are chosen at random (with replacement) from a node, the probability that both are responders is while the probability that both are non-responders is further into child nodes or offspring nodes. = Splitting a node involves examining all candidate splits on esch input and selecting the best split on each, and picking the input that has the best of the selected splits. This process is repested at esch node. Any node, which cannot be partitioned further, becomes a leaf (tenninal node, or p; , leafnode.) p ~ and the probability that they are either both responders 1 - p 12 - non-responders where o is observed frequency z going into nodes both • 2 statistic is 60, 340, 40, and 560 respectively. The computed using these expected and observed frequencies. Logworth can be calculated from the associatedP-value as -log10 (P -value) . Logworth increases as p decresses. If there are I 0 candidate splits on an input, the one with the highest logworth is selected. a and b , then the decrease in impurity is i(v)-trA a)-trbi (b), where i(v) is the impurity index of node v and i(a) and i(b) are the impurity indexes of nodes a and b . There are two node 2 of the cell, and E is the expected frequency under the null hypothesis that the class proportions are the sante in esch row. In the root node the proportion of responders is 10%. Under the null hypothesis we expect 40 responders and 360 non-responders in node a , and 60 responders and 540 non-responders in node b . The observed frequencies are Impurity reduction: If c is the split of node v into two child nodes a and b , and 1f4 and 1fb are the proportions ofrecords from or LP log (p },forbinary i=O Let XI, x2, ..... Xtoo be the variables in the data set. The tree algorithm examines all candidate splits ofthe fonn X 1 ~ C where Cis a real number between the minimum X; i(p)=- is p; + p;. One can stop tree growth (ie. the process of partitioning the nodes into sub-nodes) by specifYing a stopping rule. Hence p; can be inteipreted as the probability that One stopping rule may be to specifY that a node should not be split further if the chi-square statistics are not significant The level of significance can be specified in the Decision Tree node in the Enterprise Miner1U. any two elements chosen at random (with replacement) are different A pure node has a Gini Index of zero. Tree growth can also be controlled by selecting an appropriate depth of the tree. The maximum depth can be specified in the Basic tab of the Decision Tree node. 155 Diagram 9: Decision Tree node: Basic Tab Diagra m 10: Decision Tree node: Advanced Tab 156 Response Model with Decision Tree Table ID: The leaf1Wtles of the Risk ModeL In this example the decision tree partitioned the input space into the following disjoint subsets ox leaf nodes. These are also known as the terminal nodes or leaves of the tree. Table D: The leaj"110tles of the Response ModeL 1. 1. • 2. 3. EMP 2: 3 and Credit < 152.5 4. Response score = 43.8 MS = U and EMP < 3 and CREDIT< 152.5 Resp score = 17.2 3. MS = M and EMP < 3 and CREDIT< 152.5 Response score = 11.5 4. 152.5 S CREDIT<297.5andEMP '2::: 3 Response score = 13.8 5. CREDIT 2: 297.5 and EMP 2: 3 Response score= 3.9 6. MILEAGE 2: 45,900 and 152.5 ::;; CREDIT< 285.5 and EMP < 3 Response score= 3.5 7. MILEAGE 2: 39,162 and CREDIT 2: 285.5 andEMP<3 Response score= 16.7 8. AGE< 27.5 and MILEAGE< 45,900 and 152.5 :!> CREDIT< 285.5 and EMP< 3 Response score=5.7 9. AGE '2::: 27.5 andMILEAGE<45,900 and 152.5 ::;; CREDIT< 285.5 and EMP < 3 Response score=2. 7 10. 285.5 ::;; CREDIT< 347.5 and MILEAGE< 39,162 and EMP < 3 Response score = 1.4 11. CREDIT 2: 347.5 and MILEAGE< 39,162 and EMP<3 Response score = 0. 7 2. S. 6. Credit < 66.5 Risk score= 0.259 AGE< 19.5 and 66.5 ::;; CREDIT -<293.5 Risk score = 0.273 AGE< 45.5 and CREDIT 2: 293.5 Risk score = 0.089 AGE 2: 45.5 and CREDIT2: 293.5 Risk score= 0.05 AGE 2: 19.Sand 66.5S CREDIT<197.5 Risk score= 0.155 AGE '2::: 19.5 and 197.5 S CREDIT<293.5 Risk score= 0.094 Note: Risk score is the same as the calculated claim frequency in each node. The target variable in the risk model is the frequency of claims per car-year. Since this is a continuous variable the decision tree algorithm calculates the mean claim frequency for each leafnode. We call the expected claim frequency the risk score. Score node: The models are developed in the Decision Tree Node and passed to the Score node. The Score node applies the model to the data set that we want to score. In the case of the response model, the Score node sends each case (record) to one of the 11 leqf nodes, shown in Table ll above, according to the ranges ofthe values ofthe input variables of the record. Accordingly each record is assigned an expected probability eqlllll to the posterior probability of response of the node. Similsrly, in the risk model we connected the Decision Tree node to the Score node. Again, based on the input values, each record is assigned one of the six terminal nodes shown in Table m and the estimated mean claim frequency of the node (calculated from the training data set) is assigned to the record. This mean is the predicted claim frequency for all the individuals in the node. Note: In each node the response score is the same as the proportion of responders. These proportions are also referred to as the posterior probabilities. In the first letifnode all cases have a value greater than or eqlllll to 3 for the variable EMP. Proportion of responders in this node is 43.8o/o. In this example we assign a response score of 43.8 for this group. SAS Code node: SAS Code node enables you to further process the scored data set created in the Score node. Custom graphs and tables can be generated. The second leafnode has all cases with Marital Status {MS) unknown (U) and Credit index less than 152.5. The response rate in this group is 17.2. In this case the response score is 17.2. Insight node: J.usight node is used to view the scored data set or any other data set imported from the predecessor node. Risk Model with Regression Tree The decision tree partitioned the input space into six disjoint subsets. For each subset a risk score is computed from the predicted mean claim frequency. 157 References: (1) Kattamuri S. Sarma. 2001, Enterprise Miner>'for Forecaating. Paper number: P2S0-26, presented at SUGI 26, Long Beach, California, April24, 2001. (2) SAS Institute Inc., Getting Started with Enterprise Mineil" Software, Release 4.1, Cary, NC: SAS Institute Inc., 2000. Kattamuri S. Sarma, Ph.D. 61 Hawthorne Street White Plains, NY 10603 (914) 428-8733. Fax: 914-428-4551. Email: KSSarma(dJ.worl!lnet att.net 158