Download Using SAS Enterprise Miner for Forecasting Response and Risk

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Using SAS Enterprise MinerTM
For Forecasting Response and Risk
Kattamurj. S. Sanna, Ph.D.
White Plains, N.Y.
Abstract
promoting products or services. Customer
profitability can also be derived from these scores.
This paper shows how to organize and execute a data
mining project for developing predictive models for direct
marketing. The steps involved in developing the project are
demonstrated using SAS Enterprise Miner"". A data
mining process diagiam is included to show the sequellce
of steps in the project. The data mim'ng process diagram
consists of a number of connected nodes (tools), each node
perfonning a particular task and passing its output to the
next node. The nodes included in the diagram are: Input
Data Source node, Filter Outliers node, Data Partition node,
Decision Tree node, Regression node, Assessment node,
Score node, SAS Code node, and Insight node.
This paper shows the steps involved in developing a
response model using SAS Entelprise MinerTM. A risk
model is s1so developed with claim frequency as the target
variable. Since the steps involved in both types of models
are the same, the diagrams are provided only for the
Before developing the models it is
response model.
necessary to prepare the data. Dsta preparation involves
finding and eliminating eaors, filtering outliers, and
imputing missing values. This paper shows how these
tasks can be performed using various tools provided by the
SAS Enterprise Minef'IM. There are several modeling
level
options in the Enterprise MinerTM. Due to their intuitive
appeal. decision tree models are demonstrated in this paper.
Introduction:
The data set used in this paper is hypothetical It is
generated for illustrative purposes only. The examples
provided here are generated using SAS Version 8.2 for
Microsoft Windows 98.
A company may 1Iy to boost the efficiency of its marketing
campaign by promoting its products or services to those
individuals who are most likely to respond. Response
models are used to forecast an individual's probability of
response and 1'llllk the individuals according to the
predicted probability of response. However, certain
individuals with high probability of response may s1so have
a high propensity to generate losses for the company.
Therefore, it is s1so necessary to forecast the potential
losses associated with each individual in the target
Risk Models are used to predict these
population.
potential losses.
Setting up the Forecasting Project in SAS
Enterprise Miner.
To start a new project in the Enterprise Miner we follow.
these steps:
(1) From the Menu bar at the topofthe SAS window
click on ~lutions.
(2) Select Analysis and Enterprise Miner
(3) From the Menu bar select File-> New-> Project.
In response models the variable being predicted (ie. the
target variable) is usoally binary. It takes the value of 1 if
there is a response, and 0 if there is no response. In risk
models, the target variable can be binary, ordinal or
continuous. For example, banks offering loans will incur
losses if an acquired customer fails to pay the borrowed
amount. In auto insurance, losses arise whenever the
acquired customer has an insurance claim. In risk models
for auto insurance companies, the frequency of claims can
be used as an indicator of risk. In this paper a risk model is
developed to predict claims per car yesr. The target
Create New Project window opens as shown in Diagram 1.
In this window you type the name of the project and select
''Create." An Enterprise Miner window opens that contains
two sub-windows (Diagram 2). The right-most sub-window
is the Diagram Workspace. To its left is the Project
Navigator. In the Project Navigator you see the project
name followed by the names of the diagrams. Initially
there is one diagram name ''Untitled." This is changed to
"Response 1" in this example. At the bottom of the Project
Navigator window there are three tabs: Diagmm. Tools,
and &ports. After clicking on the Tools tab, a menu of
tools opens up. One can click on any tool and drag it into
the Diagram Workspace. With a simple point-and-click
action, one can perform complex tasks using these tools.
Some of these tools are s1so on the tool bar. A Data Mining
Process Diagram, created for developing the response
model, is shown in Diagram 3.
variable, namely the claim frequency, is continuous
(interval scaled.) By rounding, it can be changed into an
ordinal variable.
Response models can be used to calculate a response score,
while risk models can be used to calculate a risk score for
each individual in the target population. Both of these
scores should be used together to achieve optimum
selection of 1lliiiHlB (current or prospective customers) for
149
Diagram 1: Creatblg a new project
ce
Diagram 2: Enterprise Miner Window: Project Navigator and Diagram Workspa
150
Diagram 3: Enterpris e Miner window: Data mining process diagram
Source Data, Description, and Role. In order to specicy the
source data, one first selects the library reference and then
the data set name. In this example, the source data is
myl!b.bookl. The role of the data set is "Raw." Other
choJ.CeS for the role are: "Train," "Validate," "Test," or
"Score." At the top of the input source window there are
five tabs labeled: Data, Variables, Interval Variables
Clo.Js Variables, and Notes. In this window one can ~
clumge the size of the metadata from its default size of
2000.
Input Data Source node:
In the data mining process diagram (Diagram 3) the first
node is the Input Data Source node. In this node the source
data is specified, and the roles of the variables are defined.
SAS creates a data mining database from the input source
data. Diagram 4 below shows the Input Data Source node.
~ ~ o~ the _Input Data Source node either by double
clicking on 1t or right clicking and selecting "Open." The
Input Source Data window opens, allowing one to specify
151
Diagram 4: Input Data Source node
Diagra m 5 : Assigning model role to variables in the Input
Data Source node
152
Filter Outllen Node:
Allsiguing "Model Role" to the variables in the Input
Data Source node:
The Filter Outliers node (Diagram 6) is used to clean up the
data. One can examine each variable graphically and
eliminate extreme values or outliers. When this· node is
opened the window shown in Diagram 6 appears. At the
top of this window there are several tabs: Data, Settings,
Class Variables, Interval Variables, Output, and Notes.
You can select the Variables tab to view the variables list
and assign model roles. The variables window is shown in
Diagram 5. There are 11 variables in the response model
dataset.
Suppose you wish to examine the interval ~les for
outliers. Click on the Interval Variables tsb. In Diagram 6
the CREDIT variable is selected for examination. By right
clic.lcing in the column titled "Range to include" and the
row corresponding to the CREDIT variable another
window opens, as shown in Diagram 7. In this window
there is a histogram of the variable CREDIT. The
histogram bas two vertical bars (labeled MIN and MAX)
with handles, which can be moved horizontally. The
position of the left handle defines the minimum value to
include and the position of the right handle defines the
maximum value allowed for the variable. Any observation
with the variable taking a value outside this :range is
excluded. In this example, we excluded observations with
credit index above 728.859.
Table I : Variables for the Response Model
Age of the responder.
AGE:
An index of credit rating.
CREDIT:
Annual miles driven.
:MILEAGE:
Gender.
GENDER:
Number of jobs held during the last
EMP:
3 years.
Number of addresses in the last 3 years.
RES:
Number of vehicles owned.
NVEH:
Type of residence
RESTYPE:
- private house or other.
Dummy variable indicating whether
MFDU:
the responder lives in a multifamily
dwelling.
Target variable in the response model.
RESP:
If an individual responds to direct
mail then RESP takes the value 1,
otherwise it is 0.
Diagram 6 : Filter Outllen node
153
Diagram 7: Filter Outlier s node (Select lltlhus window)
Diagram 8: Data Partitio n node
data set, for validating the model. and the third data set is
''Test" data set. In this example we use the test data set for
scoring. These three data sets are passed to the successor
nodes. In this example the data sets are first passed to the
Decision Tree node where we develop the model.
Data Partitio n node:
The filtered data set is then passed to the Data Partition
node. Here the data set is partitioned into three data sets.
The first one is for developing the models. This is called
the ''Training" data set. The next one is the "Validation"
154
Entropy:
Entropy is another measure of the impurity of a node and it
Decision Tree Methodology:
I
A decision 1ree partitions the observations (records,
examples, or cases) of the data set into distinct groups
(disjoint subsets), known as kaves, letifnodes, or temtinal
nodes. Each letif has a unique combination of ranges . of
the input variables. The root node is the first node of the
tree and it contains all the observations of the data set.
Starting at the root node the tree algorithm successively
splits the data set into sub-regions or nodes. If a node
cannot be partitioned further it becomes a temtinal node
(leaf, or leafnode.) The process of partitioning proceeds in
the following way:
isdefined as
1
2
1
targets.
Petii'Son 's chi41J1U1re Test:
records, which have
To illustrate the chi-square test let us consider a simple
example. Suppose node vis split into two child nodes a
and b . Suppose there are 1000 individuals in node v . Of
these suppose there are 100 responders and 900 nonresponders. Suppose in node a there are 400 individuals
111111 60 responders, llllllinnode b there are 600 individuals
and 40 responders. One can construct a 2 x 2 contingency
table with rows representing the child nodes a and b ,
and columns representing the target classes (1 for response
and 0 for no response).
~ c go to the left node, and those records, which have
The chi-square test statistic can be calculated as
and maximum value of
X; . All
z =L <a~E)
The algorithm selects the
best split on esch variable and then selects the best of
these. The process is repeated at esch node. In order to
determine which split is the best, one can use tests of
impurity reduction or Pearson's chi-square test.
X 1 > C go to the right node.
2
v
measures of impurity. They are Gini Index and Entropy.
Stopping rules for limiting tree growth:
Gini Impurity Irulex:
Starting at the root node, the algorithm splits esch node
If p 1 is the proportion of responders in a node, and p 0
is the proportion of non-responders, the Gini Impurity
Index of that node is defined as i(p) 1- p; - p~.
If two records are chosen at random (with replacement)
from a node, the probability that both are responders is
while the probability that both are non-responders is
further into child nodes or offspring nodes.
=
Splitting a node involves examining all candidate splits on
esch input and selecting the best split on each, and picking
the input that has the best of the selected splits. This
process is repested at esch node. Any node, which cannot
be partitioned further, becomes a leaf (tenninal node, or
p; ,
leafnode.)
p ~ and the probability that they are either both responders
1 - p 12
-
non-responders
where o is observed frequency
z
going into nodes
both
•
2
statistic is
60, 340, 40, and 560 respectively. The
computed using these expected and observed frequencies.
Logworth can be calculated from the associatedP-value as
-log10 (P -value) . Logworth increases as p
decresses. If there are I 0 candidate splits on an input, the
one with the highest logworth is selected.
a and b , then the decrease in
impurity is i(v)-trA a)-trbi (b), where i(v) is
the impurity index of node v and i(a) and i(b) are the
impurity indexes of nodes a and b . There are two
node
2
of the cell, and E is the expected frequency under the null
hypothesis that the class proportions are the sante in esch
row. In the root node the proportion of responders is 10%.
Under the null hypothesis we expect 40 responders and
360 non-responders in node a , and 60 responders and 540
non-responders in node b . The observed frequencies are
Impurity reduction:
If c is the split of node v into two child nodes a and
b , and 1f4 and 1fb are the proportions ofrecords from
or
LP log (p },forbinary
i=O
Let XI, x2, ..... Xtoo be the variables in the data set.
The tree algorithm examines all candidate splits ofthe fonn
X 1 ~ C where Cis a real number between the minimum
X;
i(p)=-
is
p; + p;.
One can stop tree growth (ie. the process of partitioning
the nodes into sub-nodes) by specifYing a stopping rule.
Hence
p; can be inteipreted as the probability that
One stopping rule may be to specifY that a node should not
be split further if the chi-square statistics are not
significant The level of significance can be specified in
the Decision Tree node in the Enterprise Miner1U.
any two elements chosen at random (with replacement) are
different A pure node has a Gini Index of zero.
Tree growth can also be controlled by selecting an
appropriate depth of the tree. The maximum depth can be
specified in the Basic tab of the Decision Tree node.
155
Diagram 9: Decision Tree node: Basic Tab
Diagra m 10: Decision Tree node: Advanced Tab
156
Response Model with Decision Tree
Table ID: The leaf1Wtles of the Risk ModeL
In this example the decision tree partitioned the input space
into the following disjoint subsets ox leaf nodes. These are
also known as the terminal nodes or leaves of the tree.
Table D: The leaj"110tles of the Response ModeL
1.
1.
•
2.
3.
EMP 2: 3 and Credit < 152.5
4.
Response score = 43.8
MS = U and EMP < 3 and CREDIT< 152.5
Resp score = 17.2
3. MS = M and EMP < 3 and CREDIT< 152.5
Response score = 11.5
4. 152.5 S CREDIT<297.5andEMP '2::: 3
Response score = 13.8
5. CREDIT 2: 297.5 and EMP 2: 3
Response score= 3.9
6. MILEAGE 2: 45,900 and
152.5 ::;; CREDIT< 285.5 and EMP < 3
Response score= 3.5
7. MILEAGE 2: 39,162 and CREDIT 2: 285.5
andEMP<3
Response score= 16.7
8. AGE< 27.5 and MILEAGE< 45,900 and
152.5 :!> CREDIT< 285.5 and EMP< 3
Response score=5.7
9. AGE '2::: 27.5 andMILEAGE<45,900 and
152.5 ::;; CREDIT< 285.5 and EMP < 3
Response score=2. 7
10. 285.5 ::;; CREDIT< 347.5 and
MILEAGE< 39,162 and EMP < 3
Response score = 1.4
11. CREDIT 2: 347.5 and MILEAGE< 39,162 and
EMP<3
Response score = 0. 7
2.
S.
6.
Credit < 66.5
Risk score= 0.259
AGE< 19.5 and 66.5 ::;; CREDIT -<293.5
Risk score = 0.273
AGE< 45.5 and CREDIT 2: 293.5
Risk score = 0.089
AGE 2: 45.5 and CREDIT2: 293.5
Risk score= 0.05
AGE 2: 19.Sand 66.5S CREDIT<197.5
Risk score= 0.155
AGE '2::: 19.5 and 197.5 S CREDIT<293.5
Risk score= 0.094
Note: Risk score is the same as the calculated claim
frequency in each node.
The target variable in the risk model is the frequency of
claims per car-year. Since this is a continuous variable the
decision tree algorithm calculates the mean claim
frequency for each leafnode. We call the expected claim
frequency the risk score.
Score node:
The models are developed in the Decision Tree Node and
passed to the Score node. The Score node applies the model
to the data set that we want to score. In the case of the
response model, the Score node sends each case (record) to
one of the 11 leqf nodes, shown in Table ll above,
according to the ranges ofthe values ofthe input variables
of the record. Accordingly each record is assigned an
expected probability eqlllll to the posterior probability of
response of the node.
Similsrly, in the risk model we connected the Decision
Tree node to the Score node. Again, based on the input
values, each record is assigned one of the six terminal
nodes shown in Table m and the estimated mean claim
frequency of the node (calculated from the training data
set) is assigned to the record. This mean is the predicted
claim frequency for all the individuals in the node.
Note: In each node the response score is the same as the
proportion of responders. These proportions are
also referred to as the posterior probabilities.
In the first letifnode all cases have a value greater than or
eqlllll to 3 for the variable EMP. Proportion of responders
in this node is 43.8o/o. In this example we assign a response
score of 43.8 for this group.
SAS Code node:
SAS Code node enables you to further process the scored
data set created in the Score node. Custom graphs and
tables can be generated.
The second leafnode has all cases with Marital Status {MS)
unknown (U) and Credit index less than 152.5. The
response rate in this group is 17.2. In this case the response
score is 17.2.
Insight node:
J.usight node is used to view the scored data set or any other
data set imported from the predecessor node.
Risk Model with Regression Tree
The decision tree partitioned the input space into six
disjoint subsets. For each subset a risk score is computed
from the predicted mean claim frequency.
157
References:
(1) Kattamuri S. Sarma. 2001, Enterprise Miner>'for
Forecaating. Paper number: P2S0-26, presented at
SUGI 26, Long Beach, California, April24, 2001.
(2) SAS Institute Inc., Getting Started with Enterprise
Mineil" Software, Release 4.1, Cary, NC: SAS
Institute Inc., 2000.
Kattamuri S. Sarma, Ph.D.
61 Hawthorne Street
White Plains, NY 10603
(914) 428-8733. Fax: 914-428-4551.
Email: KSSarma(dJ.worl!lnet att.net
158