Download Predictive Modeling in Automotive Direct Marketing: Tools

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Predictive Modeling in Automotive Direct Marketing: Tools,
Experiences and Open Issues
Wendy Gersten, Rüdiger Wirth, Dirk Arndt
DaimlerChrysler AG, Research & Technology
Data Mining Solutions, FT3/AD
PO BOX 2360
89013 Ulm Germany
{wendy.gersten, ruediger.wirth, dirk.arndt} @
daimlerchrysler.com
projects with different departments within our company. We
present tools we developed to support knowledge transfer and
organizational learning.
ABSTRACT
Direct marketing is an increasingly popular application of data
mining. In this paper we summarize some of our own experiences
from various data mining application projects for direct
marketing. We focus on a particular project environment and
describe tools which address issues across the whole data mining
process. These tools include a Quick Reference Guide for the
standardization of the process and for user guidance and a library
of re-usable procedures in the commercial data mining tool
Clementine. We report experiences with these tools and identify
open issues requiring further research. In particular, we focus on
evaluation measures for predictive models.
In the next section, we introduce our application scenario with a
focus on technical challenges and requirements for solutions.
Then we describe tools we developed while working with several
different marketing departments. First, we present a Quick
Reference Guide for predictive modeling tasks in marketing
applications. This guide standardizes the process and guides end
users through data mining projects. The second set of tools
consists of re-usable procedures in the commercial data mining
tool Clementine. We present practical solutions to common data
mining problems like unbalanced training data and the evaluation
of models. For each of these tools we discuss our experience and
outline open issues for further research.
Categories and Subject Descriptors
H.2.8 Database Applications
Data mining
2. APPLICATION SCENARIO
Keywords
For the purpose of this paper we restrict ourselves to acquisition
campaigns. The challenge is to select prospects from an address
list who are likely to buy a Mercedes for the first time. We try to
derive profiles of Mercedes customers and use these profiles to
identify prospects. Although this sounds like a textbook data
mining application, there are many complications.
Clementine, CRISP-DM, Data Mining Process, Direct Marketing,
Evaluation Measures.
1. INTRODUCTION
There are many opportunities for data mining in automotive
industry. DaimlerChrysler's Data Mining Solutions department
works on different projects in fields like finance, manufacturing,
quality assurance, and marketing. Here, we focus on direct
marketing. Several authors [1, 10, 11] described problems and
solutions for data mining in direct marketing. In this paper we
reflect our own experience working in the automotive industry.
One of the major challenges is the fact that the buying decision
takes a long time and the relationship marketing action can only
contribute a small part. This makes the assessment of its success
and, consequently, the formulation of the proper prediction task
difficult. Furthermore, the process is not as stable and predictable
as one might expect. Typically, we cannot develop models of the
customers behavior based on previous examples of the same
behavior because in many cases we simply lack the data.
For various reasons, modeling the response to direct marketing
actions in the automotive industry is more difficult than for mail
order businesses. In this paper, we describe some results from
A more fundamental problem is that it is not clear what behavior
we want to model. Ideally, we want to increase the acquisition rate
not just the response rate. However, the acquisition rate is difficult
to handle. It is not easy to measure, it is captured fairly late (up to
9 months after modeling), and the impact of the predictive model
is difficult to assess.
Permission to make digital or hard copies of part or all of this work or
personal or classroom use is granted without fee provided that copies are not
made or distributed
for profit
or commercial
advantage
and that copies bear
LEAVE
THIS
TEXT BOX
IN PLACE
this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers, orAND
to redistribute
BLANKto lists, requires prior specific
permission and/or a fee.
KDD 2000, Boston, MA USA
© ACM 2000 1-58113-233-6/00/08 ...$5.00
Regardless of the success criterion, we face a technical problem
during training. Usually, measures like accuracy or lift are used to
398
evaluate models [3], implying that the higher the measure the
better the model. But this is not true, a model with a perfect lift is
not necessarily the best one (see section 4.4.1).
Furthermore, it points to a stream library, i.e., executable and
adaptable procedures in Clementine. Some of them are meant to
be used by marketing people, others facilitate the work of data
mining specialists. The experiences during the process are
archived for later reuse in an experienced base.
In addition, there are many other factors, like the available data or
the market situation, that make every predictive modeling project
unique in a certain sense. This gets even worse if we want the
process to be applicable in different countries with differences in
language, culture, laws, and competitive situation.
In the following sections we describe the Quick Reference Guide
and the Clementine stream library. We specify our particular
requirements, discuss the state of the art, outline our solutions and
identify open issues requiring new research.
Nevertheless, we aim for a standard approach and a set of tools
which should be usable by marketing people with only little data
mining skills. Although the projects differ in details there is a
large set of common issues which come up in most of these
projects. In order to support the construction of a common
experience base, the solutions and the lessons learned need to be
documented and made available throughout the organization.
3. QUICK REFERENCE GUIDE
Basically, the Quick Reference Guide is a specification of the
predictive modeling process with detailed advice for various tasks.
As such it should ensure a basic level of quality, guide
inexperienced users through the predictive modeling process, help
them using the data mining tool, and support the evaluation and
interpretation of results.
There are two groups of people involved. First, there are
marketing people who will do the address selection three to four
times a year and must be able to perform this process reliably.
They have little time and insufficient skills to experiment with
different approaches. Therefore, they need guidance and easy to
use software tools. Second, there is a data mining team which
does the initial case studies, develops and maintains the process,
trains the marketing people, and which will later support the
marketing people with more challenging application scenarios
outside the standard procedure. The data mining team needs
sophisticated data mining tools and a common framework for
documentation and knowledge transfer.
In the remainder of this section, we explain how we developed
this guide and our experience to date.
The Quick Reference Guide is based on CRISP-DM, a recently
proposed standard data mining process model [4]. CRISP-DM
breaks a data mining process down in phases and tasks. The
phases are shown in Figure 2.
To fulfill the needs of these two groups of people we developed
various tools (see Figure 1). This toolbox is organized along the
CRISP-DM data mining process model. We created a Quick
Reference Guide which is supposed to guide the marketing people
through a data mining project. The guide contains specific advice
for various tasks of the process and points to appropriate tools,
like questionnaires for problem understanding and check lists for
the evaluation of data providers.
Business
Understanding
Data
Understanding
Data
Preparation
Deployment
Data
Data
Data
Modelling
CRISP-DM Phases, Steps and Tasks
Data Understanding
Approach Definition
1
2
3
4
5
6
Guides
Evaluation
Figure 2. CRISP-DM Process Model
Experiences
• Quick Reference Guide
•
Questionnaires
• Advice Book
Templates
Project Documentation
Data Source Reports
At the task level, CRISP-DM distinguishes between a generic and
a specific process model (see Figure 3). The generic process
model is supposed to cover any data mining project. As such it
provides an excellent framework and is useful for planning the
projects, for communication both within and outside the project
team, for documentation of results and experiences, and for
quality assurance. However, it is too abstract to describe
repeatable processes for the marketing people. Such processes
typically take place in a certain context and under specific
constraints. To cover this, CRISP-DM allows for instantiations of
the generic process model. The resulting specific process models
Stream Library
Streams for
• data understanding
• data preparation
• modeling
• evaluation
• scoring
Figure 1. Toolbox
399
are meant to reflect the context, constraints and terminology of the
application. The Quick Reference Guide is a specific CRISP-DM
process model.
When developing the Quick Reference Guide, we got into the
following dilemma: On the one hand it is obviously impossible and after all even not desirable - to give a detailed and exhaustive
process description. This is due to the complexity resulting from
many unknown factors and unforeseeable situations that will
always remain. We experienced that even from campaign to
campaign in the same country the circumstances may differ
fundamentally. On the other hand, the opposite extreme, a very
high level description like the generic CRISP-DM process model,
is not a solution either. While it covers the whole process and is
useful for experienced people, it is not suitable for the kind of
users one is confronted with when moving into normal business
processes. The resulting process description should guide the user
as much as possible but, at the same time, enable him to handle
difficult unexpected situations. See [12] for more experiences
with using CRISP-DM in this application.
Phases
CRISP
Process Model
Generic Tasks
Mapping
Specialized Tasks
A major challenge is to put a project management framework
around this highly iterative, creative process with many parallel
activities. There must be firm deadlines for a task to be completed
to ensure timely completion and proper usage of resources. But
when is a task complete in a data mining project? This is a
question that cannot be answered easily. So far, we have not yet a
complete, satisfactory and operational set of criteria in our
application. One approach to define key performance indicators is
described in [6].
CRISP
Process
Process Instances
Figure 3. The four levels of the CRISP-DM methodology
Business
Understanding
Inventory
of
Resources
Situation
Assessment
Specify
Response
Modeling
Goals
Initial
Project
Plan
Data
Understanding
Initial
Data
Collection
Report
Import
Data into
Clementine
Data
Description
Data
Quality
Verification
Select
Working
Data
Data
Preperation
Select
Attributes
and
Datasets
Data
Cleaning
Derive
New
Attributes
Integrate
Data
Sources
Adjust
Modeling &
Scoring
Data
Review
Modeling
Approach
Generate
Test
Design
Set Up
Modeling
Stream(s)
Assess
First
Model
Results
Fine-Tune
Model
Parameters
Final Model
Assessment
Review
Process
Plan
Evaluation
Evaluate
Results
Quality
Assurance
Determine
Next
Steps
Plan
Scoring
Plan
Monitoring&
and
Maintenance
Apply
Predictive
Model
Run
Campaign
Evaluate
Outcome of
Campaign
Produce
Final
Report
Review
Project
Modeling
Understand
Planned
Marketing
Action
Deployment
The starting point for the Quick Reference Guide was the Generic
CRISP-DM User Guide. We deleted tasks and activities that were
not relevant for our application, renamed tasks to make them more
concrete, and added a few tasks at various places. The additions,
however, were not due to omissions in the generic model. It was
more like splitting one abstract task into several more concrete
tasks. The main work was to generate check lists of specific
activities for each task. There, the generic check lists were an
excellent guideline. Figure 4 shows the specific tasks of the Quick
Reference Guide.
Apart from the specification of the process, the Quick Reference
Guide contains pointers to examples from case studies,
documented experiences, and to the library of executable
procedures described in the next section.
So far, the experience with the Quick Reference Guide was mostly
positive. We expected it to be useful for planning, documentation,
and quality assurance, and this turned out to be the case.
However, its use for communication both within and outside the
project was much more advantageous than we originally
anticipated. Presenting the project plan and status reports in terms
of the process model and, of course, the fact that we followed a
process, inspired a lot of confidence in users and sponsors. It also
facilitated status meetings because the process model provided a
clear reference and a common terminology.
In our first case studies, we encountered an unexpected difficulty
with the process model. Although we stated frequently that the
phases and tasks are not supposed to be strictly sequential, the
clear and obvious presentation of the process model inevitably
created this impression in decision makers. Despite our
arguments, we found ourselves forced to very tight deadlines,
which in the end lead to sub-optimal solutions. On the other hand,
the process model gave us a structured documentation which
allowed us to justify how much effort was spent in which task.
This made it fairly easy to argue for more realistic resources in
later applications.
Develop
Initial
Modeling
Approach
Based on this experience, we advise in the Quick Reference Guide
to plan for iterations explicitly. As a rough guide we suggest to
plan for three iterations where each iteration is allocated half the
time of the preceding one. It is also stressed that it is never the
case that a phase is completely done before the subsequent phase
starts. The relation between phases is such that a phase cannot
start before the previous one has started. From this, it follows that
the distinction between phases is not sharp. But this is a rather
Figure 4. Specific Tasks from the Quick Reference Guide
400
academic question. In practice, all we need is a sensible grouping
of tasks.
Logical errors concern the content and meaning of data.
Typically, they are caused by false documentation, wrong
encoding or human misunderstanding. For instance, in one project
we used an attribute (nondependent variable) named
DealerDistance. This attribute was supposed to measure the
distance from a buyer's home to the closest car dealer. First
models included rules like “If DealerDistance is larger than
120 miles then the person is a car buyer.” These rules contradicted
the business knowledge. Closer inspection revealed, that for
buyers DealerDistance contained the distance between their
home and the dealer who sold the car. This is not necessarily the
nearest one.
4. CLEMENTINE STREAM LIBRARY
4.1 Overview
As mentioned above, we do different projects in the field of
marketing. However, there are always some business related
problems as well as repeatable tasks in common. Additionally, we
go through several loops between single tasks within one project
as shown in Figure 2. Therefore, we want to automate parts of the
data mining process. The basic idea is to develop tailored software
procedures supporting special tasks of the process. These
procedures should be modular, easy to handle and simple to
modify. Although they mostly assist within single tasks they must
have the ability to interact automatically in order to avoid human
errors.
Technical errors pertain to the form of data. They are usually
caused by false technical processing like wrong data formats,
incorrect merging of data sources or false SQL statements. For
instance, once we received data from two sources, the customer
data base and an external provider, both with the same file
structure. The modeling algorithm generated rules like “If
Wealth-Poverty-Factor is larger than 4.99 and smaller
than 5.01 then the person is a car owner.” Looking for an
explanation, we detected that the export from the data sources was
inconsistent. Wealth-Poverty-Factor contained real
numbers when exported from the customer database and integers
otherwise.
For our applications, we chose the data mining tool Clementine
from SPSS, mainly because its breadth of techniques, its process
support, and its scripting facilities. Using a visual programming
interface, Clementine offers rich facilities for exploring and
manipulating data. It also contains several modeling techniques
and offers standard graphics for visualization. The single
operations are represented by nodes which are linked on a
workspace to form a data flow, a so called stream.
Such errors are not easy to identify. On the one hand, we need
strong business knowledge to detect logical errors like the one
described above. On the other hand, one needs to be familiar with
data mining tools and techniques. Therefore, data mining
specialists and marketing people need to work closely together
during data understanding. Examples like the ones above are
contained in our toolbox.
Clementine is geared towards interactive, explorative data mining.
A scripting language allows the automation of repetitive tasks,
such as repeated experiments with different parameter settings.
Therefore, the tool serves our needs to automate the data mining
process very well. Nevertheless, there are also disadvantages.
Some useful functionality like gain charts or various basic
statistics are laborious to realize.
The examples also illustrate the usefulness of interpretable
models. Typically, we build a few models early in the process
because they are good indicators for data errors. When we are
fairly confident about the correctness of our data, we use
automation streams for mass modeling, a procedure described
below. Each automation stream can load defined files
automatically and save the results for the next step in process.
We built a library of streams supporting the process as described
in the Quick Reference Guide. The library contains three kinds of
streams:
Sample streams: These streams aim to show marketing people
how to implement important procedures in Clementine. They
illustrate basic techniques on the basis of tasks common in direct
marketing projects.
Automation streams: These streams help to automate the data
mining process. They are used to repeat one task very often under
different conditions, for example, to train a model with changing
parameter settings. They can be used by both marketing people
and data mining specialists.
Functionality streams: Some functionality is used in most of our
projects (e.g. score calculation) but not implemented in
Clementine from the outset. So we added these functions for easy
re-use.
The first two CRISP-DM phases supported are data understanding
and data preparation. Since these phases are highly explorative,
we mostly use sample and functionality streams. These phases are
the most time consuming ones. There are many traps one can
avoid using systematical discovery procedures as provided
through the stream library. In our projects we found two main
kinds of errors in data.
Figure 5. Numerical Model Evaluation Stream
401
systematically and saved for modeling by another stream from the
library.
Figure 5 shows an example of an automation stream used for
numerical model evaluation. The starlike icons are so-called
“Super Nodes” and represent procedures from our library (in this
case numerical quality measures for modeling (see 4.4.3)).
In our projects, we noticed that different training sets strongly
affect the model quality. But we did not yet discover a direct
correlation between properties of the training sets and the
algorithms or their parameter settings. Since we cannot narrow
down the most promising training sets early in the process, we
have to handle a large number of possible sets while modeling.
During model evaluation the situation gets even worse, as we will
discuss. Here, we need systematical tests in order to develop
approaches for fine-tuning training sets regarding properties of
input data and modeling algorithms (including their various
parameter settings) before the learning itself begins.
In the following sections, we will have a closer look at the process
from the modeling to the deployment phase. Therefore, we pick
some of the most important issues for predictive modeling in
direct marketing. Each issue is introduced first. Afterwards, we
discuss approaches, technical solutions, experiences, and the
needs for further research.
4.2 Creation of training and test subsets
Before modeling, we have to define a test design. The most
common approach is to split the data into training and test sets.
The training set is used to generate the models and the test set
serves to measure their quality on unseen data. An alternative
approach is to use resampling techniques, especially for small data
sets. Here we only discuss the former approach because its lower
computational requirements make it more efficient for mass
modeling (see section 4.3).
4.3 Mass Modeling
Today there are lots of techniques and software implementations
to built effective predictors. The question is not only to choose
between different techniques (like decision trees, rule sets, neural
networks, regressions, etc.) but also to choose between several
algorithms (e.g., CART, CHAID, C5.0, etc.). One also has to
consider the parameter settings of the algorithms and the various
training sets as discussed before. It is well known that there is no
generally best technique or algorithm. The choice depends on
many factors like input data, target variable, costs, or output
needed. Up to date we have no practicable approach to select even
the most promising combinations early in the process.
The first decision is to choose the sizes of the sets. In the best
case, we have enough data to create sufficiently large training and
test sets. Practically, we are often limited somehow. In case of
small data sets we do not have a general suggestion how large the
test set should be. It depends on the total number of records
available as well as on the number of records with positive target
variable. A large training set results in more stable models but
there have to be enough records in the test set to evaluate the
model quality. Normally, we reserve at least 20% of the sample
for testing. The remaining data are used to create various training
sets.
Fine-tuning just one modeling parameter after another makes it
really hard to consider all cross influences. Besides, the basic
functionality of some algorithms is not published sufficiently.
Then, it is difficult to judge the real influence of some parameters.
For these reasons we think it is necessary to experiment with
different combinations and, consequently, generate a whole
palette of possible models.
A major challenge in our marketing applications is the highly
unbalanced distribution of the target variable. Typically, we have
an overbalance of records with negative target variable (i.e., nonbuyers) [5, 10]. This makes it difficult to build good models that
find characteristics of the minority class. One way to approach
this problem is to create more or less balanced training sets.
According to our experiences, a ratio between 35:65 and 65:35 of
buyers to non-buyers is most effective [10]. But it differs between
the several projects and can change dramatically under some
circumstances. Of course, the distribution in the test set should be
representative for the universe and will not be changed.
For tuning the parameters, we need to automate the modeling
process. Especially, if we also take the various training sets into
consideration. For this purpose, we built several streams to
generate large numbers of predictive models. Each of the streams
deals with one technique and generates models with a whole range
of sensible parameter combinations.
Clementine offers several decision trees, rule sets, neural networks
and linear regression as techniques to build predictors. The
streams can be run independently of each other. All of them load
different training sets, train the algorithm with changing parameter
combinations on each set and save the generated models. This
way hundreds or thousands of models are generated in
multifarious facets, taking all reasonable possibilities into
consideration. We call this procedure mass modeling. Because of
the automatism we save time (computers work at night), avoid
human mistakes and secure best practice model quality.
There are two ways to change the balance within the training set.
We can either reduce the negative records or boost the positive
ones [8, 9]. This will affect the modeling results because we vary
amount and content of information for learning. To date, we have
not yet sufficiently explored the impact of the two alternatives.
The corresponding stream in our library is designed to handle
these issues flexibly. At first, it allows to choose the percentage of
data to be randomly assigned to the test set. Implicitly, we assume
that the sample is representative for the universe. This operation is
done only once and the same test set serves to evaluate all models.
The remaining data is used for generating multiple training sets,
where the stream automatically balances the data according to
different properties. With this stream, we set only some
parameters and generate multiple training sets with different sizes,
ratios and boosted or reduced records. All the subsets are named
Mass modeling is a pragmatic approach yielding good results in
real-life projects. But there are also some practical problems.
Dealing with such a large number of models places high
requirements on hard- and software and demands careful
preparation. As we will see later, it is also very hard to compare
that many models and pick the optimal one reliably. To improve
the process we need to develop procedures to focus only on the
most promising algorithms and parameter setting combinations.
402
Combinations not fitting to the overall project goal, input data and
other framework should be excluded safely and as soon as
possible.
4.4.2 Approaches to measure model quality
Histogram of scores shows the distribution of all records
according to their score overlaid with their real target value
(Mercedes buyers or non-buyers). The diagram gives evidence
how strong the model assigns records to one of the two classes
(see Figure 7 in section 4.5). In case of many records with
medium scores, the model’s discriminating power is very low.
However, the histogram tells us nothing about stability and
plausibility.
4.4 Evaluation
After models are generated through mass modeling they must be
evaluated.
4.4.1 Requirements of measures
The cumulative gain chart relates the number of objects with a
real positive target value, e.g., Mercedes buyers, to all objects in
the data base (buyers and non-buyers). The objects are sorted
descendingly according to their score. Ideally, all true positive
values should be situated at the left of the graph (see Figure 6).
[7] contains a thorough discussion of requirements and properties
of evaluation measures. In our projects we found the following
characteristics to be practically relevant for model quality:
Predictive accuracy: How accurate is the prediction of the
model? The predictive accuracy is high, if the model assigns a
score of 1 to objects belonging to class 1 and a score of 0 to
objects belonging to class 0.
Discriminatory power: This characteristic indicates the model's
ability to assign objects clearly to one of the two classes, even if
it assigns them to the wrong class. Discriminatory power is high if
the model assigns mostly high or low scores to the records, but
rarely medium scores. Assigning mostly medium scores indicates
a lack of discriminating information or mistakes in data
preparation phase.
Number of objects with
positive target value
(Mercedes-buyers)
Selection with Predictive Model
2500
Advantage
of
Predictive
Model
2000
1500
Random Selection
Stability: Stability indicates how models vary when generated
from different training sets. It implies three facets, overall
stability, stability per object and stability of the output form.
Overall stability means that models with identical parameter
settings generated on different training sets produce similar
results. Stability per object requires that the score for every object
is similar for models generated on different training sets. Under
stability of the output form we understand that the form of the
models is similar, e.g. different rule sets contain similar rules.
1000
500
0
0
10
20
30
40
50
60
70
80
90
100
% of objects (Mercedesbuyers and non-buyers)
Plausibility: Plausibility of models means that they do not
contradict business knowledge and other expectations. It requires
the possibility to interpret the results and is necessary for two
reasons. First, it increases the acceptance of predictive modeling
results and second, it facilitates individual and organizational
learning by recognizing mistakes and inconsistencies (e.g., logical
and technical errors). As such, it is a sort of quality assurance.
Plausibility can also be enhanced by univariate analyses of
important attributes and segments with high scores.
Figure 6. Cumulative Gain Chart
Only taking into account objects with positive target value
presupposes that it does not cause costs to predict a non-buyer as
buyer. This is not in concordance with reality. Furthermore the
measure says nothing about discriminatory power, plausibility and
stability of the model.
Histogram of scores and cumulative gain charts give visually a
good insight in model performance. However, it is not feasible to
compare hundreds of models. We overcome this disadvantage
using numerical indicators.
Accuracy and discriminatory power are often meant to be the most
important objectives of a good model. Although both properties
are important, they are not sufficient. In our setting, there are
many non-customers who resemble our customers but, for various
reasons, have not yet got around to buy our products.
Nevertheless, they are worth contacting to stimulate a purchase.
Therefore, models which perfectly separate customers and noncustomers are not sensible.
The quadratic predictive error gives evidence of the predictive
accuracy of all records and is calculated as sum of squared
deviations between real target value and predicted target value
[7]. By squaring the deviations, a high predictive error contributes
disproportionally more to the resulting measure.
Now the challenge consists in finding one or several measures to
operationalize these characteristics. In our toolbox, we use four
measures to evaluate the quality of the generated models. After
describing their functionality we discuss their weaknesses and
strengths.
A strength of this indicator is that false positive and false negative
predictive errors are taken into account and that deviations will be
registered exactly. But false positive and false negative predictive
errors are not weighted differently, even if – in practice – the error
403
to assign a Mercedes buyer as non-buyer is worse than the other
way round. Therefore, we also consider the squared errors of
positive and negative cases separately. Furthermore, there is no
benchmark what is a good or a bad predictive error.
attributes. Of course, the evaluation process is not always as linear
as described here.
The weighted lift [10] pursues the same goal as the lift curve. It
indicates whether the model assigns high scores to real positive
targets. But in contrast to the graphics this measure weighs true
positive targets with a high score much higher than those with a
low score. This has high practical relevance because mostly only
records with a high score are selected for actions. The maximum
value for the weighted lift is 100. A weighted lift of 50 (in case of
infinite number of partitions) corresponds to random selection.
None of the presented measures fulfills accuracy, discriminatory
power, stability and plausibility at once. Only a combination of
various numerical and graphical measures will give us a complete
picture of the models' characteristics.
4.4.4 Summary of evaluation
Accuracy is well represented in its diverse facets by all measures.
Discriminatory power cannot be calculated as indicator yet. Only
the histogram of scores gives us an impression of the separability.
But in order to compare hundreds of models easily there is a
strong need of an proper indicator of separability. Until now, we
are not able to measure stability and plausibility sufficiently
exactly. Concerning the stability of models we will try to measure
the confidence and the variability of results and integrate these as
confidence bounds in the weighted lift and the predictive error.
This probably allows us to decide whether it makes sense to try to
improve the model.
One problem is the fact, that weighted lifts based on different
number of partitions cannot be compared. The maximum lift of
100 can only be reached if all positive targets could theoretically
fall in the first partition. For instance, if there are 20% positive
targets in the set, there should not be more than 5 partitions. But
the lower the number of partitions, the higher the random lift and
the lower the advantage over random selection. In case of 5
partitions, the random weighted lift is 60. Besides, the weighted
lift tells us nothing about the discriminatory power of the model,
its stability or plausibility. Furthermore, there is the open
question, how many partitions to use. In our applications, we use
as many partitions as possible, as long as the size of a partition is
larger than the number of true targets, usually between 5 and 10.
In summary, we judge the weighted lift as a useful heuristics but
not more.
So far, we ignored costs of misclassifications. Even if costs are
relevant there exist high practical barriers to integrate them in
model evaluation. The main problem is that they are mostly
impractical to estimate reliably and consequently their impact can
only be guessed.
In general, we consider the evaluation of models in settings like
ours to be a largely open and challenging research issue. As we
mentioned at various places, there is no measure that tells us how
a good predictive model will perform in the field.
4.4.3 Realization
We have shown a few measures for model quality. They are
applied in different stages of the process. In the early phase, we
build a few models and inspect them carefully regarding potential
data errors. For this purpose, we use rule sets and judge their
plausibility. Accuracy and discriminatory power are assessed
visually. Only when the obvious data errors are repaired, we start
mass modeling.
4.5 Benefits
Before applying a promising model for the selection of addresses
to be included in the direct marketing campaign, its benefit must
be evaluated. This will decide whether to deploy the model or to
go back to a previous step. Furthermore, illustrating the benefit
when using data mining enhances acceptance and further
distribution of data mining within the enterprise. But what is the
benchmark to calculate the benefit of predictive modeling?
Since it is infeasible to compare hundreds of models visually, we
use the numerical model evaluation stream as shown in Figure 5.
There, we calculate weighted lift, quadratic predictive error and in
addition positive and negative quadratic predictive error. The
stream loads all models and calculates the measures automatically.
The output is a sorted list of model names and corresponding
indicators.
Comparing the model to random selection - as it is usually done is practically not very relevant. This would mean that no sensible
selection of prospects is done. But typically, some intelligent
selection according to common business sense is already
performed in the enterprise. It is often based on human experience
and two or three attributes (like high income, less than two
children and middle-aged). This intelligent selection should be
chosen as benchmark. The aim is to find out which objects are
assigned to the same class by both predictive model and
intelligent selection, and which objects are assigned to different
classes. Mainly the latter must be examined.
This way, many models can be compared. How many models can
be dealt depends on the actual computer. The performance of
Clementine Version 5.1 - the version we currently use - is mainly
limited by physical memory. This means that the whole process
must be broken down in steps so that Clementine can handle it
efficiently.
We built a Clementine stream to compare the predictive model
with such an intelligent selection. Figure 7 shows the results from
one of our business projects. Instead of 328 buyers correctly
recognized by intelligent selection, our predictive models assigned
a high score to 491 buyers. The advantage of the predictive model
over the intelligent selection increases when only objects with a
score higher than 0.8 are contacted.
Then, we pick only the best models for closer inspection with
streams for histogram of scores and cumulative gain chart.
Finally, we test the stability of the best few models using
resampling (10-fold cross-validation) and, in case of decision
trees and rule sets, we check the plausibility of the results by
inspection and with univariate analyses of the most relevant
404
The toolbox we use for our data mining applications in direct
marketing is pragmatically useful but not yet complete. Currently,
we are putting toolbox and experience documentation in a
hypermedia experience base to make storage and retrieval more
flexible and comfortable [2]. We are also developing additional
tools and addressing the important open issues of evaluation.
However, in this area, we require a lot more basic research which
also takes the business constraints into account.
Intelligent Selection
191 buyers
1373 Non-Buyers
328 Buyers
1273 Non-Buyers
Predictive Model
28 Buyers
1624 Non-Buyers
Another blank area is the question of what potential is really
hidden in the data. [11] describes an interesting heuristics for
estimating campaign benefits. As the authors note, the estimation
is valid only for applications where customer behavior is
predicted using previous examples of exactly the same customer
behavior. However, in our applications this assumption is rarely
met, especially not for acquisition campaigns. In our acquisition
campaigns we rarely know more than simple socio-demographic
facts. Estimating the expected lift is a promising approach.
However, the particular approach of [11] does not fit exactly our
experiences and needs. There is more work needed along theses
lines.
491 Buyers
1022 Non-Buyers
Figure 7. Benefit of Predictive Modeling compared to
Intelligent Selection
At this stage, plausibility of the results comes into play. Although
neural nets produce comparable results in terms of evaluation
measures discussed above, we favor decision trees or rules sets.
For them, it is relatively easy to demonstrate to business people
why they perform better than the common sense benchmark
(besides of helping us finding errors). Let us illustrate this
statement with an example from one of our applications. Usually,
we achieve good results with C5.0 using boosting. Typically, the
first classifier is very close to common sense selection, e.g., these
are rules referring to specific income ranges or socio-demographic
types, which are common selection criteria for manual scoring.
The subsequent classifiers typically are specializations of these
initial rules, e.g., socio-demographic types with additional
conditions. This way the boosted model confirms the common
sense selection and refines it.
As we tried to illustrate in this paper, it is not clear what kind of
behavior we need to model in automotive direct marketing.
Ideally, we want to model both dialogue affinity and propensity to
buy. But we usually cannot get hold of data containing the
necessary information. Typically, we have to make
approximations. What we then end up with is that we learn a
particular behavior of one population and try to adapt this to a
different population. It is an open issue how to evaluate models in
this setting.
Although this issue is of immediate practical importance,
Clementine does not sufficiently support the illustration of
boosted rules, requiring rather tedious manual post-processing of
the rules. There is certainly room for improvement.
[2] Bartlmae, K., A KDD Experience Factory: Using Textual
5. CONCLUSIONS
[3] Berry, M. J. A. and G. Linoff, Data Mining Techniques –
6. REFERENCES
[1] Bhattacharyya, S., Direct Marketing response models using
Genetic algorithms, Proceedings of the 4th International
Conference on Knowledge Discovery & Data Mining, pp.
144-148 (1998).
CBR for Reusing Lessons Learned, Proceedings of the 11th
International Conference on Database and Expert Systems
Applications (2000).
For Marketing, Sales and Customer Support, New York et
al. (1997).
A data mining process is a living process which will be changed
by future experiences. Therefore, all documentation, process
models and software tools must be flexible and living as well. The
Quick Reference Guide is an excellent framework for knowledge
transfer, documentation, communication, and quality assurance. It
structures the process while allowing for the necessary flexibility.
Having a standardized process helps to realize economies of scale
and scope, supports individual and organizational learning, and
speeds up the learning curve.
[4] CRISP-DM: Cross Industry Standard Process Model for
Data Mining, http://www.crisp-dm.org/home.html (2000).
[5] Fawcett, T. and F. Provost: Combining data mining and
machine learning for effective user profile. Proceedings. of
the Second International Conference on Knowledge and
Data Mining, pp. 8-13 (1996).
[6] Gersten, W., Einbindung des Predictive Modeling-Prozesses
When choosing a data mining tool you have to consider the users
of the tool. In our case, we chose Clementine because of its
process support, its various techniques and its ability to automate
streams via scripting. It was mainly meant to be used by data
mining experts. Although Clementine is targeted at business end
users, it requires quite skilled users. However, our stream library
aims to reduce the skills necessary to use the tool and, thus,
supports both data mining experts and marketing people.
in den Customer Relationship Management-Prozess:
Formulierung von Key Performance Indicators zur
Steuerung und Erfolgsmessung der Database MarketingAktionen eines Automobilproduzenten, Diploma Thesis,
University of Dresden (1999).
405
[7] Hand, D. J., Construction and Assessment of Classification
[11] Piatetsky-Shapiro, G. and B. Masand, Estimating Campaign
Rules, Chichester (1997).
Benefits and Modeling Lift, Proceedings of the 5th
International Conference on Knowledge Discovery & Data
Mining, pp. 185-193 (1999).
[8] Kubat, M. and S. Matwin, Addressing the curse of
unbalanced training sets, Proceedings of the 14th
International Conference on Machine Learning (1997).
[12] Wirth, R. and J. Hipp, CRISP-DM: Towards a Standard
Process Model for Data Mining, Proceedings of the 4th
International Conference on the Practical Application of
Knowledge Discovery and Data Mining, pp. 29-39 (2000).
[9] Lewis, D. and J. Catlett, Heterogeneous uncertainty sampling
for unsupervised learning, Proceedings of the 11th
International Conference on Machine Learning, pp. 148-156
(1994).
[10] Ling, C. X. and C. Li, Data Mining for Marketing: Problems
and Solutions, Proceedings of the 4th International
Conference on Knowledge Discovery & Data Mining, pp.
73-79 (1998).
406