Download JMIS2015 - Lingnan University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Transcript
Cost-Sensitive Learning via Priority
Sampling to Improve the Return on
Marketing and CRM Investment
Geng Cui, Man Leung Wong, and Xiang Wan
Geng Cui is a professor of marketing and international business at Lingnan University,
Hong Kong. His research interests include quantitative models in marketing, consumer behavior, and marketing in China, and foreign direct investment strategies and
performance. His work has appeared in leading academic journals such as Management Science, Journal of International Business Studies, and Journal of International
Marketing, and in Journal of World Business.
Man Leung Wong is an associate professor of computing and decision sciences at
Lingnan University, Hong Kong. His research focuses on data mining and knowledge
discovery, machine learning, and evolutionary algorithms. His work has been published
in leading journals such as Management Science, IEEE Intelligent Systems, and Expert
Systems with Applications.
Xiang Wan is an assistant professor of computer science at the Hong Kong University of Science and Technology. His research on bioinformatics with an emphasis on
detection of genetic patterns in complex diseases using statistics and heuristic search
methodology has appeared in American Journal of Human Genetics, Nature Genetics,
Bioinformatics, among others.
Abstract: Because of the unbalanced class and skewed profit distribution in customer
purchase data, the unknown and variant costs of false negative errors are a common
problem for predicting the high-value customers in marketing operations. Incorporating
cost-sensitive learning into forecasting models can improve the return on investment
under resource constraint. This study proposes a cost-sensitive learning algorithm via
priority sampling that gives greater weight to the high-value customers. We apply the
method to three data sets and compare its performance with that of competing solutions. The results suggest that priority sampling compares favorably with the alternative
methods in augmenting profitability. The learning algorithm can be implemented in
decision support systems to assist marketing operations and to strengthen the strategic
competitiveness of organizations.
Key words and phrases: cost-sensitive learning, customer relationship management,
direct marketing, forecasting, priority sampling.
For many binary classification problems in marketing and customer relationship
management (CRM), there is usually a severe unbalanced distribution of classes in
Journal of Management Information Systems / Summer 2012, Vol. 29, No. 1, pp. 335–367.
© 2012 M.E. Sharpe, Inc. All rights reserved. Permissions: www.copyright.com
ISSN 0742–1222 (print) / ISSN 1557–928X (online)
DOI: 10.2753/MIS0742-1222290110
10 cui.indd 335
7/2/2012 6:01:02 AM
336
Cui, Wong, and Wan
empirical data, that is, the small number of true positives (e.g., 5 percent buyers) versus
the majority of true negatives (95 percent nonbuyers). Moreover, false negative errors
(e.g., loss of subscription or membership fees) are often much more costly than false
positive errors (e.g., the cost of mailing or other ways of contacting customers). A
predictive model lacking sensitivity to the unequal costs of misclassification errors
often results in suboptimal performance in identifying the buyers and augmenting
profit. Other similar situations include (1) upgrading customers—how to provide sizable incentives to those customers who are the most likely to upgrade and contribute
greater profit, (2) modeling customer churn and retention—how to prevent the most
valuable customers from switching to a competitor, and (3) credit default—how to
identify customers who do not pay back their sizable loans [2]. Since customers’
purchase probability and their profit contribution are inherently difficult to predict,
differentiating high-profit customers from low-profit customers is critical for achieving better profit rankings of customers. While rebalanced data via undersampling or
oversampling may improve classification accuracy [3, 8], researchers have proposed
various cost-sensitive learning algorithms such as AdaCost and MetaCost by providing
a cost matrix to an estimator [10, 13]. Such features have been incorporated in popular
software such as IBM SPSS Modeler and SAS Enterprise Miner and discussed in books
on data mining [20]. However, these solutions do not apply to scenarios where the costs
of false negative errors are unknown and variant (i.e., the amount of purchase).
In addition to the between-class imbalance, the within-class imbalance also presents
a significant challenge for cost-sensitive learning in direct marketing forecasting [25].
A small number of customers typically account for a large portion of a company’s
profit or loss in a long-tail distribution. Because a limited marketing budget allows
for contacting only the preset percentage of the most valuable customers (i.e., the top
10 percent or 20 percent), decision makers must rely on forecasting models to select
from the vast list of customers those who are the most likely to respond to a marketing
offer and purchase the greatest amount. Thus, aside from classification accuracy, a
predictive model needs to maximize sales or<<and?>> profit in the top deciles of the
test data. While a number of studies have dealt with the unknown costs of false negatives via resampling [23, 26], treating the false positives and false negatives together
or using the total costs may place too much emphasis on the high-value customers
and lead to overfitting and suboptimal performance.
In the following sections, we first review the literature of direct marketing and
cost-sensitive learning and discuss the research problems dealing with the unknown
and variant costs of false negative errors. Second, we propose a two-step approach to
handle between-class imbalance and within-class imbalance separately to minimize
overfitting. Given sufficient data, we deal with the between-class imbalance problem with random down-sampling of the negative class. To tackle the within-class
imbalance and the unknown and variant costs of false negative errors, we propose a
cost-sensitive algorithm via priority sampling to generate a desired data distribution
that places greater weight on high-value customers. Ensemble learning is adopted to
improve the accuracy of parameter estimates. Third, we apply priority sampling to
three direct marketing and CRM data sets. The results suggest that priority sampling
10 cui.indd 336
7/2/2012 6:01:02 AM
<<Abbreviated running head okay? If not, how would you
like it to read?>>
Cost-Sensitive Learning via Priority Sampling
337
compares favorably with the alternative methods in augmenting the profitability of
direct marketing operations. Moreover, priority sampling consistently renders superior
performance with data of various degree of class imbalance and can be implemented in
other classification methods such as naive Bayes, thus providing a robust and general
solution to cost-sensitive learning given the unknown and variant costs of false negative
errors. Last, we explore the theoretical and managerial implications for improving the
return on marketing and CRM investment under resource constraint.
Literature Review
Forecasting Models for Direct Marketing and CRM
While the unbalanced class and cost distributions are common in many business
areas, cost-sensitive learning has been a unique and challenging problem in direct marketing research because of the nature of its operations and distribution of data. Many
CRM activities also use direct marketing channels such as direct mail and telephone
calls. Due to budget constraint, the primary objective of modeling consumer responses
to direct marketing is to identify those customers who are the most likely to respond.
Only the customers with the highest response probabilities, say, the top 20 percent,
will be contacted. Aside from the conventional RFM<<define>> method, researchers can incorporate consumer demographic and psychographic variables and apply
more sophisticated statistical methods such as latent class analysis [3], beta-logistic
models [19], and tree-generating techniques such as CART and CHAID<<if CART
and CHAID are acronyms, define them>> [16].
Despite these improvements, the response rate of most direct marketing campaigns is
usually very low; for instance, around 5 percent for catalog mailings. Thus, improving
the response rate of direct marketing campaigns is a priority issue. Although various
statistical methods have been developed to improve the accuracy of classification, the
unbalanced distribution of classes may be problematic for statistical forecasting models
that focus on minimizing the overall misclassification errors. While they<<clarify
“they”>> may have high overall accuracy of classification by identifying the majority true negatives (nonbuyers), they do not help in predicting the rare class of true
positives (buyers), which are of interest to decision makers (Figure 1). This is because
the small class of positive cases does not lend sufficient opportunities for a model to
learn the underlying structure of the data. To improve the classification accuracy, decision support systems have employed various machine learning methods that are less
subject to the problem of unbalanced class distribution, such as neural networks and
Bayesian networks [9, 29]. These methods often outperform conventional statistical
tools in terms of classification accuracy and managerial insight [15].
Even though these methods can potentially improve the classification accuracy
based on the predicted probabilities of response, they may not help to identify the
most valuable customers. This is because the existing methods suffer from one major
limitation, that is, the lack of sensitivity to the unequal costs of misclassification errors as they assume that false positives and false negative errors are equally costly,
10 cui.indd 337
7/2/2012 6:01:02 AM
338
Cui, Wong, and Wan
Class Positive (C+), e.g., 5%
Class Negative (C–), e.g., 95%
Prediction
Positive (R+)
True Positives (TP)
(buyers correctly classified)
False Positives (FP)
(nonbuyers misclassified)
Prediction
Negative (R–)
False Negatives (FN)
(buyers misclassified)
True Negatives (TN)
(nonbuyers correctly classified)
Figure 1. The Confusion Matrix for Classifier Performance
which is true only in certain cases (second column in Table 1). In many cases (third
column in Table 1), false negatives errors (loss of potential sales and profit, e.g., $30
in terms of subscription or membership fees) are often much more costly than false
positive errors (e.g., cost of mailing, which usually amounts to $1 per customer). These
two issues together, unbalanced distribution of classes and unequal costs of misclassification errors, highlight the cost-sensitivity problem in direct marketing and CRM
operations [8, 30<<verify reference meant / list ends at 29>>].
Cost-Sensitive Learning
In many real-life situations, the default assumption of equal misclassification costs
underlying most classification and pattern recognition techniques is not tenable. Costsensitive learning has been proposed to help make optimal decisions of customer
selection in terms of cost and benefit [12, 24, 26]. In general, decision support systems
have used several strategies to deal with the unequal costs of misclassification errors.
Given the problem of between-class imbalance, the standard industry practice in direct
marketing, referred to as “salting,” is to undersample the nonbuyers to provide a more
balanced class distribution of positive and negative cases [1, 3, 25]. When there are
not enough data in the positive class, one may oversample the positive cases using
strategies such as SMOTE<<if this is an acronym, what does it stand
for?>> [8]. With more balanced training data between the classes, the classification tool may have sufficiently more opportunities to learn the model structures and
improve classification accuracy.
Alternatively, decision support systems may manipulate the output data from a
forecasting model by adjusting the threshold value. The relative performance of
competing models can be compared using the area of receiver operating characteristic
(ROC) curve [21]. For instance, when logistic regression is used for classification, one
could set the probability at 0.8 instead of 0.5 as the threshold value, which means that
a case could be labeled as positive or “1” if its predicted probability of purchase is
greater than 0.8. Using a different threshold value is suitable for simple classification
tasks, but it does not change the rankings of the predicted output. Thus, neither simple
resampling nor adjusting the threshold value specifically tackles the issue of unequal
costs of misclassification errors. Some researchers<<cite any example(s)?>>
10 cui.indd 338
7/2/2012 6:01:02 AM
Cost-Sensitive Learning via Priority Sampling
339
Table 1. Costs of Misclassification Errors in Data with Unbalanced Class
Distribution
Errors/costs
Equal and known
costs
False positives
(large
percentage)
Wrong answer in a
test: –1 point/error
False negatives
(small
percentage)
Viable solutions
Missed right answer:
–1 point/error
Uniform sampling,
adjusting threshold
values
Unequal and
known costs
Membership/
subscription: cost
of mailing: $1.00/
customer
Lost membership
or subscription:
$30.00/customer
Applying cost
matrix or ratios,
e.g., AdaCost,
MetaCost, C4.5
Unequal but
unknown costs of
false negatives
Catalog direct
marketing: cost of
mailing or contact:
$1.00/customer
Loss of sales/profit:
e.g., from $20.00 to
$600.00, not known
Expected cost
approach, AdaC2,
priority sampling
Note: The more costly errors typically come from a small minority, such as the buyers and highvalue customers.
have recommended methods that are less sensitive to the unbalanced class distribution. Joint distribution models such as association rules, naive Bayes, and support
vector machines are less susceptible to the influence of outliers or unbalanced class
distribution. Since minimizing the total misclassification errors remains their focus,
they<<clarify / researchers or joint distribution models
meant?>> do not directly address the cost-sensitivity problem to help achieve
better profit rankings of customers.
A viable solution is to incorporate the unequal and known costs of misclassification
errors in the training process, usually by providing a cost matrix or ratio in the learning algorithm (third column in Table 1). In this case, a cost matrix plays a key role in
guiding the training process. To date, researchers in statistical learning have developed
a number of cost-sensitive learning algorithms, including bagging [6] such as MetaCost [10] and boosting [14] such as AdaCost. Although both methods combine multiple
models using ensemble learning, bagging does so by generating replicated bootstrap
samples of the data and boosting does so by adjusting the weights of training data.
For cost-sensitive learning, both AdaCost and MetaCost can incorporate a cost matrix
to weight samples and reorder the output [10, 13]. To develop a known cost matrix
or ratio, one must determine the conditional risks first and sort the cases according to
the conditional risks (e.g., 9,500 false positives at $1 each versus 500 true positives
at $30). This approach can be very helpful when the exact costs of misclassification
errors are known and constant within each type of error (third column in Table 1). In
such cases, applying a cost matrix can help to improve the accuracy of classification
models and augment the sales or profitability of direct marketing [20, 24, 30].
10 cui.indd 339
7/2/2012 6:01:03 AM
340
Cui, Wong, and Wan
When the Costs are Unknown and Variant
In many marketing applications, however, the costs of false negative errors are sometimes neither known to the decision makers nor uniform. Decision makers cannot
anticipate whether customers will respond to a promotion or how much they will
purchase (Figure 2 and fourth column in Table 1). When the costs of false negatives
are unknown, it is unrealistic to apply a cost matrix. Moreover, as shown in Figure 2,
the distribution of customer sales and profit data is often highly skewed with a very
long tail, indicating a concentration of profit among a small group of customers [18].
In empirical studies of profit forecasting, the skewed distribution of profit data creates
problems for identifying the small number of high-value customers whose profit may
amount to hundreds or thousands of dollars, whereas most buyers contribute a much
smaller amount (e.g., $10). This has led to an increasing emphasis on CRM, which
requires decision makers to focus on the high-value customers. Researchers have developed various models to maximize the profit of direct marketing [7, 19]. However,
the profit maximization approach to customer selection, which selects those customers
with an expected marginal profit, is not realistic in most direct marketing situations
that do not allow contacting customers beyond a preset percentage. Furthermore, the
profit maximization approach focuses on maximizing the total potential profit of a
direct marketing campaign, but cannot help achieve better profit rankings of customers
or target the most valuable customers at the top two deciles of the testing data.
Much work has been done to deal with the issues of class imbalance and unequal
and known errors [8, 30<<verify reference meant / list ends at
29>>], but only a few researchers have addressed the issue of cost-sensitivity when
the exact costs of false negative errors are unknown [12]. One simple solution is to
use the expected cost (profit) to rank the customers [26]. In this case, the expected
profit is estimated using only the positive cases and a linear regression model. Then,
Heckman’s two-step procedure is used to correct the sample selection bias. The first
step is to estimate a classification model to generate conditional probabilities of purchase P(j = 1 | x). The second step is to estimate a linear regression model using the
training data containing only the positive cases with j (x) = 1, but also includes the
estimated probabilities in the model. Then the model with the learned parameters is
used to rank the testing data including both positive and negative cases.
An alternative approach is the cost-based resampling procedure, such as cost-proportionate sampling [27] and cost-sensitive algorithms such as AdaCost, including AdaC1,
AdaC2, and AdaC3 [23]. Cost-proportionate resampling applies the rejection sampling
strategy and draws a sample with probability c / Z, where c is the case-dependent
cost and Z is a constant (usually the maximum of all costs). As a result, this method
usually generates a very small training data set (by a factor of about N / Z, where N
is the number of cases in the training data) but still achieves a bounded classification
error [27]. However, it puts all the cases (both positive and negative) into sampling
and uses the total costs (sum of all costs of false negatives and false positives) in the
population to draw samples. Because the costs of false positives are very small and
uniform, this approach may overemphasize the positive cases with very high values
and result in overfitting.
10 cui.indd 340
7/2/2012 6:01:03 AM
Cost-Sensitive Learning via Priority Sampling
341
Figure 2. The Skewed Long-Tailed Distribution of Customer Profit Data<<the font
matrix is off in this figure / is it possible to provide a better
figure? If the figure was created in Excel, please provide the
original file>>
Sun et al. [23] proposed a number of cost-sensitive algorithms—AdaC1, AdaC2,
and AdaC3—which are extensions to AdaBoost [14]. The AdaBoost algorithm is a
successful ensemble learning approach for improving classification accuracy by applying a sample weighting strategy. First<<there is no “second” (etc)>>,
the weights of all the training examples are initialized equally. A baseline model such
as logistic regression is performed on the training examples to generate a component
model. A weight-updating parameter αt is implemented to improve the accuracy of
the component model. Then, the training examples are divided into true positives, true
negatives, false positives, and false negatives. A weight-updating formula is used to
increase the weights of false positives and false negatives by an identical ratio. The
weights of all true positives and true negatives are decreased by another identical
ratio. Another component model is then learned from all training examples with the
modified weights. The above steps are repeated until a specific number of component
models are obtained. To classify an unseen case, the weighted average of the outputs
of the component models are produced using the αt values of different component
models as the weights.
However, AdaBoost is also accuracy oriented and treats positive and negative examples equally. Thus, it ignores the profit/cost associated with the examples in the
weight-updating formula. To implement cost-sensitive learning for examples with
varying costs, AdaC1, AdaC2, and AdaC3 incorporate the profit/cost items in the
weight-updating formula in different ways. Essentially, the weights of false negatives
with high costs are increased significantly while the weights of false negatives with
small costs are increased slightly. The weights of true positives with high costs are decreased significantly while the weights of true positives with small costs are decreased
10 cui.indd 341
7/2/2012 6:01:03 AM
342
Cui, Wong, and Wan
slightly. To learn another component model in following iterations, the algorithm can
then concentrate on those false negatives with high profit/cost values.
In summary, the expected cost approach [26] is rather simple and straightforward.
But it may result in poor forecast of consumer purchases as it only includes the positive cases in the training process. The cost-based resampling approach uses all the
data including both positive and negative cases, but existing methods put the greatest
weight on the most profitable cases and exclude most of the low profit and negative
cases [23, 27]. AdaCost includes both positive and negative cases, but it treats the false
positives and false negatives differently according to a given cost matrix and does not
consider the variance among the false negative errors. Thus, the existing approaches
treat the false positives and false negatives together or use the total costs in the resampling process, <<text missing here? which? and?>> may overemphasize
the high-value customers and result in overfitting and suboptimal performance of a
forecasting model. A reasonable approach to improve the sensitivity to the unknown
costs of false negative errors should consider the unbalanced distribution of classes
as well as a better representation of the skewed distribution of customer profit in the
positive class.
Cost-Sensitive Learning via Priority Sampling
Different from previous studies, we consider the unequal costs of misclassification
errors and the unknown costs of false negatives separately to avoid overemphasizing
the positive cases and to minimize the overfitting problem. First, given the unequal
costs of misclassification errors, a learning algorithm needs to be sensitive to the
more costly errors (i.e., false negatives) to improve its predictive accuracy. Given the
between-class imbalance problem and the unequal cost of misclassification errors, a
rebalanced class distribution is necessary to improve the sensitivity to the more costly
errors. In this case, researchers may undersample the negative cases and oversample
the positive cases. This is also true for cost-sensitive learning [11]. When there are not
enough positive cases, up-sampling or oversampling of the minority class can be used
to provide a rebalanced class distribution, for instance, using the SMOTE method [8].
Given sufficient size of the negative class, one can undersample the negative cases by
applying a 1:1 ratio to achieve a symmetric distribution of the two classes. This way
one avoids including a small number of negative cases or overemphasizing the positive
cases in the resampling process, especially the extremely high-cost customers, thus
minimizes the overfitting problem. Moreover, since the costs of false positives are
much lower and already known and uniform, it is not necessary to apply differential
weights to them or include them in the resampling process.
Second, we address the problem of unknown costs of false negatives using a resampling strategy. Given the severe within-class imbalance, the model should place greater
emphasis on high-value customers. We adopt the cost-based weighted resampling approach to address the sensitivity to the false negative errors in the context of skewed
profit distribution [26]. Among the positive cases, we integrate the case-dependent cost
in the learning process and give priority to high-profit customers in the resampling
10 cui.indd 342
7/2/2012 6:01:03 AM
Cost-Sensitive Learning via Priority Sampling
343
process. To make such an estimator cost-sensitive, a logical way is to generate a sample
of desired distribution by giving greater importance to the high-profit customers rather
than the original distribution. In statistics, importance sampling using the normalized
importance weights is a general technique for estimating properties of a particular
distribution using samples generated from a different distribution.
Formally, let X be a random variable in S. Let p be a probability measure on S, and
f some function on S. Then one can formulate the expectation of f under p as
E[ f (X) | p] = ∫ f (x)p(x)dx.
<<E, x, d need to be defined>>The essence of importance sampling is
to draw from a distribution other than p, say, q, to modify the above formula to get
a consistent estimate of E[f (X) | p]. This procedure helps to reduce the variance of
E [f (X) | p] by an appropriate choice of q, as samples from q are more “important” for
the estimation of the integral.
Given the above definition, we have w(x) = p(x)/q(x), where w is known as the importance weight and the distribution q is usually referred to as the sampling or proposal
distribution. With random samples, normalized importance weights can be generated
according to q. Since this method is completely general, the above analysis can be
repeated when it represents a conditional distribution. Monte Carlo simulation has often
been used in importance sampling to determine the choice of the distribution q.
In the context of cost-sensitive learning, importance sampling focuses on finding a
biasing density function so that the variance of the estimator is less than the variance
of the general Monte Carlo estimate [27]. Although there are many kinds of biasing
methods, a simple and effective biasing technique is to employ the translation of the
density function of a random variable and place much of its probability mass in the
desired region of the rare cases, in this case, the high-value customers. This is consistent
with the purpose of cost-sensitive learning: re-weight the distribution in the training
data according to their importances to draw the desired samples that give greater
priority to high-value customers. The key to the success of any biasing or translating
function lies in the design of the weighting mechanism.
Priority Sampling
Following the concept of normalized importance weights, we translate customer profit
(or cost) into a probability distribution. Given a data set,
S(m, k) = {(s1, c1), (s2, c2), ..., (sm, cm), (n1, x), (n2, x), ..., (nk, x)},
with m positive cases and k negative cases, where each positive case si is associated
with profit ci and each negative case ni is associated with a constant cost x (such as
mailing cost). In direct marketing, m is much smaller than k, that is, m < k. The uniform sampling draws samples using the probability function P(si ) = 1/(m + k) and
P(ni ) = 1/(m + k), which treats each case equally. In priority sampling, we directly use
the profit for each positive case as the cost in the resampling process and assign higher
probabilities to cases with greater profit. We define the probability distribution function
10 cui.indd 343
7/2/2012 6:01:03 AM
344
Cui, Wong, and Wan
Pd (si ) as Pd (si ) = ci /Sj cj . For negative cases, which are the majority of samples with
the cost x, we use the uniform sampling with the probability function P(ni ) = 1/k to
draw a subset for training. There are other alternatives for the probability distribution
function Pd (si ). Our preliminary experiments suggest that the linear weight function
described above is sufficient, whereas other transformations of the weight function
(e.g., exponentials or logarithms) may change the weights of cases but do not render
superior results.
In priority sampling, a random number r (si ) is drawn from a uniform distribution
for each positive case si. Next, it is compared with the probability Pd (si ) that gives
priority to the cases with higher profit. If r (si ) ≤ Pd (si ), si is selected for training. For
example, for seven customers with different profit amounts of 4, 3, 2, 2, 1, 1, and 1,
their probability Pd (si ) of being selected would be 0.29, 0.22, 0.14, 0.14, 0.07, 0.07,
and 0.07. If the generated random numbers r (si ), 1 ≤ i ≤ 7, are, respectively, 0.15, 0.3,
0.2, 0.25, 0.5, 0.4, and 0.01, the first and the last customers will be selected. On the
other hand, if the generated random numbers r (si ), 1 ≤ i ≤ 7, are 0.01, 0.2, 0.1, 0.4,
0.3, 0.35, and 0.1, respectively, the first three customers will be chosen. Thus, the cases
with higher profit have greater chances to be selected. For the cases with lower profit
but higher frequency, they also have the chance to appear in multiple runs of sampling.
Thus, the probability for a positive case to be included depends on its profit as well
as its frequency of appearance in the sample. In so doing, we achieve a normalized
distribution of profit among the customers in the training data set.
In the meantime, depending on the number of positive cases from the priority sample,
a uniform sampling procedure is applied to draw the same number of cases from
the majority negative class. This way, the training data have a balanced distribution
of positive and negative cases. In essence, using the combination of undersampling
of negative cases and priority sampling of positive cases based on their profit, the
generated data should consist of the desired samples in terms of both response and
profit distribution. From the same original data, different runs of priority sampling
will produce different training samples, and the cases with higher profit will appear
more frequently than those with lower profit. Since such a sample is likely to be
only a small portion of all the training data, it may not lend an opportunity to arrive
at accurate estimates of the parameters. To solve this problem, we use an ensemble
learning approach inspired by bootstrap aggregating [6]. Consider a two-class classification problem and a training data set S of size n, bootstrap aggregating creates
T bootstrap samples St , 1 ≤ t ≤ T<<correct?>>, by sampling S uniformly with
replacement. The sizes of St are less than or equal to n. A baseline model such as
logistic regression is then performed on the T bootstrap samples to generate T component models that form the ensemble. To classify an unseen case, the outputs of the
T component models are combined by averaging the results from different models in
the ensemble. Suppose that there are five component models in the ensemble, their
response probabilities for a new case are respectively 0.6, 0.7, 0.5, 0.6, and 0.7. The
average is (0.6 + 0.7 + 0.5 + 0.6 + 0.7)/5 = 0.62, thus the ensemble determines that
the customer has a 62 percent probability of responding. Both empirical and theoreti-
10 cui.indd 344
7/2/2012 6:01:04 AM
Cost-Sensitive Learning via Priority Sampling
345
cal evidence suggests that averaging leads to a decrease in prediction error and can
improve predictive accuracy [6].
The proposed priority sampling learning algorithm is included in Figure 3. To illustrate this algorithm, the first 10 positive and all the negative examples of the training
data set from the 1998 Knowledge Discovery and Data Mining competition [27] are
used. In Table 2, the positive examples and their profit values are shown. In Step 2
of the algorithm (Figure 3), the profit values of all positive training examples are
copied to an array W. Then the total profit, which is $90.2, is calculated and stored
in the variable TotalProfit in Step 3. Next, the values in W are normalized using the
total profit. For example, the profit of the first positive example is $3.32, thus W(1) is
normalized to $3.32/$90.2 = 0.0368. The normalized values of W can be found in the
third column of Table 2. The ensemble of models H is initialized in Step 5 and then
T component models are learned in the loop of Step 6.
To learn T component models, a number of random numbers first are generated and
stored in the array R in Step 6(a). Second, the positive examples are examined one by
one to determine which one(s) should be selected and stored in S +, the set of selected
positive examples. For example, if R(1) is 0.2, the first positive example is not selected
because R(1) is larger than W(1). On the other hand, the second positive example is
selected if R(2) is 0.06 because R(2) is smaller than W(2). The loop of Step 6(d) is
executed for m times until all positive examples have been considered. Suppose that
three positive examples are selected in Step 6(d), the value of NoSelectedPositive is
equal to 3 after completing Step 6(d). Third, negative examples are selected randomly
without replacement in Step 6(e) and the number of selected negative examples is
equal to NoSelectedPositive. Thus, three negative examples are selected in this case.
In Step 6(f), a model is learned from the selected positive and negative examples. In
this case, a logistic regression model is obtained from three positive and three negative examples. Finally, the induced model is stored in the ensemble H in Step 6(g).
The loop of Step 6 is repeated for T times until T component models ht , 1 ≤ t ≤ T,
have been induced and stored in the ensemble H. In the last step, the ensemble H is
returned by the algorithm.
In summary, unlike the unexpected cost approach, priority sampling includes both
positive and negative cases but treats them separately. In contrast to other resampling
methods such as cost-proportionate sampling and AdaCost, our approach has a balanced ratio of positive and negative cases and alleviates the problem of overfitting.
Then, priority sampling is used to draw from the positive cases. Cases of greater profit
will appear more frequently in different samples, and vice versa. In so doing, priority
sampling gives greater weight to more costly false negative errors and can improve
the profit ranking of customers. In the following section, we test the benefits of priority sampling in comparison with the alternative methods using data sets of different
levels of imbalance in class and profit distribution.
10 cui.indd 345
7/2/2012 6:01:04 AM
346
Cui, Wong, and Wan
PrioritySamplng(S, m, k, T)
// S is the training set with m positive and k negative examples
// S = {(s1, c1), (s2, c2), ..., (sm, cm), (n1, x), (n2, x), ..., (nk, x)},
1. Define and initialize variables:
S +,
// The set of selected positive examples
S –,
// The set of selected negative examples
W,
// An array of numbers
R,
// An array of numbers
H,
// The ensemble of models
TotalProfit, NoSelectedPositive;
2. For i = 1 to m do
W(i) := ci;
// Store the profit ci of si to W(i)
3. TotalProfit := Sum of all values in W;
4. For i = 1 to m do
W(i) := W(i)/TotalProfit;
5. H := f;
6. For t = 1 to T do
(a) Generate m random numbers and store them in R;
(b) S + := f;
// Initialize S +
(c) NoSelectedPositive := 0;
(d) For i = 1 to m do
If R(i) <= W(i) then
S + := S + ∪ {si}; // Select the positive example si
NoSelectedPositive := NoSelectedPositive + 1;
EndIf
EnfFor
(e) Randomly select NoSelectedPositive negative examples without replacement
from S and store them in S +;
(f) Apply a learning method such as logistic regression to learn a model ht from the
selected examples S + and S –;
(g) H := H ∪ {ht};
// Add the learned model hi to the ensemble
EndFor
7. Return H;
Figure 3. Learning Algorithm Based on Priority Ensemble Sampling
Study One
Data and Methods
We first conduct experiments with a direct marketing data set from a U.S.-based
catalog company. The company sells various product lines of merchandise, including
gifts, apparel, and consumer electronics. This data set contains the records of 106,284
consumers and their purchase history over a 12-year period. The most recent promotion
sent a catalog to every customer in this data set, and achieved a 5.4 percent response
rate with 5,740 buyers. Nine variables are selected using the forward selection criterion (p = 0.05): recency (i.e., the number of months lapsed since the last purchase),
frequency of purchase in the last 36 months, monetary value of purchases in the last
10 cui.indd 346
7/2/2012 6:01:04 AM
Cost-Sensitive Learning via Priority Sampling
347
Table 2. The First 10 Positive Examples of the Training Data Set of the 1998
Knowledge Discovery and Data Mining Competition <<for the profit
column, are those hundreds, thousands,...?>>
Record number
Profit
W(i)
21
31
46
79
94
102
116
127
193
204
Total profit
$3.32
$6.32
$4.32
$12.32
$9.32
$4.32
$24.32
$9.32
$7.32
$9.32
$90.2
0.0368
0.0701
0.0479
0.1366
0.1033
0.0479
0.2696
0.1033
0.0812
0.1033
36 months, average order size, lifetime orders (number of orders placed), lifetime contacts (number of mailings sent), whether a customer typically places telephone orders,
makes cash payment, or uses the “house” credit card from the catalog company.
Since every customer in this data set received a catalog from the company in the
current mailing, the data do not have the problem of sample selection. However, there
may be an endogeneity bias among the RFM variables, which are based on the previous purchases of consumers. For the endogeneity tests, we employ the asymptotic
t‑test developed by Smith and Blundell [22]. The significant results of the t‑tests reject
the null hypotheses of exogeneity for the RFM variables. To correct the endogeneity
bias, we adopt the “control function” approach developed by Blundell and Powell [5]
to address such problems. We first run a parametric reduced-form logit regression to
compute the estimates of endogenous RFM variables on the whole data set. In the
second stage, the residuals of the reduced-form regressors are included as covariates
in the binary response model.
Logistic regression, without considering the costs of misclassification errors, is not
cost-sensitive and serves as the baseline model for comparison with the cost-sensitive
methods. The logistic regression models use the data with (1) the original class imbalanced distribution and (2) balanced class distribution by down-sampling the negative
cases. Then, we compare the three cost-sensitive methods: (1) the expected cost
method [27], (2) AdaC2 [23], and (3) priority sampling with logistic regression. As the
expected cost method uses a linear regression model, only positive cases are used in
the training process, and the dependent variable is customer profit. For selection bias
correction using the Heckman procedure, predicted purchase probabilities from a logit
model are added to the linear regression model of customer profit. This approach can
be considered as incorporating the expected cost directly in the training process, while
the other two methods represent the resampling approach to cost-sensitive learning.
The latter two models use logistic regression as a binary classification tool, and the
dependent variable is whether a customer responds to a direct marketing promotion or
10 cui.indd 347
7/2/2012 6:01:04 AM
348
Cui, Wong, and Wan
not. The AdaC2 and priority sampling methods are based on their respective sampling
procedures to improve their sensitivity to the costs of misclassification errors. To test
its applicability to other classification methods, we also test the naive Bayes approach
using both the original imbalanced class distribution and priority sampling.
Because decision makers usually have a fixed budget and can contact only a small
portion of the potential customers in their database (e.g., 10 percent or<<to?>>
20 percent), overall classification accuracy of models or simple error rates are not
meaningful as criteria for model evaluation or comparison. To support direct marketing decisions, maximizing the number of true positives at the top deciles, that is, the
cumulative lift, is usually the most important criterion for assessing the performance
of classifiers [4, 29]. Cumulative lift is the ratio of the number of true positives identified by a model to that of a random model at a specific decile of the file. The figure
is then multiplied by 100. Thus, a model with a response lift of 200 in the top decile
is said to be twice (200 percent) as good as a random model. Using cumulative lifts
across depths of file (usually 10 deciles) is helpful for comparing the performance of
different models. Since most direct market campaigns contact only the top 10 percent
or<<to?>> 20 percent of the names in the customer database, researchers usually
compare the cumulative lifts of models at the top two deciles to select a better model.
The cumulative profit lift across the deciles is measured the same way, that is, the
ratio of realized profit by a model to that by a random model based on the testing
data. Lifted profit is the amount of “extra profit” in dollar amount generated by the
new model over that by a random method.
Because a single split or hold-out validation is inadequate, we assess the performance
of different models using stratified tenfold cross-validation, which has proven to be
sufficient to produce stable results and become a popular method of cross-validation
for comparing model performance [17]. We follow the standard practice by splitting
the whole data set into 10 disjoint subsets using stratified random sampling and apply
stratification of the original cases so that each subset of the data has approximately the
same number of responders (574) and nonresponders (10,054). Then, we estimate and
validate a model 10 times, using each of the 10 subsets in turn as the testing data set
and all of the remaining 9 data sets combined as the training data set. The expected
cost approach uses only the positive cases in the training process but include both
positive and negative cases for validation. Using the same tenfold cross-validation data
sets, AdaC2 and priority sampling further sample the data based on their respective
resampling procedures. For priority sampling, each ensemble contains 25 models
for priority sampling, thus 250 samples were used to improve the robustness of the
procedures. However, all the methods have exactly the same 10 testing data sets, each
of which is 10 percent of the entire data. The results for each method are the average
of the tenfold cross-validation.
Comparisons of the Results
First, we use priority sampling elaborated above to draw the training samples. Figure 4
clearly shows that the new sample has a smoother distribution, not as skewed as the
10 cui.indd 348
7/2/2012 6:01:04 AM
Cost-Sensitive Learning via Priority Sampling
349
Figure 4. Priority Sampling Versus Uniform Sampling
original one. Moreover, the new sample has a greater proportion of high-profit customers than the original data, but less of the extreme high-value customers, thus reducing
the number of outliers that may lead to overfitting. The results of the tenfold crossvalidation in Table 3 indicate that the baseline logistic regression models achieve the
highest response lift of 376.4 and 262.4 in the top two deciles using the original data
and 374.5 and 264.9 using the balanced data, followed by AdaC2 (375.0 and 263.1) and
priority sampling (364.9 and 273.5). The expected cost approach produces response
lifts of 187.2 and 182.0 at the top two deciles, which are significantly lower than those
of the other methods. This is not surprising as the expected cost approach uses only
positive cases in the training process. Thus, cost-sensitive learning does not improve
the classification accuracy or better probability rankings of customers. The naive Bayes
model also records lower response lifts (280.7 and 220) in the top two deciles, but
priority sampling does improve its predictive accuracy (333.8 and 243.5).
Table 4 includes the cumulative profit lifts of all the methods across the deciles.
On average, the baseline logistic regression model using original data provides profit
lifts of 589.9 and 364.8 in the top two deciles. The rebalanced data improves the performance of the logistic regression model (609.7 and 371.7). By comparison, AdaC2
renders lower profit lifts (593.6 and 367.7). The other two cost-sensitive methods
generate significantly higher profit lifts in the top two deciles. The expected cost
approach produces profit lifts of 620.9 and 385.9 in the top two deciles, despite its
poor performance in response lifts. The priority sampling method, in the meantime,
records the highest average profit lifts in the top two deciles: 622.9 and 388.3. The
average profit lift in the top decile is significantly higher than that of all competing methods except the expected cost approach (the p‑values can be found in the
Appendices<<appendix C is cited, but Appendix A and Appendix B
10 cui.indd 349
7/2/2012 6:01:05 AM
10 cui.indd 350
376.4
(15.1)
262.4
(8.5)
217.2
(6.2)
185.0
(3.5)
161.5
(3.6)
145.2
(2.0)
130.2
(1.2)
118.7
(1.1)
108.7
(0.7)
100.0
374.5
(21.6)
264.9
(8.8)
220.4
(6.6)
185.6
(3.9)
162.9
(3.8)
145.5
(2.3)
130.6
(1.2)
118.9
(1.2)
108.9
(0.8)
100.0
Logistic
regression
(balanced)
187.2
(13.2)
182.0
(6.8)
165.8
(3.9)
145.8
(3.5)
130.8
(2.9)
118.4
(2.6)
107.9
(2.3)
99.9
(1.7)
93.4
(1.4)
100.0
Expected
cost
375.0
(17.2)
263.1
(8.5)
217.0
(6.4)
184.2
(4.0)
161.1
(3.4)
144.8
(2.0)
129.9
(1.5)
118.6
(1.0)
108.6
(0.8)
100.0
AdaC2
<<Provide note indicating what the figures in parentheses mean?>>
10
9
8
7
6
5
4
3
2
1
Model/
decile
Logistic
regression
(original)
Table 3. Average Cumulative Response Lift of Ten-Fold Cross-Validation for Study One
364.9
(26.8)
273.5
(11.1)
220.4
(6.6)
182.1
(3.6)
156.7
(3.2)
138.3
(3.0)
123.4
(2.6)
112.6
(1.8)
104.4
(1.3)
100.0
Priority
sampling
for logistic
regression
280.7
(19.0)
220.0
(11.4)
187.1
(6.9)
162.5
(5.3)
146.6
(3.1)
134.8
(2.2)
126.8
(1.0)
117.7
(1.5)
108.6
(0.5)
100.0
Naive Bayes
(original)
333.8
(20.5)
243.5
(11.6)
204.0
(7.2)
180.9
(4.9)
158.0
(3.1)
139.2
(3.0)
122.9
(2.6)
112.1
(1.7)
104.9
(1.2)
100.0
Priority
sampling
for naive Bayes
350
Cui, Wong, and Wan
7/2/2012 6:01:05 AM
10 cui.indd 351
589.9
(33.0)
364.8
(18.1)
274.5
(11.8)
221.3
(7.3)
184.1
(5.7)
159.3
(3.6)
139.0
(3.0)
123.3
(1.6)
110.4
(1.2)
100.0
609.7
(43.2)
371.7
(18.4)
279.4
(11.4)
222.2
(6.8)
186.3
(6.2)
160.3
(3.2)
139.2
(2.3)
123.5
(1.5)
110.6
(1.3)
100.0
Logistic
regression
(balanced)
620.9
(51.4)
385.9
(14.7)
282.0
(9.3)
221.3
(6.6)
182.0
(5.6)
155.7
(4.9)
134.9
(4.5)
120.3
(3.2)
107.8
(2.7)*
100.0
Expected
cost
593.6
(31.1)
367.7
(18.0)
275.3
(11.7)
220.7
(7.4)
184.0
(5.5)
159.2
(3.7)
138.5
(3.5)
123.2
(1.3)
110.5
(1.3)
100.0
AdaC2
622.9
(49.3)
388.3
(15.7)
282.1
(10.1)
220.7
(5.9)
182.0
(5.6)
155.8
(5.0)
135.0
(4.3)
120.6
(3.0)
108.2
(2.5)
100.0
Priority
sampling
for logistic
regression
478.2
(44.3)
326.6
(22.0)
251.1
(14.4)
203.4
(11.4)
173.5
(5.3)
151.7
(4.6)
137.2
(2.9)
123.3
(2.3)
111.1
(0.8)
100.0
Naive Bayes
(original)
575.8
(37.5)
359.2
(21.5)
271.9
(11.5)
222.0
(8.1)
184.6
(5.5)
156.6
(4.7)
134.4
(4.1)
120.2
(2.7)
108.4
(1.3)
100.0
Priority
sampling
for naive
Bayes
<<In model 9, expected cost, what does the asterisk indicate? Also, provide note indicating meaning of the figures
in parentheses?>>
10
9
8
7
6
5
4
3
2
1
Model/
decile
Logistic
regression
(original)
Table 4. Average Cumulative Profit Lift of Tenfold Cross-Validation for Study One
Cost-Sensitive Learning via Priority Sampling
351
7/2/2012 6:01:05 AM
352
Cui, Wong, and Wan
each need to be specifically mentioned where appropriate>>). Priority sampling also improves the profit lifts of the naive Bayes model
(575.8 versus 478.2) in the top decile.
In Table 5, we further examine whether the improvement in the top decile lift produces significant increase in real profit. Logistic regression records relatively lower
lifted profit in the top two deciles ($7,184 and $7,770 using the original data and $7,476
and $7,975 using the balanced data). AdaC2 records lower lifted profit in the top two
deciles ($7,246 and $7,872). The lifted profits are $7,639.6 and $8,389.4 in the top
two deciles for the expected cost method. By comparison, priority sampling using
logistic regression provides the highest lifted profit in the top two deciles ($7,668.6
and $8,461.1). Specifically, it would generate $18,093 extra profit (1,000,000 / 10,628 *
(7668.6 − 7,476.3)) than the logistic regression model using balanced data if a campaign had been mailed to 1 million customers. Overall, priority sampling compares
favorably with the competing methods in generating incremental profit.
Study Two
Modeling Customer Churn
Direct marketing and CRM data are known to vary greatly in class distribution
and customer value, and the performance of cost-sensitive learning methods may
be subjective to the influence of specific problems and data distribution. To assess
the merit of the proposed method, we also apply priority sampling to a well-known
CRM data set. Churn forecasting and customer retention are serious and challenging
problems in the telecommunications and other industries. The customer churn data
set from the 2003 Duke University data mining competition<<add cite for
this?>> is a very difficult task for solving classification problems, and has been
used in studies on data mining and CRM<<something missing from this
sentence? For study two, we use the customer churn data
set from...?>>. It contains the data from a U.S.-based wireless telecommunication
company for predicting customers who switch to a competitor so that the company
can implement some intervention to minimize customer churn. The training data set
“Current Score Data” has 171 predictor variables and the records of 51,306 customers with a true positive rate of 1.81 percent (percentage of churners). Whereas most
customers spend about $60 or less per month, a minority of high-value customers
spend on average around $200 a month, and a few customers’ monthly bills run up
to thousands of dollars. Thus, this data set is even more unbalanced in both class and
customer value distribution. Moreover, resource constraint is also a realistic issue as
companies typically devote a limited amount of manpower and man hours to contact
the small percentage of potential churners for intervention and retention (e.g., only
the top 10 percent of potential churners). The data set contains variables including
(1) customers usage data, (2) monthly spending, and (3) other customer history and
demographic variables. We first select 23 variables from the data set using the forward
method in logistic regression, including monthly spending, calling and roaming time,
customer history, and so on. Since the competition organizer provides the testing data
10 cui.indd 352
7/2/2012 6:01:05 AM
10 cui.indd 353
1
2
3
4
5
6
7
8
9
10
Model/
decile
7,184.6
7,770.5
7,683.7
7,120.2
6,171.8
5,222.6
4,003.2
2,732.3
1,376.2
0.0
Logistic
regression
(original)
7,476.3
7,975.2
7,899.2
7,177.8
6,333.2
5,310.0
4,027.7
2,756.8
1,406.6
0.0
Logistic
regression
(balanced)
7,639.6
8,389.4
8,012.1
7,121.8
6,016.0
4,902.5
3,590.5
2,384.6
1,024.3
0.0
Expected
cost
7,246.3
7,872.8
7,725.3
7,096.1
6,160.1
5,208.5
3,962.4
2,717.5
1,380.4
0.0
AdaC2
Table 5. Average Lifted Profit (Dollars) of Tenfold Cross-Validation for Study One
7,668.6
8,461.1
8,016.3
7,087.2
6,021.5
4,913.6
3,594.2
2,413.7
1,077.7
0.0
Priority
sampling
for logistic
regression
5,553.0
6,650.9
6,662.6
6,076.3
5,392.7
4,546.2
3,812.4
2,727.6
1,464.0
0.0
Naive Bayes
(original)
6,978.4
7,608.5
7,568.7
7,163.7
6,209.8
4,986.8
3,530.5
2,372.0
1,108.3
0.0
Priority
sampling
for naive
Bayes
Cost-Sensitive Learning via Priority Sampling
353
7/2/2012 6:01:05 AM
354
Cui, Wong, and Wan
set “Future Score Data” with 100,462 records, we adopt the train-and-test approach to
evaluate the performance of priority sampling and the alternative methods.
Results of Validation
In study two, we follow the same procedures of the respective models as in study one.
The results in Table 6 indicate that logistic regression with balanced data achieves
true positive rate (TPR) lifts of 143.3 in the first decile and 128.8 in the second decile,
below that using the original data (161.1 and 138.0). All other methods report lower
response lifts, with those of the expected cost method below that of a random model.
Despite its enviable performance in TPR lifts, logistic regression with rebalanced data
hands in cumulative “profit” lifts of 287.2 and 215.1 in the top two deciles (Table 7).
The expected cost approach, however, does not do as well (251.4 and 207.7). AdaC2
performs relatively better with 296.7 and 227.6 in the top two decile profit lifts. By
comparison, priority sampling renders the highest profit lifts in the top two deciles:
299.7 and 219.1 for the logistic regression model and 301.4 and 226.4 for the naive
Bayes model. In Table 8, we include the lifted profits of all the methods as compared
with a random model. In the top decile, logistic regression using rebalanced data has
lifted profit of $17,469 in the top decile. By comparison, AdaC2 generates $18,355
extra profit in the top decile. Priority sampling records the highest lifted profit: $18,634
for logistic regression and $18,794 for the naive Bayes model. Thus, even with a data
set of very unbalanced class and profit distribution, priority sampling submits superior
results than the competing methods (except for the second decile of AdaC2).
Study Three
Because the above two data sets are not commonly used, it is necessary to compare
the performance of priority sampling with other methods using a popular data set that
is available to the public and used in data mining competitions. The data set from
the 1998 Knowledge Discovery and Data Mining competition is used in many studies [26<<27 is used at previous mention of the 1998 KDD / verify
cite meant at both places>>]. In this task, a charitable organization
sends out regular mailings to its target customers to solicit donations. The training
data contain 95,412 customer records while the validation data have 96,367 cases.
The most recent fund-raising campaign had a response rate of 5.1 percent. This data
set is particularly useful for testing cost-sensitive methods as the organization has
found an inverse correlation between likelihood to respond and the dollar amount of
the gift. Thus, a simple response model will most likely select only very low-dollar
donors. High-dollar donors may fall into the lower deciles and be excluded from future mailings. The lost revenue of these high-value donors may offset any gains due
to the increased response rate of the low-dollar donors. Thus, the organization needs
a forecasting model that can help maximize the net revenue generated from future
mailings. Since validation data are available from the competitor organizer, we adopt
10 cui.indd 354
7/2/2012 6:01:05 AM
10 cui.indd 355
1
2
3
4
5
6
7
8
9
10
Model/
decile
161.1
138.0
133.9
129.3
121.6
115.4
111.1
107.5
104.3
100.0
Logistic
regression
(original)
143.3
128.8
120.5
116.0
115.2
109.9
107.4
106.2
102.5
100.0
Logistic
regression
(balanced)
71.5
92.1
95.4
98.9
98.2
98.1
97.1
94.9
93.9
100.0
Expected
cost
Table 6. Average Response (TPR) Lift on the Test Data for Study Two
117.3
114.5
108.4
107.2
108.1
105.0
103.8
101.7
101.0
100.0
AdaC2
135.0
120.8
119.0
111.5
111.3
108.5
106.8
104.8
102.3
100.0
Priority
sampling
for logistic
regression
126.7
115.5
109.6
105.9
105.1
103.2
103.2
103.3
101.3
100.0
Naive Bayes
(original)
128.3
115.5
111.8
109.4
109.2
106.5
105.5
103.6
102.3
100.0
Priority
sampling
for naive
Bayes
Cost-Sensitive Learning via Priority Sampling
355
7/2/2012 6:01:05 AM
10 cui.indd 356
1
2
3
4
5
6
7
8
9
10
Model/
decile
139.2
109.7
103.0
99.7
96.5
93.4
94.2
95.3
97.7
100.0
Logistic
regression
(original)
287.2
215.1
176.0
156.2
142.5
128.3
118.8
112.1
104.0
100.0
Logistic
regression
(balanced)
251.4
207.7
176.8
156.9
139.9
128.6
118.0
110.0
104.5
100.0
Expected cost
Table 7. Average Profit Lift on the Test Data for Study Two
296.7
227.6
187.4
165.1
150.4
135.5
124.6
114.5
106.8
100
AdaC2
299.7
219.1
188.4
161.5
146.9
134.0
123.8
114.6
106.1
100.0
Priority
sampling
for logistic
regression
78.6
68.0
64.8
64.4
67.0
68.7
75.0
82.6
87.3
100.0
Naive Bayes
(original)
301.4
226.4
188.7
165.2
149.4
135.1
124.6
115.2
107.0
100.0
Priority
sampling
for naive
Bayes
356
Cui, Wong, and Wan
7/2/2012 6:01:06 AM
10 cui.indd 357
Logistic
regression
(original)
3,654.0
1,803.5
849.8
–1,01.1
–1,632.9
–3,666.9
–3,813.3
–3,478.2
–1,964.5
0.0
Model/
decile
1
2
3
4
5
6
7
8
9
10
17,469.5
21,473.5
21,267.0
20,977.1
19,832.8
15,824.1
12,301.6
9,042.7
3,394.7
0.0
Logistic
regression
(balanced)
14,129.7
20,098.3
21,491.5
21,240.3
18,630.4
15,989.3
11,748.7
7,442.4
3,774.3
0.0
Expected
cost
Table 8. Average Lifted Profit (Dollars) on the Test Data for Study Two
18,355.9
23,819.0
24,456.0
24,299.6
23,528.8
19,890.4
16,083.2
10,835.1
5,711.9
0.0
AdaC2
18,634.5
22,232.4
24,746.7
22,957.0
21,884.2
19,015.5
15,568.2
10,919.9
5,134.8
0.0
Priority
sampling
for logistic
regression
–2,001.3
–5,966.8
–9,843.9
–13,286.9
–15,400.8
–17,502.1
–16,338.0
–13,004.3
–10,657.0
0.0
Naive Bayes
(original)
18,794.9
23,586.9
24,828.3
24,330.7
23,068.2
19,655.3
16,065.3
11,373.7
5,878.7
0.0
Priority
sampling
for naive
Bayes
Cost-Sensitive Learning via Priority Sampling
357
7/2/2012 6:01:06 AM
358
Cui, Wong, and Wan
the train-and-test approach to evaluate the performance of priority sampling and the
competing methods. First, we used the 44 independent variables suggested in SAS
Enterprise Miner Tutorial for model building.1 These variables and the two dependent
variables are summarized in the Appendix C. Second, we repeat the same experiments
for all the methods following the same procedures used in study one.
Comparing Model Performance
The results in Table 9 indicate that the baseline logistic regression models achieve the
highest response lifts of 199.8 and 172.5 in the top two deciles using the original data
and 146.6 and 135.6 using the rebalance data, followed by naive Bayes (163.7 and
153.4), priority sampling approach for logistic regression (125.5 and 120.6) and for
naive Bayes (116.8 and 117.0). The predictive accuracy of the expected cost approach
(86.1 and 101.6) and the AdaC2 model (67.3 and 67.0) are equal to or worse than the
random model. According to Table 10, logistic regression using balanced data provides
profit lifts of 723.5 and 473.5 in the top two deciles, much higher than those using
the original unbalanced data (242.4 and 206.7). The expected cost method (618.3 and
416.5) and AdaC2 (554.4 and 312.9) provide lower profit lifts. By comparison, the
two models using priority sampling generate the highest profit lifted in the top two
deciles (882.4 and 571.6 for logistic regression and 756.2 and 493.9 for naive Bayes)
highlight its ability to minimize overfitting. The superior performance of the priority sampling approach is also reflected in the actual lifted profits of the two models,
which are much higher than those of the competing methods (Table 11). Overall, the
realized sales by priority sampling with logistic regression is higher than that of the
competition winner ($15,800 versus $14,712 at a mailing depth of 60 percent or 58,287
customers) and that reported in another study ($15,329) [26].
Sensitivity Analyses
In real-life situations, the response rates of marketing campaigns may vary greatly.
Sometimes, the response rate may drop below 2 percent, for instance, in credit card
promotions. This may lead to data sets with different degrees of unbalanced class
distribution and unequal costs of misclassification errors. To validate the sensitivity
of priority sampling to data sets with different degrees of skewness in class and profit
distributions and its ability to improve cost-sensitivity while minimizing overfitting,
we conduct more experiments on the same data using data sets with varying degrees
of unbalanced class distribution. For such sensitivity analysis, positive cases in the
training data set are randomly deleted to generate training data sets with more unbalanced class and more skewed profit distribution.2 The resulting ratios of positive cases
to negative cases in the training data range from 5 percent to 1 percent (Table 12). At
the low ratios, this results in a very small number of positive cases. Then, we perform
train-and-test validation for all the methods and compare their profit lift in the top
decile using the testing data with the original class ratio.
10 cui.indd 358
7/2/2012 6:01:06 AM
10 cui.indd 359
1
2
3
4
5
6
7
8
9
10
Model/
decile
199.8
172.5
154.6
143.4
133.5
124.8
118.1
110.6
105.2
100.0
Logistic
regression
(original)
146.6
135.6
130.5
126.6
121.9
116.7
112.3
108.7
103.9
100.0
Logistic
regression
(balanced)
86.1
101.6
107.3
107.3
105.4
102.9
100.1
97.7
94.3
100.0
Expected
cost
Table 9. Average Response Lift on the Testing Data for Study Three
67.3
67.0
69.6
72.7
76.2
79.5
83.1
86.5
91.3
100.0
AdaC2
125.5
120.6
118.6
116.0
113.2
110.9
107.6
105.5
102.4
100.0
Priority
sampling
for logistic
regression
163.7
153.4
144.7
137.7
128.9
122.6
116.6
110.3
104.7
100.0
Naive Bayes
(original)
116.8
117.0
115.1
115.9
114.2
113.1
110.4
107.1
103.7
100.0
Priority
sampling
for naive
Bayes
Cost-Sensitive Learning via Priority Sampling
359
7/2/2012 6:01:06 AM
10 cui.indd 360
1
2
3
4
5
6
7
8
9
10
Model/
decile
242.4
206.7
172.0
160.9
148.6
133.6
126.3
101.1
92.9
100.0
Logistic
regression
(original)
723.5
473.5
370.0
306.1
256.7
209.7
177.9
152.2
119.4
100.0
Logistic
regression
(balanced)
618.3
416.5
327.6
279.9
234.3
197.8
167.3
147.2
120.4
100.0
Expected
cost
Table 10. Average Profit Lift on the Testing Data for Study Three
544.4
312.9
244.7
200.2
174.0
158.9
146.3
129.8
113.6
100.0
AdaC2
882.4
571.6
435.0
352.3
288.3
246.6
204.7
172.4
133.3
100.0
Priority
sampling
for logistic
regression
103.4
124.2
125.5
124.4
102.2
102.7
96.9
81.0
71.3
100.0
Naive Bayes
(original)
756.2
493.9
361.0
310.6
264.2
230.3
192.7
162.8
131.9
100.0
Priority
sampling
for naive
Bayes
360
Cui, Wong, and Wan
7/2/2012 6:01:06 AM
10 cui.indd 361
1
2
3
4
5
6
7
8
9
10
Model/
decile
1,504.0
2,253.1
2,280.8
2,574.3
2,563.6
2,127.4
1,946.0
94.3
–671.9
0.0
Logistic
regression
(original)
6,584.4
7,889.4
8,553.4
8,705.6
8,275.3
6,951.5
5,755.0
4,410.8
1,846.5
0.0
Logistic
regression
(balanced)
5,473.3
6,684.7
7,209.4
7,598.1
7,090.3
6,196.4
4,976.5
3,990.3
1,936.5
0.0
Expected
cost
4,699.7
4,501.6
4,590.2
4,238.1
3,914.5
3,735.9
3,428.9
2,520.6
1,298.5
0.0
AdaC2
Table 11. Average Lifted Profit (Dollars) on the Testing Data for Study Three
8,262.0
9,959.4
10,614.3
10,655.6
9,944.8
9,287.0
7,739.2
6,115.8
3,163.5
0.0
Priority
sampling
for logistic
regression
35.4
511.2
809.3
1,031.4
114.7
172.4
–231.5
–1,605.7
–2,727.9
0.0
Naive Bayes
(original)
6,929.9
8,318.8
8,268.0
8,896.6
8,667.3
8,256.5
6,854.6
5,308.9
3,031.0
0.0
Priority
sampling
for naive
Bayes
Cost-Sensitive Learning via Priority Sampling
361
7/2/2012 6:01:06 AM
10 cui.indd 362
146.6
147.0
140.7
138.2
128.1
140.12
Mean
Response
lift
5
4
3
2
1
Percent
positive
cases
660.1
723.5
707.6
652.6
634.8
582.0
Profit
lift
Logistic regression
(balanced)
82.36
86.1
87.0
83.4
81.0
74.3
Response
lift
570.38
618.3
588.4
575.0
534.3
535.9
Profit
lift
Expected cost
72.48
67.3
71.2
74.9
74.1
74.9
Response
lift
550.32
544.4
575.8
602.2
452.2
577.0
Profit
lift
AdaC2
Model/lift
120.38
125.5
125.3
120.9
118.3
111.9
Response
lift
838.7
882.4
880.5
813.4
840.2
777.0
Profit
lift
Priority sampling
for logistic regression
114.94
116.8
117.5
116.2
116.0
108.2
Response
lift
751.06
756.2
739.3
727.2
782.0
750.6
Profit
lift
Priority sampling
for naive Bayes
Table 12. Model Performance with Varying Degrees of Class Imbalance for Study Three on Testing Data<<Resp. lift changed to
Response lift correct?>>
362
Cui, Wong, and Wan
7/2/2012 6:01:07 AM
Cost-Sensitive Learning via Priority Sampling
363
Overall, logistic regression with rebalanced data has the highest average response
lift in the top decile (mean = 140), while the expected cost approach and AdaC2 are
much worse than the random model (mean = 82 and mean = 72, respectively). In terms
of profit lift, logistic regression using balanced data (mean = 660) also performs better
than these two methods (mean = 570 and mean = 550, respectively), which may suffer
from overfitting. On average, priority sampling records the highest top decile profit
lift (839), which is 52 percent higher than AdaC2, 47 percent higher than the expected
cost method, and 27 percent higher than the logistic regression model. The superior
performance of priority sampling is consistent across data sets with different degrees
of class imbalance, even when the ratio of positive cases is very low (i.e.<<e.g.?>>,
1 percent and 2 percent). The naive Bayes model using priority sampling also records
significantly higher profit lift than the competing methods, and its advantage is even
greater as the class distribution becomes more unbalanced. These results indicate
that priority sampling consistently renders superior performance in providing better
rankings of customer value and in augmenting profitability across data sets of various
degrees of unbalanced class distribution.
Conclusions
The combination of undersampling of negative cases and priority sampling of positive
cases provides a viable and superior solution to the cost-sensitive learning problem
associated with highly unbalanced class distribution and the unknown and skewed
costs of false negative errors. It helps to improve the accuracy of predicting the small
number of high-value customers while minimizing the overfitting problem. Moreover,
priority sampling using the profit quantity as a probability distribution is intuitively
appealing and efficient to implement. Ensemble learning helps to reduce the variations among the subsamples and improving the accuracy of parameter estimates. The
results of validation suggest that priority sampling consistently achieves significant
improvement in augmenting profit over the alternative methods. Its advantages are
apparent even with data of highly unbalanced class distribution and apply to other
classification tools such as naive Bayes. Overall, priority sampling offers a robust and
general solution to cost-sensitive learning with unknown and variant costs of false
negative errors.
While the issue of cost-sensitivity arises from the discussion of misclassification
errors, cost-sensitive learning in this case, ironic as it may seem, does not necessarily
improve the classification accuracy or the true positive rate. This is not surprising
because the goal of cost-sensitive learning is not to improve classification accuracy,
but to produce better rankings of customer value. This is consistent with the ultimate
goal of augmenting the return on marketing investment with a budget constraint.
Meanwhile, the findings indicate that although the simple classification accuracy of
a response model can be improved, a model lacking cost-sensitivity results in lower
profit when the class distribution is highly unbalanced and the profit distribution is
much skewed. Thus, it is essential that decision makers adopt appropriate cost-sensitive
solutions in such cases.
10 cui.indd 363
7/2/2012 6:01:07 AM
364
Cui, Wong, and Wan
The results also reveal that cost-sensitive methods are themselves “sensitive” to the
distribution of classes and costs. The performance of the competing models differs
depending on how they treat the unbalanced distribution of classes and costs in the
training process. By comparison, priority sampling provides a viable solution to costsensitive learning when cost-based learning is not feasible due to the unknown costs
of false negatives. The sensitivity analysis in the study three indicates that the benefits
of priority sampling are evident for data sets with more unbalanced class distribution.
Priority sampling consistently achieves significant improvement over the competing
methods in augmenting profitability without overfitting the predictive model. Superior
cost-sensitive models allows for contacting fewer customers yet still achieving higher
returns on investment, thus enabling a more cost-effective use of precious marketing resources. Thus, this approach can also help alleviating the problem of customer
fatigue, a common problem with direct marketing operations.
Notes
1. The training and testing data sets with the variables can be found at http://cptra.ln.edu
.hk/~mlwong/JMIS2012.
2. The training data sets can be found at http://cptra.ln.edu.hk/~mlwong/JMIS2012.
References
1. Baesens, B.; Viaene, S.; Van den Poel, D.; Vanthienen, J.; and Dedene, G. Bayesian
neural network learning for repeat purchase modelling in direct marketing. European Journal
of Operational Research, 138, 1 (2002), 191–211.
2. Bansal, G.; Sinha, A.P.; and Zhao, H. Tuning data mining methods for cost-sensitive regression: A study in loan charge-off forecasting. Journal of Management Information Systems,
25, 3 (Winter 2008–9), 315–336.
3. Berger, P., and Magliozzi, T. The effect of sample size and proportion of buyers in the
sample on the performance of list segmentation equations generated by regression analysis.
Journal of Direct Marketing, 6, 1 (1992), 13–22.
4. Bhattacharya, S. Direct marketing performance modeling using genetic algorithms.
INFORMS Journal on Computing, 11, 3 (1999), 248–257.
5. Blundell, R.W., and Powell, J.L. Endogeneity in semiparametric binary response models.
Review of Economic Studies, 71, 7 (2004), 655–679.
6. Breiman, L. Bagging predictors. Machine Learning, 24, 2 (1996), 123–140.
7. Bult, J.R., and Wansbeek, T. Optimal selection for direct mail. Marketing Science, 14,
4 (1995), 378–394.
8. Chawla, N.; Bowyer, K.; Hall, L.; and Kegelmeyer, W. SMOTE: Synthetic Minority of
Over-Sampling TEchique. Journal of Artificial Intelligence Research, 16, <<issue and/or
season>> (2002), 341–378.
9. Cui, G.; Wong, M.L.; and Lui. H.-K. Machine learning for direct marketing response
models: Bayesian networks with evolutionary programming. Management Science, 52, 4
(2006), 597–612.
10. Domingos, P. MetaCost: A general method for making classifiers cost-sensitive. In
<<editor(s)>>, Proceedings of the Fifth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. New York: ACM Press, 1999, pp. 155–164.
11. Drummond, C., and Holte, R.C. C4.5, class imbalance, and cost sensitivity: Why undersampling beats over-sampling. Paper presented at the ICML’2003<<spell out ICML>>
Workshop on Learning from Imbalanced Datasets II, <<city and state (or country)
10 cui.indd 364
7/2/2012 6:01:07 AM
Cost-Sensitive Learning via Priority Sampling
365
where conference held>>, <<month and day(s) of conference>>,
2003.
12. Elkan, C. The foundations of cost-sensitive learning. In B. Nebel (ed.), Proceedings of the
Seventeenth International Joint Conference on Artificial Intelligence. San Francisco: Morgan
Kaufmann, 2001, pp. 973–978.
13. Fan, W.; Stolfo, S.J.; Zhang, J.; and Chan, P.K. AdaCost: Misclassification cost-sensitive
boosting. In <<editor(s)>> Proceedings of the 16th International Conference on Machine
Learning. Bled, Slovenia: <<publisher>>, 1999, pp. 97–105.
14. Freund, Y., and Schapire. R.E. A decision-theoretic generalization of on-line learning and
an application to boosting. Journal of Computer Systems and Science, 55, 1 (1997), 119–139.
15. Hastie, T.; Tibshirani, R.; and Friedman, J. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction, 2d ed. <<City>>: Springer-Verlag, 2008.
16. Haughton, D., and Oulabi. S. Direct marketing modeling with CART and CHAID. Journal
of Direct Marketing, 11, 4 (1997), 42–52.
17. Mitchell, T. Machine Learning. New York: McGraw-Hill. 1997.
18. Mulhern, F.J. Customer profitability analysis: Measurement, concentration, and research
directions. Journal of Interactive Marketing, 13, 1 (1999), 25–40.
19. Rao, V.R., and Steckel, J.H. Selecting, evaluating, and updating prospects in direct mail
marketing. Journal of Direct Marketing, 9, 2 (1995), 20–31.
20. Shmueli, G.; Patel, N.R.; and Bruce, P.C. Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel with XLminer. New York: John
Wiley & Sons. 2007.
21. Sinha, A.P., and May, J.H. Evaluating and tuning predictive data mining models using
receiver operating characteristic curves. Journal of Management Information Systems, 21, 3
(Winter 2004–5), 249–280.
22. Smith, R.J., and Blundell, R.W. An exogeneity test for a simultaneous equation Tobit model
with an application to labor supply. Econometrica, 54, <<issue>> (May 1986), 679–685.
23. Sun, Y.; Kamel, M.S.; Wong, A.K.C.; and Wang, Y. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40, 12 (2007), 3358–3378.
24. Viaene, S., and Dedene, G. Cost-sensitive learning and decision making revisited. European Journal of Operational Research, 166, 1 (2005), 212–220.
25. Weiss, G.M. Mining with rarity: A unifying framework. ACM SIGKDD Explorations
Newsletter, 6, 1 (2004), 7–19.
26. Zadrozny, B., and Elkan. C. Learning and making decisions when costs and probabilities
are both unknown. In <<editor(s)>> Proceedings of the Knowledge Discovery and Data
Mining (KDD) Conference.<<Verify published title / is there an annual
number?>>. San Francisco, CA,<<publisher and publisher’s location>>,
2001. pp. 204–213.
27. Zadrozny, B.; Langford, J.; and Abe, N. Cost-sensitive learning by cost-proportionate
example weighting. In <<editor(s)>> Proceedings of the Third IEEE International Conference on Data Mining. Los Alamitos, CA: IEEE Computer Society, 2003, pp. 435–442.
<<Not cited / cite or delete>>28. Zahavi, J., and Levin, N. Applying neural
computing to target marketing. Journal of Direct Marketing, 11, 4 (1997), 76–93.
29. Zhao, H.; Sinha, A.P.; and Bansal, G. An extended tuning method for cost-sensitive
regression and forecasting. Decision Support Systems, 51, 3 (2011), 372–383.
10 cui.indd 365
7/2/2012 6:01:07 AM
366
Cui, Wong, and Wan
Appendix A: p-Values of Comparing Priority Sampling for Logistic
Regression with Other Methods for Study One
Model/
decile
1
2
3
4
5
6
7
8
9
10
Logistic
regression
(original)
Logistic
regression
(balanced)
Expected
cost
AdaC2
0.010954
6.15E-05
0.009198
0.365295
0.145305
0.064375
0.029054
0.012017
0.023987
NA
0.00372
0.381544
0.238272
0.009497
0.032888
0.156633
0.037921
0.338579
0.336274
NA
0.345592
0.060218
0.441884
0.29392
0.382724
0.137458
0.387875
0.17795
0.017021
NA
0.012995
0.000134
0.016545
0.489565
0.146169
0.071175
0.048697
0.008799
0.021667
NA
Note: NA = not applicable<<correct?>>
Appendix B: p-Values of Comparing Priority Sampling with Naive
Bayes for Study One
Decile
p-values
1
2
3
4
5
6
7
8
9
10
4.37E-06
2.06E-06
2.81E-05
1.87E-05
7.19E-06
0.012385
0.063732
0.028763
0.000293
NA
Note: NA = not applicable<<correct?>>
10 cui.indd 366
7/2/2012 6:01:07 AM
Cost-Sensitive Learning via Priority Sampling
367
Appendix C: Variables Used for Study Three
Variable name
MONTHS_SINCE_ORIGIN
IN_HOUSE
OVERLAY_SOURCE
DONOR_AGE
DONOR_GENDER
PUBLISHED_PHONE
HOME_OWNER
MOR_HIT
CLUSTER
SES
INCOME
MED_HOUSEHOLD_INCOME
PER_CAPITA_INCOME
WEALTH
MED_HOME_VALUE
PCT_OWNER_OCCUPIED
URBANICITY
PCT_MALE_MILITARY
PCT_MALE_VETERANS
PCT_VIETNAM_VETERANS
PCT_WWII_VETERANS
NUMBER_PROM_12
CARD_PROM_12
FREQ_STATUS_97NK
RECENCY_STATUS_96NK
LAST_GIFT_AMT
RECENT_RESPONSE_COUNT
RECENT_RESPONSE_PROP
RECENT_AVG_GIFT_AMT
RECENT_STAR_STATUS
CARD_RESPONSE_COUNT
CARD_RESPONSE_PROP
CARD_AVG_GIFT_AMT
PROM
GIFT_COUNT
AVG_GIFT_AMT
GIFT_AMOUNT
MAX_GIFT
GIFT_RANGE
MONTHS_SINCE_FIRST
MONTHS_SINCE_LAST
PEP_STAR
CARD_PROM
MIN_GIFT
TARGET_D (dependent variable)
TARGET_B (dependent variable)
10 cui.indd 367
Meaning
Elapsed time since first donation
Is in house donor?
M = Metromail, P = Polk, B = both
Age as of June 1997
Actual or inferred gender
Published telephone listing
Is home owner?
Mail order response hit rate
54 socioeconomic cluster codes
5 socioeconomic cluster codes
7 income group levels
Median income
Income per capita
10 wealth rating groups
Median home value
Percent owner occupied housing
U = urban, C = city, S = suburban, T = town,
R = rural, M = unknown
Percent male military in block
Percent male veterans in block
Percent Vietnam veterans in block
Percent World War II veterans in block
Number of promotions in the last 12 months
Number of card promotions in the last 12 months
Frequency status, June 1997
Recency status, June 1996
Amount of the most recent donation
Recent response count
Recent response proportion
Recent average gift amount
Recent STAR status (1 = yes, 0 = no)
Response count since June 1994
Response proportion since June 1994
Average gift amount since June 1994
Total number of promotions
Total number of donations
Overall average gift amount
Total gift amount
Maximum gift amount
Maximum less minimum gift amount
First donation date from June 1997
Last donation date from June 1997
STAR status ever (1 = yes, 0 = no)
Number of card promotion
Minimum gift amount
Response amount to June 1997 solicitation
Response to June 1997 solicitation
7/2/2012 6:01:07 AM
10 cui.indd 368
7/2/2012 6:01:07 AM