Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Cost-Sensitive Learning via Priority Sampling to Improve the Return on Marketing and CRM Investment Geng Cui, Man Leung Wong, and Xiang Wan Geng Cui is a professor of marketing and international business at Lingnan University, Hong Kong. His research interests include quantitative models in marketing, consumer behavior, and marketing in China, and foreign direct investment strategies and performance. His work has appeared in leading academic journals such as Management Science, Journal of International Business Studies, and Journal of International Marketing, and in Journal of World Business. Man Leung Wong is an associate professor of computing and decision sciences at Lingnan University, Hong Kong. His research focuses on data mining and knowledge discovery, machine learning, and evolutionary algorithms. His work has been published in leading journals such as Management Science, IEEE Intelligent Systems, and Expert Systems with Applications. Xiang Wan is an assistant professor of computer science at the Hong Kong University of Science and Technology. His research on bioinformatics with an emphasis on detection of genetic patterns in complex diseases using statistics and heuristic search methodology has appeared in American Journal of Human Genetics, Nature Genetics, Bioinformatics, among others. Abstract: Because of the unbalanced class and skewed profit distribution in customer purchase data, the unknown and variant costs of false negative errors are a common problem for predicting the high-value customers in marketing operations. Incorporating cost-sensitive learning into forecasting models can improve the return on investment under resource constraint. This study proposes a cost-sensitive learning algorithm via priority sampling that gives greater weight to the high-value customers. We apply the method to three data sets and compare its performance with that of competing solutions. The results suggest that priority sampling compares favorably with the alternative methods in augmenting profitability. The learning algorithm can be implemented in decision support systems to assist marketing operations and to strengthen the strategic competitiveness of organizations. Key words and phrases: cost-sensitive learning, customer relationship management, direct marketing, forecasting, priority sampling. For many binary classification problems in marketing and customer relationship management (CRM), there is usually a severe unbalanced distribution of classes in Journal of Management Information Systems / Summer 2012, Vol. 29, No. 1, pp. 335–367. © 2012 M.E. Sharpe, Inc. All rights reserved. Permissions: www.copyright.com ISSN 0742–1222 (print) / ISSN 1557–928X (online) DOI: 10.2753/MIS0742-1222290110 10 cui.indd 335 7/2/2012 6:01:02 AM 336 Cui, Wong, and Wan empirical data, that is, the small number of true positives (e.g., 5 percent buyers) versus the majority of true negatives (95 percent nonbuyers). Moreover, false negative errors (e.g., loss of subscription or membership fees) are often much more costly than false positive errors (e.g., the cost of mailing or other ways of contacting customers). A predictive model lacking sensitivity to the unequal costs of misclassification errors often results in suboptimal performance in identifying the buyers and augmenting profit. Other similar situations include (1) upgrading customers—how to provide sizable incentives to those customers who are the most likely to upgrade and contribute greater profit, (2) modeling customer churn and retention—how to prevent the most valuable customers from switching to a competitor, and (3) credit default—how to identify customers who do not pay back their sizable loans [2]. Since customers’ purchase probability and their profit contribution are inherently difficult to predict, differentiating high-profit customers from low-profit customers is critical for achieving better profit rankings of customers. While rebalanced data via undersampling or oversampling may improve classification accuracy [3, 8], researchers have proposed various cost-sensitive learning algorithms such as AdaCost and MetaCost by providing a cost matrix to an estimator [10, 13]. Such features have been incorporated in popular software such as IBM SPSS Modeler and SAS Enterprise Miner and discussed in books on data mining [20]. However, these solutions do not apply to scenarios where the costs of false negative errors are unknown and variant (i.e., the amount of purchase). In addition to the between-class imbalance, the within-class imbalance also presents a significant challenge for cost-sensitive learning in direct marketing forecasting [25]. A small number of customers typically account for a large portion of a company’s profit or loss in a long-tail distribution. Because a limited marketing budget allows for contacting only the preset percentage of the most valuable customers (i.e., the top 10 percent or 20 percent), decision makers must rely on forecasting models to select from the vast list of customers those who are the most likely to respond to a marketing offer and purchase the greatest amount. Thus, aside from classification accuracy, a predictive model needs to maximize sales or<<and?>> profit in the top deciles of the test data. While a number of studies have dealt with the unknown costs of false negatives via resampling [23, 26], treating the false positives and false negatives together or using the total costs may place too much emphasis on the high-value customers and lead to overfitting and suboptimal performance. In the following sections, we first review the literature of direct marketing and cost-sensitive learning and discuss the research problems dealing with the unknown and variant costs of false negative errors. Second, we propose a two-step approach to handle between-class imbalance and within-class imbalance separately to minimize overfitting. Given sufficient data, we deal with the between-class imbalance problem with random down-sampling of the negative class. To tackle the within-class imbalance and the unknown and variant costs of false negative errors, we propose a cost-sensitive algorithm via priority sampling to generate a desired data distribution that places greater weight on high-value customers. Ensemble learning is adopted to improve the accuracy of parameter estimates. Third, we apply priority sampling to three direct marketing and CRM data sets. The results suggest that priority sampling 10 cui.indd 336 7/2/2012 6:01:02 AM <<Abbreviated running head okay? If not, how would you like it to read?>> Cost-Sensitive Learning via Priority Sampling 337 compares favorably with the alternative methods in augmenting the profitability of direct marketing operations. Moreover, priority sampling consistently renders superior performance with data of various degree of class imbalance and can be implemented in other classification methods such as naive Bayes, thus providing a robust and general solution to cost-sensitive learning given the unknown and variant costs of false negative errors. Last, we explore the theoretical and managerial implications for improving the return on marketing and CRM investment under resource constraint. Literature Review Forecasting Models for Direct Marketing and CRM While the unbalanced class and cost distributions are common in many business areas, cost-sensitive learning has been a unique and challenging problem in direct marketing research because of the nature of its operations and distribution of data. Many CRM activities also use direct marketing channels such as direct mail and telephone calls. Due to budget constraint, the primary objective of modeling consumer responses to direct marketing is to identify those customers who are the most likely to respond. Only the customers with the highest response probabilities, say, the top 20 percent, will be contacted. Aside from the conventional RFM<<define>> method, researchers can incorporate consumer demographic and psychographic variables and apply more sophisticated statistical methods such as latent class analysis [3], beta-logistic models [19], and tree-generating techniques such as CART and CHAID<<if CART and CHAID are acronyms, define them>> [16]. Despite these improvements, the response rate of most direct marketing campaigns is usually very low; for instance, around 5 percent for catalog mailings. Thus, improving the response rate of direct marketing campaigns is a priority issue. Although various statistical methods have been developed to improve the accuracy of classification, the unbalanced distribution of classes may be problematic for statistical forecasting models that focus on minimizing the overall misclassification errors. While they<<clarify “they”>> may have high overall accuracy of classification by identifying the majority true negatives (nonbuyers), they do not help in predicting the rare class of true positives (buyers), which are of interest to decision makers (Figure 1). This is because the small class of positive cases does not lend sufficient opportunities for a model to learn the underlying structure of the data. To improve the classification accuracy, decision support systems have employed various machine learning methods that are less subject to the problem of unbalanced class distribution, such as neural networks and Bayesian networks [9, 29]. These methods often outperform conventional statistical tools in terms of classification accuracy and managerial insight [15]. Even though these methods can potentially improve the classification accuracy based on the predicted probabilities of response, they may not help to identify the most valuable customers. This is because the existing methods suffer from one major limitation, that is, the lack of sensitivity to the unequal costs of misclassification errors as they assume that false positives and false negative errors are equally costly, 10 cui.indd 337 7/2/2012 6:01:02 AM 338 Cui, Wong, and Wan Class Positive (C+), e.g., 5% Class Negative (C–), e.g., 95% Prediction Positive (R+) True Positives (TP) (buyers correctly classified) False Positives (FP) (nonbuyers misclassified) Prediction Negative (R–) False Negatives (FN) (buyers misclassified) True Negatives (TN) (nonbuyers correctly classified) Figure 1. The Confusion Matrix for Classifier Performance which is true only in certain cases (second column in Table 1). In many cases (third column in Table 1), false negatives errors (loss of potential sales and profit, e.g., $30 in terms of subscription or membership fees) are often much more costly than false positive errors (e.g., cost of mailing, which usually amounts to $1 per customer). These two issues together, unbalanced distribution of classes and unequal costs of misclassification errors, highlight the cost-sensitivity problem in direct marketing and CRM operations [8, 30<<verify reference meant / list ends at 29>>]. Cost-Sensitive Learning In many real-life situations, the default assumption of equal misclassification costs underlying most classification and pattern recognition techniques is not tenable. Costsensitive learning has been proposed to help make optimal decisions of customer selection in terms of cost and benefit [12, 24, 26]. In general, decision support systems have used several strategies to deal with the unequal costs of misclassification errors. Given the problem of between-class imbalance, the standard industry practice in direct marketing, referred to as “salting,” is to undersample the nonbuyers to provide a more balanced class distribution of positive and negative cases [1, 3, 25]. When there are not enough data in the positive class, one may oversample the positive cases using strategies such as SMOTE<<if this is an acronym, what does it stand for?>> [8]. With more balanced training data between the classes, the classification tool may have sufficiently more opportunities to learn the model structures and improve classification accuracy. Alternatively, decision support systems may manipulate the output data from a forecasting model by adjusting the threshold value. The relative performance of competing models can be compared using the area of receiver operating characteristic (ROC) curve [21]. For instance, when logistic regression is used for classification, one could set the probability at 0.8 instead of 0.5 as the threshold value, which means that a case could be labeled as positive or “1” if its predicted probability of purchase is greater than 0.8. Using a different threshold value is suitable for simple classification tasks, but it does not change the rankings of the predicted output. Thus, neither simple resampling nor adjusting the threshold value specifically tackles the issue of unequal costs of misclassification errors. Some researchers<<cite any example(s)?>> 10 cui.indd 338 7/2/2012 6:01:02 AM Cost-Sensitive Learning via Priority Sampling 339 Table 1. Costs of Misclassification Errors in Data with Unbalanced Class Distribution Errors/costs Equal and known costs False positives (large percentage) Wrong answer in a test: –1 point/error False negatives (small percentage) Viable solutions Missed right answer: –1 point/error Uniform sampling, adjusting threshold values Unequal and known costs Membership/ subscription: cost of mailing: $1.00/ customer Lost membership or subscription: $30.00/customer Applying cost matrix or ratios, e.g., AdaCost, MetaCost, C4.5 Unequal but unknown costs of false negatives Catalog direct marketing: cost of mailing or contact: $1.00/customer Loss of sales/profit: e.g., from $20.00 to $600.00, not known Expected cost approach, AdaC2, priority sampling Note: The more costly errors typically come from a small minority, such as the buyers and highvalue customers. have recommended methods that are less sensitive to the unbalanced class distribution. Joint distribution models such as association rules, naive Bayes, and support vector machines are less susceptible to the influence of outliers or unbalanced class distribution. Since minimizing the total misclassification errors remains their focus, they<<clarify / researchers or joint distribution models meant?>> do not directly address the cost-sensitivity problem to help achieve better profit rankings of customers. A viable solution is to incorporate the unequal and known costs of misclassification errors in the training process, usually by providing a cost matrix or ratio in the learning algorithm (third column in Table 1). In this case, a cost matrix plays a key role in guiding the training process. To date, researchers in statistical learning have developed a number of cost-sensitive learning algorithms, including bagging [6] such as MetaCost [10] and boosting [14] such as AdaCost. Although both methods combine multiple models using ensemble learning, bagging does so by generating replicated bootstrap samples of the data and boosting does so by adjusting the weights of training data. For cost-sensitive learning, both AdaCost and MetaCost can incorporate a cost matrix to weight samples and reorder the output [10, 13]. To develop a known cost matrix or ratio, one must determine the conditional risks first and sort the cases according to the conditional risks (e.g., 9,500 false positives at $1 each versus 500 true positives at $30). This approach can be very helpful when the exact costs of misclassification errors are known and constant within each type of error (third column in Table 1). In such cases, applying a cost matrix can help to improve the accuracy of classification models and augment the sales or profitability of direct marketing [20, 24, 30]. 10 cui.indd 339 7/2/2012 6:01:03 AM 340 Cui, Wong, and Wan When the Costs are Unknown and Variant In many marketing applications, however, the costs of false negative errors are sometimes neither known to the decision makers nor uniform. Decision makers cannot anticipate whether customers will respond to a promotion or how much they will purchase (Figure 2 and fourth column in Table 1). When the costs of false negatives are unknown, it is unrealistic to apply a cost matrix. Moreover, as shown in Figure 2, the distribution of customer sales and profit data is often highly skewed with a very long tail, indicating a concentration of profit among a small group of customers [18]. In empirical studies of profit forecasting, the skewed distribution of profit data creates problems for identifying the small number of high-value customers whose profit may amount to hundreds or thousands of dollars, whereas most buyers contribute a much smaller amount (e.g., $10). This has led to an increasing emphasis on CRM, which requires decision makers to focus on the high-value customers. Researchers have developed various models to maximize the profit of direct marketing [7, 19]. However, the profit maximization approach to customer selection, which selects those customers with an expected marginal profit, is not realistic in most direct marketing situations that do not allow contacting customers beyond a preset percentage. Furthermore, the profit maximization approach focuses on maximizing the total potential profit of a direct marketing campaign, but cannot help achieve better profit rankings of customers or target the most valuable customers at the top two deciles of the testing data. Much work has been done to deal with the issues of class imbalance and unequal and known errors [8, 30<<verify reference meant / list ends at 29>>], but only a few researchers have addressed the issue of cost-sensitivity when the exact costs of false negative errors are unknown [12]. One simple solution is to use the expected cost (profit) to rank the customers [26]. In this case, the expected profit is estimated using only the positive cases and a linear regression model. Then, Heckman’s two-step procedure is used to correct the sample selection bias. The first step is to estimate a classification model to generate conditional probabilities of purchase P(j = 1 | x). The second step is to estimate a linear regression model using the training data containing only the positive cases with j (x) = 1, but also includes the estimated probabilities in the model. Then the model with the learned parameters is used to rank the testing data including both positive and negative cases. An alternative approach is the cost-based resampling procedure, such as cost-proportionate sampling [27] and cost-sensitive algorithms such as AdaCost, including AdaC1, AdaC2, and AdaC3 [23]. Cost-proportionate resampling applies the rejection sampling strategy and draws a sample with probability c / Z, where c is the case-dependent cost and Z is a constant (usually the maximum of all costs). As a result, this method usually generates a very small training data set (by a factor of about N / Z, where N is the number of cases in the training data) but still achieves a bounded classification error [27]. However, it puts all the cases (both positive and negative) into sampling and uses the total costs (sum of all costs of false negatives and false positives) in the population to draw samples. Because the costs of false positives are very small and uniform, this approach may overemphasize the positive cases with very high values and result in overfitting. 10 cui.indd 340 7/2/2012 6:01:03 AM Cost-Sensitive Learning via Priority Sampling 341 Figure 2. The Skewed Long-Tailed Distribution of Customer Profit Data<<the font matrix is off in this figure / is it possible to provide a better figure? If the figure was created in Excel, please provide the original file>> Sun et al. [23] proposed a number of cost-sensitive algorithms—AdaC1, AdaC2, and AdaC3—which are extensions to AdaBoost [14]. The AdaBoost algorithm is a successful ensemble learning approach for improving classification accuracy by applying a sample weighting strategy. First<<there is no “second” (etc)>>, the weights of all the training examples are initialized equally. A baseline model such as logistic regression is performed on the training examples to generate a component model. A weight-updating parameter αt is implemented to improve the accuracy of the component model. Then, the training examples are divided into true positives, true negatives, false positives, and false negatives. A weight-updating formula is used to increase the weights of false positives and false negatives by an identical ratio. The weights of all true positives and true negatives are decreased by another identical ratio. Another component model is then learned from all training examples with the modified weights. The above steps are repeated until a specific number of component models are obtained. To classify an unseen case, the weighted average of the outputs of the component models are produced using the αt values of different component models as the weights. However, AdaBoost is also accuracy oriented and treats positive and negative examples equally. Thus, it ignores the profit/cost associated with the examples in the weight-updating formula. To implement cost-sensitive learning for examples with varying costs, AdaC1, AdaC2, and AdaC3 incorporate the profit/cost items in the weight-updating formula in different ways. Essentially, the weights of false negatives with high costs are increased significantly while the weights of false negatives with small costs are increased slightly. The weights of true positives with high costs are decreased significantly while the weights of true positives with small costs are decreased 10 cui.indd 341 7/2/2012 6:01:03 AM 342 Cui, Wong, and Wan slightly. To learn another component model in following iterations, the algorithm can then concentrate on those false negatives with high profit/cost values. In summary, the expected cost approach [26] is rather simple and straightforward. But it may result in poor forecast of consumer purchases as it only includes the positive cases in the training process. The cost-based resampling approach uses all the data including both positive and negative cases, but existing methods put the greatest weight on the most profitable cases and exclude most of the low profit and negative cases [23, 27]. AdaCost includes both positive and negative cases, but it treats the false positives and false negatives differently according to a given cost matrix and does not consider the variance among the false negative errors. Thus, the existing approaches treat the false positives and false negatives together or use the total costs in the resampling process, <<text missing here? which? and?>> may overemphasize the high-value customers and result in overfitting and suboptimal performance of a forecasting model. A reasonable approach to improve the sensitivity to the unknown costs of false negative errors should consider the unbalanced distribution of classes as well as a better representation of the skewed distribution of customer profit in the positive class. Cost-Sensitive Learning via Priority Sampling Different from previous studies, we consider the unequal costs of misclassification errors and the unknown costs of false negatives separately to avoid overemphasizing the positive cases and to minimize the overfitting problem. First, given the unequal costs of misclassification errors, a learning algorithm needs to be sensitive to the more costly errors (i.e., false negatives) to improve its predictive accuracy. Given the between-class imbalance problem and the unequal cost of misclassification errors, a rebalanced class distribution is necessary to improve the sensitivity to the more costly errors. In this case, researchers may undersample the negative cases and oversample the positive cases. This is also true for cost-sensitive learning [11]. When there are not enough positive cases, up-sampling or oversampling of the minority class can be used to provide a rebalanced class distribution, for instance, using the SMOTE method [8]. Given sufficient size of the negative class, one can undersample the negative cases by applying a 1:1 ratio to achieve a symmetric distribution of the two classes. This way one avoids including a small number of negative cases or overemphasizing the positive cases in the resampling process, especially the extremely high-cost customers, thus minimizes the overfitting problem. Moreover, since the costs of false positives are much lower and already known and uniform, it is not necessary to apply differential weights to them or include them in the resampling process. Second, we address the problem of unknown costs of false negatives using a resampling strategy. Given the severe within-class imbalance, the model should place greater emphasis on high-value customers. We adopt the cost-based weighted resampling approach to address the sensitivity to the false negative errors in the context of skewed profit distribution [26]. Among the positive cases, we integrate the case-dependent cost in the learning process and give priority to high-profit customers in the resampling 10 cui.indd 342 7/2/2012 6:01:03 AM Cost-Sensitive Learning via Priority Sampling 343 process. To make such an estimator cost-sensitive, a logical way is to generate a sample of desired distribution by giving greater importance to the high-profit customers rather than the original distribution. In statistics, importance sampling using the normalized importance weights is a general technique for estimating properties of a particular distribution using samples generated from a different distribution. Formally, let X be a random variable in S. Let p be a probability measure on S, and f some function on S. Then one can formulate the expectation of f under p as E[ f (X) | p] = ∫ f (x)p(x)dx. <<E, x, d need to be defined>>The essence of importance sampling is to draw from a distribution other than p, say, q, to modify the above formula to get a consistent estimate of E[f (X) | p]. This procedure helps to reduce the variance of E [f (X) | p] by an appropriate choice of q, as samples from q are more “important” for the estimation of the integral. Given the above definition, we have w(x) = p(x)/q(x), where w is known as the importance weight and the distribution q is usually referred to as the sampling or proposal distribution. With random samples, normalized importance weights can be generated according to q. Since this method is completely general, the above analysis can be repeated when it represents a conditional distribution. Monte Carlo simulation has often been used in importance sampling to determine the choice of the distribution q. In the context of cost-sensitive learning, importance sampling focuses on finding a biasing density function so that the variance of the estimator is less than the variance of the general Monte Carlo estimate [27]. Although there are many kinds of biasing methods, a simple and effective biasing technique is to employ the translation of the density function of a random variable and place much of its probability mass in the desired region of the rare cases, in this case, the high-value customers. This is consistent with the purpose of cost-sensitive learning: re-weight the distribution in the training data according to their importances to draw the desired samples that give greater priority to high-value customers. The key to the success of any biasing or translating function lies in the design of the weighting mechanism. Priority Sampling Following the concept of normalized importance weights, we translate customer profit (or cost) into a probability distribution. Given a data set, S(m, k) = {(s1, c1), (s2, c2), ..., (sm, cm), (n1, x), (n2, x), ..., (nk, x)}, with m positive cases and k negative cases, where each positive case si is associated with profit ci and each negative case ni is associated with a constant cost x (such as mailing cost). In direct marketing, m is much smaller than k, that is, m < k. The uniform sampling draws samples using the probability function P(si ) = 1/(m + k) and P(ni ) = 1/(m + k), which treats each case equally. In priority sampling, we directly use the profit for each positive case as the cost in the resampling process and assign higher probabilities to cases with greater profit. We define the probability distribution function 10 cui.indd 343 7/2/2012 6:01:03 AM 344 Cui, Wong, and Wan Pd (si ) as Pd (si ) = ci /Sj cj . For negative cases, which are the majority of samples with the cost x, we use the uniform sampling with the probability function P(ni ) = 1/k to draw a subset for training. There are other alternatives for the probability distribution function Pd (si ). Our preliminary experiments suggest that the linear weight function described above is sufficient, whereas other transformations of the weight function (e.g., exponentials or logarithms) may change the weights of cases but do not render superior results. In priority sampling, a random number r (si ) is drawn from a uniform distribution for each positive case si. Next, it is compared with the probability Pd (si ) that gives priority to the cases with higher profit. If r (si ) ≤ Pd (si ), si is selected for training. For example, for seven customers with different profit amounts of 4, 3, 2, 2, 1, 1, and 1, their probability Pd (si ) of being selected would be 0.29, 0.22, 0.14, 0.14, 0.07, 0.07, and 0.07. If the generated random numbers r (si ), 1 ≤ i ≤ 7, are, respectively, 0.15, 0.3, 0.2, 0.25, 0.5, 0.4, and 0.01, the first and the last customers will be selected. On the other hand, if the generated random numbers r (si ), 1 ≤ i ≤ 7, are 0.01, 0.2, 0.1, 0.4, 0.3, 0.35, and 0.1, respectively, the first three customers will be chosen. Thus, the cases with higher profit have greater chances to be selected. For the cases with lower profit but higher frequency, they also have the chance to appear in multiple runs of sampling. Thus, the probability for a positive case to be included depends on its profit as well as its frequency of appearance in the sample. In so doing, we achieve a normalized distribution of profit among the customers in the training data set. In the meantime, depending on the number of positive cases from the priority sample, a uniform sampling procedure is applied to draw the same number of cases from the majority negative class. This way, the training data have a balanced distribution of positive and negative cases. In essence, using the combination of undersampling of negative cases and priority sampling of positive cases based on their profit, the generated data should consist of the desired samples in terms of both response and profit distribution. From the same original data, different runs of priority sampling will produce different training samples, and the cases with higher profit will appear more frequently than those with lower profit. Since such a sample is likely to be only a small portion of all the training data, it may not lend an opportunity to arrive at accurate estimates of the parameters. To solve this problem, we use an ensemble learning approach inspired by bootstrap aggregating [6]. Consider a two-class classification problem and a training data set S of size n, bootstrap aggregating creates T bootstrap samples St , 1 ≤ t ≤ T<<correct?>>, by sampling S uniformly with replacement. The sizes of St are less than or equal to n. A baseline model such as logistic regression is then performed on the T bootstrap samples to generate T component models that form the ensemble. To classify an unseen case, the outputs of the T component models are combined by averaging the results from different models in the ensemble. Suppose that there are five component models in the ensemble, their response probabilities for a new case are respectively 0.6, 0.7, 0.5, 0.6, and 0.7. The average is (0.6 + 0.7 + 0.5 + 0.6 + 0.7)/5 = 0.62, thus the ensemble determines that the customer has a 62 percent probability of responding. Both empirical and theoreti- 10 cui.indd 344 7/2/2012 6:01:04 AM Cost-Sensitive Learning via Priority Sampling 345 cal evidence suggests that averaging leads to a decrease in prediction error and can improve predictive accuracy [6]. The proposed priority sampling learning algorithm is included in Figure 3. To illustrate this algorithm, the first 10 positive and all the negative examples of the training data set from the 1998 Knowledge Discovery and Data Mining competition [27] are used. In Table 2, the positive examples and their profit values are shown. In Step 2 of the algorithm (Figure 3), the profit values of all positive training examples are copied to an array W. Then the total profit, which is $90.2, is calculated and stored in the variable TotalProfit in Step 3. Next, the values in W are normalized using the total profit. For example, the profit of the first positive example is $3.32, thus W(1) is normalized to $3.32/$90.2 = 0.0368. The normalized values of W can be found in the third column of Table 2. The ensemble of models H is initialized in Step 5 and then T component models are learned in the loop of Step 6. To learn T component models, a number of random numbers first are generated and stored in the array R in Step 6(a). Second, the positive examples are examined one by one to determine which one(s) should be selected and stored in S +, the set of selected positive examples. For example, if R(1) is 0.2, the first positive example is not selected because R(1) is larger than W(1). On the other hand, the second positive example is selected if R(2) is 0.06 because R(2) is smaller than W(2). The loop of Step 6(d) is executed for m times until all positive examples have been considered. Suppose that three positive examples are selected in Step 6(d), the value of NoSelectedPositive is equal to 3 after completing Step 6(d). Third, negative examples are selected randomly without replacement in Step 6(e) and the number of selected negative examples is equal to NoSelectedPositive. Thus, three negative examples are selected in this case. In Step 6(f), a model is learned from the selected positive and negative examples. In this case, a logistic regression model is obtained from three positive and three negative examples. Finally, the induced model is stored in the ensemble H in Step 6(g). The loop of Step 6 is repeated for T times until T component models ht , 1 ≤ t ≤ T, have been induced and stored in the ensemble H. In the last step, the ensemble H is returned by the algorithm. In summary, unlike the unexpected cost approach, priority sampling includes both positive and negative cases but treats them separately. In contrast to other resampling methods such as cost-proportionate sampling and AdaCost, our approach has a balanced ratio of positive and negative cases and alleviates the problem of overfitting. Then, priority sampling is used to draw from the positive cases. Cases of greater profit will appear more frequently in different samples, and vice versa. In so doing, priority sampling gives greater weight to more costly false negative errors and can improve the profit ranking of customers. In the following section, we test the benefits of priority sampling in comparison with the alternative methods using data sets of different levels of imbalance in class and profit distribution. 10 cui.indd 345 7/2/2012 6:01:04 AM 346 Cui, Wong, and Wan PrioritySamplng(S, m, k, T) // S is the training set with m positive and k negative examples // S = {(s1, c1), (s2, c2), ..., (sm, cm), (n1, x), (n2, x), ..., (nk, x)}, 1. Define and initialize variables: S +, // The set of selected positive examples S –, // The set of selected negative examples W, // An array of numbers R, // An array of numbers H, // The ensemble of models TotalProfit, NoSelectedPositive; 2. For i = 1 to m do W(i) := ci; // Store the profit ci of si to W(i) 3. TotalProfit := Sum of all values in W; 4. For i = 1 to m do W(i) := W(i)/TotalProfit; 5. H := f; 6. For t = 1 to T do (a) Generate m random numbers and store them in R; (b) S + := f; // Initialize S + (c) NoSelectedPositive := 0; (d) For i = 1 to m do If R(i) <= W(i) then S + := S + ∪ {si}; // Select the positive example si NoSelectedPositive := NoSelectedPositive + 1; EndIf EnfFor (e) Randomly select NoSelectedPositive negative examples without replacement from S and store them in S +; (f) Apply a learning method such as logistic regression to learn a model ht from the selected examples S + and S –; (g) H := H ∪ {ht}; // Add the learned model hi to the ensemble EndFor 7. Return H; Figure 3. Learning Algorithm Based on Priority Ensemble Sampling Study One Data and Methods We first conduct experiments with a direct marketing data set from a U.S.-based catalog company. The company sells various product lines of merchandise, including gifts, apparel, and consumer electronics. This data set contains the records of 106,284 consumers and their purchase history over a 12-year period. The most recent promotion sent a catalog to every customer in this data set, and achieved a 5.4 percent response rate with 5,740 buyers. Nine variables are selected using the forward selection criterion (p = 0.05): recency (i.e., the number of months lapsed since the last purchase), frequency of purchase in the last 36 months, monetary value of purchases in the last 10 cui.indd 346 7/2/2012 6:01:04 AM Cost-Sensitive Learning via Priority Sampling 347 Table 2. The First 10 Positive Examples of the Training Data Set of the 1998 Knowledge Discovery and Data Mining Competition <<for the profit column, are those hundreds, thousands,...?>> Record number Profit W(i) 21 31 46 79 94 102 116 127 193 204 Total profit $3.32 $6.32 $4.32 $12.32 $9.32 $4.32 $24.32 $9.32 $7.32 $9.32 $90.2 0.0368 0.0701 0.0479 0.1366 0.1033 0.0479 0.2696 0.1033 0.0812 0.1033 36 months, average order size, lifetime orders (number of orders placed), lifetime contacts (number of mailings sent), whether a customer typically places telephone orders, makes cash payment, or uses the “house” credit card from the catalog company. Since every customer in this data set received a catalog from the company in the current mailing, the data do not have the problem of sample selection. However, there may be an endogeneity bias among the RFM variables, which are based on the previous purchases of consumers. For the endogeneity tests, we employ the asymptotic t‑test developed by Smith and Blundell [22]. The significant results of the t‑tests reject the null hypotheses of exogeneity for the RFM variables. To correct the endogeneity bias, we adopt the “control function” approach developed by Blundell and Powell [5] to address such problems. We first run a parametric reduced-form logit regression to compute the estimates of endogenous RFM variables on the whole data set. In the second stage, the residuals of the reduced-form regressors are included as covariates in the binary response model. Logistic regression, without considering the costs of misclassification errors, is not cost-sensitive and serves as the baseline model for comparison with the cost-sensitive methods. The logistic regression models use the data with (1) the original class imbalanced distribution and (2) balanced class distribution by down-sampling the negative cases. Then, we compare the three cost-sensitive methods: (1) the expected cost method [27], (2) AdaC2 [23], and (3) priority sampling with logistic regression. As the expected cost method uses a linear regression model, only positive cases are used in the training process, and the dependent variable is customer profit. For selection bias correction using the Heckman procedure, predicted purchase probabilities from a logit model are added to the linear regression model of customer profit. This approach can be considered as incorporating the expected cost directly in the training process, while the other two methods represent the resampling approach to cost-sensitive learning. The latter two models use logistic regression as a binary classification tool, and the dependent variable is whether a customer responds to a direct marketing promotion or 10 cui.indd 347 7/2/2012 6:01:04 AM 348 Cui, Wong, and Wan not. The AdaC2 and priority sampling methods are based on their respective sampling procedures to improve their sensitivity to the costs of misclassification errors. To test its applicability to other classification methods, we also test the naive Bayes approach using both the original imbalanced class distribution and priority sampling. Because decision makers usually have a fixed budget and can contact only a small portion of the potential customers in their database (e.g., 10 percent or<<to?>> 20 percent), overall classification accuracy of models or simple error rates are not meaningful as criteria for model evaluation or comparison. To support direct marketing decisions, maximizing the number of true positives at the top deciles, that is, the cumulative lift, is usually the most important criterion for assessing the performance of classifiers [4, 29]. Cumulative lift is the ratio of the number of true positives identified by a model to that of a random model at a specific decile of the file. The figure is then multiplied by 100. Thus, a model with a response lift of 200 in the top decile is said to be twice (200 percent) as good as a random model. Using cumulative lifts across depths of file (usually 10 deciles) is helpful for comparing the performance of different models. Since most direct market campaigns contact only the top 10 percent or<<to?>> 20 percent of the names in the customer database, researchers usually compare the cumulative lifts of models at the top two deciles to select a better model. The cumulative profit lift across the deciles is measured the same way, that is, the ratio of realized profit by a model to that by a random model based on the testing data. Lifted profit is the amount of “extra profit” in dollar amount generated by the new model over that by a random method. Because a single split or hold-out validation is inadequate, we assess the performance of different models using stratified tenfold cross-validation, which has proven to be sufficient to produce stable results and become a popular method of cross-validation for comparing model performance [17]. We follow the standard practice by splitting the whole data set into 10 disjoint subsets using stratified random sampling and apply stratification of the original cases so that each subset of the data has approximately the same number of responders (574) and nonresponders (10,054). Then, we estimate and validate a model 10 times, using each of the 10 subsets in turn as the testing data set and all of the remaining 9 data sets combined as the training data set. The expected cost approach uses only the positive cases in the training process but include both positive and negative cases for validation. Using the same tenfold cross-validation data sets, AdaC2 and priority sampling further sample the data based on their respective resampling procedures. For priority sampling, each ensemble contains 25 models for priority sampling, thus 250 samples were used to improve the robustness of the procedures. However, all the methods have exactly the same 10 testing data sets, each of which is 10 percent of the entire data. The results for each method are the average of the tenfold cross-validation. Comparisons of the Results First, we use priority sampling elaborated above to draw the training samples. Figure 4 clearly shows that the new sample has a smoother distribution, not as skewed as the 10 cui.indd 348 7/2/2012 6:01:04 AM Cost-Sensitive Learning via Priority Sampling 349 Figure 4. Priority Sampling Versus Uniform Sampling original one. Moreover, the new sample has a greater proportion of high-profit customers than the original data, but less of the extreme high-value customers, thus reducing the number of outliers that may lead to overfitting. The results of the tenfold crossvalidation in Table 3 indicate that the baseline logistic regression models achieve the highest response lift of 376.4 and 262.4 in the top two deciles using the original data and 374.5 and 264.9 using the balanced data, followed by AdaC2 (375.0 and 263.1) and priority sampling (364.9 and 273.5). The expected cost approach produces response lifts of 187.2 and 182.0 at the top two deciles, which are significantly lower than those of the other methods. This is not surprising as the expected cost approach uses only positive cases in the training process. Thus, cost-sensitive learning does not improve the classification accuracy or better probability rankings of customers. The naive Bayes model also records lower response lifts (280.7 and 220) in the top two deciles, but priority sampling does improve its predictive accuracy (333.8 and 243.5). Table 4 includes the cumulative profit lifts of all the methods across the deciles. On average, the baseline logistic regression model using original data provides profit lifts of 589.9 and 364.8 in the top two deciles. The rebalanced data improves the performance of the logistic regression model (609.7 and 371.7). By comparison, AdaC2 renders lower profit lifts (593.6 and 367.7). The other two cost-sensitive methods generate significantly higher profit lifts in the top two deciles. The expected cost approach produces profit lifts of 620.9 and 385.9 in the top two deciles, despite its poor performance in response lifts. The priority sampling method, in the meantime, records the highest average profit lifts in the top two deciles: 622.9 and 388.3. The average profit lift in the top decile is significantly higher than that of all competing methods except the expected cost approach (the p‑values can be found in the Appendices<<appendix C is cited, but Appendix A and Appendix B 10 cui.indd 349 7/2/2012 6:01:05 AM 10 cui.indd 350 376.4 (15.1) 262.4 (8.5) 217.2 (6.2) 185.0 (3.5) 161.5 (3.6) 145.2 (2.0) 130.2 (1.2) 118.7 (1.1) 108.7 (0.7) 100.0 374.5 (21.6) 264.9 (8.8) 220.4 (6.6) 185.6 (3.9) 162.9 (3.8) 145.5 (2.3) 130.6 (1.2) 118.9 (1.2) 108.9 (0.8) 100.0 Logistic regression (balanced) 187.2 (13.2) 182.0 (6.8) 165.8 (3.9) 145.8 (3.5) 130.8 (2.9) 118.4 (2.6) 107.9 (2.3) 99.9 (1.7) 93.4 (1.4) 100.0 Expected cost 375.0 (17.2) 263.1 (8.5) 217.0 (6.4) 184.2 (4.0) 161.1 (3.4) 144.8 (2.0) 129.9 (1.5) 118.6 (1.0) 108.6 (0.8) 100.0 AdaC2 <<Provide note indicating what the figures in parentheses mean?>> 10 9 8 7 6 5 4 3 2 1 Model/ decile Logistic regression (original) Table 3. Average Cumulative Response Lift of Ten-Fold Cross-Validation for Study One 364.9 (26.8) 273.5 (11.1) 220.4 (6.6) 182.1 (3.6) 156.7 (3.2) 138.3 (3.0) 123.4 (2.6) 112.6 (1.8) 104.4 (1.3) 100.0 Priority sampling for logistic regression 280.7 (19.0) 220.0 (11.4) 187.1 (6.9) 162.5 (5.3) 146.6 (3.1) 134.8 (2.2) 126.8 (1.0) 117.7 (1.5) 108.6 (0.5) 100.0 Naive Bayes (original) 333.8 (20.5) 243.5 (11.6) 204.0 (7.2) 180.9 (4.9) 158.0 (3.1) 139.2 (3.0) 122.9 (2.6) 112.1 (1.7) 104.9 (1.2) 100.0 Priority sampling for naive Bayes 350 Cui, Wong, and Wan 7/2/2012 6:01:05 AM 10 cui.indd 351 589.9 (33.0) 364.8 (18.1) 274.5 (11.8) 221.3 (7.3) 184.1 (5.7) 159.3 (3.6) 139.0 (3.0) 123.3 (1.6) 110.4 (1.2) 100.0 609.7 (43.2) 371.7 (18.4) 279.4 (11.4) 222.2 (6.8) 186.3 (6.2) 160.3 (3.2) 139.2 (2.3) 123.5 (1.5) 110.6 (1.3) 100.0 Logistic regression (balanced) 620.9 (51.4) 385.9 (14.7) 282.0 (9.3) 221.3 (6.6) 182.0 (5.6) 155.7 (4.9) 134.9 (4.5) 120.3 (3.2) 107.8 (2.7)* 100.0 Expected cost 593.6 (31.1) 367.7 (18.0) 275.3 (11.7) 220.7 (7.4) 184.0 (5.5) 159.2 (3.7) 138.5 (3.5) 123.2 (1.3) 110.5 (1.3) 100.0 AdaC2 622.9 (49.3) 388.3 (15.7) 282.1 (10.1) 220.7 (5.9) 182.0 (5.6) 155.8 (5.0) 135.0 (4.3) 120.6 (3.0) 108.2 (2.5) 100.0 Priority sampling for logistic regression 478.2 (44.3) 326.6 (22.0) 251.1 (14.4) 203.4 (11.4) 173.5 (5.3) 151.7 (4.6) 137.2 (2.9) 123.3 (2.3) 111.1 (0.8) 100.0 Naive Bayes (original) 575.8 (37.5) 359.2 (21.5) 271.9 (11.5) 222.0 (8.1) 184.6 (5.5) 156.6 (4.7) 134.4 (4.1) 120.2 (2.7) 108.4 (1.3) 100.0 Priority sampling for naive Bayes <<In model 9, expected cost, what does the asterisk indicate? Also, provide note indicating meaning of the figures in parentheses?>> 10 9 8 7 6 5 4 3 2 1 Model/ decile Logistic regression (original) Table 4. Average Cumulative Profit Lift of Tenfold Cross-Validation for Study One Cost-Sensitive Learning via Priority Sampling 351 7/2/2012 6:01:05 AM 352 Cui, Wong, and Wan each need to be specifically mentioned where appropriate>>). Priority sampling also improves the profit lifts of the naive Bayes model (575.8 versus 478.2) in the top decile. In Table 5, we further examine whether the improvement in the top decile lift produces significant increase in real profit. Logistic regression records relatively lower lifted profit in the top two deciles ($7,184 and $7,770 using the original data and $7,476 and $7,975 using the balanced data). AdaC2 records lower lifted profit in the top two deciles ($7,246 and $7,872). The lifted profits are $7,639.6 and $8,389.4 in the top two deciles for the expected cost method. By comparison, priority sampling using logistic regression provides the highest lifted profit in the top two deciles ($7,668.6 and $8,461.1). Specifically, it would generate $18,093 extra profit (1,000,000 / 10,628 * (7668.6 − 7,476.3)) than the logistic regression model using balanced data if a campaign had been mailed to 1 million customers. Overall, priority sampling compares favorably with the competing methods in generating incremental profit. Study Two Modeling Customer Churn Direct marketing and CRM data are known to vary greatly in class distribution and customer value, and the performance of cost-sensitive learning methods may be subjective to the influence of specific problems and data distribution. To assess the merit of the proposed method, we also apply priority sampling to a well-known CRM data set. Churn forecasting and customer retention are serious and challenging problems in the telecommunications and other industries. The customer churn data set from the 2003 Duke University data mining competition<<add cite for this?>> is a very difficult task for solving classification problems, and has been used in studies on data mining and CRM<<something missing from this sentence? For study two, we use the customer churn data set from...?>>. It contains the data from a U.S.-based wireless telecommunication company for predicting customers who switch to a competitor so that the company can implement some intervention to minimize customer churn. The training data set “Current Score Data” has 171 predictor variables and the records of 51,306 customers with a true positive rate of 1.81 percent (percentage of churners). Whereas most customers spend about $60 or less per month, a minority of high-value customers spend on average around $200 a month, and a few customers’ monthly bills run up to thousands of dollars. Thus, this data set is even more unbalanced in both class and customer value distribution. Moreover, resource constraint is also a realistic issue as companies typically devote a limited amount of manpower and man hours to contact the small percentage of potential churners for intervention and retention (e.g., only the top 10 percent of potential churners). The data set contains variables including (1) customers usage data, (2) monthly spending, and (3) other customer history and demographic variables. We first select 23 variables from the data set using the forward method in logistic regression, including monthly spending, calling and roaming time, customer history, and so on. Since the competition organizer provides the testing data 10 cui.indd 352 7/2/2012 6:01:05 AM 10 cui.indd 353 1 2 3 4 5 6 7 8 9 10 Model/ decile 7,184.6 7,770.5 7,683.7 7,120.2 6,171.8 5,222.6 4,003.2 2,732.3 1,376.2 0.0 Logistic regression (original) 7,476.3 7,975.2 7,899.2 7,177.8 6,333.2 5,310.0 4,027.7 2,756.8 1,406.6 0.0 Logistic regression (balanced) 7,639.6 8,389.4 8,012.1 7,121.8 6,016.0 4,902.5 3,590.5 2,384.6 1,024.3 0.0 Expected cost 7,246.3 7,872.8 7,725.3 7,096.1 6,160.1 5,208.5 3,962.4 2,717.5 1,380.4 0.0 AdaC2 Table 5. Average Lifted Profit (Dollars) of Tenfold Cross-Validation for Study One 7,668.6 8,461.1 8,016.3 7,087.2 6,021.5 4,913.6 3,594.2 2,413.7 1,077.7 0.0 Priority sampling for logistic regression 5,553.0 6,650.9 6,662.6 6,076.3 5,392.7 4,546.2 3,812.4 2,727.6 1,464.0 0.0 Naive Bayes (original) 6,978.4 7,608.5 7,568.7 7,163.7 6,209.8 4,986.8 3,530.5 2,372.0 1,108.3 0.0 Priority sampling for naive Bayes Cost-Sensitive Learning via Priority Sampling 353 7/2/2012 6:01:05 AM 354 Cui, Wong, and Wan set “Future Score Data” with 100,462 records, we adopt the train-and-test approach to evaluate the performance of priority sampling and the alternative methods. Results of Validation In study two, we follow the same procedures of the respective models as in study one. The results in Table 6 indicate that logistic regression with balanced data achieves true positive rate (TPR) lifts of 143.3 in the first decile and 128.8 in the second decile, below that using the original data (161.1 and 138.0). All other methods report lower response lifts, with those of the expected cost method below that of a random model. Despite its enviable performance in TPR lifts, logistic regression with rebalanced data hands in cumulative “profit” lifts of 287.2 and 215.1 in the top two deciles (Table 7). The expected cost approach, however, does not do as well (251.4 and 207.7). AdaC2 performs relatively better with 296.7 and 227.6 in the top two decile profit lifts. By comparison, priority sampling renders the highest profit lifts in the top two deciles: 299.7 and 219.1 for the logistic regression model and 301.4 and 226.4 for the naive Bayes model. In Table 8, we include the lifted profits of all the methods as compared with a random model. In the top decile, logistic regression using rebalanced data has lifted profit of $17,469 in the top decile. By comparison, AdaC2 generates $18,355 extra profit in the top decile. Priority sampling records the highest lifted profit: $18,634 for logistic regression and $18,794 for the naive Bayes model. Thus, even with a data set of very unbalanced class and profit distribution, priority sampling submits superior results than the competing methods (except for the second decile of AdaC2). Study Three Because the above two data sets are not commonly used, it is necessary to compare the performance of priority sampling with other methods using a popular data set that is available to the public and used in data mining competitions. The data set from the 1998 Knowledge Discovery and Data Mining competition is used in many studies [26<<27 is used at previous mention of the 1998 KDD / verify cite meant at both places>>]. In this task, a charitable organization sends out regular mailings to its target customers to solicit donations. The training data contain 95,412 customer records while the validation data have 96,367 cases. The most recent fund-raising campaign had a response rate of 5.1 percent. This data set is particularly useful for testing cost-sensitive methods as the organization has found an inverse correlation between likelihood to respond and the dollar amount of the gift. Thus, a simple response model will most likely select only very low-dollar donors. High-dollar donors may fall into the lower deciles and be excluded from future mailings. The lost revenue of these high-value donors may offset any gains due to the increased response rate of the low-dollar donors. Thus, the organization needs a forecasting model that can help maximize the net revenue generated from future mailings. Since validation data are available from the competitor organizer, we adopt 10 cui.indd 354 7/2/2012 6:01:05 AM 10 cui.indd 355 1 2 3 4 5 6 7 8 9 10 Model/ decile 161.1 138.0 133.9 129.3 121.6 115.4 111.1 107.5 104.3 100.0 Logistic regression (original) 143.3 128.8 120.5 116.0 115.2 109.9 107.4 106.2 102.5 100.0 Logistic regression (balanced) 71.5 92.1 95.4 98.9 98.2 98.1 97.1 94.9 93.9 100.0 Expected cost Table 6. Average Response (TPR) Lift on the Test Data for Study Two 117.3 114.5 108.4 107.2 108.1 105.0 103.8 101.7 101.0 100.0 AdaC2 135.0 120.8 119.0 111.5 111.3 108.5 106.8 104.8 102.3 100.0 Priority sampling for logistic regression 126.7 115.5 109.6 105.9 105.1 103.2 103.2 103.3 101.3 100.0 Naive Bayes (original) 128.3 115.5 111.8 109.4 109.2 106.5 105.5 103.6 102.3 100.0 Priority sampling for naive Bayes Cost-Sensitive Learning via Priority Sampling 355 7/2/2012 6:01:05 AM 10 cui.indd 356 1 2 3 4 5 6 7 8 9 10 Model/ decile 139.2 109.7 103.0 99.7 96.5 93.4 94.2 95.3 97.7 100.0 Logistic regression (original) 287.2 215.1 176.0 156.2 142.5 128.3 118.8 112.1 104.0 100.0 Logistic regression (balanced) 251.4 207.7 176.8 156.9 139.9 128.6 118.0 110.0 104.5 100.0 Expected cost Table 7. Average Profit Lift on the Test Data for Study Two 296.7 227.6 187.4 165.1 150.4 135.5 124.6 114.5 106.8 100 AdaC2 299.7 219.1 188.4 161.5 146.9 134.0 123.8 114.6 106.1 100.0 Priority sampling for logistic regression 78.6 68.0 64.8 64.4 67.0 68.7 75.0 82.6 87.3 100.0 Naive Bayes (original) 301.4 226.4 188.7 165.2 149.4 135.1 124.6 115.2 107.0 100.0 Priority sampling for naive Bayes 356 Cui, Wong, and Wan 7/2/2012 6:01:06 AM 10 cui.indd 357 Logistic regression (original) 3,654.0 1,803.5 849.8 –1,01.1 –1,632.9 –3,666.9 –3,813.3 –3,478.2 –1,964.5 0.0 Model/ decile 1 2 3 4 5 6 7 8 9 10 17,469.5 21,473.5 21,267.0 20,977.1 19,832.8 15,824.1 12,301.6 9,042.7 3,394.7 0.0 Logistic regression (balanced) 14,129.7 20,098.3 21,491.5 21,240.3 18,630.4 15,989.3 11,748.7 7,442.4 3,774.3 0.0 Expected cost Table 8. Average Lifted Profit (Dollars) on the Test Data for Study Two 18,355.9 23,819.0 24,456.0 24,299.6 23,528.8 19,890.4 16,083.2 10,835.1 5,711.9 0.0 AdaC2 18,634.5 22,232.4 24,746.7 22,957.0 21,884.2 19,015.5 15,568.2 10,919.9 5,134.8 0.0 Priority sampling for logistic regression –2,001.3 –5,966.8 –9,843.9 –13,286.9 –15,400.8 –17,502.1 –16,338.0 –13,004.3 –10,657.0 0.0 Naive Bayes (original) 18,794.9 23,586.9 24,828.3 24,330.7 23,068.2 19,655.3 16,065.3 11,373.7 5,878.7 0.0 Priority sampling for naive Bayes Cost-Sensitive Learning via Priority Sampling 357 7/2/2012 6:01:06 AM 358 Cui, Wong, and Wan the train-and-test approach to evaluate the performance of priority sampling and the competing methods. First, we used the 44 independent variables suggested in SAS Enterprise Miner Tutorial for model building.1 These variables and the two dependent variables are summarized in the Appendix C. Second, we repeat the same experiments for all the methods following the same procedures used in study one. Comparing Model Performance The results in Table 9 indicate that the baseline logistic regression models achieve the highest response lifts of 199.8 and 172.5 in the top two deciles using the original data and 146.6 and 135.6 using the rebalance data, followed by naive Bayes (163.7 and 153.4), priority sampling approach for logistic regression (125.5 and 120.6) and for naive Bayes (116.8 and 117.0). The predictive accuracy of the expected cost approach (86.1 and 101.6) and the AdaC2 model (67.3 and 67.0) are equal to or worse than the random model. According to Table 10, logistic regression using balanced data provides profit lifts of 723.5 and 473.5 in the top two deciles, much higher than those using the original unbalanced data (242.4 and 206.7). The expected cost method (618.3 and 416.5) and AdaC2 (554.4 and 312.9) provide lower profit lifts. By comparison, the two models using priority sampling generate the highest profit lifted in the top two deciles (882.4 and 571.6 for logistic regression and 756.2 and 493.9 for naive Bayes) highlight its ability to minimize overfitting. The superior performance of the priority sampling approach is also reflected in the actual lifted profits of the two models, which are much higher than those of the competing methods (Table 11). Overall, the realized sales by priority sampling with logistic regression is higher than that of the competition winner ($15,800 versus $14,712 at a mailing depth of 60 percent or 58,287 customers) and that reported in another study ($15,329) [26]. Sensitivity Analyses In real-life situations, the response rates of marketing campaigns may vary greatly. Sometimes, the response rate may drop below 2 percent, for instance, in credit card promotions. This may lead to data sets with different degrees of unbalanced class distribution and unequal costs of misclassification errors. To validate the sensitivity of priority sampling to data sets with different degrees of skewness in class and profit distributions and its ability to improve cost-sensitivity while minimizing overfitting, we conduct more experiments on the same data using data sets with varying degrees of unbalanced class distribution. For such sensitivity analysis, positive cases in the training data set are randomly deleted to generate training data sets with more unbalanced class and more skewed profit distribution.2 The resulting ratios of positive cases to negative cases in the training data range from 5 percent to 1 percent (Table 12). At the low ratios, this results in a very small number of positive cases. Then, we perform train-and-test validation for all the methods and compare their profit lift in the top decile using the testing data with the original class ratio. 10 cui.indd 358 7/2/2012 6:01:06 AM 10 cui.indd 359 1 2 3 4 5 6 7 8 9 10 Model/ decile 199.8 172.5 154.6 143.4 133.5 124.8 118.1 110.6 105.2 100.0 Logistic regression (original) 146.6 135.6 130.5 126.6 121.9 116.7 112.3 108.7 103.9 100.0 Logistic regression (balanced) 86.1 101.6 107.3 107.3 105.4 102.9 100.1 97.7 94.3 100.0 Expected cost Table 9. Average Response Lift on the Testing Data for Study Three 67.3 67.0 69.6 72.7 76.2 79.5 83.1 86.5 91.3 100.0 AdaC2 125.5 120.6 118.6 116.0 113.2 110.9 107.6 105.5 102.4 100.0 Priority sampling for logistic regression 163.7 153.4 144.7 137.7 128.9 122.6 116.6 110.3 104.7 100.0 Naive Bayes (original) 116.8 117.0 115.1 115.9 114.2 113.1 110.4 107.1 103.7 100.0 Priority sampling for naive Bayes Cost-Sensitive Learning via Priority Sampling 359 7/2/2012 6:01:06 AM 10 cui.indd 360 1 2 3 4 5 6 7 8 9 10 Model/ decile 242.4 206.7 172.0 160.9 148.6 133.6 126.3 101.1 92.9 100.0 Logistic regression (original) 723.5 473.5 370.0 306.1 256.7 209.7 177.9 152.2 119.4 100.0 Logistic regression (balanced) 618.3 416.5 327.6 279.9 234.3 197.8 167.3 147.2 120.4 100.0 Expected cost Table 10. Average Profit Lift on the Testing Data for Study Three 544.4 312.9 244.7 200.2 174.0 158.9 146.3 129.8 113.6 100.0 AdaC2 882.4 571.6 435.0 352.3 288.3 246.6 204.7 172.4 133.3 100.0 Priority sampling for logistic regression 103.4 124.2 125.5 124.4 102.2 102.7 96.9 81.0 71.3 100.0 Naive Bayes (original) 756.2 493.9 361.0 310.6 264.2 230.3 192.7 162.8 131.9 100.0 Priority sampling for naive Bayes 360 Cui, Wong, and Wan 7/2/2012 6:01:06 AM 10 cui.indd 361 1 2 3 4 5 6 7 8 9 10 Model/ decile 1,504.0 2,253.1 2,280.8 2,574.3 2,563.6 2,127.4 1,946.0 94.3 –671.9 0.0 Logistic regression (original) 6,584.4 7,889.4 8,553.4 8,705.6 8,275.3 6,951.5 5,755.0 4,410.8 1,846.5 0.0 Logistic regression (balanced) 5,473.3 6,684.7 7,209.4 7,598.1 7,090.3 6,196.4 4,976.5 3,990.3 1,936.5 0.0 Expected cost 4,699.7 4,501.6 4,590.2 4,238.1 3,914.5 3,735.9 3,428.9 2,520.6 1,298.5 0.0 AdaC2 Table 11. Average Lifted Profit (Dollars) on the Testing Data for Study Three 8,262.0 9,959.4 10,614.3 10,655.6 9,944.8 9,287.0 7,739.2 6,115.8 3,163.5 0.0 Priority sampling for logistic regression 35.4 511.2 809.3 1,031.4 114.7 172.4 –231.5 –1,605.7 –2,727.9 0.0 Naive Bayes (original) 6,929.9 8,318.8 8,268.0 8,896.6 8,667.3 8,256.5 6,854.6 5,308.9 3,031.0 0.0 Priority sampling for naive Bayes Cost-Sensitive Learning via Priority Sampling 361 7/2/2012 6:01:06 AM 10 cui.indd 362 146.6 147.0 140.7 138.2 128.1 140.12 Mean Response lift 5 4 3 2 1 Percent positive cases 660.1 723.5 707.6 652.6 634.8 582.0 Profit lift Logistic regression (balanced) 82.36 86.1 87.0 83.4 81.0 74.3 Response lift 570.38 618.3 588.4 575.0 534.3 535.9 Profit lift Expected cost 72.48 67.3 71.2 74.9 74.1 74.9 Response lift 550.32 544.4 575.8 602.2 452.2 577.0 Profit lift AdaC2 Model/lift 120.38 125.5 125.3 120.9 118.3 111.9 Response lift 838.7 882.4 880.5 813.4 840.2 777.0 Profit lift Priority sampling for logistic regression 114.94 116.8 117.5 116.2 116.0 108.2 Response lift 751.06 756.2 739.3 727.2 782.0 750.6 Profit lift Priority sampling for naive Bayes Table 12. Model Performance with Varying Degrees of Class Imbalance for Study Three on Testing Data<<Resp. lift changed to Response lift correct?>> 362 Cui, Wong, and Wan 7/2/2012 6:01:07 AM Cost-Sensitive Learning via Priority Sampling 363 Overall, logistic regression with rebalanced data has the highest average response lift in the top decile (mean = 140), while the expected cost approach and AdaC2 are much worse than the random model (mean = 82 and mean = 72, respectively). In terms of profit lift, logistic regression using balanced data (mean = 660) also performs better than these two methods (mean = 570 and mean = 550, respectively), which may suffer from overfitting. On average, priority sampling records the highest top decile profit lift (839), which is 52 percent higher than AdaC2, 47 percent higher than the expected cost method, and 27 percent higher than the logistic regression model. The superior performance of priority sampling is consistent across data sets with different degrees of class imbalance, even when the ratio of positive cases is very low (i.e.<<e.g.?>>, 1 percent and 2 percent). The naive Bayes model using priority sampling also records significantly higher profit lift than the competing methods, and its advantage is even greater as the class distribution becomes more unbalanced. These results indicate that priority sampling consistently renders superior performance in providing better rankings of customer value and in augmenting profitability across data sets of various degrees of unbalanced class distribution. Conclusions The combination of undersampling of negative cases and priority sampling of positive cases provides a viable and superior solution to the cost-sensitive learning problem associated with highly unbalanced class distribution and the unknown and skewed costs of false negative errors. It helps to improve the accuracy of predicting the small number of high-value customers while minimizing the overfitting problem. Moreover, priority sampling using the profit quantity as a probability distribution is intuitively appealing and efficient to implement. Ensemble learning helps to reduce the variations among the subsamples and improving the accuracy of parameter estimates. The results of validation suggest that priority sampling consistently achieves significant improvement in augmenting profit over the alternative methods. Its advantages are apparent even with data of highly unbalanced class distribution and apply to other classification tools such as naive Bayes. Overall, priority sampling offers a robust and general solution to cost-sensitive learning with unknown and variant costs of false negative errors. While the issue of cost-sensitivity arises from the discussion of misclassification errors, cost-sensitive learning in this case, ironic as it may seem, does not necessarily improve the classification accuracy or the true positive rate. This is not surprising because the goal of cost-sensitive learning is not to improve classification accuracy, but to produce better rankings of customer value. This is consistent with the ultimate goal of augmenting the return on marketing investment with a budget constraint. Meanwhile, the findings indicate that although the simple classification accuracy of a response model can be improved, a model lacking cost-sensitivity results in lower profit when the class distribution is highly unbalanced and the profit distribution is much skewed. Thus, it is essential that decision makers adopt appropriate cost-sensitive solutions in such cases. 10 cui.indd 363 7/2/2012 6:01:07 AM 364 Cui, Wong, and Wan The results also reveal that cost-sensitive methods are themselves “sensitive” to the distribution of classes and costs. The performance of the competing models differs depending on how they treat the unbalanced distribution of classes and costs in the training process. By comparison, priority sampling provides a viable solution to costsensitive learning when cost-based learning is not feasible due to the unknown costs of false negatives. The sensitivity analysis in the study three indicates that the benefits of priority sampling are evident for data sets with more unbalanced class distribution. Priority sampling consistently achieves significant improvement over the competing methods in augmenting profitability without overfitting the predictive model. Superior cost-sensitive models allows for contacting fewer customers yet still achieving higher returns on investment, thus enabling a more cost-effective use of precious marketing resources. Thus, this approach can also help alleviating the problem of customer fatigue, a common problem with direct marketing operations. Notes 1. The training and testing data sets with the variables can be found at http://cptra.ln.edu .hk/~mlwong/JMIS2012. 2. The training data sets can be found at http://cptra.ln.edu.hk/~mlwong/JMIS2012. References 1. Baesens, B.; Viaene, S.; Van den Poel, D.; Vanthienen, J.; and Dedene, G. Bayesian neural network learning for repeat purchase modelling in direct marketing. European Journal of Operational Research, 138, 1 (2002), 191–211. 2. Bansal, G.; Sinha, A.P.; and Zhao, H. Tuning data mining methods for cost-sensitive regression: A study in loan charge-off forecasting. Journal of Management Information Systems, 25, 3 (Winter 2008–9), 315–336. 3. Berger, P., and Magliozzi, T. The effect of sample size and proportion of buyers in the sample on the performance of list segmentation equations generated by regression analysis. Journal of Direct Marketing, 6, 1 (1992), 13–22. 4. Bhattacharya, S. Direct marketing performance modeling using genetic algorithms. INFORMS Journal on Computing, 11, 3 (1999), 248–257. 5. Blundell, R.W., and Powell, J.L. Endogeneity in semiparametric binary response models. Review of Economic Studies, 71, 7 (2004), 655–679. 6. Breiman, L. Bagging predictors. Machine Learning, 24, 2 (1996), 123–140. 7. Bult, J.R., and Wansbeek, T. Optimal selection for direct mail. Marketing Science, 14, 4 (1995), 378–394. 8. Chawla, N.; Bowyer, K.; Hall, L.; and Kegelmeyer, W. SMOTE: Synthetic Minority of Over-Sampling TEchique. Journal of Artificial Intelligence Research, 16, <<issue and/or season>> (2002), 341–378. 9. Cui, G.; Wong, M.L.; and Lui. H.-K. Machine learning for direct marketing response models: Bayesian networks with evolutionary programming. Management Science, 52, 4 (2006), 597–612. 10. Domingos, P. MetaCost: A general method for making classifiers cost-sensitive. In <<editor(s)>>, Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 1999, pp. 155–164. 11. Drummond, C., and Holte, R.C. C4.5, class imbalance, and cost sensitivity: Why undersampling beats over-sampling. Paper presented at the ICML’2003<<spell out ICML>> Workshop on Learning from Imbalanced Datasets II, <<city and state (or country) 10 cui.indd 364 7/2/2012 6:01:07 AM Cost-Sensitive Learning via Priority Sampling 365 where conference held>>, <<month and day(s) of conference>>, 2003. 12. Elkan, C. The foundations of cost-sensitive learning. In B. Nebel (ed.), Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence. San Francisco: Morgan Kaufmann, 2001, pp. 973–978. 13. Fan, W.; Stolfo, S.J.; Zhang, J.; and Chan, P.K. AdaCost: Misclassification cost-sensitive boosting. In <<editor(s)>> Proceedings of the 16th International Conference on Machine Learning. Bled, Slovenia: <<publisher>>, 1999, pp. 97–105. 14. Freund, Y., and Schapire. R.E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer Systems and Science, 55, 1 (1997), 119–139. 15. Hastie, T.; Tibshirani, R.; and Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2d ed. <<City>>: Springer-Verlag, 2008. 16. Haughton, D., and Oulabi. S. Direct marketing modeling with CART and CHAID. Journal of Direct Marketing, 11, 4 (1997), 42–52. 17. Mitchell, T. Machine Learning. New York: McGraw-Hill. 1997. 18. Mulhern, F.J. Customer profitability analysis: Measurement, concentration, and research directions. Journal of Interactive Marketing, 13, 1 (1999), 25–40. 19. Rao, V.R., and Steckel, J.H. Selecting, evaluating, and updating prospects in direct mail marketing. Journal of Direct Marketing, 9, 2 (1995), 20–31. 20. Shmueli, G.; Patel, N.R.; and Bruce, P.C. Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel with XLminer. New York: John Wiley & Sons. 2007. 21. Sinha, A.P., and May, J.H. Evaluating and tuning predictive data mining models using receiver operating characteristic curves. Journal of Management Information Systems, 21, 3 (Winter 2004–5), 249–280. 22. Smith, R.J., and Blundell, R.W. An exogeneity test for a simultaneous equation Tobit model with an application to labor supply. Econometrica, 54, <<issue>> (May 1986), 679–685. 23. Sun, Y.; Kamel, M.S.; Wong, A.K.C.; and Wang, Y. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40, 12 (2007), 3358–3378. 24. Viaene, S., and Dedene, G. Cost-sensitive learning and decision making revisited. European Journal of Operational Research, 166, 1 (2005), 212–220. 25. Weiss, G.M. Mining with rarity: A unifying framework. ACM SIGKDD Explorations Newsletter, 6, 1 (2004), 7–19. 26. Zadrozny, B., and Elkan. C. Learning and making decisions when costs and probabilities are both unknown. In <<editor(s)>> Proceedings of the Knowledge Discovery and Data Mining (KDD) Conference.<<Verify published title / is there an annual number?>>. San Francisco, CA,<<publisher and publisher’s location>>, 2001. pp. 204–213. 27. Zadrozny, B.; Langford, J.; and Abe, N. Cost-sensitive learning by cost-proportionate example weighting. In <<editor(s)>> Proceedings of the Third IEEE International Conference on Data Mining. Los Alamitos, CA: IEEE Computer Society, 2003, pp. 435–442. <<Not cited / cite or delete>>28. Zahavi, J., and Levin, N. Applying neural computing to target marketing. Journal of Direct Marketing, 11, 4 (1997), 76–93. 29. Zhao, H.; Sinha, A.P.; and Bansal, G. An extended tuning method for cost-sensitive regression and forecasting. Decision Support Systems, 51, 3 (2011), 372–383. 10 cui.indd 365 7/2/2012 6:01:07 AM 366 Cui, Wong, and Wan Appendix A: p-Values of Comparing Priority Sampling for Logistic Regression with Other Methods for Study One Model/ decile 1 2 3 4 5 6 7 8 9 10 Logistic regression (original) Logistic regression (balanced) Expected cost AdaC2 0.010954 6.15E-05 0.009198 0.365295 0.145305 0.064375 0.029054 0.012017 0.023987 NA 0.00372 0.381544 0.238272 0.009497 0.032888 0.156633 0.037921 0.338579 0.336274 NA 0.345592 0.060218 0.441884 0.29392 0.382724 0.137458 0.387875 0.17795 0.017021 NA 0.012995 0.000134 0.016545 0.489565 0.146169 0.071175 0.048697 0.008799 0.021667 NA Note: NA = not applicable<<correct?>> Appendix B: p-Values of Comparing Priority Sampling with Naive Bayes for Study One Decile p-values 1 2 3 4 5 6 7 8 9 10 4.37E-06 2.06E-06 2.81E-05 1.87E-05 7.19E-06 0.012385 0.063732 0.028763 0.000293 NA Note: NA = not applicable<<correct?>> 10 cui.indd 366 7/2/2012 6:01:07 AM Cost-Sensitive Learning via Priority Sampling 367 Appendix C: Variables Used for Study Three Variable name MONTHS_SINCE_ORIGIN IN_HOUSE OVERLAY_SOURCE DONOR_AGE DONOR_GENDER PUBLISHED_PHONE HOME_OWNER MOR_HIT CLUSTER SES INCOME MED_HOUSEHOLD_INCOME PER_CAPITA_INCOME WEALTH MED_HOME_VALUE PCT_OWNER_OCCUPIED URBANICITY PCT_MALE_MILITARY PCT_MALE_VETERANS PCT_VIETNAM_VETERANS PCT_WWII_VETERANS NUMBER_PROM_12 CARD_PROM_12 FREQ_STATUS_97NK RECENCY_STATUS_96NK LAST_GIFT_AMT RECENT_RESPONSE_COUNT RECENT_RESPONSE_PROP RECENT_AVG_GIFT_AMT RECENT_STAR_STATUS CARD_RESPONSE_COUNT CARD_RESPONSE_PROP CARD_AVG_GIFT_AMT PROM GIFT_COUNT AVG_GIFT_AMT GIFT_AMOUNT MAX_GIFT GIFT_RANGE MONTHS_SINCE_FIRST MONTHS_SINCE_LAST PEP_STAR CARD_PROM MIN_GIFT TARGET_D (dependent variable) TARGET_B (dependent variable) 10 cui.indd 367 Meaning Elapsed time since first donation Is in house donor? M = Metromail, P = Polk, B = both Age as of June 1997 Actual or inferred gender Published telephone listing Is home owner? Mail order response hit rate 54 socioeconomic cluster codes 5 socioeconomic cluster codes 7 income group levels Median income Income per capita 10 wealth rating groups Median home value Percent owner occupied housing U = urban, C = city, S = suburban, T = town, R = rural, M = unknown Percent male military in block Percent male veterans in block Percent Vietnam veterans in block Percent World War II veterans in block Number of promotions in the last 12 months Number of card promotions in the last 12 months Frequency status, June 1997 Recency status, June 1996 Amount of the most recent donation Recent response count Recent response proportion Recent average gift amount Recent STAR status (1 = yes, 0 = no) Response count since June 1994 Response proportion since June 1994 Average gift amount since June 1994 Total number of promotions Total number of donations Overall average gift amount Total gift amount Maximum gift amount Maximum less minimum gift amount First donation date from June 1997 Last donation date from June 1997 STAR status ever (1 = yes, 0 = no) Number of card promotion Minimum gift amount Response amount to June 1997 solicitation Response to June 1997 solicitation 7/2/2012 6:01:07 AM 10 cui.indd 368 7/2/2012 6:01:07 AM