Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Some Applications of Data Mining Tools in Database Marketing customer as a percentage of the total number of customers classified. It is also found that different levels of classification accuracy can be achieved by specifYing different levels of relative misclassification costs in different classes (prospect vs. non-prospect). There exists a tradeoffbetween misclassification rates for the two classes. The subjective judgment is required in determining the optimal point on the trade-off curve under different business circumstances, which will be discussed in greater details later. Jianmm Liu, Computer Information System, Golden Gate University, San Francisco Abstract This paper presents some results of an empirical study on database marketing using CART, artificial intelligence-based algorithm (AI) and Legit model. The robustness of the three different approaches is evaluated using a large dataset. Decision Treefor Customer Response to Electronic Ccmmerce (EC) The classification tree is used to identify the important variables and demographic characteristics in customer response to EC. The CART software is used to implement the nonparametric statistical algorithm developed by Breiman et al. (1984). The following Diagram shows the selected decision tree of CART. Introduction This study attempts to develop an analytical framework for database marketing. The goal is to establish a set of criteria. The selected criteria need to reflect the dynamics of market structure, demand for EC, new technology and, most importantly, the difference in various characteristics between prospect and non-prospect customers ofEC. Using econometric and data mining techniques, it is found that there exist some patterns in customer profile associated with customers' response to EC. Successfully identifYing these patterns is the core of the task. In the Diagram, the ellipse shapes represent terminal nodes of the tree. Rectangle shapes represent non-terminal nodes. There are total of thirteen terminal nodes in this decision tree. The relative misclassification costs are set equal fcir both classes (prospect vs. non-prospect). In other words, the classification errors in both classes are equally serious. There are about 120,000 observations in the data set. This data set is randomly split into two sets, learning set and test set, where 65% of data in the original set is in the learning set to be used for developing the classification tree. The rest of the data in the original data set is in the test set for testing the robustness of the tree and calculate the misclassification rate. In this study various social economic characteristics are investigated. Some characteristics of rmancial products and services used by these customers are also examined. Ttests are conducted on the means of the selected variables of the customer and non-customer ofEC under the null hypothesis that the difference in the value of the means is not significantly different from zero. This null hypothesis is rejected at a 1% significance level, indicating that the two customer groups had some different characteristics. These characteristics are used as explanatory variables in the customer prospect model. The results of CART analysis are compared to those of the artificial intelligence-based algorithm and the conventionallogit model. Simulations of CART and AI analysis are conducted to assess the robustness of the methodologies. It is found that an overall 75% accuracy rate, or 25% misclassification rate, can be achieved applying CART to the data, where the misclassification rate is dermed as thenumber of mis-classified The statistics of misclassification rate of the test set is reported below. The misclassification rate for EC-prospect class for the tree displayed in the Diagram is about 7.8%, while about 35% for nonprospect class. The overall misclassification rate is about 24.2% for the tree. The interpretation of misclassification of prospect is that we can be about 92% sure when we classifY a group of customers as prospects. 7.8% of these classified prospects are mis-classified. Similarly, the interpretation of misclassification of non-prospect is that we can be about 65% sure when we consider a group of customers as non-ECprospects. 35% of these classified non-prospects 232 are mis-classified. In other words, they in fact acceptEC. misclassification of both EC and non-EC prospects is equally costly, the relative misclassification cost ratio should be set to one (for binary case). Ifwe consider the misclassification ofEC prospects is more costly than that of non-EC, than the misclassification cost ratio ofEC over non-EC prospects can be set to greater than one. Relative Misclassification Costs The variable misclassification cost can be altered in order to adjust the misclassification rate of corresponding classes. There exists a trade-off between the misclassification cost in both classes. If the relative misclassification cost ratio for the two classes is altered (deviates away from I), the misclassification rate for each class varies accordingly, and so does the corresponding overall misclassification rate of the tree. Typically, if the relative misclassification cost for one class become higher (more serious), the misclassification rate of this class will decrease and the misclassification rate for the other class will increase. As a result, the overall misclassification rate of the tree will vary as well. If we specifY a greater value to the variable misclassification cost ofnon-EC prospects, therefore allowing the misclassification rate of EC-household class to increase, the misclassification rate ofnon-EC-prospect class will decrease accordingly. Table I shows the trade-off of misclassification rates for EC and nonEC, and the relative cost ratio. Figure I shows the trade-off of the misclassification rate between EC and non-EC prospect graphicalIy. As it can be seen, the misclassification rate varies as the relative-cost ratio varies. The appropriate classification cost specification is based on a subjective judgment. If the TABLE 1. MISCLASSIFICATION RATE AND VARIABLE MIS CLASSIFICATION COST RATIO OF CART ANALYSIS EC MISNON-EC MISEC/NON-EC NON-EC EC COST CLASSIFICATION CLASSIFICATION COST RA COST 0.4565 3.0000 1.0 3.0 0.0193 no 0.0262 0.0265 0.0600 0.0780 0.1160 0.1490 0.1838 0.2760 0.3255 0.4140 0.5230 0.4413 0.4405 0.3800 0.3550 0.3200 0.2918 0.2595 0.2080 0.1766 0.1400 0.1000 2.5000 2.0000 1.5000 1.0000 0.8333 0.5556 0.6667 0.7692 0.5000 0.4000 0.3333 1.0 1.0 1.0 1.0 1.2 1.8 1.5 1.3 2.0 2.5 3.0 2.5 2.0 1.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 misclassification cost ratio of the two classes, CART generates different decision trees with different overall misclassification rates. A series of simulations are conducted on CART under different specifications of relative misclassification cost for EC and non-EC classes. The misclassification rates of both learning set and test set are calculated in CART and the difference between the learning set and the test set is not significant, indicating that the decision tree is quite stable and robust. Results OfAI-based Algorithm (Ill) and Comparison to CART The same data set used in CART analysis is used for IH. The forecasted value for BC and non-Be prospects in the test set is continuous in value within the range of (0, 1), with the low value as the non-EC and high value as EC. The cut-off line is pre-specified based on the subjective judgment. For example, if the value ofp is specified as the cut-off threshold, then if the forecasted value of As mentioned earlier, the selected tree has a misclassification rate of7.9% for EC-prospect class and 35.5% for non-EC class, where the variable misclassification cost ratio of the two classes is one. For different levels of 233 dependent variable, household type (EC vs. nonEC) is equal or greater than p than the corresponding household is classified as an EChousehold. Otherwise, the household is classified as a non-EC prospect. The trade-off of misclassification rates can be obtained by altering the value of cut-off threshold variable,p. The misclassification rates for EC and non-EC are presented in Table 2. The relative cost ratio is calculated as the p/(l-p). From Table 2 it can be seen that when the misclassification rate for EC class is 20.2% the corresponding rate for non-EC is ab<>ut 25.7%, and the corresponding ratio of non-EC to EC is 1.86. When the rate for EC is 13%, the corresponding rate for non-EC is 29.6% and the corresponding relative cost ratio ofnon-EC to EC is 2.33. These misclassification rates vary as the ratio of non-EC relative cost to EC changes. Figure: shows the . trade-off relationship of misclassificstion rate between EC and non-EC prospect class for IH. TABLE 2. VARIABLE MISCLASSIFICATION COST AND MIS CLASSIFICATION COSTS (AI-Based Algorithm) EC 0.0580 0.0740 0.1300 0.2020 0.2390 0.2680 0.3010 0.3350 0.3660 0.3890 0.4030 0.4160 0.4290 NON-EC 0.3250 0.3160 0.2960 0.2570 0.2220 0.1930 0.1580 0.1240 0.0870 0.0590 0.0430 0.0290 0.0190 EC COST (p) 0.2000 0.2500 0.3000 0.3500 0.4000 0.4500 0.5000 0.5500 0.6000 0.6500 0.7000 0.7500 0.8000 NON-EC COST (1-p) 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 NON-EC I EC 4.00 3.00 2.33 1.86 1.50 1.22 1.00 0.82 0.67 0.54 0.43 0.33 0.25 No information is available on IH's methodology Comparison to CART besides a very general description. C~T is a • Performance As it can be seen in Figure 2, the trade-off curve of well known and documented method('iogy. misclassification rate of IH is similar to that Results ofLogit Model and Comparison to caT obtained from CART analysis. For example, in CART analysis, when the misclassification rate for Logit regression is a parametric appr"ach which is well understood and widely used in csregorical EC is 18.38%, the rate is about 26% for non-EC. When the misclassification rate for EC is 11.6%, data analysis for its simplicity and be:rer performance over linear probability model and the misclassification rate for non-EC is about discriminant analysis which have some significmt 32%. The performance of CART is slightly better weakness (see McFadden, 1982, MadJala. 1983. than IH based on the simulations conducted in this Chhikara, 1989 for example). In the ronteot of project. customer prospect model, the depencknt variable • Computer Running Time of the logit model takes on only two ,alues, 1 if.< CART runs much faster than IH. IH's optimizer requires overnight run for the data set used, while customer positively responds to EC (tvent), 0 otherwise (non-event). The data set liSed for CART typically requires 10-15 minutes for a run on a Pentium processor. logistic regression is the same as used for CART andIH. • Availability of information on the underlying algorithms 1. Model Specification The model specification of the logistic regression is as follows. P I-P log{--} = ao +a. x Income2k+a2 x Income5k+a3 x /ncome7k+a4 x IncomelOO 234 +a, x Micro02+a. x Micro09+a 7 x MicroI4+a" x Micro23+a9 x DDA+a,o x HHAGE (1), +a" x HHLOR+a'2 x HHLOAN + a13 ,x HHINVT + a 14 x Chiidren +a" x ACCTI+a,. x ACCT4PL where Income2k = I ifHHINCOME ::;; $29,999.00, otherwise 0 Income5k = I if$50,000 ::;; HHINCOME ::;; $74,999.00 otherwise 0 Income7k = I if$75,000 ::;; HHINCOME ::;; $99,999.00 otherwise 0 IncomeJOO = I ifHHINCOME 2! $100,060, otherwise 0 = 1 ifMICROCD = 02, otherwise 0 Micro02 Micro09 = 1 ifMICROCD = 09, otherwise 0 Microi4 = I ifMICROCD = 14, otherwise 0 Children = 1 if a household has Children, Micro23 = 1 ifMICROCD =23, otherwise 0 = 1 ifPtypel = DDA, otherwise 0 DDA otherwise 0 = HHAGE (numerical variable, same = 1 if ACCT = 1, otherwise 0 (ACCT: ACCTJ HHAGE as defined earlier) number of accounts per household) = HHLOR (numerical variable, same HHLOR ACCT4PL = 1 if ACCT ~ 4, otherwise 0 as defined earlier) The estimation algorithm of logit model is HHLOAN = household loan (in $ I 000) maximum likelihood. The estimates of parameters HHINVT = household investment (in $1000) are presented in Table 3. TABLE 3. Analysis of Maximum LikeUhood Estimates of Logistic Regression Parameter Standard Wald Pr > Standardized Variable DF Estimate Error Chi-Square Chi-Square Estimate INTERCPT INCOME2K INCOME5K INCOME7K INCOM100 MICR02 MICR09 MICR014 MICR023 DEPOST4 DEPOST6 DEPOST7 ACCTI DDA HHAGE HHLOR HHLOAN HHINVT CHILDREN 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.0735 -3.2402 -0.3635 0.0258 0.0233 0.0628 0.0247 0.2941 0.0271 0.4316 0.0218 0.3590 0.2000 0.0294 0.0243 0.2422 0.0231 -0.1008 2.0347 0.3156 0.3371 1.7877 0.0822 2.2836 -0.0176 0.0165 0.0544 2.7736 -0.0340 0.000664 -0.0521 0.00160 0.00244 0.000203 0.00099 0.00022 0.0311 0.0165 1942.3988 198.7202 7.2580 142.2358 254.3452 270.0581' 46.3203 99.2254 18.9801 41. 5539 28.1224 771. 0284 1.1361 2602.8287 2622.0069 1058.3629 144.9447 20.2957 3.5631 From Table 3 it can be seen that all the parameter estimates are significant at a 1% level except for ACCTJ. INCOME2Khas a negative sign, indicating that the fact that a household's annual income is equal or less than $2,000 negatively contributes to the probability that this ,household will positively respond to EC. This finding is consistent to what is found in the descriptive statistics of customer profile, where EC prospects tend to have higher level of annual income than non-EC prospects. Similarly, INCOME5K and INCOME7K have a positive sign, indicating that 0.0001 0.0001 0.0071 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.2865 0.0001 0.0001 0.0001 0.0001 0.0001 0.0591 Odds Ratio . -0.078517 0.015041 0.066538 0.085301 0.072711 0.027792 0.041521 -0.018350 0.035512 0.028308 0.565594 -0.004844 0.718159 -0.253968 -0.153035 0.054857 0.023634 0.008066 0.695 1. 065 1.342 1. 540 1.432 1. 221 1.274 0.904 7.650 5.976 9.812 0.983 16.016 0.967 0.949 1.002 1.001 1.032 prospects with higher level of annual income is more likely to accept EC, other things being equal. The adjusted R2 of the logistic regression is 0.47. The misclassification rate for both EC and non-EC class are obtained by matching the estimated event probability to the cut-off value (externally specified, called P-hat). If the estimated probability is greater or equal than P-hat, then the case is classified as the EC, otherwise classified as non-EC prospect. 235 FIGURE 1 MISCLASSIFICATION RATES OF CART 0.5 0.45 0.4 l~ 0.35 .. 15 0.3 0.25 u!E 0.2 w" • III 0.15 15z Ii: 0.1 i 0.05 0 .s .tl 0.1 0 EC-P~pect 0.2 0.4 0.5 0.6 Misclassiflcation Rate FIGURE 2 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.4 0.45 0.5 EC Prospect Misclassification Rate FIGURE 3 MISCLASSIFICATION RATES OF LOGIT MODEL 0.4 Qe35 -S'" S. ~.3 .. 1825 e '" ~1·2 ay t!15 c_ il ~.1 '-os 0.05 0.1 0.15 0.2 0.25 0.3 EC Prospect Misclassiflcation Rate 236 0.35 model. for misclassification rate of 22.9% for EC, It can be seen that the misclassification rate for the corresponding misclassification rate for nonboth classes is a function of cut-off value, P-hat. EC is 35.74%. In CART analysis, when the Table 4 shows the classification rate for EC and misclassification rate forEC is 18.38%, the non-EC and the corresponding values ofP"hat as corresponding rate for non-EC is only 25.9%, well as the ratio ofP-hat over (l-P-hat). Figure 3 about 10% more accurate than logit model. Figure is a graphical presentation of the trade-off curve 3 is the graphical presentation of the relationship between misclassification rates ofEC and non-EC between the misclassification rates ofEC and nonclass. EC class. Comparing Table I to Table 4, it can be found that CART outperforms logit model. With the logit TABLE 4. VARIABLE MISCLASSIFICATION COST AND MISCLASSIFICATION COSTS OF LOGIT MODEL EC 0.4487 0.4319 0.4121 0.4006 0.3882 0.3736 0.3579 0.3373 0.3134 0.2896 0.2641 0.2466 0.2372 0.2290 NON-EC 0.0107 0.0176 0.0342 0.0510 0.0766 0.1083 0.1443 0.1828 0.2238 0.2666 0.3070 0.3377 0.3536 0.3574 P-hat (p) 0.1000 0.2000 0.3000 0.3500 0.4000 0.4500 0.5000 0.5500 0.6000 0.6500 0.7000 0.7500 0.8000 0.8500 RATIO=(1-p)/p 9.0000 4.0000 2.3333 1.8571 1.5000 1.2222 1.0000 0.8182 0.6667 0.5385 0.4286 0.3333 0.2500 0.1765 Conclusion In summary, it is found that CART has the best performance in terms of customer classification and the AI-based algorithm is the second best. The logit model performed poorly compared to the alternatives. However, the superiority of a specific algorithm is case dependent. With a different set of data and variables, the robustness of different algorithms need to be reexamined. Analysis of Discrete Data with Econometric Applications, ed. By C.F. Manski and D. McFadden, MIT Press, 1982 Mezrich, J., "When Is a Tree a Hedge?", Financial Analysts Journal, 50 (6), pp. 75-81,1994 Ripley, B., "Pattern Recognition and Neural Networks," Cambridge University Press, 1996 References Breiman, L., Friedman, J., Olshen, R., Stone, C., Classification and Regression Trees, Wadsworth International Group, Belmont, California, 1984 Chhikara, K., "The State of the Art in Credit Evaluation," American Journal ojAgricultural Economics, Vol. 71, pp. 1138-44, 1989 Grinold, R. and Kahn, R., Active Portfolio Management, IRWIN, 1995 Maddala, G., Limited Dependent and Qualitative Variables in Econometrics, Cambridge University Press, 1983 McFadden, D., "Econometric Analysis of Qualitative Response Models," in Structural The author can be reached at: Contact Jianmin Liu, Ph.D. Vice President Bank of America Mortgage, Unit 13592 50 California St., 14th Floor San Francisco, CA 94111 (415) 445-4407 jianmin [email protected] or [email protected] 237