Download Some Applications of Data Mining Tools in Database Marketing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Sensitivity analysis wikipedia , lookup

Business valuation wikipedia , lookup

Channel coordination wikipedia , lookup

Exchange rate wikipedia , lookup

Predictive analytics wikipedia , lookup

Transcript
Some Applications of Data Mining Tools in
Database Marketing
customer as a percentage of the total number of
customers classified. It is also found that different
levels of classification accuracy can be achieved
by specifYing different levels of relative
misclassification costs in different classes
(prospect vs. non-prospect). There exists a tradeoffbetween misclassification rates for the two
classes. The subjective judgment is required in
determining the optimal point on the trade-off
curve under different business circumstances,
which will be discussed in greater details later.
Jianmm Liu, Computer Information
System, Golden Gate University, San
Francisco
Abstract
This paper presents some results of an empirical
study on database marketing using CART,
artificial intelligence-based algorithm (AI) and
Legit model. The robustness of the three different
approaches is evaluated using a large dataset.
Decision Treefor Customer Response to
Electronic Ccmmerce (EC)
The classification tree is used to identify the
important variables and demographic
characteristics in customer response to EC. The
CART software is used to implement the nonparametric statistical algorithm developed by
Breiman et al. (1984). The following Diagram
shows the selected decision tree of CART.
Introduction
This study attempts to develop an analytical
framework for database marketing. The goal is to
establish a set of criteria. The selected criteria
need to reflect the dynamics of market structure,
demand for EC, new technology and, most
importantly, the difference in various
characteristics between prospect and non-prospect
customers ofEC. Using econometric and data
mining techniques, it is found that there exist some
patterns in customer profile associated with
customers' response to EC. Successfully
identifYing these patterns is the core of the task.
In the Diagram, the ellipse shapes represent
terminal nodes of the tree. Rectangle shapes
represent non-terminal nodes. There are total of
thirteen terminal nodes in this decision tree. The
relative misclassification costs are set equal fcir
both classes (prospect vs. non-prospect). In other
words, the classification errors in both classes are
equally serious. There are about 120,000
observations in the data set. This data set is
randomly split into two sets, learning set and test
set, where 65% of data in the original set is in the
learning set to be used for developing the
classification tree. The rest of the data in the
original data set is in the test set for testing the
robustness of the tree and calculate the
misclassification rate.
In this study various social economic
characteristics are investigated. Some
characteristics of rmancial products and services
used by these customers are also examined. Ttests are conducted on the means of the selected
variables of the customer and non-customer ofEC
under the null hypothesis that the difference in the
value of the means is not significantly different
from zero. This null hypothesis is rejected at a 1%
significance level, indicating that the two customer
groups had some different characteristics. These
characteristics are used as explanatory variables in
the customer prospect model. The results of
CART analysis are compared to those of the
artificial intelligence-based algorithm and the
conventionallogit model. Simulations of CART
and AI analysis are conducted to assess the
robustness of the methodologies. It is found that
an overall 75% accuracy rate, or 25%
misclassification rate, can be achieved applying
CART to the data, where the misclassification rate
is dermed as thenumber of mis-classified
The statistics of misclassification rate of the test
set is reported below. The misclassification rate
for EC-prospect class for the tree displayed in the
Diagram is about 7.8%, while about 35% for nonprospect class. The overall misclassification rate
is about 24.2% for the tree. The interpretation of
misclassification of prospect is that we can be
about 92% sure when we classifY a group of
customers as prospects. 7.8% of these classified
prospects are mis-classified. Similarly, the
interpretation of misclassification of non-prospect
is that we can be about 65% sure when we
consider a group of customers as non-ECprospects. 35% of these classified non-prospects
232
are mis-classified. In other words, they in fact
acceptEC.
misclassification of both EC and non-EC prospects
is equally costly, the relative misclassification cost
ratio should be set to one (for binary case). Ifwe
consider the misclassification ofEC prospects is
more costly than that of non-EC, than the
misclassification cost ratio ofEC over non-EC
prospects can be set to greater than one.
Relative Misclassification Costs
The variable misclassification cost can be altered
in order to adjust the misclassification rate of
corresponding classes. There exists a trade-off
between the misclassification cost in both classes.
If the relative misclassification cost ratio for the
two classes is altered (deviates away from I), the
misclassification rate for each class varies
accordingly, and so does the corresponding overall
misclassification rate of the tree. Typically, if the
relative misclassification cost for one class
become higher (more serious), the
misclassification rate of this class will decrease
and the misclassification rate for the other class
will increase. As a result, the overall
misclassification rate of the tree will vary as well.
If we specifY a greater value to the variable
misclassification cost ofnon-EC prospects,
therefore allowing the misclassification rate of
EC-household class to increase, the
misclassification rate ofnon-EC-prospect class
will decrease accordingly. Table I shows the
trade-off of misclassification rates for EC and nonEC, and the relative cost ratio. Figure I shows the
trade-off of the misclassification rate between EC
and non-EC prospect graphicalIy. As it can be
seen, the misclassification rate varies as the
relative-cost ratio varies.
The appropriate classification cost specification is
based on a subjective judgment. If the
TABLE 1. MISCLASSIFICATION RATE AND VARIABLE
MIS CLASSIFICATION COST RATIO OF CART ANALYSIS
EC MISNON-EC MISEC/NON-EC
NON-EC
EC COST
CLASSIFICATION CLASSIFICATION COST RA
COST
0.4565
3.0000
1.0
3.0
0.0193
no
0.0262
0.0265
0.0600
0.0780
0.1160
0.1490
0.1838
0.2760
0.3255
0.4140
0.5230
0.4413
0.4405
0.3800
0.3550
0.3200
0.2918
0.2595
0.2080
0.1766
0.1400
0.1000
2.5000
2.0000
1.5000
1.0000
0.8333
0.5556
0.6667
0.7692
0.5000
0.4000
0.3333
1.0
1.0
1.0
1.0
1.2
1.8
1.5
1.3
2.0
2.5
3.0
2.5
2.0
1.5
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
misclassification cost ratio of the two classes,
CART generates different decision trees with
different overall misclassification rates.
A series of simulations are conducted on CART
under different specifications of relative
misclassification cost for EC and non-EC classes.
The misclassification rates of both learning set and
test set are calculated in CART and the difference
between the learning set and the test set is not
significant, indicating that the decision tree is
quite stable and robust.
Results OfAI-based Algorithm (Ill) and
Comparison to CART
The same data set used in CART analysis is used
for IH. The forecasted value for BC and non-Be
prospects in the test set is continuous in value
within the range of (0, 1), with the low value as the
non-EC and high value as EC. The cut-off line is
pre-specified based on the subjective judgment.
For example, if the value ofp is specified as the
cut-off threshold, then if the forecasted value of
As mentioned earlier, the selected tree has a
misclassification rate of7.9% for EC-prospect
class and 35.5% for non-EC class, where the
variable misclassification cost ratio of the two
classes is one. For different levels of
233
dependent variable, household type (EC vs. nonEC) is equal or greater than p than the
corresponding household is classified as an EChousehold. Otherwise, the household is classified
as a non-EC prospect. The trade-off of
misclassification rates can be obtained by altering
the value of cut-off threshold variable,p. The
misclassification rates for EC and non-EC are
presented in Table 2. The relative cost ratio is
calculated as the p/(l-p).
From Table 2 it can be seen that when the
misclassification rate for EC class is 20.2% the
corresponding rate for non-EC is ab<>ut 25.7%, and
the corresponding ratio of non-EC to EC is 1.86.
When the rate for EC is 13%, the corresponding
rate for non-EC is 29.6% and the corresponding
relative cost ratio ofnon-EC to EC is 2.33. These
misclassification rates vary as the ratio of non-EC
relative cost to EC changes. Figure: shows the
. trade-off relationship of misclassificstion rate
between EC and non-EC prospect class for IH.
TABLE 2. VARIABLE MISCLASSIFICATION COST AND
MIS CLASSIFICATION COSTS (AI-Based Algorithm)
EC
0.0580
0.0740
0.1300
0.2020
0.2390
0.2680
0.3010
0.3350
0.3660
0.3890
0.4030
0.4160
0.4290
NON-EC
0.3250
0.3160
0.2960
0.2570
0.2220
0.1930
0.1580
0.1240
0.0870
0.0590
0.0430
0.0290
0.0190
EC COST (p)
0.2000
0.2500
0.3000
0.3500
0.4000
0.4500
0.5000
0.5500
0.6000
0.6500
0.7000
0.7500
0.8000
NON-EC COST (1-p)
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
NON-EC I EC
4.00
3.00
2.33
1.86
1.50
1.22
1.00
0.82
0.67
0.54
0.43
0.33
0.25
No information is available on IH's methodology
Comparison to CART
besides a very general description. C~T is a
• Performance
As it can be seen in Figure 2, the trade-off curve of
well known and documented method('iogy.
misclassification rate of IH is similar to that
Results ofLogit Model and Comparison to caT
obtained from CART analysis. For example, in
CART analysis, when the misclassification rate for
Logit regression is a parametric appr"ach which is
well understood and widely used in csregorical
EC is 18.38%, the rate is about 26% for non-EC.
When the misclassification rate for EC is 11.6%,
data analysis for its simplicity and be:rer
performance over linear probability model and
the misclassification rate for non-EC is about
discriminant analysis which have some significmt
32%. The performance of CART is slightly better
weakness (see McFadden, 1982, MadJala. 1983.
than IH based on the simulations conducted in this
Chhikara, 1989 for example). In the ronteot of
project.
customer
prospect model, the depencknt variable
• Computer Running Time
of the logit model takes on only two ,alues, 1 if.<
CART runs much faster than IH. IH's optimizer
requires overnight run for the data set used, while
customer positively responds to EC (tvent), 0
otherwise (non-event). The data set liSed for
CART typically requires 10-15 minutes for a run
on a Pentium processor.
logistic regression is the same as used for CART
andIH.
• Availability of information on the underlying
algorithms
1. Model Specification
The model specification of the logistic regression is as follows.
P
I-P
log{--}
= ao +a. x Income2k+a2 x Income5k+a3 x /ncome7k+a4 x IncomelOO
234
+a, x Micro02+a. x Micro09+a 7 x MicroI4+a" x Micro23+a9 x DDA+a,o x HHAGE
(1),
+a" x HHLOR+a'2 x HHLOAN + a13 ,x HHINVT + a 14 x Chiidren +a" x ACCTI+a,. x ACCT4PL
where
Income2k = I ifHHINCOME ::;; $29,999.00, otherwise 0
Income5k = I if$50,000 ::;; HHINCOME ::;; $74,999.00 otherwise 0
Income7k = I if$75,000 ::;; HHINCOME ::;; $99,999.00 otherwise 0
IncomeJOO = I ifHHINCOME 2! $100,060, otherwise 0
= 1 ifMICROCD = 02, otherwise 0
Micro02
Micro09
= 1 ifMICROCD = 09, otherwise 0
Microi4
= I ifMICROCD = 14, otherwise 0
Children = 1 if a household has Children,
Micro23
= 1 ifMICROCD =23, otherwise 0
= 1 ifPtypel = DDA, otherwise 0
DDA
otherwise 0
= HHAGE (numerical variable, same
= 1 if ACCT = 1, otherwise 0 (ACCT:
ACCTJ
HHAGE
as defined earlier)
number of accounts per household)
= HHLOR (numerical variable, same
HHLOR
ACCT4PL = 1 if ACCT ~ 4, otherwise 0
as defined earlier)
The estimation algorithm of logit model is
HHLOAN = household loan (in $ I 000)
maximum likelihood. The estimates of parameters
HHINVT = household investment (in $1000)
are presented in Table 3.
TABLE 3. Analysis of Maximum LikeUhood Estimates of Logistic Regression
Parameter Standard
Wald
Pr >
Standardized
Variable DF Estimate
Error Chi-Square Chi-Square
Estimate
INTERCPT
INCOME2K
INCOME5K
INCOME7K
INCOM100
MICR02
MICR09
MICR014
MICR023
DEPOST4
DEPOST6
DEPOST7
ACCTI
DDA
HHAGE
HHLOR
HHLOAN
HHINVT
CHILDREN
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.0735
-3.2402
-0.3635
0.0258
0.0233
0.0628
0.0247
0.2941
0.0271
0.4316
0.0218
0.3590
0.2000
0.0294
0.0243
0.2422
0.0231
-0.1008
2.0347
0.3156
0.3371
1.7877
0.0822
2.2836
-0.0176
0.0165
0.0544
2.7736
-0.0340 0.000664
-0.0521 0.00160
0.00244 0.000203
0.00099 0.00022
0.0311
0.0165
1942.3988
198.7202
7.2580
142.2358
254.3452
270.0581'
46.3203
99.2254
18.9801
41. 5539
28.1224
771. 0284
1.1361
2602.8287
2622.0069
1058.3629
144.9447
20.2957
3.5631
From Table 3 it can be seen that all the parameter
estimates are significant at a 1% level except for
ACCTJ. INCOME2Khas a negative sign,
indicating that the fact that a household's annual
income is equal or less than $2,000 negatively
contributes to the probability that this ,household
will positively respond to EC. This finding is
consistent to what is found in the descriptive
statistics of customer profile, where EC prospects
tend to have higher level of annual income than
non-EC prospects. Similarly, INCOME5K and
INCOME7K have a positive sign, indicating that
0.0001
0.0001
0.0071
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.2865
0.0001
0.0001
0.0001
0.0001
0.0001
0.0591
Odds
Ratio
.
-0.078517
0.015041
0.066538
0.085301
0.072711
0.027792
0.041521
-0.018350
0.035512
0.028308
0.565594
-0.004844
0.718159
-0.253968
-0.153035
0.054857
0.023634
0.008066
0.695
1. 065
1.342
1. 540
1.432
1. 221
1.274
0.904
7.650
5.976
9.812
0.983
16.016
0.967
0.949
1.002
1.001
1.032
prospects with higher level of annual income is
more likely to accept EC, other things being equal.
The adjusted R2 of the logistic regression is 0.47.
The misclassification rate for both EC and non-EC
class are obtained by matching the estimated event
probability to the cut-off value (externally
specified, called P-hat). If the estimated
probability is greater or equal than P-hat, then the
case is classified as the EC, otherwise classified as
non-EC prospect.
235
FIGURE 1
MISCLASSIFICATION RATES OF CART
0.5
0.45
0.4
l~ 0.35
.. 15 0.3
0.25
u!E
0.2
w"
• III 0.15
15z Ii: 0.1
i 0.05
0
.s
.tl
0.1
0
EC-P~pect
0.2
0.4
0.5
0.6
Misclassiflcation Rate
FIGURE 2
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.4
0.45
0.5
EC Prospect Misclassification Rate
FIGURE 3
MISCLASSIFICATION RATES OF LOGIT MODEL
0.4
Qe35
-S'"
S. ~.3
.. 1825
e '"
~1·2
ay
t!15
c_
il
~.1
'-os
0.05
0.1
0.15
0.2
0.25
0.3
EC Prospect Misclassiflcation Rate
236
0.35
model. for misclassification rate of 22.9% for EC,
It can be seen that the misclassification rate for
the corresponding misclassification rate for nonboth classes is a function of cut-off value, P-hat.
EC is 35.74%. In CART analysis, when the
Table 4 shows the classification rate for EC and
misclassification rate forEC is 18.38%, the
non-EC and the corresponding values ofP"hat as
corresponding rate for non-EC is only 25.9%,
well as the ratio ofP-hat over (l-P-hat). Figure 3
about 10% more accurate than logit model. Figure
is a graphical presentation of the trade-off curve
3 is the graphical presentation of the relationship
between misclassification rates ofEC and non-EC
between the misclassification rates ofEC and nonclass.
EC class.
Comparing Table I to Table 4, it can be found that
CART outperforms logit model. With the logit
TABLE 4. VARIABLE MISCLASSIFICATION COST AND
MISCLASSIFICATION COSTS OF LOGIT MODEL
EC
0.4487
0.4319
0.4121
0.4006
0.3882
0.3736
0.3579
0.3373
0.3134
0.2896
0.2641
0.2466
0.2372
0.2290
NON-EC
0.0107
0.0176
0.0342
0.0510
0.0766
0.1083
0.1443
0.1828
0.2238
0.2666
0.3070
0.3377
0.3536
0.3574
P-hat (p)
0.1000
0.2000
0.3000
0.3500
0.4000
0.4500
0.5000
0.5500
0.6000
0.6500
0.7000
0.7500
0.8000
0.8500
RATIO=(1-p)/p
9.0000
4.0000
2.3333
1.8571
1.5000
1.2222
1.0000
0.8182
0.6667
0.5385
0.4286
0.3333
0.2500
0.1765
Conclusion
In summary, it is found that CART has the best
performance in terms of customer classification
and the AI-based algorithm is the second best. The
logit model performed poorly compared to the
alternatives. However, the superiority of a
specific algorithm is case dependent. With a
different set of data and variables, the robustness
of different algorithms need to be reexamined.
Analysis of Discrete Data with Econometric
Applications, ed. By C.F. Manski and D.
McFadden, MIT Press, 1982
Mezrich, J., "When Is a Tree a Hedge?", Financial
Analysts Journal, 50 (6), pp. 75-81,1994
Ripley, B., "Pattern Recognition and Neural
Networks," Cambridge University Press, 1996
References
Breiman, L., Friedman, J., Olshen, R., Stone, C.,
Classification and Regression Trees, Wadsworth
International Group, Belmont, California, 1984
Chhikara, K., "The State of the Art in Credit
Evaluation," American Journal ojAgricultural
Economics, Vol. 71, pp. 1138-44, 1989
Grinold, R. and Kahn, R., Active Portfolio
Management, IRWIN, 1995
Maddala, G., Limited Dependent and Qualitative
Variables in Econometrics, Cambridge University
Press, 1983
McFadden, D., "Econometric Analysis of
Qualitative Response Models," in Structural
The author can be reached at:
Contact
Jianmin Liu, Ph.D.
Vice President
Bank of America Mortgage, Unit 13592
50 California St., 14th Floor
San Francisco, CA 94111
(415) 445-4407
jianmin [email protected]
or
[email protected]
237