Download Data Fusion

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Fusion:
A Way to Provide
More Data to Mine in?
Peter van der Putten ab
a
Sentient Machine Research
Baarsjesweg 224, 1058 Amsterdam, The Netherlands
[email protected]
Abstract
In everyday data mining practice, the availability of data is often a serious problem. For
instance, in database marketing elementary customer information resides in customer
databases, but market survey data is only available for a subset or even a different
sample of customers. Data fusion provides a way out by combining information from
different sources for each customer. We present a simple data fusion procedure based
on a nearest neighbor algorithm. We suggest different measures to evaluate the quality
of the fusion process. An experiment on real world data is described to illustrate the
added value of our approach.
1. Introduction and motivation
In marketing, direct forms of communication are getting more popular. Instead of
broadcasting a single message to all customers through traditional mass media such as
television and print, the most promising potential customers receive personalized offers
at the most appropriate time and through the most appropriate channels. For this it
becomes more important to gather information about media consumption, attitudes,
product propensity etc. at an individual level.
The amount of data that is collected about customers is generally growing very fast,
however it is often scattered among a large number of sources. For instance, elementary
customer information resides in customer databases, but market survey data depicting a
richer view of the customer is only available for a small sample. Simply collecting all
this information for the whole customer database in a single source survey is far too
expensive.
b
The author is also affiliated with the Leiden Institute of Advanced Computer Science
(LIACS), P.O. Box 9512, 2300 RA Leiden, The Netherlands
Customer database
Recipient
Market survey
Donor
+
1x106 customers
50 variables
25 commons
Virtual survey with each customer
Fused data
=
1000 survey respondents
1500 variables
25 commons
1x106 customers
1525 variables
25 commons
Figure 1: Data Fusion in a nutshell
A widely accepted alternative within database marketing is to buy external sociodemographic data that has been collected at a regional level. All customers living in a
single region, for instance in the same zip code area, receive equal values. However, the
kind of information that can be acquired is relatively limited. Furthermore, the
underlying assumption that all customers within a region are equal is at the least
questionable.
Data fusion techniques can provide a way out. Information from different sources is
combined by matching customers on variables that are available in both data sources.
The resulting enriched data set can be used for all kinds of data mining and database
marketing analyses.
In this paper we will give a practical introduction to the application of data fusion for
data mining in a database marketing context (section 2). This will be illustrated with
some preliminary empirical results on a real world data set (section 3). Rather than
presenting a complete solution we argue that data fusion might be a valuable tool in
every day data mining practice. Furthermore we aim to demonstrate that the data fusion
problem is far from simple and contains a lot of interesting topics for future algorithmic
and methodological data mining research (section 4).
2. Data fusion
Data fusion is not new. In the 1980s this subject was quite popular, most particularly in
the field of media research [1,2,4,5,7,16] and micro economic analysis [3,10,11]. Up
until today, data fusion is used to reduce the required number of respondents or
questions in a survey. For instance, for the Belgian National Readership survey
questions regarding media and questions regarding products are collected in 2 separate
groups of 10.000 respondents and fused into a single survey thus reducing costs and
time for a respondent needed to complete a survey [9].
In our research we are ultimately aiming at fusing entire customer databases with
surveys instead of merging surveys with surveys. This implies new ways to exploit
existing survey data. Furthermore, the single source alternative - asking all the questions
to all the customers - might be an option for merging surveys, but in most cases it will
not even be a possibility when merging large customer databases with survey data.
2.1 Core data fusion concepts
The core data fusion concept is illustrated in figure 1. Assume a company has one
million customers. For each customer, 50 variables are stored in the customer database.
Furthermore, there exists a market survey with 1000 respondents, not necessarily
customers of the company, and they were asked questions corresponding to 1500
variables. In this example 25 variables occur in both the database and the survey: these
variables are called common variables. Now assume that we want to transfer the
information from the market survey, the donor, to the customer database, the recipient.
For each record in the customer database the data fusion procedure is used to predict the
most probable answers on the market survey questions, thus creating a virtual survey
with each customer. The variables to be predicted are called fusion variables.
The most straightforward procedure to perform this kind of data fusion is statistical
matching, which can be compared to k-nearest neighbor classification. For each
recipient record those k donor records are selected that have the smallest distance to the
recipient record, with the distance measure defined over the common variables. Based
on this set of k donors the values of the fusion variables are estimated, e.g. by taking the
average for ordinals or the mode for nominals.
Sometimes separate fusions are carried out for groups for which 'mistakes' in the
predictions are unacceptable, e.g. predicting 'pregnant last year' for men. In this case the
gender variable will become a so-called cell variable; the match between recipient and
donor must be 100% on the cell variable, otherwise they won't be matched at all.
2.2 Data fusion evaluation
An important issue in data fusion is measuring the quality of the fusion; this is far from
straightforward.
The bottom line evaluation is the external evaluation. Assume for instance that we want
to improve the response on mailings for a certain set of products, so this was the reason
why the fusion variables were added in the first place. In this case, one way to evaluate
the external quality is to check whether an improved mail response prediction model can
be built when fused data is included in the input. However, one must take into account
that the added value of socio-demographic and other external variables is often of
limited value for purely predictive data mining. These variables have more value for
descriptive data mining, e.g. discovering why people are interested in these products
[13,14].
The internal evaluation of the data fusion procedure is simply the a priori evaluation
before external evaluation has taken place. We identify evaluating representativeness
versus predictiveness, although the problem to make this distinction formal is an
interesting problem on its own. One challenge for both the fusion procedure and the
evaluation of representativeness of the fused data is that the donor and the recipient
might be samples from different populations, e.g. a customer database from a bank
versus a national media survey. If both donor and recipient are samples from the same
population, penalty factors can be used to 'punish' winning donors and ensure that
donors are used evenly [11]. Standard statistical tests can be used to check whether there
are significant deviations in frequency distributions for variable values in the fused data
set. An interesting problem when testing predictiveness is that in general there are no
target values available for the recipient, so measures like root mean squared error and
classification error can generally only be calculated for the donor.
3. Experiments & results
In this section we will describe some preliminary experiments and results with a
standard statistical matching data fusion procedure. We assume the following
hypothetical business case. A bank wants to learn more about its credit card customers
and expand the market for this product. Unfortunately, there is no survey data available
that includes credit cardholdership, this variable is just known for actual customers. Data
fusion is used to enrich a customer database with survey data. The resulting data set
serves as a starting point for further descriptive and predictive data mining analysis.
3.1 The data sets and fusion methodology used
In this experiment we did not use separate donors, but we chose to split up an existing
real world market survey into a donor and a recipient. The recipient contained 2000
records with a cell variable for gender, commons for age, marital status, region, number
of persons in the household and income. Furthermore the recipient contained a unique
variable for credit card ownership, the target variable to model. The donor contained
4880 records, with 36 variables for which we expected that there might be a relation to
credit cardholdership: general household demographics, holiday and spare time
activities, financial product usage and personal attitudes. The original survey contains
over a thousand of variables and over 5000 possible variable values.
We fused the donor and the recipient using 4 fold cross validation on the donor to
determine the optimal k. Only ordinals and binary fusion variables were included, so we
restricted to predicting averages. Standard root mean squared error was used as a
measure for predictive quality.
3.2 Internal evaluation: representativeness
Apart from the root mean squared error cross validation procedure we restricted
ourselves to representativeness evaluation.
First we compared averages for all variables for the donor and the recipient. As could be
expected from the donor and recipient sizes and the fact that both sets were generated
from the same source there weren't many significant differences between donor and
recipient for the common variables. Within the recipient 'not married' was over
represented (30.0% instead of 26.6%), 'married and living together' was under
represented (56.1% versus 60.0%) and the countryside and larger families were slightly
over represented. More surprisingly (and reassuring) the average fusion variable values
were very well preserved in the recipient survey compared to the donor survey. Only the
averages of "Way Of Spending The Night during Summer Holiday" and "Number Of
Savings Accounts" differed significantly, respectively by 2.6% and 1.5%.
Apart from general statistics we wanted to evaluate the preservation of relations between
variables, for which we used the following weak measures. For each common variable
we listed the correlation with all fusion variables, the real values for the donor and the
predicted values for the recipient. We then computed the correlation between these lists
and calculated the average over these correlations. The result was an average correlation
of common-fusion relationship between recipient and donor of 0.9 ± 0.028. The mean
difference between common-fusion correlations in the donor versus the recipient was
0.12 ± 0.028. In other words, these correlations were very well preserved. A similar
procedure could be carried out for the fusion variables with respect to each other.
Further work should also be done on the application of penalty factors to improve
representativeness. However, our preliminary experiments have demonstrated that
penalties have a negative effect on the prediction quality (measured in RMSE).
3.3 External value of fused data for prediction tasks
To experiment with the added value of data fusion for further analysis (external
evaluation) we first performed some descriptive data mining to discover relations
between the target variable, credit cardholdership, and the fusion variables using
straightforward univariate techniques. First we selected the top 10 fusion variables with
the highest absolute correlations with the target (see Table I). Note that, in contrast to
standard practice, it is perfectly legal to include dependent fusion variables such as
‘frequency usage credit card’ in the set of input variables for prediction. Smaller effects
included "Need for cognition" (average 1.05 times higher) and less "housewives" (0.9
times lower). These results can already offer a lot of insight to a marketer.
The descriptive results were also used to guide the predictive data mining modelling
process. In this case we wanted to investigate whether different computational learning
methods would be able to exploit the added information in the fusion variables. We
included naive bayes, neural networks and linear regression and an adapted version of
naïve bayes adapted for ordinals (naive bayes Gaussian). We report results over 10 runs
with train and test sets of equal size. The quality of the models was measured by the so
called c-index, a rank based test related to Kendall's Tau [12], which measures the
concordance between the ordered lists of real and predicted cardholders (see [15] for
details on the algorithms and the c-index).
We compared models which were trained on commons only, for which no fusion was
actually needed, and models on commons plus either all or a selection of correlated
fusion variables (see Table II; c=0.5 means random prediction, c=1 means perfect
prediction). These results indicate that for this data set the models that include the highly
correlated fusion variables outperform the models which were built using commons
only. For linear regression these differences were most significant. Significance was
tested by a one sided two sample T test on the ‘fusion’ runs versus the ‘only commons’
runs.
In figure 2 cumulative response curves are drawn for the linear regression models. The
test recipients are ordered from high score to low score on the x-axis. The data points
correspond to the actual proportion of cardholders up to that percentile. Random
selection of customers results in an average proportion of 32.5% cardholders. We can
see from this figure that credit cardholdership can be predicted quite well. The top 10%
of cardholder prospects according to the prediction models contain around 50-65%
cardholders. The added logarithmic trend lines indicate that the models which include
fusion variables are better in 'creaming the crop', i.e. selecting the top prospects.
Welfare class
Income household above average
Is a manager
Manages which number of people
Time per day of watching television
Eating out (privately): money per person
Frequency usage credit card
Frequency usage regular customer card
Statement current income
Spend more money on investments
75
70
Commons & Correlated
Fusion vbls
Commons only
65
60
55
50
45
40
35
Table I:. Fusion variables in recipient
strongly correlated with credit card
ownership
30
0
20
40
60
80
Figure 2: Lift chart linear regression
models for predicting credit card
ownership (7 randomly selected
runs)
Only commons
Commons
&
correlated
fusion vbls
Commons
&
all fusion vbls
SCG Neural
Network
Linear
regression
Naïve Bayes
Naïve Bayes
Gaussian
C=0.692  0.012
C=0.692  0.014
C=0.707  0.015
C=0.701 0.015
C=0.703  0.015
C=0.724 0.012
C=0.712  0.011
C=0.720  0.012
p=0.041
p=2.1e-5
p= 0.20
p=0.0034
C=0.694  0.019
C=0.713  0.013
C=0.704  0.009
C=0.719  0.012
p=0.38
p=0.0017
p=0.72
p=0.0049
Table II:. C indexes
4. Discussion and future research
One could argue that in theory by applying data fusion no information is added to the
recipient survey, because this information is derived directly from the commons.
However, in practice data fusion can still be a valuable tool. For descriptive data mining
tasks, the fusion variables and the patterns derived from these variables can be more
understandable and easier to interpret for an end user than patterns derived solely from
commons. Furthermore it is a well known practical fact that it often makes sense to
100
include derived variables to improve prediction quality. In this case, fusion can make it
easier for ‘imperfect’ algorithms such as linear regression to discover complex nonlinear relations between commons and target variables, by exploiting the information in
the fusion variables. It is highly recommended to use appropriate variable selection
techniques to remove the noise that is added by ‘irrelevant’ fusion variables (to counter
the ‘curse of dimensionality’).
It goes without saying that evaluating the quality of data fusion is crucial for acceptance.
We hope to have demonstrated that this is not straightforward. A lot of interesting
research can be done in this area, especially in the field of evaluating the recipient fusion
variable predictions, for which no targets are available. Even a relatively simple
question as determining the optimal set of commons has interesting research
dimensions. To structure all these choices we have started to build a data fusion process
model, analogously to the CRISP_DM model for data mining [6].
Also, the core fusion algorithms provide a lot of room for research and improvement.
There is no fundamental reason why the fusion algorithm should be based on k-nearest
neighbor prediction instead of clustering methods, regression, the expectationmaximization (EM) algorithm or other statistical and machine learning algorithms (see
f.i. [8]). By shifting from fusing surveys to fusing customer databases with surveys an
extra challenge must be faced: scoring millions of customer database records instead of
thousands of surveys. All these efforts work towards a single vision: keeping all
knowledge about a customer up to date, including soft information such as predictions
based on measurements from different sources.
5. Conclusion
The promise of data fusion is indeed attractive: getting insight about individual
customers against a fraction of the price it would have cost to collect all this information
in a single source survey. The application of data fusion will increase the value of data
mining, because there is more integrated data to mine in. However, there is still a lot of
interesting research to be done to evaluate data fusion quality and improve the still
rather straightforward data fusion algorithms.
Acknowledgements
We would like to thank Michel de Ruiter, Martijn Ramaekers, Evelien Langendoen,
Michiel van Wezel and Joost Kok for their comments. Part of this work has been
performed within "The Fusion Factory" project, which is supported by the Dutch
Ministry of Economic Affairs, through the KREDO stimulation initiative for
development of electronic services.
References
[1] Antoine, J. 1985. A Case Study Illustrating the Objectives and Perspectives of
Fusion Techniques. Proceedings of the Salzburg Readership Symposium.
[2] Ken Baker, Paul Harris and John O’Brien. 1989. Data Fusion: An Appraisal and
Experimental Evaluation. Journal of the Market Research Society, 31 (2), 152-212.
[3] Barry, J.T. 1988. An investigation of statistical matching Journal of Applied
Statistics.
[4] Sarah O’Brien. The role of the data fusion in actionable media targeting in the
1990’s. 1990. ESOMAR, pp 531-548.
[5] A.E. Bronner. Einde van de fusie fobie in Nederland? 1989. In: Jaarboek van de
Nederlandse vereniging van marktonderzoekers 1988/1989, 9-18
[6] Chapman, P., Clinton J., Khabaza T., Reinartz, T., Wirth, R. (1999). The CRISPDM Process Model. Draft Discussion paper, Crisp Consortium, March 1999.
http://www.crisp-dm.org/.
[7] Harris, P. and Baker, K. 1998. Data Fusion. Admap, June 1998
[8] W.A. Kamakura and M. Wedel, 1996. Statistical Data-Fusion For CrossTabulation. Research Report SOM Institute, Groningen University, The
Netherlands.
[9] R. Lokker. 1998. Bereikstudies Pers, Bioscoop en PMP. Centrum voor Informatie
over de Media, Brussel, Belgium.
[10] van Noordwijk, A.J. 1983. Technical Notes on a Statistical Matching Experiment.
Chapter 8 in: Koppelling van Databestanden, Sociaal en Cultureel Planbureau,
Rijswijk, the Netherlands.
[11] Paass, G. 1986. Statistical Match: Evaluation of Existing Procedures and
Improvements by Using Additional Information. In: Microanalytic Simulatiom
Models to Support Social and Financial Policy. Orcutt, G.H. and Merz, K, (eds).
Elsevier Science Publishers BV, North Holland.
[12] Press, W.H, Teukolsky, S.A., Vetterling, W.T. and Flannery, B.P., 1992.
Numerical Recipes in C. The Art of Scientific Computing. Cambridge University
Press, Cambridge MA, 2nd edition
[13] van der Putten, P.,1999. Datamining in Direct Marketing Databases. In: Baets, W.
(ed.). (1999). A Collection of Essays on Complexity and Management. World
Scientific, Singapore.
[14] van der Putten, P. 1999. A Datamining Scenario for Stimulating Credit Card Usage
by Mining Transaction Data. Proceedings of Benelearn-99.
[15] de Ruiter, Michel, 1999. Bayesian classification in data mining: theory and
practice. MSc. Thesis, BWI, Free University of Amsterdam, The Netherlands
[16] Schieler, H.E. and Wiegand, J. 1985. A Report on Experiments in Fusion in the
Official German Media Research. Proceedings of the Salzburg Readership
Symposium.