Download The Data Mining Process

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Product planning wikipedia , lookup

Business model wikipedia , lookup

Product lifecycle wikipedia , lookup

Transcript
The Data Mining Process
With CRM becoming more of a business philosophy for most organizations today, data
mining is often viewed as the analytical technology piece which is required to achieve a
given solution. But what does this really mean? For many people, the fact that data
mining is viewed as the technological component implies that purchasing the right
software and hardware is the key to effective data mining. Other schools of thought
regard the use of statistics and/or computer programming along with their machinelearning type algorithms as the data mining component. Yet, these preconceived
perceptions miss the mark in the sense that these notions represent specific components
within the overall data mining process. For data mining is a step by step process requiring
both the human element interacting with technology in order to produce the best business
solution for a given process. This is best understood by explaining this process and what
is involved within each step.
In the many articles and books that have been written about data mining, authors will
have differing opinions on the number of steps or stages within a given data mining
project. However, as one reads through a number of these books or articles, common
themes emerge about concerning the critical junctures within a data mining project. In
this article, we look at the data mining process as comprising four major steps or stages:
1)Identification of the Business Problem or Challenge
2)Creation of the Analytical File
3)Application of The Appropriate Tools and Technology
4)Implementation and Tracking
Identification of the Business Problem or Challenge
You will not get much argument from any of the pundits in recognizing that this is
probably the most important stage within the entire process. In fact, it is rather ironic
when one hears about the plethora of data mining technology available in the
marketplace, yet realizing that this is the part of the process when the human element is
most important. This capability relies on the expertise of being able to critically assess a
given business situation both quantitatively and qualitatively and being able to determine
its overall importance given the overall business strategy. An example best illustrates this
argument:
An organization’s overall sales had significantly decreased within the last year. It was
also found that 75% of the company’s overall sales resulted from their best 20% of
customers. The analyst and marketer identified that customer attrition had significantly
increased within this last year.
With this information, it was decided that an attrition model to identify high-risk attritors
would enable marketers to allocate more resources to this vulnerable group in the hope of
retaining them. Yet this thinking was flawed on two business fronts. Analysis should
have been conducted on the high value risk group to determine the extent of the attrition
problem within this segment. Given the preliminary information of this specific business
case, it is extremely likely that the attrition problem is highly prevalent within this
segment. The other thing to consider before embarking on developing a model is to even
understand whether CRM or data mining has any relevance in this situation. For instance,
if the reason for losing sales is because the competition has created new products or
services which have price points and benefits far superior to the company, the capability
of being superior CRM practitioners through data mining is not going to stop this sales
erosion. As you can see in this case, some upfront data analysis complemented with
some market research is the preferred route to undertake at the present time.
Another important consideration in being able to identify a problem is to understand the
current data environment and its implications in resolving a certain business problem. A
company may want to launch a new product and develop a cross-sell model to target
those customers who are most likely to purchase this product. In building a model, the
analyst needs to understand that there is no prior information or history on this specific
product in which to develop a model. The analyst might then think that there may be
other similar type products with previous history which could be used to develop a broad
profile. If this were not the case, the analyst might want to think of ways to identify
customers who might be early adopters (i.e. the early pioneer purchasers of new
products).
As one can surmise from these above examples, it is evident that some creativity in
thinking needs to be employed in ensuring that we are identifying the right problem and
challenge given the current business circumstances. Once again, strong collaborative
efforts between the marketing and the data mining area represent the keys towards really
optimizing the creative thinking from both functional areas.
Creation of the Analytical File
Once the business challenge or problem is identified, the analyst needs to understand the
data and information requirements which will enable him/her to conduct the necessary
analytics. This does not mean that the analyst needs to undergo a rigorous data needs
analysis as one would undertake in building or designing a database. The analyst is only
concerned with what is already there and not what should be there. Once the analyst has
identified a file and some of its contents as being potentially relevant for an analysis, the
analyst in most cases would request the entire file as one source of data in the project.
Other files would also be requested depending on their relevance within the project and
whether or not customer-related type information can be extracted.
With the source files being determined, the analyst then needs to understand the quality
of the data. In other words, are there certain fields that have a large proportion of missing
fields?
# of
Customers
49000
49000
59500
42000
150500
350000
Tenure
1998
1999
2000
2001
Missing
Total
% of
Customers
14%
14%
17%
12%
43%
100%
In the above case, we can see that 43% of the customer base has no start date as a
customer. A number of techniques can be used to handle these missing values. For
instance, using the average or median value of the non-missing values to impute an
overall value is the popular way of dealing with missing values. A more robust but much
more time consuming way is to build a model or algorithm that predicts the value of a
variable based on the other fields or characteristics within the database.
Another problem concerning the quality of data pertains to values within a field that don’t
make sense.
Product Category
Code
ABC
DEF
GHI
999
Total
# of Customers
103810
118650
74165
49875
350000
% of
Customers
29.66%
33.90%
21.19%
14.25%
99%
In this above case, we see that all the product codes are comprised of letters and in all
likelihood relate to a specific product category. The product category ‘999’ suggests that
some investigation is required to better understand what this relates to.
After the data quality issues are ironed out, consideration needs to be given as to how
certain fields should be summarized or grouped. This is particularly relevant for purchase
history. For instance, we may want to summarize the spend into yearly buckets with one
bucket looking at the overall lifetime spend of the customer. At the same time, we may
have hundreds of different product purchase codes. The challenge in meaningfully using
this data is to group these product codes into broad categories such that the grouped
information has enough data for any future statistical exercise within a given data mining
project.
Once it has been determined how to handle data quality issues as well as how to group or
summarize the data, algorithms are then written to organize the data and information into
one overall analytical file. This stage can in many cases represent the area where the data
miner or analyst best demonstrates their worth or value to the organization. It is at this
point that the analyst utilizes their knowledge of the information environment to create
meaningful variables or fields of information which will be most relevant for a given
analysis. For instance, the creation of trend variables(growth and decline)as well as
purchase variables related to time and type of purchase are derived from the analyst’s
work and are not directly obtained from the source database files. In fact, the breakdown
of sourced(direct from the source files) vs. derived variables(created by the analyst) is
about 10% vs. 90%.
Application of The Appropriate Tools and Technology
With the completion of the analytical file, the analyst is now in a position to deploy the
appropriate tools,technologies and techniques in order to obtain the right solution. Keep
in mind that not all data mining solutions require tools with statistical analysis. The
notion of using indexes, in particular RFM(recency,frequency, and monetary) represents
one non-statistical method of targetting customers for a given business initiative. Yet, the
use of statistics represents an unbiased and objective means of letting the science
determine the most appropriate characteristics or variables. In using solutions that require
some statistical analysis, there are a large number of vendors that provide products in this
area. The two most common vendors in this field are SAS(www.sas.com) and
SPSS(www.spss.com). Both vendors offer a variety of statistical techniques which can be
used depending on the specific solution and tactics that are required.
Response Rate
In applying the right tools and technologies, we need to consider the types of analytics
that we require. For instance, we will need reports that demonstrate the key trends and
behaviours against a given business metric(response rate,retention rate,ROI,etc.).
Therefore, our tools need to be able to provide the capability of creating these reports.
These are often referred to as EDA(Exploratory Data Analysis reports). Listed below is
one example of an EDA report.
0.30%
0.20%
0.10%
0.00%
Average
30 to 36
37 to 41
42 to 46
Customers who are older
47 to 52
53 to 60
As you can see from the above report, the analyst can determine that as the age of a
customer increases, the likelihood that they will respond also increases.
The statistical tools are required when we need to scientifically determine key triggers
and behaviours which relate to a given business behaviour that we are trying to
optimize. For instance, we want to identify the top 4 characteristics of what constitutes
customer fraud within a given credit card database or at the same time identify the top 5
characteristics that result in a person upselling to a higher premium credit card. A variety
of statistical techniques can be employed. Correlation analysis can help determine these
variables if we are unconcerned about the interaction between the variables themselves
which is often referred to as multicollinearity. In other words, we are only looking at the
variables in a univariate way i.e. one variable at a time vs. the desired business metric.
This can be useful if the outcome is to merely rank variables or characteristics against the
desired business metric. However, if the desired outcome is to build a model, then we
need to consider the interactions between variables. A variety of multivariate statistical
techniques such as discriminant analysis ,regression analysis,CHAID can then be used to
build the model. Some of the more advanced modeling technologies have started to
capitalize on the learning and research from those disciplines relating to artificial
intelligence. Techniques such as neural nets and genetic algorithms have adopted this
machine-rule learning in order to optimize the performance of these models. More
discussion about these tools and technologies will be the subject of a subsequent article.
The other tools, that we need ,contain the ability to demonstrate the business impact of
the desired solution. This is best demonstrated by observing how well the model will
perform in a given business situation. In the chart below, we are assessing the business
impact of a cross-sell model that was developed across the customer base. Given a
revenue per order amount of $125 and a cost per promotion effort of $.85, we obtain the
following business results.
% of
Average
Number of
prospects
Response
Names
Mailed in
Rate in
Mailed
Interval
Interval
0-10%
10%-20%
20%-30%
30%-40%
40%-50%
50%-60%
60%-70%
70%-80%
80%-90%
90%-100%
total
24,344
24,344
24,344
24,344
24,344
24,344
24,344
24,344
24,344
24,344
243,437
2.90%
1.44%
1.31%
0.97%
0.97%
0.64%
0.49%
0.42%
0.47%
0.40%
1.00%
ROI
326.14%
111.68%
92.18%
42.05%
42.04%
-5.30%
-27.59%
-38.73%
-30.37%
-41.51%
47.06%
In the above chart, the model is applied against 243,437 customers whereby these names
are ranked by descending model score into ten deciles. Since the model was a response
rate model, we want to observe how well the model actually classifies observed or actual
response rate. For this group of 243,437 customers, we actually have actual response
rates since there was a prior cross-sell campaign to this group of customers. In terms of
assessing the model, you can see that the model rank orders observed response rate quite
well and achieved a ratio of 7.25 to 1 when comparing the response rate of the top decile
to the response rate of the bottom decile. From the response rate performance, we can
then translate these numbers into ROI since we know the cost per promotion as well as
the revenue per order. The ROI numbers can then be used by the decision maker to
determine the appropriate quantity of persons who should be promoted.
Implementation and Tracking
With the solution being completed, the next step is to action it within some business
initiative. In some cases, these initiatives could be non-marketing related. For example,
the development of credit-risk models could be applied within the operations area
whereby the actual output of these models to the account representative are risk
segments along with their specified courses of action.
In applying solutions, the most important consideration is to ensure that the solution is
being applied correctly. Initially, this implies that we conduct some data quality checks
on several records to ensure the integrity of the solution on these records.
Another consideration is to ensure that the information environment has not changed
substantially between the time of the developed solution and the time of its application.
This can be done by creating frequency distributions of key elements within the solution.
For example, a given model could be examined by comparing the score distribution
ranges between time of development and time of implementation . See the example
below:
% of List
0- 10%
10- 20%
20- 30%
30- 40%
40- 50%
etc...
Minimum Score
(development)
0.08
0.07
0.06
0.05
0.04
Minimum Score
(application)
applicationcore)
0.04
0.03
0.02
0.01
0.004
In the above example, the score ranges have changed quite drastically between time of
development and the most current application. Forging ahead with this solution without
understanding these score discrepancies is an invitation for failure. Investigation and
analysis needs to be conducted on the database to clearly understand why these score
discrepancies exist.
Once we are comfortable with the application of the solution, we then need to create a
testing and tracking environment in order to evaluate the impact of this solution within a
live business initiative. Listed below is an example of one test/tracking matrix which
would be used within a marketing campaign.
% of File
(Ranked by
model Score)
0-5%
5-10%
10-15%
15%-100%
# of Names
Mailed
50,000
50,000
50,000
50,000
Test Cell 1:45,000 Control Cell 2:5,000
Test Cell 3:45,000 Control Cell 4:5,000
Test Cell 5: 45,000 Control Cell 6:5,000
Test Cell 7: 45,000 Control Cell 8:5,000
As you can observe from the matrix, the intent in creating this matrix was to both
evaluate the performance of the model as well as a particular new communication piece.
Comparison of the performance of the control cells across each model-ranked interval
will indicate how well the model is performing. Comparison of the performance of all
test cells vs. the performance of the control cells will indicate the effectiveness of this
new communication piece. At a finer level, we can determine where the sensitivities of
modeling and communication exert their greatest impact within the overall list.
From the above examples, the implementation process is highly detailed. Time needs to
be devoted such that the solution is correct and makes sense in today’s information
environment. As well, this time commitment also needs to be allotted in order to set up
the proper testing and tracking environment for evaluating performance.
Conclusion
As you can see from the above discussion, data mining is much more than just
technology. It is a step by step process with the end result being the development and
application of a solution. Yet, at the same time, this process should always yield learning
which can be potentially utilized for future campaigns. The development and application
of solutions as well as acquisition of new learning provide the ingredients whereby
continuous business improvement is really the long-term goal.
By Richard Boire – Partner, Boire Filler Group