Download What to consider when purchasing external data for data mining I`ll

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
What to consider when purchasing external data for data mining
I’ll begin this article with the usual data mining motherhood statement which is ‘That the
value of any data mining exercise is contingent on the data’. Having said that, purchase
decisions regarding various data sources can become extremely important in any data
mining exercise. In purchasing data for data mining purposes, the data is used not to
replace existing data but rather as additional data sources to the existing data
environment. But the question remains as to whether or not this new data can provide
significantly better results over what we already have within our existing data
environment? This question, though, is easier asked rather than resolved. Resolving this
type of question involves a number of considerations by the purchaser. This article will
attempt to address some of them.
Using external data for Existing Customer Programs
First of all, the purchaser must determine whether the data needs are for acquisition
programs or for existing customer programs. For existing customer programs, most
organization in today’s CRM driven environment have databases of existing customers
and at least their purchases. The organization and structure of this data may vary from
company to company depending on its level of database marketing sophistication. Yet,
regardless of how this data is structured and organized, for data mining purposes, the
data is at an individual level. Data miners will always strive to use as much individuallevel as possible since this will always provide superior results. But these results can be
compromised if there are quality issues with the individual-level data. For example, age
and income might be reported on an individual level. Yet if over 90% of customer
records contain missing values in these fields, this individual level data on the remaining
10% is not going to very useful in a data mining project. In this case , one should look at
aggregate type data sources such as Statistics Canada and specifically Stats Can Census
data. Aggregate level data means that customers residing in the same postal area would
have the same Stats Can values while customers residing in different postal areas would
have different Stats Can values. Appending aggregate-level data(Stats Can Census area)
to this customer file would at least provide full age and income information to 90% of
records which do have any this information at all. Using this aggregated level data for
income and age would yield superior results than the status quo individual level data
where 90% of the information is missing.
Besides enhancing information where there are data quality issues, aggregate-level data
can also provide breadth of information. For example, areas such as ethnicity, occupation,
religion, education, and a range of other Stats Can type demographic type information are
unlikely to be directly available on any customer database. Appending this above type
information to an existing customer database can add some value to a modeling or
profiling exercise. However, the rich individual-level information related to the purchase
or transaction information of the customer will produce the stronger modeling/profiling
variables.
Yet, this aggregrate level demographic information can have some other impact by being
able to provide some broad-level insights that can be used to develop better
communication strategies.
Using Data for Acquisition Programs
Yet, the bigger impact of external data is going to be its use in acquisition programs. In
fact, suppliers of data market themselves more on its overall impact in acquiring new
customers. This need for external data represents the actual foundation in building any
data mining solution for acquisition programs rather than as supplemental information for
existing customer programs Typically, name, address, and postal code represent the
available pieces of data for any acquisition program. With postal code, though, we have
the key link in being able to append Stats Can data to these name and address records.
Stats Can data, though, is offered under two main types of products. The first product is
Stats Can taxfiler data. The data, here, is organized at a postal walk level which
represents approximately 800 households and contains information compiled from annual
tax returns. Such data would contain income-related information such as income earned
from employment, investment income, charitable deductions, etc.
The second file, Stats Can Census data, contains information at an enumeration area level
or approximately 400 to 500 households. Although it contains some income data, it lacks
the other measures of wealth that are contained in the taxfiler data. However, this file is
much richer in demographic information and contains information pertaining to ethnicity,
religion,language, education, occupation,etc. Furthermore, it is much more granular as
data is aggregated for every 400 households as opposed to 800 households which is the
case with the taxfiler data. One limitation to this data, though, is that it is updated only
every 5 years which is when the Stats Can Census survey is conducted.
In comparing and determining which above Stats Can source to purchase, the decision
can often depend on the industry sector of the company. Taxfiler data may often
represent the initial purchase of financial institutions while Stats Can Census data may
often represent the initial purchase of retail organizations. Larger organizations in many
cases will purchase both types of data sources as the cost to purchase both data sources is
under 50M. The Census data can be used for 5years while the taxfiler data which is
updated annually could arguably be used for two years before requiring any update.
The Suppliers/Players in Data Enhancement Services
The business of supplying data has grown significantly over the last twenty years. This
growth can be attributed directed to the growth in data mining. Twenty years ago,
besides Statistics Canada, there was one other organization offering data enhancement
services. Within the Toronto area, there are now six organizations with both Statistics
Canada and Canada Post actually developing more service-oriented strategies around data
enhancement.
Originally, service in this area comprised the development of demographic clusters across
Canada. A company could purchase these clusters for marketing purposes. For example,
prospects for an acquisition program would be assigned to cluster codes based on their
postal code. The marketer would then determine the appropriate clusters to target for a
given acquisition program. Marketers, as well, could also use this above approach in
targeting existing customers. But, as mentioned above, the company’s individual-level
data (customer and transactions) will provide information that will always be superior to
a demographic cluster solution based on aggregate-level data.
Historically, clusters were used solely as a means to target prospects. But as the level of
sophistication has increased regarding the use of data, analysts are not only looking at the
clusters but also at the raw source data which was used in building these clusters. The
analytical mindset is to consider a variety of data inputs when building data mining
solutions. With this kind of mindset, the data suppliers themselves have become more
sophisticated by offering data enhancement products beyond just cluster codes. One
company offers data enhancement products which provide increased targeting
capabilities amongst the different ethnic groups. Another organization offers
demographic postal area information that goes beyond the information provided by Stats
Can Census and Taxfiler data. In fact, this organization uses both these sources as well as
a variety of other data sources in being able to provide annual demographic data at a
postal code level. Although, this data is much more granular(postal code level) and would
therefore appear to be superior than the more aggregate-level Stats Can data, it is
important to know that this postal code demographic data represent estimates and not
raw or source data. Estimates as we all know are derived from some form of mathematics
and of course are going to have some degree of error. Yet, in building data mining
solutions where we are trying to differentiate customers and/or prospects based on a
desired behaviour, the use of this type of data can indeed provide more powerful inputs to
a data mining solution than using the raw Stats Can data. But again, data miners will
always strive to consider using both the derived granular estimates as well as the more
aggregate raw source Stats Can data as data inputs in any data mining solution.
So far, I have discussed data sources that are available at an aggregate level for better
targeting of prospects for acquisition programs. But individual level data for acquisition
programs can also be purchased. One organization has built a massive consumer database
of individual-level data through survey mailings to Canadians. The incentive for
consumers to fill out this survey are coupons and discount offers for a variety of different
products. One argument, though, against using this data consists of the fact that the
information is self-reported and that what people say does not necessarily reflect what
they will do. Another argument against using this data resides in the notion that there may
be a responder bias . In other words, people that complete the survey may not be
representative of the population that comprises our initial audience. Yet, even with these
above limitations, this information has provided great value in being able to rank order
and differentiate prospects and to ultimately better target prospects for acquisition
programs. Marketers can use this information in two ways. The first way is to actually
rent names based on this self-reported information while the second way is to actually
build models off the self-reported information and to then determine the best list of
prospects from this database. The decision to rent or to model the names will be based on
the type of business and product offers of the company.
In the business to business world, we are also witnessing the growth of vendors offering
data enhancement services as a means to better target companies. Some companies offer
these services in particular sectors while others deal more exclusively in the large
company size sector. But there are two main players that offer data enhancement as well
as list rental services for all companies regardless of industry sector or size. This
information is at an individual company record and consists of information such as
industry sector, industry sales, employee size, years in business, and a range of other
firmographic type information. Marketers again have the choice of renting names that
are specific for a particular initiative or to use the firmographic information to model
these names against that same initiative. Once again, the decision to rent or model names
will be determined by the type of business and product offers of the company.
As you can see from the above discussion, there is no definitive right or wrong when
purchasing data . The type of industry and product offers as well as one’s budget will
impact this decision. Presumably, if one’s budget is large enough and data mining is an
intensive activity within the organization, then more data is going to be the direction
when making data purchases. However, if decisions have to be prioritized, one can adopt
the discipline of back testing the results of each of these data sources against developed
data mining solutions. For example, we could look at a number of previous acquisition
models and then view how much results improve with a specific data source. The
amount of improvement would then formulate the basis of our decision in prioritizing
purchases of data sources.
Besides the impact on marketing, the purchase of external data can be leveraged to other
functional areas of the business. Many of these other functional areas of the organization
employ data mining technologies such as the credit risk operations area of a company.
Certainly, the number of different functional areas beyond marketing that are employing
data mining technologies simply reinforces the tendency towards purchasing more and
not less data. From a data mining perspective, more is good as our approach is to examine
as much data as possible and to then identify the ‘nuggets of gold’ for a given solution
rather than excluding data upfront thereby potentially eliminating these so-called nuggets
of gold.