Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
What to consider when purchasing external data for data mining I’ll begin this article with the usual data mining motherhood statement which is ‘That the value of any data mining exercise is contingent on the data’. Having said that, purchase decisions regarding various data sources can become extremely important in any data mining exercise. In purchasing data for data mining purposes, the data is used not to replace existing data but rather as additional data sources to the existing data environment. But the question remains as to whether or not this new data can provide significantly better results over what we already have within our existing data environment? This question, though, is easier asked rather than resolved. Resolving this type of question involves a number of considerations by the purchaser. This article will attempt to address some of them. Using external data for Existing Customer Programs First of all, the purchaser must determine whether the data needs are for acquisition programs or for existing customer programs. For existing customer programs, most organization in today’s CRM driven environment have databases of existing customers and at least their purchases. The organization and structure of this data may vary from company to company depending on its level of database marketing sophistication. Yet, regardless of how this data is structured and organized, for data mining purposes, the data is at an individual level. Data miners will always strive to use as much individuallevel as possible since this will always provide superior results. But these results can be compromised if there are quality issues with the individual-level data. For example, age and income might be reported on an individual level. Yet if over 90% of customer records contain missing values in these fields, this individual level data on the remaining 10% is not going to very useful in a data mining project. In this case , one should look at aggregate type data sources such as Statistics Canada and specifically Stats Can Census data. Aggregate level data means that customers residing in the same postal area would have the same Stats Can values while customers residing in different postal areas would have different Stats Can values. Appending aggregate-level data(Stats Can Census area) to this customer file would at least provide full age and income information to 90% of records which do have any this information at all. Using this aggregated level data for income and age would yield superior results than the status quo individual level data where 90% of the information is missing. Besides enhancing information where there are data quality issues, aggregate-level data can also provide breadth of information. For example, areas such as ethnicity, occupation, religion, education, and a range of other Stats Can type demographic type information are unlikely to be directly available on any customer database. Appending this above type information to an existing customer database can add some value to a modeling or profiling exercise. However, the rich individual-level information related to the purchase or transaction information of the customer will produce the stronger modeling/profiling variables. Yet, this aggregrate level demographic information can have some other impact by being able to provide some broad-level insights that can be used to develop better communication strategies. Using Data for Acquisition Programs Yet, the bigger impact of external data is going to be its use in acquisition programs. In fact, suppliers of data market themselves more on its overall impact in acquiring new customers. This need for external data represents the actual foundation in building any data mining solution for acquisition programs rather than as supplemental information for existing customer programs Typically, name, address, and postal code represent the available pieces of data for any acquisition program. With postal code, though, we have the key link in being able to append Stats Can data to these name and address records. Stats Can data, though, is offered under two main types of products. The first product is Stats Can taxfiler data. The data, here, is organized at a postal walk level which represents approximately 800 households and contains information compiled from annual tax returns. Such data would contain income-related information such as income earned from employment, investment income, charitable deductions, etc. The second file, Stats Can Census data, contains information at an enumeration area level or approximately 400 to 500 households. Although it contains some income data, it lacks the other measures of wealth that are contained in the taxfiler data. However, this file is much richer in demographic information and contains information pertaining to ethnicity, religion,language, education, occupation,etc. Furthermore, it is much more granular as data is aggregated for every 400 households as opposed to 800 households which is the case with the taxfiler data. One limitation to this data, though, is that it is updated only every 5 years which is when the Stats Can Census survey is conducted. In comparing and determining which above Stats Can source to purchase, the decision can often depend on the industry sector of the company. Taxfiler data may often represent the initial purchase of financial institutions while Stats Can Census data may often represent the initial purchase of retail organizations. Larger organizations in many cases will purchase both types of data sources as the cost to purchase both data sources is under 50M. The Census data can be used for 5years while the taxfiler data which is updated annually could arguably be used for two years before requiring any update. The Suppliers/Players in Data Enhancement Services The business of supplying data has grown significantly over the last twenty years. This growth can be attributed directed to the growth in data mining. Twenty years ago, besides Statistics Canada, there was one other organization offering data enhancement services. Within the Toronto area, there are now six organizations with both Statistics Canada and Canada Post actually developing more service-oriented strategies around data enhancement. Originally, service in this area comprised the development of demographic clusters across Canada. A company could purchase these clusters for marketing purposes. For example, prospects for an acquisition program would be assigned to cluster codes based on their postal code. The marketer would then determine the appropriate clusters to target for a given acquisition program. Marketers, as well, could also use this above approach in targeting existing customers. But, as mentioned above, the company’s individual-level data (customer and transactions) will provide information that will always be superior to a demographic cluster solution based on aggregate-level data. Historically, clusters were used solely as a means to target prospects. But as the level of sophistication has increased regarding the use of data, analysts are not only looking at the clusters but also at the raw source data which was used in building these clusters. The analytical mindset is to consider a variety of data inputs when building data mining solutions. With this kind of mindset, the data suppliers themselves have become more sophisticated by offering data enhancement products beyond just cluster codes. One company offers data enhancement products which provide increased targeting capabilities amongst the different ethnic groups. Another organization offers demographic postal area information that goes beyond the information provided by Stats Can Census and Taxfiler data. In fact, this organization uses both these sources as well as a variety of other data sources in being able to provide annual demographic data at a postal code level. Although, this data is much more granular(postal code level) and would therefore appear to be superior than the more aggregate-level Stats Can data, it is important to know that this postal code demographic data represent estimates and not raw or source data. Estimates as we all know are derived from some form of mathematics and of course are going to have some degree of error. Yet, in building data mining solutions where we are trying to differentiate customers and/or prospects based on a desired behaviour, the use of this type of data can indeed provide more powerful inputs to a data mining solution than using the raw Stats Can data. But again, data miners will always strive to consider using both the derived granular estimates as well as the more aggregate raw source Stats Can data as data inputs in any data mining solution. So far, I have discussed data sources that are available at an aggregate level for better targeting of prospects for acquisition programs. But individual level data for acquisition programs can also be purchased. One organization has built a massive consumer database of individual-level data through survey mailings to Canadians. The incentive for consumers to fill out this survey are coupons and discount offers for a variety of different products. One argument, though, against using this data consists of the fact that the information is self-reported and that what people say does not necessarily reflect what they will do. Another argument against using this data resides in the notion that there may be a responder bias . In other words, people that complete the survey may not be representative of the population that comprises our initial audience. Yet, even with these above limitations, this information has provided great value in being able to rank order and differentiate prospects and to ultimately better target prospects for acquisition programs. Marketers can use this information in two ways. The first way is to actually rent names based on this self-reported information while the second way is to actually build models off the self-reported information and to then determine the best list of prospects from this database. The decision to rent or to model the names will be based on the type of business and product offers of the company. In the business to business world, we are also witnessing the growth of vendors offering data enhancement services as a means to better target companies. Some companies offer these services in particular sectors while others deal more exclusively in the large company size sector. But there are two main players that offer data enhancement as well as list rental services for all companies regardless of industry sector or size. This information is at an individual company record and consists of information such as industry sector, industry sales, employee size, years in business, and a range of other firmographic type information. Marketers again have the choice of renting names that are specific for a particular initiative or to use the firmographic information to model these names against that same initiative. Once again, the decision to rent or model names will be determined by the type of business and product offers of the company. As you can see from the above discussion, there is no definitive right or wrong when purchasing data . The type of industry and product offers as well as one’s budget will impact this decision. Presumably, if one’s budget is large enough and data mining is an intensive activity within the organization, then more data is going to be the direction when making data purchases. However, if decisions have to be prioritized, one can adopt the discipline of back testing the results of each of these data sources against developed data mining solutions. For example, we could look at a number of previous acquisition models and then view how much results improve with a specific data source. The amount of improvement would then formulate the basis of our decision in prioritizing purchases of data sources. Besides the impact on marketing, the purchase of external data can be leveraged to other functional areas of the business. Many of these other functional areas of the organization employ data mining technologies such as the credit risk operations area of a company. Certainly, the number of different functional areas beyond marketing that are employing data mining technologies simply reinforces the tendency towards purchasing more and not less data. From a data mining perspective, more is good as our approach is to examine as much data as possible and to then identify the ‘nuggets of gold’ for a given solution rather than excluding data upfront thereby potentially eliminating these so-called nuggets of gold.