Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Lecture 4 MARK2039 Winter 2006 George Brown College Wednesday 9-12 Assignment 31) You are given the marketing results of three different initiatives. What would you conclude and why for each initiative listed below? Remember you need to base your conclusion using a 95% confidence interval. Initiative 1: Strategy 1 Strategy 2 Response Rate 5.00% 4.50% Quantity 30000 25000 .13%-Std.dev. 4.24%<=4.50%<=4.76% would do strategy 1 as it is stat. sign.and therefore different from strategy 2. 2 Assignment 3 1) continued Initiative 2 Response Rate 6.00% 3.00% Strategy 3 Strategy 4 Quantity 100 200 Std.dev.=2.37% 1.26%<=6%<=10.74% Would do either strategy as there is no stat. diff. between strategies. Initiative 3 Strategy 4 Strategy 5 Response Rate 45% 40% Quantity 15000 10000 Std.dev.=.489% 39.02%<= 40%<=40.98% would do strategy 4 as it is stat. different and significant from strategy 5. 3 Assignment 3 2) Listed below is a table containing a number of variables. Explain the reason why each variable is useful or not useful in a future analysis. Variables # of records Data Field Format # of unique values # of missing values 1st 3 digits of postal code household size Credit score mortgage account Product code 100000 100000 100000 100000 100000 character numeric numeric character character 1000 10 90000 2 5000 0 20000 95000 0 0 Median Income of Postal Code of record 100000 numeric 20000 0 4 Assignment 3 2)continued 1st 3 digits are not useful. Character-level data will be converted to binary variables(1000) and at this level data is too granular and too sparse for any real impact in a data mining exercise. Household size-useful as % of missing values is below 50%. Credit score: not useful as % of missing values at 95%. Mortgage account: useful-0 missing values and data can be meaningfully grouped for this character type data. Product code: not useful as data is character and will need to be converted to binary variables(5000) . These binary variables will be too granular to find any meaningful patterns. Median Income of Postal Code: useful-data granularity high but ok as variable is numeric and % of missing values is low- 0%. 5 Recap- Data,Data,Data-What Phase of the Data Mining Process are we in? • Data Formats? – Examples? • Data Transformations? What do we mean here? Examples? • In all data mining projects, what must the final data values be? 6 Recap• Data,Data,Data-What Phase of the Data Mining Process are we in? • Data Types? What are they? • What is discrete vs. index vs. continuous and how do they relate to Data Type. – Birthdate -Gender – Product category -Model rank – Spending percentile -Income – Promotion Date -Model Score 7 Recap •Let’s take a look at postal code Postal Code M5A1G5 V6A1A1 ….. H4B2E5 % of Records 0.0020% 0.0035% 0.0029% •How would you use the info here. Create binary variables for every postal code value. Is there another better way to group? 8 Types of Data • Nominal • Ordinal • Interval • Nominal is basically a yes/no variable or variable with outcomes that have no order or sense of magnitude to the numbers – Derived variables are coded as 0,1. – Give me an example of this and how you would create a nominal variable for data mining?. Assume you are analyzing response rate trends for 10000 customers? 9 Types of Data • Ordinal – There is order to the values of the variable – Give me examples of this and what it would like in a data mining exercise. Assume you are analyzing response rate trends for 10000 customers? • Interval – There is a sense of magnitude between two values. – Give me examples of this and what it would like in a data mining exercise. Assume you are analyzing response rate trends for 10000 customers? • How does ordinal and interval differ. Explain it within the context of a data mining exercise where we analyze response? 10 Data Usefulness • When is Data Useful? – Few Missing values – Variable does not consist primarily of one value – Non-Numeric Data consists of only a few values which can be properly grouped into more meaningful categories 11 Examples-Analytical Perspective Variable Income Customer Type Gender Household Size Product Type Customer Name Postal Code # of records 100000 100000 100000 100000 100000 100000 100000 Data Field format numeric character character numeric character character character # of Unique values 50000 4 2 7 3000 100000 50000 # of missing values 2000 10000 50000 90000 5000 0 0 What fields are useful and why? 12 Examples Closer look at income Income % of Records <25000 25000-50000 50000-75000 75000+ Missing 25% 25% 25% 23% 2% Closer look at gender Gender Male Female Missing % of records 23% 27% 50% Create a data mining response rate trend with each variable For both variables, demonstrate how no response rate might exist. 13 Examples • Closer Look at Customer Type Customer Type Gold Bronze Silver Platinum Missing % of records 5% 40% 30% 15% 10% Create a data mining response rate trend with each variable Closer look at Product Type Product Type A001 B001 C003 A010 …. missing Z004 % of records 0.07% 0.08% 0.06% 0.06% Cum. % of records 0.07% 0.15% 0.21% 0.27% 5% 0.08% 99.92% 100% For both variables, demonstrate how no response rate trend might exist. 14 Examples Variables 1st 3 digits of postal code household size Credit score mortgage account Product code Median Income of Postal Code of record # of records Data Field Format # of unique values # of missing values 100000 100000 100000 100000 100000 character numeric numeric character character ? ? ? ? ? 100000 100000 100000 100000 100000 100000 numeric ? 100000 •What variables would be useful here •What would be the number of unique values •What would some of these look like in a data mining response rate analysis exercise? 15 Examples Variables 1st 3 digits of postal code household size Credit score mortgage account Product code Median Income of Postal Code of record # of records Data Field Format # of unique values # of missing values 100000 100000 100000 100000 100000 character numeric numeric character character 100000 100000 100000 100000 100000 0 0 0 0 0 100000 numeric 100000 0 •What variables would be useful here •What would some of these look like in a data mining response rate analysis exercise? 16 Examples-Marketing Perspective • A mortgage company is conducting a campaign to its high value customers. One of the key characteristics of value is high income which is self-reported at time of application. Income < 30000 30000-60000 60000-80000 80000-100000 100000+ missing % of records 5% 5% 20% 10% 10% 50% As a marketer, how will you use this information and what do you need to consider? What might the results be if you applied this learning to a marketing campaign. 17 Examples-Marketing Perspective • An insurance company is marketing an insurance product to people over the age of 60. Listed below is a report indicating the distribution of age. • Age <30 30-40 40-55 55-65 65+ missing % of records 5% 10% 15% 10% 10% 50% As a marketer, how will you use this information? What might the results be if you applied this learning to a marketing campaign. 18 Examples-Marketing Perspective • An retail company has over 1000 product SKU’s. After investigation, it has been determined that the 1st digit represents a broader product category. You have been asked to design the product layout for all stores. Product SKU A000003 A000004 B000005 B000006 …. Z999999 % of records 0.03% 0.02% 0.03% 0.04% Cum. % of records 0.03% 0.05% 0.08% 0.12% 0.02% 100% As a marketer, how will you use this information? 19 Examples-Marketing Perspective • Gender Male Female Missing Income 0-20K 20K-40K 40K-60K 60K-80K 80K+ missing % of records 10% 12% 88% % of records 5% 4% 7% 6% 5% 73% What can be done here, if anything and what else can we consider in terms of using gender and income information? What might it look like in a data mining exercise? 20 Examples-Marketing Perspective • You have postal code information for each customer. You are asked to design customer reports by province.How would you do this? • What would this look like in a data mining response rate analysis exercise? 21 Examples-Data Mining Perspective • You have the following variables and values – Gender: ’M’:Male ‘F’:Female – Income ‘B’: <20 ‘F’: 20-40 – ‘R’:40-60 ‘S’:60-80 ‘T’:80-100 ‘Z’: 100+ • What must be done here? • What would this look like in a data mining response rate analysis exercise? 22 Concepts • Operational Database – Customer DB – Transactional DB • Data Warehouse • Data Mart • Analysis Flat file vs. OLAP • External Data Overlays – Postal Code Overlays – Survey/Registration Data 23 Databases • Operational Databases vs. Data Warehouses vs. Data Marts vs. Analytical File • Operational data consists of information from the source systems – Customer File – Transaction System – Finance System – Operations – Human Resources – Etc. In practice, what do you think an operations database is really dealing with? 24 Databases • Data warehouse – Pulls elements and fields from each source system – May summarize/organize or aggregate information with each system to present the information in a more meaningful way? – Warehouse can comprise information from disparate areas of company – What do we mean by this? 25 Databases • Data mart – Can in many cases be very similar to data warehouses in the way that information is summarized and aggregrated – Pulls elements and fields from each source system – May summarize/organize or aggregate information with each system to present the information in a more meaningful way? – Usually is focussed solely towards one functional area of the company • Marketing data Mart • Let’s think of some information that might be contained in a data mart? 26 Databases Customer Transaction Finance .. Etc. Data Warehouse Data MartMarketing Data MartFinance Data MartEtc. 27 Database Structure • In Database Design, most databases are relational – Creates a key which becomes a database index – This index or key becomes the link between different files Customer Customer Transaction Promotion Customer ID is the link between all the tables Why do we need to think about the notion relational? 28 Database Structure • Relational DB – Database indexes allow very quick processing of data when joining and merging files together • The key in all database design is to create a database that optimizes processing of all information. • In database design, you want the right data to be stored which is useful from a data mining perspective • From a marketing standpoint, can you think of some examples? Why is this important from a data mining standpoint? 29 Database Structure • Other approaches used in speeding up database processing – Inverted flat files • This technology allows each field to be indexed • Very common amongst the leading-edge DB suppliers today. • Is much faster at processing data than traditional relational DB technology • Again, why is this relevant from a data mining perspective? 30 Databases • Analytical File • For most data mining applications, your analytic file needs to be in the format of one record per customer with all known attributes • Generally, the database is not in that format. • ECTL – extraction, clean, transform, load – is the process/methodology for preparing data for data mining Typically a flat file used for analysis What do think is the most important concept for data mining? Databases or Analytical File How do they work together? 31 Databases File 1 -Cust ID -Income -Age -Household Size File 2 -Cust ID -Trans. Type -Trans Date -Trans Amt 32 Databases • In building databases, the notion of continuity management is important • In the context of household or customers on a database, continuity management is the process by which you are able to track customers through events in time. • Why is this important? 33 Analytical file • All data mining algorithms want their input in tabular form – rows & columns as in a spreadsheet or database table Typically, if we saw data like this, what typically needs to be done? Assume reference number is the customer I.D. What does continuity mean here? 34 What the Data Should Look Like • A customer “snapshot” = a single row Each row represents the customer and whatever might be useful for data mining 35 What the Data Should Look Like • The columns – Contain data that describe aspects of the customer (e.g., sales $ and quantity for each of product A, B, C) – Contain the results of calculations referred to as derived variables (e.g., total sales $) Cust Id 123 123 123 456 456 Date Of Purchase jan 4/2006 dec.6/2006 mar.4/2006 apr.6/2006 feb.6/2006 # of Months since last purchase 4 5 2 1 3 Derived variables are Total Price in 1st chart and # of months since last purchase in 2nd chart 36 Sourcing the Data from External Data Sources Typical Data Sources - External • Geo-demographic information – Statistics Canada (aggregated level data) • Census data • Taxfiler data • • Geo-demographic Cluster Codes – Generation 5 – Mosaic – Equifax -Psyte Survey Data – ICOM 37 Sourcing the Data (Extraction) Census data Data collected every 5 years. Enumeration Area level. ~ 250 households on average. ~ 440 households in large urban areas. ~125 households in rural areas. ~ 50,000 EA’s in Canada Can be converted to postal code level and appended to your file. Type of data -immigration/ethnicity/language patterns -occupation -education -income/gender/age/employment -religion 38 Sourcing the Data (Extraction) Taxfiler data. Data collected every year. Postal walk level. ~ 450 households on average. ~ 26,000 Postal Walks in Canada. ·Contains data from previous year tax returns. · Income by source and type. Employment, investment. · RRSP contributions and room. Etc. Can also be appended to your files at postal code level. 39 Sourcing the Data (Extraction) Geo-Demographic cluster codes. • Uses Stats Can data in most cases plus other external data overlays to determine postal code cluster groups – Quebec farm families – Young and Struggling – Empty Nesters – Upper Income Family-Oriented • Equifax – High credit risk – Medium credit Risk – Etc. 40 Sourcing the Data-Stats Can Type Table Postal Area Area 1 Area 1 Area 1 Area 2 Area 2 Area 3 Area 3 Area 3 Median Income Avg. Age 42000 40 42000 40 42000 40 50000 35 50000 35 37000 43 37000 43 37000 43 Avg. Household Size 2 2 2 1 1 3 3 3 % French 10.00% 10.00% 10.00% 85.00% 85.00% 5.00% 5.00% 5.00% 41 Sourcing the Data (Extraction) Typical Data Sources - External • Business to Business “Firmographics” – SIC, Number of Employees, Revenue etc. – Sources: • D&B • CBI / InfoCanada • Scott’s Company XYZ …. Employee Size 1-4 … Industry Classification retail … Sales Size <1 million .. Yrs In business 10 .. 42 Sourcing the Data (Extraction) Typical Data Sources - Survey • Attitudinal- Needs, preferences, social values, opinions • Behavioral- Buying habits, lifestyle, brand usage For most data mining projects, we want to assign a value to all customers; therefore the information used must be available for all customers – survey-based information generally cannot be used as it typically can only be applied a small portion of the database 43 Sourcing the Data (Extraction) Typical Data Sources - Survey • ICOM – Surveys to approx. 10MM Canadians – Fully updated every 2 years – Contains attitude behaviour and purchase behaviours across all industry sectors • What do you think the value is here? 44 Examples • A marketer wants to target high risk cancels for a retention campaign for a Telco. Information is contained in legacy database systems containing a customer file, transaction file, and call detail file. As a marketer and analyst, answer the following requirements – 5 Key Data fields from above files that should be created in analytical exercise – Create a diagram or schema of how this data would be linked into an analytical file – What resources would you need and why? • People • Software 45 Examples • How would the previous example change if the information was available in a data mart or warehouse 46 Examples • • • A university is conducting a fund raising campaign to its alumni(100000 members). On its database, it has the following information: – Age of alumnus – Year graduated – Degree and specialization – Donation value – Current Address It has also collected information from a survey. 10% of members have responded to the survey with the following %’s of members answering the following information: – Current Occupation-5% – Current Income-8% – Why they give?-7% – How much they give As a marketer and analyst, how would you use the information to conduct a campaign to its high value donors 47 Examples • A computer company collects information from all customers who purchase a new product. This new product information is collected through a product registration form which the customer fills in at point of purchase. This information relates to the following: – Product preferences,Income,household size and hobbies All customer tombstone information as well as purchase information related to products bought has been summarized and stored onto a data mart. As a marketer and analyst, how would you use the information to develop a cross-sell campaign. 48 Examples • A credit card company has 100000 customers containing tombstone information and detailed transactional information on their database. 50000 customers have email addresses. 10% of 50000 customers have responded to a survey in which 5% have indicated that they consider themselves loyal customers. Web activity of these loyal customers indicate that many of them have clicked on travel-related packages. • As a marketer and analyst, how would you use this information to sell travel-related insurance. 49