Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 7 MARK2039 Summer 2006 George Brown College Wednesday 9-12 Exam 1) You are running an analysis to determine the number of customers that are poor credit risk that live in Montreal and that have been promoted in the last month. There are 3 million customers and 50 million promotion records. The analysis has taken over a day. The customer file and promotion file contains the following fields: Answer the following: a) What fields would you pull to do the query. b) Give one suggestion on how would you improve the run time of this query. Customer Account ID Houshold Number Credit Score Postal code Promotion Code Account ID Date of Promotion Promotion Type a)Acct ID, Date of Promotion, credit score, postal code b)Index account ID and make the DB relational 2 Exam 2)Listed below are 3 columns with each column containing 5 valuesi Column A 120 80 40000 140 90 Column B 10 5 20 15 25 Column C 200 20000 18000 22000 24000 Answer the following: 1) What is the mean and median of each column 2) What column contains the normal distribution and why? 3) What would be the better reporting measure for the non normal distribution and why? 1)Col A: mean=8086, median=120 COL. B: mean=15,median=15 Col. C: mean= 16840 median=20000 2)Normal dist. Is B because mean and median are same. 3)Median as it is not skewed by otliers 3 Exam 3) The current expected performance of a given campaign is 4%. Two strategies have been tested with the following results. Strategy Strategy A Strategy B # of names 40000 2000 Response Rate 3.80% 2.00% What would you conclude for each strategy and what would you do for the next campaign based on the learning (Hint: you have to conduct your calculations on both tests) Str. A: std.dev= .00189 – CI: .0361<=.0380<=.03989 Str.B: std.dev=.003 -CI: .014<=.02<=.026 Do not use either strategy and continue with existing strategy 4 Exam 4)Three initiatives are outlined below. Assume that data mining can yield a 15% lift. What initiative would you pursue and why? Show your calculations. -Outbound Telemarketing campaign with an available universe of 75000 names at $3.00 per name -Email Campaign with an available universe of 10000000 names at $.10 per name -Direct Mail Campaign with an available universe of 100000 names at $2.00 per name 75000 86250 Cost Diff: 862 862 $33,750 1.15 1 10000000 115000 11500000 115000 Cost Diff: $150,000 1.15 1 100000 115000 Cost Diff: 1150 1150 1.15 1 $30,000 5 Exam 5)The marketing team wants the flexibility and the ability to conduct its own analysis without I/T or system resources. The customer file and transaction file contains the following fields: Answer the following: a) What type of technology would you use b) Give me a design that contains three dimensions and one measure c) Provide a query that can be conducted based on your above design. a)Cube b)dimensions:product type,1st digit of postal code,payment type Measure: acct Id c)Give me count of all customers who bought prod. A with cash 6) You are given the postal code data of each customer for company XYZ. How might company XYZ use this information to better target prospects to become customers Determine number of customers in postal code, determine number of persons in postal code from Stats Can data. Create penetration index: Number of customers/ number of persons at postal code. Rank postal codes by penetration index and use ranked postal codes to target prospects. 6 Exam 7)You are given a customer file with postal code data only. You can then append Stats Can taxfiler data and Stats Can Census Data. Which data would be richer in terms of providing more granular data and why? What might be the advantage of using Stats Can Taxfiler data. Stats Can Census is richer as it has more records(50000 vs. 28000 for taxfiler Advantage of using Taxfiler data is that data is more recent 8. Answer the following Questions a)What is the last stage of data mining? b)What is more important in data mining-reducing costs or maximizing revenues ? c)What must happen to the data before it gets used in a data mining application? d)What is the metric that allows us to look at how data varies within a population? a) b) c) d) Implementation Reducing costs Must be one to one in analytical file Standard deviation or variation 7 Exam 9) What is a more accurate estimate of weight -Sample A: 150 lbs with std. dev of 5 lbs -Sample B- 25 pounds with std. of 4 lbs. Explain why? Sample A , although std. dev. is larger, if we look at std. dev. on a relative basis when comparing to the range or magnitude of values in the sample, we will observe that we are getting a much tighter bound around A rather than B 10) Give me one example of a legacy type system file. Give me one advantage of why you might build a data mart Legacy: billing or call detail files,external data such as Stats Can Advantage to building data mart is the following: -data aggregated and summarized-easier to use for analysis -Quicker processing -Easier intrpretation as data deals solely with functional area 8 Exam 11) Answer yes or no on whether data mining should be used i) ii) iii) iv) v) Creating a national advertising program Identifying your most profitable customers Trying to maximize the revenue of a campaign. Using Survey Results(10% of customer base) to create a targeted customer list Analyzing the results of a direct marketing campaign. i)No,ii)yes,iii)No,iv)No,v)yes 12) Listed below is a table containing 5 variables. For each variable, do the following a)Indicate if it is nominal, ordinal or interval b)Indicate whether the variable is useful and provide 1 sentence for your reasoning. Variable Promotion Date Promotion Codes Income Number of Children Credit Decile Rank # of records 100000 150000 75000 75000 75000 # of unique values 1 5000 70000 6 10 # of missing values 0 0 70000 10000 0 Prom.Date-interval,not useful,only one value Prom.codes-nominal-not useful-too granular Income-interval-not useful too many missing values Number of children: interval-useful-few missing values Credit decile rank: ordinal-useful-0 missing values 9 Creating the Analytical File-Reviewing Data Dumps Initial dump of 1st few records Account Number 123456 345231 543236 Postal Code M5A3S6 H3A2B4 T5A3S7 etc… Birth Date 07/49 08/54 06/92 Start Behave. Income # in Date Score House 03/91 500 30000 6 04/92 550 42500 1 600 35000 3 543210 Missing values in data are not properly being treated. 10 Creating the Analytical File-Reviewing Data Dumps Initial dump of st 1 few records Proper treatment of missing values results in the following dump: Account Number 123456 345231 543236 543210 Postal Code M5A3S6 H3A2B4 T5A3S7 etc… Birth Date 07/49 08/54 Start Date 03/91 04/92 06/92 Behav. Income # in Score House 500 30000 6 550 42500 1 600 35000 3 Effective programming can ensure that records are being properly loaded into the system. 11 Creating the Analytical File-Reviewing Data Dumps View of the Transaction File A dump of a few records from a billing file revealed the following after sorting by account number Account 123460 123460 456720 456720 333121 333121 789232 789232 Purchase Amount Product Category Date of Purchase $50 $75 $90 $100 $25 $40 $30 $20 ABC123 DEF789 GHI123 ABC456 JKL432 GHI342 GHI261 236phi 19980630 19980703 19980701 19980715 19980315 19980401 19980228 19980307 12 Creating the Analytical File-Reviewing Data Dumps View of the Promo History File A dump of a few promotion history records revealed the following after sorting by account number: Account No. Promotion ID Promotion Date 123460 123460 123460 456720 456720 456720 456720 333121 789232 ABA123 ACB431 AAC221 BAA123 BBA321 BCB330 BAC112 CBA321 BAD333 19970115 19970315 19970618 19970115 19980115 19980315 19980618 19980115 19980415 13 Creating the Analytical File-Reviewing Data Dumps • Using your marketing knowledge, give me examples of variables that we might create from the last three slides – Slide 11 – Slide 12 – Slide 13 • Slide 11: Age, region of country, tenure • Slide 12: Total Amount, Total amount for a given product, and recency of purchase. • Slide 13: Total promotions, Total Promotions by Type and recency of last promotion 14 Creating the Analytical File-Data Hygiene and Cleansing • Once the data has been dumped in order to view records, typically data hygiene and cleansing have to take place • Two key deliverables – Clean name and address information – Standard rules for coding of data values 15 Creating the Analytical File-Data Hygiene and Cleansing • Clean Name and Address Information – Market to right Individual – Create Match keys 16 Creating the Analytical File Name and Address Standardization • Clean Name and Address Information – Market to right Individual – Create Match keys – Name and Address Standardization BankID 987654321 Name JONH SMITH JR. Address1 123 WILLIAMS STRET Address2 2ND FLOOR Address3 TRT., O.N. M5G-1F3 Country CDN UnIndivID 123456789 BankID PreName FirstName Surname PostName Street1 Street2 City Province Postal Code Country UnIndivID Origin 987654321 JONH SMITH JR. 123 WILLIAMS STRET 2ND FLOOR TRT O.N. M5G-1F3 CANADA 123456789 Bank 17 Creating the Analytical File-Name and Address Standardization DATA CLEANING • • • • Address correction Name parsing Genderizing Casing BankID PreName FirstName Surname PostName Street1 Street2 City Province Postal Code Country UnIndivID Origin 987654321 JONH SMITH JR. 123 WILLIAMS STRET 2ND FLOOR TRT O.N. M5G-1F3 CANADA 123456789 Bank BankID PreName FirstName Surname PostName Street1 Street2 City Province Postal Code Country UnIndivID Origin 987654321 Mr. John Smith Jr. 200-123 Williams Street Toronto ON M5G 1F3 Canada 123456789 Bank 18 Creating the Analytical File-Merge Purge of Names • What are the reasons for creating unique match customer keys – Generating a marketing list – Conducting analysis Should the match keys be the same for both above scenarios? No, tighter matchkeys in generating lists and looser matchkeys when conducting analysis What are the situations when match keys that are numeric? When dealing with existing customer data where you are matching Files involving only existing customer data. 19 Creating the Analytical File-Merge Purge of Names Common fields to use in creating Match keys • First Name; • Surname; • Unique Individual ID; • Postal Code • Credit Card Number • Duns Number for Businesses • Phone Number Unique I.D’s or number type I.D’s are the preferred choice when creating match keys • Let’s take a closer look at creating match keys using name and address 20 Creating the Analytical File-Merge Purge of Names • Let’s take a look at 6 records and see what this means. Surname First Name Smith John Smith Brown James Tim Brown Green Green Filler T. Ted Tanya Robert Filler Larry Address 12345 Elm Street 45678 Elm Street 5678 Oak 5678 Oak Road 3478 Pine 3478 Pine 2345 Nurr 5672 Bolton Dr. Postal Code Match Key L1A2A1 L1A2A1SMITHJ L1A2A1 M5A3A2 L1A2A1SMITHJ M5A3A2BROWNT M5A3A2 V6A2A1 V6A2A2 M5A3A2 M5A3A2BROWNT V6A2A1GREENT V6A2A1GREENT M5A3A2FILLERR M6A2A1 M6A2A1FILLERL 21 Creating the Analytical File-Merge Purge of Names • Example: You have one record here: – Richard Boire-4628 Mayfair Ave. H4B2E5 – How would you use the above information for a backend analysis if I were a responder to an acquisition campaign? BOIREH4B2E5 – What about if you were conducting analysis on me as an existing customer who responded to a cross-sell campaign. – Need only customer id – How about if you wanted to send me a direct mail piece – BOIRERICHARDH4B2E54628MAYFAIR 22 Creating the Analytical File- Data standardization • Refers to a process where values from a common variable from different files are mapped to the same value. Some common examples: • SIC Code Industry Classification Table – Industry categories have common set of codes • Postal Code Variable – Postal code has to have 6 digits comprised of alpha,numeric,alpha,numeric,alpha,numeric which exclude the following alphas: D,F,O,Q,U, and Z. • Give me examples of bad postal codes vs. good postal codes. – D4B2E5, H442E6,etc. are bad postal codes. – M5J1A1, A1A1A3,etc. are good postal codes 23 Creating the Analytical File- Data Standardization • Here is an example of how disposition codes for telemarketing outcomes might be handled Code 21 21 21 32 9 U28 B22 B23 Description Do Not Call Do Not Call Do Not Call Do Not Call Do Not Call - Place on “Do Not Call” list permanently Do Not Solicit - Do not call, mail, email or attempt any other form of solicitations to this customer Do Not Mail - Place permanently on “Do Not Mail” list; future calling solicitations ok No sale - Do not sollicitate Never call again, <<Client>> Never call again, general C08 Scrubbed Vendor DNS 20 22 24 Creating the Analytical File- Data Standardization • Postal Code Standardization – Six digit code comprising Alpha,numeric,alpha,numeric,alpha,numeric – 1st letters: A,B,C,E,G,H,J,K,L,M,N,P,R,S,T,V,X,Y • SIC(Standard Industry Code Classification – 4 digit code used to classify all companies into standard set of industries 25 Creating the Analytical File- Data standardization • Example: – You have been asked to build retention model You have two years worth of transaction data. Changes in the product category codes occurred six months ago. Key information that you would look at would be as follows: • Income category • Product Category • Transaction Codes • Transaction Amount • Postal Code • Transaction Date • Gender What would you need to do • Need to map the old product category code definitions from prior to six months ago to the new product category code definitions 26 Creating the Analytical File- Geo-Codingn • Geocoding is the process that assigns a latitude-longitude coordinate to an address. Once a latitude-longitude coordinate is assigned, the address can be displayed on a map or used in a spatial search. • Data miners often use these coordinates to calculate such things as “distance to the nearest store” 27 Demographic Analysis Geo Profile Population Count Age Distribution Average Age Store Location 28 Creating the Analytical File-What is Geocoding? • Let’s look at a sample of what some data might look like? Postal Code A1A5A2 B5V1A2 M6B2A2 T4B1A2 V4H2B5 latitude Longitude 5 10 7 20 10 30 6 40 11 50 How do we use this data to create meaningful variables? -using the pythagorean theorem where distance**2=lat**2+ longitude**2. This is extremely useful in calculating distance type variables between a customer and a given location 29 Creating the Analytical File-What is Geocoding • Example: – A retailer has the following information: • Name and address of its customers • Address of its stores • Stats Can Information – As a marketer, how would you intelligently use this information – Find the distance between the nearest store and a given customer. – Create a trading area around a given store. Find out which stores have the best penetration. At the same time, analyze these best penetration stores and determine some key stats can attributes around these best penetration stores 30 Frequency Distribution • The report below uses first digit of postal code to assign customers to region. • For example, postal codes beginning with ‘G’, ‘H’, or ’J’ represent the Quebec region. Region Prairie Provinces Quebec Ontario West Missing Values Total Customer Profiling # of Customers 25 M 100 M 350 M 25 M 500 M 1 MM % of Total 2.5% 10% 35% 2.5% 50% 100% Frequency Distribution Tenure 1998 1999 2000 2001 Missing Total # of % of Customers Customers 9800 14% 10000 14% 12000 17% 8000 11% 30000 43% 69800 100% This tenure report would tell us that the tenure field was not on this database prior to 1998 and that 30,000 customers began prior to that date. Given the high percent of customers with missing values, we would need to determine whether we could capture tenure from another field in the database or not use 32 Frequency Distribution Type of # of Product/Services Purchased Customers Product A 35000 Product B 40000 Product C 25000 Product D 15000 Other 3000 Total 118000 % of Customers 29.66% 33.90% 21.19% 12.71% 2.54% 100.00% The Product/service field has good coverage and concludes that product B has been the best selling product, followed closely by product A 33 Creating Variables Source/ Raw File Variables # in Household Derived Variables Region of country Income Total spend within certain period Credit score Age Total lifetime spend Tenure Total number of promotions Number of promotions in last year by campaign category •Example of source variables •Example of derived variables 34 More Creations • Other variables – – – – Total spend in certain time periods Total spend by product category in certain time periods Decline in spend-total & by product type Trend variables related to spending and product category: • Median • Mean • Variation – Index Variables • Grouping of variable into meaningful categories where category values are index values • Binary Variables-yes/no type variables such as gender 35 Creating the Analytical File-Reviewing Data Dumps View of the Transaction File A dump of a few records from a billing file revealed the following after sorting by account number Account 123460 123460 456720 456720 333121 333121 789232 789232 Purchase Amount Product Category Date of Purchase $50 $75 $90 $100 $25 $40 $30 $20 ABC123 DEF789 GHI123 ABC456 JKL432 GHI342 GHI261 236phi 19980630 19980703 19980701 19980715 19980315 19980401 19980228 19980307 •What kind of variables can be derived. 36 Creating Binary Groups Income under 20K 20-30K 30-40K 40-55K 55-80K 80K+ Average % of Customers 16% 16% 16% 16% 16% 16% 100% Response Response Income> Rate Index 40K 1.50% 0.43 2.50% 0.71 0 2.00% 0.57 6% 1.71 5% 1.43 1 4% 1.14 3.50% 1.00 37 Creating Indices # of Months Since Last Promotion 1 2 3 4 5 6 Average % of Customers Response Rate Response Index 16% 16% 16% 16% 16% 16% 100% 2.50% 1.50% 3.75% 3.25% 6.00% 4.00% 3.50% 0.71 0.43 1.07 0.93 1.71 1.14 1.00 Months Since Last Promotion 0.57 0.62 1.00 1.43 38 More Variable Creation Spending 0-100 100-200 200-300 300-400 400+ # of customers 1000 1000 1000 1000 1000 Response Rate 1% 0.80% 1.20% 0.90% 0.95% •What would you do here •Is there any trend? Given that there seems to be no trend or impact between spend and response, it is highly unlikely that further information would be derived from this field. 39 More Variable Creation Tenure < 1 year 1-2 yrs 2-3 yrs 3-4 yrs 4yrs+ # of customers 1000 1000 1000 1000 1000 Response Rate 3% 2.00% 1.00% 0.75% 0.30% •What would you do here? •Here, this variable in all likelihood would be useful given its trend with response rate. 40 Stage 3 of Data Mining • What stage are we at: – Application of data mining tools • Give me some examples of what data miners would be doing in stage 3 – Data discovery • Data Audit/Frequency Distribution Analysis, Value Segmentation – – – – Models,profiles,etc. Post Campaign Analysis Reporting i.e such as standard KBM-Key Business Measure Reports AdHoc Reports • Modelling and profiling represent some examples of what we might be doing in this stage. 41 Types of Predictive Models • Examples:Discrete Models – Response Models • Cross Sell • Upsell • Acquisition – Attrition Models – Product Affinity Models – Risk Models 42 Types of Predictive Models • Examples-Continuous Models – Profitability/Value Models – Spending Models • What is the concept of the objective function or dependant variable? – This the variable that we trying to predict • Response,bad credit,defection,spend,etc. – What are we trying to optimize essentially becomes our objective function. – This is the variable we are trying to predict 43