Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 3 MARK2039 Winter 2006 George Brown College Wednesday 9-12 Recap • What are the four stages of data mining and who are the stakeholders • Data mining measures and metrics – – – – Mean Median Mode Standard Deviation • Why are these above Statistics important in evaluating numbers Boire Filler Group Recap • Is the Average or Mean Appropriate in deriving Insight about a group,segment or sample behaviour. • Why do we need to look at how numbers vary? • What are some of the measures used to assess variation? Boire Filler Group Recap Distribution A 350 500 750 1000 1150 Distribution B 700 725 750 775 800 2 distributions above. What do they mean and you would interpret the results. Both distributions have the same median and mean Boire Filler Group Recap Distribution A 3 4 5 6 7 8 Distribution B 3 4 5 6 7 1000 What is the problem here? Boire Filler Group Recap • Consider the following two distributions ... Distribution A: 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000 Mean Stdev Distribution B: 4,500 4,600 4,700 4,800 4,900 5,000 5,100 5,200 5,300 5,400 5,500 5,000 3,316.62 Boire Filler Group 5,000 331.66 Recap • For a binomial distribution, such as response, we must use a different formula. ( p * q) ( N ) 1 0 0 1 0 1 0 0 0 0 0.300 0.145 Boire Filler Group Responder Non - responder Non - responder Responder Non - responder Responder Non - responder Non - responder Non - responder Non - responder Mean Stdev Recap • What are Indexes. • Give me some examples. • Why are they important in the marketing world? • What is the most common one used in the marketing world? Boire Filler Group Lift • Lift represents a relative comparison between two numbers. It is a type of index. How is normally used? • Typically, it represents the number of a particular of a particular group divided by the average.( X1/average). • Example: Response Rate Target Group 2% Average 1.50% Boire Filler Group Recap-Lift • Use relative measures and not absolutes • The notion of “lift” should be the marketer’s key determinant of success Example Campaign 1 Campaign 2 Strategy 1 3% Resp. Rate 23% Resp. Rate Strategy 2 1.5% Resp. Rate 21.5% Resp. Rate Difference 1.5% Resp. Rate What is the key learning here? Boire Filler Group 1.5% Resp. Rate Assignment 2 1.Answer the following questions on the table listed below: Col. A 240 250 220 250 240 240 240 260 50 235 Col. B 4000 4000 3000 2000 1000 5000 3000 3000 2000 3000 Col.C 300 300 300 400 100 150 150 400 500 2000 Calculate the following averages and medians for each column of numbers mean median 222.5 240 3000 3000 460 300 Boire Filler Group Assignment 2 What kind of distributions are col. A and col. C and what metric would be best used to communicate to business users. Skewed,or asymmetric,or nonnormal. Median is the key measure What column would be most reliable in estimating results to a larger population and why? Col. A as std.deviation is smallest which allows our range around the mean to be much tighter. Boire Filler Group Assignment 2 2 marks 2. The median height of 65 inches is the same for two classes. Yet, the average in one class is 65 inches vs. 70 inches in another class. What is causing this difference? An outlier value containing a very tall person is causing the mean of one class to be 70 inches Boire Filler Group Assignment 2 Calculate the index values for each variable for Customer A. Why are indexes useful in database marketing? Spending: .5 Tenure: .5 Income: 1.2 Indexes are useful as relative measures in terms of comparing a value relative to the average and being able to rank order or prioritize records Boire Filler Group Evaluating test results • In database marketing, marketers are constantly asked what to conclude from their testing results. • For instance, are the results of one strategy significantly different than another strategy. • Let’s take a look at some examples. Boire Filler Group Evaluating Marketing Test • Two groups of cells have been tested for different communication strategies. Results are as follows. What would you conclude? Strategy A B Sample Size 10000 5000 Boire Filler Group Response Rate 2.30% 2% Evaluating Marketing Test • To determine this, you need to do statistical testing which essentially comprises three factors: – Confidence level that you want – Actual standard deviation based on the lower sample size – Response Rate Or performance Rate – For our purposes, we will use a 95% confidence interval which essentially translates into 2 standard deviations around the mean Boire Filler Group Evaluating Marketing Test • Calculate the following confidence intervals at 95% – 1% – 2% – 5% – 5% with with with with a a a a std. std. std. std. deviation of .1% Deviation of .05% Deviation of .5% Deviation of .3% • Let’s get back to the problem Boire Filler Group Evaluating Marketing Test • Two groups of cells have been tested for different communication strategies. Results are as follows. What would you conclude? Strategy A B Sample Size 10000 5000 Boire Filler Group Response Rate 2.30% 2% Evaluating Marketing Test • Calculate the standard deviation first using the sample with the lower qty-Strategy B. – ( p * q) ( N ) – Sq. root of (.02X.98)/5000=.00198 – 95% confidence interval= • .02+2*.00198 and .02-2*.00198= • .01604<=.02<=.02396. – Based on this result, what can you conclude between Strategy A and Strategy B Boire Filler Group Evaluating Marketing Test Results • Two other groups of cells have been tested for different communication strategies. Results are as follows. What would you conclude? •Strategy •A •B •Sample Size •1000 •2000 •Response •Rate •5.00% •3% Suppose the A becomes 3.3%. What would you conclude? Boire Filler Group Evaluating Marketing Test • Calculate the standard deviation first using the sample with the lower qty-Strategy A. – ( p * q) ( N ) – Sq. root of (.05X.95)/1000=.00689 – 95% confidence interval= • .05+2*.00689 and .05-2*.00689= • .03622<=.05<=.06378. – Based on this result, what can you conclude between Strategy A and Strategy B Boire Filler Group Evaluating Marketing Test Results • Two other groups of cells have been tested for different communication strategies. Results are as follows. What would you conclude? •Strategy •A •B •Sample Size •1000 •2000 •Response •Rate •5.00% •4.0% Suppose B becomes 4.0%. What would you conclude? Boire Filler Group Evaluating Marketing Test Results • Calculate the standard deviation first using the sample with the lower qty-Strategy A. – ( p * q) ( N ) – Sq. root of (.05X.95)/1000=.00689 – 95% confidence interval= • .05+2*.00689 and .05-2*.00689= • .03622<=.05<=.06378. – Based on this result, what can you conclude between Strategy A and Strategy B Boire Filler Group Evaluating Marketing Test Results • Having done several of these tests, what will cause your confidence range to narrow – Large sample size – Smaller response rates Boire Filler Group Data Review of Data Types Of Data/Format • Character-Level Data • Numeric Data • Date • Give me some examples • In Data Mining, what do we have to do with all data before building a solution Boire Filler Group Data Format Examples • Gender • Income • Spending • Birthdate • Customer type • How would you use gender,customer type, and birthdate in a data mining exercise Boire Filler Group Data Transformation • Gender Variable – Male=1, non male=0 – Female=1,non female=0 – What happens to missing values here? • Customer Type Variable – Gold member=1,non gold member=0 – Platinum member=1,non platinum member=0 – Etc. Boire Filler Group Data Transformation • Birthdate – Convert birthdate to age – Extract birthyear from birthdate field and substract from current year(i.e.2005-1954) • Date of last Spending Activity – Create recency of last spend – Create tenure variable – How would this be done. Boire Filler Group Data • Discrete vs. index vs. continuous • Discrete – Yes/No – On/Off • Convert above type data to 1,0 type scenario Boire Filler Group Data • Index Type Data Customer Type Regular Gold Platinum Average Average Spend 100 200 300 125 List Source A B C Average Average spend 200 400 600 400 Could convert each customer type to binary value. But what would be more valuable way to convert or transform this variable? Boire Filler Group Data • Continuous data – What are some examples • What does it mean when we say that data is continuous? Boire Filler Group Data Type • Looking at data as we have in the last number of slides, we can create what we call data categories: – Nominal – Ordinal – Interval Boire Filler Group Data Categories • Nominal variables are variables where the values do not represent any real order or magnitude of value. • Examples: – Gender – Product Category – Promotion Category Boire Filler Group Data Categories • Ordinal Variables represent fields where the values have some order • Good examples are: – index-type variables – Model rank – Etc. Boire Filler Group Data Categories • Interval Variables represent fields where the actual values indicate order but also magnitude. – Income – Spend – Model Score • What data category is the most granular? • Which category might you typically expect to be more powerful in a data mining exercise? Boire Filler Group Data Usefulness • When is Data Useful? – Few Missing values – Variable does not consist primarily of one value – Non-Numeric Data consists of too many values which cannot be properly grouped into more meaningful categories Boire Filler Group Examples-Analytical Perspective Variable Income Customer Type Gender Household Size Product Type Customer Name Postal Code # of records 100000 100000 100000 100000 100000 100000 100000 Data Field format numeric character character numeric character character character What fields are useful and why? Boire Filler Group # of Unique values 50000 4 2 7 3000 100000 50000 # of missing values 2000 10000 50000 90000 5000 0 0 Examples Closer look at income Income % of Records <25000 25000-50000 50000-75000 75000+ Missing 25% 25% 25% 23% 2% Closer look at gender Gender Male Female Missing % of records 23% 27% 50% Boire Filler Group Examples • Closer Look at Customer Type Customer Type Gold Bronze Silver Platinum Missing % of records 5% 40% 30% 15% 10% Closer look at Product Type Product Type A001 B001 C003 A010 …. missing Z004 % of records 0.07% 0.08% 0.06% 0.06% Cum. % of records 0.07% 0.15% 0.21% 0.27% 5% 0.08% 99.92% 100% Boire Filler Group Examples Variables 1st 3 digits of postal code household size Credit score mortgage account Product code Median Income of Postal Code of record # of records Data Field Format # of unique values # of missing values 100000 100000 100000 100000 100000 character numeric numeric character character ? ? ? ? ? 100000 100000 100000 100000 100000 100000 numeric ? 100000 •What variables would be useful here •What would be the number of unique variables Boire Filler Group Examples Variables 1st 3 digits of postal code household size Credit score mortgage account Product code Median Income of Postal Code of record # of records Data Field Format # of unique values # of missing values 100000 100000 100000 100000 100000 character numeric numeric character character 100000 100000 100000 100000 100000 0 0 0 0 0 100000 numeric 100000 0 •What variables would be useful here Boire Filler Group Examples-Marketing Perspective • A mortgage company is conducting a campaign to its high value customers. One of the key characteristics of value is high income which is self-reported at time of application. Income < 30000 30000-60000 60000-80000 80000-100000 100000+ missing % of records 5% 5% 20% 10% 10% 50% As a marketer, how will you use this information and what do you need to consider? Boire Filler Group Examples-Marketing Perspective • An insurance company is marketing an insurance product to people over the age of 60. Listed below is a report indicating the distribution of age. • Age <30 30-40 40-55 55-65 65+ missing % of records 5% 10% 15% 10% 10% 50% As a marketer, how will you use this information? Boire Filler Group Examples-Marketing Perspective • An retail company has over 1000 product SKU’s. After investigation, it has been determined that the 1st digit represents a broader product category. You have been asked to design the product layout for all stores. Product SKU A000003 A000004 B000005 B000006 …. Z999999 % of records 0.03% 0.02% 0.03% 0.04% Cum. % of records 0.03% 0.05% 0.08% 0.12% 0.02% 100% As a marketer, how will you use this information? Boire Filler Group Examples-Marketing Perspective • Gender Male Female Missing Income 0-20K 20K-40K 40K-60K 60K-80K 80K+ missing % of records 10% 12% 88% % of records 5% 4% 7% 6% 5% 73% What can be done here, if anything and what else can we consider in terms of using gender and Boire Filler Group Examples-Marketing Perspective • You have postal code information for each customer. You are asked to design customer reports by province.How would you do this? Boire Filler Group Examples-Data Mining Perspective • You have the following variables and values – Gender: ’M’:Male ‘F’:Female – Age: ‘B’: <20M ‘F’: 20M-40M ‘R’:40M-60M ‘S’:60M-80M ‘T’:80M-100M ‘Z’: 100M+ • What must be done here? Boire Filler Group