Download Client`s Logo/Name

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Lecture 7
MARK2039
Summer 2006
George Brown College
Wednesday 9-12
Exam
1) You are running an analysis to determine the number of customers that are poor
credit risk that live in Montreal and that have been promoted in the last month.
There are 3 million customers and 50 million promotion records. The analysis has
taken over a day.
The customer file and promotion file contains the following fields:
Answer the following:
a) What fields would you pull to do the query.
b) Give one suggestion on how would you improve the run time of this query.
Customer
Account ID
Houshold Number
Credit Score
Postal code
Promotion Code
Account ID
Date of Promotion
Promotion Type
a)Acct ID, Date of Promotion, credit score, postal code
b)Index account ID and make the DB relational
2
Exam
2)Listed below are 3 columns with each column containing 5 valuesi
Column A
120
80
40000
140
90
Column B
10
5
20
15
25
Column C
200
20000
18000
22000
24000
Answer the following:
1) What is the mean and median of each column
2) What column contains the normal distribution and why?
3) What would be the better reporting measure for the non normal distribution
and why?
1)Col A: mean=8086, median=120
COL. B: mean=15,median=15
Col. C: mean= 16840 median=20000
2)Normal dist. Is B because mean and median are same.
3)Median as it is not skewed by otliers
3
Exam
3) The current expected performance of a given campaign is 4%.
Two strategies have been tested with the following results.
Strategy
Strategy A
Strategy B
# of names
40000
2000
Response Rate
3.80%
2.00%
What would you conclude for each strategy and what would you do for the next
campaign based on the learning
(Hint: you have to conduct your calculations on both tests)
Str. A: std.dev= .00189 – CI: .0361<=.0380<=.03989
Str.B: std.dev=.003 -CI: .014<=.02<=.026
Do not use either strategy and continue with existing strategy
4
Exam
4)Three initiatives are outlined below. Assume that data mining can yield a 15% lift.
What initiative would you pursue and why? Show your calculations.
-Outbound Telemarketing campaign with an available universe of 75000 names
at $3.00 per name
-Email Campaign with an available universe of 10000000 names at $.10 per name
-Direct Mail Campaign with an available universe of 100000 names at $2.00 per
name
75000
86250
Cost Diff:
862
862
$33,750
1.15
1
10000000
115000
11500000
115000
Cost Diff: $150,000
1.15
1
100000
115000
Cost Diff:
1150
1150
1.15
1
$30,000
5
Exam
5)The marketing team wants the flexibility and the ability to conduct its own analysis
without I/T or system resources.
The customer file and transaction file contains the following fields:
Answer the following:
a) What type of technology would you use
b) Give me a design that contains three dimensions and one measure
c) Provide a query that can be conducted based on your above design.
a)Cube
b)dimensions:product type,1st digit of postal code,payment type
Measure: acct Id
c)Give me count of all customers who bought prod. A with cash
6) You are given the postal code data of each customer for company XYZ. How might
company XYZ use this information to better target prospects to become customers
Determine number of customers in postal code, determine number of persons in postal
code from Stats Can data. Create penetration index: Number of customers/ number of persons
at postal code. Rank postal codes by penetration index and use ranked postal codes to target
prospects.
6
Exam
7)You are given a customer file with postal code data only. You can then append Stats
Can taxfiler data and Stats Can Census Data.
Which data would be richer in terms of providing more granular data and why?
What might be the advantage of using Stats Can Taxfiler data.
Stats Can Census is richer as it has more records(50000 vs. 28000 for taxfiler
Advantage of using Taxfiler data is that data is more recent
8. Answer the following Questions
a)What is the last stage of data mining?
b)What is more important in data mining-reducing costs or maximizing revenues ?
c)What must happen to the data before it gets used in a data mining application?
d)What is the metric that allows us to look at how data varies within a population?
a)
b)
c)
d)
Implementation
Reducing costs
Must be one to one in analytical file
Standard deviation or variation
7
Exam
9) What is a more accurate estimate of weight
-Sample A: 150 lbs with std. dev of 5 lbs
-Sample B- 25 pounds with std. of 4 lbs.
Explain why?
Sample A , although std. dev. is larger, if we look at std. dev. on a relative basis when
comparing to the range or magnitude of values in the sample,
we will observe that we are getting a much tighter bound around A rather than B
10) Give me one example of a legacy type system file.
Give me one advantage of why you might build a data mart
Legacy: billing or call detail files,external data such as Stats Can
Advantage to building data mart is the following:
-data aggregated and summarized-easier to use for analysis
-Quicker processing
-Easier intrpretation as data deals solely with functional area
8
Exam
11) Answer yes or no on whether data mining should be used
i)
ii)
iii)
iv)
v)
Creating a national advertising program
Identifying your most profitable customers
Trying to maximize the revenue of a campaign.
Using Survey Results(10% of customer base) to create a targeted customer list
Analyzing the results of a direct marketing campaign.
i)No,ii)yes,iii)No,iv)No,v)yes
12) Listed below is a table containing 5 variables. For each variable, do the following
a)Indicate if it is nominal, ordinal or interval
b)Indicate whether the variable is useful and provide 1 sentence for your reasoning.
Variable
Promotion Date
Promotion Codes
Income
Number of Children
Credit Decile Rank
# of records
100000
150000
75000
75000
75000
# of unique values
1
5000
70000
6
10
# of missing values
0
0
70000
10000
0
Prom.Date-interval,not useful,only one value
Prom.codes-nominal-not useful-too granular
Income-interval-not useful too many missing values
Number of children: interval-useful-few missing values
Credit decile rank: ordinal-useful-0 missing values
9
Creating the Analytical File-Reviewing Data Dumps
Initial dump of 1st few records
Account
Number
123456
345231
543236
Postal
Code
M5A3S6
H3A2B4
T5A3S7
etc…
Birth
Date
07/49
08/54
06/92
Start Behave. Income
# in
Date
Score
House
03/91
500 30000
6
04/92
550 42500
1
600
35000
3 543210
Missing values in data are not properly being treated.
10
Creating the Analytical File-Reviewing Data Dumps
Initial dump of
st
1
few records
Proper treatment of missing values results in the following
dump:
Account
Number
123456
345231
543236
543210
Postal
Code
M5A3S6
H3A2B4
T5A3S7
etc…
Birth
Date
07/49
08/54
Start
Date
03/91
04/92
06/92
Behav. Income
# in
Score
House
500 30000
6
550 42500
1
600 35000
3
Effective programming can ensure that records are being
properly loaded into the system.
11
Creating the Analytical File-Reviewing Data Dumps
View of the Transaction File
A dump of a few records from a billing file revealed the
following after sorting by account number
Account
123460
123460
456720
456720
333121
333121
789232
789232
Purchase
Amount
Product
Category
Date of
Purchase
$50
$75
$90
$100
$25
$40
$30
$20
ABC123
DEF789
GHI123
ABC456
JKL432
GHI342
GHI261
236phi
19980630
19980703
19980701
19980715
19980315
19980401
19980228
19980307
12
Creating the Analytical File-Reviewing Data Dumps
View of the Promo History File
A dump of a few promotion history records revealed the
following after sorting by account number:
Account No.
Promotion ID
Promotion Date
123460
123460
123460
456720
456720
456720
456720
333121
789232
ABA123
ACB431
AAC221
BAA123
BBA321
BCB330
BAC112
CBA321
BAD333
19970115
19970315
19970618
19970115
19980115
19980315
19980618
19980115
19980415
13
Creating the Analytical File-Reviewing Data Dumps
• Using your marketing knowledge, give me examples of variables that
we might create from the last three slides
– Slide 11
– Slide 12
– Slide 13
• Slide 11: Age, region of country, tenure
• Slide 12: Total Amount, Total amount for a given product, and recency
of purchase.
• Slide 13: Total promotions, Total Promotions by Type and recency of
last promotion
14
Creating the Analytical File-Data Hygiene and
Cleansing
• Once the data has been dumped in order to view records, typically
data hygiene and cleansing have to take place
• Two key deliverables
– Clean name and address information
– Standard rules for coding of data values
15
Creating the Analytical File-Data Hygiene and
Cleansing
• Clean Name and Address Information
– Market to right Individual
– Create Match keys
16
Creating the Analytical File
Name and Address Standardization
• Clean Name and Address Information
– Market to right Individual
– Create Match keys
– Name and Address Standardization
BankID
987654321
Name
JONH SMITH JR.
Address1 123 WILLIAMS STRET
Address2
2ND FLOOR
Address3 TRT., O.N. M5G-1F3
Country
CDN
UnIndivID
123456789
BankID
PreName
FirstName
Surname
PostName
Street1
Street2
City
Province
Postal Code
Country
UnIndivID
Origin
987654321
JONH SMITH JR.
123 WILLIAMS STRET
2ND FLOOR
TRT
O.N.
M5G-1F3
CANADA
123456789
Bank
17
Creating the Analytical File-Name and Address
Standardization
DATA CLEANING
•
•
•
•
Address correction
Name parsing
Genderizing
Casing
BankID
PreName
FirstName
Surname
PostName
Street1
Street2
City
Province
Postal Code
Country
UnIndivID
Origin
987654321
JONH SMITH JR.
123 WILLIAMS STRET
2ND FLOOR
TRT
O.N.
M5G-1F3
CANADA
123456789
Bank
BankID
PreName
FirstName
Surname
PostName
Street1
Street2
City
Province
Postal Code
Country
UnIndivID
Origin
987654321
Mr.
John
Smith
Jr.
200-123 Williams Street
Toronto
ON
M5G 1F3
Canada
123456789
Bank
18
Creating the Analytical File-Merge Purge of Names
• What are the reasons for creating unique match customer keys
– Generating a marketing list
– Conducting analysis
Should the match keys be the same for
both above scenarios?
No, tighter matchkeys in generating lists and looser matchkeys
when conducting analysis
What are the situations when match keys that are numeric?
When dealing with existing customer data where you are matching
Files involving only existing customer data.
19
Creating the Analytical File-Merge Purge of Names
Common fields to use in creating Match keys
• First Name;
• Surname;
• Unique Individual ID;
• Postal Code
• Credit Card Number
• Duns Number for Businesses
• Phone Number
Unique I.D’s or number type I.D’s are the preferred choice when
creating match keys
•
Let’s take a closer look at creating match keys using name and
address
20
Creating the Analytical File-Merge Purge of Names
• Let’s take a look at 6 records and see what this means.
Surname
First Name
Smith
John
Smith
Brown
James
Tim
Brown
Green
Green
Filler
T.
Ted
Tanya
Robert
Filler
Larry
Address
12345 Elm
Street
45678 Elm
Street
5678 Oak
5678 Oak
Road
3478 Pine
3478 Pine
2345 Nurr
5672 Bolton
Dr.
Postal Code Match Key
L1A2A1
L1A2A1SMITHJ
L1A2A1
M5A3A2
L1A2A1SMITHJ
M5A3A2BROWNT
M5A3A2
V6A2A1
V6A2A2
M5A3A2
M5A3A2BROWNT
V6A2A1GREENT
V6A2A1GREENT
M5A3A2FILLERR
M6A2A1
M6A2A1FILLERL
21
Creating the Analytical File-Merge Purge of Names
• Example: You have one record here:
– Richard Boire-4628 Mayfair Ave. H4B2E5
– How would you use the above information for a backend analysis
if I were a responder to an acquisition campaign?
BOIREH4B2E5
– What about if you were conducting analysis on me as an existing
customer who responded to a cross-sell campaign.
– Need only customer id
– How about if you wanted to send me a direct mail piece
– BOIRERICHARDH4B2E54628MAYFAIR
22
Creating the Analytical File- Data standardization
• Refers to a process where values from a common variable from
different files are mapped to the same value. Some common examples:
• SIC Code Industry Classification Table
– Industry categories have common set of codes
• Postal Code Variable
– Postal code has to have 6 digits comprised of
alpha,numeric,alpha,numeric,alpha,numeric which exclude the
following alphas: D,F,O,Q,U, and Z.
• Give me examples of bad postal codes vs. good postal codes.
– D4B2E5, H442E6,etc. are bad postal codes.
– M5J1A1, A1A1A3,etc. are good postal codes
23
Creating the Analytical File- Data Standardization
• Here is an example of how disposition codes for telemarketing outcomes
might be handled
Code
21
21
21
32
9
U28
B22
B23
Description
Do Not Call
Do Not Call
Do Not Call
Do Not Call
Do Not Call - Place on “Do
Not Call” list permanently
Do Not Solicit - Do not call,
mail, email or attempt any
other form of solicitations to
this customer
Do Not Mail - Place
permanently on “Do Not Mail”
list; future calling solicitations
ok
No sale - Do not sollicitate
Never call again, <<Client>>
Never call again, general
C08
Scrubbed Vendor DNS
20
22
24
Creating the Analytical File- Data Standardization
• Postal Code Standardization
– Six digit code comprising
Alpha,numeric,alpha,numeric,alpha,numeric
– 1st letters: A,B,C,E,G,H,J,K,L,M,N,P,R,S,T,V,X,Y
• SIC(Standard Industry Code Classification
– 4 digit code used to classify all companies into standard set of
industries
25
Creating the Analytical File- Data standardization
•
Example:
–
You have been asked to build retention model You have two
years worth of transaction data.
Changes in the product category codes occurred six months
ago. Key information that you would look at would be as follows:
• Income category
• Product Category
• Transaction Codes
• Transaction Amount
• Postal Code
• Transaction Date
• Gender
What would you need to do
• Need to map the old product category code definitions from
prior to six months ago to the new product category code
definitions
26
Creating the Analytical File- Geo-Codingn
• Geocoding is the process that assigns a latitude-longitude coordinate
to an address. Once a latitude-longitude coordinate is assigned, the
address can be displayed on a map or used in a spatial search.
• Data miners often use these coordinates to calculate such things as
“distance to the nearest store”
27
Demographic Analysis
Geo
Profile
Population
Count
Age
Distribution
Average Age
Store
Location
28
Creating the Analytical File-What is Geocoding?
• Let’s look at a sample of what some data might look like?
Postal Code
A1A5A2
B5V1A2
M6B2A2
T4B1A2
V4H2B5
latitude Longitude
5
10
7
20
10
30
6
40
11
50
How do we use this data to create meaningful
variables?
-using the pythagorean theorem where
distance**2=lat**2+ longitude**2. This is extremely
useful in calculating distance type
variables between a customer and a given location
29
Creating the Analytical File-What is Geocoding
• Example:
– A retailer has the following information:
• Name and address of its customers
• Address of its stores
• Stats Can Information
– As a marketer, how would you intelligently use this information
– Find the distance between the nearest store and a given customer.
– Create a trading area around a given store. Find out which stores
have the best penetration. At the same time, analyze these best
penetration stores and determine some key stats can attributes
around these best penetration stores
30
Frequency Distribution
• The report below uses first digit of postal code to assign
customers to region.
• For example, postal codes beginning with ‘G’, ‘H’, or ’J’ represent
the Quebec region.
Region
Prairie Provinces
Quebec
Ontario
West
Missing Values
Total
Customer Profiling
# of Customers
25 M
100 M
350 M
25 M
500 M
1 MM
% of Total
2.5%
10%
35%
2.5%
50%
100%
Frequency Distribution
Tenure
1998
1999
2000
2001
Missing
Total
# of
% of
Customers Customers
9800
14%
10000
14%
12000
17%
8000
11%
30000
43%
69800
100%
This tenure report would tell us that the tenure field was not on this database
prior to 1998 and that 30,000 customers began prior to that date. Given the high
percent of customers with missing values, we would need to determine whether
we could capture tenure from another field in the database or not use
32
Frequency Distribution
Type of
# of
Product/Services
Purchased
Customers
Product A
35000
Product B
40000
Product C
25000
Product D
15000
Other
3000
Total
118000
% of
Customers
29.66%
33.90%
21.19%
12.71%
2.54%
100.00%
The Product/service field has good coverage and concludes that product B
has been the best selling product, followed closely by product A
33
Creating Variables
Source/ Raw File
Variables
 # in Household
Derived Variables
 Region of country
 Income
 Total spend within certain
period
 Credit score
 Age
 Total lifetime spend
 Tenure
 Total number of
promotions
 Number of promotions in
last year by campaign
category
•Example of source variables
•Example of derived variables
34
More Creations
• Other variables
–
–
–
–
Total spend in certain time periods
Total spend by product category in certain time periods
Decline in spend-total & by product type
Trend variables related to spending and product category:
• Median
• Mean
• Variation
– Index Variables
• Grouping of variable into meaningful categories where category values are
index values
• Binary Variables-yes/no type variables such as gender
35
Creating the Analytical File-Reviewing Data Dumps
View of the Transaction File
A dump of a few records from a billing file revealed the
following after sorting by account number
Account
123460
123460
456720
456720
333121
333121
789232
789232
Purchase
Amount
Product
Category
Date of
Purchase
$50
$75
$90
$100
$25
$40
$30
$20
ABC123
DEF789
GHI123
ABC456
JKL432
GHI342
GHI261
236phi
19980630
19980703
19980701
19980715
19980315
19980401
19980228
19980307
•What kind of variables can be derived.
36
Creating Binary Groups
Income
under 20K
20-30K
30-40K
40-55K
55-80K
80K+
Average
% of
Customers
16%
16%
16%
16%
16%
16%
100%
Response Response Income>
Rate
Index
40K
1.50%
0.43
2.50%
0.71
0
2.00%
0.57
6%
1.71
5%
1.43
1
4%
1.14
3.50%
1.00
37
Creating Indices
# of Months
Since Last
Promotion
1
2
3
4
5
6
Average
% of
Customers
Response
Rate
Response
Index
16%
16%
16%
16%
16%
16%
100%
2.50%
1.50%
3.75%
3.25%
6.00%
4.00%
3.50%
0.71
0.43
1.07
0.93
1.71
1.14
1.00
Months
Since Last
Promotion
0.57
0.62
1.00
1.43
38
More Variable Creation
Spending
0-100
100-200
200-300
300-400
400+
# of customers
1000
1000
1000
1000
1000
Response Rate
1%
0.80%
1.20%
0.90%
0.95%
•What would you do here
•Is there any trend? Given that there seems to be no
trend or impact between spend and response, it is highly
unlikely that further information would be derived from this
field.
39
More Variable Creation
Tenure
< 1 year
1-2 yrs
2-3 yrs
3-4 yrs
4yrs+
# of customers
1000
1000
1000
1000
1000
Response Rate
3%
2.00%
1.00%
0.75%
0.30%
•What would you do here?
•Here, this variable in all likelihood would be
useful given its trend with response rate.
40
Stage 3 of Data Mining
• What stage are we at:
– Application of data mining tools
• Give me some examples of what data miners would be doing in stage 3
– Data discovery
• Data Audit/Frequency Distribution Analysis, Value Segmentation
–
–
–
–
Models,profiles,etc.
Post Campaign Analysis
Reporting i.e such as standard KBM-Key Business Measure Reports
AdHoc Reports
• Modelling and profiling represent some examples of what we might be
doing in this stage.
41
Types of Predictive Models
• Examples:Discrete Models
– Response Models
• Cross Sell
• Upsell
• Acquisition
– Attrition Models
– Product Affinity Models
– Risk Models
42
Types of Predictive Models
• Examples-Continuous Models
– Profitability/Value Models
– Spending Models
• What is the concept of the objective function or dependant variable?
– This the variable that we trying to predict
• Response,bad credit,defection,spend,etc.
– What are we trying to optimize essentially becomes our objective
function.
– This is the variable we are trying to predict
43