Download Steps

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Chapter 2: Basics of Business Analytics
2.1 Overview of Techniques
2.2 Data Management
2.3 Data Difficulties
2.4 SAS Enterprise Miner: A Primer
2.5 Honest Assessment
2.6 Methodology
2.7 Recommended Reading
1
Chapter 2: Basics of Business Analytics
2.1 Overview of Techniques
2.2 Data Management
2.3 Data Difficulties
2.4 SAS Enterprise Miner: A Primer
2.5 Honest Assessment
2.6 Methodology
2.7 Recommended Reading
2
Objectives


3
Name two major types of data mining analyses.
List techniques for supervised and unsupervised
analyses.
Analytical Methodology
A methodology clarifies the purpose and implementation
of analytics.
Define or refine
business objective
Assess results
Select data
Deploy models
Explore input
data
Apply analysis
Prepare and
Repair data
Transform input
data
4
Business Analytics and Data Mining
Data mining is a key part of effective business analytics.
Components of data mining:
 data management
 data management
 data management
 customer segmentation
 predictive modeling
 forecasting
 standard and nonstandard statistical modeling
practices
5
What Is Data Mining?



6
Information Technology
– complicated database queries
Machine Learning
– inductive learning from examples
Statistics
– what we were taught not to do
Translation for This Course
Segmentation
Predictive Modeling
 Unsupervised classification
 Supervised classification
– cluster analysis
– linear regression
– association rules
– logistic regression
 other techniques
– decision trees
 other techniques
7
Customer Segmentation
Segmentation is a vague term with many meanings.
Segments can be based on the following:
 a priori judgment
– alike based on business rules, not
based on data analysis
 unsupervised classification
– alike with respect to several
attributes
 supervised classification
– alike with respect to a target,
defined by a set of inputs
8
Segmentation: Unsupervised Classification
9
Training Data
Training Data
case 1: inputs, ?
case 2: inputs, ?
case 3: inputs, ?
case 4: inputs, ?
case 5: inputs, ?
case 1: inputs, cluster 1
case 2: inputs, cluster 3
case 3: inputs, cluster 2
case 4: inputs, cluster 1
case 5: inputs, cluster 2
new
case
new
case
Segmentation: A Selection of Methods
Barbie
 Candy
Beer
 Diapers
Peanut butter  Meat
k-means clustering
10
Association rules
(Market basket analysis)
Predictive Modeling: Supervised Classification
Training Data
case 1: inputs  prob  class
case 2: inputs  prob  class
case 3: inputs  prob  class
case 4: inputs  prob  class
case 5: inputs  prob  class
new
case
11
new
case
Predictive Modeling: Supervised Classification
Inputs
Target
Cases
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
...
...
...
...
...
...
...
...
...
...
12
.. ..
. .
2.01 Poll
The primary difference between supervised and
unsupervised classification is whether a dependent,
or target, variable is known.
 Yes
 No
13
2.01 Poll – Correct Answer
The primary difference between supervised and
unsupervised classification is whether a dependent,
or target, variable is known.
 Yes
 No
14
Types of Targets



15
Logistic Regression
– event/no event (binary target)
– class label (multiclass problem)
Regression
– continuous outcome
Survival Analysis
– time-to-event (possibly censored)
Discrete Targets



16
Health Care
– target = favorable/unfavorable outcome
Credit Scoring
– target = defaulted/did not default on a loan
Marketing
– target = purchased product A, B, C, or none
Continuous Targets



17
Health Care Outcomes
– target = hospital length of stay, hospital cost
Liquidity Management
– target = amount of money at an ATM machine or in
a branch vault
Merchandise Returns
– target = time between purchase and return
(censored)
Application: Target Marketing

Cases
=

Inputs
=

Target
Action
=
=

18
customers, prospects, suspects,
households
geographics, demographics,
psychometrics, RFM variables
response to a past or test solicitation
target high-responding segments
of customers in future campaigns
Application: Attrition Prediction/Defection
Detection

Cases
Inputs
=
=

Target
=

Action
=

19
existing customers
payment history, product/service usage,
demographics
churn, brand switching, cancellation,
defection
customer loyalty promotion
Application: Fraud Detection




20
Cases
Inputs
Target
Action
=
=
=
=
past transaction or claims
particulars and circumstances
fraud, abuse, deception
impede or investigate suspicious cases
Application: Credit Scoring

Cases
Inputs
=
=

Target
=

Action
=

21
past applicants
application information, credit bureau
reports
default, charge-off, serious delinquency,
repossession, foreclosure
accept or reject future applicants for
credit
The Fallacy of Univariate Thinking
What is the most important cause of churn?
Prob(churn)
International
Usage
22
Daytime
Usage
A Selection of Modeling Methods
Linear Regression,
Logistic Regression
23
Decision
Trees
Hard Target Search
Transactions
24
...
Hard Target Search
Transactions
25
Fraud
Undercoverage
Accepted
Bad
Accepted
Good
Rejected
No Follow-up
26
...
Undercoverage
Next
Generation
Accepted
Bad
Accepted
Good
Rejected
No Follow-up
27
2.02 Poll
Impediments to high-quality business data can lie in the
very nature of business decision-making: the worst
prospects are not marketed to. Therefore, information
about the sort of customer that they would be (profitable
or unprofitable) is usually unknown, making supervised
classification more difficult.
 Yes
 No
28
2.02 Poll – Correct Answer
Impediments to high-quality business data can lie in the
very nature of business decision-making: the worst
prospects are not marketed to. Therefore, information
about the sort of customer that they would be (profitable
or unprofitable) is usually unknown, making supervised
classification more difficult.
 Yes
 No
29
Chapter 2: Basics of Business Analytics
2.1 Overview of Techniques
2.2 Data Management
2.3 Data Difficulties
2.4 SAS Enterprise Miner: A Primer
2.5 Honest Assessment
2.6 Methodology
2.7 Recommended Reading
30
Objectives


31
Explain the concept of data integration.
Describe SAS Enterprise Guide and how it fits in with
data integration and management for business
analytics.
Data Management and Business Analytics
Data management brings together data components that
can exist on multiple machines, from different software
vendors, throughout the organization.
Data management is
the foundation for
business analytics.
Without correctly
consolidated data,
those working in the
analytics, reporting,
and solutions areas
might not be working
with the most current,
accurate data.
32
Advanced
Analytics
Basic
Analytics
Reporting
Managing Data for Business Analytics



33
Business analytics requires data management
activities such as data access, movement,
transformation, aggregation, and augmentation.
These tasks can involve many different types of data
(for example, simple flat files, files with commaseparated values, Microsoft Excel files, SAS tables,
and Oracle tables).
The data likely combines individual transactions,
customer summaries, product summaries, or other
levels of data granularity – or some combination of
those things.
Planning from the Top Down
What mission-critical
questions must be
answered?
What data will help you
answer these questions?
What data do you have
that will help you build
the needed data?
34
Implementing from the Bottom Up
Create Reports
Define Target Data
Identify Source Data
35
Collaboration Is Key to Business Analytics



36
Business Expert
IT Expert
Analytical Expert
Data Marts: Tying Questions to Data
Stated simplistically, data marts are implemented
at organizations because there are questions that must
be answered.
Data is typically collected in daily operations but might
not be organized in a way that answers the questions.
An IT professional can use the questions and the data
collected from daily operations to construct the tables for
a data warehouse or data mart.
37
Building a Data Mart
Foundation of a Data Mart
 Identify source tables.
 Identify target tables.
 Create target tables.
Building the foundation of the data mart consists
of the three basic steps listed above.
38
39
Analytic Objective Example
Business: Large financial institution
Objective: From a population of existing
clients with sufficient tenure and
other qualifications, identify a
subset most likely to have
interest in an insurance
investment product (INS).
39
Financial Institution’s Data
The financial institution has highly detailed data that is
challenging to transform into a structure suitable for
predictive modeling. As is the case with most
organizations, the financial institution has
a large amount of data about its customers, products, and
employees. Much of this information is stored in
transactional systems in various formats.
Using SAS Enterprise Guide, this transactional
information is extracted, transformed, and loaded
into a data mart for the Marketing Department.
You continue to work with this data set for some basic
exploratory analysis and reporting.
40
A Target Star Schema
One goal of creating a data mart is to produce, from the
source data, a dimensional data model that is a star
schema.
Customer
Dimension
Organization
Dimension
Fact Table
Time
Dimension
41
Product
Dimension
Financial Institution Target Star Schema
The analyst can produce, from the financial institution’s
source data, a dimensional data model that is a star
schema.
Customer
Dimension
Credit Bureau
Dimension
42
Checking
Fact Table
Insurance
Dimension
Checking_transactions Table
The checking_transactions
table contains the following
attributes, one per a record
fact.
This fact contains some
measured or observed
variables.
The fact table contains the
data, and the dimensions
identify each tuple in
the data.
43
CHECKING_ID
CHKING_TRANS_DT
CHKING_TRANS_AMT
CHKING_TRANS_CHANNEL_CD
CHKING_TRANS_METHOD_CD
CHKING_TRANS_TYPE_CD
Client Table
The client table contains
client information.
In practice, this data set
could also contain address
and other information.
For this demonstration,
only CLIENT_ID, FST_NM,
LST_NM, ORIG_DT,
BIRTH_DT, and ZIP_5
are used.
CLIENT_ID
FST_NM
LST_NM
ORIG_DT
BIRTH_DT
ZIP_5
44
Client_ins_account Table
The client_ins_account
table matches client IDs
to INS account IDs.
CLIENT_ID
CLIENT_INS_ID
45
Ins_account Table
The ins_account table
contains the insurance
account information.
In practice, this data set
would contain other fields
such as rates, maturity dates,
and initial deposit amount.
For this demonstration,
only INS_ACT_ID and
INS_ACT_OPEN_DT
are used.
46
INS_ACT_ID
INS_ACT_OPEN_DT
…
…
…
…
Credit_bureau Table
The credit_bureau table
contains credit bureau
information.
In practice, this data set
could contain credit scores
from more than one credit
bureau and also a history
of credit scores.
CLIENT_ID
TL_CNT
FST_TL_TR
FICO_CR_SCR
CREDIT_YQ
47
Advantages of Data Marts





48
There is one version of the truth.
Downstream tables are updated as source data is
updated, so analyses are always based on the latest
information.
The problem of a proliferation of spreadsheets is
avoided.
Information is clearly identified by standardized
variable names and data types.
Multiple users can access the same data.
SAS Enterprise Guide Overview
SAS Enterprise Guide can be used for data management,
as well as a wide variety of other tasks:
 data exploration
 querying and reporting
 graphical analysis
 statistical analysis
 scoring
49
Example: Financial Institution Data
Management
The head of Marketing wants to know which customers
have the highest propensity for buying insurance products
from the institution.
This could present a cross-selling opportunity.
Create part of an analytical data mart by combining
information from many tables: checking account data,
customer records, insurance data, and credit bureau
information.
50
Input Files
client_ins_account.sas7bdat
credit_bureau.sas7bdat
ins_account.sas7dbat
client.sas7bdat
51
Final Data
52
A Data Management Process
Using SAS Enterprise Guide
Financial Institution Case Study
Task: Join several SAS tables and use separate
sampling to obtain a training data set.
53
Exploring the Data and Creating a Report
Investigate the distribution of credit scores.
 Create a report of credit scores by customers without
insurance and customers with insurance.
Does age have an influence on credit scores? Which
customers have the highest credit scores, young
customers or older customers?
 Create a graph of credit scores by age.
54
Exploratory Analysis
55
Exploring the Data and
Creating a Basic Report
Financial Institution Case Study
Task: Investigate the distribution of credit scores by
creating a report of credit scores by customers without
insurance and customers with insurance.
56
Graphical Exploration
Financial Institution Case Study
Task: Create a graph of credit scores by age.
57
Idea Exchange


58
What conclusions would you draw from this basic data
exploration? Are there additional plots or reports that
you would like to explore from the orders data to help
you better understand your customers and their
propensity to buy insurance?
What additional data would you need to help you
make a case to the head of the Marketing Department
that marketing dollars should be spent
in a particular way?
Chapter 2: Basics of Business Analytics
2.1 Overview of Techniques
2.2 Data Management
2.3 Data Difficulties
2.4 SAS Enterprise Miner: A Primer
2.5 Honest Assessment
2.6 Methodology
2.7 Recommended Reading
59
Objectives

60
Identify several of the challenges of data mining and
present ways to address these challenges.
Initial Challenges in Data Mining
1. What do I want to predict?
2. What level of granularity is needed to obtain data
about the customer?
61
...
Initial Challenges in Data Mining
1. What do I want to predict?
 a transaction
 an individual
 a household
 a store
 a sales team
2. What level of granularity is needed to obtain data
about the customer?
62
...
Initial Challenges in Data Mining
1. What do I want to predict?
 a transaction
 an individual
 a household
 a store
 a sales team
2. What level of granularity is needed to obtain data
about the customer?
 transactional
 regional
 daily
 monthly
 other
63
2.03 Multiple Answer Poll
Which of the following might constitute a case in a
predictive model?
a. a household
b. loan amount
c. an individual
d. the number of products purchased
e. a company
f. a ZIP code
g. salary
64
2.03 Multiple Answer Poll – Correct Answers
Which of the following might constitute a case in a
predictive model?
a. a household
b. loan amount
c. an individual
d. the number of products purchased
e. a company
f. a ZIP code
g. salary
65
Typical Data Mining Time Line
Allotted Time
Projected:
Actual:
Dreaded:
(Data Acquisition)
Needed:
Data Preparation
66
Data Analysis
Data Challenges
What identifies a unit?
67
Cracking the Code
What identifies a unit?
ID1 ID2
2612
2613
2614
2615
2616
2617
2618
2618
2619
2620
2620
2620
68
624
625
626
627
628
629
630
631
632
633
634
635
DATE
941106
940506
940809
941010
940507
940812
950906
951107
950112
950802
950908
950511
JOB SEX FIN PRO3 CR_T ERA
06
04
11
16
04
09
09
13
10
11
06
01
8
5
5
1
2
1
2
2
5
1
0
1
DEC
ETS
PBB
RVC
ETT
OFS
RFN
PBB
SLP
STL
DES
DLF
.
.
.
.
.
.
71
0
0
34
0
0
.
.
.
.
.
.
612
623
504
611
675
608
.
.
.
.
.
.
12
23
04
11
75
08
Data Challenges
What should the data look like to perform an analysis?
69
Data Arrangement
What should the data look like to perform an analysis?
Long-Narrow
Acct type
2133
2133
2133
2653
2653
3544
3544
3544
3544
3544
70
MTG
SVG
CK
CK
SVG
MTG
CK
MMF
CD
LOC
Short-Wide
Acct CK SVG MMF CD LOC MTG
2133
2653
3544
1
1
1
1
1
0
0
0
1
0
0
1
0
0
1
1
0
1
Data Challenges
What variables do I need?
71
Derived Inputs
What variables do I need?
Claim
Accident
Date
Time
11nov96
22dec95
26apr95
02jul94
08mar96
15dec96
09nov94
72
102396/12:38
012395/01:42
042395/03:05
070294/06:25
123095/18:33
061296/18:12
110594/22:14
Delay Season Dark
19
fall
0
333
3
0
69
winter
spring
summer
winter
1
1
0
0
186
summer
0
4
fall
1
Data Challenges
How do I convert my data to the proper level of
granularity?
73
Roll-Up
How do I convert my data to the proper level of
granularity?
HH Acct Sales
4461
4461
4461
4461
4461
4911
5630
5630
6225
6225
74
2133
2244
2773
2653
2801
3544
2496
2635
4244
4165
160
42
212
250
122
786
458
328
27
759
HH
Acct
Sales
4461 2133
4911 3544
?
?
5630 2496
6225 4244
?
?
Rolling Up Longitudinal Data
How do I convert my data to the proper level of
granularity?
Frequent
Flying
VIP
Flier
Month
Mileage
Member
75
10621
10621
Jan
Feb
650
0
No
No
10621
10621
Mar
Apr
0
250
No
No
33855
33855
33855
33855
Jan
Feb
Mar
Apr
350
300
1200
850
No
No
Yes
Yes
Data Challenges
What sorts of raw data quality problems can I expect?
76
Errors, Outliers, and Missings
What sorts of raw data quality problems can I expect?
cking #cking
ADB NSF dirdep SVG
bal
Y
Y
Y
y
Y
Y
Y
Y
Y
77
1
1
1
.
2
1
1
.
.
1
3
2
468.11 1
68.75 0
212.04 0
. 0
585.05 0
47.69 2
4687.7 0
. 1
. .
0.00 0
89981.12 0
585.05 0
1876
0
6
0
7218
1256
0
0
1598
0
0
7218
Y
Y
Y
Y
Y
Y
Y
1208
0
0
4301
234
238
0
1208
0
0
45662
234
Missing Value Imputation
What sorts of raw data quality problems can I expect?
Inputs
?
?
?
?
?
Cases
?
?
?
?
78
Data Challenges
Can I (more importantly, should I) analyze all the data
that I have?
All the observations?
All the variables?
79
Massive Data
Can I (more importantly, should I) analyze all the data
that I have?
Bytes
Paper
80
Kilobyte
210
½ sheet
Megabyte
220
1 ream
Gigabyte
230
167 feet
Terabyte
240
32 miles
Petabyte
250
32,000 miles
Sampling
Can I (more importantly, should I) analyze all the data
that I have?
81
Oversampling
Can I (more importantly, should I) analyze all the data
that I have?
OK
Fraud
82
The Curse of Dimensionality
Can I (more importantly, should I) analyze all the data
that I have?
1-D
2-D
3-D
83
Dimension Reduction
Input3
E(Target)
Can I (more importantly, should I) analyze all the data
that I have?
Redundancy
Irrelevancy
Input1
84
2.04 Multiple Answer Poll
Which of the following statements are true?
a. The more data you can get, the better.
b. Too many variables can make it difficult to detect
patterns in data.
c. Too few variables can make it difficult to learn
interesting facts about the data.
d. Cases with missing values should generally be
deleted from modeling.
85
2.04 Multiple Answer Poll – Correct Answers
Which of the following statements are true?
a. The more data you can get, the better.
b. Too many variables can make it difficult to detect
patterns in data.
c. Too few variables can make it difficult to learn
interesting facts about the data.
d. Cases with missing values should generally be
deleted from modeling.
86
Chapter 2: Basics of Business Analytics
2.1 Overview of Techniques
2.2 Data Management
2.3 Data Difficulties
2.4 SAS Enterprise Miner: A Primer
2.5 Honest Assessment
2.6 Methodology
2.7 Recommended Reading
87
88
Objectives

88
Describe the basic navigation of SAS Enterprise Miner.
SAS Enterprise Miner
89
SAS Enterprise Miner – Interface Tour
Menu bar and shortcut buttons
90
SAS Enterprise Miner – Interface Tour
Project panel
91
SAS Enterprise Miner – Interface Tour
Properties panel
92
SAS Enterprise Miner – Interface Tour
Help panel
93
SAS Enterprise Miner – Interface Tour
Diagram workspace
94
SAS Enterprise Miner – Interface Tour
Process flow
95
SAS Enterprise Miner – Interface Tour
Node
96
SAS Enterprise Miner – Interface Tour
SEMMA tools palette
97
Catalog Case Study
Analysis Goal:
A mail-order catalog retailer wants to save money on
mailing and increase revenue by targeting mailed catalogs
to customers who are most likely to purchase in the future.
Data set: CATALOG2010
Number of rows: 48,356
Number of columns: 98
Contents: sales figures summarized
across departments and
quarterly totals for 5.5 years
of sales
Targets: RESPOND (binary)
ORDERSIZE (continuous)
98
Catalog Case Study: Basics
Throughout this chapter, you work with data in
SAS Enterprise Miner to perform exploratory analysis.
1. Import the CATALOG2010 data.
2. Identify the target variables.
3. Define and transform the variables for use in RFM
analysis.
4. Perform graphical RFM analysis in SAS Enterprise
Miner.
Later, you use the CATALOG2010 data for predictive
modeling and scoring.
99
Accessing and Importing Data for Modeling
First, get familiar with the data!
The data file is a SAS data set.
1. Create a project in SAS Enterprise Miner.
2. Create a diagram.
3. Locate and import the CATALOG2010 data.
4. Define characteristics of the data set, such as the
variable roles and measurement levels.
5. Perform a basic exploratory analysis of the data.
100
Defining a Data Source
CATALOG data
ABA1
SAS
Foundation
Server
Libraries
101
Metadata
Definition
Metadata Definition
Select a table.
Set the metadata information.
Three purposes for metadata:
 Define variable roles (such as input, target, or ID).
 Define measurement levels (such as binary, interval,
or nominal).
 Define table role (such as raw data, transactional data,
or scoring data).
102
Creating Projects
and Diagrams in
SAS Enterprise Miner
Catalog Case Study
Task: Create a project and a diagram in
SAS Enterprise Miner.
103
Defining a Data Source
Catalog Case Study
Task: Define the CATALOG data source in
SAS Enterprise Miner.
104
Defining Column Metadata
Catalog Case Study
Task: Define column metadata.
105
Changing the Sampling
Defaults in the Explore Window
and Exploring a Data Source
Catalog Case Study
Tasks: Change preference settings in
the Explore window and explore
variable associations.
106
Idea Exchange
Consider an academic retention example. Freshmen
enter a university in the fall term, and some of them drop
out before the second term begins. Your job is to try to
predict whether a student is likely to drop out after the first
term.
107
continued...
Idea Exchange


108
What types of variables would you consider using to
assess this question?
How does time factor into your data collection? Do
inferences about students five years ago apply to
students today? How do changes in technology,
university policies, and teaching trends affect your
conclusions?
continued...
Idea Exchange


109
As an administrator, do you have this information?
Could you obtain it? What types of data quality issues
do you anticipate?
Are there any ethical considerations in accessing the
information in your study?
Chapter 2: Basics of Business Analytics
2.1 Overview of Techniques
2.2 Data Management
2.3 Data Difficulties
2.4 SAS Enterprise Miner: A Primer
2.5 Honest Assessment
2.6 Methodology
2.7 Recommended Reading
110
Objectives



111
Explain the characteristics of a good predictive model.
Describe data splitting.
Discuss the advantages of using honest assessment
to evaluate a model and obtain the model with the
best prediction.
Predictive Modeling Implementation



112
Model Selection and Comparison
– Which model gives the best prediction?
Decision/Allocation Rule
– What actions should be taken on new cases?
Deployment
– How can the predictions be applied to new cases?
...
Predictive Modeling Implementation



113
Model Selection and Comparison
– Which model gives the best prediction?
Decision/Allocation Rule
– What actions should be taken on new cases?
Deployment
– How can the predictions be applied to new cases?
Getting the “Best” Prediction: Fool’s Gold
My model fits the
training data perfectly...
I’ve struck it rich!
114
2.05 Poll
The best model is a model that does a good job of
predicting your modeling data.
 Yes
 No
115
2.05 Poll – Correct Answer
The best model is a model that does a good job of
predicting your modeling data.
 Yes
 No
116
Model Complexity
117
...
Model Complexity
Too flexible
118
...
Model Complexity
Too flexible
119
...
Model Complexity
Just right
Too flexible
120
Data Splitting and Honest Assessment
121
Overfitting
Training Set
122
Test Set
Better Fitting
Training Set
123
Validation Set
Predictive Modeling Implementation



124
Model Selection and Comparison
– Which model gives the best prediction?
Decision/Allocation Rule
– What actions should be taken on new cases?
Deployment
– How can the predictions be applied to new cases?
Decisions, Decisions
Predicted
0
Actual
0 360
540
1 20
80
0 540
360
1 40
60
0 720
180
1 60
125
1
40
.08
44%
80%
1.3
.10
60%
60%
1.4
.12
76%
40%
1.8
Misclassification Costs
Predicted Class
Actual
0
126
Action
1
Accept
0
True
Neg
False
Pos
1
False
Neg
True
Pos
Deny
OK
0
1
Fraud
9
0
Predictive Modeling Implementation



127
Model Selection and Comparison
– Which model gives the best prediction?
Decision/Allocation Rule
– What actions should be taken on new cases?
Deployment
– How can the predictions be applied to new
cases?
Scoring
Model Deployment
Model
Development
128
Scoring Recipe


129
The model results in
a formula or rules.
The data requires
modifications.
– Derived inputs
– Transformations
– Missing value
imputation

The scoring code is
deployed.
– To score, you do not
rerun the algorithm;
apply score code
(equations) obtained
from the final model to
the scoring data.
Scorability
Training Data
Classifier
X1
1
Tree
.8
.6
.4
New Case
.2
0
0
130
.2 .4 .6 .8
X2
1
Scoring Code
If x1<.47
and x2<.18
or x1>.47
and x2>.29,
then red.
Scoring Pitfalls: Population Drift
Data generated
Data cleaned
Model deployed
Time
Data analyzed
Data acquired
131
The Secret to Better Predictions
Fraud
OK
Transaction Amt.
132
...
The Secret to Better Predictions
Fraud
OK
Transaction Amt.
133
...
The Secret to Better Predictions
Fraud
OK
Transaction Amt.
134
Cheatin’
Heart
Idea Exchange
Think of everything that you have done in the past week.
What transactions or actions created data? For example,
point-of-sale transactions, Internet activity, surveillance,
and questionnaires are all data collection avenues that
many people encounter daily.
 How do you think that the data about you will be used?
 How could models be deployed that use data about
you?
135
Chapter 2: Basics of Business Analytics
2.1 Overview of Techniques
2.2 Data Management
2.3 Data Difficulties
2.4 SAS Enterprise Miner: A Primer
2.5 Honest Assessment
2.6 Methodology
2.7 Recommended Reading
136
Objectives



137
Describe a methodology for implementing business
analytics through data mining.
Discuss each of the steps, with examples, in the
methodology.
Create a project and diagram in SAS Enterprise Miner.
Methodology
Data mining is not a linear process. It is a cycle, where
later results can lead back to previous steps.
Define or refine
business objective
Assess results
Select data
Deploy models
Explore input
data
Apply analysis
Prepare and
repair data
Transform input
data
138
Why Have a Methodology?





139
To avoid learning things that are not true
To avoid learning things that are not useful
– results that arise from past marketing decisions
– results that you already know
– results that you already should know
– results that you are not allowed to use
To create stable models
To avoid making the mistakes that you made in the
past
To develop useful tips from what you learned
Methodology
1. Define the business objective and state it as a
data mining task.
Define or refine
business objective
Assess results
Select data
Deploy models
Explore input
data
Apply analysis
Prepare and
repair data
Transform input
data
140
1) Define the Business Objective








141
Improve the response rate for a direct
marketing campaign.
Increase the average order size.
Determine what drives customer acquisition.
Forecast the size of the customer base in the future.
Choose the right message for the right groups of
customers.
Target a marketing campaign to maximize incremental
value.
Recommend the next, best product for existing
customers.
Segment customers by behavior.
A lot of good statistical analysis is directed at
solving the wrong business problem.
Define the Business Goal
Example: Who is the yogurt lover?
What is a yogurt lover?
 One answer prints coupons
at the cash register.
 Another answer mails coupons
to people’s homes.
 Another results in advertising.
142
MEDIUM
LOW
$$ Spent on Yogurt
HIGH
Big Challenge: Defining a Yogurt Lover
LOW
MEDIUM
Yogurt as %
of All Purchases
143
HIGH
“Yogurt lover” is not in
the data.
You can impute it, using
business rules:
 Yogurt lovers spend
a lot of money on
yogurt.
 Yogurt lovers spend
a relatively large
amount of their
shopping dollars
on yogurt.
Next Challenge: Profile the Yogurt Lover
You have identified a segment of customers that you
believe are yogurt lovers.
But who are they? How would I know them in the store?
 Identify them by demographic data.
 Identify them by other things that they purchase (for
example, yogurt lovers are people who buy nutrition
bars and sports drinks).
What action can I take?
 Set up “yogurt-lover-attracting” displays.
144
Idea Exchange
If a customer is identified as a yogurt lover, what action
should you take? Should you give yogurt coupons, even
though these individuals buy yogurt anyway? Is there a
cross-sell opportunity? Is there an opportunity to identify
potential yogurt lovers? What would you do?
145
Profiling in the Extreme: Best Buy
Using analytical methodology, electronics retailer
Best Buy discovered that a small percentage of
customers accounted for a large percentage of revenue.
Over the past several years, the company has adopted a
customer-centric approach to store design and flow,
staffing, and even corporate acquisitions such as the
Geek Squad support team.
The company’s largest competitor has gone bankrupt
while Best Buy has seen growth in market share.
See Gulati (2010)
146
Define the Business Objective
What Is the business objective?
Example: Telco Churn
Initial problem: Assign a churn score to all customers.
 Recent customers with little call history
 Telephones? Individuals? Families?
 Voluntary churn versus involuntary churn
How will the results be used?
Better objective: By September 24, provide a list of the
10,000 elite customers who are most likely to churn in
October.
The new objective is actionable.
147
Define the Business Objective
Example: Credit Churn
How do you define the target? When did a customer leave?
 When she has not made a new charge in six months?
 When she had a zero balance for three months?
 When the balance does not support the cost of carrying
the customer?
 When she cancels her card?
3.0%
1.0%
 When the contract ends?
0.8%
0.6%
0.4%
0.2%
0.0%
0
1
2
3
4
5
6
Tenure (months)
148
7
8
9 10 11 12 13 14 15
Translate Business Objectives into
Data Mining Tasks
Do you already know the answer?
In supervised data mining, the data has examples of what
you are looking for, such as the following:
 customers who responded in the past
 customers who stopped
 transactions identified as fraud
In unsupervised data mining, you are looking for new
patterns, associations, and ideas.
149
Data Mining Tasks Lead to Specific Techniques
Objectives
Customer
Acquisition
Credit Risk
Pricing
Customer Churn
Fraud Detection
Discovery
Customer Value
150
...
Data Mining Tasks Lead to Specific Techniques
Objectives
Customer
Acquisition
Credit Risk
Pricing
Customer Churn
Fraud Detection
Discovery
Customer Value
Tasks
Exploratory Data
Analysis
Binary Response
Modeling
Multiple Response
Modeling
Estimation
Forecasting
Detecting Outliers
Pattern Detection
151
...
Data Mining Tasks Lead to Specific Techniques
Objectives
Customer
Acquisition
Credit Risk
Pricing
Customer Churn
Fraud Detection
Discovery
Customer Value
Tasks
Exploratory Data
Analysis
Binary Response
Modeling
Multiple Response
Modeling
Estimation
Forecasting
Techniques
Decision Trees
Regression
Neural Networks
Survival Analysis
Clustering
Association Rules
Detecting Outliers
Link Analysis
Pattern Detection
Hypothesis Testing
Visualization
152
Data Analysis Is Pattern Detection
Patterns might not represent any underlying rule.
Some patterns reflect some underlying reality.
 The party that holds the White House tends to
lose seats in Congress during off-year elections.
Others do not.
 When the American League wins the World Series
in Major League Baseball, Republicans take the
White House.
 Stars cluster in constellations.
Sometimes, it is difficult to tell without analysis.
 In U.S. presidential contests, the taller candidate
usually wins.
153
Example: Maximizing Donations
Example from the KDD Cup, a data mining competition
associated with the KDD Conference (www.sigkdd.org):
 Purpose: Maximizing profit for a charity fundraising
campaign
 Tested on actual results from mailing (using data withheld
from competitors)
Competitors took multiple approaches
to the modeling:
 Modeling who will respond
 Modeling how much people will give
 Perhaps more esoteric approaches
However, the top three winners all took
the same approach (although they used
different techniques, methods, and
software).
154
The Winning Approach:
Expected Revenue
Task: Estimate responseperson,
the probability that a person
responds to the mailing (all
customers).
Task: Estimate the value of response,
dollarsperson (only customers who respond).
Choose prospects with the highest expected value,
responseperson * dollarsperson.
155
An Unexpected Pattern
An unexpected pattern suggests an approach.
When people give money frequently, they tend to
donate less money each time.
 In most business applications, as people take an
action more often, they spend more money.
 Donors to a charity are different.
This suggests that potential donors go
through a two-step process:
 Shall I respond to this mailing?
 How much money should I give
this time?
Modeling can follow the same logic.
156
Methodology
2. Select or collect the appropriate data to address the
problem. Identify the customer signature.
Define or refine
business objective
Assess results
Select data
Deploy models
Explore input
data
Apply analysis
Prepare and
repair data
Transform input
data
157
2) Select Appropriate Data






What is available?
What is the right level of granularity?
How much data is needed?
How much history is required?
How many variables should be used?
What must the data contain?
Assemble results into customer signatures.
158
Representativeness of the Training Sample
The model set might not reflect the relevant population.
 Customers differ from prospects.
 Survey responders differ from non-responders.
 People who read e-mail differ from people who do not
read e-mail.
 Customers who started three
years ago might differ from
customers who started three
months ago.
 People with land lines differ
from those without.
159
Availability of Relevant Data
Elevated printing defect rates might be due to humidity,
but that information is not in press run records.
Poor coverage might be the number one reason for
wireless subscribers canceling their subscriptions, but
data about dropped calls is not in billing data.
Customers might already have potential cross-sell
products from other companies, but that information is not
available internally.
160
Types of Attributes in Data
Readily Supported
 Binary
 Categorical (nominal)
 Numeric (interval)
 Date and time
161
Require More Work
 Text
 Image
 Video
 Links
Idea Exchange
Suppose that you were in charge of a charity similar to the
KDD example above. What type of data are you likely to
have available before beginning the project? Is there
additional data that you would need?
Do you have to purchase the data, or is it publicly
available for free? How could you make the best use of a
limited budget to acquire high quality data about individual
donation patterns?
162
The Customer Signature
The primary
key uniquely
identifies each
row, often
corresponding
to customer ID.
The target
A foreign key
columns are
gives access to
what you are
data in another
looking for.
table, such as
Sometimes, the
ZIP code
information is in
demographics.
multiple columns,
such as a churn flag
and churn date.
Some columns
are ignored
because the
values are not
predictive or they
contain future
information, or for
other reasons.
Each row generally corresponds to a customer.
163
Data Assembly Operations
Copying
Pivoting
Table
lookup
Derivation of
new variables
Summarization
of values from data
Aggregation
164
Methodology
3. Explore the data. Look for anomalies. Consider timedependent variables. Identify key relationships among
variables.
Define or refine
business objective
Assess results
Select data
Deploy models
Explore input
data
Apply analysis
Prepare and
repair data
Transform input
data
165
3) Explore the Data
Examine distributions.
 Study histograms.
 Think about extreme values.
 Notice the prevalence of missing values.
Compare values with descriptions.
Validate assumptions.
Ask many questions.
166
Ask Many Questions







167
Why were some customers active for 31 days in
February, but none were active for more than 28 days
in January?
How do some retail card holders spend more than
$100,000 in a week in a grocery store?
Why were so many customers born in 1911? Are they
really that old?
Why do Safari users never make second purchases?
What does it mean when the contract begin date is
after the contract end date?
Why are there negative numbers in the sale price
field?
How can active customers have a non-null value in the
cancellation reason code field?
Be Wary of Changes over Time
Price-related cancelations
Does the same code have the same meaning in historical
data?
Did different data elements start being loaded at different
points in time?
Did something happen at a particular point in time?
May
168
Price
increase
price
complaint
stops
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Methodology
4. Prepare and repair the data. Define metadata correctly.
Partition the data and create balanced samples, if
necessary.
Define or refine
business objective
Assess results
Select data
Deploy models
Explore input
data
Apply analysis
Prepare and
repair data
Transform input
data
169
4) Prepare and Repair the Data




170
Set up a proper temporal relationship
between the target variable and inputs.
Create a balanced sample, if possible.
Include multiple time frames if necessary.
Split the data into training, validation, and (optionally)
test data sets.
Temporal Relationship: Prediction or Profiling?
The same techniques work for both.
Earlier
In a predictive model, values of explanatory
variables are from an earlier time frame than
the target variable.
Later
Same Time Frame
In a profiling model, the explanatory variables and the
target variable might all be from the same time frame.
171
Balancing the Input Data Set
A very accurate model simply predicts that no one wants
a brokerage account:
 98.8% accurate
 1.2% error rate
This is useless for differentiating among customers.
Distribution of Brokerage Target Variable
Brokerage = "Y"
2,355
Brokerage = "N"
228,926
0
172
50,000
100,000
150,000
200,000
250,000
Two Ways to Create Balanced Data
173
Data Splitting and Validation
Error Rate
Improving the model causes
the error rate to decline on the
data used to build it. At the
same time, the model
becomes more complex.
Models Getting More Complex
174
Validation Data Prevents Overfitting
Sweet
spot
Validation
Data
Error Rate
Signal
Noise
Training Data
Models Getting More Complex
175
Partitioning the Input Data Set
Training
Use the training set to find patterns and
create an initial set of candidate models.
Validation
Use the validation set to select the best
model from the candidate set of models.
Use the test set to measure performance of
the selected model on unseen data. The test
Test
set can be an out-of-time sample of the data,
if necessary.
Partitioning data is an allowable luxury because data mining
assumes a large amount of data.
Test sets do not help select the final model; they only provide
an estimate of the model’s effectiveness in the population. Test
sets are not always used.
176
Fix Problems with the Data
Data imperfectly describes the features of the real world.
 Data might be missing or empty.
 Samples might not be representative.
 Categorical variables might have too many values.
 Numeric variables might have unusual distributions
and outliers.
 Meanings can change over time.
 Data might be coded inconsistently.
177
No Easy Fix for Missing Values
Throw out the records with missing values?
 No. This creates a bias for the sample.
Replace missing values with a “special” value (-99)?
 No. This resembles any other value to a data mining algorithm.
Replace with some “typical” value?
 Maybe. Replacement with the mean, median, or mode changes
the distribution, but predictions might be fine.
Impute a value? (Imputed values should be flagged.)
 Maybe. Use distribution of values to randomly choose a value.
 Maybe. Model the imputed value using some technique.
Use data mining techniques that can handle missing values?
 Yes. One of these, decision trees, is discussed.
Partition records and build multiple models?
 Yes. This action is possible when data is missing for a
canonical reason, such as insufficient history.
178
Methodology
5. Transform data. Standardize, bin, combine, replace,
impute, log, and so on.
Define or refine
business objective
Assess results
Select data
Deploy models
Explore input
data
Apply analysis
Prepare and
repair data
Transform input
data
179
5) Transform Data









Standardize values into z-scores.
Change counts into percentages.
Remove outliers.
Capture trends with ratios, differences, or beta values.
Combine variables to bring information to the surface.
Replace categorical variables with some numeric
function of the categorical values.
Impute missing values.
Transform using mathematical functions, such as logs.
Translate dates to durations.
Example: Body Mass Index (kg/m2) is a better predictor of
diabetes than either variable separately.
180
A Selection of Transformations
Standardize numeric values.
 All numeric values are replaced by the notion of “how
far is this value from the average?”
 Conceptually, all numeric values are in the same range.
(The actual range differs, but the meaning is the same.)
 Although it sometimes has no effect on the results
(such as for decision trees and regression), it never
produces worse results.
 Standardization is so useful that it is often built into
SAS Enterprise Miner modeling nodes.
181
A Selection of Transformations
“Stretching” and “squishing” transformations
 Log, reciprocal, and square root are examples.
Replace categorical values with appropriate numeric
values.
 Many techniques work better with numeric values than
with categorical values.
 Historical projections (such as handset churn rate or
penetration by ZIP code) are particularly useful.
182
Methodology
6. Apply analysis. Fit many candidate models, try different
solutions, try different sets of input variables, select the
best model.
Define or refine
business objective
Assess results
Select data
Deploy models
Explore input
data
Apply analysis
Prepare and
repair data
Transform input
data
183
6) Apply Analysis









184
Regression
Decision trees
Cluster detection
Association rules
Neural networks
Memory-based reasoning
Survival analysis
Link analysis
Genetic algorithms
Train Models
OUTPUT
INPUT
INPUT
185
INPUT
INPUT
INPUT
INPUT
INPUT
INPUT
MODEL 3
MODEL 2
MODEL 1
INPUT
Build candidate models by
applying a data mining technique
(or techniques) to the training data.
OUTPUT
OUTPUT
Assess Models
OUTPUT
INPUT
INPUT
186
INPUT
INPUT
INPUT
INPUT
INPUT
INPUT
MODEL 3
MODEL 2
MODEL 1
INPUT
Assess models by applying the
models to the validation data set.
OUTPUT
OUTPUT
Assess Models
Score the validation data using the candidate models and
then compare the results. Select the model with the best
performance on the validation data set.
Communicate model assessments through
 quantitative measures
 graphs.
187
Look for Warnings in Models
Trailing Indicators: Learning Things That Are Not True
What happens in month 8?
Minutes of Use by Tenure
120
Minutes of Use
100
80
60
40
20
0
1
2
3
4
5
6
7
8
9
10
11
Tenure (Months)
Does declining usage in month 8 predict attrition in month 9?
188
Look for Warnings in Models
Perfect Models: Things that are too good to be true.
100% of customers who spoke to a customer support
representative canceled a contract.
Eureka! It’s all I need to know!


If a customer cancels, that customer is automatically
flagged to get a call from customer support.
The information is useless in predicting cancellation.
Models that seem too good usually are.
189
Idea Exchange
What are some other warning signs that you can think of
in modeling? Have you experienced any pitfalls that were
memorable or that changed how you approach the data
analysis objectives?
190
Methodology
7. Deploy models. Score new observations, make modelbased decisions. Gather results of model deployment.
Define or refine
business objective
Assess results
Select data
Deploy models
Explore input
data
Apply analysis
Prepare and
repair data
Transform input
data
191
7) Deploy Models and Score New Data
192
Methodology
8. Assess the usefulness of the model. If the model has
gone stale, revise it.
Define or refine
business objective
Assess results
Select data
Deploy models
Explore input
data
Apply analysis
Prepare and
repair data
Transform input
data
193
8) Assess Results





194
Compare actual results against expectations.
Compare the challenger’s results against
the champion’s.
Did the model find the right people?
Did the action affect their behavior?
What are the characteristics of the customers
most affected by the intervention?
Good Test Design Measures the Impact
of Both the Message and the Model
NO
Message
YES
Impact of model on group
getting message
Control Group
Target Group
Chosen at random;
receives message.
Chosen by model;
receives message.
Response measures
message without model.
Response measures
message with model.
Holdout Group
Modeled Holdout
Chosen at random;
receives no message.
Chosen by model;
receives no message.
Response measures
background response.
Response measures
model without message.
YES
NO
Picked by Model
195
Impact of
message
on group
with good
model
scores
Test Mailing Results
E-mail campaign test results
 lift 3.5
E-Mail Test
0.8
Response Rate
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Target Group
196
Control Group
Holdout Group
Methodology
9. As you learn from earlier model results, refine the
business goals to gain more from the data.
Define or refine
business objective
Assess results
Select data
Deploy models
Explore input
data
Apply analysis
Prepare and
repair data
Transform input
data
197
9) Begin Again
Revisit business objectives.
Define new objectives.
Gather and evaluate new data.
 model scores
 cluster assignments
 responses
Example:
A model discovers that geography is a good predictor
of churn.
 What do the high-churn geographies have in
common?
 Is the pattern your model discovered stable over time?
198
Lessons Learned
Data miners must be careful to avoid pitfalls, particularly
with regard to spurious patterns in the data:
 learning things that are not true or not useful
 confusing signal and noise
 creating unstable models
A methodology is a way of being careful.
199
Idea Exchange
Outline a business objective of your own in terms of the
methodology described here.
What is your business objective? Can you frame it in
terms of a data mining problem? How will you select the
data? What are the inputs? What do you want to look at to
get familiar with the data?
200
continued...
Idea Exchange
Anticipate any data quality problems that you might
encounter and how you could go about fixing them.
Do any variables require transformation?
Proceed through the remaining steps of the methodology
as you consider your example.
201
Basic Data Modeling
A common approach to modeling customer value is RFM
analysis, so named because it uses three key variables:
 Recency – how long it has been since the customer’s
last purchase
 Frequency – how many times the customer has
purchased something
 Monetary value – how much money the customer has
spent
RFM variables tend to predict responses to marketing
campaigns effectively.
 RFM is a special case of OLAP.
202
RFM Cell Approach
Frequency
Monetary
value
Recency
203
RFM Cell Approach
A typical approach to RFM analysis is to bin customers
into (approximately) equal-sized groups on each of the
rank-ordered R,F, and M variables. For example:
 Bin five groups on R (highest bin = most recent)
 Bin five groups on F (highest bin = most frequent)
 Bin five groups on M (highest bin = highest value)
The combination of the bins gives an RFM “score” that
can be compared to some target or outcome variable.
Customer score 555 = most recent quintile, most frequent
quintile, highest spending quintile.
204
Computing Profitability in RFM
Break-even response rate =
current cost of promotion per dollar of net profit.
Cost of promotion to an individual
Average net profit per sale
Example: It costs $2.00 to print and mail each catalog.
Average net profit per transaction is $30.
2.00/30.00 = 0.067
Profitable RFM cells are those with a response rate
greater than 6.7%.
205
RFM Analysis of the Catalog Data




206
Recode recency so that the highest values are the
most recent.
Bin the R, F, and M variables into five groups each,
numbered 1-5, so that 1 is the least valuable and 5 is
the most valuable bin.
Concatenate the RFM variables to obtain a single
RFM “score.”
Graphically investigate the response rates for the
different groups.
Performing RFM Analysis of
the Catalog Data
Catalog Case Study
Task: Perform RFM analysis on the
catalog data.
207
Performing Graphical
RFM Analysis
Catalog Case Study
Task: Perform graphical RFM analysis.
208
Limitations of RFM
Only uses three variables
 Modern data collection processes offer rich
information about preferences, behaviors, attitudes,
and demographics.
Scores are entirely categorical
 515 and 551 and 155 are equally good, if RFM
variables are of equal importance.
 Sorting by the RFM values is not informative and
overemphasizes recency.
So many categories
 The simple example above results in 125 groups.
Not very useful for finding prospective customers
 Statistics are descriptive.
209
Idea Exchange
Would RFM analysis apply to a business objective that
you are considering? If so, what would be your R, F, and
M variables?
What other basic analytical techniques could you use to
explore your data and get preliminary answers to your
questions?
210
211
Exercise Scenario
Practice with a charity direct mail example.
Analysis Goal:
A veteran’s organization seeks continued contributions
from lapsing donors. Use lapsing donor response from an
earlier campaign to predict future lapsing donor response.
211
...
212
Exercise Scenario
Practice with a charity direct mail example.
Analysis Goal:
A veteran’s organization seeks continued contributions
from lapsing donors. Use lapsing donor response from an
earlier campaign to predict future lapsing donor response.
Exercise Data (PVA97NK):
 The data is extracted from previous year’s campaign.
 The sample is balanced with regard to
response/non-response rate.
 The actual response rate is approximately 5%.
212
213
R, F, M Variables in the Charity Data Set
In the data set PVA97NK, the following variables should be
used for RFM analysis:
GiftTimeLast Time since last gift (Recency)
GiftCntAll
Gift count over all months (Frequency)
Monetary value must be computed as follows:
GiftAvgAll*GiftCntAll
Average gift amount over
lifetime * total gift count
Use SAS Enterprise Miner to create the RFM variables and
bins, and then perform graphical RFM analysis.
213
Exercise
This exercise reinforces the concepts discussed
previously.
214
Chapter 2: Basics of Business Analytics
2.1 Overview of Techniques
2.2 Data Management
2.3 Data Difficulties
2.4 SAS Enterprise Miner: A Primer
2.5 Honest Assessment
2.6 Methodology
2.7 Recommended Reading
215
Recommended Reading
Davenport, Thomas H., Jeanne G. Harris, and Robert
Morison. 2010. Analytics at Work: Smarter Decisions,
Better Results. Boston: Harvard Business Press.
 Chapters 2 through 6, the DELTA method
These chapters present a complementary perspective
to this chapter on how to integrate analytics at various
levels of the organization.
216