Download Data Preparation for Analytics

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Preparation
for Analytics
Dr. Gerhard Svolba
SAS-Austria – Vienna
http://sascommunity.org/wiki/Gerhard_Svolba
Copyright © 2006, SAS Institute Inc. All rights reserved.
Agenda
ƒ Data Preparation for Analytics – General Thoughts
ƒ Data Structures for Analytics
ƒ Case Study: Building an efficient one-row-persubject data mart
ƒ Data Management with the SAS® Language and
with SAS® Tools
ƒ Closing Thoughts
Copyright © 2006, SAS Institute Inc. All rights reserved.
Some Words on Data Preparation
ƒ Is for techies
ƒ Is just coding
ƒ Is boring
ƒ Consumes up to 80 % of the project
ƒ Was not in the focus of data mining
literature so far
ƒ Is something that SAS can excellently do
ƒ Is vital to the quality of the project
Copyright © 2006, SAS Institute Inc. All rights reserved.
The Analysis Process:
From Raw Data to Actionable Results
Data Sources
Data Preparation
Different Data
Sources
Merges,
Denormalisation
Relational
Models, Star
Schemes
Derived
Variables
Data
Availability
+
Copyright © 2006, SAS Institute Inc. All rights reserved.
Analytic Modeling
Results and Actions
Modeling,
Parameter
Estimation,
Tuning,
Usage of
Results
Interpretations
Transpositions,
Aggregations
Predictions,
Classifications,
Clustering
Adequate
Preparation
Clever
Modeling
Good
Results
+
Profiling
=
Four Dimensions
for Analytic Data Preparation
Business and
Process Knowledge
Analytical
Knowledge
Analytic
Data
Preparaton
Documentation and
Maintainance
Copyright © 2006, SAS Institute Inc. All rights reserved.
Efficient and
Clever
SAS Coding
Four Dimensions
for Analytic Data Preparation
Business and
Process Knowledge
Analytical
Knowledge
Analytic
Data
Preparaton
Documentation and
Maintainance
Copyright © 2006, SAS Institute Inc. All rights reserved.
Efficient and
Clever
SAS Coding
Challenging a Business Question
ƒ „Can you calculate the probability that a
customer will cancel the usage of product X?“
ƒ Questions:
• Cancellation or decline in usage?
• How do we treat usage of more advanced products?
• Include non-volontary cancellation?
• How quickly can retention measures take place?
• Do you want to have probabilites per customer or a list
of the 10.000 high risk customers or risk classes in
general?
Copyright © 2006, SAS Institute Inc. All rights reserved.
Business, Statistican and IT:
The Optimal Triangle !?
I have formulated my
business question. Are
there any reasons, not
to be back with results
in 2 days?
Simon
Retention Manager
Analytic
Data
Preparaton
Name me the list of
attributes and derived
variables that you will use
in your final model and
which I have to deliver
periodically.
Elias
IT Expert
Copyright © 2006, SAS Institute Inc. All rights reserved.
I would like to have an analytic
database with a high number of
attributes and an environment
to perform both, data
management and analysis.
Daniele
Quantitative Expert
Four Dimensions
for Analytic Data Preparation
Business and
Process Knowledge
Analytical
Knowledge
Analytic
Data
Preparaton
Documentation and
Maintainance
Copyright © 2006, SAS Institute Inc. All rights reserved.
Efficient and
Clever
SAS Coding
Business Questions Æ Analytical Models
ƒ Event prediction (Churn, Fraud, Delinquency,
Response, …)
ƒ Value prediction (Purchase Size, Claim Amount, …)
ƒ Clustering (Segmentation, …)
ƒ Market Basket Analysis (Association Analysis, …)
ƒ Time Series Forecasting
Copyright © 2006, SAS Institute Inc. All rights reserved.
Analysis Subjects and Multiple Observations
ƒ Analysis subjects are entities that are being
analyzed and the analysis results are interpreted
in their context.
ƒ Multiple observations per analysis subject
• Repeated measurements over time
• Multiple observations because of hierarchical
relationships
Copyright © 2006, SAS Institute Inc. All rights reserved.
Main Types of Data Marts
One-Row-perSubject Data Mart
Multiple-Row-perSubject Data Mart
Longitudinal
Data Mart
Copyright © 2006, SAS Institute Inc. All rights reserved.
Data Mart Structures
Data Mart Structure for the Analysis
Structure of the source data:
“Multiple observations per
analysis subject exist?”
NO
YES
Copyright © 2006, SAS Institute Inc. All rights reserved.
One-Row-per-Subject
Data Mart
Data Mart with one row per
subject is created.
Information of multiple
observations has to be
aggregated or transposed per
analysis subject
Multiple-Row-per-Subject
Data Mart
(Key-Value Table)
Data Mart with one row per
multiple observations is created.
Variables at the analysis subject
level are duplicated for each
repetition
The One-Row-Per-Subject Data Mart
ƒ Required by many statistical methods
• Regression Analysis, Neural Networks, Decision Trees,
Survival analysis, Cluster analysis, …
ƒ Most prominent data mart structure in data
mining
• Event prediction (Churn, Fraud, Delinquency,
Response, …)
• Value prediction (Purchase Size, Claim Amount, …)
• Segmentation (Clustering, …)
Copyright © 2006, SAS Institute Inc. All rights reserved.
The One-Row-Per-Subject Paradigm
Copyright © 2006, SAS Institute Inc. All rights reserved.
Transposing Data to One-Row-Per-Subject
PROC TRANSPOSE DATA = long
OUT = wide(DROP = _name_)
PREFIX = weight;
BY id ;
VAR weight;
ID time;
RUN;
Copyright © 2006, SAS Institute Inc. All rights reserved.
Clever Aggregations
Interval Data
ƒ Static Aggregation
ƒ Correlation of Values
ƒ Course over Time
ƒ Concentration of Values
Categorical Data
ƒ Frequency Counts
ƒ Concatenated
Frequencies
ƒ Total and Distinct Counts
Copyright © 2006, SAS Institute Inc. All rights reserved.
Correlation of Values
How do the monthly values
per subject correlate with
the overall mean per
month?
Copyright © 2006, SAS Institute Inc. All rights reserved.
Measures for the Course over Time
PROC REG DATA = longitud NOPRINT
OUTEST=Est_LongTerm(KEEP = CustID month
RENAME = (month=LongTerm));
MODEL usage = month;
BY CustID;
RUN;
PROC REG DATA = longitud NOPRINT
OUTEST=Est_ShortTerm(KEEP = CustID month
RENAME = (month=ShortTerm));
MODEL usage = month;
BY CustID;
WHERE month in (5 6);
RUN;
Copyright
© 2006, SAS Institute Inc. All rights reserved.
Concentration of Values
Concentration =
proportion of the sum of the top
50 % sub-hierarchies
/
the total sum over all sub
hierarchies
Copyright © 2006, SAS Institute Inc. All rights reserved.
Categorical Variables:
Frequency Counts
Source Data
Absolute and Relative Frequencies
Counts and Distinct Counts
Copyright © 2006, SAS Institute Inc. All rights reserved.
Categorical Variables:
Concatenated Frequencies
Account_
Cumulative
Cumulative
RowPct
Frequency
Percent
Frequency
Percent
--------------------------------------------------------------0_100_0_0
12832
30.61
12832
30.61
100_0_0_0
9509
22.69
22341
53.30
50_0_0_50
4898
11.69
27239
64.98
33_0_0_67
1772
4.23
29011
69.21
0_0_100_0
1684
4.02
30695
73.23
67_0_0_33
1426
3.40
32121
76.63
0_0_50_50
861
2.05
32982
78.69
50_0_50_0
681
1.62
33663
80.31
Copyright © 2006, SAS Institute Inc. All rights reserved.
Hierarchies:
Aggregating Up, Copying Down
Copyright © 2006, SAS Institute Inc. All rights reserved.
Main Types of Data Marts
One-Row-perSubject Data Mart
Multiple-Row-perSubject Data Mart
Longitudinal
Data Mart
Copyright © 2006, SAS Institute Inc. All rights reserved.
ƒ The standard form for
longitudinal data
Copyright © 2006, SAS Institute Inc. All rights reserved.
ƒ The inter-leaved
longitudinal data mart
The cross-sectional dimension data mart
Copyright © 2006, SAS Institute Inc. All rights reserved.
Longitudinal Data Marts
ƒ Aggregating data at the right level
• Transactional Data
• Finest Granularity
• Most Appropriate Aggregation Level
ƒ Defining cross sectional groups
ƒ Alignment of time values
ƒ Creation of event indicators and input variables
Copyright © 2006, SAS Institute Inc. All rights reserved.
Four Dimensions
for Analytic Data Preparation
Business and
Process Knowledge
Analytical
Knowledge
Analytic
Data
Preparaton
Documentation and
Maintainance
Copyright © 2006, SAS Institute Inc. All rights reserved.
Efficient and
Clever
SAS Coding
Case Study: Business Question
ƒ Predict customers that have a high probability to
leave the company
ƒ Derive target variable „Cancellation YES/NO“
from the monthly value segment history
(Entry „8. LOST“)
ƒ Create a one-row-per-subject data mart for data
mining analysis
Copyright © 2006, SAS Institute Inc. All rights reserved.
Case Study: Data and Data Model
Copyright © 2006, SAS Institute Inc. All rights reserved.
ƒ
Customer data:
demographic and
customer baseline data
ƒ
Account data:
information customer
accounts
ƒ
Leasing data: data on
leasing information
ƒ
Call Center data: data
on Call center contacts
ƒ
Score data: data of
value segment scores
Considerations for Predictive Modeling
Only consider customers
that are still active at this
point in time
Snapshot
Date:
June 30th, 2003
Define Cancellation
YES/NO based on
Do not use input
values in the
data after this
target window
point in time
Observation Window
April
2003
Copyright © 2006, SAS Institute Inc. All rights reserved.
May
2003
June
2003
Offset
Target Window
July
2003
August
2003
Using Data from the
CALLCENTER
Table
%let snapdate = ’30JUN2003’d;
PROC FREQ DATA = callcenter NOPRINT;
TABLE CustID / OUT = CallCenterComplaints
(DROP = Percent RENAME =
(Count = Complaints));
WHERE Category = 'Complaint' and
datepart(date) <= &snapdate;
RUN;
Copyright © 2006, SAS Institute Inc. All rights reserved.
Using Data from the
SCORES-Table
%let snapdate = ’30JUN2003’d;
DATA ScoreFuture(RENAME = (ValueSegment =
FutureValueSegment))
ScoreActual
ScoreLastMonth(RENAME = (ValueSegment =
LastValueSegment));
SET Scores;
DATE = INTNX('MONTH',Date,0,'END');
DROP Date;
IF Date = &snapdate THEN OUTPUT ScoreActual;
ELSE IF Date = INTNX('MONTH',&snapdate,-1)
THEN OUTPUT ScoreLastMonth;
ELSE IF Date = INTNX('MONTH',&snapdate,2)
THEN OUTPUT ScoreFuture;
RUN;
Observation Window
April
2003
Copyright © 2006, SAS Institute Inc. All rights reserved.
May
2003
June
2003
Offset
Target Window
July
2003
August
2003
DATA CustomerMart;
ATTRIB /* Customer Baseline */
CustID
FORMAT = 8.
LABEL = "Customer ID"
Birthdate
FORMAT = DATE9. LABEL = "Date of Birth"
Alter
FORMAT = 8.
LABEL = "Age (years)"
Gender
FORMAT = $6.
LABEL = "Gender"
MaritalStatus FORMAT = $10.
LABEL = "Marital Status"
Title
FORMAT = $10.
LABEL = "Academic Title"
HasTitle
FORMAT = 8.
LABEL = "Has Title? 0/1"
Branch
FORMAT = $5.
LABEL = "Branch Name";
MERGE Customer (IN = InCustomer)
AccountSum (IN = InAccounts)
AccountTypes
LeasingSum (IN = InLeasing)
CallCenterContacts (IN = InCallCenter)
CallCenterComplaints
ScoreFuture
ScoreActual
ScoreLastMonth;
BY CustID;
IF InCustomer;
Copyright © 2006, SAS Institute Inc. All rights reserved.
/* Customer Baseline */
HasTitle = (Title ne "");
Alter = (&Snapdate-Birthdate)/365.25;
CustomerMonths = (&Snapdate- CustomerSince)/(365.25/12);
/* Accounts */
HasAccounts = InAccounts;
LoanPct = Loan / BalanceSum * 100;
SavingAccountPct = SavingAccount / BalanceSum * 100;
FundsPct = Funds / BalanceSum * 100;
/* Leasing */
HasLeasing = InLeasing;
/* Call Center */
HasCallCenter = InCallCenter;
ComplaintPct = Complaints / Calls *100;
/* Value Segment */
Cancel = (FutureValueSegment = '8. LOST');
ChangeValueSegment = (ValueSegment = LastValueSegment);
RUN;
Copyright © 2006, SAS Institute Inc. All rights reserved.
Screenshots of
the Resulting
Data Mart
Copyright © 2006, SAS Institute Inc. All rights reserved.
The Power of the SAS Language
for Data Preparation
Copyright © 2006, SAS Institute Inc. All rights reserved.
SAS Data Step vs. SQL
Transposing a table
PROC TRANSPOSE DATA = sampsio.assocs(obs=21)
OUT = assoc_tp (DROP = _name_);
BY customer;
ID Product;
RUN;
Copyright © 2006, SAS Institute Inc. All rights reserved.
Changing between longitdutinal data structures
Standard
Form
ÅÆ
Interleaved
Copyright © 2006, SAS Institute Inc. All rights reserved.
Selection of the first and last line per customer
CustID
Month
Value
FirstValue
LastValue
1
7
45
1
0
1
8
34
0
0
1
9
5
0
1
2
7
34
1
0
2
8
32
0
0
2
9
44
0
1
3
7
56
1
0
3
8
54
0
0
3
9
32
0
1
Copyright © 2006, SAS Institute Inc. All rights reserved.
Creating a sequential number and summing items
Copyright © 2006, SAS Institute Inc. All rights reserved.
CustID
Date
Points
PurchNo
Cum_Poi
1
10.03.2004
45
1
45
1
04.04.2004
10
2
55
1
20.04.2004
20
3
75
1
16.05.2004
18
4
93
1
01.06.2004
5
5
98
2
01.02.2004
10
1
10
2
19.03.2004
30
2
40
3
05.08.2004
4
1
4
3
16.08.2004
16
2
20
3
31.08.2004
12
3
32
3
10.09.2004
20
4
52
Copying omitted data
CustID
Age
Gender
Month
Value
1
26
m
7
45
8
34
9
5
7
34
8
32
9
44
7
56
8
54
9
32
2
3
37
46
w
m
Copyright © 2006, SAS Institute Inc. All rights reserved.
CustID
Age
Gender
Month
Value
1
26
m
7
45
1
26
m
8
34
1
26
m
9
5
2
37
w
7
34
2
37
w
8
32
2
37
w
9
44
3
46
m
7
56
3
46
m
8
54
3
46
m
9
32
Shift down a column for k positions
Value_
PrevDay
Date
Value
01.01.2004
45
.
01.02.2004
34
45
01.03.2004
5
34
01.04.2004
34
5
01.05.2004
32
34
01.06.2004
44
32
Copyright © 2006, SAS Institute Inc. All rights reserved.
Analytic Data Preparation in
SAS® Enterprise Miner
Copyright © 2006, SAS Institute Inc. All rights reserved.
Analytic Data Preparation in
SAS® Enterprise Miner
ƒ Input Data Source Node
ƒ Filter Node
ƒ Sample Node
ƒ Transform Variables Node
ƒ Data Partition Node
ƒ Impute Node
ƒ Metadata Node
ƒ SAS Code Node
ƒ Principal Components
Node
Copyright © 2006, SAS Institute Inc. All rights reserved.
Working with Multiple-Row-per-Subject Data in
SAS® Enterprise Miner
ƒ Time Series
Node
ƒ Association
Node
ƒ Merge Node
Copyright © 2006, SAS Institute Inc. All rights reserved.
Extending SAS®Enterprise Miner
with Data Preparation for Analytics” Nodes
Extension Nodes
ƒ Correlation
ƒ TrendRegression
ƒ Concentration
ƒ CategoryCount
Copyright © 2006, SAS Institute Inc. All rights reserved.
Analytic Data Preparation in
SAS® Enterprise Miner - Summary
ƒ Documentation of the data preparation process
as a whole in the process flow diagram
ƒ Definition of variables metadata that are used in
the analysis
ƒ Automatic Creation of Dummy-Variables for
categorical data
ƒ Powerful data transformations in the filter node,
transform variables node and impute node.
ƒ Possibility to perform association analysis and
time-series-analysis
ƒ Creation of score code from all nodes in SAS
Enterprise Miner as SAS datastep code
Copyright © 2006, SAS Institute Inc. All rights reserved.
Analytic Data Preparation in
SAS® Forecast Studio
Copyright © 2006, SAS Institute Inc. All rights reserved.
Analytic Data Preparation in
SAS® Forecast Studio
ƒ Convert transactional data into time series data
ƒ Treatment of missing values
ƒ Alignment of date values in intervals
ƒ Aggregation of data on various levels
ƒ Handling of events
• Allows users to create event definitions, assign events
to selected series in the project and delete events
• Users can specify event duration, shape, and
recurrence options.
• Pre-defined common events and holidays are available
for inclusion in the forecasting models
Copyright © 2006, SAS Institute Inc. All rights reserved.
Four Dimensions
for Analytic Data Preparation
Business and
Process Knowledge
Analytical
Knowledge
Analytic
Data
Preparaton
Documentation and
Maintainance
Copyright © 2006, SAS Institute Inc. All rights reserved.
Efficient and
Clever
SAS Coding
SAS® Data Integration Studio
Copyright © 2006, SAS Institute Inc. All rights reserved.
SAS®
Data
Integration
Studio
Copyright © 2006, SAS Institute Inc. All rights reserved.
SAS® Data Integration Studio – General Features
ƒ Comprehensive transformation library
ƒ Graphical user interface, drag & drop, wizard
driven
ƒ Multi-developer support: Check-in/check-out,
change management
ƒ Impact analysis
ƒ Documentation of the data management
process
ƒ Data Mining Model Scoring
• Register Model in SAS metadata repository
• Apply SAS Enterprise Miner model to new data source
• Creates the target table definition in the metadata
Copyright © 2006, SAS Institute Inc. All rights reserved.
SAS® DI Studio – Analytic Features
ƒ Data Mining Model Scoring
• Register Model in SAS metadata repository
• Apply SAS Enterprise Miner model to new data source
• Creates the target table definition in the metadata
ƒ Forecast Analysis Transformation
• Allows to perform time series analysis
• Based on HPF (high performance forecasting) procedure
• Allows to integrate the forecast step into the data flow
process in the metadata
Copyright © 2006, SAS Institute Inc. All rights reserved.
Summary
ƒ Analytic Data Preparation is a discipline,
not a incommodious necessity
ƒ Analytic Data Preparation is more than just coding
ƒ The One-Row-Per-Subject Paradigm
• Central in data mining and predictive modeling
• Do not stop with simple transpose, summing or averaging
• Clever aggregations can be the key success factor
ƒ Predictive Modeling and Historic Data:
Which data are you allowed to use for modeling?
ƒ SAS combines powerful data management functionality
and market leading analytics in one integrated package
ƒ SAS Tools like SAS® DI-Studio, SAS® Enterprise Miner
and SAS® Forecast Studio assist you in analytic data
preparation
Copyright © 2006, SAS Institute Inc. All rights reserved.
The commercial break …
Copyright © 2006, SAS Institute Inc. All rights reserved.
"It is exciting to see a book completely devoted to data
preparation in SAS."
Jin Li
Statistician
Capital One Financial Services
This is a must read for anyone, who prepares data for data
mining and analytics. Not only you receive ideas and
suggestions for important derived variables for your
datamart, you also get a lot of insight on the business
rationale behind the scenes. The author does a great job in
explaining you step by step the world of data preparation for
data mining and shows a lot of example code and macros.
Christine Hallwirth
MHE & Partners
Gerhard Svolba's book "Data Preparation for Analytics" is an
"owner's manual" for data miners, business analysts and all
who prefer to be in charge of and responsible for their own
datasets. This book is for those who are not afraid of data,
who understand data, and for whom rolling up their sleeves
and getting their hands into data is an integral part of
analytics and predictive modeling.
Copyright © 2006, SAS Institute Inc. All rights reserved.
Lessia Shajenko
(American Bank)
Recommended
Reading
Data Preparation for Analytics
by Gerhard Svolba
SAS-Press (#60502)
Business Rationale
Concepts
Coding Examples
Copyright © 2006, SAS Institute Inc. All rights reserved.
Downloads
ƒ Data Preparation for Analytics
•
http://www.sas.com/apps/pubscat/bookdetails.jsp?pc=60502
•
http://support.sas.com/publishing/bbu/companion_site/60502.html
•
http://www2.sas.com/proceedings/sugi31/078-31.pdf
•
http://sascommunity.org/wiki/Data_Preparation_for_Analytics
Copyright © 2006, SAS Institute Inc. All rights reserved.
Questions and Contact
ƒ Dr. Gerhard Svolba
ƒ Email: [email protected]
Copyright © 2006, SAS Institute Inc. All rights reserved.