Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Preparation for Analytics Dr. Gerhard Svolba SAS-Austria – Vienna http://sascommunity.org/wiki/Gerhard_Svolba Copyright © 2006, SAS Institute Inc. All rights reserved. Agenda Data Preparation for Analytics – General Thoughts Data Structures for Analytics Case Study: Building an efficient one-row-persubject data mart Data Management with the SAS® Language and with SAS® Tools Closing Thoughts Copyright © 2006, SAS Institute Inc. All rights reserved. Some Words on Data Preparation Is for techies Is just coding Is boring Consumes up to 80 % of the project Was not in the focus of data mining literature so far Is something that SAS can excellently do Is vital to the quality of the project Copyright © 2006, SAS Institute Inc. All rights reserved. The Analysis Process: From Raw Data to Actionable Results Data Sources Data Preparation Different Data Sources Merges, Denormalisation Relational Models, Star Schemes Derived Variables Data Availability + Copyright © 2006, SAS Institute Inc. All rights reserved. Analytic Modeling Results and Actions Modeling, Parameter Estimation, Tuning, Usage of Results Interpretations Transpositions, Aggregations Predictions, Classifications, Clustering Adequate Preparation Clever Modeling Good Results + Profiling = Four Dimensions for Analytic Data Preparation Business and Process Knowledge Analytical Knowledge Analytic Data Preparaton Documentation and Maintainance Copyright © 2006, SAS Institute Inc. All rights reserved. Efficient and Clever SAS Coding Four Dimensions for Analytic Data Preparation Business and Process Knowledge Analytical Knowledge Analytic Data Preparaton Documentation and Maintainance Copyright © 2006, SAS Institute Inc. All rights reserved. Efficient and Clever SAS Coding Challenging a Business Question „Can you calculate the probability that a customer will cancel the usage of product X?“ Questions: • Cancellation or decline in usage? • How do we treat usage of more advanced products? • Include non-volontary cancellation? • How quickly can retention measures take place? • Do you want to have probabilites per customer or a list of the 10.000 high risk customers or risk classes in general? Copyright © 2006, SAS Institute Inc. All rights reserved. Business, Statistican and IT: The Optimal Triangle !? I have formulated my business question. Are there any reasons, not to be back with results in 2 days? Simon Retention Manager Analytic Data Preparaton Name me the list of attributes and derived variables that you will use in your final model and which I have to deliver periodically. Elias IT Expert Copyright © 2006, SAS Institute Inc. All rights reserved. I would like to have an analytic database with a high number of attributes and an environment to perform both, data management and analysis. Daniele Quantitative Expert Four Dimensions for Analytic Data Preparation Business and Process Knowledge Analytical Knowledge Analytic Data Preparaton Documentation and Maintainance Copyright © 2006, SAS Institute Inc. All rights reserved. Efficient and Clever SAS Coding Business Questions Æ Analytical Models Event prediction (Churn, Fraud, Delinquency, Response, …) Value prediction (Purchase Size, Claim Amount, …) Clustering (Segmentation, …) Market Basket Analysis (Association Analysis, …) Time Series Forecasting Copyright © 2006, SAS Institute Inc. All rights reserved. Analysis Subjects and Multiple Observations Analysis subjects are entities that are being analyzed and the analysis results are interpreted in their context. Multiple observations per analysis subject • Repeated measurements over time • Multiple observations because of hierarchical relationships Copyright © 2006, SAS Institute Inc. All rights reserved. Main Types of Data Marts One-Row-perSubject Data Mart Multiple-Row-perSubject Data Mart Longitudinal Data Mart Copyright © 2006, SAS Institute Inc. All rights reserved. Data Mart Structures Data Mart Structure for the Analysis Structure of the source data: “Multiple observations per analysis subject exist?” NO YES Copyright © 2006, SAS Institute Inc. All rights reserved. One-Row-per-Subject Data Mart Data Mart with one row per subject is created. Information of multiple observations has to be aggregated or transposed per analysis subject Multiple-Row-per-Subject Data Mart (Key-Value Table) Data Mart with one row per multiple observations is created. Variables at the analysis subject level are duplicated for each repetition The One-Row-Per-Subject Data Mart Required by many statistical methods • Regression Analysis, Neural Networks, Decision Trees, Survival analysis, Cluster analysis, … Most prominent data mart structure in data mining • Event prediction (Churn, Fraud, Delinquency, Response, …) • Value prediction (Purchase Size, Claim Amount, …) • Segmentation (Clustering, …) Copyright © 2006, SAS Institute Inc. All rights reserved. The One-Row-Per-Subject Paradigm Copyright © 2006, SAS Institute Inc. All rights reserved. Transposing Data to One-Row-Per-Subject PROC TRANSPOSE DATA = long OUT = wide(DROP = _name_) PREFIX = weight; BY id ; VAR weight; ID time; RUN; Copyright © 2006, SAS Institute Inc. All rights reserved. Clever Aggregations Interval Data Static Aggregation Correlation of Values Course over Time Concentration of Values Categorical Data Frequency Counts Concatenated Frequencies Total and Distinct Counts Copyright © 2006, SAS Institute Inc. All rights reserved. Correlation of Values How do the monthly values per subject correlate with the overall mean per month? Copyright © 2006, SAS Institute Inc. All rights reserved. Measures for the Course over Time PROC REG DATA = longitud NOPRINT OUTEST=Est_LongTerm(KEEP = CustID month RENAME = (month=LongTerm)); MODEL usage = month; BY CustID; RUN; PROC REG DATA = longitud NOPRINT OUTEST=Est_ShortTerm(KEEP = CustID month RENAME = (month=ShortTerm)); MODEL usage = month; BY CustID; WHERE month in (5 6); RUN; Copyright © 2006, SAS Institute Inc. All rights reserved. Concentration of Values Concentration = proportion of the sum of the top 50 % sub-hierarchies / the total sum over all sub hierarchies Copyright © 2006, SAS Institute Inc. All rights reserved. Categorical Variables: Frequency Counts Source Data Absolute and Relative Frequencies Counts and Distinct Counts Copyright © 2006, SAS Institute Inc. All rights reserved. Categorical Variables: Concatenated Frequencies Account_ Cumulative Cumulative RowPct Frequency Percent Frequency Percent --------------------------------------------------------------0_100_0_0 12832 30.61 12832 30.61 100_0_0_0 9509 22.69 22341 53.30 50_0_0_50 4898 11.69 27239 64.98 33_0_0_67 1772 4.23 29011 69.21 0_0_100_0 1684 4.02 30695 73.23 67_0_0_33 1426 3.40 32121 76.63 0_0_50_50 861 2.05 32982 78.69 50_0_50_0 681 1.62 33663 80.31 Copyright © 2006, SAS Institute Inc. All rights reserved. Hierarchies: Aggregating Up, Copying Down Copyright © 2006, SAS Institute Inc. All rights reserved. Main Types of Data Marts One-Row-perSubject Data Mart Multiple-Row-perSubject Data Mart Longitudinal Data Mart Copyright © 2006, SAS Institute Inc. All rights reserved. The standard form for longitudinal data Copyright © 2006, SAS Institute Inc. All rights reserved. The inter-leaved longitudinal data mart The cross-sectional dimension data mart Copyright © 2006, SAS Institute Inc. All rights reserved. Longitudinal Data Marts Aggregating data at the right level • Transactional Data • Finest Granularity • Most Appropriate Aggregation Level Defining cross sectional groups Alignment of time values Creation of event indicators and input variables Copyright © 2006, SAS Institute Inc. All rights reserved. Four Dimensions for Analytic Data Preparation Business and Process Knowledge Analytical Knowledge Analytic Data Preparaton Documentation and Maintainance Copyright © 2006, SAS Institute Inc. All rights reserved. Efficient and Clever SAS Coding Case Study: Business Question Predict customers that have a high probability to leave the company Derive target variable „Cancellation YES/NO“ from the monthly value segment history (Entry „8. LOST“) Create a one-row-per-subject data mart for data mining analysis Copyright © 2006, SAS Institute Inc. All rights reserved. Case Study: Data and Data Model Copyright © 2006, SAS Institute Inc. All rights reserved. Customer data: demographic and customer baseline data Account data: information customer accounts Leasing data: data on leasing information Call Center data: data on Call center contacts Score data: data of value segment scores Considerations for Predictive Modeling Only consider customers that are still active at this point in time Snapshot Date: June 30th, 2003 Define Cancellation YES/NO based on Do not use input values in the data after this target window point in time Observation Window April 2003 Copyright © 2006, SAS Institute Inc. All rights reserved. May 2003 June 2003 Offset Target Window July 2003 August 2003 Using Data from the CALLCENTER Table %let snapdate = ’30JUN2003’d; PROC FREQ DATA = callcenter NOPRINT; TABLE CustID / OUT = CallCenterComplaints (DROP = Percent RENAME = (Count = Complaints)); WHERE Category = 'Complaint' and datepart(date) <= &snapdate; RUN; Copyright © 2006, SAS Institute Inc. All rights reserved. Using Data from the SCORES-Table %let snapdate = ’30JUN2003’d; DATA ScoreFuture(RENAME = (ValueSegment = FutureValueSegment)) ScoreActual ScoreLastMonth(RENAME = (ValueSegment = LastValueSegment)); SET Scores; DATE = INTNX('MONTH',Date,0,'END'); DROP Date; IF Date = &snapdate THEN OUTPUT ScoreActual; ELSE IF Date = INTNX('MONTH',&snapdate,-1) THEN OUTPUT ScoreLastMonth; ELSE IF Date = INTNX('MONTH',&snapdate,2) THEN OUTPUT ScoreFuture; RUN; Observation Window April 2003 Copyright © 2006, SAS Institute Inc. All rights reserved. May 2003 June 2003 Offset Target Window July 2003 August 2003 DATA CustomerMart; ATTRIB /* Customer Baseline */ CustID FORMAT = 8. LABEL = "Customer ID" Birthdate FORMAT = DATE9. LABEL = "Date of Birth" Alter FORMAT = 8. LABEL = "Age (years)" Gender FORMAT = $6. LABEL = "Gender" MaritalStatus FORMAT = $10. LABEL = "Marital Status" Title FORMAT = $10. LABEL = "Academic Title" HasTitle FORMAT = 8. LABEL = "Has Title? 0/1" Branch FORMAT = $5. LABEL = "Branch Name"; MERGE Customer (IN = InCustomer) AccountSum (IN = InAccounts) AccountTypes LeasingSum (IN = InLeasing) CallCenterContacts (IN = InCallCenter) CallCenterComplaints ScoreFuture ScoreActual ScoreLastMonth; BY CustID; IF InCustomer; Copyright © 2006, SAS Institute Inc. All rights reserved. /* Customer Baseline */ HasTitle = (Title ne ""); Alter = (&Snapdate-Birthdate)/365.25; CustomerMonths = (&Snapdate- CustomerSince)/(365.25/12); /* Accounts */ HasAccounts = InAccounts; LoanPct = Loan / BalanceSum * 100; SavingAccountPct = SavingAccount / BalanceSum * 100; FundsPct = Funds / BalanceSum * 100; /* Leasing */ HasLeasing = InLeasing; /* Call Center */ HasCallCenter = InCallCenter; ComplaintPct = Complaints / Calls *100; /* Value Segment */ Cancel = (FutureValueSegment = '8. LOST'); ChangeValueSegment = (ValueSegment = LastValueSegment); RUN; Copyright © 2006, SAS Institute Inc. All rights reserved. Screenshots of the Resulting Data Mart Copyright © 2006, SAS Institute Inc. All rights reserved. The Power of the SAS Language for Data Preparation Copyright © 2006, SAS Institute Inc. All rights reserved. SAS Data Step vs. SQL Transposing a table PROC TRANSPOSE DATA = sampsio.assocs(obs=21) OUT = assoc_tp (DROP = _name_); BY customer; ID Product; RUN; Copyright © 2006, SAS Institute Inc. All rights reserved. Changing between longitdutinal data structures Standard Form ÅÆ Interleaved Copyright © 2006, SAS Institute Inc. All rights reserved. Selection of the first and last line per customer CustID Month Value FirstValue LastValue 1 7 45 1 0 1 8 34 0 0 1 9 5 0 1 2 7 34 1 0 2 8 32 0 0 2 9 44 0 1 3 7 56 1 0 3 8 54 0 0 3 9 32 0 1 Copyright © 2006, SAS Institute Inc. All rights reserved. Creating a sequential number and summing items Copyright © 2006, SAS Institute Inc. All rights reserved. CustID Date Points PurchNo Cum_Poi 1 10.03.2004 45 1 45 1 04.04.2004 10 2 55 1 20.04.2004 20 3 75 1 16.05.2004 18 4 93 1 01.06.2004 5 5 98 2 01.02.2004 10 1 10 2 19.03.2004 30 2 40 3 05.08.2004 4 1 4 3 16.08.2004 16 2 20 3 31.08.2004 12 3 32 3 10.09.2004 20 4 52 Copying omitted data CustID Age Gender Month Value 1 26 m 7 45 8 34 9 5 7 34 8 32 9 44 7 56 8 54 9 32 2 3 37 46 w m Copyright © 2006, SAS Institute Inc. All rights reserved. CustID Age Gender Month Value 1 26 m 7 45 1 26 m 8 34 1 26 m 9 5 2 37 w 7 34 2 37 w 8 32 2 37 w 9 44 3 46 m 7 56 3 46 m 8 54 3 46 m 9 32 Shift down a column for k positions Value_ PrevDay Date Value 01.01.2004 45 . 01.02.2004 34 45 01.03.2004 5 34 01.04.2004 34 5 01.05.2004 32 34 01.06.2004 44 32 Copyright © 2006, SAS Institute Inc. All rights reserved. Analytic Data Preparation in SAS® Enterprise Miner Copyright © 2006, SAS Institute Inc. All rights reserved. Analytic Data Preparation in SAS® Enterprise Miner Input Data Source Node Filter Node Sample Node Transform Variables Node Data Partition Node Impute Node Metadata Node SAS Code Node Principal Components Node Copyright © 2006, SAS Institute Inc. All rights reserved. Working with Multiple-Row-per-Subject Data in SAS® Enterprise Miner Time Series Node Association Node Merge Node Copyright © 2006, SAS Institute Inc. All rights reserved. Extending SAS®Enterprise Miner with Data Preparation for Analytics” Nodes Extension Nodes Correlation TrendRegression Concentration CategoryCount Copyright © 2006, SAS Institute Inc. All rights reserved. Analytic Data Preparation in SAS® Enterprise Miner - Summary Documentation of the data preparation process as a whole in the process flow diagram Definition of variables metadata that are used in the analysis Automatic Creation of Dummy-Variables for categorical data Powerful data transformations in the filter node, transform variables node and impute node. Possibility to perform association analysis and time-series-analysis Creation of score code from all nodes in SAS Enterprise Miner as SAS datastep code Copyright © 2006, SAS Institute Inc. All rights reserved. Analytic Data Preparation in SAS® Forecast Studio Copyright © 2006, SAS Institute Inc. All rights reserved. Analytic Data Preparation in SAS® Forecast Studio Convert transactional data into time series data Treatment of missing values Alignment of date values in intervals Aggregation of data on various levels Handling of events • Allows users to create event definitions, assign events to selected series in the project and delete events • Users can specify event duration, shape, and recurrence options. • Pre-defined common events and holidays are available for inclusion in the forecasting models Copyright © 2006, SAS Institute Inc. All rights reserved. Four Dimensions for Analytic Data Preparation Business and Process Knowledge Analytical Knowledge Analytic Data Preparaton Documentation and Maintainance Copyright © 2006, SAS Institute Inc. All rights reserved. Efficient and Clever SAS Coding SAS® Data Integration Studio Copyright © 2006, SAS Institute Inc. All rights reserved. SAS® Data Integration Studio Copyright © 2006, SAS Institute Inc. All rights reserved. SAS® Data Integration Studio – General Features Comprehensive transformation library Graphical user interface, drag & drop, wizard driven Multi-developer support: Check-in/check-out, change management Impact analysis Documentation of the data management process Data Mining Model Scoring • Register Model in SAS metadata repository • Apply SAS Enterprise Miner model to new data source • Creates the target table definition in the metadata Copyright © 2006, SAS Institute Inc. All rights reserved. SAS® DI Studio – Analytic Features Data Mining Model Scoring • Register Model in SAS metadata repository • Apply SAS Enterprise Miner model to new data source • Creates the target table definition in the metadata Forecast Analysis Transformation • Allows to perform time series analysis • Based on HPF (high performance forecasting) procedure • Allows to integrate the forecast step into the data flow process in the metadata Copyright © 2006, SAS Institute Inc. All rights reserved. Summary Analytic Data Preparation is a discipline, not a incommodious necessity Analytic Data Preparation is more than just coding The One-Row-Per-Subject Paradigm • Central in data mining and predictive modeling • Do not stop with simple transpose, summing or averaging • Clever aggregations can be the key success factor Predictive Modeling and Historic Data: Which data are you allowed to use for modeling? SAS combines powerful data management functionality and market leading analytics in one integrated package SAS Tools like SAS® DI-Studio, SAS® Enterprise Miner and SAS® Forecast Studio assist you in analytic data preparation Copyright © 2006, SAS Institute Inc. All rights reserved. The commercial break … Copyright © 2006, SAS Institute Inc. All rights reserved. "It is exciting to see a book completely devoted to data preparation in SAS." Jin Li Statistician Capital One Financial Services This is a must read for anyone, who prepares data for data mining and analytics. Not only you receive ideas and suggestions for important derived variables for your datamart, you also get a lot of insight on the business rationale behind the scenes. The author does a great job in explaining you step by step the world of data preparation for data mining and shows a lot of example code and macros. Christine Hallwirth MHE & Partners Gerhard Svolba's book "Data Preparation for Analytics" is an "owner's manual" for data miners, business analysts and all who prefer to be in charge of and responsible for their own datasets. This book is for those who are not afraid of data, who understand data, and for whom rolling up their sleeves and getting their hands into data is an integral part of analytics and predictive modeling. Copyright © 2006, SAS Institute Inc. All rights reserved. Lessia Shajenko (American Bank) Recommended Reading Data Preparation for Analytics by Gerhard Svolba SAS-Press (#60502) Business Rationale Concepts Coding Examples Copyright © 2006, SAS Institute Inc. All rights reserved. Downloads Data Preparation for Analytics • http://www.sas.com/apps/pubscat/bookdetails.jsp?pc=60502 • http://support.sas.com/publishing/bbu/companion_site/60502.html • http://www2.sas.com/proceedings/sugi31/078-31.pdf • http://sascommunity.org/wiki/Data_Preparation_for_Analytics Copyright © 2006, SAS Institute Inc. All rights reserved. Questions and Contact Dr. Gerhard Svolba Email: [email protected] Copyright © 2006, SAS Institute Inc. All rights reserved.