Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Everything you ever wanted to know about data mining but were afraid to ask. David Duling Data Mining R&D Director SAS Institute Copyright © 2006, SAS Institute Inc. All rights reserved. Abstract Data mining is the process of systematically sifting through often large databases to identify patterns and trends relevant to solving business problems such as increasing sales and efficiency. Successful examples of data mining can be found in many areas of business and science including customer relations management, market basket analysis, human resources, bio-informatics and medicine, fraud detection, and searching the web. The rapid growth in data mining is largely due to the increased availability of large databases, advances in large scale computing and development of data mining algorithms. This presentation will begin with a brief history of data mining, cover current trends with case studies and finish with a look into the future of data mining from the SAS perspective. Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Q: Where did data mining start ? 1841 • Lewis Tappan formed the Mercantile Company to provide creditworthiness reports (ie: credit scores) to New York merchants. • Employed a number of ‘correspondents’’ in western frontier towns to monitor the behavior of local traders, in addition to pooling merchant records. A huge data base was accumulated. • Enormous success ! 1849 • John Bradstreet starts a credit reporting company in Cincinnati, OH. 1859 • The Mercantile Company was sold to Robert Graham Dun 1933 • The Dun company merged with the Bradstreet company Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Q: How about something with computers ? Date: September, 1963 Authors: James Myers and Edward Forgy Title: The Development of Numerical Credit Evaluation Systems Publication: Journal of the American Statistical Association Abstract Several discriminant and multiple regression analyses were performed on retail credit application data to develop a numerical scoring system for predicting credit risk in a finance company. Results showed that equal weights for all significantly predictive items were as effective as weights from the more sophisticated techniques of discriminant analysis and "stepwise multiple regression." However, a variation of the basic discriminant analysis produced a better separation of groups at the lower score levels, where more potential losses could be eliminated with a minimum cost of potentially good accounts. Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Data Mining, circa 1963 IBM 7090 600 applicants “Machine storage limitations restricted the total number of variables which could be considered at one time to 25.” Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Q: That Sounds like Statistics so what’s the difference? Statistics Experimental Prior Hypothesis Data mining Commercial Posterior Hypothesis • Idea before data acquisition • Data acquisition planned Experimental Design • • • • No Experimental Design Sampling strategies Factorial designs Required confidence Minimize model terms Inference • Hypothesis testing • Prediction Copyright © 2006, SAS Institute Inc. All rights reserved. • Idea after data acquisition • Data acquisition opportunistic • • • • Explore data Create hypothesis Generate query Create models Prediction • Lift, Profit, Response • Inference Company confidential - for internal use only It’s all about the Data Experimental Opportunistic Purpose Research Operational Value Scientific Commercial Generation Actively controlled Passively observed Size Small Massive Hygiene Clean Dirty State Static Dynamic Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Where does mining data come from ? Data Warehouses store detail data on transactions and states Geo_Type Geo_Type_Id: INTEGER Geo_Type_Name: CHARACTER(30) Region Region_Id: INTEGER County County_Id: INTEGER County_Type: INTEGER (FK) County_Name: CHARACTER(30) Region_Id: INTEGER (FK) Region_Type: INTEGER (FK) Region_Name: CHARACTER(30) State_Id: INTEGER (FK) Very simple demo example State State_Id: INTEGER Street_Code Street_Id: INTEGER Country: CHARACTER(2) Street_Name: CHARACTER(30) Zip_Code: CHARACTER(10) From_Street_No: NUMERIC(8) To_Street_No: NUMERIC(8) City: CHARACTER(22) County: CHARACTER(25) State: CHARACTER(2) City_Id: INTEGER State_Id: INTEGER County_Id: INTEGER (FK) Zip_Id: INTEGER (FK) State_Type: CHARACTER(30) State_Name: CHARACTER(30) Country: CHARACTER(2) (FK) Geo_Type_Id: INTEGER (FK) Customer_Id: INTEGER Country: CHARACTER(2) Gender: CHARACTER(1) Personal_Id: CHARACTER(15) Customer_Name: CHARACTER(40) Customer_Firstname: CHARACTER(20) Customer_Lastname: CHARACTER(30) Birthday: DATE Customer_Address: CHARACTER(40) Street_Id: INTEGER (FK) Street_Number: CHARACTER(8) Customer_Type_Id: INTEGER (FK) Zip_Id: INTEGER City_Name: CHARACTER(30) Zipcode: CHARACTER(18) City_Id: INTEGER (FK) Supplier Supplier_Id: INTEGER City_Id: INTEGER City_Name: CHARACTER(30) Customer_Type Customer_Type_Id: INTEGER Supplier_Name: CHARACTER(30) Street_Id: INTEGER (FK) Supplier_Address: CHARACTER(30) Supplier_Street_Nu: NUMERIC(3) Country: CHARACTER(2) Customer_Type: CHARACTER(40) Customer_Group_Id: INTEGER Customer_Group: CHARACTER(40) Price_List Promotion Product_Id: INTEGER (FK) Start_Date: DATE Product_Id: INTEGER (FK) Start_Date: DATE End_Date: DATE Unit_Cost_Price: DECIMAL(12,2) Unit_Sales_Price: DECIMAL(12,2) End_Date: DATE Sales_Price: DECIMAL(12,2) Promotion: DECIMAL(5,2) Copyright © 2006, SAS Institute Inc. All rights reserved. Country: CHARACTER(2) Country_Name: CHARACTER(45) Population: NUMERIC(6) Office: CHARACTER(2) Dir: CHARACTER(3) Country_Id: INTEGER Continent_Id: INTEGER (FK) Country_Former_Nam: CHARACTER(45) Customer Zip_Code City Country Order Order_Id: INTEGER Employee_Id: INTEGER (FK) Customer_Id: INTEGER (FK) Order_Date: DATE Delivery_Date: DATE Order_Type: INTEGER Order_Item Order_Id: INTEGER (FK) Order_Item_No: INTEGER Product_Id: INTEGER (FK) Amount: SMALLINT Price: DECIMAL(12,2) Unit_Cost_Price: DECIMAL(12,2) Promotion: DECIMAL(5,2) Product Product_Id: INTEGER Product_Name: CHARACTER(45) Supplier_Id: INTEGER (FK) Product_Level_Id: INTEGER (FK) Product_Ref_Id: INTEGER (FK) Product_Level: NUMERIC(3) Company confidential - for internal use only Continent Continent_Id: INTEGER Continent_Name: CHARACTER(30) Staff Employee_Id: INTEGER (FK) Start_Date: DATE Salary: DECIMAL(12,2) Birthday: DATE End_Date: DATE Emp_Hire_Date: DATE Gender: CHARACTER(1) Emp_Term_Date: DATE Job_Title: CHARACTER(25) Organization Employee_Id: INTEGER Org_Name: CHARACTER(40) Country: CHARACTER(2) Org_Level_Id: INTEGER (FK) Start_Date: DATE End_Date: DATE Org_Ref_Id: INTEGER (FK) Org_Level Org_Level_Id: INTEGER Org_Text: CHARACTER(40) Product_Level Product_Level_Id: INTEGER Product_Level_Name: CHARACTER(30) Data: majority of time spent on mining Intelligent Enterprise Magazine http://www.intelligententerprise.com/030405/606feat2_1.jhtml Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Each data type has its own mix of data prep and mining Demographics, Personal Information Market baskets, Item sets Market baskets with time order Web paths: unique sequences Time stamped transactions Text normalization Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Use integrated data for mining Web paths Market baskets Demographic, Financial Seasonal indices Interactions ID columns Copyright © 2006, SAS Institute Inc. All rights reserved. Text dimensions Company confidential - for internal use only Example Quest: Maximize response to this year’s summer promotion How: Find those customers most likely to respond Use response to last year’s summer promotion as indicator of response to this year’s promotion. This is the dependent variable. Use all customer data available before last summer. These are the independent variables. • • • • • • Demographics Sales item history Sales amount history Web site history Call center records …. Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Q: Why don’t we just select last year’s customer lists ? Data is non-stationary (remember this point) Move to new locations Change jobs Income goes up or down Debt increase or decrease Marital status Parental status … The model is a function of the attributes, not the individuals Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Q: What Functions are Popular ? Associations unsupervised Clustering unsupervised PCA/SVD unsupervised Logistic Regression supervised Decision Tree supervised Neural Network supervised Ensembles supervised … and many other forms, variants, and names Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only How do I find Patterns ? Try ASSOCIATIONS and SEQUENCES Searches for frequent patterns (Car Wreck Dr. X ) (Diagnosis Code xxx MRI) Confidence: If (A) happens then (B) happens 80% of the time C=(B|A)/A Support: (A) (B) happens in 10% of all itemsets S=(B|A)/N Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Rule Sets show the next most likely action Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Associations and Sequences / Visualization Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only What is the Most Popular Data Mining Function ? -Logistic Regression Still Rules ! -Linear combination of terms (z) -Relatively easy to compute -Converges to a solution -Explainable Prob Input Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Input Q: What is CART and why do I need it ? A Decision Tree ! Classification and Regression Strategy Development Hunt 1966 Concept Learning System Kass 1980 Chi-squared Automatic Interaction Detection Breiman 1984 Classification and Regression Trees Quinlan 1993 C 4.5 rule sets Numerous others… • Algorithms for efficiently building trees • Hypothesis tests for finding split points − Various measurement scales Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Building a Decision Tree Keep doing that until there are no more beneficial splits... Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Recursive Partitioning Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Benefits of Trees Interpretability • Tree structured presentation Mixed Measurement Scales • Nominal, ordinal, interval • Regression trees Robustness Missing Values Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only …Benefits Automatically Prob • Detects interactions (AID) • Accommodates nonlinearity • Selects input variables Input Input Multivariate Step Function Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Drawbacks of Trees Roughness Linear, Main Effects Instability Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Q: Why do they call it ‘Neural’ network ? Neuron Hidden Unit Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Feed Forward Neural Network Input Layer Hidden Layers Copyright © 2006, SAS Institute Inc. All rights reserved. Output Layer Company confidential - for internal use only How does it work? a b C= combination ( Weights * Inputs ) c A = Activation ( C ) f(W,I) = A[C] + b -> output … y~ f(W,(s,t)) s= f(S,(p,q,r)) p d s e f t g t= f(T,(p,q,r)) p= f(P,X) q= f(Q,X) r= f(R,X) r h i j … y ~ f(W,(f(S.(f(P,X), f(Q,X), f(R,X)))), f(T,(f(P,X), f(Q,X), f(R,X))))) Err = E(Y,y) ~ (Y - y)^2 Copyright © 2006, SAS Institute Inc. All rights reserved. y q Company confidential - for internal use only Input Layer Activation Function Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Training Iterative Optimization Algorithm Parameter 1 Error Function Parameter 2 Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Training history for our example Error measure goes down with every iteration. Weights evolve at every iteration Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Neural Pros and Cons Very Flexible functions Implicit transformation and interactions Good algorithms for controlling complexity No inference Complex function Many possible networks – large search space Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Q: How do I know the model will work on new data ? Make sure that you don’t have a perfect model ! • Real data has multiple forms of the dependent variable effect Limit exposure to data that changes over time • Examine distributions of data at several time points • Select stable data • Use standardizations • Use category=other Backtest • Use a hold out sample from a later time period Monitor Performance • Compare actual and expected results • Compare input term distributions Don’t fit the noise. Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Q: How do I model signal instead of noise ? Limit model complexity by using Validation Data. Decision Tree: Pruning Neural Network: Early Stopping Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Complexity -> Overfitting Training Set Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Test Set Better Fitting … with a more simple model Training Set Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Test Set How do I select the best model ? ROC for overall model performance: Decision Tree Lift for targeted model performance: Neural Network Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Computers keep getting so much faster, why does my neural network take so long to run? Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only The enemy: growth in data warehouses In the aggregate, the 2001 survey pool reported 632 TB of storage. Just two years later, those surveyed were using almost 2 petabytes (2,000 TB) of storage. Based on the number of survey respondents, the average large database — whether used for decision support or transaction processing — increased its storage requirements three and one-half times in just two years. DM-REVIEW Company confidential - for internal use only Copyright © 2006, SAS Institute Inc. All rights reserved. Single disk size growth Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Clock speed vs. disk sizes Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Even worse, it’s all about complexity of data 1M rows x 100 columns x 8 bytes = 800MB 1000 rows x 1000 columns x 8 bytes = 8MB Which data is more complex ? Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Q: so how does this make me money ? Models get deployed to operational systems • New data is acquired • Each case is scored with the model function • Action taken on each case: − Send promotion or don’t send promotion − Select item for cross sell offer − Grand credit or don’t grant credit − Alert engineers that a manufacturing defect has been found. Model driven decision are nearly always better than intuition …iff… the data miner has accounted for enough sources of variation. Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Offline Applications • • • • Scheduled Scoring ETL process ETL engine ETL for model development and scoring Scores generated on nightly basis ID and Score data pre-loaded into data store Score tables pushed to external applications Model Development Data Mining Scoring Engine BI Application Campaign Planning Operations Campaign Execution Information Technology Data Store Scores Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Online Applications • • • • Scheduled Scoring ETL process Scores generated on nightly basis ID and Score data pre-loaded into data store Individual score requests contain one or more IDs Decision server translates score to action ETL engine Model Development Scoring Engine BI Application Decision Server Front Office Application Data Store Scores Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Customer call center On-Demand Applications • • • • • • Scheduled Scoring ETL process ETL engine Model input data pre-loaded into data store New data provided by application Score engine pulls data by ID from data store joins with new data Scores generated immediately Decision Server translates score to action Model Development Front Office Application Decision Server Automation Application Scoring Engine Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Fraud detection Monty laundering Medical diagnostics Q: So what are the cool applications right now ? GOOGLE, YAHOO, ASK, etc… • Huge model training task: index and summarize the web • Techniques: text data processing; page rank • Real time scoring task: process your query NETFLIX • $1M challenge: beat their statisticians • Huge sparse matrix: fill in the blanks • Techniques SVD by numerical approximation aka: Hebbian-learning Neural Net Ensembles Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only Copyright © 2006, SAS Institute Inc. All rights reserved. Company confidential - for internal use only