Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining and The Use of SAS to Deploy Scoring Rules South Central SAS Users Group Conference Neil Fleming, Ph.D., ASQ CQE November 7-9, 2004 2W Systems Co., Inc. [email protected] 972 733-0588 www.2WSystems.com Types of Data Mining • Supervised Classification (target): – Logistic regression (discrete outcome) – Multiple regression (continuous outcome) – Decision trees (discrete outcome) – Regression trees (continuous outcome) – Neural Nets (discrete and continuous outcomes) • Unsupervised Classification (no target) – Cluster analysis (K-Means, hierarchal, etc.) – Self-Organized maps (SOMS) 147 The Goal: Prediction Versus Explanation • • • • What type of action will be taken? Regression: Explanation & Prediction Decision trees: Explanation & Prediction Neural Nets: Prediction Decision Trees • Finds variables at different levels to best: – Maximize hetergeneity between groups – Maximize homogeneity within groups • Non-linear (interaction) • Merges categories that are the same (no statistically significant difference) • Discretizes continuous variables (preserving ordinality) • Uses missing data 148 Picking a Tool • Subsidiary of Forrester Research, Inc. examined four data mining products: 1) SAS Enterprise Miner (EM) 2) SPSS Clementine 3) IBM DB2 Intelligent Miner (IM) 4) Oracle Data Mining (ODM) http://www.sas.com/presscenter/analysts/giga_122203.pdf Decision Tree Deliverables • Segments data into terminal nodes • Provides profiles for explanation &prediction • Creates rules for scoring (prediction) 149 Decision Tree Algorithms Goals & Methods • CHAID (Chi-Square Automatic Interaction Detection) • CART (Classification & Regression Trees) • Quest Picking the Best Tree • Training, Testing, and Validation • Cross-Validation with Hold-out samples • Metrics: Gains Tables (ROI) & Classification Error 150 SAS: Data Mining Leader • SAS was chosen as the leader in functionality for: – architecture, algorithms, and data access • SPSS was chosen as the leader in usability – collaboration between statisticians, data preparers, and business analysts. • SAS was chosen as the leader in support, with a slight edge over SPSS • IBM was noted for its in data-base modeling & deployment of scoring PRICE of Server Version Initial and Renewal (lowest range) • SAS EM:$119K/$39K with Base SAS & SAS/STAT needed • SPSS Clementine: $75K • IBM DB2 IM: $18,750/$3,750 (probably as add-on)through Data Warehouse Standard Edition which includes many other products • Oracle ODM: $20K/CPU with different percentages for perpetual licenses 151 My company is not a Fortune 100…. Another Solution Dedicated software for decision tree modeling 152 Node Node33 Node Node44 Node Node55 Node Node66 Gain Summary by Node Target variable: Has Amex card Target category: Statistics Nodes Node: n Node: % Gain: n 5 108 33.4 61 3 86 26.6 39 6 50 15.5 22 4 79 24.5 34 Total 323 100 156 Nodes 5 3 6 4 Node:% 33.4 26.6 15.5 24.5 Gain(%) 39.1 25.0 14.1 21.8 153 Yes Resp: % 56.5 45.3 44.0 43.0 48.3 Index (%) 116.9 93.9 91.1 89.1 100 Gain Summary - In Deciles Target variable: Has Amex card Target category: Yes Nodes 5 5 5 5;3 3 3 6 6;4 4 4 Percentile 10 20 30 40 50 60 70 80 90 100 Statistics Percentile: n Gain: n Gain (%) 32 18 11.6 65 37 23.5 97 55 35.1 129 71 45.2 162 85 54.8 194 100 64.1 226 114 73.1 258 128 82.1 291 142 91.2 323 156 100.0 Resp: % 56.5 56.5 56.5 54.7 52.8 51.5 50.5 49.6 48.9 48.3 SQL Rules /* Node 3*/ UPDATE <TABLE> SET nod_001 = 3, pre_001 = 0, prb_001 = 0.546512 WHERE ((PAY_WEEK IS NULL) OR (PAY_WEEK <= 1)) AND ((CLASS IS NULL) OR (CLASS <= 3)); /* Node 4*/ UPDATE <TABLE> SET nod_001 = 4, pre_001 = 0, prb_001 = 0.569620 WHERE ((PAY_WEEK IS NULL) OR (PAY_WEEK <= 1)) AND (NOT(CLASS IS NULL) AND (CLASS > 3)); 154 Continued /* Node 5*/ UPDATE <TABLE> SET nod_001 = 5, pre_001 = 1, prb_001 = 0.564815 WHERE (NOT(PAY_WEEK IS NULL) AND (PAY_WEEK > 1)) AND ((AGE IS NULL) OR (AGE <= 2)); /* Node 6*/ UPDATE <TABLE> SET nod_001 = 6, pre_001 = 0, prb_001 = 0.560000 WHERE (NOT(PAY_WEEK IS NULL) AND (PAY_WEEK > 1)) AND (NOT(AGE IS NULL) AND (AGE > 2)); Gains % Chart Based on Deciles 155 Misclassification Matrix Predicted Category Actual Category No Yes Total No 120 95 215 Yes 47 61 108 Total 167 156 323 Risk Statistics Risk Estimate 0.439628 = (95+47)/323 SE of Risk Estimate 0.0276172 = Sqrt[(.45*(1-.45))/323] SAS Log libname in 'e:/NOTSUG'; NOTE: Libref IN was successfully assigned as follows: Engine: V8 Physical Name: e:\NOTSUG 356 %let dsn=Credit; 357 358 Data Assign; SYMBOLGEN: Macro variable DSN resolves to Credit 359 Set in.&dsn; /*SAS Data set coming in to be segmented*/; 360 nod_001=.; 361 pre_001=.; 362 prb_001=.; NOTE: There were 323 observations read from the data set IN.CREDIT. NOTE: The data set WORK.ASSIGN has 323 observations and 8 variables. NOTE: DATA statement used: real time 0.04 seconds cpu time 0.04 seconds 156 Proc SQL; 364 365 /* Node 3*/ 366 UPDATE Assign 367 SET nod_001 = 3, pre_001 = 0, prb_001 = 0.546512 368 WHERE ((PAY_WEEK IS NULL) OR (PAY_WEEK <= 1)) AND ((CLASS IS NULL) OR (CLASS <= 3)); NOTE: 86 rows were updated in WORK.ASSIGN. 369 370 /* Node 4*/ 371 UPDATE Assign 372 SET nod_001 = 4, pre_001 = 0, prb_001 = 0.569620 373 WHERE ((PAY_WEEK IS NULL) OR (PAY_WEEK <= 1)) AND (NOT(CLASS IS NULL) AND (CLASS > 3)); NOTE: 79 rows were updated in WORK.ASSIGN. 374 375 /* Node 5*/ 376 UPDATE Assign 377 SET nod_001 = 5, pre_001 = 1, prb_001 = 0.564815 378 WHERE (NOT(PAY_WEEK IS NULL) AND (PAY_WEEK > 1)) AND ((AGE IS NULL) OR (AGE <= 2)); NOTE: 108 rows were updated in WORK.ASSIGN. 379 380 381 382 383 AND /* Node 6*/ UPDATE Assign SET nod_001 = 6, pre_001 = 0, prb_001 = 0.560000 WHERE (NOT(PAY_WEEK IS NULL) AND (PAY_WEEK > 1)) (NOT(AGE IS NULL) AND (AGE > 2)); NOTE: 50 rows were updated in WORK.ASSIGN. 384 NOTE: PROCEDURE SQL used: real time 0.19 seconds cpu time 0.19 seconds 157 385 386 387 388 389 390 391 Data Assign; Set Assign; If prb_001=. then Prob=0; else If pre_001=0 then Prob=1-prb_001; Else if pre_001=1 then Prob=prb_001; /* This assigns the Probability for Target Outcome 1 */; Run; NOTE: There were 323 observations read from the data set WORK.ASSIGN. NOTE: The data set WORK.ASSIGN has 323 observations and 9 variables. NOTE: DATA statement used: real time 0.05 seconds cpu time 0.05 seconds 392 393 394 395 396 proc summary data=assign; class nod_001; var Prob;output out=statb mean=mean_Prob sum=sum_Prob; run; NOTE: There were 323 observations read from the data set WORK.ASSIGN. NOTE: The data set WORK.STATB has 5 observations and 5 variables. Analysis of Credit Card Data 10:32 Monday, April 5, 2004 Segments with Active Cards Dsn=Credit Obs 1 2 3 4 5 nod_001 5 3 6 4 . _TYPE_ 1 1 1 1 0 _FREQ_ 108 86 50 79 323 158 mean_Prob 0.56482 0.45349 0.44000 0.43038 0.48297 sum_Prob 61.000 39.000 22.000 34.000 156.000 Conclusion • Use Dedicated Software product that is affordable • Combine with SAS SQL for Deploying Scoring Rules • Create powerful application for Data Mining • Provide explanation that is ACTIONABLE with prediction 159