Download Data Mining and the Use of SAS® to Deploy Scoring Rules

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining and The Use of SAS to
Deploy Scoring Rules
South Central SAS Users Group
Conference
Neil Fleming, Ph.D., ASQ CQE
November 7-9, 2004
2W Systems Co., Inc.
[email protected]
972 733-0588
www.2WSystems.com
Types of Data Mining
• Supervised Classification (target):
– Logistic regression (discrete outcome)
– Multiple regression (continuous outcome)
– Decision trees (discrete outcome)
– Regression trees (continuous outcome)
– Neural Nets (discrete and continuous
outcomes)
• Unsupervised Classification (no target)
– Cluster analysis (K-Means, hierarchal,
etc.)
– Self-Organized maps (SOMS)
147
The Goal:
Prediction Versus
Explanation
•
•
•
•
What type of action will be taken?
Regression: Explanation & Prediction
Decision trees: Explanation & Prediction
Neural Nets: Prediction
Decision Trees
• Finds variables at different levels to best:
– Maximize hetergeneity between groups
– Maximize homogeneity within groups
• Non-linear (interaction)
• Merges categories that are the same (no
statistically significant difference)
• Discretizes continuous variables
(preserving ordinality)
• Uses missing data
148
Picking a Tool
• Subsidiary of Forrester Research, Inc.
examined four data mining products:
1) SAS Enterprise Miner (EM)
2) SPSS Clementine
3) IBM DB2 Intelligent Miner (IM)
4) Oracle Data Mining (ODM)
http://www.sas.com/presscenter/analysts/giga_122203.pdf
Decision Tree Deliverables
• Segments data into terminal nodes
• Provides profiles for explanation
&prediction
• Creates rules for scoring (prediction)
149
Decision Tree Algorithms
Goals & Methods
• CHAID (Chi-Square Automatic Interaction
Detection)
• CART (Classification & Regression Trees)
• Quest
Picking the Best Tree
• Training, Testing, and Validation
• Cross-Validation with Hold-out samples
• Metrics: Gains Tables (ROI) &
Classification Error
150
SAS: Data Mining Leader
• SAS was chosen as the leader in
functionality for:
– architecture, algorithms, and data
access
• SPSS was chosen as the leader in
usability
– collaboration between statisticians, data
preparers, and business analysts.
• SAS was chosen as the leader in support,
with a slight edge over SPSS
• IBM was noted for its in data-base
modeling & deployment of scoring
PRICE of Server Version
Initial and Renewal (lowest range)
• SAS EM:$119K/$39K with Base SAS &
SAS/STAT needed
• SPSS Clementine: $75K
• IBM DB2 IM: $18,750/$3,750 (probably as
add-on)through Data Warehouse Standard
Edition which includes many other
products
• Oracle ODM: $20K/CPU with different
percentages for perpetual licenses
151
My company is not a
Fortune 100….
Another Solution
Dedicated software for decision
tree modeling
152
Node
Node33
Node
Node44
Node
Node55
Node
Node66
Gain Summary by Node
Target variable: Has Amex card Target category:
Statistics
Nodes Node: n
Node: %
Gain: n
5
108
33.4
61
3
86
26.6
39
6
50
15.5
22
4
79
24.5
34
Total 323
100
156
Nodes
5
3
6
4
Node:%
33.4
26.6
15.5
24.5
Gain(%)
39.1
25.0
14.1
21.8
153
Yes
Resp: %
56.5
45.3
44.0
43.0
48.3
Index (%)
116.9
93.9
91.1
89.1
100
Gain Summary - In Deciles
Target variable: Has Amex card Target category: Yes
Nodes
5
5
5
5;3
3
3
6
6;4
4
4
Percentile
10
20
30
40
50
60
70
80
90
100
Statistics
Percentile: n Gain: n Gain (%)
32
18
11.6
65
37
23.5
97
55
35.1
129
71
45.2
162
85
54.8
194
100
64.1
226
114
73.1
258
128
82.1
291
142
91.2
323
156
100.0
Resp: %
56.5
56.5
56.5
54.7
52.8
51.5
50.5
49.6
48.9
48.3
SQL Rules
/* Node 3*/
UPDATE <TABLE>
SET nod_001 = 3,
pre_001 = 0, prb_001 = 0.546512
WHERE ((PAY_WEEK IS NULL) OR (PAY_WEEK <= 1)) AND
((CLASS IS NULL) OR (CLASS <= 3));
/* Node 4*/
UPDATE <TABLE>
SET nod_001 = 4,
pre_001 = 0, prb_001 = 0.569620
WHERE ((PAY_WEEK IS NULL) OR (PAY_WEEK <= 1))
AND (NOT(CLASS IS NULL)
AND (CLASS > 3));
154
Continued
/* Node 5*/
UPDATE <TABLE>
SET nod_001 = 5,
pre_001 = 1, prb_001 = 0.564815
WHERE (NOT(PAY_WEEK IS NULL) AND (PAY_WEEK > 1))
AND ((AGE IS NULL) OR (AGE <= 2));
/* Node 6*/
UPDATE <TABLE>
SET nod_001 = 6,
pre_001 = 0, prb_001 = 0.560000
WHERE (NOT(PAY_WEEK IS NULL) AND (PAY_WEEK > 1))
AND (NOT(AGE IS NULL) AND (AGE > 2));
Gains % Chart Based on
Deciles
155
Misclassification Matrix
Predicted Category
Actual Category
No
Yes Total
No
120
95
215
Yes 47
61
108
Total 167
156
323
Risk Statistics
Risk Estimate
0.439628 = (95+47)/323
SE of Risk Estimate 0.0276172 = Sqrt[(.45*(1-.45))/323]
SAS Log
libname in 'e:/NOTSUG';
NOTE: Libref IN was successfully assigned as follows:
Engine:
V8
Physical Name: e:\NOTSUG
356 %let dsn=Credit;
357
358 Data Assign;
SYMBOLGEN: Macro variable DSN resolves to Credit
359 Set in.&dsn;
/*SAS Data set coming in to be segmented*/;
360 nod_001=.;
361 pre_001=.;
362 prb_001=.;
NOTE: There were 323 observations read from the data set IN.CREDIT.
NOTE: The data set WORK.ASSIGN has 323 observations and 8 variables.
NOTE: DATA statement used:
real time
0.04 seconds
cpu time
0.04 seconds
156
Proc SQL;
364
365 /* Node 3*/
366 UPDATE Assign
367 SET nod_001 = 3,
pre_001 = 0,
prb_001 = 0.546512
368 WHERE ((PAY_WEEK IS NULL) OR (PAY_WEEK <= 1)) AND ((CLASS IS NULL)
OR (CLASS <= 3));
NOTE: 86 rows were updated in WORK.ASSIGN.
369
370 /* Node 4*/
371 UPDATE Assign
372 SET nod_001 = 4,
pre_001 = 0,
prb_001 = 0.569620
373 WHERE ((PAY_WEEK IS NULL) OR (PAY_WEEK <= 1)) AND (NOT(CLASS IS
NULL) AND (CLASS > 3));
NOTE: 79 rows were updated in WORK.ASSIGN.
374
375 /* Node 5*/
376 UPDATE Assign
377 SET nod_001 = 5,
pre_001 = 1,
prb_001 = 0.564815
378 WHERE (NOT(PAY_WEEK IS NULL) AND (PAY_WEEK > 1)) AND
((AGE IS NULL) OR (AGE <= 2));
NOTE: 108 rows were updated in WORK.ASSIGN.
379
380
381
382
383
AND
/* Node 6*/
UPDATE Assign
SET nod_001 = 6,
pre_001 = 0,
prb_001 = 0.560000
WHERE (NOT(PAY_WEEK IS NULL) AND (PAY_WEEK > 1))
(NOT(AGE IS NULL) AND (AGE > 2));
NOTE: 50 rows were updated in WORK.ASSIGN.
384
NOTE: PROCEDURE SQL used:
real time
0.19 seconds
cpu time
0.19 seconds
157
385
386
387
388
389
390
391
Data Assign;
Set Assign;
If prb_001=. then Prob=0;
else If pre_001=0 then Prob=1-prb_001;
Else if pre_001=1 then Prob=prb_001;
/* This assigns the Probability for Target Outcome 1 */;
Run;
NOTE: There were 323 observations read from the data set WORK.ASSIGN.
NOTE: The data set WORK.ASSIGN has 323 observations and 9 variables.
NOTE: DATA statement used:
real time
0.05 seconds
cpu time
0.05 seconds
392
393
394
395
396
proc summary data=assign;
class nod_001;
var Prob;output out=statb mean=mean_Prob sum=sum_Prob;
run;
NOTE: There were 323 observations read from the data set WORK.ASSIGN.
NOTE: The data set WORK.STATB has 5 observations and 5 variables.
Analysis of Credit Card Data
10:32 Monday, April 5, 2004
Segments with Active Cards
Dsn=Credit
Obs
1
2
3
4
5
nod_001
5
3
6
4
.
_TYPE_
1
1
1
1
0
_FREQ_
108
86
50
79
323
158
mean_Prob
0.56482
0.45349
0.44000
0.43038
0.48297
sum_Prob
61.000
39.000
22.000
34.000
156.000
Conclusion
• Use Dedicated Software product
that is affordable
• Combine with SAS SQL for
Deploying Scoring Rules
• Create powerful application for
Data Mining
• Provide explanation that is
ACTIONABLE with prediction
159
Related documents