Download Comparison Of Enterprise Miner And SAS/STAT For Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Paper DM100
Comparison of Enterprise Miner and SAS/Stat for Data Mining
Patricia B. Cerrito, University of Louisville, Louisville, KY
ABSTRACT
There are many definitions of data mining: the discovery of spurious relationships in the data; automated data
analysis; examination of large data sets; exploratory data analysis. Early methods of data mining included stepwise
regression, cluster analysis, and discriminant analysis. Data mining should be thought of as a process that includes
data cleaning, investigation of the data using models, and validation of potential results. In particular, practical
significance should be emphasized over statistical significance. Also, decision making and the cost of
misclassification is important. SAS/Stat contains methods that can be used to investigate data using a data mining
process. These methods can complement those developed specifically for Enterprise Miner, and can be used in
conjunction with Enterprise Miner. This paper will examine data mining in SAS/Stat, contrasting it with Enterprise
Miner.
INTRODUCTION
The term “data mining” has now come into public use with little understanding of what it is or does. In the past,
statisticians have thought little of data mining because data were examined without the final step of model validation.
Data mining differs from standard statistical practice in that the process
Assumes large data samples
Assumes large numbers of variables
Has Validation as a routine, automatic part of the process
Examines the cost of misclassification
Emphasizes decision-making over inference
And yet there remains overlap between statistics and data mining in both technique and practice. Since many of the
techniques, such as cluster analysis, overlap both statistics and data mining, it is the purpose of this paper to
examine some of the differences and similarities in the use of these overlapping techniques from a statistical
perspective versus a data mining perspective.
To demonstrate the different techniques, a dataset consisting of responses to a survey concerning student
expectations in mathematics courses was used throughout. A list of questions asked in the survey is given in the
appendix. The data were coded ordinally and variable names were coded to represent the questions. The data
analysis was performed to investigate the relationship between variables and responses in the dataset. A total of 192
responses were collected from this survey. The basic survey questions are
The areas of mathematics which interest me are (Check all that apply):
_____Pure mathematics
_____Applied mathematics
_____Abstract Algebra
_____Number Theory
_____Topology
_____Real analysis
_____Discrete mathematics
_____Differential equations
_____Actuarial Science
_____Probability
_____Statistics
The students were also asked whether they were graduate or undergraduate, and how many hours per week they
estimated they studied mathematics.
CLUSTER ANALYSIS
As a data mining process, clustering is considered to be unsupervised learning, meaning that there is no specific
outcome variable. Clustering involves the following process:
Group observations or group variables
Choice of type-hierarchical or non-hierarchical
Number of clusters
Cluster identity
Validation
1
Hierarchical clustering involves the grouping of observations based upon the distance between observations.
Distance can be defined using different criteria available in PROC CLUSTER. The clusters are built based on a
hierarchical tree structure. The closest observations are grouped together initially followed by the next closest match.
The other method of clustering is based upon the selection of random seed values as the centers of spheres
containing all observations closest to that center (PROC FASTCLUS). Then the centers are re-defined based upon
the outcomes. In Enterprise Miner, PROC FASTCLUS is used to perform clustering. This is the same procedure
available in SAS/Stat. SAS/Stat has the additional hierarchical clustering techniques available. The variables in the
dataset dealing with preferences for mathematics subject were first clustered in SAS/Stat using the hierarchical
procedure in PROC CLUSTER. Several distance criteria were used for comparison purposes. In addition, course
level (200,300,400, 500 and above) was included.
proc cluster data=sasuser.studentsurvey method=density r=2 ;
var courselevel pure applied abstractalgebra numbertheory topology realanalysis
discretemathematics differentialequations actuarialscience probability statistics;
id id;
run;
proc tree horizontal spaces=2;
id id;
In addition to density, Ward’s method, centroid, and two stage methods were used to determine clusters with very
different results (Figure 1).
Figure 1. Methods of Clustering
Density
Ward’s Method
id
11
4
128
132
2
12
15
124
8
3
25
1
5
6
16
134
7
9
13
135
214
127
10
129
139
126
136
14
137
133
138
130
131
32
29
27
36
22
17
21
43
18
28
20
24
19
26
37
30
31
35
41
44
39
25
40
23
42
34
33
38
95
71
83
99
81
88
90
85
84
72
87
74
79
82
80
98
78
93
92
73
75
77
86
89
97
100
101
91
94
102
76
96
148
45
63
52
50
48
49
47
56
112
147
46
68
118
59
178
182
60
170
58
154
103
163
61
160
168
180
55
166
177
106
174
53
123
153
157
184
158
54
64
65
105
120
143
145
146
151
190
243
185
144
149
181
113
117
189
62
67
104
109
111
150
167
176
187
122
152
159
162
179
107
121
164
171
108
183
66
69
114
116
188
119
141
186
161
115
165
51
57
156
70
173
110
169
172
175
155
id
135
1
2
8
133
4
11
3
6
10
12
214
5
16
1
734
13
126
9
124
14
139
15
129
132
127
136
128
125
137
131
130
138
45
148
155
175
48
70
57
51
110
161
63
169
172
156
173
109
187
121
147
157
174
46
179
154
159
47
49
60
108
178
182
183
118
50
165
52
56
115
106
177
66
114
188
111
163
176
119
141
53
54
65
120
143
145
146
151
185
190
243
64
105
181
186
123
158
153
184
112
144
166
149
55
113
117
160
168
180
189
107
58
164
170
171
59
69
116
61
62
67
104
150
167
68
103
122
152
162
17
34
33
23
36
25
43
20
21
24
27
18
42
19
30
39
22
35
28
31
44
26
37
40
41
29
32
38
71
83
76
95
96
80
81
97
90
99
82
91
101
94
72
73
77
85
74
78
87
92
93
75
84
86
89
100
79
102
88
98
100
90
80
70
60
50
40
30
20
10
0. 0 0
0. 1
0. 2
0. 3
Cl ust er Fusi on Densi t y
Centroid
0. 5
0. 6
0. 7
0. 8
0. 9
Two Stage
id
32
33
27
17
18
24
19
23
43
36
25
30
42
39
37
34
29
26
22
20
28
31
21
35
44
41
40
38
45
148
155
175
110
48
70
46
47
154
60
183
182
108
178
144
153
166
184
159
118
49
161
173
103
152
122
162
52
114
66
111
176
163
119
141
107
55
189
113
117
160
180
168
158
174
50
51
57
56
149
106
53
120
243
65
143
185
54
145
151
190
146
123
181
157
177
115
165
156
59
116
69
188
63
112
147
109
58
170
171
164
64
105
121
179
187
186
169
172
61
167
150
104
62
67
68
71
72
73
77
85
94
76
81
80
90
82
91
101
99
83
95
79
88
97
75
89
100
84
86
98
102
96
74
93
92
87
78
11
4
5
10
2
9
8
16
14
124
1
3
6
7
13
214
135
126
134
139
12
127
133
130
131
132
15
125
129
136
128
137
138
id
135
1
8
2
133
3
10
6
16
134
139
7
13
14
1
26
9
124
12
214
15
129
132
127
136
128
125
137
5
130
138
4
11
131
45
148
155
175
46
154
159
60
108
178
182
183
153
184
123
158
144
166
118
47
49
53
54
65
120
143
145
146
151
185
190
243
149
106
61
62
67
104
150
167
68
59
69
116
181
66
111
163
176
119
141
103
122
152
162
174
55
113
117
160
168
180
189
107
58
164
170
171
179
64
105
109
187
157
186
121
147
114
188
52
56
115
177
48
70
57
51
110
161
50
165
112
63
169
172
156
173
17
34
18
42
19
30
39
20
21
24
44
22
35
28
31
26
37
41
23
36
25
43
27
40
29
32
38
33
71
72
73
77
85
94
80
81
90
99
82
91
101
74
78
87
92
93
75
84
86
89
100
88
98
97
79
102
83
96
76
95
0. 0
0. 4
Sem
i - Par t i al R- Squar ed
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
1. 0
1. 1
1. 2
1. 3
1. 4
200
Di st ance Bet ween Cl ust er Cent r oi ds
180
160
140
120
100
80
60
40
20
Cl ust er Fusi on Densi t y
The number of clusters defined range from 2 to many. With so many different possible clustering results, it is difficult
to make a judgment as to which is optimal. Instead, the question must become, “Is the result reasonable, and does
the result enhance decision-making?”
2
0
In contrast, k-means clustering, which is used in Enterprise Miner, uses the following code:
proc fastclus data=sasuser.studentsurveyimputed maxclusters=4 list out=sasuser.fastclusresults ;
var courselevel pure applied abstractalgebra numbertheory topology realanalysis
discretemathematics differentialequations actuarialscience probability statistics;
id id;
It is important to save the cluster values. The option out=sasuser.fastclusresults preserves the original data plus the
variables CLUSTER and DISTANCE. CLUSTER gives the cluster number to the observation; DISTANCE gives the
Euclidean distance from the observation to the center of the cluster.
Note that the investigator needs to specify the maximum number of clusters to be examined. It is possible to create a
list identifying which observation is classified into which cluster. A labeled list is not possible with the hierarchical
procedure since the final clustering is uncertain. Once the hierarchy is established and the number of clusters is
chosen, the final clusters can be labeled by the investigator. For the Fastclus results, Table 1 gives the means of
each variable by cluster.
Table 1. Cluster by Variable
Pure
Cluster
.68
1
.28
2
.36
3
1.00
4
Applied
.36
.92
.08
1.00
Algebra
.32
.05
.08
.76
Topology
.02
.08
.04
.56
Probability
.18
.43
.88
.41
Statistics
.06
.69
.92
.35
In this program, the maximum number of possible clusters is given by the user. To validate the results, it is
extremely important to give each cluster a meaningful identity. Given the proportion of values in each cluster, it is
possible to define labels for each cluster (Table 2).
Table 2. Cluster Labels
Cluster
1
2
3
4
Label
Pure Math
Applied Math
Statistics
Math Generally
Table 3. 5 Clusters in FASTCLUS
Pure
Applied Algebra
Cluster
1
2
3
4
5
Cluster
1
2
3
4
5
.36
.30
.05
1.00
1.00
.46
.80
.44
.50
.90
.07
.48
.02
.40
.70
Label
Pure Math
Applied Math
Statistics
Somewhat
Ambivalent
Math Generally
Five clusters are summarized in Table 3.
Topology
.03
.14
.05
.05
.60
Probabilit
y
.09
.22
.65
1.00
.60
Statistic
s
.04
.32
.98
.75
.40
In contrast, Enterprise Miner does not require a specification of a
maximum number of clusters, although it is an available option. It
also provides some graphics to examine the data that are not readily
available in SAS/Stat (Figure 3). The coding for Enterprise Miner is
given in Figure 2. Although defined using icons, the Fastclus
procedure is the same.
Figure 2. Enterprise Miner Clustering
3
Figure 3. Visualization of 5 clusters in Enterprise Miner
In addition, a
multidimensional
scaling (PROC MDS)
is performed in
Enterprise Miner to
examine the
separation between
clusters (Figure 4). The
scaling is based upon
distance within and
between clusters.
Figure 4. Graphical Representation of Distance Between Clusters
The centers of each cluster are well
separated but there is considerable
overlap. Once the clusters are defined,
specific variables can be examined (Table
4).
Note also in Figure 2 that Enterprise Miner will automatically divide the dataset, creating a holdout sample that can
be used to test the results. That capability does not automatically exist in SAS/Stat. It is useful to have such a
holdout sample so that the clustering results can be validated. To create such a sample, the following code is used:
PROC SQL;
CREATE VIEW WORK.SORT6649
AS SELECT * FROM sasuser.REVISEDSTUDENTSURVEY;
QUIT;
PROC SURVEYSELECT DATA=sasuser.surveysample
OUT=SASUSER.RAND8956(LABEL="Random sample of sasuser.revisedstudentsurvey")
METHOD=SRS
RATE=0.2
Creates the random sample
;
of 20% of the data.
RUN;
Adds a pointer to the random
sample, of numeric value 1.
data sasuser.surveysampleadd;
set sasuser.surveysample;
sample=1;
run;
4
proc sort data=Sasuser.Revisedstudentsurvey out=WORK._TABLE1_;
by ID;
run;
proc sort data=Sasuser.Surveysampleadd out=WORK._TABLE2_;
by ID;
run;
data sasuser.merged;
length id 8 Student_Type $ 16 CourseLevel 8 Course 8 Section 8 Pure 8
Applied 8 AbstractAlgebra 8 NumberTheory 8 Topology 8 RealAnalysis 8
DiscreteMathematics 8 DifferentialEquations 8 ActuarialScience 8
Probability 8 Statistics 8 sample 8;
merge WORK._TABLE1_ (in=TABLE1) WORK._TABLE2_ (in=TABLE2) ;
by ID;
run;
Merges the random sample into
the original dataset.
PROC SQL;
CREATE TABLE SASUSER.studentsurveytrain AS SELECT studentsurveymerged.id
FORMAT=BEST12.,
studentsurveymerged.Student_Type FORMAT=$F16.,
studentsurveymerged.Pure FORMAT=BEST12.,
Filters out the random sample
studentsurveymerged.Applied FORMAT=BEST12.,
leaving all values not in the
studentsurveymerged.AbstractAlgebra FORMAT=BEST12.,
sample.
studentsurveymerged.NumberTheory FORMAT=BEST12.,
studentsurveymerged.Topology FORMAT=BEST12.,
studentsurveymerged.RealAnalysis FORMAT=BEST12.,
studentsurveymerged.DiscreteMathematics FORMAT=BEST12.,
studentsurveymerged.DifferentialEquations FORMAT=BEST12.,
studentsurveymerged.ActuarialScience FORMAT=BEST12.,
studentsurveymerged.Probability FORMAT=BEST12.,
studentsurveymerged.Statistics FORMAT=BEST12.,
studentsurveymerged.sample FORMAT=BEST12.
FROM sasuser.studentsurveymerged AS studentsurveymerged
WHERE studentsurveymerged.sample NOT = 1;
QUIT;
In this code, SURVEYSAMPLEADD is used to test the data; STUDENTSURVEYTRAIN is used to define the model.
Once the clusters are defined using the training set, PROC DISCRIM can be used to examine the clusters that would
be defined on the holdout sample:
proc discrim data=sasuser.studentsurveytrain testdata=sasuser.surveysampleadd list testlist;
class student_type;
id id;
var Pure Applied AbstractAlgebra NumberTheory RealAnalysis DiscreteMathematics DifferentialEquations
ActuarialScience Probability Statistics;;
run;
In the above code, data=sasuser.studentsurvey train is used to define the discriminant function (which will fit the
training set almost perfectly). Test data (defined by testdata=sasuser.surveysampleadd) are used to define cluster
values on a set not used to define them. Once these are defined, cluster means can be computed on the test data
set and compared to those in the initial set. Once the clusters are defined, they can be used in other types of
analyses for comparison purposes. Table 4 gives a chi-square analysis to examine clusters by types of students.
Table 4. Clusters by Student Type
Undergraduat Graduate
Cluster
e
53 (41%) 14 (23%)
1
35 (27%)
15
2
(25%)
20 (15%) 23 (38%)
3
16 (12%)
4 (7%)
4
5 (5%)
4 (7%)
5
Undergraduates are more likely than graduate student to
prefer pure mathematics. This trend is even more
noticeable when clusters are examined by course level
(Table 5).
5
Table 5. Clusters by Course Level
200
300
400
Cluster
41
12
6 (21%)
1
(42%)
(36%)
24
11
11
2
(25%)
(33%)
(39%)
14
6 (18%)
8 (28%)
3
(14%)
13
3 (9%)
2 (7%)
4
(13%)
5 (5%)
1 (3%)
1 (4%)
5
In contrast to tables, Enterprise Miner allows this
information to be depicted graphically in the
CLUSTER NODE (Figure 5).
500
8 (25%)
4 (13%)
15
(47%)
2 (6%)
3 (9%)
Figure 5. Enterprise Miner Comparisons
CLASSIFICATION AND PREDICTIVE MODELING
Classification is considered to be supervised data mining since it is possible to compare a predicted value to an
actual value. In SAS/Stat, discriminant analysis and logistic are the primary means of classification. In Enterprise
Miner, neural networks, decision trees, and logistic regression are used. However, the two components of SAS
approach classification with different perspectives. In Enterprise Miner, the dataset is large enough to partition it so
that the classification model can be validated through the use of a holdout sample. Misclassification is one of the
means of determining the strength of the model. Another is to define a profit/loss function to determine the cost (or
benefit) of a correct or incorrect classification.
In SAS/Stat, datasets are often small so that partitioning is not possible. The strength of a logistic regression is
measured by the odds ratios, the receiver operating curve, and the p-value. Similarly, the strength of discriminant
analysis is measured by the proportion of correct classifications without the use of a holdout sample, although there
is an option for cross validation that is not available for logistic regression.
For logistic regression, however, the two data sets should remain merged. However, the coding given in the previous
section should be modified to allow for differences in procedures.
data sasuser.surveysampleadd;
set sasuser.surveysample;
sample=1;
student_type1=student_type;
student_type=’.’
run;
proc sort data=Sasuser.Revisedstudentsurvey out=WORK._TABLE1_;
Merges the random sample into
by ID;
the original dataset.
run;
proc sort data=Sasuser.Surveysampleadd out=WORK._TABLE2_;
by ID;
run;
data sasuser.merged;
length id 8 Student_Type $ 16 student_type1 $ 16 CourseLevel 8 Course 8 Section 8 Pure 8
Applied 8 AbstractAlgebra 8 NumberTheory 8 Topology 8 RealAnalysis 8
6
DiscreteMathematics 8 DifferentialEquations 8 ActuarialScience 8
Probability 8 Statistics 8 sample 8;
merge WORK._TABLE1_ (in=TABLE1) WORK._TABLE2_ (in=TABLE2) ;
by ID;
run;
Then, using the merged datasets, the outcome values are removed from the smaller partition. The logistic procedure
will predict the values that can then be compared to accuracy. In this example, the holdout sample had a 22%
misclassification rate compared to 13% for the initial sample.
Similarly, PROC DISCRIM uses the following code, where the partitioned dataset can be used as a holdout sample.
Note that it is very similar to the code used to validate the clustering in the previous section.
proc discrim data=sasuser.trainingset testdata=sasuser.partition list testlist;
class student_type;
id id;
var pure applied abstractalgebra numbertheory topology realanalysis discretemathematics
differentialequations actuarialscience probability statistics;
run;
The summary table on the test data is given in Table 6 below.
Graduate
Student Type
4 (67%)
Graduate
6 (30%)
Undergraduate
Undergraduate
2 (33%)
14 (70%)
Nonparametric discriminant analysis can also be used:
proc discrim data=sasuser.trainingset testdata=sasuser.partition list testlist
method=npar r=1 kernel=epa;
class student_type;
id id;
var pure applied abstractalgebra numbertheory topology realanalysis discretemathematics
differentialequations actuarialscience probability statistics;
run;
with a strong likelihood of correctly classifying undergraduates (Table 7).
Student Type
Graduate
Undergraduate
Graduate
3 (50%)
4 (20%)
Undergraduate
3 (50%)
16 (80%)
Instead of a holdout sample, cross validation is the standard method to correct for the inflation of results (Table 8):
proc discrim data=sasuser.revisedstudentsurvey list
method=npar k=5 crossvalidate;
class student_type;
id id;
var pure applied abstractalgebra numbertheory topology realanalysis discretemathematics
differentialequations actuarialscience probability statistics;
run;
Student Type
Graduate
Undergraduate
Graduate
28 (47%)
46 (35%)
Undergraduate
31 (52%)
83 (64%)
Note that while the cross validation technique can correctly classify undergraduates 64% of the time, the
classification of graduate students is poor. In comparison, a standard discriminant analysis without cross validation
shows a much more accurate result (Table 9).
Graduate
Undergraduate
Student Type
43 (72%)
17 (28%)
Graduate
43 (33%)
87 (67%)
Undergraduate
In contrast, Enterprise Miner has the capability to partition easily (Figure 6).
7
Figure 6. Code in Enterprise Miner for Classification
Neural networks act like “black boxes” in that the model is not presented in a nice, concise format that is provided by
regression. Its accuracy is examined in a way similar to the diagnostics of the regression curve. The simplest neural
network contains a single input (independent variable) and a single target (dependent variable) with a single output
unit. It increases in complexity with the addition of hidden layers, and additional input variables. With no hidden
layers, the results of a neural network analysis will resemble those of regression. Each input variable is
connected to each variable in the hidden layer, and each hidden variable is connected to each outcome
variable. The hidden units combine inputs, and apply a function to predict outputs. Hidden layers are often
nonlinear.
Decision trees provide a completely different approach to the problem of classification. The decision tree develops a
series of if…then rules. Each rule assigns an observation to one segment of the tree, at which point there is another
if…then rule applied. The initial segment, containing the entire dataset, is the root node for the decision tree. The
final nodes are called leaves. Intermediate nodes (a node plus all its successors) forms a branch of the tree. The
final leaf containing an observation is its predictive value. Unlike neural networks and regression, decision trees will
not work with interval data. It will work with nominal outcomes that have more than two possible results. Decision
trees will also work with ordinal outcome variables.
The Assessment Node is used to compare outcome methods (Table10).
Table 10. Comparison of Methods
Tool
Target Event
Neural Network
Undergraduate
Tree
Undergraduate
Regression
Undergraduate
Misclass
Rate
0.381578947
4
0.342105263
2
0.197368421
1
Valid:
Misclass
Rate
0.333333333
3
0.315789473
7
0.473684210
5
Test:
Misclass
Rate
0.245614035
1
0.280701754
4
0.438596491
2
Interestingly, logistic regression has the lowest misclassification rate on the initial training set but the highest
misclassification rate on the testing set. The results suggest that logistic regression tends to inflate results. Similarly,
the receiver operating curve is given in Figure 7 for three models. Note also that the data sample is partitioned into
three sets instead of two. Predictive modeling in Enterprise Miner is iterative. The initial result is compared to the
validation set and adjustments are made to the model. Once all adjustments are completed, the final testing is used
to validate the results.
8
Figure 7. ROC Curves
According to this measure, logistic regression gives the optimal outcome. However, the misclassification rate on the
testing sample indicates that the decision tree gives a better overall result.
DATA VISUALIZATION
With large datasets, data visualization becomes an important part of exploring in data mining. Version 4.3 of
Enterprise Miner included a node for SAS/Insight, and included all graphics within Insight. Version 5.1 removed the
SAS/Insight Node, adding a Stat/
Explore Node. However, neither yet provides the point-and-click graphics that are readily available in SAS Enterprise
Guide.
There is, however, one set of graphics available in SAS/Stat that are not available in any other component of SAS.
That procedure, PROC KDE, allows the investigator to overlay smoothed histograms to examine data. It is available
for interval data only. In addition to questions concerning preferences for mathematics, students were asked to
estimate the number of hours per week they expected to study for their mathematics courses. In Enterprise Miner, it
is possible to use histograms to examine the data. Figure 8 was provided in Enterprise Miner using the StatExplore
Node. Figure 9 shows the corresponding smoothed histogram as provided by SAS/Stat.
Figure 8. Bar Chart of Hours Studied by Student Type
9
Comparing one to the other is difficult because grouped histograms are side-by-side. PROC KDE allows an overlay
comparison:
Proc sort data=sasuser.studentsurvey;
By student_type;
Proc KDE data=sasuser.studentsurvey;
Univar hours/gridl=0 gridu=25 out=sasuser.kdesurvey;
By student_type;
Run;
Proc GPlot is used to graph the results as given in Figure 9. Although Proc KDE has added graphics in version 9, the
overlay still must be done with GPlot.
Figure 9. Smoothed, Over-layed Histogram by Student Type
With the overlay, it is clear that almost all undergraduate students spend 10 hours or less in study compared to
graduate students who are divided between 0-10 hours and 10-18 hours per week. In a similar comparison given in
Figure 10, students who enjoy pure mathematics spend about the same amount of time in study as do the students
who do not enjoy pure mathematics.
Figure 10. Smoothed Histogram for Those Students Who Favor Pure Mathematics
10
However, with the exception of a small group of students at the 15 hour mark who do not like pure mathematics,
students who enjoy applied mathematics are more likely to study more than 7 hours per week.
Figure 11. Smoothed Histogram for Those Students Who Favor Applied Mathematics
The comparison is not as clear in the histogram provided in Enterprise Miner.
Figure 12. Histogram in Enterprise Miner for Those Students Who Favor Applied Mathematics
CONCLUSION
There are many similar procedures in SAS/Stat and Enterprise Miner that can be used for similar needs. However,
generally the process is different since data are routinely partitioned to validate models, and to optimize the choice of
model. Although SAS/Stat contains many similar models, it does not have the partitioning process readily available,
although it can be coded into the process. There are some exploratory techniques built into SAS/Stat, such as
PROC KDE, that can complement Enterprise Miner. In addition, partitioning and imputation steps can be performed
in Enterprise Miner, and the data analyzed using SAS/Stat techniques.
11
CONTACT INFORMATION
Patricia Cerrito
Department of Mathematics
University of Louisville
Louisville, KY 40292
502-852-6826
502-852-7132 (fax)
[email protected]
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
12