Download Case Studies in Data Mining

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Case Studies in Data Mining
András Fülöp, e-Ventures Ltd. <[email protected]>
László Gonda, University of Debrecen <[email protected]>
Dr.. Márton Ispány, University of Debrecen <[email protected]>
Dr.. Péter Jeszenszky, University of Debrecen <[email protected]>
Dr.. László Szathmáry, University of Debrecen <[email protected]>
Created by XMLmind XSL-FO Converter.
Case Studies in Data Mining
by András Fülöp, László Gonda, Dr.. Márton Ispány, Dr.. Péter Jeszenszky, and Dr.. László Szathmáry
Publication date 2014
Copyright © 2014 Faculty of Informatics, University of Debrecen
A tananyag a TÁMOP-4.1.2.A/1-11/1-2011-0103 azonosítójú pályázat keretében valósulhatott meg.
Created by XMLmind XSL-FO Converter.
Table of Contents
Preface ................................................................................................................................................ ii
1. How to use this material ....................................................................................................... iv
I. Data mining tools ............................................................................................................................ 6
1. Commercial data mining softwares ....................................................................................... 7
2. Free and shareware data mining softwares .......................................................................... 11
II. RapidMiner .................................................................................................................................. 13
3. Data Sources ....................................................................................................................... 16
1. Importing data from a CSV file ................................................................................. 16
2. Importing data from an Excel file .............................................................................. 17
3. Creating an AML file for reading a data file ............................................................. 19
4. Importing data from an XML file .............................................................................. 21
5. Importing data from a database ................................................................................. 23
4. Pre-processing ..................................................................................................................... 25
1. Managing data with issues - Missing, inconsistent, and duplicate values ................. 25
2. Sampling and aggregation ......................................................................................... 27
3. Creating and filtering attributes ................................................................................. 31
4. Discretizing and weighting attributes ........................................................................ 35
5. Classification Methods 1 ..................................................................................................... 41
1. Classification using a decision tree ............................................................................ 41
2. Under- and overfitting of a classification with a decision tree .................................. 46
3. Evaluation of performance for classification by decision tree ................................... 51
4. Evaluation of performance for classification by decision tree 2 ................................ 55
5. Comparison of decision tree classifiers ..................................................................... 58
6. Classification Methods 2 ..................................................................................................... 65
1. Using a rule-based classifier (1) ................................................................................ 65
2. Using a rule-based classifier (2) ................................................................................ 66
3. Transforming a decision tree to an equivalent rule set .............................................. 68
7. Classification Methods 3 ..................................................................................................... 71
1. Linear regression ....................................................................................................... 71
2. Osztályozás lineáris regresszióval ............................................................................. 73
3. Evaluation of performance for classification by regression model ............................ 76
4. Evaluation of performance for classification by regression model 2 ......................... 79
8. Classification Methods 4 ..................................................................................................... 84
1. Using a perceptron for solving a linearly separable binary classification problem ... 84
2. Using a feed-forward neural network for solving a classification problem ............... 85
3. The influence of the number of hidden neurons to the performance of the feed-forward
neural network ............................................................................................................... 87
4. Using a linear SVM for solving a linearly separable binary classification problem .. 88
5. The influence of the parameter C to the performance of the linear SVM (1) ............ 90
6. The influence of the parameter C to the performance of the linear SVM (2) ............ 93
7. The influence of the parameter C to the performance of the linear SVM (3) ............ 95
8. The influence of the number of training examples to the performance of the linear SVM
97
9. Solving the two spirals problem by a nonlinear SVM ............................................. 100
10. The influence of the kernel width parameter to the performance of the RBF kernel SVM
101
11. Search for optimal parameter values of the RBF kernel SVM .............................. 103
12. Using an SVM for solving a multi-class classification problem ............................ 105
13. Using an SVM for solving a regression problem ................................................... 106
9. Classification Methods 5 ................................................................................................... 110
1. Introducing ensemble methods: the bagging algorithm ........................................... 110
2. The influence of the number of base classifiers to the performance of bagging ...... 111
3. The influence of the number of base classifiers to the performance of the AdaBoost method
...................................................................................................................................... 113
4. The influence of the number of base classifiers to the performance of the random forest
115
iii
Created by XMLmind XSL-FO Converter.
Case Studies in Data Mining
10. Association rules .............................................................................................................
1. Extraction of association rules .................................................................................
2. Asszociációs szabályok kinyerése nem tranzakciós adathalmazból ........................
3. Evaluation of performance for association rules .....................................................
4. Performance of association rules - Simpson's paradox ............................................
11. Clustering 1 .....................................................................................................................
1. K-means method ......................................................................................................
2. K-medoids method ..................................................................................................
3. The DBSCAN method .............................................................................................
4. Agglomerative methods ...........................................................................................
5. Divisive methods .....................................................................................................
12. Clustering 2 .....................................................................................................................
1. Support vector clustering .........................................................................................
2. Choosing parameters in clustering ...........................................................................
3. Cluster evaluation ....................................................................................................
4. Centroid method ......................................................................................................
5. Text clustering .........................................................................................................
13. Anomaly detection ..........................................................................................................
1. Searching for outliers ...............................................................................................
2. Unsupervised search for outliers .............................................................................
3. Unsupervised statistics based anomaly detection ....................................................
III. SAS® Enterprise Miner™ .......................................................................................................
14. Data Sources ...................................................................................................................
1. Reading SAS dataset ...............................................................................................
2. Importing data from a CSV file ...............................................................................
3. Importing data from a Excel file ..............................................................................
15. Preprocessing ..................................................................................................................
1. Constructing metadata and automatic variable selection .........................................
2. Vizualizing multidimensional data and dimension reduction by PCA ....................
3. Replacement and imputation ...................................................................................
16. Classification Methods 1 .................................................................................................
1. Classification by decision tree .................................................................................
2. Comparison and evaluation of decision tree classifiers ...........................................
17. Classification Methods 2 .................................................................................................
1. Rule induction to the classification of rare events ...................................................
18. Classification Methods 3 .................................................................................................
1. Logistic regression ...................................................................................................
2. Prediction of discrete target by regression models ..................................................
19. Classification Methods 4 .................................................................................................
1. Solution of a linearly separable binary classification task by ANN and SVM ........
2. Using artificial neural networks (ANN) ..................................................................
3. Using support vector machines (SVM) ...................................................................
20. Classification Methods 5 .................................................................................................
1. Ensemble methods: Combination of classifiers .......................................................
2. Ensemble methods: bagging ....................................................................................
3. Ensemble methods: boosting ...................................................................................
21. Association mining ..........................................................................................................
1. Extracting association rules .....................................................................................
22. Clustering 1 .....................................................................................................................
1. K-means method ......................................................................................................
2. Agglomerative hierarchical methods .......................................................................
3. Comparison of clustering methods ..........................................................................
23. Clustering 2 .....................................................................................................................
1. Clustering attributes before fitting SVM .................................................................
2. Self-organizing maps (SOM) and vector quantization (VQ) ...................................
24. Regression for continuous target .....................................................................................
1. Logistic regression ...................................................................................................
2. Prediction of discrete target by regression models ..................................................
3. Supervised models for continuous target .................................................................
25. Anomaly detection ..........................................................................................................
iv
Created by XMLmind XSL-FO Converter.
118
118
121
126
131
135
135
137
140
142
144
148
148
151
154
159
161
165
165
167
171
176
178
178
180
183
185
185
188
191
196
196
200
208
208
212
212
217
221
221
225
232
240
240
244
249
256
256
260
260
267
271
278
278
284
289
289
294
297
304
Case Studies in Data Mining
1. Detecting outliers ..................................................................................................... 304
Bibliography ................................................................................................................................... 307
v
Created by XMLmind XSL-FO Converter.
List of Figures
3.1. Metadata of the resulting ExampleSet. ...................................................................................... 16
3.2. A small excerpt of the resulting ExampleSet. ............................................................................ 17
3.3. Metadata of the resulting ExampleSet. ...................................................................................... 18
3.4. A small excerpt of the resulting ExampleSet. ............................................................................ 18
3.5. The resulting AML file. ............................................................................................................. 20
3.6. A small excerpt of The World Bank: Population (Total) data set used in the exepriment. ........ 21
3.7. Metadata of the resulting ExampleSet. ...................................................................................... 22
3.8. A small excerpt of the resulting ExampleSet. ............................................................................ 22
3.9. Metadata of the resulting ExampleSet. ...................................................................................... 23
3.10. A small excerpt of the resulting ExampleSet. .......................................................................... 24
4.1. Graphic representation of the global and kitchen power consumption in time .......................... 25
4.2. Possible outliers based on the hypothesized hbits of the members of the household ................ 26
4.3. Filtering of the possible values using a record filter .................................................................. 26
4.4. Selection of aggregate functions for attributes .......................................................................... 28
4.5. Preferences for dataset sampling ............................................................................................... 29
4.6. Preferences for dataset filtering ................................................................................................. 29
4.7. Resulting dataset after dataset sampling .................................................................................... 30
4.8. Resulting dataset after dataset filtering ...................................................................................... 30
4.9. Defining a new attribute based on an expression relying on existing attributes ........................ 32
4.10. Properties of the operator used for removing the attributes made redundant .......................... 33
4.11. Selection of the attributes to remain in the dataset with reduced size ...................................... 33
4.12. The appearance of the derived attribute in the altered dataset ................................................. 34
4.13. Selection of the appropriate discretization operator ................................................................ 36
4.14. Setting the properties of the discretization operator ................................................................ 37
4.15. Selection of the appropriate weighting operator ...................................................................... 37
4.16. Defining the weights of the individual attributes ..................................................................... 38
4.17. Comparison of the weighted and unweighted dataset instances .............................................. 39
5.1. Preferences for the building of the decision tree ........................................................................ 41
5.2. Preferences for splitting the dataset into training and test sets .................................................. 42
5.3. Setting the relative sizes of the data partitions ........................................................................... 42
5.4. Graphic representation of the decision tree created ................................................................... 43
5.5. The classification of the records based on the decision tree ...................................................... 44
5.6. Setting a threshold for the maximal depth of the decision tree .................................................. 46
5.7. Graphic representation of the decision tree created ................................................................... 47
5.8. Graphic representation of he classification of the records based on the decision tree ............... 47
5.9. Graphic representation of the decision tree created with the increased maximal depth ............. 48
5.10. Graphic representation of he classification of the records based on the decision tree with increased
maximal depth .................................................................................................................................. 48
5.11. Graphic representation of the decision tree created with the further increased maximal depth 49
5.12. Graphic representation of he classification of the records based on the decision tree with further
increased maximal depth .................................................................................................................. 50
5.13. Preferences for the building of the decision tree ...................................................................... 52
5.14. Graphic representation of the decision tree created ................................................................. 52
5.15. Graphic representation of he classification of the records based on the decision tree ............. 53
5.16. Performance vector of the classification based on the decision tree ........................................ 53
5.17. The modification of preferences for the building of the decision tree. .................................... 53
5.18. Graphic representation of the decision tree created with the modified preferences ................. 54
5.19. Performance vector of the classification based on the decision tree created with the modified
preferences ........................................................................................................................................ 54
5.20. Settings for the sampling done in the validation operator ........................................................ 56
5.21. Subprocesses of the validation operator .................................................................................. 56
5.22. Graphic representation of the decision tree created ................................................................. 56
5.23. Performance vector of the classification based on the decision tree ........................................ 57
5.24. Settings of the cross-validation operator .................................................................................. 57
5.25. Overall performance vector of the classifications done in the cross-validation operator ........ 58
vi
Created by XMLmind XSL-FO Converter.
Case Studies in Data Mining
5.26. Overall performance vector of the classifications done in the cross-validation operator in the leaveone-out case ...................................................................................................................................... 58
5.27. Preferences for the building of the decision tree based on the Gini-index criterion ................ 59
5.28. Preferences for the building of the decision tree based on the gain ratio criterion .................. 60
5.29. Graphic representation of the decision tree created based on the gain ratio criterion .............. 61
5.30. Performance vector of the classification based on the decision tree built using the gain ratio criterion
........................................................................................................................................................... 62
5.31. Graphic representation of the decision tree created based on the Gini-index criterion ............ 62
5.32. Performance vector of the classification based on the decision tree built using the Gini-index
criterion ............................................................................................................................................. 62
5.33. Settings of the operator for the comparison of ROC curves .................................................... 62
5.34. Subprocess of the operator for the comparison of ROC curves ............................................... 63
5.35. Comparison of the ROC curves of the two decision tree classifiers ........................................ 63
6.1. The rule set of the rule-based classifier trained on the data set. ................................................. 65
6.2. The classification accuracy of the rule-based classifier on the data set. .................................... 65
6.3. The rule set of the rule-based classifier. .................................................................................... 67
6.4. The classification accuracy of the rule-based classifier on the training set. .............................. 67
6.5. The classification accuracy of the rule-based classifier on the test set. ..................................... 67
6.6. The decision tree built on the data set. ....................................................................................... 68
6.7. The rule set equivalent of the decision tree. .............................................................................. 69
6.8. The classification accuracy of the rule-based classifier on the data set. .................................... 69
7.1. Properties of the linear regression operator ............................................................................... 71
7.2. The linear regression model yielded as a result ......................................................................... 72
7.3. The class prediction values calculated based on the linear regression model ............................ 72
7.4. The subprocess of the classification by regression operator ...................................................... 74
7.5. The linear regression model yielded as a result ......................................................................... 74
7.6. The class labels derived from the predictions calculated based on the regression model .......... 75
7.7. The subprocess of the classification by regression operator ...................................................... 77
7.8. The linear regression model yielded as a result ......................................................................... 77
7.9. The performance vector of the classification based on the regression model ............................ 78
7.10. The subprocess of the cross-validation by regression operator ............................................... 80
7.11. The subprocess of the classification by regression operator .................................................... 80
7.12. The linear regression model yielded as a result ....................................................................... 80
7.13. The customizable properties of the cross-validation operator ................................................. 81
7.14. The overall performance vector of the classifications done using the regression model defined in the
cross-validation operator .................................................................................................................. 82
7.15. The overall performance vector of the classifications done using the regression model defined in the
cross-validation operator for the case of using the leave-one-out method ........................................ 82
8.1. A linearly separable subset of the Wine data set [UCI MLR] used in the experiment (2 of the total of 3
classes and 2 of the total of 13 attributes was selected). ................................................................... 84
8.2. The decision boundary of the perceptron. .................................................................................. 85
8.3. The classification accuracy of the perceptron on the data set. ................................................... 85
8.4. The classification accuracy of the neural network on the data set. ............................................ 86
8.5. The average classification error rate obtained from 10-fold cross-validation against the number of
hidden neurons. ................................................................................................................................. 87
8.6. A linearly separable subset of the Wine data set [UCI MLR] used in the experiment (2 of the total of 3
classes and 2 of the total of 13 attributes was selected). ................................................................... 89
8.7. The kernel model of the linear SVM. ........................................................................................ 89
8.8. The classification accuracy of the linear SVM on the data set. ................................................. 90
8.9. A subset of the Wine data set used in the experiment (2 of the total of 3 classes and 2 of the total of 13
attributes was selected). Note that the classes are not linearly separable. ........................................ 91
8.10. The classification error rate of the linear SVM against the value of the parameter C. ............. 91
8.11. The number of support vectors against the value of the parameter C. ..................................... 92
8.12. The average classification error rate of the linear SVM obtained from 10-fold cross-validation
against the value of the parameter C, where the horizontal axis is scaled logarithmically. ............... 94
8.13. The classification error rate of the linear SVM on the training and the test sets against the value of
the parameter C. ................................................................................................................................ 95
8.14. The number of support vectors against the value of the parameter C. ..................................... 96
8.15. The classification error rate of the linear SVM on the training and the test sets against the number of
training examples. ............................................................................................................................. 98
vii
Created by XMLmind XSL-FO Converter.
Case Studies in Data Mining
8.16. The number of support vectors against the number of training examples. .............................. 98
8.17. CPU execution time needed to train the SVM against the number of training examples. ....... 98
8.18. The Two Spirals data set ........................................................................................................ 100
8.19. The R code that produces the data set and is executed by the Execute Script (R) operator of the
R Extension. ................................................................................................................................... 100
8.20. The classification accuracy of the nonlinear SVM on the data set. ....................................... 101
8.21. The classification error rates of the SVM on the training and the test sets against the value of RBF
kernel width parameter. .................................................................................................................. 102
8.22. The optimal parameter values for the RBF kernel SVM. ...................................................... 104
8.23. The classification accuracy of the RBF kernel SVM trained on the entire data set using the optimal
parameter values. ............................................................................................................................ 104
8.24. The kernel model of the linear SVM. .................................................................................... 105
8.25. The classification accuracy of the linear SVM on the data set. ............................................. 106
8.26. The optimal value of the gamma parameter for the RBF kernel SVM. ................................... 107
8.27. The average RMS error of the RBF kernel SVM obtained from 10-fold cross-validation against the
value of the parameter gamma, where the horizontal axis is scaled logarithmically. ....................... 107
8.28. The kernel model of the optimal RBF kernel SVM. .............................................................. 108
8.29. Predictions provided by the optimal RBF kernel SVM against the values of the observed values of
the dependent variable. ................................................................................................................... 108
9.1. The average classification error rate of a single decision stump obtained from 10-fold crossvalidation. ....................................................................................................................................... 110
9.2. The average classification error rate of the bagging algorithm obtained from 10-fold cross-validation,
where 10 decision stumps were used as base classifiers. ................................................................ 110
9.3. The average classification error rate obtained from 10-fold cross-validation against the number of
base classifiers. ............................................................................................................................... 112
9.4. The average classification error rate obtained from 10-fold cross-validation against the number of
base classifiers. ............................................................................................................................... 114
9.5. The average error rate of the random forest obtained from 10-fold cross-validation against the number
of base classifiers. ........................................................................................................................... 116
10.1. List of the frequent item sets generated ................................................................................. 118
10.2. List of the association rules generated ................................................................................... 119
10.3. Graphic representation of the association rules generated ..................................................... 120
10.4. Operator preferences for the necessary data conversion ........................................................ 121
10.5. Converted version of the dataset ............................................................................................ 122
10.6. List of the frequent item sets generated ................................................................................. 122
10.7. List of the association rules generated ................................................................................... 123
10.8. Operator preferences for the appropriate data conversion ..................................................... 123
10.9. The appropriate converted version of the dataset .................................................................. 124
10.10. Enhanced list of the frequent item sets generated ................................................................ 124
10.11. List of the association rules generated ................................................................................. 125
10.12. Graphic representation of the association rules generated ................................................... 125
10.13. Operator preferences for the necessary data conversion ...................................................... 127
10.14. Label role assignment for performance evaluation .............................................................. 128
10.15. Prediction role assignment for performance evaluation ....................................................... 128
10.16. Operator preferences for the data conversion necessary for evaluation ............................... 129
10.17. Graphic representation of the association rules generated regarding survival ..................... 129
10.18. List of the association rules generated regarding survival ................................................... 130
10.19. Performance vector for the application of association rules generated ................................ 130
10.20. List of the association rules generated regarding survival ................................................... 131
10.21. Performance vector for the application of association rules generated ................................ 132
10.22. Contingency table of the dataset .......................................................................................... 132
10.23. Record filter usage ............................................................................................................... 132
10.24. Removal of attributes that become redundant after filtering ................................................ 133
10.25. List of the association rules generated for the subset of adults ............................................ 133
10.26. Performance vector for the application of association rules generated regarding survival for the
subset of adults ............................................................................................................................... 133
10.27. List of the association rules generated for the subset of children ........................................ 133
10.28. Performance vector for the application of association rules generated regarding survival for the
subset of children ............................................................................................................................ 133
11.1. The 7 separate groups ............................................................................................................ 135
viii
Created by XMLmind XSL-FO Converter.
Case Studies in Data Mining
11.2. Clustering with default values ...............................................................................................
11.3. Set the distance function. .......................................................................................................
11.4. Clustering with Mahalanobis distance function .....................................................................
11.5. The dataset .............................................................................................................................
11.6. Setting the parameters of the clustering .................................................................................
11.7. The clusters produced by the analysis ...................................................................................
11.8. The groups with varying density ...........................................................................................
11.9. The results of the method with default parameters ................................................................
11.10. The 15 group ........................................................................................................................
11.11. The resulting dendrogram ....................................................................................................
11.12. The clustering generated from dendrogram .........................................................................
11.13. The 600 two-dimensional vectors ........................................................................................
11.14. The subprocess ....................................................................................................................
11.15. The report generated by the clustering .................................................................................
11.16. The output of the analysis ....................................................................................................
12.1. The two groups ......................................................................................................................
12.2. Support vector clustering with polynomial kernel and p=0.21 setup .....................................
12.3. Unsuccessful clustering .........................................................................................................
12.4. Clustering with RBF kernel ...................................................................................................
12.5. More promising results ..........................................................................................................
12.6. The two groups containing 240 vectors .................................................................................
12.7. The subprocess of the optimalization node ............................................................................
12.8. The parameters of the optimalization ....................................................................................
12.9. The report generated by the process ......................................................................................
12.10. Clustering generated with the most optimal parameters ......................................................
12.11. The 788 vectors ...................................................................................................................
12.12. The evaluating subprocess ...................................................................................................
12.13. Setting up the parameters .....................................................................................................
12.14. Parameters to log .................................................................................................................
12.15. Cluster density against k number of clusters .......................................................................
12.16. Item distribution against k number of clusters .....................................................................
12.17. The vectors forming 31 clusters ...........................................................................................
12.18. The extracted centroids ........................................................................................................
12.19. The output of the k nearest neighbour method, using the centroids as prototypes ..............
12.20. The preprocessing subprocess .............................................................................................
12.21. The clustering setup .............................................................................................................
12.22. The confusion matrix of the results .....................................................................................
13.1. Graphic representation of the possible outliers ......................................................................
13.2. The number of outliers detected as the distance limit grows .................................................
13.3. Nearest neighbour based operators in the Anomaly Detection package ................................
13.4. Settings of LOF. ....................................................................................................................
13.5. Outlier scores assigned to the individual records based on k nearest neighbours ..................
13.6. Outlier scores assigned to the individual records based on LOF ...........................................
13.7. Filtering the records based on their outlier scores .................................................................
13.8. The dataset filtered based on the k-NN score ........................................................................
13.9. The dataset filtered based on the LOF score ..........................................................................
13.10. Global settings for Histogram-based Outlier Score .............................................................
13.11. Column-level settings for Histogram-based Outlier Score ..................................................
13.12. Scores and attribute binning for fixed binwidth and arbitrary number of bins ....................
13.13. Graphic representation of outlier scores ..............................................................................
13.14. Scores and attributes binning for dynamic binwidth and arbitrary number of bins .............
13.15. Graphic representation of the enhanced outlier scores ........................................................
14.1. The metadata of the dataset ...................................................................................................
14.2. Setting the Sample operator ...................................................................................................
14.3. The metadata of the resulting dataset and a part of the dataset ..............................................
14.4. The list of file in the File Import operator .........................................................................
14.5. The parameters of the File Import operator .......................................................................
14.6. A small portion of the dataset ................................................................................................
14.7. The metadata of the resulting dataset .....................................................................................
14.8. A small portion of the resulting dataset .................................................................................
ix
Created by XMLmind XSL-FO Converter.
135
136
136
138
138
139
140
141
142
143
143
145
146
146
147
148
149
149
150
150
151
152
152
153
153
155
155
156
156
157
157
159
160
160
162
162
163
165
166
168
168
168
169
169
170
170
171
172
172
173
174
175
178
178
179
180
181
181
182
183
Case Studies in Data Mining
15.1. Metadata produced by the DMDB operator .............................................................................. 185
15.2. The settings of Variable Selection operator .................................................................... 186
15.3. List of variables after the selection ........................................................................................ 186
15.4. Sequential R-square plot ........................................................................................................ 187
15.5. The binary target variables in a function of the two most important input attributes after the variable
selection ......................................................................................................................................... 187
15.6. Displaying the dataset by parallel axis ................................................................................... 189
15.7. Explained cumulated explained variance plot of the PCA ..................................................... 190
15.8. Scatterplit of the Iris dataset using the first two principal components ................................. 191
15.9. The replacement wizard ......................................................................................................... 192
15.10. The output of imputation ..................................................................................................... 193
15.11. The relationship of an input and the target variable before imputation ............................... 193
15.12. The relationship of an input and the target variable after imputation .................................. 194
16.1. The settings of dataset partitioning ........................................................................................ 196
16.2. The decision tree .................................................................................................................... 197
16.3. The response curve of the decision tree ................................................................................. 198
16.4. Fitting statistics of the decision tree ...................................................................................... 198
16.5. The classification chart of the decision tree ........................................................................... 198
16.6. The cumulative lift curve of the decision tree ........................................................................ 199
16.7. The importance of attributes .................................................................................................. 200
16.8. The settings of parameters in the partitioning step ................................................................ 201
16.9. The decision tree using the chi-square impurity measure ...................................................... 202
16.10. The decision tree using the entropy impurity measure ........................................................ 202
16.11. The decision tree using the Gini-index ................................................................................ 203
16.12. The cumulative response curve of decision trees ................................................................. 204
16.13. The classification plot .......................................................................................................... 205
16.14. Response curve of decision trees ......................................................................................... 205
16.15. The score distribution of decision trees ............................................................................... 206
16.16. The main statistics of decision trees .................................................................................... 206
17.1. The misclassification rate of rule induction ........................................................................... 208
17.2. The classification matrix of rule induction ............................................................................ 208
17.3. The classification chart of rule induction ............................................................................... 209
17.4. The ROC curves of rule inductions and decision tree ........................................................... 209
17.5. The output of the rule induction operator .............................................................................. 210
18.1. Classification matrix of the logistic regression ...................................................................... 212
18.2. Effects plot of the logistic regression ..................................................................................... 213
18.3. Classification matrix of the stepwise logistic regression ....................................................... 213
18.4. Effects plot of the stepwise logistic regression ...................................................................... 214
18.5. Fitting statistics for logistic regression models ...................................................................... 214
18.6. Classification charts of the logistic regression models .......................................................... 215
18.7. Cumulativ lift curve of the logistic regression models .......................................................... 215
18.8. ROC curves of the logistic regression models ....................................................................... 216
18.9. Classification matrix of the logistic regression ...................................................................... 217
18.10. The classification chart of the logistic regression ................................................................ 218
18.11. The effects plot of the logistic regression ............................................................................ 219
19.1. A linearly separable subset of the Wine dataset .................................................................... 221
19.2. Fitting statistics for perceptron .............................................................................................. 222
19.3. The classification matrix of the perceptron ........................................................................... 222
19.4. The cumulative lift curve of the perceptron ........................................................................... 222
19.5. Fitting statistics for SVM ....................................................................................................... 223
19.6. The classification matrix of SVM .......................................................................................... 223
19.7. The cumulative lift curve of SVM ......................................................................................... 223
19.8. List of the support vectors ..................................................................................................... 224
19.9. Fitting statistics of the multilayer perceptron ........................................................................ 225
19.10. The classification matrix of the multilayer perceptron ........................................................ 226
19.11. The cumulative lift curve of the multilayer perceptron ....................................................... 226
19.12. Weights of the multilayer perceptron .................................................................................. 227
19.13. Training curve of the multilayer perceptron ........................................................................ 227
19.14. Stepwise optimization statistics for DMNeural operator ...................................................... 228
19.15. Weights of neural networks AutoNeural operátorral kapott háló neuronjainak súlyai ....... 228
x
Created by XMLmind XSL-FO Converter.
Case Studies in Data Mining
19.16. Fitting statistics of neural networks ..................................................................................... 229
19.17. Classification charts of neural networks .............................................................................. 230
19.18. Cumulative lift curves of neural networks ........................................................................... 230
19.19. ROC curves of neural networks ........................................................................................... 231
19.20. Fitting statistics for linear kernel SVM ................................................................................ 233
19.21. The classification matrix of linear kernel SVM ................................................................... 233
19.22. Support vectors (partly) of linear kernel SVM .................................................................... 233
19.23. The distribution of Lagrange multipliers for linear kernel SVM ......................................... 234
19.24. The parameters of polynomial kernel SVM ......................................................................... 234
19.25. Fitting statistics for polynomial kernel SVM ....................................................................... 235
19.26. Classification matrix of polynomial kernel SVM ................................................................ 235
19.27. Support vectors (partly) of polynomial kernel SVM ........................................................... 236
19.28. Fitting statistics for SVM's .................................................................................................. 236
19.29. The classification chart of SVM's ........................................................................................ 237
19.30. Cumulative lift curves of SVM's ......................................................................................... 237
19.31. Comparison of cumulative lift curves to the baseline and the optimal one ........................ 238
19.32. ROC curves of SVM's ......................................................................................................... 238
20.1. Fitting statistics of the ensemble classifier ............................................................................ 240
20.2. The classification matrix of the ensemble classifier .............................................................. 240
20.3. The cumulative lift curve of the ensemble classifier ............................................................. 241
20.4. Misclassification rates of the ensemble classifier and the SVM ............................................ 242
20.5. Classification matrices of the ensemble classifier and the SVM ........................................... 242
20.6. Cumulative lift curves of the ensemble classifier and the SVM ............................................ 242
20.7. Cumulative lift curves of the ensemble classifier, the SVM and the best theoretical model . 243
20.8. ROC curves of the ensemble classifier and the SVM ............................................................ 243
20.9. The classification matrix of the bagging classifier ................................................................ 245
20.10. The error curves of the bagging classifier ............................................................................ 246
20.11. Misclassification rates of the bagging classifier and the decision tree ................................ 247
20.12. Classification matrices of the bagging classifier and the decision tree ................................ 247
20.13. Response curves of the bagging classifier and the decision tree .......................................... 247
20.14. Response curves of the bagging classifier and the decision tree comparing the baseline and the
optimal classifiers ........................................................................................................................... 248
20.15. ROC curves of the bagging classifier and the decision tree ................................................. 248
20.16. The classification matrix of the boosting classifier ............................................................. 250
20.17. The error curve of the boosting classifier ............................................................................ 251
20.18. Misclassification rates of the boosting classifier and the SVM ........................................... 252
20.19. Classification matrices for the boosting classifier and the SVM ......................................... 252
20.20. Cumulative response curves of the boosting classifier and the SVM .................................. 252
20.21. Response curves of the boosting classifier and the SVM comparing the baseline and the optimal
classifiers ........................................................................................................................................ 253
20.22. ROC curves of the boosting classifier and the SVM ........................................................... 254
21.1. List of items ........................................................................................................................... 256
21.2. The association rules as a function of the support and the reliability .................................... 257
21.3. Graph of lift values ................................................................................................................ 257
21.4. List of association rules ......................................................................................................... 258
22.1. The Aggregation dataset. ....................................................................................................... 260
22.2. The setting of the Cluster operator. ..................................................................................... 261
22.3. The result of K-means clustering when K=7 ......................................................................... 262
22.4. The setting of the MacQueen clustering ................................................................................ 263
22.5. The result of the MacQueen clustering .................................................................................. 263
22.6. The result of the clustering with 8 clusters ............................................................................ 264
22.7. The result display of the Cluster operator ......................................................................... 265
22.8. Scatterplot of the cluster means ............................................................................................. 265
22.9. The decision tree of the clustering ......................................................................................... 266
22.10. Scatterplot of the Maximum Variance (R15) dataset ........................................................... 267
22.11. The result of the average linkage hierarchical clustering ..................................................... 268
22.12. Evaluating of the clustering by 3D bar chart ....................................................................... 269
22.13. The result of Ward clustering .............................................................................................. 269
22.14. CCC plot of automatic clustering ........................................................................................ 270
22.15. Proximity graph of the automatic clustering ........................................................................ 270
xi
Created by XMLmind XSL-FO Converter.
Case Studies in Data Mining
22.16. The Maximum Variance (D31) dataset ................................................................................
22.17. The result of automatic clustering ........................................................................................
22.18. The CCC plot of automatic clustering .................................................................................
22.19. Az automatikus klaszterezés proximitási ábrája ..................................................................
22.20. The result of K-means clustering .........................................................................................
22.21. The proximity graph of K-means clustering ........................................................................
22.22. The profile of the segments (clusters) ..................................................................................
23.1. The dendrogram of attribute clustering ..................................................................................
23.2. The graph of clusters and attributes .......................................................................................
23.3. The cluster membership .........................................................................................................
23.4. The correlation plot of the attributes ......................................................................................
23.5. The correlation between clusters and an attribute ..................................................................
23.6. Classification charts of SVM models ....................................................................................
23.7. The response curve of SVM models ......................................................................................
23.8. Az SVM modellek kumulatív lift függvényei ........................................................................
23.9. The ROC curves of SVM models ..........................................................................................
23.10. The scatterplot of the Maximum Variance (R15) dataset .....................................................
23.11. The result of Kohonen's vector quantization .......................................................................
23.12. The pie chart of cluster size .................................................................................................
23.13. Statistics of clusters .............................................................................................................
23.14. Graphical representation of the SOM ..................................................................................
23.15. Scatterplot of the result of SOM ..........................................................................................
24.1. Classification matrix of the logistic regression ......................................................................
24.2. Effects plot of the logistic regression .....................................................................................
24.3. Classification matrix of the stepwise logistic regression .......................................................
24.4. Effects plot of the stepwise logistic regression ......................................................................
24.5. Fitting statistics for logistic regression models ......................................................................
24.6. Classification charts of the logistic regression models ..........................................................
24.7. Cumulativ lift curve of the logistic regression models ..........................................................
24.8. ROC curves of the logistic regression models .......................................................................
24.9. Classification matrix of the logistic regression ......................................................................
24.10. The classification chart of the logistic regression ................................................................
24.11. The effects plot of the logistic regression ............................................................................
24.12. Statistics of the fitted models on the test dataset .................................................................
24.13. Comparison of the fitted models by means of predictions ...................................................
24.14. The observed and predicted means plot ...............................................................................
24.15. The model scores .................................................................................................................
24.16. The decision tree for continuous target ................................................................................
24.17. The weights of neural network after traning ........................................................................
25.1. Statistics before and after filtering outliers ............................................................................
25.2. The predicted mean based on the two decision trees .............................................................
25.3. The tree map of the best model ..............................................................................................
25.4. Comparison of the two fitted decision trees ..........................................................................
xii
Created by XMLmind XSL-FO Converter.
272
273
273
273
274
274
275
278
279
279
280
281
282
282
282
283
284
285
285
286
286
287
289
290
290
291
291
292
292
293
294
295
296
297
298
298
299
300
301
304
305
305
306
Colophon
TODO
1
Created by XMLmind XSL-FO Converter.
Preface
The data mining is an interdisciplinary area of information technology, one of the most important parts of the
so-called KDD (Knowledge Discovery from Databases) process. It consists of such computationally intensive
algorithms, methods which are capable to explore patterns from relatively large datasets that represent wellinterpretable information for further use. The applied algorithms originate from a number of fields, namely,
artificial intelligence, machine learning, statistics, and database systems. Moreover, the data mining combine the
results of these areas and it evolves in interaction of them today too. In contrast to focus merely to data analysis,
see, for example, the statistics, data mining contains a number of additional elements, including the
datamanegement and data preprocessing, moreover, such post-processing issues as the interesting metrics or the
suitable visualization of the explored knowledge.
The use of the word data mining has become very fashionable and many people mistakenly use it for all sorts of
information processing involving large amounts of data, e.g., simple information extraction or data warehouse
building), but it also appears in the context of decision support systems. In fact, the most important feature of
data mining is the exploration or the discovery that is to produce something new, previously unknown and
useful information for the user. The term of data mining has emerged in the '60s, when the statisticians used it in
negative context when one analyzes the data without any presupposition. In the information technology it
appeared first in the database community in the '90s in the context of describing the sophisticated information
extraction. Although, the termof data mining is spread in the business, more synonym exists, for example,
knowledge discovery. It is important to distinguish between data mining and the challenging Big Data problems
nowadays. The solution of Big Data problems usually does not require the development of new theoretical
models or methods, the problem is rather that the well-working algorithms of data mining softwares hopelessly
slow down when you want to process a really large volumes of data as a whole instead of a reasonable sample
size. This obviously requires a special attitude and IT infrastructure that is outside the territory of the present
curriculum.
The data mining activity, in automatic or semi-automatic way, is integrated into the IT infrastructure of the
organization which applies it. This means that we can provide newer and newer information for the users by
data mining tools from the ever-changing data sources, typically from data warehouses, with relatively limited
human intervention. The reason is that because the (business) environment is constantly changing, following the
changes of the data warehouse which collects the data from the environment. Hence, the previously fitted data
mining models lose their validity, new models may need to model the altered data. Data mining softwares
increasingly support this approach in such a way that they are able to operate in very heterogeneous
environments. The collaboration between the information service and the supporting analytics nowadays allows
the development of real-time analytics based online systems, see, e.g., the recommendation systems.
The data mining is organized around the so-called data mining process, followed by the majority of data mining
softwares. In general, this is a five-step process where the steps are as follows:
• Sampling, data selection;
• Exploring,preprocessing;
• Modifying, transforming;
• Modelling;
• Interpreting, assessing.
The data mining softwares provide operators for these steps and we can carry out certain operations with them,
for example, reading an external file, filtering outliers, or fitting a neural network model. Representing the data
mining process by a diagram in a graphical interface, there are nodes corresponding to these operators.
Examples of this process are the SEMMA methodology of the SAS Institute Inc.® which is known about its
information delivery softwares and the widely used Cross Industry Standard Process for Data Mining (CRISPDM) methodology, which has evolved by cooperating of many branch of industry, e.g., finance, automotive,
information technology, etc.
During the sampling the target database is formed for the data mining process. The source of the data in most
cases is an enterprise (organization) data warehouse or its subject-oriented part, a so-called datamart. Therefore,
ii
Created by XMLmind XSL-FO Converter.
Preface
the data obtained from here have gone through a pre-processing phase in general, when they move from the
operational systems into the data warehouse, and thus they can be considered to be reliable. If this is not the case
then the used data mining software provides tools for data cleaning, which, in this case, can already be
considered as the second step of the process. Sampling can generally be done using an appropriate statistical
method, for example, simple random or stratified sampling method. Also in this step the dataset is partitioned to
training, validating, and testing set. On the training dataset the data mining model is fitted and its parameters
isestimated. The validating dataset is used to stop the convergence in the training process or compare different
models. By this method, using independent dataset from the training dataset, we obtain reliable decision where
to stop the training. Finally, on the test data set generalization ability of the model can be measured, that is how
it will be expected to behave in case of new records.
The exploring step means the acquaintance with the data without any preconception if it is possible. The
objective of the exploring step is to form hypotheses to establish in connection with the applicable procedures.
The main tools are the descriptive statistics and graphical vizualization. A data mining software has is a number
graphical tools which exceed the tools of standard statistical softwares. Another objective of the exploring is to
identify any existing errors (noise) and to find the places of missing data.
The purposes of modifying is the preparation of the data for fitting a data mining model. There may be several
reasons for this. One of them is that many methods require directly the modification of the data, for example, in
case of neural networks the attributes have to be standardized before the training of the network. An other one is
that if a method does not require the modification of the data, however a better fitting model is obtained after
suitable modification. An example is the normalization of the attributes by suitable transformations before
fitting a regression model in order the input attributes wiil be close to the normal distribution. The modification
can be carried out at multiple levels: at the level of entire attributes by transforming whole attributes, at the level
of the records, e.g., by standardizing some records or at the level of the fields by modifying some data. The
modifying step also includes the elimination of noisy and missing data, the so-called imputation as well.
The modeling is the most complex step of the data mining process and it requires the largest knowledge as well.
In essence, we solve here, after suitable preparation, the data mining task. The typical data mining tasks can be
divided into two groups. The first group is known as supervised data mining and supervised training. In this
case, there is an attribute with a special role in the dataset which is called target. The target variable should be
indicated in the used data mining software. Our task then is to describe this target variables by using the other
variables as well as we can. The second group is known as unsupervised data mining or unsupervised learning.
In this case, there is no special attributes in the analyzed dataset, where we want to explore hidden patterns.
Within the data mining, 6 task types can be defined, from which the classification and the regression are
supervised data mining and the segmentation, association, sequential analysis, and anomaly detection are
unsupervised data mining.
• Classification: modelling known classes (groups) for generalization purpose in order to apply the built model
for new records. Example: filtering emails by classifying them for spam and no-spam classes.
• Regression: building a model which approximates a continuous target by a function of input attributes such
that the error of this approximation is as small as possible. Example: estimating customer value by current
demographic and historical data.
• Segmenting, clustering: finding in a sense similar groups in the data without taking into account any known
existing structure. A typical example is the customer segmentation when a bank or insurance company is
looking for groups of clients behaving similarly.
• Association: searching relationships between attributes. Typical example is the basket analysis, when we look
at what goods are bought by the customers in the stores and supermarkets.
• Anomaly detection: identifying such records, which may be interesting or require further investigation due to
a mistake. Example is searching extremely behaved clients, users.
• Sequential analysis: searching temporal and spatial relationships between attributes. For example, in which
order are the services took by the customers or examining gene sequences.
The assessing of the results is the last step in the data mining process, which objective is to decide whether truly
relevant and useful knowledge is reached by the data mining process. Namely, it often happens that such model
is produced by the improper use of the data mining which has weak generalization ability and the model works
iii
Created by XMLmind XSL-FO Converter.
Preface
very poorly on new data. This is the so-called overfitting. In order to avoid the overfitting we should lean on the
training, validating, and testing dataset. At this step, we can also compare our fitted models if there are more
than one. In the comparison various measurements, e.g., misclassification rate, mean square error, and graphical
tools, e.g., lift curve, ROC curve, can be used.
This electronic curriculum aims at providing an introduction to data mining applications, so that it shows these
applications in practice through the use of data mining softwares. Problems requiring data mining can be
encountered in many fields of life. Some of these are listed below, datasets used in the course material also came
from these areas.
• Commercial data mining. One of the main driving forces behind the development and application of data
mining. Its objective is to analyze the static, historical business data stored in data warehouses in order to
explore hidden patterns and trends. Besides the standard way of collecting data, companies found out several
other ways to build more reliable data mining models, for example, this is also one of the main reasons behind
the spreading of loyalty cards. Among a number of specific application areas we emphasize the customer
relationship management (CRM): who are our customers and how to deal with them, the churn analysis:
which customers are planning to leave us, the cross-selling: what products should be offered together. The
algorithms of market basket analysis have been born in solving a business problem.
• Scientific data mining. The other main driving force behind the development of data mining. Many data
mining methods, for example, neural networks and self-organizing map, have been developed for solving a
scientific problem and they became a method of data mining only years later. The application areas are
ranging from the astronomy (galaxy classification and processing various radiation detected in space),
chemistry (forecasting, the properties of artificial molecules), and engineering sciences (material science,
traffic management) to biology (bioinformatics, drug discovery, and genetics). Data mining can help in areas
where the problem of data gap appears, i.e., far more data is generated than the scientist is able to process.
• Mining medical data. The development of health information technology makes possible for doctors to share
their diagnostic results with each other and thus it enables not to repeate doing an examination several times.
Moreover, by collecting the diagnosis as the results of examination in a common data warehouse, it will be
possible to develope new medical procedures by means of data mining techniques. Data mining is also likely
to play an important role in personalized medicine as well.
• Spatial data mining. Analysis of spatial data with data mining methods, the extension of the traditional
geographic information systems (GIS) with data mining tools. Application areas: climate research, the spread
of epidemics, customer analysis of large multinational companies taking into account the spatial dimension.
An important area in the future will be the processing of data generated in sensor networks, e.g., pollution
monitoring on an area.
• Multimédia adatbányászat. Analysis of audio, image and video files by data mining tools. Data mining can
help to find similarities between songs in order to decide copyright issues more objectively. Another
application is finding copyright law conflicting or illegal contents in file-sharing systems and in multimedia
service providers.
• Web mining. Analysis of web data. Three types of web mining problems are distinguished: web structure
mining, web content mining, and web usage mining. The web structure mining means the examination of the
structure of the Web, i.e., the examination of the web-graph, where the set of vertices consists of the sites and
the set of edges consists of the links between the sites. The web content mining means the retrieval of useful
information from the contents on the web. The well-known web search engines (Google, AltaVista, etc) also
carry out this thing. The web usage mining deals with examining what users are searching for on the Internet,
using the the data gathered by Web servers and application servers. These areas are strongly related to Big
Data problems because it is needed to work many times over an Internet-scale infrastructure.
• Text mining. Mining of unstructured or semi-structured data. Under unstructured data we mean continuous
texts (sequences of strings), which may be connected by a theme (e.g., scienctific), by a field (e.g., sport), or
they can be customer's sentiments at a customer service. Semi-structured data are typically produced by
computers or files produced for computers, for example, in XML or JSON format. Some specific
applications: data mining for security reasons, e.g., searching for terrorists, analytical CRM, sentiment
analysis, and academic applications (plagiarism investigation).
1. How to use this material
iv
Created by XMLmind XSL-FO Converter.
Preface
The RapidMiner and SAS® Enterprise Miner™ workflows presented in this course material are contained in the
file resources/workflows.zip.
Fontos
Data files used in the experiments must be downloaded by the user from the location specified in the
text. After importing a workflow file paths must be set to point to the local copies of data files
(absolute paths are required).
v
Created by XMLmind XSL-FO Converter.
Part I. Data mining tools
Introduction
In this part, data mining tools and softwares are overviewed. There are three necessary conditions of the
succesful data mining. First, we need an appropriate data sets to perform data mining. In practice, this is often a
task-oriented data-mart generated from the enterprise data warehouse. In the education, and so in this
curriculum, the datasets are taken from widely used data repository. All datasets are attached to this material.
Another important condition to data mining is the data mining expert. We hope that this curriculum will be able
to contribute to the education of this professionals. Finally, the key is the software with which data mining is
performed. They can be classified on the basis of several criteria, e.g., business or free, self-contained or
integrated, general or specific, theme-oriented or not. The most up-to-date information on this topic can be
found on the website KDnuggets™. The reader can get fresh information on current job openings, courses,
conferences etc. from here as well.
In the curriculum two softwares are discussed in detail: a leading one from the free data mining softwares,
RapidMiner 3.5 and one of the most widely used commercial data mining softwares, SAS® Enterprise Miner™
Version 7.1. The list of the data mining softwares below is based on the KDnuggets™ portal.
Created by XMLmind XSL-FO Converter.
Chapter 1. Commercial data mining
softwares
• AdvancedMiner Professional , (formerly Gornik), provides a wide range of tools for data transformations,
Data Mining models, data analysis and reporting.
• Alteryx, offering Strategic Analytics platform, including a free Project Edition version.
• Angoss Knowledge Studio, a comprehensive suite of data mining and predictive modeling tools;
interoperability with SAS and other major statistical tools.
• BayesiaLab, a complete and powerful data mining tool based on Bayesian networks, including data
preparation, missing values imputation, data and variables clustering, unsupervised and supervised learning.
• BioComp i-Suite, constraint-based optimization, cause and effect analysis, non-linear predictive modeling,
data access and cleaning, and more.
• BLIASoft Knowledge Discovery, for building models from data based mainly on fuzzy logic.
• CMSR Data Miner, built for business data with database focus, incorporating rule-engine, neural network,
neural clustering (SOM), decision tree, hotspot drill-down, cross table deviation analysis, cross-sell analysis,
visualization/charts, and more.
• Data Applied, offers a comprehensive suite of web-based data mining techniques, an XML web API, and rich
data visualizations.
• Data Miner Software Kit, collection of data mining tools, offered in combination with a book: Predictive Data
Mining: A Practical Guide, Weiss and Indurkhya.
• DataDetective, the powerful yet easy to use data mining platform and the crime analysis software of choice
for the Dutch police.
• Dataiku, a software platform for data science, statistics, guided machine learning and visualization
capabilities, built on Open Source, Hadoop integration.
• DataLab, a complete and powerful data mining tool with a unique data exploration process, with a focus on
marketing and interoperability with SAS.
• DBMiner 2.0 (Enterprise), powerful and affordable tool to mine large databases; uses Microsoft SQL Server
7.0 Plato.
• Delta Miner, integrates new search techniques and "business intelligence" methodologies into an OLAP frontend that embraces the concept of Active Information Management.
• ESTARD Data Miner, simple to use, designed both for data mining experts and common users.
• Exeura Rialto™ , provides comprehensive support for the entire data mining and analytics life-cycle at an
affordable price in a single, easy-to-use tool.
• Fair Isaac Model Builder, software platform for developing and deploying analytic models, includes data
analysis, decision tree and predictive model construction, decision optimization, business rules management,
and open-platform deployment.
• FastStats Suite (Apteco), marketing analysis products, including data mining, customer profiling and
campaign management.
• GainSmarts, uses predictive modeling technology that can analyze past purchase, demographic, and lifestyle
data, to predict the likelihood of response and develop an understanding of consumer characteristics.
• Generation5 GenVoy, On-Demand Consumer Analytics.
7
Created by XMLmind XSL-FO Converter.
Commercial data mining softwares
• GenIQ Model, uses machine learning for regression task; automatically performs variable selection, and new
variable construction, and then specifies the model equation to "optimize the decile table".
• GhostMiner, complete data mining suite, including k-nearest neighbors, neural nets, decision tree,
neurofuzzy, SVM, PCA, clustering, and visualization.
• GMDH Shell, an advanced but easy to use tool for predictive modeling and data mining. Free trial version is
available.
• Golden Helix Optimus RP, uses Formal Inference-based Recursive Modeling (recursive partitioning based on
dynamic programming) to find complex relationships in data and to build highly accurate predictive and
segmentation models.
• IBM SPSS Modeler, (formerly Clementine), a visual and powerful data mining workbench.
• Insights, (formerly KnowledgeMiner) 64-bit parallel software for autonomously building reliable predictive
analytical models from high-dimensional noisy data using outstanding self-organizing knowledge mining
technologies. Model export to Excel. Localized for English, Spanish, German. Free trial version.
• JMP, offers significant visualization and data mining capabilities along with classical statistical analyses.
• K. wiz, from thinkAnalytics - massively scalable, embeddable, Java-based real-time data-mining platform.
Designed for Customer and OEM solutions.
• Kaidara Advisor, (formerly Acknosoft KATE), Case-Based Reasoning (CBR) and data mining engine.
• Kensington Discovery Edition, high-performance discovery platform for life sciences, with multi-source data
integration, analysis, visualization, and workflow building.
• Kepler, extensible, multi-paradigm, multi-purpose data mining system.
• Knowledge Miner, a knowledge mining tool that works with data stored in Microsoft Excel for building
predictive and descriptive models. (MacOS, Excel 2004 or later).
• Kontagent kSuite DataMine, a SaaS User Analytics platform offering real-time behavioral insights for Social,
Mobile and Web, offering SQL-like queries on top of Hadoop deployments.
• KXEN (SAP company), providing Automated Predictive Analytics tools for Big Data.
• LIONsolver (Learning and Intelligent OptimizatioN) Learning and Intelligent OptimizatioN: modeling and
optimization with "on the job learning" for business and engineering by Reactive Search SrL.
• LPA Data Mining tools, support fuzzy, bayesian and expert discovery and modeling of rules.
• Magnify PATTERN, software suite, contains PATTERN:Prepare for data preparation; PATTERN:Model for
building predictive models; and PATTERN:Score for model deployment.
• Mathematica solution for Data Analysis and Mining, from Wolfram.
• MCubiX, a complete and affordable data mining toolbox, including decision tree, neural networks,
associations rules, visualization, and more.
• Microsoft SQL Server, empowers informed decisions with predictive analysis through intuitive data mining,
seamlessly integrated within the Microsoft BI platform, and extensible into any application.
• Machine Learning Framework, provides analysis, prediction, and visualization using fuzzy logic and ML
methods; implemented in C++ and integrated into Mathematica.
• Molegro Data Modeller, a cross-platform application for Data Mining, Data Modelling, and Data
Visualization.
• Nuggets, builds models that uncover hidden facts and relationships, predict for new data, and find key
variables (Windows).
8
Created by XMLmind XSL-FO Converter.
Commercial data mining softwares
• Oracle Data Mining (ODM), enables customers to produce actionable predictive information and build
integrated business intelligence applications.
• Palisade DecisionTools Suite, complete risk and decision analysis toolkit.
• Partek, pattern recognition, interactive visualization, and statistical analysis, modeling system.
• Pentaho, open-source BI suite, including reporting, analysis, dashboards, data integration, and data mining
based on Weka.
• Polyanalyst, comprehensive suite for data mining, now also including text analysis, decision forest, and link
analysis. Supports OLE DB for Data Mining, and DCOM technology.
• Powerhouse Data Mining, for predictive and clustering modelling, based on Dorian Pyle's ideas on using
Information Theory in data analysis. Most information is in Spanish.
• Predictive Dynamix, integrates graphical and statistical data analysis with modeling algorithms including
neural networks, clustering, fuzzy systems, and genetic algorithms.
• Previa, family of products for classification and forecasting.
• Quadstone DecisionHouse, an agile analytics solution with a complete suite of capabilities to support end-toend data mining cycle.
• RapAnalyst™ , uses advanced artificial intelligence to create dynamic predictive models, to reveal
relationships between new and historical data.
• Rapid Insight Analytics, streamlines the predictive modeling and data exploration process, enabling users of
all abilities to quickly build, test, and implement statistical models at lightning speed.
• Reel Two, real-time classification software for structured and unstructured data as well entity extraction.
From desktop to enterprise.
• Salford Systems Data Mining Suite, CART Decision Trees, MARS predictive modeling, automated
regression, TreeNet classification and regression, data access, preparation, cleaning and reporting modules,
RandomForests predictive modeling, clustering and anomaly detection.
• SAS® Enterprise Miner™ , an integrated suite which provides a user-friendly GUI front-end to the SEMMA
(Sample, Explore, Modify, Model, Assess) process.
• SPAD, provides powerful exploratory analyses and data mining tools, including PCA, clustering, interactive
decision trees, discriminant analyses, neural networks, text mining and more, all via user-friendly GUI.
• Statistica Data Miner, a comprehensive, integrated statistical data analysis, graphics, data base management,
and application development system.
• Synapse, a development environment for neural networks and other adaptive systems, supporting the entire
development cycle from data import and preprocessing via model construction and training to evaluation and
deployment; allows deployment as .NET components.
• Teradata Warehouse Miner and Teradata Analytics, providing analytic services for in-place mining on a
Teradata DBMS.
• thinkCRA , from thinkAnalytics, an integrated suite of Customer Relationship Analytics applications
supporting real-time decisioning.
• TIBCO Spotfire Miner, combining Spotfile visualization, Insightful Miner, S+ with intuitive drag-and-drop
user interface.
• TIMi Suite: The Intelligent Mining machine, a family of stand-alone, automated, user-friendly GUI tools for
prediction, segmentation and data preparation, with high scalability, speed, ROI, prediction accuracy (a
recurrent top winner at KDD cups).
9
Created by XMLmind XSL-FO Converter.
Commercial data mining softwares
• Viscovery data mining suite, a unique, comprehensive data mining suite for business applications with
workflow-guided project environment; includes modules for visual data mining, clustering, scoring,
automation and real-time integration.
• Xeno, InfoCentricity powerful, user-friendly online analytic platform, supporting segmentation, clustering,
exploratory data analysis, and the development of highly predictive models.
• XLMiner, Data Mining Add-In For Excel.
• Xpertrule Miner, (Attar Software) features data transformation, Decision Trees, Association Rules and
Clustering on large scale data sets.
10
Created by XMLmind XSL-FO Converter.
Chapter 2. Free and shareware data
mining softwares
• Alteryx Project Edition, free version of Alteryx, delivers the data blending, analytics, and sharing capabilities
of Alteryx with just enough allowed runs of your workflow to solve one business problem or to complete one
project.
• AlphaMiner, open source data mining platform that offers various data mining model building and data
cleansing functionality.
• CMSR Data Miner, built for business data with database focus, incorporating rule-engine, neural network,
neural clustering (SOM), decision tree, hotspot drill-down, cross table deviation analysis, cross-sell analysis,
visualization/charts, and more. Free for academic use.
• CRAN Task View: Machine Learning and Statistical Learning, machine learning and statistical packages in
R.
• Databionic ESOM Tools, a suite of programs for clustering, visualization, and classification with Emergent
Self-Organizing Maps (ESOM).
• ELKI: Environment for DeveLoping KDD-Applications Supported by Index-Structures, a framework in Java
which includes clustering, outlier detection, and other algorithms; allows user to evaluate the combination of
arbitrary algorithms, data types, and distance functions.
• Gnome Data Mining Tools, including apriori, decision trees, and Bayes classifiers.
• jHepWork, an interactive environment for scientific computation, data analysis and data visualization
designed for scientists, engineers and students.
• KEEL, includes knowledge extraction algorithms, preprocessing techniques, evolutionary rule learning,
genetic fuzzy systems, and more.
• KNIME, extensible open source data mining platform implementing the data pipelining paradigm (based on
eclipse).
• Machine Learning in Java (MLJ), an open-source suite of Java tools for research in machine learning. (The
software will not be developed further.)
• MiningMart, a graphical tool for data preprocessing and mining on relational databases; supports
development, documentation, re-use and exchange of complete KDD processes. Free for non-commercial
purposes.
• ML-Flex, an open-source software package designed to enable flexible and efficient processing of disparate
data sets for machine-learning (classification).
• MLC++, a machine learning library in C++.
• Orange, open source data analytics and mining through visual programming or Python scripting. Components
for visualization, rule learning, clustering, model evaluation, and more.
• PredictionIO, an open source machine learning server for software developers and data engineers to create
predictive features, such as personalization, recommendation and content discovery.
• RapidMiner , a leading open-source system for knowledge discovery and data mining.
• Rattle, a data mining suite based on open source statistical language R, includes graphics, clustering,
modeling, and more.
• TANAGRA, offers a GUI interface and methods for data access, statistics, feature selection, classification,
clustering, visualization, association and more.
11
Created by XMLmind XSL-FO Converter.
Free and shareware data mining
softwares
• Weka, collection of machine learning algorithms for solving real-world data mining problems. It is written in
Java and runs on almost any platform.
12
Created by XMLmind XSL-FO Converter.
Part II. RapidMiner
Created by XMLmind XSL-FO Converter.
Table of Contents
3. Data Sources ................................................................................................................................. 16
1. Importing data from a CSV file ........................................................................................... 16
2. Importing data from an Excel file ....................................................................................... 17
3. Creating an AML file for reading a data file ....................................................................... 19
4. Importing data from an XML file ....................................................................................... 21
5. Importing data from a database ........................................................................................... 23
4. Pre-processing .............................................................................................................................. 25
1. Managing data with issues - Missing, inconsistent, and duplicate values ........................... 25
2. Sampling and aggregation ................................................................................................... 27
3. Creating and filtering attributes ........................................................................................... 31
4. Discretizing and weighting attributes .................................................................................. 35
5. Classification Methods 1 .............................................................................................................. 41
1. Classification using a decision tree ..................................................................................... 41
2. Under- and overfitting of a classification with a decision tree ............................................ 46
3. Evaluation of performance for classification by decision tree ............................................ 51
4. Evaluation of performance for classification by decision tree 2 ......................................... 55
5. Comparison of decision tree classifiers ............................................................................... 58
6. Classification Methods 2 .............................................................................................................. 65
1. Using a rule-based classifier (1) .......................................................................................... 65
2. Using a rule-based classifier (2) .......................................................................................... 66
3. Transforming a decision tree to an equivalent rule set ........................................................ 68
7. Classification Methods 3 .............................................................................................................. 71
1. Linear regression ................................................................................................................. 71
2. Osztályozás lineáris regresszióval ....................................................................................... 73
3. Evaluation of performance for classification by regression model ..................................... 76
4. Evaluation of performance for classification by regression model 2 .................................. 79
8. Classification Methods 4 .............................................................................................................. 84
1. Using a perceptron for solving a linearly separable binary classification problem ............. 84
2. Using a feed-forward neural network for solving a classification problem ........................ 85
3. The influence of the number of hidden neurons to the performance of the feed-forward neural
network ................................................................................................................................... 87
4. Using a linear SVM for solving a linearly separable binary classification problem ........... 88
5. The influence of the parameter C to the performance of the linear SVM (1) ...................... 90
6. The influence of the parameter C to the performance of the linear SVM (2) ...................... 93
7. The influence of the parameter C to the performance of the linear SVM (3) ...................... 95
8. The influence of the number of training examples to the performance of the linear SVM . 97
9. Solving the two spirals problem by a nonlinear SVM ...................................................... 100
10. The influence of the kernel width parameter to the performance of the RBF kernel SVM 101
11. Search for optimal parameter values of the RBF kernel SVM ........................................ 103
12. Using an SVM for solving a multi-class classification problem ..................................... 105
13. Using an SVM for solving a regression problem ............................................................ 106
9. Classification Methods 5 ............................................................................................................ 110
1. Introducing ensemble methods: the bagging algorithm .................................................... 110
2. The influence of the number of base classifiers to the performance of bagging ............... 111
3. The influence of the number of base classifiers to the performance of the AdaBoost method 113
4. The influence of the number of base classifiers to the performance of the random forest 115
10. Association rules ....................................................................................................................... 118
1. Extraction of association rules .......................................................................................... 118
2. Asszociációs szabályok kinyerése nem tranzakciós adathalmazból .................................. 121
3. Evaluation of performance for association rules ............................................................... 126
4. Performance of association rules - Simpson's paradox ..................................................... 131
11. Clustering 1 .............................................................................................................................. 135
1. K-means method ............................................................................................................... 135
2. K-medoids method ............................................................................................................ 137
3. The DBSCAN method ...................................................................................................... 140
4. Agglomerative methods .................................................................................................... 142
14
Created by XMLmind XSL-FO Converter.
RapidMiner
5. Divisive methods ...............................................................................................................
12. Clustering 2 ..............................................................................................................................
1. Support vector clustering ..................................................................................................
2. Choosing parameters in clustering ....................................................................................
3. Cluster evaluation .............................................................................................................
4. Centroid method ................................................................................................................
5. Text clustering ...................................................................................................................
13. Anomaly detection ....................................................................................................................
1. Searching for outliers ........................................................................................................
2. Unsupervised search for outliers .......................................................................................
3. Unsupervised statistics based anomaly detection ..............................................................
15
Created by XMLmind XSL-FO Converter.
144
148
148
151
154
159
161
165
165
167
171
Chapter 3. Data Sources
1. Importing data from a CSV file
Description
The process demonstrates how to import data from CSV files using the Read CSV and the Open File operators.
In the experiment we use a real-time earthquake data feed provided by USGS in CSV format. First, we
download the feed to able to import it into RapidMiner using the Import Configuration Wizard of the Read
CSV operator. The wizard guides the user step by step through the import process and helps him to set the
parameters of the operator correctly. After the local copy of the feed is successfully imported into RapidMiner,
we can switch to the live feed adding the Open File to the process.
Input
The United States Geological Survey (or USGS for short) provides real time earthquake data feeds at the
Earthquake Hazards Program website. Data is available in various formats, including CSV. The experiment uses
the feed of the magnitude 1+ earthquakes in the past 30 days in CSV format from the
URLhttp://earthquake.usgs.gov/earthquakes/feed/v1.0/csv.php. The feed is updated in every 15 minutes.
Output
An ExampleSet that contains data imported from the CSV feed.
Figure 3.1. Metadata of the resulting ExampleSet.
16
Created by XMLmind XSL-FO Converter.
Data Sources
Figure 3.2. A small excerpt of the resulting ExampleSet.
Interpretation of the results
Each time the process is run, it will read live data from the web.
Video
Workflow
import_exp1.rmp
Keywords
importing data
CSV
Operators
Open File
Read CSV
2. Importing data from an Excel file
17
Created by XMLmind XSL-FO Converter.
Data Sources
Description
The process demonstrates how to import data from Excel files using the Read Excel operator. The Concrete
Compressive Strength data set is used in the experiment. Parameters of the Read Excel operator are set via its
Import Configuration Wizard.
Input
Concrete Compressive Strength [UCI MLR] [Concrete]
Output
An ExampleSet that contains data imported from the Excel file.
Figure 3.3. Metadata of the resulting ExampleSet.
Figure 3.4. A small excerpt of the resulting ExampleSet.
18
Created by XMLmind XSL-FO Converter.
Data Sources
Interpretation of the results
Video
Workflow
import_exp2.rmp
Keywords
importing data
Excel
Operators
Guess Types
Open File
Read Excel
Rename by Generic Names
3. Creating an AML file for reading a data file
19
Created by XMLmind XSL-FO Converter.
Data Sources
Description
The process demonstrates how to create an AML file using the Read AML operator for reading a data file. AML
files are XML documents that provide metadata about attributes including their names, datatypes, and roles.
Once the AML file is created it can be used to read the underlying data file properly.
Input
Pima Indians Diabetes [UCI MLR]
Output
An AML file in the file system and an ExampleSet that contains data imported from the data file using the AML
file.
Figure 3.5. The resulting AML file.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<attributeset default_source="pima-indians-diabetes.data" encoding="UTF-8">
<attribute name="Number of times pregnant " sourcecol="1" valuetype="integer"/>
<attribute name="Plasma glucose concentration" sourcecol="2" valuetype="integer"/>
<attribute name="Diastolic blood pressure" sourcecol="3" valuetype="integer"/>
<attribute name="Triceps skin fold thickness" sourcecol="4" valuetype="integer"/>
<attribute name="2-Hour serum insulin" sourcecol="5" valuetype="integer"/>
<attribute name="Body mass index" sourcecol="6" valuetype="real"/>
<attribute name="Diabetes pedigree function" sourcecol="7" valuetype="real"/>
<attribute name="Age" sourcecol="8" valuetype="integer"/>
<label name="Class" sourcecol="9" valuetype="binominal">
<value>1</value>
<value>0</value>
</label>
</attributeset>
Interpretation of the results
The resulting AML file is intended to be distributed together with the data file.
Video
Workflow
20
Created by XMLmind XSL-FO Converter.
Data Sources
import_exp3.rmp
Keywords
importing data
AML
Operators
Read AML
4. Importing data from an XML file
Description
The process demonstrates how to import data from XML documents using the Read XML operator. Parameters
of the Read XML operator are set via its Import Configuration Wizard. Attribute values are extracted from
the XML document using XPath location paths.
Input
The experiment uses population data in XML from the World Bank Open Data website. The data set is available
at http://data.worldbank.org/indicator/SP.POP.TOTL in various formats, including XML.
Figure 3.6. A small excerpt of The World Bank: Population (Total) data set used in the
exepriment.
<?xml version="1.0" encoding="utf-8"?>
<Root xmlns:wb="http://www.worldbank.org">
<data>
<record>
<field name="Country or Area" key="ABW">Aruba</field>
<field name="Item" key="SP.POP.TOTL">Population (Total)</field>
<field name="Year">1960</field>
<field name="Value">54208</field>
</record>
<record>
<field name="Country or Area" key="ABW">Aruba</field>
<field name="Item" key="SP.POP.TOTL">Population (Total)</field>
<field name="Year">1961</field>
<field name="Value">55435</field>
</record>
<record>
21
Created by XMLmind XSL-FO Converter.
Data Sources
<field name="Country or Area" key="ABW">Aruba</field>
<field name="Item" key="SP.POP.TOTL">Population (Total)</field>
<field name="Year">1962</field>
<field name="Value">56226</field>
</record>
<!-- ... -->
</Root>
Output
An ExampleSet that contains data imported from the XML document.
Figure 3.7. Metadata of the resulting ExampleSet.
Figure 3.8. A small excerpt of the resulting ExampleSet.
Interpretation of the results
Video
22
Created by XMLmind XSL-FO Converter.
Data Sources
Workflow
import_exp4.rmp
Keywords
importing data
XML
Operators
Read XML
5. Importing data from a database
Description
The process demonstrates how to connect to an Oracle database and execute an SQL query.
Input
The experiment uses a local database server running at the Faculty of Informatics, University of Debrecen. Note
that connection to the database server requires authentication.
Output
An ExampleSet that contains the result of the SQL query.
Figure 3.9. Metadata of the resulting ExampleSet.
23
Created by XMLmind XSL-FO Converter.
Data Sources
Figure 3.10. A small excerpt of the resulting ExampleSet.
Interpretation of the results
Video
Workflow
import_exp5.rmp
Keywords
importing data
database
SQL
Operators
Read Database
24
Created by XMLmind XSL-FO Converter.
Chapter 4. Pre-processing
1. Managing data with issues - Missing, inconsistent,
and duplicate values
Description
The process shows, using a sample of the Individual household electric power consumption data set, how to
manage such data sets in which duplicate, inconsistent, and/or duplicate values are present. The missing values
can be substituted with a default value, or one computed based on the other instances of the field, or if it is
necessary, the records belonging to them can be deleted. After defining the inconsistent values, the records
belonging to these can also be filtered out, but however, to define these, it is usually necessary to have some
background knowledge or knowledge in the given field. On the contrary, filtering duplicate values is a rather
automated task, the data identical to each other can be filtered out easily.
Input
Individual household electric power consumption [UCI MLR]
Output
The dataset used here is a sample from the original dataset that encompasses a longer time period; it only
contains the energy consumption data from January 2007. Normally the dataset contains a measurement for
every minute, but if a given measurement was not executed for some reason, the timestamp is present without
data. Such missing values can be substituted with defined values, e.g. with the average of the values present in
the given attribute, or a decision to leave out the records connected to them can be made. However, dealing with
inconsistent values is a more complex issue. In given fields, intervals can be defined into which the values of the
attributes can fall, but in other cases, one must relay on other kinds of background information. For example, let
us suppose that the members of the household, in which the measurements took place, do not fly with the owls.
Based on this, consider the following representation of the data:
Figure 4.1. Graphic representation of the global and kitchen power consumption in time
25
Created by XMLmind XSL-FO Converter.
Pre-processing
On the figure, colors are assigned based on the values of the variable Sub_metering_1, which represents the
energy consumption of the tools in the kitchen, and as the x axis is time, it can be seen that some of the
outstanding kitchen consumption values have been measured late at night. This can also be seen in the data
view, if the data are ordered according to the kitchen consumption values:
Figure 4.2. Possible outliers based on the hypothesized hbits of the members of the
household
If the members of the household are indeed not flying with the owls, these data can be considered inconsistent
based on our background knowledge, and they can be filtered out as follows. Formulating our condition, let us
assume that if the values of a measurement in the kitchen exceeds 50 Wh at a point of time after 10 p.m., this is
considered a piece of inconsistent data. Based on this, the filtering condition can easily be defined, but first, the
time attribute has to be converted, as by default, it is stored in a nominal variable, in the format hh:mm:ss, which
can only be compared for equality. Using the appropriate operators, it can be splitted into the components hour,
minute, and second, interpreted as numbers. The filtering condition can be defined using the Time_1 variable
from among these, which contains the hour component:
Figure 4.3. Filtering of the possible values using a record filter
26
Created by XMLmind XSL-FO Converter.
Pre-processing
Interpretation of the results
Using such filters, the records belonging to inconsistent data can be filtered out, and also, by using the
appropriate operator, duplicate records can be filtered out as well - it can also be defined based on the equality
of which attribute set they are to be considered duplicate - from the dataset, and after this, the actual processing
of the filtered and/or refined records can begin.
Video
Workflow
preproc_exp1.rmp
Keywords
missing data
inconsistent data
data transformation
duplicate data
Operators
Filter Examples
Parse Numbers
Read CSV
Remove Duplicates
Replace Missing Values
Split
2. Sampling and aggregation
27
Created by XMLmind XSL-FO Converter.
Pre-processing
Description
The process shows, using a sample of the Individual household electric power consumption dataset, how the
data can be summed up using aggregation or sampled if not all of the individual records are required during the
given process. Aggregation can be used if the individual data are not necessary, but the values computed from
the whole of the dataset are required, and sampling can be done if generally only a fraction of the dataset is
required, and conclusions are to be derived based on this subset of the data.
Input
Individual household electric power consumption [UCI MLR]
Output
When aggregating, all the aggregate functions available in SQL can be used, and using these, basic statistics can
easily be computed for the data of the given dataset.
Figure 4.4. Selection of aggregate functions for attributes
28
Created by XMLmind XSL-FO Converter.
Pre-processing
If sampling is done on the dataset, this can be done by explicitly specifying the size of the sample, or based on
probability, and also a filter can be used, in the case when the parts of the dataset are not to be represented
proportionally in the sample, rather a given subset of the original dataset is necessary for the process. For
example, filtering for the records belonging to a given time on every day can be done as follows:
Figure 4.5. Preferences for dataset sampling
Figure 4.6. Preferences for dataset filtering
29
Created by XMLmind XSL-FO Converter.
Pre-processing
Interpretation of the results
After performing the aggregation or sampling, the received dataset will only consist of the aggregate values
emerging as a result of the specified operations, or the records that fulfil the specified conditions, respectively:
Figure 4.7. Resulting dataset after dataset sampling
Figure 4.8. Resulting dataset after dataset filtering
30
Created by XMLmind XSL-FO Converter.
Pre-processing
Video
Workflow
preproc_exp2.rmp
Keywords
aggregation
summation
sampling
data filtering
Operators
Aggregate
Filter Examples
Multiply
Read CSV
Sample
3. Creating and filtering attributes
31
Created by XMLmind XSL-FO Converter.
Pre-processing
Description
The process shows, using The Insurance Company Benchmark (COIL 2000) dataset, how new, computed
attributes can be created based on existing data, in case the attributes are not appropriate in their original form,
or some data derived from them is required. Furthermore, how the individual attributes can be removed from the
dataset can also be seen in the process, as in these cases, if the raw data that form the basis of the calculation are
not necessarily required later on, these columns can be removed from the dataset. Naturally, other columns can
be filtered out as well, if they are not required for the solution of the given task, or if their disturbing effects are
needed to be filtered out.
Input
The Insurance Company Benchmark (COIL 2000) [CoIL Challenge 2000]
Output
In the attributes of the dataset which begin with the letter m, the demographic data of the region belonging to the
zip-code of the given potential client are present; among others, the distribution of individual income groups in
the given region. If, for some reason, the original representation is to be compressed, it is possible to create a
derived field based on these income attributes using a given formula, based on some heuristic, for example as
follows:
Figure 4.9. Defining a new attribute based on an expression relying on existing
attributes
32
Created by XMLmind XSL-FO Converter.
Pre-processing
After the appropriate computed field has been created, based on the given case, it can be decided whether the
original fields used during the computation are required later on or not. It has to be considered whether these
original data could be important for the creation of models in the future, or whether they could have some
disturbing effect. The attributes of the raw data used for the computation, or any other arbitrarily selected
attributes can be removed from the original dataset as follows:
Figure 4.10. Properties of the operator used for removing the attributes made
redundant
Figure 4.11. Selection of the attributes to remain in the dataset with reduced size
33
Created by XMLmind XSL-FO Converter.
Pre-processing
Interpretation of the results
After executing these steps, all records will appear in the modified dataset, but with a modified attribute set.
After the computed field has been created, this new attribute appears in every record, while the attributes filtered
out disappear:
Figure 4.12. The appearance of the derived attribute in the altered dataset
34
Created by XMLmind XSL-FO Converter.
Pre-processing
Video
Workflow
preproc_exp3.rmp
Keywords
derived attribute
attribute creation
attribute removal
attribute subset
Operators
Generate Attributes
Read AML
Select Attributes
4. Discretizing and weighting attributes
35
Created by XMLmind XSL-FO Converter.
Pre-processing
Description
The process shows, using a sample of the Individual household electric power consumption dataset, how an
attribute that takes its values from an interval of real numbers can be discretized, i.e. converted to discrete values
that represent defined subintervals of the real interval. Furthermore, it can also be seen in the process, how
weights can be added to the individual data columns, if when using a data minding procedure, it is necessary to
distinguish between the individual data regarding their importance, and not to allow all attributes take part in the
data mining algorithm, and in the conclusions based on it with equal weights.
Input
Individual household electric power consumption [UCI MLR]
Output
In the dataset, the usage of discretization is shown using the variable Global_active_power. This variable
represents the total energy consumption in the whole of the household at a given moment, so following the
changes of the times of day, these values also change in a cyclic fashion. Thus if the total consumption is to be
represented with discrete values, and not real numbers, in order to be used in a given method, then this column
can be properly discretized. Discretization can be done using different operators, by defining the size of the
categories (the number of elements in them), or the number of categories, and based on this number, either
categories of equal size, or ones of equal element numbers can be created, for example as follows:
Figure 4.13. Selection of the appropriate discretization operator
36
Created by XMLmind XSL-FO Converter.
Pre-processing
Figure 4.14. Setting the properties of the discretization operator
Furthermore, using given methods, to receive a result or decision that is appropriate for the requirements later
on, it has to be defined, which attributes have what level of importance - the simplest way to do this is
weighting. In order to be able to weight the attributes, the weights themselves have to be created first, and then
they have to be applied to the dataset. For example, such weights can be set manually for this dataset, whit
which it can be indicated that the globally measured values are of most importance, the submeterings are of less
importance, and the date and time values are of the least importance, as follows:
Figure 4.15. Selection of the appropriate weighting operator
37
Created by XMLmind XSL-FO Converter.
Pre-processing
Figure 4.16. Defining the weights of the individual attributes
38
Created by XMLmind XSL-FO Converter.
Pre-processing
Interpretation of the results
After executing these steps, the value of the variable Global_active_power will be modified in all the records
of the dataset. Here it can be seen that the division into intervals has been done, but behind the discrete values,
the interval the values falling into which are corresponding to the given value are displayed as well. In addition,
by comparing the original and the modified dataset (the weighted dataset can be seen on the left, and the
unweighted dataset can be seen on the right, in their state after the discretization has been executed), it can also
be seen that the numeric values in the individual columns have been altered according to the weighting (as the
normalize weights option is turned on, the greatest weight is considered 1, thus the values of the columns to
which this weight has been assigned are not subject to change, and the values of the columns to which smaller
weights have been assigned decrease proportionally):
Figure 4.17. Comparison of the weighted and unweighted dataset instances
39
Created by XMLmind XSL-FO Converter.
Pre-processing
Video
Workflow
preproc_exp4.rmp
Keywords
attribute discretization
attribute weighting
weighting
discretization
Operators
Discretize by Binning
Multiply
Read CSV
Scale by Weights
Weight by User Specification
40
Created by XMLmind XSL-FO Converter.
Chapter 5. Classification Methods 1
Decision Trees
1. Classification using a decision tree
Description
The process shows, using the Wine dataset, how classification can be executed by building a decision tree. To
build the decision model, first the dataset has to be split into training and testing sets. After this, the splitting
rules are ordered into a decision tree based on the training set, and afterwards, the model created in such a way
will be used on the test set. Later on, it can be checked what decision conditions the model consists of, based on
the training set, and to which class the records of the test set have been assigned based on these decisions.
Input
Wine [UCI MLR]
Output
The decisions about the individual splits can be done based on measures such as the Gini-index or information
gain. For these, and for the confidence level of splits, different parameter values can be defined when creating
the decision tree model. Furthermore, the stop conditions of splitting can also be defined either by specifying the
minimal size of record sets that can be split further, or by specifying the maximal depth of the tree.
Figure 5.1. Preferences for the building of the decision tree
41
Created by XMLmind XSL-FO Converter.
Classification Methods 1
When splitting the dataset, various sampling methods can be defined, and the ratio in which it should be split
into training and test sets can also be specified. Splitting can be done simply based on the order of the records,
randomly, or attending to it that records belonging to each class occur in the same ratio as in the original dataset
in the training and test sets.
Figure 5.2. Preferences for splitting the dataset into training and test sets
Figure 5.3. Setting the relative sizes of the data partitions
42
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Interpretation of the results
After it has been built, the model itself can also be directed to the output, thus it can be checked what decision
tree has been built based on the data of the training set. Based on this, incidental erroneous decisions can be
filtered out using background information or domain knowledge, if any of these is available. If such decisions
are found, the process of building the model can be tuned further. Besides this, by applying the model to the test
set, it can also be seen which classes the records of the test set have been assigned to based on the model trained
using the training set.
Figure 5.4. Graphic representation of the decision tree created
43
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Figure 5.5. The classification of the records based on the decision tree
44
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Video
Workflow
dtree_exp1.rmp
Keywords
classification
decision tree
splitting
Operators
Apply Model
45
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Decision Tree
Multiply
Read AML
Split Data
2. Under- and overfitting of a classification with a
decision tree
Description
The process shows, using the Zoo dataset, under which conditions can under- and oversampling appear when
performing classification using decision trees. If the decision tree which provides the model has a depth that is
too small, it can occur that it cannot explore the structure of the training set in its entirety, thus it is inappropriate
to carry out the classification properly. This is a case of undersampling. However, if the records are split more
that required, such conclusions can be drawn along the decisions that are not true anymore, and following this
excess of splitting rules, inappropriate decisions can be made - for example in the case of irregular records. This
is considered a case of oversampling.
Input
Zoo [UCI MLR]
Output
In this process, operators building similar decision trees are used for the same training set, and in them, only the
stop condition defining the maximal depth of the tree is different. The value of the maximal depth is 3, 6, and 9
respectively.
Figure 5.6. Setting a threshold for the maximal depth of the decision tree
46
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Interpretation of the results
In accordance with this, decision trees of different depths are created, which thus contain different amounts of
splitting conditions, based on which the records of the test set will be classified differently by the different
models. If the value of the maximal value is 3, the following decision tree is received as a result:
Figure 5.7. Graphic representation of the decision tree created
Figure 5.8. Graphic representation of he classification of the records based on the
decision tree
47
Created by XMLmind XSL-FO Converter.
Classification Methods 1
It can be seen here that based on using 2 rules, the 7 possible classes cannot be separated by the model, so this is
a clear case of undersampling. If the value of the maximal value is 3, the following decision tree is received as a
result:
Figure 5.9. Graphic representation of the decision tree created with the increased
maximal depth
Figure 5.10. Graphic representation of he classification of the records based on the
decision tree with increased maximal depth
48
Created by XMLmind XSL-FO Converter.
Classification Methods 1
In this case, only 3 of the records are classified differently from their original labels. However, if the threshold
for the maximal depth is increased further, the result will not be better, rather, it will worsen, as the additional
rules lead to inappropriate consequences, i.e. this is a case of oversampling. If the value of the maximal value is
3, the following decision tree is received as a result:
Figure 5.11. Graphic representation of the decision tree created with the further
increased maximal depth
49
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Figure 5.12. Graphic representation of he classification of the records based on the
decision tree with further increased maximal depth
50
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Video
Workflow
dtree_exp2.rmp
Keywords
classification
decision tree
overfitting
underfitting
Operators
Apply Model
Decision Tree
Multiply
Read AML
Split Data
3. Evaluation of performance for classification by
decision tree
Description
The process shows, using the Congressional Voting Records dataset, how the quality of a given classification
can be evaluated. After the decision tree has been built based on the training set, and the test set has been
classified using it, the quality of the classification executed can be examined. Using the evaluation received this
way, it can be decided whether the resulting classification is appropriate for the goals of the process, the existing
model should be improved further, or the existing model is of such poor quality that using a completely new
model is necessary.
51
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Input
Congressional Voting Records [UCI MLR]
Output
The decision tree is built based on the data set using the following settings in the process:
Figure 5.13. Preferences for the building of the decision tree
In this case, the following decision tree emerges:
Figure 5.14. Graphic representation of the decision tree created
Interpretation of the results
52
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Using the decision tree created, the records of the test set can be classified, and after the classification of the
records, the original class labels can be compared to those assigned based on the decision tree, e.g. using the
following figure:
Figure 5.15. Graphic representation of he classification of the records based on the
decision tree
Examining the performance of the classifier, the number of records classified appropriately and inappropriately
can be obtained, and the precision of the classification done by the model can be seen as well, displayed in
percentages for the individual classes, and overally:
Figure 5.16. Performance vector of the classification based on the decision tree
The question can also be raised in this case whether the performance of the model can be increased further. For
example, the minimal required confidence for splits can be raised as follows:
Figure 5.17. The modification of preferences for the building of the decision tree.
53
Created by XMLmind XSL-FO Converter.
Classification Methods 1
In this case, as a result of the raised value of the required confidence, the structure of the decision tree will be
completely different from that of the original one, and this leads to a change in the numbers and distribution of
the records classified appropriately and inappropriately as well. This model yields a better performance than the
original one, which can also be seen in the figure:
Figure 5.18. Graphic representation of the decision tree created with the modified
preferences
Figure 5.19. Performance vector of the classification based on the decision tree created
with the modified preferences
54
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Video
Workflow
dtree_exp3.rmp
Keywords
classification
decision tree
performance
evaluation
Operators
Apply Model
Decision Tree
Performance (Classification)
Read AML
Split Data
4. Evaluation of performance for classification by
decision tree 2
Description
The process shows, using the Congressional Voting Records dataset, how the quality of a given classification
can be evaluated. After the decision tree has been built based on the training set, and the test set has been
classified using it, the quality of the classification executed can be examined. In some cases, more advanced
levels of validation may be necessary; in these cases, e.g. random subsampling, cross-validation, or a special
case of the latter, the leave-one-out method can be used. Using the evaluation received this way, it can be
decided whether the resulting classification is appropriate for the goals of the process, the existing model should
55
Created by XMLmind XSL-FO Converter.
Classification Methods 1
be improved further, or due to the poor performance of the existing model, it should be replaced with a
completely new model.
Input
Congressional Voting Records [UCI MLR]
Output
Evaluation can be done by using a complex validation operator as well instead of separate operators. In this
case, the split ratio of the dataset, and the form of sampling can be specified:
Figure 5.20. Settings for the sampling done in the validation operator
This is a complex operator that consists of two subprocesses, which can be defined as follows:
Figure 5.21. Subprocesses of the validation operator
Interpretation of the results
This case is completely identical to the process in the previous example, the split into training and test sets is
done, the decision tree built using the training set is applied to the test set, and then its efficiency is evaluated.
Here, the following decision tree emerges, which classifies the records of the test set with the following results:
Figure 5.22. Graphic representation of the decision tree created
56
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Figure 5.23. Performance vector of the classification based on the decision tree
If a deeper examination of the given classifier is necessary, subprocesses identical to the ones above can be
defined in the operator responsible for cross-validation as well. The operator can be tuned using the following
preferences:
Figure 5.24. Settings of the cross-validation operator
Here, it can be defined how many cross-validation iterations should be executed. The dataset is split into a many
subsets of equal size as the number of iterations. Then, each of these splits is selected to be the test set of an
iteration, and the union of all other subsets will serve as the training set of the given iteration. A special case of
this is the leave-one-out method, which can be used by ticking the appropriate checkbox (leave-one-out). When
using this, an iteration is run for each record, in which the given record serves as the test set, and the training set
consists of all other records. As can be seen on the figure, the following average performance values are yielded
by cross-validation with 10 iterations:
57
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Figure 5.25. Overall performance vector of the classifications done in the crossvalidation operator
The following average performance values are yielded by the leave-one-out method:
Figure 5.26. Overall performance vector of the classifications done in the crossvalidation operator in the leave-one-out case
Note that in this case, the standard deviation of the precision values of the leave-one-out method are remarkably
higher than those of standard cross-validation. This might indicate that such irregular records are present the
classification os which is not necessarily accurate, even after learning on all other records.
Video
Workflow
dtree_exp4.rmp
Keywords
classification
decision tree
performance
random subsampling
cross-validation
Operators
Apply Model
Decision Tree
Multiply
Performance (Classification)
Read AML
Split Validation
X-Validation
5. Comparison of decision tree classifiers
58
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Description
The process shows, using the Spambase dataset, how the quality of multiple classifiers, the efficiency of
multiple classifiers can be compared. After the decision trees of the classifiers have been built based on the
training set, the test set can be classified using them, and the quality of the individually executed classifications
can be examined. This can be done separately, measuring the precision of the classifiers one by one, or the
analyses can be merged, and the ROC curves of the individual classifiers can be represented on a common
figure for a better picturing of the differences between the results. Based on the thus received evaluation, it can
be decided which classifier suits the requirements of the process, whether a given model should be improved, or
whether a given model has to be replaced or removed due to its poor performance.
Input
Spambase [UCI MLR]
Output
Let us create the following two decision tree classifiers based on the training set of the data set:
Figure 5.27. Preferences for the building of the decision tree based on the Gini-index
criterion
59
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Figure 5.28. Preferences for the building of the decision tree based on the gain ratio
criterion
60
Created by XMLmind XSL-FO Converter.
Classification Methods 1
The classifier using the gain ratio builds the following decision tree:
Figure 5.29. Graphic representation of the decision tree created based on the gain ratio
criterion
When applied to the test set, this decision tree yields the following performance values:
61
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Figure 5.30. Performance vector of the classification based on the decision tree built
using the gain ratio criterion
On the contrary, the classifier using the Gini-index builds the following decision tree:
Figure 5.31. Graphic representation of the decision tree created based on the Gini-index
criterion
When applied to the test set, this decision tree yields the following performance values:
Figure 5.32. Performance vector of the classification based on the decision tree built
using the Gini-index criterion
Interpretation of the results
It can be seen that the performance of the classifier utilizing the Gini-index is better than that of the classifier
based on the gain ratio. However, on one hand, the difference between individual models is not this obvious in
all cases, and on the other hand, for the sake of simplification, or to avoid the differences caused by sampling,
the evaluation of the individual models can be merged into a complex operator, and thus, the ROC curves of
their precision can also be displayed on a single figure, for example as follows:
Figure 5.33. Settings of the operator for the comparison of ROC curves
62
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Figure 5.34. Subprocess of the operator for the comparison of ROC curves
In this case, an arbitrary number of the model building operators can be placed in the complex operator, thus the
precision of this arbitrary number of models can be examined for the same data set at once. However, it is
advisable to use a local random seed to ensure that the comparison will be repeatable, as this ensures that for any
execution, the records will be split into training and test sets in the same manner.
Figure 5.35. Comparison of the ROC curves of the two decision tree classifiers
63
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Based on the ROC curves, it is obvious that the classifier based on the Gini-index has a much higher precision
than that of the classifier using the gain ratio, as its ROC curve is curved more in the direction of the point (0,1)
from the diagonal between the points (0,0) and (1,1).
Video
Workflow
dtree_exp5.rmp
Keywords
classification
decision tree
performance
comparison
ROC curve
Operators
Apply Model
Compare ROCs
Decision Tree
Multiply
Performance
Read AML
Split Data
64
Created by XMLmind XSL-FO Converter.
Chapter 6. Classification Methods 2
Rule-Based Classifiers
1. Using a rule-based classifier (1)
Description
In this experiment a rule-based classifier is trained on the Zoo data set.
Input
Zoo [UCI MLR]
Output
Figure 6.1. The rule set of the rule-based classifier trained on the data set.
Figure 6.2. The classification accuracy of the rule-based classifier on the data set.
65
Created by XMLmind XSL-FO Converter.
Classification Methods 2
Interpretation of the results
The second figure shows that the rule-based classifier perfectly classifies all training examples.
Video
Workflow
rules_exp1.rmp
Keywords
rule-based classifier
supervised learning
classification
Operators
Apply Model
Map
Performance (Classification)
Read AML
Rule Induction
Subprocess
2. Using a rule-based classifier (2)
Description
66
Created by XMLmind XSL-FO Converter.
Classification Methods 2
This experiment investigates the performance of the rule-based classifier on the Zoo data set. The data set is split
into a training and a test set, half of the examples are used to form a training set, and the rest are for testing. The
classification accuracies on both the training and the test sets are determined for the rule-based classifier.
Input
Zoo [UCI MLR]
Output
Figure 6.3. The rule set of the rule-based classifier.
Figure 6.4. The classification accuracy of the rule-based classifier on the training set.
Figure 6.5. The classification accuracy of the rule-based classifier on the test set.
Interpretation of the results
The second figure shows that the rule-based classifier perfectly classifies all training examples.
The third figure shows that the rule-based classifier perfectly classifies all but 6 of the 50 test examples.
Video
67
Created by XMLmind XSL-FO Converter.
Classification Methods 2
Workflow
rules_exp2.rmp
Keywords
rule-based classifier
supervised learning
classification
Operators
Apply Model
Map
Performance (Classification)
Read AML
Rule Induction
Split Data
Subprocess
3. Transforming a decision tree to an equivalent rule
set
Description
The process demonstrates the use of the Tree to Rules operator that transforms a decision tree to an
equivalent rule based classifier. The experiment uses a decision tree built on the Zoo data set.
Input
Zoo [UCI MLR]
Output
Figure 6.6. The decision tree built on the data set.
68
Created by XMLmind XSL-FO Converter.
Classification Methods 2
Figure 6.7. The rule set equivalent of the decision tree.
Figure 6.8. The classification accuracy of the rule-based classifier on the data set.
Interpretation of the results
It is apparent in the first and second figures that each rule in the rule set corresponds to a branch of the decision
tree from the root to a leaf node.
The third figure shows that the rule based classifier (and thus also the decision tree) perfectly classifies all
examples.
Video
69
Created by XMLmind XSL-FO Converter.
Classification Methods 2
Workflow
rules_exp3.rmp
Keywords
decision tree
rule-based classifier
supervised learning
classification
Operators
Apply Model
Decision Tree
Map
Multiply
Performance (Classification)
Read AML
Subprocess
Tree to Rules
70
Created by XMLmind XSL-FO Converter.
Chapter 7. Classification Methods 3
Regression
1. Linear regression
Description
The process shows, using the Wine dataset, how a regression model can be fitted to a given dataset.
Classification can also be done based on a regression model, but however, this process shows that creating the
regression model itself is insufficient to perform this. Based on the regression model, approximate values for
numerical labels can be defined, but these values are not assigned to concrete class labels. Apart from this, it can
be stated that similarly to other classification methods, the data set has to be split into training and test sets, and
the regression model created using the training set is to be applied to the test set.
Input
Wine [UCI MLR]
Output
When creating the regression model, it can be chosen from among various types of regression, such as linear
regression or logistic regression. From these, linear regression is utilized in the process. In this form, for
example, it can be defined which method should be used for attribute selection, or what the level of minimal
tolerance should be. The thus created linear regression model can be applied to the test set.
Figure 7.1. Properties of the linear regression operator
71
Created by XMLmind XSL-FO Converter.
Classification Methods 3
The following regression model is created based on the data of the training set:
Figure 7.2. The linear regression model yielded as a result
Interpretation of the results
Using the regression model created based on the records of the training set on the test set, approximate values
can be calculated for values of the labels of the individual test records. These approximate values can be seen in
the labelled data set yielded by the model application:
Figure 7.3. The class prediction values calculated based on the linear regression model
72
Created by XMLmind XSL-FO Converter.
Classification Methods 3
It can be seen that most of the approximate values yield a rather good estimation, and take a value that is close
to the original label, but this by itself is insufficient to complete the classification task. In order to be able to
classify records based on a regression model, its estimation have to be assigned to class labels.
Video
Workflow
regr_exp1.rmp
Keywords
classification
regression
Operators
Apply Model
Linear Regression
Read AML
Split Data
2. Osztályozás lineáris regresszióval
73
Created by XMLmind XSL-FO Converter.
Classification Methods 3
Description
The process shows, using the Wine dataset, how a regression model can be fitted to a given dataset, and then
how can a classification task be completed based on the received estimates. Classification can also be done
based on a regression model; in this case, approximate values for numerical labels can be defined based on the
regression model,and afterwards, these values can be assigned to concrete class labels. Similarly to other
classification methods, the data set has to be split into training and test sets, and the regression model created
using the training set is to be applied to the test set.
Input
Wine [UCI MLR]
Output
When creating the regression model, it can be chosen from among various types of regression, such as linear
regression or logistic regression. From these, linear regression is utilized in the process. In order to be able to
use this for classification, it has to be placed into an operator that implements regression-based classification.
Identically to when the operator is used by itself, it can be defined for example which method should be used for
attribute selection, or what the level of minimal tolerance should be. The thus created linear regression model
can be applied to the test set.
Figure 7.4. The subprocess of the classification by regression operator
The following regression model is created based on the data of the training set:
Figure 7.5. The linear regression model yielded as a result
74
Created by XMLmind XSL-FO Converter.
Classification Methods 3
Interpretation of the results
Using the regression model created based on the records of the training set on the test set, confidence values can
be calculated regarding the probabilities of the individual test records belonging to the given groups. These
confidence values, and the class assignments created based on these can be seen in the labelled data set yielded
by the model application:
Figure 7.6. The class labels derived from the predictions calculated based on the
regression model
75
Created by XMLmind XSL-FO Converter.
Classification Methods 3
It can be seen that based on the approximate values, the assignments are done correctly, and are equal to the
original labels in most cases.
Video
Workflow
regr_exp2.rmp
Keywords
classification
regression
Operators
Apply Model
Classification by Regression
Linear Regression
Read AML
Split Data
3. Evaluation of performance for classification by
regression model
76
Created by XMLmind XSL-FO Converter.
Classification Methods 3
Description
The process shows, using the Spambase dataset, how the quality, the precision of a given classification that is
created based on a regression model fitted to a given data set can be evaluated. After the regression model has
been built based on the training set, and the test set has been classified using it, the quality of the classification
executed can be examined. Using the evaluation received this way, it can be decided whether the resulting
classification is appropriate for the goals of the process, the existing model should be improved further, or the
existing model is of such poor quality that using a completely new model is necessary.
Input
Spambase [UCI MLR]
Output
After creating the regression model, in order to be able to use it for classification, it has to be placed into an
operator that implements regression-based classification. Similarly to when using the operator individually, it
can be defined for example which method should be used for attribute selection, or what the level of minimal
tolerance should be. The thus created linear regression model can be applied to the test set.
Figure 7.7. The subprocess of the classification by regression operator
The following regression model is created based on the data of the training set:
Figure 7.8. The linear regression model yielded as a result
77
Created by XMLmind XSL-FO Converter.
Classification Methods 3
Interpretation of the results
Using the regression model created based on the records of the training set on the test set, confidence values can
be calculated regarding the probabilities of the individual test records belonging to the given groups. Based on
these confidence values, class assignments are assigned to the individual records of the test set. Corresponding
to this, it can be evaluated how many records have been classified successfully based on the regression model:
Figure 7.9. The performance vector of the classification based on the regression model
78
Created by XMLmind XSL-FO Converter.
Classification Methods 3
Video
Workflow
regr_exp3.rmp
Keywords
classification
regression
performance
evaluation
Operators
Apply Model
Classification by Regression
Linear Regression
Performance (Classification)
Read AML
Split Data
4. Evaluation of performance for classification by
regression model 2
Description
The process shows, using the Wine dataset, how the quality, the precision of a given classification that is created
based on a regression model fitted to a given data set can be evaluated. After the regression model has been built
79
Created by XMLmind XSL-FO Converter.
Classification Methods 3
based on the training set, and the test set has been classified using it, the quality of the classification executed
can be examined. In some cases, more advanced levels of validation may be necessary; in these cases, e.g.
random subsampling, cross-validation, or a special case of the latter, the leave-one-out method can be used.
Using the evaluation received this way, t can be decided whether the resulting classification is appropriate for
the goals of the process, the existing model should be improved further, or the existing model is of such poor
quality that using a completely new model is necessary.
Input
Wine [UCI MLR]
Output
Evaluation can be done by using a complex validation operator as well instead of separate operators. In this
case, as the regression model has to be placed into an operator that implements regression-based classification,
and this operator has to be placed into the operator of complex evaluation, the result is a process that contains
embedded operators on multiple levels:
Figure 7.10. The subprocess of the cross-validation by regression operator
Figure 7.11. The subprocess of the classification by regression operator
Similarly to when using the operator individually, it can be defined for example which method should be used
for attribute selection, or what the level of minimal tolerance should be. The thus created linear regression
model can be applied to the test set. The following regression model is created based on the data of the training
set:
Figure 7.12. The linear regression model yielded as a result
80
Created by XMLmind XSL-FO Converter.
Classification Methods 3
Interpretation of the results
If a deeper examination of the given classifier is necessary, subprocesses identical to the ones above can be
defined in the operator responsible for cross-validation as well. The operator can be tuned using the following
preferences:
Figure 7.13. The customizable properties of the cross-validation operator
81
Created by XMLmind XSL-FO Converter.
Classification Methods 3
Here, it can be defined how many cross-validation iterations should be executed. The dataset is split into a many
subsets of equal size as the number of iterations. Then, each of these splits is selected to be the test set of an
iteration, and the union of all other subsets will serve as the training set of the given iteration. A special case of
this is the leave-one-out method, which can be used by ticking the appropriate checkbox (leave-one-out). When
using this, an iteration is run for each record, in which the given record serves as the test set, and the training set
consists of all other records. As can be seen on the figure, the following average performance values are yielded
by cross-validation with 10 iterations:
Figure 7.14. The overall performance vector of the classifications done using the
regression model defined in the cross-validation operator
The following average performance values are yielded by the leave-one-out method:
Figure 7.15. The overall performance vector of the classifications done using the
regression model defined in the cross-validation operator for the case of using the leaveone-out method
Note that in this case, the standard deviation of the precision values of the leave-one-out method are remarkably
higher than those of standard cross-validation. This might indicate that such irregular records are present the
classification os which is not necessarily accurate, even after learning on all other records.
Video
Workflow
regr_exp4.rmp
82
Created by XMLmind XSL-FO Converter.
Classification Methods 3
Keywords
classification
regression
performance
cross-validation
Operators
Apply Model
Classification by Regression
Linear Regression
Performance (Classification)
Read AML
X-Validation
83
Created by XMLmind XSL-FO Converter.
Chapter 8. Classification Methods 4
Neural Networks and Support Vector Machines
1. Using a perceptron for solving a linearly separable
binary classification problem
Description
In this experiment a perceptron is trained on a linearly separable two-dimensional data set consisting of two
classes, that is a subset of the Wine data set. The classification accuracy of the perceptron is determined on the
data set.
Input
Figure 8.1. A linearly separable subset of the Wine data set [UCI MLR] used in the
experiment (2 of the total of 3 classes and 2 of the total of 13 attributes was selected).
Output
84
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Figure 8.2. The decision boundary of the perceptron.
Figure 8.3. The classification accuracy of the perceptron on the data set.
Interpretation of the results
The second figure shows that the perceptron perfectly classifies all training examples.
Video
Workflow
ann_exp1.rmp
Keywords
perceptron
supervised learning
classification
Operators
Apply Model
Filter Examples
Perceptron
Performance (Classification)
Read CSV
Remove Unused Values
2. Using a feed-forward neural network for solving a
classification problem
85
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Description
In this experiment a two-layer feed-forward neural network with 2 hidden neurons is trained on the Sonar,
Mines vs. Rocks data set.
Input
Sonar, Mines vs. Rocks [UCI MLR]
Output
Figure 8.4. The classification accuracy of the neural network on the data set.
Interpretation of the results
The figure shows that the neural network perfectly classifies all but one of the training examples.
Video
Workflow
ann_exp2.rmp
Keywords
feed-forward neural network
supervised learning
classification
Operators
Apply Model
86
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Neural Net
Performance (Classification)
Read CSV
3. The influence of the number of hidden neurons to
the performance of the feed-forward neural network
Description
In this experiment two-layer feed-forward neural networks with different number of hidden neurons are trained
on the Sonar, Mines vs. Rocks data set. The average classification error rate from 10-fold cross-validation is
determined for each neural network.
The main contribution of the experiment is that it shows how to change the value of a list type parameter of an
operator (in our case, the hidden layers parameter of the Neural Net operator) in loops using a macro.
To obtain a reasonable execution time only neural networks with the following number of hidden neurons are
considered: 1, 2, 4, 8, 16.
Input
Sonar, Mines vs. Rocks [UCI MLR]
Output
Figure 8.5. The average classification error rate obtained from 10-fold cross-validation
against the number of hidden neurons.
87
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Interpretation of the results
The figure shows that the best average classification error rate (14.5%) is achieved when the number of hidden
neurons is 8.
Video
Workflow
ann_exp3.rmp
Keywords
feed-forward neural network
supervised learning
error rate
classification
cross-validation
Operators
Apply Model
Guess Types
Log
Log to Data
Loop Values
Neural Net
Performance (Classification)
Print to Console
Provide Macro as Log Value
Read CSV
X-Validation
Execute Script (R) [R Extension]
4. Using a linear SVM for solving a linearly separable
binary classification problem
88
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Description
In this experiment a linear SVM is trained on a linearly separable two-dimensional data set consisting of two
classes, that is a subset of the Wine data set. The classification accuracy of the linear SVM is determined on the
data set.
Input
Figure 8.6. A linearly separable subset of the Wine data set [UCI MLR] used in the
experiment (2 of the total of 3 classes and 2 of the total of 13 attributes was selected).
Output
Figure 8.7. The kernel model of the linear SVM.
89
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Figure 8.8. The classification accuracy of the linear SVM on the data set.
Interpretation of the results
The figure shows that the linear SVM perfectly classifies all training examples.
Video
Workflow
svm_exp1.rmp
Keywords
SVM
supervised learning
classification
Operators
Apply Model
Filter Examples
Performance (Classification)
Read CSV
Remove Unused Values
Support Vector Machine (LibSVM)
5. The influence of the parameter C to the
performance of the linear SVM (1)
90
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Description
The process demonstrates the influence of the parameter C on performance of the linear SVM. Linear SVMs are
trained on a subset of the Wine data set while the value of the parameter C is increased from 0.001 to 100. The
classification error rate on the training set and also the number of support vectors are determined for each SVM.
Input
A subset of the Wine data set [UCI MLR].
Figure 8.9. A subset of the Wine data set used in the experiment (2 of the total of 3
classes and 2 of the total of 13 attributes was selected). Note that the classes are not
linearly separable.
Output
Figure 8.10. The classification error rate of the linear SVM against the value of the
parameter C.
91
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Figure 8.11. The number of support vectors against the value of the parameter C.
Interpretation of the results
The first figure shows that the classification error rate quickly falls below 6% as the value of the parameter C is
increased and then it remains constant.
The second figure shows that the number of support vectors decreases similarly with the increasing value of the
parameter C, although not so rapidly as the classification error rate.
Video
Workflow
svm_exp2.rmp
Keywords
92
Created by XMLmind XSL-FO Converter.
Classification Methods 4
SVM
supervised learning
error rate
classification
Operators
Apply Model
Filter Examples
Log
Loop Parameters
Normalize
Performance (Classification)
Performance (Support Vector Count)
Read CSV
Remove Unused Values
Support Vector Machine (LibSVM)
6. The influence of the parameter C to the
performance of the linear SVM (2)
Description
The process demonstrates the influence of the parameter C on the average classification error rate of the linear
SVM in the case of the Heart Disease data set. We consider linear SVMs with different C values, each of which
is an integer power of 2: C = 2^n, where -13 <= n <= 6. The average classification error rate from 10-fold crossvalidation is determined for each linear SVM.
Input
Heart Disease [UCI MLR]
Note
The data set was donated to the UCI Machine Learning Repository by R. Detrano[Detrano et al.].
Output
93
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Figure 8.12. The average classification error rate of the linear SVM obtained from 10fold cross-validation against the value of the parameter C, where the horizontal axis is
scaled logarithmically.
Interpretation of the results
The figure shows that the average classification error rate is minimal when the value of the parameter C is 2^-8.
Larger values of C result in a slightly worse average classification performance. However, values closer to zero
give worse result.
Thus, the performance of the linear SVM seems not to be sensitive to the value of the parameter C in this case.
Video
Workflow
svm_exp3.rmp
Keywords
SVM
supervised learning
error rate
classification
cross-validation
Operators
Apply Model
Filter Examples
Log
Loop Parameters
Map
Normalize
Performance (Classification)
Read CSV
Support Vector Machine (LibSVM)
94
Created by XMLmind XSL-FO Converter.
Classification Methods 4
X-Validation
7. The influence of the parameter C to the
performance of the linear SVM (3)
Description
In this experiment linear SVMs are trained on the Spambase data set while the value of the parameter C is
varied. We will use integer powers of 2 as value of the parameter: C = 2^n, where -8 <= n <= 5. The data set is
split into a training and a test set, 60% of the examples are used to form a training set, and the rest are for
testing. The classification error rates on both the training and the test sets and also the number of support vectors
are determined for each SVM.
Input
Spambase [UCI MLR]
Output
Figure 8.13. The classification error rate of the linear SVM on the training and the test
sets against the value of the parameter C.
95
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Figure 8.14. The number of support vectors against the value of the parameter C.
Interpretation of the results
The first figure shows that the classification error rate on the training set is decreases with the increase of value
of the parameter C. As the value of the parameter C is increased, the error rate on the test set also decreases, until
C reaches 2. However, further increase of the value of the parameter causes a slight increase in the test error.
The second figure shows that the number of support vectors falls by about 50% while the value of the parameter
C is increased from 2^-8 to 8. Further increase of the value of the parameter causes a slight increase in the
number of support vectors.
Video
96
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Workflow
svm_exp4.rmp
Keywords
SVM
supervised learning
error rate
classification
Operators
Apply Model
Log
Log to Data
Loop Parameters
Normalize
Performance (Classification)
Performance (Support Vector Count)
Read CSV
Split Data
Support Vector Machine (LibSVM)
8. The influence of the number of training examples to
the performance of the linear SVM
Description
The process demonstrates the influence of the number of training examples on the performance of the linear
SVM in the case of the Adult (LIBSVM) data set. The number of training examples is increased in the
experiment, and an SVM is trained in each step. The following performance characteristics are determined for
each of the SVMs:
• the classification error rate on the training set,
• the classification error rate on the corresponding test set,
• the number of support vectors,
• the CPU execution time needed to train the linear SVM.
97
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Input
A discretized and binarized version of the Adult data set [UCI MLR] available at the LIBSVM website
[LIBSVM].
Output
Figure 8.15. The classification error rate of the linear SVM on the training and the test
sets against the number of training examples.
Figure 8.16. The number of support vectors against the number of training examples.
Figure 8.17. CPU execution time needed to train the SVM against the number of
training examples.
98
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Interpretation of the results
The first figure shows that the classification error on the training and test sets are roughly the same,
independently of the number of training examples.
The second and the third figures show that both the number of support vectors and the CPU execution time
increase linearly with the number of training examples.
Video
Workflow
svm_exp5.rmp
Keywords
SVM
supervised learning
error rate
classification
cross-validation
Operators
Apply Model
Extract Macro
Generate Attributes
Log
Log to Data
Loop Files
Normalize
Parse Numbers
Performance (Classification)
Performance (Support Vector Count)
Provide Macro as Log Value
Read Sparse
Remove Duplicates
99
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Sort
Support Vector Machine (LibSVM)
9. Solving the two spirals problem by a nonlinear SVM
Description
In this experiment a nonlinear SVM is trained to solve the Two Spirals problem, that is a linearly non-separable
classification problem developed for benchmarking neural networks. The classification accuracy of the
nonlinear SVM is determined on the data set.
Input
Two Spirals [Two Spirals]
Figure 8.18. The Two Spirals data set
Figure 8.19. The R code that produces the data set and is executed by the
(R) operator of the R Extension.
100
Created by XMLmind XSL-FO Converter.
Execute Script
Classification Methods 4
i <- 0:96
angle <- i * pi / 16
radius <- 6.5 * (104 - i) / 104
x <- radius * cos(angle);
y <- radius * sin(angle);
spirals <- data.frame(
rbind(
cbind(x, y, 0),
cbind(-x, -y, 1)
)
)
names(spirals) <- c("x", "y", "class")
spirals <- transform(spirals, class = factor(class))
spirals.label <- "class"
Note
The R code that produces the data set is based on a SAS code snippet in [Neural Network FAQ].
Output
Figure 8.20. The classification accuracy of the nonlinear SVM on the data set.
Interpretation of the results
The figure shows that the nonlinear SVM perfectly classifies all training examples.
Video
Workflow
svm_exp6.rmp
Keywords
SVM
supervised learning
linearly non-separable
classification
Operators
Apply Model
Performance (Classification)
Support Vector Machine (LibSVM)
Execute Script (R) [R Extension]
10. The influence of the kernel width parameter to the
performance of the RBF kernel SVM
101
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Description
In this experiment RBF kernel SVMs are trained on the Pima Indians Diabetes data set with different kernel
width parameter (gamma) values. The value of this parameter is increased from 0.001 to 5 while the value of the
parameter C is fixed to 1 to obtain comparable results. The data set is split into a training and a test set, 75% of
the examples are used to form a training set, and the rest are for testing. The classification error rates on both the
training and the test sets are determined for each SVM.
Input
Pima Indians Diabetes [UCI MLR]
Output
Figure 8.21. The classification error rates of the SVM on the training and the test sets
against the value of RBF kernel width parameter.
Interpretation of the results
The value of the RBF kernel width parameter can be chosen such that the SVM will perfectly classify all
training examples. Unfortunately, the model does not perform well on the test data. Apparently, overfitting
102
Created by XMLmind XSL-FO Converter.
Classification Methods 4
occurs here. It should be noted that the linear SVM does not perform so well on the training set, its classification
error rate is around 20%.
Video
Workflow
svm_exp7.rmp
Keywords
SVM
supervised learning
error rate
classification
Operators
Apply Model
Log
Loop Parameters
Normalize
Performance (Classification)
Read CSV
Split Data
Support Vector Machine (LibSVM)
11. Search for optimal parameter values of the RBF
kernel SVM
Description
In this experiment RBF kernel SVMs are trained on the Ionosphere data set while the value of the parameter
gamma of the RBF kernel and also the value of the parameter C are changed. The average classification error rate
from 10-fold cross-validation is determined for each SVM. As a result, the values yielding the best average
103
Created by XMLmind XSL-FO Converter.
Classification Methods 4
classification error rate will be returned. The following parameter values will be considered for C and gamma: C
= 2^n, where -5 <= n <= 6, gamma = 2^m, where -10 <= m <= 4. Thus, the total number of parameter value
combinations considered is 180.
Input
Ionosphere [UCI MLR]
Output
Figure 8.22. The optimal parameter values for the RBF kernel SVM.
Figure 8.23. The classification accuracy of the RBF kernel SVM trained on the entire
data set using the optimal parameter values.
Interpretation of the results
The first figure shows that the best average classification error rate is achieved when the value of the parameter
C is 16 and the value of the parameter gamma is 0.015625. Note that these parameter values can not be
considered as the global optimum of the average classification error rate, since they were obtained by
performing a grid search that examines only a few points of the search space.
The second figure shows that the RBF kernel SVM trained on the entire data set using the optimal parameter
values performs very well.
Video
Workflow
svm_exp8.rmp
Keywords
104
Created by XMLmind XSL-FO Converter.
Classification Methods 4
SVM
supervised learning
error rate
classification
cross-validation
parameter optimization
Operators
Apply Model
Log
Multiply
Normalize
Optimize Parameters (Grid)
Performance (Classification)
Read CSV
Set Parameters
Support Vector Machine (LibSVM)
X-Validation
12. Using an SVM for solving a multi-class
classification problem
Description
In this experiment a linear SVM is trained on a data set consisting of three classes. The classification accuracy
of the linear SVM is determined on the data set.
Input
Wine [UCI MLR]
Output
Figure 8.24. The kernel model of the linear SVM.
105
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Figure 8.25. The classification accuracy of the linear SVM on the data set.
Interpretation of the results
The second figure shows that the linear SVM perfectly classifies all training examples.
Video
Workflow
svm_exp9.rmp
Keywords
SVM
supervised learning
classification
Operators
Apply Model
Normalize
Performance (Classification)
Read CSV
Support Vector Machine (LibSVM)
13. Using an SVM for solving a regression problem
106
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Description
The process demonstrates how to use an SVM for solving a regression problem. In this experiment RBF kernel
SVMs are trained on the Concrete Compressive Strength data set while the value of the parameter gamma of the
RBF kernel is changed. To obtain comparable results the value of the parameter C is fixed to 10. The average
RMS error from 10-fold cross-validation is determined for each SVM. As a result, the gamma value yielding the
best average RMS error will be returned. Using this value for the parameter gamma an RBF kernel SVM is
trained on the entire data set that is referred to as the “optimal RBF kernel SVM” below.
Input
Concrete Compressive Strength [UCI MLR] [Concrete]
Output
Figure 8.26. The optimal value of the gamma parameter for the RBF kernel SVM.
Figure 8.27. The average RMS error of the RBF kernel SVM obtained from 10-fold
cross-validation against the value of the parameter gamma, where the horizontal axis is
scaled logarithmically.
107
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Figure 8.28. The kernel model of the optimal RBF kernel SVM.
Figure 8.29. Predictions provided by the optimal RBF kernel SVM against the values of
the observed values of the dependent variable.
108
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Interpretation of the results
The first figure shows that the best average RMS error is achieved when the value of the parameter gamma is 2^2 = 0.25.
The third figure shows that the average RMS error decreases with the increasing value of the parameter gamma
until it reaches its minimum. However, further increase of the value of the parameter gamma results in the
degradation of the performance, i.e., model overfitting occurs.
Video
Workflow
svm_regr_exp1.rmp
Keywords
SVM
supervised learning
RMS error
regression
cross-validation
parameter optimization
Operators
Apply Model
Log
Multiply
Normalize
Optimize Parameters (Grid)
Performance (Regression)
Read Excel
Set Parameters
Support Vector Machine (LibSVM)
X-Validation
109
Created by XMLmind XSL-FO Converter.
Chapter 9. Classification Methods 5
Ensemble Methods
1. Introducing ensemble methods: the bagging
algorithm
Description
The experiment introduces the use of ensemble methods, featuring the Bagging operator. The average
classification error rate from 10-fold cross-validation on the Heart Disease data set is compared for a single
decision stump and an ensemble of 10 decision stumps trained by bagging. The impurity measure used for the
decision stumps is the gain ratio.
Input
Heart Disease [UCI MLR]
Note
The data set was donated to the UCI Machine Learning Repository by R. Detrano[Detrano et al.].
Output
Figure 9.1. The average classification error rate of a single decision stump obtained
from 10-fold cross-validation.
Figure 9.2. The average classification error rate of the bagging algorithm obtained from
10-fold cross-validation, where 10 decision stumps were used as base classifiers.
110
Created by XMLmind XSL-FO Converter.
Classification Methods 5
Interpretation of the results
An ensemble of 10 decision stumps trained by bagging gives an average classification error rate that is about 7%
better that those of a single decision stump.
Video
Workflow
ensemble_exp1.rmp
Keywords
bagging
ensemble methods
supervised learning
error rate
cross-validation
classification
Operators
Apply Model
Bagging
Decision Stump
Map
Multiply
Performance (Classification)
Read CSV
X-Validation
2. The influence of the number of base classifiers to
the performance of bagging
111
Created by XMLmind XSL-FO Converter.
Classification Methods 5
Description
This process demonstrates the influence of the number of base classifiers on the classification error rate of
bagging in the case of the Heart Disease data set. The base classifiers are decision stumps and the impurity
measure used is the gain ratio. The number of base classifiers are increased from 1 to 20 in the experiment, and
the average classification error rate of bagging from 10-fold cross-validation is determined in each step.
Input
Heart Disease [UCI MLR]
Note
The data set was donated to the UCI Machine Learning Repository by R. Detrano[Detrano et al.].
Output
Figure 9.3. The average classification error rate obtained from 10-fold cross-validation
against the number of base classifiers.
112
Created by XMLmind XSL-FO Converter.
Classification Methods 5
Interpretation of the results
The figure shows that the best average classification error rate (21.4%) is achieved when the number of base
classifiers is 14.
Video
Workflow
ensemble_exp2.rmp
Keywords
bagging
ensemble methods
supervised learning
error rate
cross-validation
classification
Operators
Apply Model
Bagging
Decision Stump
Log
Loop Parameters
Map
Performance (Classification)
Read CSV
X-Validation
3. The influence of the number of base classifiers to
the performance of the AdaBoost method
Description
113
Created by XMLmind XSL-FO Converter.
Classification Methods 5
The process demonstrates the influence of the number of base classifiers on the classification error rate of the
AdaBoost method in the case of the Heart Disease data set. The base classifiers are decision stumps and the
impurity measure used is the gain ratio. The number of base classifiers are increased from 1 to 20 in the
experiment, and the average classification error rate of the AdaBoost method from 10-fold cross-validation is
determined in each step.
Note
The experiment is the same as the previous one, the only difference is that the AdaBoost operator is
used instead of the Bagging operator.
Input
Heart Disease [UCI MLR]
Note
The data set was donated to the UCI Machine Learning Repository by R. Detrano[Detrano et al.].
Output
Figure 9.4. The average classification error rate obtained from 10-fold cross-validation
against the number of base classifiers.
Interpretation of the results
The figure shows that the best average classification error rate (22.7%) is achieved when the number of base
classifiers is 3. It is also apparent that increasing the number of base classifiers does not result in the degradation
of the performance, that remains constant instead. Thus, model overfitting surprisingly does not occur.
Note that the best performance obtained is almost identical to those of bagging, but requires lesser base
classifiers. Moreover, performance behaves more predictable than in the case of bagging.
Video
Workflow
114
Created by XMLmind XSL-FO Converter.
Classification Methods 5
ensemble_exp3.rmp
Keywords
AdaBoost
ensemble methods
supervised learning
error rate
cross-validation
classification
Operators
AdaBoost
Apply Model
Decision Stump
Log
Loop Parameters
Map
Performance (Classification)
Read CSV
X-Validation
4. The influence of the number of base classifiers to
the performance of the random forest
Description
The process demonstrates the influence of the number of base classifiers on the classification error rate of the
random forest in the case of the Heart Disease data set. The number of base classifiers (i.e., decision trees) are
increased from 1 to 20 in the experiment, and the average classification error rate of the random forest from 10fold cross-validation is determined in each step. The impurity measure used for the decision trees is the gain
ratio.
Note
The experiment is the same as the previous two, the only difference is that the Random Forest
operator is used here instead of the Bagging and the AdaBoost operators.
Input
115
Created by XMLmind XSL-FO Converter.
Classification Methods 5
Heart Disease [UCI MLR]
Note
The data set was donated to the UCI Machine Learning Repository by R. Detrano[Detrano et al.].
Output
Figure 9.5. The average error rate of the random forest obtained from 10-fold crossvalidation against the number of base classifiers.
Interpretation of the results
The figure shows that the best average classification error rate (19.1%) is achieved when the number of base
classifiers is 10.
Note that the best performance obtained is slightly better than those of AdaBoost (22.7%), but requires more
base classifiers. Moreover, the performance of AdaBoost behaves more predictable than those of the random
forest.
Video
Workflow
ensemble_exp4.rmp
Keywords
random forest
ensemble methods
supervised learning
error rate
cross-validation
classification
Operators
Apply Model
116
Created by XMLmind XSL-FO Converter.
Classification Methods 5
Log
Loop Parameters
Map
Performance (Classification)
Random Forest
Read CSV
X-Validation
117
Created by XMLmind XSL-FO Converter.
Chapter 10. Association rules
1. Extraction of association rules
Description
The process shows, using the Extended Bakery dataset, how association rules can be extracted from a
transactional dataset. The emphasis is on the items that are present in the transactional datasets from the possible
items, i.e. on the items which form part of the given transaction, and not those which are missing from it. If such
a transactional dataset is in an uncompressed sparse matrix representation, so all records contain a binomial
value for each of the possible items, the extraction of association rules can be executed without any complex
transformation, the only thing that has to be kept in mind is that the attributes representing the individual items
should be of a binomial type. Using these, the frequent item sets can be extracted, and based on these, the
association rules valid for the dataset can be extracted.
Input
Extended Bakery [Extended Bakery]
Output
Using the FP-Growth algorithm on the version of the dataset that contains 20000 records, the following frequent
item sets are created:
Figure 10.1. List of the frequent item sets generated
118
Created by XMLmind XSL-FO Converter.
Association rules
Interpretation of the results
Based on these frequent item sets, the appropriate association rules can be created. It can be set the rules
meeting what kind of criteria should be considered valid - by default, a required level of confidence can be set,
but filtering can be done based on other values as well. Using the emerging rules, deeper conclusions can be
drawn regarding the connections between the data. Among other things, the table representation of the rules can
aid this, as in this representation, different kinds of filters can be utilized to filter out the rules considered
interesting, for example by outcome or by confidence level:
Figure 10.2. List of the association rules generated
119
Created by XMLmind XSL-FO Converter.
Association rules
Besides the table representation, a graphic representation can also be used, with available filtering conditions
that are similar to those of the former:
Figure 10.3. Graphic representation of the association rules generated
Video
Workflow
assoc_exp1.rmp
Keywords
frequent item sets
association rules
transactional data
binomial attributes
Operators
120
Created by XMLmind XSL-FO Converter.
Association rules
Create Association Rules
FP-Growth
Numerical to Binominal
Read AML
2. Asszociációs szabályok kinyerése nem tranzakciós
adathalmazból
Description
The process shows, using the Titanic dataset, how association rules can be extracted from a non-transactional
dataset. In order to obtain association rules from such a dataset, it first has to be transformed into a transactional
dataset. In these cases, it depends on the structure of the original database whether the emphasis is only on the
items that are present in the transactional dataset from the possible items, or the 0 values of the variables also
have to be interpreted. These datasets have to be transformed into an uncompressed sparse matrix
representation, in which all records contain a binomial value for each of the possible items. After this, the
extraction of association rules can be executed without any complex transformation. The frequent item sets
occurring in the dataset can be extracted, and based on these, the association rules valid for the dataset can be
extracted as well.
Input
Titanic [Titanic]
Output
Using this dataset, it can be examined whether the age, sex, and class of the passengers of the Titanic had any
influence on their survival chances. As the Class variable is not of a binomial type, this has to be converted into
binomial form first, before the frequent item sets could be extracted:
Figure 10.4. Operator preferences for the necessary data conversion
121
Created by XMLmind XSL-FO Converter.
Association rules
Figure 10.5. Converted version of the dataset
Based on these, the frequent item sets can now be acquired, from which the association rules valid for the
dataset can be generated:
Figure 10.6. List of the frequent item sets generated
122
Created by XMLmind XSL-FO Converter.
Association rules
Figure 10.7. List of the association rules generated
Interpretation of the results
Looking at the frequent item sets and association rules created, it is obvious that the handling of the dataset is
inappropriate. In can be seen in the documentation of the dataset that for each variable, including the binomial
variables, 0 values have a separate meaning (e.g. this represents children at the age variable, or belonging to the
crew at the class variable). In accordance with this, to acquire the appropriate transactional records, these
variables will also have to be split into two separate variables that represent the presence or absence of the two
possible values. In this case, the following dataset is yielded as a result:
Figure 10.8. Operator preferences for the appropriate data conversion
123
Created by XMLmind XSL-FO Converter.
Association rules
Figure 10.9. The appropriate converted version of the dataset
Based on these, the frequent item sets, and using those, the appropriate association rules can be extracted. Using
these emerging rules, deeper conclusions can be drawn regarding the connections between the data, and the
factors influencing the survival chances of the passengers can be filtered out. Among other things, the table
representation of the rules can aid this, as in this representation, different kinds of filters can be utilized to filter
out the rules considered interesting, for example by outcome or by confidence level:
Figure 10.10. Enhanced list of the frequent item sets generated
124
Created by XMLmind XSL-FO Converter.
Association rules
Figure 10.11. List of the association rules generated
Besides the table representation, a graphic representation can also be used, with available filtering conditions
that are similar to those of the former:
Figure 10.12. Graphic representation of the association rules generated
125
Created by XMLmind XSL-FO Converter.
Association rules
Video
Workflow
assoc_exp2.rmp
Keywords
frequent item sets
association rules
non-transactional data
binomial attributes
data transformation
Operators
Create Association Rules
FP-Growth
Multiply
Nominal to Binominal
Read AML
3. Evaluation of performance for association rules
126
Created by XMLmind XSL-FO Converter.
Association rules
Description
The process shows, using the Titanic dataset, how the usability and efficiency of association rules can be
checked if association rules are being extracted for a given dataset. After extracting the association rules, their
support can be evaluated, and similarly to classification tasks, it can be checked to what extent the original
values of the dataset can be predicted based on the rules created. Based on these types of evaluation, conclusions
can be drawn based on which it can be decided whether the resulting association rules are appropriate for the
goals of the process, the existing rules should be improved further, or the existing rules have revealed such poor
connections that using a completely new approach is necessary.
Input
Titanic [Titanic]
Output
Using this dataset, it can be examined whether the age, sex, and class of the passengers of the Titanic had any
influence on their survival chances. After the appropriate conversion of the variables, the dataset can be split
into a training set and a test set, and then, by applying the association rules deduced based on the training set to
the test set, it can be defined to what extent the rules are usable. In order to render the attributes created during
the conversion referable, the following parameter has to be used:
Figure 10.13. Operator preferences for the necessary data conversion
127
Created by XMLmind XSL-FO Converter.
Association rules
After this, in order to evaluate the efficiency of applying the rules using the general performance evaluation
operator, the original and predicted values of the attribute of interest (in this case, the variable Surived_1, which
indicates that the given passenger has survived the shipwreck) have to be converted to nominal types, and also,
it also has to be ensured that their values are coded using the same values:
Figure 10.14. Label role assignment for performance evaluation
Figure 10.15. Prediction role assignment for performance evaluation
128
Created by XMLmind XSL-FO Converter.
Association rules
Figure 10.16. Operator preferences for the data conversion necessary for evaluation
Interpretation of the results
After setting the appropriate roles, the performance measurement operator automatically performs the
comparisons, and based on these, it evaluates the efficiency of the application of the rules. Running the process
yields the following rules regarding the survival of the passengers as a result:
Figure 10.17. Graphic representation of the association rules generated regarding
survival
129
Created by XMLmind XSL-FO Converter.
Association rules
Figure 10.18. List of the association rules generated regarding survival
It can be seen here that although many conclusions have been drawn regarding the survival of the passengers,
the support of the rules is rather low. This leads to the conclusion that the rules can be applied in relatively
special cases, and not generally, thus in some cases, no decision will be possible based on them. This can be
illustrated by the low value appearing in the evaluation of performance as well:
Figure 10.19. Performance vector for the application of association rules generated
One of the reasons for this could be that during the extraction of the association rules, some other factor, that
might affect the connections disclosed by the association rules, was not taken into consideration. After the
discovery of these, a better result might be obtainable in some cases.
Video
Workflow
assoc_exp3.rmp
Keywords
frequent item sets
association rules
performance
support
Operators
Apply Association Rules
Create Association Rules
Discretize by User Specification
FP-Growth
130
Created by XMLmind XSL-FO Converter.
Association rules
Multiply
Nominal to Binominal
Performance
Read AML
Set Role
Split Data
4. Performance of association rules - Simpson's
paradox
Description
The process shows, using the Titanic dataset, how the usability and efficiency of association rules can be
enhanced if association rules are being extracted for a given dataset, by creating subsets of the dataset based on
the connections between its data, and then creating association rules separately for its subsets. After extracting
the association rules, their support can be evaluated, and similarly to classification tasks, it can be checked to
what extent the original values of the dataset can be predicted based on the rules created. If these values fail to
reach the expected levels, one of the reasons for this can be the so-called Simpson's paradox, which means that
due to some hidden factors, the connections between the variables can weaken, disappear, or even turn in the
opposite direction. If such factors are discovered, splitting the dataset along these can enhance the performance
of the association rules.
Input
Titanic [Titanic]
Output
Using this dataset, it can be examined whether the age, sex, and class of the passengers of the Titanic had any
influence on their survival chances. After the appropriate conversion of the variables, the dataset can be split
into a training set and a test set, and then, by applying the association rules deduced based on the training set to
the test set, it can be defined to what extent the rules are usable. However, if we do this based on the whole of
the dataset, relatively poor results emerge for support and, rooting from this, for performance as well:
Figure 10.20. List of the association rules generated regarding survival
131
Created by XMLmind XSL-FO Converter.
Association rules
Figure 10.21. Performance vector for the application of association rules generated
But considering the contingency table of the dataset, for example split by the age of the passengers, and by their
class, the conclusion can be drawn that the individual variables have such a strong influence on the value of the
variable of interest - survival - that these effects of the individual classes can neutralize each other in the whole
of the dataset, so it can be more advantageous to split the dataset along these variables, and extract the
association rules separately in the individual subsets:
Figure 10.22. Contingency table of the dataset
In order to do this, for example if the dataset is to be split based on the age of the passengers, first the
appropriate records have to be filtered out, then the variables used as filtering conditions can also be removed,
as in the subsets, they carry information that can now be considered redundant:
Figure 10.23. Record filter usage
132
Created by XMLmind XSL-FO Converter.
Association rules
Figure 10.24. Removal of attributes that become redundant after filtering
Interpretation of the results
After this, the training and test sets are created, the association rules concerning them are extracted, and their
efficiency is evaluated for the separate datasets of adults and children. The subset of adults yielded the following
results:
Figure 10.25. List of the association rules generated for the subset of adults
Figure 10.26. Performance vector for the application of association rules generated
regarding survival for the subset of adults
The subset of children yielded the following results:
Figure 10.27. List of the association rules generated for the subset of children
Figure 10.28. Performance vector for the application of association rules generated
regarding survival for the subset of children
It can be seen that performance can be increased remarkably by such splits of datasets, as by doing this, the
interference between the effects of groups can be neutralized. For the group of children, the enhancement in
performance is much smaller, but this can be explained with the much smaller record count of the subset.
133
Created by XMLmind XSL-FO Converter.
Association rules
Video
Workflow
assoc_exp4.rmp
Keywords
frequent item sets
association rules
performance
support
Simpson's paradox
Operators
Apply Association Rules
Create Association Rules
Discretize by User Specification
Filter Examples
FP-Growth
Multiply
Nominal to Binominal
Performance
Read AML
Select Attributes
Set Role
Split Data
134
Created by XMLmind XSL-FO Converter.
Chapter 11. Clustering 1
Standard methods
1. K-means method
Description
The process demonstrates, using the Aggregation dataset, how the K-means clustering algorithm works. Also, it
also shows the importance of choosing the distance function.
Input
Aggregation [SIPU Datasets] [Aggregation]
The dataset consists of 788 two-dimensional vectors, which form 7 separate groups. The task is to discover these
groups - clusters. The difficulty of the task is in the alignment of the points, as smaller and larger clouds of
points are present with different distances in space between them.
Figure 11.1. The 7 separate groups
Output
Figure 11.2. Clustering with default values
135
Created by XMLmind XSL-FO Converter.
Clustering 1
After reading the data, the node of the K-means method is connected, and the algorithm is set to search for 7
clusters, then the process is initiated. The result is that the discovery of the upper and right side point clouds is
successful, however, the algorithms performed poorly on the lower point cloud.
Figure 11.3. Set the distance function.
Let us try out another distance function, the Mahalanobis distance.
Figure 11.4. Clustering with Mahalanobis distance function
136
Created by XMLmind XSL-FO Converter.
Clustering 1
It can be seen that by minor sacrifices, but the result has become more precise; the clustering of the lower point
cloud is now nearing a perfect solution.
Interpretation of the results
It can be seen that even the simplest clustering algorithms can discover basic connections, and if the distance
function is chosen correctly, the results can even be made more precise.
Video
Workflow
clust_exp1.rmp
Keywords
K-means method
distance functions
cluster analysis
Operators
k-Means
Read CSV
2. K-medoids method
137
Created by XMLmind XSL-FO Converter.
Clustering 1
Description
The process shows, using the Maximum Variance (R15) dataset, how the K-medoids method can be used.
Input
Maximum Variance (R15) [SIPU Datasets] [Maximum Variance]
The dataset contains 600 two-dimensional vectors, which are concentrated into 15 clusters. The points are
aligned around a center with the coordinates (10,10), in increasing distances from each other as they get further
from the center. This is the difficulty of the task, as the clusters near the center are close to blending into each
other.
Figure 11.5. The dataset
Output
Figure 11.6. Setting the parameters of the clustering
138
Created by XMLmind XSL-FO Converter.
Clustering 1
The difference of the K-medoids method from the K-means method is that the centers of the clusters have to be
existing points. After setting the distance function and the number of clusters k, and then running the process, it
can be seen that even though a more sophisticated distance function has been chosen, the alignment of the data
did not make the precise analysis of the central clusters possible.
Figure 11.7. The clusters produced by the analysis
Interpretation of the results
The process has shown that not all datasets provide a chance for the usage of arbitrary cluster analysis.
Video
139
Created by XMLmind XSL-FO Converter.
Clustering 1
Workflow
clust_exp2.rmp
Keywords
K-medoids method
dataset properties
cluster analysis
Operators
k-Medoids
Read CSV
3. The DBSCAN method
Description
The process shows, using the Compound dataset, the advantages of using density based clustering by using the
DBSCAN clustering algorithm.
Input
Compound [SIPU Datasets] [Compound]
The dataset consist of 399 two-dimensional vectors belonging to six groups. The groups differ in the density of
their points, and each set can encompass another point set as well.
Figure 11.8. The groups with varying density
140
Created by XMLmind XSL-FO Converter.
Clustering 1
Output
Figure 11.9. The results of the method with default parameters
The process yields remarkable results even with default settings, only one out of the six clusters contains an
error. Using the parameters epsilon and min points, results can be refined further, but as the points in the
above-mentioned cluster are of different densities, a perfect solution cannot be reached.
Interpretation of the results
The operation of the DBSCAN algorithm has been demonstration on a dataset consisting of groups with
different densities. The deficiencies of the algorithm have been shows, i.e. that if points of different densities
can be found in a cluster, the precise operation of the algorithm cannot be guaranteed.
141
Created by XMLmind XSL-FO Converter.
Clustering 1
Video
Workflow
clust_exp3.rmp
Keywords
DBSCAN method
density function
cluster analysis
Operators
DBSCAN
Read CSV
4. Agglomerative methods
Description
The process shows, using the Maximum Variance (R15) dataset, how to define the appropriate number of
clusters, and the agglomerative hierarchical clustering method.
Input
Maximum Variance (R15) [SIPU Datasets] [Maximum Variance]
The dataset contains 600 two-dimensional vectors, which form 15 separate groups. The task is to define the
cardinality of the groups, and to discover them.
Figure 11.10. The 15 group
142
Created by XMLmind XSL-FO Converter.
Clustering 1
Output
Figure 11.11. The resulting dendrogram
The result of aggregation clustering is a so-called dendrogram, which is such a tree structure the leaves of which
are the points themselves, and the intermediate nodes (the clusters) result from agglomerating two points or
subtrees (clusters). The method always contracts the two points (or clusters) closest to each other, thus building
up the tree, which will contain all the points by the end of the process. The length of the edges in the finished
dendrogram is proportional to the distance between the clusters, thus the number of edges on the appropriate
level defines the ideal number of clusters. So, at the beginning of the process, each point forms a cluster on its
own, while by the end of the process, all points get to be put into one single cluster.
Figure 11.12. The clustering generated from dendrogram
143
Created by XMLmind XSL-FO Converter.
Clustering 1
By using the Flatten clustering operator, the dendrogram can also be used for clustering, manually stating the
number of clusters as a single parameter. The figure shows the result of this cluster analysis.
Interpretation of the results
It could be seen that based on the dendrogram created, the ideal number of clusters can be defined, and then,
based on this, the cluster analysis can be performed as well.
Video
Workflow
clust_exp4.rmp
Keywords
Agglomerative method
agglomerative hierarchical clustering
cluster analysis
Operators
Agglomerative Clustering
Flatten Clustering
Multiply
Read CSV
5. Divisive methods
144
Created by XMLmind XSL-FO Converter.
Clustering 1
Description
The process shows, using the Maximum Variance (R15) dataset, how divisive hierarchical cluster analysis can
be done.
Input
Maximum Variance (R15) [SIPU Datasets] [Maximum Variance]
The dataset contains 600 two-dimensional vectors, which form 15 separate groups. The task is to define the
ideal cardinality of the groups, and to discover them.
Figure 11.13. The 600 two-dimensional vectors
145
Created by XMLmind XSL-FO Converter.
Clustering 1
Output
Figure 11.14. The subprocess
In order to perform divisive clustering, an arbitrary clustering method with which the division can be performed
is necessary. In the initial state, all points belong to the same cluster, then, the method continuously divides the
points into multiple groups, until finally, all points are placed into separate clusters. The operator determines the
ideal number of clusters, and assigns the points to the clusters as well.
Figure 11.15. The report generated by the clustering
146
Created by XMLmind XSL-FO Converter.
Clustering 1
In the present case, the procedure determined the number of cluster as 63 groups.
Figure 11.16. The output of the analysis
Then the points are assigned to these groups.
Interpretation of the results
It can be seen that indeed, the method has created a larger number of clusters, but due to this, the central clusters
can be separated better from each other.
Video
Workflow
clust_exp5.rmp
Keywords
Divisive method
divisive hierarchical clustering
cluster analysis
Operators
k-Means
Read CSV
Top Down Clustering
147
Created by XMLmind XSL-FO Converter.
Chapter 12. Clustering 2
Advanced methods
1. Support vector clustering
Description
The process shows, using the Jain dataset, how support vector clustering can be used, and what the effects of its
parameters are.
Input
Jain [SIPU Datasets] [Jain]
The dataset contains 373 two-dimensional vectors, which are organized into 2 groups. The challenge posed by
the point set is that clouds of points are aligned closely to each other, and they have non-spherical shapes.
Figure 12.1. The two groups
Output
During support vector clustering, data are transformed using kernel functions, and then, a circle is enlarged until
a state in which all points are located within the circle. Finally, the thus created boundary curve is transformed
148
Created by XMLmind XSL-FO Converter.
Clustering 2
back into real space along with the data, and thus the clusters are created. The kernel functions are identical to
the functions described at the support vector machines, their parameters are the same as well. Support vector
clustering has a unique parameter r, with which the radius of the circle in the transformed space can be defined.
Figure 12.2. Support vector clustering with polynomial kernel and p=0.21 setup
Firstly, let us test the polynomial kernel, letting the points reach over the boundary curve.
Figure 12.3. Unsuccessful clustering
149
Created by XMLmind XSL-FO Converter.
Clustering 2
It can be seen that the result is rather disappointing, the resulting cluster are extending into each other, and the
second cluster is considered to be noise by the method.
Figure 12.4. Clustering with RBF kernel
Switching to the RBF kernel, and not allowing the points to reach over the boundary curve, the result is much
more promising. However, the upper cluster is split into multiple clusters, but the lower one remains in one
piece, and is separated from the other clusters.
Figure 12.5. More promising results
Interpretation of the results
150
Created by XMLmind XSL-FO Converter.
Clustering 2
Just like when using support vector machines, when using SVC, the factors that influence the efficiency of the
method the most are choosing the appropriate kernel function, and finding the ideal value of the ability to
generalize.
Video
Workflow
clust2_exp1.rmp
Keywords
Support vector clustering
SVC
cluster analysis
kernel functions
Operators
Read CSV
Support Vector Clustering
2. Choosing parameters in clustering
Description
The process shows, using the Flame dataset, how the ideal parameters can be found automatically.
Input
Flame [SIPU Datasets] [Flame]
The dataset consists of 240 two-dimensional vectors, that belong to two clusters. The clusters are aligned close
to each other, and one of the clusters has a non-spherical shape.
Figure 12.6. The two groups containing 240 vectors
151
Created by XMLmind XSL-FO Converter.
Clustering 2
Output
Figure 12.7. The subprocess of the optimalization node
To perform parameter optimization, a performance operator is required, which, in this case, will be the node
measuring cluster distance.
Figure 12.8. The parameters of the optimalization
152
Created by XMLmind XSL-FO Converter.
Clustering 2
The parameters to be optimized, and their possible values are chosen in the parameter optimization operator, and
then, it is confided to the system to choose the ideal values.
Figure 12.9. The report generated by the process
In the present case, the best result was yielded by partitioning the task into 10 clusters, and defining the distance
between them with the Euclidean distance.
Figure 12.10. Clustering generated with the most optimal parameters
153
Created by XMLmind XSL-FO Converter.
Clustering 2
Interpretation of the results
For many parameterized clustering methods, it can be ideal to confide the determination of the appropriate
number of clusters to a performance measurement operator, and then run the clustering with the obtained values.
Video
Workflow
clust2_exp2.rmp
Keywords
Support vector clustering
SVC
cluster analysis
kernel functions
Operators
Cluster Distance Performance
k-Means
Optimize Parameters (Grid)
Read CSV
3. Cluster evaluation
154
Created by XMLmind XSL-FO Converter.
Clustering 2
Description
The process shows, using the Aggregation dataset, how to gather and display cluster metrics.
Input
Aggregation [SIPU Datasets] [Aggregation]
The dataset contains 788 two-dimensional vectors, which form 7 separate groups. In the present case, the aim is
to evaluate the clusters created.
Figure 12.11. The 788 vectors
Output
Figure 12.12. The evaluating subprocess
155
Created by XMLmind XSL-FO Converter.
Clustering 2
After reading the data, an agglomerative clustering is run with different parameters, and then using this, clusters
can be created. A similarity function is created to measure cluster density, and then, the results of the
measurements are saved for each parameter setting.
Figure 12.13. Setting up the parameters
60 different settings are tested, the number of clusters ranging from 2 to 20, and all three of the agglomeration
strategies of the agglomerative clustering are tried out.
Figure 12.14. Parameters to log
156
Created by XMLmind XSL-FO Converter.
Clustering 2
The cluster sizes, the cluster densities, the distribution of the points, and the agglomeration strategy are saved
for each setting.
Figure 12.15. Cluster density against k number of clusters
Figure 12.16. Item distribution against k number of clusters
157
Created by XMLmind XSL-FO Converter.
Clustering 2
The final result can be acquired by reading the log.
Interpretation of the results
The final result shows that the increase in the number of clusters leads to the increase of cluster densities, and
the decrease of point distribution in different paces for the tree different strategies. However, the single link
strategy falls a bit behind compared to the complete link and average link methods.
Video
Workflow
clust2_exp3.rmp
Keywords
cluster evaluation
agglomerative clustering
single link
complete link
average link
point density
point distribution
Operators
Agglomerative Clustering
Cluster Density Performance
Data to Similarity
Flatten Clustering
Item Distribution Performance
Log
Log to Data
Loop Parameters
158
Created by XMLmind XSL-FO Converter.
Clustering 2
Multiply
Read CSV
4. Centroid method
Description
The process shows, using the Maximum Variance (D31) dataset, that cluster centers are suitable for representing
even the whole of their clusters.
Input
Maximum Variance (D31) [SIPU Datasets] [Maximum Variance]
The dataset contains 3100 two-dimensional vectors, which are concentrated into 31 clusters. Using this dataset,
it is to be illustrated the generalization power centroids possess.
Figure 12.17. The vectors forming 31 clusters
159
Created by XMLmind XSL-FO Converter.
Clustering 2
Output
Figure 12.18. The extracted centroids
Centroids are obtained after the cluster analysis of the data, and then, to illustrate their representative power,
they are utilized as training data for a k-NN classifier.
Figure 12.19. The output of the k nearest neighbour method, using the centroids as
prototypes
160
Created by XMLmind XSL-FO Converter.
Clustering 2
The efficiency of the k-NN classification method primarily depends on the prototypes selected. Based on the
result, it can be seen that the well-chosen points have aided the classification.
Interpretation of the results
It can be seen that clustering can be a good starting point for the extraction of the prototypes of a dataset, which
can make the cut-back of the training dataset possible.
Video
Workflow
clust2_exp4.rmp
Keywords
centroids
X-means method
k-NN
Operators
Apply Model
Extract Cluster Prototypes
k-NN
Multiply
Read CSV
Set Role
X-Means
5. Text clustering
161
Created by XMLmind XSL-FO Converter.
Clustering 2
Description
The process shows, using the Twenty Newsgroups dataset, how the clustering of documents can be performed.
Input
A subset of the Twenty Newsgroups dataset [UCI MLR].
Note
The data set was donated to the UCI Machine Learning Repository by Tom Mitchell.
The dataset contains about 20000 news articles belonging to 20 topics. The subset of this dataset utilized here
contains only three of the topics, which are concerned with cars, electronics, and everyday politics.
Output
Figure 12.20. The preprocessing subprocess
The data are read by topic, transformed to lower-case, tokenized, stemmed, and then stop words are filtered out.
After this, the only thing that needs to be done is to cluster the TF-IDF vectors by document.
Figure 12.21. The clustering setup
162
Created by XMLmind XSL-FO Converter.
Clustering 2
The distance between the document vectors can be measured using the cosine similarity. The cluster labels are
transformed into class labels, and then, it is checked to what extent the clusters cover the individual topics.
Figure 12.22. The confusion matrix of the results
Interpretation of the results
The results show that cars have severely blended with electronics, which is possibly not too far from reality, as
there are numerous common points in the two professions.
Video
Workflow
clust2_exp5.rmp
Keywords
K-means method
cosine similarity
text clustering
text mining
Operators
k-Means
Map Clustering on Labels
Performance (Classification)
163
Created by XMLmind XSL-FO Converter.
Clustering 2
Filter Stopwords (English) [Text Mining Extension]
Process Documents from Files [Text Mining Extension]
Stem (Snowball) [Text Mining Extension]
Tokenize [Text Mining Extension]
Transform Cases [Text Mining Extension]
164
Created by XMLmind XSL-FO Converter.
Chapter 13. Anomaly detection
1. Searching for outliers
Description
The workflow, using the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, shows how to find outliers
based on the distances measured between the data. This can be done either by measuring their distance from
their k nearest neighbours, or by checking whether their distance from some data object is above a given
threshold. The definition of an outlier is relative, it can always be defined in comparison with the distances
between the data objects. Thus if the distances between the data objects are basically great, a high threshold has
to be set for outliers.
Input
Wisconsin Diagnostic Breast Cancer (WDBC) [UCI MLR]
Output
It can be seen that outliers can be filtered out using the appropriate settings. As for example even differences
that range in the hundreds occur between the individual values of the represented attribute area, thus this result
can be obtained by setting the threshold for the Euclidean distance to the value 500.
Figure 13.1. Graphic representation of the possible outliers
165
Created by XMLmind XSL-FO Converter.
Anomaly detection
Interpretation of the results
Note that due to the existing great distances between the data objects, the number of outliers detected will only
decrease to its true level if the threshold is incremented up to 500, and under a certain value, way too many data
objects would be identified as outliers.
Figure 13.2. The number of outliers detected as the distance limit grows
Video
Workflow
anomaly_exp1.rmp
166
Created by XMLmind XSL-FO Converter.
Anomaly detection
Keywords
outliers
data preprocessing
data cleansing
Operators
Detect Outlier (Densities)
Detect Outlier (Distances)
Multiply
Read AML
2. Unsupervised search for outliers
Description
The process shows, using a sample of the Individual household electric power consumption dataset, how
incidentally occurring outliers, anomalies can be found in a dataset with unsupervised methods. Several methods
can be used for unsupervised anomaly detection, e.g., in general cases, methods based on k nearest neighbours,
which assign an outlier indicator value to each element based on its distance from its k nearest neighbours. The
higher this indicator value is, the value of the given item will be more of an outlier, and more likely to be a
potential anomaly. However, this scoring may vary depending on the dataset and the method utilized, so the
threshold from which a given element is to be considered an outlier should be set according to the distances
between the data and the methods used.
Input
Individual household electric power consumption [UCI MLR]
Output
167
Created by XMLmind XSL-FO Converter.
Anomaly detection
The Anomaly Detection extension that can be installed in Rapid Miner proposes several possible methods for
detecting anomalies, for example the method based on k nearest neighbours, and the LOF metric based on this,
which relies on the k nearest neighbours method, but it also takes density into consideration.
Figure 13.3. Nearest neighbour based operators in the Anomaly Detection package
Figure 13.4. Settings of LOF.
These methods assign different scores to the elements, based on which it can be seen which elements are
outliers. The k nearest neighbours method assigns the following scores to the elements of the dataset:
Figure 13.5. Outlier scores assigned to the individual records based on k nearest
neighbours
168
Created by XMLmind XSL-FO Converter.
Anomaly detection
The LOF method assigns the following scores to the elements of the dataset:
Figure 13.6. Outlier scores assigned to the individual records based on LOF
Interpretation of the results
Based on the results received, it can be decided which score should be considered as a threshold above which an
element is considered an anomaly, and the elements with scores above this threshold, i.e. outliers can also be
immediately filtered out of the dataset, or a separate dataset can be formed from them:
Figure 13.7. Filtering the records based on their outlier scores
169
Created by XMLmind XSL-FO Converter.
Anomaly detection
For example, using the k-NN method, the following dataset appears as a result after removing the values rated
as outliers:
Figure 13.8. The dataset filtered based on the k-NN score
Furthermore, the set of elements rated as outlier based on the LOF is the following:
Figure 13.9. The dataset filtered based on the LOF score
Video
Workflow
anomaly_exp2.rmp
Keywords
outliers
anomaly detection
k nearest neighbours
Operators
170
Created by XMLmind XSL-FO Converter.
Anomaly detection
Filter Examples
Multiply
Read CSV
k-NN Global Anomaly Score [Anomaly Detection]
Local Outlier Factor (LOF) [Anomaly Detection]
3. Unsupervised statistics based anomaly detection
Description
The process shows, using the Flame dataset, how incidentally occurring outliers, anomalies can be found in a
dataset using statistics based unsupervised methods. Several methods can be used for unsupervised anomaly
detection, e.g., a statistics based, histogram based method. In this case, groups of values are defined for each
attribute with a histogram, and based on the deviation from these can the given value in the given column be
considered an outlier. After this, using these scores is the overall outlier score of the records defined. The higher
this indicator value is, the value or record will be more of an outlier, and more likely to be a potential anomaly.
However, this scoring may vary depending on the dataset and the method utilized, so the threshold from which a
given element is to be considered an outlier should be set according to the distances between the data and the
methods used. At the same time, due to this fact, it can be more illustrative to use colors instead of only values
to indicate the outlier scores, which is done automatically be the histogram based method.
Input
Flame [SIPU Datasets] [Flame]
Output
The Anomaly Detection extension that can be installed in Rapid Miner proposes several possible methods for
detecting anomalies, for example the histogram based method, which defines the outlier score of the individual
values in each column of the dataset, and based on these, it calculates the final score of the records. This method
can be refined with multiple settings, either on operator level, or on column level as well:
Figure 13.10. Global settings for Histogram-based Outlier Score
171
Created by XMLmind XSL-FO Converter.
Anomaly detection
Figure 13.11. Column-level settings for Histogram-based Outlier Score
Based on the settings, the operator splits the set of values in the individual columns into either a pre-defined or
an arbitrary number of bins which are either equal or variable in length. Based on these, it assigns color codes,
and calculates the record level score based on the scores of the column values as well. Using a fixed binwidth,
and an arbitrary number of bins, the following values are returned as a result:
Figure 13.12. Scores and attribute binning for fixed binwidth and arbitrary number of
bins
172
Created by XMLmind XSL-FO Converter.
Anomaly detection
Interpretation of the results
Based on the results received, it can be decided which score should be considered as a threshold above which an
element is considered an anomaly. In this case, however, a more detailed examination is possible as well, as
based on the colour codes set, it can be viewed how much the probability of the individual attributes containing
outliers is, and if these coincide with outlier values of other columns. Based on this, on one hand, the model can
be refined if necessary, and on the other hand, in some cases, it can be easier to define which values should be
considered an anomaly. By checking the graphic representation of the model built based on the scores, it can be
seen that there are sightly outlying values that have not been assigned a high score:
Figure 13.13. Graphic representation of outlier scores
173
Created by XMLmind XSL-FO Converter.
Anomaly detection
Based on this, it might be advisable to alter the model, for example to split the attributes into dynamically sized
bins. This enhances the performance of outlier detection, as can be seen in the following results:
Figure 13.14. Scores and attributes binning for dynamic binwidth and arbitrary
number of bins
174
Created by XMLmind XSL-FO Converter.
Anomaly detection
Figure 13.15. Graphic representation of the enhanced outlier scores
Video
Workflow
anomaly_exp3.rmp
Keywords
outliers
anomaly detection
statistics based anomaly detection
histogram based anomaly detection
bin size
Operators
Read CSV
Histogram-based Outlier Score (HBOS) [Anomaly Detection]
175
Created by XMLmind XSL-FO Converter.
Part III. SAS® Enterprise Miner™
Created by XMLmind XSL-FO Converter.
Table of Contents
14. Data Sources .............................................................................................................................
1. Reading SAS dataset .........................................................................................................
2. Importing data from a CSV file .........................................................................................
3. Importing data from a Excel file .......................................................................................
15. Preprocessing ............................................................................................................................
1. Constructing metadata and automatic variable selection ..................................................
2. Vizualizing multidimensional data and dimension reduction by PCA ..............................
3. Replacement and imputation .............................................................................................
16. Classification Methods 1 ..........................................................................................................
1. Classification by decision tree ...........................................................................................
2. Comparison and evaluation of decision tree classifiers ....................................................
17. Classification Methods 2 ..........................................................................................................
1. Rule induction to the classification of rare events .............................................................
18. Classification Methods 3 ..........................................................................................................
1. Logistic regression ............................................................................................................
2. Prediction of discrete target by regression models ............................................................
19. Classification Methods 4 ..........................................................................................................
1. Solution of a linearly separable binary classification task by ANN and SVM ..................
2. Using artificial neural networks (ANN) ............................................................................
3. Using support vector machines (SVM) .............................................................................
20. Classification Methods 5 ..........................................................................................................
1. Ensemble methods: Combination of classifiers ................................................................
2. Ensemble methods: bagging .............................................................................................
3. Ensemble methods: boosting .............................................................................................
21. Association mining ...................................................................................................................
1. Extracting association rules ...............................................................................................
22. Clustering 1 ..............................................................................................................................
1. K-means method ...............................................................................................................
2. Agglomerative hierarchical methods .................................................................................
3. Comparison of clustering methods ....................................................................................
23. Clustering 2 ..............................................................................................................................
1. Clustering attributes before fitting SVM ...........................................................................
2. Self-organizing maps (SOM) and vector quantization (VQ) .............................................
24. Regression for continuous target ..............................................................................................
1. Logistic regression ............................................................................................................
2. Prediction of discrete target by regression models ............................................................
3. Supervised models for continuous target ..........................................................................
25. Anomaly detection ....................................................................................................................
1. Detecting outliers ..............................................................................................................
177
Created by XMLmind XSL-FO Converter.
178
178
180
183
185
185
188
191
196
196
200
208
208
212
212
217
221
221
225
232
240
240
244
249
256
256
260
260
267
271
278
278
284
289
289
294
297
304
304
Chapter 14. Data Sources
1. Reading SAS dataset
Description
The experiment illustrates how existing SAS data sets can be made available to the SAS® Enterprise Miner™
by the Input Data operator. In the experiment, an earlier prepared SAS dataset will be read. A SAS dataset can
be created by using the SAS® System or the SAS® Enterprise Guide™. In order to load a SAS file that we
would like to use we need to know the path to the file. The file may be on the local machine, but also can be on
a remote SAS server. The SAS file can be read by using a wizard that guides you through the entire process.
Then, the original dataset is sampled by the Sample operator where a part of the relatively large data file is
selected.
Figure 14.1. The metadata of the dataset
Input
Individual household electric power consumption [UCI MLR]
Output
A dataset which contains the 10 percent of the original dataset. At the sampling, absolute and relative sample
size can be chosen as well. It is also possible to set the Random Seed parameter which controls the cycle of the
pseudo-random number generator. If the same value is set to on different machines we get the same random
sample. We also set the method of sampling, e.g. simple random, clustered or stratified.
Figure 14.2. Setting the Sample operator
178
Created by XMLmind XSL-FO Converter.
Data Sources
Figure 14.3. The metadata of the resulting dataset and a part of the dataset
179
Created by XMLmind XSL-FO Converter.
Data Sources
Interpretation of the results
Whenever we rerun the process, the current state of the data set will be imported to the system, so the Input
Data operator can be used to retrieve data files and to rerun the data mining process based on them, which are
updated constantly by other SAS based systems.
Video
Workflow
sas_import_exp1.xml
Keywords
reading SAS dataset
sampling
Operators
Data Source
Sample
2. Importing data from a CSV file
Description
The process demonstrates how to import data from CSV datasets by the File Import operator. In the
experiment, the Bodyfat dataset of the StatLib data repository is used. In order to open the dataset we would like
to use, we need to know the path to this file which can be on the local machine or on a remote SAS server as
well. This path can be assigned step by step in a menu.
Figure 14.4. The list of file in the File
Import
operator
180
Created by XMLmind XSL-FO Converter.
Data Sources
Input
Bodyfat [StatLib]
Note
The dataset was donated by Roger W. Johnson to the StatLib.
The process of import can be parametrized in the File Import operator. We can set the maximal number of
records, the maximal number of attributes and the separator character. It is also possible to define the number of
rows which determines the file structure.
Figure 14.5. The parameters of the File
Import
operator
Output
A datatset which consists of the imported data.
Figure 14.6. A small portion of the dataset
181
Created by XMLmind XSL-FO Converter.
Data Sources
Figure 14.7. The metadata of the resulting dataset
Interpretation of the results
Whenever we rerun the process, the current state of the data set will be imported to the system, so the Input
Data operator can be used to reload data files and to rerun the data mining process based on them, which are
updated constantly by other SAS based systems.
Video
Workflow
sas_import_exp2.xml
Keywords
importing data
CSV file
Operators
File Import
182
Created by XMLmind XSL-FO Converter.
Data Sources
Graph Explore
Statistic Explore
3. Importing data from a Excel file
Description
The process illustrates how to import data from an Excel dataset by the help of the File Import operator. In
the experiment the Zoo dataset is used, which was saved previously as an Excel file. In order to open the file that
we would like to use, we need to know the path to this file which can be on the local computer or on a remote
SAS server as well. This path can be assigned step by step by going through the directory tree.
Input
Zoo [UCI MLR]
The process of import can be parametrized in the File Import operator. We can set the maximal number of
records, the maximal number of attributes. It is also possible to define the number of rows which determines the
file structure.
Output
A dataset which contains the imported data.
Figure 14.8. A small portion of the resulting dataset
Interpretation of the results
Whenever we rerun the process the system will import the newest version of the dataset. Thus the File Import
operator can be used to load such datasets and to rerun data mining processes based on them which are updated
by operative systems. The import procedure works only for Excel 97-2003 files with xls extension.
183
Created by XMLmind XSL-FO Converter.
Data Sources
Video
Workflow
sas_import_exp3.xml
Keywords
importing data
Excel
Operators
File Import
Graph Explore
184
Created by XMLmind XSL-FO Converter.
Chapter 15. Preprocessing
1. Constructing metadata and automatic variable
selection
Description
The process illustrates, by using the Spambase dataset, how to generate the metadata of a dataset by the DMDB
operator, then how automatic variable selection can be obtained by the Variable Selection operator. The
Spambase dataset contains 58 attributes, one of which is the binary target. In order to visualize a dataset, it may
be necessary to determine the most important input attributes which can be used in the graphical representation.
Input
Spambase [UCI MLR]
Output
The DMDB operator produces such metadata (descriptive statistics) as mean, variance, minimum, maximum,
skewness, and kurtosis. In case of discrete attributes these are complemented by the mode.
Figure 15.1. Metadata produced by the DMDB operator
185
Created by XMLmind XSL-FO Converter.
Preprocessing
The default settings of the Variable Selection operator are applied except that the minimum R-square is
increased in order to filter the unnecessary attributes.
Figure 15.2. The settings of Variable
Selection
operator
The result on the one hand will be a list which contains the decision about the variables, i.e., whether it remains
or not in the data mining process, on the other hand, a few graphs of the importance of the variables.
Figure 15.3. List of variables after the selection
186
Created by XMLmind XSL-FO Converter.
Preprocessing
Figure 15.4. Sequential R-square plot
In view of the important variables a number of graphical tools of the Enterprise Miner ™ can be used to display
the records.
Figure 15.5. The binary target variables in a function of the two most important input
attributes after the variable selection
187
Created by XMLmind XSL-FO Converter.
Preprocessing
Interpretation of the results
The experiment shows how metadata can be extracted from SAS datasets which we can then transmit to other
operators. Moreover, we demonstrated how can variable selection be performed in case of large number of
attributes and how can we be working with the important attributes.
Video
Workflow
sas_preproc_exp1.xml
Keywords
variable selection
metadata
Operators
Data Source
Data Mining DataBase
Graph Explore
Variable Selection
2. Vizualizing multidimensional data and dimension
reduction by PCA
188
Created by XMLmind XSL-FO Converter.
Preprocessing
Description
The experiment presents vizualization and dimension reduction methods by the help of the Fisher-Anderson Iris
dataset. Multidimensional datasets can be vizualized by the Graph Explore operator. Dimension reduction can
be performed by the Principal Components operator. After the dimension reduction, it becomes much easier
to display multi-dimensional datasets in the space of principal components.
Input
Fisher-Anderson Iris
Output
The Graph Explore operator provides several graphical tools for displaying multi-dimensional datasets, which
plays a key role in the preprocessing step of data mining. Some of these are extensions of well-known tools such
as two- and three-dimensional scatterplots and bar charts supplemented by a number of options such as the use
of colors and symbols. Other techniques such as parallel axis or the radar plot, however, are only characteristics
of data mining software tools.
Figure 15.6. Displaying the dataset by parallel axis
189
Created by XMLmind XSL-FO Converter.
Preprocessing
The Pricipal Components Analysis (PCA) can be performed by the Principal Components operator. In the
operator the following settings can be defined: the dependency structure as covariance or correlation, the cut-off
condition as the number of eigenvalues or the cumulative eigenvalue ratio.
Figure 15.7. Explained cumulated explained variance plot of the PCA
190
Created by XMLmind XSL-FO Converter.
Preprocessing
The main result of principal component analysis is the principal component coordinates of individual records,
which can be used in the further data analysis and visualization.
Figure 15.8. Scatterplit of the Iris dataset using the first two principal components
Interpretation of the results
The experiment shows that how we can display high-dimensional data sets and perform dimension reduction. In
our experiment, the original 4-dimensional data set that can not be displayed using the standard scatterplot, is
managed to reduce to 2 dimensions such that the 95 percent of the information contained in the data is
preserved.
Video
Workflow
sas_preproc_exp2.xml
Keywords
principal components analysis (PCA)
parallel axis
Operators
Data Source
Graph Explore
Principal Components
3. Replacement and imputation
191
Created by XMLmind XSL-FO Converter.
Preprocessing
Description
In this experiment, we demonstrate by the help of the Congressional Voting Records dataset how to modify the
values of attributes by the Replacement operator and then how to impute the missing value by the Impute
operator. The replacement of missing values for each variable can be carried out independently of the others and
to interact with the target variable by fitting a decision tree.
Input
Congressional Voting Records [UCI MLR]
Output
By the Replacement operator we can set the substitution of discrete and continuous variables separately.
Figure 15.9. The replacement wizard
192
Created by XMLmind XSL-FO Converter.
Preprocessing
A number of imputation methods can be choosen in the Impute operator. We may fill in the missing values by a
constant value, but also can use distribution-based value, where a random value is generated by the system, or
decision tree based method.
Figure 15.10. The output of imputation
The results of the imputation correlated by the target variable are shown in the following two bar charts.
Figure 15.11. The relationship of an input and the target variable before imputation
193
Created by XMLmind XSL-FO Converter.
Preprocessing
Figure 15.12. The relationship of an input and the target variable after imputation
Interpretation of the results
The experiment shows that if the method of imputation is chosen in appropriate way the values obtained in place
of the missing data values is not very distorted and thus, on a larger dataset, we can perform a more reliable
fitting of the model.
Video
Workflow
194
Created by XMLmind XSL-FO Converter.
Preprocessing
sas_preproc_exp3.xml
Keywords
replacement
imputation
Operators
Data Source
Graph Explore
Impute
Replacement
195
Created by XMLmind XSL-FO Converter.
Chapter 16. Classification Methods 1
Decision trees
1. Classification by decision tree
Description
The process demonstrates how to classify by the Decision Tree operator in the case when the target is a
nominal attribute. In this case the Wine dataset is used and the target variable has three values. In order to build
a decision tree classifier it is worth to divide the dataset to training and validation datasets. Then the current best
splitting rule is found by the algorithm on the training set, but the growth of the tree is stopped by using the
validation dataset when the algorithm does not find a significant split. In partitioning step, a test dataset can be
separated too in order to measure the generalization ability of the resulting tree, but now this is not
recommended due to the limited size of the data set. The decision tree as the result of the process can be
displayed where we can see the decisions at the splittings of the model. Using the principle of majority voting
the algorithm decides which class label should assign to each leaf (terminal nodes).
Input
Wine [UCI MLR]
Output
In case of nominal target variable, we can decide about the execution of each split on the basis of various
impurity measures such as the chi-square, the Gini index, or the entropy. For these, and for the reliability of
splitting, depending on the choosen measure, a parameter value can be specified. In addition, the stopping
condition of splitting can be determined by the way that we give the minimum size set of records can be divided
even further, or the maximal depth of the tree. Also we may set the maximum number of branches of a tree. The
default is 2, that is, the algorithm builds a binary tree. It is also possible to decide if we wish to use the missing
values in splitting used as possible value. We can also decide that the input attributes are used only once or
several times when the decision tree is produced.
Figure 16.1. The settings of dataset partitioning
196
Created by XMLmind XSL-FO Converter.
Classification Methods 1
In the partition of dataset different sampling methods can be choosen and the proportion can be determined
among the training, validation, and test dataset. This partitioning can be carried out simply by considering the
order of records, randomly, or stratifying with respect to the target variable. The stratified sampling ensures the
same proportion of each class in the training, validation, and test set.
Figure 16.2. The decision tree
The results of the classification can be seen in the decision tree for the training and validation dataset as well,
including the number of records in each vertex of the tree according to each class, respectively. On the edges
between vertices the variables that define the splittings and their splitting values are presented. The thickness of
the lines is proportional to the number of concerned records.
197
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Interpretation of the results
The evaluation of the resulting decision tree is supported by numerous statistical indicators and graphical tools.
Among them, the most important ones are displayed by multiple windows at a time where we can do
comparisons. These windows can also be opened one by one by the view menu. By the help of this tools wrong
decisions can be filtered out and the modeling process can be tuned by using further background information or
domain knowledge. In this process, an interactive tree building process also helps.
Figure 16.3. The response curve of the decision tree
On the response curve above it can be seen, for the training and validation dataset, based on the ranking of
records according to their goodness how many percent of the records are classified correctly. The curve is
generally monotonically decreasing.
Figure 16.4. Fitting statistics of the decision tree
In the Fit Statistics table different indicators can be seen on the fiiting of the decision tree classifier produced by
the algorithm. The simplest and most important one among them is the misclassification rate in the red circle,
which shows the proportion of the wrong classification.
Figure 16.5. The classification chart of the decision tree
198
Created by XMLmind XSL-FO Converter.
Classification Methods 1
On the classification bar chart we can look at the details which classes work well or poorly the model.
Figure 16.6. The cumulative lift curve of the decision tree
On the figure, it can be concluded how relates the resulting decision tree to the best possible model based on the
cumulative lift value.
199
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Figure 16.7. The importance of attributes
The variable importance measure table shows which variables and with what importance involved in the
decision of the decision tree. This is a useful tool for the users who possess some speciality knowledge.
Video
Workflow
sas_dtree_exp1.xml
Keywords
classification
decision tree
cutting
response curve
misclassification rate
Operators
Data Source
Decision Tree
Data Partition
2. Comparison and evaluation of decision tree
classifiers
200
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Description
The process illustrates how to fit decision trees by using different impurity measures and then how to compare
these models on the basis of the Congressional Voting Records dataset. After the decision tree is built based on
the training and the validation dataset, the best model is selected by the model comparison operator ( Model
Comparison) using the validation dataset during the decision process. Finally, the quality of the performed
classification can be studied on the test dataset. Using the resulting model we can perform the scoring, which is
the evaluating of the test set or a data set where we do not know the value of the target variable.
Input
Congressional Voting Records [UCI MLR]
Output
In the process we set the following parameters during the partitioning step.
Figure 16.8. The settings of parameters in the partitioning step
201
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Using the chi-square impurity measure we obtain the following decision tree.
Figure 16.9. The decision tree using the chi-square impurity measure
Using the entropy impurity measure we obtain the following decision tree.
Figure 16.10. The decision tree using the entropy impurity measure
202
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Using the Gini impurity measure we obtain the following decision tree.
Figure 16.11. The decision tree using the Gini-index
203
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Interpretation of the results
The first of the three resulting decision tree is the simplest and its fitting is the worst. The other two ones are
fairly similar to each other, they use the same input variables in the splits, only the splitting values differ a little
bit. The three decision trees can be compared in many ways by using graphical tools and statistics. For example,
the following chart shows that, on the basis of the cumulative response curve, decision tree given by the Gini
index is the best model if we want to build a model until the first few deciles.
Figure 16.12. The cumulative response curve of decision trees
204
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Looking at the efficiency of the classifier the number of the correctly or incorrectly classified records can be
obtained with respect to the two parties that can be represented by a bar chart.
Figure 16.13. The classification plot
A detailed comparison is possible by using the response curve, the lift curve, and their variants. The following
figure shows the (non-cumulative) response curve for the three different datasets and the three types of models.
It can be seen how to relate the response curve to the best possible and the baseline one. The bottom right figure
shows that the decision tree based on the Gini index is close to the optimal one on the test dataset.
Figure 16.14. Response curve of decision trees
205
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Another possibility is the examination of the score distributions. The model is fitting well if the red and the blue
line is a mirror image of each other and their gradients are high. In virtue of this indicator the entropy-based
decision tree is the best.
Figure 16.15. The score distribution of decision trees
The three built trees can be improved further by changing the level of significance. As a consequence, the
decision tree will be built in a different way comparing to the original one, and as a result of this the number of
correctly and incorrectly classified records and their distribution will be different. The performance of the
resulting models can be read in the following figure, on which the incorrect classification rate was underlined as
one of the most important indicators.
Figure 16.16. The main statistics of decision trees
206
Created by XMLmind XSL-FO Converter.
Classification Methods 1
Video
Workflow
sas_dtree_exp2.xml
Keywords
classification
decision tree
performance
evaluation
Operators
Data Source
Decision Tree
Model Comparison
Data Partition
Score
207
Created by XMLmind XSL-FO Converter.
Chapter 17. Classification Methods 2
Rule induction for rare events
1. Rule induction to the classification of rare events
Description
In this experiment, using the Spambase dataset, we show how can a baseline classifier be improved for a binary
classification task with rare events by Rule Induction operator.
Input
Spambase [UCI MLR]
The corresponding input is prepared using the Sample operator. The number of records are deposited on the top
of the dataset is choosen such that the proportion of the cases to be 5 percent. Then, we partition the dataset in
the usual way.
Output
Two rule induction models are fitted to the dataset. The former is based on decision tree model, the latter is
based on logistic regression model. The fitted models are compared to a baseline decision tree classifier. The
figures below show the goodness of fit.
Figure 17.1. The misclassification rate of rule induction
Figure 17.2. The classification matrix of rule induction
208
Created by XMLmind XSL-FO Converter.
Classification Methods 2
Figure 17.3. The classification chart of rule induction
On the left side of the model comparison figure a perfect ROC curve can be seen. This curve clearly shows that
the fitting is perfect on the training dataset in case of the second rule induction model.
Figure 17.4. The ROC curves of rule inductions and decision tree
209
Created by XMLmind XSL-FO Converter.
Classification Methods 2
On the next output window, the number of wrongly classified cases can also be seen as the output of the Rule
Induction operator, besides the usual information. This is a very important information in this case.
Figure 17.5. The output of the rule induction operator
Interpretation of the results
The experiment shows that, when the class is very uneven, i.e. one class frequency is very low, compared to the
traditional classification models, significant improvement can be achieved by the rule induction method.
Video
Workflow
210
Created by XMLmind XSL-FO Converter.
Classification Methods 2
sas_rules_exp1.xml
Keywords
rule induction
supervised learning
classification
Operators
Data Source
Decision Tree
Model Comparison
Data Partition
Rule Induction
Sample
211
Created by XMLmind XSL-FO Converter.
Chapter 18. Classification Methods 3
Logistic Regression
1. Logistic regression
Description
The process shows, using the Spambase dataset, how can a regression model be fitted to a dataset which has
binary target. The conventional linear regression are not suitable for this task even though the Regression
operator offers this option. Instead, we must use the logistic regression method which is the default option of
this operator. We can choose between the following link functions: logit, which takes the name of the procedure,
probit and complementary logit. There is no significant difference among these link functions. The Enterprise
Miner™ gives an other operator for fitting regression. By the Dmine Rgeression operator forward stepwise
regression can be fitted. In each step, an input variable is selected that contributes most significantly to the
variability of the target.
Input
Spambase [UCI MLR]
Output
After fitting the logistic regression, standard statistics and graphs are obtained similarly to the binary
classification tasks. Here, only the confusion matrix is shown, the rest of comparison tools is left at the and of
this experiment.
Figure 18.1. Classification matrix of the logistic regression
212
Created by XMLmind XSL-FO Converter.
Classification Methods 3
In addition to the usual tools, the regression operators, using the effect plot, also show the importance of the
input variables in the regression model which were built during the process.
Figure 18.2. Effects plot of the logistic regression
In addition to the traditional regression analysis Enterprise Miner ™ yields another operator to fit forward
stepwise regression. This is the Dmine Rgeression operator. The results can be seen in the figures below.
Figure 18.3. Classification matrix of the stepwise logistic regression
213
Created by XMLmind XSL-FO Converter.
Classification Methods 3
Figure 18.4. Effects plot of the stepwise logistic regression
The two regressions can be compared by the usual way with the Model Comparison operator. The results of
this comparison are presented in the following figures.
Figure 18.5. Fitting statistics for logistic regression models
214
Created by XMLmind XSL-FO Converter.
Classification Methods 3
Figure 18.6. Classification charts of the logistic regression models
Figure 18.7. Cumulativ lift curve of the logistic regression models
215
Created by XMLmind XSL-FO Converter.
Classification Methods 3
Figure 18.8. ROC curves of the logistic regression models
Interpretation of the results
The fit statistics and ROC curves clearly show on the test set that the logistic regression model is better than the
stepwise logistic regression model.
Video
Workflow
sas_regr_exp1.xml
216
Created by XMLmind XSL-FO Converter.
Classification Methods 3
Keywords
classification
binary target
logistic regression
Operators
Data Source
Dmine Regression
Model Comparison
Data Partition
Regression
2. Prediction of discrete target by regression models
Description
The process shows, using the Wine dataset, that how can we fit a regression model to a dataset containing
discrete but non-binary target variable. Moreover, how can we performe a classification task using the parameter
estimates obtained from the model. The type of the fitted regression model depends on the measurement scale of
the discrete target variable. If the target is nominal then the Regression operator fits binary logistic models
separately, so that a selected event of the target is compared to the class of the other values of the discrete target
variable. On the other hand, if the target is ordinal then a common logistic regression model are fitted, wherein
only the constant parameters differ, but the parameters of the input variables are shared. (As opposed to the
nominal case where these coefficients are different.)
Input
Wine [UCI MLR]
Output
A number of models can be choosen to fit a regression model, e.g., linear or logistic regression. Among them,
the logistic regression used in the process. It is not needed to set in, the software recognizes the right type of the
regression using the metadata on the target. Of course, it is possible to override this option and to enforce linear
regression, but this does not make sense because in this case the Model Comparison operator can not be used to
compare this model to other discrete supervised models. The fit of the model can be tested by the well-known
statistics and charts.
Figure 18.9. Classification matrix of the logistic regression
217
Created by XMLmind XSL-FO Converter.
Classification Methods 3
The classification chart shows that the fitted model is perfect on the training dataset and has small error on the
validation dataset.
Figure 18.10. The classification chart of the logistic regression
218
Created by XMLmind XSL-FO Converter.
Classification Methods 3
Besides the standard goodness-of-fit tests the Regression operator presents a bar graph showing the
importance of the input variables in the regression models. The higher the coefficient of an input variable, the
more is its explanation power with respect to the target variable. In case of ordinal target, there is only one,
while, in case of the nominal target, the number of different values of the target variable minus 1 bar graphs are
created. Since the Class variable has three values there are two such bar graphs below.
Figure 18.11. The effects plot of the logistic regression
Interpretation of the results
Using the regression model created on the training set for the validation dataset the above results show that a
model can be built with relatively high accuracy for multiple discrete target variable value by the Regression
operator. It is noted that for the problem discussed here the Dmine Regression operator can not be applied.
Video
Workflow
sas_regr_exp2.xml
219
Created by XMLmind XSL-FO Converter.
Classification Methods 3
Keywords
classification
nominal and ordinal target
logistic regression
Operators
Data Source
Data Partition
Regression
220
Created by XMLmind XSL-FO Converter.
Chapter 19. Classification Methods 4
Neural networks and support vector machines
1. Solution of a linearly separable binary classification
task by ANN and SVM
Description
In this experiment, we teach a perceptron as a simple artificial neural network (ANN) and a support vector
machine ( SVM) on a linearly separable two dimensional dataset with two classes. The dataset is a subset of the
Wine dataset. adatállomány egy részhalmaza. The classification accuracy of the classifiers is determined on the
dataset.
Input
Wine [UCI MLR]
Figure 19.1. A linearly separable subset of the Wine dataset
221
Created by XMLmind XSL-FO Converter.
Classification Methods 4
In order to apply the dataset for the experiment it have to go through a significant preprocessing. This involves
choosing 2 attributes from the total 13 by using the Drop operator, then deleting the second class of the Class
attribute and changing the measure level to binary by Metadata operator.
Output
The goodness of fit of the perceptron model can be checked by the usual statistics (misclassification rate, the
number of incorrectly classified cases) and graphics (response and lift curve).
Figure 19.2. Fitting statistics for perceptron
Figure 19.3. The classification matrix of the perceptron
Figure 19.4. The cumulative lift curve of the perceptron
222
Created by XMLmind XSL-FO Converter.
Classification Methods 4
In the case of support vector machines (SVM), besides the above mentioned diagnostic tools, more details are
given about the support vectors, namely the number of support vectors, the size of the margin and the list of
support vectors.
Figure 19.5. Fitting statistics for SVM
Figure 19.6. The classification matrix of SVM
Figure 19.7. The cumulative lift curve of SVM
223
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Figure 19.8. List of the support vectors
Interpretation of the results
The figures and statistics show that the perceptron and support vector machine classify perfectly all of the
traning cases as well.
Video
Workflow
sas_ann_svm_exp1.xml
Keywords
224
Created by XMLmind XSL-FO Converter.
Classification Methods 4
perceptron
supervised learning
classification
Operators
Data Source
Drop
Filter
Graph Explore
Metadata
Neural Network
Support Vector Machine
2. Using artificial neural networks (ANN)
Description
In this experiment, a number of algorithms that can be used for tranining artificial neural networks are compared
for binary classification task. In the experiment the Spambase dataset is used. The classification accuracy of the
resulting classifier is determined and the interpretation of related graphs is reviewed.
Input
Spambase [UCI MLR]
Before fitting the models the dataset is partitionated by the Data Partition operator according to the rates
60/20/20 among the training, validatation and test datasets.
Output
Firstly, a standard artificial neural network is fitted by the NeuralNetwork operator where the network topology
of a multilayer perceptron is defined as 3 hidden neuron in one hidden layer. The goodness-of-fit of the resulting
model can be verified using standard statistics (misclassification rate, the number of incorrectly classified cases)
and graphics (response and lift curve).
Figure 19.9. Fitting statistics of the multilayer perceptron
225
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Figure 19.10. The classification matrix of the multilayer perceptron
Figure 19.11. The cumulative lift curve of the multilayer perceptron
226
Created by XMLmind XSL-FO Converter.
Classification Methods 4
In addition to the standard goodness-of-fit tests we get results which have meaning for artificial neural networks
only. These results involve the graph of the weights of the neurons, the graph of the history of the teaching
where the misclassification rate can be seen as the function of the iteration for training and validation datasets.
Figure 19.12. Weights of the multilayer perceptron
Figure 19.13. Training curve of the multilayer perceptron
227
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Similar graphs were obtained for the other two neural network fitting operator, namely for the DMNeural
operator and for the AutoNeural operator. At the first, the exception is the following stepwise optimization
statistics.
Figure 19.14. Stepwise optimization statistics for DMNeural operator
Figure 19.15. Weights of neural networks
neuronjainak súlyai
AutoNeural
228
Created by XMLmind XSL-FO Converter.
operátorral kapott háló
Classification Methods 4
Finally, the three models can be compared by the Model Comparison operator. As a result, we obtain the
following statistics and graphs.
Figure 19.16. Fitting statistics of neural networks
229
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Figure 19.17. Classification charts of neural networks
Figure 19.18. Cumulative lift curves of neural networks
230
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Figure 19.19. ROC curves of neural networks
Interpretation of the results
The above statistics and figures clearly show that the best model is the first artificial neural network with multilayer perceptron architecture, where there is one hidden layer with 3 neurons.
231
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Video
Workflow
sas_ann_svm_exp2.xml
Keywords
artificial neural network
supervised learning
classification
Operators
Data Source
Model Comparison
Neural Network
Data Partition
3. Using support vector machines (SVM)
Description
In this experiment, support vector machines (SVM - Support Vector Machine) are fitted to solve the binary
classification task using the Spam Database dataset. The aim of this experiment is the comparison of different
kinds of SVM defined by linear, polynomial etc. kernel functions. The classification accuracy of the resulting
classifiers are determined and we review the interpretation of statistics and graphs related to support vector
machines. The model fitting is carried out by the SVM operator.
Input
Spambase [UCI MLR]
Before fitting the models the dataset is partitionated by the Data Partition operator according to the rates
60/20/20 among the training, validatation and test datasets.
Output
Firstly, a support vector machine is fitted using linear kernel. The goodness-of-fit of the resulting model can be
checked by standard statistics (e.g. misclassification rate, the number of incorrectly classified cases) and graphs
(e.g. response and lift curve). These tools will be discussed later during the comparison of the two models.
Besides these results the SVM operator provides such additional statistics as goodness-of-fit measures and list of
support vectors which have meaning only for support vector machines.
232
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Figure 19.20. Fitting statistics for linear kernel SVM
Figure 19.21. The classification matrix of linear kernel SVM
Figure 19.22. Support vectors (partly) of linear kernel SVM
233
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Figure 19.23. The distribution of Lagrange multipliers for linear kernel SVM
Secondly, a polynomial kernel SVM is fitted to the dataset and compared to the previous model which one is the
best. The parametrization of the SVM operator can be seen below.
Figure 19.24. The parameters of polynomial kernel SVM
234
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Figure 19.25. Fitting statistics for polynomial kernel SVM
Figure 19.26. Classification matrix of polynomial kernel SVM
235
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Figure 19.27. Support vectors (partly) of polynomial kernel SVM
The support vector machines with two different kernels (linear and polynomial) can be compared by the usual
statistical and graphical tools.
Figure 19.28. Fitting statistics for SVM's
236
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Figure 19.29. The classification chart of SVM's
Figure 19.30. Cumulative lift curves of SVM's
237
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Figure 19.31. Comparison of cumulative lift curves to the baseline and the optimal one
Figure 19.32. ROC curves of SVM's
238
Created by XMLmind XSL-FO Converter.
Classification Methods 4
Interpretation of the results
The above figures and statistics clearly show that the polynomial kernel support vector machine can improve the
fit of the model against the linear kernel one. The misclassification rate is improved by 2 per cent and the lift
and ROC curves also show a significant improvement. The cumulative lift curve shows a better model at second
to third deciles, while the ROC curve also show an improvement if the specificity is very close to 1.
Video
Workflow
sas_ann_svm_exp3.xml
Keywords
support vector machine (SVM)
supervised learning
classification
Operators
Data Source
Model Comparison
Data Partition
Support Vector Machine
239
Created by XMLmind XSL-FO Converter.
Chapter 20. Classification Methods 5
Ensemble methods
1. Ensemble methods: Combination of classifiers
Description
This experiment presents the ensemble classification method by using the Ensemble operator. By the help of
this operator a better model can be built from separate supervised data mining models. In the experiment, an
ensemble classifier is constructed from a decision tree, a logistic regression and a neural network classifer using
the average voting method. The resulted ensemble model is compared with a polynomial kernel SVM derived
by the SVM operator. For the evaluation the misclassification rate is applied on the Spambase dataset.
Input
Spambase [UCI MLR]
Before fitting the models, in the preprocessing step the dataset is partitionated by the Data Partition operator
according to the rates 60/20/20 for training, validatation and test dataset.
Output
The ensemble classifiers can be evaluated by similar tools as other supervised data mining models: statistics like
number of the incorrectly classified cases and the misclassification rate and graphs like like lift and response
curves.
Figure 20.1. Fitting statistics of the ensemble classifier
Figure 20.2. The classification matrix of the ensemble classifier
240
Created by XMLmind XSL-FO Converter.
Classification Methods 5
Figure 20.3. The cumulative lift curve of the ensemble classifier
241
Created by XMLmind XSL-FO Converter.
Classification Methods 5
The resulted ensemble classifier is compared with a baseline polynomial kernel SVM. The statistics and graphs
of this comparison are summarized below.
Figure 20.4. Misclassification rates of the ensemble classifier and the SVM
Figure 20.5. Classification matrices of the ensemble classifier and the SVM
Figure 20.6. Cumulative lift curves of the ensemble classifier and the SVM
242
Created by XMLmind XSL-FO Converter.
Classification Methods 5
Figure 20.7. Cumulative lift curves of the ensemble classifier, the SVM and the best
theoretical model
Figure 20.8. ROC curves of the ensemble classifier and the SVM
243
Created by XMLmind XSL-FO Converter.
Classification Methods 5
Interpretation of the results
The experiment shows that by combining simple classifiers we can obtain a competitive model against such
supervised model as the polynomial kernel SVM. The classification matrix clearly shows that the ensemble
classification model is better than the SVM, especially at the false positive cases. The cumulative lift curves
slightly favor the combined model, and the ROC curve of the combined model passes over the SVM's a little.
Video
Workflow
sas_ensemble_exp1.xml
Keywords
ensemble method
supervised learning
SVM
misclassification rate
ROC curve
classification
Operators
Data Source
Decision Tree
Ensemble
Model Comparison
Neural Network
Data Partition
Regression
Support Vector Machine
2. Ensemble methods: bagging
244
Created by XMLmind XSL-FO Converter.
Classification Methods 5
Description
This experiment shows the combined method of bagging. In this method, a better fitting model can be built from
supervised data mining models using bootstrap aggregation. Bagging is sampling the original training dataset
obtaining several subsamples by the bootstrap method. On these subsamples supervised models (decision treee
in this experiment) are fitted, respectively, and a new model is obtained by aggregation of the models that are
obtained. In the experiment, the bagging cycle has set 10, i.e., 10 pieces of decision trees are fitted on 10
different subsamples. The results are compared with a simple decision tree, which is fitted to the entire training
dataset. In the bagging method the basic classifier is determined by the Ensemble operator which is straddled
between the Start Groups and End Groups operators. The size of the bagging cycle is set in the Start
Groups operator.
Input
Spambase [UCI MLR]
In the preprocessing step the dataset is partitionated by the Data Partition operator according to the rates
60/20/20 for training, validatation and test dataset.
Output
There are similar tools to evaluate bagging classifiers which are available for other supervised data mining
models: statistics (number of incorrectly classified cases, misclassification rate) and graphs (response and lift
curves). The only additional graph can be seen on the second figure below, where the errors of the 10 classifiers
obtained in the consecutive bagging cycle are plotted.
Figure 20.9. The classification matrix of the bagging classifier
245
Created by XMLmind XSL-FO Converter.
Classification Methods 5
Figure 20.10. The error curves of the bagging classifier
246
Created by XMLmind XSL-FO Converter.
Classification Methods 5
The obtained bagging classifier is compared with a reference decision tree that we fit on the whole training
dataset. The statistical and graphical results obtained are shown below.
Figure 20.11. Misclassification rates of the bagging classifier and the decision tree
Figure 20.12. Classification matrices of the bagging classifier and the decision tree
Figure 20.13. Response curves of the bagging classifier and the decision tree
247
Created by XMLmind XSL-FO Converter.
Classification Methods 5
Figure 20.14. Response curves of the bagging classifier and the decision tree comparing
the baseline and the optimal classifiers
Figure 20.15. ROC curves of the bagging classifier and the decision tree
248
Created by XMLmind XSL-FO Converter.
Classification Methods 5
Interpretation of the results
The experiment shows that a better working model can be obtained by taking a bagging classifier than a simple
decision tree if the models are compared on the first deciles. This is clear considering the classification matrix,
the response and the ROC curve.
Video
Workflow
sas_ensemble_exp2.xml
Keywords
ensemble method
supervised learning
bagging
mi9sclassification rate
ROC curve
classification
Operators
Data Source
Decision Tree
End Groups
Model Comparison
Data Partition
Start Groups
3. Ensemble methods: boosting
249
Created by XMLmind XSL-FO Converter.
Classification Methods 5
Description
The experiment shows the combined method of boosting. In this method, a better fitting model can be built from
supervised data mining models. The method is based on the repeated reweighting of the records and the
classifiers in such a way that the wrongly classified cases are gaining more and more importance and we try to
classify them to the right class. In the boosting method a basic classifier is selected which can be a decision tree,
a logistic regression, or a neural networks etc, of which several copies, given by the boosting cycle, are built up.
In this experiment, this basic classifier is a decision tree. In the experiment, the boosting cycle has set to 20, i.e.
20 pieces of decision trees are fitted on the whole training dataset. The result is compared to a polynomial kernel
support vector machine (SVM), which is recognized as an effective method for binary classification tasks. In the
boosting method the basic classifier is determined by the Ensemble operator which is straddled between the
Start Groups and End Groups operators. The size of the boosting cycle is set in the Start Groups operator.
Input
Spambase [UCI MLR]
In the preprocessing step the dataset is partitionated by the Data Partition operator according to the rates
60/20/20 for training, validatation and test dataset.
Output
There are similar tools to evaluate boosting classifiers which are available for other supervised data mining
models: statistics (number of incorrectly classified cases, misclassification rate) and graphs (response and lift
curves). The only additional graph is the second figure, where the error of the resulting classifiers can be seen
which are 20 decision trees in our case.
Figure 20.16. The classification matrix of the boosting classifier
250
Created by XMLmind XSL-FO Converter.
Classification Methods 5
Figure 20.17. The error curve of the boosting classifier
251
Created by XMLmind XSL-FO Converter.
Classification Methods 5
The obtained boosting classifier is compared with a reference polynomial kernel SVM that we fit on the whole
training dataset. The statistical and graphical results obtained are shown below.
Figure 20.18. Misclassification rates of the boosting classifier and the SVM
Figure 20.19. Classification matrices for the boosting classifier and the SVM
Figure 20.20. Cumulative response curves of the boosting classifier and the SVM
252
Created by XMLmind XSL-FO Converter.
Classification Methods 5
Figure 20.21. Response curves of the boosting classifier and the SVM comparing the
baseline and the optimal classifiers
253
Created by XMLmind XSL-FO Converter.
Classification Methods 5
Figure 20.22. ROC curves of the boosting classifier and the SVM
Interpretation of the results
The experiment shows that a classifier obtained by the boosting method is competitive even comparing with a
polynomial kernel support vector machine classifier in the sense that, although the misclassification rate is
worse it has higher accuracy in the first few deciles. This can be seen clearly on the response and the ROC
curves.
Video
254
Created by XMLmind XSL-FO Converter.
Classification Methods 5
Workflow
sas_ensemble_exp3.xml
Keywords
ensemble method
supervised learning
boosting
ROC curve
classification
Operators
Data Source
Decision Tree
End Groups
Model Comparison
Data Partition
Start Groups
Support Vector Machine
255
Created by XMLmind XSL-FO Converter.
Chapter 21. Association mining
1. Extracting association rules
Description
The process presents, in the case of Extended Bakery dataset, how can association rules be obtained from a
transaction dataset. In transaction datasets, items from the possible ones are emphasized that are part of the
transaction, rather than those that are lacking. In the Enterprise Miner™, such dataset must be defined as
transaction dataset and it must include a ID variable and a Target target variables that must be nominal. The
separate market baskets can be formed by ID variable. The association rule mining (also called market basket
analysis) can be carried out by the Market Basket operator. First, the frequent itemsets are extracted, then, the
significant association rules are discovered based on these itemsets.
Input
Extended Bakery [Extended Bakery]
Output
Applying the Market Basket operator on the dataset of 20000 records the following results are given.
Figure 21.1. List of items
256
Created by XMLmind XSL-FO Converter.
Association mining
Figure 21.2. The association rules as a function of the support and the reliability
Figure 21.3. Graph of lift values
257
Created by XMLmind XSL-FO Converter.
Association mining
Interpretation of the results
The significant association rules can be discovered on the basis of the frequent itemsets. We can select which
criteria we would like to consider in order to find the appropriate rules. The defaults is the confidence level of
the rules, but other criteria can also be applied to filter the discovered association rules. Based on the established
rules deeper conclusions can be drawn from the relationships between the data. To support this inference several
tools can be applied, e.g. the table of the association rules where we may filter the relevant rules by choosing
different evaluation metric like support, confidence or lift value.
Figure 21.4. List of association rules
Video
Workflow
sas_assoc_exp1.xml
258
Created by XMLmind XSL-FO Converter.
Association mining
Keywords
frequent itemset
association rule
transaction data
Operators
Data Source
Market Basket
259
Created by XMLmind XSL-FO Converter.
Chapter 22. Clustering 1
Standard methods
1. K-means method
Description
The process shows the usage of the K-means clustering algorithm and illustrates the importance of the choice of
its various parameters on the Aggregation dataset. This clustering algorithm can be fitted using the Cluster
operator.
Input
Aggregation [SIPU Datasets] [Aggregation]
The dataset consists of 788 two-dimensional vectors, which form 7 separate groups. The task is to discover
these groups which are called clusters. The difficulty of the task is in the alignment of the points, as smaller and
larger clouds of points are present with different distances in space between them. The visualization is done by
the Graph Explore operator.
Figure 22.1. The Aggregation dataset.
260
Created by XMLmind XSL-FO Converter.
Clustering 1
Output
After reading the dataset, we drag and drop the Cluster operator and take the following setting. The user
specify cluster number is chosen and, by the above scatterplot, the number of clusters is defined as 7.
Figure 22.2. The setting of the Cluster operator.
261
Created by XMLmind XSL-FO Converter.
Clustering 1
The result can also be displayed by the Graph Explore operator. One can see that the algorithm found the
upper and right clusters but it has weak performance on the points below.
Figure 22.3. The result of K-means clustering when K=7
262
Created by XMLmind XSL-FO Converter.
Clustering 1
Let's try a different parameter settings where the choice of the initial cluster centers was the MacQueen method
and the minimum value of the distances between the cluster centers minimum is choosen as 9.
Figure 22.4. The setting of the MacQueen clustering
One can see that the result is better using this parameter setting, only for the left set of points below This result
can be improved by significantly sophisticated clustering method. Láthatjuk hogy ezzel a paraméterezéssel az
eredmény pontosabb lett, egyedül a baloldali alsó ponthalmaznál láthatunk még nagyobb hibát. Ezt már csak
jóval haladottabb módszerrel tudnánk korrigálni.
Figure 22.5. The result of the MacQueen clustering
263
Created by XMLmind XSL-FO Converter.
Clustering 1
Finally, let us look at what happens if we take a slightly larger number of clusters to be produced, which is let's
say 8 . In this case, the only changes is that the algorithm finds the two small clusters in the bottom left, but it
cuts the major cluster beside them into three parts and cuts the upper right cluster.
Figure 22.6. The result of the clustering with 8 clusters
Interpretation of the results
Based on the experiment, we can see that the simplest clustering algorithm such as the K-means method is able
to extract the simple relationships. Moreover, if we choose the parameters of the algorithm well then the
264
Created by XMLmind XSL-FO Converter.
Clustering 1
accuracy of the results can be increased. In addition, the Cluster operator provides several visualization
functions that help to evaluate the results.
Figure 22.7. The result display of the
Cluster
operator
After selecting the menu Result the figure above shows the emerging main windows where you can see the
results summary. On the left top at the cluster (segment) plot can be seen in the function of the input attributes.
On the left bottom a pie chart shows the size of the clusters. On the right top the cluster statistics can be read,
while on the right bottom the output list is shown. These windows can be enlarged individually. Among the
many other tools available we want to point out the following two.
Figure 22.8. Scatterplot of the cluster means
The figure above shows the centers of the clusters along with the total average of the attributes. Finally, the
figure below shows the decision tree constructed by the clusters, which can be obtained in such way that the
resulting cluster variable will be a classification target variable and a classification task is solved by fitting of a
decision tree.
265
Created by XMLmind XSL-FO Converter.
Clustering 1
Figure 22.9. The decision tree of the clustering
Video
Workflow
sas_clust_exp1.xml
Keywords
K-means
unsupervised learning
clustering
Operators
266
Created by XMLmind XSL-FO Converter.
Clustering 1
Cluster
Data Source
Graph Explore
2. Agglomerative hierarchical methods
Description
The process shows, using the Maximum Variance (R15) dataset, how agglomerative hierarchical clustering
algorithms work. These clustering algorithms can be run by the Cluster operator.
Input
Maximum Variance (R15) [SIPU Datasets] [Maximum Variance]
The dataset contains 600 two-dimensional vectors, which are concentrated into 15 clusters. The points are
aligned around a center with the coordinates (10,10), in increasing distances from each other as they get further
from the center. This is the difficulty of the task, as the clusters near the center are close to blending into each
other.
Figure 22.10. Scatterplot of the Maximum Variance (R15) dataset
267
Created by XMLmind XSL-FO Converter.
Clustering 1
Output
Firstly, the average linkage hierarchical method is applied. In this case, the distance between the clusters is
calculated as the average of the pairwise distance of cluster elements by the algorithm. The results are shown in
the following figure.
Figure 22.11. The result of the average linkage hierarchical clustering
The goodness of clustering can be measured so that the original grouping Class attribute and the Segment
attribute which contains the cluster membership obtained after clustering are plotted by a spatial bar chart. It can
be seen that, apart from a permutation, there is a one-to-one correspondance between the lines and the columns
except two records.
268
Created by XMLmind XSL-FO Converter.
Clustering 1
Figure 22.12. Evaluating of the clustering by 3D bar chart
An other hierarchical clustering method is the Ward method. Using this, we obtain the following results.
Figure 22.13. The result of Ward clustering
Interpretation of the results
269
Created by XMLmind XSL-FO Converter.
Clustering 1
The process demonstrated that if the number of possible clusters is relatively large then it is worth choosing one
of the automatic clustering procedures. In the SAS® Enterprise Miner™, the hierarchical clustering is available
for this purpose in several different ways. The experiment also shows that the choice of the agglomerative
method does not always affect the resulting clusters. The SAS proposes the cluster number by investigating the
CCC graph, see figure below.
Figure 22.14. CCC plot of automatic clustering
In addition, a schematic display on the location of the clusters, the so-called proximity diagram is also obtained
which is clearly similar to the previously obtained scatterplot on the clusters.
Figure 22.15. Proximity graph of the automatic clustering
270
Created by XMLmind XSL-FO Converter.
Clustering 1
Video
Workflow
sas_clust_exp2.xml
Keywords
hierarchical methods
average linkage
Ward method
CCC graph
clustering
Operators
Cluster
Data Source
Graph Explore
3. Comparison of clustering methods
271
Created by XMLmind XSL-FO Converter.
Clustering 1
Description
The experiment presents the difference between the automatic clustering and the clustering where the number of
clusters is specified by the user on the Maximum Variance (D31) dataset. In the experiment the Cluster
operator is used.
Input
Maximum Variance (D31)
The dataset consists of 3100 two-dimensional vectors, which are grouped around 31 clusters.
Figure 22.16. The Maximum Variance (D31) dataset
Output
272
Created by XMLmind XSL-FO Converter.
Clustering 1
Firstly, an automatic clustering is performed where the Class attribute is ignored. The algorithm finds 31
clusters which aggres to the original number of clusters. The resulting clusters are shown in the following figure.
Figure 22.17. The result of automatic clustering
The correctness of the resulted cluster number is clearly shown by the CCC plot.
Figure 22.18. The CCC plot of automatic clustering
The scematic arrangement of the clusters is shown by the proximity graph below.
Figure 22.19. Az automatikus klaszterezés proximitási ábrája
273
Created by XMLmind XSL-FO Converter.
Clustering 1
You can try a cluster model based on the CCC chart, which has 9 clusters. This can be done by the Ward's
version of the K-means algorithm. As a result, the scatterplot and the proximity graph of the clusters are shown
in the following two figures.
Figure 22.20. The result of K-means clustering
Figure 22.21. The proximity graph of K-means clustering
274
Created by XMLmind XSL-FO Converter.
Clustering 1
Then, by the so-called segment profiling, the resulted clusters can be investigated from the point of view that
how the input variables determine the clusters.
Figure 22.22. The profile of the segments (clusters)
275
Created by XMLmind XSL-FO Converter.
Clustering 1
Interpretation of the results
The experiment shows that the automatic clustering is able to find the correct number of clusters in case of a
relatively large number of closely spaced, but spherical groups. If this we put down it too big so this number can
be reduced to a reasonable size by analyzing of the CCC graph and searching a suitable breakpoint.
Video
Workflow
sas_clust_exp3.xml
Keywords
automatic clustering
K-means
cluster profiling
CCC graphs
clustering
276
Created by XMLmind XSL-FO Converter.
Clustering 1
Operators
Cluster
Data Source
Graph Explore
MultiPlot
Segment Profile
277
Created by XMLmind XSL-FO Converter.
Chapter 23. Clustering 2
Advanced methods
1. Clustering attributes before fitting SVM
Description
The process demonstrates how to cluster the attributes by using the Variable Clustering operator when there
are a number of attributes in the dataset. The process uses the Spambase dataset. After clustering the attributes,
further supervised data mining methods can be applied, e.g., we may classify the e-mails into spam and nonspam classes.
Input
Spambase [UCI MLR]
The dataset contains 4601 records and 58 attributes. The records are classified into 2 groups by the Class
variable, which identifies the spam e-mails, i.e., its value equals to 1 if the record is spam and 0 otherwise. The
challenge in the dataset is that there are relatively large number of attributes which slow down the training
process. The experiment points out that a competitive model can be obtained after a suitable clustering of the
attributes to such models which are fitted on the whole dataset.
Output
During the attribute clustering the columns of the dataset are clustered by a hierarchical method to reduce the
dimension of the dataset. The most important parameter of the Variable Clustering operator is the Maximum
Cluster, which can be used to adjust the maximal number of clusters. Similar parameters are the maximal
number of eigenvalues and the explained variance. You can also choose between the correlation and the
covariance matrix in the analysis. One of the most important results is the dendrogram which visualizes the
process of the hierarchical clustering.
Figure 23.1. The dendrogram of attribute clustering
278
Created by XMLmind XSL-FO Converter.
Clustering 2
The relationship between the original attributes and the obtained clusters is depicted on the following graph.
Figure 23.2. The graph of clusters and attributes
The list of cluster membership, i.e., the set of attributes belonging to clusters, respectively, can be seen on the
following figure.
Figure 23.3. The cluster membership
279
Created by XMLmind XSL-FO Converter.
Clustering 2
To create clusters the correlation (covariance) between the original attributes plays the most important role.
Those attributes will be in one cluster which have high correlation with each other. This displays in the
following figure.
Figure 23.4. The correlation plot of the attributes
280
Created by XMLmind XSL-FO Converter.
Clustering 2
It can also be investigated how high is the correlation between each variable and the new cluster variables
obtained. The following figure shows the correlation bar chart of the variable representing the special character
dollar.
Figure 23.5. The correlation between clusters and an attribute
281
Created by XMLmind XSL-FO Converter.
Clustering 2
After the attribute clustering, SVM model was fitted to the Class binary variable by using the obtained 19 new
cluster attributes. Then, the results obtained in this way were compared with a similar model fitted to the
original 58 attributes directly. The results below show that the models obtained have similar performance. The
classification bar charts show a similar classification matrix.
Figure 23.6. Classification charts of SVM models
The response curve behaves better in some places on the clustered attributes than on the original ones.
Figure 23.7. The response curve of SVM models
If the cumulative lift functions are compared considering the baseline and the best lift functions, similar
behavior can be seen.
Figure 23.8. Az SVM modellek kumulatív lift függvényei
282
Created by XMLmind XSL-FO Converter.
Clustering 2
Finally, the ROC curves are very similar to each other.
Figure 23.9. The ROC curves of SVM models
Interpretation of the results
If there are very much input attributes in teaching a supervised data mining model, which makes teaching so
very slow, then it worth reducing the dimension by clustering the input attributes. The explanatory power of the
resulting model is usually not much worse than the one fitted to the original attributes.
Video
Workflow
283
Created by XMLmind XSL-FO Converter.
Clustering 2
sas_clust2_exp1.xml
Keywords
attibute clustering
dendrogram
hierarchical methods
clustering
ROC curve
SVM
Operators
Data Source
Model Comparison
Support Vector Machine
2. Self-organizing maps (SOM) and vector
quantization (VQ)
Description
The process presents the Kohonen's vector quantization (VQ) and the self-organizing map (SOM) algorithms
using the Maximum Variance (R15) dataset. These algorithms can be fitted by the SOM/Kohonen operator.
Input
Maximum Variance (R15) [SIPU Datasets] [Maximum Variance]
The dataset consists of 600 two-dimensional records, which are grouped into 15 groups. The points are located
around the point with coordinates (10, 10) and they are farther from each other as they are far from the center.
The difficulty of the task is that the groups which are around the center almost fuse. In the figure below these
points are depicted by coloring the different groups.
Figure 23.10. The scatterplot of the Maximum Variance (R15) dataset
284
Created by XMLmind XSL-FO Converter.
Clustering 2
Output
First, the method of Kohonen's vector quantization is used. By this method we got 10 clusters. The results can
be seen on the figure below.
Figure 23.11. The result of Kohonen's vector quantization
The size of clusters can be depicted by a simple pie chart.
Figure 23.12. The pie chart of cluster size
285
Created by XMLmind XSL-FO Converter.
Clustering 2
A table displays all the statistics which characterize the clusters, among others the frequency of clusters, the
standard deviation of clusters, the maximum distance from the center of clusters, and the number of the adjacent
cluster with the distance between them.
Figure 23.13. Statistics of clusters
Then, the method of batch SOM algorithm is applied for the same dataset. In this case, the numbers of row and
column segments should be defined where 6 was chosen. The results are shown in the following two figures.
The first one is the schematic graph of the SOM/Kohonen operator on the resulting net where the coloring
shows the frequency of each cell.
Figure 23.14. Graphical representation of the SOM
286
Created by XMLmind XSL-FO Converter.
Clustering 2
The second figure is a scatterplot which displays the resulting clusters in the coordinate system of original input
attributes.
Figure 23.15. Scatterplot of the result of SOM
Interpretation of the results
The experiment shows how to use two unsupervised data mining techniques such as vector quantization and
self-organizing maps. The two methods are particularly effective for examining 2-dimensional data. However,
being important prototype methods, they can greatly simplify the further analysis in higher dimension too.
Video
Workflow
sas_clust2_exp2.xml
287
Created by XMLmind XSL-FO Converter.
Clustering 2
Keywords
vector quantization (VQ)
self-organizing map (SOM)
clustering
Operators
Data Source
Graph Explore
Self-organizing Map
288
Created by XMLmind XSL-FO Converter.
Chapter 24. Regression for
continuous target
1. Logistic regression
Description
The process shows, using the Spambase dataset, how can a regression model be fitted to a dataset which has
binary target. The conventional linear regression are not suitable for this task even though the Regression
operator offers this option. Instead, we must use the logistic regression method which is the default option of
this operator. We can choose between the following link functions: logit, which takes the name of the procedure,
probit and complementary logit. There is no significant difference among these link functions. The Enterprise
Miner™ gives an other operator for fitting regression. By the Dmine Rgeression operator forward stepwise
regression can be fitted. In each step, an input variable is selected that contributes most significantly to the
variability of the target.
Input
Spambase [UCI MLR]
Output
After fitting the logistic regression, standard statistics and graphs are obtained similarly to the binary
classification tasks. Here, only the confusion matrix is shown, the rest of comparison tools is left at the and of
this experiment.
Figure 24.1. Classification matrix of the logistic regression
289
Created by XMLmind XSL-FO Converter.
Regression for continuous target
In addition to the usual tools, the regression operators, using the effect plot, also show the importance of the
input variables in the regression model which were built during the process.
Figure 24.2. Effects plot of the logistic regression
In addition to the traditional regression analysis Enterprise Miner ™ yields another operator to fit forward
stepwise regression. This is the Dmine Rgeression operator. The results can be seen in the figures below.
Figure 24.3. Classification matrix of the stepwise logistic regression
290
Created by XMLmind XSL-FO Converter.
Regression for continuous target
Figure 24.4. Effects plot of the stepwise logistic regression
The two regressions can be compared by the usual way with the Model Comparison operator. The results of
this comparison are presented in the following figures.
Figure 24.5. Fitting statistics for logistic regression models
291
Created by XMLmind XSL-FO Converter.
Regression for continuous target
Figure 24.6. Classification charts of the logistic regression models
Figure 24.7. Cumulativ lift curve of the logistic regression models
292
Created by XMLmind XSL-FO Converter.
Regression for continuous target
Figure 24.8. ROC curves of the logistic regression models
Interpretation of the results
The fit statistics and ROC curves clearly show on the test set that the logistic regression model is better than the
stepwise logistic regression model.
Video
Workflow
sas_regr_exp1.xml
293
Created by XMLmind XSL-FO Converter.
Regression for continuous target
Keywords
classification
binary target
logistic regression
Operators
Data Source
Dmine Regression
Model Comparison
Data Partition
Regression
2. Prediction of discrete target by regression models
Description
The process shows, using the Wine dataset, that how can we fit a regression model to a dataset containing
discrete but non-binary target variable. Moreover, how can we performe a classification task using the parameter
estimates obtained from the model. The type of the fitted regression model depends on the measurement scale of
the discrete target variable. If the target is nominal then the Regression operator fits binary logistic models
separately, so that a selected event of the target is compared to the class of the other values of the discrete target
variable. On the other hand, if the target is ordinal then a common logistic regression model are fitted, wherein
only the constant parameters differ, but the parameters of the input variables are shared. (As opposed to the
nominal case where these coefficients are different.)
Input
Wine [UCI MLR]
Output
A number of models can be choosen to fit a regression model, e.g., linear or logistic regression. Among them,
the logistic regression used in the process. It is not needed to set in, the software recognizes the right type of the
regression using the metadata on the target. Of course, it is possible to override this option and to enforce linear
regression, but this does not make sense because in this case the Model Comparison operator can not be used to
compare this model to other discrete supervised models. The fit of the model can be tested by the well-known
statistics and charts.
Figure 24.9. Classification matrix of the logistic regression
294
Created by XMLmind XSL-FO Converter.
Regression for continuous target
The classification chart shows that the fitted model is perfect on the training dataset and has small error on the
validation dataset.
Figure 24.10. The classification chart of the logistic regression
295
Created by XMLmind XSL-FO Converter.
Regression for continuous target
Besides the standard goodness-of-fit tests the Regression operator presents a bar graph showing the
importance of the input variables in the regression models. The higher the coefficient of an input variable, the
more is its explanation power with respect to the target variable. In case of ordinal target, there is only one,
while, in case of the nominal target, the number of different values of the target variable minus 1 bar graphs are
created. Since the Class variable has three values there are two such bar graphs below.
Figure 24.11. The effects plot of the logistic regression
Interpretation of the results
Using the regression model created on the training set for the validation dataset the above results show that a
model can be built with relatively high accuracy for multiple discrete target variable value by the Regression
operator. It is noted that for the problem discussed here the Dmine Regression operator can not be applied.
Video
Workflow
sas_regr_exp2.xml
296
Created by XMLmind XSL-FO Converter.
Regression for continuous target
Keywords
classification
nominal and ordinal target
logistic regression
Operators
Data Source
Data Partition
Regression
3. Supervised models for continuous target
Description
The process presents, using the dataset Concrete Compressive Strength, that how can we fit supervised data
mining models for datasets with continuous target. We can use for this task the Regression operator, the
Decision Tree operator, and the Neural Network operator, as well. By Regression operator linear
regression model can be fitted. Decision trees can be given by the Decision Tree operator. Finally, the result
of a Neural Network operator is a neural network that minimizes a predefined error function on the validation
dataset. In each case, continuous target variable with continuous level metadata must be selected.
Input
Concrete Compressive Strength [UCI MLR] [Concrete]
In the preprocessing step, the dataset is partitioned for training, validation and test datasets with percent
60/20/20, respectively.
Output
A comparison of the resulting models are significantly different from the tools applied for discrete or binary
target. Among the statistical indicators, we can use different information criterion (e.g. AIC, SBC) or the mean
square error and the square root of the it. In this case there aro no such graphical tools as lift curve and
classification chart. Instead, we can use the graphs showing averages of forecasts.
Figure 24.12. Statistics of the fitted models on the test dataset
297
Created by XMLmind XSL-FO Converter.
Regression for continuous target
The figure below shows that the neural network and the linear regression behave fairly similarly.
Figure 24.13. Comparison of the fitted models by means of predictions
Figure 24.14. The observed and predicted means plot
298
Created by XMLmind XSL-FO Converter.
Regression for continuous target
The following curves is good if we are closer to the diagonal straight line.
Figure 24.15. The model scores
299
Created by XMLmind XSL-FO Converter.
Regression for continuous target
In addition to the comparisons, one by one, we can examine the individual models. Below you can see the
constructed decision tree, which is created by using F-statistics.
Figure 24.16. The decision tree for continuous target
300
Created by XMLmind XSL-FO Converter.
Regression for continuous target
For the neural network model the weights of neurons can be visualized.
Figure 24.17. The weights of neural network after traning
301
Created by XMLmind XSL-FO Converter.
Regression for continuous target
Interpretation of the results
Both the statistics and graphics show that the neural network model fits the best.
Video
Workflow
sas_regr_exp3.xml
Keywords
supervised learning
continuous target
decision tree
linear regression
neural network
Operators
302
Created by XMLmind XSL-FO Converter.
Regression for continuous target
Data Source
Decision Tree
Model Comparison
Neural Network
Data Partition
Regression
303
Created by XMLmind XSL-FO Converter.
Chapter 25. Anomaly detection
1. Detecting outliers
Description
The process presents, using the Concrete Compressive Strength dataset, that how can we filter outliers
considering various criteria by the Filter operator. The applied criteria involve the mean square deviation from
the mean or the mean absolute deviation. Other possibility is the modal center. In the experiment, the records
which differ from the mean by twice standard deviation are filetered.
Input
Concrete Compressive Strength [UCI MLR] [Concrete]
Output
As shown, the above setting can filter out a significant number of outliers.
Figure 25.1. Statistics before and after filtering outliers
304
Created by XMLmind XSL-FO Converter.
Anomaly detection
Figure 25.2. The predicted mean based on the two decision trees
Figure 25.3. The tree map of the best model
305
Created by XMLmind XSL-FO Converter.
Anomaly detection
Interpretation of the results
The following comparison shows that after the filtering the error of the fitted decision tree is significantly
smaller than the error of the decision tree fitted on the full dataset. Thus, in suitable cases, the removal of
outliers is able to improve the efficiency of supervised models.
Figure 25.4. Comparison of the two fitted decision trees
Video
Workflow
sas_anomaly_exp1.xml
Keywords
outliers
preprocessing
data cleaning
Operators
Data Source
Decision Tree
Filter
Graph Explore
Model Comparison
306
Created by XMLmind XSL-FO Converter.
Bibliography
Data Sets
[Aggregation] Aristides Gionis. Heikki Mannila. Panayiotis Tsaparas. “Clustering Aggregation”. 4:1–4:30.
ACM Transactions on Knowledge Discovery from Data (TKDD). 1. 1. 2007.
[Compound] C. T. Zahn. “Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters”. 68–86.
IEEE Transactions on Computers. 20. 1. 1971.
[CoIL Challenge 2000] Peter van der Putten. Maarten van Someren. CoIL Challenge 2000: The Insurance
Company Case . 2000. Sentient Machine Research. Amsterdam. Also a Leiden Institute of Advanced
Computer Science. Technical Report 2000-09.
[Concrete] I-Cheng Yeh. “Modeling of strength of high performance concrete using artificial neural networks”.
1797–1808. Cement and Concrete Research. 28. 12. 1998.
[Detrano et al.] R. Detrano. A. Jánosi. W. Steinbrunn. M. Pfisterer. J. Schmid. S. Sandhu. K. Guppy. S. Lee. V.
Froelicher. “International application of a new probability algorithm for the diagnosis of coronary
artery disease”. 304–310. American Journal of Cardiology. 64. 5. 1989.
[Extended Bakery] Alex Dekhtyar. Jacob Verburg. Extended Bakery Dataset . 2009.
[Flame] Limin Fu. Enzo Medico. “ FLAME, a novel fuzzy clustering method for the analysis of DNA
microarray data ”. BMC Bioinformatics. 8. 3. 2007.
[Jain] Anil K. Jain. Martin H.C. Law. “Data Clustering: A User's Dilemma”. 1–10. Lecture Notes in Computer
Science Volume. 3776. Springer. 2005.
[Maximum Variance] C. J. Veenman. M. J. T. Reinders. E. Backer. “A Maximum Variance Cluster Algorithm”.
1273–1280. IEEE Transactions on Pattern Analysis and Machine Intelligence. 24. 9. 2002.
[SIPU Datasets] Clustering datasets . Speech and Image Processing Unit, School of Computing, University of
Eastern Finland.
[StatLib] Mike Meyer. Pantelis Vlachos. StatLib—Datasets Archive . 2005.
[Titanic] Robert J. MacG. Dawson. “ The “Unusual Episode” Data Revisited ”. Journal of Statistics Education.
3. 3. 1995.
[Two Spirals] Kevin J. Lang. Michael J. Witbrock. “Learning to Tell Two Spirals Apart”. 52–59. David
Touretzky. Geoffrey Hinton. Terrence Sejnowski. Proceedings of the 1988 Connectionist Models
Summer School. Morgan Kaufmann. 1988.
[UCI MLR] K. Bache. M. Lichman. UCI Machine Learning Repository . University of California, Irvine,
School of Information and Computer Sciences. 2013.
Other
[DMBOOK] Pang-Ning Tan. Michael Steinbach. Vipin Kumar.
Wesley. 2005.
Introduction to Data Mining . Addison-
[LIBSVM] Chih-Chung Chang. Chih-Jen Lin. “LIBSVM: A Library for Support Vector Machines”. 27:1–
27:27. ACM Transactions on Intelligent Systems and Technology. 2. 3. 2011. Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
[Neural Network FAQ] Warren S. Sarle. Neural Network FAQ, part 3 of 7: Generalization . 1997. Periodic
posting to the Usenet newsgroup comp.ai.neural-nets.
307
Created by XMLmind XSL-FO Converter.
Bibliography
[RapidMiner Manual] RapidMiner 5.0 Manual . Rapid-I GmbH. 2010.
[SAS Enterprise Miner] Getting Started with SAS®Enterprise Miner™ . SAS Institute Inc.. 2011.
[SAS Enterprise Miner Ref] SAS® Enterprise Miner™: Reference Help . SAS Institute Inc.. 2011.
308
Created by XMLmind XSL-FO Converter.