Download Case Studies in Data Mining

Case Studies in Data Mining András Fülöp, e-Ventures Ltd. <[email protected]> László Gonda, University of Debrecen <[email protected]> Dr.. Márton Ispány, University of Debrecen <[email protected]> Dr.. Péter Jeszenszky, University of Debrecen <[email protected]> Dr.. László Szathmáry, University of Debrecen <[email protected]> Created by XMLmind XSL-FO Converter. Case Studies in Data Mining by András Fülöp, László Gonda, Dr.. Márton Ispány, Dr.. Péter Jeszenszky, and Dr.. László Szathmáry Publication date 2014 Copyright © 2014 Faculty of Informatics, University of Debrecen A tananyag a TÁMOP-4.1.2.A/1-11/1-2011-0103 azonosítójú pályázat keretében valósulhatott meg. Created by XMLmind XSL-FO Converter. Table of Contents Preface ................................................................................................................................................ ii 1. How to use this material ....................................................................................................... iv I. Data mining tools ............................................................................................................................ 6 1. Commercial data mining softwares ....................................................................................... 7 2. Free and shareware data mining softwares .......................................................................... 11 II. RapidMiner .................................................................................................................................. 13 3. Data Sources ....................................................................................................................... 16 1. Importing data from a CSV file ................................................................................. 16 2. Importing data from an Excel file .............................................................................. 17 3. Creating an AML file for reading a data file ............................................................. 19 4. Importing data from an XML file .............................................................................. 21 5. Importing data from a database ................................................................................. 23 4. Pre-processing ..................................................................................................................... 25 1. Managing data with issues - Missing, inconsistent, and duplicate values ................. 25 2. Sampling and aggregation ......................................................................................... 27 3. Creating and filtering attributes ................................................................................. 31 4. Discretizing and weighting attributes ........................................................................ 35 5. Classification Methods 1 ..................................................................................................... 41 1. Classification using a decision tree ............................................................................ 41 2. Under- and overfitting of a classification with a decision tree .................................. 46 3. Evaluation of performance for classification by decision tree ................................... 51 4. Evaluation of performance for classification by decision tree 2 ................................ 55 5. Comparison of decision tree classifiers ..................................................................... 58 6. Classification Methods 2 ..................................................................................................... 65 1. Using a rule-based classifier (1) ................................................................................ 65 2. Using a rule-based classifier (2) ................................................................................ 66 3. Transforming a decision tree to an equivalent rule set .............................................. 68 7. Classification Methods 3 ..................................................................................................... 71 1. Linear regression ....................................................................................................... 71 2. Osztályozás lineáris regresszióval ............................................................................. 73 3. Evaluation of performance for classification by regression model ............................ 76 4. Evaluation of performance for classification by regression model 2 ......................... 79 8. Classification Methods 4 ..................................................................................................... 84 1. Using a perceptron for solving a linearly separable binary classification problem ... 84 2. Using a feed-forward neural network for solving a classification problem ............... 85 3. The influence of the number of hidden neurons to the performance of the feed-forward neural network ............................................................................................................... 87 4. Using a linear SVM for solving a linearly separable binary classification problem .. 88 5. The influence of the parameter C to the performance of the linear SVM (1) ............ 90 6. The influence of the parameter C to the performance of the linear SVM (2) ............ 93 7. The influence of the parameter C to the performance of the linear SVM (3) ............ 95 8. The influence of the number of training examples to the performance of the linear SVM 97 9. Solving the two spirals problem by a nonlinear SVM ............................................. 100 10. The influence of the kernel width parameter to the performance of the RBF kernel SVM 101 11. Search for optimal parameter values of the RBF kernel SVM .............................. 103 12. Using an SVM for solving a multi-class classification problem ............................ 105 13. Using an SVM for solving a regression problem ................................................... 106 9. Classification Methods 5 ................................................................................................... 110 1. Introducing ensemble methods: the bagging algorithm ........................................... 110 2. The influence of the number of base classifiers to the performance of bagging ...... 111 3. The influence of the number of base classifiers to the performance of the AdaBoost method ...................................................................................................................................... 113 4. The influence of the number of base classifiers to the performance of the random forest 115 iii Created by XMLmind XSL-FO Converter. Case Studies in Data Mining 10. Association rules ............................................................................................................. 1. Extraction of association rules ................................................................................. 2. Asszociációs szabályok kinyerése nem tranzakciós adathalmazból ........................ 3. Evaluation of performance for association rules ..................................................... 4. Performance of association rules - Simpson's paradox ............................................ 11. Clustering 1 ..................................................................................................................... 1. K-means method ...................................................................................................... 2. K-medoids method .................................................................................................. 3. The DBSCAN method ............................................................................................. 4. Agglomerative methods ........................................................................................... 5. Divisive methods ..................................................................................................... 12. Clustering 2 ..................................................................................................................... 1. Support vector clustering ......................................................................................... 2. Choosing parameters in clustering ........................................................................... 3. Cluster evaluation .................................................................................................... 4. Centroid method ...................................................................................................... 5. Text clustering ......................................................................................................... 13. Anomaly detection .......................................................................................................... 1. Searching for outliers ............................................................................................... 2. Unsupervised search for outliers ............................................................................. 3. Unsupervised statistics based anomaly detection .................................................... III. SAS® Enterprise Miner™ ....................................................................................................... 14. Data Sources ................................................................................................................... 1. Reading SAS dataset ............................................................................................... 2. Importing data from a CSV file ............................................................................... 3. Importing data from a Excel file .............................................................................. 15. Preprocessing .................................................................................................................. 1. Constructing metadata and automatic variable selection ......................................... 2. Vizualizing multidimensional data and dimension reduction by PCA .................... 3. Replacement and imputation ................................................................................... 16. Classification Methods 1 ................................................................................................. 1. Classification by decision tree ................................................................................. 2. Comparison and evaluation of decision tree classifiers ........................................... 17. Classification Methods 2 ................................................................................................. 1. Rule induction to the classification of rare events ................................................... 18. Classification Methods 3 ................................................................................................. 1. Logistic regression ................................................................................................... 2. Prediction of discrete target by regression models .................................................. 19. Classification Methods 4 ................................................................................................. 1. Solution of a linearly separable binary classification task by ANN and SVM ........ 2. Using artificial neural networks (ANN) .................................................................. 3. Using support vector machines (SVM) ................................................................... 20. Classification Methods 5 ................................................................................................. 1. Ensemble methods: Combination of classifiers ....................................................... 2. Ensemble methods: bagging .................................................................................... 3. Ensemble methods: boosting ................................................................................... 21. Association mining .......................................................................................................... 1. Extracting association rules ..................................................................................... 22. Clustering 1 ..................................................................................................................... 1. K-means method ...................................................................................................... 2. Agglomerative hierarchical methods ....................................................................... 3. Comparison of clustering methods .......................................................................... 23. Clustering 2 ..................................................................................................................... 1. Clustering attributes before fitting SVM ................................................................. 2. Self-organizing maps (SOM) and vector quantization (VQ) ................................... 24. Regression for continuous target ..................................................................................... 1. Logistic regression ................................................................................................... 2. Prediction of discrete target by regression models .................................................. 3. Supervised models for continuous target ................................................................. 25. Anomaly detection .......................................................................................................... iv Created by XMLmind XSL-FO Converter. 118 118 121 126 131 135 135 137 140 142 144 148 148 151 154 159 161 165 165 167 171 176 178 178 180 183 185 185 188 191 196 196 200 208 208 212 212 217 221 221 225 232 240 240 244 249 256 256 260 260 267 271 278 278 284 289 289 294 297 304 Case Studies in Data Mining 1. Detecting outliers ..................................................................................................... 304 Bibliography ................................................................................................................................... 307 v Created by XMLmind XSL-FO Converter. List of Figures 3.1. Metadata of the resulting ExampleSet. ...................................................................................... 16 3.2. A small excerpt of the resulting ExampleSet. ............................................................................ 17 3.3. Metadata of the resulting ExampleSet. ...................................................................................... 18 3.4. A small excerpt of the resulting ExampleSet. ............................................................................ 18 3.5. The resulting AML file. ............................................................................................................. 20 3.6. A small excerpt of The World Bank: Population (Total) data set used in the exepriment. ........ 21 3.7. Metadata of the resulting ExampleSet. ...................................................................................... 22 3.8. A small excerpt of the resulting ExampleSet. ............................................................................ 22 3.9. Metadata of the resulting ExampleSet. ...................................................................................... 23 3.10. A small excerpt of the resulting ExampleSet. .......................................................................... 24 4.1. Graphic representation of the global and kitchen power consumption in time .......................... 25 4.2. Possible outliers based on the hypothesized hbits of the members of the household ................ 26 4.3. Filtering of the possible values using a record filter .................................................................. 26 4.4. Selection of aggregate functions for attributes .......................................................................... 28 4.5. Preferences for dataset sampling ............................................................................................... 29 4.6. Preferences for dataset filtering ................................................................................................. 29 4.7. Resulting dataset after dataset sampling .................................................................................... 30 4.8. Resulting dataset after dataset filtering ...................................................................................... 30 4.9. Defining a new attribute based on an expression relying on existing attributes ........................ 32 4.10. Properties of the operator used for removing the attributes made redundant .......................... 33 4.11. Selection of the attributes to remain in the dataset with reduced size ...................................... 33 4.12. The appearance of the derived attribute in the altered dataset ................................................. 34 4.13. Selection of the appropriate discretization operator ................................................................ 36 4.14. Setting the properties of the discretization operator ................................................................ 37 4.15. Selection of the appropriate weighting operator ...................................................................... 37 4.16. Defining the weights of the individual attributes ..................................................................... 38 4.17. Comparison of the weighted and unweighted dataset instances .............................................. 39 5.1. Preferences for the building of the decision tree ........................................................................ 41 5.2. Preferences for splitting the dataset into training and test sets .................................................. 42 5.3. Setting the relative sizes of the data partitions ........................................................................... 42 5.4. Graphic representation of the decision tree created ................................................................... 43 5.5. The classification of the records based on the decision tree ...................................................... 44 5.6. Setting a threshold for the maximal depth of the decision tree .................................................. 46 5.7. Graphic representation of the decision tree created ................................................................... 47 5.8. Graphic representation of he classification of the records based on the decision tree ............... 47 5.9. Graphic representation of the decision tree created with the increased maximal depth ............. 48 5.10. Graphic representation of he classification of the records based on the decision tree with increased maximal depth .................................................................................................................................. 48 5.11. Graphic representation of the decision tree created with the further increased maximal depth 49 5.12. Graphic representation of he classification of the records based on the decision tree with further increased maximal depth .................................................................................................................. 50 5.13. Preferences for the building of the decision tree ...................................................................... 52 5.14. Graphic representation of the decision tree created ................................................................. 52 5.15. Graphic representation of he classification of the records based on the decision tree ............. 53 5.16. Performance vector of the classification based on the decision tree ........................................ 53 5.17. The modification of preferences for the building of the decision tree. .................................... 53 5.18. Graphic representation of the decision tree created with the modified preferences ................. 54 5.19. Performance vector of the classification based on the decision tree created with the modified preferences ........................................................................................................................................ 54 5.20. Settings for the sampling done in the validation operator ........................................................ 56 5.21. Subprocesses of the validation operator .................................................................................. 56 5.22. Graphic representation of the decision tree created ................................................................. 56 5.23. Performance vector of the classification based on the decision tree ........................................ 57 5.24. Settings of the cross-validation operator .................................................................................. 57 5.25. Overall performance vector of the classifications done in the cross-validation operator ........ 58 vi Created by XMLmind XSL-FO Converter. Case Studies in Data Mining 5.26. Overall performance vector of the classifications done in the cross-validation operator in the leaveone-out case ...................................................................................................................................... 58 5.27. Preferences for the building of the decision tree based on the Gini-index criterion ................ 59 5.28. Preferences for the building of the decision tree based on the gain ratio criterion .................. 60 5.29. Graphic representation of the decision tree created based on the gain ratio criterion .............. 61 5.30. Performance vector of the classification based on the decision tree built using the gain ratio criterion ........................................................................................................................................................... 62 5.31. Graphic representation of the decision tree created based on the Gini-index criterion ............ 62 5.32. Performance vector of the classification based on the decision tree built using the Gini-index criterion ............................................................................................................................................. 62 5.33. Settings of the operator for the comparison of ROC curves .................................................... 62 5.34. Subprocess of the operator for the comparison of ROC curves ............................................... 63 5.35. Comparison of the ROC curves of the two decision tree classifiers ........................................ 63 6.1. The rule set of the rule-based classifier trained on the data set. ................................................. 65 6.2. The classification accuracy of the rule-based classifier on the data set. .................................... 65 6.3. The rule set of the rule-based classifier. .................................................................................... 67 6.4. The classification accuracy of the rule-based classifier on the training set. .............................. 67 6.5. The classification accuracy of the rule-based classifier on the test set. ..................................... 67 6.6. The decision tree built on the data set. ....................................................................................... 68 6.7. The rule set equivalent of the decision tree. .............................................................................. 69 6.8. The classification accuracy of the rule-based classifier on the data set. .................................... 69 7.1. Properties of the linear regression operator ............................................................................... 71 7.2. The linear regression model yielded as a result ......................................................................... 72 7.3. The class prediction values calculated based on the linear regression model ............................ 72 7.4. The subprocess of the classification by regression operator ...................................................... 74 7.5. The linear regression model yielded as a result ......................................................................... 74 7.6. The class labels derived from the predictions calculated based on the regression model .......... 75 7.7. The subprocess of the classification by regression operator ...................................................... 77 7.8. The linear regression model yielded as a result ......................................................................... 77 7.9. The performance vector of the classification based on the regression model ............................ 78 7.10. The subprocess of the cross-validation by regression operator ............................................... 80 7.11. The subprocess of the classification by regression operator .................................................... 80 7.12. The linear regression model yielded as a result ....................................................................... 80 7.13. The customizable properties of the cross-validation operator ................................................. 81 7.14. The overall performance vector of the classifications done using the regression model defined in the cross-validation operator .................................................................................................................. 82 7.15. The overall performance vector of the classifications done using the regression model defined in the cross-validation operator for the case of using the leave-one-out method ........................................ 82 8.1. A linearly separable subset of the Wine data set [UCI MLR] used in the experiment (2 of the total of 3 classes and 2 of the total of 13 attributes was selected). ................................................................... 84 8.2. The decision boundary of the perceptron. .................................................................................. 85 8.3. The classification accuracy of the perceptron on the data set. ................................................... 85 8.4. The classification accuracy of the neural network on the data set. ............................................ 86 8.5. The average classification error rate obtained from 10-fold cross-validation against the number of hidden neurons. ................................................................................................................................. 87 8.6. A linearly separable subset of the Wine data set [UCI MLR] used in the experiment (2 of the total of 3 classes and 2 of the total of 13 attributes was selected). ................................................................... 89 8.7. The kernel model of the linear SVM. ........................................................................................ 89 8.8. The classification accuracy of the linear SVM on the data set. ................................................. 90 8.9. A subset of the Wine data set used in the experiment (2 of the total of 3 classes and 2 of the total of 13 attributes was selected). Note that the classes are not linearly separable. ........................................ 91 8.10. The classification error rate of the linear SVM against the value of the parameter C. ............. 91 8.11. The number of support vectors against the value of the parameter C. ..................................... 92 8.12. The average classification error rate of the linear SVM obtained from 10-fold cross-validation against the value of the parameter C, where the horizontal axis is scaled logarithmically. ............... 94 8.13. The classification error rate of the linear SVM on the training and the test sets against the value of the parameter C. ................................................................................................................................ 95 8.14. The number of support vectors against the value of the parameter C. ..................................... 96 8.15. The classification error rate of the linear SVM on the training and the test sets against the number of training examples. ............................................................................................................................. 98 vii Created by XMLmind XSL-FO Converter. Case Studies in Data Mining 8.16. The number of support vectors against the number of training examples. .............................. 98 8.17. CPU execution time needed to train the SVM against the number of training examples. ....... 98 8.18. The Two Spirals data set ........................................................................................................ 100 8.19. The R code that produces the data set and is executed by the Execute Script (R) operator of the R Extension. ................................................................................................................................... 100 8.20. The classification accuracy of the nonlinear SVM on the data set. ....................................... 101 8.21. The classification error rates of the SVM on the training and the test sets against the value of RBF kernel width parameter. .................................................................................................................. 102 8.22. The optimal parameter values for the RBF kernel SVM. ...................................................... 104 8.23. The classification accuracy of the RBF kernel SVM trained on the entire data set using the optimal parameter values. ............................................................................................................................ 104 8.24. The kernel model of the linear SVM. .................................................................................... 105 8.25. The classification accuracy of the linear SVM on the data set. ............................................. 106 8.26. The optimal value of the gamma parameter for the RBF kernel SVM. ................................... 107 8.27. The average RMS error of the RBF kernel SVM obtained from 10-fold cross-validation against the value of the parameter gamma, where the horizontal axis is scaled logarithmically. ....................... 107 8.28. The kernel model of the optimal RBF kernel SVM. .............................................................. 108 8.29. Predictions provided by the optimal RBF kernel SVM against the values of the observed values of the dependent variable. ................................................................................................................... 108 9.1. The average classification error rate of a single decision stump obtained from 10-fold crossvalidation. ....................................................................................................................................... 110 9.2. The average classification error rate of the bagging algorithm obtained from 10-fold cross-validation, where 10 decision stumps were used as base classifiers. ................................................................ 110 9.3. The average classification error rate obtained from 10-fold cross-validation against the number of base classifiers. ............................................................................................................................... 112 9.4. The average classification error rate obtained from 10-fold cross-validation against the number of base classifiers. ............................................................................................................................... 114 9.5. The average error rate of the random forest obtained from 10-fold cross-validation against the number of base classifiers. ........................................................................................................................... 116 10.1. List of the frequent item sets generated ................................................................................. 118 10.2. List of the association rules generated ................................................................................... 119 10.3. Graphic representation of the association rules generated ..................................................... 120 10.4. Operator preferences for the necessary data conversion ........................................................ 121 10.5. Converted version of the dataset ............................................................................................ 122 10.6. List of the frequent item sets generated ................................................................................. 122 10.7. List of the association rules generated ................................................................................... 123 10.8. Operator preferences for the appropriate data conversion ..................................................... 123 10.9. The appropriate converted version of the dataset .................................................................. 124 10.10. Enhanced list of the frequent item sets generated ................................................................ 124 10.11. List of the association rules generated ................................................................................. 125 10.12. Graphic representation of the association rules generated ................................................... 125 10.13. Operator preferences for the necessary data conversion ...................................................... 127 10.14. Label role assignment for performance evaluation .............................................................. 128 10.15. Prediction role assignment for performance evaluation ....................................................... 128 10.16. Operator preferences for the data conversion necessary for evaluation ............................... 129 10.17. Graphic representation of the association rules generated regarding survival ..................... 129 10.18. List of the association rules generated regarding survival ................................................... 130 10.19. Performance vector for the application of association rules generated ................................ 130 10.20. List of the association rules generated regarding survival ................................................... 131 10.21. Performance vector for the application of association rules generated ................................ 132 10.22. Contingency table of the dataset .......................................................................................... 132 10.23. Record filter usage ............................................................................................................... 132 10.24. Removal of attributes that become redundant after filtering ................................................ 133 10.25. List of the association rules generated for the subset of adults ............................................ 133 10.26. Performance vector for the application of association rules generated regarding survival for the subset of adults ............................................................................................................................... 133 10.27. List of the association rules generated for the subset of children ........................................ 133 10.28. Performance vector for the application of association rules generated regarding survival for the subset of children ............................................................................................................................ 133 11.1. The 7 separate groups ............................................................................................................ 135 viii Created by XMLmind XSL-FO Converter. Case Studies in Data Mining 11.2. Clustering with default values ............................................................................................... 11.3. Set the distance function. ....................................................................................................... 11.4. Clustering with Mahalanobis distance function ..................................................................... 11.5. The dataset ............................................................................................................................. 11.6. Setting the parameters of the clustering ................................................................................. 11.7. The clusters produced by the analysis ................................................................................... 11.8. The groups with varying density ........................................................................................... 11.9. The results of the method with default parameters ................................................................ 11.10. The 15 group ........................................................................................................................ 11.11. The resulting dendrogram .................................................................................................... 11.12. The clustering generated from dendrogram ......................................................................... 11.13. The 600 two-dimensional vectors ........................................................................................ 11.14. The subprocess .................................................................................................................... 11.15. The report generated by the clustering ................................................................................. 11.16. The output of the analysis .................................................................................................... 12.1. The two groups ...................................................................................................................... 12.2. Support vector clustering with polynomial kernel and p=0.21 setup ..................................... 12.3. Unsuccessful clustering ......................................................................................................... 12.4. Clustering with RBF kernel ................................................................................................... 12.5. More promising results .......................................................................................................... 12.6. The two groups containing 240 vectors ................................................................................. 12.7. The subprocess of the optimalization node ............................................................................ 12.8. The parameters of the optimalization .................................................................................... 12.9. The report generated by the process ...................................................................................... 12.10. Clustering generated with the most optimal parameters ...................................................... 12.11. The 788 vectors ................................................................................................................... 12.12. The evaluating subprocess ................................................................................................... 12.13. Setting up the parameters ..................................................................................................... 12.14. Parameters to log ................................................................................................................. 12.15. Cluster density against k number of clusters ....................................................................... 12.16. Item distribution against k number of clusters ..................................................................... 12.17. The vectors forming 31 clusters ........................................................................................... 12.18. The extracted centroids ........................................................................................................ 12.19. The output of the k nearest neighbour method, using the centroids as prototypes .............. 12.20. The preprocessing subprocess ............................................................................................. 12.21. The clustering setup ............................................................................................................. 12.22. The confusion matrix of the results ..................................................................................... 13.1. Graphic representation of the possible outliers ...................................................................... 13.2. The number of outliers detected as the distance limit grows ................................................. 13.3. Nearest neighbour based operators in the Anomaly Detection package ................................ 13.4. Settings of LOF. .................................................................................................................... 13.5. Outlier scores assigned to the individual records based on k nearest neighbours .................. 13.6. Outlier scores assigned to the individual records based on LOF ........................................... 13.7. Filtering the records based on their outlier scores ................................................................. 13.8. The dataset filtered based on the k-NN score ........................................................................ 13.9. The dataset filtered based on the LOF score .......................................................................... 13.10. Global settings for Histogram-based Outlier Score ............................................................. 13.11. Column-level settings for Histogram-based Outlier Score .................................................. 13.12. Scores and attribute binning for fixed binwidth and arbitrary number of bins .................... 13.13. Graphic representation of outlier scores .............................................................................. 13.14. Scores and attributes binning for dynamic binwidth and arbitrary number of bins ............. 13.15. Graphic representation of the enhanced outlier scores ........................................................ 14.1. The metadata of the dataset ................................................................................................... 14.2. Setting the Sample operator ................................................................................................... 14.3. The metadata of the resulting dataset and a part of the dataset .............................................. 14.4. The list of file in the File Import operator ......................................................................... 14.5. The parameters of the File Import operator ....................................................................... 14.6. A small portion of the dataset ................................................................................................ 14.7. The metadata of the resulting dataset ..................................................................................... 14.8. A small portion of the resulting dataset ................................................................................. ix Created by XMLmind XSL-FO Converter. 135 136 136 138 138 139 140 141 142 143 143 145 146 146 147 148 149 149 150 150 151 152 152 153 153 155 155 156 156 157 157 159 160 160 162 162 163 165 166 168 168 168 169 169 170 170 171 172 172 173 174 175 178 178 179 180 181 181 182 183 Case Studies in Data Mining 15.1. Metadata produced by the DMDB operator .............................................................................. 185 15.2. The settings of Variable Selection operator .................................................................... 186 15.3. List of variables after the selection ........................................................................................ 186 15.4. Sequential R-square plot ........................................................................................................ 187 15.5. The binary target variables in a function of the two most important input attributes after the variable selection ......................................................................................................................................... 187 15.6. Displaying the dataset by parallel axis ................................................................................... 189 15.7. Explained cumulated explained variance plot of the PCA ..................................................... 190 15.8. Scatterplit of the Iris dataset using the first two principal components ................................. 191 15.9. The replacement wizard ......................................................................................................... 192 15.10. The output of imputation ..................................................................................................... 193 15.11. The relationship of an input and the target variable before imputation ............................... 193 15.12. The relationship of an input and the target variable after imputation .................................. 194 16.1. The settings of dataset partitioning ........................................................................................ 196 16.2. The decision tree .................................................................................................................... 197 16.3. The response curve of the decision tree ................................................................................. 198 16.4. Fitting statistics of the decision tree ...................................................................................... 198 16.5. The classification chart of the decision tree ........................................................................... 198 16.6. The cumulative lift curve of the decision tree ........................................................................ 199 16.7. The importance of attributes .................................................................................................. 200 16.8. The settings of parameters in the partitioning step ................................................................ 201 16.9. The decision tree using the chi-square impurity measure ...................................................... 202 16.10. The decision tree using the entropy impurity measure ........................................................ 202 16.11. The decision tree using the Gini-index ................................................................................ 203 16.12. The cumulative response curve of decision trees ................................................................. 204 16.13. The classification plot .......................................................................................................... 205 16.14. Response curve of decision trees ......................................................................................... 205 16.15. The score distribution of decision trees ............................................................................... 206 16.16. The main statistics of decision trees .................................................................................... 206 17.1. The misclassification rate of rule induction ........................................................................... 208 17.2. The classification matrix of rule induction ............................................................................ 208 17.3. The classification chart of rule induction ............................................................................... 209 17.4. The ROC curves of rule inductions and decision tree ........................................................... 209 17.5. The output of the rule induction operator .............................................................................. 210 18.1. Classification matrix of the logistic regression ...................................................................... 212 18.2. Effects plot of the logistic regression ..................................................................................... 213 18.3. Classification matrix of the stepwise logistic regression ....................................................... 213 18.4. Effects plot of the stepwise logistic regression ...................................................................... 214 18.5. Fitting statistics for logistic regression models ...................................................................... 214 18.6. Classification charts of the logistic regression models .......................................................... 215 18.7. Cumulativ lift curve of the logistic regression models .......................................................... 215 18.8. ROC curves of the logistic regression models ....................................................................... 216 18.9. Classification matrix of the logistic regression ...................................................................... 217 18.10. The classification chart of the logistic regression ................................................................ 218 18.11. The effects plot of the logistic regression ............................................................................ 219 19.1. A linearly separable subset of the Wine dataset .................................................................... 221 19.2. Fitting statistics for perceptron .............................................................................................. 222 19.3. The classification matrix of the perceptron ........................................................................... 222 19.4. The cumulative lift curve of the perceptron ........................................................................... 222 19.5. Fitting statistics for SVM ....................................................................................................... 223 19.6. The classification matrix of SVM .......................................................................................... 223 19.7. The cumulative lift curve of SVM ......................................................................................... 223 19.8. List of the support vectors ..................................................................................................... 224 19.9. Fitting statistics of the multilayer perceptron ........................................................................ 225 19.10. The classification matrix of the multilayer perceptron ........................................................ 226 19.11. The cumulative lift curve of the multilayer perceptron ....................................................... 226 19.12. Weights of the multilayer perceptron .................................................................................. 227 19.13. Training curve of the multilayer perceptron ........................................................................ 227 19.14. Stepwise optimization statistics for DMNeural operator ...................................................... 228 19.15. Weights of neural networks AutoNeural operátorral kapott háló neuronjainak súlyai ....... 228 x Created by XMLmind XSL-FO Converter. Case Studies in Data Mining 19.16. Fitting statistics of neural networks ..................................................................................... 229 19.17. Classification charts of neural networks .............................................................................. 230 19.18. Cumulative lift curves of neural networks ........................................................................... 230 19.19. ROC curves of neural networks ........................................................................................... 231 19.20. Fitting statistics for linear kernel SVM ................................................................................ 233 19.21. The classification matrix of linear kernel SVM ................................................................... 233 19.22. Support vectors (partly) of linear kernel SVM .................................................................... 233 19.23. The distribution of Lagrange multipliers for linear kernel SVM ......................................... 234 19.24. The parameters of polynomial kernel SVM ......................................................................... 234 19.25. Fitting statistics for polynomial kernel SVM ....................................................................... 235 19.26. Classification matrix of polynomial kernel SVM ................................................................ 235 19.27. Support vectors (partly) of polynomial kernel SVM ........................................................... 236 19.28. Fitting statistics for SVM's .................................................................................................. 236 19.29. The classification chart of SVM's ........................................................................................ 237 19.30. Cumulative lift curves of SVM's ......................................................................................... 237 19.31. Comparison of cumulative lift curves to the baseline and the optimal one ........................ 238 19.32. ROC curves of SVM's ......................................................................................................... 238 20.1. Fitting statistics of the ensemble classifier ............................................................................ 240 20.2. The classification matrix of the ensemble classifier .............................................................. 240 20.3. The cumulative lift curve of the ensemble classifier ............................................................. 241 20.4. Misclassification rates of the ensemble classifier and the SVM ............................................ 242 20.5. Classification matrices of the ensemble classifier and the SVM ........................................... 242 20.6. Cumulative lift curves of the ensemble classifier and the SVM ............................................ 242 20.7. Cumulative lift curves of the ensemble classifier, the SVM and the best theoretical model . 243 20.8. ROC curves of the ensemble classifier and the SVM ............................................................ 243 20.9. The classification matrix of the bagging classifier ................................................................ 245 20.10. The error curves of the bagging classifier ............................................................................ 246 20.11. Misclassification rates of the bagging classifier and the decision tree ................................ 247 20.12. Classification matrices of the bagging classifier and the decision tree ................................ 247 20.13. Response curves of the bagging classifier and the decision tree .......................................... 247 20.14. Response curves of the bagging classifier and the decision tree comparing the baseline and the optimal classifiers ........................................................................................................................... 248 20.15. ROC curves of the bagging classifier and the decision tree ................................................. 248 20.16. The classification matrix of the boosting classifier ............................................................. 250 20.17. The error curve of the boosting classifier ............................................................................ 251 20.18. Misclassification rates of the boosting classifier and the SVM ........................................... 252 20.19. Classification matrices for the boosting classifier and the SVM ......................................... 252 20.20. Cumulative response curves of the boosting classifier and the SVM .................................. 252 20.21. Response curves of the boosting classifier and the SVM comparing the baseline and the optimal classifiers ........................................................................................................................................ 253 20.22. ROC curves of the boosting classifier and the SVM ........................................................... 254 21.1. List of items ........................................................................................................................... 256 21.2. The association rules as a function of the support and the reliability .................................... 257 21.3. Graph of lift values ................................................................................................................ 257 21.4. List of association rules ......................................................................................................... 258 22.1. The Aggregation dataset. ....................................................................................................... 260 22.2. The setting of the Cluster operator. ..................................................................................... 261 22.3. The result of K-means clustering when K=7 ......................................................................... 262 22.4. The setting of the MacQueen clustering ................................................................................ 263 22.5. The result of the MacQueen clustering .................................................................................. 263 22.6. The result of the clustering with 8 clusters ............................................................................ 264 22.7. The result display of the Cluster operator ......................................................................... 265 22.8. Scatterplot of the cluster means ............................................................................................. 265 22.9. The decision tree of the clustering ......................................................................................... 266 22.10. Scatterplot of the Maximum Variance (R15) dataset ........................................................... 267 22.11. The result of the average linkage hierarchical clustering ..................................................... 268 22.12. Evaluating of the clustering by 3D bar chart ....................................................................... 269 22.13. The result of Ward clustering .............................................................................................. 269 22.14. CCC plot of automatic clustering ........................................................................................ 270 22.15. Proximity graph of the automatic clustering ........................................................................ 270 xi Created by XMLmind XSL-FO Converter. Case Studies in Data Mining 22.16. The Maximum Variance (D31) dataset ................................................................................ 22.17. The result of automatic clustering ........................................................................................ 22.18. The CCC plot of automatic clustering ................................................................................. 22.19. Az automatikus klaszterezés proximitási ábrája .................................................................. 22.20. The result of K-means clustering ......................................................................................... 22.21. The proximity graph of K-means clustering ........................................................................ 22.22. The profile of the segments (clusters) .................................................................................. 23.1. The dendrogram of attribute clustering .................................................................................. 23.2. The graph of clusters and attributes ....................................................................................... 23.3. The cluster membership ......................................................................................................... 23.4. The correlation plot of the attributes ...................................................................................... 23.5. The correlation between clusters and an attribute .................................................................. 23.6. Classification charts of SVM models .................................................................................... 23.7. The response curve of SVM models ...................................................................................... 23.8. Az SVM modellek kumulatív lift függvényei ........................................................................ 23.9. The ROC curves of SVM models .......................................................................................... 23.10. The scatterplot of the Maximum Variance (R15) dataset ..................................................... 23.11. The result of Kohonen's vector quantization ....................................................................... 23.12. The pie chart of cluster size ................................................................................................. 23.13. Statistics of clusters ............................................................................................................. 23.14. Graphical representation of the SOM .................................................................................. 23.15. Scatterplot of the result of SOM .......................................................................................... 24.1. Classification matrix of the logistic regression ...................................................................... 24.2. Effects plot of the logistic regression ..................................................................................... 24.3. Classification matrix of the stepwise logistic regression ....................................................... 24.4. Effects plot of the stepwise logistic regression ...................................................................... 24.5. Fitting statistics for logistic regression models ...................................................................... 24.6. Classification charts of the logistic regression models .......................................................... 24.7. Cumulativ lift curve of the logistic regression models .......................................................... 24.8. ROC curves of the logistic regression models ....................................................................... 24.9. Classification matrix of the logistic regression ...................................................................... 24.10. The classification chart of the logistic regression ................................................................ 24.11. The effects plot of the logistic regression ............................................................................ 24.12. Statistics of the fitted models on the test dataset ................................................................. 24.13. Comparison of the fitted models by means of predictions ................................................... 24.14. The observed and predicted means plot ............................................................................... 24.15. The model scores ................................................................................................................. 24.16. The decision tree for continuous target ................................................................................ 24.17. The weights of neural network after traning ........................................................................ 25.1. Statistics before and after filtering outliers ............................................................................ 25.2. The predicted mean based on the two decision trees ............................................................. 25.3. The tree map of the best model .............................................................................................. 25.4. Comparison of the two fitted decision trees .......................................................................... xii Created by XMLmind XSL-FO Converter. 272 273 273 273 274 274 275 278 279 279 280 281 282 282 282 283 284 285 285 286 286 287 289 290 290 291 291 292 292 293 294 295 296 297 298 298 299 300 301 304 305 305 306 Colophon TODO 1 Created by XMLmind XSL-FO Converter. Preface The data mining is an interdisciplinary area of information technology, one of the most important parts of the so-called KDD (Knowledge Discovery from Databases) process. It consists of such computationally intensive algorithms, methods which are capable to explore patterns from relatively large datasets that represent wellinterpretable information for further use. The applied algorithms originate from a number of fields, namely, artificial intelligence, machine learning, statistics, and database systems. Moreover, the data mining combine the results of these areas and it evolves in interaction of them today too. In contrast to focus merely to data analysis, see, for example, the statistics, data mining contains a number of additional elements, including the datamanegement and data preprocessing, moreover, such post-processing issues as the interesting metrics or the suitable visualization of the explored knowledge. The use of the word data mining has become very fashionable and many people mistakenly use it for all sorts of information processing involving large amounts of data, e.g., simple information extraction or data warehouse building), but it also appears in the context of decision support systems. In fact, the most important feature of data mining is the exploration or the discovery that is to produce something new, previously unknown and useful information for the user. The term of data mining has emerged in the '60s, when the statisticians used it in negative context when one analyzes the data without any presupposition. In the information technology it appeared first in the database community in the '90s in the context of describing the sophisticated information extraction. Although, the termof data mining is spread in the business, more synonym exists, for example, knowledge discovery. It is important to distinguish between data mining and the challenging Big Data problems nowadays. The solution of Big Data problems usually does not require the development of new theoretical models or methods, the problem is rather that the well-working algorithms of data mining softwares hopelessly slow down when you want to process a really large volumes of data as a whole instead of a reasonable sample size. This obviously requires a special attitude and IT infrastructure that is outside the territory of the present curriculum. The data mining activity, in automatic or semi-automatic way, is integrated into the IT infrastructure of the organization which applies it. This means that we can provide newer and newer information for the users by data mining tools from the ever-changing data sources, typically from data warehouses, with relatively limited human intervention. The reason is that because the (business) environment is constantly changing, following the changes of the data warehouse which collects the data from the environment. Hence, the previously fitted data mining models lose their validity, new models may need to model the altered data. Data mining softwares increasingly support this approach in such a way that they are able to operate in very heterogeneous environments. The collaboration between the information service and the supporting analytics nowadays allows the development of real-time analytics based online systems, see, e.g., the recommendation systems. The data mining is organized around the so-called data mining process, followed by the majority of data mining softwares. In general, this is a five-step process where the steps are as follows: • Sampling, data selection; • Exploring,preprocessing; • Modifying, transforming; • Modelling; • Interpreting, assessing. The data mining softwares provide operators for these steps and we can carry out certain operations with them, for example, reading an external file, filtering outliers, or fitting a neural network model. Representing the data mining process by a diagram in a graphical interface, there are nodes corresponding to these operators. Examples of this process are the SEMMA methodology of the SAS Institute Inc.® which is known about its information delivery softwares and the widely used Cross Industry Standard Process for Data Mining (CRISPDM) methodology, which has evolved by cooperating of many branch of industry, e.g., finance, automotive, information technology, etc. During the sampling the target database is formed for the data mining process. The source of the data in most cases is an enterprise (organization) data warehouse or its subject-oriented part, a so-called datamart. Therefore, ii Created by XMLmind XSL-FO Converter. Preface the data obtained from here have gone through a pre-processing phase in general, when they move from the operational systems into the data warehouse, and thus they can be considered to be reliable. If this is not the case then the used data mining software provides tools for data cleaning, which, in this case, can already be considered as the second step of the process. Sampling can generally be done using an appropriate statistical method, for example, simple random or stratified sampling method. Also in this step the dataset is partitioned to training, validating, and testing set. On the training dataset the data mining model is fitted and its parameters isestimated. The validating dataset is used to stop the convergence in the training process or compare different models. By this method, using independent dataset from the training dataset, we obtain reliable decision where to stop the training. Finally, on the test data set generalization ability of the model can be measured, that is how it will be expected to behave in case of new records. The exploring step means the acquaintance with the data without any preconception if it is possible. The objective of the exploring step is to form hypotheses to establish in connection with the applicable procedures. The main tools are the descriptive statistics and graphical vizualization. A data mining software has is a number graphical tools which exceed the tools of standard statistical softwares. Another objective of the exploring is to identify any existing errors (noise) and to find the places of missing data. The purposes of modifying is the preparation of the data for fitting a data mining model. There may be several reasons for this. One of them is that many methods require directly the modification of the data, for example, in case of neural networks the attributes have to be standardized before the training of the network. An other one is that if a method does not require the modification of the data, however a better fitting model is obtained after suitable modification. An example is the normalization of the attributes by suitable transformations before fitting a regression model in order the input attributes wiil be close to the normal distribution. The modification can be carried out at multiple levels: at the level of entire attributes by transforming whole attributes, at the level of the records, e.g., by standardizing some records or at the level of the fields by modifying some data. The modifying step also includes the elimination of noisy and missing data, the so-called imputation as well. The modeling is the most complex step of the data mining process and it requires the largest knowledge as well. In essence, we solve here, after suitable preparation, the data mining task. The typical data mining tasks can be divided into two groups. The first group is known as supervised data mining and supervised training. In this case, there is an attribute with a special role in the dataset which is called target. The target variable should be indicated in the used data mining software. Our task then is to describe this target variables by using the other variables as well as we can. The second group is known as unsupervised data mining or unsupervised learning. In this case, there is no special attributes in the analyzed dataset, where we want to explore hidden patterns. Within the data mining, 6 task types can be defined, from which the classification and the regression are supervised data mining and the segmentation, association, sequential analysis, and anomaly detection are unsupervised data mining. • Classification: modelling known classes (groups) for generalization purpose in order to apply the built model for new records. Example: filtering emails by classifying them for spam and no-spam classes. • Regression: building a model which approximates a continuous target by a function of input attributes such that the error of this approximation is as small as possible. Example: estimating customer value by current demographic and historical data. • Segmenting, clustering: finding in a sense similar groups in the data without taking into account any known existing structure. A typical example is the customer segmentation when a bank or insurance company is looking for groups of clients behaving similarly. • Association: searching relationships between attributes. Typical example is the basket analysis, when we look at what goods are bought by the customers in the stores and supermarkets. • Anomaly detection: identifying such records, which may be interesting or require further investigation due to a mistake. Example is searching extremely behaved clients, users. • Sequential analysis: searching temporal and spatial relationships between attributes. For example, in which order are the services took by the customers or examining gene sequences. The assessing of the results is the last step in the data mining process, which objective is to decide whether truly relevant and useful knowledge is reached by the data mining process. Namely, it often happens that such model is produced by the improper use of the data mining which has weak generalization ability and the model works iii Created by XMLmind XSL-FO Converter. Preface very poorly on new data. This is the so-called overfitting. In order to avoid the overfitting we should lean on the training, validating, and testing dataset. At this step, we can also compare our fitted models if there are more than one. In the comparison various measurements, e.g., misclassification rate, mean square error, and graphical tools, e.g., lift curve, ROC curve, can be used. This electronic curriculum aims at providing an introduction to data mining applications, so that it shows these applications in practice through the use of data mining softwares. Problems requiring data mining can be encountered in many fields of life. Some of these are listed below, datasets used in the course material also came from these areas. • Commercial data mining. One of the main driving forces behind the development and application of data mining. Its objective is to analyze the static, historical business data stored in data warehouses in order to explore hidden patterns and trends. Besides the standard way of collecting data, companies found out several other ways to build more reliable data mining models, for example, this is also one of the main reasons behind the spreading of loyalty cards. Among a number of specific application areas we emphasize the customer relationship management (CRM): who are our customers and how to deal with them, the churn analysis: which customers are planning to leave us, the cross-selling: what products should be offered together. The algorithms of market basket analysis have been born in solving a business problem. • Scientific data mining. The other main driving force behind the development of data mining. Many data mining methods, for example, neural networks and self-organizing map, have been developed for solving a scientific problem and they became a method of data mining only years later. The application areas are ranging from the astronomy (galaxy classification and processing various radiation detected in space), chemistry (forecasting, the properties of artificial molecules), and engineering sciences (material science, traffic management) to biology (bioinformatics, drug discovery, and genetics). Data mining can help in areas where the problem of data gap appears, i.e., far more data is generated than the scientist is able to process. • Mining medical data. The development of health information technology makes possible for doctors to share their diagnostic results with each other and thus it enables not to repeate doing an examination several times. Moreover, by collecting the diagnosis as the results of examination in a common data warehouse, it will be possible to develope new medical procedures by means of data mining techniques. Data mining is also likely to play an important role in personalized medicine as well. • Spatial data mining. Analysis of spatial data with data mining methods, the extension of the traditional geographic information systems (GIS) with data mining tools. Application areas: climate research, the spread of epidemics, customer analysis of large multinational companies taking into account the spatial dimension. An important area in the future will be the processing of data generated in sensor networks, e.g., pollution monitoring on an area. • Multimédia adatbányászat. Analysis of audio, image and video files by data mining tools. Data mining can help to find similarities between songs in order to decide copyright issues more objectively. Another application is finding copyright law conflicting or illegal contents in file-sharing systems and in multimedia service providers. • Web mining. Analysis of web data. Three types of web mining problems are distinguished: web structure mining, web content mining, and web usage mining. The web structure mining means the examination of the structure of the Web, i.e., the examination of the web-graph, where the set of vertices consists of the sites and the set of edges consists of the links between the sites. The web content mining means the retrieval of useful information from the contents on the web. The well-known web search engines (Google, AltaVista, etc) also carry out this thing. The web usage mining deals with examining what users are searching for on the Internet, using the the data gathered by Web servers and application servers. These areas are strongly related to Big Data problems because it is needed to work many times over an Internet-scale infrastructure. • Text mining. Mining of unstructured or semi-structured data. Under unstructured data we mean continuous texts (sequences of strings), which may be connected by a theme (e.g., scienctific), by a field (e.g., sport), or they can be customer's sentiments at a customer service. Semi-structured data are typically produced by computers or files produced for computers, for example, in XML or JSON format. Some specific applications: data mining for security reasons, e.g., searching for terrorists, analytical CRM, sentiment analysis, and academic applications (plagiarism investigation). 1. How to use this material iv Created by XMLmind XSL-FO Converter. Preface The RapidMiner and SAS® Enterprise Miner™ workflows presented in this course material are contained in the file resources/workflows.zip. Fontos Data files used in the experiments must be downloaded by the user from the location specified in the text. After importing a workflow file paths must be set to point to the local copies of data files (absolute paths are required). v Created by XMLmind XSL-FO Converter. Part I. Data mining tools Introduction In this part, data mining tools and softwares are overviewed. There are three necessary conditions of the succesful data mining. First, we need an appropriate data sets to perform data mining. In practice, this is often a task-oriented data-mart generated from the enterprise data warehouse. In the education, and so in this curriculum, the datasets are taken from widely used data repository. All datasets are attached to this material. Another important condition to data mining is the data mining expert. We hope that this curriculum will be able to contribute to the education of this professionals. Finally, the key is the software with which data mining is performed. They can be classified on the basis of several criteria, e.g., business or free, self-contained or integrated, general or specific, theme-oriented or not. The most up-to-date information on this topic can be found on the website KDnuggets™. The reader can get fresh information on current job openings, courses, conferences etc. from here as well. In the curriculum two softwares are discussed in detail: a leading one from the free data mining softwares, RapidMiner 3.5 and one of the most widely used commercial data mining softwares, SAS® Enterprise Miner™ Version 7.1. The list of the data mining softwares below is based on the KDnuggets™ portal. Created by XMLmind XSL-FO Converter. Chapter 1. Commercial data mining softwares • AdvancedMiner Professional , (formerly Gornik), provides a wide range of tools for data transformations, Data Mining models, data analysis and reporting. • Alteryx, offering Strategic Analytics platform, including a free Project Edition version. • Angoss Knowledge Studio, a comprehensive suite of data mining and predictive modeling tools; interoperability with SAS and other major statistical tools. • BayesiaLab, a complete and powerful data mining tool based on Bayesian networks, including data preparation, missing values imputation, data and variables clustering, unsupervised and supervised learning. • BioComp i-Suite, constraint-based optimization, cause and effect analysis, non-linear predictive modeling, data access and cleaning, and more. • BLIASoft Knowledge Discovery, for building models from data based mainly on fuzzy logic. • CMSR Data Miner, built for business data with database focus, incorporating rule-engine, neural network, neural clustering (SOM), decision tree, hotspot drill-down, cross table deviation analysis, cross-sell analysis, visualization/charts, and more. • Data Applied, offers a comprehensive suite of web-based data mining techniques, an XML web API, and rich data visualizations. • Data Miner Software Kit, collection of data mining tools, offered in combination with a book: Predictive Data Mining: A Practical Guide, Weiss and Indurkhya. • DataDetective, the powerful yet easy to use data mining platform and the crime analysis software of choice for the Dutch police. • Dataiku, a software platform for data science, statistics, guided machine learning and visualization capabilities, built on Open Source, Hadoop integration. • DataLab, a complete and powerful data mining tool with a unique data exploration process, with a focus on marketing and interoperability with SAS. • DBMiner 2.0 (Enterprise), powerful and affordable tool to mine large databases; uses Microsoft SQL Server 7.0 Plato. • Delta Miner, integrates new search techniques and "business intelligence" methodologies into an OLAP frontend that embraces the concept of Active Information Management. • ESTARD Data Miner, simple to use, designed both for data mining experts and common users. • Exeura Rialto™ , provides comprehensive support for the entire data mining and analytics life-cycle at an affordable price in a single, easy-to-use tool. • Fair Isaac Model Builder, software platform for developing and deploying analytic models, includes data analysis, decision tree and predictive model construction, decision optimization, business rules management, and open-platform deployment. • FastStats Suite (Apteco), marketing analysis products, including data mining, customer profiling and campaign management. • GainSmarts, uses predictive modeling technology that can analyze past purchase, demographic, and lifestyle data, to predict the likelihood of response and develop an understanding of consumer characteristics. • Generation5 GenVoy, On-Demand Consumer Analytics. 7 Created by XMLmind XSL-FO Converter. Commercial data mining softwares • GenIQ Model, uses machine learning for regression task; automatically performs variable selection, and new variable construction, and then specifies the model equation to "optimize the decile table". • GhostMiner, complete data mining suite, including k-nearest neighbors, neural nets, decision tree, neurofuzzy, SVM, PCA, clustering, and visualization. • GMDH Shell, an advanced but easy to use tool for predictive modeling and data mining. Free trial version is available. • Golden Helix Optimus RP, uses Formal Inference-based Recursive Modeling (recursive partitioning based on dynamic programming) to find complex relationships in data and to build highly accurate predictive and segmentation models. • IBM SPSS Modeler, (formerly Clementine), a visual and powerful data mining workbench. • Insights, (formerly KnowledgeMiner) 64-bit parallel software for autonomously building reliable predictive analytical models from high-dimensional noisy data using outstanding self-organizing knowledge mining technologies. Model export to Excel. Localized for English, Spanish, German. Free trial version. • JMP, offers significant visualization and data mining capabilities along with classical statistical analyses. • K. wiz, from thinkAnalytics - massively scalable, embeddable, Java-based real-time data-mining platform. Designed for Customer and OEM solutions. • Kaidara Advisor, (formerly Acknosoft KATE), Case-Based Reasoning (CBR) and data mining engine. • Kensington Discovery Edition, high-performance discovery platform for life sciences, with multi-source data integration, analysis, visualization, and workflow building. • Kepler, extensible, multi-paradigm, multi-purpose data mining system. • Knowledge Miner, a knowledge mining tool that works with data stored in Microsoft Excel for building predictive and descriptive models. (MacOS, Excel 2004 or later). • Kontagent kSuite DataMine, a SaaS User Analytics platform offering real-time behavioral insights for Social, Mobile and Web, offering SQL-like queries on top of Hadoop deployments. • KXEN (SAP company), providing Automated Predictive Analytics tools for Big Data. • LIONsolver (Learning and Intelligent OptimizatioN) Learning and Intelligent OptimizatioN: modeling and optimization with "on the job learning" for business and engineering by Reactive Search SrL. • LPA Data Mining tools, support fuzzy, bayesian and expert discovery and modeling of rules. • Magnify PATTERN, software suite, contains PATTERN:Prepare for data preparation; PATTERN:Model for building predictive models; and PATTERN:Score for model deployment. • Mathematica solution for Data Analysis and Mining, from Wolfram. • MCubiX, a complete and affordable data mining toolbox, including decision tree, neural networks, associations rules, visualization, and more. • Microsoft SQL Server, empowers informed decisions with predictive analysis through intuitive data mining, seamlessly integrated within the Microsoft BI platform, and extensible into any application. • Machine Learning Framework, provides analysis, prediction, and visualization using fuzzy logic and ML methods; implemented in C++ and integrated into Mathematica. • Molegro Data Modeller, a cross-platform application for Data Mining, Data Modelling, and Data Visualization. • Nuggets, builds models that uncover hidden facts and relationships, predict for new data, and find key variables (Windows). 8 Created by XMLmind XSL-FO Converter. Commercial data mining softwares • Oracle Data Mining (ODM), enables customers to produce actionable predictive information and build integrated business intelligence applications. • Palisade DecisionTools Suite, complete risk and decision analysis toolkit. • Partek, pattern recognition, interactive visualization, and statistical analysis, modeling system. • Pentaho, open-source BI suite, including reporting, analysis, dashboards, data integration, and data mining based on Weka. • Polyanalyst, comprehensive suite for data mining, now also including text analysis, decision forest, and link analysis. Supports OLE DB for Data Mining, and DCOM technology. • Powerhouse Data Mining, for predictive and clustering modelling, based on Dorian Pyle's ideas on using Information Theory in data analysis. Most information is in Spanish. • Predictive Dynamix, integrates graphical and statistical data analysis with modeling algorithms including neural networks, clustering, fuzzy systems, and genetic algorithms. • Previa, family of products for classification and forecasting. • Quadstone DecisionHouse, an agile analytics solution with a complete suite of capabilities to support end-toend data mining cycle. • RapAnalyst™ , uses advanced artificial intelligence to create dynamic predictive models, to reveal relationships between new and historical data. • Rapid Insight Analytics, streamlines the predictive modeling and data exploration process, enabling users of all abilities to quickly build, test, and implement statistical models at lightning speed. • Reel Two, real-time classification software for structured and unstructured data as well entity extraction. From desktop to enterprise. • Salford Systems Data Mining Suite, CART Decision Trees, MARS predictive modeling, automated regression, TreeNet classification and regression, data access, preparation, cleaning and reporting modules, RandomForests predictive modeling, clustering and anomaly detection. • SAS® Enterprise Miner™ , an integrated suite which provides a user-friendly GUI front-end to the SEMMA (Sample, Explore, Modify, Model, Assess) process. • SPAD, provides powerful exploratory analyses and data mining tools, including PCA, clustering, interactive decision trees, discriminant analyses, neural networks, text mining and more, all via user-friendly GUI. • Statistica Data Miner, a comprehensive, integrated statistical data analysis, graphics, data base management, and application development system. • Synapse, a development environment for neural networks and other adaptive systems, supporting the entire development cycle from data import and preprocessing via model construction and training to evaluation and deployment; allows deployment as .NET components. • Teradata Warehouse Miner and Teradata Analytics, providing analytic services for in-place mining on a Teradata DBMS. • thinkCRA , from thinkAnalytics, an integrated suite of Customer Relationship Analytics applications supporting real-time decisioning. • TIBCO Spotfire Miner, combining Spotfile visualization, Insightful Miner, S+ with intuitive drag-and-drop user interface. • TIMi Suite: The Intelligent Mining machine, a family of stand-alone, automated, user-friendly GUI tools for prediction, segmentation and data preparation, with high scalability, speed, ROI, prediction accuracy (a recurrent top winner at KDD cups). 9 Created by XMLmind XSL-FO Converter. Commercial data mining softwares • Viscovery data mining suite, a unique, comprehensive data mining suite for business applications with workflow-guided project environment; includes modules for visual data mining, clustering, scoring, automation and real-time integration. • Xeno, InfoCentricity powerful, user-friendly online analytic platform, supporting segmentation, clustering, exploratory data analysis, and the development of highly predictive models. • XLMiner, Data Mining Add-In For Excel. • Xpertrule Miner, (Attar Software) features data transformation, Decision Trees, Association Rules and Clustering on large scale data sets. 10 Created by XMLmind XSL-FO Converter. Chapter 2. Free and shareware data mining softwares • Alteryx Project Edition, free version of Alteryx, delivers the data blending, analytics, and sharing capabilities of Alteryx with just enough allowed runs of your workflow to solve one business problem or to complete one project. • AlphaMiner, open source data mining platform that offers various data mining model building and data cleansing functionality. • CMSR Data Miner, built for business data with database focus, incorporating rule-engine, neural network, neural clustering (SOM), decision tree, hotspot drill-down, cross table deviation analysis, cross-sell analysis, visualization/charts, and more. Free for academic use. • CRAN Task View: Machine Learning and Statistical Learning, machine learning and statistical packages in R. • Databionic ESOM Tools, a suite of programs for clustering, visualization, and classification with Emergent Self-Organizing Maps (ESOM). • ELKI: Environment for DeveLoping KDD-Applications Supported by Index-Structures, a framework in Java which includes clustering, outlier detection, and other algorithms; allows user to evaluate the combination of arbitrary algorithms, data types, and distance functions. • Gnome Data Mining Tools, including apriori, decision trees, and Bayes classifiers. • jHepWork, an interactive environment for scientific computation, data analysis and data visualization designed for scientists, engineers and students. • KEEL, includes knowledge extraction algorithms, preprocessing techniques, evolutionary rule learning, genetic fuzzy systems, and more. • KNIME, extensible open source data mining platform implementing the data pipelining paradigm (based on eclipse). • Machine Learning in Java (MLJ), an open-source suite of Java tools for research in machine learning. (The software will not be developed further.) • MiningMart, a graphical tool for data preprocessing and mining on relational databases; supports development, documentation, re-use and exchange of complete KDD processes. Free for non-commercial purposes. • ML-Flex, an open-source software package designed to enable flexible and efficient processing of disparate data sets for machine-learning (classification). • MLC++, a machine learning library in C++. • Orange, open source data analytics and mining through visual programming or Python scripting. Components for visualization, rule learning, clustering, model evaluation, and more. • PredictionIO, an open source machine learning server for software developers and data engineers to create predictive features, such as personalization, recommendation and content discovery. • RapidMiner , a leading open-source system for knowledge discovery and data mining. • Rattle, a data mining suite based on open source statistical language R, includes graphics, clustering, modeling, and more. • TANAGRA, offers a GUI interface and methods for data access, statistics, feature selection, classification, clustering, visualization, association and more. 11 Created by XMLmind XSL-FO Converter. Free and shareware data mining softwares • Weka, collection of machine learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost any platform. 12 Created by XMLmind XSL-FO Converter. Part II. RapidMiner Created by XMLmind XSL-FO Converter. Table of Contents 3. Data Sources ................................................................................................................................. 16 1. Importing data from a CSV file ........................................................................................... 16 2. Importing data from an Excel file ....................................................................................... 17 3. Creating an AML file for reading a data file ....................................................................... 19 4. Importing data from an XML file ....................................................................................... 21 5. Importing data from a database ........................................................................................... 23 4. Pre-processing .............................................................................................................................. 25 1. Managing data with issues - Missing, inconsistent, and duplicate values ........................... 25 2. Sampling and aggregation ................................................................................................... 27 3. Creating and filtering attributes ........................................................................................... 31 4. Discretizing and weighting attributes .................................................................................. 35 5. Classification Methods 1 .............................................................................................................. 41 1. Classification using a decision tree ..................................................................................... 41 2. Under- and overfitting of a classification with a decision tree ............................................ 46 3. Evaluation of performance for classification by decision tree ............................................ 51 4. Evaluation of performance for classification by decision tree 2 ......................................... 55 5. Comparison of decision tree classifiers ............................................................................... 58 6. Classification Methods 2 .............................................................................................................. 65 1. Using a rule-based classifier (1) .......................................................................................... 65 2. Using a rule-based classifier (2) .......................................................................................... 66 3. Transforming a decision tree to an equivalent rule set ........................................................ 68 7. Classification Methods 3 .............................................................................................................. 71 1. Linear regression ................................................................................................................. 71 2. Osztályozás lineáris regresszióval ....................................................................................... 73 3. Evaluation of performance for classification by regression model ..................................... 76 4. Evaluation of performance for classification by regression model 2 .................................. 79 8. Classification Methods 4 .............................................................................................................. 84 1. Using a perceptron for solving a linearly separable binary classification problem ............. 84 2. Using a feed-forward neural network for solving a classification problem ........................ 85 3. The influence of the number of hidden neurons to the performance of the feed-forward neural network ................................................................................................................................... 87 4. Using a linear SVM for solving a linearly separable binary classification problem ........... 88 5. The influence of the parameter C to the performance of the linear SVM (1) ...................... 90 6. The influence of the parameter C to the performance of the linear SVM (2) ...................... 93 7. The influence of the parameter C to the performance of the linear SVM (3) ...................... 95 8. The influence of the number of training examples to the performance of the linear SVM . 97 9. Solving the two spirals problem by a nonlinear SVM ...................................................... 100 10. The influence of the kernel width parameter to the performance of the RBF kernel SVM 101 11. Search for optimal parameter values of the RBF kernel SVM ........................................ 103 12. Using an SVM for solving a multi-class classification problem ..................................... 105 13. Using an SVM for solving a regression problem ............................................................ 106 9. Classification Methods 5 ............................................................................................................ 110 1. Introducing ensemble methods: the bagging algorithm .................................................... 110 2. The influence of the number of base classifiers to the performance of bagging ............... 111 3. The influence of the number of base classifiers to the performance of the AdaBoost method 113 4. The influence of the number of base classifiers to the performance of the random forest 115 10. Association rules ....................................................................................................................... 118 1. Extraction of association rules .......................................................................................... 118 2. Asszociációs szabályok kinyerése nem tranzakciós adathalmazból .................................. 121 3. Evaluation of performance for association rules ............................................................... 126 4. Performance of association rules - Simpson's paradox ..................................................... 131 11. Clustering 1 .............................................................................................................................. 135 1. K-means method ............................................................................................................... 135 2. K-medoids method ............................................................................................................ 137 3. The DBSCAN method ...................................................................................................... 140 4. Agglomerative methods .................................................................................................... 142 14 Created by XMLmind XSL-FO Converter. RapidMiner 5. Divisive methods ............................................................................................................... 12. Clustering 2 .............................................................................................................................. 1. Support vector clustering .................................................................................................. 2. Choosing parameters in clustering .................................................................................... 3. Cluster evaluation ............................................................................................................. 4. Centroid method ................................................................................................................ 5. Text clustering ................................................................................................................... 13. Anomaly detection .................................................................................................................... 1. Searching for outliers ........................................................................................................ 2. Unsupervised search for outliers ....................................................................................... 3. Unsupervised statistics based anomaly detection .............................................................. 15 Created by XMLmind XSL-FO Converter. 144 148 148 151 154 159 161 165 165 167 171 Chapter 3. Data Sources 1. Importing data from a CSV file Description The process demonstrates how to import data from CSV files using the Read CSV and the Open File operators. In the experiment we use a real-time earthquake data feed provided by USGS in CSV format. First, we download the feed to able to import it into RapidMiner using the Import Configuration Wizard of the Read CSV operator. The wizard guides the user step by step through the import process and helps him to set the parameters of the operator correctly. After the local copy of the feed is successfully imported into RapidMiner, we can switch to the live feed adding the Open File to the process. Input The United States Geological Survey (or USGS for short) provides real time earthquake data feeds at the Earthquake Hazards Program website. Data is available in various formats, including CSV. The experiment uses the feed of the magnitude 1+ earthquakes in the past 30 days in CSV format from the URLhttp://earthquake.usgs.gov/earthquakes/feed/v1.0/csv.php. The feed is updated in every 15 minutes. Output An ExampleSet that contains data imported from the CSV feed. Figure 3.1. Metadata of the resulting ExampleSet. 16 Created by XMLmind XSL-FO Converter. Data Sources Figure 3.2. A small excerpt of the resulting ExampleSet. Interpretation of the results Each time the process is run, it will read live data from the web. Video Workflow import_exp1.rmp Keywords importing data CSV Operators Open File Read CSV 2. Importing data from an Excel file 17 Created by XMLmind XSL-FO Converter. Data Sources Description The process demonstrates how to import data from Excel files using the Read Excel operator. The Concrete Compressive Strength data set is used in the experiment. Parameters of the Read Excel operator are set via its Import Configuration Wizard. Input Concrete Compressive Strength [UCI MLR] [Concrete] Output An ExampleSet that contains data imported from the Excel file. Figure 3.3. Metadata of the resulting ExampleSet. Figure 3.4. A small excerpt of the resulting ExampleSet. 18 Created by XMLmind XSL-FO Converter. Data Sources Interpretation of the results Video Workflow import_exp2.rmp Keywords importing data Excel Operators Guess Types Open File Read Excel Rename by Generic Names 3. Creating an AML file for reading a data file 19 Created by XMLmind XSL-FO Converter. Data Sources Description The process demonstrates how to create an AML file using the Read AML operator for reading a data file. AML files are XML documents that provide metadata about attributes including their names, datatypes, and roles. Once the AML file is created it can be used to read the underlying data file properly. Input Pima Indians Diabetes [UCI MLR] Output An AML file in the file system and an ExampleSet that contains data imported from the data file using the AML file. Figure 3.5. The resulting AML file. <?xml version="1.0" encoding="UTF-8" standalone="no"?> <attributeset default_source="pima-indians-diabetes.data" encoding="UTF-8"> <attribute name="Number of times pregnant " sourcecol="1" valuetype="integer"/> <attribute name="Plasma glucose concentration" sourcecol="2" valuetype="integer"/> <attribute name="Diastolic blood pressure" sourcecol="3" valuetype="integer"/> <attribute name="Triceps skin fold thickness" sourcecol="4" valuetype="integer"/> <attribute name="2-Hour serum insulin" sourcecol="5" valuetype="integer"/> <attribute name="Body mass index" sourcecol="6" valuetype="real"/> <attribute name="Diabetes pedigree function" sourcecol="7" valuetype="real"/> <attribute name="Age" sourcecol="8" valuetype="integer"/> <label name="Class" sourcecol="9" valuetype="binominal"> <value>1</value> <value>0</value> </label> </attributeset> Interpretation of the results The resulting AML file is intended to be distributed together with the data file. Video Workflow 20 Created by XMLmind XSL-FO Converter. Data Sources import_exp3.rmp Keywords importing data AML Operators Read AML 4. Importing data from an XML file Description The process demonstrates how to import data from XML documents using the Read XML operator. Parameters of the Read XML operator are set via its Import Configuration Wizard. Attribute values are extracted from the XML document using XPath location paths. Input The experiment uses population data in XML from the World Bank Open Data website. The data set is available at http://data.worldbank.org/indicator/SP.POP.TOTL in various formats, including XML. Figure 3.6. A small excerpt of The World Bank: Population (Total) data set used in the exepriment. <?xml version="1.0" encoding="utf-8"?> <Root xmlns:wb="http://www.worldbank.org"> <data> <record> <field name="Country or Area" key="ABW">Aruba</field> <field name="Item" key="SP.POP.TOTL">Population (Total)</field> <field name="Year">1960</field> <field name="Value">54208</field> </record> <record> <field name="Country or Area" key="ABW">Aruba</field> <field name="Item" key="SP.POP.TOTL">Population (Total)</field> <field name="Year">1961</field> <field name="Value">55435</field> </record> <record> 21 Created by XMLmind XSL-FO Converter. Data Sources <field name="Country or Area" key="ABW">Aruba</field> <field name="Item" key="SP.POP.TOTL">Population (Total)</field> <field name="Year">1962</field> <field name="Value">56226</field> </record>  </Root> Output An ExampleSet that contains data imported from the XML document. Figure 3.7. Metadata of the resulting ExampleSet. Figure 3.8. A small excerpt of the resulting ExampleSet. Interpretation of the results Video 22 Created by XMLmind XSL-FO Converter. Data Sources Workflow import_exp4.rmp Keywords importing data XML Operators Read XML 5. Importing data from a database Description The process demonstrates how to connect to an Oracle database and execute an SQL query. Input The experiment uses a local database server running at the Faculty of Informatics, University of Debrecen. Note that connection to the database server requires authentication. Output An ExampleSet that contains the result of the SQL query. Figure 3.9. Metadata of the resulting ExampleSet. 23 Created by XMLmind XSL-FO Converter. Data Sources Figure 3.10. A small excerpt of the resulting ExampleSet. Interpretation of the results Video Workflow import_exp5.rmp Keywords importing data database SQL Operators Read Database 24 Created by XMLmind XSL-FO Converter. Chapter 4. Pre-processing 1. Managing data with issues - Missing, inconsistent, and duplicate values Description The process shows, using a sample of the Individual household electric power consumption data set, how to manage such data sets in which duplicate, inconsistent, and/or duplicate values are present. The missing values can be substituted with a default value, or one computed based on the other instances of the field, or if it is necessary, the records belonging to them can be deleted. After defining the inconsistent values, the records belonging to these can also be filtered out, but however, to define these, it is usually necessary to have some background knowledge or knowledge in the given field. On the contrary, filtering duplicate values is a rather automated task, the data identical to each other can be filtered out easily. Input Individual household electric power consumption [UCI MLR] Output The dataset used here is a sample from the original dataset that encompasses a longer time period; it only contains the energy consumption data from January 2007. Normally the dataset contains a measurement for every minute, but if a given measurement was not executed for some reason, the timestamp is present without data. Such missing values can be substituted with defined values, e.g. with the average of the values present in the given attribute, or a decision to leave out the records connected to them can be made. However, dealing with inconsistent values is a more complex issue. In given fields, intervals can be defined into which the values of the attributes can fall, but in other cases, one must relay on other kinds of background information. For example, let us suppose that the members of the household, in which the measurements took place, do not fly with the owls. Based on this, consider the following representation of the data: Figure 4.1. Graphic representation of the global and kitchen power consumption in time 25 Created by XMLmind XSL-FO Converter. Pre-processing On the figure, colors are assigned based on the values of the variable Sub_metering_1, which represents the energy consumption of the tools in the kitchen, and as the x axis is time, it can be seen that some of the outstanding kitchen consumption values have been measured late at night. This can also be seen in the data view, if the data are ordered according to the kitchen consumption values: Figure 4.2. Possible outliers based on the hypothesized hbits of the members of the household If the members of the household are indeed not flying with the owls, these data can be considered inconsistent based on our background knowledge, and they can be filtered out as follows. Formulating our condition, let us assume that if the values of a measurement in the kitchen exceeds 50 Wh at a point of time after 10 p.m., this is considered a piece of inconsistent data. Based on this, the filtering condition can easily be defined, but first, the time attribute has to be converted, as by default, it is stored in a nominal variable, in the format hh:mm:ss, which can only be compared for equality. Using the appropriate operators, it can be splitted into the components hour, minute, and second, interpreted as numbers. The filtering condition can be defined using the Time_1 variable from among these, which contains the hour component: Figure 4.3. Filtering of the possible values using a record filter 26 Created by XMLmind XSL-FO Converter. Pre-processing Interpretation of the results Using such filters, the records belonging to inconsistent data can be filtered out, and also, by using the appropriate operator, duplicate records can be filtered out as well - it can also be defined based on the equality of which attribute set they are to be considered duplicate - from the dataset, and after this, the actual processing of the filtered and/or refined records can begin. Video Workflow preproc_exp1.rmp Keywords missing data inconsistent data data transformation duplicate data Operators Filter Examples Parse Numbers Read CSV Remove Duplicates Replace Missing Values Split 2. Sampling and aggregation 27 Created by XMLmind XSL-FO Converter. Pre-processing Description The process shows, using a sample of the Individual household electric power consumption dataset, how the data can be summed up using aggregation or sampled if not all of the individual records are required during the given process. Aggregation can be used if the individual data are not necessary, but the values computed from the whole of the dataset are required, and sampling can be done if generally only a fraction of the dataset is required, and conclusions are to be derived based on this subset of the data. Input Individual household electric power consumption [UCI MLR] Output When aggregating, all the aggregate functions available in SQL can be used, and using these, basic statistics can easily be computed for the data of the given dataset. Figure 4.4. Selection of aggregate functions for attributes 28 Created by XMLmind XSL-FO Converter. Pre-processing If sampling is done on the dataset, this can be done by explicitly specifying the size of the sample, or based on probability, and also a filter can be used, in the case when the parts of the dataset are not to be represented proportionally in the sample, rather a given subset of the original dataset is necessary for the process. For example, filtering for the records belonging to a given time on every day can be done as follows: Figure 4.5. Preferences for dataset sampling Figure 4.6. Preferences for dataset filtering 29 Created by XMLmind XSL-FO Converter. Pre-processing Interpretation of the results After performing the aggregation or sampling, the received dataset will only consist of the aggregate values emerging as a result of the specified operations, or the records that fulfil the specified conditions, respectively: Figure 4.7. Resulting dataset after dataset sampling Figure 4.8. Resulting dataset after dataset filtering 30 Created by XMLmind XSL-FO Converter. Pre-processing Video Workflow preproc_exp2.rmp Keywords aggregation summation sampling data filtering Operators Aggregate Filter Examples Multiply Read CSV Sample 3. Creating and filtering attributes 31 Created by XMLmind XSL-FO Converter. Pre-processing Description The process shows, using The Insurance Company Benchmark (COIL 2000) dataset, how new, computed attributes can be created based on existing data, in case the attributes are not appropriate in their original form, or some data derived from them is required. Furthermore, how the individual attributes can be removed from the dataset can also be seen in the process, as in these cases, if the raw data that form the basis of the calculation are not necessarily required later on, these columns can be removed from the dataset. Naturally, other columns can be filtered out as well, if they are not required for the solution of the given task, or if their disturbing effects are needed to be filtered out. Input The Insurance Company Benchmark (COIL 2000) [CoIL Challenge 2000] Output In the attributes of the dataset which begin with the letter m, the demographic data of the region belonging to the zip-code of the given potential client are present; among others, the distribution of individual income groups in the given region. If, for some reason, the original representation is to be compressed, it is possible to create a derived field based on these income attributes using a given formula, based on some heuristic, for example as follows: Figure 4.9. Defining a new attribute based on an expression relying on existing attributes 32 Created by XMLmind XSL-FO Converter. Pre-processing After the appropriate computed field has been created, based on the given case, it can be decided whether the original fields used during the computation are required later on or not. It has to be considered whether these original data could be important for the creation of models in the future, or whether they could have some disturbing effect. The attributes of the raw data used for the computation, or any other arbitrarily selected attributes can be removed from the original dataset as follows: Figure 4.10. Properties of the operator used for removing the attributes made redundant Figure 4.11. Selection of the attributes to remain in the dataset with reduced size 33 Created by XMLmind XSL-FO Converter. Pre-processing Interpretation of the results After executing these steps, all records will appear in the modified dataset, but with a modified attribute set. After the computed field has been created, this new attribute appears in every record, while the attributes filtered out disappear: Figure 4.12. The appearance of the derived attribute in the altered dataset 34 Created by XMLmind XSL-FO Converter. Pre-processing Video Workflow preproc_exp3.rmp Keywords derived attribute attribute creation attribute removal attribute subset Operators Generate Attributes Read AML Select Attributes 4. Discretizing and weighting attributes 35 Created by XMLmind XSL-FO Converter. Pre-processing Description The process shows, using a sample of the Individual household electric power consumption dataset, how an attribute that takes its values from an interval of real numbers can be discretized, i.e. converted to discrete values that represent defined subintervals of the real interval. Furthermore, it can also be seen in the process, how weights can be added to the individual data columns, if when using a data minding procedure, it is necessary to distinguish between the individual data regarding their importance, and not to allow all attributes take part in the data mining algorithm, and in the conclusions based on it with equal weights. Input Individual household electric power consumption [UCI MLR] Output In the dataset, the usage of discretization is shown using the variable Global_active_power. This variable represents the total energy consumption in the whole of the household at a given moment, so following the changes of the times of day, these values also change in a cyclic fashion. Thus if the total consumption is to be represented with discrete values, and not real numbers, in order to be used in a given method, then this column can be properly discretized. Discretization can be done using different operators, by defining the size of the categories (the number of elements in them), or the number of categories, and based on this number, either categories of equal size, or ones of equal element numbers can be created, for example as follows: Figure 4.13. Selection of the appropriate discretization operator 36 Created by XMLmind XSL-FO Converter. Pre-processing Figure 4.14. Setting the properties of the discretization operator Furthermore, using given methods, to receive a result or decision that is appropriate for the requirements later on, it has to be defined, which attributes have what level of importance - the simplest way to do this is weighting. In order to be able to weight the attributes, the weights themselves have to be created first, and then they have to be applied to the dataset. For example, such weights can be set manually for this dataset, whit which it can be indicated that the globally measured values are of most importance, the submeterings are of less importance, and the date and time values are of the least importance, as follows: Figure 4.15. Selection of the appropriate weighting operator 37 Created by XMLmind XSL-FO Converter. Pre-processing Figure 4.16. Defining the weights of the individual attributes 38 Created by XMLmind XSL-FO Converter. Pre-processing Interpretation of the results After executing these steps, the value of the variable Global_active_power will be modified in all the records of the dataset. Here it can be seen that the division into intervals has been done, but behind the discrete values, the interval the values falling into which are corresponding to the given value are displayed as well. In addition, by comparing the original and the modified dataset (the weighted dataset can be seen on the left, and the unweighted dataset can be seen on the right, in their state after the discretization has been executed), it can also be seen that the numeric values in the individual columns have been altered according to the weighting (as the normalize weights option is turned on, the greatest weight is considered 1, thus the values of the columns to which this weight has been assigned are not subject to change, and the values of the columns to which smaller weights have been assigned decrease proportionally): Figure 4.17. Comparison of the weighted and unweighted dataset instances 39 Created by XMLmind XSL-FO Converter. Pre-processing Video Workflow preproc_exp4.rmp Keywords attribute discretization attribute weighting weighting discretization Operators Discretize by Binning Multiply Read CSV Scale by Weights Weight by User Specification 40 Created by XMLmind XSL-FO Converter. Chapter 5. Classification Methods 1 Decision Trees 1. Classification using a decision tree Description The process shows, using the Wine dataset, how classification can be executed by building a decision tree. To build the decision model, first the dataset has to be split into training and testing sets. After this, the splitting rules are ordered into a decision tree based on the training set, and afterwards, the model created in such a way will be used on the test set. Later on, it can be checked what decision conditions the model consists of, based on the training set, and to which class the records of the test set have been assigned based on these decisions. Input Wine [UCI MLR] Output The decisions about the individual splits can be done based on measures such as the Gini-index or information gain. For these, and for the confidence level of splits, different parameter values can be defined when creating the decision tree model. Furthermore, the stop conditions of splitting can also be defined either by specifying the minimal size of record sets that can be split further, or by specifying the maximal depth of the tree. Figure 5.1. Preferences for the building of the decision tree 41 Created by XMLmind XSL-FO Converter. Classification Methods 1 When splitting the dataset, various sampling methods can be defined, and the ratio in which it should be split into training and test sets can also be specified. Splitting can be done simply based on the order of the records, randomly, or attending to it that records belonging to each class occur in the same ratio as in the original dataset in the training and test sets. Figure 5.2. Preferences for splitting the dataset into training and test sets Figure 5.3. Setting the relative sizes of the data partitions 42 Created by XMLmind XSL-FO Converter. Classification Methods 1 Interpretation of the results After it has been built, the model itself can also be directed to the output, thus it can be checked what decision tree has been built based on the data of the training set. Based on this, incidental erroneous decisions can be filtered out using background information or domain knowledge, if any of these is available. If such decisions are found, the process of building the model can be tuned further. Besides this, by applying the model to the test set, it can also be seen which classes the records of the test set have been assigned to based on the model trained using the training set. Figure 5.4. Graphic representation of the decision tree created 43 Created by XMLmind XSL-FO Converter. Classification Methods 1 Figure 5.5. The classification of the records based on the decision tree 44 Created by XMLmind XSL-FO Converter. Classification Methods 1 Video Workflow dtree_exp1.rmp Keywords classification decision tree splitting Operators Apply Model 45 Created by XMLmind XSL-FO Converter. Classification Methods 1 Decision Tree Multiply Read AML Split Data 2. Under- and overfitting of a classification with a decision tree Description The process shows, using the Zoo dataset, under which conditions can under- and oversampling appear when performing classification using decision trees. If the decision tree which provides the model has a depth that is too small, it can occur that it cannot explore the structure of the training set in its entirety, thus it is inappropriate to carry out the classification properly. This is a case of undersampling. However, if the records are split more that required, such conclusions can be drawn along the decisions that are not true anymore, and following this excess of splitting rules, inappropriate decisions can be made - for example in the case of irregular records. This is considered a case of oversampling. Input Zoo [UCI MLR] Output In this process, operators building similar decision trees are used for the same training set, and in them, only the stop condition defining the maximal depth of the tree is different. The value of the maximal depth is 3, 6, and 9 respectively. Figure 5.6. Setting a threshold for the maximal depth of the decision tree 46 Created by XMLmind XSL-FO Converter. Classification Methods 1 Interpretation of the results In accordance with this, decision trees of different depths are created, which thus contain different amounts of splitting conditions, based on which the records of the test set will be classified differently by the different models. If the value of the maximal value is 3, the following decision tree is received as a result: Figure 5.7. Graphic representation of the decision tree created Figure 5.8. Graphic representation of he classification of the records based on the decision tree 47 Created by XMLmind XSL-FO Converter. Classification Methods 1 It can be seen here that based on using 2 rules, the 7 possible classes cannot be separated by the model, so this is a clear case of undersampling. If the value of the maximal value is 3, the following decision tree is received as a result: Figure 5.9. Graphic representation of the decision tree created with the increased maximal depth Figure 5.10. Graphic representation of he classification of the records based on the decision tree with increased maximal depth 48 Created by XMLmind XSL-FO Converter. Classification Methods 1 In this case, only 3 of the records are classified differently from their original labels. However, if the threshold for the maximal depth is increased further, the result will not be better, rather, it will worsen, as the additional rules lead to inappropriate consequences, i.e. this is a case of oversampling. If the value of the maximal value is 3, the following decision tree is received as a result: Figure 5.11. Graphic representation of the decision tree created with the further increased maximal depth 49 Created by XMLmind XSL-FO Converter. Classification Methods 1 Figure 5.12. Graphic representation of he classification of the records based on the decision tree with further increased maximal depth 50 Created by XMLmind XSL-FO Converter. Classification Methods 1 Video Workflow dtree_exp2.rmp Keywords classification decision tree overfitting underfitting Operators Apply Model Decision Tree Multiply Read AML Split Data 3. Evaluation of performance for classification by decision tree Description The process shows, using the Congressional Voting Records dataset, how the quality of a given classification can be evaluated. After the decision tree has been built based on the training set, and the test set has been classified using it, the quality of the classification executed can be examined. Using the evaluation received this way, it can be decided whether the resulting classification is appropriate for the goals of the process, the existing model should be improved further, or the existing model is of such poor quality that using a completely new model is necessary. 51 Created by XMLmind XSL-FO Converter. Classification Methods 1 Input Congressional Voting Records [UCI MLR] Output The decision tree is built based on the data set using the following settings in the process: Figure 5.13. Preferences for the building of the decision tree In this case, the following decision tree emerges: Figure 5.14. Graphic representation of the decision tree created Interpretation of the results 52 Created by XMLmind XSL-FO Converter. Classification Methods 1 Using the decision tree created, the records of the test set can be classified, and after the classification of the records, the original class labels can be compared to those assigned based on the decision tree, e.g. using the following figure: Figure 5.15. Graphic representation of he classification of the records based on the decision tree Examining the performance of the classifier, the number of records classified appropriately and inappropriately can be obtained, and the precision of the classification done by the model can be seen as well, displayed in percentages for the individual classes, and overally: Figure 5.16. Performance vector of the classification based on the decision tree The question can also be raised in this case whether the performance of the model can be increased further. For example, the minimal required confidence for splits can be raised as follows: Figure 5.17. The modification of preferences for the building of the decision tree. 53 Created by XMLmind XSL-FO Converter. Classification Methods 1 In this case, as a result of the raised value of the required confidence, the structure of the decision tree will be completely different from that of the original one, and this leads to a change in the numbers and distribution of the records classified appropriately and inappropriately as well. This model yields a better performance than the original one, which can also be seen in the figure: Figure 5.18. Graphic representation of the decision tree created with the modified preferences Figure 5.19. Performance vector of the classification based on the decision tree created with the modified preferences 54 Created by XMLmind XSL-FO Converter. Classification Methods 1 Video Workflow dtree_exp3.rmp Keywords classification decision tree performance evaluation Operators Apply Model Decision Tree Performance (Classification) Read AML Split Data 4. Evaluation of performance for classification by decision tree 2 Description The process shows, using the Congressional Voting Records dataset, how the quality of a given classification can be evaluated. After the decision tree has been built based on the training set, and the test set has been classified using it, the quality of the classification executed can be examined. In some cases, more advanced levels of validation may be necessary; in these cases, e.g. random subsampling, cross-validation, or a special case of the latter, the leave-one-out method can be used. Using the evaluation received this way, it can be decided whether the resulting classification is appropriate for the goals of the process, the existing model should 55 Created by XMLmind XSL-FO Converter. Classification Methods 1 be improved further, or due to the poor performance of the existing model, it should be replaced with a completely new model. Input Congressional Voting Records [UCI MLR] Output Evaluation can be done by using a complex validation operator as well instead of separate operators. In this case, the split ratio of the dataset, and the form of sampling can be specified: Figure 5.20. Settings for the sampling done in the validation operator This is a complex operator that consists of two subprocesses, which can be defined as follows: Figure 5.21. Subprocesses of the validation operator Interpretation of the results This case is completely identical to the process in the previous example, the split into training and test sets is done, the decision tree built using the training set is applied to the test set, and then its efficiency is evaluated. Here, the following decision tree emerges, which classifies the records of the test set with the following results: Figure 5.22. Graphic representation of the decision tree created 56 Created by XMLmind XSL-FO Converter. Classification Methods 1 Figure 5.23. Performance vector of the classification based on the decision tree If a deeper examination of the given classifier is necessary, subprocesses identical to the ones above can be defined in the operator responsible for cross-validation as well. The operator can be tuned using the following preferences: Figure 5.24. Settings of the cross-validation operator Here, it can be defined how many cross-validation iterations should be executed. The dataset is split into a many subsets of equal size as the number of iterations. Then, each of these splits is selected to be the test set of an iteration, and the union of all other subsets will serve as the training set of the given iteration. A special case of this is the leave-one-out method, which can be used by ticking the appropriate checkbox (leave-one-out). When using this, an iteration is run for each record, in which the given record serves as the test set, and the training set consists of all other records. As can be seen on the figure, the following average performance values are yielded by cross-validation with 10 iterations: 57 Created by XMLmind XSL-FO Converter. Classification Methods 1 Figure 5.25. Overall performance vector of the classifications done in the crossvalidation operator The following average performance values are yielded by the leave-one-out method: Figure 5.26. Overall performance vector of the classifications done in the crossvalidation operator in the leave-one-out case Note that in this case, the standard deviation of the precision values of the leave-one-out method are remarkably higher than those of standard cross-validation. This might indicate that such irregular records are present the classification os which is not necessarily accurate, even after learning on all other records. Video Workflow dtree_exp4.rmp Keywords classification decision tree performance random subsampling cross-validation Operators Apply Model Decision Tree Multiply Performance (Classification) Read AML Split Validation X-Validation 5. Comparison of decision tree classifiers 58 Created by XMLmind XSL-FO Converter. Classification Methods 1 Description The process shows, using the Spambase dataset, how the quality of multiple classifiers, the efficiency of multiple classifiers can be compared. After the decision trees of the classifiers have been built based on the training set, the test set can be classified using them, and the quality of the individually executed classifications can be examined. This can be done separately, measuring the precision of the classifiers one by one, or the analyses can be merged, and the ROC curves of the individual classifiers can be represented on a common figure for a better picturing of the differences between the results. Based on the thus received evaluation, it can be decided which classifier suits the requirements of the process, whether a given model should be improved, or whether a given model has to be replaced or removed due to its poor performance. Input Spambase [UCI MLR] Output Let us create the following two decision tree classifiers based on the training set of the data set: Figure 5.27. Preferences for the building of the decision tree based on the Gini-index criterion 59 Created by XMLmind XSL-FO Converter. Classification Methods 1 Figure 5.28. Preferences for the building of the decision tree based on the gain ratio criterion 60 Created by XMLmind XSL-FO Converter. Classification Methods 1 The classifier using the gain ratio builds the following decision tree: Figure 5.29. Graphic representation of the decision tree created based on the gain ratio criterion When applied to the test set, this decision tree yields the following performance values: 61 Created by XMLmind XSL-FO Converter. Classification Methods 1 Figure 5.30. Performance vector of the classification based on the decision tree built using the gain ratio criterion On the contrary, the classifier using the Gini-index builds the following decision tree: Figure 5.31. Graphic representation of the decision tree created based on the Gini-index criterion When applied to the test set, this decision tree yields the following performance values: Figure 5.32. Performance vector of the classification based on the decision tree built using the Gini-index criterion Interpretation of the results It can be seen that the performance of the classifier utilizing the Gini-index is better than that of the classifier based on the gain ratio. However, on one hand, the difference between individual models is not this obvious in all cases, and on the other hand, for the sake of simplification, or to avoid the differences caused by sampling, the evaluation of the individual models can be merged into a complex operator, and thus, the ROC curves of their precision can also be displayed on a single figure, for example as follows: Figure 5.33. Settings of the operator for the comparison of ROC curves 62 Created by XMLmind XSL-FO Converter. Classification Methods 1 Figure 5.34. Subprocess of the operator for the comparison of ROC curves In this case, an arbitrary number of the model building operators can be placed in the complex operator, thus the precision of this arbitrary number of models can be examined for the same data set at once. However, it is advisable to use a local random seed to ensure that the comparison will be repeatable, as this ensures that for any execution, the records will be split into training and test sets in the same manner. Figure 5.35. Comparison of the ROC curves of the two decision tree classifiers 63 Created by XMLmind XSL-FO Converter. Classification Methods 1 Based on the ROC curves, it is obvious that the classifier based on the Gini-index has a much higher precision than that of the classifier using the gain ratio, as its ROC curve is curved more in the direction of the point (0,1) from the diagonal between the points (0,0) and (1,1). Video Workflow dtree_exp5.rmp Keywords classification decision tree performance comparison ROC curve Operators Apply Model Compare ROCs Decision Tree Multiply Performance Read AML Split Data 64 Created by XMLmind XSL-FO Converter. Chapter 6. Classification Methods 2 Rule-Based Classifiers 1. Using a rule-based classifier (1) Description In this experiment a rule-based classifier is trained on the Zoo data set. Input Zoo [UCI MLR] Output Figure 6.1. The rule set of the rule-based classifier trained on the data set. Figure 6.2. The classification accuracy of the rule-based classifier on the data set. 65 Created by XMLmind XSL-FO Converter. Classification Methods 2 Interpretation of the results The second figure shows that the rule-based classifier perfectly classifies all training examples. Video Workflow rules_exp1.rmp Keywords rule-based classifier supervised learning classification Operators Apply Model Map Performance (Classification) Read AML Rule Induction Subprocess 2. Using a rule-based classifier (2) Description 66 Created by XMLmind XSL-FO Converter. Classification Methods 2 This experiment investigates the performance of the rule-based classifier on the Zoo data set. The data set is split into a training and a test set, half of the examples are used to form a training set, and the rest are for testing. The classification accuracies on both the training and the test sets are determined for the rule-based classifier. Input Zoo [UCI MLR] Output Figure 6.3. The rule set of the rule-based classifier. Figure 6.4. The classification accuracy of the rule-based classifier on the training set. Figure 6.5. The classification accuracy of the rule-based classifier on the test set. Interpretation of the results The second figure shows that the rule-based classifier perfectly classifies all training examples. The third figure shows that the rule-based classifier perfectly classifies all but 6 of the 50 test examples. Video 67 Created by XMLmind XSL-FO Converter. Classification Methods 2 Workflow rules_exp2.rmp Keywords rule-based classifier supervised learning classification Operators Apply Model Map Performance (Classification) Read AML Rule Induction Split Data Subprocess 3. Transforming a decision tree to an equivalent rule set Description The process demonstrates the use of the Tree to Rules operator that transforms a decision tree to an equivalent rule based classifier. The experiment uses a decision tree built on the Zoo data set. Input Zoo [UCI MLR] Output Figure 6.6. The decision tree built on the data set. 68 Created by XMLmind XSL-FO Converter. Classification Methods 2 Figure 6.7. The rule set equivalent of the decision tree. Figure 6.8. The classification accuracy of the rule-based classifier on the data set. Interpretation of the results It is apparent in the first and second figures that each rule in the rule set corresponds to a branch of the decision tree from the root to a leaf node. The third figure shows that the rule based classifier (and thus also the decision tree) perfectly classifies all examples. Video 69 Created by XMLmind XSL-FO Converter. Classification Methods 2 Workflow rules_exp3.rmp Keywords decision tree rule-based classifier supervised learning classification Operators Apply Model Decision Tree Map Multiply Performance (Classification) Read AML Subprocess Tree to Rules 70 Created by XMLmind XSL-FO Converter. Chapter 7. Classification Methods 3 Regression 1. Linear regression Description The process shows, using the Wine dataset, how a regression model can be fitted to a given dataset. Classification can also be done based on a regression model, but however, this process shows that creating the regression model itself is insufficient to perform this. Based on the regression model, approximate values for numerical labels can be defined, but these values are not assigned to concrete class labels. Apart from this, it can be stated that similarly to other classification methods, the data set has to be split into training and test sets, and the regression model created using the training set is to be applied to the test set. Input Wine [UCI MLR] Output When creating the regression model, it can be chosen from among various types of regression, such as linear regression or logistic regression. From these, linear regression is utilized in the process. In this form, for example, it can be defined which method should be used for attribute selection, or what the level of minimal tolerance should be. The thus created linear regression model can be applied to the test set. Figure 7.1. Properties of the linear regression operator 71 Created by XMLmind XSL-FO Converter. Classification Methods 3 The following regression model is created based on the data of the training set: Figure 7.2. The linear regression model yielded as a result Interpretation of the results Using the regression model created based on the records of the training set on the test set, approximate values can be calculated for values of the labels of the individual test records. These approximate values can be seen in the labelled data set yielded by the model application: Figure 7.3. The class prediction values calculated based on the linear regression model 72 Created by XMLmind XSL-FO Converter. Classification Methods 3 It can be seen that most of the approximate values yield a rather good estimation, and take a value that is close to the original label, but this by itself is insufficient to complete the classification task. In order to be able to classify records based on a regression model, its estimation have to be assigned to class labels. Video Workflow regr_exp1.rmp Keywords classification regression Operators Apply Model Linear Regression Read AML Split Data 2. Osztályozás lineáris regresszióval 73 Created by XMLmind XSL-FO Converter. Classification Methods 3 Description The process shows, using the Wine dataset, how a regression model can be fitted to a given dataset, and then how can a classification task be completed based on the received estimates. Classification can also be done based on a regression model; in this case, approximate values for numerical labels can be defined based on the regression model,and afterwards, these values can be assigned to concrete class labels. Similarly to other classification methods, the data set has to be split into training and test sets, and the regression model created using the training set is to be applied to the test set. Input Wine [UCI MLR] Output When creating the regression model, it can be chosen from among various types of regression, such as linear regression or logistic regression. From these, linear regression is utilized in the process. In order to be able to use this for classification, it has to be placed into an operator that implements regression-based classification. Identically to when the operator is used by itself, it can be defined for example which method should be used for attribute selection, or what the level of minimal tolerance should be. The thus created linear regression model can be applied to the test set. Figure 7.4. The subprocess of the classification by regression operator The following regression model is created based on the data of the training set: Figure 7.5. The linear regression model yielded as a result 74 Created by XMLmind XSL-FO Converter. Classification Methods 3 Interpretation of the results Using the regression model created based on the records of the training set on the test set, confidence values can be calculated regarding the probabilities of the individual test records belonging to the given groups. These confidence values, and the class assignments created based on these can be seen in the labelled data set yielded by the model application: Figure 7.6. The class labels derived from the predictions calculated based on the regression model 75 Created by XMLmind XSL-FO Converter. Classification Methods 3 It can be seen that based on the approximate values, the assignments are done correctly, and are equal to the original labels in most cases. Video Workflow regr_exp2.rmp Keywords classification regression Operators Apply Model Classification by Regression Linear Regression Read AML Split Data 3. Evaluation of performance for classification by regression model 76 Created by XMLmind XSL-FO Converter. Classification Methods 3 Description The process shows, using the Spambase dataset, how the quality, the precision of a given classification that is created based on a regression model fitted to a given data set can be evaluated. After the regression model has been built based on the training set, and the test set has been classified using it, the quality of the classification executed can be examined. Using the evaluation received this way, it can be decided whether the resulting classification is appropriate for the goals of the process, the existing model should be improved further, or the existing model is of such poor quality that using a completely new model is necessary. Input Spambase [UCI MLR] Output After creating the regression model, in order to be able to use it for classification, it has to be placed into an operator that implements regression-based classification. Similarly to when using the operator individually, it can be defined for example which method should be used for attribute selection, or what the level of minimal tolerance should be. The thus created linear regression model can be applied to the test set. Figure 7.7. The subprocess of the classification by regression operator The following regression model is created based on the data of the training set: Figure 7.8. The linear regression model yielded as a result 77 Created by XMLmind XSL-FO Converter. Classification Methods 3 Interpretation of the results Using the regression model created based on the records of the training set on the test set, confidence values can be calculated regarding the probabilities of the individual test records belonging to the given groups. Based on these confidence values, class assignments are assigned to the individual records of the test set. Corresponding to this, it can be evaluated how many records have been classified successfully based on the regression model: Figure 7.9. The performance vector of the classification based on the regression model 78 Created by XMLmind XSL-FO Converter. Classification Methods 3 Video Workflow regr_exp3.rmp Keywords classification regression performance evaluation Operators Apply Model Classification by Regression Linear Regression Performance (Classification) Read AML Split Data 4. Evaluation of performance for classification by regression model 2 Description The process shows, using the Wine dataset, how the quality, the precision of a given classification that is created based on a regression model fitted to a given data set can be evaluated. After the regression model has been built 79 Created by XMLmind XSL-FO Converter. Classification Methods 3 based on the training set, and the test set has been classified using it, the quality of the classification executed can be examined. In some cases, more advanced levels of validation may be necessary; in these cases, e.g. random subsampling, cross-validation, or a special case of the latter, the leave-one-out method can be used. Using the evaluation received this way, t can be decided whether the resulting classification is appropriate for the goals of the process, the existing model should be improved further, or the existing model is of such poor quality that using a completely new model is necessary. Input Wine [UCI MLR] Output Evaluation can be done by using a complex validation operator as well instead of separate operators. In this case, as the regression model has to be placed into an operator that implements regression-based classification, and this operator has to be placed into the operator of complex evaluation, the result is a process that contains embedded operators on multiple levels: Figure 7.10. The subprocess of the cross-validation by regression operator Figure 7.11. The subprocess of the classification by regression operator Similarly to when using the operator individually, it can be defined for example which method should be used for attribute selection, or what the level of minimal tolerance should be. The thus created linear regression model can be applied to the test set. The following regression model is created based on the data of the training set: Figure 7.12. The linear regression model yielded as a result 80 Created by XMLmind XSL-FO Converter. Classification Methods 3 Interpretation of the results If a deeper examination of the given classifier is necessary, subprocesses identical to the ones above can be defined in the operator responsible for cross-validation as well. The operator can be tuned using the following preferences: Figure 7.13. The customizable properties of the cross-validation operator 81 Created by XMLmind XSL-FO Converter. Classification Methods 3 Here, it can be defined how many cross-validation iterations should be executed. The dataset is split into a many subsets of equal size as the number of iterations. Then, each of these splits is selected to be the test set of an iteration, and the union of all other subsets will serve as the training set of the given iteration. A special case of this is the leave-one-out method, which can be used by ticking the appropriate checkbox (leave-one-out). When using this, an iteration is run for each record, in which the given record serves as the test set, and the training set consists of all other records. As can be seen on the figure, the following average performance values are yielded by cross-validation with 10 iterations: Figure 7.14. The overall performance vector of the classifications done using the regression model defined in the cross-validation operator The following average performance values are yielded by the leave-one-out method: Figure 7.15. The overall performance vector of the classifications done using the regression model defined in the cross-validation operator for the case of using the leaveone-out method Note that in this case, the standard deviation of the precision values of the leave-one-out method are remarkably higher than those of standard cross-validation. This might indicate that such irregular records are present the classification os which is not necessarily accurate, even after learning on all other records. Video Workflow regr_exp4.rmp 82 Created by XMLmind XSL-FO Converter. Classification Methods 3 Keywords classification regression performance cross-validation Operators Apply Model Classification by Regression Linear Regression Performance (Classification) Read AML X-Validation 83 Created by XMLmind XSL-FO Converter. Chapter 8. Classification Methods 4 Neural Networks and Support Vector Machines 1. Using a perceptron for solving a linearly separable binary classification problem Description In this experiment a perceptron is trained on a linearly separable two-dimensional data set consisting of two classes, that is a subset of the Wine data set. The classification accuracy of the perceptron is determined on the data set. Input Figure 8.1. A linearly separable subset of the Wine data set [UCI MLR] used in the experiment (2 of the total of 3 classes and 2 of the total of 13 attributes was selected). Output 84 Created by XMLmind XSL-FO Converter. Classification Methods 4 Figure 8.2. The decision boundary of the perceptron. Figure 8.3. The classification accuracy of the perceptron on the data set. Interpretation of the results The second figure shows that the perceptron perfectly classifies all training examples. Video Workflow ann_exp1.rmp Keywords perceptron supervised learning classification Operators Apply Model Filter Examples Perceptron Performance (Classification) Read CSV Remove Unused Values 2. Using a feed-forward neural network for solving a classification problem 85 Created by XMLmind XSL-FO Converter. Classification Methods 4 Description In this experiment a two-layer feed-forward neural network with 2 hidden neurons is trained on the Sonar, Mines vs. Rocks data set. Input Sonar, Mines vs. Rocks [UCI MLR] Output Figure 8.4. The classification accuracy of the neural network on the data set. Interpretation of the results The figure shows that the neural network perfectly classifies all but one of the training examples. Video Workflow ann_exp2.rmp Keywords feed-forward neural network supervised learning classification Operators Apply Model 86 Created by XMLmind XSL-FO Converter. Classification Methods 4 Neural Net Performance (Classification) Read CSV 3. The influence of the number of hidden neurons to the performance of the feed-forward neural network Description In this experiment two-layer feed-forward neural networks with different number of hidden neurons are trained on the Sonar, Mines vs. Rocks data set. The average classification error rate from 10-fold cross-validation is determined for each neural network. The main contribution of the experiment is that it shows how to change the value of a list type parameter of an operator (in our case, the hidden layers parameter of the Neural Net operator) in loops using a macro. To obtain a reasonable execution time only neural networks with the following number of hidden neurons are considered: 1, 2, 4, 8, 16. Input Sonar, Mines vs. Rocks [UCI MLR] Output Figure 8.5. The average classification error rate obtained from 10-fold cross-validation against the number of hidden neurons. 87 Created by XMLmind XSL-FO Converter. Classification Methods 4 Interpretation of the results The figure shows that the best average classification error rate (14.5%) is achieved when the number of hidden neurons is 8. Video Workflow ann_exp3.rmp Keywords feed-forward neural network supervised learning error rate classification cross-validation Operators Apply Model Guess Types Log Log to Data Loop Values Neural Net Performance (Classification) Print to Console Provide Macro as Log Value Read CSV X-Validation Execute Script (R) [R Extension] 4. Using a linear SVM for solving a linearly separable binary classification problem 88 Created by XMLmind XSL-FO Converter. Classification Methods 4 Description In this experiment a linear SVM is trained on a linearly separable two-dimensional data set consisting of two classes, that is a subset of the Wine data set. The classification accuracy of the linear SVM is determined on the data set. Input Figure 8.6. A linearly separable subset of the Wine data set [UCI MLR] used in the experiment (2 of the total of 3 classes and 2 of the total of 13 attributes was selected). Output Figure 8.7. The kernel model of the linear SVM. 89 Created by XMLmind XSL-FO Converter. Classification Methods 4 Figure 8.8. The classification accuracy of the linear SVM on the data set. Interpretation of the results The figure shows that the linear SVM perfectly classifies all training examples. Video Workflow svm_exp1.rmp Keywords SVM supervised learning classification Operators Apply Model Filter Examples Performance (Classification) Read CSV Remove Unused Values Support Vector Machine (LibSVM) 5. The influence of the parameter C to the performance of the linear SVM (1) 90 Created by XMLmind XSL-FO Converter. Classification Methods 4 Description The process demonstrates the influence of the parameter C on performance of the linear SVM. Linear SVMs are trained on a subset of the Wine data set while the value of the parameter C is increased from 0.001 to 100. The classification error rate on the training set and also the number of support vectors are determined for each SVM. Input A subset of the Wine data set [UCI MLR]. Figure 8.9. A subset of the Wine data set used in the experiment (2 of the total of 3 classes and 2 of the total of 13 attributes was selected). Note that the classes are not linearly separable. Output Figure 8.10. The classification error rate of the linear SVM against the value of the parameter C. 91 Created by XMLmind XSL-FO Converter. Classification Methods 4 Figure 8.11. The number of support vectors against the value of the parameter C. Interpretation of the results The first figure shows that the classification error rate quickly falls below 6% as the value of the parameter C is increased and then it remains constant. The second figure shows that the number of support vectors decreases similarly with the increasing value of the parameter C, although not so rapidly as the classification error rate. Video Workflow svm_exp2.rmp Keywords 92 Created by XMLmind XSL-FO Converter. Classification Methods 4 SVM supervised learning error rate classification Operators Apply Model Filter Examples Log Loop Parameters Normalize Performance (Classification) Performance (Support Vector Count) Read CSV Remove Unused Values Support Vector Machine (LibSVM) 6. The influence of the parameter C to the performance of the linear SVM (2) Description The process demonstrates the influence of the parameter C on the average classification error rate of the linear SVM in the case of the Heart Disease data set. We consider linear SVMs with different C values, each of which is an integer power of 2: C = 2^n, where -13 <= n <= 6. The average classification error rate from 10-fold crossvalidation is determined for each linear SVM. Input Heart Disease [UCI MLR] Note The data set was donated to the UCI Machine Learning Repository by R. Detrano[Detrano et al.]. Output 93 Created by XMLmind XSL-FO Converter. Classification Methods 4 Figure 8.12. The average classification error rate of the linear SVM obtained from 10fold cross-validation against the value of the parameter C, where the horizontal axis is scaled logarithmically. Interpretation of the results The figure shows that the average classification error rate is minimal when the value of the parameter C is 2^-8. Larger values of C result in a slightly worse average classification performance. However, values closer to zero give worse result. Thus, the performance of the linear SVM seems not to be sensitive to the value of the parameter C in this case. Video Workflow svm_exp3.rmp Keywords SVM supervised learning error rate classification cross-validation Operators Apply Model Filter Examples Log Loop Parameters Map Normalize Performance (Classification) Read CSV Support Vector Machine (LibSVM) 94 Created by XMLmind XSL-FO Converter. Classification Methods 4 X-Validation 7. The influence of the parameter C to the performance of the linear SVM (3) Description In this experiment linear SVMs are trained on the Spambase data set while the value of the parameter C is varied. We will use integer powers of 2 as value of the parameter: C = 2^n, where -8 <= n <= 5. The data set is split into a training and a test set, 60% of the examples are used to form a training set, and the rest are for testing. The classification error rates on both the training and the test sets and also the number of support vectors are determined for each SVM. Input Spambase [UCI MLR] Output Figure 8.13. The classification error rate of the linear SVM on the training and the test sets against the value of the parameter C. 95 Created by XMLmind XSL-FO Converter. Classification Methods 4 Figure 8.14. The number of support vectors against the value of the parameter C. Interpretation of the results The first figure shows that the classification error rate on the training set is decreases with the increase of value of the parameter C. As the value of the parameter C is increased, the error rate on the test set also decreases, until C reaches 2. However, further increase of the value of the parameter causes a slight increase in the test error. The second figure shows that the number of support vectors falls by about 50% while the value of the parameter C is increased from 2^-8 to 8. Further increase of the value of the parameter causes a slight increase in the number of support vectors. Video 96 Created by XMLmind XSL-FO Converter. Classification Methods 4 Workflow svm_exp4.rmp Keywords SVM supervised learning error rate classification Operators Apply Model Log Log to Data Loop Parameters Normalize Performance (Classification) Performance (Support Vector Count) Read CSV Split Data Support Vector Machine (LibSVM) 8. The influence of the number of training examples to the performance of the linear SVM Description The process demonstrates the influence of the number of training examples on the performance of the linear SVM in the case of the Adult (LIBSVM) data set. The number of training examples is increased in the experiment, and an SVM is trained in each step. The following performance characteristics are determined for each of the SVMs: • the classification error rate on the training set, • the classification error rate on the corresponding test set, • the number of support vectors, • the CPU execution time needed to train the linear SVM. 97 Created by XMLmind XSL-FO Converter. Classification Methods 4 Input A discretized and binarized version of the Adult data set [UCI MLR] available at the LIBSVM website [LIBSVM]. Output Figure 8.15. The classification error rate of the linear SVM on the training and the test sets against the number of training examples. Figure 8.16. The number of support vectors against the number of training examples. Figure 8.17. CPU execution time needed to train the SVM against the number of training examples. 98 Created by XMLmind XSL-FO Converter. Classification Methods 4 Interpretation of the results The first figure shows that the classification error on the training and test sets are roughly the same, independently of the number of training examples. The second and the third figures show that both the number of support vectors and the CPU execution time increase linearly with the number of training examples. Video Workflow svm_exp5.rmp Keywords SVM supervised learning error rate classification cross-validation Operators Apply Model Extract Macro Generate Attributes Log Log to Data Loop Files Normalize Parse Numbers Performance (Classification) Performance (Support Vector Count) Provide Macro as Log Value Read Sparse Remove Duplicates 99 Created by XMLmind XSL-FO Converter. Classification Methods 4 Sort Support Vector Machine (LibSVM) 9. Solving the two spirals problem by a nonlinear SVM Description In this experiment a nonlinear SVM is trained to solve the Two Spirals problem, that is a linearly non-separable classification problem developed for benchmarking neural networks. The classification accuracy of the nonlinear SVM is determined on the data set. Input Two Spirals [Two Spirals] Figure 8.18. The Two Spirals data set Figure 8.19. The R code that produces the data set and is executed by the (R) operator of the R Extension. 100 Created by XMLmind XSL-FO Converter. Execute Script Classification Methods 4 i <- 0:96 angle <- i * pi / 16 radius <- 6.5 * (104 - i) / 104 x <- radius * cos(angle); y <- radius * sin(angle); spirals <- data.frame( rbind( cbind(x, y, 0), cbind(-x, -y, 1) ) ) names(spirals) <- c("x", "y", "class") spirals <- transform(spirals, class = factor(class)) spirals.label <- "class" Note The R code that produces the data set is based on a SAS code snippet in [Neural Network FAQ]. Output Figure 8.20. The classification accuracy of the nonlinear SVM on the data set. Interpretation of the results The figure shows that the nonlinear SVM perfectly classifies all training examples. Video Workflow svm_exp6.rmp Keywords SVM supervised learning linearly non-separable classification Operators Apply Model Performance (Classification) Support Vector Machine (LibSVM) Execute Script (R) [R Extension] 10. The influence of the kernel width parameter to the performance of the RBF kernel SVM 101 Created by XMLmind XSL-FO Converter. Classification Methods 4 Description In this experiment RBF kernel SVMs are trained on the Pima Indians Diabetes data set with different kernel width parameter (gamma) values. The value of this parameter is increased from 0.001 to 5 while the value of the parameter C is fixed to 1 to obtain comparable results. The data set is split into a training and a test set, 75% of the examples are used to form a training set, and the rest are for testing. The classification error rates on both the training and the test sets are determined for each SVM. Input Pima Indians Diabetes [UCI MLR] Output Figure 8.21. The classification error rates of the SVM on the training and the test sets against the value of RBF kernel width parameter. Interpretation of the results The value of the RBF kernel width parameter can be chosen such that the SVM will perfectly classify all training examples. Unfortunately, the model does not perform well on the test data. Apparently, overfitting 102 Created by XMLmind XSL-FO Converter. Classification Methods 4 occurs here. It should be noted that the linear SVM does not perform so well on the training set, its classification error rate is around 20%. Video Workflow svm_exp7.rmp Keywords SVM supervised learning error rate classification Operators Apply Model Log Loop Parameters Normalize Performance (Classification) Read CSV Split Data Support Vector Machine (LibSVM) 11. Search for optimal parameter values of the RBF kernel SVM Description In this experiment RBF kernel SVMs are trained on the Ionosphere data set while the value of the parameter gamma of the RBF kernel and also the value of the parameter C are changed. The average classification error rate from 10-fold cross-validation is determined for each SVM. As a result, the values yielding the best average 103 Created by XMLmind XSL-FO Converter. Classification Methods 4 classification error rate will be returned. The following parameter values will be considered for C and gamma: C = 2^n, where -5 <= n <= 6, gamma = 2^m, where -10 <= m <= 4. Thus, the total number of parameter value combinations considered is 180. Input Ionosphere [UCI MLR] Output Figure 8.22. The optimal parameter values for the RBF kernel SVM. Figure 8.23. The classification accuracy of the RBF kernel SVM trained on the entire data set using the optimal parameter values. Interpretation of the results The first figure shows that the best average classification error rate is achieved when the value of the parameter C is 16 and the value of the parameter gamma is 0.015625. Note that these parameter values can not be considered as the global optimum of the average classification error rate, since they were obtained by performing a grid search that examines only a few points of the search space. The second figure shows that the RBF kernel SVM trained on the entire data set using the optimal parameter values performs very well. Video Workflow svm_exp8.rmp Keywords 104 Created by XMLmind XSL-FO Converter. Classification Methods 4 SVM supervised learning error rate classification cross-validation parameter optimization Operators Apply Model Log Multiply Normalize Optimize Parameters (Grid) Performance (Classification) Read CSV Set Parameters Support Vector Machine (LibSVM) X-Validation 12. Using an SVM for solving a multi-class classification problem Description In this experiment a linear SVM is trained on a data set consisting of three classes. The classification accuracy of the linear SVM is determined on the data set. Input Wine [UCI MLR] Output Figure 8.24. The kernel model of the linear SVM. 105 Created by XMLmind XSL-FO Converter. Classification Methods 4 Figure 8.25. The classification accuracy of the linear SVM on the data set. Interpretation of the results The second figure shows that the linear SVM perfectly classifies all training examples. Video Workflow svm_exp9.rmp Keywords SVM supervised learning classification Operators Apply Model Normalize Performance (Classification) Read CSV Support Vector Machine (LibSVM) 13. Using an SVM for solving a regression problem 106 Created by XMLmind XSL-FO Converter. Classification Methods 4 Description The process demonstrates how to use an SVM for solving a regression problem. In this experiment RBF kernel SVMs are trained on the Concrete Compressive Strength data set while the value of the parameter gamma of the RBF kernel is changed. To obtain comparable results the value of the parameter C is fixed to 10. The average RMS error from 10-fold cross-validation is determined for each SVM. As a result, the gamma value yielding the best average RMS error will be returned. Using this value for the parameter gamma an RBF kernel SVM is trained on the entire data set that is referred to as the “optimal RBF kernel SVM” below. Input Concrete Compressive Strength [UCI MLR] [Concrete] Output Figure 8.26. The optimal value of the gamma parameter for the RBF kernel SVM. Figure 8.27. The average RMS error of the RBF kernel SVM obtained from 10-fold cross-validation against the value of the parameter gamma, where the horizontal axis is scaled logarithmically. 107 Created by XMLmind XSL-FO Converter. Classification Methods 4 Figure 8.28. The kernel model of the optimal RBF kernel SVM. Figure 8.29. Predictions provided by the optimal RBF kernel SVM against the values of the observed values of the dependent variable. 108 Created by XMLmind XSL-FO Converter. Classification Methods 4 Interpretation of the results The first figure shows that the best average RMS error is achieved when the value of the parameter gamma is 2^2 = 0.25. The third figure shows that the average RMS error decreases with the increasing value of the parameter gamma until it reaches its minimum. However, further increase of the value of the parameter gamma results in the degradation of the performance, i.e., model overfitting occurs. Video Workflow svm_regr_exp1.rmp Keywords SVM supervised learning RMS error regression cross-validation parameter optimization Operators Apply Model Log Multiply Normalize Optimize Parameters (Grid) Performance (Regression) Read Excel Set Parameters Support Vector Machine (LibSVM) X-Validation 109 Created by XMLmind XSL-FO Converter. Chapter 9. Classification Methods 5 Ensemble Methods 1. Introducing ensemble methods: the bagging algorithm Description The experiment introduces the use of ensemble methods, featuring the Bagging operator. The average classification error rate from 10-fold cross-validation on the Heart Disease data set is compared for a single decision stump and an ensemble of 10 decision stumps trained by bagging. The impurity measure used for the decision stumps is the gain ratio. Input Heart Disease [UCI MLR] Note The data set was donated to the UCI Machine Learning Repository by R. Detrano[Detrano et al.]. Output Figure 9.1. The average classification error rate of a single decision stump obtained from 10-fold cross-validation. Figure 9.2. The average classification error rate of the bagging algorithm obtained from 10-fold cross-validation, where 10 decision stumps were used as base classifiers. 110 Created by XMLmind XSL-FO Converter. Classification Methods 5 Interpretation of the results An ensemble of 10 decision stumps trained by bagging gives an average classification error rate that is about 7% better that those of a single decision stump. Video Workflow ensemble_exp1.rmp Keywords bagging ensemble methods supervised learning error rate cross-validation classification Operators Apply Model Bagging Decision Stump Map Multiply Performance (Classification) Read CSV X-Validation 2. The influence of the number of base classifiers to the performance of bagging 111 Created by XMLmind XSL-FO Converter. Classification Methods 5 Description This process demonstrates the influence of the number of base classifiers on the classification error rate of bagging in the case of the Heart Disease data set. The base classifiers are decision stumps and the impurity measure used is the gain ratio. The number of base classifiers are increased from 1 to 20 in the experiment, and the average classification error rate of bagging from 10-fold cross-validation is determined in each step. Input Heart Disease [UCI MLR] Note The data set was donated to the UCI Machine Learning Repository by R. Detrano[Detrano et al.]. Output Figure 9.3. The average classification error rate obtained from 10-fold cross-validation against the number of base classifiers. 112 Created by XMLmind XSL-FO Converter. Classification Methods 5 Interpretation of the results The figure shows that the best average classification error rate (21.4%) is achieved when the number of base classifiers is 14. Video Workflow ensemble_exp2.rmp Keywords bagging ensemble methods supervised learning error rate cross-validation classification Operators Apply Model Bagging Decision Stump Log Loop Parameters Map Performance (Classification) Read CSV X-Validation 3. The influence of the number of base classifiers to the performance of the AdaBoost method Description 113 Created by XMLmind XSL-FO Converter. Classification Methods 5 The process demonstrates the influence of the number of base classifiers on the classification error rate of the AdaBoost method in the case of the Heart Disease data set. The base classifiers are decision stumps and the impurity measure used is the gain ratio. The number of base classifiers are increased from 1 to 20 in the experiment, and the average classification error rate of the AdaBoost method from 10-fold cross-validation is determined in each step. Note The experiment is the same as the previous one, the only difference is that the AdaBoost operator is used instead of the Bagging operator. Input Heart Disease [UCI MLR] Note The data set was donated to the UCI Machine Learning Repository by R. Detrano[Detrano et al.]. Output Figure 9.4. The average classification error rate obtained from 10-fold cross-validation against the number of base classifiers. Interpretation of the results The figure shows that the best average classification error rate (22.7%) is achieved when the number of base classifiers is 3. It is also apparent that increasing the number of base classifiers does not result in the degradation of the performance, that remains constant instead. Thus, model overfitting surprisingly does not occur. Note that the best performance obtained is almost identical to those of bagging, but requires lesser base classifiers. Moreover, performance behaves more predictable than in the case of bagging. Video Workflow 114 Created by XMLmind XSL-FO Converter. Classification Methods 5 ensemble_exp3.rmp Keywords AdaBoost ensemble methods supervised learning error rate cross-validation classification Operators AdaBoost Apply Model Decision Stump Log Loop Parameters Map Performance (Classification) Read CSV X-Validation 4. The influence of the number of base classifiers to the performance of the random forest Description The process demonstrates the influence of the number of base classifiers on the classification error rate of the random forest in the case of the Heart Disease data set. The number of base classifiers (i.e., decision trees) are increased from 1 to 20 in the experiment, and the average classification error rate of the random forest from 10fold cross-validation is determined in each step. The impurity measure used for the decision trees is the gain ratio. Note The experiment is the same as the previous two, the only difference is that the Random Forest operator is used here instead of the Bagging and the AdaBoost operators. Input 115 Created by XMLmind XSL-FO Converter. Classification Methods 5 Heart Disease [UCI MLR] Note The data set was donated to the UCI Machine Learning Repository by R. Detrano[Detrano et al.]. Output Figure 9.5. The average error rate of the random forest obtained from 10-fold crossvalidation against the number of base classifiers. Interpretation of the results The figure shows that the best average classification error rate (19.1%) is achieved when the number of base classifiers is 10. Note that the best performance obtained is slightly better than those of AdaBoost (22.7%), but requires more base classifiers. Moreover, the performance of AdaBoost behaves more predictable than those of the random forest. Video Workflow ensemble_exp4.rmp Keywords random forest ensemble methods supervised learning error rate cross-validation classification Operators Apply Model 116 Created by XMLmind XSL-FO Converter. Classification Methods 5 Log Loop Parameters Map Performance (Classification) Random Forest Read CSV X-Validation 117 Created by XMLmind XSL-FO Converter. Chapter 10. Association rules 1. Extraction of association rules Description The process shows, using the Extended Bakery dataset, how association rules can be extracted from a transactional dataset. The emphasis is on the items that are present in the transactional datasets from the possible items, i.e. on the items which form part of the given transaction, and not those which are missing from it. If such a transactional dataset is in an uncompressed sparse matrix representation, so all records contain a binomial value for each of the possible items, the extraction of association rules can be executed without any complex transformation, the only thing that has to be kept in mind is that the attributes representing the individual items should be of a binomial type. Using these, the frequent item sets can be extracted, and based on these, the association rules valid for the dataset can be extracted. Input Extended Bakery [Extended Bakery] Output Using the FP-Growth algorithm on the version of the dataset that contains 20000 records, the following frequent item sets are created: Figure 10.1. List of the frequent item sets generated 118 Created by XMLmind XSL-FO Converter. Association rules Interpretation of the results Based on these frequent item sets, the appropriate association rules can be created. It can be set the rules meeting what kind of criteria should be considered valid - by default, a required level of confidence can be set, but filtering can be done based on other values as well. Using the emerging rules, deeper conclusions can be drawn regarding the connections between the data. Among other things, the table representation of the rules can aid this, as in this representation, different kinds of filters can be utilized to filter out the rules considered interesting, for example by outcome or by confidence level: Figure 10.2. List of the association rules generated 119 Created by XMLmind XSL-FO Converter. Association rules Besides the table representation, a graphic representation can also be used, with available filtering conditions that are similar to those of the former: Figure 10.3. Graphic representation of the association rules generated Video Workflow assoc_exp1.rmp Keywords frequent item sets association rules transactional data binomial attributes Operators 120 Created by XMLmind XSL-FO Converter. Association rules Create Association Rules FP-Growth Numerical to Binominal Read AML 2. Asszociációs szabályok kinyerése nem tranzakciós adathalmazból Description The process shows, using the Titanic dataset, how association rules can be extracted from a non-transactional dataset. In order to obtain association rules from such a dataset, it first has to be transformed into a transactional dataset. In these cases, it depends on the structure of the original database whether the emphasis is only on the items that are present in the transactional dataset from the possible items, or the 0 values of the variables also have to be interpreted. These datasets have to be transformed into an uncompressed sparse matrix representation, in which all records contain a binomial value for each of the possible items. After this, the extraction of association rules can be executed without any complex transformation. The frequent item sets occurring in the dataset can be extracted, and based on these, the association rules valid for the dataset can be extracted as well. Input Titanic [Titanic] Output Using this dataset, it can be examined whether the age, sex, and class of the passengers of the Titanic had any influence on their survival chances. As the Class variable is not of a binomial type, this has to be converted into binomial form first, before the frequent item sets could be extracted: Figure 10.4. Operator preferences for the necessary data conversion 121 Created by XMLmind XSL-FO Converter. Association rules Figure 10.5. Converted version of the dataset Based on these, the frequent item sets can now be acquired, from which the association rules valid for the dataset can be generated: Figure 10.6. List of the frequent item sets generated 122 Created by XMLmind XSL-FO Converter. Association rules Figure 10.7. List of the association rules generated Interpretation of the results Looking at the frequent item sets and association rules created, it is obvious that the handling of the dataset is inappropriate. In can be seen in the documentation of the dataset that for each variable, including the binomial variables, 0 values have a separate meaning (e.g. this represents children at the age variable, or belonging to the crew at the class variable). In accordance with this, to acquire the appropriate transactional records, these variables will also have to be split into two separate variables that represent the presence or absence of the two possible values. In this case, the following dataset is yielded as a result: Figure 10.8. Operator preferences for the appropriate data conversion 123 Created by XMLmind XSL-FO Converter. Association rules Figure 10.9. The appropriate converted version of the dataset Based on these, the frequent item sets, and using those, the appropriate association rules can be extracted. Using these emerging rules, deeper conclusions can be drawn regarding the connections between the data, and the factors influencing the survival chances of the passengers can be filtered out. Among other things, the table representation of the rules can aid this, as in this representation, different kinds of filters can be utilized to filter out the rules considered interesting, for example by outcome or by confidence level: Figure 10.10. Enhanced list of the frequent item sets generated 124 Created by XMLmind XSL-FO Converter. Association rules Figure 10.11. List of the association rules generated Besides the table representation, a graphic representation can also be used, with available filtering conditions that are similar to those of the former: Figure 10.12. Graphic representation of the association rules generated 125 Created by XMLmind XSL-FO Converter. Association rules Video Workflow assoc_exp2.rmp Keywords frequent item sets association rules non-transactional data binomial attributes data transformation Operators Create Association Rules FP-Growth Multiply Nominal to Binominal Read AML 3. Evaluation of performance for association rules 126 Created by XMLmind XSL-FO Converter. Association rules Description The process shows, using the Titanic dataset, how the usability and efficiency of association rules can be checked if association rules are being extracted for a given dataset. After extracting the association rules, their support can be evaluated, and similarly to classification tasks, it can be checked to what extent the original values of the dataset can be predicted based on the rules created. Based on these types of evaluation, conclusions can be drawn based on which it can be decided whether the resulting association rules are appropriate for the goals of the process, the existing rules should be improved further, or the existing rules have revealed such poor connections that using a completely new approach is necessary. Input Titanic [Titanic] Output Using this dataset, it can be examined whether the age, sex, and class of the passengers of the Titanic had any influence on their survival chances. After the appropriate conversion of the variables, the dataset can be split into a training set and a test set, and then, by applying the association rules deduced based on the training set to the test set, it can be defined to what extent the rules are usable. In order to render the attributes created during the conversion referable, the following parameter has to be used: Figure 10.13. Operator preferences for the necessary data conversion 127 Created by XMLmind XSL-FO Converter. Association rules After this, in order to evaluate the efficiency of applying the rules using the general performance evaluation operator, the original and predicted values of the attribute of interest (in this case, the variable Surived_1, which indicates that the given passenger has survived the shipwreck) have to be converted to nominal types, and also, it also has to be ensured that their values are coded using the same values: Figure 10.14. Label role assignment for performance evaluation Figure 10.15. Prediction role assignment for performance evaluation 128 Created by XMLmind XSL-FO Converter. Association rules Figure 10.16. Operator preferences for the data conversion necessary for evaluation Interpretation of the results After setting the appropriate roles, the performance measurement operator automatically performs the comparisons, and based on these, it evaluates the efficiency of the application of the rules. Running the process yields the following rules regarding the survival of the passengers as a result: Figure 10.17. Graphic representation of the association rules generated regarding survival 129 Created by XMLmind XSL-FO Converter. Association rules Figure 10.18. List of the association rules generated regarding survival It can be seen here that although many conclusions have been drawn regarding the survival of the passengers, the support of the rules is rather low. This leads to the conclusion that the rules can be applied in relatively special cases, and not generally, thus in some cases, no decision will be possible based on them. This can be illustrated by the low value appearing in the evaluation of performance as well: Figure 10.19. Performance vector for the application of association rules generated One of the reasons for this could be that during the extraction of the association rules, some other factor, that might affect the connections disclosed by the association rules, was not taken into consideration. After the discovery of these, a better result might be obtainable in some cases. Video Workflow assoc_exp3.rmp Keywords frequent item sets association rules performance support Operators Apply Association Rules Create Association Rules Discretize by User Specification FP-Growth 130 Created by XMLmind XSL-FO Converter. Association rules Multiply Nominal to Binominal Performance Read AML Set Role Split Data 4. Performance of association rules - Simpson's paradox Description The process shows, using the Titanic dataset, how the usability and efficiency of association rules can be enhanced if association rules are being extracted for a given dataset, by creating subsets of the dataset based on the connections between its data, and then creating association rules separately for its subsets. After extracting the association rules, their support can be evaluated, and similarly to classification tasks, it can be checked to what extent the original values of the dataset can be predicted based on the rules created. If these values fail to reach the expected levels, one of the reasons for this can be the so-called Simpson's paradox, which means that due to some hidden factors, the connections between the variables can weaken, disappear, or even turn in the opposite direction. If such factors are discovered, splitting the dataset along these can enhance the performance of the association rules. Input Titanic [Titanic] Output Using this dataset, it can be examined whether the age, sex, and class of the passengers of the Titanic had any influence on their survival chances. After the appropriate conversion of the variables, the dataset can be split into a training set and a test set, and then, by applying the association rules deduced based on the training set to the test set, it can be defined to what extent the rules are usable. However, if we do this based on the whole of the dataset, relatively poor results emerge for support and, rooting from this, for performance as well: Figure 10.20. List of the association rules generated regarding survival 131 Created by XMLmind XSL-FO Converter. Association rules Figure 10.21. Performance vector for the application of association rules generated But considering the contingency table of the dataset, for example split by the age of the passengers, and by their class, the conclusion can be drawn that the individual variables have such a strong influence on the value of the variable of interest - survival - that these effects of the individual classes can neutralize each other in the whole of the dataset, so it can be more advantageous to split the dataset along these variables, and extract the association rules separately in the individual subsets: Figure 10.22. Contingency table of the dataset In order to do this, for example if the dataset is to be split based on the age of the passengers, first the appropriate records have to be filtered out, then the variables used as filtering conditions can also be removed, as in the subsets, they carry information that can now be considered redundant: Figure 10.23. Record filter usage 132 Created by XMLmind XSL-FO Converter. Association rules Figure 10.24. Removal of attributes that become redundant after filtering Interpretation of the results After this, the training and test sets are created, the association rules concerning them are extracted, and their efficiency is evaluated for the separate datasets of adults and children. The subset of adults yielded the following results: Figure 10.25. List of the association rules generated for the subset of adults Figure 10.26. Performance vector for the application of association rules generated regarding survival for the subset of adults The subset of children yielded the following results: Figure 10.27. List of the association rules generated for the subset of children Figure 10.28. Performance vector for the application of association rules generated regarding survival for the subset of children It can be seen that performance can be increased remarkably by such splits of datasets, as by doing this, the interference between the effects of groups can be neutralized. For the group of children, the enhancement in performance is much smaller, but this can be explained with the much smaller record count of the subset. 133 Created by XMLmind XSL-FO Converter. Association rules Video Workflow assoc_exp4.rmp Keywords frequent item sets association rules performance support Simpson's paradox Operators Apply Association Rules Create Association Rules Discretize by User Specification Filter Examples FP-Growth Multiply Nominal to Binominal Performance Read AML Select Attributes Set Role Split Data 134 Created by XMLmind XSL-FO Converter. Chapter 11. Clustering 1 Standard methods 1. K-means method Description The process demonstrates, using the Aggregation dataset, how the K-means clustering algorithm works. Also, it also shows the importance of choosing the distance function. Input Aggregation [SIPU Datasets] [Aggregation] The dataset consists of 788 two-dimensional vectors, which form 7 separate groups. The task is to discover these groups - clusters. The difficulty of the task is in the alignment of the points, as smaller and larger clouds of points are present with different distances in space between them. Figure 11.1. The 7 separate groups Output Figure 11.2. Clustering with default values 135 Created by XMLmind XSL-FO Converter. Clustering 1 After reading the data, the node of the K-means method is connected, and the algorithm is set to search for 7 clusters, then the process is initiated. The result is that the discovery of the upper and right side point clouds is successful, however, the algorithms performed poorly on the lower point cloud. Figure 11.3. Set the distance function. Let us try out another distance function, the Mahalanobis distance. Figure 11.4. Clustering with Mahalanobis distance function 136 Created by XMLmind XSL-FO Converter. Clustering 1 It can be seen that by minor sacrifices, but the result has become more precise; the clustering of the lower point cloud is now nearing a perfect solution. Interpretation of the results It can be seen that even the simplest clustering algorithms can discover basic connections, and if the distance function is chosen correctly, the results can even be made more precise. Video Workflow clust_exp1.rmp Keywords K-means method distance functions cluster analysis Operators k-Means Read CSV 2. K-medoids method 137 Created by XMLmind XSL-FO Converter. Clustering 1 Description The process shows, using the Maximum Variance (R15) dataset, how the K-medoids method can be used. Input Maximum Variance (R15) [SIPU Datasets] [Maximum Variance] The dataset contains 600 two-dimensional vectors, which are concentrated into 15 clusters. The points are aligned around a center with the coordinates (10,10), in increasing distances from each other as they get further from the center. This is the difficulty of the task, as the clusters near the center are close to blending into each other. Figure 11.5. The dataset Output Figure 11.6. Setting the parameters of the clustering 138 Created by XMLmind XSL-FO Converter. Clustering 1 The difference of the K-medoids method from the K-means method is that the centers of the clusters have to be existing points. After setting the distance function and the number of clusters k, and then running the process, it can be seen that even though a more sophisticated distance function has been chosen, the alignment of the data did not make the precise analysis of the central clusters possible. Figure 11.7. The clusters produced by the analysis Interpretation of the results The process has shown that not all datasets provide a chance for the usage of arbitrary cluster analysis. Video 139 Created by XMLmind XSL-FO Converter. Clustering 1 Workflow clust_exp2.rmp Keywords K-medoids method dataset properties cluster analysis Operators k-Medoids Read CSV 3. The DBSCAN method Description The process shows, using the Compound dataset, the advantages of using density based clustering by using the DBSCAN clustering algorithm. Input Compound [SIPU Datasets] [Compound] The dataset consist of 399 two-dimensional vectors belonging to six groups. The groups differ in the density of their points, and each set can encompass another point set as well. Figure 11.8. The groups with varying density 140 Created by XMLmind XSL-FO Converter. Clustering 1 Output Figure 11.9. The results of the method with default parameters The process yields remarkable results even with default settings, only one out of the six clusters contains an error. Using the parameters epsilon and min points, results can be refined further, but as the points in the above-mentioned cluster are of different densities, a perfect solution cannot be reached. Interpretation of the results The operation of the DBSCAN algorithm has been demonstration on a dataset consisting of groups with different densities. The deficiencies of the algorithm have been shows, i.e. that if points of different densities can be found in a cluster, the precise operation of the algorithm cannot be guaranteed. 141 Created by XMLmind XSL-FO Converter. Clustering 1 Video Workflow clust_exp3.rmp Keywords DBSCAN method density function cluster analysis Operators DBSCAN Read CSV 4. Agglomerative methods Description The process shows, using the Maximum Variance (R15) dataset, how to define the appropriate number of clusters, and the agglomerative hierarchical clustering method. Input Maximum Variance (R15) [SIPU Datasets] [Maximum Variance] The dataset contains 600 two-dimensional vectors, which form 15 separate groups. The task is to define the cardinality of the groups, and to discover them. Figure 11.10. The 15 group 142 Created by XMLmind XSL-FO Converter. Clustering 1 Output Figure 11.11. The resulting dendrogram The result of aggregation clustering is a so-called dendrogram, which is such a tree structure the leaves of which are the points themselves, and the intermediate nodes (the clusters) result from agglomerating two points or subtrees (clusters). The method always contracts the two points (or clusters) closest to each other, thus building up the tree, which will contain all the points by the end of the process. The length of the edges in the finished dendrogram is proportional to the distance between the clusters, thus the number of edges on the appropriate level defines the ideal number of clusters. So, at the beginning of the process, each point forms a cluster on its own, while by the end of the process, all points get to be put into one single cluster. Figure 11.12. The clustering generated from dendrogram 143 Created by XMLmind XSL-FO Converter. Clustering 1 By using the Flatten clustering operator, the dendrogram can also be used for clustering, manually stating the number of clusters as a single parameter. The figure shows the result of this cluster analysis. Interpretation of the results It could be seen that based on the dendrogram created, the ideal number of clusters can be defined, and then, based on this, the cluster analysis can be performed as well. Video Workflow clust_exp4.rmp Keywords Agglomerative method agglomerative hierarchical clustering cluster analysis Operators Agglomerative Clustering Flatten Clustering Multiply Read CSV 5. Divisive methods 144 Created by XMLmind XSL-FO Converter. Clustering 1 Description The process shows, using the Maximum Variance (R15) dataset, how divisive hierarchical cluster analysis can be done. Input Maximum Variance (R15) [SIPU Datasets] [Maximum Variance] The dataset contains 600 two-dimensional vectors, which form 15 separate groups. The task is to define the ideal cardinality of the groups, and to discover them. Figure 11.13. The 600 two-dimensional vectors 145 Created by XMLmind XSL-FO Converter. Clustering 1 Output Figure 11.14. The subprocess In order to perform divisive clustering, an arbitrary clustering method with which the division can be performed is necessary. In the initial state, all points belong to the same cluster, then, the method continuously divides the points into multiple groups, until finally, all points are placed into separate clusters. The operator determines the ideal number of clusters, and assigns the points to the clusters as well. Figure 11.15. The report generated by the clustering 146 Created by XMLmind XSL-FO Converter. Clustering 1 In the present case, the procedure determined the number of cluster as 63 groups. Figure 11.16. The output of the analysis Then the points are assigned to these groups. Interpretation of the results It can be seen that indeed, the method has created a larger number of clusters, but due to this, the central clusters can be separated better from each other. Video Workflow clust_exp5.rmp Keywords Divisive method divisive hierarchical clustering cluster analysis Operators k-Means Read CSV Top Down Clustering 147 Created by XMLmind XSL-FO Converter. Chapter 12. Clustering 2 Advanced methods 1. Support vector clustering Description The process shows, using the Jain dataset, how support vector clustering can be used, and what the effects of its parameters are. Input Jain [SIPU Datasets] [Jain] The dataset contains 373 two-dimensional vectors, which are organized into 2 groups. The challenge posed by the point set is that clouds of points are aligned closely to each other, and they have non-spherical shapes. Figure 12.1. The two groups Output During support vector clustering, data are transformed using kernel functions, and then, a circle is enlarged until a state in which all points are located within the circle. Finally, the thus created boundary curve is transformed 148 Created by XMLmind XSL-FO Converter. Clustering 2 back into real space along with the data, and thus the clusters are created. The kernel functions are identical to the functions described at the support vector machines, their parameters are the same as well. Support vector clustering has a unique parameter r, with which the radius of the circle in the transformed space can be defined. Figure 12.2. Support vector clustering with polynomial kernel and p=0.21 setup Firstly, let us test the polynomial kernel, letting the points reach over the boundary curve. Figure 12.3. Unsuccessful clustering 149 Created by XMLmind XSL-FO Converter. Clustering 2 It can be seen that the result is rather disappointing, the resulting cluster are extending into each other, and the second cluster is considered to be noise by the method. Figure 12.4. Clustering with RBF kernel Switching to the RBF kernel, and not allowing the points to reach over the boundary curve, the result is much more promising. However, the upper cluster is split into multiple clusters, but the lower one remains in one piece, and is separated from the other clusters. Figure 12.5. More promising results Interpretation of the results 150 Created by XMLmind XSL-FO Converter. Clustering 2 Just like when using support vector machines, when using SVC, the factors that influence the efficiency of the method the most are choosing the appropriate kernel function, and finding the ideal value of the ability to generalize. Video Workflow clust2_exp1.rmp Keywords Support vector clustering SVC cluster analysis kernel functions Operators Read CSV Support Vector Clustering 2. Choosing parameters in clustering Description The process shows, using the Flame dataset, how the ideal parameters can be found automatically. Input Flame [SIPU Datasets] [Flame] The dataset consists of 240 two-dimensional vectors, that belong to two clusters. The clusters are aligned close to each other, and one of the clusters has a non-spherical shape. Figure 12.6. The two groups containing 240 vectors 151 Created by XMLmind XSL-FO Converter. Clustering 2 Output Figure 12.7. The subprocess of the optimalization node To perform parameter optimization, a performance operator is required, which, in this case, will be the node measuring cluster distance. Figure 12.8. The parameters of the optimalization 152 Created by XMLmind XSL-FO Converter. Clustering 2 The parameters to be optimized, and their possible values are chosen in the parameter optimization operator, and then, it is confided to the system to choose the ideal values. Figure 12.9. The report generated by the process In the present case, the best result was yielded by partitioning the task into 10 clusters, and defining the distance between them with the Euclidean distance. Figure 12.10. Clustering generated with the most optimal parameters 153 Created by XMLmind XSL-FO Converter. Clustering 2 Interpretation of the results For many parameterized clustering methods, it can be ideal to confide the determination of the appropriate number of clusters to a performance measurement operator, and then run the clustering with the obtained values. Video Workflow clust2_exp2.rmp Keywords Support vector clustering SVC cluster analysis kernel functions Operators Cluster Distance Performance k-Means Optimize Parameters (Grid) Read CSV 3. Cluster evaluation 154 Created by XMLmind XSL-FO Converter. Clustering 2 Description The process shows, using the Aggregation dataset, how to gather and display cluster metrics. Input Aggregation [SIPU Datasets] [Aggregation] The dataset contains 788 two-dimensional vectors, which form 7 separate groups. In the present case, the aim is to evaluate the clusters created. Figure 12.11. The 788 vectors Output Figure 12.12. The evaluating subprocess 155 Created by XMLmind XSL-FO Converter. Clustering 2 After reading the data, an agglomerative clustering is run with different parameters, and then using this, clusters can be created. A similarity function is created to measure cluster density, and then, the results of the measurements are saved for each parameter setting. Figure 12.13. Setting up the parameters 60 different settings are tested, the number of clusters ranging from 2 to 20, and all three of the agglomeration strategies of the agglomerative clustering are tried out. Figure 12.14. Parameters to log 156 Created by XMLmind XSL-FO Converter. Clustering 2 The cluster sizes, the cluster densities, the distribution of the points, and the agglomeration strategy are saved for each setting. Figure 12.15. Cluster density against k number of clusters Figure 12.16. Item distribution against k number of clusters 157 Created by XMLmind XSL-FO Converter. Clustering 2 The final result can be acquired by reading the log. Interpretation of the results The final result shows that the increase in the number of clusters leads to the increase of cluster densities, and the decrease of point distribution in different paces for the tree different strategies. However, the single link strategy falls a bit behind compared to the complete link and average link methods. Video Workflow clust2_exp3.rmp Keywords cluster evaluation agglomerative clustering single link complete link average link point density point distribution Operators Agglomerative Clustering Cluster Density Performance Data to Similarity Flatten Clustering Item Distribution Performance Log Log to Data Loop Parameters 158 Created by XMLmind XSL-FO Converter. Clustering 2 Multiply Read CSV 4. Centroid method Description The process shows, using the Maximum Variance (D31) dataset, that cluster centers are suitable for representing even the whole of their clusters. Input Maximum Variance (D31) [SIPU Datasets] [Maximum Variance] The dataset contains 3100 two-dimensional vectors, which are concentrated into 31 clusters. Using this dataset, it is to be illustrated the generalization power centroids possess. Figure 12.17. The vectors forming 31 clusters 159 Created by XMLmind XSL-FO Converter. Clustering 2 Output Figure 12.18. The extracted centroids Centroids are obtained after the cluster analysis of the data, and then, to illustrate their representative power, they are utilized as training data for a k-NN classifier. Figure 12.19. The output of the k nearest neighbour method, using the centroids as prototypes 160 Created by XMLmind XSL-FO Converter. Clustering 2 The efficiency of the k-NN classification method primarily depends on the prototypes selected. Based on the result, it can be seen that the well-chosen points have aided the classification. Interpretation of the results It can be seen that clustering can be a good starting point for the extraction of the prototypes of a dataset, which can make the cut-back of the training dataset possible. Video Workflow clust2_exp4.rmp Keywords centroids X-means method k-NN Operators Apply Model Extract Cluster Prototypes k-NN Multiply Read CSV Set Role X-Means 5. Text clustering 161 Created by XMLmind XSL-FO Converter. Clustering 2 Description The process shows, using the Twenty Newsgroups dataset, how the clustering of documents can be performed. Input A subset of the Twenty Newsgroups dataset [UCI MLR]. Note The data set was donated to the UCI Machine Learning Repository by Tom Mitchell. The dataset contains about 20000 news articles belonging to 20 topics. The subset of this dataset utilized here contains only three of the topics, which are concerned with cars, electronics, and everyday politics. Output Figure 12.20. The preprocessing subprocess The data are read by topic, transformed to lower-case, tokenized, stemmed, and then stop words are filtered out. After this, the only thing that needs to be done is to cluster the TF-IDF vectors by document. Figure 12.21. The clustering setup 162 Created by XMLmind XSL-FO Converter. Clustering 2 The distance between the document vectors can be measured using the cosine similarity. The cluster labels are transformed into class labels, and then, it is checked to what extent the clusters cover the individual topics. Figure 12.22. The confusion matrix of the results Interpretation of the results The results show that cars have severely blended with electronics, which is possibly not too far from reality, as there are numerous common points in the two professions. Video Workflow clust2_exp5.rmp Keywords K-means method cosine similarity text clustering text mining Operators k-Means Map Clustering on Labels Performance (Classification) 163 Created by XMLmind XSL-FO Converter. Clustering 2 Filter Stopwords (English) [Text Mining Extension] Process Documents from Files [Text Mining Extension] Stem (Snowball) [Text Mining Extension] Tokenize [Text Mining Extension] Transform Cases [Text Mining Extension] 164 Created by XMLmind XSL-FO Converter. Chapter 13. Anomaly detection 1. Searching for outliers Description The workflow, using the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, shows how to find outliers based on the distances measured between the data. This can be done either by measuring their distance from their k nearest neighbours, or by checking whether their distance from some data object is above a given threshold. The definition of an outlier is relative, it can always be defined in comparison with the distances between the data objects. Thus if the distances between the data objects are basically great, a high threshold has to be set for outliers. Input Wisconsin Diagnostic Breast Cancer (WDBC) [UCI MLR] Output It can be seen that outliers can be filtered out using the appropriate settings. As for example even differences that range in the hundreds occur between the individual values of the represented attribute area, thus this result can be obtained by setting the threshold for the Euclidean distance to the value 500. Figure 13.1. Graphic representation of the possible outliers 165 Created by XMLmind XSL-FO Converter. Anomaly detection Interpretation of the results Note that due to the existing great distances between the data objects, the number of outliers detected will only decrease to its true level if the threshold is incremented up to 500, and under a certain value, way too many data objects would be identified as outliers. Figure 13.2. The number of outliers detected as the distance limit grows Video Workflow anomaly_exp1.rmp 166 Created by XMLmind XSL-FO Converter. Anomaly detection Keywords outliers data preprocessing data cleansing Operators Detect Outlier (Densities) Detect Outlier (Distances) Multiply Read AML 2. Unsupervised search for outliers Description The process shows, using a sample of the Individual household electric power consumption dataset, how incidentally occurring outliers, anomalies can be found in a dataset with unsupervised methods. Several methods can be used for unsupervised anomaly detection, e.g., in general cases, methods based on k nearest neighbours, which assign an outlier indicator value to each element based on its distance from its k nearest neighbours. The higher this indicator value is, the value of the given item will be more of an outlier, and more likely to be a potential anomaly. However, this scoring may vary depending on the dataset and the method utilized, so the threshold from which a given element is to be considered an outlier should be set according to the distances between the data and the methods used. Input Individual household electric power consumption [UCI MLR] Output 167 Created by XMLmind XSL-FO Converter. Anomaly detection The Anomaly Detection extension that can be installed in Rapid Miner proposes several possible methods for detecting anomalies, for example the method based on k nearest neighbours, and the LOF metric based on this, which relies on the k nearest neighbours method, but it also takes density into consideration. Figure 13.3. Nearest neighbour based operators in the Anomaly Detection package Figure 13.4. Settings of LOF. These methods assign different scores to the elements, based on which it can be seen which elements are outliers. The k nearest neighbours method assigns the following scores to the elements of the dataset: Figure 13.5. Outlier scores assigned to the individual records based on k nearest neighbours 168 Created by XMLmind XSL-FO Converter. Anomaly detection The LOF method assigns the following scores to the elements of the dataset: Figure 13.6. Outlier scores assigned to the individual records based on LOF Interpretation of the results Based on the results received, it can be decided which score should be considered as a threshold above which an element is considered an anomaly, and the elements with scores above this threshold, i.e. outliers can also be immediately filtered out of the dataset, or a separate dataset can be formed from them: Figure 13.7. Filtering the records based on their outlier scores 169 Created by XMLmind XSL-FO Converter. Anomaly detection For example, using the k-NN method, the following dataset appears as a result after removing the values rated as outliers: Figure 13.8. The dataset filtered based on the k-NN score Furthermore, the set of elements rated as outlier based on the LOF is the following: Figure 13.9. The dataset filtered based on the LOF score Video Workflow anomaly_exp2.rmp Keywords outliers anomaly detection k nearest neighbours Operators 170 Created by XMLmind XSL-FO Converter. Anomaly detection Filter Examples Multiply Read CSV k-NN Global Anomaly Score [Anomaly Detection] Local Outlier Factor (LOF) [Anomaly Detection] 3. Unsupervised statistics based anomaly detection Description The process shows, using the Flame dataset, how incidentally occurring outliers, anomalies can be found in a dataset using statistics based unsupervised methods. Several methods can be used for unsupervised anomaly detection, e.g., a statistics based, histogram based method. In this case, groups of values are defined for each attribute with a histogram, and based on the deviation from these can the given value in the given column be considered an outlier. After this, using these scores is the overall outlier score of the records defined. The higher this indicator value is, the value or record will be more of an outlier, and more likely to be a potential anomaly. However, this scoring may vary depending on the dataset and the method utilized, so the threshold from which a given element is to be considered an outlier should be set according to the distances between the data and the methods used. At the same time, due to this fact, it can be more illustrative to use colors instead of only values to indicate the outlier scores, which is done automatically be the histogram based method. Input Flame [SIPU Datasets] [Flame] Output The Anomaly Detection extension that can be installed in Rapid Miner proposes several possible methods for detecting anomalies, for example the histogram based method, which defines the outlier score of the individual values in each column of the dataset, and based on these, it calculates the final score of the records. This method can be refined with multiple settings, either on operator level, or on column level as well: Figure 13.10. Global settings for Histogram-based Outlier Score 171 Created by XMLmind XSL-FO Converter. Anomaly detection Figure 13.11. Column-level settings for Histogram-based Outlier Score Based on the settings, the operator splits the set of values in the individual columns into either a pre-defined or an arbitrary number of bins which are either equal or variable in length. Based on these, it assigns color codes, and calculates the record level score based on the scores of the column values as well. Using a fixed binwidth, and an arbitrary number of bins, the following values are returned as a result: Figure 13.12. Scores and attribute binning for fixed binwidth and arbitrary number of bins 172 Created by XMLmind XSL-FO Converter. Anomaly detection Interpretation of the results Based on the results received, it can be decided which score should be considered as a threshold above which an element is considered an anomaly. In this case, however, a more detailed examination is possible as well, as based on the colour codes set, it can be viewed how much the probability of the individual attributes containing outliers is, and if these coincide with outlier values of other columns. Based on this, on one hand, the model can be refined if necessary, and on the other hand, in some cases, it can be easier to define which values should be considered an anomaly. By checking the graphic representation of the model built based on the scores, it can be seen that there are sightly outlying values that have not been assigned a high score: Figure 13.13. Graphic representation of outlier scores 173 Created by XMLmind XSL-FO Converter. Anomaly detection Based on this, it might be advisable to alter the model, for example to split the attributes into dynamically sized bins. This enhances the performance of outlier detection, as can be seen in the following results: Figure 13.14. Scores and attributes binning for dynamic binwidth and arbitrary number of bins 174 Created by XMLmind XSL-FO Converter. Anomaly detection Figure 13.15. Graphic representation of the enhanced outlier scores Video Workflow anomaly_exp3.rmp Keywords outliers anomaly detection statistics based anomaly detection histogram based anomaly detection bin size Operators Read CSV Histogram-based Outlier Score (HBOS) [Anomaly Detection] 175 Created by XMLmind XSL-FO Converter. Part III. SAS® Enterprise Miner™ Created by XMLmind XSL-FO Converter. Table of Contents 14. Data Sources ............................................................................................................................. 1. Reading SAS dataset ......................................................................................................... 2. Importing data from a CSV file ......................................................................................... 3. Importing data from a Excel file ....................................................................................... 15. Preprocessing ............................................................................................................................ 1. Constructing metadata and automatic variable selection .................................................. 2. Vizualizing multidimensional data and dimension reduction by PCA .............................. 3. Replacement and imputation ............................................................................................. 16. Classification Methods 1 .......................................................................................................... 1. Classification by decision tree ........................................................................................... 2. Comparison and evaluation of decision tree classifiers .................................................... 17. Classification Methods 2 .......................................................................................................... 1. Rule induction to the classification of rare events ............................................................. 18. Classification Methods 3 .......................................................................................................... 1. Logistic regression ............................................................................................................ 2. Prediction of discrete target by regression models ............................................................ 19. Classification Methods 4 .......................................................................................................... 1. Solution of a linearly separable binary classification task by ANN and SVM .................. 2. Using artificial neural networks (ANN) ............................................................................ 3. Using support vector machines (SVM) ............................................................................. 20. Classification Methods 5 .......................................................................................................... 1. Ensemble methods: Combination of classifiers ................................................................ 2. Ensemble methods: bagging ............................................................................................. 3. Ensemble methods: boosting ............................................................................................. 21. Association mining ................................................................................................................... 1. Extracting association rules ............................................................................................... 22. Clustering 1 .............................................................................................................................. 1. K-means method ............................................................................................................... 2. Agglomerative hierarchical methods ................................................................................. 3. Comparison of clustering methods .................................................................................... 23. Clustering 2 .............................................................................................................................. 1. Clustering attributes before fitting SVM ........................................................................... 2. Self-organizing maps (SOM) and vector quantization (VQ) ............................................. 24. Regression for continuous target .............................................................................................. 1. Logistic regression ............................................................................................................ 2. Prediction of discrete target by regression models ............................................................ 3. Supervised models for continuous target .......................................................................... 25. Anomaly detection .................................................................................................................... 1. Detecting outliers .............................................................................................................. 177 Created by XMLmind XSL-FO Converter. 178 178 180 183 185 185 188 191 196 196 200 208 208 212 212 217 221 221 225 232 240 240 244 249 256 256 260 260 267 271 278 278 284 289 289 294 297 304 304 Chapter 14. Data Sources 1. Reading SAS dataset Description The experiment illustrates how existing SAS data sets can be made available to the SAS® Enterprise Miner™ by the Input Data operator. In the experiment, an earlier prepared SAS dataset will be read. A SAS dataset can be created by using the SAS® System or the SAS® Enterprise Guide™. In order to load a SAS file that we would like to use we need to know the path to the file. The file may be on the local machine, but also can be on a remote SAS server. The SAS file can be read by using a wizard that guides you through the entire process. Then, the original dataset is sampled by the Sample operator where a part of the relatively large data file is selected. Figure 14.1. The metadata of the dataset Input Individual household electric power consumption [UCI MLR] Output A dataset which contains the 10 percent of the original dataset. At the sampling, absolute and relative sample size can be chosen as well. It is also possible to set the Random Seed parameter which controls the cycle of the pseudo-random number generator. If the same value is set to on different machines we get the same random sample. We also set the method of sampling, e.g. simple random, clustered or stratified. Figure 14.2. Setting the Sample operator 178 Created by XMLmind XSL-FO Converter. Data Sources Figure 14.3. The metadata of the resulting dataset and a part of the dataset 179 Created by XMLmind XSL-FO Converter. Data Sources Interpretation of the results Whenever we rerun the process, the current state of the data set will be imported to the system, so the Input Data operator can be used to retrieve data files and to rerun the data mining process based on them, which are updated constantly by other SAS based systems. Video Workflow sas_import_exp1.xml Keywords reading SAS dataset sampling Operators Data Source Sample 2. Importing data from a CSV file Description The process demonstrates how to import data from CSV datasets by the File Import operator. In the experiment, the Bodyfat dataset of the StatLib data repository is used. In order to open the dataset we would like to use, we need to know the path to this file which can be on the local machine or on a remote SAS server as well. This path can be assigned step by step in a menu. Figure 14.4. The list of file in the File Import operator 180 Created by XMLmind XSL-FO Converter. Data Sources Input Bodyfat [StatLib] Note The dataset was donated by Roger W. Johnson to the StatLib. The process of import can be parametrized in the File Import operator. We can set the maximal number of records, the maximal number of attributes and the separator character. It is also possible to define the number of rows which determines the file structure. Figure 14.5. The parameters of the File Import operator Output A datatset which consists of the imported data. Figure 14.6. A small portion of the dataset 181 Created by XMLmind XSL-FO Converter. Data Sources Figure 14.7. The metadata of the resulting dataset Interpretation of the results Whenever we rerun the process, the current state of the data set will be imported to the system, so the Input Data operator can be used to reload data files and to rerun the data mining process based on them, which are updated constantly by other SAS based systems. Video Workflow sas_import_exp2.xml Keywords importing data CSV file Operators File Import 182 Created by XMLmind XSL-FO Converter. Data Sources Graph Explore Statistic Explore 3. Importing data from a Excel file Description The process illustrates how to import data from an Excel dataset by the help of the File Import operator. In the experiment the Zoo dataset is used, which was saved previously as an Excel file. In order to open the file that we would like to use, we need to know the path to this file which can be on the local computer or on a remote SAS server as well. This path can be assigned step by step by going through the directory tree. Input Zoo [UCI MLR] The process of import can be parametrized in the File Import operator. We can set the maximal number of records, the maximal number of attributes. It is also possible to define the number of rows which determines the file structure. Output A dataset which contains the imported data. Figure 14.8. A small portion of the resulting dataset Interpretation of the results Whenever we rerun the process the system will import the newest version of the dataset. Thus the File Import operator can be used to load such datasets and to rerun data mining processes based on them which are updated by operative systems. The import procedure works only for Excel 97-2003 files with xls extension. 183 Created by XMLmind XSL-FO Converter. Data Sources Video Workflow sas_import_exp3.xml Keywords importing data Excel Operators File Import Graph Explore 184 Created by XMLmind XSL-FO Converter. Chapter 15. Preprocessing 1. Constructing metadata and automatic variable selection Description The process illustrates, by using the Spambase dataset, how to generate the metadata of a dataset by the DMDB operator, then how automatic variable selection can be obtained by the Variable Selection operator. The Spambase dataset contains 58 attributes, one of which is the binary target. In order to visualize a dataset, it may be necessary to determine the most important input attributes which can be used in the graphical representation. Input Spambase [UCI MLR] Output The DMDB operator produces such metadata (descriptive statistics) as mean, variance, minimum, maximum, skewness, and kurtosis. In case of discrete attributes these are complemented by the mode. Figure 15.1. Metadata produced by the DMDB operator 185 Created by XMLmind XSL-FO Converter. Preprocessing The default settings of the Variable Selection operator are applied except that the minimum R-square is increased in order to filter the unnecessary attributes. Figure 15.2. The settings of Variable Selection operator The result on the one hand will be a list which contains the decision about the variables, i.e., whether it remains or not in the data mining process, on the other hand, a few graphs of the importance of the variables. Figure 15.3. List of variables after the selection 186 Created by XMLmind XSL-FO Converter. Preprocessing Figure 15.4. Sequential R-square plot In view of the important variables a number of graphical tools of the Enterprise Miner ™ can be used to display the records. Figure 15.5. The binary target variables in a function of the two most important input attributes after the variable selection 187 Created by XMLmind XSL-FO Converter. Preprocessing Interpretation of the results The experiment shows how metadata can be extracted from SAS datasets which we can then transmit to other operators. Moreover, we demonstrated how can variable selection be performed in case of large number of attributes and how can we be working with the important attributes. Video Workflow sas_preproc_exp1.xml Keywords variable selection metadata Operators Data Source Data Mining DataBase Graph Explore Variable Selection 2. Vizualizing multidimensional data and dimension reduction by PCA 188 Created by XMLmind XSL-FO Converter. Preprocessing Description The experiment presents vizualization and dimension reduction methods by the help of the Fisher-Anderson Iris dataset. Multidimensional datasets can be vizualized by the Graph Explore operator. Dimension reduction can be performed by the Principal Components operator. After the dimension reduction, it becomes much easier to display multi-dimensional datasets in the space of principal components. Input Fisher-Anderson Iris Output The Graph Explore operator provides several graphical tools for displaying multi-dimensional datasets, which plays a key role in the preprocessing step of data mining. Some of these are extensions of well-known tools such as two- and three-dimensional scatterplots and bar charts supplemented by a number of options such as the use of colors and symbols. Other techniques such as parallel axis or the radar plot, however, are only characteristics of data mining software tools. Figure 15.6. Displaying the dataset by parallel axis 189 Created by XMLmind XSL-FO Converter. Preprocessing The Pricipal Components Analysis (PCA) can be performed by the Principal Components operator. In the operator the following settings can be defined: the dependency structure as covariance or correlation, the cut-off condition as the number of eigenvalues or the cumulative eigenvalue ratio. Figure 15.7. Explained cumulated explained variance plot of the PCA 190 Created by XMLmind XSL-FO Converter. Preprocessing The main result of principal component analysis is the principal component coordinates of individual records, which can be used in the further data analysis and visualization. Figure 15.8. Scatterplit of the Iris dataset using the first two principal components Interpretation of the results The experiment shows that how we can display high-dimensional data sets and perform dimension reduction. In our experiment, the original 4-dimensional data set that can not be displayed using the standard scatterplot, is managed to reduce to 2 dimensions such that the 95 percent of the information contained in the data is preserved. Video Workflow sas_preproc_exp2.xml Keywords principal components analysis (PCA) parallel axis Operators Data Source Graph Explore Principal Components 3. Replacement and imputation 191 Created by XMLmind XSL-FO Converter. Preprocessing Description In this experiment, we demonstrate by the help of the Congressional Voting Records dataset how to modify the values of attributes by the Replacement operator and then how to impute the missing value by the Impute operator. The replacement of missing values for each variable can be carried out independently of the others and to interact with the target variable by fitting a decision tree. Input Congressional Voting Records [UCI MLR] Output By the Replacement operator we can set the substitution of discrete and continuous variables separately. Figure 15.9. The replacement wizard 192 Created by XMLmind XSL-FO Converter. Preprocessing A number of imputation methods can be choosen in the Impute operator. We may fill in the missing values by a constant value, but also can use distribution-based value, where a random value is generated by the system, or decision tree based method. Figure 15.10. The output of imputation The results of the imputation correlated by the target variable are shown in the following two bar charts. Figure 15.11. The relationship of an input and the target variable before imputation 193 Created by XMLmind XSL-FO Converter. Preprocessing Figure 15.12. The relationship of an input and the target variable after imputation Interpretation of the results The experiment shows that if the method of imputation is chosen in appropriate way the values obtained in place of the missing data values is not very distorted and thus, on a larger dataset, we can perform a more reliable fitting of the model. Video Workflow 194 Created by XMLmind XSL-FO Converter. Preprocessing sas_preproc_exp3.xml Keywords replacement imputation Operators Data Source Graph Explore Impute Replacement 195 Created by XMLmind XSL-FO Converter. Chapter 16. Classification Methods 1 Decision trees 1. Classification by decision tree Description The process demonstrates how to classify by the Decision Tree operator in the case when the target is a nominal attribute. In this case the Wine dataset is used and the target variable has three values. In order to build a decision tree classifier it is worth to divide the dataset to training and validation datasets. Then the current best splitting rule is found by the algorithm on the training set, but the growth of the tree is stopped by using the validation dataset when the algorithm does not find a significant split. In partitioning step, a test dataset can be separated too in order to measure the generalization ability of the resulting tree, but now this is not recommended due to the limited size of the data set. The decision tree as the result of the process can be displayed where we can see the decisions at the splittings of the model. Using the principle of majority voting the algorithm decides which class label should assign to each leaf (terminal nodes). Input Wine [UCI MLR] Output In case of nominal target variable, we can decide about the execution of each split on the basis of various impurity measures such as the chi-square, the Gini index, or the entropy. For these, and for the reliability of splitting, depending on the choosen measure, a parameter value can be specified. In addition, the stopping condition of splitting can be determined by the way that we give the minimum size set of records can be divided even further, or the maximal depth of the tree. Also we may set the maximum number of branches of a tree. The default is 2, that is, the algorithm builds a binary tree. It is also possible to decide if we wish to use the missing values in splitting used as possible value. We can also decide that the input attributes are used only once or several times when the decision tree is produced. Figure 16.1. The settings of dataset partitioning 196 Created by XMLmind XSL-FO Converter. Classification Methods 1 In the partition of dataset different sampling methods can be choosen and the proportion can be determined among the training, validation, and test dataset. This partitioning can be carried out simply by considering the order of records, randomly, or stratifying with respect to the target variable. The stratified sampling ensures the same proportion of each class in the training, validation, and test set. Figure 16.2. The decision tree The results of the classification can be seen in the decision tree for the training and validation dataset as well, including the number of records in each vertex of the tree according to each class, respectively. On the edges between vertices the variables that define the splittings and their splitting values are presented. The thickness of the lines is proportional to the number of concerned records. 197 Created by XMLmind XSL-FO Converter. Classification Methods 1 Interpretation of the results The evaluation of the resulting decision tree is supported by numerous statistical indicators and graphical tools. Among them, the most important ones are displayed by multiple windows at a time where we can do comparisons. These windows can also be opened one by one by the view menu. By the help of this tools wrong decisions can be filtered out and the modeling process can be tuned by using further background information or domain knowledge. In this process, an interactive tree building process also helps. Figure 16.3. The response curve of the decision tree On the response curve above it can be seen, for the training and validation dataset, based on the ranking of records according to their goodness how many percent of the records are classified correctly. The curve is generally monotonically decreasing. Figure 16.4. Fitting statistics of the decision tree In the Fit Statistics table different indicators can be seen on the fiiting of the decision tree classifier produced by the algorithm. The simplest and most important one among them is the misclassification rate in the red circle, which shows the proportion of the wrong classification. Figure 16.5. The classification chart of the decision tree 198 Created by XMLmind XSL-FO Converter. Classification Methods 1 On the classification bar chart we can look at the details which classes work well or poorly the model. Figure 16.6. The cumulative lift curve of the decision tree On the figure, it can be concluded how relates the resulting decision tree to the best possible model based on the cumulative lift value. 199 Created by XMLmind XSL-FO Converter. Classification Methods 1 Figure 16.7. The importance of attributes The variable importance measure table shows which variables and with what importance involved in the decision of the decision tree. This is a useful tool for the users who possess some speciality knowledge. Video Workflow sas_dtree_exp1.xml Keywords classification decision tree cutting response curve misclassification rate Operators Data Source Decision Tree Data Partition 2. Comparison and evaluation of decision tree classifiers 200 Created by XMLmind XSL-FO Converter. Classification Methods 1 Description The process illustrates how to fit decision trees by using different impurity measures and then how to compare these models on the basis of the Congressional Voting Records dataset. After the decision tree is built based on the training and the validation dataset, the best model is selected by the model comparison operator ( Model Comparison) using the validation dataset during the decision process. Finally, the quality of the performed classification can be studied on the test dataset. Using the resulting model we can perform the scoring, which is the evaluating of the test set or a data set where we do not know the value of the target variable. Input Congressional Voting Records [UCI MLR] Output In the process we set the following parameters during the partitioning step. Figure 16.8. The settings of parameters in the partitioning step 201 Created by XMLmind XSL-FO Converter. Classification Methods 1 Using the chi-square impurity measure we obtain the following decision tree. Figure 16.9. The decision tree using the chi-square impurity measure Using the entropy impurity measure we obtain the following decision tree. Figure 16.10. The decision tree using the entropy impurity measure 202 Created by XMLmind XSL-FO Converter. Classification Methods 1 Using the Gini impurity measure we obtain the following decision tree. Figure 16.11. The decision tree using the Gini-index 203 Created by XMLmind XSL-FO Converter. Classification Methods 1 Interpretation of the results The first of the three resulting decision tree is the simplest and its fitting is the worst. The other two ones are fairly similar to each other, they use the same input variables in the splits, only the splitting values differ a little bit. The three decision trees can be compared in many ways by using graphical tools and statistics. For example, the following chart shows that, on the basis of the cumulative response curve, decision tree given by the Gini index is the best model if we want to build a model until the first few deciles. Figure 16.12. The cumulative response curve of decision trees 204 Created by XMLmind XSL-FO Converter. Classification Methods 1 Looking at the efficiency of the classifier the number of the correctly or incorrectly classified records can be obtained with respect to the two parties that can be represented by a bar chart. Figure 16.13. The classification plot A detailed comparison is possible by using the response curve, the lift curve, and their variants. The following figure shows the (non-cumulative) response curve for the three different datasets and the three types of models. It can be seen how to relate the response curve to the best possible and the baseline one. The bottom right figure shows that the decision tree based on the Gini index is close to the optimal one on the test dataset. Figure 16.14. Response curve of decision trees 205 Created by XMLmind XSL-FO Converter. Classification Methods 1 Another possibility is the examination of the score distributions. The model is fitting well if the red and the blue line is a mirror image of each other and their gradients are high. In virtue of this indicator the entropy-based decision tree is the best. Figure 16.15. The score distribution of decision trees The three built trees can be improved further by changing the level of significance. As a consequence, the decision tree will be built in a different way comparing to the original one, and as a result of this the number of correctly and incorrectly classified records and their distribution will be different. The performance of the resulting models can be read in the following figure, on which the incorrect classification rate was underlined as one of the most important indicators. Figure 16.16. The main statistics of decision trees 206 Created by XMLmind XSL-FO Converter. Classification Methods 1 Video Workflow sas_dtree_exp2.xml Keywords classification decision tree performance evaluation Operators Data Source Decision Tree Model Comparison Data Partition Score 207 Created by XMLmind XSL-FO Converter. Chapter 17. Classification Methods 2 Rule induction for rare events 1. Rule induction to the classification of rare events Description In this experiment, using the Spambase dataset, we show how can a baseline classifier be improved for a binary classification task with rare events by Rule Induction operator. Input Spambase [UCI MLR] The corresponding input is prepared using the Sample operator. The number of records are deposited on the top of the dataset is choosen such that the proportion of the cases to be 5 percent. Then, we partition the dataset in the usual way. Output Two rule induction models are fitted to the dataset. The former is based on decision tree model, the latter is based on logistic regression model. The fitted models are compared to a baseline decision tree classifier. The figures below show the goodness of fit. Figure 17.1. The misclassification rate of rule induction Figure 17.2. The classification matrix of rule induction 208 Created by XMLmind XSL-FO Converter. Classification Methods 2 Figure 17.3. The classification chart of rule induction On the left side of the model comparison figure a perfect ROC curve can be seen. This curve clearly shows that the fitting is perfect on the training dataset in case of the second rule induction model. Figure 17.4. The ROC curves of rule inductions and decision tree 209 Created by XMLmind XSL-FO Converter. Classification Methods 2 On the next output window, the number of wrongly classified cases can also be seen as the output of the Rule Induction operator, besides the usual information. This is a very important information in this case. Figure 17.5. The output of the rule induction operator Interpretation of the results The experiment shows that, when the class is very uneven, i.e. one class frequency is very low, compared to the traditional classification models, significant improvement can be achieved by the rule induction method. Video Workflow 210 Created by XMLmind XSL-FO Converter. Classification Methods 2 sas_rules_exp1.xml Keywords rule induction supervised learning classification Operators Data Source Decision Tree Model Comparison Data Partition Rule Induction Sample 211 Created by XMLmind XSL-FO Converter. Chapter 18. Classification Methods 3 Logistic Regression 1. Logistic regression Description The process shows, using the Spambase dataset, how can a regression model be fitted to a dataset which has binary target. The conventional linear regression are not suitable for this task even though the Regression operator offers this option. Instead, we must use the logistic regression method which is the default option of this operator. We can choose between the following link functions: logit, which takes the name of the procedure, probit and complementary logit. There is no significant difference among these link functions. The Enterprise Miner™ gives an other operator for fitting regression. By the Dmine Rgeression operator forward stepwise regression can be fitted. In each step, an input variable is selected that contributes most significantly to the variability of the target. Input Spambase [UCI MLR] Output After fitting the logistic regression, standard statistics and graphs are obtained similarly to the binary classification tasks. Here, only the confusion matrix is shown, the rest of comparison tools is left at the and of this experiment. Figure 18.1. Classification matrix of the logistic regression 212 Created by XMLmind XSL-FO Converter. Classification Methods 3 In addition to the usual tools, the regression operators, using the effect plot, also show the importance of the input variables in the regression model which were built during the process. Figure 18.2. Effects plot of the logistic regression In addition to the traditional regression analysis Enterprise Miner ™ yields another operator to fit forward stepwise regression. This is the Dmine Rgeression operator. The results can be seen in the figures below. Figure 18.3. Classification matrix of the stepwise logistic regression 213 Created by XMLmind XSL-FO Converter. Classification Methods 3 Figure 18.4. Effects plot of the stepwise logistic regression The two regressions can be compared by the usual way with the Model Comparison operator. The results of this comparison are presented in the following figures. Figure 18.5. Fitting statistics for logistic regression models 214 Created by XMLmind XSL-FO Converter. Classification Methods 3 Figure 18.6. Classification charts of the logistic regression models Figure 18.7. Cumulativ lift curve of the logistic regression models 215 Created by XMLmind XSL-FO Converter. Classification Methods 3 Figure 18.8. ROC curves of the logistic regression models Interpretation of the results The fit statistics and ROC curves clearly show on the test set that the logistic regression model is better than the stepwise logistic regression model. Video Workflow sas_regr_exp1.xml 216 Created by XMLmind XSL-FO Converter. Classification Methods 3 Keywords classification binary target logistic regression Operators Data Source Dmine Regression Model Comparison Data Partition Regression 2. Prediction of discrete target by regression models Description The process shows, using the Wine dataset, that how can we fit a regression model to a dataset containing discrete but non-binary target variable. Moreover, how can we performe a classification task using the parameter estimates obtained from the model. The type of the fitted regression model depends on the measurement scale of the discrete target variable. If the target is nominal then the Regression operator fits binary logistic models separately, so that a selected event of the target is compared to the class of the other values of the discrete target variable. On the other hand, if the target is ordinal then a common logistic regression model are fitted, wherein only the constant parameters differ, but the parameters of the input variables are shared. (As opposed to the nominal case where these coefficients are different.) Input Wine [UCI MLR] Output A number of models can be choosen to fit a regression model, e.g., linear or logistic regression. Among them, the logistic regression used in the process. It is not needed to set in, the software recognizes the right type of the regression using the metadata on the target. Of course, it is possible to override this option and to enforce linear regression, but this does not make sense because in this case the Model Comparison operator can not be used to compare this model to other discrete supervised models. The fit of the model can be tested by the well-known statistics and charts. Figure 18.9. Classification matrix of the logistic regression 217 Created by XMLmind XSL-FO Converter. Classification Methods 3 The classification chart shows that the fitted model is perfect on the training dataset and has small error on the validation dataset. Figure 18.10. The classification chart of the logistic regression 218 Created by XMLmind XSL-FO Converter. Classification Methods 3 Besides the standard goodness-of-fit tests the Regression operator presents a bar graph showing the importance of the input variables in the regression models. The higher the coefficient of an input variable, the more is its explanation power with respect to the target variable. In case of ordinal target, there is only one, while, in case of the nominal target, the number of different values of the target variable minus 1 bar graphs are created. Since the Class variable has three values there are two such bar graphs below. Figure 18.11. The effects plot of the logistic regression Interpretation of the results Using the regression model created on the training set for the validation dataset the above results show that a model can be built with relatively high accuracy for multiple discrete target variable value by the Regression operator. It is noted that for the problem discussed here the Dmine Regression operator can not be applied. Video Workflow sas_regr_exp2.xml 219 Created by XMLmind XSL-FO Converter. Classification Methods 3 Keywords classification nominal and ordinal target logistic regression Operators Data Source Data Partition Regression 220 Created by XMLmind XSL-FO Converter. Chapter 19. Classification Methods 4 Neural networks and support vector machines 1. Solution of a linearly separable binary classification task by ANN and SVM Description In this experiment, we teach a perceptron as a simple artificial neural network (ANN) and a support vector machine ( SVM) on a linearly separable two dimensional dataset with two classes. The dataset is a subset of the Wine dataset. adatállomány egy részhalmaza. The classification accuracy of the classifiers is determined on the dataset. Input Wine [UCI MLR] Figure 19.1. A linearly separable subset of the Wine dataset 221 Created by XMLmind XSL-FO Converter. Classification Methods 4 In order to apply the dataset for the experiment it have to go through a significant preprocessing. This involves choosing 2 attributes from the total 13 by using the Drop operator, then deleting the second class of the Class attribute and changing the measure level to binary by Metadata operator. Output The goodness of fit of the perceptron model can be checked by the usual statistics (misclassification rate, the number of incorrectly classified cases) and graphics (response and lift curve). Figure 19.2. Fitting statistics for perceptron Figure 19.3. The classification matrix of the perceptron Figure 19.4. The cumulative lift curve of the perceptron 222 Created by XMLmind XSL-FO Converter. Classification Methods 4 In the case of support vector machines (SVM), besides the above mentioned diagnostic tools, more details are given about the support vectors, namely the number of support vectors, the size of the margin and the list of support vectors. Figure 19.5. Fitting statistics for SVM Figure 19.6. The classification matrix of SVM Figure 19.7. The cumulative lift curve of SVM 223 Created by XMLmind XSL-FO Converter. Classification Methods 4 Figure 19.8. List of the support vectors Interpretation of the results The figures and statistics show that the perceptron and support vector machine classify perfectly all of the traning cases as well. Video Workflow sas_ann_svm_exp1.xml Keywords 224 Created by XMLmind XSL-FO Converter. Classification Methods 4 perceptron supervised learning classification Operators Data Source Drop Filter Graph Explore Metadata Neural Network Support Vector Machine 2. Using artificial neural networks (ANN) Description In this experiment, a number of algorithms that can be used for tranining artificial neural networks are compared for binary classification task. In the experiment the Spambase dataset is used. The classification accuracy of the resulting classifier is determined and the interpretation of related graphs is reviewed. Input Spambase [UCI MLR] Before fitting the models the dataset is partitionated by the Data Partition operator according to the rates 60/20/20 among the training, validatation and test datasets. Output Firstly, a standard artificial neural network is fitted by the NeuralNetwork operator where the network topology of a multilayer perceptron is defined as 3 hidden neuron in one hidden layer. The goodness-of-fit of the resulting model can be verified using standard statistics (misclassification rate, the number of incorrectly classified cases) and graphics (response and lift curve). Figure 19.9. Fitting statistics of the multilayer perceptron 225 Created by XMLmind XSL-FO Converter. Classification Methods 4 Figure 19.10. The classification matrix of the multilayer perceptron Figure 19.11. The cumulative lift curve of the multilayer perceptron 226 Created by XMLmind XSL-FO Converter. Classification Methods 4 In addition to the standard goodness-of-fit tests we get results which have meaning for artificial neural networks only. These results involve the graph of the weights of the neurons, the graph of the history of the teaching where the misclassification rate can be seen as the function of the iteration for training and validation datasets. Figure 19.12. Weights of the multilayer perceptron Figure 19.13. Training curve of the multilayer perceptron 227 Created by XMLmind XSL-FO Converter. Classification Methods 4 Similar graphs were obtained for the other two neural network fitting operator, namely for the DMNeural operator and for the AutoNeural operator. At the first, the exception is the following stepwise optimization statistics. Figure 19.14. Stepwise optimization statistics for DMNeural operator Figure 19.15. Weights of neural networks neuronjainak súlyai AutoNeural 228 Created by XMLmind XSL-FO Converter. operátorral kapott háló Classification Methods 4 Finally, the three models can be compared by the Model Comparison operator. As a result, we obtain the following statistics and graphs. Figure 19.16. Fitting statistics of neural networks 229 Created by XMLmind XSL-FO Converter. Classification Methods 4 Figure 19.17. Classification charts of neural networks Figure 19.18. Cumulative lift curves of neural networks 230 Created by XMLmind XSL-FO Converter. Classification Methods 4 Figure 19.19. ROC curves of neural networks Interpretation of the results The above statistics and figures clearly show that the best model is the first artificial neural network with multilayer perceptron architecture, where there is one hidden layer with 3 neurons. 231 Created by XMLmind XSL-FO Converter. Classification Methods 4 Video Workflow sas_ann_svm_exp2.xml Keywords artificial neural network supervised learning classification Operators Data Source Model Comparison Neural Network Data Partition 3. Using support vector machines (SVM) Description In this experiment, support vector machines (SVM - Support Vector Machine) are fitted to solve the binary classification task using the Spam Database dataset. The aim of this experiment is the comparison of different kinds of SVM defined by linear, polynomial etc. kernel functions. The classification accuracy of the resulting classifiers are determined and we review the interpretation of statistics and graphs related to support vector machines. The model fitting is carried out by the SVM operator. Input Spambase [UCI MLR] Before fitting the models the dataset is partitionated by the Data Partition operator according to the rates 60/20/20 among the training, validatation and test datasets. Output Firstly, a support vector machine is fitted using linear kernel. The goodness-of-fit of the resulting model can be checked by standard statistics (e.g. misclassification rate, the number of incorrectly classified cases) and graphs (e.g. response and lift curve). These tools will be discussed later during the comparison of the two models. Besides these results the SVM operator provides such additional statistics as goodness-of-fit measures and list of support vectors which have meaning only for support vector machines. 232 Created by XMLmind XSL-FO Converter. Classification Methods 4 Figure 19.20. Fitting statistics for linear kernel SVM Figure 19.21. The classification matrix of linear kernel SVM Figure 19.22. Support vectors (partly) of linear kernel SVM 233 Created by XMLmind XSL-FO Converter. Classification Methods 4 Figure 19.23. The distribution of Lagrange multipliers for linear kernel SVM Secondly, a polynomial kernel SVM is fitted to the dataset and compared to the previous model which one is the best. The parametrization of the SVM operator can be seen below. Figure 19.24. The parameters of polynomial kernel SVM 234 Created by XMLmind XSL-FO Converter. Classification Methods 4 Figure 19.25. Fitting statistics for polynomial kernel SVM Figure 19.26. Classification matrix of polynomial kernel SVM 235 Created by XMLmind XSL-FO Converter. Classification Methods 4 Figure 19.27. Support vectors (partly) of polynomial kernel SVM The support vector machines with two different kernels (linear and polynomial) can be compared by the usual statistical and graphical tools. Figure 19.28. Fitting statistics for SVM's 236 Created by XMLmind XSL-FO Converter. Classification Methods 4 Figure 19.29. The classification chart of SVM's Figure 19.30. Cumulative lift curves of SVM's 237 Created by XMLmind XSL-FO Converter. Classification Methods 4 Figure 19.31. Comparison of cumulative lift curves to the baseline and the optimal one Figure 19.32. ROC curves of SVM's 238 Created by XMLmind XSL-FO Converter. Classification Methods 4 Interpretation of the results The above figures and statistics clearly show that the polynomial kernel support vector machine can improve the fit of the model against the linear kernel one. The misclassification rate is improved by 2 per cent and the lift and ROC curves also show a significant improvement. The cumulative lift curve shows a better model at second to third deciles, while the ROC curve also show an improvement if the specificity is very close to 1. Video Workflow sas_ann_svm_exp3.xml Keywords support vector machine (SVM) supervised learning classification Operators Data Source Model Comparison Data Partition Support Vector Machine 239 Created by XMLmind XSL-FO Converter. Chapter 20. Classification Methods 5 Ensemble methods 1. Ensemble methods: Combination of classifiers Description This experiment presents the ensemble classification method by using the Ensemble operator. By the help of this operator a better model can be built from separate supervised data mining models. In the experiment, an ensemble classifier is constructed from a decision tree, a logistic regression and a neural network classifer using the average voting method. The resulted ensemble model is compared with a polynomial kernel SVM derived by the SVM operator. For the evaluation the misclassification rate is applied on the Spambase dataset. Input Spambase [UCI MLR] Before fitting the models, in the preprocessing step the dataset is partitionated by the Data Partition operator according to the rates 60/20/20 for training, validatation and test dataset. Output The ensemble classifiers can be evaluated by similar tools as other supervised data mining models: statistics like number of the incorrectly classified cases and the misclassification rate and graphs like like lift and response curves. Figure 20.1. Fitting statistics of the ensemble classifier Figure 20.2. The classification matrix of the ensemble classifier 240 Created by XMLmind XSL-FO Converter. Classification Methods 5 Figure 20.3. The cumulative lift curve of the ensemble classifier 241 Created by XMLmind XSL-FO Converter. Classification Methods 5 The resulted ensemble classifier is compared with a baseline polynomial kernel SVM. The statistics and graphs of this comparison are summarized below. Figure 20.4. Misclassification rates of the ensemble classifier and the SVM Figure 20.5. Classification matrices of the ensemble classifier and the SVM Figure 20.6. Cumulative lift curves of the ensemble classifier and the SVM 242 Created by XMLmind XSL-FO Converter. Classification Methods 5 Figure 20.7. Cumulative lift curves of the ensemble classifier, the SVM and the best theoretical model Figure 20.8. ROC curves of the ensemble classifier and the SVM 243 Created by XMLmind XSL-FO Converter. Classification Methods 5 Interpretation of the results The experiment shows that by combining simple classifiers we can obtain a competitive model against such supervised model as the polynomial kernel SVM. The classification matrix clearly shows that the ensemble classification model is better than the SVM, especially at the false positive cases. The cumulative lift curves slightly favor the combined model, and the ROC curve of the combined model passes over the SVM's a little. Video Workflow sas_ensemble_exp1.xml Keywords ensemble method supervised learning SVM misclassification rate ROC curve classification Operators Data Source Decision Tree Ensemble Model Comparison Neural Network Data Partition Regression Support Vector Machine 2. Ensemble methods: bagging 244 Created by XMLmind XSL-FO Converter. Classification Methods 5 Description This experiment shows the combined method of bagging. In this method, a better fitting model can be built from supervised data mining models using bootstrap aggregation. Bagging is sampling the original training dataset obtaining several subsamples by the bootstrap method. On these subsamples supervised models (decision treee in this experiment) are fitted, respectively, and a new model is obtained by aggregation of the models that are obtained. In the experiment, the bagging cycle has set 10, i.e., 10 pieces of decision trees are fitted on 10 different subsamples. The results are compared with a simple decision tree, which is fitted to the entire training dataset. In the bagging method the basic classifier is determined by the Ensemble operator which is straddled between the Start Groups and End Groups operators. The size of the bagging cycle is set in the Start Groups operator. Input Spambase [UCI MLR] In the preprocessing step the dataset is partitionated by the Data Partition operator according to the rates 60/20/20 for training, validatation and test dataset. Output There are similar tools to evaluate bagging classifiers which are available for other supervised data mining models: statistics (number of incorrectly classified cases, misclassification rate) and graphs (response and lift curves). The only additional graph can be seen on the second figure below, where the errors of the 10 classifiers obtained in the consecutive bagging cycle are plotted. Figure 20.9. The classification matrix of the bagging classifier 245 Created by XMLmind XSL-FO Converter. Classification Methods 5 Figure 20.10. The error curves of the bagging classifier 246 Created by XMLmind XSL-FO Converter. Classification Methods 5 The obtained bagging classifier is compared with a reference decision tree that we fit on the whole training dataset. The statistical and graphical results obtained are shown below. Figure 20.11. Misclassification rates of the bagging classifier and the decision tree Figure 20.12. Classification matrices of the bagging classifier and the decision tree Figure 20.13. Response curves of the bagging classifier and the decision tree 247 Created by XMLmind XSL-FO Converter. Classification Methods 5 Figure 20.14. Response curves of the bagging classifier and the decision tree comparing the baseline and the optimal classifiers Figure 20.15. ROC curves of the bagging classifier and the decision tree 248 Created by XMLmind XSL-FO Converter. Classification Methods 5 Interpretation of the results The experiment shows that a better working model can be obtained by taking a bagging classifier than a simple decision tree if the models are compared on the first deciles. This is clear considering the classification matrix, the response and the ROC curve. Video Workflow sas_ensemble_exp2.xml Keywords ensemble method supervised learning bagging mi9sclassification rate ROC curve classification Operators Data Source Decision Tree End Groups Model Comparison Data Partition Start Groups 3. Ensemble methods: boosting 249 Created by XMLmind XSL-FO Converter. Classification Methods 5 Description The experiment shows the combined method of boosting. In this method, a better fitting model can be built from supervised data mining models. The method is based on the repeated reweighting of the records and the classifiers in such a way that the wrongly classified cases are gaining more and more importance and we try to classify them to the right class. In the boosting method a basic classifier is selected which can be a decision tree, a logistic regression, or a neural networks etc, of which several copies, given by the boosting cycle, are built up. In this experiment, this basic classifier is a decision tree. In the experiment, the boosting cycle has set to 20, i.e. 20 pieces of decision trees are fitted on the whole training dataset. The result is compared to a polynomial kernel support vector machine (SVM), which is recognized as an effective method for binary classification tasks. In the boosting method the basic classifier is determined by the Ensemble operator which is straddled between the Start Groups and End Groups operators. The size of the boosting cycle is set in the Start Groups operator. Input Spambase [UCI MLR] In the preprocessing step the dataset is partitionated by the Data Partition operator according to the rates 60/20/20 for training, validatation and test dataset. Output There are similar tools to evaluate boosting classifiers which are available for other supervised data mining models: statistics (number of incorrectly classified cases, misclassification rate) and graphs (response and lift curves). The only additional graph is the second figure, where the error of the resulting classifiers can be seen which are 20 decision trees in our case. Figure 20.16. The classification matrix of the boosting classifier 250 Created by XMLmind XSL-FO Converter. Classification Methods 5 Figure 20.17. The error curve of the boosting classifier 251 Created by XMLmind XSL-FO Converter. Classification Methods 5 The obtained boosting classifier is compared with a reference polynomial kernel SVM that we fit on the whole training dataset. The statistical and graphical results obtained are shown below. Figure 20.18. Misclassification rates of the boosting classifier and the SVM Figure 20.19. Classification matrices for the boosting classifier and the SVM Figure 20.20. Cumulative response curves of the boosting classifier and the SVM 252 Created by XMLmind XSL-FO Converter. Classification Methods 5 Figure 20.21. Response curves of the boosting classifier and the SVM comparing the baseline and the optimal classifiers 253 Created by XMLmind XSL-FO Converter. Classification Methods 5 Figure 20.22. ROC curves of the boosting classifier and the SVM Interpretation of the results The experiment shows that a classifier obtained by the boosting method is competitive even comparing with a polynomial kernel support vector machine classifier in the sense that, although the misclassification rate is worse it has higher accuracy in the first few deciles. This can be seen clearly on the response and the ROC curves. Video 254 Created by XMLmind XSL-FO Converter. Classification Methods 5 Workflow sas_ensemble_exp3.xml Keywords ensemble method supervised learning boosting ROC curve classification Operators Data Source Decision Tree End Groups Model Comparison Data Partition Start Groups Support Vector Machine 255 Created by XMLmind XSL-FO Converter. Chapter 21. Association mining 1. Extracting association rules Description The process presents, in the case of Extended Bakery dataset, how can association rules be obtained from a transaction dataset. In transaction datasets, items from the possible ones are emphasized that are part of the transaction, rather than those that are lacking. In the Enterprise Miner™, such dataset must be defined as transaction dataset and it must include a ID variable and a Target target variables that must be nominal. The separate market baskets can be formed by ID variable. The association rule mining (also called market basket analysis) can be carried out by the Market Basket operator. First, the frequent itemsets are extracted, then, the significant association rules are discovered based on these itemsets. Input Extended Bakery [Extended Bakery] Output Applying the Market Basket operator on the dataset of 20000 records the following results are given. Figure 21.1. List of items 256 Created by XMLmind XSL-FO Converter. Association mining Figure 21.2. The association rules as a function of the support and the reliability Figure 21.3. Graph of lift values 257 Created by XMLmind XSL-FO Converter. Association mining Interpretation of the results The significant association rules can be discovered on the basis of the frequent itemsets. We can select which criteria we would like to consider in order to find the appropriate rules. The defaults is the confidence level of the rules, but other criteria can also be applied to filter the discovered association rules. Based on the established rules deeper conclusions can be drawn from the relationships between the data. To support this inference several tools can be applied, e.g. the table of the association rules where we may filter the relevant rules by choosing different evaluation metric like support, confidence or lift value. Figure 21.4. List of association rules Video Workflow sas_assoc_exp1.xml 258 Created by XMLmind XSL-FO Converter. Association mining Keywords frequent itemset association rule transaction data Operators Data Source Market Basket 259 Created by XMLmind XSL-FO Converter. Chapter 22. Clustering 1 Standard methods 1. K-means method Description The process shows the usage of the K-means clustering algorithm and illustrates the importance of the choice of its various parameters on the Aggregation dataset. This clustering algorithm can be fitted using the Cluster operator. Input Aggregation [SIPU Datasets] [Aggregation] The dataset consists of 788 two-dimensional vectors, which form 7 separate groups. The task is to discover these groups which are called clusters. The difficulty of the task is in the alignment of the points, as smaller and larger clouds of points are present with different distances in space between them. The visualization is done by the Graph Explore operator. Figure 22.1. The Aggregation dataset. 260 Created by XMLmind XSL-FO Converter. Clustering 1 Output After reading the dataset, we drag and drop the Cluster operator and take the following setting. The user specify cluster number is chosen and, by the above scatterplot, the number of clusters is defined as 7. Figure 22.2. The setting of the Cluster operator. 261 Created by XMLmind XSL-FO Converter. Clustering 1 The result can also be displayed by the Graph Explore operator. One can see that the algorithm found the upper and right clusters but it has weak performance on the points below. Figure 22.3. The result of K-means clustering when K=7 262 Created by XMLmind XSL-FO Converter. Clustering 1 Let's try a different parameter settings where the choice of the initial cluster centers was the MacQueen method and the minimum value of the distances between the cluster centers minimum is choosen as 9. Figure 22.4. The setting of the MacQueen clustering One can see that the result is better using this parameter setting, only for the left set of points below This result can be improved by significantly sophisticated clustering method. Láthatjuk hogy ezzel a paraméterezéssel az eredmény pontosabb lett, egyedül a baloldali alsó ponthalmaznál láthatunk még nagyobb hibát. Ezt már csak jóval haladottabb módszerrel tudnánk korrigálni. Figure 22.5. The result of the MacQueen clustering 263 Created by XMLmind XSL-FO Converter. Clustering 1 Finally, let us look at what happens if we take a slightly larger number of clusters to be produced, which is let's say 8 . In this case, the only changes is that the algorithm finds the two small clusters in the bottom left, but it cuts the major cluster beside them into three parts and cuts the upper right cluster. Figure 22.6. The result of the clustering with 8 clusters Interpretation of the results Based on the experiment, we can see that the simplest clustering algorithm such as the K-means method is able to extract the simple relationships. Moreover, if we choose the parameters of the algorithm well then the 264 Created by XMLmind XSL-FO Converter. Clustering 1 accuracy of the results can be increased. In addition, the Cluster operator provides several visualization functions that help to evaluate the results. Figure 22.7. The result display of the Cluster operator After selecting the menu Result the figure above shows the emerging main windows where you can see the results summary. On the left top at the cluster (segment) plot can be seen in the function of the input attributes. On the left bottom a pie chart shows the size of the clusters. On the right top the cluster statistics can be read, while on the right bottom the output list is shown. These windows can be enlarged individually. Among the many other tools available we want to point out the following two. Figure 22.8. Scatterplot of the cluster means The figure above shows the centers of the clusters along with the total average of the attributes. Finally, the figure below shows the decision tree constructed by the clusters, which can be obtained in such way that the resulting cluster variable will be a classification target variable and a classification task is solved by fitting of a decision tree. 265 Created by XMLmind XSL-FO Converter. Clustering 1 Figure 22.9. The decision tree of the clustering Video Workflow sas_clust_exp1.xml Keywords K-means unsupervised learning clustering Operators 266 Created by XMLmind XSL-FO Converter. Clustering 1 Cluster Data Source Graph Explore 2. Agglomerative hierarchical methods Description The process shows, using the Maximum Variance (R15) dataset, how agglomerative hierarchical clustering algorithms work. These clustering algorithms can be run by the Cluster operator. Input Maximum Variance (R15) [SIPU Datasets] [Maximum Variance] The dataset contains 600 two-dimensional vectors, which are concentrated into 15 clusters. The points are aligned around a center with the coordinates (10,10), in increasing distances from each other as they get further from the center. This is the difficulty of the task, as the clusters near the center are close to blending into each other. Figure 22.10. Scatterplot of the Maximum Variance (R15) dataset 267 Created by XMLmind XSL-FO Converter. Clustering 1 Output Firstly, the average linkage hierarchical method is applied. In this case, the distance between the clusters is calculated as the average of the pairwise distance of cluster elements by the algorithm. The results are shown in the following figure. Figure 22.11. The result of the average linkage hierarchical clustering The goodness of clustering can be measured so that the original grouping Class attribute and the Segment attribute which contains the cluster membership obtained after clustering are plotted by a spatial bar chart. It can be seen that, apart from a permutation, there is a one-to-one correspondance between the lines and the columns except two records. 268 Created by XMLmind XSL-FO Converter. Clustering 1 Figure 22.12. Evaluating of the clustering by 3D bar chart An other hierarchical clustering method is the Ward method. Using this, we obtain the following results. Figure 22.13. The result of Ward clustering Interpretation of the results 269 Created by XMLmind XSL-FO Converter. Clustering 1 The process demonstrated that if the number of possible clusters is relatively large then it is worth choosing one of the automatic clustering procedures. In the SAS® Enterprise Miner™, the hierarchical clustering is available for this purpose in several different ways. The experiment also shows that the choice of the agglomerative method does not always affect the resulting clusters. The SAS proposes the cluster number by investigating the CCC graph, see figure below. Figure 22.14. CCC plot of automatic clustering In addition, a schematic display on the location of the clusters, the so-called proximity diagram is also obtained which is clearly similar to the previously obtained scatterplot on the clusters. Figure 22.15. Proximity graph of the automatic clustering 270 Created by XMLmind XSL-FO Converter. Clustering 1 Video Workflow sas_clust_exp2.xml Keywords hierarchical methods average linkage Ward method CCC graph clustering Operators Cluster Data Source Graph Explore 3. Comparison of clustering methods 271 Created by XMLmind XSL-FO Converter. Clustering 1 Description The experiment presents the difference between the automatic clustering and the clustering where the number of clusters is specified by the user on the Maximum Variance (D31) dataset. In the experiment the Cluster operator is used. Input Maximum Variance (D31) The dataset consists of 3100 two-dimensional vectors, which are grouped around 31 clusters. Figure 22.16. The Maximum Variance (D31) dataset Output 272 Created by XMLmind XSL-FO Converter. Clustering 1 Firstly, an automatic clustering is performed where the Class attribute is ignored. The algorithm finds 31 clusters which aggres to the original number of clusters. The resulting clusters are shown in the following figure. Figure 22.17. The result of automatic clustering The correctness of the resulted cluster number is clearly shown by the CCC plot. Figure 22.18. The CCC plot of automatic clustering The scematic arrangement of the clusters is shown by the proximity graph below. Figure 22.19. Az automatikus klaszterezés proximitási ábrája 273 Created by XMLmind XSL-FO Converter. Clustering 1 You can try a cluster model based on the CCC chart, which has 9 clusters. This can be done by the Ward's version of the K-means algorithm. As a result, the scatterplot and the proximity graph of the clusters are shown in the following two figures. Figure 22.20. The result of K-means clustering Figure 22.21. The proximity graph of K-means clustering 274 Created by XMLmind XSL-FO Converter. Clustering 1 Then, by the so-called segment profiling, the resulted clusters can be investigated from the point of view that how the input variables determine the clusters. Figure 22.22. The profile of the segments (clusters) 275 Created by XMLmind XSL-FO Converter. Clustering 1 Interpretation of the results The experiment shows that the automatic clustering is able to find the correct number of clusters in case of a relatively large number of closely spaced, but spherical groups. If this we put down it too big so this number can be reduced to a reasonable size by analyzing of the CCC graph and searching a suitable breakpoint. Video Workflow sas_clust_exp3.xml Keywords automatic clustering K-means cluster profiling CCC graphs clustering 276 Created by XMLmind XSL-FO Converter. Clustering 1 Operators Cluster Data Source Graph Explore MultiPlot Segment Profile 277 Created by XMLmind XSL-FO Converter. Chapter 23. Clustering 2 Advanced methods 1. Clustering attributes before fitting SVM Description The process demonstrates how to cluster the attributes by using the Variable Clustering operator when there are a number of attributes in the dataset. The process uses the Spambase dataset. After clustering the attributes, further supervised data mining methods can be applied, e.g., we may classify the e-mails into spam and nonspam classes. Input Spambase [UCI MLR] The dataset contains 4601 records and 58 attributes. The records are classified into 2 groups by the Class variable, which identifies the spam e-mails, i.e., its value equals to 1 if the record is spam and 0 otherwise. The challenge in the dataset is that there are relatively large number of attributes which slow down the training process. The experiment points out that a competitive model can be obtained after a suitable clustering of the attributes to such models which are fitted on the whole dataset. Output During the attribute clustering the columns of the dataset are clustered by a hierarchical method to reduce the dimension of the dataset. The most important parameter of the Variable Clustering operator is the Maximum Cluster, which can be used to adjust the maximal number of clusters. Similar parameters are the maximal number of eigenvalues and the explained variance. You can also choose between the correlation and the covariance matrix in the analysis. One of the most important results is the dendrogram which visualizes the process of the hierarchical clustering. Figure 23.1. The dendrogram of attribute clustering 278 Created by XMLmind XSL-FO Converter. Clustering 2 The relationship between the original attributes and the obtained clusters is depicted on the following graph. Figure 23.2. The graph of clusters and attributes The list of cluster membership, i.e., the set of attributes belonging to clusters, respectively, can be seen on the following figure. Figure 23.3. The cluster membership 279 Created by XMLmind XSL-FO Converter. Clustering 2 To create clusters the correlation (covariance) between the original attributes plays the most important role. Those attributes will be in one cluster which have high correlation with each other. This displays in the following figure. Figure 23.4. The correlation plot of the attributes 280 Created by XMLmind XSL-FO Converter. Clustering 2 It can also be investigated how high is the correlation between each variable and the new cluster variables obtained. The following figure shows the correlation bar chart of the variable representing the special character dollar. Figure 23.5. The correlation between clusters and an attribute 281 Created by XMLmind XSL-FO Converter. Clustering 2 After the attribute clustering, SVM model was fitted to the Class binary variable by using the obtained 19 new cluster attributes. Then, the results obtained in this way were compared with a similar model fitted to the original 58 attributes directly. The results below show that the models obtained have similar performance. The classification bar charts show a similar classification matrix. Figure 23.6. Classification charts of SVM models The response curve behaves better in some places on the clustered attributes than on the original ones. Figure 23.7. The response curve of SVM models If the cumulative lift functions are compared considering the baseline and the best lift functions, similar behavior can be seen. Figure 23.8. Az SVM modellek kumulatív lift függvényei 282 Created by XMLmind XSL-FO Converter. Clustering 2 Finally, the ROC curves are very similar to each other. Figure 23.9. The ROC curves of SVM models Interpretation of the results If there are very much input attributes in teaching a supervised data mining model, which makes teaching so very slow, then it worth reducing the dimension by clustering the input attributes. The explanatory power of the resulting model is usually not much worse than the one fitted to the original attributes. Video Workflow 283 Created by XMLmind XSL-FO Converter. Clustering 2 sas_clust2_exp1.xml Keywords attibute clustering dendrogram hierarchical methods clustering ROC curve SVM Operators Data Source Model Comparison Support Vector Machine 2. Self-organizing maps (SOM) and vector quantization (VQ) Description The process presents the Kohonen's vector quantization (VQ) and the self-organizing map (SOM) algorithms using the Maximum Variance (R15) dataset. These algorithms can be fitted by the SOM/Kohonen operator. Input Maximum Variance (R15) [SIPU Datasets] [Maximum Variance] The dataset consists of 600 two-dimensional records, which are grouped into 15 groups. The points are located around the point with coordinates (10, 10) and they are farther from each other as they are far from the center. The difficulty of the task is that the groups which are around the center almost fuse. In the figure below these points are depicted by coloring the different groups. Figure 23.10. The scatterplot of the Maximum Variance (R15) dataset 284 Created by XMLmind XSL-FO Converter. Clustering 2 Output First, the method of Kohonen's vector quantization is used. By this method we got 10 clusters. The results can be seen on the figure below. Figure 23.11. The result of Kohonen's vector quantization The size of clusters can be depicted by a simple pie chart. Figure 23.12. The pie chart of cluster size 285 Created by XMLmind XSL-FO Converter. Clustering 2 A table displays all the statistics which characterize the clusters, among others the frequency of clusters, the standard deviation of clusters, the maximum distance from the center of clusters, and the number of the adjacent cluster with the distance between them. Figure 23.13. Statistics of clusters Then, the method of batch SOM algorithm is applied for the same dataset. In this case, the numbers of row and column segments should be defined where 6 was chosen. The results are shown in the following two figures. The first one is the schematic graph of the SOM/Kohonen operator on the resulting net where the coloring shows the frequency of each cell. Figure 23.14. Graphical representation of the SOM 286 Created by XMLmind XSL-FO Converter. Clustering 2 The second figure is a scatterplot which displays the resulting clusters in the coordinate system of original input attributes. Figure 23.15. Scatterplot of the result of SOM Interpretation of the results The experiment shows how to use two unsupervised data mining techniques such as vector quantization and self-organizing maps. The two methods are particularly effective for examining 2-dimensional data. However, being important prototype methods, they can greatly simplify the further analysis in higher dimension too. Video Workflow sas_clust2_exp2.xml 287 Created by XMLmind XSL-FO Converter. Clustering 2 Keywords vector quantization (VQ) self-organizing map (SOM) clustering Operators Data Source Graph Explore Self-organizing Map 288 Created by XMLmind XSL-FO Converter. Chapter 24. Regression for continuous target 1. Logistic regression Description The process shows, using the Spambase dataset, how can a regression model be fitted to a dataset which has binary target. The conventional linear regression are not suitable for this task even though the Regression operator offers this option. Instead, we must use the logistic regression method which is the default option of this operator. We can choose between the following link functions: logit, which takes the name of the procedure, probit and complementary logit. There is no significant difference among these link functions. The Enterprise Miner™ gives an other operator for fitting regression. By the Dmine Rgeression operator forward stepwise regression can be fitted. In each step, an input variable is selected that contributes most significantly to the variability of the target. Input Spambase [UCI MLR] Output After fitting the logistic regression, standard statistics and graphs are obtained similarly to the binary classification tasks. Here, only the confusion matrix is shown, the rest of comparison tools is left at the and of this experiment. Figure 24.1. Classification matrix of the logistic regression 289 Created by XMLmind XSL-FO Converter. Regression for continuous target In addition to the usual tools, the regression operators, using the effect plot, also show the importance of the input variables in the regression model which were built during the process. Figure 24.2. Effects plot of the logistic regression In addition to the traditional regression analysis Enterprise Miner ™ yields another operator to fit forward stepwise regression. This is the Dmine Rgeression operator. The results can be seen in the figures below. Figure 24.3. Classification matrix of the stepwise logistic regression 290 Created by XMLmind XSL-FO Converter. Regression for continuous target Figure 24.4. Effects plot of the stepwise logistic regression The two regressions can be compared by the usual way with the Model Comparison operator. The results of this comparison are presented in the following figures. Figure 24.5. Fitting statistics for logistic regression models 291 Created by XMLmind XSL-FO Converter. Regression for continuous target Figure 24.6. Classification charts of the logistic regression models Figure 24.7. Cumulativ lift curve of the logistic regression models 292 Created by XMLmind XSL-FO Converter. Regression for continuous target Figure 24.8. ROC curves of the logistic regression models Interpretation of the results The fit statistics and ROC curves clearly show on the test set that the logistic regression model is better than the stepwise logistic regression model. Video Workflow sas_regr_exp1.xml 293 Created by XMLmind XSL-FO Converter. Regression for continuous target Keywords classification binary target logistic regression Operators Data Source Dmine Regression Model Comparison Data Partition Regression 2. Prediction of discrete target by regression models Description The process shows, using the Wine dataset, that how can we fit a regression model to a dataset containing discrete but non-binary target variable. Moreover, how can we performe a classification task using the parameter estimates obtained from the model. The type of the fitted regression model depends on the measurement scale of the discrete target variable. If the target is nominal then the Regression operator fits binary logistic models separately, so that a selected event of the target is compared to the class of the other values of the discrete target variable. On the other hand, if the target is ordinal then a common logistic regression model are fitted, wherein only the constant parameters differ, but the parameters of the input variables are shared. (As opposed to the nominal case where these coefficients are different.) Input Wine [UCI MLR] Output A number of models can be choosen to fit a regression model, e.g., linear or logistic regression. Among them, the logistic regression used in the process. It is not needed to set in, the software recognizes the right type of the regression using the metadata on the target. Of course, it is possible to override this option and to enforce linear regression, but this does not make sense because in this case the Model Comparison operator can not be used to compare this model to other discrete supervised models. The fit of the model can be tested by the well-known statistics and charts. Figure 24.9. Classification matrix of the logistic regression 294 Created by XMLmind XSL-FO Converter. Regression for continuous target The classification chart shows that the fitted model is perfect on the training dataset and has small error on the validation dataset. Figure 24.10. The classification chart of the logistic regression 295 Created by XMLmind XSL-FO Converter. Regression for continuous target Besides the standard goodness-of-fit tests the Regression operator presents a bar graph showing the importance of the input variables in the regression models. The higher the coefficient of an input variable, the more is its explanation power with respect to the target variable. In case of ordinal target, there is only one, while, in case of the nominal target, the number of different values of the target variable minus 1 bar graphs are created. Since the Class variable has three values there are two such bar graphs below. Figure 24.11. The effects plot of the logistic regression Interpretation of the results Using the regression model created on the training set for the validation dataset the above results show that a model can be built with relatively high accuracy for multiple discrete target variable value by the Regression operator. It is noted that for the problem discussed here the Dmine Regression operator can not be applied. Video Workflow sas_regr_exp2.xml 296 Created by XMLmind XSL-FO Converter. Regression for continuous target Keywords classification nominal and ordinal target logistic regression Operators Data Source Data Partition Regression 3. Supervised models for continuous target Description The process presents, using the dataset Concrete Compressive Strength, that how can we fit supervised data mining models for datasets with continuous target. We can use for this task the Regression operator, the Decision Tree operator, and the Neural Network operator, as well. By Regression operator linear regression model can be fitted. Decision trees can be given by the Decision Tree operator. Finally, the result of a Neural Network operator is a neural network that minimizes a predefined error function on the validation dataset. In each case, continuous target variable with continuous level metadata must be selected. Input Concrete Compressive Strength [UCI MLR] [Concrete] In the preprocessing step, the dataset is partitioned for training, validation and test datasets with percent 60/20/20, respectively. Output A comparison of the resulting models are significantly different from the tools applied for discrete or binary target. Among the statistical indicators, we can use different information criterion (e.g. AIC, SBC) or the mean square error and the square root of the it. In this case there aro no such graphical tools as lift curve and classification chart. Instead, we can use the graphs showing averages of forecasts. Figure 24.12. Statistics of the fitted models on the test dataset 297 Created by XMLmind XSL-FO Converter. Regression for continuous target The figure below shows that the neural network and the linear regression behave fairly similarly. Figure 24.13. Comparison of the fitted models by means of predictions Figure 24.14. The observed and predicted means plot 298 Created by XMLmind XSL-FO Converter. Regression for continuous target The following curves is good if we are closer to the diagonal straight line. Figure 24.15. The model scores 299 Created by XMLmind XSL-FO Converter. Regression for continuous target In addition to the comparisons, one by one, we can examine the individual models. Below you can see the constructed decision tree, which is created by using F-statistics. Figure 24.16. The decision tree for continuous target 300 Created by XMLmind XSL-FO Converter. Regression for continuous target For the neural network model the weights of neurons can be visualized. Figure 24.17. The weights of neural network after traning 301 Created by XMLmind XSL-FO Converter. Regression for continuous target Interpretation of the results Both the statistics and graphics show that the neural network model fits the best. Video Workflow sas_regr_exp3.xml Keywords supervised learning continuous target decision tree linear regression neural network Operators 302 Created by XMLmind XSL-FO Converter. Regression for continuous target Data Source Decision Tree Model Comparison Neural Network Data Partition Regression 303 Created by XMLmind XSL-FO Converter. Chapter 25. Anomaly detection 1. Detecting outliers Description The process presents, using the Concrete Compressive Strength dataset, that how can we filter outliers considering various criteria by the Filter operator. The applied criteria involve the mean square deviation from the mean or the mean absolute deviation. Other possibility is the modal center. In the experiment, the records which differ from the mean by twice standard deviation are filetered. Input Concrete Compressive Strength [UCI MLR] [Concrete] Output As shown, the above setting can filter out a significant number of outliers. Figure 25.1. Statistics before and after filtering outliers 304 Created by XMLmind XSL-FO Converter. Anomaly detection Figure 25.2. The predicted mean based on the two decision trees Figure 25.3. The tree map of the best model 305 Created by XMLmind XSL-FO Converter. Anomaly detection Interpretation of the results The following comparison shows that after the filtering the error of the fitted decision tree is significantly smaller than the error of the decision tree fitted on the full dataset. Thus, in suitable cases, the removal of outliers is able to improve the efficiency of supervised models. Figure 25.4. Comparison of the two fitted decision trees Video Workflow sas_anomaly_exp1.xml Keywords outliers preprocessing data cleaning Operators Data Source Decision Tree Filter Graph Explore Model Comparison 306 Created by XMLmind XSL-FO Converter. Bibliography Data Sets [Aggregation] Aristides Gionis. Heikki Mannila. Panayiotis Tsaparas. “Clustering Aggregation”. 4:1–4:30. ACM Transactions on Knowledge Discovery from Data (TKDD). 1. 1. 2007. [Compound] C. T. Zahn. “Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters”. 68–86. IEEE Transactions on Computers. 20. 1. 1971. [CoIL Challenge 2000] Peter van der Putten. Maarten van Someren. CoIL Challenge 2000: The Insurance Company Case . 2000. Sentient Machine Research. Amsterdam. Also a Leiden Institute of Advanced Computer Science. Technical Report 2000-09. [Concrete] I-Cheng Yeh. “Modeling of strength of high performance concrete using artificial neural networks”. 1797–1808. Cement and Concrete Research. 28. 12. 1998. [Detrano et al.] R. Detrano. A. Jánosi. W. Steinbrunn. M. Pfisterer. J. Schmid. S. Sandhu. K. Guppy. S. Lee. V. Froelicher. “International application of a new probability algorithm for the diagnosis of coronary artery disease”. 304–310. American Journal of Cardiology. 64. 5. 1989. [Extended Bakery] Alex Dekhtyar. Jacob Verburg. Extended Bakery Dataset . 2009. [Flame] Limin Fu. Enzo Medico. “ FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data ”. BMC Bioinformatics. 8. 3. 2007. [Jain] Anil K. Jain. Martin H.C. Law. “Data Clustering: A User's Dilemma”. 1–10. Lecture Notes in Computer Science Volume. 3776. Springer. 2005. [Maximum Variance] C. J. Veenman. M. J. T. Reinders. E. Backer. “A Maximum Variance Cluster Algorithm”. 1273–1280. IEEE Transactions on Pattern Analysis and Machine Intelligence. 24. 9. 2002. [SIPU Datasets] Clustering datasets . Speech and Image Processing Unit, School of Computing, University of Eastern Finland. [StatLib] Mike Meyer. Pantelis Vlachos. StatLib—Datasets Archive . 2005. [Titanic] Robert J. MacG. Dawson. “ The “Unusual Episode” Data Revisited ”. Journal of Statistics Education. 3. 3. 1995. [Two Spirals] Kevin J. Lang. Michael J. Witbrock. “Learning to Tell Two Spirals Apart”. 52–59. David Touretzky. Geoffrey Hinton. Terrence Sejnowski. Proceedings of the 1988 Connectionist Models Summer School. Morgan Kaufmann. 1988. [UCI MLR] K. Bache. M. Lichman. UCI Machine Learning Repository . University of California, Irvine, School of Information and Computer Sciences. 2013. Other [DMBOOK] Pang-Ning Tan. Michael Steinbach. Vipin Kumar. Wesley. 2005. Introduction to Data Mining . Addison- [LIBSVM] Chih-Chung Chang. Chih-Jen Lin. “LIBSVM: A Library for Support Vector Machines”. 27:1– 27:27. ACM Transactions on Intelligent Systems and Technology. 2. 3. 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/. [Neural Network FAQ] Warren S. Sarle. Neural Network FAQ, part 3 of 7: Generalization . 1997. Periodic posting to the Usenet newsgroup comp.ai.neural-nets. 307 Created by XMLmind XSL-FO Converter. Bibliography [RapidMiner Manual] RapidMiner 5.0 Manual . Rapid-I GmbH. 2010. [SAS Enterprise Miner] Getting Started with SAS®Enterprise Miner™ . SAS Institute Inc.. 2011. [SAS Enterprise Miner Ref] SAS® Enterprise Miner™: Reference Help . SAS Institute Inc.. 2011. 308 Created by XMLmind XSL-FO Converter.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Case Studies in Data Mining