Download large synthetic data sets to compare different data mining methods

large synthetic data sets to compare different data mining methods Victoria Ivanova† , Yaroslav Nalivajko‡ Abstract— Data Mining methods are used for knowledge discovery in data. Each Data Mining method has it’s own advantages and disadvantages. One approach to evaluate the performance of data mining methods is to use synthetic data with certain characteristics. In this paper we outline the main features of two common data mining methods, Support Vector Machines and Random Forests and construct suitable datasets, to evaluate both methods. We first explain the theory behind each method. Then we describe the artificial data sets we have created. Finally, we compare the methods in their ability to classify the constructed datasets. Index Terms—Data Mining, Support Vector Machines, Random Forests, Classification, Synthetic x2 1 Introduction The evaluation of the performance of a data mining method is an important task [1]. It can help to find an appropriate data mining method for a certain problem. It is possible to evaluate the methods through comparing the performance on synthetic data. Support Vector Machines and Random Forests are common data mining methods, which can be used for classification or regression. To evaluate the methods we have written python scripts which generate artificial datasets with different characteristics. The scripts allow the user to manipulate some parameters, so that the resulting complexity of datasets vary. The goal was to construct datasets, which would allow to evaluate the performance of the chosen data mining methods. The rest of the paper is organized as follows: section 2 describes theoretical foundations of the methods. Section 3 gives an overview of the common difficulties in real-world datasets. In section 4 we present the results classification on some of the generated synthetic datasets. The section 5 summarizes the practical results we achieved. Finally, section 6 presets the conclusions. ρ x1 r∗ r∗ Optimal Hyperplane w> x + b = 0 Figure 1: Linearly separable training set with hyperplane and maximal margin ρ [5] testdata, x = (x1 , x2 , ...xn ) where x ∈ Rn is the vector with attributes and it is known, to which class y1 ...yn ∈ {1, −1} it belongs to. The aim is to construct a classificator which would optimally separate the data samples. For two classes which are linearly separable we can find a surface, which would be equidistant to the class boundaries, where the classes are closest to each other. This decision surface is equidistant to the points in each class, which are called support vectors and maximizes the distance between the classes. Such classifiers are called maximum-margin classifiers. We define the algorithm formally as described in [7]. We have: 2 Theoretical foundantions of Support Vector Machines and Random Forests Random Forests and Support Vector Machines are two common data mining methods. They both can be used for classification and regression. Both methods can be used for binary or multiclass classification. The performance of the methods is often compared and, depending on the problem, one of the methods outperforms the another [2], [3]. In the following, we outline the theory behind each method. 2.1 Support Vector Machines The Support Vector Machines (SVM) method was first introduced by V. N. Vapnik in 1995 [4]. The SVC (Support Vector Classifier) and the SVR (Support Vector Regressor) are the algorithms for classification and regression respectively [5]. SVC finds a hyperplane which separates the two classes (binary classification) with a maximal margin. The time complexity is O(m3 ) and space complexity is O(m2 ), where m is the training size [6]. This hyperplane is proven to have a generalization ability. That means that the solution would also work on a data with the same distribution as the test data and guarantees high predictive ability [6]. There is the • dot product space Rn • target function f : Rn → {+1, −1} (binary classification) • labeled, linearly separable training set: D = {(x¯1 , y1 ), (x¯2 , y2 ), . . . (x¯i , yi )} ⊆ Rn × (+1, −1), where yi = f (x¯i ) a model: fˆ : Rn → {+1, −1} should be computed, using D such that fˆ(x̄) ∼ = f (x̄) for all x¯i ∈ Rn The optimal hyperplane wt x + b = 0, where w is the weight vector and b is the bias, separates the data see Figure 1. The aim of the algorithm is to maximize the margin between the hyperplane and the support planes. Without loss of generality the functional margin is fixed to 1 [5]. That results in: wt x+b =≥ 1 for yi = 1 and wt x + b =≥ −1, for yi = −1 The points which satisfy † Victoria Ivanova [email protected] ‡ Yaroslav Nalivajko [email protected] 1 Kernel function Cathegory k(x, y) = xy linear k(x, y) = (γxy + c0 )d polynomial k(x, y) = exp(−γ k x − y k) radial (Gauss) k(x, y) = tanh(γxy + c0 ) sigmoidal The optimal hyperplane is computed with the kernel funcPn ∗ T α y φ tion: (x )φ(x) = 0. The advantage of the Keri i i i nel is that one must not consider the concrete form of the transformation φ, which does not have to be explicitly formulated in the higher dimensional space. A kernel function must be symmetric: k(x, y) = k(y, x) and positive semidefinite, R a R a that means for x, y ∈ X in the interval [a : b] that k(x, y)g(x)g(y)dxdy ≥ 0 for all g : X → R b b Holding these equations the Mercel kernel is defined: K = K(xi , yj )n i,j=1 , the matrix constructed by the kernel function is symmetric and positive semidefinite In practice both methods are combined: kernel trick and the soft margin classifier. these two equations are called support vectors, we write for them x∗ . So, for the geometrically defined distance, we have t ∗ x +b 1 the following equations: r∗ = w kwk , so r∗ = kwk for yi = 1 and r∗ = −1 kwk for yi = −1 [5]. The margin ρ = 2 · r∗ = The aim of SVM is the maximization of the margin, ρ = 2 . kwk 2 kwk in respect to w and b such that yi (wT xi + b) ≤ 1, i = 1..n and is equivalent to the minimization of 12 k w k, which is a convex function [7]. This optimization problem can be solved with the Lagrangian multipliers method: n L(w, b, α) = X 1 t w w− αi [yi (wt xi + b) − 1] 2 i=1 where αi is the Lagrangian multiplier and αi ≥ 0 [5]. For every constraint we have one α. The solution of the Lagrangian multipliers method are points that maximize L with respect to α and minimize L with respect to x. These solutions are saddle points on the graph of the Lagrangian. This saddle point is unique [7]. We differentiate the Lagrangian in respect with w and b and set both results to zero. We obtain: n X αi yi = 0 and w = i=1 n X 2.1.3 αi yi xi i The substitution of the equations into the Lagrangian results in the corresponding dual problem: maxW (α) = Pn Pn Pn Pn T 1 i=1 j=1 αi αj yi yj xi xj , where i=1 αi − 2 i=1 αi yi = 0 with αi ≥ 0 [5]. The Karush-Kuhn-Tucker conditions stay that αi [yi (wT x∗i + b) − 1] = 0, where x∗i are the support vectors [7]. So only the support vectors correspond to the non-zero αi -s. All other αi -s are zero. After the calculation of the Lagrangian Pn multipliers we can calculate the weight vector: w∗ = i αi yi xi , and the bias can be calculated b = 1 − w∗T xs for ys = 1 [5]. The method described above can not find a solution if samples are not completely separable (the margins would be negative), which is often in the real world problems. To address this problem there are two widely adopted approaches. The one is soft-margin classifier and the second is the kernel trick. 2.1.1 2.2 Random Forests Random Forests (RF) algorithm is commonly used for classification and regression, but can be easily adapted for other tasks. RF is a substantial modification of Bootstrap AGGregatING (or bagging) [8] that builds a large collection of de-correlated trees, and then averages them [9]. This allows RF to benefit from all advantages of decision tree learning, like work with continuous and discrete attributes or assessment quality of models and then shuffle off all disadvantages through bagging. As a consequence, random forest is a popular data mining method, and is implemented in a variety of packages. 2.2.1 Soft margin classifier Description To build N trees we will take N random subsets of parameters and N random subsets of a learning record, then we will use them to build decision trees. Using only random subsets allows us to overcome overfit problem, commonly meted by decision tree algorithm and to work with a very large number of predictors and observations [10]. Later, for execution classification, regression, etc., all received trees will vote. The result with the highest number of votes is then chosen. The soft margin Pn classifier Tintroduces a slack variable ξ: 2 1 k w k +C 2 i ξi and yi (w x + b) ≥ 1 − ξi , ξi ≥ 0 i = 1, .., n. The slack variable characterizes the distance between the misclassified data and the separating hyperplane. The parameter C is the parameter, which defines the compromise between complexity and the number of inseparable points. It is a ”regularization” parameter, can be chosen from the user. The dual Lagrangian problem has the same solution in this case, the only difference is that 0 ≤ αi ≤ C. The Karush Kuhn Tucker conditions say that 2.2.2 Mechanism of work RF algorithm requires two main arguments and a learning set for start. These are restriction for trees depth and number of trees. Limitation is not usually necessary, because high number of dimensions in learning sets(which leads to the need to limit trees) is a rare situation for RF, where number of dimensions in learning set already has been greatly reduced. At first, number of trees can be chosen intuitively, and then improved observing the changes in out-of-bag error. This value describes the mean prediction error on each training sample xi , using only the trees that did not have xi in their bootstrap training sample. The training and test error tend to level off after some number of trees have been fit. Usually to αi [yi (wT xi + b) + ξ − 1] = 0 i = 1, .., n and ξi γi , i = 1, ..n, where γi − s are the Lagrangian multipliers. [5] 2.1.2 Multiclass SVM In practice problems with more than two classes are often. As SVM is a two-class classifier, there are different approaches to combine the two-class SVMs and to build a multiclass classifier. The first approach is the one-versus-the-rest approach. This approach has several disadvantages e.g. imbalance of the training set. Another approach is often called one-versus-one. K(K−1) [8] The main idea is to train different 2 class SVMs, 2 where K is the number of classes, on all pairs of classes. The classification is then done according to the class which is chosen the most [8]. Kernel function A kernel function is based on the inner product between the given data to a feature space of higher (or infinite) dimension. The skalar product is then transformed nonlinear. The Kernel Function k: k < x, y >=< Φ(x), Φ(y) >. There are different categories of kernel functions [7]: 2 build a forest for a starting training set with M samples and √ N dimensions, bagging uses M samples (with repetition) √ and N or log2 (N + 1) parameters for classification and N/3 for regression with minimum of five randomly chosen for each tree [11]. Then we are freely to use one of the decision tree build algorithm, like ID3, C4.5, CART, RI, IndCart, DB-Cart, CHAID or MARS. Each algorithm differs from the others with methods of choosing the next attribute, whose gain is the maximum as a root node [12] and finding optimal question for this parameter. In our case tests are calculated in WEKA, which uses RI algorithm [13]. the values of attributes or through inversing the class label. Noisy data should be preprocessed before classification but it might be important to see, whether a certain data mining method is sensitive to noise or not. The crosstalk in data means, that if there are many relationships within a data set, then it might be difficult for a data mining method to identify strong relationships within many weak relationships [1]. This occurs e.g. if a dataset has some redundant attributes. Then it might be difficult for a classifier, to detect the important attribute. Data can be linearly separable or not. For a linear classifier it is easy to classify linearly separable data, but difficult to classify linearly unseparable data. SVM works well on linearly separable data [5]. The performance on not linearly separable data depends on the kernel function and penalty parameter [16]. Another problem that may occur in praxis is overlapping data. Overlapping means that some samples of different classes have very similar characteristics [17]. Imbalanced data is a problem in classification that often occurs in praxis. An imbalanced dataset is a dataset in which the number of objects of one class dominates over another class. There are some known techniques to handle this problem: undersampling, oversampling and feature selection [18]. One can either under-sample the dominating class and over-sample the minority class. The imbalance of datasets has been known as a problem for SVM [19]. The problem of classification of imbalanced data with Random Forests algorithm has been studied by various researchers [20], [21]. Very large datasets with millions of entries can also be a challenge for a classification method. Another problem can be a dataset with too many attributes. The number of attributes critical for classification depends on the method used for classification. 2.2.3 Estimation of precision After all trees are built, we need to calculate the out-of-bag error. Value of this parameter helps us to estimate how good the number of trees fits for the training set. We can increase number of trees, if out-of-bag error strong decreases, then the number of trees is not sufficient. If out-of-bag error increases, then we have a dataset with high noise value and can decrease the number of trees, that can improve the results. If out-ofbag error did not change, then the number of trees matches or exceeds the optimal value. We can try to reduce the number of trees to speed up the calculations. 2.2.4 Time Complexity Time needed to build model is going to be close to the sum of the complexities of building the individual decision trees in the model. Usually, the models all have the same complexity, then it would be the complexity of the individual model times the number of models builded. If there are n instances and m attributes, then the computational cost of building a tree is O(m ∗ n ∗ log(n)) [13]. If M trees are grown, then the complexity is O(M ∗ (m ∗ n ∗ log(n)) [13]. 2.2.5 Space Complexity Random Trees used by Random Forest are to some extent binary trees. Space Complexity can be estimated with a number of nodes in a tree. If class can be estimated with h decisions nodes (which represents the height of a tree) then number of nodes can be calculated as 2h . Whole model space complexity is O(M ∗ 2h ) [14]. 4 Practical results For comparing the performance of the methods WEKA [13] with LibSVM and RandomForest Implementation in WEKA were used for SVM and RF respectively. For SVM the preprocessing of the datasets, as described in [5] was not done, because the datasets were generated in the appropriate format (arff files) and the datasets had the proper interval (−1, 1), (0, 1), so scaling was not necessary. To evaluate the data mining methods k-fold-cross validation was used (with default k = 10). For finding the appropriate parameters CVParameterSelection in WEKA was used. 2.2.6 Disadvantages • Normally, random forests do not overfit as more trees are added, but on some datasets, especially on noisy data, RF can overfit [15]. • Random Forest’s model description can not be easily understood unlike the decision trees, where it is easy to comprehend final structure of tree. 4.1 Reference dataset The n-dimensional chess dataset was used as a reference data set, to estimate assumptive influence of other difficulties. This dataset looks like a chessboard, see figure 2. There are two classes. The number of dimensions and the number of points can be varied. For the reference chess dataset we obtained the following results (a two dimensional dataset). Value describing percent of points classified correctly. Method 1K Points 10K Points 100K Points SVM 95% 98.93% 99.739% RF 98.2% 99.85% 99.984% As we can see, good classification results could be obtained with both methods. We can also see, that as the number of points increase, the classification result improves. • High size of produced models requires large volume of memory. RF for complex data can contain thousand of trees with millions of nodes. 3 Known difficulties in datasets Our goal was to generate synthetic datasets which would help to stress all the advantages and disadvantages of the chosen data mining methods. There are some known characteristics of datasets which can make the classification difficult for almost every data mining method. These are noise, crosstalk and inherent complexity [1]. Noisy data often comes in real-world problems. There are two types of noise known: class noise and attribute noise. Attribute noise can have different causes: e.g. missing or unknown attribute values, incomplete values or wrong values. Class noise can result from either misclassification or contradictory examples (duplicate values). Both categories of noise can easily be simulated through either slightly changing 4.2 Not-axis parallel datasets As mentioned in Section 3, non-axis parallel data can be a challenge for a data mining method. To address this problem, the chess dataset was transformed. The rotation angle can be varied and the rotation plane can be chosen. For Datasets with 10K Points we obtained the following results. 3 1 0.8 0.8 y coordinate 1 y 0.6 0.4 0.6 0.4 0.2 0.2 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 x 1 1 0.8 0.8 0.6 0.4 0 0 0.4 0.6 x coordinate 0.8 0.8 1 0.4 0.2 0.2 1 0.6 0.2 0 0.8 Figure 4: Chess dataset with class noise y coordinate y coordinate Figure 2: 2 dimensional chess dataset with 10K points 0.4 0.6 x coordinate 1 0 Figure 3: 2 dimensional chess dataset with 1K points, rotated 30 degrees in the xy-plane 0.2 0.4 0.6 x coordinate Figure 5: Chess dataset with attribute noise 4.4 Method 2D 30 degrees 2D 1 degree 3D 30 degrees SVM 98.34% 98.37% 90.88% RF 99.24% 96.99% 82.5% We can see that rotation didn’t changed the performance of SVM a lot, but has significantly reduced accuracy of RF. Big Datasets To see the difference in classification of big datasets, we generated datasets with 100K entries. Method 3D, 100K 4D, 100K SVM 98.188% 92.339% RF 99.893% 97.378% The classification results of SVM were not so good as the results of the RF algorithm. 4.3 Noise in data Noise in datasets, as described in section 3 is a common classification problem in real datasets. At first, we simulated class noise through labeling a certain percent of instances with wrong class label. For class noise we obtained the following results: Method 2D 1% 2D 2% 2D 5% 5D, 5% SVM 97.71% 96.29% 92.78% 58.0% RF 98.88% 97.83% 94.67% 54.25 % Here RF algorithm achieved better results than SVM, but both algorithms found it difficult to classify the dataset with 5 % noise and 5 attributes. To simulate attribute noise, we modified the attributes slightly. Method 2D, 1K 2D, 10K 3D, 10K SVM 94.5% 94.2% 90.29% RF 95.5% 95.2% 92.15% As we can see, attribute noise reduces accuracy of both methods, but RF could achieve a slightly better results. 4.5 Datasets with high number of attributes To evaluate the performance of the methods on datasets with different number of attributes, we used the dataset multidimensional sphere. We tested the influence of high number of attributes (in our case that means high number of dimensions) on the methods ability to predict the class correctly. A multidimensional sphere in a sphere dataset is a binary classification problem, here the points inside the smaller sphere are labeled with 1 and points in an outer sphere are labeled with −1. For 2 dimensional sphere in a sphere with 1000 points we obtained following results: Method 2D, 1K 5D, 1K 10D, 10K 10D, 100K SVM 98.5% 98.2% 97.7 % 99.2 % RF 90.7% 96.6% 77.1 % 81.4 % As we can see, the number of dimensions (attributes) has 4 4.7 no significant effect on the performance of SVM, but RF algorithm seems to be sensitive to the increasing number of attributes. 4.6 Cross talk dataset To address the problem of cross talk in data, we constructed datasets with redundant attributes. That means, the additional attributes are added to the dataset but are not adding any information to the classification. A reference chess dataset was modified through such a procedure. The datasets varied in the number of redundant attributes. In the table the results of classification of datasets with different number of redundant attributes (r.a.) are presented. Method 1K 1 r.a. 10K with 8 r.a. 100K 16 r.a. SVM 88.4% 54.96 % 54 % RF 97.2% 99.6% 99.737% We could see, that the number of additional attributes had no significant effect on the classification result of RF. The number of points correlates with the classification accuracy. The cross talk in data reduces the accuracy of SVM. Linearly not separable datasets For testing linearly not separable data, we generated the dataset: a multidimensional sphere in a square, see fig 6. The multidimensional sphere in a square dataset consists of points which are separated into two classes: the points which are in the sphere are labeled with −1 and the points outside the sphere are labeled with 1. The origin of the sphere is the point (0.5, 0.5, ...)n for a n-dimensional sphere and the radius was chosen to keep ratio of point in two classes about 1:1. For classification of a sphere in a square dataset we achieved 2 dimensional sphere in a square 5 We could see, that SVM and RF showed different results on some of the chosen datasets. Whil SVM handled notaxis parallel datasets better, RF achieved better results on noisy data (both attribute and class noise). On big datasets there was no significant difference in the performance of the methods. SVM handled linearly not separable datasets better then RF. The cross talk in data significantly reduced the accuracy of SVM but was not a problem for RF. 1 y coordinate 0.8 0.6 0.4 6 0 0 0.2 0.4 0.6 x coordinate 0.8 1 Figure 6: 2 dimensional sphere in a squre following results. Method 2D, 10K SVM 98.1% RF 99.01% Both methods showed 3D, 10K 5D, 10K 98.73% 97.45% 96.69% 95.8 % good classification results. References [1] P. Scott and E. Wilkins. Evaluating data mining procedures: techniques for generating artificial data sets. 41, 1999. [2] Joseph O Ogutu, Torben Schulz-Streeck, and Hans-Peter Piepho. A comparison of random forests, boosting and support vector machines for genomic selection. 2011. [3] Yuchun Tang, Weilai Yang, and Sven Krasser. Support vector machines and random forests modeling for spam senders behavior analysis. 2008. [4] Corinna Cortes and Vladimir Vapnik. Support vector machines. Machine Leaming, 20, 1995. [5] Chih-Wei Hsu, Chih-Jen Lin, and Chih-Chung Chang. A practical guide to support vector classification. Department of Computer Science National Taiwan University, Taipei 106, Taiwan, 2003. [6] Ivor W. Tsang, James T. Kwok, and Pak-Ming Cheung. Core vector machines: Fast svm training on very large data sets. Journal of Machine Learning Research, 2005. [7] Lutz Hamel. Knowledge Discovery with Support Vector Machines. A JOHN WILEY & SONS, INC., PUBLICATION, 2009. [8] Christopher M. Bishop. Pattern recognition and machine learning. 2006. [9] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Infer- 1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 x coordinate 0.8 Conclusions We could see, that datasets with different grade of difficulty can be constructed and used to compare two data mining methods. The main challenge is to identify, which kind of difficulties in a dataset schould be used, to evaluate a certain data mining method. Synthetic data can also help to evaluate the performance of a data mining method on different kind of data. We could identify important characteristics of datasets which are difficult for chosen data methods. For finding all weaknesses and strengths of the chosen methods, more experimental work would be needed. The problems, which we discussed in the section 3 should be used as a basis for construction of other datasets for future evaluation. It also would be interesting to use the generated datasets to evaluate other important data mining methods, e.g. Sparse Grid or Neural Networks algorithms. 0.2 y coordinate Discussion 1 Figure 7: 2 dimensional sphere in a sphere 5 [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] ence, and Prediction, chapter Random Forests, pages 587–604. Springer New York, New York, NY, 2009. Richard A. Berk. Statistical Learning from a Regression Perspective, chapter Random Forests, pages 1–63. Springer New York, New York, NY, 2008. Claude Sammut and Geoffrey I. Webb, editors. Encyclopedia of Machine Learning, chapter Random Forests, pages 828–828. Springer US, Boston, MA, 2010. Sung-Hyuk Cha and Charles Tappert. A genetic algorithm for constructing compactbinary decision trees. JOURNAL OF PATTERN RECOGNITION RESEARCH, 2009. I.H. Witten, Frank Eibe, and Mark A. Hall. Data Mining. Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publishers, 2011. Gilles Louppe. Undrestanding random forest. 2015. Leo Breiman. Random forests. 2001. Steve R.Gunn. Support vector machines for classification and regression. Technical report, 1998. School of Economics and Management, Beihang University. Classification with Class Overlapping: A Systematic Study, The 2010 International Conference on E-Business Intelligence. Atlantis Press, 2010. Kotsiantis Sotiris, Kanellopoulos Dimitris, and Panayiotis Pintelas. Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering, 30, 2006. Rehan Akbani, Stephen Kwek, and Nathalie Japkowicz. Applying support vector machines to imbalanced datasets. F. Boulicaut et al. (Eds.): ECML 2004, LNAI 3201, 2004. Taghi M. Khoshgoftaar, Moiz Golawala, and Jason Van Hulse. An empirical study of learning from imbalanced data using random forest. 2007. Chao Chen, Andy Liaw, and Leo Breiman. Using random forest to learn imbalanced data. 2004. 6

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download large synthetic data sets to compare different data mining methods