Download large synthetic data sets to compare different data mining methods

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Cluster analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
large synthetic data sets to compare different data
mining methods
Victoria Ivanova† , Yaroslav Nalivajko‡
Abstract— Data Mining methods are used for knowledge discovery in data. Each Data Mining method has it’s own advantages
and disadvantages. One approach to evaluate the performance of data mining methods is to use synthetic data with certain
characteristics. In this paper we outline the main features of two common data mining methods, Support Vector Machines and
Random Forests and construct suitable datasets, to evaluate both methods. We first explain the theory behind each method.
Then we describe the artificial data sets we have created. Finally, we compare the methods in their ability to classify the
constructed datasets.
Index Terms—Data Mining, Support Vector Machines, Random Forests, Classification, Synthetic
x2
1 Introduction
The evaluation of the performance of a data mining method
is an important task [1]. It can help to find an appropriate
data mining method for a certain problem. It is possible to
evaluate the methods through comparing the performance
on synthetic data. Support Vector Machines and Random
Forests are common data mining methods, which can be used
for classification or regression. To evaluate the methods we
have written python scripts which generate artificial datasets
with different characteristics. The scripts allow the user to
manipulate some parameters, so that the resulting complexity
of datasets vary. The goal was to construct datasets, which
would allow to evaluate the performance of the chosen data
mining methods.
The rest of the paper is organized as follows: section 2
describes theoretical foundations of the methods. Section 3
gives an overview of the common difficulties in real-world
datasets. In section 4 we present the results classification
on some of the generated synthetic datasets. The section 5
summarizes the practical results we achieved. Finally, section
6 presets the conclusions.
ρ
x1
r∗
r∗
Optimal Hyperplane
w> x + b = 0
Figure 1: Linearly separable training set with hyperplane
and maximal margin ρ [5]
testdata, x = (x1 , x2 , ...xn ) where x ∈ Rn is the vector with
attributes and it is known, to which class y1 ...yn ∈ {1, −1}
it belongs to. The aim is to construct a classificator which
would optimally separate the data samples. For two classes
which are linearly separable we can find a surface, which
would be equidistant to the class boundaries, where the
classes are closest to each other. This decision surface is
equidistant to the points in each class, which are called
support vectors and maximizes the distance between the
classes. Such classifiers are called maximum-margin classifiers.
We define the algorithm formally as described in [7]. We have:
2
Theoretical foundantions of Support Vector Machines
and Random Forests
Random Forests and Support Vector Machines are two common data mining methods. They both can be used for classification and regression. Both methods can be used for binary
or multiclass classification. The performance of the methods
is often compared and, depending on the problem, one of the
methods outperforms the another [2], [3]. In the following,
we outline the theory behind each method.
2.1 Support Vector Machines
The Support Vector Machines (SVM) method was first introduced by V. N. Vapnik in 1995 [4]. The SVC (Support Vector
Classifier) and the SVR (Support Vector Regressor) are the
algorithms for classification and regression respectively [5].
SVC finds a hyperplane which separates the two classes
(binary classification) with a maximal margin. The time
complexity is O(m3 ) and space complexity is O(m2 ), where
m is the training size [6]. This hyperplane is proven to have
a generalization ability. That means that the solution would
also work on a data with the same distribution as the test
data and guarantees high predictive ability [6]. There is the
• dot product space Rn
• target function f : Rn → {+1, −1} (binary classification)
• labeled, linearly separable training set:
D =
{(x¯1 , y1 ), (x¯2 , y2 ), . . . (x¯i , yi )} ⊆ Rn × (+1, −1), where
yi = f (x¯i )
a model: fˆ : Rn → {+1, −1} should be computed, using D
such that fˆ(x̄) ∼
= f (x̄) for all x¯i ∈ Rn The optimal hyperplane
wt x + b = 0, where w is the weight vector and b is the bias,
separates the data see Figure 1. The aim of the algorithm
is to maximize the margin between the hyperplane and the
support planes. Without loss of generality the functional
margin is fixed to 1 [5]. That results in: wt x+b =≥ 1 for yi =
1 and wt x + b =≥ −1, for yi = −1 The points which satisfy
† Victoria Ivanova
[email protected]
‡ Yaroslav Nalivajko
[email protected]
1
Kernel function
Cathegory
k(x, y) = xy
linear
k(x, y) = (γxy + c0 )d
polynomial
k(x, y) = exp(−γ k x − y k) radial (Gauss)
k(x, y) = tanh(γxy + c0 )
sigmoidal
The optimal
hyperplane
is
computed
with the kernel funcPn ∗ T
α
y
φ
tion:
(x
)φ(x)
=
0.
The
advantage
of the Keri
i
i i
nel is that one must not consider the concrete form of the
transformation φ, which does not have to be explicitly formulated in the higher dimensional space. A kernel function
must be symmetric: k(x, y) = k(y, x) and positive semidefinite,
R a R a that means for x, y ∈ X in the interval [a : b] that
k(x, y)g(x)g(y)dxdy ≥ 0 for all g : X → R
b b
Holding these equations the Mercel kernel is defined:
K = K(xi , yj )n
i,j=1 , the matrix constructed by the kernel
function is symmetric and positive semidefinite In practice
both methods are combined: kernel trick and the soft margin
classifier.
these two equations are called support vectors, we write for
them x∗ . So, for the geometrically defined distance, we have
t ∗
x +b
1
the following equations: r∗ = w kwk
, so r∗ = kwk
for yi = 1
and r∗ =
−1
kwk
for yi = −1 [5]. The margin ρ = 2 · r∗ =
The aim of SVM is the maximization of the margin, ρ =
2
.
kwk
2
kwk
in respect to w and b such that yi (wT xi + b) ≤ 1, i = 1..n
and is equivalent to the minimization of 12 k w k, which is
a convex function [7]. This optimization problem can be
solved with the Lagrangian multipliers method:
n
L(w, b, α) =
X
1 t
w w−
αi [yi (wt xi + b) − 1]
2
i=1
where αi is the Lagrangian multiplier and αi ≥ 0 [5]. For every constraint we have one α. The solution of the Lagrangian
multipliers method are points that maximize L with respect
to α and minimize L with respect to x. These solutions are
saddle points on the graph of the Lagrangian. This saddle
point is unique [7]. We differentiate the Lagrangian in respect
with w and b and set both results to zero. We obtain:
n
X
αi yi = 0 and w =
i=1
n
X
2.1.3
αi yi xi
i
The substitution of the equations into the Lagrangian
results in the corresponding dual problem: maxW (α) =
Pn Pn
Pn
Pn
T
1
i=1
j=1 αi αj yi yj xi xj , where
i=1 αi − 2
i=1 αi yi =
0 with αi ≥ 0 [5]. The Karush-Kuhn-Tucker conditions
stay that αi [yi (wT x∗i + b) − 1] = 0, where x∗i are the support vectors [7]. So only the support vectors correspond
to the non-zero αi -s. All other αi -s are zero. After the calculation of the Lagrangian
Pn multipliers we can calculate the
weight vector: w∗ = i αi yi xi , and the bias can be calculated b = 1 − w∗T xs for ys = 1 [5]. The method described
above can not find a solution if samples are not completely
separable (the margins would be negative), which is often
in the real world problems. To address this problem there
are two widely adopted approaches. The one is soft-margin
classifier and the second is the kernel trick.
2.1.1
2.2
Random Forests
Random Forests (RF) algorithm is commonly used for classification and regression, but can be easily adapted for other
tasks. RF is a substantial modification of Bootstrap AGGregatING (or bagging) [8] that builds a large collection
of de-correlated trees, and then averages them [9]. This
allows RF to benefit from all advantages of decision tree
learning, like work with continuous and discrete attributes
or assessment quality of models and then shuffle off all disadvantages through bagging. As a consequence, random forest
is a popular data mining method, and is implemented in a
variety of packages.
2.2.1
Soft margin classifier
Description
To build N trees we will take N random subsets of parameters
and N random subsets of a learning record, then we will use
them to build decision trees. Using only random subsets
allows us to overcome overfit problem, commonly meted by
decision tree algorithm and to work with a very large number
of predictors and observations [10]. Later, for execution
classification, regression, etc., all received trees will vote.
The result with the highest number of votes is then chosen.
The soft margin
Pn classifier Tintroduces a slack variable ξ:
2
1
k
w
k
+C
2
i ξi and yi (w x + b) ≥ 1 − ξi , ξi ≥ 0 i = 1, .., n.
The slack variable characterizes the distance between the
misclassified data and the separating hyperplane. The parameter C is the parameter, which defines the compromise
between complexity and the number of inseparable points.
It is a ”regularization” parameter, can be chosen from the
user. The dual Lagrangian problem has the same solution in
this case, the only difference is that 0 ≤ αi ≤ C. The Karush
Kuhn Tucker conditions say that
2.2.2
Mechanism of work
RF algorithm requires two main arguments and a learning set
for start. These are restriction for trees depth and number
of trees.
Limitation is not usually necessary, because high number of
dimensions in learning sets(which leads to the need to limit
trees) is a rare situation for RF, where number of dimensions
in learning set already has been greatly reduced.
At first, number of trees can be chosen intuitively, and then
improved observing the changes in out-of-bag error. This
value describes the mean prediction error on each training
sample xi , using only the trees that did not have xi in their
bootstrap training sample. The training and test error tend to
level off after some number of trees have been fit. Usually to
αi [yi (wT xi + b) + ξ − 1] = 0
i = 1, .., n and ξi γi , i = 1, ..n, where γi − s are the Lagrangian
multipliers. [5]
2.1.2
Multiclass SVM
In practice problems with more than two classes are often. As
SVM is a two-class classifier, there are different approaches
to combine the two-class SVMs and to build a multiclass classifier. The first approach is the one-versus-the-rest approach.
This approach has several disadvantages e.g. imbalance of the
training set. Another approach is often called one-versus-one.
K(K−1)
[8] The main idea is to train
different 2 class SVMs,
2
where K is the number of classes, on all pairs of classes. The
classification is then done according to the class which is
chosen the most [8].
Kernel function
A kernel function is based on the inner product between the
given data to a feature space of higher (or infinite) dimension. The skalar product is then transformed nonlinear. The
Kernel Function k: k < x, y >=< Φ(x), Φ(y) >. There are
different categories of kernel functions [7]:
2
build a forest for a starting training
set with M samples and
√
N dimensions,
bagging
uses
M
samples
(with repetition)
√
and N or log2 (N + 1) parameters for classification and N/3
for regression with minimum of five randomly chosen for each
tree [11]. Then we are freely to use one of the decision
tree build algorithm, like ID3, C4.5, CART, RI, IndCart,
DB-Cart, CHAID or MARS. Each algorithm differs from the
others with methods of choosing the next attribute, whose
gain is the maximum as a root node [12] and finding optimal
question for this parameter. In our case tests are calculated
in WEKA, which uses RI algorithm [13].
the values of attributes or through inversing the class label.
Noisy data should be preprocessed before classification but
it might be important to see, whether a certain data mining
method is sensitive to noise or not.
The crosstalk in data means, that if there are many relationships within a data set, then it might be difficult for a
data mining method to identify strong relationships within
many weak relationships [1]. This occurs e.g. if a dataset
has some redundant attributes. Then it might be difficult for
a classifier, to detect the important attribute.
Data can be linearly separable or not. For a linear classifier
it is easy to classify linearly separable data, but difficult to
classify linearly unseparable data. SVM works well on linearly
separable data [5]. The performance on not linearly separable
data depends on the kernel function and penalty parameter
[16].
Another problem that may occur in praxis is overlapping
data. Overlapping means that some samples of different
classes have very similar characteristics [17].
Imbalanced data is a problem in classification that often
occurs in praxis. An imbalanced dataset is a dataset in
which the number of objects of one class dominates over another class. There are some known techniques to handle this
problem: undersampling, oversampling and feature selection
[18]. One can either under-sample the dominating class and
over-sample the minority class. The imbalance of datasets
has been known as a problem for SVM [19]. The problem
of classification of imbalanced data with Random Forests
algorithm has been studied by various researchers [20], [21].
Very large datasets with millions of entries can also be
a challenge for a classification method. Another problem
can be a dataset with too many attributes. The number of
attributes critical for classification depends on the method
used for classification.
2.2.3 Estimation of precision
After all trees are built, we need to calculate the out-of-bag
error. Value of this parameter helps us to estimate how good
the number of trees fits for the training set. We can increase
number of trees, if out-of-bag error strong decreases, then the
number of trees is not sufficient. If out-of-bag error increases,
then we have a dataset with high noise value and can decrease
the number of trees, that can improve the results. If out-ofbag error did not change, then the number of trees matches or
exceeds the optimal value. We can try to reduce the number
of trees to speed up the calculations.
2.2.4 Time Complexity
Time needed to build model is going to be close to the sum of
the complexities of building the individual decision trees in
the model. Usually, the models all have the same complexity,
then it would be the complexity of the individual model times
the number of models builded. If there are n instances and
m attributes, then the computational cost of building a tree
is O(m ∗ n ∗ log(n)) [13]. If M trees are grown, then the
complexity is O(M ∗ (m ∗ n ∗ log(n)) [13].
2.2.5 Space Complexity
Random Trees used by Random Forest are to some extent
binary trees. Space Complexity can be estimated with a
number of nodes in a tree. If class can be estimated with h
decisions nodes (which represents the height of a tree) then
number of nodes can be calculated as 2h . Whole model space
complexity is O(M ∗ 2h ) [14].
4
Practical results
For comparing the performance of the methods WEKA [13]
with LibSVM and RandomForest Implementation in WEKA
were used for SVM and RF respectively. For SVM the preprocessing of the datasets, as described in [5] was not done,
because the datasets were generated in the appropriate format (arff files) and the datasets had the proper interval
(−1, 1), (0, 1), so scaling was not necessary. To evaluate the
data mining methods k-fold-cross validation was used (with
default k = 10). For finding the appropriate parameters
CVParameterSelection in WEKA was used.
2.2.6 Disadvantages
• Normally, random forests do not overfit as more trees
are added, but on some datasets, especially on noisy
data, RF can overfit [15].
• Random Forest’s model description can not be easily
understood unlike the decision trees, where it is easy to
comprehend final structure of tree.
4.1
Reference dataset
The n-dimensional chess dataset was used as a reference data
set, to estimate assumptive influence of other difficulties.
This dataset looks like a chessboard, see figure 2. There
are two classes. The number of dimensions and the number
of points can be varied. For the reference chess dataset we
obtained the following results (a two dimensional dataset).
Value describing percent of points classified correctly.
Method 1K Points 10K Points 100K Points
SVM
95%
98.93%
99.739%
RF
98.2%
99.85%
99.984%
As we can see, good classification results could be obtained
with both methods. We can also see, that as the number of
points increase, the classification result improves.
• High size of produced models requires large volume of
memory. RF for complex data can contain thousand of
trees with millions of nodes.
3 Known difficulties in datasets
Our goal was to generate synthetic datasets which would help
to stress all the advantages and disadvantages of the chosen
data mining methods. There are some known characteristics
of datasets which can make the classification difficult for
almost every data mining method. These are noise, crosstalk
and inherent complexity [1].
Noisy data often comes in real-world problems. There
are two types of noise known: class noise and attribute
noise. Attribute noise can have different causes: e.g. missing
or unknown attribute values, incomplete values or wrong
values. Class noise can result from either misclassification or
contradictory examples (duplicate values). Both categories of
noise can easily be simulated through either slightly changing
4.2
Not-axis parallel datasets
As mentioned in Section 3, non-axis parallel data can be a
challenge for a data mining method. To address this problem,
the chess dataset was transformed. The rotation angle can
be varied and the rotation plane can be chosen. For Datasets
with 10K Points we obtained the following results.
3
1
0.8
0.8
y coordinate
1
y
0.6
0.4
0.6
0.4
0.2
0.2
0
0
0
0.2
0.4
0.6
0.8
1
0
0.2
x
1
1
0.8
0.8
0.6
0.4
0
0
0.4
0.6
x coordinate
0.8
0.8
1
0.4
0.2
0.2
1
0.6
0.2
0
0.8
Figure 4: Chess dataset with class noise
y coordinate
y coordinate
Figure 2: 2 dimensional chess dataset with 10K points
0.4
0.6
x coordinate
1
0
Figure 3: 2 dimensional chess dataset with 1K points, rotated
30 degrees in the xy-plane
0.2
0.4
0.6
x coordinate
Figure 5: Chess dataset with attribute noise
4.4
Method 2D 30 degrees 2D 1 degree 3D 30 degrees
SVM
98.34%
98.37%
90.88%
RF
99.24%
96.99%
82.5%
We can see that rotation didn’t changed the performance of
SVM a lot, but has significantly reduced accuracy of RF.
Big Datasets
To see the difference in classification of big
datasets, we generated datasets with 100K entries.
Method 3D, 100K 4D, 100K
SVM
98.188%
92.339%
RF
99.893%
97.378%
The classification results of SVM were not so good as the
results of the RF algorithm.
4.3 Noise in data
Noise in datasets, as described in section 3 is a common classification problem in real datasets. At first, we simulated class
noise through labeling a certain percent of instances with
wrong class label. For class noise we obtained the following
results:
Method 2D 1%
2D 2%
2D 5%
5D, 5%
SVM
97.71% 96.29% 92.78% 58.0%
RF
98.88% 97.83% 94.67% 54.25 %
Here RF algorithm achieved better results than SVM,
but both algorithms found it difficult to classify the
dataset with 5 % noise and 5 attributes.
To simulate attribute noise, we modified the attributes slightly.
Method 2D, 1K 2D, 10K 3D, 10K
SVM
94.5%
94.2%
90.29%
RF
95.5%
95.2%
92.15%
As we can see, attribute noise reduces accuracy of both methods, but RF could achieve a slightly better results.
4.5
Datasets with high number of attributes
To evaluate the performance of the methods on datasets
with different number of attributes, we used the dataset
multidimensional sphere. We tested the influence of high
number of attributes (in our case that means high number
of dimensions) on the methods ability to predict the class
correctly. A multidimensional sphere in a sphere dataset is
a binary classification problem, here the points inside the
smaller sphere are labeled with 1 and points in an outer
sphere are labeled with −1. For 2 dimensional sphere in a
sphere with 1000 points we obtained following results:
Method 2D, 1K 5D, 1K 10D, 10K 10D, 100K
SVM
98.5%
98.2%
97.7 %
99.2 %
RF
90.7%
96.6%
77.1 %
81.4 %
As we can see, the number of dimensions (attributes) has
4
4.7
no significant effect on the performance of SVM, but RF
algorithm seems to be sensitive to the increasing number of
attributes.
4.6
Cross talk dataset
To address the problem of cross talk in data, we constructed
datasets with redundant attributes. That means, the additional attributes are added to the dataset but are not
adding any information to the classification. A reference
chess dataset was modified through such a procedure. The
datasets varied in the number of redundant attributes. In
the table the results of classification of datasets with different number of redundant attributes (r.a.) are presented.
Method 1K 1 r.a. 10K with 8 r.a. 100K 16 r.a.
SVM
88.4%
54.96 %
54 %
RF
97.2%
99.6%
99.737%
We could see, that the number of additional attributes had
no significant effect on the classification result of RF. The
number of points correlates with the classification accuracy.
The cross talk in data reduces the accuracy of SVM.
Linearly not separable datasets
For testing linearly not separable data, we generated the
dataset: a multidimensional sphere in a square, see fig 6.
The multidimensional sphere in a square dataset consists of
points which are separated into two classes: the points which
are in the sphere are labeled with −1 and the points outside
the sphere are labeled with 1. The origin of the sphere is the
point (0.5, 0.5, ...)n for a n-dimensional sphere and the radius
was chosen to keep ratio of point in two classes about 1:1.
For classification of a sphere in a square dataset we achieved
2 dimensional sphere in a square
5
We could see, that SVM and RF showed different results
on some of the chosen datasets. Whil SVM handled notaxis parallel datasets better, RF achieved better results on
noisy data (both attribute and class noise). On big datasets
there was no significant difference in the performance of the
methods. SVM handled linearly not separable datasets better
then RF. The cross talk in data significantly reduced the
accuracy of SVM but was not a problem for RF.
1
y coordinate
0.8
0.6
0.4
6
0
0
0.2
0.4
0.6
x coordinate
0.8
1
Figure 6: 2 dimensional sphere in a squre
following results.
Method 2D, 10K
SVM
98.1%
RF
99.01%
Both methods showed
3D, 10K 5D, 10K
98.73%
97.45%
96.69%
95.8 %
good classification results.
References
[1] P. Scott and E. Wilkins. Evaluating data mining procedures:
techniques for generating artificial data sets. 41, 1999.
[2] Joseph O Ogutu, Torben Schulz-Streeck, and Hans-Peter
Piepho. A comparison of random forests, boosting and support
vector machines for genomic selection. 2011.
[3] Yuchun Tang, Weilai Yang, and Sven Krasser. Support vector
machines and random forests modeling for spam senders
behavior analysis. 2008.
[4] Corinna Cortes and Vladimir Vapnik. Support vector machines. Machine Leaming, 20, 1995.
[5] Chih-Wei Hsu, Chih-Jen Lin, and Chih-Chung Chang. A
practical guide to support vector classification. Department
of Computer Science National Taiwan University, Taipei 106,
Taiwan, 2003.
[6] Ivor W. Tsang, James T. Kwok, and Pak-Ming Cheung. Core
vector machines: Fast svm training on very large data sets.
Journal of Machine Learning Research, 2005.
[7] Lutz Hamel. Knowledge Discovery with Support Vector Machines. A JOHN WILEY & SONS, INC., PUBLICATION,
2009.
[8] Christopher M. Bishop. Pattern recognition and machine
learning. 2006.
[9] Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
The Elements of Statistical Learning: Data Mining, Infer-
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
x coordinate
0.8
Conclusions
We could see, that datasets with different grade of difficulty
can be constructed and used to compare two data mining
methods. The main challenge is to identify, which kind of
difficulties in a dataset schould be used, to evaluate a certain
data mining method. Synthetic data can also help to evaluate
the performance of a data mining method on different kind of
data. We could identify important characteristics of datasets
which are difficult for chosen data methods. For finding
all weaknesses and strengths of the chosen methods, more
experimental work would be needed. The problems, which
we discussed in the section 3 should be used as a basis for
construction of other datasets for future evaluation. It also
would be interesting to use the generated datasets to evaluate
other important data mining methods, e.g. Sparse Grid or
Neural Networks algorithms.
0.2
y coordinate
Discussion
1
Figure 7: 2 dimensional sphere in a sphere
5
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
ence, and Prediction, chapter Random Forests, pages 587–604.
Springer New York, New York, NY, 2009.
Richard A. Berk. Statistical Learning from a Regression
Perspective, chapter Random Forests, pages 1–63. Springer
New York, New York, NY, 2008.
Claude Sammut and Geoffrey I. Webb, editors. Encyclopedia
of Machine Learning, chapter Random Forests, pages 828–828.
Springer US, Boston, MA, 2010.
Sung-Hyuk Cha and Charles Tappert. A genetic algorithm
for constructing compactbinary decision trees. JOURNAL
OF PATTERN RECOGNITION RESEARCH, 2009.
I.H. Witten, Frank Eibe, and Mark A. Hall. Data Mining.
Practical Machine Learning Tools and Techniques. Morgan
Kaufmann Publishers, 2011.
Gilles Louppe. Undrestanding random forest. 2015.
Leo Breiman. Random forests. 2001.
Steve R.Gunn. Support vector machines for classification and
regression. Technical report, 1998.
School of Economics and Management, Beihang University.
Classification with Class Overlapping: A Systematic Study,
The 2010 International Conference on E-Business Intelligence.
Atlantis Press, 2010.
Kotsiantis Sotiris, Kanellopoulos Dimitris, and Panayiotis
Pintelas. Handling imbalanced datasets: A review. GESTS
International Transactions on Computer Science and Engineering, 30, 2006.
Rehan Akbani, Stephen Kwek, and Nathalie Japkowicz. Applying support vector machines to imbalanced datasets. F.
Boulicaut et al. (Eds.): ECML 2004, LNAI 3201, 2004.
Taghi M. Khoshgoftaar, Moiz Golawala, and Jason Van Hulse.
An empirical study of learning from imbalanced data using
random forest. 2007.
Chao Chen, Andy Liaw, and Leo Breiman. Using random
forest to learn imbalanced data. 2004.
6