Download Pre-Processing Methods for Imbalanced Data Set of Wilted Tree

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Pre-Processing Methods for Imbalanced Data Set of Wilted Tree
Ahmet Murat Turk*1, Kemal Ozkan2
1,
Department of Computer Engineering, Anadolu University, Eskisehir, Turkiye
(E-mail: [email protected])
2
Department of Computer Engineering, Eskisehir Osmangazi University, Eskisehir, Turkiye
(E-mail: [email protected])
Corresponding Author’s e-mail: [email protected]
ABSTRACT
Machine learning algorithms builds a model based on train data which is assumed as number
of instances between different classes are nearly equal. In real world problems usually data sets
are unbalanced and this can cause seriously negative effect on built model. Researches on
imbalance data sets focus on over-sampling minority class or under-sampling majority class
and recently several methods has been purposed which modified support vector machine, rough
set based minority class oriented rule learning methods, cost sensitive classifier perform good
on imbalanced data set. Although these methods provides a balanced train set artificially, in
some real world problems sense of error can be vital since cost of false negative error is
expensive than false-positive error. For instances, during classification of satellite image for
diseased tree classification, naturally most of trees in a forest is expected to be healthy.
Classification algorithm is said to be effective whether critical information is not to be lost.
One of the reason why tree’s become diseased in forest is inspect epidemic. Whether
classification system could not detect wilted tree, it is not only cause to dry the tree but also
possibility to transmission of disease will still contain by insect which can spread. Therefore
main goal of this work is minimizing false negative errors. In this work, pre-processing methods
for imbalance data sets which divert classification results as minimize false negative error, is
discussed.
Keywords: Imbalanced dataset, oversampling, undersampling
1. INTRODUCTION
A data set is called imbalanced if it contains many more samples from one class than from the
rest of the classes. Data sets are unbalanced when at least one class is represented by only a
small number of training examples (called the minority class) while other classes make up the
majority. In imbalanced data sets, classifiers can have good accuracy on the majority class but
very poor accuracy on the minority class(es) due to the influence that the larger majority class
has on traditional training criteria [1]. Most original classification algorithms pursue to
minimize the error rate: the percentage of the incorrect prediction of class labels. They ignore
the difference between types of misclassification errors. In particular, they implicitly assume
that all misclassification errors cost equally.
In real world problem, such as forest management, during classification of satellite image for
diseased tree classification, naturally most of the trees in a forest is expected to be healthy.
Classification algorithm is said to be effective whether critical information is not to be lost. One
of the reason why tree’s become diseased in forest is inspect epidemic. Whether classification
system could not detect wilted tree, it is not only cause to dry the tree but also possibility to
transmission of disease will still contain by insect which can spread. Insect infestation changes
physical and morhological characteristics of trees but it is almost imposibble to detect this
alteration such as absorbing or reflecting up the sunshine. To detect areas which in under attack
of bugs, satelite images has been used with spatial and spectral resolution, since it is almost
impossible to discover these plants with human eye [2, 3]. Accomplishment of wilted tree
detection depend on success of classification algorithm and they needs to be develop [4]. In a
study by [5] a hybrid intensity–hue–saturation smoothing filter-based intensity modulation
(IHS-SFIM) pan sharpening approach has been used to obtain more spatially and spectrally
accurate image segments, then synthetically oversampling the training data of the ‘Diseased
tree’ class using the Synthetic Minority Over-sampling Technique (SMOTE); and a multi scale
object-based image classification approach has been used. After obtain a data set for wilted
trees it is expected that this data is imbalance. Although the study [5] has significant results,
using SMOTE may occur some drawbacks. In the process of generating synthetic samples; and
synthetic minority class samples will not be paid any attention after they are generated.
Therefore, in order to address these drawbacks and improve classification accuracy of the
SMOTE method, new research has been come up [6, 7].
Machine learning algorithms operate by building a model from example inputs in order to make
data-driven predictions or decisions. When a data imbalance it is needed to use oversampling
or undersampling method to built a fair model. This paper is organized as follows: purpose and
aim of pre-processing methods, under sampling and oversampling methods.
2 PRE-PPROCESSING FOR CLASSIFICATION
Classification is one of the essential topics in computer sciences. Supervised learning
algorithms, performs machine learning using a traning set and the it is desired that system
predicts a new record’s class acorrding to its attribute’s values. A training set should be consist
of correctly identified observations. Mining highly unbalanced datasets, particularly in a cost
sensitive environment, is among the leading challenges for knowledge discovery and data
mining [8] [9]. When one of the target classes of train set is unbalaced, the learning algorithm
fits itself to the class which’s number of intances are highly more. Assume that, there is 999
sample from one target class and 1 sample from others. Even if all instances classifies as the
class which has 999 samples, classification accuracy will be %99.9 on the ohter hand most
significant data in this set will be lost. For such situations, cost of error calculates with the
confisuon matrix. This can be illustrated by the Table 1 below.
The most evaluation metrics related to imbalance classes are recall (sensitivity), specifity,
precision, F measure, geometric mean (g-mean). Sensitivity and specificity are used to monitor
the classification performance on each individual class. While precision is used in problems
interested on highly performance on only one class, Fmeasure and G-mean are used when the
performance on both classes –majority and minority classes- needed to be high [10].
Performance assesment usually evaluates by F-score but in wilted tree recognition, most
signaficant values are records that are actual positive ones. Even though we classify one actual
negative record as positive (false positive), this is not more important than not false negative
errors.
Table 1 Confusion Matrix
Actual
Positive
Actual
Negative
Predicted
Positive
True
Positive
False
Positive
Predicted
Negative
False
Negative
True
Negative
False negative errors costs not to detect diseased tree and not only cause to dry the tree but
also possiblty to transmission of disease will still contain by insect which can spread. Therefore
main goal of this work is minimazing false negavite erors.
While classification on imbalanced data sets two strategies can be choosen
 Over-sampling : Inserting minority class data points.
 Under-sampling : Removing majority class data points
It is possible to combine over-sampling of the minority class with under-sampling of the
majority class [11].
2.1. Under-Sampling
The most common pre-processing technique is random majority under-sampling (RUS), IN
RUS, Instances of the majority class are randomly discarded from the dataset. However, the
main drawback of under-sampling is that potentially useful information contained in these
ignored examples is neglected [1]. To not lose this useful information’s, other than random
under-sampling should be implemented. There are some researches which focuses picking up
values intelligently [12]. Since it is expected that number of diseased tree will be very less than
healty tree count, these strategies should be implement in this case. Processing satellite
images,metioned in Chapter 1, will obtain images attributes, basically an image consist on tree
main colour (red-green-blue). Comparison of healty and diseased trees will show that attributes
values of healty instances is higher than wilted tree hereby wilted trees look pale. As a
consequence some healthy tree instances will be extreme or outlier values when using statistics
test.
While balancing train dataset, the records which belongs to majority class but very similar or
as distance very close to members of minority class should be exclude from the set. In this work,
the purpose of under-sampling is to find the samples which are close to classification curve and
belongs to majority class. Omitting these values may increase false positive errors but decrease
negative errors. Here some different methods are suggested to find these values.
2.1.1. Interquartile Range
Interquartile range (IQR) is a measure of statistical dispersion, being equal to the difference
between the upper and lower quartiles [13]. It is commonly used as a filter to find outliers and
extreme values however that is also means these values also has highest distance from minority
class. Thus, it is possible to use these data for majority class in training set. Once this filter
applied, owing to interquartile range’s structure still it is likely to apply again and find new
samples. After extracting the values which marked as outlier, Q1, Q2 and Q3 values will chance
and when we compute them again, some other samples will mark as outlier and this picked
instances will provide to train set. The main goal of interquartile range method to define a set
with similar values but that also extracts dissimilar values which we are looking for. In Figure
1 is the flow chart for sampling using interquartile range
2.1.2. Fuzzy Classification
Fuzzy classifier computes a membership value for classes instead of generating a class label.
This fuzzy value [0-1] easily examples which samples are belong to which class thence it is
indicated the samples which will extract to train set.
Since we have two different classes, instances with 0.5 or less fuzzy membership values may
be considered as noise values on account of they may pertain both classes. Pursuant to our goal,
it should be selected instances with biggest membership value. In this way, it is feasible to
generate a train set with distinct instances. Instances can be chosen by top-N membership value
which N is the count of minority or a threshold value (like membership value > 0.6).
The membership μA˜i,comp(x) of x object with m linguistic variables to the given classes can
be calculated based on the following equation [14]
x∈X,0≤γ≤1
(1)
where γ is control parameter with default value of 0.5 (x) is the membership value of x object
to a particular variable, m is the number of variables.
2.1.3. Clustering
Unlike classification, which analyze class-labeled data objects, clustering analyzes data objects
without consulting a known class label [15]. By splitting the row data set into k piece, adjacent
values may regroup with respect to their values. The clusters which does not comprise any
minority class member (class W) will extract to train set.
2.1.4. Computing Distance Between Instances
As an essential methods in data analysis, while choosing more distinct values, it is possible to
calculate with distance measurements between data points. Euclidean, Manhattan and
Chebyshev distances are three of basic measurements.
Euclidean distance function measures the ‘as-the-crow-flies’ distance. The formula for this
distance between a point X (X1, X2, etc.) and a point Y (Y1, Y2, etc.) is:
d
n
(x  y )
i
i
i
2
(2)
The Manhattan distance function computes the distance that would be traveled to get from one
data point to the other if a grid-like path is followed. The Manhattan distance between two items
is the sum of the differences of their corresponding components. The formula for this distance
between a point X=(X1, X2, etc.) and a point Y=(Y1, Y2, etc.) is:
n
d   xi  y i
(3)
i 1
Chebyshev distance is also called maximum value distance. It examines the absolute magnitude
of the differences between coordinates of a pair of objects. This distance can be used for both
ordinal and quantitative variables. The formula for this distance between a point X=(R1, R2,
etc.) and a point Y=(Y1, Y2, etc.) is:
d =
max
i 0 ,..., k
r r
i
'
(4)
i
Before computing distance between instances, if observation interval different from other
between attributes, these values should be normalized and all values should be a fixed length
with normalization. Otherwise, whether an attributes’ value are greater than others will effect
distance not with respect to its correlation with classification attribute. The simplest method is
rescaling the range of features to scale the range in [0, 1] or [−1, 1]. Selecting the target range
depends on the nature of the data. The general formula is given as:
𝑥′ =
x−min(𝑥)
max(x)−min⁡(𝑥)
(5)
where x is an original value, x’ is the normalized value.
2.2. Over-Sampling
Several for over-sampling methods and performance measurements has been defined for
different imbalanced data classification cases. It is is possible to sum up these methods as:
 Minority over-sampling with re-sampling: have been used such as random oversampling
with replacement, random, focused oversampling. [8, 11]. In random resampling, minority
class examples are randomly replicated, but this can lead to overfitting.
 Synthetic Minority Over-sampling Technique : (SMOTE), [16] is one of the popular
algorithm, is synthetic generation of new samples based on the known information, and
combinations of the above techniques. SMOTE takes a training sample in the minority class
and introduces new synthetic examples in the feature space between that training sample
and one or more of its nearest neighbours in feature space, and then repeats this process for
the entire training data set. SMOTE has two free parameters; the number of nearest
neighbours and the percentage of new training samples to create.
It is disscussied before, in this work, to minimize false negative erros, number of minority
class sample will be increased to number of majority class sample which is count of healthy
trees samples after omitting some of the samples from majority class using SMOTE.
For this work, percentage parameters should be
as X = Count Of W sample and Y = Count of N class after pre-processing
percentage = x / y
(6)
3 PROPOSED SAMPLING ALGORITHM
Satellite imagery of a forest will assure us an imbalanced, large data set yet we can elimanate
or pre-process some values for better classification accuracy or gain better score for
performance measure. To shrink false negative errors, it is suggested that one of under sampling
methods which discussed in Chapter 2.1 and over-sampling method SMOTE which discussed
in Chapter 2.2 has to impelement to row data set, thus we can have a balanced data set with
most distrinct values between classes.
Figure 1 Under sampling method based on IQR
Figure 2 Under sampling based on fuzzy
classification
Figure 4 Proposed Sampling Algorithm
Figure 3 Under sampling based on clustering
In study [10] argue that advantages of under sampling and over sampling techniques are
independent on underlying classifier and can be easily implemented. But still these methods
have some limitations: under sampling may remove significant patterns and cause loss of useful
information and over sampling may lead to over-fitting. With these framework we built a model
which reduces these limitations.
4 EXPERIMENTS
The data set which we use in experiments is publish by [5] and available on UCI Repository1
named as Wilted Data set consists of image segments, generated by segmenting the pan
sharpened image. The segments contain spectral information from the Quickbird multispectral
image bands and texture information from the panchromatic (Pan) image band. There are few
training samples for the 'diseased trees' class (74) and many for 'other land cover' class (4265).
Experiment results based on methods
Figure 5 Distribution of data set
320 values selected in 4 iteration and confusion matrix while using train set while using CrossValidation with 10 folds. Support Vector Machine algorithm is used for experiments.
Table 2 SMOTE classification results
1
Actual W
Predicted
W
4291
Predicted
N
47
Actual N
440
3825
https://archive.ics.uci.edu/ml/datasets/Wilt
Table 3 Under sampling based IQR results
Actual W
Predicted
W
74
Predicted
N
0
Actual N
0
316
Table 4 Purposed Algorithm
Actual W
Predicted
W
319
Predicted
N
0
Actual N
4
316
CONCLUSIONS
Class imbalance is a hot topic being investigated recently by machine learning and data mining
researchers. The researchers for solving the imbalance problem have proposed various
approaches. However, there is no general approach proper for all imbalance data sets and there
is no unification framework since expectations from a classifier can depend on problem. In this
work, it is focused on minimizing false negative errors but in other research, main goal of work
may change the motivation. Developing classifiers which are robust or hybrid algorithms can
be point of interest for the future research in imbalanced dataset.
REFERENCES
1.
2.
3.
4.
5.
6.
7.
Ganganwar, V., An overview of classification algorithms for imbalanced datasets.
International Journal of Emerging Technology and Advanced Engineering, 2012. 2(4):
p. 42-47.
Rullan-Silva, C., et al., Remote Monitoring of Forest Insect Defoliation-A Review.
Forest Systems, 2013. 22(3): p. 377-391.
Kristen Evans, W.d.J.a.P.C., Future Scenarios as a Tool for Collaboration in Forest
Communities. S.A.P.I.EN.S 2008. 1(2).
Hall, R.J., R.S. Skakun, and E.J. Arsenault, Remotely sensed data in the mapping of
insect defoliation. Understanding forest disturbance and spatial pattern: Remote
sensing and GIS approaches, 2006: p. 85-111.
Johnson, B.A., R. Tateishi, and N.T. Hoan, A hybrid pansharpening approach and
multiscale object-based image analysis for mapping diseased pine and oak trees.
International Journal of Remote Sensing, 2013. 34(20): p. 6969-6982.
Dang, X.T., et al., A novel over-sampling method and its application to miRNA
prediction. Journal of Biomedical Science and Engineering, 2013. 6(02): p. 236.
Xu, D.-D., Y. Wang, and L.-J. Cai, ISMOTE algorithm for imbalanced data sets.
Journal of Computer Applications, 2011. 9: p. 025.
8.
9.
10.
11.
12.
13.
14.
15.
16.
Japkowicz, N. and S. Stephen, The class imbalance problem: A systematic study.
Intelligent data analysis, 2002. 6(5): p. 429-449.
Eason, G., B. Noble, and I. Sneddon, On certain integrals of Lipschitz-Hankel type
involving products of Bessel functions. Philosophical Transactions of the Royal
Society of London. Series A, Mathematical and Physical Sciences, 1955. 247(935): p.
529-551.
Elrahman, S.M.A. and A. Abraham, A Review of Class Imbalance Problem. Journal of
Network and Innovative Computing, 2013. 1(2013): p. 332-340.
Ling, C.X. and C. Li. Data Mining for Direct Marketing: Problems and Solutions. in
KDD. 1998.
Kubat, M. and S. Matwin. Addressing the curse of imbalanced training sets: one-sided
selection. in ICML. 1997. Nashville, USA.
Upton, G. and I. Cook, Understanding statistics. 1996: Oxford University Press.
Veryha, Y., Implementation of fuzzy classification in relational databases using
conventional SQL querying. Information and Software Technology, 2005. 47(5): p.
357-364.
Han, J., M. Kamber, and J. Pei, Data mining, southeast asia edition: Concepts and
techniques. 2006: Morgan kaufmann.
Chawla, N.V., et al., SMOTE: synthetic minority over-sampling technique. Journal of
artificial intelligence research, 2002. 16(1): p. 321-357.