Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Pre-Processing Methods for Imbalanced Data Set of Wilted Tree Ahmet Murat Turk*1, Kemal Ozkan2 1, Department of Computer Engineering, Anadolu University, Eskisehir, Turkiye (E-mail: [email protected]) 2 Department of Computer Engineering, Eskisehir Osmangazi University, Eskisehir, Turkiye (E-mail: [email protected]) Corresponding Author’s e-mail: [email protected] ABSTRACT Machine learning algorithms builds a model based on train data which is assumed as number of instances between different classes are nearly equal. In real world problems usually data sets are unbalanced and this can cause seriously negative effect on built model. Researches on imbalance data sets focus on over-sampling minority class or under-sampling majority class and recently several methods has been purposed which modified support vector machine, rough set based minority class oriented rule learning methods, cost sensitive classifier perform good on imbalanced data set. Although these methods provides a balanced train set artificially, in some real world problems sense of error can be vital since cost of false negative error is expensive than false-positive error. For instances, during classification of satellite image for diseased tree classification, naturally most of trees in a forest is expected to be healthy. Classification algorithm is said to be effective whether critical information is not to be lost. One of the reason why tree’s become diseased in forest is inspect epidemic. Whether classification system could not detect wilted tree, it is not only cause to dry the tree but also possibility to transmission of disease will still contain by insect which can spread. Therefore main goal of this work is minimizing false negative errors. In this work, pre-processing methods for imbalance data sets which divert classification results as minimize false negative error, is discussed. Keywords: Imbalanced dataset, oversampling, undersampling 1. INTRODUCTION A data set is called imbalanced if it contains many more samples from one class than from the rest of the classes. Data sets are unbalanced when at least one class is represented by only a small number of training examples (called the minority class) while other classes make up the majority. In imbalanced data sets, classifiers can have good accuracy on the majority class but very poor accuracy on the minority class(es) due to the influence that the larger majority class has on traditional training criteria [1]. Most original classification algorithms pursue to minimize the error rate: the percentage of the incorrect prediction of class labels. They ignore the difference between types of misclassification errors. In particular, they implicitly assume that all misclassification errors cost equally. In real world problem, such as forest management, during classification of satellite image for diseased tree classification, naturally most of the trees in a forest is expected to be healthy. Classification algorithm is said to be effective whether critical information is not to be lost. One of the reason why tree’s become diseased in forest is inspect epidemic. Whether classification system could not detect wilted tree, it is not only cause to dry the tree but also possibility to transmission of disease will still contain by insect which can spread. Insect infestation changes physical and morhological characteristics of trees but it is almost imposibble to detect this alteration such as absorbing or reflecting up the sunshine. To detect areas which in under attack of bugs, satelite images has been used with spatial and spectral resolution, since it is almost impossible to discover these plants with human eye [2, 3]. Accomplishment of wilted tree detection depend on success of classification algorithm and they needs to be develop [4]. In a study by [5] a hybrid intensity–hue–saturation smoothing filter-based intensity modulation (IHS-SFIM) pan sharpening approach has been used to obtain more spatially and spectrally accurate image segments, then synthetically oversampling the training data of the ‘Diseased tree’ class using the Synthetic Minority Over-sampling Technique (SMOTE); and a multi scale object-based image classification approach has been used. After obtain a data set for wilted trees it is expected that this data is imbalance. Although the study [5] has significant results, using SMOTE may occur some drawbacks. In the process of generating synthetic samples; and synthetic minority class samples will not be paid any attention after they are generated. Therefore, in order to address these drawbacks and improve classification accuracy of the SMOTE method, new research has been come up [6, 7]. Machine learning algorithms operate by building a model from example inputs in order to make data-driven predictions or decisions. When a data imbalance it is needed to use oversampling or undersampling method to built a fair model. This paper is organized as follows: purpose and aim of pre-processing methods, under sampling and oversampling methods. 2 PRE-PPROCESSING FOR CLASSIFICATION Classification is one of the essential topics in computer sciences. Supervised learning algorithms, performs machine learning using a traning set and the it is desired that system predicts a new record’s class acorrding to its attribute’s values. A training set should be consist of correctly identified observations. Mining highly unbalanced datasets, particularly in a cost sensitive environment, is among the leading challenges for knowledge discovery and data mining [8] [9]. When one of the target classes of train set is unbalaced, the learning algorithm fits itself to the class which’s number of intances are highly more. Assume that, there is 999 sample from one target class and 1 sample from others. Even if all instances classifies as the class which has 999 samples, classification accuracy will be %99.9 on the ohter hand most significant data in this set will be lost. For such situations, cost of error calculates with the confisuon matrix. This can be illustrated by the Table 1 below. The most evaluation metrics related to imbalance classes are recall (sensitivity), specifity, precision, F measure, geometric mean (g-mean). Sensitivity and specificity are used to monitor the classification performance on each individual class. While precision is used in problems interested on highly performance on only one class, Fmeasure and G-mean are used when the performance on both classes –majority and minority classes- needed to be high [10]. Performance assesment usually evaluates by F-score but in wilted tree recognition, most signaficant values are records that are actual positive ones. Even though we classify one actual negative record as positive (false positive), this is not more important than not false negative errors. Table 1 Confusion Matrix Actual Positive Actual Negative Predicted Positive True Positive False Positive Predicted Negative False Negative True Negative False negative errors costs not to detect diseased tree and not only cause to dry the tree but also possiblty to transmission of disease will still contain by insect which can spread. Therefore main goal of this work is minimazing false negavite erors. While classification on imbalanced data sets two strategies can be choosen Over-sampling : Inserting minority class data points. Under-sampling : Removing majority class data points It is possible to combine over-sampling of the minority class with under-sampling of the majority class [11]. 2.1. Under-Sampling The most common pre-processing technique is random majority under-sampling (RUS), IN RUS, Instances of the majority class are randomly discarded from the dataset. However, the main drawback of under-sampling is that potentially useful information contained in these ignored examples is neglected [1]. To not lose this useful information’s, other than random under-sampling should be implemented. There are some researches which focuses picking up values intelligently [12]. Since it is expected that number of diseased tree will be very less than healty tree count, these strategies should be implement in this case. Processing satellite images,metioned in Chapter 1, will obtain images attributes, basically an image consist on tree main colour (red-green-blue). Comparison of healty and diseased trees will show that attributes values of healty instances is higher than wilted tree hereby wilted trees look pale. As a consequence some healthy tree instances will be extreme or outlier values when using statistics test. While balancing train dataset, the records which belongs to majority class but very similar or as distance very close to members of minority class should be exclude from the set. In this work, the purpose of under-sampling is to find the samples which are close to classification curve and belongs to majority class. Omitting these values may increase false positive errors but decrease negative errors. Here some different methods are suggested to find these values. 2.1.1. Interquartile Range Interquartile range (IQR) is a measure of statistical dispersion, being equal to the difference between the upper and lower quartiles [13]. It is commonly used as a filter to find outliers and extreme values however that is also means these values also has highest distance from minority class. Thus, it is possible to use these data for majority class in training set. Once this filter applied, owing to interquartile range’s structure still it is likely to apply again and find new samples. After extracting the values which marked as outlier, Q1, Q2 and Q3 values will chance and when we compute them again, some other samples will mark as outlier and this picked instances will provide to train set. The main goal of interquartile range method to define a set with similar values but that also extracts dissimilar values which we are looking for. In Figure 1 is the flow chart for sampling using interquartile range 2.1.2. Fuzzy Classification Fuzzy classifier computes a membership value for classes instead of generating a class label. This fuzzy value [0-1] easily examples which samples are belong to which class thence it is indicated the samples which will extract to train set. Since we have two different classes, instances with 0.5 or less fuzzy membership values may be considered as noise values on account of they may pertain both classes. Pursuant to our goal, it should be selected instances with biggest membership value. In this way, it is feasible to generate a train set with distinct instances. Instances can be chosen by top-N membership value which N is the count of minority or a threshold value (like membership value > 0.6). The membership μA˜i,comp(x) of x object with m linguistic variables to the given classes can be calculated based on the following equation [14] x∈X,0≤γ≤1 (1) where γ is control parameter with default value of 0.5 (x) is the membership value of x object to a particular variable, m is the number of variables. 2.1.3. Clustering Unlike classification, which analyze class-labeled data objects, clustering analyzes data objects without consulting a known class label [15]. By splitting the row data set into k piece, adjacent values may regroup with respect to their values. The clusters which does not comprise any minority class member (class W) will extract to train set. 2.1.4. Computing Distance Between Instances As an essential methods in data analysis, while choosing more distinct values, it is possible to calculate with distance measurements between data points. Euclidean, Manhattan and Chebyshev distances are three of basic measurements. Euclidean distance function measures the ‘as-the-crow-flies’ distance. The formula for this distance between a point X (X1, X2, etc.) and a point Y (Y1, Y2, etc.) is: d n (x y ) i i i 2 (2) The Manhattan distance function computes the distance that would be traveled to get from one data point to the other if a grid-like path is followed. The Manhattan distance between two items is the sum of the differences of their corresponding components. The formula for this distance between a point X=(X1, X2, etc.) and a point Y=(Y1, Y2, etc.) is: n d xi y i (3) i 1 Chebyshev distance is also called maximum value distance. It examines the absolute magnitude of the differences between coordinates of a pair of objects. This distance can be used for both ordinal and quantitative variables. The formula for this distance between a point X=(R1, R2, etc.) and a point Y=(Y1, Y2, etc.) is: d = max i 0 ,..., k r r i ' (4) i Before computing distance between instances, if observation interval different from other between attributes, these values should be normalized and all values should be a fixed length with normalization. Otherwise, whether an attributes’ value are greater than others will effect distance not with respect to its correlation with classification attribute. The simplest method is rescaling the range of features to scale the range in [0, 1] or [−1, 1]. Selecting the target range depends on the nature of the data. The general formula is given as: 𝑥′ = x−min(𝑥) max(x)−min(𝑥) (5) where x is an original value, x’ is the normalized value. 2.2. Over-Sampling Several for over-sampling methods and performance measurements has been defined for different imbalanced data classification cases. It is is possible to sum up these methods as: Minority over-sampling with re-sampling: have been used such as random oversampling with replacement, random, focused oversampling. [8, 11]. In random resampling, minority class examples are randomly replicated, but this can lead to overfitting. Synthetic Minority Over-sampling Technique : (SMOTE), [16] is one of the popular algorithm, is synthetic generation of new samples based on the known information, and combinations of the above techniques. SMOTE takes a training sample in the minority class and introduces new synthetic examples in the feature space between that training sample and one or more of its nearest neighbours in feature space, and then repeats this process for the entire training data set. SMOTE has two free parameters; the number of nearest neighbours and the percentage of new training samples to create. It is disscussied before, in this work, to minimize false negative erros, number of minority class sample will be increased to number of majority class sample which is count of healthy trees samples after omitting some of the samples from majority class using SMOTE. For this work, percentage parameters should be as X = Count Of W sample and Y = Count of N class after pre-processing percentage = x / y (6) 3 PROPOSED SAMPLING ALGORITHM Satellite imagery of a forest will assure us an imbalanced, large data set yet we can elimanate or pre-process some values for better classification accuracy or gain better score for performance measure. To shrink false negative errors, it is suggested that one of under sampling methods which discussed in Chapter 2.1 and over-sampling method SMOTE which discussed in Chapter 2.2 has to impelement to row data set, thus we can have a balanced data set with most distrinct values between classes. Figure 1 Under sampling method based on IQR Figure 2 Under sampling based on fuzzy classification Figure 4 Proposed Sampling Algorithm Figure 3 Under sampling based on clustering In study [10] argue that advantages of under sampling and over sampling techniques are independent on underlying classifier and can be easily implemented. But still these methods have some limitations: under sampling may remove significant patterns and cause loss of useful information and over sampling may lead to over-fitting. With these framework we built a model which reduces these limitations. 4 EXPERIMENTS The data set which we use in experiments is publish by [5] and available on UCI Repository1 named as Wilted Data set consists of image segments, generated by segmenting the pan sharpened image. The segments contain spectral information from the Quickbird multispectral image bands and texture information from the panchromatic (Pan) image band. There are few training samples for the 'diseased trees' class (74) and many for 'other land cover' class (4265). Experiment results based on methods Figure 5 Distribution of data set 320 values selected in 4 iteration and confusion matrix while using train set while using CrossValidation with 10 folds. Support Vector Machine algorithm is used for experiments. Table 2 SMOTE classification results 1 Actual W Predicted W 4291 Predicted N 47 Actual N 440 3825 https://archive.ics.uci.edu/ml/datasets/Wilt Table 3 Under sampling based IQR results Actual W Predicted W 74 Predicted N 0 Actual N 0 316 Table 4 Purposed Algorithm Actual W Predicted W 319 Predicted N 0 Actual N 4 316 CONCLUSIONS Class imbalance is a hot topic being investigated recently by machine learning and data mining researchers. The researchers for solving the imbalance problem have proposed various approaches. However, there is no general approach proper for all imbalance data sets and there is no unification framework since expectations from a classifier can depend on problem. In this work, it is focused on minimizing false negative errors but in other research, main goal of work may change the motivation. Developing classifiers which are robust or hybrid algorithms can be point of interest for the future research in imbalanced dataset. REFERENCES 1. 2. 3. 4. 5. 6. 7. Ganganwar, V., An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering, 2012. 2(4): p. 42-47. Rullan-Silva, C., et al., Remote Monitoring of Forest Insect Defoliation-A Review. Forest Systems, 2013. 22(3): p. 377-391. Kristen Evans, W.d.J.a.P.C., Future Scenarios as a Tool for Collaboration in Forest Communities. S.A.P.I.EN.S 2008. 1(2). Hall, R.J., R.S. Skakun, and E.J. Arsenault, Remotely sensed data in the mapping of insect defoliation. Understanding forest disturbance and spatial pattern: Remote sensing and GIS approaches, 2006: p. 85-111. Johnson, B.A., R. Tateishi, and N.T. Hoan, A hybrid pansharpening approach and multiscale object-based image analysis for mapping diseased pine and oak trees. International Journal of Remote Sensing, 2013. 34(20): p. 6969-6982. Dang, X.T., et al., A novel over-sampling method and its application to miRNA prediction. Journal of Biomedical Science and Engineering, 2013. 6(02): p. 236. Xu, D.-D., Y. Wang, and L.-J. Cai, ISMOTE algorithm for imbalanced data sets. Journal of Computer Applications, 2011. 9: p. 025. 8. 9. 10. 11. 12. 13. 14. 15. 16. Japkowicz, N. and S. Stephen, The class imbalance problem: A systematic study. Intelligent data analysis, 2002. 6(5): p. 429-449. Eason, G., B. Noble, and I. Sneddon, On certain integrals of Lipschitz-Hankel type involving products of Bessel functions. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences, 1955. 247(935): p. 529-551. Elrahman, S.M.A. and A. Abraham, A Review of Class Imbalance Problem. Journal of Network and Innovative Computing, 2013. 1(2013): p. 332-340. Ling, C.X. and C. Li. Data Mining for Direct Marketing: Problems and Solutions. in KDD. 1998. Kubat, M. and S. Matwin. Addressing the curse of imbalanced training sets: one-sided selection. in ICML. 1997. Nashville, USA. Upton, G. and I. Cook, Understanding statistics. 1996: Oxford University Press. Veryha, Y., Implementation of fuzzy classification in relational databases using conventional SQL querying. Information and Software Technology, 2005. 47(5): p. 357-364. Han, J., M. Kamber, and J. Pei, Data mining, southeast asia edition: Concepts and techniques. 2006: Morgan kaufmann. Chawla, N.V., et al., SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 2002. 16(1): p. 321-357.