* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download A Comparative Analysis of Classification with Unlabelled Data using
Linear belief function wikipedia , lookup
Visual servoing wikipedia , lookup
Cross-validation (statistics) wikipedia , lookup
Affective computing wikipedia , lookup
Time series wikipedia , lookup
Concept learning wikipedia , lookup
Machine learning wikipedia , lookup
Mixture model wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
> Variance Based Numeric Feature selection< 1 Variance Based Numeric Feature Selection Method for Naïve Bayesian Approach Yilin Kang, and Narendra S. Chaudhari Abstract— We consider features having numerical values. We propose the use of variance as a criterion for feature selection in Gaussian naïve Bayesian classifier. We evaluate our criterion by applying it on a benchmark Iris dataset and compare the result to the original Naïve Bayesian algorithm. Our results show that this feature selection method reduces the number of features as well as error rate together when compared with the naïve Bayesian algorithm. Index Terms—feature selection, variance, supervised learning, Naïve Bayesian algorithm I. INTRODUCTION W e start with supervised learning using naïve Bayesian algorithm. To define the problem, we start with a set of variables, say F1 , F2 ,..., Fn and additional set, say V. The sets F1 , F2 ,..., Fn are used to represent features and the additional set V is used to represent the class. In the model construction phase, the supervised learning problem involves providing, as input, the training data consisting of the values of feature variables, say f1 , f 2 ,..., f n and the corresponding class value (represented by the value of our class variable, say v). Thus, the model construction problem is formulated as follows: Construct a mapping g: g ( f1 , f 2 ,..., f n ) F1 F2 ... Fn V f 1, f 2 ,..., fn ) and the corresponding After the model construction phase, when next set of ' 1 values, ' say v to denote this value. Thus, v g ( f1' , f ' 2 ,..., f n ' ) .Working with a subset of the n features, say {F1 ,..., Fk } (k<n) reduces the variability of the class-label estimator and would lead to better out-of-sample predictions [1]. Secondly, large numbers of features require considerable computational cost. These two standpoints led us to consider using a subset of the features as the training data instead and see what the experiment result of learning is in the Naïve Bayesian case. In section II, we review the Gaussian based naïve Bayesian approach for features having numerical values. In section III, we give a brief survey of existing feature selection techniques for naïve Bayesian classifier. In our survey, we have included the naïve Bayesian classifier based approached that handle both non-numerical as well as numerical features. In section IV, we give our main method that uses variance as a criterion for feature selection. In section V, we give a note on the methodology we used for evaluating our method. We use ten-fold cross-validation for this purpose. In section VI, we give results of applying our method on Iris dataset. For Iris dataset, our method shows that only one feature, namely Petallength is good enough for classifying it into three classes namely Setosa, Versicolour, Virginica. Conclusions are given in section VII. II. GAUSSIAN BASED NAÏVE BAYESIAN APPROACH value, v , as given in the training data. feature notation , such that minimizes the variability of this value (estimated from values of the class variable (class label estimator) and we may use the f1' , f ' 2 ,..., f n ' are given, ' g ( f , f 2 ,..., f n ) is used to represent the estimated value of Manuscript received Aug 23, 2008; revised Oct. 30, 2008; accepted Nov 07, 2008. Yilin Kang is a graduate student as well as a project officer in Nanyang Technological University, Singapore (e-mail: ylkang@ ntu.edu.sg). Narendra S. Chaudhari is with the Nanyang Technological University, Singapore (e-mail: ASNarendra@ ntu.edu.sg). A. Naïve Bayesian Approach Naïve Bayesian (NB) is a special form of Bayesian Network that has widely been used for data classification because of its competitive predictive performance with state-of-the-art classifiers. It performs well over a wide range of classification problems including medical diagnosis, text categorization, collaborative and email filtering and information retrieval [2]. Compared with more sophisticated schemes, naïve Bayesian classifiers often perform better [3]. Naïve Bayesian learns from training data from the conditional probability of each attribute given the class label. It uses Bayes rule to compute the probability of the classes. Given the particular instance of the attributes, prediction of the class > Variance Based Numeric Feature selection< 2 is done by identifying the class with the highest posterior probability. Therefore, it is a statistical algorithm which solves the problem by defining a probability distribution P (U ) . Since learning the complete distribution is in general impractical, the assumption is made that all attributes are conditionally independent given the class variable. The naive Bayesian method classifies by taking the value of V with the highest probability given F is expressed in the following formulation: v argmax P (V v F f ) v{ v1 ,...,vk } (1) n argmax P V v P ( Fi fi V v ) v{v1 ,..., vk } i 1 The following example illustrates naïve Bayesian approach. We consider a small dataset, namely weather symbolic [4]. It is shown in Table I. It presents a set of days’ weather together with a decision for each as to whether to play the game or not. There are four attributes in this case: outlook, temperature, humidity and windy. The problem is to learn how to classify new days as play or don’t play. Table I: The weather symbolic data No. Outlook Temperatur Humidity e 1 sunny hot high 2 sunny hot high 3 overcast hot high 4 rainy mild high 5 rainy cool normal 6 rainy cool normal 7 overcast cool normal 8 sunny mild high 9 sunny cool normal 10 rainy mild normal 11 sunny mild normal 12 overcast mild high 13 overcast hot normal 14 rainy mild high Play FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE no no yes yes yes no yes no yes yes yes yes yes no P ( yes) P ( sunny yes ) P(cool yes) P (high yes) P(true yes ) 9 2 3 3 3 * * * * 0.0053 14 9 9 9 9 Likelihood of no is 5 3 1 4 3 * * * * 0.0206 14 5 5 5 5 Obviously, the higher probability is the likelihood given no. Hence, the learning result of the above new instance using naïve Bayesian is no. B. Gaussian naïve Bayesian approach However, this naïve Bayesian algorithm is not enough. It can only deal with labeled data. As unlabeled data contain information about the joint distribution over features, an effective way of classify unlabeled data is using some distribution to preprocess the data. Since statistical clustering methods often assume that numeric attributes have a normal distribution, the extension of the Naïve Bayesian classifier to handle numeric attributes adopts the same assumption [4]. Thus the probability density function for a Gaussian distribution is given by the followed expression: f N ( X i ; c , c ) where Windy Considering a new instance: <Outlook = sunny, Temperature = cool, Humidity = high, Windy = true > To figure out whether playing or not by using naïve Bayesian, we compute formula (1). In our case, we compute the following two formulas and compare the results. Likelihood of yes is P (no) P ( sunny no) P (cool no) P( high no) P (true no) 1 2 c e ( X i c )2 2 c2 (2) c is the mean and c is the standard deviation of the attribute values from the training instances whose class equals c . The following example illustrates Gaussian naïve Bayesian examples. Consider a slightly more complex form of the weather dataset which is shown in table II. It is different from the weather symbolic dataset in two attributes: temperature and humidity. Both of these two attributes have numeric values. Table III gives a summary of these two numeric features. Table II: The weather data with some numeric attributes No. Outlook Temperatur Humidity Windy e 1 sunny 85.0 85.0 FALSE 2 sunny 80.0 90.0 TRUE 3 overcast 83.0 86.0 FALSE 4 rainy 70.0 96.0 FALSE 5 rainy 68.0 80.0 FALSE 6 rainy 65.0 70.0 TRUE 7 overcast 64.0 65.0 TRUE 8 sunny 72.0 95.0 FALSE 9 sunny 69.0 70.0 FALSE 10 rainy 75.0 80.0 FALSE 11 sunny 75.0 70.0 TRUE 12 overcast 72.0 90.0 TRUE 13 overcast 81.0 75.0 FALSE 14 rainy 71.0 91.0 TRUE Play no no yes yes yes no yes no yes yes yes yes yes no Table III: the numeric feature of weather data with summary > Variance Based Numeric Feature selection< 3 Given a new instance: <Outlook = sunny, Temperature = 66, Humidity = 90, Windy = true> We figure out whether playing or not by using Gaussian naïve Bayesian approach. If we are considering a yes outcome when temperature has a value of 66, we just need to plug X i = 66, c =73, and c = 6.2 into formula (2). So the value of the probability density function is 1 f (temperature 66 yes) e 2 *6.2 (66 73)2 2*6.22 0.0340 By using the same method, the probability density function of a yes outcome when humidity = 90 is calculated below: 1 f ( humidity 90 yes ) e 2 *10.2 (90 79.1) 2 2*10.22 0.0279 By using the same way, we can also figure out f (temperature 66 no) and f ( humidity 90 no) Using these probabilities for the new instance, likelihood of yes and no are computed as below: P ( yes ) P( sun yes ) P (66 yes) P (90 yes) P (true yes) 9 2 3 * * 0.0340 *0.0279* 0.000036 14 9 9 P (no) P ( sunny no) P (66 no) P(90 no) P (true no) 5 3 3 * *0.0221*0.0381* 0.000137 14 5 5 Since the likelihood of no is higher, the learning result of the above instance using Gaussian naïve Bayesian is no. We observe that this result is the same as what we calculated earlier in the example of naïve Bayesian approach. In fact, there also exist some other distributions for naïve Bayesian approach. However, Gaussian distribution is still the most popular one used for naïve Bayesian because of its good performance in accuracy, memory costs and fast in both modeling and evaluating periods. But all of this is based on the underlying distribution follows a normal distribution. In reality, not all cases are following this distribution. This assumption is just based on idealism. This gives rise to the inaccuracies in classification. III. EXISTING FEATURE SELECTION METHODS Many researchers have emphasized the issue of removing redundant attributes for the Naïve Bayesian classifier. Pazzani explores ways to improve the Bayesian classifier by searching for dependencies among attributes. They proposed Temperature Humidity Yes No yes no 83 85 86 85 70 80 96 90 68 65 80 70 64 72 65 95 69 71 70 91 75 80 75 70 72 90 81 75 Mean 73 74.6 Mean 79.1 86.2 Std.dev. 6.2 7.9 Std. dev. 10.2 9.7 and evaluated the methods of joining two (or more) related attributes into a new compound attribute which were forward sequential selection and joining and backward sequential elimination and joining. Result shows that the domains on which the most improvement occurs are those domains on which the naïve Bayesian classifier is significantly less accurate than a decision tree learner [5, 6]. Boosting on Naïve Bayesian classifier [7] has been experimented by applying series of classifiers to the problem and paying more attention to the examples misclassified by its predecessor. It induces multiple classifiers in sequential trials by adaptively changing the distribution of the training set based on the performance of previously created classifiers. At the end of each trial, instance weights are adjusted to reflect the importance of each training example for the next induction trial. The objective of the adjustment is to increase the weights of misclassified training examples. Change of instance weights causes the learner to concentrate on different training examples in different trials, thus resulting in different classifiers. Finally, the individual classifiers are combined through voting to form a composite classifier. However, boosting does not improve the accuracy of the naive Bayesian classifier as much as we expected in a set of natural domains [8]. A wrapper approach which is used for the subset selection to only select relevant features for NB was proposed by Langley and Sage [9]. It embeds the naive Bayesian induction scheme within an algorithm that carries out a greedy search through the space of features. It hypothesizes that this approach will improve asymptotic accuracy in domains that involve correlated features without reducing the rate of learning in ones that do not. Friedman and Goldszmidt compared naïve Bayesian to Bayesian networks. However, using unrestricted Bayesian networks did not generally lead to improvements in accuracy and even reduced accuracy in some domains. This prompted them to propose a compromise representation, combining some of Bayesian networks ability to represent dependence, with the simplicity of naïve Bayesian. This representation is called augmented naïve Bayesian. Finding the best augmented Bayesian network is an NP-hard problem. Friedman and Goldszmidt deal with this difficulty by restricting the network > Variance Based Numeric Feature selection< to be a tree topology and utilizing a result by Chow and Liu [10] to efficiently find the best tree-augmented naïve Bayesian (TAN) [11]. Keogh & Pazzani take a different approach to constructing tree-augmented Bayesian networks. Rather than directly attempting to approximate the underlying probability distribution, they concentrate solely on using the same representation to improve classification accuracy. A method for constructing augmented Bayesian networks is a novel, more efficient search algorithm, which is SuperParent. They compare these to Friedman and Goldszmidt’s approach [11] and show that approximating the underlying probability distribution is not the best way to improve classification accuracy [12]. Lazy Bayesian Rules (LBR) can be justified by a variant of Bayes theorem which supports a weaker conditional attribute independence assumption than is required by naive Bayesian. For each test example, it builds a most appropriate rule with a local naive Bayesian classifier as its consequent. Lazy learning techniques can be viewed in terms of a two stage process. First, a subset of training cases is selected for a case to be classified (the k closest for nearest neighbor or instance-based learning, those matching the decision rule for LAZYDT, and so on). Next, these selected cases are used as a training set for a learning technique that ultimately forms a classifier with which the test case is classified. For selecting such subsets, they proposed an approach that seeks to exclude from the subset only those cases for which there is evidence that inclusion will be harmful to the classifier used in the second stage, a naive Bayesian classifier. Experiments show that, on average, this algorithm obtains lower error rates significantly more often than the reverse in comparison to a naive Bayesian classifier, C4.5, a Bayesian tree learning algorithm, a constructive Bayesian classifier that eliminates attributes and constructs new attributes using Cartesian products of existing nominal attributes, and a lazy decision tree learning algorithm [13]. IV. USE OF VARIANCE FOR FEATURE SELECTION To improve the performance of the naive Bayesian classifier, we introduce the approach to remove redundant attributes from the data set. We only choose the features that are most informative in classification task. To achieve this, we use the variance that is calculated in Naïve Bayesian. We filter the features by variance corresponding to different class value. The reasons of why we choose variance of filter criterion are: Calculating variance is a step in fitting data to Gaussian distribution. It is time saving compared to other methods. The larger the variance is, the wider these data distribute. That means, even the feature have some little change or big change, it won’t influence the result of learning much compared to data with small variance. Hence, output is 4 more sensitive to features with small variance. The way we filter the features is by removing the largest variance one by one. Figure 1 shows the algorithm for numerical feature selection. 1 . No rm aliz e the nu mer ic al f e atu re s. 2 . Fit th e n umeric fe a tur e s G ij wh ich c or resp ond to d iffer en t c las s va lue into Ga us sian Gi j Fi distr ibu tion, , i 1,..., m w he re an d j 1,..., k 3 . Ca lcu la te the a ve r ag e va r ian ce fo r k cla ss va lue . 4 . Filter the fe a tur e s one by o ne in te rms o f la rge st v ar ianc e . Figure 1. Feature Selection Algorithm: Variance We first normalize all the numerical features. This is because we need to make sure all the numerical features are in the same domain of [0, 1]. In the later part, when we measure the variance, same domain of features would be comparable. The second step is a step of fitting numerical data into Gaussian distribution. It is a step which is accomplished by the Gaussian naïve Bayesian approach. Assume G {G1 ,..., Gm } ( m n ) is the numerical feature variable set ( G F ), and V is the class variable with class values {v1 ,..., vk } ( k >1), k * m Gaussian distributions would be estimated and correspondingly, k * m variances are computed. For each class variable, there are m variances corresponding to different features. For each feature vector Gi ( i (1,..., m) ), there are k variance corresponding to different class value. The third step is to calculate the average variance value for each feature. We simply sum the k variance together and divided by k . In this way, we can obtain the number of m average variance. According to the value of the average variance, we filter the features one by one from the largest value of variance. V. EVALUATION USING TEN FOLD CROSS VALIDATION We use ten-fold cross-validation as the method for evaluating our algorithm. When large data is available, evaluation the performance is not a problem because we can make a model based on a large > Variance Based Numeric Feature selection< training set, and try it out on another large test set. However, in reality, data is always in short supply. On one aspect, we would like to use as much of the data as possible for training. But one the other aspect, we want to use as much of it as possible for testing. There already exist some technologies to deal with this issue and it is still controversial till now. One of the techniques is cross-validation which has gained acceptance and is the evaluation method of choice in most practical limited-data situations. Ten-fold cross-validation divides the training data into ten parts of roughly equal size which the class is represented in approximately the same proportions as in the full dataset. Each one is used in turn for testing while the remaining nice-tenths are used for learning scheme to train. Thus the learning procedure is executed a total of 10 times on different training sets. Finally, the 10 error estimates are averaged to yield an overall error estimate. Extensive tests on numerous datasets, with different learning techniques, have shown that ten is about the right number of folds to get the best estimate of error [4]. 5 Missing value: no Table V: Data Fitting: Gaussian Std.dev. setosa versicolor virginica Average Sepallengt h 0.0938 0.1441 0.1738 0.137 Sepalwidt h 0.1559 0.1241 0.1312 0.137 Petallengt h 0.0293 0.079 0.094 0.067 Petalwidt h 0.0477 0.0769 0.1118 0.079 Table VI: Classification result Steps Original result Remove Sepallength Remove Sepalwidth Remove Petalwidth Error rate 5.3333% 5.3333% 4% 3.3333% VI. EXPERIMENTAL EVALUATION When we applied our method of variance based removal of attributes to Iris dataset, we got an interesting result that, only one feature, namely Petallength, is good enough for classification. In this section, we give details of our numerical experiments we performed for this purpose. The property of the dataset is summarized in Table IV [14]. After normalize the numerical attributes, we fitting these data to Gaussian distribution, which is shown in table V. From the “average row”, we could easily find out the following: Sepallength Sepalwidth Petalwidth Petallength As a matter of fact, Sepallength is bit larger than Sepalwidth in the fourth digit after decimal point. Here, we filter the features by the sequence above. Result is shown in table VI. Table IV: Summary of Iris Item Description Area : life Predicted attribute: Class of iris Attribute character: Real Number of instance: 150 (50 in each of three classes) Number of Attribute: 4 numerical Number of class: 3 sepal length in cm sepal width in cm petal length in cm Attribute petal width in cm information: class: -- Iris Setosa -- Iris Versicolour -- Iris Virginica From Table VI, we observe that, if we use original features, the error rate for naïve Bayesian is 5.3333%, which is already very low compared to other classification method. However, after we remove the feature (Sepallength) which has the largest variance, the error rate for naïve Bayesian is 5.3333% which is exactly the same as original feature sets. Moreover, if we continue to remove the feature with second largest variance, the error rate is 4%. Even after we removed two features out of 4, the error rate decreased. If we continuously remove the third feature which is Petalwidth, the error rate decreased to 3.3333%. This means, we can only use one feature which is Petallength to do classification rather than 4, and this definitely reduced the modeling time and evaluating time for classification. Reduction of features from 4 to 1 lead to a higher accuracy for Iris using 10-fold cross-validation. VII. CONCLUSIONS The traditional naive Bayesian classifier provides a simple and effective approach for classifier to learn. When it deals with numerical features, one of the most popular ways is to use Gaussian distribution to fit the data. A number of approaches have been proposed to improve this algorithm to make it better. Most of them are proposed to alleviate its attribute independence assumption since it is often violated in the real world. However, our proposed idea wants to improve the NB in a different way. When the number of features is large, we run into the risk of overfitting, also both the modeling time and evaluating time would increase. In order to solve this, in this paper, we have > Variance Based Numeric Feature selection< proposed to use variance to filter features. Iris dataset is very well-known, and the application of our method for Iris dataset results in the minimization of features to only one significant feature, namely, Petal-length. Remove of features with large variance, especially when the error rate of the classification reduced, is practically useful. Such a removal is expected to result in simplified model avoiding the problems like over-fitting that arise when more complex models are used. Our approach is especially useful for constructing such simplified models. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] Gerda Claseskens, Christophe Croux, and Johan Van Kerckhoven, “An information criterion for variable selection in support vector machines,” Journal of Machine Learning Research, vol. 9, 2008, pp.541-558. Y.Yang and G.I.Webb, A Comparative Study of Discretization Methods for Naïve-Bayesian Classifiers, In Proceedings of PKAW 2002, 2002,pp.159-173. Xiaojin Zhu, Semi-supervised learning literature survey, Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005, pp.4. Ian H.Witen, Eibe Frank, “Data Mining Practical Machine Learning Tools and Techniques”, Second Edition, Morgan Kaufmann publishers, 2003. Pazzani, M. 1995. Searching for attribute dependencies in Bayesian classifiers. In 5th International Workshop on Artificial Intelligence and Statistics, pages 424–429, Ft. Lauderdale, FL. Pazzani, M. 1996. Constructive induction of Cartesian product attributes. In Proceedings of the Information, Statistics and Induction in Science, pages 66 –77. Melbourne, Australia: World Scientific. Elkan, C. 1997. Boosting and Naive Bayesian Learning. Technical Report No. CS97-557, Department of Computer Science and Engineering, University of California, San Diego. Ming, K., and Z. Zheng. 1999. Improving the performance of boosting for Naive Bayesian Classification. In Proceedings of the PAKDD-99, pages 296 –305, Beijing, China. Langley, P., and S. Sage. 1994. Induction of selective Bayesian Classifiers. In Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, pages 399–406. Seattle, WA, Morgan Kaufmann. Chow, C. & Lui, C. (1968). Approximating discrete probability distributions with dependence trees. IEEE Trans. On Info Theory 14:462-467. Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers. Machine Learning, 29:2,131–163. Keogh, E. and M. Pazzani. 1999. Learning augmented Bayesian classifiers: A comparison of distribution based and classification-based approaches. In Proceedings of the 7th Int’l Workshop on AI and Statistics, pages 225– 230, Ft. Lauderdale, Florida. Zheng, Z., & Webb, G. I. (2000). Lazy learning of Bayesian Rules. Machine Learning, 41:1, 53–84. Asuncion, A. & Newman, D.J. (2007). UCI Machine Learning Repository [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, School of Information and Computer Science. 6