Download A Comparative Analysis of Classification with Unlabelled Data using

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear belief function wikipedia , lookup

Visual servoing wikipedia , lookup

Cross-validation (statistics) wikipedia , lookup

Affective computing wikipedia , lookup

Time series wikipedia , lookup

Concept learning wikipedia , lookup

Machine learning wikipedia , lookup

Mixture model wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Pattern recognition wikipedia , lookup

Transcript
> Variance Based Numeric Feature selection<
1
Variance Based Numeric Feature Selection
Method for Naïve Bayesian Approach
Yilin Kang, and Narendra S. Chaudhari

Abstract— We consider features having numerical values. We
propose the use of variance as a criterion for feature selection in
Gaussian naïve Bayesian classifier. We evaluate our criterion by
applying it on a benchmark Iris dataset and compare the result to
the original Naïve Bayesian algorithm. Our results show that this
feature selection method reduces the number of features as well
as error rate together when compared with the naïve Bayesian
algorithm.
Index Terms—feature selection, variance, supervised learning,
Naïve Bayesian algorithm
I.
INTRODUCTION
W
e start with supervised learning using naïve Bayesian
algorithm. To define the problem, we start with a set of
variables, say F1 , F2 ,..., Fn and additional set, say V. The sets
F1 , F2 ,..., Fn are used to represent features and the additional
set V is used to represent the class. In the model construction
phase, the supervised learning problem involves providing, as
input, the training data consisting of the values of feature
variables, say f1 , f 2 ,..., f n and the corresponding class value
(represented by the value of our class variable, say v). Thus, the
model construction problem is formulated as follows:
Construct a mapping g:
g ( f1 , f 2 ,..., f n )
F1  F2  ...  Fn  V
f 1, f 2 ,..., fn )
and the corresponding
After the model construction phase, when next set of
'
1
values,
'
say
v
to
denote
this
value.
Thus,

v  g ( f1' , f ' 2 ,..., f n ' ) .Working with a subset of the n
features, say {F1 ,..., Fk } (k<n) reduces the variability of the
class-label estimator and would lead to better out-of-sample
predictions [1]. Secondly, large numbers of features require
considerable computational cost. These two standpoints led us
to consider using a subset of the features as the training data
instead and see what the experiment result of learning is in the
Naïve Bayesian case.
In section II, we review the Gaussian based naïve Bayesian
approach for features having numerical values. In section III,
we give a brief survey of existing feature selection techniques
for naïve Bayesian classifier. In our survey, we have included
the naïve Bayesian classifier based approached that handle
both non-numerical as well as numerical features. In section IV,
we give our main method that uses variance as a criterion for
feature selection. In section V, we give a note on the
methodology we used for evaluating our method. We use
ten-fold cross-validation for this purpose. In section VI, we
give results of applying our method on Iris dataset. For Iris
dataset, our method shows that only one feature, namely
Petallength is good enough for classifying it into three classes
namely Setosa, Versicolour, Virginica. Conclusions are given
in section VII.
II. GAUSSIAN BASED NAÏVE BAYESIAN APPROACH
value, v , as given in the training data.
feature

notation
, such that
minimizes the variability of this value
(estimated from values of
the class variable (class label estimator) and we may use the
f1' , f ' 2 ,..., f n '
are
given,
'
g ( f , f 2 ,..., f n ) is used to represent the estimated value of
Manuscript received Aug 23, 2008; revised Oct. 30, 2008; accepted Nov 07,
2008.
Yilin Kang is a graduate student as well as a project officer in Nanyang
Technological University, Singapore (e-mail: ylkang@ ntu.edu.sg).
Narendra S. Chaudhari is with the Nanyang Technological University,
Singapore (e-mail: ASNarendra@ ntu.edu.sg).
A. Naïve Bayesian Approach
Naïve Bayesian (NB) is a special form of Bayesian Network
that has widely been used for data classification because of its
competitive predictive performance with state-of-the-art
classifiers. It performs well over a wide range of classification
problems including medical diagnosis, text categorization,
collaborative and email filtering and information retrieval [2].
Compared with more sophisticated schemes, naïve Bayesian
classifiers often perform better [3].
Naïve Bayesian learns from training data from the
conditional probability of each attribute given the class label. It
uses Bayes rule to compute the probability of the classes. Given
the particular instance of the attributes, prediction of the class
> Variance Based Numeric Feature selection<
2
is done by identifying the class with the highest posterior
probability. Therefore, it is a statistical algorithm which solves
the problem by defining a probability distribution P (U ) . Since
learning the complete distribution is in general impractical, the
assumption is made that all attributes are conditionally
independent given the class variable. The naive Bayesian
method classifies by taking the value of V with the highest
probability given F is expressed in the following formulation:

v  argmax P (V  v F  f )
v{ v1 ,...,vk }
(1)
n
 argmax P  V  v   P ( Fi  fi V  v )
v{v1 ,..., vk }
i 1
The following example illustrates naïve Bayesian approach.
We consider a small dataset, namely weather symbolic [4]. It
is shown in Table I. It presents a set of days’ weather together
with a decision for each as to whether to play the game or not.
There are four attributes in this case: outlook, temperature,
humidity and windy. The problem is to learn how to classify
new days as play or don’t play.
Table I: The weather symbolic data
No. Outlook Temperatur Humidity
e
1
sunny
hot
high
2
sunny
hot
high
3
overcast hot
high
4
rainy
mild
high
5
rainy
cool
normal
6
rainy
cool
normal
7
overcast cool
normal
8
sunny
mild
high
9
sunny
cool
normal
10
rainy
mild
normal
11
sunny
mild
normal
12
overcast mild
high
13
overcast hot
normal
14
rainy
mild
high
Play
FALSE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
TRUE
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
P ( yes) P ( sunny yes ) P(cool yes) P (high yes) P(true yes )
9 2 3 3 3
* * * *  0.0053
14 9 9 9 9
Likelihood of no is

5 3 1 4 3
* * * *  0.0206
14 5 5 5 5
Obviously, the higher probability is the likelihood given no.
Hence, the learning result of the above new instance using
naïve Bayesian is no.
B. Gaussian naïve Bayesian approach
However, this naïve Bayesian algorithm is not enough. It can
only deal with labeled data. As unlabeled data contain
information about the joint distribution over features, an
effective way of classify unlabeled data is using some
distribution to preprocess the data. Since statistical clustering
methods often assume that numeric attributes have a normal
distribution, the extension of the Naïve Bayesian classifier to
handle numeric attributes adopts the same assumption [4].
Thus the probability density function for a Gaussian
distribution is given by the followed expression:

f  N ( X i ; c , c ) 
where
Windy
Considering a new instance:
<Outlook = sunny, Temperature = cool, Humidity = high,
Windy = true >
To figure out whether playing or not by using naïve Bayesian,
we compute formula (1). In our case, we compute the following
two formulas and compare the results.
Likelihood of yes is

P (no) P ( sunny no) P (cool no) P( high no) P (true no)
1
2  c

e
( X i  c )2
2 c2
(2)
c is the mean and  c is the standard deviation of the
attribute values from the training instances whose class equals
c . The following example illustrates Gaussian naïve Bayesian
examples.
Consider a slightly more complex form of the weather dataset
which is shown in table II. It is different from the weather
symbolic dataset in two attributes: temperature and humidity.
Both of these two attributes have numeric values. Table III
gives a summary of these two numeric features.
Table II: The weather data with some numeric attributes
No. Outlook
Temperatur Humidity Windy
e
1
sunny
85.0
85.0
FALSE
2
sunny
80.0
90.0
TRUE
3
overcast
83.0
86.0
FALSE
4
rainy
70.0
96.0
FALSE
5
rainy
68.0
80.0
FALSE
6
rainy
65.0
70.0
TRUE
7
overcast
64.0
65.0
TRUE
8
sunny
72.0
95.0
FALSE
9
sunny
69.0
70.0
FALSE
10
rainy
75.0
80.0
FALSE
11
sunny
75.0
70.0
TRUE
12
overcast
72.0
90.0
TRUE
13
overcast
81.0
75.0
FALSE
14
rainy
71.0
91.0
TRUE
Play
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Table III: the numeric feature of weather data with summary
> Variance Based Numeric Feature selection<
3
Given a new instance:
<Outlook = sunny, Temperature = 66, Humidity = 90, Windy
= true>
We figure out whether playing or not by using Gaussian
naïve Bayesian approach. If we are considering a yes outcome
when temperature has a value of 66, we just need to plug X i =
66,
c =73, and  c = 6.2 into formula (2). So the value of the
probability density function is

1
f (temperature  66 yes) 
e
2 *6.2
(66  73)2
2*6.22
 0.0340
By using the same method, the probability density function
of a yes outcome when humidity = 90 is calculated below:

1
f ( humidity  90 yes ) 
e
2 *10.2
(90  79.1) 2
2*10.22
 0.0279
By using the same way, we can also figure out
f (temperature  66 no) and f ( humidity  90 no)
Using these probabilities for the new instance, likelihood of
yes and no are computed as below:
P ( yes ) P( sun yes ) P (66 yes) P (90 yes) P (true yes)
9 2
3
* * 0.0340 *0.0279*  0.000036
14 9
9
P (no) P ( sunny no) P (66 no) P(90 no) P (true no)


5 3
3
* *0.0221*0.0381*  0.000137
14 5
5
Since the likelihood of no is higher, the learning result of the
above instance using Gaussian naïve Bayesian is no.
We observe that this result is the same as what we calculated
earlier in the example of naïve Bayesian approach.
In fact, there also exist some other distributions for naïve
Bayesian approach. However, Gaussian distribution is still the
most popular one used for naïve Bayesian because of its good
performance in accuracy, memory costs and fast in both
modeling and evaluating periods. But all of this is based on the
underlying distribution follows a normal distribution. In
reality, not all cases are following this distribution. This
assumption is just based on idealism. This gives rise to the
inaccuracies in classification.
III. EXISTING FEATURE SELECTION METHODS
Many researchers have emphasized the issue of removing
redundant attributes for the Naïve Bayesian classifier.
Pazzani explores ways to improve the Bayesian classifier by
searching for dependencies among attributes. They proposed
Temperature
Humidity
Yes
No
yes
no
83
85
86
85
70
80
96
90
68
65
80
70
64
72
65
95
69
71
70
91
75
80
75
70
72
90
81
75
Mean
73
74.6
Mean
79.1
86.2
Std.dev.
6.2
7.9
Std. dev.
10.2
9.7
and evaluated the methods of joining two (or more) related
attributes into a new compound attribute which were forward
sequential selection and joining and backward sequential
elimination and joining. Result shows that the domains on
which the most improvement occurs are those domains on
which the naïve Bayesian classifier is significantly less
accurate than a decision tree learner [5, 6].
Boosting on Naïve Bayesian classifier [7] has been
experimented by applying series of classifiers to the problem
and paying more attention to the examples misclassified by its
predecessor. It induces multiple classifiers in sequential trials
by adaptively changing the distribution of the training set based
on the performance of previously created classifiers. At the end
of each trial, instance weights are adjusted to reflect the
importance of each training example for the next induction
trial. The objective of the adjustment is to increase the weights
of misclassified training examples. Change of instance weights
causes the learner to concentrate on different training examples
in different trials, thus resulting in different classifiers. Finally,
the individual classifiers are combined through voting to form
a composite classifier. However, boosting does not improve the
accuracy of the naive Bayesian classifier as much as we
expected in a set of natural domains [8].
A wrapper approach which is used for the subset selection to
only select relevant features for NB was proposed by Langley
and Sage [9]. It embeds the naive Bayesian induction scheme
within an algorithm that carries out a greedy search through
the space of features. It hypothesizes that this approach will
improve asymptotic accuracy in domains that involve
correlated features without reducing the rate of learning in ones
that do not.
Friedman and Goldszmidt compared naïve Bayesian to
Bayesian networks. However, using unrestricted Bayesian
networks did not generally lead to improvements in accuracy
and even reduced accuracy in some domains. This prompted
them to propose a compromise representation, combining some
of Bayesian networks ability to represent dependence, with the
simplicity of naïve Bayesian. This representation is called
augmented naïve Bayesian. Finding the best augmented
Bayesian network is an NP-hard problem. Friedman and
Goldszmidt deal with this difficulty by restricting the network
> Variance Based Numeric Feature selection<
to be a tree topology and utilizing a result by Chow and Liu [10]
to efficiently find the best tree-augmented naïve Bayesian
(TAN) [11].
Keogh & Pazzani take a different approach to constructing
tree-augmented Bayesian networks. Rather than directly
attempting to approximate the underlying probability
distribution, they concentrate solely on using the same
representation to improve classification accuracy. A method for
constructing augmented Bayesian networks is a novel, more
efficient search algorithm, which is SuperParent. They
compare these to Friedman and Goldszmidt’s approach [11]
and show that approximating the underlying probability
distribution is not the best way to improve classification
accuracy [12].
Lazy Bayesian Rules (LBR) can be justified by a variant of
Bayes theorem which supports a weaker conditional attribute
independence assumption than is required by naive Bayesian.
For each test example, it builds a most appropriate rule with a
local naive Bayesian classifier as its consequent. Lazy learning
techniques can be viewed in terms of a two stage process. First,
a subset of training cases is selected for a case to be classified
(the k closest for nearest neighbor or instance-based learning,
those matching the decision rule for LAZYDT, and so on).
Next, these selected cases are used as a training set for a
learning technique that ultimately forms a classifier with which
the test case is classified. For selecting such subsets, they
proposed an approach that seeks to exclude from the subset
only those cases for which there is evidence that inclusion will
be harmful to the classifier used in the second stage, a naive
Bayesian classifier. Experiments show that, on average, this
algorithm obtains lower error rates significantly more often
than the reverse in comparison to a naive Bayesian classifier,
C4.5, a Bayesian tree learning algorithm, a constructive
Bayesian classifier that eliminates attributes and constructs
new attributes using Cartesian products of existing nominal
attributes, and a lazy decision tree learning algorithm [13].
IV.
USE OF VARIANCE FOR FEATURE SELECTION
To improve the performance of the naive Bayesian classifier,
we introduce the approach to remove redundant attributes from
the data set. We only choose the features that are most
informative in classification task. To achieve this, we use the
variance that is calculated in Naïve Bayesian.
We filter the features by variance corresponding to different
class value. The reasons of why we choose variance of filter
criterion are:
 Calculating variance is a step in fitting data to Gaussian
distribution. It is time saving compared to other methods.
 The larger the variance is, the wider these data distribute.
That means, even the feature have some little change or
big change, it won’t influence the result of learning much
compared to data with small variance. Hence, output is
4
more sensitive to features with small variance.
The way we filter the features is by removing the largest
variance one by one. Figure 1 shows the algorithm for
numerical feature selection.
1 . No rm aliz e the nu mer ic al f e atu re s.
2 . Fit th e n umeric fe a tur e s G ij wh ich
c or resp ond to d iffer en t c las s va lue
into
Ga us sian
Gi j 
Fi
distr ibu tion,
,
i  1,..., m
w he re
an d
j  1,..., k
3 . Ca lcu la te the a ve r ag e va r ian ce fo r
k
cla ss va lue .
4 . Filter the fe a tur e s one by o ne in
te rms o f la rge st v ar ianc e .
Figure 1. Feature Selection Algorithm: Variance
We first normalize all the numerical features. This is
because we need to make sure all the numerical features are in
the same domain of [0, 1]. In the later part, when we measure
the variance, same domain of features would be comparable.
The second step is a step of fitting numerical data into Gaussian
distribution. It is a step which is accomplished by the Gaussian
naïve Bayesian approach. Assume G  {G1 ,..., Gm }
( m  n ) is the numerical feature variable set ( G  F ), and
V is the class variable with class values {v1 ,..., vk } ( k >1),
k * m Gaussian distributions would be estimated and
correspondingly, k * m variances are computed. For each
class variable, there are m variances corresponding to
different features. For each feature vector Gi ( i  (1,..., m) ),
there are k variance corresponding to different class value.
The third step is to calculate the average variance value for
each feature. We simply sum the k variance together and
divided by k . In this way, we can obtain the number of
m average variance. According to the value of the average
variance, we filter the features one by one from the largest value
of variance.
V. EVALUATION USING TEN FOLD CROSS VALIDATION
We use ten-fold cross-validation as the method for evaluating
our algorithm.
When large data is available, evaluation the performance is
not a problem because we can make a model based on a large
> Variance Based Numeric Feature selection<
training set, and try it out on another large test set. However, in
reality, data is always in short supply. On one aspect, we would
like to use as much of the data as possible for training. But one
the other aspect, we want to use as much of it as possible for
testing. There already exist some technologies to deal with this
issue and it is still controversial till now. One of the techniques
is cross-validation which has gained acceptance and is the
evaluation method of choice in most practical limited-data
situations.
Ten-fold cross-validation divides the training data into ten
parts of roughly equal size which the class is represented in
approximately the same proportions as in the full dataset. Each
one is used in turn for testing while the remaining nice-tenths
are used for learning scheme to train. Thus the learning
procedure is executed a total of 10 times on different training
sets. Finally, the 10 error estimates are averaged to yield an
overall error estimate.
Extensive tests on numerous datasets, with different learning
techniques, have shown that ten is about the right number of
folds to get the best estimate of error [4].
5
Missing value:
no
Table V: Data Fitting: Gaussian
Std.dev.
setosa
versicolor
virginica
Average
Sepallengt
h
0.0938
0.1441
0.1738
0.137
Sepalwidt
h
0.1559
0.1241
0.1312
0.137
Petallengt
h
0.0293
0.079
0.094
0.067
Petalwidt
h
0.0477
0.0769
0.1118
0.079
Table VI: Classification result
Steps
Original result
Remove Sepallength
Remove Sepalwidth
Remove Petalwidth
Error
rate
5.3333%
5.3333%
4%
3.3333%
VI. EXPERIMENTAL EVALUATION
When we applied our method of variance based removal of
attributes to Iris dataset, we got an interesting result that, only
one feature, namely Petallength, is good enough for
classification. In this section, we give details of our numerical
experiments we performed for this purpose. The property of the
dataset is summarized in Table IV [14].
After normalize the numerical attributes, we fitting these
data to Gaussian distribution, which is shown in table V. From
the “average row”, we could easily find out the following:
Sepallength  Sepalwidth  Petalwidth  Petallength
As a matter of fact, Sepallength is bit larger than Sepalwidth
in the fourth digit after decimal point. Here, we filter the
features by the sequence above. Result is shown in table VI.
Table IV: Summary of Iris
Item
Description
Area :
life
Predicted attribute:
Class of iris
Attribute character:
Real
Number of instance: 150 (50 in each of three classes)
Number of Attribute: 4 numerical
Number of class:
3
 sepal length in cm
 sepal width in cm
 petal length in cm
Attribute
 petal width in cm
information:
 class: -- Iris Setosa
-- Iris Versicolour
-- Iris Virginica
From Table VI, we observe that, if we use original features,
the error rate for naïve Bayesian is 5.3333%, which is already
very low compared to other classification method. However,
after we remove the feature (Sepallength) which has the largest
variance, the error rate for naïve Bayesian is 5.3333% which is
exactly the same as original feature sets. Moreover, if we
continue to remove the feature with second largest variance, the
error rate is 4%. Even after we removed two features out of 4,
the error rate decreased. If we continuously remove the third
feature which is Petalwidth, the error rate decreased to
3.3333%. This means, we can only use one feature which is
Petallength to do classification rather than 4, and this definitely
reduced the modeling time and evaluating time for
classification. Reduction of features from 4 to 1 lead to a higher
accuracy for Iris using 10-fold cross-validation.
VII. CONCLUSIONS
The traditional naive Bayesian classifier provides a simple
and effective approach for classifier to learn. When it deals
with numerical features, one of the most popular ways is to use
Gaussian distribution to fit the data. A number of approaches
have been proposed to improve this algorithm to make it better.
Most of them are proposed to alleviate its attribute
independence assumption since it is often violated in the real
world. However, our proposed idea wants to improve the NB in
a different way.
When the number of features is large, we run into the risk of
overfitting, also both the modeling time and evaluating time
would increase. In order to solve this, in this paper, we have
> Variance Based Numeric Feature selection<
proposed to use variance to filter features.
Iris dataset is very well-known, and the application of our
method for Iris dataset results in the minimization of features to
only one significant feature, namely, Petal-length.
Remove of features with large variance, especially when the
error rate of the classification reduced, is practically useful.
Such a removal is expected to result in simplified model
avoiding the problems like over-fitting that arise when more
complex models are used. Our approach is especially useful for
constructing such simplified models.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
Gerda Claseskens, Christophe Croux, and Johan Van Kerckhoven, “An
information criterion for variable selection in support vector machines,”
Journal of Machine Learning Research, vol. 9, 2008, pp.541-558.
Y.Yang and G.I.Webb, A Comparative Study of Discretization Methods for
Naïve-Bayesian Classifiers, In Proceedings of PKAW 2002,
2002,pp.159-173.
Xiaojin Zhu, Semi-supervised learning literature survey, Technical Report
1530, Computer Sciences, University of Wisconsin-Madison, 2005, pp.4.
Ian H.Witen, Eibe Frank, “Data Mining Practical Machine Learning Tools
and Techniques”, Second Edition, Morgan Kaufmann publishers, 2003.
Pazzani, M. 1995. Searching for attribute dependencies in Bayesian
classifiers. In 5th International Workshop on Artificial Intelligence and
Statistics, pages 424–429, Ft. Lauderdale, FL.
Pazzani, M. 1996. Constructive induction of Cartesian product attributes. In
Proceedings of the Information, Statistics and Induction in Science, pages 66
–77. Melbourne, Australia: World Scientific.
Elkan, C. 1997. Boosting and Naive Bayesian Learning. Technical Report
No. CS97-557, Department of Computer Science and Engineering,
University of California, San Diego.
Ming, K., and Z. Zheng. 1999. Improving the performance of boosting for
Naive Bayesian Classification. In Proceedings of the PAKDD-99, pages 296
–305, Beijing, China.
Langley, P., and S. Sage. 1994. Induction of selective Bayesian Classifiers.
In Proceedings of the Tenth Conference on Uncertainty in Artificial
Intelligence, pages 399–406. Seattle, WA, Morgan Kaufmann.
Chow, C. & Lui, C. (1968). Approximating discrete probability distributions
with dependence trees. IEEE Trans. On Info Theory 14:462-467.
Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network
classifiers. Machine Learning, 29:2,131–163.
Keogh, E. and M. Pazzani. 1999. Learning augmented Bayesian classifiers:
A comparison of distribution based and classification-based approaches. In
Proceedings of the 7th Int’l Workshop on AI and Statistics, pages 225–
230, Ft. Lauderdale, Florida.
Zheng, Z., & Webb, G. I. (2000). Lazy learning of Bayesian Rules. Machine
Learning, 41:1, 53–84.
Asuncion, A. & Newman, D.J. (2007). UCI Machine Learning Repository
[http://www.ics.uci.edu/~mlearn/MLRepository.html].
Irvine,
CA:
University of California, School of Information and Computer Science.
6