Download cst new slicing techniques to improve classification accuracy

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Naive Bayes classifier wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
CST NEW SLICING TECHNIQUES TO IMPROVE CLASSIFICATION
ACCURACY
OMAR A. A. SHIBA1
Computer Department-Faculty of Science-University of Sebha-Libya
E-mail: 1 [email protected],
ABSTRACT
Classification problem is one of the core topics in
data mining. Within the artificial intelligence
community, classification has been demonstrated to
be an important modality in diverse problem-solving
situations. The goal of this paper is to improve the
classification problem solving accuracy in data
mining. The paper achieves this goal by introducing a
new algorithm based on slicing techniques. The
proposed approach called Case Slicing Technique
(CST). The idea is based on slicing cases with respect
to slicing criterion. The paper also compares the
proposed approach against other approaches in
solving the classification problem. ID3, K-NN and
Naive Bayes approaches are used in this comparison.
This paper also presents the experimental results of
the CST on three real-world datasets. The main
results obtained showed that the proposed approach
is a promising classification method in solving the
decision-making in classification problem.
Keywords: Data Mining, Case-Slicing Technique,
Case Retrieval, Classification Accuracy.
1. INTRODUCTION
One of the problems addressed by machine
learning is that of data classification. Since the
1960’s, many algorithms for data classification have
been proposed [1]. The problem of classification is
defined as follows: The input data is referred to as the
training set, which contains a plurality of records,
each of which contains multiple attributes or features.
Each example in the training set is tagged with a class
label. The class label may either be categorical or
quantitative. The problem of classification in the
context of a quantitative class label is referred to as
the regression-modeling problem. The training set is
used in order to build a model of the classification
attribute based upon the other attributes. This model
is used in order to predict the value of the class label
for the test set. Some well-known techniques for
classification include Decision Tree: Induction of
Decision Tree Algorithm (ID3) [2] [3], K-Nearest
Neighbor Techniques [4] [5] and Bayesian Learners:
Naive Bayes (NB) [6].
This paper investigates that, the possibility of using
slicing technique successfully with classification
problem and how does slicing technique compare as a
classification approach against to ID3, K-NN and
Naive Bayes classification approaches.
The paper is organized as follows: the next section
presents the selected classification algorithms. Section
3 describes the case slicing technique. Section 4
presents experimental results that compare the
performance of the algorithms. Finally the conclusion
is presented in section 5.
2. SELECTED CLASSIFICATION
ALGORITHMS
In this section only a brief description of the
selected classification methods is given here, more
information can be found in [2] [4] [6] for ID3, K-NN
and NB respectively.
2.1 INDUCTION OF DECISION TREE
ALGORITHM (ID3)
ID3 is an algorithm introduced by Quinlan for
inducing Classification Model, also called Induction
of Decision Tree, from data. We are given a set of
records. Each record has the same structure,
consisting of a number of attribute/value pairs. One of
these attributes represents the category of the record.
The problem is to determine a decision tree that on
the basis of answers to questions about the noncategory attributes predicts correctly the value of the
category attribute. Usually the category attribute takes
only the values {true, false}, or {success, failure}, or
something equivalent. In any case, one of its values
will mean failure [2] [3].
The basic ideas behind ID3 are that:
In the decision tree each node corresponds to
a non-categorical attribute and each arc to a
possible value of that attribute. A leaf of the
tree specifies the expected value of the
categorical attribute for the records described
by the path from the root to that leaf. [This
defines what is a Decision Tree.]
In the decision tree at each node should be
associated the non-categorical attribute
which is most informative among the
attributes not yet considered in the path from
the root. [This establishes what is a "Good"
decision tree.]
Entropy is used to measure how informative
is a node. [This defines what we mean by
"Good". By the way, this notion was
introduced by Claude Shannon in
Information Theory [7]].
2.2 K-NEAREST NEIGHBOR (K-NN)
The basic idea of the K-Nearest Neighbor
algorithm (K-NN) is to compare every attribute of
every case in the set of similar cases with every
corresponding attribute of the input case. A numeric
function is used to decide the value of comparison.
Then the K-NN algorithm selects a case, which has
the highest comparison value and retrieves it [4].
K-NN assumes each case X= {x1, x2…xn, xc} is
defined by a set of n (numeric or symbolic) features,
where xc is x’s class value. Given a query q and case
library L, k-NN retrieves the set k of q’s k most similar
(i.e., least distant) cases in L and predicts their
weighted majority class as the class of q [5]. Distance
in K-NN is defined as in equation (1) below:
distance ( x, q ) =
n
f =1
w
f
* difference ( xf , qf ) 2 (1)
Where wf is the parameterized weight value
assigned to feature f as in equation (2) below:
w f = P(C ia )
(2)
That is, the weight for feature a for a class c is the
conditional probability that a case is a member of c
given the value to a where P(C|ia) is defined in
equation (3); and the difference between x and q can
be calculated as in equation (4).
P (C ia ) =
instances containing ia ∧ class = C
instances containing ia
x f − q f if
(
difference x f , q f
)=
0 if
(3)
f is numeric
f is symbolic
& xf = qf
(4)
1 otherwise
In the equation (2) k-nn assigns equal weights to
all features (i.e.
f {wf = 1}).
2.3 NAÏVE BAYES (NB)
The estimates of the probability masses are used as
input for a Naïve Bayes classifier. This classifier
simply computes the conditional probabilities of the
different classes given the values of attributes, and
then selects the class with the highest conditional
probability. If an instance is described with n
attributes ai (i=1…n), then the class that instance is
classified to a class v from set of possible classes V
according to a Maximum A Priori criterion (MAP)
Naive Bayes classifier can be defined as in equation
(5):
n
v = arg max p( v )∏ p( a i | v j )
v j ∈V
(5)
i =1
The conditional probabilities in the above formula
are obtained from the estimates of the probability
mass function using training data. This Bayes
classifier minimizes the probability of classification
error under the assumption that the sequence of points
is independent [6].
3. THE CASE SLICING
ALGORITHM
Conceptually, the proposed method is a variation of
the nearest neighbor algorithms and is called Case
Slicing Technique (CST). It compares new cases to
the training cases in the data file. It computes the
similarity between the new cases and training cases to
classify the new cases. The proposed method is a
classification technique based on slicing [9] [10][11].
Case Slicing means we are interested in
automatically obtaining that portion ‘features’ of the
case responsible for specific parts of the solution of
the case at hand. By slicing the case with respect to
important features we can obtain new case with a
small number of features or with only important
features. The proposed approach consists of a
database with three calculation modules as follows:
Discretization module to compute and find
the discrete value for continues values.
Features weighting module to assign weight
to each feature.
Distance computation module to compute
the distance between each two cases.
3.1 SLICING TECHNIQUE
The objective of slicing technique is to optimize the
similarity matching to achieve best classification
results. The proposed approach is adapting slicing
technique that have been used in programming
language, to slice the cases by selecting subset of
features with respect to the selected slicing criterion.
Case classification algorithm based slicing is shown
in Figure 1. below:
Algorithm: Algorithm for Case Classification
Input:
User’s Input Problem Specification
Output:
Classified Case
Begin
While True do
Discrete_continuous_values()
{To convert continues values into
discrete values}
Assign_Weights_Cases()
{To assign weights to each feature in the
each case}
Slicing_Cases_w.r.t.Slicing_Criterion()
{To slice the cases w. r. t. important
features}
Calculate_Distance()
{To find the distance between each
two features}
Closer_ Case_Searching()
{To find closest case to the case
at hand}
Return_Classified_Case()
{To assign class label to the new
case}
Enddo
End.
Fig. 1. Case classification algorithm based slicing
3.1.1 A FORMAL DESCRIPTION OF
CASE SLICING TECHNIQUE
In this section we will give a formal description of
Case Slicing Techniques in a basic version allowing
for a detailed investigation of the approach.
Let
S = {C1 , C2 , C3 … Cn} set of cases in Case Base.
∀S ∃Ci S ≠ φ.
Ci = {f1 , f2 , f3 … fn} where n is the number of
features in Ci
λ = [{Cs|Cs is a set of sliced cases}] OR
λ = {all cases that contains one or more important
feature(s)}
I = {if1, if2,.…, ifn}
where n is the number of important features
in I
I ⊆ Ci ⊆ S
I ⊆ Cs ⊆ λ.
4. EXPERIMENTAL RESULTS
In this section the results of several practical
experiments are presented to examine the
performance of the proposed approach and the
performance of the selected classification algorithms
on real world problems.
4.1 SELECTED DATASETS
In this paper three real-world datasets has been
used, which are widely used in the machine-learning
field for evaluation of case slicing technique. The
three medical datasets: Cleveland Heart Disease
(CLEV), Breast Cancer (BCO) and Hepatitis Domain
(HEPA) were chosen from the UCI: Machine
Learning Repositories and Domain Theories [12].
Table 1. presents the main characteristics of these
datasets, where B, C and D in the table means
Boolean, continuous and discrete attributes
respectively.
Table 1
Characteristics of the selected datasets
Datasets
CLEV
BCO
HEPA
No. Of
Data
303
699
155
Type & No. Of
Attributes
9D, 6C
(15)
13B, 6C (19)
13B, 6C (19)
No. Of
Classes
2
2
2
REFERENCES
4.2 EMPIRICAL RESULTS
We evaluated the performance of the case slicing
technique by comparing it against ID3, Naive Bayes
and K-Nearest-Neighbor classifiers on a variety of
datasets. The datasets we have selected are very good
choice to test and evaluate the slicing technique
because there is a good mixture of continues, discrete
and Boolean features. In all the experiments reported
here we used the evaluation technique 10-fold crossvalidation, which consists of randomly dividing the
data into 10 equally, sized subgroups and performing
ten different experiments. We separated one group
along with their original labels as the validation set;
another group was considered as the starting training
set; the remainder of the data were considered the test
set. Each experiment consists of ten runs of the
procedure described above, and the overall average
are the results reported here. The criterion of choosing
the best classification approach is based on the
highest percentage of classification. The results, given
in Table 2, list the classification accuracies achieved
by each approach for each of the datasets.
Table 2
The classification accuracy achieved by the different
classification algorithms.
Methods
ID3
NB
K-NN
CST
Datasets
CLEV
BCO
HEPA
71.20
94.30
67.74
83.40
96.40
86.30
71.20
97.10
92.90
96.00
99.30
97.00
5. CONCLUSION
This paper has presented and discussed the Case
Slicing Technique (CST) as a new approach based
slicing to improve classification task in data mining.
CST was supported with experiments on three
datasets. The experiments show that using the CST
indeed improves the accuracy of classification. The
paper also gave brief discription of a number of
common classification algorithms that are used either
in data mining or in general AI. The paper has
presented a comparison between proposed method
and other selected classification algorithms using
medical domain. The proposed technique has
possessed a competitive result. It gave very high
percentage of classification accuracy.
[1] Thamar, S. and Olac, F., “Improving Classifier
Accuracy Using Unlabeled Data” in Proceedings
of the IASTED International Conference on
Artificial
Intelligence
and
Applications
(AIA2001), Marbella, Spain, Sept. 2001.
[2] Quinlan, J. R. “Induction of Decision Trees.”
Machine Learning, Vol. 1, No. 1, pp. 81–106,
1986.
[3] Quinlan, J. R. C4.5: Programs for Machine
Learning, CA: Morgan Kaufmann Publishers,
Inc., 1993.
[4] Xiaoli, Q. A Case-Based Reasoning System For
Bearing Design, Master Thesis Submitted to the
Faculty of Drexel University, June 1999.
[5] Wettschereck, D. and Aha, D. W. “Weighting
Features”, In proceedings of the 1st International
Conference on CBR (ICCBR-95), pp. 347-358,
1995.
[6] Mitchell, T. M. Machine Learning, McGrawHill, New York, 1997.
[7] Cover, T. M. and Thomas, J. A. Elements of
Information Theory, Wiley, 1991.
[8] Thamar, S and Olac, F. “Improving Classification
Accuracy of Large Test Sets Using the Ordered
Classification Algorithm”, in Proceedings of
IBERAMIA-02, Lecture Notes in Computer
Science, 2002.
[9] Weiser, M. “Program Slicing.” IEEE Transaction
Software Engineering, Vol. 10, No. 4, pp. 352357, 1984.
[10] Vasconcelos, W. W. “Slicing Knowledge-Based
Systems
Techniques
and
Applications.”
Knowledge based Systems Journal, Elsevier, Vol.
13, pp. 177-198, 2000.
[11] Omar A. Shiba “Using Code Slicing Technique
To Improve Case Classification: A suggested
Case Slicing Approach.” International Journal of
Applied Computing & Informatics, Vol. 2, No. 1,
pp. 24-37, 2003.
[12] Murphy, P. M. UCI Repositories of Machine
Learning and Domain Theories [online].
University of California, Irvine
Available:
http://www.isc.uci.edu/~mlearn/MLRepository.ht
ml, 1996. [2001, Nov. 12 - Date of brows].