Download a performance comparison of end, bagging and dagging

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835
Volume-3, Issue-4, Apr.-2016
A PERFORMANCE COMPARISON OF END, BAGGING AND
DAGGING META CLASSIFICATION ALGORITHMS
ANITA GANPATI
Computer Science Department, Himachal Pradesh University, Shimla, India.
E-mail: [email protected]
Abstract— Data mining is a technology that blends traditional data analysis methods with sophisticated algorithms for
processing large volume of data. Classification is an important data mining technique with broad applications. Classification
is a supervised procedure that learns to classify new instances based on the knowledge learnt from a previously classified
training set of instances. It is used to classify each item in a set of data into one of predefined set of classes or groups. There
are various Meta classification algorithms such as Decorate, Attribute Selected Classifier, Bagging, Dagging, Filtered
Classifier, LogitBoost, END, Dagging, Rotation Forest, and so on. In this research work, we have analyzed the performance
of three Meta classification algorithms namely END, bagging and dagging. For comparing the three algorithms, the
performance parameters namely classification accuracy and error rate was used. A dataset in the form of arff file was
downloaded from UCI machine learning repository. This dataset contains 20000 instances and seventeen attributes. The
simulations were done by using WEKA open source tool. From the experimental results, it is analyzed that END algorithm
performs better than other algorithms.
Keywords— Data Mining, Meta Classification, Bagging, Dagging, END, WEKA.
type of approach are error-correcting output codes [2]
and pairwise classification [3], and they often result
in significant increases in accuracy. Recently, it has
been shown that ensembles of nested dichotomies are
a promising alternative to pairwise classification and
standard error-correcting output codes [4].
I. INTRODUCTION
Data mining techniques are used for extracting the
hidden knowledge from the large databases. Data
mining tasks can be classified into two categories:
Descriptive and predictive data mining. Descriptive
data mining provides information to understand what
is happening inside the data without a predetermined
idea. Predictive data mining allows the user to submit
records with unknown field values, and the system
will guess the unknown values based on previous
patterns discovered from the database. Data Mining
basic techniques are Clustering, Association rule
discovery,
Classification,
Sequential
pattern
discovery and the Distributed data mining.
Classification and prediction are predictive models,
but clustering and association rules are descriptive
models. Classification and prediction are two forms
of data analysis that can be used to extract models
describing important data classes or to predict future
data trends [1]. Following are the meta classification
algorithms considered for study:
II) Bagging
Bootstrap Aggregating or Bagging generates
multiple bootstrap training sets from the original
training set using sampling with replacement and uses
each of them to generate a classifier for inclusion in
the ensemble. Given a set, D, of tuples, bagging
works as follows. For iteration I (i = 1,2…k), a
training set, Di, of d tuples is sampled with
replacement from the original set of tuples, D. Note
that the term bagging stands for bootstrap
aggregation. Each training set is a bootstrap sample.
Because sampling with replacement is used, some of
the original tuples of D may not be included in Di,
whereas others may occur more than once. A
classifier model, Mi., is learned for each training set,
Di. To classify an unknown tuple, X, each classifier,
Mi, returns its class prediction, which counts as one
vote. The bagging can be applied to the prediction of
continuous values by taking the average value of each
prediction for a give test tuple. The bagged classifier
often has significantly greater accuracy than a single
classifier derived from D, the original training data. It
will not be considerably worse and is more robust to
the effects of noisy data. The increased accuracy
occurs because the composite model reduces the
variance of the individual classifiers. For prediction,
it was theoretically proven that a bagged predictor
will always have improved accuracy over a single
predictor derived from [5].
III) Dagging
I) END(Ensemble of Nested Dichotomies)
A meta classifier for handling multi-class datasets
with 2-class classifiers by building an ensemble of
nested dichotomies. Many real-world classification
problems are multi-class problems: they involve a
nominal class variable that has more than two values.
There are basically two approaches for tackling this
type of problem. One is to adapt the learning
algorithm to deal with multi-class problems directly,
and the other is to create several two-class problems
and form a multi-class prediction based on the
predictions obtained from the two-class problems.
The latter approach is appealing because it does not
involve any changes to the underlying two-class
learning algorithm. Well-known examples of this
A Performance Comparison of End, Bagging and Dagging Meta Classification Algorithms
81
International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835
This Meta classifier creates a number of disjoint,
stratified folds out of the data and feeds each chunk
of data to a copy of the supplied base classifier.
Predictions are made via majority vote, since all the
generated base classifiers are put into the Vote meta
classifier. Useful for base classifiers that are quadratic
or worse in time behavior, regarding number of
instances in the training data [6].
Volume-3, Issue-4, Apr.-2016
classifiers. Moreover the measures like TP ,FP Rates,
F-measure and ROC Area were found to be higher for
normal class professional or the Administrators to
assess the risk of the attacks.
In [10] Hu H., Li J., Plank A., Wang H. and Daggard
G.
in their paper discussed about the rapid
development of DNA Microarray technology. Many
classification methods have been used for Microarray
classification. SVMs, decision trees, Bagging,
Boosting and Random Forest are commonly used
methods. In this paper they conducted experimental
comparison of LibSVMs, C4.5, BaggingC4.5,
AdaBoostingC4.5, and Random Forest on seven
Microarray cancer data sets. The experimental results
show that all ensemble methods outperform C4.5.
The experimental results also show that all five
methods benefit from data preprocessing, including
gene selection and discretization, in classification
accuracy. In addition to comparing the average
accuracies of ten-fold cross validation tests on seven
data sets, we use two statistical tests to validate
findings. We observe that Wilcoxon signed rank test
is better than sign test for such purpose.
II. LITERATURE REVIEW
Statya Ranjan Dash et al. in their paper considered
different classification techniques such as Bayesian
Classification, classification by Decision Tree
Induction of data mining and also classification
techniques related to fuzzy concepts of soft
computing. The parameters used to comparison of
different algorithms are RMSE, ROC Area, MAE,
Kappa Statistics, time taken to build the model,
Relative Absolute Error, Root Relative Squared
Error, and the percentage value of classifying
instances. After compiling the results it is concluded
that FuzzyRoughNN and ID3 are best algorithm for
classification and prediction of Post operative patient
dataset and ID3 algorithm is the best of twos [7].
Aman Kumar Sharma and Suruchi Sahni [11]
conducted experiment in the WEKA environment by
using four algorithms namely ID3, J48, Simple
CART and Alternating Decision Tree on the spam
email dataset and later the four algorithms were
compared in terms of classification accuracy.
According to our simulation results the J48 classifier
outperforms the ID3, CART and ADTree in terms of
classification accuracy.
Pfahringer et al. [8] presented a novel meta-feature
generation method in the context of meta-learning,
which is based on procedures that compare the
performance of individual base learners in a one-toone manner. In addition to these new meta-features, a
new meta-learner called Approximate Ranking Tree
Forests (ART Forests) that performs very
competitively when compared with several state-ofthe-art meta-learners. The experimental results are
based on a large collection of datasets and show that
the proposed new techniques can improve the overall
performance of metalearning for algorithm ranking
significantly. A main point in this approach is that
each performance figure of any base learner for any
specific dataset is generated by optimizing the
parameters of the base learner separately for each
dataset.
III. RESEARCH METHODOLOGY
In this research work, we have analysed the
performance of three Meta classification Algorithms
namely END, bagging and dagging. A dataset in the
form of arff file was downloaded from UCI machine
learning repository. This dataset contains 20000
instances and seventeen attributes. Weka data mining
tool is used for analyzing the performance of the
classification algorithms. The parameters used for the
evaluation of these algorithms were accuracy and
error rate.
G. Michael et al. [9] in their research evaluated the
performance of a set of meta classifier algorithms (Ad
Boost, Attribute selected classifier, Bagging,
Classification via Regression, Filtered classifier, logit
Boost, multiclass classifier). They considered a set of
supervised machine learning schemes with meta
classifiers on the selected dataset to predict the attack
risk of the network environment. The trained models
were then used for predicting the risk of the attacks in
a web server environment or by any network
administrator or any Security Experts. The Prediction
Accuracy of the Classifiers was evaluated using 10fold Cross Validation and the results have been
compared to obtain the accuracy. The results
indicated that the Bagging Meta Classifier
outperforms in prediction than the other meta
IV. ANALYSIS OF RESULTS
The classification accuracy was analysed by using the
parameters such as correctly classified instances,
incorrectly classified instances TP rate, FP rate,
precision, recall, F Measure, ROC Area, and kappa
Statistics.
Table I demonstrate the results for
classification accuracy of three meta classification
decision tree algorithms. It is evident from the Table I
that END has the highest classification accuracy
(94.86 %).
Table 1: Results of Accuracy Measure Parameter
A Performance Comparison of End, Bagging and Dagging Meta Classification Algorithms
82
International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835
Volume-3, Issue-4, Apr.-2016
from a previously classified training set of instances.
In this research we have performed the experiments
in order to determine the classification accuracy and
error rate of three algorithms namely Bagging,
Dagging and help of an open source data mining tool
known as WEKA. From the experimental results, it
can be concluded that the END algorithm performs
better than other algorithms in terms of accuracy
measure and error rate.
It is evident from the Fig.1 that END algorithm has
highest percentage of correctly classified instances
whereas the Dagging algorithm has the lowest
percentage of correctly classified instances.
REFERENCES
[1]
R. MahaLakshmi, “A Comparative analysis on
persuasive meta classification strategy for web spam
detection”, International Journal of Computer Science
and Information Technology & Security (IJCSITS),
ISSN: 2249-9555 Vol. 2, No.4, August 2012.
[2] Dietterich, T., Bakiri, G., “Solving multi-class learning
problems via error-correcting output codes”, Journal of
Artificial Intelligence Research, vol. 2, 1995, pp. 263–
286.
[3] Fiiurnkranz, J., “Round robin classification”, Journal of
Machine Learning Research, vol. 2, 2002, pp. 721–747.
[4] Dong, Lin, Eibe Frank and Stefan Kramer. "Ensembles
of balanced nested dichotomies for multi-class
problems." Knowledge Discovery in Databases: PKDD
2005. Springer Berlin Heidelberg, 2005, pp. 84-95.
[5] Noor A. Samsudin, Andrew P. Bradley, “Group-based
Meta-classification”, School of Information Technology
and Electrical Engineering, The University of
Queensland, Australia.
[6] P.Kalaiselvi, Dr.C.Nalini, “A Comparative Study of
Meta Classifier Algorithms on Multiple Datasets”,
International Journal of Advanced Research in
Computer Science and Software Engineering vol. 3, no.
3, March 2013.
[7] Satya
Ranjan
Dash,
Satchidananda
Dehuri,
“Comparative Study of Different Classification
Techniques for Post Operative Patient Dataset”,
International Journal of Innovative Research in
Computer and Communication Engineering Vol. 1, No.
5, July 2013.
[8] Shilpa Dhanjibhai Serasiya, Neeraj Chaudhary.,
“Simulation of Various Classifications results using
WEKA”, International Journal of Recent Technology
and Engineering (IJRTE) ISSN: 2277-3878, vol.1, no.3,
August 2012.
[9] Michael, G., A. Kumaravel, and A. Chandrasekar.
"Detection of malicious attacks by Meta classification
algorithms." International
Journal
of
Advanced
Networking and Applications Vol. 6,.No. 5, 2015, pp.
2455-2459.
[10] Hu H., Li J., Plank A., Wang H. and Daggard G., “A
Comparative Study of Classification Methods for
Microarray Data Analysis”, In Proc. Fifth Australasian
Data Mining Conference, Sydney, Australia 2006.
[11] Aman Kumar Sharma, Suruchi Sahni, “A Comparative
Study of Classification Algorithms for Spam Email
Data Analysis”, IJCSE, Vol. 3, No. 5, 2011, pp. 18901895.
Fig. 1. Performance of END, Bagging and Dagging in terms of
Accuracy
The error rate was analysed by
such as Mean Absolute Error
Squared Error (RMSE),
Error(RAE),
and
Root
Error(RRSE).
using the parameters
(MAE), Root Mean
Relative Absolute
Relative
Squared
Table 2: Result of Error Rate
It is evident from the Table 2 that the END algorithm
has the lowest Error i.e. Mean Absolute Error Root
Mean Squared Error, Relative Absolute Error, Root
Relative Squared Error rate as compared to Bagging
and Dagging
CONCLUSIONS
Classification is a supervised procedure that learns to
classify new instances based on the knowledge learnt

A Performance Comparison of End, Bagging and Dagging Meta Classification Algorithms
83