Download An Efficient Feature Reduction Comparison of Machine Learning

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

IEEE 1355 wikipedia , lookup

Cracking of wireless networks wikipedia , lookup

Transcript
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: [email protected], [email protected]
Volume 2, Issue 1, January – February 2013
ISSN 2278-6856
An Efficient Feature Reduction Comparison of
Machine Learning Algorithms for Intrusion
Detection System
Upendra
Assistant Professor, CSE Department, NIT Raipur, C.G., India
Abstract: Organisation has come to recognize that applied
science in network security has become very important in
protecting its information. Intrusion detection present an
important line of defend against all variety of attacks that can
compromise the security and proper functioning of
information system initiative. In this paper we compared the
performance of intrusion detection. The evaluation of the
Intrusion Detection System (IDS) execution analysis for any
given security system configuration improvement is necessary
to achieve real time capability. We analyse two learning
algorithms (NB and C4.5) for the task of detecting intrusions
and compare their relative performances.
Keywords: intrusion detection, machine learning, C4.5,
NB, KDD 99
1. INTRODUCTION
Intrusion detection algorithm should consider the
composite properties of attack behaviors to improve the
detection speed and detection accuracy. Analyze the large
volume of network dataset and the better performances of
detection accuracy, intrusion detection become an
important research field for machine learning. In this
work we have presented C4.5 decision tree algorithm for
intrusion detection based on machine learning. The
Intrusion Detection System (IDS) is Process of
monitoring the events occurring in a computer system or
network and analyzing them for signs of possible
incidents. IDS was first introduced in 1980 by James. P.
Anderson [1] and then improved by D. Denning [2] in
1987. They are two basic approaches for Intrusion
Detection techniques, i.e. Anomaly Detection and Misuse
Detection (signature-based ID) [3]. Anomaly Detection is
basically based on assumption that attacker behavior is
different from normal user's behavior [4]. In this paper,
we present the application of machine learning to
intrusion detection. We analyse two learning algorithms
(C4.5 and NB) for the task of detecting intrusions and
compare their relative performances. There is only
available data set is KDD data set for the purpose of
experiment for intrusion detection.KDD data set [5]
contain 42 attributes. The classes in KDD99 [6] dataset
can be categorized into five main classes such as one
normal class and four main intrusion classes. Data
mining is a collection of techniques for efficient
automated discovery of previously unknown, valid, novel,
useful and understandable patterns in large databases
Volume 2, Issue 1 January - February 2013
[18]. The field of intrusion detection has received
increasing attention in recent years.
2. RELATED WORK
In 2007, Panda and Patra [7] determined a method using
naive Bayes to detect signatures of specific attacks. They
used KDD99 dataset for experiment, in the early 1980’s,
Stanford Research Institute (SRI) developed an Intrusion
Detection Expert System (IDES) that monitors user
behavior and detects suspicious events. Meng Jianliang
[8] used the K Mean algorithm to cluster and analyze the
data. He used the unsupervised learning technique for the
intrusion detection. Mohammadreza Ektefa et al., [8] in
2010, compared C4.5 with SVM and the results revealed
that C4.5 algorithms better than SVM in detecting
network intrusions and false alarm rate. Zubair A.Baig et
al. (2011) proposed An AODE-based Intrusion Detection
System for Computer Networks. They suggested that the
Naive Bayes (NB) does not accurately detect network
intrusions [9]. In 2010, Hai Nguyen et al. [10] applied
C4.5 and BayesNet for intrusion detection on KDD
CUP’99 Dataset. Jiong Zhang and Mohammad
Zulkernine [11] done the intrusion detection using the
random forest algorithms in anomaly based NIDS. Cuixio
Zhang, Guobing Zhang, Shanshan Sun [12] used the
missed approach for the intrusion detection. Various
paradigms namely Support Vector Machine [13], Neural
Networks[14], K-means based clustering[15] have been
applied to intrusion detection because it has the advantage
of discovering useful knowledge that describes a user’s or
program’s behavior. They are two basic approaches for
Intrusion Detection techniques, i.e. Anomaly Detection
and Misuse Detection. (signature-based ID) Anomaly
Detection is basically based on assumption that attacker
behavior is different from normal user's behavior [16].
Shadmehr et al [17] showed that the performance of
Bayes algorithm is better. Recently research on machine
learning for intrusion detection has standard much
attention in the computational intelligence community. In
intrusion detection algorithm, immense strengths of audit
data must be analyzed in order to conception new
detection rules for increasing number of novel attacks in
high speed network. Intrusion detection algorithm should
consider the composite properties of attack behaviors to
improve the detection speed and detection accuracy.
Page 66
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: [email protected], [email protected]
Volume 2, Issue 1, January – February 2013
ISSN 2278-6856
Predicted Class
Positive
A
3. METHODOLOGY
3.1 Naïve Bayes (NB)
A Naive Bays classifier [19],[20] is a simple probabilistic
classifier based on applying Bayes' theorem (from
Bayesian statistics) with strong (naive) independence
assumptions. A more descriptive term for the underlying
probability model would be "independent feature model".
In 2004, analysis of the Bayesian classification problem
has shown that there are some theoretical reasons for the
apparently unreasonable efficacy of naive Bayes
classifiers. Still, a comprehensive comparison with other
classification methods in 2006 showed that Bayes
classification is outperformed by more current
approaches, such as boosted trees or random forests[25].
3.2 C4.5
Decision tree technology is a common, intuitionist and
fast classification method [22]. Decision tree J48
developed by Johan Ross Quinlan [23]. C4.5 is an
extension of Quinlan's earlier the Interactive
Dichotomizer3 (ID3) Algorithm. J48 builds decision trees
from a set of labelled training data using the concept of
information entropy.The Decision tree is a classifier
expressed as a recursive partition of the instance space,
consists of nodes that form a rooted tree, meaning it is a
directed tree with a node called a root that has no
incoming edges referred to as an internal or test node. All
other nodes are called leaves (also known as terminal or
decision nodes). Decision trees [24] are one of the most
commonly classification methods used in supervised
learning approaches.
TABLE I. SELECTED 7 ATTRIBUTE WITH HIGHEST
INFORMATION GAIN
S No.
Feature name
1
2
3
4
5
6
service
src_bytes
dst_bytes
logged_in
count
dst_host_diff_srv_rate
7
dst_host_srv_diff_host_rate
8
Class
3.3 Evaluation
We constructed a confusion matrix (contingency table) to
evaluate the classifier’s performance.
TABLE II. A SAMPLE CONFUSION MATRIX
Actual Class
Positive
C
D
Actual Class
Negative
In this confusion matrix, the value A is called a true positive
and the value D is called a true negative. The value B is
referred to as a false negative and C is known as false positive.
3.3 Accuracy
This is the most basic measure of the performance of a
learning method. This measure determines the percentage
of correctly classified instances. From the confusion
matrix, we can say that:
A+ D
Accuracy
=
——————
A + B + C+ D
Precision = ratio of number of documents retrieved that
are relevant to the total number of documents that are
retrieved Referring from the confusion matrix, we can
define precision and recall for our purposes as
A
Precision = ————
A+C
Recall = ratio of number of documents retrieved that are
relevant to the total number of documents that are
relevant
A
Recall
=
————
A+B
(1)
IV. performance Evaluation and
Result
The Tables III and IV Shows the performance of C4.5
and NB classification methods based on accuracy,
Learning time, Error rate, Average true positive rate,
Average False positive rate, Average precision and
Average F-Measure respectively. The comparison is
performed for 41, 11 and 7 attributes. The C4.5 and NB
classifier models on the dataset are built and tested by
means of 10-fold cross-validation. The Java Heap size
was set to 1024 MB for WEKA 3.6.2, the simulation
platform is an Intel™ Core i3-2100 processor system with
3 GB RAM under Microsoft Windows XP™ Service
Pack-2 operating system, 3.10 GHz with 500 GB
memory. C4.5 was evaluated on the dataset by taking into
account 7 feature reductions using the Information Gain
measure. The results of this test are summarized in the
following table.
TABLE III. RESULT OF C4.5 WITH SELECTED 7
ATTRIBUTES
Parameter
Value
Accuracy
99.8901 %
Learning Time
Volume 2, Issue 1 January - February 2013
Predicted Class
Negative
B
12.67 sec
Page 67
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: [email protected], [email protected]
Volume 2, Issue 1, January – February 2013
ISSN 2278-6856
Error Rate
0.1099 %
Average True Positive Rate
0.999
Average False Positive
Rate
Average Precision
0.001
Average Recall
0.999
is desirable for good Intrusion Detection System.
Especially in the case of C4.5 accuracy is 99 % for all
attribute, 11 attribute and 7 attribute.
Now we compare the True Positive Rate of the C4.5, and
NB algorithm with selected 7 attributes.
0.999
0.999
Similar to the C4.5 tests, NB was also evaluated by taking
into account 7 features of the dataset. The results of this
evaluation are summarized in the table below.
TABLE IV. RESULT OF NB WITH SELECTED 7
ATTRIBUTES
Parameter
Value
Accuracy
93.5698 %
Learning Time
1.3 sec
Error Rate
6.4302 %
Average True Positive
Rate
Average False Positive
Rate
Average Precision
0.936
Average Recall
0.936
Average F-Measure
0.054
True Positive Rate
Average F-Measure
1.2
1
0.8
0.6
0.4
0.2
0
C4.5
NB
Attacks
Figure 2. Comparison of TPR for C4.5 and NB with
selected 7 attribute.
0.949
0.942
In this paper, we have done the feature reduction to 7
attribute and gave the result. Now we compare the result
of the C4.5 and NB algorithms with reduced 7 attribute
than only we conclude that which one algorithm is good
best for the intrusion detection.
Now the figure 1. given below show the comparison of
the accuracy of C4.5 and NB.
For good IDS True Positive Rate should be high. Above
figure 2. Shows that True Positive Rate of the C4.5
algorithm is higher when we reduce the feature of the
data set using information gain. Especially in the case of
C4.5 True Positive Rate is 1 and Figure 2. also shows that
TPR of the C4.5 is higher than the NB algorithm with
selected 7 attribute.
Now we compare the False Positive Rate of the C4.5 and
NB algorithm with selected 7 attributes. Types of attack
group such as four main categories: (Probing, DoS:
denial-of-service, R2L: remote to local, and U2R: user to
root) and one normal class.
Figure 1. Comparison of accuracy for C4.5 and NB
using all attribute, selected 11 attribute and selected 7
attribute.
From above figure1. It is clear that information gain
feature reduction method gives the better accuracy which
Volume 2, Issue 1 January - February 2013
Figure 3. Comparison of FPR for C4.5 and NB using
selected 7 attribute.
Page 68
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: [email protected], [email protected]
Volume 2, Issue 1, January – February 2013
ISSN 2278-6856
For a better Intrusion Detection System, false positive rate
should be low. Above figure 3. Shows that FPR of the
C4.5 algorithm is lower when we reduce the 7 feature of
the data set using information gain. Especially in the case
of C4.5 FPR is 0.In the case of NB algorithm False
Positive Rate of the greater than 0 with selected 7
attribute.
Figure 4. Comparison of time taken to build model for
C4.5 and NB using all attribute, selected 11 and 7
attribute.
From above figures 1, 2, 3 and 4 it is clear that C4.5
algorithm Accuracy, TPR and FPR is better than NB
algorithms. So we can say that reduction of the feature
using information gain is better technique.
4. CONCLUSIONS
In this paper we have showed C4.5 selected 7 attribute
technique for intrusion detection and performed feature
set reduction and evaluated their performance. From the
result, it is observed that after applying the feature
selection from 41 attributes to 11 and 7 attributes. The
overall performance of C4.5 has increased their
performance than NB. The comparison results show that,
in general, the C4.5 has the highest classification
accuracy performance with the lowest error rate. C4.5
achieves better detection rates than NB and increased true
positive rate. On the other hand, we also found that
drastically decreased in learning time of the algorithm.
We evaluated two machine learning algorithms C4.5 and
NB on the dataset built in this exercise. Based on the
accuracy, true positive rate, false positive rate, error rate
and learning time C4.5 performed best classifier.
Volume 2, Issue 1 January - February 2013
REFERENCES
[1.] James P. Anderson, “Computer Security Threat
Monitoring and Surveillance,” Technical Report,
James P.Anderson
Co.,
Fort
Washington,
Pennsylvania, USA, pp. 98–17, April 1980.
[2.] Dorothy E. Denning, “An Intrusion Detection
Model,” IEEE Transaction on Software Engineering
(TSE), volume–13, No.2, pp.222–232, February
1987.
[3.] LI Min and Wang Dongliang, “Anomaly
Intrusion Detection Based on SOM,” IEEE WASE
International
Conference
on
Information
Engineering, IEEE Computer Society, 2009, pp. 4044.
[4.] Lida Rashidi,Sattar Hashem and Ali Hamzeh,
“Anomaly detection in categorical datasets using
Bayesian networks,” AICI’11 Proceedings of the
Third International Conference on Artificial
Intelligence and Computational Intelligence, Volume
Part II, Springer-Verlag, Berlin ,Heidelberg, 2011,
pp. 610–619.
[5.] Knowledge Discovery in Databases DARPA
archive. Task Description ,KDDCUP 1999 Dataset,
http://www.kdd.ics.uci.edu/databases/kddcup99/task.
html
[6.] Mahbod Tavallaee,Ebrahim Bagheri,Wei Lu, and
Ali A.Ghorbani, “A Detailed Analysis of the KDD
CUP 99 Data Set,” Proceedings of the 2009 IEEE
Symposium on Computational Intelligence in
Security
and
Defense
Application(CISDA
2009),IEEE 2009.
[7.] M. Panda, and M. R. Patra, “Network intrusion
detection using naive Bayes,” International Journal of
Computer Science and Network Security (IJCSNS),
Volume -7, No. 12, December 2007, pp. 258–263.
[8.] Meng Jianliang, Shang Haikun, “The application
on intrusion detection based on K-Means cluster
algorithm,” International Forum on Information
Technology and Application, 2009.
[9.] Zubair A. Baig, Abdulrhman S. Shaheen, and
Radwan AbdelAal, “An AODE-based Intrusion
Detection System for Computer Networks,” pp. 28–
35, IEEE 2011.
[10.] Hai Nguyen, Katrin Franke and Slobodan
Petrovi’c, “Improving Effectiveness of Intrusion
Detection by Correlation Feature Selection,”
International Conference on Availability, Reliability
and Security, pp. 17–24, IEEE 2010.
[11.] Jiong Zhang and Mohhammad Zulkernine,
“Anomaly based Network Intrusion detection with
unsupervised outlier detection,” School of Computing
Queen’s University, Kingston, Ontario, Canada.
IEEE International Conference ICC 2006, Volume-9,
pp. 2388-2393, 11-15 June 2006.
Page 69
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: [email protected], [email protected]
Volume 2, Issue 1, January – February 2013
ISSN 2278-6856
[12.] Cuixiao Zhang, Guobing Zhang, Shanshan Sen.,
“A mixed unsupervised clustering based Intrusion
detection model,” Third International Conference on
Genetic and Evolutionary Computing, 2009.
[13.] Jiaqi Jiang, Ru Li,Tianhong Zheng,Feiqin Su,
Haicheng Li, “A new intrusion detection system
using Class and Sample Weighted C-Support Vector
Machine”, Third International Conference on
Communications and Mobile Computing, IEEE
Computer Society,2011,pp-51-54.
[14.] E. T. Ferreira, G. A. Carrijo, R. de Oliveira and
N. V. S. Araujo, “Intrusion Detection System with
Wavelet and Neural Artifical Network Approach for
Network Computers,” IEEE Latin America
Transactions, Vol. 9, No. 5, September 2011,pp-832837.
[15.] Yang Zhong, Hirohumi Yamaki, Hiroki
Takakura, “A Grid-Based Clustering for LowOverhead Anomaly Intrusion Detection,” IEEE 2011,
pp-17-24.
[16.] Lida Rashidi,Sattar Hashem and Ali Hamzeh,
“Anomaly detection in categorical datasets using
bayesian networks,” AICI’11 Proceedings of the
Third International Conference on Artificial
Intelligence and Computational Intelligence, Volume
Part II, Springer-Verlag, Berlin ,Heidelberg, 2011,
pp.610–619.
[17.] R. Shadmehr and Z. D’Argenio, “A comparison
of a neural network based estimator and two
statistical estimators in a sparse and noisy
environment”, In IJCNN-90 Proceedings of the
international joint conference on neural networks,
289-292,Ann Arbor, Mi, IEEE Neural Networks
Council, 1990.
[18.] Julie M. David and Kannan Balakrishnan, (2010)
“Significance of Classification Techniques In
Prediction Of Learning Disabilities”, International
Journal of Artificial Intelligence & Applications
(IJAIA), Vol.1, No.4.
[19.] R.Dogaru ,“A modified Naive Bayes classifier for
efficient implementations in embedded systems,”
Signals Circuits and Systems (ISSCS), IEEE 10th
International Symposium on Lasi, June 30, 2011 –
July 1, 2011 , pp.1–4.
[20.] Jiawei Han and Micheline kamber, “Data Mining
Concepts and Techniques,”Second Edition,University
of Illinois at Urbana-Champaign The Morgan
Kaufmann Series in Data Management Systems,
Elsevier 2007.
[21.] Rebecca G. Bace, “Intrusion Detection” Sams,
December 1999, 20th International Conference Very
Large Data Bases(VLDB), Morgan Kaufmann, pp.
487–499.
[22.] Pingchuan Ma, “Log Analysis-Based Intrusion
Detection via Unsupervised Learning” Master of
Science, School of Informatics, University of
Edinburgh, 2003.
Volume 2, Issue 1 January - February 2013
[23.] John Ross Quinlan, (1993) “C4.5: Programs for
Machine Learning”, Morgan Kaufmann Publishers,
San Mateo,CA.1993.
[24.] Kamarulrifin Abd Jalil and Mohamad Noorman
Masrek,“Comparison
of
Machine
learning
Algorithms Performance in Detecting Network
Intrusion”, IEEE 2010 International Conference on
Networking and Information Technology,pp.221226.
[25.] Shaohua Teng, Hongle Du Wei Zhang, Xiufen
Fu“A Cooperative Network Intrusion Detection
Based on Heterogeneous Distance Function
Clustering”,2010,
14th
IEEE
International
Conference on Computer Supported Cooperative
Work in Design. pp 140-145, 14-16 April 2010.
Page 70