Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 2, Issue 1, January – February 2013 ISSN 2278-6856 An Efficient Feature Reduction Comparison of Machine Learning Algorithms for Intrusion Detection System Upendra Assistant Professor, CSE Department, NIT Raipur, C.G., India Abstract: Organisation has come to recognize that applied science in network security has become very important in protecting its information. Intrusion detection present an important line of defend against all variety of attacks that can compromise the security and proper functioning of information system initiative. In this paper we compared the performance of intrusion detection. The evaluation of the Intrusion Detection System (IDS) execution analysis for any given security system configuration improvement is necessary to achieve real time capability. We analyse two learning algorithms (NB and C4.5) for the task of detecting intrusions and compare their relative performances. Keywords: intrusion detection, machine learning, C4.5, NB, KDD 99 1. INTRODUCTION Intrusion detection algorithm should consider the composite properties of attack behaviors to improve the detection speed and detection accuracy. Analyze the large volume of network dataset and the better performances of detection accuracy, intrusion detection become an important research field for machine learning. In this work we have presented C4.5 decision tree algorithm for intrusion detection based on machine learning. The Intrusion Detection System (IDS) is Process of monitoring the events occurring in a computer system or network and analyzing them for signs of possible incidents. IDS was first introduced in 1980 by James. P. Anderson [1] and then improved by D. Denning [2] in 1987. They are two basic approaches for Intrusion Detection techniques, i.e. Anomaly Detection and Misuse Detection (signature-based ID) [3]. Anomaly Detection is basically based on assumption that attacker behavior is different from normal user's behavior [4]. In this paper, we present the application of machine learning to intrusion detection. We analyse two learning algorithms (C4.5 and NB) for the task of detecting intrusions and compare their relative performances. There is only available data set is KDD data set for the purpose of experiment for intrusion detection.KDD data set [5] contain 42 attributes. The classes in KDD99 [6] dataset can be categorized into five main classes such as one normal class and four main intrusion classes. Data mining is a collection of techniques for efficient automated discovery of previously unknown, valid, novel, useful and understandable patterns in large databases Volume 2, Issue 1 January - February 2013 [18]. The field of intrusion detection has received increasing attention in recent years. 2. RELATED WORK In 2007, Panda and Patra [7] determined a method using naive Bayes to detect signatures of specific attacks. They used KDD99 dataset for experiment, in the early 1980’s, Stanford Research Institute (SRI) developed an Intrusion Detection Expert System (IDES) that monitors user behavior and detects suspicious events. Meng Jianliang [8] used the K Mean algorithm to cluster and analyze the data. He used the unsupervised learning technique for the intrusion detection. Mohammadreza Ektefa et al., [8] in 2010, compared C4.5 with SVM and the results revealed that C4.5 algorithms better than SVM in detecting network intrusions and false alarm rate. Zubair A.Baig et al. (2011) proposed An AODE-based Intrusion Detection System for Computer Networks. They suggested that the Naive Bayes (NB) does not accurately detect network intrusions [9]. In 2010, Hai Nguyen et al. [10] applied C4.5 and BayesNet for intrusion detection on KDD CUP’99 Dataset. Jiong Zhang and Mohammad Zulkernine [11] done the intrusion detection using the random forest algorithms in anomaly based NIDS. Cuixio Zhang, Guobing Zhang, Shanshan Sun [12] used the missed approach for the intrusion detection. Various paradigms namely Support Vector Machine [13], Neural Networks[14], K-means based clustering[15] have been applied to intrusion detection because it has the advantage of discovering useful knowledge that describes a user’s or program’s behavior. They are two basic approaches for Intrusion Detection techniques, i.e. Anomaly Detection and Misuse Detection. (signature-based ID) Anomaly Detection is basically based on assumption that attacker behavior is different from normal user's behavior [16]. Shadmehr et al [17] showed that the performance of Bayes algorithm is better. Recently research on machine learning for intrusion detection has standard much attention in the computational intelligence community. In intrusion detection algorithm, immense strengths of audit data must be analyzed in order to conception new detection rules for increasing number of novel attacks in high speed network. Intrusion detection algorithm should consider the composite properties of attack behaviors to improve the detection speed and detection accuracy. Page 66 International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 2, Issue 1, January – February 2013 ISSN 2278-6856 Predicted Class Positive A 3. METHODOLOGY 3.1 Naïve Bayes (NB) A Naive Bays classifier [19],[20] is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model". In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable efficacy of naive Bayes classifiers. Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as boosted trees or random forests[25]. 3.2 C4.5 Decision tree technology is a common, intuitionist and fast classification method [22]. Decision tree J48 developed by Johan Ross Quinlan [23]. C4.5 is an extension of Quinlan's earlier the Interactive Dichotomizer3 (ID3) Algorithm. J48 builds decision trees from a set of labelled training data using the concept of information entropy.The Decision tree is a classifier expressed as a recursive partition of the instance space, consists of nodes that form a rooted tree, meaning it is a directed tree with a node called a root that has no incoming edges referred to as an internal or test node. All other nodes are called leaves (also known as terminal or decision nodes). Decision trees [24] are one of the most commonly classification methods used in supervised learning approaches. TABLE I. SELECTED 7 ATTRIBUTE WITH HIGHEST INFORMATION GAIN S No. Feature name 1 2 3 4 5 6 service src_bytes dst_bytes logged_in count dst_host_diff_srv_rate 7 dst_host_srv_diff_host_rate 8 Class 3.3 Evaluation We constructed a confusion matrix (contingency table) to evaluate the classifier’s performance. TABLE II. A SAMPLE CONFUSION MATRIX Actual Class Positive C D Actual Class Negative In this confusion matrix, the value A is called a true positive and the value D is called a true negative. The value B is referred to as a false negative and C is known as false positive. 3.3 Accuracy This is the most basic measure of the performance of a learning method. This measure determines the percentage of correctly classified instances. From the confusion matrix, we can say that: A+ D Accuracy = —————— A + B + C+ D Precision = ratio of number of documents retrieved that are relevant to the total number of documents that are retrieved Referring from the confusion matrix, we can define precision and recall for our purposes as A Precision = ———— A+C Recall = ratio of number of documents retrieved that are relevant to the total number of documents that are relevant A Recall = ———— A+B (1) IV. performance Evaluation and Result The Tables III and IV Shows the performance of C4.5 and NB classification methods based on accuracy, Learning time, Error rate, Average true positive rate, Average False positive rate, Average precision and Average F-Measure respectively. The comparison is performed for 41, 11 and 7 attributes. The C4.5 and NB classifier models on the dataset are built and tested by means of 10-fold cross-validation. The Java Heap size was set to 1024 MB for WEKA 3.6.2, the simulation platform is an Intel™ Core i3-2100 processor system with 3 GB RAM under Microsoft Windows XP™ Service Pack-2 operating system, 3.10 GHz with 500 GB memory. C4.5 was evaluated on the dataset by taking into account 7 feature reductions using the Information Gain measure. The results of this test are summarized in the following table. TABLE III. RESULT OF C4.5 WITH SELECTED 7 ATTRIBUTES Parameter Value Accuracy 99.8901 % Learning Time Volume 2, Issue 1 January - February 2013 Predicted Class Negative B 12.67 sec Page 67 International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 2, Issue 1, January – February 2013 ISSN 2278-6856 Error Rate 0.1099 % Average True Positive Rate 0.999 Average False Positive Rate Average Precision 0.001 Average Recall 0.999 is desirable for good Intrusion Detection System. Especially in the case of C4.5 accuracy is 99 % for all attribute, 11 attribute and 7 attribute. Now we compare the True Positive Rate of the C4.5, and NB algorithm with selected 7 attributes. 0.999 0.999 Similar to the C4.5 tests, NB was also evaluated by taking into account 7 features of the dataset. The results of this evaluation are summarized in the table below. TABLE IV. RESULT OF NB WITH SELECTED 7 ATTRIBUTES Parameter Value Accuracy 93.5698 % Learning Time 1.3 sec Error Rate 6.4302 % Average True Positive Rate Average False Positive Rate Average Precision 0.936 Average Recall 0.936 Average F-Measure 0.054 True Positive Rate Average F-Measure 1.2 1 0.8 0.6 0.4 0.2 0 C4.5 NB Attacks Figure 2. Comparison of TPR for C4.5 and NB with selected 7 attribute. 0.949 0.942 In this paper, we have done the feature reduction to 7 attribute and gave the result. Now we compare the result of the C4.5 and NB algorithms with reduced 7 attribute than only we conclude that which one algorithm is good best for the intrusion detection. Now the figure 1. given below show the comparison of the accuracy of C4.5 and NB. For good IDS True Positive Rate should be high. Above figure 2. Shows that True Positive Rate of the C4.5 algorithm is higher when we reduce the feature of the data set using information gain. Especially in the case of C4.5 True Positive Rate is 1 and Figure 2. also shows that TPR of the C4.5 is higher than the NB algorithm with selected 7 attribute. Now we compare the False Positive Rate of the C4.5 and NB algorithm with selected 7 attributes. Types of attack group such as four main categories: (Probing, DoS: denial-of-service, R2L: remote to local, and U2R: user to root) and one normal class. Figure 1. Comparison of accuracy for C4.5 and NB using all attribute, selected 11 attribute and selected 7 attribute. From above figure1. It is clear that information gain feature reduction method gives the better accuracy which Volume 2, Issue 1 January - February 2013 Figure 3. Comparison of FPR for C4.5 and NB using selected 7 attribute. Page 68 International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 2, Issue 1, January – February 2013 ISSN 2278-6856 For a better Intrusion Detection System, false positive rate should be low. Above figure 3. Shows that FPR of the C4.5 algorithm is lower when we reduce the 7 feature of the data set using information gain. Especially in the case of C4.5 FPR is 0.In the case of NB algorithm False Positive Rate of the greater than 0 with selected 7 attribute. Figure 4. Comparison of time taken to build model for C4.5 and NB using all attribute, selected 11 and 7 attribute. From above figures 1, 2, 3 and 4 it is clear that C4.5 algorithm Accuracy, TPR and FPR is better than NB algorithms. So we can say that reduction of the feature using information gain is better technique. 4. CONCLUSIONS In this paper we have showed C4.5 selected 7 attribute technique for intrusion detection and performed feature set reduction and evaluated their performance. From the result, it is observed that after applying the feature selection from 41 attributes to 11 and 7 attributes. The overall performance of C4.5 has increased their performance than NB. The comparison results show that, in general, the C4.5 has the highest classification accuracy performance with the lowest error rate. C4.5 achieves better detection rates than NB and increased true positive rate. On the other hand, we also found that drastically decreased in learning time of the algorithm. We evaluated two machine learning algorithms C4.5 and NB on the dataset built in this exercise. Based on the accuracy, true positive rate, false positive rate, error rate and learning time C4.5 performed best classifier. Volume 2, Issue 1 January - February 2013 REFERENCES [1.] James P. Anderson, “Computer Security Threat Monitoring and Surveillance,” Technical Report, James P.Anderson Co., Fort Washington, Pennsylvania, USA, pp. 98–17, April 1980. [2.] Dorothy E. Denning, “An Intrusion Detection Model,” IEEE Transaction on Software Engineering (TSE), volume–13, No.2, pp.222–232, February 1987. [3.] LI Min and Wang Dongliang, “Anomaly Intrusion Detection Based on SOM,” IEEE WASE International Conference on Information Engineering, IEEE Computer Society, 2009, pp. 4044. [4.] Lida Rashidi,Sattar Hashem and Ali Hamzeh, “Anomaly detection in categorical datasets using Bayesian networks,” AICI’11 Proceedings of the Third International Conference on Artificial Intelligence and Computational Intelligence, Volume Part II, Springer-Verlag, Berlin ,Heidelberg, 2011, pp. 610–619. [5.] Knowledge Discovery in Databases DARPA archive. Task Description ,KDDCUP 1999 Dataset, http://www.kdd.ics.uci.edu/databases/kddcup99/task. html [6.] Mahbod Tavallaee,Ebrahim Bagheri,Wei Lu, and Ali A.Ghorbani, “A Detailed Analysis of the KDD CUP 99 Data Set,” Proceedings of the 2009 IEEE Symposium on Computational Intelligence in Security and Defense Application(CISDA 2009),IEEE 2009. [7.] M. Panda, and M. R. Patra, “Network intrusion detection using naive Bayes,” International Journal of Computer Science and Network Security (IJCSNS), Volume -7, No. 12, December 2007, pp. 258–263. [8.] Meng Jianliang, Shang Haikun, “The application on intrusion detection based on K-Means cluster algorithm,” International Forum on Information Technology and Application, 2009. [9.] Zubair A. Baig, Abdulrhman S. Shaheen, and Radwan AbdelAal, “An AODE-based Intrusion Detection System for Computer Networks,” pp. 28– 35, IEEE 2011. [10.] Hai Nguyen, Katrin Franke and Slobodan Petrovi’c, “Improving Effectiveness of Intrusion Detection by Correlation Feature Selection,” International Conference on Availability, Reliability and Security, pp. 17–24, IEEE 2010. [11.] Jiong Zhang and Mohhammad Zulkernine, “Anomaly based Network Intrusion detection with unsupervised outlier detection,” School of Computing Queen’s University, Kingston, Ontario, Canada. IEEE International Conference ICC 2006, Volume-9, pp. 2388-2393, 11-15 June 2006. Page 69 International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 2, Issue 1, January – February 2013 ISSN 2278-6856 [12.] Cuixiao Zhang, Guobing Zhang, Shanshan Sen., “A mixed unsupervised clustering based Intrusion detection model,” Third International Conference on Genetic and Evolutionary Computing, 2009. [13.] Jiaqi Jiang, Ru Li,Tianhong Zheng,Feiqin Su, Haicheng Li, “A new intrusion detection system using Class and Sample Weighted C-Support Vector Machine”, Third International Conference on Communications and Mobile Computing, IEEE Computer Society,2011,pp-51-54. [14.] E. T. Ferreira, G. A. Carrijo, R. de Oliveira and N. V. S. Araujo, “Intrusion Detection System with Wavelet and Neural Artifical Network Approach for Network Computers,” IEEE Latin America Transactions, Vol. 9, No. 5, September 2011,pp-832837. [15.] Yang Zhong, Hirohumi Yamaki, Hiroki Takakura, “A Grid-Based Clustering for LowOverhead Anomaly Intrusion Detection,” IEEE 2011, pp-17-24. [16.] Lida Rashidi,Sattar Hashem and Ali Hamzeh, “Anomaly detection in categorical datasets using bayesian networks,” AICI’11 Proceedings of the Third International Conference on Artificial Intelligence and Computational Intelligence, Volume Part II, Springer-Verlag, Berlin ,Heidelberg, 2011, pp.610–619. [17.] R. Shadmehr and Z. D’Argenio, “A comparison of a neural network based estimator and two statistical estimators in a sparse and noisy environment”, In IJCNN-90 Proceedings of the international joint conference on neural networks, 289-292,Ann Arbor, Mi, IEEE Neural Networks Council, 1990. [18.] Julie M. David and Kannan Balakrishnan, (2010) “Significance of Classification Techniques In Prediction Of Learning Disabilities”, International Journal of Artificial Intelligence & Applications (IJAIA), Vol.1, No.4. [19.] R.Dogaru ,“A modified Naive Bayes classifier for efficient implementations in embedded systems,” Signals Circuits and Systems (ISSCS), IEEE 10th International Symposium on Lasi, June 30, 2011 – July 1, 2011 , pp.1–4. [20.] Jiawei Han and Micheline kamber, “Data Mining Concepts and Techniques,”Second Edition,University of Illinois at Urbana-Champaign The Morgan Kaufmann Series in Data Management Systems, Elsevier 2007. [21.] Rebecca G. Bace, “Intrusion Detection” Sams, December 1999, 20th International Conference Very Large Data Bases(VLDB), Morgan Kaufmann, pp. 487–499. [22.] Pingchuan Ma, “Log Analysis-Based Intrusion Detection via Unsupervised Learning” Master of Science, School of Informatics, University of Edinburgh, 2003. Volume 2, Issue 1 January - February 2013 [23.] John Ross Quinlan, (1993) “C4.5: Programs for Machine Learning”, Morgan Kaufmann Publishers, San Mateo,CA.1993. [24.] Kamarulrifin Abd Jalil and Mohamad Noorman Masrek,“Comparison of Machine learning Algorithms Performance in Detecting Network Intrusion”, IEEE 2010 International Conference on Networking and Information Technology,pp.221226. [25.] Shaohua Teng, Hongle Du Wei Zhang, Xiufen Fu“A Cooperative Network Intrusion Detection Based on Heterogeneous Distance Function Clustering”,2010, 14th IEEE International Conference on Computer Supported Cooperative Work in Design. pp 140-145, 14-16 April 2010. Page 70