Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Selecting Features for Intrusion Detection: A Feature Relevance Analysis on KDD 99 Benchmark H. Güneş Kayacık Nur Zincir-Heywood Malcolm I. Heywood 1 Motivation • • • • • Machine learning in detection. Raw data High level events Need a set of features Not “any” feature, “good” features How do we quantify “good”? 2 The Data • DARPA 98 and 99 datasets. • Simulated activity. • Network traffic connection records • 41 feature per connection. DoS1 280790 107201 DoS2 97277 Normal 3 The Data • 494,000 connections in dataset. • 23 Class Labels 22 Attacks (DoS, probe, content based) “Normal” • 41 Features (few examples) Duration Failed login attempts Service FTP commands Protocol Root shells Data transfer “Su” attempts 4 Previous IDS Work • Decision trees, neural nets, clustering, SVM, EC • High detection (98%) Low FP (0.5%) • Some attacks are detected better than others. • Our task: Substantiate the performance of detectors. 5 Information Gain From Data Mining Course at KDNuggets site [http://www.kdnuggets.com/dmcourse/data_mining_course] • Used in decision trees. • Which feature leads to the purest Gain (“Windy”) = 0.02 branching? Gain (“Humidity”) = 0.971 Gain (“Temperature”) = 0.571 6 Methodology • Classes: 22 Attacks + 1 Normal For Class A: • Binary classification (Why?) 1, 0.5, 90, 8 Class A 1 3, 0.01, 7, 9 Class B 0 2, 0.1,, 7, 10 Class A 1 5, 0.2, 10, 1 Class C 0 • 23 Info. Gains per feature (vs. 1 Info Gain per feature) 7 Max. Information Gain • Some relevant some not • Features 20 and 21 8 ffe r_ b ov ac er k gu ft flo es p_ w s_ wr pa ite ss w d im ip ap sw ee p lo ad la m nd o m du ul le ti ne hop pt un nm e no ap rm al pe rl ph po p f rt od sw ee ro p ot k sa it ta sm n ur f s te p w ard y a r w rez op ar cl ez ie m nt as te r bu For each class… • Neptune (DoS) + smurf (DoS) + normal = 98% 1 Info. Gain 0.8 0.6 0.4 0.2 0 9 Relevant Classes normal smurf neptune teardrop land ftp_write back buffer_overflow guess_pwd warezclient 1 2 1 1 1 11 1 1 • 31/41 most relevant for 3 major classes. • 9 features contributed very little. • Relevant Features Connection Size Diff. Service Rate Connection state 10 10 10 Conclusions • Relevance analysis on KDD 99 dataset. • Relevance Information gain. • Key Points Easy to classify 3 major classes. Few features highly useful. Few features completely useless. • New measures and extended analysis. 11 Thank You! • You can find more information about our research at: www.cs.dal.ca/projectx. 12