Download Selecting Features for Intrusion Detection: A Feature Relevance

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Selecting Features for Intrusion
Detection:
A Feature Relevance Analysis on
KDD 99 Benchmark
H. Güneş Kayacık
Nur Zincir-Heywood
Malcolm I. Heywood
1
Motivation
•
•
•
•
•
Machine learning in detection.
Raw data  High level events
Need a set of features
Not “any” feature, “good” features
How do we quantify “good”?
2
The Data
• DARPA 98 and 99
datasets.
• Simulated activity.
• Network traffic 
connection records
• 41 feature per
connection.
DoS1
280790
107201
DoS2
97277
Normal
3
The Data
• 494,000 connections in dataset.
• 23 Class Labels
 22 Attacks (DoS, probe, content based)
 “Normal”
• 41 Features (few examples)
 Duration
 Failed login attempts
 Service
 FTP commands
 Protocol
 Root shells
 Data transfer
 “Su” attempts
4
Previous IDS Work
• Decision trees, neural nets, clustering,
SVM, EC
• High detection (98%) Low FP (0.5%)
• Some attacks are detected better than
others.
• Our task: Substantiate the performance of
detectors.
5
Information Gain
From Data Mining Course at KDNuggets site [http://www.kdnuggets.com/dmcourse/data_mining_course]
• Used in decision trees.
• Which feature leads to the purest
Gain (“Windy”) = 0.02
branching?
Gain (“Humidity”) = 0.971
Gain (“Temperature”) = 0.571
6
Methodology
• Classes: 22 Attacks + 1 Normal
For Class A:
• Binary classification
(Why?)
1, 0.5, 90, 8
Class A
1
3, 0.01, 7, 9
Class B
0
2, 0.1,, 7, 10
Class A
1
5, 0.2, 10, 1
Class C
0
• 23 Info. Gains per feature
(vs. 1 Info Gain per feature)
7
Max. Information Gain
• Some relevant
some not
• Features 20 and
21
8
ffe
r_ b
ov ac
er k
gu ft flo
es p_ w
s_ wr
pa ite
ss
w
d
im
ip ap
sw
ee
p
lo
ad la
m nd
o
m du
ul le
ti
ne hop
pt
un
nm e
no ap
rm
al
pe
rl
ph
po p f
rt od
sw
ee
ro p
ot
k
sa it
ta
sm n
ur
f
s
te p
w ard y
a
r
w rez op
ar cl
ez ie
m nt
as
te
r
bu
For each class…
• Neptune (DoS) + smurf (DoS) + normal = 98%
1
Info. Gain
0.8
0.6
0.4
0.2
0
9
Relevant Classes
normal
smurf
neptune
teardrop
land
ftp_write
back
buffer_overflow
guess_pwd
warezclient
1
2
1 1
1
11
1
1
• 31/41 most relevant for
3 major classes.
• 9 features contributed
very little.
• Relevant Features
 Connection Size
 Diff. Service Rate
 Connection state
10
10
10
Conclusions
• Relevance analysis on KDD 99 dataset.
• Relevance  Information gain.
• Key Points
 Easy to classify 3 major classes.
 Few features highly useful.
 Few features completely useless.
• New measures and extended analysis.
11
Thank You!
•
You can find more information about our
research at: www.cs.dal.ca/projectx.
12
Related documents