Download Data Mining Approaches for Intrusion Detection

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining Approaches for
Intrusion Detection
Wenke Lee and Salvatore J. Stolfo
Computer Science Department
Columbia University
Overview
•
•
•
•
•
•
•
•
•
Intrusion detection and computer security
Current intrusion detection approaches
Our proposed approach
Data mining
Classification models for intrusion detection
Mining patterns from audit data
System architecture
Current status
Research plans
Overview
• Current intrusion detection approaches and
problems
• Our proposed approach
• Data mining
• Classification models for intrusion detection
• Mining patterns from audit data
• System architecture
• Current status
• Research plans
Intrusion Detection and Computer
Security
• Computer security goals: confidentiality,
integrity, and availability
• Intrusion is a set of actions aimed to
compromise these security goals
• Intrusion prevention (authentication,
encryption, etc.) alone is not sufficient
• Intrusion detection is needed
Intrusion Detection
• Primary assumption: user and program
activities can be monitored and modeled
• Key elements:
– Resources to be protected
– Models of the “normal” or “legitimate”
behavior on the resources
– Efficient methods that compare real-time
activities against the models and report
probably “intrusive” activities.
Learning
Agent
Base
Detection
Agent
Audit
Records
Inductive
Learning
Engine
Audit Data
Preprocessor
Activity
Data
Detection
Models
Rules
(Base) Detection
Engine
Evidence
Meta
Detection
Agent
Decision
Table
Evidence from
Other Agents
(Meta) Detection Engine
Final Assertion
Decision Engine
Action/Report
Connection
Records
10:35:41.5 128.59.23.34.30 > 113.22.14.65.80 : . 512:1024(512) ack 1 win 9216
10:35:41.5 102.20.57.15.20 > 128.59.12.49.3241: . ack 1073 win 16384
10:35:41.6 128.59.25.14.2623 > 115.35.32.89.21: . ack 2650 win 16225
tcpdump
time
dur
src dst bytes srv
…
10:35:41
1.2
A
B
42
http
…
10:35:41
0.5
C
D
22
user
…
10:35:41
10.2
E
F
1036
ftp
…
…
…
…
…
…
...
…
Learning
execve(“/usr/ucb/finger”, …
open(“/dev/zero …
mmap(…
...
truss
System call
Sequence
execve
open
mmap
...
Learning
Profile
Profile
Intrusion Detection
• Two categories of techniques:
– Misuse detection: use patterns of well-known
attacks to identify intrusions
– Anomaly detection: use deviation from normal
usage patterns to identify intrusions
Current Intrusion Detection
Approaches
• Misuse detection:
– Record the specific patterns of intrusions
– Monitor current audit trails (event sequences)
and pattern matching
– Report the matched events as intrusions
– Representation models: expert rules, Colored
Petri Net, and state transition diagrams
Current Intrusion Detection
Approaches
• Anomaly detection:
– Establishing the normal behavior profiles
– Observing and comparing current activities
with the (normal) profiles
– Reporting significant deviations as intrusions
– Statistical measures as behavior profiles:
ordinal and categorical (binary and linear)
Current Intrusion Detection
Approaches
• Main problems: manual and ad-hoc
– Misuse detection:
• Known intrusion patterns have to be hand-coded
• Unable to detect any new intrusions (that have no
matched patterns recorded in the system)
– Anomaly detection:
• Selecting the right set of system features to be
measured is ad hoc and based on experience
• Unable to capture sequential interrelation between
events
Our Proposed Approach
• A systematic framework to:
– Build good models:
• select appropriate features of audit data to build
intrusion detection models
– Build better models:
• architect a hierarchical detector system that
combines multiple detection models
– Build updated models:
• dynamically update and deploy new detection
system as needed
Our Proposed Approach
• Support for the feature selection and model
construction process:
– Apply data mining algorithms to find consistent
inter- and intra- audit record (event) patterns
– Use the features and time windows in the
discovered patterns to build detection models
– A support environment to semi-automate this
process
Our Proposed Approach
• Combining multiple detection models:
– Each (base) detector model monitors one aspect
of the system
– They can employ different techniques and be
independent of each other
– The learned (meta) detector combines evidence
from a number of base detectors
Our Proposed Approach
• An intelligent agent-based architecture:
– learning agents: continuously compute (learn)
the detection models
– detection agents: use the (updated) models to
detect intrusions
Data Mining
• KDD (Knowledge Discovery in Database):
– The process of identifying valid, useful and
understandable patterns in data
– Steps: understanding the application domain,
data preparation, data mining, interpretation,
and utilizing the discovered knowledge
– Data mining: applying specific algorithms to
extract patterns from data
Data Mining
• Relevant data mining algorithms:
– Classification: maps a data item into one of
several pre-defined categories
– Link analysis: determines relations between
fields in the database
– Sequence analysis: models sequence patterns
Data Mining
• Why is it applicable to intrusion detection?
– Normal and intrusive activities leave evidence
in audit data
– From the data-centric point view, intrusion
detection is a data analysis process
– Successful applications in related domains, e.g.,
fraud detection, fault/alarm management
Building Classifiers for Intrusion
Detection
• Experiments in constructing classification
models for anomaly detection
• Two experiments:
– sendmail system call data
– network tcpdump data
• Use meta classifier to combine multiple
classification models
Classification Models on sendmail
• The data: sequence of system calls made by
sendmail.
• Classification models (rules): describe the
“normal” patterns of the system call sequences.
• The rule set is the normal profile of sendmail
• Detection: calculate the deviation from the profile
– large number/high scores of “violations” to the rules in
a new trace suggests an exploit
Classification Models on sendmail
• The sendmail data:
– Each trace has two columns: the process ids and
the system call numbers
– Normal traces: sendmail and sendmail daemon
– Abnormal traces: sunsendmailcap, syslogremote, syslog-remote, decode, sm5x and sm56a
attacks.
Classification Models on sendmail
• Data preprocessing:
– Use sliding window to create sequence of
consecutive system calls
– Label the sequences to create training data:
sequences (length 7)
class labels
4 2 66 66 4 138 66
“normal”
5 5 5 4 59 105 104
“abnormal”
…
…
Classification Models on sendmail
• Experiment 1 - learning patterns of normal
sequences:
– Each record: n consecutive system calls plus a
class label, “normal” or “abnormal”
– Training data: sequences from 80% of the
normal traces plus some of the attack traces
– Testing data: traces not used in training
– Use RIPPER to learn specific rules for the
minority classes
sendmail Experiment 1
• Examples of output RIPPER rules:
– if the 2nd system call is vtimes and the 7th is
vtrace, then the sequence is “normal”
– if the 6th system call is lseek and the 7th is
sigvec, then the sequence is “normal”
–…
– if none of the above, then the sequence is
“abnormal”
sendmail Experiment 1
• Using the learned rules to analyze a new
trace:
– label all sequences according to the rules
– define a region as l consecutive sequences
– define a “abnormal” region as having more
“abnormal” sequences than normal ones
– calculate the percentage of “abnormal” regions
– the trace is “abnormal” if the percentage is
above a threshold
sendmail Experiment 1
• Hypothesis: need specific rules of “normal”
sequences to detect “unknown/new”
intrusions
• Some results using various normal v.s.
abnormal distributions:
–
–
–
–
Experiment A: 46% normal, length 11
Experiment B: 46% normal, length 7
Experiment C: 54% normal, length 11
Experiment D: 54% normal, length 7
sendmail Experiment 1
• All 4 experiments:
– Training data includes sequences from intrusion traces
in Bold and Italic, and sequences from 80% of the
normal sendmail traces
– Percentage of abnormal “regions” of each trace
(showed in the table) is used as the intrusion indicator
– The output rule sets contain ~250 rules, each with 2 or
3 attribute tests. This compares with the total ~1,500
different sequences.
• Experiment A and B generate rules that characterize
“normal” sequences of length 11 and 7 respectively
• Experiment C and D generate rules that characterize
“abnormal” sequences of length 11 and 7 respectively
sendmail Experiment 1
traces
sscp-1
sscp-2
sscp-3
syslog-remote-1
syslog-remote-2
syslog-local-1
syslog-local-2
decode-1
decode-2
sm565a
sm5x
sendmail
Forrest et al.
5.2
5.2
5.2
5.1
1.7
4.0
5.3
0.3
0.3
0.6
2.7
0
A
41.9
40.4
40.4
30.8
27.1
16.7
19.9
4.7
4.4
11.7
17.7
1.0
B
32.2
30.4
30.4
21.2
15.6
11.1
15.9
2.1
2.0
8.0
6.5
0.1
C
40.0
37.6
37.6
30.3
26.8
17.0
19.8
3.1
2.5
1.1
5.0
0.2
D
33.1
33.3
33.3
21.9
16.5
13.0
15.9
2.1
2.2
1.0
3.0
0.3
3.4
1.9
0.9
0.7
Anomaly detectors A and B performs better then misuse
detectors C and D.
Classification Models on sendmail
• Experiment 2 - learning to predict normal
system call:
– Each record: n-1 consecutive system calls plus
a class label, the nth or the middle system call
– Training data: sequences from 80% of the
normal traces (no abnormal traces)
– Testing data: traces not used in training
– Use RIPPER to learn rules
sendmail Experiment 2
• Examples of output RIPPER rules:
– if the 3rd system call is lstat and the 4th is
write, then the 7th is stat
– if the 1st system call is sigblock and the 4th is
bind, then the 7th is setsockopt
–…
– if none of the above, then the 7th is open
sendmail Experiment 2
• Using the learned rules to analyze a new
trace:
– predict system calls according to the rules
– if a rule is violated, the “violation” score is
increased by 100 times the accuracy of the rule
– the trace is “abnormal” if the violation score is
above a threshold
sendmail Experiment 2
• Some results:
– Experiment A: predict the 11th system call
– Experiment B: predict the middle system call in
a sequence of length 7
– Experiment C: predict the middle system call in
a sequence of length 11
– Experiment D: predict the 7th system call
sendmail Experiment 2
• All 4 experiments:
– Training data includes only the sequences from 80% of
the normal sendmail traces
– Output rules predict what should be the “normal” nth or
the middle system call
– Score of rule “violation” (mismatch) of each trace
(showed in the table) is used as the intrusion indicator
– The output rule sets contain ~250 rules, each with 2 or
3 attribute tests. This compares with the total ~1,500
different sequences.
sendmail Experiment 2
Traces
A
B
C
D
sscp-1
sscp-2
24.1
23.5
13.5
13.6
14.3
13.9
24.7
24.4
sscp-3
23.5
13.6
13.9
24.4
syslog-remote-1
19.3
11.5
13.9
24.0
syslog-remote-2
15.9
8.4
10.9
23.0
syslog-local-1
syslog-local-2
13.4
15.2
6.1
8.0
7.2
9.0
19.0
20.2
decode-1
decode-2
9.4
9.6
3.9
4.2
2.4
2.8
11.3
11.5
sm565a
14.4
8.1
9.4
20.6
sm5x
17.2
8.2
10.1
18.0
*sendmail
5.7
3.7
0.6
3.3
1.2
1.2
12.6
1.3
The 11th (A) and 4th (B) system call are more predictable
Classification Models on sendmail
• Lessons learned:
– Normal behavior can be established and used to
detect anomalous usage
– Need to collect near “complete” normal data in
order to build the “normal” model
– But how do we know when to stop collecting?
– Need tools to guide the audit data gathering
process
Classification Models on tcpdump
• The tcpdump data (part of a public data
visualization contest):
– Packets of incoming, out-going, and internal
broadcast traffic
– One trace of normal network traffic
– Three traces of network intrusions
Classification Models on tcpdump
• Data preprocessing:
– Extract the “connection” level features:
• Record connection attempts
• Monitor data packets and count: # of bytes in each
direction, resent rate, hole rate, etc.
• Watch how connection is terminated
Classification Models on tcpdump
• Data Preprocessing:
– Each record has:
•
•
•
•
•
start time and duration
participating hosts and ports (applications)
statistics (e.g., # of bytes)
flag: “normal” or a connection/termination error
protocol: TCP or UDP
– Divide connections into 3 types: incoming, outgoing, and inter-lan
Classification Models on tcpdump
• Building classifier for each type of
connections:
– Use the destination service (port) as the class
label
– Training data: 80% of the normal connections
– Testing data: 20% of the normal connections
and connections in the 3 intrusion traces
– Apply RIPPER to learn rules
Classification Models on tcpdump
• The output RIPPER rules describe the
“normal” characteristics of the destination
services. The rule set is the profile of the
normal network traffic.
• Using the rules to analyze tcpdump traces:
– Examine each connection record according to
the rules
– Calculate the percentage of misclassification
(violation of a rule). This percentage is the
deviation from the profile.
Classification Models on tcpdump
• Results - misclassification rate on each type
of connections:
Connection data
Normal
Intrusion1
Intrusion2
Intrusion3
Out-going
3.91%
3.81%
4.76%
3.71%
In-coming
4.68%
6.76%
7.47%
13.7%
Inter-lan
4%
22.65%
8.7%
7.86%
This model is not very effective in detecting intrusions
Classification Models on tcpdump
• Adding temporal features for better models:
– Examine all connections in the past n seconds,
and count:
• the number of connection errors, all other errors,
connections to system services, user applications,
and connection to the same service as the current
connection
• average duration and data bytes of all connections;
and the same averages of connections to the same
service.
Classification Models on tcpdump
• Results of adding the temporal features, the
time window is 30 seconds:
Connection data
Normal
Intrusion1
Intrusion2
Intrusion3
Out-going
0.88%
2.54%
3.04%
2.32%
In-coming
0.31%
27.37%
27.42%
42.20%
Inter-lan
1.43%
20.48%
5.63%
6.80%
Adding temporal statistical features improves the
effectiveness of the detection models
Effects of time window length on misclassification rate
0.45
misclassification rate
0.4
0.35
0.3
0.25
normal
attack1
attack2
attack3
0.2
0.15
0.1
0.05
0
0
20
40
60
80
100
time window in seconds
How do we obtain the optimal time window length?
Classification Models on tcpdump
• Lessons learned:
– Data preprocessing requires extensive domain
knowledge
– Adding temporal features improves
classification accuracy
– Need tools to guide (temporal) feature selection
Building Classifiers for Intrusion
Detection
• Meta classifier that combines evidence from
multiple detection models:
– Build base classifiers that each model one
aspect of the system
– The meta learning task:
• each record has a collection of evidence from base
classifiers, and a class label “normal” or “abnormal”
on the state of the system
– Apply a learning algorithm to produce the meta
classifier
Mining Patterns from Audit Data
• Association rules: describe multi-feature
(attribute) correlation from a database
• X => Y , confidence, support:
– X and Y are subsets of the attribute values in a
record
– support is the percentage of records that contain
X and Y
– confidence is support(X+Y)/support(X)
Association Rules
• Motivations:
– Audit data can be easily formatted into a
database table
– Program executions and user activities have
frequent correlation among system features
– Incremental updating of the rule set is easy
• An example from the .sh_history :
– trn => rec.humor, [0.3, 0.1]
– Meaning: 30% of the time when using trn, the
user is reading rec.humor; and reading this
newsgroup constitutes 10% of all sh commands
Mining Patterns from Audit Data
• Frequent Episodes: frequent events
occurring within a time window
• X => Y, confidence, support, window:
– X and Y are subsets of the attribute values in a
record
– support is the percentage of (sliding) windows
that contain X and Y
– confidence is support(X+Y)/support(X)
Frequent Episodes
• Motivation:
– Sequence information needs to be included in a
detection model
• An example from a department’s web log:
– home, research => theory, [0.2, 0.05], [30]
– Meaning: 20% of the time, after home and
research pages are visited (in that order), the
theory is then visited within 30 seconds from
when home is visited; and visiting these three
pages constitutes 5% of all visits to the web site
Using the Mined Patterns
• Guide the audit data gathering process:
– Run a program under different settings
– For each run, calculate the association rules and
frequent episodes from its audit data
– Merge them into an aggregate rule set
– Stop gathering audit data when no rules can be
added from a new run
Using the Mined Patterns
• Support the feature selection process:
– System features in the association rules and
frequent episodes should be included in the
classification models
– Time window and features in the frequent
episodes suggest additional temporal features
should be considered
Using the Mined Patterns
• Alternatives and complement to
classification models:
– Examine new audit trace and calculate
“violation” scores: missing rules, new rules,
deviations in confidence and support, etc.
– Study the “unique” patterns in the trace of
suspected attack to further pin point the cause
of the intrusion alarms.
Using the Mined Patterns
• tcpdump data revisited:
– How to select the right time window?
– Hypothesis: the appropriate window should
contain stable sets of frequent episodes
– Experiments: mine frequent episodes using
different window lengths, and count the number
of episodes
Results on time window length v.s. # of episodes:
300
250
raw episodes
episode rules, conf=0.8
episode rules, conf=0.6
# of episodes
200
150
100
50
0
0
50
100
150
200
250
time window in seconds
The optimal time window length for classification has
stable # of episodes
Using the Mined Patterns
• tcpdump data revisited:
– “unique” patterns in intrusion data may provide
some insights
– intrusion 3:
• one of the unique frequent episode rules:
– dst_srv=“auth” => flag=“unwanted_syn_ack”, [0.82, 0.1],
[30]
• one of the unique association rules:
– src_srv=“smtp” => duration=0,
flag=“unwanted_syn_ack”, dst_srv=“user_apps”, [1.0,
0.38]
Architecture Support
• Dedicated learning agents are responsible
for building detection models
• Base and meta detection agents are
equipped with learned models
• Detection agents provide new audit data to
the learning agents
• Learning agents dispatch updated models
• JAM (Java Agents for Meta-learning) on
fraud detection is the model architecture
Learning
Agent
Base
Detection
Agent
Audit
Records
Inductive
Learning
Engine
Audit Data
Preprocessor
Activity
Data
Detection
Models
Rules
(Base) Detection
Engine
Evidence
Meta
Detection
Agent
Decision
Table
Evidence from
Other Agents
(Meta) Detection Engine
Final Assertion
Decision Engine
Action/Report
Current Status
• Accomplished:
– Experiments on sendmail and tcpdump data
– Implementation of the association rules and the
frequent episodes algorithms. Testing on
medium size data sets (30,000+ records, each
with 6+ fields) has been completed.
– Design and 35% of the implementation of a
support environment for mining patterns from
audit data
– High level design system architecture design
Research Plans
• To be completed within the next year and a
half:
– Finish the implementation of the support
environment for mining patterns
– Experiments on using the algorithms and the
environment to gather audit data and select
features
– Experiments on building meta detection models
Research Plans
• To be completed within the next year and a
half:
– Detailed architecture design
– Implementing a prototype intrusion detection
system
– Final evaluation using “standard/public” data
sets
Conclusions
• We demonstrated the effectiveness of
classification models for intrusion detection
• We propose to use systematic data mining
approaches to select the relevant system
features to build better detection models
• We propose to use (meta) learning agentbased architecture to combine multiple
models, and to continuously update the
detection models.